## UltraViolet Analytics - Titanic Tutorial 

### This is my 3rd submission , after failing with basic models using graphlab and sklearn 
#### This time as a beginer i opted to visit the tutorials available to make me walk through the data set <br>
<a href=http://www.ultravioletanalytics.com/2014/10/30/kaggle-titanic-competition-part-i-intro/> Tutorial Available Here </a>

In [1]:
# Import all required libraries
import pandas as pd
from sklearn import linear_model
from sklearn import ensemble
from sklearn.ensemble import RandomForestRegressor
from sklearn import preprocessing
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import re
%matplotlib inline

### Part-1 , Loading the data 

In [2]:
# read the training and test datasets 
titanic_train_df = pd.read_csv('train.csv', header=0)
titanic_test_df = pd.read_csv('test.csv', header=0)

# Merge both datasets into one dataset so we will have more training data
titanic_df = pd.concat([titanic_train_df, titanic_test_df])

# re-numbering the dataset so tht there are no duplicates
titanic_df.reset_index(inplace=True)

# reseting the index will create new column which we dont want so lets drop it
titanic_df.drop('index', axis=1, inplace=True)

# re-index the axis so first element will have index as 0
titanic_df = titanic_df.reindex_axis(titanic_train_df.columns, axis=1)

print titanic_df.shape[1], "columns:", titanic_df.columns.values
print "Row count:", titanic_df.shape[0]


12 columns: ['PassengerId' 'Survived' 'Pclass' 'Name' 'Sex' 'Age' 'SibSp' 'Parch'
 'Ticket' 'Fare' 'Cabin' 'Embarked']
Row count: 1309


In [3]:
# Lets see how the data is available in df
titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
PassengerId    1309 non-null int64
Survived       891 non-null float64
Pclass         1309 non-null int64
Name           1309 non-null object
Sex            1309 non-null object
Age            1046 non-null float64
SibSp          1309 non-null int64
Parch          1309 non-null int64
Ticket         1309 non-null object
Fare           1308 non-null float64
Cabin          295 non-null object
Embarked       1307 non-null object
dtypes: float64(3), int64(4), object(5)
memory usage: 122.8+ KB


### Part-2, Working with Missing Values

There are obviously many methods to work with missing values, we can either drop the records which have misisng values or we can fill up with some meaningful data like mean/median etc..the first one stands as good approach when we have lots of missing data for a particular column, as we cant just fill the data based on few data points available. Where are the second one is best approach in general.

In [4]:
# Assign a value indicating that the value is missing.

# we see that the cabin has lots of missing values, so it is good approach to fill the cabin details with missing details
titanic_df['Cabin'] = titanic_df['Cabin'].fillna('U0')
# So where ever we have the cabin details as U0 we can think it as missing data

In [5]:
# Assing Average Values or most used values 

# we can assign this data to fare/price of ticket and also port of Embarked
# Adjusting price 
titanic_df['Fare'] = titanic_df['Fare'].fillna(titanic_df['Fare'].median())
# Above statement can also be written as below
# titanic_df['Fare][np.isnan(titanic_df['Fare'])] = titanic_df['Fare'].median()

# Adjusting Embarked
titanic_df['Embarked'][titanic_df.Embarked.isnull()] = titanic_df['Embarked'].dropna().mode().values

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [6]:
titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
PassengerId    1309 non-null int64
Survived       891 non-null float64
Pclass         1309 non-null int64
Name           1309 non-null object
Sex            1309 non-null object
Age            1046 non-null float64
SibSp          1309 non-null int64
Parch          1309 non-null int64
Ticket         1309 non-null object
Fare           1309 non-null float64
Cabin          1309 non-null object
Embarked       1309 non-null object
dtypes: float64(3), int64(4), object(5)
memory usage: 122.8+ KB


We still have the values for Age missing , for entries like these we can apply multiple methods but a better approach would be to train on regression model to predict the missing values.<br>
We will use RandomForestRegression from sklearn.ensemble to fill out the missing values 

In [7]:
def setMissingAge(df):
    
    # features that we want to be part of regression 
    age_df = df[['Age','Embarked','Fare','Parch','SibSP','Title_id','PClass','Names','CabinLetter']]
    
    # Split age into two based on data available and data missing
    knownAge = age_df.loc[ (df.age.notnull()) ]
    unknownAge = age_df.loc[ (df.age.isnull()) ]
    
    # target column or y 
    y = knownAge.values[:,0]  # All rows and first column 
    
    # train data or X 
    X = knownAge.values[:, 1::]
    
    # Create and fit a model
    rtr_model = RandomForestRegressor(n_estimators=2000, n_jobs=-1)
    rtr_model.fit(X, y)
    
    # use model to predict the values for age
    predictedAges = rtr_model.predict(unknownAge.values[:,1::])
    
    # Assign values back to original dataset
    df.loc[(df.Age.isnull()), 'Age'] = predictedAges
    
    return df

### Part-3, Feature Engineering, Variable Transformation
Scikit-learn requires everything to be numeric so we’ll have to do some work to transform the raw data.<br>

Different types of transformations can be applied to different types of variables. Qualitative transformations include:

In [8]:
# Dummy Variables, this method is usefull when we have less number of values for a selected column, 
# For embarked we can see that only values that are available are 'S' 'C' 'Q' also as for sklearn 
# we need all the values to be converted to numeric rather than string/text data
dummies_df = pd.get_dummies(titanic_df['Embarked'])
dummies_df = dummies_df.rename(columns = lambda x: 'Embarked_' + str(x))

titanic_df = pd.concat([titanic_df,dummies_df],axis=1)

In [9]:
# Factorizing - Pandas has a method called factorize() that creates a numerical categorical variable from any other variable, 
# assigning a unique ID to each distinct value encountered.
# This method is used for alphanumeric data 
# we can apply this for Cabin data where we have alphanumeric.

titanic_df['Cabin'][titanic_df.Cabin.isnull()]  = 'U0' # already did this above, but simply redoing with no purpose 

# Finds all First charecters of Cabin and groups it
titanic_df['CabinLetter'] = titanic_df['Cabin'].map(lambda x: re.compile("[a-zA-Z]+").search(x).group())

# Convert all CabinLetter values to incremenetal integers using Factorize
titanic_df['CabinLetter'] = pd.factorize(titanic_df['CabinLetter'])[0]  #Index 1 contains index 


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Quantitative Transormations Include

In [10]:
# Scaling : Scaling is a technique used to address an issue with some models that 
# variables with wildly different scales will be treated in proportion to the magnitude of their values. 

# standardscalar will subtract the mean from each value and then scale to unit variance
# We will apply this method later once we have the NaN Values from age get filled up
# scalar = preprocessing.StandardScaler()
# titanic_df['Age_Scaled'] = scalar.fit_transform(titanic_df['Age'])

In [11]:
# Binning : Binning is a method to distribute the data into equal bins
# Lets Apply this method to Fare 

titanic_df['Fare_bin'] = pd.qcut(titanic_df['Fare'], 4) 
# qcut creates a new variable that identifies the quartile range 

# Factorize to create dummies from the result
titanic_df['Fare_bin_id'] = pd.factorize(titanic_df['Fare_bin'])[0]
# Finally we will get fare_bin_id for each fare from 0,1,2,3 values 

### Part-4, Feature Engineering : Derived Variables
 - Any variable that is generated from one or more existing variables is called a “derived” variable.

In [12]:
# Get useful information from the Name

# findout how many names are there for each person
titanic_df['Names'] = titanic_df['Name'].map(lambda x: len(re.split(" ", x)))

# Now findout title of each person 
titanic_df['Title'] = titanic_df['Name'].map(lambda x: re.compile(",(.*?)\.").findall(x)[0])

# Let us fix rare occuring Titles
titanic_df['Title'][titanic_df.Title == 'Jonkheer'] = 'Master'
titanic_df['Title'][titanic_df.Title.isin(['Ms','Mlle'])] = 'Miss'
titanic_df['Title'][titanic_df.Title == 'Mme'] = 'Mrs'
titanic_df['Title'][titanic_df.Title.isin(['Capt', 'Don', 'Major', 'Col', 'Sir'])] = 'Sir'
titanic_df['Title'][titanic_df.Title.isin(['Dona', 'Lady', 'the Countess'])] = 'Lady'

# Build Binary Features
titanic_df = pd.concat([titanic_df,pd.get_dummies(titanic_df['Title']).rename(columns=lambda x: 'Title_' + str(x))], axis=1)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [44]:
# Fix the cabin
# Create a feature for the deck
titanic_df['Deck'] = titanic_df['Cabin'].map(lambda x: re.compile("([a-zA-Z]+)").search(x).group())
titanic_df['Deck'] = pd.factorize(titanic_df['Deck'])[0]

# Create binary features for each deck
decks = pd.get_dummies(titanic_df['Deck']).rename(columns=lambda x: 'Deck_' + str(x))
titanic_df = pd.concat([titanic_df, decks], axis=1)

# Create feature for the room number
# titanic_df['Room'] = titanic_df['Cabin'].map( lambda x : re.compile("([0-9]+)").search(x).group()).astype(int) + 1

In [46]:
def processTicket():
    
    global titanic_df
    
    # extract and massage ticet prefix
    df['TicketPrefix'] = df['Ticket'].map( lambda x : getTicketPrefix(x.upper()))
    df['TicketPrefix'] = df['TicketPrefix'].map( lambda x: re.sub('[\.?\/?]', '', x) )
    df['TicketPrefix'] = df['TicketPrefix'].map( lambda x: re.sub('STON', 'SOTON', x) )
    # create binary features for each prefix
    prefixes = pd.get_dummies(df['TicketPrefix']).rename(columns=lambda x: 'TicketPrefix_' + str(x))
    df = pd.concat([df, prefixes], axis=1)
    
    # factorize the prefix to create a numerical categorical variable
    df['TicketPrefixId'] = pd.factorize(df['TicketPrefix'])[0]
    
    # extract the ticket number
    df['TicketNumber'] = df['Ticket'].map( lambda x: getTicketNumber(x) )
    
    # create a feature for the number of digits in the ticket number
    df['TicketNumberDigits'] = df['TicketNumber'].map( lambda x: len(x) ).astype(np.int)
    
    # create a feature for the starting number of the ticket number
    df['TicketNumberStart'] = df['TicketNumber'].map( lambda x: x[0:1] ).astype(np.int)
    
    # The prefix and (probably) number themselves aren't useful
    df.drop(['TicketPrefix', 'TicketNumber'], axis=1, inplace=True)
    
 
def getTicketPrefix(ticket):
    match = re.compile("([a-zA-Z\.\/]+)").search(ticket)
    if match:
        return match.group()
    else:
        return 'U'
 
def getTicketNumber(ticket):
    match = re.compile("([\d]+$)").search(ticket)
    if match:
        return match.group()
    else:
        return '0'

### Part 5 , Feature Engineering - Interaction Variables and Corellation 
Interaction variables capture effects of the relationship between variables. They are constructed by performing mathematical operations on sets of features.

KeyError: "None of [['Age_scaled', 'Fare_scaled', 'Pclass_scaled', 'Parch_scaled', 'SibSp_scaled', 'Names_scaled', 'CabinNumber_scaled', 'Age_bin_id_scaled', 'Fare_bin_id_scaled']] are in the [columns]"