In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn import tree

%matplotlib inline

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

Example files are from the Titanic tutorial on Kaggle:  
https://www.kaggle.com/c/titanic/details/tutorials  
As we're not uploading to Kaggle we just need the training set.

In [2]:
train = pd.read_csv("titanic_train.csv")

In [3]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


Some features are too complex for this simple introduction talk. So we just drop them (and accept poorer results).  
That being said it is of course possible to engineer better features out of them, see https://www.kaggle.com/yassineghouzam/titanic-top-4-with-ensemble-modeling for a nice example.

In [5]:
train.drop(["PassengerId", "Name", "Ticket", "Cabin"], axis=1, inplace=True)

Pclass is of ordinal scale (1, 2, 3). It probably wouldn't hurt to keep it as a number. But when scaled (standardized) it's harder to interpret.   
It's just my preference to treat it as a categorial variable.

In [6]:
train["Pclass"] = train["Pclass"].astype("category")

Sklearn can't handle missing values. So we have to deal with them.   
A quick and dirty way is to just fill them with the mean (for numbers) or the mode (for categories) of the existing values.  
A nicer approach would be to calculate the value from the other features (i.e. build a model that estimates the value based on the information available)

In [7]:
train["Age"] = train["Age"].fillna(train["Age"].mean())
train["Embarked"] = train["Embarked"].fillna(train["Embarked"].mode()[0])

One-Hot-Encoding of categorial features is needed for Sklearn. The easiest way to do that is use Pandas' `get_dummies()` function.  
Be aware though that this does not work if the set of distinct values differs between training and test set. That's why you either encode before the train-test-split or you use sklearn's `OneHotEncoder`.

In [8]:
train = pd.get_dummies(train, dummy_na=True, drop_first=True)

In [9]:
train.columns.values

array(['Survived', 'Age', 'SibSp', 'Parch', 'Fare', 'Pclass_2.0',
       'Pclass_3.0', 'Pclass_nan', 'Sex_male', 'Sex_nan', 'Embarked_Q',
       'Embarked_S', 'Embarked_nan'], dtype=object)

Drop the autogenerated 'nan' columns as they provide no value (we've filled all nans before).

In [10]:
train.drop(["Sex_nan", "Pclass_nan", "Embarked_nan"], axis=1, inplace=True)

In [11]:
train.rename(columns={"Pclass_2.0":"Pclass_2", "Pclass_3.0":"Pclass_3"}, inplace=True)

In [12]:
train.head()

Unnamed: 0,Survived,Age,SibSp,Parch,Fare,Pclass_2,Pclass_3,Sex_male,Embarked_Q,Embarked_S
0,0,22.0,1,0,7.25,0,1,1,0,1
1,1,38.0,1,0,71.2833,0,0,0,0,0
2,1,26.0,0,0,7.925,0,1,0,0,1
3,1,35.0,1,0,53.1,0,0,0,0,1
4,0,35.0,0,0,8.05,0,1,1,0,1


In [13]:
train.to_csv("titanic.csv", index=False)