## EDA To Prediction (DieTanic)


The notebook is about predicting survival using the Titanic data.

Goal of the notebook: This notebook is to help you to get familiar with Kishu. Please go through the notebook and follow the guide to try different features of Kishu. You **don't** need to read the code, the markdown on each cell is telling you what the cell is doing, **You just need to read the markdown and then execute the code**.


### Import packages and data

In [None]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight')
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

data=pd.read_csv('./titanic.csv')

# Try Kishu: Browse Commits's Information
Please check the kishuboard, you'll find another commit been generated. You can also check the variables and codes for the commit in the right panels.

### ETL (fill missing values, convert string to numeric, etc.

In [None]:
# impute age feature
data['Initial']=0
for i in data:
    data['Initial']=data.Name.str.extract('([A-Za-z]+)\.') #lets extract the Salutations
data['Initial'].replace(['Mlle','Mme','Ms','Dr','Major','Lady','Countess','Jonkheer','Col','Rev','Capt','Sir','Don'],['Miss','Miss','Miss','Mr','Mr','Mrs','Mrs','Other','Other','Other','Mr','Mr','Mr'],inplace=True)
## Assigning the NaN Values with the Ceil values of the mean ages
data.loc[(data.Age.isnull())&(data.Initial=='Mr'),'Age']=33
data.loc[(data.Age.isnull())&(data.Initial=='Mrs'),'Age']=36
data.loc[(data.Age.isnull())&(data.Initial=='Master'),'Age']=5
data.loc[(data.Age.isnull())&(data.Initial=='Miss'),'Age']=22
data.loc[(data.Age.isnull())&(data.Initial=='Other'),'Age']=46

In [None]:
# impute embark feature
data['Embarked'].fillna('S',inplace=True)

# Age and fare feature band (convert continous values into categorical values)
data['Age_band']=0
data.loc[data['Age']<=16,'Age_band']=0
data.loc[(data['Age']>16)&(data['Age']<=32),'Age_band']=1
data.loc[(data['Age']>32)&(data['Age']<=48),'Age_band']=2
data.loc[(data['Age']>48)&(data['Age']<=64),'Age_band']=3
data.loc[data['Age']>64,'Age_band']=4
data.head(2)

data['Fare_cat']=0
data.loc[data['Fare']<=7.91,'Fare_cat']=0
data.loc[(data['Fare']>7.91)&(data['Fare']<=14.454),'Fare_cat']=1
data.loc[(data['Fare']>14.454)&(data['Fare']<=31),'Fare_cat']=2
data.loc[(data['Fare']>31)&(data['Fare']<=513),'Fare_cat']=3

# Converting String Values into Numeric
data['Sex'].replace(['male','female'],[0,1],inplace=True)
data['Embarked'].replace(['S','C','Q'],[0,1,2],inplace=True)
data['Initial'].replace(['Mr','Mrs','Miss','Master','Other'],[0,1,2,3,4],inplace=True)

In [None]:
#drop unneeded features
data.drop(['Name','Age','Ticket','Fare','Cabin','PassengerId','Initial'],axis=1,inplace=True)

### import the packages for predictive modeling

In [None]:
#importing all the required ML packages
from sklearn.linear_model import LogisticRegression #logistic regression
from sklearn import svm #support vector Machine
from sklearn.ensemble import RandomForestClassifier #Random Forest
from sklearn.neighbors import KNeighborsClassifier #KNN
from sklearn.naive_bayes import GaussianNB #Naive bayes
from sklearn.tree import DecisionTreeClassifier #Decision Tree
from sklearn.model_selection import train_test_split #training and testing data split
from sklearn import metrics #accuracy measure
from sklearn.metrics import confusion_matrix #for confusion matrix

Devide the data into train and test

In [None]:
train,test=train_test_split(data,test_size=0.3,random_state=0,stratify=data['Survived'])
train_X=train[train.columns[1:]]
train_Y=train[train.columns[:1]]
test_X=test[test.columns[1:]]
test_Y=test[test.columns[:1]]
X=data[data.columns[1:]]
Y=data['Survived']

Train the Radial Support Vector Machines(rbf-SVM)

In [None]:
model=svm.SVC(kernel='rbf',C=1,gamma=0.1)
model.fit(train_X,train_Y)
prediction=model.predict(test_X)
print('Accuracy for rbf SVM is ',metrics.accuracy_score(prediction,test_Y))

# Try Kishu: Checkout and Branch
In the **4th cell**, you converted the age feature into catogorial groups, and then **delete the initial ages in the 6th cell** from data. Say now you want to try not banding the age feature but keeping the initial age values. And see if the result will be better. Your plan is to check out to the commit **right before** feature banding happens, then try the new feature engineering method.

You can do it as follows:
1. Find the commit where you banded age. Please try the following ways to find the commit you need.
   - Find the commit with the corresponding execution value.
   - Search the first few words of the cell you want.
3. Check out variables to the commit **before** the feature banding happens.
4. Try not banding ages(deleting the whole age banding part) and not dropping ages. and then rerun the model.

You found the result is no better, please **checkout back to the previous branch** where you banded feature age and deleted it.

# Try Kishu: Track Variable (Search + Inspect)
Say you find your `data` variable's shape is not what you expected(the column number should be 15 instead of 8), which means some columns are accidentally dropped. You want to see which of your previous execution leads to the change of **the shape of** the `data` variable. You can do it in the following steps.
1. Search all the commits where "data" is changed.
2. Fix the "data" variable in the variable panel, and browse those highlighted commits to see when the shape of "data" is changed.
3. Locate the error commit, and checkout data to the commit before the error commit.
4. try print the data's shape to see if checkout is successful.

In [None]:
data.shape

# Try Kishu: Recover
Say the kernel now is restarted. And all your previous executions are gone. Try use kishu to quickly recover the kernel state to where before the kernel is shutted down.

In [None]:
#run data.shape to see if the recover is successful
data.shape