# Tutorial (Kishu Group)

The notebook tries to predict survival in the Titanic.

**Goal**: This notebook helps you get familiar with Kishu. You only need to **read the markdown instructions, no need to read the code**. 

Please make sure you <span style="color: red">run every cell</span> in the task. 

Once you see "CHOOSE ONE", copy, paste, and execute <span style="color: red"> any one of the choices </span> in the cell below.

You are free to run other cells.

At the end, you should be able to:
- **Browse** different commits and **read** commit detailed information
- **Checkout** (i.e., restore) a notebook from a commit and **branch out** to explore different coding routes
- **Search** for variable change and **inspect** variable type, size, and value
- **Recover** from Jupyter kernel restart

# Task 1:  Data Analysis

## Part 0: Load Libraries and Reading data

In [None]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from joblib import dump, load
plt.style.use('fivethirtyeight')
import warnings
warnings.filterwarnings('ignore')
#importing all the required ML packages
from sklearn.linear_model import LogisticRegression #logistic regression
from sklearn import svm #support vector Machine
from sklearn.ensemble import RandomForestClassifier #Random Forest
from sklearn.neighbors import KNeighborsClassifier #KNN
from sklearn.naive_bayes import GaussianNB #Naive bayes
from sklearn.tree import DecisionTreeClassifier #Decision Tree
from sklearn.model_selection import train_test_split #training and testing data split
from sklearn import metrics #accuracy measure
from sklearn.metrics import confusion_matrix #for confusion matrix

%matplotlib inline

In [None]:
data = pd.read_csv('./titanic.csv')

### Try Kishu: Browse Commits

Please see Kishuboard. You should find a **commit** created on the left panel. Click on the commit.

On the right panel, you should find the **list of executed cells** (top) and **variables** **at the time when commit is made** (bottom).

In the variable panel, you should see variable **names**, their **types**, **sizes**, and **value summaries**. Clicking on "view detail" displays **detailed values** of the variables.

**Quiz**: In the second commit, what shape does `data` have? What's its 11th row?

## Part 1: Data Preprocessing

In [None]:
# impute missing age feature
data['Initial']=0
for i in data:
    data['Initial']=data.Name.str.extract('([A-Za-z]+)\.') #lets extract the Salutations
data['Initial'].replace(['Mlle','Mme','Ms','Dr','Major','Lady','Countess','Jonkheer','Col','Rev','Capt','Sir','Don'],['Miss','Miss','Miss','Mr','Mr','Mrs','Mrs','Other','Other','Other','Mr','Mr','Mr'],inplace=True)
## Assigning the NaN Values with the Ceil values of the mean ages
data.loc[(data.Age.isnull())&(data.Initial=='Mr'),'Age']=33
data.loc[(data.Age.isnull())&(data.Initial=='Mrs'),'Age']=36
data.loc[(data.Age.isnull())&(data.Initial=='Master'),'Age']=5
data.loc[(data.Age.isnull())&(data.Initial=='Miss'),'Age']=22
data.loc[(data.Age.isnull())&(data.Initial=='Other'),'Age']=46

In [None]:
# impute embark feature
data['Embarked'].fillna('S',inplace=True)

# Age and fare feature band (convert continous values into categorical values)
data['Age_band']=0
data.loc[data['Age']<=16,'Age_band']=0
data.loc[(data['Age']>16)&(data['Age']<=32),'Age_band']=1
data.loc[(data['Age']>32)&(data['Age']<=48),'Age_band']=2
data.loc[(data['Age']>48)&(data['Age']<=64),'Age_band']=3
data.loc[data['Age']>64,'Age_band']=4
data.head(2)

data['Fare_cat']=0
data.loc[data['Fare']<=7.91,'Fare_cat']=0
data.loc[(data['Fare']>7.91)&(data['Fare']<=14.454),'Fare_cat']=1
data.loc[(data['Fare']>14.454)&(data['Fare']<=31),'Fare_cat']=2
data.loc[(data['Fare']>31)&(data['Fare']<=513),'Fare_cat']=3

# Converting String Values into Numeric
data['Sex'].replace(['male','female'],[0,1],inplace=True)
data['Embarked'].replace(['S','C','Q'],[0,1,2],inplace=True)
data['Initial'].replace(['Mr','Mrs','Miss','Master','Other'],[0,1,2,3,4],inplace=True)

### Try Kishu: Fold and Unfold Commits

Try unfold to see every commit, and then refold. 

## Part 2: Feature Engineering

In [None]:
#drop unneeded features
data.drop(['Name','Ticket','Fare','Cabin','PassengerId','Initial','Age'],axis=1,inplace=True)

## Part 3: Model Training

In [None]:
# devide the data into train and test
train,test=train_test_split(data,test_size=0.3,random_state=0,stratify=data['Survived'])
train_X=train[train.columns[1:]]
train_Y=train[train.columns[:1]]
test_X=test[test.columns[1:]]
test_Y=test[test.columns[:1]]
X=data[data.columns[1:]]
Y=data['Survived']

### Select the model to use (CHOOSE ONE)

Let's try either of the two models.

1. **Choice 1**: Bagging KNN
```python
model=BaggingClassifier(base_estimator=KNeighborsClassifier(n_neighbors=3),random_state=0,n_estimators=700)
model.fit(train_X, train_Y)
```
2. **Choice 2**: rbf-SVM
```python
model=svm.SVC(kernel='rbf',C=1,gamma=0.1)
model.fit(train_X,train_Y)
```

In [None]:
prediction=model.predict(test_X)
accu = metrics.accuracy_score(prediction,test_Y)
print('Accuracy for the model is ',accu)

## Part 4: Alternative Feature Engineering

### Try Kishu: Checkout and Branch

Suppose you would like to explore a new approach. Instead of this 5th cell (if executed in order):

```python
data.drop(['Name','Ticket','Fare','Cabin','PassengerId','Initial','Age'],axis=1,inplace=True)
```

You would like to **change the code** to not drop the `Age` column:

```python
data.drop(['Name','Ticket','Fare','Cabin','PassengerId','Initial'],axis=1,inplace=True)
```
You'll need to replace the feature engineering method and rerun part 3, model training for the new features.

Thanks to Kishu's **checkout and branch**, you can do this without re-executing every cell! To do so:

1. **Find the commit to restore**
   
   **<span style="color:red">Important Quiz</span>**: What's the commit you'll need to checkout? Discuss the answer with the staff.
3. **Checkout** variables by right-clicking the commit > "Checkout" > "Variables".
4. **Modify** the cell to not drop the `Age` column.
5. **Branch out** by executing the modified cell and following cells as usual.

**Quiz**: What's the previous accu score before you checking out? Try to browse it directly through kishuboard

Answer: XXXX.XXXX

___________________________

Unfortunately, not dropping `Age` leads to no significant improvement.

**Quiz**: Let's revert back by **checking out** the original commit. And then export the original model by running the following code below:
```python
dump(model, 'tutorial.joblib')
```

# Task2: Identifying

### Try Kishu: Search and Track Variable change

Suppose you would like to find when `data` has changed, for example, when its number of columns changes from 15 to 8. 

Luckily, Kishu's Search and Inspect enables you to do this easily! To do so:

1. **Locate** search bar on the top.
2. **Select** "Select search target" > "variable changes".
3. **Search** the variable `data` by typing its name in the search bar and clicking the search icon.
4. **Unfold** If a folded commit is highlighted, you'll have to unfold it first.
5. **Inspect** `data` among highlighted commits that changed `data`'s shape.
   - *Hint: Variable size is located in the bottom right panel.*
   - Tips: You can pin `data` on top of the panel by typing `data` into "Add a watch to the variable you want to inspect".

**Quiz**: Find which columns are dropped when the column number of `data` changed from 15 to 8.


**Quiz**: Find commit(s) when `X` has changed.

In [None]:
data.shape

### Try Kishu: Recover

Say, the kernel is restarted (please manually restart the kernel, "Kernel" > "Restart Kernel...").

Kishu helps you **recover** your previous work!

To do this,
1. **Checkout** the latest commit (right-clicking the commit > "Checkout" > "Variables").
2. That's it!

In [None]:
# Run accu to see if the recover is successful
print(accu)