### Importing libraries

In [2]:
import altair as alt #best library for visualization!
import numpy as np
import pandas as pd

from collections import Counter

#Importing Models and metrics
from sklearn.linear_model import LogisticRegression, Perceptron, SGDClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, f1_score, make_scorer, precision_score, recall_score
from sklearn.model_selection import cross_val_score, cross_validate, GridSearchCV, StratifiedKFold
from sklearn.preprocessing import StandardScaler, QuantileTransformer
from sklearn.svm import SVC, LinearSVC
from sklearn.tree import DecisionTreeClassifier


### Reading Data

In [3]:
df_train = pd.read_csv('train.csv')

## Analyzing Data

I am starting with `.info()` as it gives a quick glance of columns/features along with the datatype(numerical or categorical). Also, gives a headstart with the null values.

In [4]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


We have 12 features and handful of null values in `Cabin, Age`.

`.head()` shows a glimpse of dataset to get better understanding about datatype.

In [5]:
df_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


`Survived, Sex, Embarked` are categorical and `Pclass` is ordinal.

`Age, Fare, SibSp, Parch` are numerical wherein first two features are continous.

`Cabin` is alphanumeric

Ignoring other features like `PassengerId, Name, Ticket` as they don't effect the outcome.(Do they??)

We can better understanding of each numerical feature with the help of `.describe()`

In [6]:
df_train.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


1. Total passengers count is 891.
2. Survived is categorical but in the form of binary where 1 means survived.
3. Most of the passengers belonged to third class.
4. 75% of the total passengers are below age of 38.
5. Most of the passengers travelled without siblings or spouse.
6. More than 75% passengers tarvelled without parents or childen.
7. Though more than 75% passengers paid below 31$, there are few passengers who paid as high as 512$.


### Outliers

Before we do any analysis, its better to take care of outliers as they might make our assumptions biased.

There are various strategies available to detect outliers. One of the simplest methods is Turkey method. From above `describe()` method, in `Fare` we can see that 75% of people paid 31$. But the max is 512$ which is a drastic increase and it might heavily affect our prediction if `Fare` is an important feature. Turkey method is very helpful to care of such outliers. It considers any observation below 25% and above 75% of the distribution as an outlier. We can vary 25% and 75% using outlier step.

In [7]:
def outliers(df, columns, num):
    """
    We are giving a dataframe (df_train) and specifying which columns/features it has to check for outliers
    and num represents number of outlier columns in each row to be cosidered inorder to declare a row as outlier.
    """
    outlier_idx = []
    for col in columns:
        P25 = np.percentile(df[col], 25) #25%
        P75 = np.percentile(df[col], 75) #75%
        Prange = P75 - P25 #Most values range
        outlier_step = 1.5*Prange #We  are giving a bit leniance for those just out of range
        #Will compare all all values of given column that falls outside of range and step
        list_of_idx = df[(df[col] < P25 - outlier_step) | (df[col] > P75 + outlier_step)].index
        #Will append every iteration index to a common list
        outlier_idx.extend(list_of_idx)
    #Will count no.of times each row has an outlier in its features.
    outlier_idx = Counter(outlier_idx)
    #Given num decides which rows to be removed
    final_outliers = list(i for i, c in outlier_idx.items() if c > num)

    return final_outliers

In [8]:
outliers_list = outliers(df_train, ['Age','SibSp','Parch','Fare'],2)
df_train.loc[outliers_list]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
27,28,0,1,"Fortune, Mr. Charles Alexander",male,19.0,3,2,19950,263.0,C23 C25 C27,S
88,89,1,1,"Fortune, Miss. Mabel Helen",female,23.0,3,2,19950,263.0,C23 C25 C27,S
159,160,0,3,"Sage, Master. Thomas Henry",male,,8,2,CA. 2343,69.55,,S
180,181,0,3,"Sage, Miss. Constance Gladys",female,,8,2,CA. 2343,69.55,,S
201,202,0,3,"Sage, Mr. Frederick",male,,8,2,CA. 2343,69.55,,S
324,325,0,3,"Sage, Mr. George John Jr",male,,8,2,CA. 2343,69.55,,S
341,342,1,1,"Fortune, Miss. Alice Elizabeth",female,24.0,3,2,19950,263.0,C23 C25 C27,S
792,793,0,3,"Sage, Miss. Stella Anna",female,,8,2,CA. 2343,69.55,,S
846,847,0,3,"Sage, Mr. Douglas Bullen",male,,8,2,CA. 2343,69.55,,S
863,864,0,3,"Sage, Miss. Dorothy Edith ""Dolly""",female,,8,2,CA. 2343,69.55,,S


Looks like rows with very high `SibSp` (8) and `Fare` (263$) are considered outliers.

In [9]:
# df_train = df_train.drop(outliers_list)
# df_train.info()

Out of 891 entries, 10 are outliers. Remaining entries are 881.

### Plotting Analysis

Our main priority is to check which features are highly correleated with survival rate. We can do that using pivoting features or better visualized using plots.

In [10]:
df_train.groupby(['Pclass'],as_index=False)['Survived'].mean()

Unnamed: 0,Pclass,Survived
0,1,0.62963
1,2,0.472826
2,3,0.242363


This clearly shows survival rate is more in first and second class.

In [11]:
df_train.groupby(['Sex'],as_index=False)['Survived'].mean()

Unnamed: 0,Sex,Survived
0,female,0.742038
1,male,0.188908


Almost 75% of females and less than 20% men survived.

In [12]:
df_train.groupby(['SibSp','Parch'],as_index=False)['Survived'].mean()

Unnamed: 0,SibSp,Parch,Survived
0,0,0,0.303538
1,0,1,0.657895
2,0,2,0.724138
3,0,3,1.0
4,0,4,0.0
5,0,5,0.0
6,1,0,0.520325
7,1,1,0.596491
8,1,2,0.631579
9,1,3,0.333333


I can't see a clear correlation between `SibSp` and `Parch` with the survival rate. But will have a detailed look in plot.

#### Visualizing

From above tables, we can see first/second class and women has better survival rate (Nicely done Rose :) ) compared to third class and men (poor Jack '_').

What about `Age` and `Fare`?

In [13]:
alt.Chart(df_train).mark_bar().encode(
    alt.X(alt.repeat("row"),bin=alt.Bin(maxbins=20)),
    y = 'count()',
    color = 'Survived:N',
    tooltip = 'count()'
).repeat(
    row = ['Age', 'Fare']
).interactive()

  for col_name, dtype in df.dtypes.iteritems():


The graphs are interactive and hover over the bars to view exact counts (cool!)

Better to look over the passengers who paid more than 100$

In [14]:
df_fare = df_train.loc[df_train['Fare'] > 100,:]
alt.Chart(df_fare).mark_bar().encode(
    alt.X('Fare',bin=alt.Bin(step=50)),
    y = 'count()',
    color = 'Survived:N',
    tooltip = 'count()'
).interactive()

Seems like most of the high fare payers survived.

We we see the graphs, `Age` looks like a normal distribution but `Fare` appears to be skewed which is not desirable. So we can transform it with log function to reduce the skewness.

In [15]:
df_train['Fare'] = df_train['Fare'].map(lambda x: np.log(x) if x > 0 else 0)

In [16]:
bar = alt.Chart(df_train).mark_bar().encode(
    alt.X('Fare',bin=alt.Bin(maxbins=20)),
    y = 'count()',
    color = 'Survived:N'
).interactive()
line = alt.Chart(df_train).mark_line(color = 'red').encode(
    alt.X('Fare',bin=alt.Bin(maxbins=20)),
    y = 'count()'
).interactive()
bar + line


Looks better!

In [17]:
alt.Chart(df_train).mark_bar().encode(
    alt.X(alt.repeat("column"), type = 'nominal'),
    y = 'count()',
    color = 'Survived:N',
    tooltip = 'count()'
).repeat(
    column = ['Pclass','Sex']
).interactive()

We can confirm `Pclass` and `Sex` clearly impact survival.

In [18]:
alt.Chart(df_train).mark_bar().encode(
    alt.X(alt.repeat("row"), type = 'nominal'),
    y = 'count()',
    color = 'Survived:N',
    tooltip = 'count()'
).repeat(
    row = ['SibSp','Parch']
).interactive()


It is unclear, but families with less siblings/spouse have better record of surving. From `Parch`, no evident reasoning can be drawn.

In [19]:
alt.Chart(df_train).mark_bar().encode(
    x = 'Embarked',
    y = 'count()',
    color = 'Survived:N',
    tooltip = ['count()','Sex:N']
).interactive()

From above plots, we can say that `Pclass, Sex` and `Fare` are clearly impacting survival rate. Also, it might not be entirely clear but `Age` and `Embarked` may also be impacting survival rate. `SibSp` and `Parch` features maybe not much useful features but as our dataset is not very large, it wont hurt to keep them around. 
We did not talk about `Cabin` as it has a lot of null values. Trying to fill them and establishing a correlation might lead to errors. So its better to ignore this feature. So we can only include impacting 5 features to build the model.

In [20]:
df_train = df_train.drop(['PassengerId','Name','Ticket','Cabin'], axis=1)
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Sex       891 non-null    object 
 3   Age       714 non-null    float64
 4   SibSp     891 non-null    int64  
 5   Parch     891 non-null    int64  
 6   Fare      891 non-null    float64
 7   Embarked  889 non-null    object 
dtypes: float64(2), int64(4), object(2)
memory usage: 55.8+ KB


### Null values

We know that `Age` is an important feature but it has a few missing values. There are different methods to fill null values. Simplest one is to fill with random values using mean and standard deviation. But a better approach would be figuring out age value using its correlation with other features.

In [21]:
alt.Chart(df_train).mark_boxplot().encode(
    alt.X(alt.repeat("column")),
    y = 'Age:Q'
).repeat(
    column = ['Sex','Pclass','Parch','SibSp']
).interactive()

  for col_name, dtype in df.dtypes.iteritems():


From above boxplots, good amount of correlations can be found. Looks like `Sex` is not influencing `Age`. But there seems to be some influence due to `PClass, Parch` and `SibSp`. A correlation matrix might give more detailed information.

`Sex` is in categorical. Inorder to check correlation, will convert it to numerical.

In [22]:
df_train['Sex'] = df_train['Sex'].map({
    'male' :0, 'female' : 1
})

In [23]:
df_train[['Age','Sex','Pclass','SibSp','Parch']].corr()

Unnamed: 0,Age,Sex,Pclass,SibSp,Parch
Age,1.0,-0.093254,-0.369226,-0.308247,-0.189119
Sex,-0.093254,1.0,-0.1319,0.114631,0.245489
Pclass,-0.369226,-0.1319,1.0,0.083081,0.018443
SibSp,-0.308247,0.114631,0.083081,1.0,0.414838
Parch,-0.189119,0.245489,0.018443,0.414838,1.0


Looks like `Age` is negatively correlated wih `Pclass, SibSp` and `Parch`. So I will use meadian age of similar rows according to correlated features.

In [24]:
index_NaN_age = list(df_train["Age"][df_train["Age"].isnull()].index)

for idx in index_NaN_age :
    age_med = df_train["Age"].median()
    age_pred = df_train["Age"][((df_train['SibSp'] == df_train.iloc[idx]["SibSp"]) & (df_train['Parch'] == df_train.iloc[idx]["Parch"]) & (df_train['Pclass'] == df_train.iloc[idx]["Pclass"]))].median()
    if not np.isnan(age_pred) :
        df_train['Age'].iloc[idx] = age_pred
    else :
        df_train['Age'].iloc[idx] = age_med

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_train['Age'].iloc[idx] = age_pred
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_train['Age'].iloc[idx] = age_med


In [25]:
df_train['Age'].isnull().sum()

0

### One hot encoding

In [26]:
df_train = pd.get_dummies(df_train, columns=['Embarked'], prefix = 'Em') #Prefix to decrease the name length
df_train.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Em_C,Em_Q,Em_S
0,0,3,0,22.0,1,0,1.981001,0,0,1
1,1,1,1,38.0,1,0,4.266662,1,0,0
2,1,3,1,26.0,0,0,2.070022,0,0,1
3,1,1,1,35.0,1,0,3.972177,0,0,1
4,0,3,0,35.0,0,0,2.085672,0,0,1


## Building Models

Our dataset is almost ready to be used to build a ML model. We will split the features and output.

In [27]:
X_train = df_train.drop('Survived', axis=1)
Y_train = df_train['Survived']
X_train

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Em_C,Em_Q,Em_S
0,3,0,22.0,1,0,1.981001,0,0,1
1,1,1,38.0,1,0,4.266662,1,0,0
2,3,1,26.0,0,0,2.070022,0,0,1
3,1,1,35.0,1,0,3.972177,0,0,1
4,3,0,35.0,0,0,2.085672,0,0,1
...,...,...,...,...,...,...,...,...,...
886,2,0,27.0,0,0,2.564949,0,0,1
887,1,1,19.0,0,0,3.401197,0,0,1
888,3,1,13.5,1,2,3.154870,0,0,1
889,1,0,26.0,0,0,3.401197,1,0,0


### What model to choose?

We have data and we need to predict whether a passenger survived or not which is nothing but a classification in supervised learning environment.

#### Logistic Regression

It is one of the simplest models in machine learning. Its always good to start with a simple model as we often desire simpler models over complex models.

In [28]:
mod_log = LogisticRegression(max_iter=1000)
mod_log.fit(X_train, Y_train)
Y_pred = mod_log.predict(X_train)
accuracy_score(Y_train, Y_pred)

0.8002244668911336

We are using the same train set for training and predicting. We can understand our model performance using cross validation.

In general, k-fold cross validation splits the train and validation sets randomly but sometimes it may lead to poor train set. So insead we will use stratified k-fold.

In [29]:
kfold = StratifiedKFold(n_splits=10)

In [30]:
mod_log = LogisticRegression(max_iter=1000)
scores = cross_val_score(mod_log, X_train, Y_train,cv = kfold)
pd.DataFrame(scores)

Unnamed: 0,0
0,0.788889
1,0.797753
2,0.752809
3,0.842697
4,0.786517
5,0.764045
6,0.786517
7,0.786517
8,0.775281
9,0.820225


In [31]:
scores.mean()

0.7901248439450687

The scores above are accuracy scores. But depending on the application, we might need other performance metrics.

In [32]:
mod_log = LogisticRegression(max_iter=1000)
log_scores = cross_validate(
    mod_log, X_train, Y_train,cv = kfold, scoring={'Accuracy':make_scorer(accuracy_score), 'Precision':make_scorer(precision_score), 'Recall':make_scorer(recall_score), 'F1-score':make_scorer(f1_score)}
    )
mod_log_scores = pd.DataFrame(log_scores)
mod_log_scores

Unnamed: 0,fit_time,score_time,test_Accuracy,test_Precision,test_Recall,test_F1-score
0,0.019957,0.002758,0.788889,0.75,0.685714,0.716418
1,0.021358,0.00286,0.797753,0.735294,0.735294,0.735294
2,0.019361,0.003203,0.752809,0.730769,0.558824,0.633333
3,0.018726,0.002794,0.842697,0.777778,0.823529,0.8
4,0.021765,0.004473,0.786517,0.682927,0.823529,0.746667
5,0.024462,0.002584,0.764045,0.709677,0.647059,0.676923
6,0.010801,0.004689,0.786517,0.758621,0.647059,0.698413
7,0.016872,0.002561,0.786517,0.758621,0.647059,0.698413
8,0.016407,0.002512,0.775281,0.71875,0.676471,0.69697
9,0.032118,0.00285,0.820225,0.787879,0.742857,0.764706


In [33]:
print(f"Mean of accuracy is {mod_log_scores['test_Accuracy'].mean()} with a deviation of {mod_log_scores['test_Accuracy'].std()}")
print(f"Mean of f1-score is {mod_log_scores['test_F1-score'].mean()} with a deviation of {mod_log_scores['test_F1-score'].std()}")

Mean of accuracy is 0.7901248439450687 with a deviation of 0.025951960417775678
Mean of f1-score is 0.7167136081165932 with a deviation of 0.04723183385329963


#### K Nearest Neighbors

KNN is one of the best and simple models for classification and pattern recognition. As the name says, it predicts the output based on the features neighbors outputs.

In [34]:
mod_knn = KNeighborsClassifier()
knn_scores = cross_validate(
    mod_knn, X_train, Y_train, cv = kfold, scoring={'Accuracy':make_scorer(accuracy_score), 'Precision':make_scorer(precision_score), 'Recall':make_scorer(recall_score), 'F1-score':make_scorer(f1_score)}
)
mod_knn_scores = pd.DataFrame(knn_scores)
mod_knn_scores

Unnamed: 0,fit_time,score_time,test_Accuracy,test_Precision,test_Recall,test_F1-score
0,0.001944,0.00588,0.777778,0.777778,0.6,0.677419
1,0.002369,0.008737,0.752809,0.730769,0.558824,0.633333
2,0.002808,0.006068,0.707865,0.653846,0.5,0.566667
3,0.002709,0.004651,0.797753,0.735294,0.735294,0.735294
4,0.001755,0.004601,0.775281,0.71875,0.676471,0.69697
5,0.001718,0.003589,0.719101,0.68,0.5,0.576271
6,0.001429,0.003212,0.820225,0.8,0.705882,0.75
7,0.001309,0.003772,0.797753,0.722222,0.764706,0.742857
8,0.002205,0.003357,0.797753,0.735294,0.735294,0.735294
9,0.001414,0.003175,0.786517,0.766667,0.657143,0.707692


In [35]:
mod_knn_scores['test_Accuracy'].mean()

0.7732833957553058

There is a mistake with the way we approached KNN. Since KNN is a distance based algorithm its better to bring all the features to the same scale. This should give us better results.
Scaling can be done through preprocessing methods available in sklearn. Standard scaler is scaling with standard mean and deviation. We if need more robust scaling, Quantile Transformer is also good. It does good job in dealing outliers too.

In [36]:
X_train_scaled = StandardScaler().fit_transform(X_train)
pd.DataFrame(X_train_scaled, columns=['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare',	'Em_C',	'Em_Q',	'Em_S'])

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Em_C,Em_Q,Em_S
0,0.827377,-0.737695,-0.545734,0.432793,-0.473674,-0.910717,-0.482043,-0.307562,0.619306
1,-1.566107,1.355574,0.655962,0.432793,-0.473674,1.369616,2.074505,-0.307562,-1.614710
2,0.827377,1.355574,-0.245310,-0.474545,-0.473674,-0.821904,-0.482043,-0.307562,0.619306
3,-1.566107,1.355574,0.430644,0.432793,-0.473674,1.075818,-0.482043,-0.307562,0.619306
4,0.827377,-0.737695,0.430644,-0.474545,-0.473674,-0.806291,-0.482043,-0.307562,0.619306
...,...,...,...,...,...,...,...,...,...
886,-0.369365,-0.737695,-0.170204,-0.474545,-0.473674,-0.328130,-0.482043,-0.307562,0.619306
887,-1.566107,1.355574,-0.771053,-0.474545,-0.473674,0.506169,-0.482043,-0.307562,0.619306
888,0.827377,1.355574,-1.184136,0.432793,2.008933,0.260416,-0.482043,-0.307562,0.619306
889,-1.566107,-0.737695,-0.245310,-0.474545,-0.473674,0.506169,2.074505,-0.307562,-1.614710


In [37]:
mod_knn = KNeighborsClassifier()
knn_scores = cross_validate(
    mod_knn, X_train_scaled, Y_train, cv = kfold, scoring={'Accuracy':make_scorer(accuracy_score), 'Precision':make_scorer(precision_score), 'Recall':make_scorer(recall_score), 'F1-score':make_scorer(f1_score)}
)
mod_knn_scores = pd.DataFrame(knn_scores)
mod_knn_scores

Unnamed: 0,fit_time,score_time,test_Accuracy,test_Precision,test_Recall,test_F1-score
0,0.000937,0.004236,0.755556,0.685714,0.685714,0.685714
1,0.000739,0.003427,0.842697,0.833333,0.735294,0.78125
2,0.000909,0.00344,0.741573,0.73913,0.5,0.596491
3,0.000815,0.00319,0.820225,0.714286,0.882353,0.789474
4,0.000662,0.002898,0.797753,0.722222,0.764706,0.742857
5,0.000682,0.002804,0.820225,0.821429,0.676471,0.741935
6,0.000662,0.002828,0.820225,0.8,0.705882,0.75
7,0.000857,0.003339,0.775281,0.75,0.617647,0.677419
8,0.001012,0.003158,0.786517,0.714286,0.735294,0.724638
9,0.000633,0.002939,0.820225,0.827586,0.685714,0.75


In [38]:
mod_knn_scores['test_Accuracy'].mean()

0.7980274656679149

In [39]:
mod_knn.get_params()

{'algorithm': 'auto',
 'leaf_size': 30,
 'metric': 'minkowski',
 'metric_params': None,
 'n_jobs': None,
 'n_neighbors': 5,
 'p': 2,
 'weights': 'uniform'}

Number of neighbors can have a lot of influence on the model. In default settings, it is set to 5. But we can try different setting and this can be easily done with GridsearchCV.

In [40]:
grid_knn = GridSearchCV(
    estimator = mod_knn,
    param_grid={'n_neighbors': [1,2,3,4,5,6,7,8,9,10]},
    cv = kfold,
    scoring = {'Accuracy':make_scorer(accuracy_score), 'Precision':make_scorer(precision_score), 'Recall':make_scorer(recall_score), 'F1-score':make_scorer(f1_score)},
    refit = 'Accuracy' #Inorder to rank the output
)
grid_knn.fit(X_train_scaled, Y_train)
mod_knn_scores = pd.DataFrame(grid_knn.cv_results_)
mod_knn_scores

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_n_neighbors,params,split0_test_Accuracy,split1_test_Accuracy,split2_test_Accuracy,split3_test_Accuracy,...,split3_test_F1-score,split4_test_F1-score,split5_test_F1-score,split6_test_F1-score,split7_test_F1-score,split8_test_F1-score,split9_test_F1-score,mean_test_F1-score,std_test_F1-score,rank_test_F1-score
0,0.000728,8.1e-05,0.002935,0.000243,1,{'n_neighbors': 1},0.7,0.696629,0.707865,0.764045,...,0.72,0.704225,0.712329,0.677419,0.656716,0.704225,0.714286,0.676433,0.039314,9
1,0.000721,0.000179,0.002869,0.000303,2,{'n_neighbors': 2},0.744444,0.741573,0.730337,0.775281,...,0.6875,0.701754,0.701754,0.690909,0.655172,0.709677,0.677966,0.661571,0.045736,10
2,0.000747,0.000224,0.002925,0.000356,3,{'n_neighbors': 3},0.722222,0.831461,0.730337,0.831461,...,0.788732,0.764706,0.8,0.761905,0.677419,0.753623,0.787879,0.737393,0.060992,2
3,0.000684,0.00012,0.002868,0.000204,4,{'n_neighbors': 4},0.766667,0.831461,0.764045,0.820225,...,0.764706,0.761905,0.689655,0.758621,0.666667,0.733333,0.709677,0.711518,0.047383,8
4,0.000731,0.000164,0.003373,0.000554,5,{'n_neighbors': 5},0.755556,0.842697,0.741573,0.820225,...,0.789474,0.742857,0.741935,0.75,0.677419,0.724638,0.75,0.723978,0.054303,7
5,0.000674,9.8e-05,0.002983,0.000194,6,{'n_neighbors': 6},0.8,0.808989,0.764045,0.853933,...,0.816901,0.793651,0.733333,0.77193,0.642857,0.769231,0.688525,0.725615,0.061056,6
6,0.00082,0.000386,0.003054,0.000285,7,{'n_neighbors': 7},0.8,0.842697,0.764045,0.853933,...,0.821918,0.835821,0.78125,0.806452,0.633333,0.746269,0.764706,0.754234,0.065633,1
7,0.000695,0.000137,0.002978,0.000266,8,{'n_neighbors': 8},0.833333,0.831461,0.752809,0.853933,...,0.821918,0.84375,0.711864,0.758621,0.596491,0.75,0.757576,0.736911,0.073589,4
8,0.000702,0.000175,0.003202,0.000606,9,{'n_neighbors': 9},0.777778,0.831461,0.752809,0.831461,...,0.8,0.848485,0.741935,0.745763,0.62069,0.727273,0.776119,0.735247,0.066776,5
9,0.000657,7.9e-05,0.003213,0.000279,10,{'n_neighbors': 10},0.822222,0.853933,0.764045,0.842697,...,0.810811,0.8125,0.721311,0.745763,0.631579,0.75,0.738462,0.737083,0.060318,3


In [41]:
#Plotting graph for better visualization
alt.Chart(mod_knn_scores).transform_fold(
    ['mean_test_Accuracy','mean_test_Precision', 'mean_test_Recall','mean_test_F1-score'],
).mark_line().encode(
    x = 'param_n_neighbors',
    y = alt.Y('value', type = 'quantitative', title = 'Score'),
    color = alt.Color('key', type = 'nominal', title = 'Metrics')
).interactive()

  for col_name, dtype in df.dtypes.iteritems():


Hmm, seems like performance metrics are swinging up and down...
Because when the neighbors are even, there is a tie between decision boundaries. So its always advised to choose an odd k.

In [42]:
grid_knn = GridSearchCV(
    estimator = mod_knn,
    param_grid={'n_neighbors': [1,3,5,7,9,11,13,15]},
    cv = kfold,
    scoring = {'Accuracy':make_scorer(accuracy_score), 'Precision':make_scorer(precision_score), 'Recall':make_scorer(recall_score), 'F1-score':make_scorer(f1_score)},
    refit = 'Accuracy' #Inorder to rank the output
)
grid_knn.fit(X_train_scaled, Y_train)
Y_pred = grid_knn.predict(X_train_scaled)
mod_knn_scores = pd.DataFrame(grid_knn.cv_results_)
mod_knn_scores

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_n_neighbors,params,split0_test_Accuracy,split1_test_Accuracy,split2_test_Accuracy,split3_test_Accuracy,...,split3_test_F1-score,split4_test_F1-score,split5_test_F1-score,split6_test_F1-score,split7_test_F1-score,split8_test_F1-score,split9_test_F1-score,mean_test_F1-score,std_test_F1-score,rank_test_F1-score
0,0.00075,0.000103,0.003121,0.000289,1,{'n_neighbors': 1},0.7,0.696629,0.707865,0.764045,...,0.72,0.704225,0.712329,0.677419,0.656716,0.704225,0.714286,0.676433,0.039314,8
1,0.000655,4.9e-05,0.002852,0.000325,3,{'n_neighbors': 3},0.722222,0.831461,0.730337,0.831461,...,0.788732,0.764706,0.8,0.761905,0.677419,0.753623,0.787879,0.737393,0.060992,4
2,0.000625,5.1e-05,0.002746,0.000148,5,{'n_neighbors': 5},0.755556,0.842697,0.741573,0.820225,...,0.789474,0.742857,0.741935,0.75,0.677419,0.724638,0.75,0.723978,0.054303,7
3,0.000619,3e-05,0.002812,0.000107,7,{'n_neighbors': 7},0.8,0.842697,0.764045,0.853933,...,0.821918,0.835821,0.78125,0.806452,0.633333,0.746269,0.764706,0.754234,0.065633,1
4,0.000707,0.000215,0.002923,0.000249,9,{'n_neighbors': 9},0.777778,0.831461,0.752809,0.831461,...,0.8,0.848485,0.741935,0.745763,0.62069,0.727273,0.776119,0.735247,0.066776,6
5,0.000641,5.3e-05,0.003059,0.00034,11,{'n_neighbors': 11},0.822222,0.831461,0.752809,0.842697,...,0.810811,0.8125,0.761905,0.745763,0.655172,0.735294,0.757576,0.742426,0.054963,3
6,0.000658,9.4e-05,0.003076,0.000238,13,{'n_neighbors': 13},0.822222,0.831461,0.752809,0.820225,...,0.789474,0.830769,0.774194,0.766667,0.631579,0.738462,0.776119,0.744469,0.060915,2
7,0.0007,0.000216,0.003172,0.000262,15,{'n_neighbors': 15},0.811111,0.831461,0.775281,0.820225,...,0.789474,0.8125,0.754098,0.758621,0.631579,0.727273,0.746269,0.736755,0.052438,5


In [43]:
grid_knn.best_params_

{'n_neighbors': 7}

In [44]:
#Plotting graph for better visualization
alt.Chart(mod_knn_scores).transform_fold(
    ['mean_test_Accuracy','mean_test_Precision', 'mean_test_Recall','mean_test_F1-score'],
).mark_line().encode(
    x = 'param_n_neighbors',
    y = alt.Y('value', type = 'quantitative', title = 'Score'),
    color = alt.Color('key', type = 'nominal', title = 'Metrics')
).interactive()

  for col_name, dtype in df.dtypes.iteritems():


We can see at neighbors = 7, accuracy is high and f1-score(which is a balance of precision and recall) is also high.

In [45]:
df_knn = mod_knn_scores.loc[3,:]
print(f"Mean of accuracy is {df_knn['mean_test_Accuracy']} with a deviation of {df_knn['std_test_Accuracy']}")
print(f"Mean of f1-score is {df_knn['mean_test_F1-score']} with a deviation of {df_knn['std_test_F1-score']}")

Mean of accuracy is 0.8226966292134831 with a deviation of 0.03936691274360046
Mean of f1-score is 0.7542338712930381 with a deviation of 0.06563345780595786


In [46]:
#print(classification_report(Y_train, Y_pred))

#### Support Vector Machine

In [47]:
mod_svm = SVC()
svm_scores = cross_validate(
    mod_svm, X_train, Y_train, cv = kfold, scoring={'Accuracy':make_scorer(accuracy_score), 'Precision':make_scorer(precision_score), 'Recall':make_scorer(recall_score), 'F1-score':make_scorer(f1_score)}
)
mod_svm_scores = pd.DataFrame(svm_scores)
mod_svm_scores

Unnamed: 0,fit_time,score_time,test_Accuracy,test_Precision,test_Recall,test_F1-score
0,0.01327,0.006572,0.666667,0.727273,0.228571,0.347826
1,0.012224,0.00402,0.640449,0.6,0.176471,0.272727
2,0.011838,0.00383,0.685393,0.714286,0.294118,0.416667
3,0.011795,0.003813,0.775281,0.888889,0.470588,0.615385
4,0.011419,0.005016,0.651685,0.615385,0.235294,0.340426
5,0.012235,0.003878,0.674157,0.727273,0.235294,0.355556
6,0.011889,0.00431,0.719101,0.909091,0.294118,0.444444
7,0.012282,0.003885,0.696629,0.888889,0.235294,0.372093
8,0.011894,0.003845,0.786517,1.0,0.441176,0.612245
9,0.012645,0.004043,0.685393,0.888889,0.228571,0.363636


In [48]:
mod_svm_scores['test_Accuracy'].mean()

0.69812734082397

In [49]:
mod_svm.get_params()

{'C': 1.0,
 'break_ties': False,
 'cache_size': 200,
 'class_weight': None,
 'coef0': 0.0,
 'decision_function_shape': 'ovr',
 'degree': 3,
 'gamma': 'scale',
 'kernel': 'rbf',
 'max_iter': -1,
 'probability': False,
 'random_state': None,
 'shrinking': True,
 'tol': 0.001,
 'verbose': False}

By default, RBF kernel is chosen. Changing kernel might give better results. We can use Gridsearch to decide best kernel.

In [50]:
grid_svm = GridSearchCV(
    estimator = mod_svm,
    param_grid={'kernel': ['linear','rbf','poly']},
    cv = kfold,
    scoring = {'Accuracy':make_scorer(accuracy_score), 'Precision':make_scorer(precision_score), 'Recall':make_scorer(recall_score), 'F1-score':make_scorer(f1_score)},
    refit = 'Accuracy' #Inorder to rank the output
)
grid_svm.fit(X_train, Y_train)
mod_svm_scores = pd.DataFrame(grid_svm.cv_results_)
mod_svm_scores

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_kernel,params,split0_test_Accuracy,split1_test_Accuracy,split2_test_Accuracy,split3_test_Accuracy,...,split3_test_F1-score,split4_test_F1-score,split5_test_F1-score,split6_test_F1-score,split7_test_F1-score,split8_test_F1-score,split9_test_F1-score,mean_test_F1-score,std_test_F1-score,rank_test_F1-score
0,1.685364,0.643753,0.002944,0.00014,linear,{'kernel': 'linear'},0.811111,0.797753,0.764045,0.842697,...,0.8,0.742857,0.677419,0.666667,0.622951,0.730159,0.686567,0.70822,0.050041,1
1,0.01521,0.002828,0.004919,0.000525,rbf,{'kernel': 'rbf'},0.666667,0.640449,0.685393,0.775281,...,0.615385,0.340426,0.355556,0.444444,0.372093,0.612245,0.363636,0.4141,0.108742,2
2,0.014266,0.000733,0.003425,0.000618,poly,{'kernel': 'poly'},0.677778,0.651685,0.707865,0.696629,...,0.341463,0.25641,0.333333,0.363636,0.210526,0.418605,0.363636,0.315155,0.079724,3


In [51]:
grid_svm.best_index_

0

In [52]:
#Plotting graph for better visualization
alt.Chart(mod_svm_scores).transform_fold(
    ['mean_test_Accuracy','mean_test_Precision', 'mean_test_Recall','mean_test_F1-score'],
).mark_bar().encode(
    x = 'param_kernel',
    y = alt.Y('value', type = 'quantitative', title = 'Score'),
    color = alt.Color('key', type = 'nominal', title = 'Metrics'),
    tooltip = ['mean_test_Accuracy','mean_test_Precision', 'mean_test_Recall','mean_test_F1-score']
).interactive()

  for col_name, dtype in df.dtypes.iteritems():


Looks like linear kernel worked better than RBF

In [53]:
df_svm = mod_svm_scores.loc[0,:]
print(f"Mean of accuracy is {df_svm['mean_test_Accuracy']} with a deviation of {df_svm['std_test_Accuracy']}")
print(f"Mean of f1-score is {df_svm['mean_test_F1-score']} with a deviation of {df_svm['std_test_F1-score']}")

Mean of accuracy is 0.786729088639201 with a deviation of 0.028599350013463757
Mean of f1-score is 0.7082203851092008 with a deviation of 0.05004076586746738


#### Stochastic Gradient Descent Classifier

SGD is not a model. It is just an optimizer. (More explained below)

In [54]:
mod_sgd = SGDClassifier()
sgd_scores = cross_validate(
    mod_sgd, X_train, Y_train, cv = kfold, scoring={'Accuracy':make_scorer(accuracy_score), 'Precision':make_scorer(precision_score), 'Recall':make_scorer(recall_score), 'F1-score':make_scorer(f1_score)}
)
mod_sgd_scores = pd.DataFrame(sgd_scores)
mod_sgd_scores

Unnamed: 0,fit_time,score_time,test_Accuracy,test_Precision,test_Recall,test_F1-score
0,0.003319,0.002228,0.766667,0.818182,0.514286,0.631579
1,0.002224,0.002596,0.707865,0.586957,0.794118,0.675
2,0.00426,0.002749,0.651685,0.8,0.117647,0.205128
3,0.002963,0.002581,0.831461,0.952381,0.588235,0.727273
4,0.002897,0.001875,0.775281,0.659091,0.852941,0.74359
5,0.002383,0.002133,0.752809,0.75,0.529412,0.62069
6,0.00421,0.001953,0.764045,1.0,0.382353,0.553191
7,0.002193,0.001819,0.797753,0.833333,0.588235,0.689655
8,0.003855,0.002013,0.393258,0.378049,0.911765,0.534483
9,0.002944,0.001945,0.730337,0.823529,0.4,0.538462


In [55]:
mod_sgd_scores['test_Accuracy'].mean()

0.717116104868914

In [56]:
mod_sgd.get_params()

{'alpha': 0.0001,
 'average': False,
 'class_weight': None,
 'early_stopping': False,
 'epsilon': 0.1,
 'eta0': 0.0,
 'fit_intercept': True,
 'l1_ratio': 0.15,
 'learning_rate': 'optimal',
 'loss': 'hinge',
 'max_iter': 1000,
 'n_iter_no_change': 5,
 'n_jobs': None,
 'penalty': 'l2',
 'power_t': 0.5,
 'random_state': None,
 'shuffle': True,
 'tol': 0.001,
 'validation_fraction': 0.1,
 'verbose': 0,
 'warm_start': False}

Like said before, SGD is just an optimization technique. If the `loss` parameter is set to 'log', SGD classifier is same as logistic regression with SGD optimizer (logitic has gradient descent as optimizer). If the `loss` is 'hinge', it is same as linear SVM classifier.

In [57]:
grid_sgd = GridSearchCV(
    estimator = mod_sgd,
    param_grid={'loss': ['hinge','log_loss','perceptron'], 'penalty': ['l2','l1']},
    cv = kfold,
    scoring = {'Accuracy':make_scorer(accuracy_score), 'Precision':make_scorer(precision_score), 'Recall':make_scorer(recall_score), 'F1-score':make_scorer(f1_score)},
    refit = 'Accuracy' #Inorder to rank the output
)
grid_sgd.fit(X_train, Y_train)
mod_sgd_scores = pd.DataFrame(grid_sgd.cv_results_)
mod_sgd_scores

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_loss,param_penalty,params,split0_test_Accuracy,split1_test_Accuracy,split2_test_Accuracy,...,split3_test_F1-score,split4_test_F1-score,split5_test_F1-score,split6_test_F1-score,split7_test_F1-score,split8_test_F1-score,split9_test_F1-score,mean_test_F1-score,std_test_F1-score,rank_test_F1-score
0,0.003003,0.000607,0.002067,0.000341,hinge,l2,"{'loss': 'hinge', 'penalty': 'l2'}",0.655556,0.786517,0.696629,...,0.722892,0.653061,0.48,0.454545,0.7,0.25641,0.5,0.506055,0.180192,6
1,0.003158,0.000824,0.001799,8.2e-05,hinge,l1,"{'loss': 'hinge', 'penalty': 'l1'}",0.711111,0.764045,0.685393,...,0.75,0.746269,0.686567,0.694444,0.730159,0.707692,0.727273,0.68026,0.079373,2
2,0.003964,0.00074,0.001932,0.000192,log_loss,l2,"{'loss': 'log_loss', 'penalty': 'l2'}",0.777778,0.707865,0.662921,...,0.554622,0.44898,0.363636,0.552846,0.688525,0.651163,0.666667,0.560144,0.141221,5
3,0.007121,0.002549,0.002301,0.000412,log_loss,l1,"{'loss': 'log_loss', 'penalty': 'l1'}",0.788889,0.797753,0.730337,...,0.818182,0.736842,0.586207,0.689655,0.71875,0.724138,0.682927,0.696859,0.065729,1
4,0.002996,0.000485,0.001969,0.000429,perceptron,l2,"{'loss': 'perceptron', 'penalty': 'l2'}",0.733333,0.662921,0.719101,...,0.766667,0.753247,0.557377,0.698413,0.697674,0.716418,0.631579,0.627449,0.157864,4
5,0.004153,0.001314,0.002387,0.000521,perceptron,l1,"{'loss': 'perceptron', 'penalty': 'l1'}",0.722222,0.808989,0.662921,...,0.8125,0.704225,0.545455,0.666667,0.735632,0.703704,0.682927,0.673094,0.085206,3


In [58]:
print(grid_sgd.best_index_)
grid_sgd.best_params_

3


{'loss': 'log_loss', 'penalty': 'l1'}

In [59]:
df_sgd = mod_sgd_scores.loc[1,:]
print(f"Mean of accuracy is {df_sgd['mean_test_Accuracy']} with a deviation of {df_sgd['std_test_Accuracy']}")
print(f"Mean of f1-score is {df_sgd['mean_test_F1-score']} with a deviation of {df_sgd['std_test_F1-score']}")

Mean of accuracy is 0.7654931335830212 with a deviation of 0.038601424802378666
Mean of f1-score is 0.6802596996539052 with a deviation of 0.0793726759736491


#### Gaussian Naive Bayes

Mostly used for text based calssifiactions, Naive Bayes is a popular ML technique. Its better to scale our data before we train the model.

In [60]:
mod_gnb = GaussianNB()
gnb_scores = cross_validate(
    mod_gnb, X_train_scaled, Y_train, cv = kfold, scoring={'Accuracy':make_scorer(accuracy_score), 'Precision':make_scorer(precision_score), 'Recall':make_scorer(recall_score), 'F1-score':make_scorer(f1_score)}
)
mod_gnb_scores = pd.DataFrame(gnb_scores)
mod_gnb_scores

Unnamed: 0,fit_time,score_time,test_Accuracy,test_Precision,test_Recall,test_F1-score
0,0.000915,0.001597,0.733333,0.641026,0.714286,0.675676
1,0.000535,0.001304,0.707865,0.590909,0.764706,0.666667
2,0.000484,0.001281,0.775281,0.705882,0.705882,0.705882
3,0.000468,0.001452,0.730337,0.608696,0.823529,0.7
4,0.000486,0.001378,0.775281,0.666667,0.823529,0.736842
5,0.000478,0.001451,0.764045,0.685714,0.705882,0.695652
6,0.000488,0.002157,0.808989,0.742857,0.764706,0.753623
7,0.000615,0.0014,0.808989,0.742857,0.764706,0.753623
8,0.000564,0.001413,0.764045,0.675676,0.735294,0.704225
9,0.000485,0.001295,0.831461,0.777778,0.8,0.788732


In [61]:
mod_gnb_scores['test_Accuracy'].mean()

0.7699625468164794

In [62]:
mod_gnb.get_params()

{'priors': None, 'var_smoothing': 1e-09}

### Future Improvements

I tried to cover most of the concepts of Machine Learning and best of sklearn to support the concepts but there are many more things that can be tried.
1. Hyperparamter Tuning

    I used GridSearchCV for optimizing hyperparametrs. Though grid search serves our purpose, it is slow. Since our dataset is small, it was not an issue. Other alternatives are:
    * Random Search CV

       Instead of running all the parameters, it selects few **random** combinations. Though this is fast, there might be a tradeoff of performance compared to Grid search.
    * Bayesian Optimization

       This is more of an intelligent guess appraoch. It starts with a few random combinations but it chooses next set of parameters by analyzing the results from previous chosen parameters. This can be implemented using a librabry called **hyperopt**.