In [None]:
from IPython.display import Image
from IPython.core.display import HTML 

## Where to find data?
- in the wild (scraping websites, Twitter, GAMS, etc)
- for practice - Kaggle (https://www.kaggle.com/datasets)

Kaggle is a really useful service where you can find data to play with and Notebooks to take inspiration and learn from. 
It also offers ML models you can try out and a community of people with data science interests.

You can also form groups and solve dataset issues to win competitions (sometimes with cash prizes) - https://www.kaggle.com/competitions.

Let's try out one of the intro-level challenges that already has a bunch of submissions - the Titanic dataset challenge (https://www.kaggle.com/competitions/titanic/overview/description). Kaggle has its own API that you can use to easily download the data.

Let's install the Kaggle library. People who have Anaconda can do it easily through the naviagtor, otherwise:

```
$pip install kaggle

```

or 

```
conda install -c conda-forge kaggle


```

Documentation:
- https://github.com/Kaggle/kaggle-api
- https://anaconda.org/conda-forge/kaggle


In [None]:
import kaggle


To be able to import this libary, you also need to create a user account, get your 'kaggle.json' file and place it under C:/Users/{username}/.kaggle.

More on how to do that here: https://github.com/Kaggle/kaggle-api

In [None]:
 ! kaggle datasets list

In [None]:
import os


In [None]:
%cd data


In [None]:
%mkdir Titanic

In [None]:
%cd Titanic

In [None]:
! kaggle competitions download -c titanic

Now we have our data, let's unzip it and check what is there. There should be a train and test set which we will use to build our models on. 

The train set (the ground truth set) contains the outcome for each passender of the titanic (whether they survived or not) and has features such as gender and passenger class. For the test set, it' s our job to build a model to predict whether they survived or died on the Titanic.

The gender_submissions.csv is a set of predictions that serves an example of what the competition submission file should look like. More info on: https://www.kaggle.com/competitions/titanic/data

In [None]:
import pandas as pd
import numpy as np
import sklearn

from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelBinarizer
from sklearn.svm import SVC
from sklearn.preprocessing import OneHotEncoder, LabelEncoder


In [None]:
import plotly.graph_objects as go
import plotly.express as px
import matplotlib.pyplot as plt

import seaborn as sns

In [None]:
train_data = pd.read_csv("titanic/train.csv")
train_data.head()

In [None]:
train_data.shape

In [None]:
test_data = pd.read_csv("titanic/test.csv")
test_data.head(10)

In [None]:
test_data.shape

## EDA (Exploratory data analysis)

The Titanic competition also has a data dictionary, which explains the columns that make up the data set. Below are the descriptions contained in that data dictionary:



| Column      | Description                                                                               |
|-------------|-------------------------------------------------------------------------------------------|
| PassengerID | A column added by Kaggle to identify each row and make submissions easier                 |
| Survived    | Whether the passenger survived or not and the value we are predicting (0=No, 1=Yes)       |
| Pclass      | The class of the ticket the passenger purchased (1=1st, 2=2nd, 3=3rd)                     |
| Sex         | The passenger's sex                                                                       |
| Age         | The passenger's age in years                                                              |
| SibSp       | The number of siblings or spouses the passenger had aboard the Titanic                    |
| Parch       | The number of parents or children the passenger had aboard the Titanic                    |
| Ticket      | The passenger's ticket number                                                             |
| Fare        | The fare the passenger paid                                                               |
| Cabin       | The passenger's cabin number                                                              |
| Embarked    | The port where the passenger embarked (C=Cherbourg, Q=Queenstown, S=Southampton)          |



#### How many null values?

In [None]:
train_data.isnull().sum()

In [None]:
test_data.isnull().sum()

#### Data types?

In [None]:
train_data.dtypes

In [None]:
train_data.describe()

### How many people survived?

In [None]:
train_data['Survived'].value_counts().plot(kind='pie')

### How many people is there per class?

In [None]:
train_data['Pclass'].value_counts().plot(kind='pie')

### How many women survived and how many men survived in our training set?

In [None]:
sex_survived = train_data.groupby(['Sex','Survived'])['Survived'].count()
sex_survived

In [None]:
sex_survived.plot(kind="bar")

### How many people from different classes survived? (Did the wealthy have a higher chance of survival?)


In [None]:
pd.crosstab([train_data.Sex,train_data.Survived],train_data.Pclass,margins=True).style.background_gradient(cmap='summer_r')


### How many prople survived depending on age?


In [None]:
print('Oldest Passenger was of:',train_data['Age'].max(),'Years')
print('Youngest Passenger was of:',train_data['Age'].min(),'Years')
print('Average Age on the ship:',train_data['Age'].mean(),'Years')


In [None]:
df = px.data.tips()
fig = px.violin(train_data, y="Age", x="Survived", color="Sex", box=True, points="all",
          hover_data=train_data.columns)
fig.show()

### Correlations?

In [None]:
# look at numeric and categorical values separately 
train_num = train_data[['Age','SibSp','Parch','Fare']]
train_cat = train_data[['Survived','Pclass','Sex','Ticket','Cabin','Embarked']]

In [None]:
print(train_num.corr())
sns.heatmap(train_num.corr(),annot=True)

## Predict who's going to survive the Titanic disaster

In this tutorial, we're going to use two algorithms. One that we already know (Linear Regression) and one that we're going to learn about now.

## Clean data (NaN values)

We have a lot of Nans in the Ticket and Cabin columns, and By dropping features we are dealing with fewer data points. Speeds up our notebook and eases the analysis.


In [None]:
train_data = train_data.drop(['Ticket', 'Cabin'], axis=1)
test_data = test_data.drop(['Ticket', 'Cabin'], axis=1)


In [None]:
train_data['Age'].fillna(train_data['Age'].median(), inplace = True)
test_data['Age'].fillna(test_data['Age'].median(), inplace = True)

In [None]:
train_data['Embarked'].fillna(train_data['Embarked'].mode()[0], inplace = True)
test_data['Embarked'].fillna(test_data['Embarked'].mode()[0], inplace = True)


In [None]:
train_data['Fare'].fillna(train_data['Fare'].median(), inplace = True)
test_data['Fare'].fillna(test_data['Fare'].median(), inplace = True)


### Feature Engineering
Here we will try to make new features by analysing the existing features. This will increase our model efficiency and performance.



In [None]:
# Here the train and test dataset are stored in a list so that we dont have to manipulate both one by one
data_cleaner = [train_data, test_data]


In [None]:
###CREATE: Feature Engineering for train and test/validation dataset
for dataset in data_cleaner:    
        
    
    #Discrete variables - creates a family size with no. of siblings , spouse , parents or children.
    dataset['FamilySize'] = dataset ['SibSp'] + dataset['Parch'] + 1

    dataset['IsAlone'] = 1 #initialize to yes/1 is alone
    dataset['IsAlone'].loc[dataset['FamilySize'] > 1] = 0 # now update to no/0 if family size is greater than 1

    #quick and dirty code split title from name: http://www.pythonforbeginners.com/dictionary/python-split
    #This will store the title of each person from its 'Name'.
    dataset['Title'] = dataset['Name'].str.split(", ", expand=True)[1].str.split(".", expand=True)[0]


    #Continuous variable bins; qcut vs cut: https://stackoverflow.com/questions/30211923/what-is-the-difference-between-pandas-qcut-and-pandas-cut
    #Fare Bins/Buckets using qcut or frequency bins: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.qcut.html
    #This will create a Fare Slab according to no. provided. In our case we want '4' Fare Slab.
    dataset['FareBin'] = pd.qcut(dataset['Fare'], 4)

    
    #Age Bins/Buckets using cut or value bins: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.cut.html
    #Same goes with here, we will have 6 Age categories.
    dataset['AgeBin'] = pd.cut(dataset['Age'].astype(int), 6)



In [None]:
train_data.head()

In [None]:

print(test_data['Title'].value_counts())


This will make a group of those titles which are less than 10 as 'Misc' column.



In [None]:
#cleanup rare title names
#print(data1['Title'].value_counts())
stat_min = 10 #while small is arbitrary, we'll use the common minimum in statistics: http://nicholasjjackson.com/2012/03/08/sample-size-is-10-a-magic-number/
title_names = (train_data['Title'].value_counts() < stat_min) #this will create a true false series with title name as index

#apply and lambda functions are quick and dirty code to find and replace with fewer lines of code: https://community.modeanalytics.com/python/tutorial/pandas-groupby-and-python-lambda-functions/
train_data['Title'] = train_data['Title'].apply(lambda x: 'Misc' if title_names.loc[x] == True else x)
print(train_data['Title'].value_counts())

title_namest = (test_data['Title'].value_counts() < stat_min)
test_data['Title'] = test_data['Title'].apply(lambda x: 'Misc' if title_namest.loc[x] == True else x)

### Label Encoding
It is used to transform non-numerical labels to numerical labels (or nominal categorical variables). Numerical labels are always between 0 and n_classes-1.

Below code will create two column 'AgeBin_Code' and 'FareBin_Code' and convert the bins( ex: AgeBin has {1-16},{16-24} ) to numeric value and label them according to the bins.



In [None]:
label = LabelEncoder()
for dataset in data_cleaner:
    dataset['AgeBin_Code'] = label.fit_transform(dataset['AgeBin'])
    dataset['FareBin_Code'] = label.fit_transform(dataset['FareBin'])
    
print(train_data.columns)
train_data.head()

### Correlation between features


In [None]:
sns.heatmap(train_data.corr(),annot=True,cmap='RdYlGn',linewidths=0.2) 


We will select the features we want to include in our model



In [None]:
features_col=['Pclass','Sex','Embarked','IsAlone','Title','AgeBin_Code','FamilySize','FareBin_Code']
train_ds = train_data[features_col]
test_ds = test_data[features_col]
train_label = train_data['Survived']
print(train_ds.columns)
train_ds.head()

## One Hot Encoding
One hot encoding is a process by which categorical variables are converted into a form that could be provided to ML algorithms to do a better job in prediction.

Suppose we have Sex classes as 'male' and 'female'. One hot encoding will create two columns 'Sex_male' and 'Sex_female' and store the values as binary.



In [None]:
one_hot_encoded_training_predictors = pd.get_dummies(train_ds)
one_hot_encoded_testing_predictors = pd.get_dummies(test_ds)


In [None]:
sns.heatmap(one_hot_encoded_training_predictors.corr(),annot=True,cmap='RdYlGn',linewidths=0.2) #data.corr()-->correlation matrix
fig=plt.gcf()
fig.set_size_inches(10,8)
plt.show()


Most Correlating features are (Sex_male,Title_Mr) , (Sex_female,Title_Miss) So , chances are they might have redundant data. Its better to remove one of the feature.



In [None]:
# Remove the correlated feature to reduce redundancy in model.
corln_col=['Title_Miss','Sex_male']
one_hot_encoded_training_predictors = one_hot_encoded_training_predictors.drop(corln_col,axis=1)
one_hot_encoded_testing_predictors = one_hot_encoded_testing_predictors.drop(corln_col,axis=1)


### Predictive Modeling


So now we will predict the whether the Passenger will survive or not using some great Classification Algorithms.Following are the algorithms I will use to make the model:

* Logistic Regression

* Support Vector Machines

* K-Nearest Neighbours





In [None]:
from sklearn.svm import SVC, LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import make_scorer, accuracy_score
from sklearn import svm #support vector Machine
from sklearn import metrics #accuracy measure
from sklearn.metrics import confusion_matrix #for confusion matrix


### Train- test split


The data we use is usually split into training data and test data. The training set contains a known output and the model learns on this data in order to be generalized to other data later on. We have the test dataset (or subset) in order to test our model’s prediction on this subset.

The test_size=0.20 inside the function indicates the percentage of the data that should be held over for testing. It’s usually around 80/20 or 70/30.



In [None]:
from sklearn.model_selection import train_test_split

train_X, test_X, train_y, test_y = train_test_split(one_hot_encoded_training_predictors, train_label, test_size=0.20)


### Logistic Regression


In [None]:
logreg_clf = LogisticRegression()
logreg_clf.fit(train_X, train_y)
pred_logreg = logreg_clf.predict(test_X)
acc_logreg = accuracy_score(test_y, pred_logreg)

print(acc_logreg)


### K-Nearest Neighbours(KNN)

In [None]:
Image(url= "img/knn.png")
# source: Machine Learning for Absolute Beginners, Oliver Theobald

- k-NN classifies new data points based on their position to nearby data points
- using k-NN we can predict the category of the new data point based on its position regarding the other data points 
- we need to set k in order to determine how many data points we want to use to classify the new data point (e.g. if we set it to 3, k-NN will analyze it in respect to the 3 nearest data points, neighbours)
- default is 5
- it is useful to test numerous k combinations to find the best fit and avoid setting k to low or high
- too low - bias, to high - more computationally expensive
- this algorithm works best with continuous variables
- binary variables should be used only when critical for the model’s accuracy


In [None]:
knn_clf = KNeighborsClassifier()
knn_clf.fit(train_X, train_y)
pred_knn = knn_clf.predict(test_X)
acc_knn = accuracy_score(test_y, pred_knn)

print(acc_knn)

### Linear Support Vector Machines


In [None]:
Image(url= "img/svm.png")
# source: Machine Learning for Absolute Beginners, Oliver Theobald

- useful to mitigate outliers and complex relationships
- wide margin - more mistakes, narrow margin - fewer mistakes, the goal is to strike balance
- you can use the hyperparameter C to make the margin softer/harder, e.g. 
- svm not recommended for low feature to row ratio (low nr of features relative to rows) 
- works well at untangling outliers from complex small and medium datasets and managing high dimensional data
- regularization and standardization are data scrubbing (used often with this algorithm)


In [None]:
linsvc_clf = LinearSVC(dual=False)
linsvc_clf.fit(train_X, train_y)
pred_linsvc = linsvc_clf.predict(test_X)
acc_linsvc = accuracy_score(test_y, pred_linsvc)

print(acc_linsvc)

In [None]:
#Compare all model performance.
model_performance = pd.DataFrame({
    'Model': [ 'Linear SVM', 
              'Logistic Regression', 'K Nearest Neighbors'],
    'Accuracy': [ acc_linsvc, 
              acc_logreg, acc_knn]
})

model_performance.sort_values(by='Accuracy', ascending=False)

## Choosing the right estimator


https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

In [None]:
Image(url= "img/sklearn.png")
# source: https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

## Literature, tutorials

* Exercises taken from - Machine Learning for Absolute Beginners, Oliver Theobald (https://www.amazon.de/gp/product/B08RWBSKQB/ref=ppx_yo_dt_b_d_asin_title_o00?ie=UTF8&psc=1) (https://bmansoori.ir/book/Machine%20Learning%20For%20Absolute%20Beginners.pdf)
* Tutorial accompanying the book -https://scatterplotpress.teachable.com/courses/enrolled/1247161
* Good tutorial on Kaggle -  https://www.kaggle.com/learn/intro-to-machine-learning
* Titanic notebook - https://www.kaggle.com/code/vjgupta/titanic-simple-model-beginners
* Titanic dataset -https://www.kaggle.com/competitions/titanic
* SKLearn tutorial - https://scikit-learn.org/stable/index.html