# Introduction to Machine Learning

<img src="logo.jpg" style="float: left; width: 15%" />

[CSE204-2018](https://moodle.polytechnique.fr/course/view.php?id=6784) Lab session #01

Jérémie DECOCK

<a href="https://colab.research.google.com/github/jeremiedecock/polytechnique-cse204-2018/blob/master/lab_session_01.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>

<a href="https://mybinder.org/v2/gh/jeremiedecock/polytechnique-cse204-2018/master?filepath=lab_session_01.ipynb"><img align="left" src="https://mybinder.org/badge.svg" alt="Open in Binder" title="Open and Execute in Binder"></a>

In [None]:
#!pip install seaborn==0.9.0

## Introduction

Welcome to [CSE 204](https://moodle.polytechnique.fr/course/view.php?id=6784) lab session: *Introduction to Machine Learning*.

This lab session assume you are familiar with the Python programming language.
Check [CSE 101](https://moodle.polytechnique.fr/course/view.php?id=5390) if you need to refresh your knowledge.

Additional documentation can be found there:
- [Python 3 documentation](https://docs.python.org/3/)
- [The Python Tutorial](https://docs.python.org/3/tutorial/index.html)
- [The Python Standard Library](https://docs.python.org/3/library/index.html)
- [The Python Language Reference](https://docs.python.org/3/reference/index.html)

### Jupyter Notebooks

The document you are reading is a [Jupyter Notebook](https://jupyter.org/).

The Jupyter Notebook (formerly IPython Notebooks) is a web-based interactive web application for creating documents that contain live code, equations, visualizations and narrative text.

The notebook consists of a sequence of cells. A cell is a multiline text input field, and **its contents can be executed** by using `Shift+Enter`, or by clicking either the `Play` button the toolbar, or Cell, Run in the menu bar. The execution behavior of a cell is determined by the cell’s type. There are three types of cells: *code cells*, *markdown cells*, and *raw cells*. Every cell starts off being a code cell, but its type can be changed by using a drop-down on the toolbar (which will be “Code”, initially).

#### Code cells

A code cell allows you to edit and write new code, with full syntax highlighting and tab completion. The programming language used here is Python.

When a code cell is executed, code that it contains is sent to the kernel associated with the notebook. The results that are returned from this computation are then displayed in the notebook as the cell’s output. The output is not limited to text, with many other possible forms of output are also possible, including figures and HTML tables.

**Tips**:
- press the `Tab` key to use auto completion in code cells
- press `Shift+Tab` to display the documentation of the current object (the one on witch the cursor is)

See e.g. https://miykael.github.io/nipype_tutorial/notebooks/introduction_jupyter-notebook.html

#### Markdown cells

You can document the computational process in a literate way, alternating descriptive text with code, using rich text. In IPython this is accomplished by marking up text with the Markdown language. The corresponding cells are called Markdown cells. The Markdown language provides a simple way to perform this text markup, that is, to specify which parts of the text should be emphasized (italics), bold, form lists, etc.

For more information about Jupyter Notebooks, read its [official documentation](https://jupyter-notebook.readthedocs.io/en/stable/notebook.html#structure-of-a-notebook-document).

#### Save your work!

Don't forget to regularly save your work!
If you call this notebook from Google Colab, you can either save the notebook in your Google Drive (if you have a Google account) or download the `.ipynb` file from the menu bar. Downloaded `.ipynb` files can then be imported from the same menu to restore your work.

## Objectives

What we will practice today:
- Pandas basics
- Make exploratory analysis on an actual dataset
- Make a first predictive model for this dataset

## Pandas Basics

[Pandas](http://pandas.pydata.org/) is a popular data analysis toolkit for Python.

We will use it explore data with heterogeneous types and/or missing values.

Additional documentation can be found there:
- http://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html
- http://pandas.pydata.org/pandas-docs/stable/
- https://jakevdp.github.io/PythonDataScienceHandbook/
- http://www.jdhp.org/docs/notebook/python_pandas_en.html

### Import directives

To begin with, let's import the Pandas library as the *pd* alias. Select the following code cell and execute it with the `Shift + Enter` shortcut key.

We also import:
- Numpy to generate arrays
- Seaborn to make plots
- Sklearn to learn data

The `%matplotlib inline` line is a "magic" command that tells Jupyter Notebook to display figures within the document.

In [None]:
%matplotlib inline

import pandas as pd
import numpy as np
import seaborn as sns
import sklearn

### Make DataFrames (2D data)

Let's make a *DataFrame* now. DataFrame is the name given to 2D arrays in Pandas. We will assign this DataFrame to the `df` variable and display it (with the `df` line at the end of the cell).

In [None]:
data = [[1, 2, 3], [4, 5, 6]]
df = pd.DataFrame(data)
df

The previous command has made a DataFrame with automatic *indices* (rows index) and *columns* (columns label).

To make a DataFrame with defined indices and columns, use:

In [None]:
data = [[1, 2, 3], [4, 5, 6]]
index = [10, 20]
columns = ['A', 'B', 'C']

df = pd.DataFrame(data=data, index=index, columns=columns)
df

A Python dictionary can be used to define data; its keys define **columns** label.

In [None]:
data_dict = {'A': 'foo',
             'B': [10, 20, 30],
             'C': 3.14}
df = pd.DataFrame(data_dict, index=[10, 20, 30])
df

#### Get information about a dataframe

Its indices:

In [None]:
df.index

Its columns lables:

In [None]:
df.columns

Its shape (i.e. number of rows and number of columns):

In [None]:
df.shape

The number of lines in `df` is:

In [None]:
df.shape[0]

The number of columns in `df` is:

In [None]:
df.shape[1]

The data type of each column:

In [None]:
df.dtypes

Additional information about columns:

In [None]:
df.info()

In [None]:
df.describe()

### Select a single column

Here are 3 equivalent syntaxes to get the column "C":

In [None]:
df.C

In [None]:
df["C"]

In [None]:
df.loc[:,"C"]

### Select multiple columns

Here are 2 equivalent syntaxes to get both columns "A" and "B":

In [None]:
df[['A','B']]

In [None]:
df.loc[:,['A','B']]

### Select a single row

In [None]:
df.loc[10]

In [None]:
df.loc[10,:]

### Select multiple rows

In [None]:
df.loc[[10, 20],:]

### Select rows based on values

In [None]:
df.B < 30.

In [None]:
df.loc[df.B < 30.]

In [None]:
df.loc[(df.B < 20) | (df.B >= 30)]

In [None]:
df.loc[(df.B >= 20) & (df.B < 30)]

### Select rows and columns

In [None]:
df.loc[(df.B < 20) | (df.B >= 30), 'C']

In [None]:
df.loc[(df.B < 20) | (df.B >= 30), ['A','B']]

### Setting values

### Apply a function to selected colunms values

In [None]:
df.B *= 2.
df

In [None]:
df.B = pow(df.B, 2)
df

### Apply a function to selected rows values

In [None]:
df.loc[df.B < 500., 'A'] = -1
df

In [None]:
df.loc[(df.B < 500.) | (df.B > 2000), 'C'] = 0
df

### Handling missing data

Missing data are symbolized by a *NaN* ("Not a Number").

In [None]:
data = [[3, 2, 3],
        [float("nan"), 4, 4],
        [5, float("nan"), 5],
        [float("nan"), 3, 6],
        [7, 1, 1]]
df = pd.DataFrame(data, columns=['A', 'B', 'C'])
df

To obtain the boolean mask where values are NaN, type:

In [None]:
df.isnull()

To drop any rows that have missing data:

In [None]:
df.dropna()

To fill missing data with a chosen value (e.g. 999):

In [None]:
df.fillna(value=999)

To count the number of NaN values in a given column:

In [None]:
df.isnull().sum()

### Exercice 1

In [None]:
df = pd.DataFrame([[i+j*10 for i in range(10)] for j in range(20)], columns=["A", "B", "C", "D", "E", "F", "G", "H", "I", "J"])
df

Considering the dataframe `df` created in the previous cell:
- Write the code to extract the column 'B' and assign this sub array to a variable `df2`
- Write the code to extract the column 'B' and 'G' and assign this sub array to a variable `df3`
- Write the code to extract lines where the value in column 'C' is greater than 100 and assign this sub array to a variable `df4`
- Write the code to extract lines where the value in column 'C' is greater than 100 or less than 40 and assign this sub array to a variable `df5`
- Write the code to extract lines where the value in column 'C' is less than 100 and greater than 40 and assign this sub array to a variable `df6`
- Write the code to extract columns 'A' and 'B' of lines having the value in column 'C' is grater than 100 or less than 40 and assign this sub array to a variable `df7`

In [None]:
df2 = df.loc[:, ['B']]
df2

In [None]:
df3 = df.loc[:, ['B', 'G']]
df3

In [None]:
df4 = df.loc[df.C > 100]
df4

In [None]:
df5 = df.loc[(df.C > 100) | (df.C < 40)]
df5

In [None]:
df6 = df.loc[(df.C < 100) & (df.C > 40)]
df6

In [None]:
df7 = df.loc[(df.C > 100) | (df.C < 40), ['A', 'B']]
df7

## Exploratory analysis of the Titanic dataset

### Problem description

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this lab session, you will complete the analysis of what sorts of people were likely to survive and you will apply machine learning methods to predict which passengers survived the tragedy.

([Description from Kaggle](https://www.kaggle.com/c/titanic))

### Load data

We start by acquiring the dataset into the `df` Pandas DataFrames.

In [None]:
df = pd.read_csv("https://raw.githubusercontent.com/jeremiedecock/polytechnique-cse204-2018/master/titanic_train.csv")
df

### Variables description

Here are the *features* available in the dataset:

In [None]:
for column in df.columns:
    print(column)

- *Survived*: survived (1) or died (0)
- *Pclass*: passenger's class
- *Name*: passenger's name
- *Sex*: passenger's sex
- *Age*: passenger's age
- *SibSp*: number of siblings/spouses aboard
- *Parch*: number of parents/children aboard
- *Ticket*: ticket number
- *Fare*: fare
- *Cabin*: cabin
- *Embarked*: port of embarkation
  - C = Cherbourg
  - Q = Queenstown
  - S = Southampton


([Description from Kaggle](https://www.kaggle.com/c/titanic/data))

### Exercise 2

- List *categorical* features in the Titanic dataset.

In [None]:
df.dtypes

Name, Sex, Cabin, Embarked

- List *numerical* features.

PassengerId, Survived, Pclass, Age, SibSp, Parch, Fare

- List *mixed data types* features.

Ticket

- Which features may contain errors or typos?

Mostly Name

- Which features contain blank, null or empty values?

In [None]:
df.isnull().sum()

Age, Cabin, Embarked

- We want to complete the analysis of what sorts of people were likely to survive i.e. to predict for a new passenger whether he/she will survive or not. What kind of machine learning problem is it? A classification problem? A regression problem? A clustering problem? A reinforcement learning problem? Why?

Binary classification problem (predicted outputs are discrete values defined in {0, 1} or {'died', 'survived'}).

### Display a quick summary of the dataset

### Exercise 3

- What is the age of the oldest passenger (or crew)?
- What is the average fare ?
- What are the fare quartiles ?

In [None]:
df.Age.max()

In [None]:
df.Fare.mean()

In [None]:
df.Fare.describe()

### Explore correlations between the numerical features and survival of passengers

Let us start by understanding correlations between numerical features and the label we want to predict (Survived).

#### Correlation survival vs age

In [None]:
bins = np.arange(0, 100, 5)
print("Bins:", bins)

ax = df.loc[df.Survived == 0, "Age"].hist(bins=bins, color="red", alpha=0.5, label="died", figsize=(18, 8))
df.loc[df.Survived == 1, "Age"].hist(bins=bins, color="blue", ax=ax, alpha=0.5, label="survived")
ax.legend();

Don't forget to check missing values...

In [None]:
df.loc[(df.Age.isnull()) & (df.Survived == 0)].shape[0]

In [None]:
df.loc[(df.Age.isnull()) & (df.Survived == 1)].shape[0]

### Exercise 4

What useful observation can you extract from the previous plot ?

The Age variable seems to have useful correlations with the survival label (infants seems to have more chances to survive than adults for instance).

### Exercise 5

Plot correlation between Fare and Survival. Use the code from the previous cell as a starting point.

In [None]:
bins = np.arange(0, 100, 5)
print("Bins:", bins)

ax = df.loc[df.Survived == 0, "Fare"].hist(bins=bins, color="red", alpha=0.5, label="died", figsize=(18, 8))
df.loc[df.Survived == 1, "Fare"].hist(bins=bins, color="blue", ax=ax, alpha=0.5, label="survived")
ax.legend();

What useful observation can you extract from the previous plot ?

The Fare variable seems to have useful correlations with the survival label (examples with high fare seems to have more chances to survive than those with low fare for instance).

### Explore correlations between the categorical features and survival of passengers

#### Correlation survival vs class

In [None]:
g = sns.catplot(y="Survived", x="Pclass", kind="bar", data=df)
g.set_ylabels("survival probability")

### Exercise 6

What useful observation can you extract from the previous plot ?

People in 1st class seems to have more chances to survive.

Plot correlation between
* Sex and Survival
* SibSp and Survival
* Parch and Survival
* Embarked and Survival

Use the code from the previous cell as a starting point.

In [None]:
g = sns.catplot(y="Survived", x="Sex", kind="bar", data=df)
g.set_ylabels("survival probability")

In [None]:
g = sns.catplot(y="Survived", x="SibSp", kind="bar", data=df)
g.set_ylabels("survival probability")

In [None]:
g = sns.catplot(y="Survived", x="Parch", kind="bar", data=df)
g.set_ylabels("survival probability")

In [None]:
g = sns.catplot(y="Survived", x="Embarked", kind="bar", data=df)
g.set_ylabels("survival probability")

What useful observation can you extract from these plots ?

Females seems to have more chances to survive than males. It seems to be a good discriminative variable.

Other variables are less discriminative.

### Exercise 7

Them following code compute the survival rate of women in the first class
i.e. an estimate of $P(\text{Survived} = 1 | \text{Sex}=\text{female}, \text{Pclass}=1)$

In [None]:
df.loc[(df.Sex == "female") & (df.Pclass == 1), "Survived"].mean()

Compute the survival rate corresponding to:

- P(Survived = 1 | Sex=female, Pclass=2)
- P(Survived = 1 | Sex=female, Pclass=3)
- P(Survived = 1 | Sex=male, Pclass=1)
- P(Survived = 1 | Sex=male, Pclass=2)
- P(Survived = 1 | Sex=male, Pclass=3)

In [None]:
df.loc[(df.Sex == "female") & (df.Pclass == 2), "Survived"].mean()

In [None]:
df.loc[(df.Sex == "female") & (df.Pclass == 3), "Survived"].mean()

In [None]:
df.loc[(df.Sex == "male") & (df.Pclass == 1), "Survived"].mean()

In [None]:
df.loc[(df.Sex == "male") & (df.Pclass == 2), "Survived"].mean()

In [None]:
df.loc[(df.Sex == "male") & (df.Pclass == 3), "Survived"].mean()

The following plots display the survival rate considering multiple variables.

In [None]:
sns.catplot(x="Pclass", y="Survived", hue="Sex", kind="bar", data=df)

In [None]:
sns.catplot(y="Survived", x="Embarked", hue="Pclass", kind="bar", data=df)

Explore other variables combination with similar plots.

After considering these plots, which variables do you expect to be good predictors of survival?

Age, Sex and Fare seems to be the most discriminative. But remember that all these distributions are marginal distributions and thus they can hide more complex correlations or be spurious correlations!
Thus it is just a hint, the predictor will give us the actual final answer.

## Make a predictive model

After this brief exploration of the dataset, let's try to train a model to predict the survival of "new" (i.e. unknown) passengers.

A *decision tree* classifier will be used to complete this task.

To begin with, import the decision tree package (named `tree`) implemented in Scikit Learn (a.k.a. `sklearn`).

In [None]:
import sklearn.tree

Then reload the dataset:

In [None]:
df = pd.read_csv("https://raw.githubusercontent.com/jeremiedecock/polytechnique-cse204-2018/master/titanic_train.csv")

### Exercise 8

Following investigations made in the exploratory analysis, what variables should be removed from the dataset ? Complete the following cell to remove useless data.

In [None]:
df = df.drop(['Name', 'Ticket', 'Cabin', 'PassengerId'], axis=1)

Once you have removed useless variables, remove examples with missing values from the dataset 

In [None]:
df = df.dropna()

Let's finish the datased cleaning by converting categorical feature to numeric ones.

In [None]:
df['Embarked'] = df['Embarked'].map({'S': 0, 'C': 1, 'Q': 2}).astype(int)
df['Sex'] = df['Sex'].map({'male': 0, 'female': 1}).astype(int)

The following cell gives an overview of our dataset (showing only the first lines)

In [None]:
df.head()

Now that the dataset is ready, we split it in two subsets:
- the training set (`X_train` and `Y_train`)
- the testing set (`X_test` and `Y_test`)

`X_xxx` contains example's *features* and `Y_xxx` contains example's *labels*

In [None]:
X = df.drop("Survived", axis=1)
Y = df["Survived"]

X_train = X.iloc[:-10]
Y_train = Y.iloc[:-10]

X_test = X.iloc[-10:]
Y_test = Y.iloc[-10:]

X_train.shape, Y_train.shape, X_test.shape, Y_test.shape

### Exercise 9

Explain why is it necessary to split the dataset in these two subsets (training set and testing set)?

The most important task of our model is to be able to generalize predictions on new examples, not to be excellent on the training dataset. Thus we should evaluate it with data it has not used for the training.

The classifier is made and trained with the following code:

In [None]:
decision_tree = sklearn.tree.DecisionTreeClassifier()

decision_tree.fit(X_train, Y_train)

Then the success rate of predictions is checked on both sets:

In [None]:
decision_tree.score(X_train, Y_train)

In [None]:
decision_tree.score(X_test, Y_test)

Finally we make some predictions and compare them to the truth:

In [None]:
Y_pred = decision_tree.predict(X_test)

pd.DataFrame(np.array([Y_pred, Y_test]).T, columns=('Predicted', 'Actual'))

### Plot feature importances (for the trained model)

The following code plot the relative importance of features for the prediction task (for the trained model).

In [None]:
imp = pd.DataFrame(decision_tree.feature_importances_,
                   index=X_train.columns,
                   columns=['Importance'])
imp.sort_values(['Importance'], ascending=False)

In [None]:
imp.sort_values(['Importance'], ascending=False).plot.bar();

### Exercise 10

Compare this list to the assumptions made during the exploratory analysis.
Have you predicted a similar ranking?

### Bonus: display the decision tree

In [None]:
import graphviz            # "conda install python-graphviz"

dot_data = sklearn.tree.export_graphviz(decision_tree, out_file=None, 
                                feature_names=X_train.columns,  
                                class_names=['Died', 'Survived'],
                                filled=True, rounded=True,  
                                special_characters=True)  

graph = graphviz.Source(dot_data, format='png')

In [None]:
graph.view()

<img src="Source.gv.png" />

## Going further

Here is a good tutorial to complete the lab session: [Understanding and diagnosing your machine-learning models](http://gael-varoquaux.info/interpreting_ml_tuto/) (by Gael Varoquaux)