Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE".

Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All). On JupyterLab, you may want to hit the "Validate" button as well.

Caution: do not mess with the notebook's metadata; do not change a pre-existing cell's type; do not copy pre-existing cells (add new ones with the + button instead). This will break autograding; you will get a 0; you are warned.

<table style="width: 100%; border: none;" cellspacing="0" cellpadding="0" border="0">
  <tr>
    <td><img src="https://www.planetegrandesecoles.com/wp-content/uploads/2021/07/Identite%CC%81-visuelle-Plane%CC%80te-BAC-8-600x398.png" style="float: left; width: 100%" />
</td>
    <td><a style="font-size: 3em; text-align: center; vertical-align: middle;" href="https://moodle.polytechnique.fr/course/view.php?id=19260">[CSC2S004EP - 2024] - Introduction to Machine Learning</a>
</td>
  </tr>
</table>

<a style="font-size: 3em;">Lab Session 1: Jupyter Notebooks, Pandas, Sklearn basics</a>

# Introduction

Welcome to the 1st [CSC2S004EP](https://moodle.polytechnique.fr/course/view.php?id=19260) lab session: *Introduction to Machine Learning*.

This lab session assumes you are familiar with the Python programming language.

Note that the 1st year courses used scripting/packaging tools (*e.g.* Spyder) and relied on relatively "low-level" abstractions (lists, numpy arrays, etc.) and manipulation techniques (*e.g.* `for` loops). While the knowledge of these concepts and tools is mandatory for this course and your future use of Python, we will concentrate in this course respectively on Jupyter Notebooks (more on these later) to write code and "high-level", "more Pythonic" abstractions (e.g. `pandas`, scikit-learn, plotting libraries) as a coding style.

Additional documentation can be found there:
- [Python 3 documentation](https://docs.python.org/3/)
- [The Python Tutorial](https://docs.python.org/3/tutorial/index.html)
- [The Python Standard Library](https://docs.python.org/3/library/index.html)
- [The Python Language Reference](https://docs.python.org/3/reference/index.html)

# Jupyter Notebooks

The document you are reading is a [Jupyter Notebook](https://jupyter.org/).

The Jupyter Notebook (formerly IPython Notebooks) is a web-based interactive web application for creating documents that contain live code, equations, visualizations and narrative text. While notebooks represent a good way to build a tutorial or a report, you will still need to rely on more common, scripting IDEs like Spyder or PyCharm, when working on a "project" (e.g. by organizing functions in modules to industrialize a data science product).

The notebook consists of a sequence of cells. A cell is a multiline text input field, and **its contents can be executed** by using `Shift+Enter`, or by clicking either the `Play` button on the toolbar, or Cell > Run in the menu bar. The execution behavior of a cell is determined by the cell’s type. There are three types of cells: *code cells*, *markdown cells*, and *raw cells*. Every cell starts off being a code cell, but its type can be changed by using a drop-down on the toolbar (which will be “Code”, initially). **As part of this course, DO NOT change the type of a pre-existing cell, DO NOT duplicate a pre-existing cell**. However, you can create and change the type of any new cell.

## Code cells

A code cell allows you to edit and write new code, with full syntax highlighting and tab completion. The programming language used here is Python.

When a code cell is executed, code that it contains is sent to the kernel associated with the notebook. The results that are returned from this computation are then displayed in the notebook as the cell’s output. The output is not limited to text, with many other possible forms of output are also possible, including figures and HTML tables.

**Tips**:
- press the `Tab` key to use auto completion in code cells
- press `Shift+Tab` to display the documentation of the current object (the one on witch the cursor is)

See e.g. https://miykael.github.io/nipype_tutorial/notebooks/introduction_jupyter-notebook.html

## Markdown cells

You can document the computational process in a literate way, alternating descriptive text with code, using rich text. In IPython this is accomplished by marking up text with the Markdown language. The corresponding cells are called Markdown cells. The Markdown language provides a simple way to perform this text markup, that is, to specify which parts of the text should be emphasized (italics), bold, form lists, etc. See [this cheatsheet](https://www.edureka.co/blog/wp-content/uploads/2018/10/Jupyter_Notebook_CheatSheet_Edureka.pdf).

Markdown cells will also render $\LaTeX$ code escaped with \\$\LaTeX\\$.

For more information about Jupyter Notebooks, read its [official documentation](https://jupyter-notebook.readthedocs.io/en/stable/notebook.html#structure-of-a-notebook-document).

## Read-only cells

Most cells in these assignments have been marked "read-only" meaning you will not be able to modify their content, to avoid mistakes and guarantee some reproducibility when grading them. If you need to inspect what is going on inside such a cell, create a new cell and copy over the content that you want to run/inspect.

## `NotImplementedError`

Whenever you see a (modifiable) cell with a `raise NotImplementedError` instruction, it means you should implement something! This cell will generally be preceded by a Markdown cell explaining what you need to do. The following cell usually contains visible and/or hidden tests (to test your code). If you see (seemingly) empty cells after a code cell where you implemented something, it usually means there are only hidden tests that are run against your code! 

## Save your work!

Don't forget to regularly save your work! There is some autosaving but hitting `Ctrl+S` will ease your mind.

## Validate button

If you correctly followed the [guidelines to setup your Python environment](https://moodle.polytechnique.fr/mod/page/view.php?id=499510&forceview=1), you should see a "Validate" button in the tool bar, next to the Save, +, Scissors, ...
Hitting this button, it should turn grey and say "Validating..." and show some message at some later point, e.g. "Success! Your notebook
passes all the tests!". Please use and abuse this button before submitting your assignment on Moodle.

## Objectives

What we will practice today:
- Pandas basics
- Making an exploratory analysis on an real dataset
- Making a first predictive model for this datase

# Pandas Basics

[Pandas](http://pandas.pydata.org/) is a popular data analysis toolkit for Python.

We will use it to explore data with heterogeneous types and/or missing values.

Additional documentation can be found there:
- https://pandas.pydata.org/docs/getting_started/index.html
- http://pandas.pydata.org/pandas-docs/stable/
- https://jakevdp.github.io/PythonDataScienceHandbook/
- http://www.jdhp.org/docs/notebook/python_pandas_en.html

## Import directives

To begin with, let's import the Pandas library as the *pd* alias. Select the following code cell and execute it with the `Shift + Enter` shortcut key.

We also import:
- Numpy to generate arrays
- Seaborn to make plots
- Sklearn to learn data
- Matplotlib to plot some stuff

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn

## Make DataFrames (2D data)

Let's make a *DataFrame* now. DataFrame is the name given to 2D arrays in Pandas. We will assign this DataFrame to the `df` variable and display it (with the `df` line at the end of the cell).

When the data carries a meaning, i.e. comes from a real data source, e.g. the `titanic` dataset that we will load later on consisting in characteristics of passengers of the Titanic, always use explicit names such as `df_titanic`. See [PEP8](https://www.python.org/dev/peps/pep-0008/), a Python code style guide, for details.

In [None]:
data = [[1, 2, 3], [4, 5, 6]]
df = pd.DataFrame(data)
df

The previous command has made a DataFrame with automatic *indices* (rows index) and *columns* (columns label).

To make a DataFrame with a custom index (`pandas` allows to define multiple indices gathered in a `MultiIndex` but this feature will not covered in this course) and custom column names, use:

In [None]:
data = [[1, 2, 3], [4, 5, 6]]
index = [10, 20]
columns = ['A', 'B', 'C']

df = pd.DataFrame(data=data, index=index, columns=columns)
df

A Python dictionary can be used to define data; its keys define its **columns'** labels.

In [None]:
data_dict = {'A': 'foo',
             'B': [10, 20, 30],
             'C': 3.14}
df = pd.DataFrame(data_dict, index=[10, 20, 30])
df

### Get information about a dataframe

Its (row) index (possibly a `MultiIndex`):

In [None]:
df.index

Its columns' labels (also an index!):

In [None]:
df.columns

Its shape (i.e. number of rows and number of columns):

In [None]:
df.shape

The number of lines in `df` is:

In [None]:
df.shape[0]

Or equivalently:

In [None]:
len(df)

The number of columns in `df` is:

In [None]:
df.shape[1]

The data type of each column:

In [None]:
df.dtypes

Additional information about columns:

In [None]:
df.info()

In [None]:
df.describe()

### Select a single column

Here are 3 equivalent syntaxes to get the column "C":

In [None]:
df.C

In [None]:
df["C"]

In [None]:
df.loc[:, "C"]

Note that a "column" corresponds to a `pandas.Series` object:

In [None]:
type(df["C"])

Selecting multiple columns, or a list containing at least one column, will output a `pandas.DataFrame`. This is subtle but may be the cause of subsequent errors.

In [None]:
df[["C"]]

In [None]:
type(df[["C"]])

### Select multiple columns

Here are 2 equivalent syntaxes to get both columns "A" and "B":

In [None]:
df[['A', 'B']]

In [None]:
df.loc[:, ['A', 'B']]

### Select a single row

In [None]:
df.loc[10]

In [None]:
df.loc[10, :]

Note that again, the result being a "vector", the output is of type `pandas.Series`:

In [None]:
type(df.loc[10, :])

### Select multiple rows

In [None]:
df.loc[[10, 20], :]

### Select rows based on values

In [None]:
df.B < 30.  # this is a Series

In [None]:
df.loc[df.B < 30.]  # this is a DataFrame

In [None]:
df.loc[(df.B < 20) | (df.B >= 30)]

In [None]:
df.loc[(df.B >= 20) & (df.B < 30)]

### Select rows and columns

In [None]:
df.loc[(df.B < 20) | (df.B >= 30), 'C']  # this is a Series

In [None]:
df.loc[(df.B < 20) | (df.B >= 30), ['A', 'B']]  # this is a DataFrame

### Select columns based on a condition

Which column of the 10th line (starting from the Oth) takes value `'foo'`?

In [None]:
df.loc[10] == 'foo'

Of those, which value(s) are `True`?

**The 0th**

In [None]:
np.where(df.loc[10] == 'foo')

To which column name (or `Index`) does column 0 correspond?

In [None]:
df.columns[np.where(df.loc[10] == 'foo')]

In [None]:
df.columns[np.where(df.loc[10] == 'foo')].to_list()

### Apply a function to selected columns' values

In [None]:
df.B *= 2.
df

In [None]:
df.B = pow(df.B, 2)
df

### Apply a function to selected rows' values

In [None]:
df.loc[df.B < 500., 'A'] = -1
df

In [None]:
df.loc[(df.B < 500.) | (df.B > 2000), 'C'] = 0
df

### Handling missing data

Missing data are represented by a *NaN* ("Not a Number").

In [None]:
# equivalently, use np.nan
data = [[3, 2, 3],
        [float("nan"), 4, 4],
        [5, float("nan"), 5],
        [float("nan"), 3, 6],
        [7, 1, 1]]
df = pd.DataFrame(data, columns=['A', 'B', 'C'])
df

To obtain the boolean mask where values are NaN, type:

In [None]:
df.isnull()

To drop any rows that have missing data:

In [None]:
df.dropna()

Note: this does not replace df (as suggested by the fact that a DataFrame is printed: something is returned!).

In [None]:
df

If you wish to overwrite `df`, you may reassign (`df = ...`) or use `inplace=True` (common to lots of `pandas`' methods).

To fill missing data with a chosen value (e.g. 999):

In [None]:
df.fillna(value=999)  # Again, this returns a new DataFrame

To count the number of NaN values in a given column:

In [None]:
df.isnull().sum()

The `axis` keyword (common to lots of `pandas`' methods) allows to do the summation on each line rather than by column:

In [None]:
df.isnull().sum(axis=1)

### Exercice 1

In [None]:
df = pd.DataFrame([[i + j * 10 for i in range(10)] for j in range(20)],
                  columns=["A", "B", "C", "D", "E", "F", "G", "H", "I", "J"])
df

Considering the dataframe `df` created in the previous cell:
- Write the code to extract the column 'C' and assign this subarray to a variable `df2` **of type DataFrame**.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert type(df2) == pd.DataFrame

- Write the code to extract the column 'A' and 'F' and assign this sub-`DataFrame` to a variable `df3`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert type(df3) == pd.DataFrame

- Write the code to extract lines where the value in column 'D' is greater than 100 and assign this sub-`DataFrame` to a variable `df4`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert type(df4) == pd.DataFrame

- Write the code to extract lines where the value in column 'D' is greater than 100 or less than 40 and assign this sub-`DataFrame` to a variable `df5`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert type(df5) == pd.DataFrame

- Write the code to extract lines where the value in column 'D' is less than 100 and greater than 40 and assign this sub-`DataFrame` to a variable `df6`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert type(df6) == pd.DataFrame

- Write the code to extract the values of columns 'D' and 'E' of lines where column 'B' is greater than 100 or less than 40 and assign this sub-`DataFrame` to a variable `df7`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert type(df7) == pd.DataFrame

# Exploratory analysis of the Titanic dataset

## Problem description

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this lab session, you will complete the analysis of what sorts of people were likely to survive and you will apply machine learning methods to predict which passengers survived the tragedy.

([Description from Kaggle](https://www.kaggle.com/c/titanic))

## Load data

We start by acquiring the dataset into the `df` Pandas DataFrames.

In [None]:
titanic = pd.read_csv("https://raw.githubusercontent.com/ogrisel/parallel_ml_tutorial/master/notebooks/titanic_train.csv")
titanic

## Variables description

Here are the *features* available in the dataset:

In [None]:
for column in titanic.columns:
    print(column)

- *Survived*: survived (1) or died (0)
- *Pclass*: passenger's class
- *Name*: passenger's name
- *Sex*: passenger's sex
- *Age*: passenger's age
- *SibSp*: number of siblings/spouses aboard
- *Parch*: number of parents/children aboard
- *Ticket*: ticket number
- *Fare*: fare
- *Cabin*: cabin
- *Embarked*: port of embarkation
  - C = Cherbourg
  - Q = Queenstown
  - S = Southampton


([Description from Kaggle](https://www.kaggle.com/c/titanic/data))

## Exercise 2

- List *categorical* features in the Titanic dataset and store it in `titanic_categorical` (of type `list`).

**Hint**: the types of all features can be retrieved with `titanic.dtypes`. Which one(s) are **not** *numerical*?

**Warning**: a systematic approach is expected, DO NOT simply copy/paste column names from above.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

- List *numerical* features and store it in `titanic_numerical`. **Hint**: if they're numerical, they're not *categorical*.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

- Which features contain blank, null or empty values? Store a list of them in `titanic_features_blank`.

**Warning**: a list of feature names is expected.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

- We want to complete the analysis of what sorts of people were likely to survive, i.e. to predict for a new passenger whether he/she will (or would have...) survive(d) or not. What kind of machine learning problem is it?
- A supervised / unsupervised problem?
- A classification / regression / clustering / reinforcement learning problem?
- **Why?**

YOUR ANSWER HERE

## Display a quick summary of the dataset

In [None]:
titanic.describe()

## Exercise 3

- What is the age of the oldest passenger (or crew)? Store it in `age_oldest_on_titanic`.

**Warning**: use a systematic approach. DO NOT simply copy/paste from the table above. A `float` is expected.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

- What is the average fare ? Store it in `average_fare_on_titanic`.

**Warning**: use a systematic approach. DO NOT simply copy/paste from the table above. A `float` is expected.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

- What are the fare quartiles (the 25%, 50%, 75% quantiles)?

**Warning**: use a systematic approach. DO NOT simply copy/paste from the table above. A `Series` is expected.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

## Explore correlations between the numerical features and survival of passengers

Let us start by understanding correlations between numerical features and the label we want to predict (Survived).

### Correlation survival vs age

In [None]:
bins = np.arange(0, 100, 5)  # this controls the width of the age categories
print("Bins:", bins)

ax = titanic.loc[titanic.Survived == 0, "Age"].hist(  # we select dead people's rows and use hist(ogram)
    bins=bins, color="red", alpha=0.5, label="died", figsize=(18, 8))  # let's plot them in red
titanic.loc[titanic.Survived == 1, "Age"].hist(  # we select surviving people's rows and use hist(ogram)
    bins=bins, color="blue", ax=ax, alpha=0.5, label="survived")  # let's plot them in blue
ax.legend();  # let's add a legend
plt.show()

Don't forget to check missing values as it could bias the plot above!

In [None]:
titanic.loc[(titanic.Age.isnull()) & (titanic.Survived == 0)].shape[0]

In [None]:
titanic.loc[(titanic.Age.isnull()) & (titanic.Survived == 1)].shape[0]

## Exercise 4

What useful qualitative observation can you extract from the previous plot?

YOUR ANSWER HERE

## Exercise 5

Plot correlation between Fare and Survival. Use the code from the previous cell as a starting point.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

What useful observation can you extract from the previous plot?

YOUR ANSWER HERE

## Explore correlations between the categorical features and survival of passengers

### Correlation survival vs class

In [None]:
g = sns.catplot(y="Survived", x="Pclass", kind="bar", data=titanic)
g.set_ylabels("survival probability");

Survival probability is simply the ratio of surviving passengers in each class.

## Exercise 6

What useful observation can you extract from the previous plot?

YOUR ANSWER HERE

Plot correlation between
* Sex and Survival
* SibSp and Survival
* Parch and Survival
* Embarked and Survival

Use the code from the previous cell as a starting point.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

What useful observation can you extract from these plots ?

YOUR ANSWER HERE

## Exercise 7

The following code computes the survival rate of women in the first class, i.e. an estimate of $P(\text{Survived} = 1 | \text{Sex}=\text{female}, \text{Pclass}=1)$.

Note how we're making good use of the (frustrating) first exercises.

In [None]:
p_survived_1_given_female_1 = titanic.loc[(titanic.Sex == "female") & (titanic.Pclass == 1), "Survived"].mean()
p_survived_1_given_female_1

Compute the survival rate corresponding to:

- $P(\text{Survived} = 1 | \text{Sex}=\text{female}, \text{Pclass}=2)$ and store it in `p_survived_1_given_female_2`

In [None]:
# YOUR CODE HERE
raise NotImplementedError()
p_survived_1_given_female_2

- $P(\text{Survived} = 1 | \text{Sex}=\text{female}, \text{Pclass}=3)$ and store it in `p_survived_1_given_female_3`

In [None]:
# YOUR CODE HERE
raise NotImplementedError()
p_survived_1_given_female_3

- $P(\text{Survived}=1 | \text{Sex}=\text{male}, \text{Pclass}=1)$ and store it in `p_survived_1_given_male_1`

In [None]:
# YOUR CODE HERE
raise NotImplementedError()
p_survived_1_given_male_1

- $P(\text{Survived} = 1 | \text{Sex}=\text{male}, \text{Pclass}=2)$ and store it in `p_survived_1_given_male_2`

In [None]:
# YOUR CODE HERE
raise NotImplementedError()
p_survived_1_given_male_2

- $P(\text{Survived}=1 | \text{Sex}=\text{male}, \text{Pclass}=3)$ and store it in `p_survived_1_given_male_3`

In [None]:
# YOUR CODE HERE
raise NotImplementedError()
p_survived_1_given_male_3

The following plots display the survival rate considering multiple variables. This might exhibit "multivariate" or "compound" effects on survival.

In [None]:
sns.catplot(x="Pclass", y="Survived", hue="Sex", kind="bar", data=titanic)
g.set_ylabels("survival probability");

In [None]:
sns.catplot(y="Survived", x="Embarked", hue="Pclass", kind="bar", data=titanic)
g.set_ylabels("survival probability");

Explore other combinations of features with similar plots.

After considering these plots, which variables do you expect to be good predictors of survival?

YOUR ANSWER HERE

# Make a predictive model

After this brief exploration of the dataset, let's try to train a model to predict the survival of "new" (i.e. unknown) passengers.

A *decision tree* classifier will be used to complete this task.

To begin with, import the decision tree package (named `tree`) implemented in Scikit Learn (a.k.a. `sklearn`).

In [None]:
import sklearn.tree

Then reload the dataset:

In [None]:
titanic = pd.read_csv("https://raw.githubusercontent.com/ogrisel/parallel_ml_tutorial/master/notebooks/titanic_train.csv")

## Exercise 8

Based on investigations made in the exploratory analysis, which variables can be ignored (they cannot convey any meaningful information - think IDs...)? Complete the following cell to remove useless features.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

Once you have removed useless variables, remove examples with missing values from the dataset.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

Let's finish the dataset cleaning by converting categorical features to numerical ones. This is often required by Machine Learning algorithms' implementations (in particular in `sklearn`) since most os these algorithms rely on matrix calculus / algebra as we will see later on in this course.

In [None]:
titanic['Embarked'] = titanic['Embarked'].map({'S': 0, 'C': 1, 'Q': 2}).astype(int)
titanic['Sex'] = titanic['Sex'].map({'male': 0, 'female': 1}).astype(int)

The following cell gives an overview of our dataset (only the first lines displayed).

In [None]:
titanic.head()

Now that the dataset is ready, we split it in two subsets:
- the training set (`X_train` and `Y_train`);
- the testing set (`X_test` and `Y_test`).

`X_xxx` contains example's *features* and `Y_xxx` contains example's *labels*.

In [None]:
X = titanic.drop("Survived", axis=1)  # X cannot contain the label
Y = titanic["Survived"]

X_train = X.iloc[:-10]  # all samples except 10 last ones
Y_train = Y.iloc[:-10]

X_test = X.iloc[-10:]  # 10 last samples
Y_test = Y.iloc[-10:]

X_train.shape, Y_train.shape, X_test.shape, Y_test.shape

## Exercise 9

Explain qualitatively your intuition about why it is necessary to split the dataset in these two subsets (training set - think **past** results of dice rolls {1, 2, 1} - and testing set - think **new** dice rolls {6, 4}) to understand the behavior / performance a Machine Learning model (think predicting the average of a dice throw).

YOUR ANSWER HERE

The classifier is instanciated (it is now an object of a class) and trained (thanks to a method implemented in this class) with the following code:

In [None]:
decision_tree = sklearn.tree.DecisionTreeClassifier()

decision_tree.fit(X_train, Y_train)

Note that in `sklearn` most algorithms are implemented as classes.

A particular instance is created, and we use class methods such as `fit`, `transform` or `predict` on appropriate arguments to perform operations.

Then the success rate of predictions is checked on both sets:

In [None]:
decision_tree.score(X_train, Y_train)

In [None]:
decision_tree.score(X_test, Y_test)

Finally we make some predictions and compare them to the truth:

In [None]:
Y_pred = decision_tree.predict(X_test)

pd.DataFrame(np.array([Y_pred, Y_test]).T, columns=('Predicted', 'Actual'))

## Plot feature importances (for the trained model)

The following code plot the relative importance of features for the prediction task (for the trained model).

(What "importance" means is beyond the scope of this lab; the interested reader might refer to `sklearn`'s documentation)

In [None]:
imp = pd.DataFrame(decision_tree.feature_importances_,
                   index=X_train.columns,
                   columns=['Importance'])
imp.sort_values(['Importance'], ascending=False)

In [None]:
imp.sort_values(['Importance'], ascending=False).plot.bar();

## Exercise 10

Compare this list to the assumptions made during the exploratory analysis.
Have you predicted a similar ranking?

YOUR ANSWER HERE

## Display the decision tree

In [None]:
from sklearn import tree
tree.plot_tree(decision_tree);

Something should be striking and feel wrong on this plot... Try to think about what could have happened!

# Going further

Here is a good tutorial to complete the lab session: [Understanding and diagnosing your machine-learning models](http://gael-varoquaux.info/interpreting_ml_tuto/) (by Gael Varoquaux)