<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Import-the-needed-Libraries" data-toc-modified-id="Import-the-needed-Libraries-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Import the needed Libraries</a></span></li><li><span><a href="#Data-leakage" data-toc-modified-id="Data-leakage-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Data leakage</a></span><ul class="toc-item"><li><span><a href="#Think-🤔" data-toc-modified-id="Think-🤔-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Think 🤔</a></span></li><li><span><a href="#Think-🤔" data-toc-modified-id="Think-🤔-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Think 🤔</a></span></li><li><span><a href="#Check-your-understanding" data-toc-modified-id="Check-your-understanding-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Check your understanding</a></span></li><li><span><a href="#How-to-access-the-value-of-a-cell" data-toc-modified-id="How-to-access-the-value-of-a-cell-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>How to access the value of a cell</a></span></li></ul></li><li><span><a href="#Specifying-the-column-and-the-line" data-toc-modified-id="Specifying-the-column-and-the-line-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Specifying the column and the line</a></span></li><li><span><a href="#The-.loc()-method" data-toc-modified-id="The-.loc()-method-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>The .loc() method</a></span></li><li><span><a href="#Mask" data-toc-modified-id="Mask-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Mask</a></span><ul class="toc-item"><li><span><a href="#Select-data-by-masks" data-toc-modified-id="Select-data-by-masks-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Select data by masks</a></span></li></ul></li><li><span><a href="#Statistics" data-toc-modified-id="Statistics-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Statistics</a></span></li></ul></div>

# Titanic data

## Import the needed Libraries

In [19]:
import pandas as pd

# Load Titanic dataset

In [23]:
titanic_filename = 'titanic.csv'
titanic = pd.read_csv(titanic_filename, sep=',', header=0)
titanic

Unnamed: 0,PassengerID,PClass,Age,Sex,Survived
0,1,1st,29.00,female,1
1,2,1st,2.00,female,0
2,3,1st,30.00,male,0
3,4,1st,25.00,female,0
4,5,1st,0.92,male,1
...,...,...,...,...,...
1308,1309,3rd,27.00,male,0
1309,1310,3rd,26.00,male,0
1310,1311,3rd,22.00,male,0
1311,1312,3rd,24.00,male,0


## Data leakage

Data **leakage** is a big problem in machine learning when developing predictive models.

Data **leakage** occurs when a predictive model is trained using information that is available in training data but not actually available for predicting outcomes in production. (#TODO seems like wrong definition)

In our example, our goal is to predict whether one will survive the Titanic incident.

### Think 🤔
What information will likely cause leakage?
1. Passenger ID
1. Age
1. Class
1. Sex

When trying to load the file in the classic way, you'll find yourself in a situation where you have PassengerID as a feature (or a column). So the answer is `1`

In [44]:
titanic.head(4)
titanic.isna().value_counts()
titanic.dtypes

PassengerID      int64
PClass          object
Age            float64
Sex             object
Survived         int64
dtype: object

Source: A. Boschetti and L. Massaron, Chapters 2 and 6

Nothing is practically incorrect here.

But, an index should NOT be mistaken as a feature. Index is the unique information which is highly correlated to the intended outcome.

The passenger ID is the index. That means, if you know a passenger’s ID, you will be able to find out whether that passenger did survive the Titanic.

So, it is not wrong, but it is not ‘useful’. The model cannot “learn” from this Passenger ID column.

What you need to do is to keep Passenger ID separated from the data set.

### Think 🤔
> In a prostate cancer dataset, PROSSURG (whether the patient had received prostate surgery) could cause leakage. Why is that so?

If someone has a prostrate surgery, it is likely that the person has prostate cancer. This variable is not useful for the machine learn from as this information will not be available in fresh data.

If this index is used during the learning phase of your model, you may possibly incur a case of "leakage", which is one of the major sources of error in machine learning.

In fact, if the index is a random number, no harm will be done to your model's efficacy.

However, if the index contains progressive, temporal, or even informative elements.

For example, certain numeric ranges may be used for positive outcomes, and others for the negative ones.

You might incorporate leaked information into the model. Then, it will be impossible for your model to replicate the prediction when using your model on fresh data.

Therefore, while loading such a dataset, we might want to specify that PassengerID is the index column.

Since the index PassengerID is the first column, we can give the following command:

In [79]:
titanic = pd.read_csv('titanic.csv',sep=',', index_col="PassengerID")
titanic.head()

Unnamed: 0_level_0,PClass,Age,Sex,Survived
PassengerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,1st,29.0,female,1
2,1st,2.0,female,0
3,1st,30.0,male,0
4,1st,25.0,female,0
5,1st,0.92,male,1


Unnamed: 0_level_0,PClass,Age,Sex,Survived
PassengerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,1st,29.0,female,1
2,1st,2.0,female,0
3,1st,30.0,male,0
4,1st,25.0,female,0
5,1st,0.92,male,1


Note: [.read_csv documentation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)

### Check your understanding
1. Index column is not a feature
1. We need to leave index column out during the learning/training phase
1. Index can be useful to make predictions
1. To solve this problem, we can specify the index column while loading the data set in pandas. 

Only choice `3`. is incorrect. Index is an incredibly predictive feature. This means, it is not useful to predict new or unseen instances.

### How to access the value of a cell

To access the value of a cell, we need to understand a special data structure: DataFrame.

To load the file into the data frame, can think of data frame this way.

> DataFrame is like a table or an excel sheet. It is two-dimensional size-mutable, potentially heterogeneous tabular data structure with labelled axes. 

<img src="dataframe.png" width=400>

[reference](http://stackoverflow.com/questions/25773245/ambiguity-in-pandas-dataframe-numpy-array-axis-definition (Links to an external site.))

> axis = 1 are the columns & axis = 0 are the rows

These are a few ways to access the value of a cell.

## Specifying the column and the line
You can simply specify the column then the line (by using its index) you are interested in.

To extract the Age of the fourth line (indexed with `PassengerID=4`), you can give the following command: 

In [6]:
titanic['Age'][4]

25.0

Do this operation carefully since it's not a matrix and you might be tempted to first input the row and then the column.

**Remember** that it's actually a pandas DataFrame, and the [ ] operator works first on columns and then on the element of the resulting pandas Series.

In [7]:
Age_s = titanic['Age']

In [8]:
Age_s[4]

25.0

In [9]:
Sex = titanic['Sex']
Sex[5]

'male'

## The .loc() method

Similar to the preceding method of accessing data, you can use the .loc() method.

In [10]:
titanic.loc[4, 'Age']

25.0

You should first specify the index and then the columns you're interested in.

## Mask
If you need to apply a function to a limited section of rows, you can create a mask.
> A mask is a series of Boolean values (True or False) that tells whether the line is selected or not.

### Select data by masks

In [113]:
survived = titanic['Survived'] == 1
females = titanic['Sex'] == 'female'
males = titanic['Sex'] == 'male'
females.value_counts()
females[survived].value_counts()

True     308
False    142
Name: Sex, dtype: int64

In [115]:
titanic[survived]

Unnamed: 0_level_0,PClass,Age,Sex,Survived
PassengerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,1st,29.00,female,1
5,1st,0.92,male,1
6,1st,47.00,male,1
7,1st,63.00,female,1
9,1st,58.00,female,1
...,...,...,...,...
1280,3rd,22.00,male,1
1290,3rd,,male,1
1294,3rd,45.00,female,1
1303,3rd,,male,1


In the preceding simple example, we can immediately see which observations are True and which are not (False), and which fit the selection query.

Now, we want to check the ‘Age’ of those survived. We can use the following command:

In [13]:
Age_Q = titanic.loc[titanic['Survived'] == 1,'Age']
Age_Q.value_counts().sort_values(ascending=False).head(20)
Age_Q.sort_values(ascending=False).head(20)

PassengerID
74      69.0
68      64.0
1265    63.0
7       63.0
252     62.0
105     60.0
39      60.0
111     60.0
272     60.0
38      59.0
9       58.0
29      58.0
43      58.0
124     58.0
232     56.0
206     56.0
163     55.0
274     55.0
71      55.0
121     54.0
Name: Age, dtype: float64

## Statistics
This is how you can access the value of a cell using the mask function.

If you want to see some statistics about each feature, you can group each column accordingly.

In [14]:
titanic.groupby(['Survived']).mean()

Unnamed: 0_level_0,Age
Survived,Unnamed: 1_level_1
0,31.13167
1,29.359585


If you need to sort the observations using a function, you can use the `.sort_value()` method.

In [15]:
titanic.sort_values(by='Age',ascending=False).head(10)

Unnamed: 0_level_0,PClass,Age,Sex,Survived
PassengerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
506,2nd,71.0,male,0
120,1st,71.0,male,0
10,1st,71.0,male,0
73,1st,70.0,male,0
74,1st,69.0,female,1
253,1st,67.0,male,0
773,3rd,65.0,male,0
180,1st,65.0,male,0
104,1st,64.0,male,0
68,1st,64.0,female,1
