# Tutorial: Machine Learning in scikit-learn
From the [Izaskun Mendia], the GitHub repository (https://github.com/izmendi/)

![Machine learning](images/01_robot.png)

# Cleaning Data with Pandas and Scikit-learn

Pandas is a Python library that contains high-level data structures and manipulation tools designed for data analysis. Scikit-learn, on the other hand, is an open-source machine learning library for Python.

While Scikit-learn does a lot of the heavy lifting, what's equally important is ensuring that raw data is processed in such a way that we are able to 'feed' it to Scikit-learn. Hence, the ability to manipulate raw data with Pandas makes it an indispensible part of our toolkit.

In the following set of exercises, we will be reviewing the data from the Kaggle Titanic competition. Our aim is to make predictions on whether or not specific passengers on the Titanic survived, based on characteristics such as age, sex and class.

## Pandas - Extracting data

First, we load the training data from a .csv file. This is the similar to the data found on the Kaggle website: 

https://www.kaggle.com/c/titanic-gettingStarted/data

In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv('./data/titanic.txt', sep=',')

## Pandas - Cleaning data

We then review a selection of the data. 

In [2]:
df.head()

Unnamed: 0,row.names,pclass,survived,name,age,embarked,home.dest,room,ticket,boat,sex
0,1,1st,1,"Allen, Miss Elisabeth Walton",29.0,Southampton,"St Louis, MO",B-5,24160 L221,2,female
1,2,1st,0,"Allison, Miss Helen Loraine",2.0,Southampton,"Montreal, PQ / Chesterville, ON",C26,,,female
2,3,1st,0,"Allison, Mr Hudson Joshua Creighton",30.0,Southampton,"Montreal, PQ / Chesterville, ON",C26,,(135),male
3,4,1st,0,"Allison, Mrs Hudson J.C. (Bessie Waldo Daniels)",25.0,Southampton,"Montreal, PQ / Chesterville, ON",C26,,,female
4,5,1st,1,"Allison, Master Hudson Trevor",0.9167,Southampton,"Montreal, PQ / Chesterville, ON",C22,,11,male


We notice that the columns describe features of the Titanic passengers, such as **Age, Sex** and **Pclass**. Of particular interest is the column **Survived**, which describes whether or not the passenger survived. When training our model, what we are essentially doing is assessing how each feature impacts whether or not the passenger survived (or if the feature makes an impact at all).

We observe that the columns **Row.names, Name, Home.dest, Room, Ticket** and **Boat** are, for our current purposes, irrelevant. We proceed to remove them from our data set.

In [3]:
df = df.drop(['row.names','name','home.dest','ticket','room','boat'], axis=1)
df.head()

Unnamed: 0,pclass,survived,age,embarked,sex
0,1st,1,29.0,Southampton,female
1,1st,0,2.0,Southampton,female
2,1st,0,30.0,Southampton,male
3,1st,0,25.0,Southampton,female
4,1st,1,0.9167,Southampton,male


Next, we review the type of data in the columns, and their respective counts.

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1313 entries, 0 to 1312
Data columns (total 5 columns):
pclass      1313 non-null object
survived    1313 non-null int64
age         633 non-null float64
embarked    821 non-null object
sex         1313 non-null object
dtypes: float64(1), int64(1), object(3)
memory usage: 51.4+ KB


We notice that the columns **Age** and **Embarked** have *NaNs* or missing values.

In [5]:
df.age.isnull().sum()

680

In [6]:
df.isnull().sum()

pclass        0
survived      0
age         680
embarked    492
sex           0
dtype: int64

All is good, except age which has lots of missing values. Lets compute a median or interpolate() all the ages and fill those missing age values. Pandas has a nice interpolate() function that will replace all the missing NaNs to interpolated values.

In [7]:
df['age'].interpolate()

0       29.0000
1        2.0000
2       30.0000
3       25.0000
4        0.9167
5       47.0000
6       63.0000
7       39.0000
8       58.0000
9       71.0000
10      47.0000
11      19.0000
12      26.7500
13      34.5000
14      42.2500
15      50.0000
16      24.0000
17      36.0000
18      37.0000
19      47.0000
20      26.0000
21      25.0000
22      25.0000
23      19.0000
24      28.0000
25      45.0000
26      39.0000
27      30.0000
28      58.0000
29      51.5000
         ...   
1283    19.0000
1284    19.0000
1285    19.0000
1286    19.0000
1287    19.0000
1288    19.0000
1289    19.0000
1290    19.0000
1291    19.0000
1292    19.0000
1293    19.0000
1294    19.0000
1295    19.0000
1296    19.0000
1297    19.0000
1298    19.0000
1299    19.0000
1300    19.0000
1301    19.0000
1302    19.0000
1303    19.0000
1304    19.0000
1305    19.0000
1306    19.0000
1307    19.0000
1308    19.0000
1309    19.0000
1310    19.0000
1311    19.0000
1312    19.0000
Name: age, dtype: float6

In [8]:
my_mean = df['age'].mean()
df['age'] = df['age'].fillna(my_mean)
df.isnull().sum()

pclass        0
survived      0
age           0
embarked    492
sex           0
dtype: int64

As previously discussed, we take the approach of simply removing the rows with missing values.

In [9]:
df = df.dropna()
df.isnull().sum()

pclass      0
survived    0
age         0
embarked    0
sex         0
dtype: int64

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 821 entries, 0 to 838
Data columns (total 5 columns):
pclass      821 non-null object
survived    821 non-null int64
age         821 non-null float64
embarked    821 non-null object
sex         821 non-null object
dtypes: float64(1), int64(1), object(3)
memory usage: 38.5+ KB


Now you see the dataset is reduced to 821 rows from 1313, which means we are wasting data. Machine learning models need data for training to perform well. So we preserve the data and make use of it as much as we can.  We will see it later

[NOTE] I am going to take the liberty to change the label, just to explain the next notebook. This is not recomendable at all!.

In [11]:
size_mapping = {
           0: 'no survived',
           1: 'survived'}

df['survived'] = df['survived'].map(size_mapping)
df.head()

Unnamed: 0,pclass,survived,age,embarked,sex
0,1st,survived,29.0,Southampton,female
1,1st,no survived,2.0,Southampton,female
2,1st,no survived,30.0,Southampton,male
3,1st,no survived,25.0,Southampton,female
4,1st,survived,0.9167,Southampton,male


We save our new DataFrame on a CSV file

In [12]:
df.to_csv('./data/titanic_curso.csv', sep=',', index=False)