# Tutorial: Machine Learning in scikit-learn
From the [Izaskun Mendia], the GitHub repository (https://github.com/izmendi/)

![Machine learning](images/01_robot.png)

# Cleaning Data with Pandas and Scikit-learn

Pandas is a Python library that contains high-level data structures and manipulation tools designed for data analysis. Scikit-learn, on the other hand, is an open-source machine learning library for Python.

While Scikit-learn does a lot of the heavy lifting, what's equally important is ensuring that raw data is processed in such a way that we are able to 'feed' it to Scikit-learn. Hence, the ability to manipulate raw data with Pandas makes it an indispensible part of our toolkit.

In the following set of exercises, we will be reviewing the data from the Kaggle Titanic competition. Our aim is to make predictions on whether or not specific passengers on the Titanic survived, based on characteristics such as age, sex and class.

## Pandas - Extracting data

First, we load the training data from a .csv file. This is the similar to the data found on the Kaggle website: 

https://www.kaggle.com/c/titanic-gettingStarted/data

In [None]:
import pandas as pd
import numpy as np

df = pd.read_csv('./data/titanic.txt', sep=',')

## Pandas - Cleaning data

We then review a selection of the data. 

In [None]:
df.head()

We notice that the columns describe features of the Titanic passengers, such as **Age, Sex** and **Pclass**. Of particular interest is the column **Survived**, which describes whether or not the passenger survived. When training our model, what we are essentially doing is assessing how each feature impacts whether or not the passenger survived (or if the feature makes an impact at all).

We observe that the columns **Row.names, Name, Home.dest, Room, Ticket** and **Boat** are, for our current purposes, irrelevant. We proceed to remove them from our data set.

In [None]:
df = df.drop(['row.names','name','home.dest','ticket','room','boat'], axis=1)
df.head()

Next, we review the type of data in the columns, and their respective counts.

In [None]:
df.info()

We notice that the columns **Age** and **Embarked** have *NaNs* or missing values.

In [None]:
df.age.isnull().sum()

In [None]:
df.isnull().sum()

All is good, except age which has lots of missing values. Lets compute a median or interpolate() all the ages and fill those missing age values. Pandas has a nice interpolate() function that will replace all the missing NaNs to interpolated values.

In [None]:
df['age'].interpolate()

In [None]:
my_mean = df['age'].mean()
df['age'] = df['age'].fillna(my_mean)
df.isnull().sum()

As previously discussed, we take the approach of simply removing the rows with missing values.

In [None]:
df = df.dropna()
df.isnull().sum()

In [None]:
df.info()

Now you see the dataset is reduced to 821 rows from 1313, which means we are wasting data. Machine learning models need data for training to perform well. So we preserve the data and make use of it as much as we can.  We will see it later

[NOTE] I am going to take the liberty to change the label, just to explain the next notebook. This is not recomendable at all!.

In [None]:
size_mapping = {
           0: 'no survived',
           1: 'survived'}

df['survived'] = df['survived'].map(size_mapping)
df.head()

We save our new DataFrame on a CSV file

In [None]:
df.to_csv('./data/titanic_curso.csv', sep=',', index=False)