# Práctica 2 - Interligencia Artificial

#### Autores: Andreu Marqués Valerià y Álvaro Pimentel Lorente
#### Fecha: 12/12/2020




In [117]:
import numpy as np 
import pandas as pd

from sklearn.preprocessing import OneHotEncoder

import matplotlib.pyplot as plt

## Llegim les dades

Per llegir les dades emprarem la llibreria de ``pandas``. El fitxer en qüestió és el fitxer que heu descarregat de Kaggle.

In [118]:
df_train = pd.read_csv('dades.csv')
pd.set_option('display.max_columns', None)
df_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Eliminamos las columnas ``Name`` y ``Ticket`` ya que no aportan información relevante para el entrenamiento del modelo.

In [119]:
df_train = df_train.drop(columns=['Name', 'Ticket'])
df_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
0,1,0,3,male,22.0,1,0,7.25,,S
1,2,1,1,female,38.0,1,0,71.2833,C85,C
2,3,1,3,female,26.0,0,0,7.925,,S
3,4,1,1,female,35.0,1,0,53.1,C123,S
4,5,0,3,male,35.0,0,0,8.05,,S


Convertirmos la columna ``Sex`` a numérica, ya que el modelo no puede trabajar con datos categóricos. Para ello, sustituimos los valores ``male`` por ``0`` y ``female`` por ``1``. A demás, eliminamos la columna ``PassengerId`` ya que no aporta información relevante para el entrenamiento del modelo.

In [120]:
df_train['Sex'].replace(['male','female'],[0,1],inplace=True)
df_train.drop(columns=['PassengerId'], inplace=True)
df_train.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
0,0,3,0,22.0,1,0,7.25,,S
1,1,1,1,38.0,1,0,71.2833,C85,C
2,1,3,1,26.0,0,0,7.925,,S
3,1,1,1,35.0,1,0,53.1,C123,S
4,0,3,0,35.0,0,0,8.05,,S


Realizamos una matriz de correlación para observar qué variables están más correlacionadas entre sí. En este caso, podemos observar que las variables más correlacionadas con la variable ``Survived`` son ``Sex``, ``Pclass`` y ``Fare`` (en ese orden).

In [121]:
corr = df_train.corr()
corr.style.background_gradient(cmap='coolwarm')

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare
Survived,1.0,-0.338481,0.543351,-0.077221,-0.035322,0.081629,0.257307
Pclass,-0.338481,1.0,-0.1319,-0.369226,0.083081,0.018443,-0.5495
Sex,0.543351,-0.1319,1.0,-0.093254,0.114631,0.245489,0.182333
Age,-0.077221,-0.369226,-0.093254,1.0,-0.308247,-0.189119,0.096067
SibSp,-0.035322,0.083081,0.114631,-0.308247,1.0,0.414838,0.159651
Parch,0.081629,0.018443,0.245489,-0.189119,0.414838,1.0,0.216225
Fare,0.257307,-0.5495,0.182333,0.096067,0.159651,0.216225,1.0


La columna ``Cabin`` tiene muchos valores nulos, por lo que una opción sería eliminarla. Sin embargo, nos aporta infomación sobre si el pasajero tiene camarote a su nombre o no. Por lo tanto, sustituiremos los valores nulos por ``0`` y los valores no nulos por ``1``.

In [122]:
HasCabin = df_train['Cabin'].notnull().astype('int')
df_train['Cabin'] = HasCabin
df_train.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
0,0,3,0,22.0,1,0,7.25,0,S
1,1,1,1,38.0,1,0,71.2833,1,C
2,1,3,1,26.0,0,0,7.925,0,S
3,1,1,1,35.0,1,0,53.1,1,S
4,0,3,0,35.0,0,0,8.05,0,S


sklearn.model_selection.GridSearchCV
Será necesario realizar one-hot encoding de las variables categóricas ``Embarked`` y ``Pclass``. Para ello, utilizaremos la funcion ``get_dummies`` de ``pandas``. Antes de realizar en one-hot encoding, nos aseguraremos de que no hay valores nulos en las columnas que vamos a utilizar.  

En el caso de la columna ``Pclass`` un valor nulo supondría que se trata del personal del barco, por lo que no es necesario eliminar la fila y los valores nulos se sustituirán por ``0``.
En el caso de la columna ``Embarked``, eliminaremos la fila puesto que habrá sido un error en el registro de los datos puesto que tanto los pasajeros como el personal del barco, tuvo que subir al barco en algún puerto.

In [123]:
df_train['Pclass'].fillna(0, inplace=True)

df_train['Embarked'].dropna(inplace=True)


A continuación, realizamos el one-hot encoding de las columnas ``Embarked`` y ``Pclass``.

In [124]:
df_onehot_Pclass = pd.get_dummies(df_train['Pclass'], prefix='Pclass')

df_onehot_Embarked = pd.get_dummies(df_train['Embarked'], prefix='Embarked')

df_onehot_Embarked.head()

df_train = pd.concat([df_train, df_onehot_Pclass, df_onehot_Embarked], axis=1)

df_train.drop(['Pclass', 'Embarked'], axis=1, inplace=True)
df_train.head()


Unnamed: 0,Survived,Sex,Age,SibSp,Parch,Fare,Cabin,Pclass_1,Pclass_2,Pclass_3,Embarked_C,Embarked_Q,Embarked_S
0,0,0,22.0,1,0,7.25,0,0,0,1,0,0,1
1,1,1,38.0,1,0,71.2833,1,1,0,0,1,0,0
2,1,1,26.0,0,0,7.925,0,0,0,1,0,0,1
3,1,1,35.0,1,0,53.1,1,1,0,0,0,0,1
4,0,0,35.0,0,0,8.05,0,0,0,1,0,0,1


In [125]:
corr = df_train.corr()
corr.style.background_gradient(cmap='coolwarm')

Unnamed: 0,Survived,Sex,Age,SibSp,Parch,Fare,Cabin,Pclass_1,Pclass_2,Pclass_3,Embarked_C,Embarked_Q,Embarked_S
Survived,1.0,0.543351,-0.077221,-0.035322,0.081629,0.257307,0.316912,0.285904,0.093349,-0.322308,0.16824,0.00365,-0.15566
Sex,0.543351,1.0,-0.093254,0.114631,0.245489,0.182333,0.140391,0.098013,0.064746,-0.137143,0.082853,0.074115,-0.125722
Age,-0.077221,-0.093254,1.0,-0.308247,-0.189119,0.096067,0.249732,0.348941,0.006954,-0.312271,0.036261,-0.022405,-0.032523
SibSp,-0.035322,0.114631,-0.308247,1.0,0.414838,0.159651,-0.04046,-0.054582,-0.055932,0.092548,-0.059528,-0.026354,0.070941
Parch,0.081629,0.245489,-0.189119,0.414838,1.0,0.216225,0.036987,-0.017633,-0.000734,0.01579,-0.011069,-0.081228,0.063036
Fare,0.257307,0.182333,0.096067,0.159651,0.216225,1.0,0.482075,0.591711,-0.118557,-0.413333,0.269335,-0.117216,-0.166603
Cabin,0.316912,0.140391,0.249732,-0.04046,0.036987,0.482075,1.0,0.788773,-0.172413,-0.539291,0.208528,-0.129572,-0.110087
Pclass_1,0.285904,0.098013,0.348941,-0.054582,-0.017633,0.591711,0.788773,1.0,-0.288585,-0.626738,0.296423,-0.155342,-0.170379
Pclass_2,0.093349,0.064746,0.006954,-0.055932,-0.000734,-0.118557,-0.172413,-0.288585,1.0,-0.56521,-0.125416,-0.127301,0.192061
Pclass_3,-0.322308,-0.137143,-0.312271,0.092548,0.01579,-0.413333,-0.539291,-0.626738,-0.56521,1.0,-0.153329,0.237449,-0.009511
