# Práctica 2 - Interligencia Artificial

#### Autores: Andreu Marqués Valerià y Álvaro Pimentel Lorente
#### Fecha: 12/12/2020




In [297]:
import numpy as np 
import pandas as pd

from sklearn.preprocessing import OneHotEncoder

import matplotlib.pyplot as plt

## Lectura de datos

Per llegir les dades emprarem la llibreria de ``pandas``. El fitxer en qüestió és el fitxer que heu descarregat de Kaggle.

In [298]:
df_train = pd.read_csv('dades.csv')
pd.set_option('display.max_columns', None)
df_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Tratado de valores nulos (``Nan``)

Eliminamos las columnas ``PassengerId``, ``Name`` y ``Ticket`` ya que no aportan información relevante para el entrenamiento del modelo.

In [299]:
df_train.drop(columns=['Name', 'Ticket', 'PassengerId'], inplace=True)
df_train.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
0,0,3,male,22.0,1,0,7.25,,S
1,1,1,female,38.0,1,0,71.2833,C85,C
2,1,3,female,26.0,0,0,7.925,,S
3,1,1,female,35.0,1,0,53.1,C123,S
4,0,3,male,35.0,0,0,8.05,,S


A continuación buscamos en qué columnas existen valores ``Nan``. Como se puede observar, únicamente existen 3 columnas con valores nulos: ``Age``, ``Cabin`` y ``Embarked``

In [300]:
df_train.isnull().sum()

Survived      0
Pclass        0
Sex           0
Age         177
SibSp         0
Parch         0
Fare          0
Cabin       687
Embarked      2
dtype: int64

La columna ``Cabin`` tiene muchos valores nulos, por lo que una opción sería eliminarla. Sin embargo, nos aporta infomación sobre si el pasajero tiene camarote a su nombre o no. Por lo tanto, sustituiremos los valores nulos por ``0`` y los valores no nulos por ``1``.

In [301]:
HasCabin = df_train['Cabin'].notnull().astype('int')
df_train['Cabin'] = HasCabin
df_train.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
0,0,3,male,22.0,1,0,7.25,0,S
1,1,1,female,38.0,1,0,71.2833,1,C
2,1,3,female,26.0,0,0,7.925,0,S
3,1,1,female,35.0,1,0,53.1,1,S
4,0,3,male,35.0,0,0,8.05,0,S


Continuamos con la columna ``Age`` que tiene algunos valores ``Nan``. En este caso, sustituiremos aquellos valores nulos por la mediana de los valores de la columna ``Age``

In [302]:
df_train['Age'].fillna(df_train['Age'].median(), inplace = True)


En el caso de la columna ``Embarked``, aplicaremos el mismo proceso anterior. Para determinar el valor con el que reemplazar los valores faltantes, se llama al método mode en la columna ``Embarked``. Este método devuelve el valor más frecuente de la columna, que en este caso es el valor más comúnmente embarcado por los pasajeros.

In [303]:
df_train['Embarked'].fillna(df_train['Embarked'].mode()[0], inplace = True)

## Conversión de valores y columnas

Convertirmos la columna ``Sex`` a numérica, ya que el modelo no puede trabajar con datos categóricos. Para ello, sustituimos los valores ``male`` por ``0`` y ``female`` por ``1``.

In [304]:
df_train['Sex'].replace(['male','female'],[0,1],inplace=True)
df_train.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
0,0,3,0,22.0,1,0,7.25,0,S
1,1,1,1,38.0,1,0,71.2833,1,C
2,1,3,1,26.0,0,0,7.925,0,S
3,1,1,1,35.0,1,0,53.1,1,S
4,0,3,0,35.0,0,0,8.05,0,S


Realizamos una matriz de correlación para observar qué variables están más correlacionadas entre sí. En este caso, podemos observar que las variables más correlacionadas con la variable ``Survived`` son ``Sex``, ``Pclass``, ``Cabin`` y ``Fare`` (en ese orden).

In [305]:
corr = abs(df_train.corr())
corr.style.background_gradient(cmap='coolwarm')

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin
Survived,1.0,0.338481,0.543351,0.06491,0.035322,0.081629,0.257307,0.316912
Pclass,0.338481,1.0,0.1319,0.339898,0.083081,0.018443,0.5495,0.725541
Sex,0.543351,0.1319,1.0,0.081163,0.114631,0.245489,0.182333,0.140391
Age,0.06491,0.339898,0.081163,1.0,0.233296,0.172482,0.096688,0.240314
SibSp,0.035322,0.083081,0.114631,0.233296,1.0,0.414838,0.159651,0.04046
Parch,0.081629,0.018443,0.245489,0.172482,0.414838,1.0,0.216225,0.036987
Fare,0.257307,0.5495,0.182333,0.096688,0.159651,0.216225,1.0,0.482075
Cabin,0.316912,0.725541,0.140391,0.240314,0.04046,0.036987,0.482075,1.0


A continuación, será necesario realizar one-hot encoding de las variables categóricas ``Embarked`` y ``Pclass``. Para ello, utilizaremos la funcion ``get_dummies`` de ``pandas``.

In [306]:
df_onehot_Pclass = pd.get_dummies(df_train['Pclass'], prefix='Pclass')

df_onehot_Embarked = pd.get_dummies(df_train['Embarked'], prefix='Embarked')

df_onehot_Embarked.head()

df_train = pd.concat([df_train, df_onehot_Pclass, df_onehot_Embarked], axis=1)

df_train.drop(['Pclass', 'Embarked'], axis=1, inplace=True)

df_train.head()


Unnamed: 0,Survived,Sex,Age,SibSp,Parch,Fare,Cabin,Pclass_1,Pclass_2,Pclass_3,Embarked_C,Embarked_Q,Embarked_S
0,0,0,22.0,1,0,7.25,0,0,0,1,0,0,1
1,1,1,38.0,1,0,71.2833,1,1,0,0,1,0,0
2,1,1,26.0,0,0,7.925,0,0,0,1,0,0,1
3,1,1,35.0,1,0,53.1,1,1,0,0,0,0,1
4,0,0,35.0,0,0,8.05,0,0,0,1,0,0,1


A continuación, combinaremos las columnas SibSp y Parch

In [307]:
df_train['Familiars'] = df_train['SibSp'] + df_train['Parch'] + 1

df_train

Unnamed: 0,Survived,Sex,Age,SibSp,Parch,Fare,Cabin,Pclass_1,Pclass_2,Pclass_3,Embarked_C,Embarked_Q,Embarked_S,Familiars
0,0,0,22.0,1,0,7.2500,0,0,0,1,0,0,1,2
1,1,1,38.0,1,0,71.2833,1,1,0,0,1,0,0,2
2,1,1,26.0,0,0,7.9250,0,0,0,1,0,0,1,1
3,1,1,35.0,1,0,53.1000,1,1,0,0,0,0,1,2
4,0,0,35.0,0,0,8.0500,0,0,0,1,0,0,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,0,27.0,0,0,13.0000,0,0,1,0,0,0,1,1
887,1,1,19.0,0,0,30.0000,1,1,0,0,0,0,1,1
888,0,1,28.0,1,2,23.4500,0,0,0,1,0,0,1,4
889,1,0,26.0,0,0,30.0000,1,1,0,0,1,0,0,1


In [308]:
corr = abs(df_train.corr())
corr.style.background_gradient(cmap='coolwarm')

Unnamed: 0,Survived,Sex,Age,SibSp,Parch,Fare,Cabin,Pclass_1,Pclass_2,Pclass_3,Embarked_C,Embarked_Q,Embarked_S,Familiars
Survived,1.0,0.543351,0.06491,0.035322,0.081629,0.257307,0.316912,0.285904,0.093349,0.322308,0.16824,0.00365,0.149683,0.016639
Sex,0.543351,1.0,0.081163,0.114631,0.245489,0.182333,0.140391,0.098013,0.064746,0.137143,0.082853,0.074115,0.119224,0.200988
Age,0.06491,0.081163,1.0,0.233296,0.172482,0.096688,0.240314,0.323896,0.015831,0.291955,0.030248,0.031415,0.006729,0.245619
SibSp,0.035322,0.114631,0.233296,1.0,0.414838,0.159651,0.04046,0.054582,0.055932,0.092548,0.059528,0.026354,0.068734,0.890712
Parch,0.081629,0.245489,0.172482,0.414838,1.0,0.216225,0.036987,0.017633,0.000734,0.01579,0.011069,0.081228,0.060814,0.783111
Fare,0.257307,0.182333,0.096688,0.159651,0.216225,1.0,0.482075,0.591711,0.118557,0.413333,0.269335,0.117216,0.162184,0.217138
Cabin,0.316912,0.140391,0.240314,0.04046,0.036987,0.482075,1.0,0.788773,0.172413,0.539291,0.208528,0.129572,0.101139,0.009175
Pclass_1,0.285904,0.098013,0.323896,0.054582,0.017633,0.591711,0.788773,1.0,0.288585,0.626738,0.296423,0.155342,0.161921,0.046114
Pclass_2,0.093349,0.064746,0.015831,0.055932,0.000734,0.118557,0.172413,0.288585,1.0,0.56521,0.125416,0.127301,0.18998,0.038594
Pclass_3,0.322308,0.137143,0.291955,0.092548,0.01579,0.413333,0.539291,0.626738,0.56521,1.0,0.153329,0.237449,0.015104,0.071142
