# Introduccion al machine Learning
## Actividad 3
### Lorenzo Tomas Diez

In [49]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [50]:
df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')

In [51]:
train_types = df_train.dtypes
print("===== TRAIN TYPES =====")
print(train_types)

===== TRAIN TYPES =====
PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object


In [52]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [53]:
df_train.isnull().sum()

Unnamed: 0,0
PassengerId,0
Survived,0
Pclass,0
Name,0
Sex,0
Age,177
SibSp,0
Parch,0
Ticket,0
Fare,0


In [54]:
print('===== AGE NULL PERCENTAGE =====')
print((df_train['Age'].isnull().sum()*100)/df_train.shape[0])

print('===== CABIN NULL PERCENTAGE =====')
print((df_train['Cabin'].isnull().sum()*100)/df_train.shape[0])

print('===== EMBARKED NULL PERCENTAGE =====')
print((df_train['Embarked'].isnull().sum()*100)/df_train.shape[0])

===== AGE NULL PERCENTAGE =====
19.865319865319865
===== CABIN NULL PERCENTAGE =====
77.10437710437711
===== EMBARKED NULL PERCENTAGE =====
0.2244668911335578


#### Respuesta `A`
Tenemos 3 columnas con valores nulos. Los valores faltantes están en las columnas `Age`, `Cabin` y `Embarked`.

- Para la columna `Age`, utilizaremos la **mediana** para reemplazar los valores nulos. La mediana es preferible en estos casos, ya que estos valores pueden tener outliers que distorsionarían la media, y la mediana es menos sensible a esos valores extremos.

- En el caso de `Embarked`, se completará con el **valor más frecuente**, ya que es un punto de embarque que representa la categoría más común y mantiene la consistencia en la distribución de esta variable.

- Para la columna `Cabin`, debido a la gran cantidad de valores faltantes, no se tendra en cuenta.

#### Respuesta `B`

- Completar valores faltantes en `Age` utilizando mediana

In [55]:
df_train = df_train.fillna({'Age': df_train['Age'].median()})

In [56]:
print('====== TRAIN VALUES ======')
print(f'Null values on Age: {df_train["Age"].isnull().sum()}')

Null values on Age: 0


- Completar valores faltantes en `Embarked` utilizando el valor mas frecuente en la tabla TRAIN

In [57]:
print(df_train['Embarked'].mode())

0    S
Name: Embarked, dtype: object


In [58]:
df_train = df_train.fillna({'Embarked': 'S'})

In [59]:
print('====== TRAIN VALUES ======')
print(f'Null values on Embarked: {df_train["Embarked"].isnull().sum()}')

Null values on Embarked: 0


In [60]:
df_train.isnull().sum()

Unnamed: 0,0
PassengerId,0
Survived,0
Pclass,0
Name,0
Sex,0
Age,0
SibSp,0
Parch,0
Ticket,0
Fare,0


#### Respuesta `C` - Seleccion de columnas y reemplazo de valores - Modelamiento.

- Ya que `Cabin` no lo tendremos en cuenta por su falta de representatividad.

- No analizaremos name, PassengerId y Ticket debido a que tienen valores unicos en cada fila y esto no permite al modelo crear una tendencia.

- Dado que los modelos no aceptan strings, y solo aceptan datos numericos, debemos reemplazar los valores distintos de numericos. Nos vamos a ocupar de reemplazar a`Sex` y `Embarked`.

- Seleccionaremos las variables `['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']`

##### Reemplazo de `Sex` y `Embarked`
- Ya que en estas dos variables poseemos valores categoricos, vamos a utilizar Label Encoder para cambiar las categorias por valores numericos

In [61]:
from sklearn.preprocessing import LabelEncoder

pre_columns = ['Sex', 'Embarked']

for column in pre_columns:
    encoder = LabelEncoder()
    df_train[column] = encoder.fit_transform(df_train[column].astype(str))

In [62]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    int64  
 5   Age          891 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     891 non-null    int64  
dtypes: float64(2), int64(7), object(3)
memory usage: 83.7+ KB


In [63]:
imp_cols = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']

- Vamos a dividir nuestro set, en el resultado y las columnas que explican el mismo.

In [64]:
seed_number = 28
label_column = 'Survived'
x, y = df_train[imp_cols], df_train[label_column]

- Dividimos el set que tenemos en `entrenamiento` y `validacion`

In [65]:
from sklearn.model_selection import train_test_split

x_train, x_val, y_train, y_val = train_test_split(x, y, test_size=0.20, random_state=seed_number)

- Modelamiento con Arbol de decision

In [66]:
from sklearn.tree import DecisionTreeClassifier

arb = DecisionTreeClassifier(random_state=seed_number)
arb.fit(x_train, y_train)

#### Respuesta `D` - Determinar Accuracy, F1-Score, Recall

In [67]:
from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score

y_pred = arb.predict(x_val)

acc = accuracy_score(y_val, y_pred)
pre = precision_score(y_val, y_pred)
rec = recall_score(y_val, y_pred)
f1 = f1_score(y_val, y_pred)

print(f'Accuracy: {acc: .4f}')
print(f'Precision: {pre: .4f}')
print(f'Recall: {rec: .4f}')
print(f'F1 Score: {f1: .4f}')

Accuracy:  0.7542
Precision:  0.6282
Recall:  0.7656
F1 Score:  0.6901
