# Proyecto

Gustavo Alvarado. Carnet # 20063401

In [87]:
import tensorflow as tf
from tensorflow import keras
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix
from datetime import datetime

In [88]:
if tf.__version__.startswith("2."):
    import tensorflow.compat.v1 as tf
    tf.compat.v1.disable_v2_behavior()
    tf.compat.v1.disable_eager_execution()
    print("Enabled compatitility to tf1.x")

Enabled compatitility to tf1.x


In [29]:
titanic_data = pd.read_csv('data_titanic_proyecto.csv')
titanic_data.head()

Unnamed: 0,PassengerId,Name,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,passenger_class,passenger_sex,passenger_survived
0,1,"Braund, Mr. Owen Harris",22.0,1,0,A/5 21171,7.25,,S,Lower,M,N
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,1,0,PC 17599,71.2833,C85,C,Upper,F,Y
2,3,"Heikkinen, Miss. Laina",26.0,0,0,STON/O2. 3101282,7.925,,S,Lower,F,Y
3,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,1,0,113803,53.1,C123,S,Upper,F,Y
4,5,"Allen, Mr. William Henry",35.0,0,0,373450,8.05,,S,Lower,M,N


## Feature Engineering

**Selección**: Se excluyen los *features* no relevantes: **PassangerId**, **Name**, **Ticket**, **Cabin** y **Embarked**.


In [30]:
selected_data = titanic_data[['Age', 'SibSp', 'Parch', 'Fare', 'passenger_class', 'passenger_sex', 'passenger_survived']]
selected_data.head()

Unnamed: 0,Age,SibSp,Parch,Fare,passenger_class,passenger_sex,passenger_survived
0,22.0,1,0,7.25,Lower,M,N
1,38.0,1,0,71.2833,Upper,F,Y
2,26.0,0,0,7.925,Lower,F,Y
3,35.0,1,0,53.1,Upper,F,Y
4,35.0,0,0,8.05,Lower,M,N


**Transformación**: Actualizando **Age** sin datos por promedio general y transformando datos no numéricos **passenger_class**, **passenger_sex** y **passenger_survived**. 

In [43]:
selected_data.isna().sum()

Age                   177
SibSp                   0
Parch                   0
Fare                    0
passenger_class         0
passenger_sex           0
passenger_survived      0
dtype: int64

In [63]:
transformed_data = selected_data
transformred_data = transformed_data.fillna(transformed_data['Age'].mean(), inplace = True)

class_mapping = {'Lower': 1, 'Middle': 2, 'Upper': 3} 
sex_mapping = {'M': 1, 'F': 2}
survived_mapping = {'Y': 1, 'N': 0}

transformed_data = selected_data.replace({'passenger_class': class_mapping, 
                                          'passenger_sex': sex_mapping, 
                                          'passenger_survived': survived_mapping})
transformed_data.head()

Unnamed: 0,Age,SibSp,Parch,Fare,passenger_class,passenger_sex,passenger_survived
0,22.0,1,0,7.25,1,1,0
1,38.0,1,0,71.2833,3,2,1
2,26.0,0,0,7.925,1,2,1
3,35.0,1,0,53.1,3,2,1
4,35.0,0,0,8.05,1,1,0


In [64]:
transformed_data.isna().sum()

Age                   0
SibSp                 0
Parch                 0
Fare                  0
passenger_class       0
passenger_sex         0
passenger_survived    0
dtype: int64

## Datos de entrenamiento, validación y de prueba

In [65]:
x = transformed_data
y = x.pop('passenger_survived')

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.20)

In [71]:
x_train, x_validation, y_train, y_validation = train_test_split(x_train, y_train, test_size = 0.20)

In [72]:
x_train

Unnamed: 0,Age,SibSp,Parch,Fare,passenger_class,passenger_sex
382,32.000000,0,0,7.9250,1,1
181,29.699118,0,0,15.0500,2,1
887,19.000000,0,0,30.0000,3,2
134,25.000000,0,0,13.0000,2,1
40,40.000000,1,0,9.4750,1,2
...,...,...,...,...,...,...
700,18.000000,1,0,227.5250,3,2
569,32.000000,0,0,7.8542,1,1
252,62.000000,0,0,26.5500,3,1
552,29.699118,0,0,7.8292,1,1


In [73]:
y_train

382    0
181    0
887    1
134    0
40     0
      ..
700    1
569    1
252    0
552    0
602    0
Name: passenger_survived, Length: 569, dtype: int64

## Ensemble learning

Nota: Para este proceso de entrenamiento, no se efectuará *bootstraping*, sin embargo, para problemas reales si se recomienda hacerlo. Para efectuar *bootstraping*, se necesita definir el tamaño de la muestra y el número de repeticiones. Se obtiene una muestra con el tamaño definido, utilizando datos aleatorios del dataset, sin importar si se repiten los elementos. El proceso se repite en número de veces definido para obtener las estadísticas requeridas por cada muestra. Esto puede ser efectuado con la función **resample** de **scikit-learn**. 

### Árbol de decisión

In [84]:
def trainDecisionTree(x_train, y_train, x_test, y_test):
    classifier = DecisionTreeClassifier()
    classifier.fit(x_train, y_train)
    y_pred = classifier.predict(x_test)
    print(confusion_matrix(y_test, y_pred))

In [83]:
trainDecisionTree(x_train, y_train, x_test, y_test)

[[88 19]
 [21 51]]
              precision    recall  f1-score   support

           0       0.81      0.82      0.81       107
           1       0.73      0.71      0.72        72

    accuracy                           0.78       179
   macro avg       0.77      0.77      0.77       179
weighted avg       0.78      0.78      0.78       179



### SVM

In [93]:
def trainSVM(x_train, y_train, x_test, y_test):
    svm_classifier = SVC(kernel='linear')
    svm_classifier.fit(x_train, y_train)
    y_pred = svm_classifier.predict(x_test)
    print(confusion_matrix(y_test,y_pred))
    print(classification_report(y_test,y_pred))

In [94]:
trainSVM(x_train, y_train, x_test, y_test)

[[91 16]
 [30 42]]
              precision    recall  f1-score   support

           0       0.75      0.85      0.80       107
           1       0.72      0.58      0.65        72

    accuracy                           0.74       179
   macro avg       0.74      0.72      0.72       179
weighted avg       0.74      0.74      0.74       179



### Naive Bayes

In [None]:
def trainNaiveBayes:
    