# Solución Ejercicio Árboles de decisión - Titanic
#### UD2. Aprendizaxe Supervisada
#### MP. Sistemas de Aprendizaxe Automáticos
#### IES de Teis (Vigo), Cristina Gómez Alonso

## 1. Importación de paquetes y dataset

In [6]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier

In [7]:
titanic = pd.read_csv('../ML.2.4.Logistic-Regression/data/titanic_train.csv')
titanic_test = pd.read_csv('../ML.2.4.Logistic-Regression/data/titanic_test.csv')
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## 2. Preprocessing

In [8]:
# Removing the names
titanic = titanic.drop(['PassengerId','Name','Ticket','Cabin','Embarked'], axis=1)
# One-hot encoding
titanic = pd.get_dummies(titanic)
titanic = titanic.fillna(0.0)
titanic.head()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare,Sex_female,Sex_male
0,0,3,22.0,1,0,7.25,0,1
1,1,1,38.0,1,0,71.2833,1,0
2,1,3,26.0,0,0,7.925,1,0
3,1,1,35.0,1,0,53.1,1,0
4,0,3,35.0,0,0,8.05,0,1


In [9]:
# Removing the names
titanic_test = titanic_test.drop(['PassengerId','Name','Ticket','Cabin','Embarked'], axis=1)
# One-hot encoding
titanic_test = pd.get_dummies(titanic_test)
titanic_test = titanic_test.fillna(0.0)
titanic_test.head()

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare,Sex_female,Sex_male
0,3,34.5,0,0,7.8292,0,1
1,3,47.0,1,0,7.0,1,0
2,2,62.0,0,0,9.6875,0,1
3,3,27.0,0,0,8.6625,0,1
4,3,22.0,1,1,12.2875,1,0


## 3. División del dataset

In [10]:
from sklearn.model_selection import train_test_split
X = titanic.drop('Survived',axis=1)
y = titanic['Survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)
X_train.shape, y_train.shape, X_test.shape, y_test.shape
print(titanic.columns)
X_train

Index(['Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'Sex_female',
       'Sex_male'],
      dtype='object')


Unnamed: 0,Pclass,Age,SibSp,Parch,Fare,Sex_female,Sex_male
445,1,4.0,0,2,81.8583,0,1
650,3,0.0,0,0,7.8958,0,1
172,3,1.0,1,1,11.1333,1,0
450,2,36.0,1,2,27.7500,0,1
314,2,43.0,1,1,26.2500,0,1
...,...,...,...,...,...,...,...
106,3,21.0,0,0,7.6500,1,0
270,1,0.0,0,0,31.0000,0,1
860,3,41.0,2,0,14.1083,0,1
435,1,14.0,1,2,120.0000,1,0


## 4. Creación del modelo de Árboles de decisión

In [11]:
tree_clf = DecisionTreeClassifier(max_depth=4)

## 5. Entrenamiento

In [12]:
tree_clf.fit(X_train, y_train)

DecisionTreeClassifier(max_depth=4)

## 6. Visualización del árbol de decisión

Podemos visualizar el árbol de decisiones utilizando el método export_graphiz() para exportar un archivo de representación gráfica y luego transformarlo a png:

In [13]:
from sklearn.tree import export_graphviz

In [21]:
export_graphviz(tree_clf, 
                out_file='./img/titanic_tree.dot',
                feature_names=X.columns,
                class_names=['Not survived', 'survived'],
                rounded=True,
                filled=True)

Convertimos el archivo gráfico en un archivo .png:

In [20]:
! dot -Tpng ./img/titanic_tree.dot -o ./img/titanic_tree.png

Y este es el resultado:

![Resultado](img/titanic_tree.png)

## 7. Realización de predicciones

In [16]:
# Making predictions
y_train_pred = tree_clf.predict(X_train)
y_test_pred = tree_clf.predict(X_test)
# Calculate the accuracy
from sklearn.metrics import accuracy_score
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)
print('The training accuracy is', train_accuracy)
print('The test accuracy is', test_accuracy)


The training accuracy is 0.8330658105939005
The test accuracy is 0.8097014925373134


## 8. Mejora: Adición de hiperparámetros al modelo

In [17]:
# Training the model
model = DecisionTreeClassifier(max_depth=6, min_samples_leaf=6, min_samples_split=10)
model.fit(X_train, y_train)
# Making predictions
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
# Calculating accuracies
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)
print('The improved training accuracy is', train_accuracy)
print('The improved test accuracy is', test_accuracy)


The improved training accuracy is 0.8651685393258427
The improved test accuracy is 0.7873134328358209


### 7.1. Estimando las probabilidades de pertenencia a cada clase

Un árbol de decisión también puede estimar la probabilidad de que cierta instancia pertenezca a cierta clase. Simplemente devuelve el ratio o proporción de esa clase sobre la suma de todas las instancias en la hoja.

Podemos comprobarlo con el método predict_proba de scikit-learn:

En este ejemplo, si indicamos que la longitud del pétalo es 5 y el ancho es 1.5, la probabilidad de ser de clase 0 será 0, de clase 1 0.9 y de clase 2 0.09

In [18]:
tree_clf.predict_proba([[5, 1.5]])

ValueError: X has 2 features, but DecisionTreeClassifier is expecting 7 features as input.

In [None]:
tree_clf.predict([[5, 1.5]])


Nota: obtendremos la misma probabilidad siempre que estemos en un mismo cuadro asignado a la hoja. No importa si nuestro nuevo punto de datos se acerca a los márgenes de decisión (decision boundaries).