## <center>Практика. Дерево решений в задаче предсказания выживания пассажиров "Титаника".

Предсказываем, выжил ли пассажир титаника или нет.

In [51]:
import numpy as np
import pandas as pd
from sklearn.utils import shuffle
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import roc_auc_score, accuracy_score, confusion_matrix, average_precision_score
%matplotlib inline
from matplotlib import pyplot as plt
import seaborn as sns

**Считываем датасет и делим его.**

In [52]:
df = pd.read_csv("train.csv")
df = shuffle(df, random_state=42) # перемешиваем датасет

### Предобработка данных

In [53]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
709,710,1,3,"Moubarek, Master. Halim Gonios (""William George"")",male,,1,1,2661,15.2458,,C
439,440,0,2,"Kvillner, Mr. Johan Henrik Johannesson",male,31.0,0,0,C.A. 18723,10.5,,S
840,841,0,3,"Alhomaki, Mr. Ilmari Rudolf",male,20.0,0,0,SOTON/O2 3101287,7.925,,S
720,721,1,2,"Harper, Miss. Annie Jessie ""Nina""",female,6.0,0,1,248727,33.0,,S
39,40,1,3,"Nicola-Yarred, Miss. Jamila",female,14.0,1,0,2651,11.2417,,C


In [54]:
df.describe(include='all')

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
count,891.0,891.0,891.0,891,891,714.0,891.0,891.0,891.0,891.0,204,889
unique,,,,891,2,,,,681.0,,147,3
top,,,,"Watson, Mr. Ennis Hastings",male,,,,347082.0,,B96 B98,S
freq,,,,1,577,,,,7.0,,4,644
mean,446.0,0.383838,2.308642,,,29.699118,0.523008,0.381594,,32.204208,,
std,257.353842,0.486592,0.836071,,,14.526497,1.102743,0.806057,,49.693429,,
min,1.0,0.0,1.0,,,0.42,0.0,0.0,,0.0,,
25%,223.5,0.0,2.0,,,20.125,0.0,0.0,,7.9104,,
50%,446.0,0.0,3.0,,,28.0,0.0,0.0,,14.4542,,
75%,668.5,1.0,3.0,,,38.0,1.0,0.0,,31.0,,


**Заполним пропуски медианными значениями. (столбцы Age, Embarked в train и в test)**

Подсказка - fillna(), median(), можете найти в документации

In [97]:
df.fillna(df.median(), inplace=True)
print(df)

     PassengerId  Survived  Pclass  \
709          710         1       3   
439          440         0       2   
840          841         0       3   
720          721         1       2   
39            40         1       3   
290          291         1       1   
300          301         1       3   
333          334         0       3   
208          209         1       3   
136          137         1       1   
137          138         0       1   
696          697         0       3   
485          486         0       3   
244          245         0       3   
344          345         0       2   
853          854         1       1   
621          622         1       1   
653          654         1       3   
886          887         0       2   
110          111         0       1   
294          295         0       3   
447          448         1       1   
192          193         1       3   
682          683         0       3   
538          539         0       3   
819         

**Кодируем категориальные признаки `Pclass`, `Sex`, `SibSp`, `Parch` и `Embarked` с помощью техники One-Hot-Encoding.**

In [98]:
# pd.concat соединяет датафреймы, параметр axis указывает, как
# pd.get_dummies делает one_hot encoding столбца
# pd.drop удаляет часть датасета
# Общее правило: axis=1 использует столбцы, axis=0 использует строки
# см. документацию

df2 = pd.concat([pd.get_dummies(df.loc[:, ['Pclass', 'Sex', 'SibSp', 'Parch', 'Embarked']]), 
                 df.drop(['Pclass', 'Sex', 'SibSp', 'Parch', 'Embarked'], axis=1)])


df2 = df.drop(['Pclass', 'Sex', 'SibSp', 'Parch', 'Embarked'], axis=1).join(
        pd.get_dummies(df['Pclass'], prefix = 'Pclass_')).join(
        pd.get_dummies(df['Sex'])).join(
        pd.get_dummies(df['SibSp'], prefix = 'SibSp_')).join(
        pd.get_dummies(df['Parch'], prefix = 'Parch_')).join(
        pd.get_dummies(df['Embarked'], prefix = 'Embarked_'))

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  


In [99]:
X = df2
X = X.drop(['Survived', 'Name', 'Ticket', 'Cabin'], axis = 1)
y = df2.loc[ : , 'Survived']

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.6)



In [100]:
df2.head()

Unnamed: 0,PassengerId,Survived,Name,Age,Ticket,Fare,Cabin,Pclass__1,Pclass__2,Pclass__3,...,Parch__0,Parch__1,Parch__2,Parch__3,Parch__4,Parch__5,Parch__6,Embarked__C,Embarked__Q,Embarked__S
709,710,1,"Moubarek, Master. Halim Gonios (""William George"")",28.0,2661,15.2458,,0,0,1,...,0,1,0,0,0,0,0,1,0,0
439,440,0,"Kvillner, Mr. Johan Henrik Johannesson",31.0,C.A. 18723,10.5,,0,1,0,...,1,0,0,0,0,0,0,0,0,1
840,841,0,"Alhomaki, Mr. Ilmari Rudolf",20.0,SOTON/O2 3101287,7.925,,0,0,1,...,1,0,0,0,0,0,0,0,0,1
720,721,1,"Harper, Miss. Annie Jessie ""Nina""",6.0,248727,33.0,,0,1,0,...,0,1,0,0,0,0,0,0,0,1
39,40,1,"Nicola-Yarred, Miss. Jamila",14.0,2651,11.2417,,0,0,1,...,1,0,0,0,0,0,0,1,0,0


## 1. Дерево решений без настройки параметров 

**Обучите на имеющейся выборке дерево решений (`DecisionTreeClassifier`) максимальной глубины 2. Используйте параметр `random_state=17` для воспроизводимости результатов.**

In [128]:
modelt = DecisionTreeClassifier(max_depth = 2, random_state = 17)
modelt.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=2,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=17,
            splitter='best')

**Сделайте с помощью полученной модели прогноз для тестовой выборки **

In [130]:
tep = modelt.predict_proba(X_test)[:, 1]

#### Посчитайте ROC AUC и Average Precision Score

In [131]:
print('ROC AUC:', roc_auc_score(y_test, tep))
print('Average precision : ', average_precision_score(y_test, tep))

ROC AUC: 0.8308692495544848
Average precision :  0.7585654276542566


## 2. Дерево решений с настройкой параметров 

**Обучите на имеющейся выборке дерево решений (`DecisionTreeClassifier`). Также укажите `random_state=17`. Максимальную глубину и минимальное число элементов в листе настройте на 5-кратной кросс-валидации с помощью `GridSearchCV`.**

In [135]:
# tree params for grid search
tree_params = {'max_depth': list(range(1, 5)), 
               'min_samples_leaf': list(range(1, 5))}
modelt2 = DecisionTreeClassifier(random_state = 17)

cv = GridSearchCV(modelt2, tree_params)
cv = cv.fit(X_train, y_train)

modelt2 = cv.best_estimator_
modelt2.fit(X_train, y_train)



DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=2, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=17,
            splitter='best')

**Сделайте с помощью полученной модели прогноз для тестовой выборки.**

In [136]:
tep1 = modelt2.predict_proba(X_test)[:, 1]

#### Посчитайте ROC AUC и Average Precision Score и сравните с предыдущим значением метрики

In [137]:
print('ROC AUC:', roc_auc_score(y_test, tep1))
print('Average precision : ', average_precision_score(y_test, tep1))

ROC AUC: 0.8509174311926606
Average precision :  0.7901010851572567


## 3. Логистическая регрессия

#### Обучите логистическую регрессию, и посмотрите, насколько хорошо она будет работать (по ROC AUC и Average Precision)

In [138]:
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()

In [139]:
logreg.fit(X_train, y_train)
ytep = logreg.predict_proba(X_test)[:, 1]

print('ROC AUC Тест: ', roc_auc_score(y_test, ytep))
print('Average precision Тест: ', average_precision_score(y_test, ytep))

ROC AUC Тест:  0.8574021516731569
Average precision Тест:  0.8247185843230949


