# Technisches und Vorwissen

## Libaries and imports

Zur Analyse benutzen wir die Programmiersprache Python sowie diverse weitere Bibliotheken zur Datenanalyse die als "PyData"-Stack zusammengefasst werden. [PyData](https://pydata.org/) ist darüber hinaus auch eine Community welche Veranstalltungen wie Vorträge, Meetups und Konferenzen organisiert.

In [None]:
# Data Structures
import numpy as np
import pandas as pd
import sklearn

# Plotting Libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Jupyter Notebook Magic
#sns.set_style('whitegrid')
%matplotlib inline
from IPython.core.pylabtools import figsize

![Libraries](static/libs.png)

## Grundlage Python + Jupyter

* Mehr Beispiele und Erklärungen zu Python: https://learnxinyminutes.com/docs/python/
* Eine Einführung zu JupyterLab von mir: https://www.youtube.com/watch?v=aSChciAOvcE

In [None]:
talk_title = "Einführung in datengetriebene Projekte"

In [None]:
type(talk_title)

In [None]:
my_list = [1, 2, 3, talk_title]

In [None]:
my_list[3]

In [None]:
my_list[:2]

In [None]:
my_talk = {"author": "Nico Kreiling", "title": talk_title, "date": "17.2.2020"}   
marcels_talk = {"author": "Marcel Kurovski", "title": "Recommender Systems", "date": "19.2.2020"}   
inovex_talks = [my_talk, marcels_talk]

In [None]:
my_talk

In [None]:
for talk in inovex_talks:
    print(f'{talk["author"]} hält am {talk["date"]} einen Vortrag zu "{talk["title"]}"')

In [None]:
tage_mit_inovex_vortrag = [talk["date"] for talk in inovex_talks]
print(set(tage_mit_inovex_vortrag))

In [None]:
df = pd.DataFrame(inovex_talks)
df

In [None]:
print(type(df))
print(type(df.author))

# Survive on the Titanic


**Vorgehensweise**

Unser Vorgehen innerhalb dieses Workshops orientiert sich am [CRISP-DM](https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining), einem industrie unabhängigen Standarad zum Vorgehen in Data Science (ursprünglich Data Mining) Projekten.

Dieser beschreibt ein iteratives Vorgehen, bei dem nach einem inhaltlichen Verständnis der Aufgabe die Daten mit einer explorativen Analyse untersucht werden. Sind die Daten und das Anforderung in einklang startet eine Modell-Entwicklungsphase, bei dem die Daten so aufbereitet werden, dass darauf basierend ein oder mehrere maschinelle Lernverfahren angewandt werden können. Entspricht dieses Modell einer ausreichenden offline güte wird es in einem entsprechendem A/B Test evaluiert und anschließend produktiviert.

![Crisp DM](static/crisp_dm.png)

# Business Understanding

In dieser Phase geht es darum, die eigentliche Aufgabe zu verstehen, um ein angemessenes Vorgehen zu bestimmen.

## Titanic

On April 15, 1912, the largest passenger liner ever made collided with an iceberg during her maiden voyage. When the Titanic sank it killed 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships. One of the reasons that the shipwreck resulted in such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others.

The titanic.csv file contains data for 887 of the real Titanic passengers. Each row represents one person. The columns describe different attributes about the person including whether they survived (S), their age (A), their passenger-class (C), their sex (G) and the fare they paid (X).

In [None]:
from IPython.display import HTML
# Youtube
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/ItjXTieWKyI" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>')


**Aufgabe:** Vorhersage, ob ein Passagier den Untergang der Titanic übrerlebt.
    
**Machine Learning Klassifikation**
* Supervised Machine Learning
* Klassifikation
* Binäre Klassifikation

**Keine Weiteren Einschränkungen:**

* Wir müssen nicht erklären, warum wir denken, dass ein Passagier stirbt oder überlebt (Explainability)
* Uns stehen theoretisch beliebige Rechenresourcen zur Verfügung
* Wir sind frei in der Technologiewahl
* Es gibt keine Gewichtung der Klassen

# Data Understanding

Mit diesem Vorwissen werden wir die gegebenen Daten analysen, um eine Aussage über die Machbarkeit der Aufgabe zu erlangen

## Data Loading

In [None]:
train = pd.read_csv("./data/titanic/train.csv")
test  = pd.read_csv("./data/titanic/test.csv")

In [None]:
train_raw = train.copy()
test_raw  = test.copy()

print("Train Dimensions:", train.shape)
print("Test Dimensions:", test.shape)

# preview the data
train.head()

In [None]:
import qgrid
col_options = {
    'width': 70,
}

def qshow(df, ops=None):
    if ops is None:
        ops = dict(
            column_options=col_options,grid_options={'forceFitColumns': True}
        )
    return qgrid.show_grid(df, **ops)

#qshow(train)

In [None]:
#Columns that only exist in Trainingsset
set(train.columns)-set(test.columns)

In [None]:
# Get Datatypes
train.info()

In [None]:
columns2drop = ["PassengerId"]
train[columns2drop].head()

In [None]:
for df in [train, test]:
    df.drop(columns2drop, axis=1, inplace=True)

## Missing data

In [None]:
train.isna().sum()

In [None]:
figsize(15,5)
sns.heatmap(train.isnull(), cbar=False)

In [None]:
print("training:",train.isnull().sum())
print("test:",test.isnull().sum())

## Zielvariable

In [None]:
train.Survived.value_counts()

In [None]:
figsize(15,5)
train.Survived.value_counts().sort_index().plot(kind="bar")

## Abhängige Variablen

### Passanger Class (Pclass)

In [None]:
figsize(15,5)
train.Pclass.value_counts().sort_index().plot(kind="bar")

In [None]:
train.groupby(["Pclass","Survived"]).size().unstack("Survived").plot(kind="bar")

### Ticketpreis

In [None]:
# peaks for survived/not survived passengers by their age
facet = sns.FacetGrid(train, hue="Pclass",aspect=4)
facet.map(sns.kdeplot,'Fare',shade= True)
facet.set(xlim=(0, train['Fare'].max()))
facet.add_legend()

### Sex

In [None]:
train.Sex.value_counts()

In [None]:
train.Sex.value_counts().sort_index().plot(kind="bar")

In [None]:
train.groupby(["Sex","Survived"]).size().unstack("Survived").plot(kind="pie", subplots=True)

### Age

In [None]:
train.Age.describe()

In [None]:
train.Age.hist()

In [None]:
train.Age.to_frame().plot(kind="kde", bw_method=0.3)

In [None]:
# peaks for survived/not survived passengers by their age
facet = sns.FacetGrid(train, hue="Survived",aspect=4)
facet.map(sns.kdeplot,'Age',shade= True)
facet.set(xlim=(0, train['Age'].max()))
facet.add_legend()

### Familienbeziehug

In [None]:
train.SibSp.value_counts().sort_index().plot(kind="bar")

In [None]:
train.groupby(["SibSp","Survived"]).size().unstack("Survived").plot(kind="bar")

In [None]:
train.Parch.value_counts().sort_index().plot(kind="bar")

In [None]:
train.groupby(["Parch","Survived"]).size().unstack("Survived").plot(kind="bar")

## Interactives Dashboard mit Panel

In [None]:
df = train
def plot_categorial(column):
    %matplotlib agg
    ab = df.groupby([column,"Survived"]).size().unstack("Survived").plot(kind="bar").get_figure()
    return ab

def plot_numerical(column):
    %matplotlib agg
    # peaks for survived/not survived passengers by their age
    facet = sns.FacetGrid(df, hue="Survived",aspect=4)
    facet.map(sns.kdeplot,column,shade= True)
    facet.set(xlim=(0, df[column].max()))
    facet.add_legend()
    return facet.fig
    
def plot_column(column):
    if df[column].dtype == np.float:
        fig = plot_numerical(column)
    else:
        fig = plot_categorial(column)
    return fig
    
fig = plot_column("Age")        

In [None]:
import panel as pn
pn.extension()
cols = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
fig = pn.interact(plot_column, column=cols)
fig

## Aufgaben
* Sammele für dich interessante Erkentnisse und Ideen, die später bei der Prognose von Überlebenden hilfreich sein könnten.
* Analysiere die Spalte "Embarked" und versuche eine Erklärung zu finden, warum sich Hafen C von Q und S unterscheidet
* Erweitere das Dashboard um einen interaktiven Titel, der das dargestellte beschreibt.

### Erkentnisse der Datenanalyse

### Warum sterben mehr Passagiere vom Abfahrtshafen C

### Erweiterung Dashboard

# Basic Data Preparation

In [None]:
train_with_nas = train.copy()
test_with_nas = test.copy()

## Missing Values

Missing Values, also unbekannte Werte stellen für viele Machine Learning Algorithmen ein Problem dar (wenn auch nicht für alle), da sie im wesentlichen mit nummerischen Werten arbeiten. Grundsätzlich ist es daher eine gute Idee fehlende Werte zu ersetzen. Hierfür gibt es zahlreiche Möglichkeiten, wie etwa:

* Das ersetzen mit einem fest definiertem Wert
    * Für numerische Werte nutzt man den Durchschnittswert (average) oder den Mittelwert (median).
    * Für kategorische Werte empfiehlt sich die häufigsten Klasse
    
* Das ersetzen mit einem repräsentativen Wert
    * Equivalent zu oben, allerdings wird der Wert innerhalb einer repräsentativen Subgruppe gebildet
    * Automatische Bildung mehrer Cluster (etwa via kNN)
    
* Das ersetzen mit einem algorithmisch bestimmten Werts
    * Regressions- und Klassifikationsmodelle
    * Deep Learning

In [None]:
df = train.append(test)
df.isnull().sum().sort_values(ascending=False)

In [None]:
default_age = train.Age.median()

for df in [train, test]:
    df.Age.fillna(default_age, inplace=True)

In [None]:
default_price = train.Fare.mean()

for df in [train, test]:
    df.Fare.fillna(default_price, inplace=True)

In [None]:
default_harbor = train.Embarked.mode()[0]

for df in [train, test]:
    df.Embarked.fillna(default_harbor, inplace=True)

In [None]:
df = train.append(test)
df.isnull().sum().sort_values(ascending=False)

**Alternative** Scikit-Learn Missing Value Imputer

In [None]:
train.Embarked.values.reshape(-1, 1)[:5]

In [None]:
from sklearn.impute import SimpleImputer

embarked_imp = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
embarked_imp.fit_transform(train.Embarked.values.reshape(-1, 1))
without_nas = embarked_imp.transform(test.Embarked.values.reshape(-1, 1))

# Check if there are still NAs?
pd.Series(np.ravel(without_nas)).isna().sum()

## Feature Transformation

Nicht nur Missing Values sind problematisch für Algorithmen, sondern auch andere nicht nummerische Werte. Zwar wandeln zahlreiche Verfahren etwa Boolean-Werte automatisch in eine 1/0 Darstellung um, dennoch empfiehlt es sich auch hier Zeichenketten (strings) und kategoriale Werte entsprechend umzuwandeln. Gängige Verfahren sind:

* Binarize: Darstellung in Binärzahlen (1 und 0 für True und False)
* [One Hot Encoding](http://queirozf.com/entries/one-hot-encoding-a-feature-on-a-pandas-dataframe-an-example)
* Dummy Encoding

In [None]:
from sklearn.preprocessing import label_binarize
for df in [train, test]:
    df['Sex'] = label_binarize(df.Sex, ['male', 'female'])

In [None]:
from sklearn.preprocessing import LabelEncoder
harbour_encocer = LabelEncoder()
harbour_encocer.fit(train.Embarked)
for df in [train, test]:
    df["Embarked"] = harbour_encocer.transform(df.Embarked)

In [None]:
df.dtypes

In [None]:
df.select_dtypes(exclude=[np.number])

In [None]:
for df in [train, test]:
    df.drop(["Name","Ticket"], axis=1, inplace=True)

## einfaches Feature Engineering

Nicht alle Informationen lasssen sich so standardisiert in maschinell verarbeitbare Daten überführen. Beim Feature Engineering geht es darum, neue spalten zu erstellen, die für die algorithmen zusätzliche, wertvolle Informationen berreitstellen.

### Schiffskabine

Einige Passagiere haben eine Kabinennummer, anhand derer sich eventuell auf die Position im Boot (Deck, Innen- oder Außenkabine, Rumpf- oder Heckbereich) schließen lassen könnte, jedoch ist dies nicht offensichtlich. Dennoch bietet aufgrund der vielen fehlenden Werte allein die Information, dass ein Passagier eine Kabine hatte (das war nicht selbstverständlich) eine wertvolle Information die wir nutzen wollen.

In [None]:
len(train.loc[train.Cabin.isna()]) / len(train.Cabin)

In [None]:
for df in [train, test]:
    df["has_cabin"] = ~df.Cabin.isna()
    
for df in [train, test]:
    df.drop("Cabin", axis=1, inplace=True)

### Familiengröße

Im Datensatz finden sich die zwei Spalten SibSp (=Siblings/Spouses, also Ehegatten oder Geschwister) und Parch (Eltern / Kinder). Diese etwas verwirrende Aufteilung lässt sich deutlich einfacher darstellen, in dem wir beide Werte einfach als Familiengröße zusammenfassen.

In [None]:
df.SibSp + df.Parch

In [None]:
for df in [train, test]:
    df["family_size"] = df.SibSp + df.Parch

## Aufgaben

* Überprüfe, wie das Feature Familiengröße im Verhältnis zur Überlebenswahrscheinlichkeit steht
* Erstelle ein Feature, dass aussagt, ob eine Person alleine reist
* Entferne alle Spalten, welche für das Modelltraining nicht benötigt werden
* Erstelle eine Funktion, welche alle fehlenden Werte in einem DataFrame bereinigt und mache konfigurierbar, ob das DataFrame selbst oder eine Kopie bereinigt werden soll

### Analyse Familiengröße

### Alleinreisenden-Feature

### Entfernen unnötiger Spalten

### Missing Value Funktion

# Machine Learning (Modeling 1)

Überprüfe erneut, ob nur nummerische und boolean Werte vorkommen und fehlenden Werte mehr existieren (mit ausnahme von Survived im testset)

In [None]:
# Check the again the data
# There should be only bool and numeric columns and no NAs beside from Survived (which of course is unkown in the testset)
df = train.append(test)
pd.DataFrame(zip(df.columns, df.dtypes, df.isna().sum()), columns=["column", "type", "NAs"])

In [None]:
X = train.drop(columns="Survived")
y = np.ravel(train[['Survived']])

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [None]:
print(X_train.shape)
print(X_test.shape)

## Entscheidungsbaum

In [None]:
from sklearn.tree import DecisionTreeClassifier

classifier = DecisionTreeClassifier()
classifier.fit(X_train, y_train)

In [None]:
from sklearn.tree._export import plot_tree
figsize(20, 8)
fig = plot_tree(classifier, label='root', feature_names = X.columns, impurity=True, filled=True) 

In [None]:
classifier = DecisionTreeClassifier(max_depth=3)
classifier.fit(X_train, y_train)

figsize(20, 8)
fig = plot_tree(classifier, label='root', feature_names = X.columns, impurity=True, filled=True) 

## Modell-Bewertung

In [None]:
predictions = classifier.predict(X_test)

In [None]:
pd.DataFrame(zip(predictions, y_test), columns=["predicted", "actual"]).sample(5, random_state=1)

In [None]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, predictions)
tn, fp, fn, tp = cm.ravel()
pd.DataFrame(cm)

In [None]:
sns.heatmap((pd.DataFrame(cm)/len(predictions)).round(decimals=2), annot=True)

![Binäre Klassifikationsmetriken](static/metrics.png)

### False Positive Rate / Fehler 1. Art 

Wie viele der positiven Ereignisse (Passagier überlebt) wurden falsch vorhergesagt?

=> Bei wie vielen dachten wir, sie werden ertrinken, obwohl sie überlebt haben

In [None]:
false_positive_rate = fp / (fp + tn)
false_positive_rate

### False Negative Rate / Fehler 2. Art 

Wie viele der negativen Ereignisse (verstorben) wurden falsch vorhergesagt?

=> Bei wie vielen dachten wir, sie werden ertrinken, obwohl sie überlebt haben

In [None]:
false_negative_rate = fn / (tp + fn)
false_negative_rate

### True Negative Rate / Specificity / Spezifität 

Wie viele der negativen Ereignisse (verstorben) haben wir korrekt vorhergesagt

In [None]:
specificity = tn / (tn + fp)
specificity

### True Positive Rate / Recall / Sensitivität

Wie viele der positiven Ereignisse (überlebt) haben wir korrekt vorhergesagt

In [None]:
recall = tp / (tp + fn)
recall

# oder auch
from sklearn.metrics import recall_score
recall_score(y_test, predictions)

### Positive Predictive Value / Precision / Genauigkeit
Wie viele der positiv vorhergesagten Ereignisse waren auch positiv (haben auch überlebt)

In [None]:
precision = tp/ (tp + fp)

# oder auch
from sklearn.metrics import precision_score
precision_score(y_test, predictions)

### Accuracy

Wie viele der vorhersagen stimmen?

In [None]:
accuracy = (tp + tn) / (tp + fp + fn + tn)

# oder auch
from sklearn.metrics import accuracy_score
accuracy_score(y_test, predictions)

### F1-Score

Das harmonische Mittel zwischen Precision und Recall

In [None]:
f1 = 2 * precision * recall / (precision + recall)

# oder auch
from sklearn.metrics import f1_score
f1_score(y_test, predictions)

## Modell-Analyse

In [None]:
classifier.feature_importances_

In [None]:
pd.DataFrame(zip(X.columns, classifier.feature_importances_), columns=["feature", "importance"])\
.set_index("feature").sort_values("importance")\
.plot.barh()

## Cross-Validation

In [None]:
from sklearn.model_selection import cross_val_score
cv_scores = cross_val_score(classifier, X, y, scoring = "accuracy", cv=10)
print(cv_scores)
np.mean(cv_scores)

In [None]:
from sklearn.model_selection import cross_val_score
cv_scores = cross_val_score(classifier, X, y, scoring = "accuracy", cv=2)
print(cv_scores)
np.mean(cv_scores)

## Learning Curves

From the [Scikit-Learn Documentation](https://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html)

In [None]:
print(__doc__)

import numpy as np
import matplotlib.pyplot as plt
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.datasets import load_digits
from sklearn.model_selection import learning_curve
from sklearn.model_selection import ShuffleSplit


def plot_learning_curve(estimator, title, X, y, axes=None, ylim=None, cv=None,
                        n_jobs=None, train_sizes=np.linspace(.1, 1.0, 5)):
    """
    Generate 3 plots: the test and training learning curve, the training
    samples vs fit times curve, the fit times vs score curve.

    Parameters
    ----------
    estimator : object type that implements the "fit" and "predict" methods
        An object of that type which is cloned for each validation.

    title : string
        Title for the chart.

    X : array-like, shape (n_samples, n_features)
        Training vector, where n_samples is the number of samples and
        n_features is the number of features.

    y : array-like, shape (n_samples) or (n_samples, n_features), optional
        Target relative to X for classification or regression;
        None for unsupervised learning.

    axes : array of 3 axes, optional (default=None)
        Axes to use for plotting the curves.

    ylim : tuple, shape (ymin, ymax), optional
        Defines minimum and maximum yvalues plotted.

    cv : int, cross-validation generator or an iterable, optional
        Determines the cross-validation splitting strategy.
        Possible inputs for cv are:
          - None, to use the default 5-fold cross-validation,
          - integer, to specify the number of folds.
          - :term:`CV splitter`,
          - An iterable yielding (train, test) splits as arrays of indices.

        For integer/None inputs, if ``y`` is binary or multiclass,
        :class:`StratifiedKFold` used. If the estimator is not a classifier
        or if ``y`` is neither binary nor multiclass, :class:`KFold` is used.

        Refer :ref:`User Guide <cross_validation>` for the various
        cross-validators that can be used here.

    n_jobs : int or None, optional (default=None)
        Number of jobs to run in parallel.
        ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context.
        ``-1`` means using all processors. See :term:`Glossary <n_jobs>`
        for more details.

    train_sizes : array-like, shape (n_ticks,), dtype float or int
        Relative or absolute numbers of training examples that will be used to
        generate the learning curve. If the dtype is float, it is regarded as a
        fraction of the maximum size of the training set (that is determined
        by the selected validation method), i.e. it has to be within (0, 1].
        Otherwise it is interpreted as absolute sizes of the training sets.
        Note that for classification the number of samples usually have to
        be big enough to contain at least one sample from each class.
        (default: np.linspace(0.1, 1.0, 5))
    """
    if axes is None:
        _, axes = plt.subplots(1, 3, figsize=(20, 5))

    axes[0].set_title(title)
    if ylim is not None:
        axes[0].set_ylim(*ylim)
    axes[0].set_xlabel("Training examples")
    axes[0].set_ylabel("Score")

    train_sizes, train_scores, test_scores, fit_times, _ = \
        learning_curve(estimator, X, y, cv=cv, n_jobs=n_jobs,
                       train_sizes=train_sizes, return_times=True)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    fit_times_mean = np.mean(fit_times, axis=1)
    fit_times_std = np.std(fit_times, axis=1)

    # Plot learning curve
    axes[0].grid()
    axes[0].fill_between(train_sizes, train_scores_mean - train_scores_std,
                         train_scores_mean + train_scores_std, alpha=0.1,
                         color="r")
    axes[0].fill_between(train_sizes, test_scores_mean - test_scores_std,
                         test_scores_mean + test_scores_std, alpha=0.1,
                         color="g")
    axes[0].plot(train_sizes, train_scores_mean, 'o-', color="r",
                 label="Training score")
    axes[0].plot(train_sizes, test_scores_mean, 'o-', color="g",
                 label="Cross-validation score")
    axes[0].legend(loc="best")

    # Plot n_samples vs fit_times
    axes[1].grid()
    axes[1].plot(train_sizes, fit_times_mean, 'o-')
    axes[1].fill_between(train_sizes, fit_times_mean - fit_times_std,
                         fit_times_mean + fit_times_std, alpha=0.1)
    axes[1].set_xlabel("Training examples")
    axes[1].set_ylabel("fit_times")
    axes[1].set_title("Scalability of the model")

    # Plot fit_time vs score
    axes[2].grid()
    axes[2].plot(fit_times_mean, test_scores_mean, 'o-')
    axes[2].fill_between(fit_times_mean, test_scores_mean - test_scores_std,
                         test_scores_mean + test_scores_std, alpha=0.1)
    axes[2].set_xlabel("fit_times")
    axes[2].set_ylabel("Score")
    axes[2].set_title("Performance of the model")

    return plt

In [None]:
fig, axes = plt.subplots(3, 1, figsize=(10, 15))

cv = ShuffleSplit(n_splits=100, test_size=0.2, random_state=0)

plot_learning_curve(classifier, "Decision Tree max_depth=3", X, y, axes=axes[:], cv=cv, n_jobs=4)

plt.show()

## Aufgaben
* Erstelle eine Funktion, welche die wichtigsten Metriken berechnet und darstellt
* Probiere verschiedene Parameter für den Entscheidungsbaum aus und analysiere die Auswirkungen

### Scoring Funktion

### Verprobe unterschiedliche Modellparameter

# Modeling 2

## HyperParameter Tuning

In [None]:
from sklearn.model_selection import RandomizedSearchCV

dt_hpt_params = dict()
dt_hpt_params["max_depth"] = [2,3,5,10,None]
dt_hpt_params["min_samples_leaf"] = [1,2,3,5,10]
dt_hpt_params["min_samples_split"] = [1, 2, 3,4,5,10]

In [None]:
dt = DecisionTreeClassifier()
dt_random = RandomizedSearchCV(estimator = dt, param_distributions = dt_hpt_params,\
                               n_iter = 50, scoring="f1", cv = 5, verbose=2, random_state=42, n_jobs = -1)
dt_random.fit(X_train, y_train)

In [None]:
dt_random.best_params_

In [None]:
fig, axes = plt.subplots(3, 1, figsize=(10, 15))

classifier = DecisionTreeClassifier(**dt_random.best_params_)
classifier.fit(X_train, y_train)
plot_learning_curve(classifier, "Decision Tree", X, y, axes=axes[:], cv=cv, n_jobs=4)

predictions = classifier.predict(X_test)
diff = pd.Series(get_scores(y_test, predictions))-baseline
print(diff)

print(classification_report(y_test, predictions, target_names=["Survived", "Drown"]))
plt.show()

## Logistic Regresion

In [None]:
from sklearn.model_selection import StratifiedKFold
fig, axes = plt.subplots(3, 1, figsize=(10, 15))

cv = StratifiedKFold(n_splits=5, random_state=42)

classifier = DecisionTreeClassifier(max_depth=3)
classifier.fit(X_train, y_train)
plot_learning_curve(classifier, "Decision Tree", X, y, axes=axes[:], cv=cv, n_jobs=4)

predictions = classifier.predict(X_test)
diff = pd.Series(get_scores(y_test, predictions))-baseline
print(diff)

print(classification_report(y_test, predictions, target_names=["Survived", "Drown"]))
plt.show()

## Vergleich verschiedener Algorithmen

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, cross_val_score, learning_curve

cv = StratifiedKFold(n_splits=5, random_state=42)

X_train = train.drop(columns="Survived")
y_train = train[['Survived']]

classifiers = [DecisionTreeClassifier(), LogisticRegression(), KNeighborsClassifier(), GradientBoostingClassifier()]

algorithm_names = []
cv_means = []
cv_std = []
for classifier in classifiers:
    classifier.random_state=42
    algorithm_names.append(type(classifier).__name__)
    cv_result = cross_val_score(classifier, X_train, y = y_train, scoring = "f1", cv = cv, n_jobs=-1)
    cv_means.append(cv_result.mean())
    cv_std.append(cv_result.std())
    
#algorithm_names = ["SVC","DecisionTree","AdaBoost","RandomForest","ExtraTrees","GradientBoosting","MultipleLayerPerceptron","KNeighboors","LogisticRegression","LinearDiscriminantAnalysis"]
cv_res = pd.DataFrame({"mean":cv_means,"std": cv_std,"Algorithm":algorithm_names})

In [None]:
cv_res

In [None]:
figsize(15,5)
g = sns.barplot("mean","Algorithm",data = cv_res, xerr=cv_std)
g.set_xlabel("Mean Accuracy")
g = g.set_title("Cross validation scores")

In [None]:
fig, axes = plt.subplots(3, len(classifiers), figsize=(20, 20))

for i, estimator in enumerate(classifiers):
    title = f"{estimator.__class__.__name__}"
    plot_learning_curve(estimator, title, X, y, axes=axes[:, i],
                        cv=cv, n_jobs=4)

plt.show()

## Aufgaben

* Erstelle ein möglichst gutes Klassifikations-Modell

# Evaluation & Deployment

## Define pipeline

In [None]:
#%%writefile model_preperation.py
import numpy as np 
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline 
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import label_binarize

class ModelPreperation(TransformerMixin):
    #Class Constructor 
    def __init__( self ):
        self.title_encoder = LabelEncoder()
        self.harbor_encoder = LabelEncoder()
        pass
     
    def fit( self, X, y=None ):
        self.default_age = X.Age.median()
        self.default_price = X.Fare.mean()
        self.default_harbor = X.Embarked.mode()[0]
        self.harbor_encoder.fit(X.Embarked.astype("str"))
        return self
        
    def transform( self, df):
        print(df)
        df.Age.fillna(self.default_age, inplace=True)
        df.Fare.fillna(self.default_price, inplace=True)
        df.Embarked.fillna(self.default_harbor, inplace=True)
        df["has_cabin"] = ~df.Cabin.isna()
        df['Sex'] = label_binarize(df.Sex, ['male', 'female'])
        df["Embarked"] = self.harbor_encoder.transform(df.Embarked)
        df.drop(["Cabin","Name","Ticket","PassengerId"], axis=1, inplace=True)
        return df

In [None]:
def train_pipe():
    train = pd.read_csv("./data/titanic/train.csv")
    X = train.drop(columns="Survived")
    y = np.ravel(train[['Survived']])

    gb_params = {'n_estimators': 200,'min_samples_split': 16,'min_samples_leaf': 16,'max_features': 5,'max_depth': 3,'learning_rate': 0.25}
    pipe = Pipeline(steps=[("prepare",ModelPreperation()), ("clr",GradientBoostingClassifier(**gb_params))])
    return pipe.fit(X, y)
pipe = train_pipe()

## Build API

In [None]:
%%writefile api.py

import pickle
from fastapi import FastAPI
from pydantic import BaseModel
from model_preperation import ModelPreperation
import model_preperation
import joblib
from sklearn.pipeline import FeatureUnion, Pipeline 

app = FastAPI()

import numpy as np 
import pandas as pd
from sklearn.ensemble import GradientBoostingClassifier

class Passenger(BaseModel):
    PassengerId: float = 1
    Pclass: str = 3
    Name: str = 'Nico, Rare. Kreiling'
    Sex: str = 'F'
    Age: int = 30
    SibSp: float = 0
    Parch: float = 3
    Ticket: str = ''
    Fare: float = 100
    Cabin: str = ''
    Embarked: str = 'C'
        
@app.get("/")
async def root():
    return {"message": "Hello World"}

@app.get("/survived/passanger_id/{user_id}")
async def read_item(user_id):
    return {"item_id": user_id}

@app.post("/survived/custom")
async def root(passenger: Passenger):
    prediction = pipe.predict(pd.Series(dict(passenger)).to_frame().transpose())
    if prediction[0] == 0:
        return {"message": "Sorry, you die!"}
    else:
        return {"message": "Yeaaah, you will survive :)"}

In [None]:
import requests
form_data = {
    "Pclass": 3, 
    "Name": 'Nico, Mr. Kreiling',
    "Sex": 'M',
    "Age": 30,
    "SibSp": 4,
    "Parch": 0,
    "Fare": 100,
    "Embarked": 'C'
}
r = requests.post('http://127.0.0.1:8000/survived/custom', json=form_data)
r.status_code
r.json()["message"]

# Bonus

## Title

In [None]:
train_raw.Name.sample(10)

In [None]:
train_raw.Name.str.extract(' ([A-Za-z]+)\.', expand=False).value_counts().plot(kind="bar")

In [None]:
def get_title(df):
    titles = df.Name.str.extract(' ([A-Za-z]+)\.', expand=False)
    titles = titles.replace(['Lady', 'Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'],'Rare')
    titles = titles.replace('Mlle','Miss')
    titles = titles.replace('Ms','Miss')
    titles = titles.replace('Mme','Mrs')
    return titles

get_title(train_raw).value_counts().plot(kind="bar")

In [None]:
train["title"] = get_title(train_raw)
test["title"] = get_title(test_raw)

from sklearn.preprocessing import LabelEncoder
title_encoder = LabelEncoder()
title_encoder.fit(train.title)
for df in [train, test]:
    df["title"] = title_encoder.transform(df.title)

## Auto-Features (PClass * Age)

In [None]:
for df in [train, test]:
    df["comb"] = df.Pclass * df.Age

## Aufgabe 
Analysiere den Einfluss der neuen Features auf dein Modell

## Bucket Numerical Features and 1-hot encode

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

#float_cols = list(train.dtypes[train.dtypes == float].index)
float_cols = ['Age', 'Fare', 'comb', 'family_size']
scaler.fit(train[float_cols])

for df in [train, test]:
    df[float_cols] = scaler.transform(df[float_cols])

In [None]:
print(list(train.dtypes[train.dtypes == int].index))
dummie_cols = ['Pclass', 'Embarked', 'title']    
for c in dummie_cols:  
    print(c)
    for df in [train, test]:
        if len(df[c].unique()) > 2:
            dummies = pd.get_dummies(df[c], prefix=c, drop_first=False)
            for d in dummies:
                df[d] = dummies[d]
            df.drop(columns=c, inplace=True)