## Decision Trees

In this classification task, we'd like to distinguish 3 different species (Setosa, Versicolour, and Virginica) of iris flowers based on their petal and sepal length and width.

This data set is built into sklearn, so it's straightforward to load it in. See [here](https://en.wikipedia.org/wiki/Iris_flower_data_set) for more details on the iris data set.

In [None]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report

In [None]:
# Load in features X and outputs y.
# Then split (X, y) into a training and test set.
# 30% of the data is put into the test set and 70% in the training set.
iris_data = load_iris()
X = iris_data.data
y = iris_data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
y_actual = y_test

In [None]:
print("Features are", iris_data.feature_names)
print("Targets are", iris_data.target_names)
print("Training set size is", len(X_train))
print("Test set size is", len(X_test))
for a, b in zip(X_train[:10], y_train[:10]):
    print("Input = {0}; output = {1} (species = {2})".format(a, b,
                                                             iris_data.target_names[b]))

## Build and output the model
In the next two steps, we build the decision tree model from the training set and export it to a file for viewing in GraphViz.

In [None]:
model = DecisionTreeClassifier()
model.fit(X_train, y_train)

In [None]:
with open("iris.dot", 'w') as f:
    f = tree.export_graphviz(model,
                             out_file=f,
                             feature_names=iris_data.feature_names,  
                             class_names=iris_data.target_names,  
                             filled=True,
                             rounded=True,  
                             special_characters=True)  

## Model evaluation

Let's first print the confusion matrix as we usually do.

In [None]:
y_pred = model.predict(X_test)
print(confusion_matrix(y_actual, y_pred))

Now let's print the **precision**, **recall** and **$F_1$ score** for each class.

**Example**: For the "versicolor" class, the precision is equal to the proportion of irises predicted to be "versicolor" that were indeed "versicolor".

The recall is equal to the proportion of irises that are in fact "versicolor" that the classifier correctly predicted to be "versicolor".

If our classifier hypothetically labelled everything as "versicolor", this would give us a low precision and high recall (100%) for this class.

If our classifier labelled only a single iris (where it was absolutely sure of its prediction) as "versicolor", this would give us a high precision (100%) and low recall for this class.

Typically, we have to trade off precision against recall based on what is most important for our problem.

The $F_1$ score is equal to the harmonic mean of precision and recall. In other words, it gives equal weight to the precision and recall and then computes their average to give us a single score for the class.

In [None]:
print(classification_report(y_actual,
                            y_pred,
                            target_names=iris_data.target_names))

In [None]:
model.predict_proba(X_test)

In [None]:
def second_element(x):
    return x[1]

bla = [("hello", 5), ("bla", 2), ("wah", 1)]

bla.sort(key=lambda x:x[1], reverse=False)

In [None]:
bla

In [None]:
a = [1, 2, 3]
b = ["hello", "world", "yes"]
zip(a,b)

In [None]:
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
# the size of A4 paper
fig.set_size_inches(11.7, 8.27)

In [None]:
iris = pd.DataFrame(X, columns=iris_data.feature_names)

In [None]:
iris['species'] = y
iris['petal_width_bins'] = pd.qcut(iris['petal width (cm)'],
                           q=3,
                           labels=["low width", "medium width", "high width"])

iris['sepal_length_bins'] = pd.qcut(iris['sepal length (cm)'],
                           q=3,
                           labels=["low length", "medium length", "high length"])

iris.sample(10)

In [None]:
sns.boxplot(x="species", y="sepal length (cm)", data=iris)
sns.plt.show()

In [None]:
sns.boxplot(x="species", y="petal length (cm)", data=iris)
sns.plt.show()

In [None]:
sns.barplot(x="petal_width_bins", y="species", hue="sepal_length_bins", data=iris)
sns.plt.show()

In [None]:
sns.distplot(iris['sepal width (cm)'])
sns.plt.show()

In [None]:
g = sns.factorplot(x="petal width (cm)", y="sepal length (cm)", col="species",
                   data=iris, saturation=.5,
                   kind="point", ci=None, aspect=.6)

sns.plt.show()

In [None]:
iris.head()

In [None]:
corrmat = iris.corr()
fig, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corrmat, vmax=.8, square=True)
sns.plt.show()

In [None]:
corrmat = iris.corr()
corrmat

In [None]:
words_freqs = {'words': ['hello', 'world', 'chicken'],
               'freqs': [10, 5, 3]}
words_freqs_df = pd.DataFrame(words_freqs)

words_freqs_df

In [None]:
import seaborn as sns
my_plot = sns.barplot(x="words", y="freqs", data=words_freqs_df)
my_plot.get_figure().savefig("bla3.png", dpi=200)
sns.plt.show()

In [None]:
list1 = [1, 2, 3, 4]
list2 = [2, 3, 5, 9]

set(list1) & set(list2)

In [None]:
sns.plt.savefig('bla.png')