### True Learning Objectives

- How can I process data in Python

#### Decision Tree

The book has all the theoretical parts down. How do we work on decision tree in applied scenarios?

Another library from Python: **scikit-learn**:
- Free machine learning library for Python.
- Implements various algorithms for classification, regression, and clustering.
- Convenient and easy to use. 
- A list of all current available algorithms can be found at https://scikit-learn.org/stable/user_guide.html

Let's start by looking at our data. 

In [None]:
from sklearn import tree
import pandas as pd

In [None]:
data =  {'Home Owner': pd.Series(['Yes','No','No','Yes','No','No','Yes','No','No','No'], 
                                 index=['1', '2', '3','4','5','6','7','8','9','10']),
         'Marital Status': pd.Series(['Single','Married','Single','Married','Divorced','Married',
                                      'Divorced','Single','Married','Single'], 
                                     index=['1', '2', '3','4','5','6','7','8','9','10']),
         'Annual Income': pd.Series([125,100,70,120,95,60,220,85,75,90],
                                   index=['1', '2', '3','4','5','6','7','8','9','10']),
         'Defaulted Borrower': pd.Series(['No','No','No','No','Yes','No','No','Yes','No','Yes'], 
                                 index=['1', '2', '3','4','5','6','7','8','9','10']) }

df = pd.DataFrame(data)
print (df)

In this case, we want to make a decision tree for the `Defaulted Borrower` column. Therefore, we will split the DataFrame into two, X representing the attributes (features), and Y representing the corresponding labels. 

In [None]:
X = df.drop('Defaulted Borrower', axis=1)  
y = df['Defaulted Borrower']

print (X)
print (y)

**Scikit-learn** provides you with a number of tools to help make your work easier. Here, `train_test_split` is one such tool, which allows you to generate a random split of your using a predefined ratio of 20%

In [None]:
from sklearn.model_selection import train_test_split  
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)  

In [None]:
print(X_train)

In [None]:
print(X_test)

In [None]:
print (y_train)

In [None]:
print (y_test)

In the code below, we first create a Python `DecisionTreeClassifier` object, then apply the `fit` function of that object on our training data. 

In [None]:
clf = tree.DecisionTreeClassifier()
clf.fit(X_train, y_train)

Oops, what happens?

It turns out that there is limitation to **scikit-learn** that it only supports numerical data. Does this mean that it does not work for us? Fortunately, we can preprocess the data to convert the text-based categorical columns to numeric. 

For column `Home Owner`, let's set `Yes` to be 1, and `No` to be 0. 
For column `Marital Status`, let's set `Married` to be 3, `Single` to be 2, and `Divorced` to be 1. 

In [None]:
col_Home = pd.Series(range(0,10), index = ['1', '2', '3','4','5','6','7','8','9','10'])
col_Married = pd.Series(range(0,10), index = ['1', '2', '3','4','5','6','7','8','9','10'])

In [None]:
for i in range(0,10):
    if (X['Home Owner'][i] == 'Yes'):
        col_Home[i] = 1;
    else:
        col_Home[i] = 0;
print (col_Home)

In [None]:
for i in range(0,10):
    if (X['Marital Status'][i] == 'Married'):
        col_Married[i] = 3;
    elif (X['Marital Status'][i] == 'Single'):
        col_Married[i] = 2;
    else:
        col_Married[i] = 1;
print (col_Married)

We can replace the `Home Owner` and `Marital Status` columns with their corresponding numerical conversion. 

In [None]:
X_enc = pd.concat((X.drop(['Home Owner', 'Marital Status'], axis = 1), 
                   col_Home.rename('Home Owner'),
                   col_Married.rename('Marital Status')), axis = 1)
print(X_enc)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_enc, y, test_size=0.2)  
clf = tree.DecisionTreeClassifier()
clf.fit(X_train, y_train)

In [None]:
y_pred = clf.predict(X_test)  
print (y_pred)
print (y_test)

To install graphviz, open the Anaconda Prompt inside Anaconda's Startup Menu, then run

```
$ conda install -c anaconda python-graphviz
```

In [None]:
import graphviz 
dot_data = tree.export_graphviz(clf, out_file=None,
                                feature_names=X_enc.columns.values,  
                                class_names=['Yes','No'],  
                                filled=True, rounded=True,  
                                special_characters=True)
graph = graphviz.Source(dot_data) 
graph

The example above does not reflect a true automated decision tree construction, as there are too few observations in the data set. Let's look at the Iris data set again. 

In [None]:
df = pd.read_csv('data/iris.csv', header=None)
df.columns = ['sepal length', 'sepal width', 'petal length', 'petal width', 'class']
print (df.head())
print (df.shape)

In [None]:
X = df.drop('class', axis=1)  
y = df['class']

print (X.head())
print (y.head())

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)  
clf = tree.DecisionTreeClassifier()
clf.fit(X_train, y_train)

In [None]:
y_pred = clf.predict(X_test)  
print (y_pred)
print (y_test)

In [None]:
import graphviz 
iris_data = tree.export_graphviz(clf, out_file=None,
                                feature_names=X.columns.values,  
                                class_names=['Iris-versicolor','Iris-virginica','Iris-setosa'],  
                                filled=True, rounded=True,  
                                special_characters=True)
graph = graphviz.Source(iris_data) 
graph

- If petal length is smaller than 2.45, it is Iris-setosa. 
- If petal is greater than 2.45 and petal width is less than 1.65 and petal length is less than 4.95 (but greater than 2.45), it is Iris virginica. 
- 

### Question:

Import the titanic data set (`data/titatic.csv`), then build and visualize a decision tree using `gini` and `entropy` (one tree for each method) to classify the survival label (0 or 1). Make your decision about numerical conversion. Write down your narration for each decision tree. 