# Big Data Processes - Exercise No. 9
# <font color= green>Decision Trees <font color= black> & </font>Random Forest <font color= black> & </font> Boosting and Bagging</font>


# <font color = green>Data preparation
    
<br>    

## 1. <font color= orange>Importing various libraries</font>

<font size = 3> First, we import some libraries. We rely on the usual pandas and numpy packages moreover, in order to create models, we are going to use Sklearn, as usual. Matplotlib is imported to display our results.

<br>There are two ways to import a library. Either import the whole package by **import library_name**, but it may take time. If you know exactly what part (class) of the package you need, you can directly specify it by using the **from library_name import class_name**

In [None]:
# Libraries to work with the data object
import pandas as pd 
import numpy as np

# libraries to visualize
import seaborn as sns
import matplotlib.pyplot as plt

# sklearn as one of the most used machine learning package
from sklearn.metrics import confusion_matrix
from sklearn.tree import DecisionTreeClassifier 
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

<br>


## 2. <font color= orange>Load and process the dataset</font>

<font size = 3>Again, we are using the **Auto** dataset that we also used for Linear and Logistic Regression. After loading and reading, we do some minor data cleaning and add the binary variable **mpg_binary** to be used as a label for classification. 

In [None]:
#read data
path = 'Auto.csv'
df = pd.read_csv(path, sep=';',encoding='utf-8', header = 0)

#replace separator and change data type
df['mpg'] = df['mpg'].str.replace(',','.').astype('float')
df['acceleration'] = df['acceleration'].str.replace(',','.').astype('float')
df['displacement'] = df['displacement'].str.replace(',','.').astype('float')

#create binary variable 
df['mpg_binary'] = (df['mpg'] >= df['mpg'].median()).astype(int)

### <font color= #b300b3>Take a closer look at the data</font>

<font size = 3>Examine the X and y dataframes we have just created using the **.head()** and **.info()** functions, just as before. 

In [None]:
df.info()

In [None]:
df.head()

<font size = 3> The target variable, y has classes, and they are in order. Therefore, if we display the last values of the target variable, we see the maximum number in classes. Since it is 1 and numbering in python starts from 0, our target variable has 2 classes.

In [None]:
df.tail(5)

<font size = 3>Look a bit more in depth at your variables... 

In [None]:
df.describe()

In [None]:
X = df[['horsepower','year','cylinders','acceleration']].values
y = df['mpg_binary'].values

###  <font color=  #b300b3> Create the holdout dataset

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html 

<font size = 3>An important issue might arise knowing that the vectors, both target and feature are ordered. Therefore, if we split them into train and test datasets, what guarantees that both sets will include all labels and all type of feature vectors? **Luckily the train_test_split() method takes care of this by random shuffling the data before applying this split.**
    
  

<font size = 3>We can make sure about it by checking one of the new vectors.

In [None]:
y_train

<br>


# <font color = green>Decision Tree
   
    
<br>    

## 1. <font color= orange>Create Decision Tree classification </font> by Sklearn 


<font size = 3>The Sklearn default implementation uses Gini impurity to split on the given attributes. It can be easily transformed by changing the **criterion** parameter from **gini** to **entropy**, if you want to use information gain as a splitting criteria. Read the documentation for more insight. 
    
https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier

![image.png](https://miro.medium.com/max/1275/0*0dN6d8THyImxwPeD.png)

<font size = 3><font color = red>Creating a model, step 1.: </font> **Make an object** from the model's class

<font size = 3 > Here **max_depth** refers to the maximum number of levels of the tree, excluding the topmost node. The default value is None, meaning that it will continue splitting until all nodes are leaf nodes. We are limiting the **max_depth** to prevent it from overfitting. 
    
Explore with different depths and check the documentation to understand the default parameters!

In [None]:
# first, create an object of the DecisionTreeClassifier() class
model_DT = DecisionTreeClassifier(max_depth = 5, criterion = 'entropy')

#https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier

<font size = 3><font color = red>Creating a model, step 2.: </font>**Fit** your model object on you data, which is the feature vectors (X) and the target vector (y)

In [None]:
# then apply the .fit() method on the object
model_DT.fit(X_train,y_train)

#https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier

In [None]:
from sklearn import tree

plt.figure(figsize=(12,8))
tree.plot_tree(model_DT.fit(X_train, y_train))
plt.show()

<font size = 3><font color = red>Creating a model, step 3.: </font>**Make prediction** on the (new / unseen) **test** dataset

In [None]:
model_DT.predict(X_test)

<font size = 3>Let's measure our model's **accuracy**.

In [None]:
# measure the accuracy and print it
accuracy_train = round(model_DT.score(X_train,y_train),4)
accuracy_test = round(model_DT.score(X_test,y_test),4)

print("The model's accuracy for the train data is: \t", accuracy_train)
print("The model's accuracy for the test data is: \t", accuracy_test)

#https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier

### <font color= #b300b3>Confusion Matrix

<font size = 3>Print out the confusion matrix. 

In [None]:
prediction = model_DT.predict(X_test)
print("Confusion matrix for model_dt on test set: \n\n", confusion_matrix(y_test, prediction))


# <font color = green> Random Forest

<font size = 3> A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. 
    
Behold! Documentation!
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html 

## 1. <font color= orange> Create Random forest classification </font> by Sklearn 

### <font color= #b300b3> Building the model


<font size = 3><font color = red>Creating a model, step 1.: </font> **Make an object** from the model's class.
    
Here **n_estimators** represents the number of trees in the forest, the default value is 100. Again, we limit this number in order to prevent overfitting. 

In [None]:
model_RF = RandomForestClassifier(n_estimators = 5, max_depth = 5)

# https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

<font size = 3><font color = red>Creating a model, step 2.: </font>**Fit, or train**,  your model object on your** training data**, which is the feature vectors (X_train) and the target vector (y_train)

In [None]:
model_RF.fit(X_train,y_train)

<font size = 3><font color = red>Creating a model, step 3.: </font>**Make prediction** on the (new / unseen) **test** dataset

In [None]:
model_RF.predict(X_test)

<font size = 3> Let's measure the accuracy score for both test and train data and print the confusion matrices. 

In [None]:
# measure the accuracy and print it
accuracy_train = round(model_RF.score(X_train,y_train),4)
accuracy_test = round(model_RF.score(X_test,y_test),4)

print("The model's accuracy for the train data is: \t", accuracy_train)
print("The model's accuracy for the test data is: \t", accuracy_test)

### <font color= #b300b3>Confusion Matrix

<font size = 3>Let's print out the confusion matrix for the prediction the the **train set**. What would you expect?

In [None]:
prediction = model_RF.predict(X_train)
print("Confusion matrix for model_RF on train set: \n\n",confusion_matrix(y_train, prediction))

<font size = 3>Let's print out the confusion matrix for the prediction the the **test set**. What would you expect?

In [None]:
prediction = model_RF.predict(X_test)
print("Confusion matrix for model_RF on test set: \n\n",confusion_matrix(y_test, prediction))

<br>


# <font color = green> Bagging and Boosting

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html

##  1. <font color= orange>Create Bagging and Boosting classifiers <font color = black > by Sklearn

<font size = 3>Let's import two new libraries

In [None]:
from sklearn.ensemble import BaggingClassifier, AdaBoostClassifier

### <font color= #b300b3> Building the model

<font size = 3><font color = red>Creating a model, step 1.: </font> **Make an object** from the model's class.

Signification of the parameters for Bagging and AdaBoost:

**max_samples** is the number of samples to draw from X in order to train each decision tree (base estimator) 

**max_features** is the number of features to draw from X in order to train each decision tree (base estimator) 

**n_estimators** the number of decision trees(base estimators) in the ensemble 

In [None]:
model_BG = BaggingClassifier(DecisionTreeClassifier(), max_samples = 0.5, max_features = 1.0, n_estimators = 2)
# https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html

model_BO = AdaBoostClassifier(DecisionTreeClassifier(), n_estimators = 2)
# https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html

<font size = 3><font color = red>Creating a model, step 2.: </font>**Fit, or train**,  your model object on your** training data**, which is the feature vectors (X_train) and the target vector (y_train)

In [None]:
model_BG.fit(X_train,y_train)

In [None]:
model_BO.fit(X_train,y_train)

<font size = 3><font color = red>Creating a model, step 3.: </font>**Make prediction** on the (new / unseen) **test** dataset

In [None]:
# measure the accuracy and print it
accuracy_BG = round(model_BG.score(X_test,y_test),4)
accuracy_BO = round(model_BO.score(X_test,y_test),4)

print("The Bagging model's accuracy on the test data is: \t", accuracy_BG)
print("The Boosting model's accuracy on the test data is: \t", accuracy_BO)

### <font color= #b300b3>Confusion Matrix

In [None]:
prediction = model_BG.predict(X_test)
print("Confusion matrix for model_BG on test set: \n\n", confusion_matrix(y_test, prediction))

In [None]:
prediction = model_BO.predict(X_test)
print("Confusion matrix for model_BO on test set: \n\n", confusion_matrix(y_test, prediction))

<br>


# <font color = green> Multiple Model Ensemble

##  1. <font color= orange>Trying out 3 different classifiers

<font size = 3>You already have the **DecisionTreeClassifier** as **DT**. Let's import the libraries and fit two other classifiers on this data: **Linear Regression** and **K Nearest Neighbor**. In addition we also import the **VotingClassifier** to find the best classification algorithm and the **StandardScaler** for standardizing the data used for the **KNN** classifier. 

In [None]:
#import libraries
from sklearn.linear_model import LogisticRegression  
from sklearn.neighbors import KNeighborsClassifier  
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import VotingClassifier 

In [None]:
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.fit_transform(X_test)

### <font color= #b300b3> Building the models

<font size = 3><font color = red>Creating a model, step 1.: </font> **Make an object** from the model's class

In [None]:
model_LR = LogisticRegression(solver='liblinear')
model_KNN = KNeighborsClassifier()
model_DT = DecisionTreeClassifier()

# https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html

<font size = 3><font color = red>Creating a model, step 2.: </font>**Fit, or train**,  your model object on your** training data**, which is the feature vectors (X_train) and the target vector (y_train)

In [None]:
model_LR.fit(X_train,y_train)
model_KNN.fit(X_train,y_train)
model_DT.fit(X_train,y_train)

##  2. <font color= orange>Apply the VotingClassifier

<font size = 3><font color = red>Creating a model, step 1.: </font> **Make an object** from the model's class

In [None]:
VC = VotingClassifier(estimators= [("model_LR",model_LR),("model_KNN", model_KNN),("model_DT", model_DT)], voting = 'hard')
# voting = 'hard' -> vote is on the labels and not the probabilities

# https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html

<font size = 3><font color = red>Creating a model, step 2.: </font>**Fit, or train**,  your model object on your** training data**, which is the feature vectors (X_train) and the target vector (y_train)

In [None]:
VC.fit(X_train,y_train)

<font size = 3><font color = red>Creating a model, step 3.: </font>**Make prediction** on the (new / unseen) **test** dataset

In [None]:
VC.score(X_test, y_test)

### <font color= #b300b3> Understanding the results

<font size = 3> Using the **.transform**() method, we can get the prediction for each classifier. If we put the final prediction of the voting process and the actual target labels, we get a really nice overview of our classifiers.

In [None]:
np.concatenate([VC.transform(X_test), VC.predict(X_test).reshape(-1,1), y_test.reshape(-1,1)], axis=1)

<font size = 3>If numpy arrays are hard to read, let's create a pandas dataframe, to which we are got used to already.

In [None]:
results = pd.DataFrame(np.concatenate([VC.transform(X_test), VC.predict(X_test).reshape(-1,1), y_test.reshape(-1,1)], axis=1), columns = ['model_RL', 'model_KNN', 'model_DT', 'Prediction', 'Truth'])
results

<font size = 3>Having these data in one dataframe opens endless opportunities to discover it further... Just recall our processes from the data cleaning classes.
<br>
    
For example, we can create an extract for those lines in which the voted prediction differs from the actual target value.

In [None]:
results_diff = results.loc[results['Prediction'] != results['Truth']]
results_diff

<font size = 3>Using the same method, let's put the number of differences and matches for each classifier and the actual target label into a dictionary.

In [None]:
difference_dict = {'model_RL': results['model_RL'][results['model_RL'] != results['Truth']].count(),
                'model_KNN':results['model_KNN'][results['model_KNN'] != results['Truth']].count(),
                'model_DT': results['model_DT'][results['model_DT'] != results['Truth']].count()}

match_dict = {'model_RL': results['model_RL'][results['model_RL'] == results['Truth']].count(),
                'model_KNN':results['model_KNN'][results['model_KNN'] == results['Truth']].count(),
                'model_DT': results['model_DT'][results['model_DT'] == results['Truth']].count()}

print("Number of different predictions: \t", difference_dict)
print("Number of matching predictions: \t", match_dict)

### <font color= black>What model would you take?

**E1.** Think about whether you can apply a DecisionTree, Bagging and Boosting on your own dataset. (Think hard)

## <font color = green>That's it! Have a nice weekend once you are done progressing in your groups. :)