# Decision Trees and Random Forest Exercise

**Import the regular libraries**

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

**Load the data**

In [3]:
df = pd.read_csv('kyphosis.csv')

In [4]:
df.head()

Unnamed: 0,Kyphosis,Age,Number,Start
0,absent,71,3,5
1,absent,158,3,14
2,present,128,4,5
3,absent,2,5,1
4,absent,1,4,15


**Split the data into test, train**

In [70]:
from sklearn.model_selection import train_test_split

X = df.drop('Kyphosis',axis=1)
y = df['Kyphosis']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

**Import the Decision Tree Classifier and fit the training data**

In [104]:
from sklearn.tree import DecisionTreeClassifier

dtree = DecisionTreeClassifier()

dtree.fit(X_train,y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

**Find the predictions and generate a classification report**

In [105]:
predictions = dtree.predict(X_test)

from sklearn.metrics import classification_report

print(classification_report(y_test,predictions))

             precision    recall  f1-score   support

     absent       0.80      0.89      0.84        18
    present       0.60      0.43      0.50         7

avg / total       0.74      0.76      0.75        25



**Import the Random Forest Classifier and fit the training data**

In [106]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=100)
rfc.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

**Find the predictions and generate a classification report**

In [107]:
rfc_pred = rfc.predict(X_test)

print(classification_report(y_test,rfc_pred))

             precision    recall  f1-score   support

     absent       0.75      1.00      0.86        18
    present       1.00      0.14      0.25         7

avg / total       0.82      0.76      0.69        25



#### Scikit-Optimize, or skopt, is a simple and efficient library to minimize (very) expensive and noisy black-box functions. It implements several methods for sequential model-based optimization. skopt is reusable in many contexts and accessible.

**Install Scikit-Optimize**

In [108]:
!pip install scikit-optimize

In [111]:
from skopt import BayesSearchCV

** Using the BayesSearchCV which is the replacement for sklearn's gridsearchCV, we can now optimize our RandomForestClassifier by tuning the 'n_estimators' and 'max_features' parameters**

In [112]:
opt = BayesSearchCV(
    RandomForestClassifier(),
    {
        'n_estimators': (1,100),  
        'max_features': ["auto","sqrt","log2",None]
    },
    n_iter=32
)

In [117]:
opt.fit(X_test,y_test)

**We can check which are the best parameters to obtain the best results**

In [114]:
opt.best_params_

{'max_features': 'auto', 'n_estimators': 100}

**Once we run our prediction and print the classification report, we can see the difference in accuracy when we optimize the Random Forest Classifier**

In [115]:
o = opt.predict(X_test)

In [116]:
print(classification_report(y_test, o))

             precision    recall  f1-score   support

     absent       1.00      1.00      1.00        18
    present       1.00      1.00      1.00         7

avg / total       1.00      1.00      1.00        25

