## ICP04 Decision Trees

### 1.Importing Libraries

In [1]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier 
from matplotlib import pyplot as plt
import numpy as np
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

from sklearn.svm import SVC
np.random.seed(1)

In [2]:
import os
print(os.getcwd())

C:\Users\Kavya Reddy Basupall


### 2.Loading data

In [3]:
X_train = pd.read_csv("C:/Users/Kavya Reddy Basupall/Downloads/airbnb_train_X_price_gte_150.csv") 
y_train = pd.read_csv("C:/Users/Kavya Reddy Basupall/Downloads/airbnb_train_y_price_gte_150.csv") 
X_test = pd.read_csv("C:/Users/Kavya Reddy Basupall/Downloads/airbnb_test_X_price_gte_150.csv") 
y_test = pd.read_csv("C:/Users/Kavya Reddy Basupall/Downloads/airbnb_test_y_price_gte_150.csv") 

### 3. Fitting SVM Model using poly kernel for the data 

In [4]:
performance = pd.DataFrame({"model": [], "Accuracy": [], "Precision": [], "Recall": [], "F1": []})

In [5]:
svm_poly_model = SVC(kernel="poly", degree=3, coef0=1, C=10)
_ = svm_poly_model.fit(X_train, np.ravel(y_train))

In [6]:
model_preds = svm_poly_model.predict(X_train)
c_matrix = confusion_matrix(y_train, model_preds)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"poly svm", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])

In [7]:
performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,poly svm,0.927653,0.910769,0.948718,0.929356


### NOTE: In this next section, GridSearch will use K Fold Cross Validation
K-Fold cross validation is used to avoid overfitting. 

In k-fold cross-validation, the original sample is randomly partitioned into k equal sized subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data. The cross-validation process is then repeated k times, with each of the k subsamples used exactly once as the validation data. The k results can then be averaged to produce a single estimation. 
![cross validation diagram](https://scikit-learn.org/stable/_images/grid_search_cross_validation.png)



### 4. Fitting Decision Tree modelling 

In [None]:
score_measure = "precision"
kfolds = 7

param_grid = {
    'min_samples_split': np.arange(39,48),  
    'min_samples_leaf': np.arange(6,12),
    'min_impurity_decrease': np.arange(0.0048, 0.0054, 0.0001),
    'max_leaf_nodes': np.arange(162,168), 
    'max_depth': np.arange(15,21), 
    'criterion': ['entropy'],
}

dtree = DecisionTreeClassifier()
grid_search = GridSearchCV(estimator = dtree, param_grid=param_grid, cv=kfolds, 
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

_ = grid_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {grid_search.best_score_}")
print(f"... with parameters: {grid_search.best_params_}")

bestRecallTree = grid_search.best_estimator_

Fitting 7 folds for each of 13608 candidates, totalling 95256 fits


In [None]:
c_matrix = confusion_matrix(y_test, grid_search.predict(X_test))
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
print(f"Accuracy={(TP+TN)/(TP+TN+FP+FN):.7f} Precision={TP/(TP+FP):.7f} Recall={TP/(TP+FN):.7f} F1={2*TP/(2*TP+FP+FN):.7f}")

###  Importance of features

In [None]:
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt

plt.figure(figsize=(125,50))
tree = plot_tree(grid_search.best_estimator_, 
              #feature_names=train_inputs.columns.values, # our feature names are stripped from the data set
              class_names=['lt 150', 'gte 150'], 
              filled=True, 
              rounded=True, 
              fontsize=14)

### CONCLUSION:
From both the models we can see the performance of accuracy, precision, recall and F1 Score.

SVM Model using poly kernel displays 
Accuracy = 0.927653 ; Precision= 0.910769 ; Recall= 0.948718 ; F1 Score =0.929356

Where as Decision Tree using grid search shows, 
Accuracy =0.8462980 ; Precision= 0.8379374; Recall =0.8568738 ;F1 Score = 0.8472998

1. Decision Tree shows low performance compared to SVM Model using Kernel , all the measures i.e., accuracy, precision, REcall and F1Score are low.
2. Decisoion Tree Modelling is more time consuming and takes more memory to evaluate the performance.

From this it is clear that SVM Modelling is best compared to Decision tree modelling of data.
Precision being the scoring measure to optimize the model precision value of SVM using poly kernel is more when compared to the value of Decision tree's precision.
