## Decision Tree for Classification

In this section we will predict whether a bank note is authentic or fake depending upon the four different attributes of the image of the note. The attributes are Variance of wavelet transformed image, curtosis of the image, entropy, and skewness of the image.

In [33]:
### Importing the libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

__  Since our file is in CSV format, we will use panda's read_csv method to read our CSV data file. Execute the following script to do so:__

In [36]:
#Reading the data using pandas

bill_data = pd.read_csv('Desktop/bill_authentication.csv')
bill_data.head()

Unnamed: 0,Variance,Skewness,Curtosis,Entropy,Class
0,3.6216,8.6661,-2.8073,-0.44699,0
1,4.5459,8.1674,-2.4586,-1.4621,0
2,3.866,-2.6383,1.9242,0.10645,0
3,3.4566,9.5228,-4.0112,-3.5944,0
4,0.32924,-4.4552,4.5718,-0.9888,0


__Execute the command to see the number of rows and columns in our dataset:__

In [37]:
bill_data.shape

(1372, 5)

In [39]:
### gives the Numerical description of the dataset
bill_data.describe()

Unnamed: 0,Variance,Skewness,Curtosis,Entropy,Class
count,1372.0,1372.0,1372.0,1372.0,1372.0
mean,0.433735,1.922353,1.397627,-1.191657,0.444606
std,2.842763,5.869047,4.31003,2.101013,0.497103
min,-7.0421,-13.7731,-5.2861,-8.5482,0.0
25%,-1.773,-1.7082,-1.574975,-2.41345,0.0
50%,0.49618,2.31965,0.61663,-0.58665,0.0
75%,2.821475,6.814625,3.17925,0.39481,1.0
max,6.8248,12.9516,17.9274,2.4495,1.0


In [81]:
#### check for any null values

bill_data.isnull().sum()

Variance    0
Skewness    0
Curtosis    0
Entropy     0
Class       0
dtype: int64

### Preparing the Data

__In this section we will divide our data into attributes and labels and will then divide the resultant data into both training and test sets. By doing this we can train our algorithm on one set of data and then test it out on a completely different set of data that the algorithm hasn't seen yet. This provides you with a more accurate view of how your trained algorithm will actually perform.__

To divide data into attributes and labels, we will use the following code

In [82]:
X = bill_data.drop(['Class'],axis = 1)
y = bill_data['Class']

__The final preprocessing step is to divide our data into training and test sets. The model_selection library of Scikit-Learn contains train_test_split method, which we'll use to randomly split the data into training and testing sets. Execute the following code to do so:__

In [83]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.3)

In the code above, the test_size parameter specifies the ratio of the test set, which we use to split up 30% of the data in to the test set and 70% for training and random state is 3

### Training and Making Predictions

__Once the data has been divided into the training and testing sets, the final step is to train the decision tree algorithm on this data and make predictions. Scikit-Learn contains the tree library, which contains built-in classes/methods for various decision tree algorithms. Since we are going to perform a classification task here, we will use the DecisionTreeClassifier class for this example. The fit method of this class is called to train the algorithm on the training data, which is passed as parameter to the fit method. Execute the following script to train the algorithm:__

In [84]:
from sklearn.tree import DecisionTreeClassifier  
classifier = DecisionTreeClassifier(criterion = 'entropy')  
classifier.fit(X_train, y_train)  
 

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

__Now that the classifier has been trained, let's make predictions on the test data. To make predictions, the predict method of the DecisionTreeClassifier class is used.__

In [85]:
#### prediction using criteria -"ENTROPY"


y_pred = classifier.predict(X_test)

###  Evaluating the Algorithm
 

__Now,the algorithm made some predictions. Now we'll see how accurate our algorithm is. For classification tasks some commonly used metrics are confusion matrix, precision, recall, and F1 score. We can use Scikit=-Learn's metrics library which contains the classification_report and confusion_matrix methods that can be used to calculate these metrics__

In [86]:
from sklearn.metrics import classification_report, confusion_matrix  
print(confusion_matrix(y_test, y_pred))  
print(classification_report(y_test, y_pred))

[[225   5]
 [  3 179]]
             precision    recall  f1-score   support

          0       0.99      0.98      0.98       230
          1       0.97      0.98      0.98       182

avg / total       0.98      0.98      0.98       412



In [87]:
from sklearn.tree import DecisionTreeClassifier  

##### Prediction using Gini criteria


classifier = DecisionTreeClassifier(criterion = 'gini')  
classifier.fit(X_train, y_train)  
y_pred_gini = classifier.predict(X_test)

In [88]:
from sklearn.metrics import classification_report, confusion_matrix  
print(confusion_matrix(y_test, y_pred))  
print(classification_report(y_test, y_pred))

[[225   5]
 [  3 179]]
             precision    recall  f1-score   support

          0       0.99      0.98      0.98       230
          1       0.97      0.98      0.98       182

avg / total       0.98      0.98      0.98       412



__From the confusion matrix, we can see that out of 412 test instances, our algorithm misclassified only 8. This is 98.05% accuracy.__

##  Decision Tree for Regression

__We can also use Decision Tree for regression.The process of solving regression problem with decision tree using Scikit Learn is very similar to that of classification. However for regression we use DecisionTreeRegressor class of the tree library. Also the evaluation matrics for regression differ from those of classification. The rest of the process is almost same.__

In [94]:
petrol_consume = pd.read_csv('Desktop/petrol_consumption.csv')
petrol_consume.head()

Unnamed: 0,Petrol_tax,Average_income,Paved_Highways,Population_Driver_licence(%),Petrol_Consumption
0,9.0,3571,1976,0.525,541
1,9.0,4092,1250,0.572,524
2,9.0,3865,1586,0.58,561
3,7.5,4870,2351,0.529,414
4,8.0,4399,431,0.544,410


In [95]:
petrol_consume.shape

(48, 5)

In [96]:
petrol_consume.describe()

Unnamed: 0,Petrol_tax,Average_income,Paved_Highways,Population_Driver_licence(%),Petrol_Consumption
count,48.0,48.0,48.0,48.0,48.0
mean,7.668333,4241.833333,5565.416667,0.570333,576.770833
std,0.95077,573.623768,3491.507166,0.05547,111.885816
min,5.0,3063.0,431.0,0.451,344.0
25%,7.0,3739.0,3110.25,0.52975,509.5
50%,7.5,4298.0,4735.5,0.5645,568.5
75%,8.125,4578.75,7156.0,0.59525,632.75
max,10.0,5342.0,17782.0,0.724,968.0


In [98]:
### preparing the data 

X = petrol_consume.drop('Petrol_Consumption', axis=1)  
y = petrol_consume['Petrol_Consumption']  

In [99]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)  

In [100]:
### importing the "DecisionTreeRegressor"


from sklearn.tree import DecisionTreeRegressor  
regressor = DecisionTreeRegressor()  
regressor.fit(X_train, y_train)

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')

In [101]:
y_pred = regressor.predict(X_test) 

In [102]:
df=pd.DataFrame({'Actual':y_test, 'Predicted':y_pred})  
df  

Unnamed: 0,Actual,Predicted
29,534,541.0
4,410,414.0
26,577,554.0
30,571,554.0
32,577,554.0
37,704,574.0
34,487,648.0
40,587,649.0
7,467,414.0
10,580,464.0


In [103]:
from sklearn import metrics  
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))  
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))  

('Mean Absolute Error:', 59.6)
('Mean Squared Error:', 6434.2)
('Root Mean Squared Error:', 80.21346520379231)
