Decision Trees in Python with Scikit-Learn
-------------------------------------------------------------

Introduction
-----------------
A decision tree is one of most frequently and widely **used supervised machine learning algorithms that can perform both regression and classification tasks**. Hence called <font color='green'><b>CART</b> - <u>C</u>lassification <u>A</u>nd <u>R</u>egression <u>T</u>rees.</font>

For each attribute in the dataset, the decision tree algorithm forms a node, where the most important attribute is placed at the root node. For evaluation we start at the root node and work our way down the tree by following the corresponding node that meets our condition or "decision". This process continues until a leaf node is reached, which contains the prediction or the outcome of the decision tree.

![decison_tree_image](https://drive.google.com/uc?id=1nCTEZVfy_m6dMu_Qz7SQDZWNhGZZOFKz 'decison_tree_image')

#### VERY IMPORTANT READ : 
https://towardsdatascience.com/entropy-and-information-gain-in-decision-trees-c7db67a3a293

![decison_tree_example](https://drive.google.com/uc?id=1F_T2ICas2htr6b-GDBFRIo3DTbPnIzSI 'decison_tree_example')

>**Advantages of Decision Trees**
------------------------------

There are several advantages of using decision trees for predictive analysis:

1> Decision trees can be used to predict both continuous and discrete values i.e. they work well for both regression and classification tasks.

2> They require relatively less effort for training the algorithm.

3> They can be used to classify non-linearly separable data.

4> They're very fast and efficient compared to KNN and other classification algorithms.


# 1. Decision Tree for Classification
---------------------------------------------------------
<b><font color='green'>( We will be using DecisionTreeClassifier from sklearn.tree.</b> It is fast, simple and takes care of all the Math part. We will concentrate only on Coding and solving the Real time problem. )</font><br><br>
<font color='red'>
Here, we will predict whether a <b>bank note is authentic or fake</b> depending upon the four different attributes of the image of the note. The <u>attributes</u> are Variance of wavelet transformed image, curtosis of the image, entropy, and skewness of the image.</font>

**Note :** In the dataset the **class** variable can be **0 or 1**. **0 indicates authentic BankNote and 1 indicates fake BankNote.**

In [1]:
# Steps to upload any dataset into your Colab NB :
# step 1 : First Download the dataset to your local PC. 
#          The link for downloading our dataset for practicing is 
#          https://drive.google.com/open?id=19YvsKMdlIZ_bxgJOSg4waIkVTyJ-UYx_
# step 2 : Run the below code and select the (above downloaded) dataset. 
# from google.colab import files
# files.upload()

In [2]:
# doing the minimum necessary imports
# more modules would be imported as and when needed

import pandas as pd  
import numpy as np  
import matplotlib.pyplot as plt  
%matplotlib inline

# reading data from CSV file. 
# reading bank currency note data into pandas dataframe.
bankdata = pd.read_csv("bill_authentication.csv")  

# Exploratory Data Analysis
print(bankdata.shape)  
print("------------")

#bankdata.head()

# shuffling the 100% of the data
print(bankdata.sample(random_state=100, frac=1).head(10)) 

## shuffle the original dataframe
# bankdata = bankdata.sample(random_state=100, frac=1)

(1372, 5)
------------
      Variance  Skewness  Curtosis  Entropy  Class
1058  -1.56210   -2.2121   4.25910  0.27972      1
714    2.55590    3.3605   2.03210  0.26809      0
1061  -2.31470    3.6668  -0.69690 -1.24740      1
399    2.96950    5.6222   0.27561 -1.15560      0
382    0.86202    2.6963   4.29080  0.54739      0
376    3.23030    7.8384  -3.53480 -1.21510      0
987   -0.55648    3.2136  -3.30850 -2.79650      1
416    4.34830   11.1079  -4.08570 -4.25390      0
945   -1.76970    3.4329  -1.21440 -2.37890      1
595    3.18360    7.2321  -1.07130 -2.59090      0


In [3]:
bankdata.info()  # this helps in finding any missing values

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1372 entries, 0 to 1371
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Variance  1372 non-null   float64
 1   Skewness  1372 non-null   float64
 2   Curtosis  1372 non-null   float64
 3   Entropy   1372 non-null   float64
 4   Class     1372 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 53.7 KB


<b> Analysis : </b> Their is no missing data. This data is clean.

In [4]:
# Data Preprocessing
# Data preprocessing involves 
# (1) Dividing the data into attributes and labels and 
# (2) dividing the data into training and testing sets.

# To divide the data into attributes and labels, do :
X = bankdata.drop('Class', axis=1)  
y = bankdata['Class']  

# the final preprocessing step is to divide data into training and test sets
from sklearn.model_selection import train_test_split  
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=100)
# default test_size parameter value is 0.25

# Training the Algorithm. Here we would use DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier  
classifier = DecisionTreeClassifier()  
classifier.fit(X_train, y_train)

# make predictions on the test data
y_pred = classifier.predict(X_test)

# Evaluating the Algorithm
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score  
print(confusion_matrix(y_test, y_pred))  
print(classification_report(y_test, y_pred))
print(accuracy_score(y_test, y_pred))

# Remember : for evaluating classification-based ML algo use  
# confusion_matrix, classification_report and accuracy_score.

# And for evaluating regression-based ML Algo use Mean Squared Error(MSE) 
# or RMSE (Root Mean Squared Error), ...

[[197   1]
 [  3 142]]
              precision    recall  f1-score   support

           0       0.98      0.99      0.99       198
           1       0.99      0.98      0.99       145

    accuracy                           0.99       343
   macro avg       0.99      0.99      0.99       343
weighted avg       0.99      0.99      0.99       343

0.9883381924198251


<b><font color='green'>Analysis</font></b> : From the confusion matrix, you can see that out of 343 test instances, our algorithm misclassified only 4. This is approx 99% accuracy. 

# 2. Decision Tree for Regression
------------------------------------------------------
<b><font color='green'>( We will be using DecisionTreeRegressor from sklearn.tree.</b> It is fast, simple and takes care of all the Math part. We will concentrate only on Coding and solving the Real time problem. )</font><br><br>
<font color='red'>
We will use petrol_consumption.csv dataset and <b>try to predict gas consumptions</b> (in millions of gallons) in 48 US states <u>based upon</u> gas tax (in cents), per capita income (dollars), paved highways (in miles) and the proportion of population with a drivers license. </font>

**Note :** In the dataset **Petrol_Consumption** is the target variable. 

In [4]:
# Steps to upload any dataset into your Colab NB :
# step 1 : First Download the dataset to your local PC. 
#          The link for downloading our dataset for practicing is https://drive.google.com/open?id=1_YH4VnFwlZBd3MS45TYyU7dzchn-4skk
# step 2 : Run the below code and select the (above downloaded) dataset. 
# from google.colab import files
# files.upload()

In [6]:
# Importing Libraries
import pandas as pd  
import numpy as np  
import matplotlib.pyplot as plt  
%matplotlib inline

# Importing the Dataset
dataset = pd.read_csv('petrol_consumption.csv')

dataset.head()  

Unnamed: 0,Petrol_tax,Average_income,Paved_Highways,Population_Driver_licence(%),Petrol_Consumption
0,9.0,3571,1976,0.525,541
1,9.0,4092,1250,0.572,524
2,9.0,3865,1586,0.58,561
3,7.5,4870,2351,0.529,414
4,8.0,4399,431,0.544,410


In [7]:
# To see statistical details of the dataset, execute the following command:

#dataset.describe()
dataset['Petrol_Consumption'].mean()  # avg of the target var. 

576.7708333333334

In [None]:
# Preparing the Data
# divide the data into attributes and labels
X = dataset.drop('Petrol_Consumption', axis=1)  
y = dataset['Petrol_Consumption']  

# dividing data into training and testing set
from sklearn.model_selection import train_test_split  
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    random_state=0)  

# Training and Making Predictions
# Note : we will using DecisionTreeRegressor class, not DecisionTreeClassifier



# To make predictions on the test set, 


# Now let's compare some of our predicted values with the actual values 



**Note** : 

that in your case the records compared may be different, depending upon the training and testing split. Since the train_test_split method randomly splits the data we likely won't have the same training and test sets. For train_test_split with random_state=0 , you would get the same results.


In [9]:
# Evaluating the Algorithm
from sklearn import metrics  
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))  
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))  

Mean Absolute Error: 54.9
Mean Squared Error: 5181.5
Root Mean Squared Error: 71.98263679527167


The root mean squared error for our algorithm is ------ , which is more than *10 percent of the mean* of all the values in the '**Petrol_Consumption**' column ( i.e **57.6** ). This means that our algorithm did not do a fine prediction job. Allthough getting a value **less than 10%** <u>would have been better</u>.

Their could many reasons for a Regression Algo to not perform that well, some reasons are : 

**`IMP`** The data sample is too less for training the Model. So instead of testing the model on just 20% of data and judging it , we can do a better job by applying the **CROSS VALIDATING** our Model. 

### Learners should apply CROSS VALIDATION --> weekly H.w



<font color='red'><b><u>Problem Solving (25-30 mins):</u></b> </font>
--
<br>
<b><h4>Can you predict tomorrow's stock price for <u>HDFC bank</u> on BSE ?</h4></b>

<font color='green'> <b>Follow the steps </b></font>

**Introduction** <br>
The stock market is a market that enables the seamless exchange of buying and selling of company stocks. Every Stock Exchange has its own Stock Index value. The index is the average value that is calculated by combining several stocks. 

This helps in representing the entire stock market and predicting the market’s movement over time. The stock market can have a huge impact on people and the country’s economy as a whole. Therefore, predicting the stock trends in an efficient manner can minimize the risk of loss and maximize profit.


**How does stock market work?** <br>
The concept behind how the stock market works is pretty simple. Operating much like an auction house, the stock market enables buyers and sellers to negotiate prices and make trades.

The stock market works through a network of exchanges — you may have heard of the New York Stock Exchange, Nasdaq or Sensex or the NSE. Companies list shares of their stock on an exchange through a process called an initial public offering or IPO. Investors purchase those shares, which allows the company to raise money to grow its business. Investors can then buy and sell these stocks among themselves, and the exchange tracks the supply and demand of each listed stock.

That supply and demand help determine the price for each security or the levels at which stock market participants — investors and traders — are willing to buy or sell.


**How Share Prices Are Set** <br>
To actually buy shares of a stock on a stock exchange, investors go through brokers — an intermediary trained in the science of stock trading, who can get an investor a stock at a fair price, at a moment’s notice. Investors simply let their broker know what stock they want, how many shares they want, and usually at a general price range. That’s called a “bid” and sets the stage for the execution of a trade. If an investor wants to sell shares of a stock, they tell their broker what stock to sell, how many shares, and at what price level. That process is called an “offer” or “ask price.”

**Predicting**  <br>
How the stock market will perform is one of the **most difficult things to do**. There are so many factors involved in the prediction — physical factors vs. physiological, rational and irrational behavior, etc. All these aspects combine to make share prices volatile and very difficult to predict with a high degree of accuracy.
<br>
<font color = 'green'> <br>
We will try predicting the <b>next day's stock price</b> using <b>DECISION TREE REGRESSOR</b> </font>

Understanding the Problem Statement
--

Broadly, stock market analysis is divided into two parts – Fundamental Analysis and Technical Analysis.

**Fundamental Analysis** involves analyzing the company’s future profitability on the basis of its current business environment and financial performance.

**Technical Analysis**, on the other hand, includes reading the charts and using statistical figures to identify the trends in the stock market.

As you might have guessed, our focus will be on the technical analysis part. We’ll be using a dataset from **Quandl** (you can find historical data for various stocks here) and for this particular project, I have used the data for ‘HDFC bank Ltd- BSE’. 

`data source` : https://www.quandl.com/data/BSE/BOM500180-HDFC-Bank-Ltd-EOD-Prices

<u>Important</u> : <br>
<b>1. Please create <u>Student</u> Account on Quandl.com.</b> <br>
<b>2. Search</b> <font color='blue'>HDFC-Bank-Ltd, BSE</font> data and <b>download the csv file. You would get latest data upto yesterday.</b>
<br><br>
<small>Our SuvenML team is not readily giving you the dataset, as we have done in previous NB's / case-studies.</small>


In [10]:
# doing minimum necessary imports

import pandas as pd                            # for loading and analysing data
import matplotlib.pyplot as plt                # for data visualization 
from sklearn.tree import DecisionTreeRegressor # Our Decision Tree classifier

In [11]:
!pip install Quandl

Defaulting to user installation because normal site-packages is not writeable
Collecting Quandl
  Downloading Quandl-3.6.1-py2.py3-none-any.whl (26 kB)
Collecting inflection>=0.3.1
  Downloading inflection-0.5.1-py2.py3-none-any.whl (9.5 kB)
Installing collected packages: inflection, Quandl
Successfully installed Quandl-3.6.1 inflection-0.5.1
You should consider upgrading via the '/usr/local/bin/python3.7 -m pip install --upgrade pip' command.[0m


In [12]:
import quandl        # Stock market API for fetching Data

In [36]:
# ### import Quandl into your program as
import quandl  # Stock market API for fetching Data

# you can then fetch the stock data directly into your code as :
quandl.ApiConfig.api_key = 'cKoVfJoyzLxzqzsgb1Uz'   ## enter your key 
#stock_data = quandl.get("BSE/HDFC-BANK-LTD", start_date="2020-07-16", end_date="2021-07-16")  
stock_data = quandl.get('BSE/BOM500180', start_date='2020-10-13', end_date='2021-10-13')
# choose upto yesterday's date 
print(stock_data.head())   ### Let's see the data
print("----------------------")
print(stock_data.shape)   ### Let's see the data

               Open     High      Low    Close      WAP  No. of Shares  \
Date                                                                     
2020-10-13  1214.15  1221.50  1195.10  1198.15  1206.64       201283.0   
2020-10-14  1193.30  1214.80  1175.85  1210.85  1193.66       246863.0   
2020-10-15  1214.80  1216.45  1164.40  1169.15  1190.37       382299.0   
2020-10-16  1182.00  1203.00  1172.95  1199.00  1192.02       401346.0   
2020-10-19  1220.00  1228.80  1193.00  1203.35  1206.85       820336.0   

            No. of Trades  Total Turnover  Deliverable Quantity  \
Date                                                              
2020-10-13         6118.0     242876052.0               43263.0   
2020-10-14         8946.0     294671648.0               51586.0   
2020-10-15        11771.0     455077129.0              154798.0   
2020-10-16        12293.0     478414159.0              212464.0   
2020-10-19        24011.0     990019613.0              140712.0   

           

In [37]:
## working with shift()  --> https://www.geeksforgeeks.org/python-pandas-dataframe-shift/
print(stock_data[['Open', 'Close']].tail(4))
shifted_Open_Close = stock_data.loc[:,['Open', 'Close']].shift(-1)
print(shifted_Open_Close.tail(4))
print("-------------------------------")

shifted_Stock_Data = stock_data.copy()

print("-------------------------------")
shifted_Stock_Data['Open'] = shifted_Open_Close['Open']
print(shifted_Stock_Data[['Open', 'Close']].tail(4))

              Open    Close
Date                       
2021-10-06  1601.0  1615.05
2021-10-07  1624.1  1611.25
2021-10-08  1612.0  1602.20
2021-10-11  1603.0  1634.35
              Open    Close
Date                       
2021-10-06  1624.1  1611.25
2021-10-07  1612.0  1602.20
2021-10-08  1603.0  1634.35
2021-10-11     NaN      NaN
-------------------------------
-------------------------------
              Open    Close
Date                       
2021-10-06  1624.1  1615.05
2021-10-07  1612.0  1611.25
2021-10-08  1603.0  1602.20
2021-10-11     NaN  1634.35


In [38]:
## would be used later for testing purpose 
## last date value
print(shifted_Stock_Data.values[-1:])

[[           nan 1.64485000e+03 1.60000000e+03 1.63435000e+03
  1.63190000e+03 1.44074000e+05 6.35500000e+03 2.35114992e+08
  7.79400000e+04 5.41000000e+01 4.48500000e+01 3.13500000e+01]]


In [39]:
# How many samples do we have ?
#stock_data.shape
shifted_Stock_Data.shape

(248, 12)

In [40]:
# checking whether any column or feature has missing values
#stock_data.isnull().sum()

shifted_Stock_Data.isnull().sum()

Open                         1
High                         0
Low                          0
Close                        0
WAP                          0
No. of Shares                0
No. of Trades                0
Total Turnover               0
Deliverable Quantity         0
% Deli. Qty to Traded Qty    0
Spread H-L                   0
Spread C-O                   0
dtype: int64

In [41]:
shifted_Stock_Data.dropna(inplace=True)
shifted_Stock_Data.isnull().sum()

Open                         0
High                         0
Low                          0
Close                        0
WAP                          0
No. of Shares                0
No. of Trades                0
Total Turnover               0
Deliverable Quantity         0
% Deli. Qty to Traded Qty    0
Spread H-L                   0
Spread C-O                   0
dtype: int64

Now, the most important and a simple thing :

> Decide and divide data into Dependent and Independent variables

> Using `Date` column may not be useful in predicting **Opening Price**, for that we would have to look at **Time Series Forecasting** approach. In a simple Regression Approach using date to recommend opening price may not be a good idea.

> <font color='green'>Now we have to predict <b>open price</b> so this column is our <u>dependent variable</u> because open price depends on <b>High,Low,Close,.....,Turnover.</b>

In [42]:
# Let's select our features

X = shifted_Stock_Data.drop(['Open'] , axis=1) 
y = shifted_Stock_Data.loc[ : ,'Open']

In [43]:
X.head(2) # head() shows the earliest 2 records

Unnamed: 0_level_0,High,Low,Close,WAP,No. of Shares,No. of Trades,Total Turnover,Deliverable Quantity,% Deli. Qty to Traded Qty,Spread H-L,Spread C-O
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2020-10-13,1221.5,1195.1,1198.15,1206.64,201283.0,6118.0,242876052.0,43263.0,21.49,26.4,-16.0
2020-10-14,1214.8,1175.85,1210.85,1193.66,246863.0,8946.0,294671648.0,51586.0,20.9,38.95,17.55


In [44]:
y.head(2) # latest 2 stock prices

Date
2020-10-13    1193.3
2020-10-14    1214.8
Name: Open, dtype: float64

In [45]:
y.tail(2) # latest 2 stock prices

Date
2021-10-07    1612.0
2021-10-08    1603.0
Name: Open, dtype: float64

In [46]:
# split the entire data into Training and Test 
# keep 80% for training and 20% for Testing
# so the test_size = 0.2
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(X, y, 
                                                 test_size = 0.2, 
                                                 random_state = 0)

In [47]:
# Let's fit our DecisionTree Model over the training data.
regressor = DecisionTreeRegressor() # making the object of DecisionTreeRegressor
regressor.fit(x_train,y_train)

DecisionTreeRegressor()

In [48]:
# Get the predictions on the test set 
y_pred = regressor.predict(x_test)

# Evaluating the Algorithm

from sklearn import metrics  
import numpy as np
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred))) 

Root Mean Squared Error: 18.716061284362155


In [49]:
# Now , whats the mean of 'Close_Price' ??
# please note we will find the mean of the Close_Price of the dataset.
# Spliting data into Training and Testing is only for the purpose evaluating the Model
stock_data['Close'].mean()

1464.9403225806443

<font color='green'> <b>Analysis : </b> Now the 10% of ---- is ~ <b>-----</b>. Our RMSE was ~<b>-----</b>. <br>I am very happy that its within 10% range. That means our <u>model</u> is doing a good job. 

<font color='red'> Now lets predict the <u><b>Open</b></u> price for today. </font>

In [50]:
# Trying to predict Tommorow's rate, i.e 14th Oct 2021 in my case. 
#test_data = [[1530.0,	1519.35,	1521.7,	1524.06,	91268.0,	4591.0,	139097555.0]]
#test_data = [[1522.0,	1530.0,	1519.35,	1524.06,	91268.0,	4591.0,	139097555.0,	32924.0,	36.07,	10.65,	-0.3]]

## testing for shifted data
test_data = [[ 1.64485000e+03, 1.60000000e+03, 1.63435000e+03,  1.63190000e+03, 1.44074000e+05, 6.35500000e+03, 
              2.35114992e+08, 7.79400000e+04, 5.41000000e+01, 4.48500000e+01, 3.13500000e+01]]

regressor.predict(test_data)

array([1624.1])

<font color='green'><b>Observation</b> : So our Model says that <u>HDFC Bank Ltd(BSE)</u> would Close *today* at ___ on ___. </font>

<font color='red'><b>What you should do ?</b></font>
1. Train and test the Model as above. 
2. Quandl data would always upto one day prior. Say if today is 1st April 2021, 6:00Am, then Quandl would give data only upto 30th March 2021, 3:30pm. So you can Google and fetch data for 31st March and use it for predicting the <b>Open price</b> for 1st April 2021  

**Note** : I am sure you are aware that the BSE market opens trading approx by <u>9:30 am IST</u> and ends the day by <u>3:30 pm IST</u>.

Do connect with me on Linked in here :  https://www.linkedin.com/in/rocky-jagtiani-3b390649/