# Notebook Instructions

This notebook is incomplete. GOTO MODIFY and follow instructions to complete the notebook.
You can run the notebook document sequentially (one cell at a time) by pressing shift + enter. While a cell is running, a [*] will display on the left. When it has been run, a number will display indicating the order in which it was run in the notebook [8].

Enter edit mode by pressing Enter or using the mouse to click on a cell's editor area. Edit mode is indicated by a green cell border and a prompt showing in the editor area.

## Class Weights in Decision Trees

When we are building a decision tree model, it can happen that the dataset provided to the model may have very less data points for its most important classes. In such an instance, the decision tree algorithm will try to maximize the accuracy of the most common labels. 

In order to adjust for this issue, we re-assign weights to the data points of the most important labels. This can be done in the scikit-library using the class_weight argument to the decision tree classifier. Let us take an example to illustrate this.

#### Example:

We will input raw data of ACC Ltd. stock from a csv file. The data consists of Open-High-Low-Close prices and Volume data. Predictor and target variables are created using this raw data. 

In [1]:
import pandas as pd
df = pd.read_csv("ACC.csv")

df.tail()

Unnamed: 0,Date,OPEN,HIGH,LOW,CLOSE
243,6/11/2018,1315.9,1338.8,1311.3,1320.2
244,6/12/2018,1321.0,1329.4,1306.6,1315.7
245,6/13/2018,1317.3,1349.4,1311.6,1331.0
246,6/14/2018,1331.0,1333.3,1304.4,1307.2
247,6/15/2018,1309.7,1314.9,1295.0,1302.1


#### Computing Technical Indicators and Daily Future Returns

We compute the values for the Average Directional Index (ADI), Relative Strength Index (RSI), and Simple Moving Average (SMA) using the TA-Lib package. These will be used as predictor variables in the decision tree model. Next, we compute the daily future returns on the close price. The code is shown below.


In [2]:
import numpy as np
import talib as ta 

df['ADX'] = ta.ADX(df['HIGH'].values, df['LOW'].values, df['CLOSE'].values, timeperiod=14)
df['RSI'] = ta.RSI(df['CLOSE'].values, timeperiod=14)
df['SMA'] = ta.SMA(df['CLOSE'].values, timeperiod=20)

df['Return'] = df['CLOSE'].pct_change(1).shift(-1)  
df = df.dropna()

df.tail()

Unnamed: 0,Date,OPEN,HIGH,LOW,CLOSE,ADX,RSI,SMA,Return
242,6/8/2018,1314.0,1325.6,1305.5,1315.1,34.818493,34.913609,1342.575,0.003878
243,6/11/2018,1315.9,1338.8,1311.3,1320.2,33.309646,36.468425,1336.025,-0.003409
244,6/12/2018,1321.0,1329.4,1306.6,1315.7,32.050041,35.658985,1331.225,0.011629
245,6/13/2018,1317.3,1349.4,1311.6,1331.0,30.094175,40.494969,1327.235,-0.017881
246,6/14/2018,1331.0,1333.3,1304.4,1307.2,28.520287,35.966371,1322.57,-0.003901


#### Categorize Returns into Multiple Classes

We define a function called 'returns_to_class' using nested If..else statement to categorize returns into multiple classes. We also specify the range for the returns for each class in this function. This function is then applied on our dataframe, df to get the multi-class target variable.


In [3]:
def returns_to_class(df):
    if df.Return <= 0.0:
        return 0
    elif df.Return > 0.0 and df.Return < 0.02:
        return 1
    elif df.Return > 0.02 and df.Return< 0.03:
        return 2
    else:
        return 3

df['Class'] = df.apply(returns_to_class,axis=1)
df.tail()

Unnamed: 0,Date,OPEN,HIGH,LOW,CLOSE,ADX,RSI,SMA,Return,Class
242,6/8/2018,1314.0,1325.6,1305.5,1315.1,34.818493,34.913609,1342.575,0.003878,1
243,6/11/2018,1315.9,1338.8,1311.3,1320.2,33.309646,36.468425,1336.025,-0.003409,0
244,6/12/2018,1321.0,1329.4,1306.6,1315.7,32.050041,35.658985,1331.225,0.011629,1
245,6/13/2018,1317.3,1349.4,1311.6,1331.0,30.094175,40.494969,1327.235,-0.017881,0
246,6/14/2018,1331.0,1333.3,1304.4,1307.2,28.520287,35.966371,1322.57,-0.003901,0


#### View the Multi-Class Distribution

Once we have defined the different classes for the target variable, we can see their distribution of Returns using the groupby method. As can be observed, out of the total data points majority of them (i.e. 126 data points) belong to '0' class which signifies negative returns. On the other hand, there are only 11 and 1 datapoint belonging to the '2' and the '3' class respectively.

In [4]:
df.groupby('Class').count()

Unnamed: 0_level_0,Date,OPEN,HIGH,LOW,CLOSE,ADX,RSI,SMA,Return
Class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0,126,126,126,126,126,126,126,126,126
1,82,82,82,82,82,82,82,82,82
2,9,9,9,9,9,9,9,9,9
3,3,3,3,3,3,3,3,3,3


#### Create Predictor Variables and Target Variable

Let us now define our predictors variables, X and the target variable, y for building a decision tree model.

In [5]:
X = df[['ADX','RSI','SMA']]
y = df.Class

  
We will consider two scenarios:   

1) Building a decision tree model without applying the class weights and    
2) Building a decision tree model with class weights.


### Scenario 1 - Build a decision tree model without applying the Class weights 

In [6]:
# Split into Train and Test datasets 
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state=42,stratify=y)
  
#print (X_train.shape, y_train.shape)
#print (X_test.shape, y_test.shape)

# Fit a model on train data
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(criterion='gini', max_depth=3, min_samples_leaf=5) 
clf = clf.fit(X_train, y_train)

# Use the trained model to make predictions on the test data
y_pred = clf.predict(X_test)  
  
# Evaluate the model performance 
from sklearn.metrics import classification_report
report = classification_report(y_test, y_pred)
print (report)

              precision    recall  f1-score   support

           0       0.55      0.87      0.67        38
           1       0.00      0.00      0.00        24
           2       0.00      0.00      0.00         3
           3       0.00      0.00      0.00         1

   micro avg       0.50      0.50      0.50        66
   macro avg       0.14      0.22      0.17        66
weighted avg       0.32      0.50      0.39        66



  'precision', 'predicted', average, warn_for)


In [7]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_pred)

array([[33,  5,  0,  0],
       [24,  0,  0,  0],
       [ 2,  1,  0,  0],
       [ 1,  0,  0,  0]], dtype=int64)

As can be seen from the output of the classification report, the decision tree algorithm tries to maximize the accuracy of the most common labels and does not give good predictions on the underrepresented labels.

### Scenario 2 - Build a decision tree model with Class weights 

MODIFY:
Let us use the class_weight parameter when defining the decision tree classifier to correct for the underrepresented labels.  Read https://towardsdatascience.com/practical-tips-for-class-imbalance-in-binary-classification-6ee29bcdb8a7 and set the class_weight parameter of the decision tree classifier appropriately do so as to cause the classes to appear with equal frequency.

As can be seen from the output of the classification report, using of class weight makes the decision tree algorithm achieve higher accuracy on the underrepresented labels which were labels '2'and '3' in this case.

In [8]:
# Split into Train and Test datasets 
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state=42,stratify=y)


# Fit a model on train data
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(criterion = 'gini', max_depth = 3, min_samples_leaf = 5, class_weight = 'balanced') 
clf = clf.fit(X_train, y_train)

# Use the trained model to make predictions on the test data
y_pred = clf.predict(X_test)  
  
# Evaluate the model performance 
from sklearn.metrics import classification_report
report = classification_report(y_test, y_pred)
print (report)

              precision    recall  f1-score   support

           0       0.64      0.24      0.35        38
           1       0.29      0.08      0.13        24
           2       0.03      0.33      0.05         3
           3       0.20      1.00      0.33         1

   micro avg       0.20      0.20      0.20        66
   macro avg       0.29      0.41      0.21        66
weighted avg       0.48      0.20      0.25        66



----------------------------------------------------------------------------------------------------------------
### <font color='red'>ANSWER
##### We added a new parameter (class_weight = "balanced") to tackle the label imbalance during model declaration
----------------------------------------------------------------------------------------------------------------

In [12]:
# Checking the confusion matrix

cm = confusion_matrix(y_test, y_pred)
print(cm)

[[ 9  4 21  4]
 [ 4  2 18  0]
 [ 1  1  1  0]
 [ 0  0  0  1]]


In [13]:
# Obtaining the accuracy of the model

from sklearn.metrics import accuracy_score
acc = accuracy_score(y_test, y_pred)
print(acc)

0.19696969696969696
