# Group Activity Week 17

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
diabetes_df = pd.read_csv('../SupervisedML_13/diabetes.csv')
diabetes_df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [3]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report,precision_score,recall_score 

In [35]:
X = diabetes_df.iloc[:,:-1]
y = diabetes_df.iloc[:,-1]

# Split into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=6, stratify=y)

#Standardize
sc= StandardScaler()
X_train=sc.fit_transform(X_train)
X_test=sc.fit_transform(X_test)

#estimator = model
rf = RandomForestClassifier(n_estimators=200,random_state=42)

rf = rf.fit(X_train, y_train)
print('Accuracy Score',rf.score(X_test, y_test).round(2)*100,'%')
predictions = rf.predict(X_test)
print(classification_report(y_test,predictions))

Accuracy Score 80.0 %
              precision    recall  f1-score   support

           0       0.81      0.91      0.86       150
           1       0.78      0.60      0.68        81

    accuracy                           0.80       231
   macro avg       0.79      0.76      0.77       231
weighted avg       0.80      0.80      0.79       231



### 1. Write simple (straightforward) definitions for the following parameters for RandomForestClassifier  and indicate how they correlate with the precision and recall for the basic diabetes model we built in class. You will need to rerun the model multiple times to do so.

### **1.n_estimator**- number of trees in the forest.

   * It correlated **positively with precision** (precision increased if no of trees increased and decreased if no of trees reduced) and changed about 0.1 with each number increase.  
   
   * There is a **negative correlation between recall and n_estimator**. I have increased n_estimator from 20 to 100 and then 100 to 300, recall reduced about 0.02 and 0.01 respectively.

### **2.max_depth** - the longest path between the root node and the leaf node

* max_depth is **negatively correlated with precision**. I changed max_depth from 3 to 5. precision was reduced 0.02.

* It is **positively correlated with recall**. It is increased 0.04 for the depth increased from 3 to 5 and then for 5 to 10 it is increased for 0.05.

### **3.min_samples_split** - the minimum number of samples required to split an internal node(default value is 2).

* There is a **strong positive correlation between precision and min_samples_split**. When I increased min_samples_split from 5 to 200, I got an increment of 0.07 in precision. 

* Opposite to precision, **there is negative correlation with recall**. For the same amount of increase in min_samples_split, my recall reduced very well(0.18).

### **4.min_samples_leaf** -  specifies the minimum number of samples that should be present in the leaf node(default is 1).

* There is a **strong positive correlation** between precision and min_samples_leaf. When I increased min_samples_leaf from 5 to 100, I got an increment of 0.13 in precision.

* Opposite to precision, there is **negative correlation with recall**. For the same amount of increase in min_samples_leaf, my recall reduced very well(0.18).

### **5. min_weight_fraction_leaf** - the fraction of the input samples required to be at a leaf node. min_weight_fraction_leaf must in [0, 0.5]

* It is **positively correlated with precision**. Great improvement in precision for each increment of 0.1. 
* It is **negatively correlated with recall**. For first increment of 0.1 in min_weight_fraction_leaf, our recall reduced 0.05, and for another increment of 0.1, recall is reduced for 0.09.

### **6.max_leaf_nodes** - This hyperparameter sets a condition on the splitting of the nodes in the tree and hence restricts the growth of the tree. If this parameter value goes beyond 25, then tree starts to overfit.

* Precision started decreasing if I increase max_leaf_node. So **its negative correlation**. Increment of 2 from max_leaf_nodes impacted our precision about 0.06.

* Recall is **positively correlated with max_leaf_nodes**. We are getting improved recall from 0.37 to 0.44 for the same increment of max_leaf_nodes.

### **7.min_impurity_decrease** - A node will be split if this split induces a decrease of the impurity greater than or equal to this value.

* There is a **positive correlation with precision**. We are getting highest precision for min_impurity_decrease of 0.04.
* There is a **negative correlation with recall**. We are getting very low recall for min_impurity_decrease of 0.04. So increasing min_impurity_decrease value is a bad idea for this diabetes dataset. Because we want good recall score.

* Using my function, I did tuned my hyperparameters.

In [56]:
def estimator(n):
    rf = RandomForestClassifier(n_estimators=n,random_state=6)
    rf = rf.fit(X_train, y_train)
    predictions = rf.predict(X_test)
    print('Precision for n_estimator = ',n, 'is',precision_score(y_test,predictions).round(2))
    
estimator(20)    
estimator(100)
estimator(300)
estimator(500)

Precision for n_estimator =  20 is 0.65
Precision for n_estimator =  100 is 0.68
Precision for n_estimator =  300 is 0.68
Precision for n_estimator =  500 is 0.69


In [57]:
def estimator(n):
    rf = RandomForestClassifier(n_estimators=n,random_state=6)
    rf = rf.fit(X_train, y_train)
    predictions = rf.predict(X_test)
    print('Recall for n_estimator = ',n, 'is',recall_score(y_test,predictions).round(2))
    
estimator(20)
estimator(100)
estimator(300)
estimator(500)

Recall for n_estimator =  20 is 0.6
Recall for n_estimator =  100 is 0.58
Recall for n_estimator =  300 is 0.57
Recall for n_estimator =  500 is 0.57


In [30]:
def max_depth(n):
    rf = RandomForestClassifier(max_depth=n,random_state=6)
    rf = rf.fit(X_train, y_train)
    predictions = rf.predict(X_test)
    print('Precision for max_depth = ',n, 'is',precision_score(y_test,predictions).round(2))
max_depth(3)
max_depth(5)
max_depth(10)
max_depth(30)

Precision for max_depth =  3 is 0.7
Precision for max_depth =  5 is 0.68
Precision for max_depth =  10 is 0.69
Precision for max_depth =  30 is 0.68


In [31]:
def max_depth(n):
    rf = RandomForestClassifier(max_depth=n,random_state=6)
    rf = rf.fit(X_train, y_train)
    predictions = rf.predict(X_test)
    print('Recall for max_depth = ',n, 'is',recall_score(y_test,predictions).round(2))
max_depth(3)
max_depth(5)
max_depth(10)
max_depth(30)

Recall for max_depth =  3 is 0.48
Recall for max_depth =  5 is 0.52
Recall for max_depth =  10 is 0.57
Recall for max_depth =  30 is 0.58


In [61]:
def min_samples_split(n):
    rf = RandomForestClassifier(min_samples_split=n,random_state=6)
    rf = rf.fit(X_train, y_train)
    predictions = rf.predict(X_test)
    print('Precision for min_samples_split = ',n, 'is',precision_score(y_test,predictions).round(2))
min_samples_split(5)
min_samples_split(50)
min_samples_split(100)
min_samples_split(200)

Precision for min_samples_split =  5 is 0.69
Precision for min_samples_split =  50 is 0.69
Precision for min_samples_split =  100 is 0.71
Precision for min_samples_split =  200 is 0.76


In [62]:
def min_samples_split(n):
    rf = RandomForestClassifier(min_samples_split=n,random_state=6)
    rf = rf.fit(X_train, y_train)
    predictions = rf.predict(X_test)
    print('Recall for min_samples_split = ',n, 'is',recall_score(y_test,predictions).round(2))
min_samples_split(5)
min_samples_split(50)
min_samples_split(100)
min_samples_split(200)

Recall for min_samples_split =  5 is 0.54
Recall for min_samples_split =  50 is 0.53
Recall for min_samples_split =  100 is 0.46
Recall for min_samples_split =  200 is 0.36


In [76]:
def min_samples_leaf(n):
    rf = RandomForestClassifier(min_samples_leaf=n,random_state=6)
    rf = rf.fit(X_train, y_train)
    predictions = rf.predict(X_test)
    print('Precision for min_samples_leaf = ',n, 'is',precision_score(y_test,predictions).round(2))
min_samples_leaf(5)
min_samples_leaf(50)
min_samples_leaf(100)

Precision for min_samples_leaf =  5 is 0.68
Precision for min_samples_leaf =  50 is 0.69
Precision for min_samples_leaf =  100 is 0.81


In [77]:
def min_samples_leaf(n):
    rf = RandomForestClassifier(min_samples_leaf=n,random_state=6)
    rf = rf.fit(X_train, y_train)
    predictions = rf.predict(X_test)
    print('Recall for min_samples_leaf = ',n, 'is',recall_score(y_test,predictions).round(2))
min_samples_leaf(5)
min_samples_leaf(50)
min_samples_leaf(100)

Recall for min_samples_leaf =  5 is 0.53
Recall for min_samples_leaf =  50 is 0.44
Recall for min_samples_leaf =  100 is 0.31


In [9]:
def min_weight_fraction_leaf(n):
    rf = RandomForestClassifier(min_weight_fraction_leaf=n,random_state=6)
    rf = rf.fit(X_train, y_train)
    predictions = rf.predict(X_test)
    print('Precision for min_weight_fraction_leaf = ',n, 'is',precision_score(y_test,predictions).round(2))
min_weight_fraction_leaf(0.1)
min_weight_fraction_leaf(0.2)
min_weight_fraction_leaf(0.3)

Precision for min_weight_fraction_leaf =  0.1 is 0.69
Precision for min_weight_fraction_leaf =  0.2 is 0.74
Precision for min_weight_fraction_leaf =  0.3 is 0.78


In [7]:
def min_weight_fraction_leaf(n):
    rf = RandomForestClassifier(min_weight_fraction_leaf=n,random_state=6)
    rf = rf.fit(X_train, y_train)
    predictions = rf.predict(X_test)
    print('Recall for min_weight_fraction_leaf = ',n, 'is',recall_score(y_test,predictions).round(2))
min_weight_fraction_leaf(0.1)
min_weight_fraction_leaf(0.2)
min_weight_fraction_leaf(0.3)

Recall for min_weight_fraction_leaf =  0.1 is 0.47
Recall for min_weight_fraction_leaf =  0.2 is 0.42
Recall for min_weight_fraction_leaf =  0.3 is 0.31


In [13]:
def max_leaf_nodes(n):
    rf = RandomForestClassifier(max_leaf_nodes=n,random_state=6)
    rf = rf.fit(X_train, y_train)
    predictions = rf.predict(X_test)
    print('Precision for max_leaf_nodes = ',n, 'is',precision_score(y_test,predictions).round(2))
max_leaf_nodes(3)
max_leaf_nodes(5)
max_leaf_nodes(10)
max_leaf_nodes(20)

Precision for max_leaf_nodes =  3 is 0.79
Precision for max_leaf_nodes =  5 is 0.71
Precision for max_leaf_nodes =  10 is 0.7
Precision for max_leaf_nodes =  20 is 0.69


In [14]:
def max_leaf_nodes(n):
    rf = RandomForestClassifier(max_leaf_nodes=n,random_state=6)
    rf = rf.fit(X_train, y_train)
    predictions = rf.predict(X_test)
    print('Recall for max_leaf_nodes = ',n, 'is',recall_score(y_test,predictions).round(2))
max_leaf_nodes(3)
max_leaf_nodes(5)
max_leaf_nodes(10)
max_leaf_nodes(20)

Recall for max_leaf_nodes =  3 is 0.37
Recall for max_leaf_nodes =  5 is 0.44
Recall for max_leaf_nodes =  10 is 0.48
Recall for max_leaf_nodes =  20 is 0.51


In [24]:
def min_impurity_decrease(n):
    rf = RandomForestClassifier(min_impurity_decrease=n,random_state=6)
    rf = rf.fit(X_train, y_train)
    predictions = rf.predict(X_test)
    print('Precision for min_impurity_decrease = ',n, 'is',precision_score(y_test,predictions).round(2))
min_impurity_decrease(0.002)
min_impurity_decrease(0.03)
min_impurity_decrease(0.04)


Precision for max_leaf_nodes =  0.002 is 0.73
Precision for max_leaf_nodes =  0.03 is 0.84
Precision for max_leaf_nodes =  0.04 is 0.89


In [27]:
def min_impurity_decrease(n):
    rf = RandomForestClassifier(min_impurity_decrease=n,random_state=6)
    rf = rf.fit(X_train, y_train)
    predictions = rf.predict(X_test)
    print('Recall for min_impurity_decrease = ',n, 'is',recall_score(y_test,predictions).round(2))
min_impurity_decrease(0.002)
min_impurity_decrease(0.03)
min_impurity_decrease(0.04)


Recall for max_leaf_nodes =  0.002 is 0.53
Recall for max_leaf_nodes =  0.03 is 0.33
Recall for max_leaf_nodes =  0.04 is 0.2



#### Summarized Table 



|   Parameter            |    Correlation with Precision                  |     Correlation with Recall                   | 
|:----------------------:|:----------------------------------------------:|:---------------------------------------------:|
|1.n_estimator           | correlated positively(0.01 value is increasing)|negatively correlated                          | 
|2. max_depth            | not correlated well, almost getting same value |positively correlated                          |  
|3. min_samples_split    | strong positive correlation                    |negative correlation                           |
|4. min_samples_leaf     | strong positive correlation                    |negative correlation                           |
|5. min_weig_frac_leaf   | positively correlated                          |negatively correlated                          |
|6. max_leaf_nodes       | negatively correlated                          |positively correlated                          |
|7. min_impurity_decrease| positively correlated                          |negatively correlated                          |




## 2. How does setting bootstrap=False influence the model performance? Note: the default is bootstrap=True. Explain why your results might be so

#### Performance with default bootstrap = True

In [46]:
rf = RandomForestClassifier(n_estimators=200,random_state=6)
rf = rf.fit(X_train, y_train)
print('Accuracy Score',rf.score(X_test, y_test).round(2)*100,'%')
predictions = rf.predict(X_test)
print(classification_report(y_test,predictions))   

Accuracy Score 81.0 %
              precision    recall  f1-score   support

           0       0.82      0.90      0.86       150
           1       0.77      0.63      0.69        81

    accuracy                           0.81       231
   macro avg       0.80      0.76      0.78       231
weighted avg       0.80      0.81      0.80       231



#### Performance with bootstrap = False

In [47]:
rf = RandomForestClassifier(bootstrap = False,n_estimators=200,random_state=6)
rf = rf.fit(X_train, y_train)
print('Accuracy Score',rf.score(X_test, y_test).round(2)*100,'%')
predictions = rf.predict(X_test)
print(classification_report(y_test,predictions))

Accuracy Score 79.0 %
              precision    recall  f1-score   support

           0       0.81      0.87      0.84       150
           1       0.73      0.63      0.68        81

    accuracy                           0.79       231
   macro avg       0.77      0.75      0.76       231
weighted avg       0.78      0.79      0.78       231



#### Results discussion about setting bootstrap = False

* Bootstrap is a boolean indicate whether samples are drawn with replacement. Default is True **(that is nature of bagging-(bootstrap aggregating))**. Some data points will be used more than once and others not.


* If we set bootstrap = False, each estimator will be trained using every data exactly once.


* In our class exercise Random Forest classifier with bootstrapping, it gave good accuracy and precision score. Because each classifier trained with Multiple bootstrap samples from the dataset, the mean calculated on each and gave a robust and accurate scores. With Bootstrapping = False, recall didn't change but precision and accuracy score reduced.


* For this classifier, setting **bootstrap = True, performs well than bootstrap = False**.