# Write simple (straightforward) definitions for the following parameters for RandomForestClassifier
(https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) and indicate how they correlate with the precision and recall for the basic
diabetes model we built in class. You will need to rerun the model multiple times to do so.

# Definitons:
estimators: the number of models to use in the random forest classifier. 
max_depth: the maximum depth of the tree.  This is the number of nodes from the root node down to the leaf following the longest path. 

min_samples_split: this is the minimum number of samples (data points) required to split the node.  Each node asks a question about the data and evaluates that question on data points.  This parameter checks to see how many data points are being evaluated and only splits the node if that number is present. 

min_samples_leaf: This is the minimum number of samples required to be a leaf.  A leaf is the final node of the tree which represents a class.

min_weight_fraction_leaf: Each leaf node has a weight of the total weights of all the leaves.  If this is specified, it means that every leaf must have at least the weight specified.

max_leaf_nodes: This specifies the maximum number of leaf nodes that are possible in the decision tree.  If this is unspecified, a maximum number of leaf nodes is possible.  If it is specified, then the model will pick the leaves that gain the maximum information. 

min_impurity_decrease: impurity is the chance that an outcome would be incorrectly labeled by a specifically node.  If a minimum impurity decrease value is set, it means that a node will be split into a further node if it decreases impurity more than the minimum value set. 

min_impurity_split: This is a threshold value that determines whether a node will be split or become a leaf node, depending on whether the impurity of the node is above this threshold (whether the node will incorrecly classify a certain percentage of the samples). 

In [1]:
import pandas as pd
from sklearn import tree
from sklearn.metrics import confusion_matrix, classification_report, plot_confusion_matrix
import pydotplus
from IPython.display import Image

diabetes_df = pd.read_csv("../week_13/diabetes.csv")
diabetes_df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [2]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X = diabetes_df.drop('Outcome', axis=1)
y = diabetes_df['Outcome']

# Split into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42, stratify=y)

#Standardize
sc= StandardScaler()
X_train=sc.fit_transform(X_train)
X_test=sc.fit_transform(X_test)

In [12]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score

rf = RandomForestClassifier(n_estimators=200, random_state =42)

rf = rf.fit(X_train, y_train)
rf.score(X_test, y_test)
y_pred = rf.predict(X_test)

print(classification_report(y_test, y_pred))
print(precision_score(y_test, y_pred, average='weighted'))
print(recall_score(y_test, y_pred, average='weighted'))

              precision    recall  f1-score   support

           0       0.80      0.86      0.83       100
           1       0.70      0.59      0.64        54

    accuracy                           0.77       154
   macro avg       0.75      0.73      0.73       154
weighted avg       0.76      0.77      0.76       154

0.7610055001359349
0.7662337662337663


In [13]:
def corr_with_estimators(n_est):
    rf = RandomForestClassifier(n_estimators=n_est, random_state=42)
    rf = rf.fit(X_train, y_train)
    rf.score(X_test, y_test)
    y_pred = rf.predict(X_test)
    precision = precision_score(y_test, y_pred, average='weighted')
    recall = recall_score(y_test, y_pred, average='weighted')
    return precision, recall

In [14]:
corr_with_estimators(100)

(0.7473701821527908, 0.7532467532467533)

In [36]:
list_of_estimators = [10, 20, 30, 50, 100, 150, 200, 250, 300]
precision_recall_list = list(map(lambda x: corr_with_estimators(x), list_of_estimators))

In [37]:
precision = [x[0] for x in precision_recall_list]
recall = [x[1] for x in precision_recall_list]

In [38]:
print(precision)
print(recall)

[0.7251114954546856, 0.7349148493016417, 0.7473701821527908, 0.7538901465506971, 0.7473701821527908, 0.7610055001359349, 0.7610055001359349, 0.7545792843068642, 0.7545792843068642]
[0.7337662337662337, 0.7402597402597403, 0.7532467532467533, 0.7597402597402597, 0.7532467532467533, 0.7662337662337663, 0.7662337662337663, 0.7597402597402597, 0.7597402597402597]


In [39]:
from scipy.stats.stats import pearsonr
correlation, p_value = pearsonr(list_of_estimators, precision)
print("The correlation between estimators and precision is " + str(correlation))

correlation, p_value = pearsonr(list_of_estimators, recall)
print("The correlation between estimators and recall is " + str(correlation))


The correlation between estimators and precision is 0.6738727856361035
The correlation between estimators and recall is 0.6679749434076905


In [40]:
def corr_with_max_depth(max_d):
    rf = RandomForestClassifier(n_estimators=200, max_depth=max_d, random_state=42)
    rf = rf.fit(X_train, y_train)
    rf.score(X_test, y_test)
    y_pred = rf.predict(X_test)
    precision = precision_score(y_test, y_pred, average='weighted')
    recall = recall_score(y_test, y_pred, average='weighted')
    return precision, recall

In [41]:
list_of_max_depths = [2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20]
precision_recall_list = list(map(lambda x: corr_with_max_depth(x), list_of_max_depths))

In [42]:
precision = [x[0] for x in precision_recall_list]
recall = [x[1] for x in precision_recall_list]

In [43]:
correlation, p_value = pearsonr(list_of_max_depths, precision)
print("The correlation between max_depth and precision is " + str(correlation))

correlation, p_value = pearsonr(list_of_max_depths, recall)
print("The correlation between max_depth and recall is " + str(correlation))


The correlation between max_depth and precision is 0.8279812237900358
The correlation between max_depth and recall is 0.850321470558266


In [44]:
def corr_with_min_samples_split(min_ss):
    rf = RandomForestClassifier(n_estimators=200, min_samples_split=min_ss, random_state=42)
    rf = rf.fit(X_train, y_train)
    rf.score(X_test, y_test)
    y_pred = rf.predict(X_test)
    precision = precision_score(y_test, y_pred, average='weighted')
    recall = recall_score(y_test, y_pred, average='weighted')
    return precision, recall

In [85]:
list_of_min_samples_split = [2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 30, 40, 50]
precision_recall_list = list(map(lambda x: corr_with_min_samples_split(x), list_of_min_samples_split))

In [46]:
precision = [x[0] for x in precision_recall_list]
recall = [x[1] for x in precision_recall_list]

In [47]:
correlation, p_value = pearsonr(list_of_min_samples_split, precision)
print("The correlation between min_samples_split and precision is " + str(correlation))

correlation, p_value = pearsonr(list_of_min_samples_split, recall)
print("The correlation between min_samples_split and recall is " + str(correlation))


The correlation between min_samples_split and precision is -0.9373147401219183
The correlation between min_samples_split and recall is -0.9277873584477501


In [89]:
def corr_with_min_samples_leaf(min_sl):
    rf = RandomForestClassifier(n_estimators=200, min_samples_leaf=min_sl, random_state=42)
    rf = rf.fit(X_train, y_train)
    rf.score(X_test, y_test)
    y_pred = rf.predict(X_test)
    precision = precision_score(y_test, y_pred, average='weighted')
    recall = recall_score(y_test, y_pred, average='weighted')
    return precision, recall

In [90]:
list_of_min_samples_leaf = [2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 30, 40, 50]
precision_recall_list = list(map(lambda x: corr_with_min_samples_leaf(x), list_of_min_samples_leaf))

In [91]:
precision = [x[0] for x in precision_recall_list]
recall = [x[1] for x in precision_recall_list]

In [92]:
correlation, p_value = pearsonr(list_of_min_samples_leaf, precision)
print("The correlation between min_samples_leaf and precision is " + str(correlation))

correlation, p_value = pearsonr(list_of_min_samples_leaf, recall)
print("The correlation between min_samples_leaf and recall is " + str(correlation))


The correlation between min_samples_leaf and precision is -0.543867975157251
The correlation between min_samples_leaf and recall is -0.5268862526734289


In [48]:
def corr_with_min_weight_fraction_leaf(min_w):
    rf = RandomForestClassifier(n_estimators=200, min_weight_fraction_leaf=min_w, random_state=42)
    rf = rf.fit(X_train, y_train)
    rf.score(X_test, y_test)
    y_pred = rf.predict(X_test)
    precision = precision_score(y_test, y_pred, average='weighted')
    recall = recall_score(y_test, y_pred, average='weighted')
    return precision, recall

In [107]:
import warnings
warnings.filterwarnings('ignore')
list_of_min_weight_fraction_leaf = [0.0, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5]
precision_recall_list = list(map(lambda x: corr_with_min_weight_fraction_leaf(x), list_of_min_weight_fraction_leaf))

In [51]:
precision = [x[0] for x in precision_recall_list]
recall = [x[1] for x in precision_recall_list]

In [53]:
correlation, p_value = pearsonr(list_of_min_weight_fraction_leaf, precision)
print("The correlation between min_weight_fraction_leaf and precision is " + str(correlation))

correlation, p_value = pearsonr(list_of_min_weight_fraction_leaf, recall)
print("The correlation between min_weight_fraction_leaf and recall is " + str(correlation))

The correlation between min_weight_fraction_leaf and precision is -0.7362889364195605
The correlation between min_weight_fraction_leaf and recall is -0.8988991088231968


In [54]:
def corr_with_max_leaf_nodes(max_leaf):
    rf = RandomForestClassifier(n_estimators=200, max_leaf_nodes=max_leaf, random_state=42)
    rf = rf.fit(X_train, y_train)
    rf.score(X_test, y_test)
    y_pred = rf.predict(X_test)
    precision = precision_score(y_test, y_pred, average='weighted')
    recall = recall_score(y_test, y_pred, average='weighted')
    return precision, recall

In [55]:
list_of_max_leaf_nodes = [2, 3, 4, 5, 6, 10, 15, 20, 25, 30, 35, 40, 45, 50]
precision_recall_list = list(map(lambda x: corr_with_max_leaf_nodes(x), list_of_max_leaf_nodes))

In [56]:
precision = [x[0] for x in precision_recall_list]
recall = [x[1] for x in precision_recall_list]

In [57]:
correlation, p_value = pearsonr(list_of_max_leaf_nodes, precision)
print("The correlation between max_leaf_nodes and precision is " + str(correlation))

correlation, p_value = pearsonr(list_of_max_leaf_nodes, recall)
print("The correlation between max_leaf_nodes and recall is " + str(correlation))

The correlation between max_leaf_nodes and precision is 0.5000049538344156
The correlation between max_leaf_nodes and recall is 0.6347010628362819


In [58]:
def corr_with_min_impurity_decrease(min_imp):
    rf = RandomForestClassifier(n_estimators=200, min_impurity_decrease=min_imp, random_state=42)
    rf = rf.fit(X_train, y_train)
    rf.score(X_test, y_test)
    y_pred = rf.predict(X_test)
    precision = precision_score(y_test, y_pred, average='weighted')
    recall = recall_score(y_test, y_pred, average='weighted')
    return precision, recall

In [106]:
import warnings
warnings.filterwarnings('ignore')
list_of_min_impurity_decrease = [0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50]
precision_recall_list = list(map(lambda x: corr_with_min_impurity_decrease(x), list_of_min_impurity_decrease))

In [79]:
precision = [x[0] for x in precision_recall_list]
recall = [x[1] for x in precision_recall_list]

In [80]:
correlation, p_value = pearsonr(list_of_min_impurity_decrease, precision)
print("The correlation between min_impurity_decrease and precision is " + str(correlation))

correlation, p_value = pearsonr(list_of_min_impurity_decrease, recall)
print("The correlation between min_impurity_decrease and recall is " + str(correlation))

The correlation between min_impurity_decrease and precision is -0.5
The correlation between min_impurity_decrease and recall is -0.49999999999999994


In [93]:
def corr_with_min_impurity_split(min_imp_spl):
    rf = RandomForestClassifier(n_estimators=200, min_impurity_split=min_imp_spl, random_state=42)
    rf = rf.fit(X_train, y_train)
    rf.score(X_test, y_test)
    y_pred = rf.predict(X_test)
    precision = precision_score(y_test, y_pred, average='weighted')
    recall = recall_score(y_test, y_pred, average='weighted')
    return precision, recall

In [108]:
import warnings
warnings.filterwarnings('ignore')
list_of_min_impurity_split = [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8]
precision_recall_list = list(map(lambda x: corr_with_min_impurity_split(x), list_of_min_impurity_split))

In [100]:
precision = [x[0] for x in precision_recall_list]
recall = [x[1] for x in precision_recall_list]

In [101]:
correlation, p_value = pearsonr(list_of_min_impurity_split, precision)
print("The correlation between min_impurity_split and precision is " + str(correlation))

correlation, p_value = pearsonr(list_of_min_impurity_split, recall)
print("The correlation between min_impurity_split and recall is " + str(correlation))

The correlation between min_impurity_split and precision is -0.8853585906752075
The correlation between min_impurity_split and recall is -0.9192718348789796


Parameter | Correlation with Precision | Correlation with Recall
-|-|-
estimators | 0.674 | 0.668
max_depth | 0.828 | 0.850
min_samples_split | -0.937 | -0.928
min_samples_leaf | -0.544 | -0.527
min_weight_fraction_leaf | -0.736 | -0.899
max_leaf_nodes | 0.500 | 0.635
min_impurity_decrease | -0.500 | -0.499
min_impurity_split | -0.885 | -0.919




# 2. How does setting bootstrap=False influence the model performance? Note: the default is bootstrap=True. Explain why your results might be so.

In [102]:
rf = RandomForestClassifier(n_estimators=200, random_state =42)

rf = rf.fit(X_train, y_train)
rf.score(X_test, y_test)
y_pred = rf.predict(X_test)

print(classification_report(y_test, y_pred))
print(precision_score(y_test, y_pred, average='weighted'))
print(recall_score(y_test, y_pred, average='weighted'))

              precision    recall  f1-score   support

           0       0.80      0.86      0.83       100
           1       0.70      0.59      0.64        54

    accuracy                           0.77       154
   macro avg       0.75      0.73      0.73       154
weighted avg       0.76      0.77      0.76       154

0.7610055001359349
0.7662337662337663


In [105]:
rf = RandomForestClassifier(n_estimators=200, bootstrap=False, random_state =42)

rf = rf.fit(X_train, y_train)
rf.score(X_test, y_test)
y_pred = rf.predict(X_test)

print(classification_report(y_test, y_pred))
print(precision_score(y_test, y_pred, average='weighted'))
print(recall_score(y_test, y_pred, average='weighted'))

              precision    recall  f1-score   support

           0       0.79      0.84      0.82       100
           1       0.67      0.59      0.63        54

    accuracy                           0.75       154
   macro avg       0.73      0.72      0.72       154
weighted avg       0.75      0.75      0.75       154

0.7483459936290124
0.7532467532467533


When bootstrap was set to false, model performance decreased.  The precision score went from 0.761 to 0.748 and the recall score went from 0.766 to 0.753.  When bootstrapping is on, it runs the model on multiple random sample sets of the dataset with replacement.  This can remove the effects of outliers because you are only taking a sample of the entire population multiple times. 