# Assignment: Decision Trees and Random Forests, Part 2

There are two notebooks for the assignment on Decision Trees and Random Forests - this is part 2 and involves classification.

## Classification with Decision Trees and Random Forests

We are going to conceptually repeat the process from the "Assignment-TreesAndForests-Part1.ipynb" notebook, only now for *classification* on [scikit-learn's breast cancer wisconsin dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html).
* We have visited this dataset before with decision trees, specifically in the notebook "Week03_NB1_TreeClassification.ipynb".
* In all of the exercises below, I would like you to use all of the feature variables when doing the machine learning classification, and to assess the performance of all trained models using scores for accuracy, precision, and recall of your models applied to the test data.

Perform classification on the breast cancer dataset using one decision tree with scikit-learn's DecisionTreeClassifier without specifically initializing any hyperparameter values.

In [1]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score

cancer = load_breast_cancer()
X = cancer.data
y = cancer.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

dt_classifier = DecisionTreeClassifier(random_state=42)
dt_classifier.fit(X_train, y_train)

y_pred = dt_classifier.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)

print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")

Accuracy: 0.9415
Precision: 0.9712
Recall: 0.9352


Perform classification using another decision tree for which you impose one or more constraints with regularization hyperparameter(s).

In [2]:
dt_classifier_regularized = DecisionTreeClassifier(max_depth=5, random_state=42)
dt_classifier_regularized.fit(X_train, y_train)

y_pred_regularized = dt_classifier_regularized.predict(X_test)

accuracy_regularized = accuracy_score(y_test, y_pred_regularized)
precision_regularized = precision_score(y_test, y_pred_regularized)
recall_regularized = recall_score(y_test, y_pred_regularized)

print(f"Regularized Decision Tree Accuracy: {accuracy_regularized:.4f}")
print(f"Regularized Decision Tree Precision: {precision_regularized:.4f}")
print(f"Regularized Decision Tree Recall: {recall_regularized:.4f}")

Regularized Decision Tree Accuracy: 0.9532
Regularized Decision Tree Precision: 0.9630
Regularized Decision Tree Recall: 0.9630


Perform classification using a random forest model.  When doing so:
* Use the same regularization hyperparameters as well as a hyperparameter for the number of trees
* Use grid search cross-validation to scan across a grid of hyperparameters values and find an optimal combination

In [3]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV


param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 5, 7],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

rf_classifier = RandomForestClassifier(random_state=42)

grid_search = GridSearchCV(rf_classifier, param_grid, cv=5, scoring='accuracy')

grid_search.fit(X_train, y_train)

best_rf_classifier = grid_search.best_estimator_

y_pred_rf = best_rf_classifier.predict(X_test)

accuracy_rf = accuracy_score(y_test, y_pred_rf)
precision_rf = precision_score(y_test, y_pred_rf)
recall_rf = recall_score(y_test, y_pred_rf)

print(f"Best Random Forest Parameters: {grid_search.best_params_}")
print(f"Random Forest Accuracy: {accuracy_rf:.4f}")
print(f"Random Forest Precision: {precision_rf:.4f}")
print(f"Random Forest Recall: {recall_rf:.4f}")

Best Random Forest Parameters: {'max_depth': 7, 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 200}
Random Forest Accuracy: 0.9708
Random Forest Precision: 0.9640
Random Forest Recall: 0.9907


Comment on the differences in your model trees & forests and evaluation metrics for all three models.

Its getting better with every iteration, the final one gave the best result.

## Submit

* Save your work (File -> Save Notebook)
* Verify that your notebook runs without error by restarting the kernel (or closing and opening the notebook) and selecting the top menu item for Run -> Run All Cells.  It should run successfully all the way to the bottom.
* Save your notebook again.  Keep all the output visible when saving the final version.
* Submit the file through the Canvas Assignment.