# Assignment 4 - Trees

In this assignment, you will delve into the world of decision trees and their ensemble counterparts. 

You'll work with the breast cancer dataset to understand how these algorithms can be utilized for classification tasks, and you'll also explore the importance of feature selection and hyperparameter tuning.

## Tasks:

### 1. Data Loading and Exploration: (1 point)

In [None]:
# Import necessary libraries
import numpy as np
import matplotlib.pyplot as plt

In [None]:
# part 1: Data Loading and Exploration
from sklearn.datasets import load_breast_cancer

# Load the dataset
"""
Question: Load Breast Cancer Dataset (1 point)
Using Scikit-learn's datasets module, load the breast cancer dataset.

Hints:

- Import the necessary function from Scikit-learn's datasets module.
- Assign the feature data to a variable named `X` and target labels to a variable named `y`.
- Remember to utilize the `load_breast_cancer()` function.
"""

data = 
X = 
y = 

### 2. Data Preprocessing (1 point)

In [None]:
# part 2: Data Preprocessing
from sklearn.model_selection import train_test_split

"""
Question: Dataset Preprocessing (1 point)
Using the loaded breast cancer dataset, divide it into training and testing sets.

Hints:

Utilize the train_test_split function.
Remember to set test_size and random_state.
"""

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = 

### 3. Decision Trees (4 points)

In [None]:
# part 3: Decision Trees
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import accuracy_score

"""
Question: Decision Trees Implementation (4 points)
Using the training data, implement a decision tree classifier and evaluate its accuracy on the testing data.

Hints:

- Initialize a DecisionTreeClassifier.
- Train the classifier using the fit method on the training data.
- Predict the labels of the testing data.
- Calculate and print the accuracy of the classifier.
- Visualize the decision tree using the plot_tree function. Make sure to use the feature_names from the data for clarity.
"""

# Decision Trees
dt = 
#fit the tree dt
y_pred = 
accuracy = 
print(f"Decision Tree Accuracy: {accuracy*100:.2f}%")
plt.figure(figsize=(20,10))
plot_tree(dt, filled= , feature_names=)
plt.show()

### 4. Random Forest (8 points)

In [None]:
# part 4: Random Forest
from sklearn.ensemble import RandomForestClassifier

"""
Question: Random Forest Implementation (4 points)
Using the training data, implement a random forest classifier and evaluate its accuracy on the testing data.

Hints:

- Initialize a RandomForestClassifier.
- Train the classifier using the fit method on the training data.
- Predict the labels of the testing data.
- Calculate and print the accuracy of the classifier.
"""


rf = 
# fit the random forest rf
y_pred = 
accuracy = 
print(f"Random Forest Accuracy: {accuracy*100:.2f}%")

# Feature importances

"""
Question: Feature Importance Analysis (4 points)
Using the trained Random Forest model, analyze and list down the importance of features in descending order.

Hints:

- Use the feature_importances_ attribute of the trained Random Forest model to get the importance of each feature.
- Sort the features based on their importance in descending order.
- Print the ranked features along with their importance values.
"""

# look for random forest on sklearn
importances = 
# look for https://numpy.org/doc/stable/reference/generated/numpy.argsort.html
indices = 
print("Feature ranking:")
for f in range(X.shape[1]):
    print(f"{f+1}. {data.feature_names[indices[f]]} ({importances[indices[f]]:.4f})")

### 5. XGBoost (4 points)

In [None]:
# Part 5: XGBoost
import xgboost as xgb

"""
Question: Implementing XGBoost (4 points)
Using the XGBoost library, train a classifier on the training data and evaluate its accuracy on the test set.

Hints:

- Initialize the XGBoost classifier using xgb.XGBClassifier().
- Fit the classifier on the training data.
- Predict the labels for the test set using the trained model.
- Evaluate and print the accuracy of the predictions against the true test labels.
"""

clf = 
# fit the clf
y_pred = 
accuracy = 
print(f"XGBoost Accuracy: {accuracy*100:.2f}%")

### 6. Hyperparameter Tuning (2 points)

In [None]:
from sklearn.model_selection import GridSearchCV

"""
Question: Hyperparameter Tuning for Random Forest (4 points)
Using the provided param_grid for hyperparameters, apply Grid Search to find the optimal hyperparameters for a Random Forest classifier.

Hints:

- Utilize GridSearchCV with the given parameter grid param_grid and a 5-fold cross validation.
- Fit the GridSearch on the training data.
- After finding the best parameters, retrieve the best estimator using best_estimator_ attribute.
- Predict the labels for the test set using the optimized Random Forest model.
- Evaluate and print the accuracy of the predictions against the true test labels.
"""

# For this assignment, we'll only tune for Random Forest here
param_grid = {
    'n_estimators': [10, 50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}
# define GridSearchCV with cv=5
grid_search = 
grid_search.fit(X_train, y_train)
# find best estimator
best_rf = 
y_pred = best_rf.predict(X_test)
print(f"Optimized Random Forest Accuracy: {accuracy_score(y_test, y_pred)*100:.2f}%")