# Random Forests, Cross-Validation, and Grid Searching

In this lab, you will learn how to implement a Random Forest classifier using Scikit-Learn, evaluate its performance with k-fold cross-validation, and optimize hyperparameters using GridSearchCV with cross-validation. This comprehensive approach allows you to understand model training, evaluation, and tuning, equipping you with the skills to build robust machine learning models.

## Part 1: Implementing Random Forests

In this section, you will:

- **Load the Dataset**: Begin by importing the necessary libraries and loading a dataset (like the Iris dataset).
- **Data Splitting**: Split the dataset into training and testing sets to ensure that you can evaluate model performance on unseen data.
- **Model Initialization**: Initialize the Random Forest classifier without specifying hyperparameters initially.
- **Model Training**: Train the Random Forest model using the training data and evaluate its performance on the test set.

**Step 1: Import Libraries**

Begin by importing the necessary libraries. You will need pandas for data manipulation, scikit-learn for model implementation, and numpy for numerical operations.

```python
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
```

**Step 2: Load the Dataset**

For this exercise, we will use the Iris dataset, which is a classic dataset for classification tasks. It contains features related to different species of iris flowers.

```python
# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target
```

**Step 3: Split the Data**

Next, split the dataset into training and testing sets. This is crucial to evaluate the model's performance on unseen data. We will use 80% of the data for training and 20% for testing. Use a `random_state` of `42`.

If you can't remember how to split your data, take a look back at our previous labs!

##### <font color='red'>**TRY IT**</font> &#x1f9e0;: Import libraries, load the data, and make the splits!

**Step 4: Initialize the Random Forest Classifier**

Now, initialize the Random Forest Classifier. You can set the number of trees (`n_estimators`) and the random state for reproducibility.

```python
# Initialize the Random Forest Classifier
rf_classifier = RandomForestClassifier(n_estimators=5, random_state=42)
```

**Step 5: Train the Model**

Fit the model to the training data. This step involves training the Random Forest on the features and corresponding labels.

```python
# Train the model
rf_classifier.fit(X_train, y_train)
```

**Step 6: Make Predictions**

After training the model, use it to make predictions on the test set. This will help you understand how well the model performs on unseen data.

```python
# Make predictions on the test set
y_pred = rf_classifier.predict(X_test)
```

**Step 7: Evaluate the Model**

Finally, evaluate the model’s performance using accuracy and a classification report, which provides a detailed breakdown of the model's performance.

```python
# Evaluate the model's performance
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
```

##### <font color='red'>**TRY IT**</font> &#x1f9e0;: Initialize the `RandomForestClassifier`, train it, make predictions, and evaluate the model! Be sure to use more that just `accuracy_score` for your evaluations. Look back at previous labs if you need a refresher!

## Part 2: Implementing k-Fold Cross-Validation

In this section, you will:

- **Define K-Fold**: Set up a k-fold cross-validation strategy to evaluate the model's performance more reliably.
- **Model Evaluation**: For each fold, train the model on a subset of the data and evaluate it on the remaining data.
- **Performance Metrics**: Collect and compute the average cross-validation scores to assess model stability and robustness.


**Step 1: Import Additional Libraries**

You will need `KFold` and `cross_val_score function` from Scikit-Learn to perform cross-validation.

```python
from sklearn.model_selection import cross_val_score, KFold
```

**Step 2: Initialize the Random Forest Classifier Again**

Create an instance of the Random Forest Classifier for use in cross-validation.

```python
# Initialize the Random Forest Classifier
rf_classifier = RandomForestClassifier(n_estimators=5, random_state=42)
```

**Step 3: Perform Cross-Validation**

`KFold` is a cross-validation strategy that splits the dataset into 'k' consecutive folds (or subsets).

It provides an iterator that can be used to generate training and testing splits for model evaluation. You need to manually loop over these folds to train and evaluate your model.

By using `KFold`, you have more control over the training and testing process, allowing you to customize what happens in each fold (e.g., applying different preprocessing steps).

```python
# Set up K-Fold Cross-Validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Collect scores from each fold
cv_scores = []
for train_index, test_index in kf.split(X_train):
    X_fold_train, X_fold_test = X_train[train_index], X_train[test_index]
    y_fold_train, y_fold_test = y_train[train_index], y_train[test_index]
    
    rf_classifier.fit(X_fold_train, y_fold_train)
    y_fold_pred = rf_classifier.predict(X_fold_test)
    score = accuracy_score(y_fold_test, y_fold_pred)
    cv_scores.append(score)

# Print the average cross-validation score
print("K-Fold Cross-Validation Scores:", cv_scores)
print(f"Mean Cross-Validation Accuracy: {sum(cv_scores) / len(cv_scores):.2f}")
```

##### <font color='red'>**TRY IT**</font> &#x1f9e0;: Perform k-Fold Cross-Validation on your Random Forest. Try editing the above loop to keep track of another performance metric (e.g., precision, recall, F-1).

But guess what!? There's a method that folds together these steps for us (the splitting and the re-training for each fold).

`cross_val_score` offers a simpler interface for evaluating models with cross-validation, automatically handling the splitting and scoring process. However, it does not allow you as much control.

```python
# Perform 5-fold cross-validation with 'cross_val_score'
cv_scores = cross_val_score(rf_classifier, X, y, cv=5)
```

##### <font color='red'>**TRY IT**</font> &#x1f9e0;: Try out this simpler function. What do you notice about its output?

# Part 3: Implementing a Grid Search

In this section, you will:

- **Hyperparameter Tuning**: Define a grid of hyperparameters to search over, optimizing the Random Forest model for better performance.
- **Grid Search Setup**: Use GridSearchCV to automate the process of hyperparameter tuning with cross-validation.
- **Model Selection**: Identify the best hyperparameters based on cross-validation scores and make predictions using the optimized model.

**Step 1: Define a grid of hyperparameters to search over.**

Three hyperparameters are commonly tuned for a Random Forest classifier:

1. `n_estimators`
  - **Definition**: This hyperparameter specifies the number of decision trees in the Random Forest model.
  - **Impact**: Increasing the number of estimators generally improves the model's performance and stability, as it allows for more diverse trees to contribute to the final prediction. However, it also increases computational cost and may lead to diminishing returns after a certain point.
  - **Typical Values**: Common choices are 50, 100, and 200. The optimal value often depends on the dataset size and complexity.
2. `max_depth`
  - **Definition**: This hyperparameter controls the maximum depth of each decision tree in the forest.
  - **Impact**: A deeper tree can capture more complex patterns in the data but is also more likely to overfit, especially with limited training data.Setting max_depth to None allows the trees to grow until all leaves are pure or contain fewer than the minimum samples required to split, which may lead to overfitting.
  - **Typical Values**: You might explore values such as None, 10, 20, and 30. A lower value can help reduce overfitting.
3. `min_samples_split`
  - **Definition**: This hyperparameter defines the minimum number of samples required to split an internal node of a tree.
  - **Impact**: Increasing this value can lead to more generalized trees, as it prevents the model from creating splits based on very few samples, thus reducing the likelihood of overfitting. A smaller value allows for more splits, which can help capture more complexity in the data but may also result in overfitting.
  - **Typical Values**: Common choices are 2, 5, and 10. The best value can vary based on the dataset size and characteristics.

We can specify values of these hyperparameters that we want to test and format them in a "grid". All combinations of these values will then be tested in cross-validation.

```python
# Import our new function
from sklearn.model_selection import GridSearchCV

# Define the hyperparameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10]
}
```

##### <font color='red'>**TRY IT**</font> &#x1f9e0;: Create this grid and then create one of your own!

**Step 2: Perform Grid Search with Cross-Validation**

Use GridSearchCV to perform hyperparameter tuning. This will automatically use cross-validation to evaluate the combinations of hyperparameters.

```python
# Set up Grid Search with cross-validation
grid_search = GridSearchCV(estimator=rf_classifier, param_grid=param_grid, cv=kf, n_jobs=-1)
grid_search.fit(X_train, y_train)
```

**Step 3: Get the Best Parameters**

Retrieve the best hyperparameters from the grid search.

```python
# Get the best parameters
best_params = grid_search.best_params_
print("Best Hyperparameters:", best_params)
```

**Step 4: Make Predictions with the Best Model**

Use the best estimator to make predictions on the test set.

```python
# Make predictions on the test set using the best model
best_rf_classifier = grid_search.best_estimator_
y_pred = best_rf_classifier.predict(X_test)
```

**Step 5: Evaluate the Model**

Finally, evaluate the performance of the model with the best hyperparameters.

```python
# Evaluate the model's performance
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
print(classification_report(y_test, y_pred))
```

##### <font color='red'>**TRY IT**</font> &#x1f9e0;: Perform a Grid Search both for the example grid above and for the one you made on your own. Which set of hyperparameters produced the best results?