# Lab 4 - Math 178, Winter 2025

You are encouraged to work in groups of up to 3 total students, but each student should make a submission on Canvas. (It's fine for everyone in the group to submit the same link.)

Put the full names of everyone in your group (even if you're working alone) here. This makes grading easier.

**Names**:

This notebook is a practice of tree and forest models. We will continue using the same dataset from last week.

* Dataset: `world_cup22.csv`
* Goal: Can we use decision trees and random forests to predict the number of goals scored in World Cup 2022 matches?
### Introduction

In this lab, we will explore two powerful machine learning algorithms: Decision Trees and Random Forests. These algorithms are widely used for both classification and regression tasks due to their simplicity and interpretability.

#### Decision Trees

A Decision Tree is a flowchart-like structure where each internal node represents a decision based on a feature, each branch represents the outcome of the decision, and each leaf node represents a class label or a continuous value. The paths from the root to the leaf represent classification rules.

#### Random Forests

A Random Forest is an ensemble method that builds multiple decision trees and merges them together to get a more accurate and stable prediction. It reduces overfitting by averaging the results of multiple trees, which are trained on different subsets of the data.

In this lab, we will:

1. Train a Decision Tree classifier on the `world_cup22.csv` dataset.
2. Visualize the decision tree to understand the decision-making process.
3. Train a Random Forest classifier and analyze the feature importances.
4. Compare the performance and interpretability of Decision Trees and Random Forests.

Let's get started!

### Prepare the Data

1. **Load the Dataset**:
    - Ensure `world_cup22.csv` is in your working directory.
    - Load the dataset into a DataFrame.

    ```python
    df = pd.read_csv("world_cup22.csv")
    ```

2. **Inspect the Data**:
    - Display the first five rows to understand its structure.

    ```python
    df.head()
    ```

3. **Split the Data**:
    - Separate the features and the target variable (`number of goals`).

    ```python
    X = df.drop('number of goals', axis=1)
    y = df['number of goals']
    ```

4. **Train-Test Split**:
    - Split the data into training and testing sets.

    ```python
    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    ```

By following these steps, you will have a clean and prepared dataset ready for training machine learning models.


### Draw The Decision Tree

To draw the decision tree, follow these steps:



1. **Import Required Libraries**:
    Ensure you have imported the necessary libraries for plotting the decision tree.
    ```python
    from sklearn.tree import plot_tree
    import matplotlib.pyplot as plt
    ```


2. **Create and Train the Decision Tree Classifier**:
    Create and train the `DecisionTreeClassifier`.

3. **Plot the Decision Tree**:
    Use the `plot_tree` function to visualize the decision tree.


The plot above is the logic tree built by the decision tree algorithm. To classify a new observation we start at the <i>root node</i> up top. If the observation satisfies the logic statement at the top we go left and are classified as a $0$, else we go right and are classified as $1$. The two <i>children</i> of the root node are known as <i>leaf nodes</i> or <i>terminal nodes</i> because they have no children of their own so we just predict the majority class contained in that node.

This is essentially the decision rule we came up with (which is the objectively correct one by the way), so in this example the decision tree did well.

If we look in the plot above we notice a number of different stats in each node:

- `samples`: the number of samples in each node. In this case, it represents the number of matches considered at each decision point.
- `gini`: the gini impurity of the node, more on this in a moment. It measures the impurity or the likelihood of misclassification at each node.
- `value`: the breakdown of the number of samples of each target value in the node. For example, if a node has `value = [50, 30, 20, 10, 5, 1]`, it means there are 50 matches where the number of goals is 0, 30 matches where the number of goals is 1, 20 matches where the number of goals is 2, 10 matches where the number of goals is 3, 5 matches where the number of goals is 4, and 1 match where the number of goals is 5.
- A decision rule: The rule that is used for the following split. Samples that would be evaluated as True for the rule go to the left child, samples that would be evaluated as False go to the right child.



### Write here

Write down what you observed here. Is the decision tree able to identify the importance of the categorical variable?




### Find the Most Important Features
## Forests are an Ensemble of Trees

The *random forest* model is created by building many different decision trees. These trees are made "different" through a variety of random perturbations (more on this later in the notebook). Random forests are thus an ensemble of decision tree models.

We will demonstrate the advantages of this ensemble with the synthetic dataset we used in our decision tree classification notebook. In the following code, we will use the `RandomForestClassifier` to identify the most important features in our dataset. The classifier will be trained on the data, and then we will extract and plot the feature importances to see which features have the most influence on the target variable.




1. **Import the RandomForestClassifier**:
    ```python
    from sklearn.ensemble import RandomForestClassifier
    ```




2. **Create and Train the RandomForestClassifier**:
    ```python
    forest_clf = RandomForestClassifier(n_estimators= 1000 , max_depth=3)
    forest_clf.fit(df.iloc[???], df.iloc[???])
    ```



3. **Find the Most Important Features**:
    ```python
    importances = forest_clf.??? 
    # Sort the indices of the importances in descending order
    indices = np.argsort(importances)[::-1]
    ```


4. **Print the Feature Ranking**:





5. **Final Step: Plot the Feature Importances**:

    ```plt.figure(figsize=(15, 5))
    plt.title("Feature importances")
    ???
    plt.show()```
