# DS 3000 - Lab 10: Classification Trees

**Student Name**: [Julia Ouritskaya]

**Date**: [11/10/2023]

### Submission Instructions
<div class="alert alert-block alert-success">
In this lab you will you'll work the [iris dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set) and train your first classification tree. The dataset is already loaded in your environment. Complete the questions in the lab and submit this `ipynb` file with your solution.
</div>

`Note:` The `ipynb` format stores outputs from the last time you ran the notebook. When you open a notebook it has the figures and outputs of the last time you ran it.  To ensure that your submitted `ipynb` file represents your latest code, make sure to give a fresh run `Kernel > Restart & Run All` just before uploading the `ipynb` file to Gradescope.

<div class="alert alert-block alert-danger">
Please do not delete the cells that are provided nor add any extra cells. Ensure that you write your code in the given cells where indicated. <br>
<strong>Do not delete any empty cells.</strong><br>
</div>


In [1]:
# Import any Libraries
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import math
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

### Question 1: Load the data (0 pts)
In this question you will load the iris data (this was performed for you). DO NOT change the names of the input 'X' and the labels 'y'

In [2]:
#load the data from sklearn.datasets
data     = load_iris()

#divide the data into the input 'X' and the labels 'y'
X        = data['data'] #the observations
y        = data['target'] #the label


In [3]:
#DO NOT DELETE THIS CELL


### Question 2: partition the data (1 pt)

Complete the function `partition_data(...)`. The function takes as input `X`, `y`, and `seed`. Ensure that you perform the following steps inside the function:
- Split the data (i.e. X and y) into 80% train and 20% test.
- Ensure that your partitions are **stratified** and **reproducible**.
- Return the resulting features for X_train and X_test; and the labels for y_train and y_test.


In [4]:
SEED = 7

In [5]:

def partition_data(X: np.ndarray, y: np.ndarray, seed: int) -> np.ndarray:
    """
    Input numpy arrays for X and y.

    Parameters:
    - X (np.ndarray): a numpy array that contains the explanatory variables
    - y (np.ndarray): a numpy array that contains the target variable
    - seed(integer): the random state for reproducibilty

    Returns:
    - np.ndarray: X_train, X_test, y_train, y_test #Ensure that you return the arrays in this order
    """

    # # TIP: use the train_test_split() function to partition your data. For example:
    # X_train, X_test, y_train, y_test = train_test_split( 
    #                                     #TODO: provide the data, 
    #                                     #TODO: set the test set size,
    #                                     #TODO: ensure stratified samples,
    #                                     #TODO: ensure reproducible partitions
    #                                     ) 

    # Split the data (i.e. X and y) into 80% train and 20% test, ensuring partitions are stratified and reproducible
    X_train, X_test, y_train, y_test = train_test_split(
                                                        X,                  #the input features
                                                        y,                  #the label
                                                        test_size=0.2,      #set aside 20% of the data as the test set  
                                                        random_state=seed,  #reproduce the results
                                                        stratify=y          #preserve the distribution of the labels
                                                        )
    
    # Return the resulting features for X_train and X_test and the labels for y_train and y_test
    return X_train, X_test, y_train, y_test
    
    raise NotImplementedError()


In [6]:
#DO NOT DELETE THIS CELL
#test the function
X_train, X_test, y_train, y_test = partition_data(X, y, SEED) # DO NOT DELETE OR MODIFY THIS LINE

#view samples from the data
print('X_train data: {}'.format(X_train[0:5]))
print('-------------------------------------')
print('X_test data: {}'.format(X_test[0:5]))
print('-------------------------------------')
print('y_train data: {}'.format(y_train[0:5]))
print('-------------------------------------')
print('y_test data: {}'.format(y_test[0:5]))

X_train data: [[7.4 2.8 6.1 1.9]
 [5.1 3.7 1.5 0.4]
 [7.7 3.8 6.7 2.2]
 [6.7 3.1 5.6 2.4]
 [6.3 3.3 6.  2.5]]
-------------------------------------
X_test data: [[6.2 2.8 4.8 1.8]
 [5.1 3.4 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [6.  2.9 4.5 1.5]
 [6.3 2.9 5.6 1.8]]
-------------------------------------
y_train data: [2 0 2 2 2]
-------------------------------------
y_test data: [2 0 0 1 2]


In [7]:
#DO NOT DELETE THIS CELL


### Question 3: Train your classification tree (3 pts)
Complete the function `build_classifier(...)` to build a decision tree classifier that predicts the type of iris flower. The function takes as input the partitioned data and a `seed`; and it returns the predicitons. Ensure that you perform the following steps inside the function:
- Instantiate a DecisionTreeClassifier with **maximum_depth** = 3, **criterion** = 'entropy' and set the **random_state** to ensure reproducible results. [Click here to view the documentation for the decision tree classifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html).
- Fit the decision tree to the training data
- Predict the test set labels and assign the result to y_pred.
- Return the predictions

In [8]:
def build_classifier(X_train: np.ndarray, X_test: np.ndarray, y_train: np.ndarray, y_test: np.ndarray, seed: int) -> np.ndarray:
    """
    Input numpy arrays for the partitioned train and test sets, and an integer seed.

    Parameters:
    - X_train (np.ndarray): a numpy array that contains the explanatory variables in the training set
    - y_train (np.ndarray): a numpy array that contains the target variables in the training set
    - X_test (np.ndarray): a numpy array that contains the explanatory variables in the test set
    - y_test (np.ndarray): a numpy array that contains the target variables in the test set
    - seed: the random state for reproducibilty

    Returns:
    - np.ndarray: y_pred
    """

    # Instantiate a DecisionTreeClassifier with maximum_depth = 3, criterion = 'entropy' and set the random_state to ensure reproducible results
    dt = DecisionTreeClassifier(max_depth=3, criterion='entropy', random_state=seed)
    
    # Fit the decision tree to the training data
    dt.fit(X_train, y_train)
    
    # Predict the test set labels and assign the result to y_pred
    y_pred = dt.predict(X_test)
    
    # Return the predictions
    return y_pred
    
    raise NotImplementedError()
    

In [9]:
#print the predictions 
predictions = build_classifier(X_train, X_test, y_train, y_test, SEED)
print("The predictions are: ", predictions)

The predictions are:  [1 0 0 1 2 1 2 0 2 2 1 0 0 1 1 1 0 0 1 1 2 0 1 0 2 2 2 1 0 2]


In [10]:
#DO NOT DELETE THIS CELL


### Question 4: Evaluate the model (1 pts)
Now that you have fit your first classification tree, let's evaluate its performance on the test set using the accuracy metric. Complete the function `evaluate_model(...)` to evaluate the predicted labels with the actual labels. The function takes as input the predicted labels and the expected labels. Ensure that you perform the following steps inside the function:
- Calculate the **accuracy** of the results.
- Round the calculation to 2 decimal places
- Return the accuracy


In [11]:
def evaluate_model(y_pred: np.ndarray, y_test: np.ndarray) -> float:
    """
    Input numpy arrays for y_pred and y_test.

    Parameters:
    - y_test (np.ndarray): a numpy array that contains the expected labels for the test set
    - y_pred (np.ndarray): a numpy array that contains the predicted labels for the test set

    Returns:
    - float: accuracy rounded to 2 decimal places
    """

    # Calculate the accuracy of the results and round the calculation to 2 decimal places
    accuracy = round(accuracy_score(y_test, y_pred), 2)
    
    # Return the accuracy
    return accuracy

    raise NotImplementedError()
    

In [12]:
#print accuracy 
score = evaluate_model(predictions, y_test)
print("Test set accuracy: {:.2f}".format(score))

Test set accuracy: 0.97


In [13]:
#DO NOT DELETE THIS CELL


**Good job with the lab!** You have learned how to create a classification tree, using the same dataset from your k-nn assignment. Next, you will build a regression tree in the assignment.

`Consider the following:` Did your decision tree perform better than k-nn using this dataset? While you are not required to submit a response to this question, keep in mind that algorithms' performance will vary depnding on the dataset. This is the reason we often experiment with different techniques. This experimentation will be cruicial in your Data Science project.

### Additional Resource:

#### [1. Scikit-learn Decision Tree Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)
#### [2. Scikit-learn Accuracy Score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html)