# Activity: Build a random forest model

## **Introduction**


As you're learning, random forests are popular statistical learning algorithms. Some of their primary benefits include reducing variance, bias, and the chance of overfitting.

This activity is a continuation of the project you began modeling with decision trees for an airline. Here, you will train, tune, and evaluate a random forest model using data from spreadsheet of survey responses from 129,880 customers. It includes data points such as class, flight distance, and inflight entertainment. Your random forest model will be used to predict whether a customer will be satisfied with their flight experience.

**Note:** Because this lab uses a real dataset, this notebook first requires exploratory data analysis, data cleaning, and other manipulations to prepare it for modeling.

## **Step 1: Imports** 


Import relevant Python libraries and modules, including `numpy` and `pandas`libraries for data processing; the `pickle` package to save the model; and the `sklearn` library, containing:
- The module `ensemble`, which has the function `RandomForestClassifier`
- The module `model_selection`, which has the functions `train_test_split`, `PredefinedSplit`, and `GridSearchCV` 
- The module `metrics`, which has the functions `f1_score`, `precision_score`, `recall_score`, and `accuracy_score`


In [1]:
# Import `numpy`, `pandas`, `pickle`, and `sklearn`.
# Import the relevant functions from `sklearn.ensemble`, `sklearn.model_selection`, and `sklearn.metrics`.

import numpy as np
import pandas as pd
import pickle
import dataframe_image as dfi
from sklearn.ensemble import RandomForestClassifier as RFC
from sklearn.model_selection import GridSearchCV, train_test_split, PredefinedSplit
from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score, confusion_matrix

As shown in this cell, the dataset has been automatically loaded in for you. You do not need to download the .csv file, or provide more code, in order to access the dataset and proceed with this lab. Please continue with this activity by completing the following instructions.

In [2]:
# RUN THIS CELL TO IMPORT YOUR DATA. 

### YOUR CODE HERE ###

air_data = pd.read_csv("Customer_Survey.csv")

<details>
  <summary><h4><strong>Hint 1</strong></h4></summary>

The `read_csv()` function from the `pandas` library can be helpful here.
 
</details>

Now, you're ready to begin cleaning your data. 

## **Step 2: Data cleaning** 

To get a sense of the data, display the first 10 rows.

In [3]:
# Display first 10 rows.

air_data.head(10)

Unnamed: 0,satisfaction,Customer Type,Age,Type of Travel,Class,Flight Distance,Seat comfort,Departure/Arrival time convenient,Food and drink,Gate location,...,Online support,Ease of Online booking,On-board service,Leg room service,Baggage handling,Checkin service,Cleanliness,Online boarding,Departure Delay in Minutes,Arrival Delay in Minutes
0,satisfied,Loyal Customer,65,Personal Travel,Eco,265,0,0,0,2,...,2,3,3,0,3,5,3,2,0,0.0
1,satisfied,Loyal Customer,47,Personal Travel,Business,2464,0,0,0,3,...,2,3,4,4,4,2,3,2,310,305.0
2,satisfied,Loyal Customer,15,Personal Travel,Eco,2138,0,0,0,3,...,2,2,3,3,4,4,4,2,0,0.0
3,satisfied,Loyal Customer,60,Personal Travel,Eco,623,0,0,0,3,...,3,1,1,0,1,4,1,3,0,0.0
4,satisfied,Loyal Customer,70,Personal Travel,Eco,354,0,0,0,3,...,4,2,2,0,2,4,2,5,0,0.0
5,satisfied,Loyal Customer,30,Personal Travel,Eco,1894,0,0,0,3,...,2,2,5,4,5,5,4,2,0,0.0
6,satisfied,Loyal Customer,66,Personal Travel,Eco,227,0,0,0,3,...,5,5,5,0,5,5,5,3,17,15.0
7,satisfied,Loyal Customer,10,Personal Travel,Eco,1812,0,0,0,3,...,2,2,3,3,4,5,4,2,0,0.0
8,satisfied,Loyal Customer,56,Personal Travel,Business,73,0,0,0,3,...,5,4,4,0,1,5,4,4,0,0.0
9,satisfied,Loyal Customer,22,Personal Travel,Eco,1556,0,0,0,3,...,2,2,2,4,5,3,4,2,30,26.0


<details>
  <summary><h4><strong>Hint 1</strong></h4></summary>

The `head()` function from the `pandas` library can be helpful here.
 
</details>

Now, display the variable names and their data types. 

In [4]:
# Display variable names and types.

air_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 129880 entries, 0 to 129879
Data columns (total 22 columns):
 #   Column                             Non-Null Count   Dtype  
---  ------                             --------------   -----  
 0   satisfaction                       129880 non-null  object 
 1   Customer Type                      129880 non-null  object 
 2   Age                                129880 non-null  int64  
 3   Type of Travel                     129880 non-null  object 
 4   Class                              129880 non-null  object 
 5   Flight Distance                    129880 non-null  int64  
 6   Seat comfort                       129880 non-null  int64  
 7   Departure/Arrival time convenient  129880 non-null  int64  
 8   Food and drink                     129880 non-null  int64  
 9   Gate location                      129880 non-null  int64  
 10  Inflight wifi service              129880 non-null  int64  
 11  Inflight entertainment             1298

<details>
  <summary><h4><strong>Hint 1</strong></h4></summary>

DataFrames have an attribute that outputs variable names and data types in one result.
 
</details>

**Question:** What do you observe about the differences in data types among the variables included in the data?

4 columns are of object type rest all are numeric

Next, to understand the size of the dataset, identify the number of rows and the number of columns.

In [5]:
# Identify the number of rows and the number of columns.

air_data.shape


(129880, 22)

<details>
  <summary><h4><strong>Hint 1</strong></h4></summary>

There is a method in the `pandas` library that outputs the number of rows and the number of columns in one result.

</details>

Now, check for missing values in the rows of the data. Start with .isna() to get Booleans indicating whether each value in the data is missing. Then, use .any(axis=1) to get Booleans indicating whether there are any missing values along the columns in each row. Finally, use .sum() to get the number of rows that contain missing values.

In [6]:
# Get Booleans to find missing values in data.
# Get Booleans to find missing values along columns.
# Get the number of rows that contain missing values.

air_data.isna().sum()


satisfaction                           0
Customer Type                          0
Age                                    0
Type of Travel                         0
Class                                  0
Flight Distance                        0
Seat comfort                           0
Departure/Arrival time convenient      0
Food and drink                         0
Gate location                          0
Inflight wifi service                  0
Inflight entertainment                 0
Online support                         0
Ease of Online booking                 0
On-board service                       0
Leg room service                       0
Baggage handling                       0
Checkin service                        0
Cleanliness                            0
Online boarding                        0
Departure Delay in Minutes             0
Arrival Delay in Minutes             393
dtype: int64

**Question:** How many rows of data are missing values?**

393 rows have missing values in arrival delay

Drop the rows with missing values. This is an important step in data cleaning, as it makes the data more useful for analysis and regression. Then, save the resulting pandas DataFrame in a variable named `air_data_subset`.

In [7]:
# Drop missing values.
# Save the DataFrame in variable `air_data_subset`.

air_data_subset = air_data.dropna()
air_data_subset.isna().sum()

satisfaction                         0
Customer Type                        0
Age                                  0
Type of Travel                       0
Class                                0
Flight Distance                      0
Seat comfort                         0
Departure/Arrival time convenient    0
Food and drink                       0
Gate location                        0
Inflight wifi service                0
Inflight entertainment               0
Online support                       0
Ease of Online booking               0
On-board service                     0
Leg room service                     0
Baggage handling                     0
Checkin service                      0
Cleanliness                          0
Online boarding                      0
Departure Delay in Minutes           0
Arrival Delay in Minutes             0
dtype: int64

<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

The `dropna()` function is helpful here.
</details>

<details>
<summary><h4><strong>Hint 2</strong></h4></summary>

The axis parameter passed in to this function should be set to 0 (if you want to drop rows containing missing values) or 1 (if you want to drop columns containing missing values).
</details>

Next, display the first 10 rows to examine the data subset.

In [8]:
# Display the first 10 rows.

air_data_subset.head(10)


Unnamed: 0,satisfaction,Customer Type,Age,Type of Travel,Class,Flight Distance,Seat comfort,Departure/Arrival time convenient,Food and drink,Gate location,...,Online support,Ease of Online booking,On-board service,Leg room service,Baggage handling,Checkin service,Cleanliness,Online boarding,Departure Delay in Minutes,Arrival Delay in Minutes
0,satisfied,Loyal Customer,65,Personal Travel,Eco,265,0,0,0,2,...,2,3,3,0,3,5,3,2,0,0.0
1,satisfied,Loyal Customer,47,Personal Travel,Business,2464,0,0,0,3,...,2,3,4,4,4,2,3,2,310,305.0
2,satisfied,Loyal Customer,15,Personal Travel,Eco,2138,0,0,0,3,...,2,2,3,3,4,4,4,2,0,0.0
3,satisfied,Loyal Customer,60,Personal Travel,Eco,623,0,0,0,3,...,3,1,1,0,1,4,1,3,0,0.0
4,satisfied,Loyal Customer,70,Personal Travel,Eco,354,0,0,0,3,...,4,2,2,0,2,4,2,5,0,0.0
5,satisfied,Loyal Customer,30,Personal Travel,Eco,1894,0,0,0,3,...,2,2,5,4,5,5,4,2,0,0.0
6,satisfied,Loyal Customer,66,Personal Travel,Eco,227,0,0,0,3,...,5,5,5,0,5,5,5,3,17,15.0
7,satisfied,Loyal Customer,10,Personal Travel,Eco,1812,0,0,0,3,...,2,2,3,3,4,5,4,2,0,0.0
8,satisfied,Loyal Customer,56,Personal Travel,Business,73,0,0,0,3,...,5,4,4,0,1,5,4,4,0,0.0
9,satisfied,Loyal Customer,22,Personal Travel,Eco,1556,0,0,0,3,...,2,2,2,4,5,3,4,2,30,26.0


Confirm that it does not contain any missing values.

In [9]:
# Count of missing values.

air_data_subset.isna().sum()

satisfaction                         0
Customer Type                        0
Age                                  0
Type of Travel                       0
Class                                0
Flight Distance                      0
Seat comfort                         0
Departure/Arrival time convenient    0
Food and drink                       0
Gate location                        0
Inflight wifi service                0
Inflight entertainment               0
Online support                       0
Ease of Online booking               0
On-board service                     0
Leg room service                     0
Baggage handling                     0
Checkin service                      0
Cleanliness                          0
Online boarding                      0
Departure Delay in Minutes           0
Arrival Delay in Minutes             0
dtype: int64

<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

You can use the `.isna().sum()` to get the number of missing values for each variable.

</details>

Next, convert the categorical features to indicator (one-hot encoded) features. 

**Note:** The `drop_first` argument can be kept as default (`False`) during one-hot encoding for random forest models, so it does not need to be specified. Also, the target variable, `satisfaction`, does not need to be encoded and will be extracted in a later step.

In [28]:
# Convert categorical features to one-hot encoded features.
air_data_subset_dummies = pd.get_dummies(air_data_subset,columns=['satisfaction','Customer Type','Type of Travel','Class'], drop_first=True)

<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

You can use the `pd.get_dummies()` function to convert categorical variables to one-hot encoded variables.
</details>

**Question:** Why is it necessary to convert categorical data into dummy variables?**

Tree building required numerical values as comparisons are to be made

Next, display the first 10 rows to review the `air_data_subset_dummies`. 

In [29]:
# Display the first 10 rows.
air_data_subset_dummies.head(10)


Unnamed: 0,Age,Flight Distance,Seat comfort,Departure/Arrival time convenient,Food and drink,Gate location,Inflight wifi service,Inflight entertainment,Online support,Ease of Online booking,...,Checkin service,Cleanliness,Online boarding,Departure Delay in Minutes,Arrival Delay in Minutes,satisfaction_satisfied,Customer Type_disloyal Customer,Type of Travel_Personal Travel,Class_Eco,Class_Eco Plus
0,65,265,0,0,0,2,2,4,2,3,...,5,3,2,0,0.0,1,0,1,1,0
1,47,2464,0,0,0,3,0,2,2,3,...,2,3,2,310,305.0,1,0,1,0,0
2,15,2138,0,0,0,3,2,0,2,2,...,4,4,2,0,0.0,1,0,1,1,0
3,60,623,0,0,0,3,3,4,3,1,...,4,1,3,0,0.0,1,0,1,1,0
4,70,354,0,0,0,3,4,3,4,2,...,4,2,5,0,0.0,1,0,1,1,0
5,30,1894,0,0,0,3,2,0,2,2,...,5,4,2,0,0.0,1,0,1,1,0
6,66,227,0,0,0,3,2,5,5,5,...,5,5,3,17,15.0,1,0,1,1,0
7,10,1812,0,0,0,3,2,0,2,2,...,5,4,2,0,0.0,1,0,1,1,0
8,56,73,0,0,0,3,5,3,5,4,...,5,4,4,0,0.0,1,0,1,0,0
9,22,1556,0,0,0,3,2,0,2,2,...,3,4,2,30,26.0,1,0,1,1,0


Then, check the variables of air_data_subset_dummies.

In [30]:
# Display variables.

air_data_subset_dummies[['satisfaction_satisfied','Customer Type_disloyal Customer','Type of Travel_Personal Travel','Class_Eco','Class_Eco Plus']] = air_data_subset_dummies[['satisfaction_satisfied','Customer Type_disloyal Customer','Type of Travel_Personal Travel','Class_Eco','Class_Eco Plus']].astype(int)
air_data_subset_dummies.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 129487 entries, 0 to 129879
Data columns (total 23 columns):
 #   Column                             Non-Null Count   Dtype  
---  ------                             --------------   -----  
 0   Age                                129487 non-null  int64  
 1   Flight Distance                    129487 non-null  int64  
 2   Seat comfort                       129487 non-null  int64  
 3   Departure/Arrival time convenient  129487 non-null  int64  
 4   Food and drink                     129487 non-null  int64  
 5   Gate location                      129487 non-null  int64  
 6   Inflight wifi service              129487 non-null  int64  
 7   Inflight entertainment             129487 non-null  int64  
 8   Online support                     129487 non-null  int64  
 9   Ease of Online booking             129487 non-null  int64  
 10  On-board service                   129487 non-null  int64  
 11  Leg room service                   1294

**Question:** What changes do you observe after converting the string data to dummy variables?**

The number of columns are increasedd and redundant columns are also present. The data type is uint

## **Step 3: Model building** 

The first step to building your model is separating the labels (y) from the features (X).

In [31]:
# Separate the dataset into labels (y) and features (X).

y = air_data_subset_dummies['satisfaction_satisfied']
X = air_data_subset_dummies.drop(['satisfaction_satisfied'], axis = 1)

<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

Save the labels (the values in the `satisfaction` column) as `y`.

Save the features as `X`. 

</details>

<details>
<summary><h4><strong>Hint 2</strong></h4></summary>

To obtain the features, drop the `satisfaction` column from the DataFrame.

</details>

Once separated, split the data into train, validate, and test sets. 

In [32]:
# Separate into train, validate, test sets.

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.25, stratify = y, random_state = 0)

X_tr, X_val, y_tr, y_val = train_test_split(X_train,y_train, test_size = 0.25, stratify = y_train, random_state = 0)

<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

Use the `train_test_split()` function twice to create train/validate/test sets, passing in `random_state` for reproducible results. 

</details>

<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

Split `X`, `y` to get `X_train`, `X_test`, `y_train`, `y_test`. Set the `test_size` argument to the proportion of data points you want to select for testing. 

Split `X_train`, `y_train` to get `X_tr`, `X_val`, `y_tr`, `y_val`. Set the `test_size` argument to the proportion of data points you want to select for validation. 

</details>

### Tune the model

Now, fit and tune a random forest model with separate validation set. Begin by determining a set of hyperparameters for tuning the model using GridSearchCV.


In [33]:
# Determine set of hyperparameters.

# For classification max_features should be sqrt
cv_params = {'n_estimators' : [50,100,150], 
              'max_depth' : [8,10,50],
              'min_samples_leaf' : [0.75,1], 
              'min_samples_split' : [2],
              'max_features' : ["sqrt",0.75,1], 
              'max_samples' : [.5,.9]}

scoring = ['accuracy', 'precision', 'recall', 'f1']

<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

Create a dictionary `cv_params` that maps each hyperparameter name to a list of values. The GridSearch you conduct will set the hyperparameter to each possible value, as specified, and determine which value is optimal.

</details>

<details>
<summary><h4><strong>Hint 2</strong></h4></summary>

The main hyperparameters here include `'n_estimators', 'max_depth', 'min_samples_leaf', 'min_samples_split', 'max_features', and 'max_samples'`. These will be the keys in the dictionary `cv_params`.

</details>

Next, create a list of split indices.

In [36]:
# Create list of split indices.
split_indices = []

for idx in X_train.index:
    if idx in X_val.index:
        split_indices.append(0)
    else:
        split_indices.append(-1)
        
custom_split = PredefinedSplit(split_indices)

<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

Use list comprehension, iterating over the indices of `X_train`. The list can consists of 0s to indicate data points that should be treated as validation data and -1s to indicate data points that should be treated as training data.

</details>

<details>
<summary><h4><strong>Hint 2</strong></h4></summary>

Use `PredfinedSplit()`, passing in `split_index`, saving the output as `custom_split`. This will serve as a custom split that will identify which data points from the train set should be treated as validation data during GridSearch.

</details>

Now, instantiate your model.

In [37]:
# Instantiate model.

rf = RFC(random_state = 0)

<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

Use `RandomForestClassifier()`, specifying the `random_state` argument for reproducible results. This will help you instantiate a random forest model, `rf`.

</details>

Next, use GridSearchCV to search over the specified parameters.

In [39]:
# Search over specified parameters.

rf_GSCV = GridSearchCV(rf, cv_params, scoring=scoring, cv=custom_split, refit = 'f1', n_jobs = -1, verbose=3, error_score='raise')

<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

Use `GridSearchCV()`, passing in `rf` and `cv_params` and specifying `cv` as `custom_split`. Additional arguments that you can specify include: `refit='f1', n_jobs = -1, verbose = 1`. 

</details>

Now, fit your model.

In [48]:
%%time
# Fit the model.

# rf_GSCV.fit(X_train, y_train)
# pickle.dump(rf_GSCV, open('Pickle/rf_GSCV.pickle', 'wb'))
rf_GSCV = pickle.load(open('Pickle/rf_GSCV.pickle', 'rb'))

CPU times: total: 93.8 ms
Wall time: 169 ms


<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

Use the `fit()` method to train the GridSearchCV model on `X_train` and `y_train`. 

</details>

<details>
<summary><h4><strong>Hint 2</strong></h4></summary>

Add the magic function `%%time` to keep track of the amount of time it takes to fit the model and display this information once execution has completed. Remember that this code must be the first line in the cell.

</details>

Finally, obtain the optimal parameters.

In [49]:
# Obtain optimal parameters.
rf_GSCV.best_params_

{'max_depth': 50,
 'max_features': 0.75,
 'max_samples': 0.9,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'n_estimators': 100}

<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

Use the `best_params_` attribute to obtain the optimal values for the hyperparameters from the GridSearchCV model.

</details>

## **Step 4: Results and evaluation** 

Use the selected model to predict on your test data. Use the optimal parameters found via GridSearchCV.

In [50]:
# Use optimal parameters on GridSearchCV.

rf_opt = rf_GSCV.best_estimator_

In [51]:
pd.DataFrame(rf_GSCV.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_max_depth,param_max_features,param_max_samples,param_min_samples_leaf,param_min_samples_split,param_n_estimators,...,std_test_precision,rank_test_precision,split0_test_recall,mean_test_recall,std_test_recall,rank_test_recall,split0_test_f1,mean_test_f1,std_test_f1,rank_test_f1
0,0.866409,0.0,0.116561,0.0,8,sqrt,0.5,0.75,2,50,...,0.0,55,1.000000,1.000000,0.0,1,0.707498,0.707498,0.0,55
1,1.715956,0.0,0.180621,0.0,8,sqrt,0.5,0.75,2,100,...,0.0,55,1.000000,1.000000,0.0,1,0.707498,0.707498,0.0,55
2,2.398426,0.0,0.236095,0.0,8,sqrt,0.5,0.75,2,150,...,0.0,55,1.000000,1.000000,0.0,1,0.707498,0.707498,0.0,55
3,3.132303,0.0,0.216556,0.0,8,sqrt,0.5,1,2,50,...,0.0,42,0.934537,0.934537,0.0,84,0.922905,0.922905,0.0,38
4,6.431217,0.0,0.349815,0.0,8,sqrt,0.5,1,2,100,...,0.0,40,0.934537,0.934537,0.0,84,0.923146,0.923146,0.0,37
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
103,2.003195,0.0,0.445715,0.0,50,1,0.9,0.75,2,100,...,0.0,55,1.000000,1.000000,0.0,1,0.707498,0.707498,0.0,55
104,2.656012,0.0,0.308250,0.0,50,1,0.9,0.75,2,150,...,0.0,55,1.000000,1.000000,0.0,1,0.707498,0.707498,0.0,55
105,6.230460,0.0,0.549834,0.0,50,1,0.9,1,2,50,...,0.0,15,0.935741,0.935741,0.0,81,0.942621,0.942621,0.0,18
106,9.696617,0.0,1.145196,0.0,50,1,0.9,1,2,100,...,0.0,14,0.938600,0.938600,0.0,74,0.944785,0.944785,0.0,14


In [52]:
def make_results(model_name:str, model_object, metric:str):
    '''
    Arguments:
    model_name (string): what you want the model to be called in the output table
    model_object: a fit GridSearchCV object
    metric (string): precision, recall, f1, or accuracy

    Returns a pandas df with the F1, recall, precision, and accuracy scores
    for the model with the best mean 'metric' score across all validation folds.
    '''
    
    table = pd.DataFrame(columns = ['Model', 'Precision', 'Recall', 'F1', 'Accuracy'])
    
    # Create dictionary that maps input metric to actual metric name in GridSearchCV
    metric_dict = {'precision': 'mean_test_precision',
                     'recall': 'mean_test_recall',
                     'f1': 'mean_test_f1',
                     'accuracy': 'mean_test_accuracy',
                     }
    # Get all the results from the CV and put them in a df
    cv_results = pd.DataFrame(model_object.cv_results_)
    
    for i in range(len(model_name)):
        # Isolate the row of the df with the max(metric) score
        best_estimator_results = cv_results.iloc[cv_results[metric_dict[metric[i]]].idxmax(), :]

        # Extract Accuracy, precision, recall, and f1 score from that row
        f1 = best_estimator_results.mean_test_f1
        recall = best_estimator_results.mean_test_recall
        precision = best_estimator_results.mean_test_precision
        accuracy = best_estimator_results.mean_test_accuracy

        # Create table of results
        table.loc[len(table)]={'Model': model_name[i],
                            'Precision': precision,
                            'Recall': recall,
                            'F1': f1,
                            'Accuracy': accuracy,
                            }

    return table

out_rf = make_results(['RF_F1','RF_AC','RF_RC','RF_PC'],rf_GSCV,['f1','accuracy','recall','precision'])
out_rf

Unnamed: 0,Model,Precision,Recall,F1,Accuracy
0,RF_F1,0.966871,0.948683,0.957691,0.954117
1,RF_AC,0.966871,0.948683,0.957691,0.954117
2,RF_RC,0.547387,1.0,0.707498,0.547387
3,RF_PC,0.966871,0.948683,0.957691,0.954117


In [54]:
y_preds_rf = rf_GSCV.best_estimator_.predict(X_test)

<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

Use `RandomForestClassifier()`, specifying the `random_state` argument for reproducible results and passing in the optimal hyperparameters found in the previous step. To distinguish this from the previous random forest model, consider naming this variable `rf_opt`.

</details>

In [55]:
def get_test_scores(model_name:str, preds, y_test_data):
    '''
    Generate a table of test scores.

    In:
    model_name (string): Your choice: how the model will be named in the output table
    preds: numpy array of test predictions
    y_test_data: numpy array of y_test data

    Out:
    table: a pandas df of precision, recall, f1, and accuracy scores for your model
    '''
    accuracy = accuracy_score(y_test_data, preds)
    precision = precision_score(y_test_data, preds)
    recall = recall_score(y_test_data, preds)
    f1 = f1_score(y_test_data, preds)

    table = pd.DataFrame({'Model': [model_name],
                        'Precision': [precision],
                        'Recall': [recall],
                        'F1': [f1],
                        'Accuracy': [accuracy]
                        })

    return table

results_rf = get_test_scores('RF Test', y_preds_rf,y_test)
results_rf = pd.concat([out_rf.iloc[0:1,:],results_rf],axis=0)
results_rf['Model'].iloc[0]='RF CV'

results_rf.dfi.export('Figures/RF Evaluation Metrics.png',dpi=300)

results_rf

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  results_rf['Model'].iloc[0]='RF CV'


Unnamed: 0,Model,Precision,Recall,F1,Accuracy
0,RF CV,0.966871,0.948683,0.957691,0.954117
0,RF Test,0.969121,0.949269,0.959092,0.955672


Once again, fit the optimal model.

In [56]:
# Fit the optimal model.

#rf_opt.fit(X_train,y_train)

<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

Use the `fit()` method to train `rf_opt` on `X_train` and `y_train`.

</details>

And predict on the test set using the optimal model.

In [57]:
# Predict on test set.

y_pred = rf_opt.predict(X_test)

<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

You can call the `predict()` function to make predictions on `X_test` using `rf_opt`. Save the predictions now (for example, as `y_pred`), to use them later for comparing to the true labels. 

</details>

### Obtain performance scores

First, get your precision score.

In [None]:
# Get precision score.
print(rf_opt.classes_)
print(precision_score(y_test,y_pred, average=None)) # pos_label = 'satisfied'  # For which category you want to calculate precision for i.e. positive label
PC = precision_score(y_test,y_pred, pos_label = 'satisfied')

<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

You can call the `precision_score()` function from `sklearn.metrics`, passing in `y_test` and `y_pred` and specifying the `pos_label` argument as `"satisfied"`.
</details>

Then, collect the recall score.

In [None]:
# Get recall score.

print(rf_opt.classes_)
print(recall_score(y_test,y_pred, average=None)) # pos_label = 'satisfied'  # For which category you want to calculate precision for i.e. positive label
RC = recall_score(y_test,y_pred, pos_label = 'satisfied')


<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

You can call the `recall_score()` function from `sklearn.metrics`, passing in `y_test` and `y_pred` and specifying the `pos_label` argument as `"satisfied"`.
</details>

Next, obtain your accuracy score.

In [None]:
# Get accuracy score.

AC = accuracy_score(y_test,y_pred)

<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

You can call the `accuracy_score()` function from `sklearn.metrics`, passing in `y_test` and `y_pred` and specifying the `pos_label` argument as `"satisfied"`.
</details>

Finally, collect your F1-score.

In [None]:
# Get F1 score.

print(rf_opt.classes_)
F1 = f1_score(y_test,y_pred, average=None) # pos_label = 'satisfied'  # For which category you want to calculate precision for i.e. positive label
print('The F1 score for:')
print(f'Satisfied = {F1[1]:.3f}')
print(f'DisSatisfied = {F1[0]:.3f}')

print(rf_opt.classes_)
print(f1_score(y_test,y_pred, average=None)) # pos_label = 'satisfied'  # For which category you want to calculate precision for i.e. positive label
F1= f1_score(y_test,y_pred, pos_label = 'satisfied')

<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

You can call the `f1_score()` function from `sklearn.metrics`, passing in `y_test` and `y_pred` and specifying the `pos_label` argument as `"satisfied"`.
</details>

**Question:** How is the F1-score calculated?

Harmonic mean of the precision and recall

**Question:** What are the pros and cons of performing the model selection using test data instead of a separate validation dataset?

* Pros: its quicker
* Cons: model could be overfitted and even after tuning the model could be tuned for the particular dataseto olny.

### Evaluate the model

Now that you have results, evaluate the model. 

**Question:** What are the four basic parameters for evaluating the performance of a classification model?

[Write your response here. Double-click (or enter) to edit.]

**Question:**  What do the four scores demonstrate about your model, and how do you calculate them?

[Write your response here. Double-click (or enter) to edit.]

Calculate the scores: precision score, recall score, accuracy score, F1 score.

In [None]:
# Precision score on test data set.

### YOUR CODE HERE ###


In [None]:
# Recall score on test data set.

### YOUR CODE HERE ###


In [None]:
# Accuracy score on test data set.

### YOUR CODE HERE ###


In [None]:
# F1 score on test data set.

### YOUR CODE HERE ###


**Question:** How does this model perform based on the four scores?

[Write your response here. Double-click (or enter) to edit.]

### Evaluate the model

Finally, create a table of results that you can use to evaluate the performace of your model.

In [None]:
# Create table of results.

### YOUR CODE HERE ###
table = pd.DataFrame({'Model': ["Tuned Decision Tree", "Tuned Random Forest"],
                        'F1':  [0.945422, F1],
                        'Recall': [0.935863, RC],
                        'Precision': [0.955197, PC],
                        'Accuracy': [0.940864, AC]
                      }
                    )
table


<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

Build a table to compare the performance of the models. Create a DataFrame using the `pd.DataFrame()` function.

</details>

**Question:** How does the random forest model compare to the decision tree model you built in the previous lab?

F1 scores are better, indicating Random forest may perform better at classification while taking into account false positive and false negatives.


## **Considerations**


**What summary would you provide to stakeholders?**

* The random forest model predicted satisfaction with more than 94.2% accuracy. The precision is over 95% and the recall is approximately 94.5%. 
* The random forest model outperformed the tuned decision tree with the best hyperparameters in most of the four scores. This indicates that the random forest model may perform better.
* Because stakeholders were interested in learning about the factors that are most important to customer satisfaction, this would be shared based on the tuned random forest. 
* In addition, you would provide details about the precision, recall, accuracy, and F1 scores to support your findings. 

### References

[What is the Difference Between Test and Validation Datasets?,  Jason Brownlee](https://machinelearningmastery.com/difference-test-validation-datasets/)

[Decision Trees and Random Forests Neil Liberman](https://towardsdatascience.com/decision-trees-and-random-forests-df0c3123f991)

**Congratulations!** You've completed this lab. However, you may not notice a green check mark next to this item on Coursera's platform. Please continue your progress regardless of the check mark. Just click on the "save" icon at the top of this notebook to ensure your work has been logged