# Phase 3 Project

Our data science team has been contracted by a fairly new up and coming motorcycle manufacturer that is struggling to find their new formula for a sales success. Being a company of motorcycle riders, they have tasked our team to come up with a plan to replicate the successes of platforms like the Yamaha MT-07 and the Suzuki SV650. We will use the 'all_bikes_curated' dataset from Kaggle, curated by Emmanuel F. Werr


<img src="pictures/092220-2021-bmw-m1000rr-f.webp"  />
<center>2022 BMW M1000RR</center>

In [1]:
# Import warnings 
import warnings
warnings.filterwarnings('ignore')

# Import pandas to read our data
import pandas as pd
# 'all_bikez_curated.csv', with a 'z'
df = pd.read_csv('all_bikez_curated.csv')
# Show our data 
df.head()

Unnamed: 0,Brand,Model,Year,Category,Rating,Displacement (ccm),Power (hp),Torque (Nm),Engine cylinder,Engine stroke,...,Dry weight (kg),Wheelbase (mm),Seat height (mm),Front brakes,Rear brakes,Front tire,Rear tire,Front suspension,Rear suspension,Color options
0,acabion,da vinci 650-vi,2011,Prototype / concept model,3.2,,804.0,,Electric,Electric,...,420.0,,,Single disc,Single disc,,,,,
1,acabion,gtbo 55,2007,Sport,2.6,1300.0,541.0,420.0,In-line four,four-stroke,...,360.0,,,,,,,,,
2,acabion,gtbo 600 daytona-vi,2011,Prototype / concept model,3.5,,536.0,,Electric,Electric,...,420.0,,,Single disc,Single disc,,,,,
3,acabion,gtbo 600 daytona-vi,2021,Prototype / concept model,,,536.0,,Electric,Electric,...,420.0,,,Single disc,Single disc,,,,,
4,acabion,gtbo 70,2007,Prototype / concept model,3.1,1300.0,689.0,490.0,In-line four,four-stroke,...,300.0,,,,,,,,,Custom made.


## Ever heard of 'acabion'? 

<img src="pictures/acabion_2.jpg" />
<center>Acabion GTBO450 (Not listed in dataset)</center>

Let's look at the unique values for 'Brand' as it would immediately appear that we probably aren't familiar with all of these motorcycles. Also, at ~700 hp, this thing looks like a literal DEATHTRAP.

In [2]:
# .unique() will give you a view of all of the potential outcomes in 'Brand'
df['Brand'].unique()

array(['acabion', 'access', 'ace', 'adiva', 'adler', 'adly', 'aeon',
       'aermacchi', 'agrati', 'ajp', 'ajs', 'alfer', 'alligator',
       'allstate', 'alphasports', 'alta', 'amazonas', 'american eagle',
       'american ironhorse', 'apc', 'aprilia', 'apsonic', 'arch',
       'arctic cat', 'ardie', 'ariel', 'arlen ness', 'arqin', 'askoll',
       'aspes', 'ather', 'atk', 'atlas honda', 'aurora',
       'avanturaa choppers', 'avinton', 'avon', 'azel', 'bajaj', 'balkan',
       'baltmotors', 'bamx', 'baotian', 'barossa', 'batavus', 'beeline',
       'benelli', 'bennche', 'beta', 'better', 'big bear choppers',
       'big dog', 'bimota', 'bintelli', 'black douglas', 'blackburne',
       'blata', 'bluroc', 'bmc choppers', 'bmw', 'boom trikes', 'borile',
       'boss hoss', 'bourget', 'bpg', 'brammo', 'bridgestone', 'britten',
       'brixton', 'brockhouse', 'brough superior', 'brudeli', 'bsa',
       'buccimoto', 'buell', 'bullit', 'bultaco', 'cagiva',
       'california scooter', 'can-

## Brands you've never heard of aside, what are we looking for? 

Though there have historically been a huge number of manufacturers with varying degrees of success, we're specifically looking for diamonds, regardless of country of origin, make, or model: if our goal is critical acclaim, let's then look at what the critics say, but let's first make sure that the critics actually said something. 

In [7]:
df.isnull().sum().sum()

219918

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38472 entries, 0 to 38471
Data columns (total 28 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Brand                38472 non-null  object 
 1   Model                38444 non-null  object 
 2   Year                 38472 non-null  int64  
 3   Category             38472 non-null  object 
 4   Rating               21788 non-null  float64
 5   Displacement (ccm)   37461 non-null  float64
 6   Power (hp)           26110 non-null  float64
 7   Torque (Nm)          16634 non-null  float64
 8   Engine cylinder      38461 non-null  object 
 9   Engine stroke        38461 non-null  object 
 10  Gearbox              32675 non-null  object 
 11  Bore (mm)            28689 non-null  float64
 12  Stroke (mm)          28689 non-null  object 
 13  Fuel capacity (lts)  31704 non-null  float64
 14  Fuel system          27844 non-null  object 
 15  Fuel control         22008 non-null 

In [9]:
df = df.dropna()

In [12]:
# What categories are highly correlated to 'Rating'? 
preds = []
for i in df.corr()['Rating'].index:
    if abs(df.corr()['Rating'][i]) > 0:
        preds.append(i)

In [13]:
df[preds].corr()

Unnamed: 0,Year,Rating,Displacement (ccm),Power (hp),Torque (Nm),Bore (mm),Fuel capacity (lts),Dry weight (kg),Wheelbase (mm),Seat height (mm)
Year,1.0,-0.124482,0.073572,0.047298,0.090871,0.073073,0.030029,0.073778,0.072141,-0.035917
Rating,-0.124482,1.0,0.264354,0.214983,0.265243,0.217992,0.279262,0.209896,0.182865,0.038402
Displacement (ccm),0.073572,0.264354,1.0,0.661359,0.962469,0.773344,0.654265,0.811809,0.711076,-0.255896
Power (hp),0.047298,0.214983,0.661359,1.0,0.797149,0.572155,0.566838,0.327473,0.324695,0.24649
Torque (Nm),0.090871,0.265243,0.962469,0.797149,1.0,0.761586,0.675614,0.728967,0.648251,-0.110939
Bore (mm),0.073073,0.217992,0.773344,0.572155,0.761586,1.0,0.502672,0.543578,0.574575,-0.003251
Fuel capacity (lts),0.030029,0.279262,0.654265,0.566838,0.675614,0.502672,1.0,0.66736,0.597111,0.033131
Dry weight (kg),0.073778,0.209896,0.811809,0.327473,0.728967,0.543578,0.66736,1.0,0.805321,-0.442269
Wheelbase (mm),0.072141,0.182865,0.711076,0.324695,0.648251,0.574575,0.597111,0.805321,1.0,-0.345611
Seat height (mm),-0.035917,0.038402,-0.255896,0.24649,-0.110939,-0.003251,0.033131,-0.442269,-0.345611,1.0


In [14]:
# Preview first five rows and admire our work
df.head()

Unnamed: 0,Brand,Model,Year,Category,Rating,Displacement (ccm),Power (hp),Torque (Nm),Engine cylinder,Engine stroke,...,Dry weight (kg),Wheelbase (mm),Seat height (mm),Front brakes,Rear brakes,Front tire,Rear tire,Front suspension,Rear suspension,Color options
195,aeon,cobra 50,2012,ATV,2.6,49.3,3.0,3.7,Single cylinder,two-stroke,...,129.0,1050.0,800.0,Expanding brake (drum brake),Single disc,19/7-8,18/10-8,"Dual hydraulic shock, Single A-arm","Single hydraulic shock, Unit swing arm","White, black"
203,aeon,crossland x4 400,2012,ATV,3.5,346.0,20.1,30.0,Single cylinder,four-stroke,...,236.0,1230.0,850.0,Double disc,Expanding brake (drum brake),23/7-12,23/10-12,Double A-Arm,Swing Arm,"Red, black"
226,aeon,urban 350i,2012,Scooter,3.6,313.0,22.8,30.0,Single cylinder,four-stroke,...,177.0,1545.0,815.0,Single disc. Hydraulic,Single disc. Hydraulic,120/70-16,140/70-15,Telescopic fork,Dual-damper unit swing arm,"White, black, silver"
361,ajp,pr4 125 enduro,2010,Enduro / offroad,3.3,124.0,12.5,8.5,Single cylinder,four-stroke,...,105.0,1410.0,910.0,Single disc. 2 piston calliper,Single disc. 4 piston calliper,90/90-21,120/90-18,"Paioli Hydraulic fork,",Sachs mono shock progressive action,Black
408,ajp,pr7 adventure 650,2018,Enduro / offroad,3.7,659.7,48.0,58.0,Single cylinder,four-stroke,...,155.0,1540.0,920.0,Single disc. 2-piston calipers,Single disc. Single-piston caliper,90/90-21,140/80-18,ZF Sachs Ø48mm fully adjustable,ZF Sachs progressive system with reservatory f...,White/red/black


In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1747 entries, 195 to 38298
Data columns (total 28 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Brand                1747 non-null   object 
 1   Model                1747 non-null   object 
 2   Year                 1747 non-null   int64  
 3   Category             1747 non-null   object 
 4   Rating               1747 non-null   float64
 5   Displacement (ccm)   1747 non-null   float64
 6   Power (hp)           1747 non-null   float64
 7   Torque (Nm)          1747 non-null   float64
 8   Engine cylinder      1747 non-null   object 
 9   Engine stroke        1747 non-null   object 
 10  Gearbox              1747 non-null   object 
 11  Bore (mm)            1747 non-null   float64
 12  Stroke (mm)          1747 non-null   object 
 13  Fuel capacity (lts)  1747 non-null   float64
 14  Fuel system          1747 non-null   object 
 15  Fuel control         1747 non-null 

<div class="alert alert-block alert-info">
There are a few things that we should look into prior to making a train_test_split; what do we believe will contribute to a higher rating for a motorcycle? Is there any data here that we can identify that may or may not give us those answers? Specifically we should see if we can make any sense out of all of the null values in this dataset. 
</div>

# What makes a rating, good? 

In [None]:
# .unique() will give you all of the potential scores found in the 'Rating' column 
df['Rating'].unique()

<div class="alert alert-block alert-info">
Since it would appear that our ratings work on a 5 point system, let's take a look at the motorcycle whose sales performance we would like to replicate: the Yamaha MT-07 and the Suzuki SV650.
</div>

# Yamaha MT-07 & Suzuki SV650

If like me, you are an avid motorcycle rider, you know, have known, or are someone on one of these bikes. 

<center>
<table><tr>
<td> <img src="pictures/yamaha mt07.jfif" alt="Drawing" style="width: 250px;"/> </td>
<td> <img src="pictures/suzuki sv650.jfif" alt="Drawing" style="width: 250px;"/> </td>
</tr></table>
MT-07 and SV650, respectively</center>
<!-- <img src="pictures/yamaha mt07.jfif" />
<img src="pictures/suzuki sv650.jfif" /> -->

In [None]:
# Find all included models that are an MT-07
df_yamaha = df.loc[df['Model'] == 'mt-07']
df_yamaha.head(1)

In [None]:
# Find all models that have the name SV650
df_suzuki = df.loc[df['Model'] == 'sv650']
df_suzuki.head(1)

In [None]:
# Bring in numpy
import numpy as np

In [None]:
# Average of ratings scores for the MT-07
yamaha_mean = np.mean(df_yamaha['Rating'].unique())
print('The average score of all year-model MT-07 in dataset:{}'.format(yamaha_mean))

In [None]:
# Average of ratings scores for the SV650
suzuki_mean = np.mean(df_suzuki['Rating'].unique())
print('The average score of all year-model SV650 in dataset:{}'.format(suzuki_mean))

In [None]:
# Nothing fancy here, find the average of the two scores
rating_success = (yamaha_mean + suzuki_mean) / 2
print('Ultimately, this will be our ratings goal:{}'.format(rating_success))

# Binary values for 'Rating'
Now that we have an average score for 'Rating', let's make a new column that will give us binary results, 1 being a rating above 3.4, 0 being a rating below. 

In [None]:
# Let's look for all of the unique values for the column 'Rating'
df['Rating'].unique()

In [None]:
# As detailed above, anything lower than 3.4 is outside of our target
rating_num = np.where(df["Rating"]>=(3.4), 1, 0)

In [None]:
# Our "Rating" column will now return a '1' for bikes that got a rating of 3.4 or above, 0 for scores
# below 3.4
rating_num

In [None]:
# In order to attach this data to our current table we will need to make it into a dataframe
data = pd.DataFrame(rating_num, columns = ['Rating_'])
data.head()

In [None]:
# Preview binary rating counts
data.value_counts()

In [None]:
# Drop 'Rating' column 
df = df.drop(columns = 'Rating', axis =1)
df.head()

In [None]:
# Put our new 'Rating_' column into our dataframe
df = pd.concat([df,data], axis =1)

In [None]:
# Make sure it worked
df.head()

In [None]:
# Preview data
df.info()

# Preprocessing data for train_test_split

## There's no replacement for displacement

In [None]:
# 'Displacement (ccm)' would be how many cubic centimeters of engine displacement/size a bike's 
# engine has
print(df['Displacement (ccm)'].value_counts())
print()
print('There are {} null values in "Displacement (ccm)" '.format(df['Displacement (ccm)'].isnull().sum()))

<div class="alert alert-block alert-info">
Even though 8857 null values in 'Displacement (ccm)' would seem like a lot, engine displacement is an incredibly subjective parameter for a motorcycle. 
</div>

In [None]:
# Drop null values for displacement
df = df.dropna(subset =['Displacement (ccm)'])

In [None]:
# Preview dataset and identify what other work needs to be done
df.head()

<div class="alert alert-block alert-info">
Now that we have preprocessed one column in our dataset, let's take a look at what other columns we should consider processing. 
</div>

In [None]:
# Number of null values in df['Displacement (ccm)']
df['Displacement (ccm)'].isnull().sum()

In [None]:
# Preview data
df.info()

<div class="alert alert-block alert-warning">
There are a few considerations that we should make; there are 21684 total entries and we can assume that any column with fewer than that are null values. There are however a few things to keep in mind when rationalizing whether or not we should drop these values and ultimately, whether or not we should keep them at all. 
<br><br>    
Looking specifically at 'Torque (Nm)', this would imply that the motorcycle in question has at one point or another been put on a dyno to read out torque specs. While this is typically considered the "fun" value in a bike, let's first make a model with all null values in 'Torque (Nm' expunged. Once we have that model created and fitted, let's also look at a model that does not take into account torque figures. 
</div>



A dynamometer test is typically used to tell you the torque capabilities in your engine. Developed first in 1798, the technology has come a _long_ way since. <br><br>

<img src="pictures/DIY_dyno_complete1.jpg.crdownload" width = '500'/>
<center>Photo of DIY Dyno build from skrunkwerks.com</center><br><br>
Modern dynamometer testing is done by pushing air into the air filter (or turbo if you're on a drag monster) in order to simulate flow at speed and to push cool air into the radiator. By strapping the bike to the dyno, we are then able to get power readouts via sensors in the flywheel where the rear tire can put power to the device. <br><br>
I suspect that the bikes that do not have a torque reading either did not get tested due to extenuating circumstances (smaller displacement bikes, cruisers, and dirt bikes are usually not tested) or, the modern dynamometer was not available for some years of model submission. 

# Model with torque numbers intact

As suggested, we will first look at bikes that _have_ torque (nm) information. 

In [None]:
# Remove null values from df['Torque (Nm)']
df = df.dropna(subset =['Torque (Nm)'])

In [None]:
# Preview our data to see what other columns need pruning
df.info()

<div class="alert alert-block alert-warning">
There are a few considerations that we should in regards to null values and whether or not we believe these should be included in the first place. Since we have not made these values into numbers via encoding and therefore cannot yet see correlation to our target, responsibly we should only address lines with FULL information. <br><br> 
    
In other words, we should rid our dataset of null values prior to cutting out any column information. 
    
<br>
This isn't always the case, but for our purposes:
</div>

# If the value is null, it has to go

In [None]:
# Drop null values data that we intend on using for modeling
df = df.dropna(subset =['Model', 'Power (hp)', 'Engine cylinder', 'Engine stroke', 'Gearbox', 'Bore (mm)', 'Stroke (mm)',
                       'Fuel capacity (lts)', 'Fuel system', 'Fuel control', 'Cooling system', 'Transmission type', 
                       'Dry weight (kg)', 'Wheelbase (mm)', 'Seat height (mm)', 'Front brakes', 'Rear brakes', 'Front tire',
                       'Rear tire', 'Front suspension', 'Rear suspension', 'Rating_'])

Because 'Color options' would seem to be a pretty objective thing and typically motorcycle manufacturers have up to 4-5 years to introducing new colorways per generation model of bike, this will ultimately be unnecessary for our model. 

In [None]:
# Remove df['Color options']
df = df.drop(columns = 'Color options', axis = 1)

In [None]:
# Preview our new dataset
df.head()

In [None]:
# Information on our new dataset
df.info()

To round out our preprocessing step, we should take out any unnecessary information, like 'Brand' and 'Model'.  

In [None]:
# Drop df['Brand']
df = df.drop(columns = 'Brand', axis =1)

In [None]:
# Drop df['Model']
df = df.drop(columns = 'Model', axis =1)

<div class="alert alert-block alert-info">
Now that we have pruned all null information from our data set let's address remaining categorical data by dummying them. 

# Dummy categorical columns

Let's split our data into a categorical variable and dummy them into binary columns.

In [None]:
# Categorical/object columns can be found from df.info(), above
categoricals = ['Category','Engine cylinder', 'Engine stroke', 'Gearbox', 'Stroke (mm)', 'Fuel system', 'Fuel control', 
                'Cooling system', 'Transmission type', 'Front brakes', 'Rear brakes', 'Front tire', 'Rear tire', 
                'Front suspension', 'Rear suspension']

# Our dataframe needs to be encoded. OHE can also be utilized, for this time we will use 
# pd.get_dummies()
df = pd.get_dummies(df, columns=categoricals)
# Preview our dummied columns
df.head()

<div class="alert alert-block alert-success">
It looks like our columns were successfully dummied. Now we will need to split our data for modeling. 
</div>

## train_test_split


In [None]:
# target is 'y'
target = df['Rating_']
# X is the new name that we will use for our data. For testing purposes we may re-use 
#'df' if it produces better results.
X = df.drop(columns = 'Rating_', axis =1)

In [None]:
#import train_test_split
from sklearn.model_selection import train_test_split

# Create variables for modeling
X_train, X_test, y_train, y_test = train_test_split (X, target, random_state = 42)

#  Baseline Model 1.0: 
First we will try a single DecisionTree

In [None]:
#import DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier 
dtc = DecisionTreeClassifier(random_state=42)  
# fit classifier onto training data
dtc.fit(X_train, y_train) 

In [None]:
print('DecisionTree Model training score: {}'.format(dtc.score(X_train, y_train)))

In [None]:
print('DecisionTree Model test score: {}'.format(dtc.score(X_test, y_test)))

<div class="alert alert-block alert-danger">
Our first model is built on a decision tree and the performance is not great. This is done with no regard to hyperparameters, so let's call this our baseline model. If that's the case, let's see how we can improve it. 
</div>

# KNN: K-Nearest-Neighbors
KNN or K-Nearest-Neighbors is a distance based estimator; we will need to scale our data prior to modeling it in order to get best results. 


In [None]:
# This next step can be done without instantiating a pipeline but the tool makes the process simpler
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline


knn_pipeline = Pipeline([('ss', StandardScaler()),
                        ('knn', KNeighborsClassifier())])

In [None]:
# Fit our knn_pipeline onto our training data, testing data and then see scores
knn_pipeline.fit(X_train,y_train)

print('KNN Training performance:{}'.format(knn_pipeline.score(X_train,y_train)))
print('KNN Test performance:{}'.format(knn_pipeline.score(X_test,y_test)))

Performance on our KNN model is not bad, but let's see if there is something that can give us a better test performance. 

# Bagged Decision Tree Model: 

Bagging is a form of ensemble model training. Bagging, the term is short for bootstrap aggregating.

In [None]:
# import BaggingClassifier
from sklearn.ensemble import BaggingClassifier
b_tree =  BaggingClassifier(DecisionTreeClassifier(criterion='entropy', max_depth=5), 
                                 n_estimators=20, random_state = 42)
# fit model on training data
b_tree.fit(X_train, y_train)

In [None]:
# Training data score
print('Bagged model training score: {}'.format(b_tree.score(X_train, y_train)))


In [None]:
# Testing data score
print('Bagged model test score: {}'.format(b_tree.score(X_test, y_test)))


<div class="alert alert-block alert-success">
Our model has made improvements and does not appear to be overfitting. We can move forward with these parameters but first, let's see if GridSearchCV can build a better model. 
</div>

# DecisionTree Model 2.0: 

We will try using cross-validation to see if we can up the score of our base Decision Tree Model 1.0. Cross-validation assists with overfitting, and even though it did not appear that our model overfit the data, it is still worth a try.  

In [None]:
# import cross_val_score from sklearn
from sklearn.model_selection import cross_val_score
# We will re-use 'dtc', our DecisionTreeClassifier() from earlier

dtc_cross_score = cross_val_score(dtc, X_train, y_train, cv = 3)
mean_dtc_cross_score = np.mean(dtc_cross_score)

print('Mean Cross-Validation Score:{}'.format(mean_dtc_cross_score))

<div class="alert alert-block alert-danger">
The performance on this model is worse than the first base DecisionTree that we put together. We will scrap this version for now and move forward with our Bagged Decision Tree. 
</div>

# Hyperparameter tuning on our First Decision Tree:

We will tune our DecisionTree Model using GridsearchCV. <br><br>
GridsearchCV is a very powerful tool that can tell us what tuning parameters are the best for this dataframe. 

In [None]:
# Bring in hyperparameter keys for DecisionTreeClassifier
DecisionTreeClassifier().get_params().keys()

In [None]:
# We should create a new param_grid for this section so that we can more easily 
# import GridSearchCV
from sklearn.model_selection import GridSearchCV


dtc_grid_search = GridSearchCV(estimator = (DecisionTreeClassifier(n_estimators=20, random_state = 42)), 
                               param_grid = {
                                   'criterion': ['gini', 'entropy','log_loss'],
                                   'max_depth': [10,20,30],
                                   'random_state': [42]
                               }, 
                                  cv = 3
                                  )

#Fit GridSearchCV function to our data
dtc_grid_search.fit(X_train, y_train)

In [None]:
# Our mean training score
dtc_training_score = dtc_grid_search.score(X_train, y_train)

# Mean Test score
dtc_testing_score = dtc_grid_search.score(X_test, y_test)

print(f"Mean Training Score: {dtc_training_score :.2%}")
print(f"Mean Test Score: {dtc_testing_score :.2%}")
print("Best Parameter Combination Found During Grid Search:")
dtc_grid_search.best_params_

# Hyperparameter tuning on our Bagged Decision Tree:
Since we tuned the hyperparameters on our worst performing model, let's see if we can improve the score on our best performing model. 

In [None]:
# Since we are hyperparameter tuning a BaggingClassifier for the first time, let's see what exactly
# we are allowed to tune. Further reading : 
# https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html
from sklearn.ensemble import RandomForestClassifier
RandomForestClassifier().get_params().keys()

In [None]:
classifier = BaggingClassifier(DecisionTreeClassifier(criterion='entropy', max_depth=5), 
                                 n_estimators=20, random_state = 42)

In [None]:
# For GridSearchCV, we can set possible parameters in this cell and fit GridsearchCV to cross-validate
# all possible parameters for best outcome. 

# param_grid = {
# #     'base_estimator': [None], # This will default to a DecisionTreeClassifier
#     'bootstrap': [True], # Default option, true bootstrapping is done by replacement
#     'bootstrap_features': [True], #Default option
#     'max_features': [1,2,3], # This is the number of base_estimators to train from X
#     'max_samples': [1,2,3,4], #Number of sample to draw from X to train each base_estimator
#     'n_estimators': [20,50,100], #Number of base_estimators in ensemble
# #     'n_jobs': [-1], # -1 uses all processors to fit and predict on model
#     'oob_score': [True,False], # Whether or not to use 'out-of-bag' samples or not
#     'random_state': [42], # The answer to the universe
#     'verbose': [0], # Unsure what verbosity is, looking to see what happens
#     'warm_start': [True,False] # When 'True', reuses previous solution and adds estimators to ensemble
    
# }

In [None]:
# Fun Fact: This will take around 3-4 minutes to run, 'Alright' by Kenrick Lamar is a good choice

# We can re-use our Bagged DecisionTree, b_tree to improve performance on our Bagged DecisionTree
# import GridSearchCV


# Instantiate GridSearchCV
b_tree_grid_search = GridSearchCV(bag_classifier, #bag_classifier is our bagged DecisionTree
                                  param_grid = {
                                   'criterion': ['gini', 'entropy','log_loss'],
                                   'max_depth': [10,20,30],
                                   'random_state': [42]
                               }, 
                                  cv = 3 
                                  )

#Fit GridSearchCV function to our data
b_tree_grid_search.fit(X_train, y_train)

In [None]:
# Our mean training score
b_tree_training_score = b_tree_grid_search.score(X_train, y_train)

# Mean Test score
b_tree_testing_score = b_tree_grid_search.score(X_test, y_test)

print(f"Mean Training Score: {b_tree_training_score :.2%}")
print(f"Mean Test Score: {b_tree_testing_score :.2%}")
print("Best Parameter Combination Found During Grid Search:")
b_tree_grid_search.best_params_