# Phase 3 Project

Our data science team has been contracted by a fairly new up and coming motorcycle manufacturer that is struggling to find their new formula for a sales success. Being a company of motorcycle riders, they have tasked our team to come up with a plan to replicate the successes of platforms like the Yamaha MT-07 and the Suzuki SV650. We will use the 'all_bikes_curated' dataset from Kaggle, curated by Emmanuel F. Werr


<img src="pictures/092220-2021-bmw-m1000rr-f.webp"  />
<center>2022 BMW M1000RR</center>

In [1]:
# Import warnings 
import warnings
warnings.filterwarnings('ignore')

# Import pandas to read our data
import pandas as pd
# 'all_bikez_curated.csv', with a 'z'
df = pd.read_csv('all_bikez_curated.csv')
# Show our data 
df.head()

Unnamed: 0,Brand,Model,Year,Category,Rating,Displacement (ccm),Power (hp),Torque (Nm),Engine cylinder,Engine stroke,...,Dry weight (kg),Wheelbase (mm),Seat height (mm),Front brakes,Rear brakes,Front tire,Rear tire,Front suspension,Rear suspension,Color options
0,acabion,da vinci 650-vi,2011,Prototype / concept model,3.2,,804.0,,Electric,Electric,...,420.0,,,Single disc,Single disc,,,,,
1,acabion,gtbo 55,2007,Sport,2.6,1300.0,541.0,420.0,In-line four,four-stroke,...,360.0,,,,,,,,,
2,acabion,gtbo 600 daytona-vi,2011,Prototype / concept model,3.5,,536.0,,Electric,Electric,...,420.0,,,Single disc,Single disc,,,,,
3,acabion,gtbo 600 daytona-vi,2021,Prototype / concept model,,,536.0,,Electric,Electric,...,420.0,,,Single disc,Single disc,,,,,
4,acabion,gtbo 70,2007,Prototype / concept model,3.1,1300.0,689.0,490.0,In-line four,four-stroke,...,300.0,,,,,,,,,Custom made.


## Ever heard of 'acabion'? 

<img src="pictures/acabion_2.jpg" />
<center>Acabion GTBO450 (Not listed in dataset)</center>

Let's look at the unique values for 'Brand' as it would immediately appear that we probably aren't familiar with all of these motorcycles. Also, at ~700 hp, this thing looks like a literal DEATHTRAP.

In [2]:
# .unique() will give you a view of all of the potential outcomes in 'Brand'
df['Brand'].unique()

array(['acabion', 'access', 'ace', 'adiva', 'adler', 'adly', 'aeon',
       'aermacchi', 'agrati', 'ajp', 'ajs', 'alfer', 'alligator',
       'allstate', 'alphasports', 'alta', 'amazonas', 'american eagle',
       'american ironhorse', 'apc', 'aprilia', 'apsonic', 'arch',
       'arctic cat', 'ardie', 'ariel', 'arlen ness', 'arqin', 'askoll',
       'aspes', 'ather', 'atk', 'atlas honda', 'aurora',
       'avanturaa choppers', 'avinton', 'avon', 'azel', 'bajaj', 'balkan',
       'baltmotors', 'bamx', 'baotian', 'barossa', 'batavus', 'beeline',
       'benelli', 'bennche', 'beta', 'better', 'big bear choppers',
       'big dog', 'bimota', 'bintelli', 'black douglas', 'blackburne',
       'blata', 'bluroc', 'bmc choppers', 'bmw', 'boom trikes', 'borile',
       'boss hoss', 'bourget', 'bpg', 'brammo', 'bridgestone', 'britten',
       'brixton', 'brockhouse', 'brough superior', 'brudeli', 'bsa',
       'buccimoto', 'buell', 'bullit', 'bultaco', 'cagiva',
       'california scooter', 'can-

## Brands you've never heard of aside, what are we looking for? 

Though there have historically been a huge number of manufacturers with varying degrees of success, we're specifically looking for diamonds, regardless of country of origin, make, or model: if our goal is critical acclaim, let's then look at what the critics say, but let's first make sure that the critics actually said something. 

In [3]:
# Drop any null values in 'Rating'
df = df.dropna(subset=['Rating'])

In [4]:
# Preview first five rows and admire our work
df.head()

Unnamed: 0,Brand,Model,Year,Category,Rating,Displacement (ccm),Power (hp),Torque (Nm),Engine cylinder,Engine stroke,...,Dry weight (kg),Wheelbase (mm),Seat height (mm),Front brakes,Rear brakes,Front tire,Rear tire,Front suspension,Rear suspension,Color options
0,acabion,da vinci 650-vi,2011,Prototype / concept model,3.2,,804.0,,Electric,Electric,...,420.0,,,Single disc,Single disc,,,,,
1,acabion,gtbo 55,2007,Sport,2.6,1300.0,541.0,420.0,In-line four,four-stroke,...,360.0,,,,,,,,,
2,acabion,gtbo 600 daytona-vi,2011,Prototype / concept model,3.5,,536.0,,Electric,Electric,...,420.0,,,Single disc,Single disc,,,,,
4,acabion,gtbo 70,2007,Prototype / concept model,3.1,1300.0,689.0,490.0,In-line four,four-stroke,...,300.0,,,,,,,,,Custom made.
13,access,ams 8.57 lux lv,2016,ATV,3.9,781.0,56.3,,Single cylinder,four-stroke,...,362.0,,,Single disc,Single disc,25/8-14,25/10-14,,,"Black, white, red"


<div class="alert alert-block alert-info">
There are a few things that we should look into prior to making a train_test_split; what do we believe will contribute to a higher rating for a motorcycle? Is there any data here that we can identify that may or may not give us those answers? Specifically we should see if we can make any sense out of all of the null values in this dataset. 
</div>

# What makes a rating, good? 

In [5]:
# .unique() will give you all of the potential scores found in the 'Rating' column 
df['Rating'].unique()

array([3.2, 2.6, 3.5, 3.1, 3.9, 2.8, 2.9, 1.9, 3.4, 2.2, 3.3, 3. , 2.7,
       2.5, 3.7, 2. , 3.6, 3.8, 2.1, 2.4, 4. , 4.1, 2.3, 1.8, 4.2, 1.7,
       4.4, 1.4, 4.3, 4.6, 1.6, 4.5, 1.5])

<div class="alert alert-block alert-info">
Since it would appear that our ratings work on a 5 point system, let's take a look at the motorcycle whose sales performance we would like to replicate: the Yamaha MT-07 and the Suzuki SV650.
</div>

# Yamaha MT-07 & Suzuki SV650

If like me, you are an avid motorcycle rider, you know, have known, or are someone on one of these bikes. 

<center>
<table><tr>
<td> <img src="pictures/yamaha mt07.jfif" alt="Drawing" style="width: 250px;"/> </td>
<td> <img src="pictures/suzuki sv650.jfif" alt="Drawing" style="width: 250px;"/> </td>
</tr></table>
MT-07 and SV650, respectively</center>
<!-- <img src="pictures/yamaha mt07.jfif" />
<img src="pictures/suzuki sv650.jfif" /> -->

In [6]:
# Find all included models that are an MT-07
df_yamaha = df.loc[df['Model'] == 'mt-07']
df_yamaha.head(1)

Unnamed: 0,Brand,Model,Year,Category,Rating,Displacement (ccm),Power (hp),Torque (Nm),Engine cylinder,Engine stroke,...,Dry weight (kg),Wheelbase (mm),Seat height (mm),Front brakes,Rear brakes,Front tire,Rear tire,Front suspension,Rear suspension,Color options
35463,yamaha,mt-07,2014,Naked bike,3.5,689.0,47.3,87.5,Twin,four-stroke,...,175.0,1440.0,815.0,Double disc. Hydraulic.,Single disc. Hydraulic.,120/70-ZR17,180/55-ZR17,Telescopic forks,"Swingarm, (Link type suspension)","Black, white"


In [7]:
# Find all models that have the name SV650
df_suzuki = df.loc[df['Model'] == 'sv650']
df_suzuki.head(1)

Unnamed: 0,Brand,Model,Year,Category,Rating,Displacement (ccm),Power (hp),Torque (Nm),Engine cylinder,Engine stroke,...,Dry weight (kg),Wheelbase (mm),Seat height (mm),Front brakes,Rear brakes,Front tire,Rear tire,Front suspension,Rear suspension,Color options
30500,suzuki,sv650,2008,Naked bike,3.9,645.0,,,V2,four-stroke,...,168.0,1440.0,800.0,Double disc,Single disc,120/60-ZR17,160/60-ZR17,"Telescopic, coil spring, oil damped, fully adj...","Link-type, 7-way adjustable spring preload","Blue, Gray"


In [8]:
# Bring in numpy
import numpy as np

In [9]:
# Average of ratings scores for the MT-07
yamaha_mean = np.mean(df_yamaha['Rating'].unique())
print('The average score of all year-model MT-07 in dataset:{}'.format(yamaha_mean))

The average score of all year-model MT-07 in dataset:3.35


In [10]:
# Average of ratings scores for the SV650
suzuki_mean = np.mean(df_suzuki['Rating'].unique())
print('The average score of all year-model SV650 in dataset:{}'.format(suzuki_mean))

The average score of all year-model SV650 in dataset:3.457142857142857


In [11]:
# Nothing fancy here, find the average of the two scores
rating_success = (yamaha_mean + suzuki_mean) / 2
print('Ultimately, this will be our ratings goal:{}'.format(rating_success))

Ultimately, this will be our ratings goal:3.4035714285714285


In [12]:
# Preview data
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 21788 entries, 0 to 38461
Data columns (total 28 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Brand                21788 non-null  object 
 1   Model                21784 non-null  object 
 2   Year                 21788 non-null  int64  
 3   Category             21788 non-null  object 
 4   Rating               21788 non-null  float64
 5   Displacement (ccm)   21684 non-null  float64
 6   Power (hp)           14892 non-null  float64
 7   Torque (Nm)          10064 non-null  float64
 8   Engine cylinder      21778 non-null  object 
 9   Engine stroke        21778 non-null  object 
 10  Gearbox              19569 non-null  object 
 11  Bore (mm)            17184 non-null  float64
 12  Stroke (mm)          17184 non-null  object 
 13  Fuel capacity (lts)  18986 non-null  float64
 14  Fuel system          14546 non-null  object 
 15  Fuel control         13444 non-null 

# Preprocessing data for train_test_split

## There's no replacement for displacement

In [13]:
# 'Displacement (ccm)' would be how many cubic centimeters of engine displacement/size a bike's 
# engine has
print(df['Displacement (ccm)'].value_counts())
print()
print('There are {} null values in "Displacement (ccm)" '.format(df['Displacement (ccm)'].isnull().sum()))

124.0    810
125.0    727
249.0    722
49.0     644
998.0    307
        ... 
464.0      1
216.0      1
329.6      1
99.6       1
59.9       1
Name: Displacement (ccm), Length: 1083, dtype: int64

There are 104 null values in "Displacement (ccm)" 


<div class="alert alert-block alert-info">
Even though 8857 null values in 'Displacement (ccm)' would seem like a lot, engine displacement is an incredibly subjective parameter for a motorcycle. 
</div>

In [14]:
# Drop null values for displacement
df = df.dropna(subset =['Displacement (ccm)'])

In [15]:
# Preview dataset and identify what other work needs to be done
df.head()

Unnamed: 0,Brand,Model,Year,Category,Rating,Displacement (ccm),Power (hp),Torque (Nm),Engine cylinder,Engine stroke,...,Dry weight (kg),Wheelbase (mm),Seat height (mm),Front brakes,Rear brakes,Front tire,Rear tire,Front suspension,Rear suspension,Color options
1,acabion,gtbo 55,2007,Sport,2.6,1300.0,541.0,420.0,In-line four,four-stroke,...,360.0,,,,,,,,,
4,acabion,gtbo 70,2007,Prototype / concept model,3.1,1300.0,689.0,490.0,In-line four,four-stroke,...,300.0,,,,,,,,,Custom made.
13,access,ams 8.57 lux lv,2016,ATV,3.9,781.0,56.3,,Single cylinder,four-stroke,...,362.0,,,Single disc,Single disc,25/8-14,25/10-14,,,"Black, white, red"
25,access,shade xtreme 850,2018,ATV,2.8,781.0,57.7,,Single cylinder,four-stroke,...,329.0,,,Single disc,Single disc,25/8-12,25/10-12,Independent,Independent,"White, black"
35,adiva,ad 125,2009,Scooter,2.8,124.0,14.8,12.0,Single cylinder,four-stroke,...,175.0,1759.0,,Single disc,Single disc,120/70-14,140/70-14,Telescopic fork,Twin shock,"Silver, black"


<div class="alert alert-block alert-info">
Now that we have preprocessed one column in our dataset, let's take a look at what other columns we should consider processing. 
</div>

In [16]:
# Number of null values in df['Displacement (ccm)']
df['Displacement (ccm)'].isnull().sum()

0

In [17]:
# Preview data
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 21684 entries, 1 to 38461
Data columns (total 28 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Brand                21684 non-null  object 
 1   Model                21680 non-null  object 
 2   Year                 21684 non-null  int64  
 3   Category             21684 non-null  object 
 4   Rating               21684 non-null  float64
 5   Displacement (ccm)   21684 non-null  float64
 6   Power (hp)           14804 non-null  float64
 7   Torque (Nm)          10007 non-null  float64
 8   Engine cylinder      21674 non-null  object 
 9   Engine stroke        21674 non-null  object 
 10  Gearbox              19511 non-null  object 
 11  Bore (mm)            17184 non-null  float64
 12  Stroke (mm)          17184 non-null  object 
 13  Fuel capacity (lts)  18973 non-null  float64
 14  Fuel system          14544 non-null  object 
 15  Fuel control         13442 non-null 

<div class="alert alert-block alert-warning">
There are a few considerations that we should make; there are 21684 total entries and we can assume that any column with fewer than that are null values. There are however a few things to keep in mind when rationalizing whether or not we should drop these values and ultimately, whether or not we should keep them at all. 
<br><br>    
Looking specifically at 'Torque (Nm)', this would imply that the motorcycle in question has at one point or another been put on a dyno to read out torque specs. While this is typically considered the "fun" value in a bike, let's first make a model with all null values in 'Torque (Nm' expunged. Once we have that model created and fitted, let's also look at a model that does not take into account torque figures. 
</div>



A dynamometer test is typically used to tell you the torque capabilities in your engine. Developed first in 1798, the technology has come a _long_ way since. <br><br>

<img src="pictures/DIY_dyno_complete1.jpg.crdownload" width = '500'/>
<center>Photo of DIY Dyno build from skrunkwerks.com</center><br><br>
Modern dynamometer testing is done by pushing air into the air filter (or turbo if you're on a drag monster) in order to simulate flow at speed and to push cool air into the radiator. By strapping the bike to the dyno, we are then able to get power readouts via sensors in the flywheel where the rear tire can put power to the device. <br><br>
I suspect that the bikes that do not have a torque reading either did not get tested due to extenuating circumstances (smaller displacement bikes, cruisers, and dirt bikes are usually not tested) or, the modern dynamometer was not available for some years of model submission. 

# Model with torque numbers intact

As suggested, we will first look at bikes that _have_ torque (nm) information. 

In [18]:
# Remove null values from df['Torque (Nm)']
df = df.dropna(subset =['Torque (Nm)'])

In [19]:
# Preview our data to see what other columns need pruning
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10007 entries, 1 to 38298
Data columns (total 28 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Brand                10007 non-null  object 
 1   Model                10005 non-null  object 
 2   Year                 10007 non-null  int64  
 3   Category             10007 non-null  object 
 4   Rating               10007 non-null  float64
 5   Displacement (ccm)   10007 non-null  float64
 6   Power (hp)           9092 non-null   float64
 7   Torque (Nm)          10007 non-null  float64
 8   Engine cylinder      10001 non-null  object 
 9   Engine stroke        10001 non-null  object 
 10  Gearbox              9045 non-null   object 
 11  Bore (mm)            7764 non-null   float64
 12  Stroke (mm)          7764 non-null   object 
 13  Fuel capacity (lts)  8634 non-null   float64
 14  Fuel system          7709 non-null   object 
 15  Fuel control         5886 non-null  

<div class="alert alert-block alert-warning">
There are a few considerations that we should in regards to null values and whether or not we believe these should be included in the first place. Since we have not made these values into numbers via encoding and therefore cannot yet see correlation to our target, responsibly we should only address lines with FULL information. <br><br> 
    
In other words, we should rid our dataset of null values prior to cutting out any column information. 
    
<br>
This isn't always the case, but for our purposes:
</div>

# If the value is null, it has to go

Because 'Color options' would seem to be a pretty objective thing and typically motorcycle manufacturers have up to 4-5 years to introducing new colorways per generation model of bike, this will ultimately be unnecessary for our model. 

In [20]:
# Remove df['Color options']
df = df.drop(columns = 'Color options', axis = 1)

In [21]:
# Drop null values data that we intend on using for modeling
df = df.dropna(subset =['Model', 'Power (hp)', 'Engine cylinder', 'Engine stroke', 'Gearbox', 'Bore (mm)', 'Stroke (mm)',
                       'Fuel capacity (lts)', 'Fuel system', 'Fuel control', 'Cooling system', 'Transmission type', 
                       'Dry weight (kg)', 'Wheelbase (mm)', 'Seat height (mm)', 'Front brakes', 'Rear brakes', 'Front tire',
                       'Rear tire', 'Front suspension', 'Rear suspension', 'Rating'])

In [22]:
# Preview our new dataset
df.head()

Unnamed: 0,Brand,Model,Year,Category,Rating,Displacement (ccm),Power (hp),Torque (Nm),Engine cylinder,Engine stroke,...,Transmission type,Dry weight (kg),Wheelbase (mm),Seat height (mm),Front brakes,Rear brakes,Front tire,Rear tire,Front suspension,Rear suspension
195,aeon,cobra 50,2012,ATV,2.6,49.3,3.0,3.7,Single cylinder,two-stroke,...,Shaft drive,129.0,1050.0,800.0,Expanding brake (drum brake),Single disc,19/7-8,18/10-8,"Dual hydraulic shock, Single A-arm","Single hydraulic shock, Unit swing arm"
203,aeon,crossland x4 400,2012,ATV,3.5,346.0,20.1,30.0,Single cylinder,four-stroke,...,Shaft drive,236.0,1230.0,850.0,Double disc,Expanding brake (drum brake),23/7-12,23/10-12,Double A-Arm,Swing Arm
226,aeon,urban 350i,2012,Scooter,3.6,313.0,22.8,30.0,Single cylinder,four-stroke,...,Chain,177.0,1545.0,815.0,Single disc. Hydraulic,Single disc. Hydraulic,120/70-16,140/70-15,Telescopic fork,Dual-damper unit swing arm
361,ajp,pr4 125 enduro,2010,Enduro / offroad,3.3,124.0,12.5,8.5,Single cylinder,four-stroke,...,Chain,105.0,1410.0,910.0,Single disc. 2 piston calliper,Single disc. 4 piston calliper,90/90-21,120/90-18,"Paioli Hydraulic fork,",Sachs mono shock progressive action
365,ajp,pr4 125 sm,2010,Super motard,2.9,124.0,12.6,8.5,Single cylinder,four-stroke,...,Chain,104.0,1410.0,910.0,Single disc. 2 piston caliper,Single disc. 2 piston caliper,100/80-17,130/70-17,"Paioli Hydraulic fork,",Sachs mono shock progressive action


In [23]:
# Information on our new dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2003 entries, 195 to 38298
Data columns (total 27 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Brand                2003 non-null   object 
 1   Model                2003 non-null   object 
 2   Year                 2003 non-null   int64  
 3   Category             2003 non-null   object 
 4   Rating               2003 non-null   float64
 5   Displacement (ccm)   2003 non-null   float64
 6   Power (hp)           2003 non-null   float64
 7   Torque (Nm)          2003 non-null   float64
 8   Engine cylinder      2003 non-null   object 
 9   Engine stroke        2003 non-null   object 
 10  Gearbox              2003 non-null   object 
 11  Bore (mm)            2003 non-null   float64
 12  Stroke (mm)          2003 non-null   object 
 13  Fuel capacity (lts)  2003 non-null   float64
 14  Fuel system          2003 non-null   object 
 15  Fuel control         2003 non-null 

# Correlation to 'Rating'

In [24]:
# What categories are highly correlated to 'Rating_'? 
preds = []
for i in df.corr()['Rating'].index:
    if abs(df.corr()['Rating'][i]) > 0:
        preds.append(i)

In [25]:
# Correlation to 'Rating', all null values removed
df[preds].corr()

Unnamed: 0,Year,Rating,Displacement (ccm),Power (hp),Torque (Nm),Bore (mm),Fuel capacity (lts),Dry weight (kg),Wheelbase (mm),Seat height (mm)
Year,1.0,-0.138174,0.07997,0.075026,0.104296,0.069367,0.019955,0.060349,0.046049,-0.042666
Rating,-0.138174,1.0,0.256191,0.209496,0.257185,0.212381,0.276855,0.20307,0.17553,0.052586
Displacement (ccm),0.07997,0.256191,1.0,0.665488,0.962458,0.754995,0.649006,0.811249,0.702888,-0.237294
Power (hp),0.075026,0.209496,0.665488,1.0,0.800018,0.538924,0.570315,0.339695,0.314449,0.228673
Torque (Nm),0.104296,0.257185,0.962458,0.800018,1.0,0.738544,0.6704,0.729453,0.637432,-0.100945
Bore (mm),0.069367,0.212381,0.754995,0.538924,0.738544,1.0,0.483364,0.51909,0.56713,0.038141
Fuel capacity (lts),0.019955,0.276855,0.649006,0.570315,0.6704,0.483364,1.0,0.658547,0.572765,0.038698
Dry weight (kg),0.060349,0.20307,0.811249,0.339695,0.729453,0.51909,0.658547,1.0,0.798966,-0.43277
Wheelbase (mm),0.046049,0.17553,0.702888,0.314449,0.637432,0.56713,0.572765,0.798966,1.0,-0.318805
Seat height (mm),-0.042666,0.052586,-0.237294,0.228673,-0.100945,0.038141,0.038698,-0.43277,-0.318805,1.0


To round out our preprocessing step, we should take out any unnecessary information, like 'Brand' and 'Model'.  

In [26]:
# Drop df['Brand']
df = df.drop(columns = 'Brand', axis =1)

In [27]:
# Drop df['Model']
df = df.drop(columns = 'Model', axis =1)

<div class="alert alert-block alert-info">
Now that we have pruned all null information from our data set let's address remaining categorical data by dummying them. 

# Dummy categorical columns

Let's split our data into a categorical variable and dummy them into binary columns.

In [28]:
# Categorical/object columns can be found from df.info(), above
categoricals = ['Category','Engine cylinder', 'Engine stroke', 'Gearbox', 'Stroke (mm)', 'Fuel system', 'Fuel control', 
                'Cooling system', 'Transmission type', 'Front brakes', 'Rear brakes', 'Front tire', 'Rear tire', 
                'Front suspension', 'Rear suspension']

# Our dataframe needs to be encoded. OHE can also be utilized, for this time we will use 
# pd.get_dummies()
df = pd.get_dummies(df, columns=categoricals)
# Preview our dummied columns
df.head()

Unnamed: 0,Year,Rating,Displacement (ccm),Power (hp),Torque (Nm),Bore (mm),Fuel capacity (lts),Dry weight (kg),Wheelbase (mm),Seat height (mm),...,"Rear suspension_Öhlins TTX36 twin tube Monoshock with rebound and compression damping,","Rear suspension_Öhlins TTX36 twin tube monoshock with piggy back reservoir, adjustable, rebound and compression damping.","Rear suspension_Öhlins TTX36 twin tube monoshock with preload, rebound and compression damping","Rear suspension_Öhlins TTX36 twin tube monoshock with preload, rebound and compression damping,","Rear suspension_Öhlins electronic suspension w/ single shock w/piggyback reservoir, 4-way adjustable","Rear suspension_Öhlins fully adjustable monoshock, Aluminium casted single-sided swingarm.","Rear suspension_Öhlins mono-shock with integrated piggy-back, fully adjustable in spring preload with full adjustment","Rear suspension_Öhlins monoshock, pre-load and rebound adjustable",Rear suspension_Öhlins shock,"Rear suspension_Öhlins, 300 mm wheel travel, adjustable pre-load, compression and rebound"
195,2012,2.6,49.3,3.0,3.7,40.0,5.0,129.0,1050.0,800.0,...,0,0,0,0,0,0,0,0,0,0
203,2012,3.5,346.0,20.1,30.0,82.0,14.0,236.0,1230.0,850.0,...,0,0,0,0,0,0,0,0,0,0
226,2012,3.6,313.0,22.8,30.0,78.0,13.5,177.0,1545.0,815.0,...,0,0,0,0,0,0,0,0,0,0
361,2010,3.3,124.0,12.5,8.5,56.5,7.5,105.0,1410.0,910.0,...,0,0,0,0,0,0,0,0,0,0
365,2010,2.9,124.0,12.6,8.5,59.5,7.5,104.0,1410.0,910.0,...,0,0,0,0,0,0,0,0,0,0


<div class="alert alert-block alert-success">
It looks like our columns were successfully dummied. Now we will need to split our data for modeling. 
</div>

## train_test_split


In [29]:
# target is 'y'
target = df['Rating']
# X is the new name that we will use for our data. For testing purposes we may re-use 
#'df' if it produces better results.
X = df.drop(columns = 'Rating', axis =1)

In [30]:
#import train_test_split
from sklearn.model_selection import train_test_split

# Create variables for modeling
X_train, X_test, y_train, y_test = train_test_split (X, target, random_state = 42)

#  Baseline Model 1.0: 
First we will try a single DecisionTree

In [31]:
#import DecisionTreeClassifier
from sklearn.tree import DecisionTreeRegressor
dtc = DecisionTreeRegressor(random_state=42)  
# fit classifier onto training data
dtc.fit(X_train, y_train) 

DecisionTreeRegressor(random_state=42)

In [32]:
print('DecisionTree Model training score: {}'.format(dtc.score(X_train, y_train)))

DecisionTree Model training score: 0.9926781059884243


In [33]:
print('DecisionTree Model test score: {}'.format(dtc.score(X_test, y_test)))

DecisionTree Model test score: -0.3848992705800627


<div class="alert alert-block alert-danger">
Our first model is built on a decision tree and the performance is not great. This is done with no regard to hyperparameters, so let's call this our baseline model. If that's the case, let's see how we can improve it. 
</div>

Performance on our KNN model is not bad, but let's see if there is something that can give us a better test performance. 

# Bagged Decision Tree Model: 

Bagging is a form of ensemble model training. Bagging, the term is short for bootstrap aggregating.

In [34]:
# import BaggingClassifier
from sklearn.ensemble import BaggingRegressor
b_tree =  BaggingRegressor(DecisionTreeRegressor(max_depth=5), 
                                 n_estimators=20, random_state = 42)
# fit model on training data
b_tree.fit(X_train, y_train)

BaggingRegressor(base_estimator=DecisionTreeRegressor(max_depth=5),
                 n_estimators=20, random_state=42)

In [35]:
# Training data score
print('Bagged model training score: {}'.format(b_tree.score(X_train, y_train)))


Bagged model training score: 0.3013774259805192


In [36]:
# Testing data score
print('Bagged model test score: {}'.format(b_tree.score(X_test, y_test)))


Bagged model test score: 0.17352393788263876


<div class="alert alert-block alert-success">
Our model has made improvements and does not appear to be overfitting. Since these are our best scores so far we should keep them in mind but for now let's move onto models 2.0.
</div>

# DecisionTree Model 2.0: Cross-Validation

We will try using cross-validation to see if we can up the score of our base Decision Tree Model 1.0. Cross-validation assists with overfitting, and even though it did not appear that our model overfit the data, it is still worth a try.  

In [37]:
# import cross_val_score from sklearn
from sklearn.model_selection import cross_val_score
# We will re-use 'dtc', our DecisionTreeClassifier() from earlier

dtc_cross_score = cross_val_score(dtc, X_train, y_train, cv = 5)
mean_dtc_cross_score = np.mean(dtc_cross_score)

print('Mean Cross-Validation Score:{}'.format(mean_dtc_cross_score))

Mean Cross-Validation Score:-0.7092948626495937


<div class="alert alert-block alert-danger">
The performance on this model is worse than the first base DecisionTree that we put together. We will scrap this version for now and move forward with our Bagged Decision Tree. 
</div>

In [39]:
from sklearn.ensemble import RandomForestRegressor
# forest variable, random_state = 42, max_depth = 5
forest = RandomForestRegressor(random_state = 42, max_depth = 5)
# fit classifier on training data
forest.fit(X_train, y_train)

RandomForestRegressor(max_depth=5, random_state=42)

In [40]:
forest.score(X_train, y_train), forest.score(X_test, y_test)

(0.30588509546552856, 0.18970623662520958)

In [41]:
RandomForestRegressor().get_params().keys()

dict_keys(['bootstrap', 'ccp_alpha', 'criterion', 'max_depth', 'max_features', 'max_leaf_nodes', 'max_samples', 'min_impurity_decrease', 'min_impurity_split', 'min_samples_leaf', 'min_samples_split', 'min_weight_fraction_leaf', 'n_estimators', 'n_jobs', 'oob_score', 'random_state', 'verbose', 'warm_start'])

In [43]:
forest1 = RandomForestRegressor(random_state = 42, max_depth = 5, n_estimators = 500)
forest1.fit(X_train, y_train)

forest1.score(X_train, y_train), forest1.score(X_test, y_test)

(0.30650789700786696, 0.19344006129729818)

In [45]:
from sklearn.ensemble import RandomForestClassifier

fartest = RandomForestClassifier(random_state = 42)


In [46]:
fartest.fit(X_train, y_train)
fartest.score(X_train, y_train), fartest.score(X_test, y_test)

ValueError: Unknown label type: 'continuous'

In [38]:
xx

NameError: name 'xx' is not defined

# RandomForestClassifier()
We can identify and interpret feature importance from our second ensemble model, RandomForestClassifier() 

In [None]:
# import RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier
# forest variable, random_state = 42, max_depth = 5
forest = RandomForestClassifier(random_state = 42, max_depth = 5)
# fit classifier on training data
forest.fit(X_train, y_train)

In [None]:
print('RandomForest training score: {}'.format(forest.score(X_train, y_train)))

In [None]:
print('RandomForest testing score: {}'.format(forest.score(X_test,y_test)))

<div class="alert alert-block alert-danger">
Though this would not appear to be a successful model, let's begin hyperparameter tuning to see if we can get a score of above, say 80% on both the training and testing data
</div>

# Hyperparameter tuning on RandomForestClassifier()

In [None]:
RandomForestClassifier().get_params().keys()

In [None]:
# Set a new variable for our second version of the RandomForestClassifier()
# As a reminder: 
# forest = RandomForestClassifier(random_state =42, max_depth = 5)
forest1 = RandomForestClassifier(random_state =42, max_depth =5, n_estimators=10)
    

In [None]:
forest1.fit(X_train,y_train)
print('RandomForestClassifier training score n_estimators = 10: {}'.format(forest1.score(X_train, y_train)))
print('RandomForestClassifier test score n_estimators =10: {}'.format(forest1.score(X_test, y_test)))

In [None]:
forest2 = RandomForestClassifier(criterion = 'gini',random_state = 42, max_depth =5, n_estimators =50)
forest2.fit(X_train,y_train)
print('RandomForestClassifier training score n_estimators = 50: {}'.format(forest2.score(X_train, y_train)))
print('RandomForestClassifier test score n_estimators =50: {}'.format(forest2.score(X_test, y_test)))

<div class="alert alert-block alert-danger">
Not exactly what we had in mind. We have created an ensemble model but are having a hard time determining parameters that will increase score. For this, let's use Random Search. 
</div>

# Random Search Training

In [None]:
# random_grid works the same way that a param_grid would: we will see what parameters are selected as 
# the 'best' parameters and build our next model based on these results. 
random_grid = {'bootstrap': [True, False],
               'max_depth': [10,30,50,70,90,100, None],
               'max_features': ['auto', 'sqrt'], 
               'n_estimators': [200, 400, 600]    
}

In [None]:
# This takes almost 20 minutes to run, Weird Al Yankovic's 'Sauerkraut' x 2 would be unbearable but...
from sklearn.model_selection import RandomizedSearchCV
rf = RandomForestClassifier()
rf_cv = RandomizedSearchCV(estimator =rf,
                          param_distributions = random_grid, 
                           n_iter = 100,
                          cv = 3
                          )
rf_cv.fit(X_train, y_train)
rf_cv.score(X_train, y_train)
# Expected Output: 0.697560975609756

In [None]:
# The best parameters in this first tuned model
rf_cv.best_params_
# Expected output: {'n_estimators': 600,
#  'max_features': 'auto',
#  'max_depth': 10,
#  'bootstrap': False}

<div class="alert alert-block alert-danger">
Thinking critically on these parameters then, 600 estimators was the most in our set, let's shoot for more in the following model, max_features: 'auto', max_depth: '(something in the single digits)', and 'bootstrap': False
</div>

# Random Search Training 2.0
Again, taking into account the parameters that were selected as the 'best', let's expand on that idea by making our next set of parameters exploratory options referentially to our previous set of 'best_params'. 

In [None]:
# Set new parameters for new random_grid1
random_grid1 = {'bootstrap': [False],
               'max_depth': [1,5,7,10],
               'max_features': ['auto'], 
               'n_estimators': [600,800,1000,1200]    
}

In [None]:
rf_cv = RandomizedSearchCV(estimator =rf,
                          param_distributions = random_grid1, 
                           n_iter = 100,
                          cv = 3
                          )
rf_cv.fit(X_train, y_train)
rf_cv.score(X_train, y_train)

In [None]:
rf_cv.score(X_test, y_test)

In [None]:
rf_cv.best_params_

<div class="alert alert-block alert-danger">
Well there's a slight improvement, and we have gotten closer to the best score we've gotten so far from our BaggedClassifier(DecisionTreeClassifier()) but let's keep tuning. 
</div>

# Random Search Training 3.0

In [None]:
random_grid2 = {'bootstrap': [False],
               'max_depth': [10,11,12,13,14,15,16,17,18,19,20],
               'max_features': ['auto'], 
               'n_estimators': [700, 750, 800, 850, 900]    
}

In [None]:
# 'Stairway to Heaven' like thrice
rf_cv = RandomizedSearchCV(estimator =rf,
                          param_distributions = random_grid2, 
                           n_iter = 100,
                          cv = 5 # Changed from 3 to 5, probably accounts for the time
                          )
rf_cv.fit(X_train, y_train)
rf_cv.score(X_train, y_train)
# Expected output: 0.7346341463414634

In [None]:
rf_cv.score(X_test, y_test)
# Expected output: 0.652046783625731

In [None]:
rf_cv.best_params_
#{'n_estimators': 900,
#  'max_features': 'auto',
#  'max_depth': 12,
#  'bootstrap': False}

In [None]:
random_grid3 = {'bootstrap': [False],
               'max_depth': [12],
               'max_features': ['auto'], 
               'n_estimators': [900, 925, 950, 975]    
}

In [None]:
# Stairway to Heaven twice?
rf_cv = RandomizedSearchCV(estimator =rf,
                          param_distributions = random_grid2, 
                           n_iter = 100,
                          cv = 3 
                          )
rf_cv.fit(X_train, y_train)
rf_cv.score(X_train, y_train)
# Expected output: 0.7404878048780488

In [None]:
rf_cv.score(X_test, y_test)
# Expected output: 0.6549707602339181

# Consider using gradientboosting here instead
# Hyperparameter tuning on our First Decision Tree:
We will tune our DecisionTree Model using Adaboost.

Adaboost is an ensemble method that uses weak learners (our first DecisionTree is a good example) and inflates the errors in a model to create a 50/50 learning split.

In [None]:
from sklearn.ensemble import AdaBoostClassifier

# Hyperparameter tuning on our First Decision Tree:

## GridsearchCV

In [None]:
# Bring in hyperparameter keys for DecisionTreeClassifier
DecisionTreeClassifier().get_params().keys()

In [None]:
# We can re-use our Bagged DecisionTree, b_tree to improve performance on our Bagged DecisionTree
# import GridSearchCV
from sklearn.model_selection import GridSearchCV

dtc_grid_search = GridSearchCV(dtc, param_grid = {'criterion': ['gini', 'entropy', 'log_loss'], #Function measures quality of split
    'max_depth': [None,1,5,10], #Default option goes until all leaves are pure or unti lall leaves contain less than min_samples_split samples
    'max_leaf_nodes': [None], # Unlimited # of leaf_nodes 
    'min_impurity_decrease': [0.0],  
    'min_samples_leaf': [1,2,3], 
    'min_samples_split': [2], 
    'min_weight_fraction_leaf': [0.0], 
    'random_state': [42], #As usual
    'splitter': ['best']}, # Strategy is to select 'best' when splitting each node}, 
                                  cv = 3)

#Fit GridSearchCV function to our data
dtc_grid_search.fit(X_train, y_train)

In [None]:
# Our mean training score
dtc_training_score = dtc_grid_search.score(X_train, y_train)

# Mean Test score
dtc_testing_score = dtc_grid_search.score(X_test, y_test)

print(f"Mean Training Score: {dtc_training_score :.2%}")
print(f"Mean Test Score: {dtc_testing_score :.2%}")
print("Best Parameter Combination Found During Grid Search:")
dtc_grid_search.best_params_

# Hyperparameter tuning on our Bagged Decision Tree:
Since we tuned the hyperparameters on our worst performing model, let's see if we can improve the score on our best performing model. 

In [None]:
# Since we are hyperparameter tuning a BaggingClassifier for the first time, let's see what exactly
# we are allowed to tune. Further reading : 
# https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html
BaggingClassifier(DecisionTreeClassifier()).get_params().keys()

In [None]:
bag_classifier = BaggingClassifier(DecisionTreeClassifier())

In [None]:
# For GridSearchCV, we can set possible parameters in this cell and fit GridsearchCV to cross-validate
# all possible parameters for best outcome. 

param_grid = {
#     'base_estimator': [None], # This will default to a DecisionTreeClassifier
    'bootstrap': [True], # Default option, true bootstrapping is done by replacement
    'bootstrap_features': [True], #Default option
    'max_features': [1,2,3], # This is the number of base_estimators to train from X
    'max_samples': [1,2,3,4], #Number of sample to draw from X to train each base_estimator
    'n_estimators': [20,50,100], #Number of base_estimators in ensemble
#     'n_jobs': [-1], # -1 uses all processors to fit and predict on model
    'oob_score': [True,False], # Whether or not to use 'out-of-bag' samples or not
    'random_state': [42], # The answer to the universe
    'verbose': [0], # Unsure what verbosity is, looking to see what happens
    'warm_start': [True,False] # When 'True', reuses previous solution and adds estimators to ensemble
    
}

In [None]:
# Fun Fact: This will take around 3-4 minutes to run, 'Alright' by Kenrick Lamar is a good choice

# We can re-use our Bagged DecisionTree, b_tree to improve performance on our Bagged DecisionTree
# import GridSearchCV
from sklearn.model_selection import GridSearchCV

# Instantiate GridSearchCV
b_tree_grid_search = GridSearchCV(bag_classifier, 
                                  param_grid, 
                                  cv = 3 
                                  )

#Fit GridSearchCV function to our data
b_tree_grid_search.fit(X_train, y_train)

In [None]:
# Our mean training score
b_tree_training_score = b_tree_grid_search.score(X_train, y_train)

# Mean Test score
b_tree_testing_score = b_tree_grid_search.score(X_test, y_test)

print(f"Mean Training Score: {b_tree_training_score :.2%}")
print(f"Mean Test Score: {b_tree_testing_score :.2%}")
print("Best Parameter Combination Found During Grid Search:")
b_tree_grid_search.best_params_