## DAB200 -- Lab 7

In this lab, you will gain some experience in dealing with missing data and further practice converting non-numeric features in a dataset to numeric.

**Target**: to predict `Comb Unadj FE - Conventional Fuel`

**Data set**: will be assigned by the instructor in class

### Part 0

Please provide the following information by editing this cell:
 - Name:  Jalpa Tank
 - Student Number: 0804303

### Part 1 - Create and evaluate an initial model

In this part you should: 
 
 - read in the data
 - isolate all numeric features from original data set
 - fill in any missing values with 0
 - create and evaluate an initial model 
 - use 150 decision trees in your random forest model

#### Code (10 marks)

**Code for reading data:**

In [1]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from rfpimp import * 
from pandas.api.types import is_categorical_dtype

data_string = 'veh18_missing.csv'
df = pd.read_csv('veh18_missing.csv')
df.head().T

Unnamed: 0,0,1,2,3,4
Eng Displ,,6.2,2.0,2.0,
# Cyl,6,8,4,4,8
Comb Unadj FE - Conventional Fuel,26.21,21.8108,35.6914,36.7994,18.9885
# Gears,6,8,8,7,6
Max Ethanol % - Gasoline,15.0,10.0,10.0,10.0,15.0
Intake Valves Per Cyl,2,1,2,2,2
Exhaust Valves Per Cyl,2,1,2,2,2
Fuel Metering Sys Desc,Multipoint/sequential fuel injection,Spark Ignition Direct Injection,unknown,unknown,Multipoint/sequential fuel injection
Air Aspiration Method Desc,Naturally Aspirated,Naturally Aspirated,Turbocharged,Turbocharged,Naturally Aspirated
Trans Desc,XXXXX,XXXXX,Semi-Automatic,Automated Manual,XXXXX


**Code for exploring data:**

In [2]:
numfeatures = df[['Eng Displ','# Cyl','# Gears','Max Ethanol % - Gasoline','Intake Valves Per Cyl','Exhaust Valves Per Cyl','Comb Unadj FE - Conventional Fuel']]
numfeatures.isnull().sum()
numfeatures = numfeatures.fillna(0)
numfeatures.isnull().sum()

Eng Displ                            0
# Cyl                                0
# Gears                              0
Max Ethanol % - Gasoline             0
Intake Valves Per Cyl                0
Exhaust Valves Per Cyl               0
Comb Unadj FE - Conventional Fuel    0
dtype: int64

#### Create and evaluate an initial model after isolating numeric features:
Put all your code inside the following function definition.  

Your solution must return:

- **first item:** Mean OOB score of 10 runs
- **second item:** The last (i.e., the 10th) random forest regressor object (fitted)
- **third item:** Feature array
- **fourth item:** Target array

In [3]:
def estimate_mean_oob_score_baseline():
    ### BEGIN SOLUTION
    numfeatures = df[['Eng Displ','# Cyl','# Gears','Max Ethanol % - Gasoline','Intake Valves Per Cyl','Exhaust Valves Per Cyl','Comb Unadj FE - Conventional Fuel']]
    numfeatures = numfeatures.fillna(0)
    X = numfeatures.drop('Comb Unadj FE - Conventional Fuel',axis=1)
    y = df['Comb Unadj FE - Conventional Fuel']
    
    oob_scores = []
    
    for i in range(10):
        rf = RandomForestRegressor(n_estimators=150, n_jobs=-1, oob_score=True)
        rf.fit(X, y)
        oob_scores.append(rf.oob_score_)
    return np.mean(oob_scores),rf,X,y
    ### END SOLUTION

**Running the following cell should not throw any error if your code in the cell above is correct.  Do not edit the following cell.**

In [4]:
baseline_oob_score, rf, X, y = estimate_mean_oob_score_baseline()

num_trees = len(rf.estimators_)

print(baseline_oob_score, num_trees)

if data_string == 'veh14_missing.csv':
    output = 0.696
elif  data_string == 'veh15_missing.csv':
    output = 0.730
elif  data_string == 'veh16_missing.csv':
    output = 0.738
elif  data_string == 'veh17_missing.csv':
    output = 0.721
elif  data_string == 'veh18_missing.csv':
    output = 0.673
elif  data_string == 'veh19_missing.csv':
    output = 0.704
    

if (np.isclose(baseline_oob_score , output, rtol = 0.01) == True) and (num_trees==150):
    part1_marks = 10
    
assert np.isclose(baseline_oob_score , output, rtol = 0.01)

0.6726395284335707 150


In [5]:
import pickle
pickle_out = open("rf.pkl","wb")
pickle.dump(rf,pickle_out)
pickle_out.close()

### Part 2 - Normalize missing values for string data

In this part you should: 
 - use Section 7.4 of the textbook as a guide
 - convert **all** representations of missing data to a **single** representation
 
#### Code (15 marks)

**Code for exploring missing values:**

In [9]:
df.head().T
df['Eng Displ'].unique()
df['Max Ethanol % - Gasoline'].unique()
df['Fuel Metering Sys Desc'].unique()
df['Trans Desc'].unique()
df['Cyl Deact?'].unique()
# df['# Cyl'].unique()
# df['# Gears'].unique()
# df['Intake Valves Per Cyl'].unique()
# df['Exhaust Valves Per Cyl'].unique()
# df['Air Aspiration Method Desc'].unique()
# df['Var Valve Lift?'].unique()

array(['N', '   ', nan, 'Y'], dtype=object)

**Code for normalizing missing values:**

Put all your code inside the following function definition. 

In [14]:
from pandas.api.types import is_string_dtype, is_object_dtype

def normalize_missing_values():
    ### BEGIN SOLUTION
     for col in df.columns:
            if is_string_dtype(df[col]) or is_object_dtype(df[col]):
                df[col] = df[col].str.lower()
                df[col] = df[col].fillna(np.nan) # make None -> np.nan
                df[col] = df[col].replace('   ',np.nan)
                df[col] = df[col].replace('unknown', np.nan)
                df[col] = df[col].replace('none', np.nan)
                df[col] = df[col].replace('xxxxx',np.nan)
                df[col] = df[col].replace('', np.nan)
            return df
    
    ### END SOLUTION

In [16]:
df['Eng Displ'].unique()
df['Max Ethanol % - Gasoline'].unique()
df['Fuel Metering Sys Desc'].unique()
df['Trans Desc'].unique()
df['Cyl Deact?'].unique()

array(['n', nan, 'y'], dtype=object)

**Running the following cell should not throw any error if your code in the cell above is correct.  Do not edit the following cell.**

In [17]:
df = normalize_missing_values()

print(df.head())

vissing_mals = ['not provided', '##-##', '   ', 'none', 'not filled in', '^^', 'unknown', 'XXXXX', 
                'not specified', '*****', '@@@@@', 'none given', '%%%%%', 'missing', 'mod']

col_missing = []

for col in df.columns:
    col_missing.append(all(df[col].isin(vissing_mals)))
    
if sum(col_missing) == 0:
    part2_marks = 15
else:
    part2_marks = 0
    
assert sum(col_missing) == 0

   Eng Displ  # Cyl  Comb Unadj FE - Conventional Fuel  # Gears  \
0        NaN      6                            26.2100        6   
1        6.2      8                            21.8108        8   
2        2.0      4                            35.6914        8   
3        2.0      4                            36.7994        7   
4        NaN      8                            18.9885        6   

   Max Ethanol % - Gasoline  Intake Valves Per Cyl  Exhaust Valves Per Cyl  \
0                      15.0                      2                       2   
1                      10.0                      1                       1   
2                      10.0                      2                       2   
3                      10.0                      2                       2   
4                      15.0                      2                       2   

                 Fuel Metering Sys Desc Air Aspiration Method Desc  \
0  multipoint/sequential fuel injection        naturally a

### Part 3 - Encoding Categorical features

In this part you should: 
 - use Section 7.5.1 as a guide
 - only use label encoding 
 - convert **all** non-numeric features to numeric 
 - handle any missing values
 
#### Code (25 marks)

**Provide your code to convert all non-numeric features to numeric using label encoding. Use only `pandas` for encoding (don't use `category encoders`). Make sure that missing values are encoded as zero. All your code must be inside the function below**

In [18]:
def label_encoding_non_numeric_cols():
    ### BEGIN SOLUTION
        for col in df.columns:
            if is_string_dtype(df[col]):
                df[col] = df[col].astype('category').cat.as_ordered()
            if is_categorical_dtype(df[col]):
                df[col] = df[col].cat.codes + 1
        return df
    
    ### END SOLUTION

**Running the following cell should not throw any error if your code in the cell above is correct.  Do not edit the following cell.**

In [19]:
df = label_encoding_non_numeric_cols()

print(df.head())

from pandas.api.types import is_numeric_dtype

cols_are_numeric = []
for col in df.columns:
    if col != 'Comb Unadj FE - Conventional Fuel':
        cols_are_numeric.append(is_numeric_dtype(df[col]))
    
    
if (df.shape[1]) - 1 == sum(cols_are_numeric):
    part3_marks = 25
else:
    part3_marks = 0
    
assert (df.shape[1]) - 1 == sum(cols_are_numeric)

   Eng Displ  # Cyl  Comb Unadj FE - Conventional Fuel  # Gears  \
0        NaN      6                            26.2100        6   
1        6.2      8                            21.8108        8   
2        2.0      4                            35.6914        8   
3        2.0      4                            36.7994        7   
4        NaN      8                            18.9885        6   

   Max Ethanol % - Gasoline  Intake Valves Per Cyl  Exhaust Valves Per Cyl  \
0                      15.0                      2                       2   
1                      10.0                      1                       1   
2                      10.0                      2                       2   
3                      10.0                      2                       2   
4                      15.0                      2                       2   

   Fuel Metering Sys Desc  Air Aspiration Method Desc  Trans Desc  Cyl Deact?  \
0                       2                      

### Part 4 - Numeric features

In this part you should: 
 - use Section 7.5.2 as a guide
 - handle any missing values
 
#### Code (30 marks)

Return only the dataframe

In [21]:
df['Eng Displ'].unique()
df['Max Ethanol % - Gasoline'].unique()

array([15., 10., nan, 85.])

In [22]:
def fill_missing_vals_num():
    ### BEGIN SOLUTION
    df['Eng Displ'+'_na'] = pd.isnull(df['Eng Displ'])
    df['Eng Displ'].fillna(df['Eng Displ'].median(), inplace=True)
    
    df['Max Ethanol % - Gasoline'+'_na'] = pd.isnull(df['Max Ethanol % - Gasoline'])
    df['Max Ethanol % - Gasoline'].fillna(df['Max Ethanol % - Gasoline'].median(), inplace=True)
    return df
    ### END SOLUTION

In [25]:
df['Eng Displ'].unique()
df['Max Ethanol % - Gasoline'].unique()

array([15., 10., 85.])

**Running the following cell should not throw any error if your code in the cell above is correct. Do not edit the following cell.**

In [26]:
df = fill_missing_vals_num()

cols_not_null = []
for col in df.columns:
    cols_not_null.append(sum(pd.isnull(df[col])))
    
if all(cols_not_null) == 0:
    part4_marks = 30
else:
    part4_marks = 0
    
assert all(cols_not_null) == 0

### Part 5 - Create and evaluate a final model

In this part you should:
 - create and evaluate a model using all the features after processing them in Parts 2, 3, and 4 above 
 - use 150 decision trees
 

The following function must return:
- **first item:** Mean OOB score of 10 runs
- **second item:** The last (i.e., the 10th) random forest regressor object (fitted)
- **third item:** Feature array
- **fourth item:** Target array

#### Code (10 marks)

In [27]:
def estimate_mean_oob_score_final():
    ### BEGIN SOLUTION
    for col in df.columns:
        if is_string_dtype(df[col]) or is_object_dtype(df[col]):
            df[col] = df[col].str.lower()
            df[col] = df[col].fillna(np.nan) # make None -> np.nan
            df[col] = df[col].replace('   ',np.nan)
            df[col] = df[col].replace('unknown', np.nan)
            df[col] = df[col].replace('none', np.nan)
            df[col] = df[col].replace('xxxxx',np.nan)
            df[col] = df[col].replace('', np.nan)
        
        if is_string_dtype(df[col]):
                df[col] = df[col].astype('category').cat.as_ordered()
        if is_categorical_dtype(df[col]):
                df[col] = df[col].cat.codes + 1

        df['Eng Displ'+'_na'] = pd.isnull(df['Eng Displ'])
        df['Eng Displ'].fillna(df['Eng Displ'].median(), inplace=True)
    
        df['Max Ethanol % - Gasoline'+'_na'] = pd.isnull(df['Max Ethanol % - Gasoline'])
        df['Max Ethanol % - Gasoline'].fillna(df['Max Ethanol % - Gasoline'].median(), inplace=True)


    X = df.drop('Comb Unadj FE - Conventional Fuel',axis=1)
    y = df['Comb Unadj FE - Conventional Fuel']
    
    oob_scores = []
    
    for i in range(10):
        rf = RandomForestRegressor(n_estimators=150, n_jobs=-1, oob_score=True)
        rf.fit(X, y)
        oob_scores.append(rf.oob_score_)
    return np.mean(oob_scores),rf,X,y
    
    ### END SOLUTION

**Running the following cell should not throw any error if your code in the cell above is correct. Do not edit the following cell.**

In [28]:
final_oob_score, rf, X, y = estimate_mean_oob_score_final()

num_trees = len(rf.estimators_)

print(final_oob_score, num_trees)

if final_oob_score > baseline_oob_score:
    part5_marks = 10
else:
    part5_marks = 0
    
assert final_oob_score > baseline_oob_score

0.7825112921401013 150


In [29]:
parts_1_to_5_marks = part1_marks + part2_marks + part3_marks + part4_marks + part5_marks
parts_1_to_5_marks

90