
####  Nabin Bagale

### Part 1 - Create and evaluate an initial model

In this part you should: 
 - read in the data
 - isolate all numeric features from original data set
 - fill in any missing values with 0
 - create and evaluate a baseline model 


In [53]:
#Importing libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, r2_score
import numpy as np
import warnings
from sklearn.preprocessing import OrdinalEncoder
import category_encoders as ce
warnings.filterwarnings('ignore')

In [78]:
#Reading data
df = pd.read_csv("veh16_missing.csv")
df.head()

Unnamed: 0,Eng Displ,# Cyl,Comb Unadj FE - Conventional Fuel,# Gears,Max Ethanol % - Gasoline,Intake Valves Per Cyl,Exhaust Valves Per Cyl,Trans Creeper Gear,Unique Label?,Air Aspiration Method Desc,Fuel Metering Sys Desc,Cyl Deact?
0,,4,32.8729,6,10.0,2,2,N,N,Naturally Aspirated,Multipoint/sequential fuel injection,N
1,,4,41.5766,6,10.0,2,2,N,N,none given,Spark Ignition Direct Injection,N
2,1.8,4,42.3624,6,15.0,2,2,N,N,none given,Multipoint/sequential fuel injection,
3,3.5,6,29.3963,6,10.0,2,2,N,N,none given,Spark Ignition Direct Injection,
4,,6,25.4694,6,85.0,2,2,N,N,Naturally Aspirated,Multipoint/sequential fuel injection,N


In [79]:
#getting numeric values and identifying missing values
df_num = df.select_dtypes(include=None, exclude=object)
df_num.columns

Index(['Eng Displ', '# Cyl', '# Gears', 'Max Ethanol % - Gasoline',
       'Intake Valves Per Cyl', 'Exhaust Valves Per Cyl'],
      dtype='object')

### Selecting Numeric featurtes
In this part we extracted all the numeric features and converted response variable 'Comb Unadj FE - Conventional Fuel' into numeric variable from type object.


In [80]:
#selecting numeric features
vehicle_numeric = ['Eng Displ', '# Cyl', '# Gears', 'Max Ethanol % - Gasoline', 'Intake Valves Per Cyl', 'Exhaust Valves Per Cyl']
X = vehicle[vehicle_numeric]
y= vehicle['Comb Unadj FE - Conventional Fuel']
y = pd.to_numeric(y,errors = 'coerce')

In this part we replaced  missing values in numeric column by 0

In [81]:
X = X.fillna(0)
y=y.fillna(0)


In order to create Baseline model for numeric column, we converted all missing values into 0 in above step and in the following step we have created baseline model to calculate initial Out of Bag (OOB) Score. The oob score is an average of 10 runs and we have used 150 trees for our random forest model. In order to compare this model to our final model we have assign oob score to variable oob_from_1st_part.

In [82]:
oob_scores = []

for i in range(10):
    rf = RandomForestRegressor(n_estimators=150, n_jobs=-1, oob_score=True) 
    rf.fit(X, y)
    oob_scores.append(rf.oob_score_)
    oob_from_1st_part = np.mean(oob_scores)
print("Mean oob score: ", np.mean(oob_scores))


Mean oob score:  0.7379333943346686


### Part 2 - Normalize missing values

In this part you should: 
 -
 - convert **all** representations of missing data to a **single** representation


In this part we Normalized different representation of Missing Values into a single missing data representation :`np.nan` and hence all the missing data and unusual values were replaced by same NaN value.

In [83]:
from pandas.api.types import is_string_dtype, is_object_dtype
def df_normalize_strings(df):
    for col in df.columns:
        if is_string_dtype(df[col]) or is_object_dtype(df[col]):
            df[col] = df[col].str.lower()
            df[col] = df[col].fillna(np.nan) # make None -> np.nan
            df[col] = df[col].replace('none or unspecified', np.nan)
            df[col] = df[col].replace('none', np.nan)
            df[col] = df[col].replace('#name?', np.nan)
            df[col] = df[col].replace('   ', np.nan)
            df[col] = df[col].replace('\n',np.nan, regex=True)
            df[col] = df[col].replace('nan', np.nan)
            df[col] = df[col].replace('@@@@@',np.nan)
            df[col] = df[col].replace('none given',np.nan)
            df[col] = df[col].replace('', np.nan)

In [84]:
df_normalize_strings(df)


In [85]:
df.head()

Unnamed: 0,Eng Displ,# Cyl,Comb Unadj FE - Conventional Fuel,# Gears,Max Ethanol % - Gasoline,Intake Valves Per Cyl,Exhaust Valves Per Cyl,Trans Creeper Gear,Unique Label?,Air Aspiration Method Desc,Fuel Metering Sys Desc,Cyl Deact?
0,,4,32.8729,6,10.0,2,2,n,n,naturally aspirated,multipoint/sequential fuel injection,n
1,,4,41.5766,6,10.0,2,2,n,n,,spark ignition direct injection,n
2,1.8,4,42.3624,6,15.0,2,2,n,n,,multipoint/sequential fuel injection,
3,3.5,6,29.3963,6,10.0,2,2,n,n,,spark ignition direct injection,
4,,6,25.4694,6,85.0,2,2,n,n,naturally aspirated,multipoint/sequential fuel injection,n


#### Question (5 marks)

Note here all the different ways missing data was represented in the data.   

Basically data type in our data set contains two types i.e numeric and object. Data in some column are complete, however, others have missing values. The unusual and meaningless values in the object data type are also considered as the missing values. 
For numeric data we have replaced missing values with value 0 in each fields. When it comes to the values having object datatype, we considered *empty* data as missing values and replaced with the NaN values. Similaryly,  **none or unspecified**, **#name?** , **double space like '  '**, **single space like, ''**, **none given** and multi space cells are considered as missing values and replaced with NaN value. Moreover, the value **@@@@@** is present in the data set and which does not give any sense to the data so we considered this as missing values and replaced with numpy's np.nan(not a number) values.


### Part 3 - Categorical features

In this part you should: 
 - only use ordinal encoding 
 - convert **all** non-numeric features to numeric 
 - handle any missing values


In this part,  we converted the categorical column to numeric values by converting the column to an ordered categorical variable and then replace the category with their category integer codes using ordinal encoder. And the missing values in each column is representated by 0.

In [86]:
encoder = OrdinalEncoder()
cat_features = ['Trans Creeper Gear','Unique Label?', 'Air Aspiration Method Desc', 'Fuel Metering Sys Desc','Cyl Deact?']
cat_col = df[cat_features]
ordinal_enc =ce.ordinal.OrdinalEncoder(cols=cat_features,handle_missing='return_nan')
cat_col = ordinal_enc.fit_transform(cat_col)
cat_col.head()

Unnamed: 0,Trans Creeper Gear,Unique Label?,Air Aspiration Method Desc,Fuel Metering Sys Desc,Cyl Deact?
0,1.0,1.0,1.0,1.0,1.0
1,1.0,1.0,,2.0,1.0
2,1.0,1.0,,1.0,
3,1.0,1.0,,2.0,
4,1.0,1.0,1.0,1.0,1.0


In [87]:
#checking whether the categorical columns have missing values or not
def sniff_modified(df):
    with pd.option_context("display.max_colwidth", 20):
        info = pd.DataFrame()
        info['data type'] = df.dtypes
        info['percent missing'] = df.isnull().sum()*100/len(df)
        info['No. unique'] = df.apply(lambda x: len(x.unique()))
        info['unique values'] = df.apply(lambda x: x.unique())
        return info.sort_values('data type')

In [88]:
sniff_modified(cat_col)


Unnamed: 0,data type,percent missing,No. unique,unique values
Trans Creeper Gear,float64,0.0,2,"[1.0, 2.0]"
Unique Label?,float64,6.347898,3,"[1.0, 2.0, nan]"
Air Aspiration Method Desc,float64,43.363561,5,"[1.0, nan, 3.0, 4.0, 5.0]"
Fuel Metering Sys Desc,float64,30.667766,6,"[1.0, 2.0, nan, 4.0, 5.0, 6.0]"
Cyl Deact?,float64,43.528442,3,"[1.0, nan, 3.0]"


Above step shows that there is missing values in the data set so we have replaced all missing values by 0 in the subsequent stage.

In [89]:
#replace missing values by 0 and rechecking missing value
cat_col.fillna(0,inplace=True)
sniff_modified(cat_col)

Unnamed: 0,data type,percent missing,No. unique,unique values
Trans Creeper Gear,float64,0.0,2,"[1.0, 2.0]"
Unique Label?,float64,0.0,3,"[1.0, 2.0, 0.0]"
Air Aspiration Method Desc,float64,0.0,5,"[1.0, 0.0, 3.0, 4.0, 5.0]"
Fuel Metering Sys Desc,float64,0.0,6,"[1.0, 2.0, 0.0, 4.0, 5.0, 6.0]"
Cyl Deact?,float64,0.0,3,"[1.0, 0.0, 3.0]"


At this point we cannot see any missing data present in categorical columns of our vehicle data set. However, there are some missing values in numeric column, which will be replaced in another part.

### Part 4 - Numeric features

In this part you should: 
 - handle any missing values
 


In this part we replaced missing values in numeric column by the median score of that column, for this we have created 2 methods which determines numeric column of data frame and replaced missing values by median score of that column. 

In [90]:
#handling missing values for numeric features
numeric_col = df[vehicle_numeric]
def fix_missing_num(dff, colname):
    #df[colname+'_na'] = pd.isnull(df[colname])
    numeric_col[colname].fillna(numeric_col[colname].median(), inplace=True)

def numeric_value(num_col):
    for col in numeric_col.select_dtypes(include='number'):
        fix_missing_num(numeric_col,col)



In [91]:
numeric_value(numeric_col)
numeric_col.head()

Unnamed: 0,Eng Displ,# Cyl,# Gears,Max Ethanol % - Gasoline,Intake Valves Per Cyl,Exhaust Valves Per Cyl
0,3.0,4,6,10.0,2,2
1,3.0,4,6,10.0,2,2
2,1.8,4,6,15.0,2,2
3,3.5,6,6,10.0,2,2
4,3.0,6,6,85.0,2,2


In [92]:
#Checking whether any column still have missing values or not
sniff_modified(numeric_col)

Unnamed: 0,data type,percent missing,No. unique,unique values
# Cyl,int64,0.0,7,"[4, 6, 12, 8, 3, 10, 5]"
# Gears,int64,0.0,7,"[6, 8, 7, 9, 5, 1, 4]"
Intake Valves Per Cyl,int64,0.0,2,"[2, 1]"
Exhaust Valves Per Cyl,int64,0.0,2,"[2, 1]"
Eng Displ,float64,0.0,42,"[3.0, 1.8, 3.5, 3.4, 2.5, 2.0, 6.3, 6.2, 1.5, ..."
Max Ethanol % - Gasoline,float64,0.0,3,"[10.0, 15.0, 85.0]"


From above representation it is clear that there is no any missing values in any column of the given dataset.

### Part 5 - Create and evaluate a final model

In this part you should:
 - create and evaluate a model using all the features after processing them in Parts 2, 3, and 4 above 


In this part we have combined all features data from part 2, 3 and 4 into single dataframe.

In [94]:
#combining data from part 2,3 and part 4
numbers = numeric_col.columns
final_X = cat_col[cat_features].join(numeric_col[numbers])
final_X.head()

Unnamed: 0,Trans Creeper Gear,Unique Label?,Air Aspiration Method Desc,Fuel Metering Sys Desc,Cyl Deact?,Eng Displ,# Cyl,# Gears,Max Ethanol % - Gasoline,Intake Valves Per Cyl,Exhaust Valves Per Cyl
0,1.0,1.0,1.0,1.0,1.0,3.0,4,6,10.0,2,2
1,1.0,1.0,0.0,2.0,1.0,3.0,4,6,10.0,2,2
2,1.0,1.0,0.0,1.0,0.0,1.8,4,6,15.0,2,2
3,1.0,1.0,0.0,2.0,0.0,3.5,6,6,10.0,2,2
4,1.0,1.0,1.0,1.0,1.0,3.0,6,6,85.0,2,2


Here we calculate the oob score of the cleaned data. The score is mean of the 10 run and random forest model has 150 trees.

In [95]:
oob_scores = []
for i in range(10):
    rf = RandomForestRegressor(n_estimators=150, n_jobs=-1, oob_score=True) 
    rf.fit(final_X,y)
    oob_scores.append(rf.oob_score_)
    oob_from_final_part = np.mean(oob_scores)
    
print("Mean oob score: ", np.mean(oob_scores))

Mean oob score:  0.7185256783344203


#### Questions (5 marks)

Provide answers to the following:
 1. calculate the percent difference between the results of Part 1 and Part 5 (make sure you are using the correct formula for percent difference) 
 

In [96]:
print(f"1.The percent difference between the results of out of bag score of Part1 and Part5  is {round(((oob_from_final_part-oob_from_1st_part)/oob_from_final_part)*100,2)} percent" )

1.The percent difference between the results of out of bag score of Part1 and Part5  is -2.7 percent


2. based on the percent difference, state whether or not the results of Part 5 are an improvement over the results of Part 1


The percent different in first and fifth part is almost negative 3 percent which means the model accuracy is decreased as we cleaned and convert features of the data from categorical to numeric. The final score is negative because in the first part we only took numeric data and we ignored the all categorical values of the entire data.