# Exercise 13

This particular Automobile Data Set includes a good mix of categorical values as well as continuous values and serves as a useful example that is relatively easy to understand. Since domain understanding is an important aspect when deciding how to encode various categorical values - this data set makes a good case study.

Read the data into Pandas

In [1]:
import pandas as pd

# Define the headers since the data does not have any
headers = ["symboling", "normalized_losses", "make", "fuel_type", "aspiration",
           "num_doors", "body_style", "drive_wheels", "engine_location",
           "wheel_base", "length", "width", "height", "curb_weight",
           "engine_type", "num_cylinders", "engine_size", "fuel_system",
           "bore", "stroke", "compression_ratio", "horsepower", "peak_rpm",
           "city_mpg", "highway_mpg", "price"]

# Read in the CSV file and convert "?" to NaN
df = pd.read_csv("http://mlr.cs.umass.edu/ml/machine-learning-databases/autos/imports-85.data",
                  header=None, names=headers, na_values="?" )
df.head()

Unnamed: 0,symboling,normalized_losses,make,fuel_type,aspiration,num_doors,body_style,drive_wheels,engine_location,wheel_base,...,engine_size,fuel_system,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,price
0,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111.0,5000.0,21,27,13495.0
1,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111.0,5000.0,21,27,16500.0
2,1,,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154.0,5000.0,19,26,16500.0
3,2,164.0,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102.0,5500.0,24,30,13950.0
4,2,164.0,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115.0,5500.0,18,22,17450.0


In [2]:
df.shape

(205, 26)

In [3]:
df.dtypes

symboling              int64
normalized_losses    float64
make                  object
fuel_type             object
aspiration            object
num_doors             object
body_style            object
drive_wheels          object
engine_location       object
wheel_base           float64
length               float64
width                float64
height               float64
curb_weight            int64
engine_type           object
num_cylinders         object
engine_size            int64
fuel_system           object
bore                 float64
stroke               float64
compression_ratio    float64
horsepower           float64
peak_rpm             float64
city_mpg               int64
highway_mpg            int64
price                float64
dtype: object

In [4]:
obj_df = df.select_dtypes(include=['object']).copy()
obj_df.head()

Unnamed: 0,make,fuel_type,aspiration,num_doors,body_style,drive_wheels,engine_location,engine_type,num_cylinders,fuel_system
0,alfa-romero,gas,std,two,convertible,rwd,front,dohc,four,mpfi
1,alfa-romero,gas,std,two,convertible,rwd,front,dohc,four,mpfi
2,alfa-romero,gas,std,two,hatchback,rwd,front,ohcv,six,mpfi
3,audi,gas,std,four,sedan,fwd,front,ohc,four,mpfi
4,audi,gas,std,four,sedan,4wd,front,ohc,five,mpfi


# Exercise 13.1

Does the database contain missing values? If so, replace them using one of the methods explained in class

In [5]:
df.isnull().sum()

symboling             0
normalized_losses    41
make                  0
fuel_type             0
aspiration            0
num_doors             2
body_style            0
drive_wheels          0
engine_location       0
wheel_base            0
length                0
width                 0
height                0
curb_weight           0
engine_type           0
num_cylinders         0
engine_size           0
fuel_system           0
bore                  4
stroke                4
compression_ratio     0
horsepower            2
peak_rpm              2
city_mpg              0
highway_mpg           0
price                 4
dtype: int64

In [6]:
df.dropna().shape

(159, 26)

In [7]:
df.normalized_losses.describe()

count    164.000000
mean     122.000000
std       35.442168
min       65.000000
25%       94.000000
50%      115.000000
75%      150.000000
max      256.000000
Name: normalized_losses, dtype: float64

In [8]:
df.stroke.describe()

count    201.000000
mean       3.255423
std        0.316717
min        2.070000
25%        3.110000
50%        3.290000
75%        3.410000
max        4.170000
Name: stroke, dtype: float64

In [9]:
df.bore.describe()

count    201.000000
mean       3.329751
std        0.273539
min        2.540000
25%        3.150000
50%        3.310000
75%        3.590000
max        3.940000
Name: bore, dtype: float64

In [10]:
df.peak_rpm.describe()

count     203.000000
mean     5125.369458
std       479.334560
min      4150.000000
25%      4800.000000
50%      5200.000000
75%      5500.000000
max      6600.000000
Name: peak_rpm, dtype: float64

In [11]:
df.peak_rpm.median()

5200.0

In [12]:
df.peak_rpm.mode()

0    5500.0
dtype: float64

In [13]:
df.normalized_losses.fillna(df.normalized_losses.mean(), inplace=True)
df.bore.fillna(df.bore.mean(), inplace=True)
df.stroke.fillna(df.stroke.mean(), inplace=True)
df.peak_rpm.fillna(df.peak_rpm.median(), inplace=True)

In [14]:
df.horsepower.describe()

count    203.000000
mean     104.256158
std       39.714369
min       48.000000
25%       70.000000
50%       95.000000
75%      116.000000
max      288.000000
Name: horsepower, dtype: float64

In [15]:
df.isnull().sum()

symboling            0
normalized_losses    0
make                 0
fuel_type            0
aspiration           0
num_doors            2
body_style           0
drive_wheels         0
engine_location      0
wheel_base           0
length               0
width                0
height               0
curb_weight          0
engine_type          0
num_cylinders        0
engine_size          0
fuel_system          0
bore                 0
stroke               0
compression_ratio    0
horsepower           2
peak_rpm             0
city_mpg             0
highway_mpg          0
price                4
dtype: int64

In [16]:
df.dropna().shape

(197, 26)

In [17]:
df.dropna(inplace=True)
df.shape

(197, 26)

# Exercise 13.2

Split the data into training and testing sets

Train a Random Forest Regressor to predict the price of a car using the nominal features

In [18]:
obj_df = df.select_dtypes(include=['object']).copy()
obj_df_cat=obj_df.copy()

In [19]:
obj_df_cat["fuel_type"]=pd.DataFrame(obj_df_cat.fuel_type.astype("category").cat.codes)
obj_df_cat["make"]=pd.DataFrame(obj_df_cat.make.astype("category").cat.codes)
obj_df_cat["aspiration"]=pd.DataFrame(obj_df_cat.aspiration.astype("category").cat.codes)
obj_df_cat["num_doors"]=pd.DataFrame(obj_df_cat.num_doors.astype("category").cat.codes)
obj_df_cat["body_style"]=pd.DataFrame(obj_df_cat.body_style.astype("category").cat.codes)
obj_df_cat["drive_wheels"]=pd.DataFrame(obj_df_cat.drive_wheels.astype("category").cat.codes)
obj_df_cat["engine_location"]=pd.DataFrame(obj_df_cat.engine_location.astype("category").cat.codes)
obj_df_cat["engine_type"]=pd.DataFrame(obj_df_cat.engine_type.astype("category").cat.codes)
obj_df_cat["num_cylinders"]=pd.DataFrame(obj_df_cat.num_cylinders.astype("category").cat.codes)
obj_df_cat["fuel_system"]=pd.DataFrame(obj_df_cat.fuel_system.astype("category").cat.codes)
obj_df_cat.head()

Unnamed: 0,make,fuel_type,aspiration,num_doors,body_style,drive_wheels,engine_location,engine_type,num_cylinders,fuel_system
0,0,1,0,1,0,2,0,0,2,5
1,0,1,0,1,0,2,0,0,2,5
2,0,1,0,1,2,2,0,4,3,5
3,1,1,0,0,3,1,0,2,2,5
4,1,1,0,0,3,0,0,2,1,5


In [20]:
val_df = df.select_dtypes(include=['float64','int64']).copy()
df2 = pd.concat([val_df, obj_df_cat], axis=1)

In [21]:
X = df2.drop(['price'], axis=1)
y = df2['price']
# train/test split
from sklearn.model_selection import train_test_split, cross_val_score
train_features, test_features, train_labels, test_labels = train_test_split(X, y, random_state=1)

In [22]:
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics
rf = RandomForestRegressor(n_estimators= 1000, random_state=42)
rf.fit(train_features, train_labels);
predictions = rf.predict(test_features)

# Performance metrics
Mtrs=pd.DataFrame(columns=['Model','RMSE','ACC',"Average abs error"],data=[])
Mtrs.shape[0]
errors = abs(predictions - test_labels)
mape = np.mean(100 * (errors / test_labels))
ACC = (round(100 - mape,2),'%')
print('Metrics for Random Forest Trained on Expanded Data')
ABS= (round(np.mean(errors), 2), 'degrees.')
RMSE=round(np.sqrt(metrics.mean_squared_error(predictions, test_labels)),2)
Mtrs.loc[0] = ['RF_Nominal_Feat',RMSE,ACC,ABS]
Mtrs

Metrics for Random Forest Trained on Expanded Data


Unnamed: 0,Model,RMSE,ACC,Average abs error
0,RF_Nominal_Feat,2135.32,"(88.7, %)","(1510.63, degrees.)"


# Exercise 13.3

Create dummy variables for the categorical features

Train a Random Forest Regressor and compare

In [23]:
obj_df_dum= obj_df.copy()
obj_df_dum= pd.get_dummies(obj_df_dum, columns = ["fuel_type"], drop_first = True)
obj_df_dum= pd.get_dummies(obj_df_dum, columns = ["make"], drop_first = True)
obj_df_dum= pd.get_dummies(obj_df_dum, columns = ["aspiration"], drop_first = True)
obj_df_dum= pd.get_dummies(obj_df_dum, columns = ["num_doors"], drop_first = True)
obj_df_dum= pd.get_dummies(obj_df_dum, columns = ["body_style"], drop_first = True)
obj_df_dum= pd.get_dummies(obj_df_dum, columns = ["drive_wheels"], drop_first = True)
obj_df_dum= pd.get_dummies(obj_df_dum, columns = ["engine_location"], drop_first = True)
obj_df_dum= pd.get_dummies(obj_df_dum, columns = ["engine_type"], drop_first = True)
obj_df_dum= pd.get_dummies(obj_df_dum, columns = ["num_cylinders"], drop_first = True)
obj_df_dum= pd.get_dummies(obj_df_dum, columns = ["fuel_system"], drop_first = True)
obj_df_dum.shape

(197, 48)

In [24]:
df3 = pd.concat([val_df, obj_df_dum], axis=1)
df3.head()

Unnamed: 0,symboling,normalized_losses,wheel_base,length,width,height,curb_weight,engine_size,bore,stroke,...,num_cylinders_three,num_cylinders_twelve,num_cylinders_two,fuel_system_2bbl,fuel_system_4bbl,fuel_system_idi,fuel_system_mfi,fuel_system_mpfi,fuel_system_spdi,fuel_system_spfi
0,3,122.0,88.6,168.8,64.1,48.8,2548,130,3.47,2.68,...,0,0,0,0,0,0,0,1,0,0
1,3,122.0,88.6,168.8,64.1,48.8,2548,130,3.47,2.68,...,0,0,0,0,0,0,0,1,0,0
2,1,122.0,94.5,171.2,65.5,52.4,2823,152,2.68,3.47,...,0,0,0,0,0,0,0,1,0,0
3,2,164.0,99.8,176.6,66.2,54.3,2337,109,3.19,3.4,...,0,0,0,0,0,0,0,1,0,0
4,2,164.0,99.4,176.6,66.4,54.3,2824,136,3.19,3.4,...,0,0,0,0,0,0,0,1,0,0


In [25]:
X = df3.drop(['price'], axis=1)
y = df3['price']
# train/test split
from sklearn.model_selection import train_test_split, cross_val_score
train_features, test_features, train_labels, test_labels = train_test_split(X, y, random_state=1)

In [26]:
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics
rf2 = RandomForestRegressor(n_estimators= 1000, random_state=42)
rf2.fit(train_features, train_labels);
predictions = rf2.predict(test_features)

# Performance metrics
errors = abs(predictions - test_labels)
mape = np.mean(100 * (errors / test_labels))
ACC = (round(100 - mape,2),'%')
print('Metrics for Random Forest Trained on Expanded Data')
ABS= (round(np.mean(errors), 2), 'degrees.')
RMSE=round(np.sqrt(metrics.mean_squared_error(predictions, test_labels)),2)
Mtrs.loc[1] = ['RF_Dummies_Feat',RMSE,ACC,ABS]
Mtrs

Metrics for Random Forest Trained on Expanded Data


Unnamed: 0,Model,RMSE,ACC,Average abs error
0,RF_Nominal_Feat,2135.32,"(88.7, %)","(1510.63, degrees.)"
1,RF_Dummies_Feat,2148.99,"(88.53, %)","(1526.82, degrees.)"


# Exercise 13.4

Apply two other methods of categorical encoding

compare the results

In [27]:
obj_df.dropna().shape

(197, 10)

In [28]:
import category_encoders as ce
#Polynomial Coding
obj_df_pol = ce.PolynomialEncoder().fit_transform(obj_df)
obj_df_pol.dropna().shape

(189, 49)

In [29]:
df4 = pd.concat([val_df, obj_df_pol], axis=1)
df4.dropna(inplace=True)
df4.head()

Unnamed: 0,symboling,normalized_losses,wheel_base,length,width,height,curb_weight,engine_size,bore,stroke,...,num_cylinders_3,num_cylinders_4,num_cylinders_5,fuel_system_0,fuel_system_1,fuel_system_2,fuel_system_3,fuel_system_4,fuel_system_5,fuel_system_6
0,3,122.0,88.6,168.8,64.1,48.8,2548,130,3.47,2.68,...,0.241747,-0.109109,0.032898,-0.540062,0.540062,-0.43082,0.282038,-0.149786,0.061546,-0.01707
1,3,122.0,88.6,168.8,64.1,48.8,2548,130,3.47,2.68,...,0.241747,-0.109109,0.032898,-0.540062,0.540062,-0.43082,0.282038,-0.149786,0.061546,-0.01707
2,1,122.0,94.5,171.2,65.5,52.4,2823,152,2.68,3.47,...,-0.564076,0.436436,-0.197386,-0.540062,0.540062,-0.43082,0.282038,-0.149786,0.061546,-0.01707
3,2,164.0,99.8,176.6,66.2,54.3,2337,109,3.19,3.4,...,0.241747,-0.109109,0.032898,-0.540062,0.540062,-0.43082,0.282038,-0.149786,0.061546,-0.01707
4,2,164.0,99.4,176.6,66.4,54.3,2824,136,3.19,3.4,...,0.080582,-0.545545,0.493464,-0.540062,0.540062,-0.43082,0.282038,-0.149786,0.061546,-0.01707


In [30]:
X = df4.drop(['price'], axis=1)
y = df4['price']
# train/test split
from sklearn.model_selection import train_test_split, cross_val_score
train_features, test_features, train_labels, test_labels = train_test_split(X, y, random_state=1)

In [31]:
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics
rf3 = RandomForestRegressor(n_estimators= 1000, random_state=42)
rf3.fit(train_features, train_labels);
predictions = rf3.predict(test_features)

# Performance metrics
errors = abs(predictions - test_labels)
mape = np.mean(100 * (errors / test_labels))
ACC = (round(100 - mape,2),'%')
print('Metrics for Random Forest Trained on Expanded Data')
ABS= (round(np.mean(errors), 2), 'degrees.')
RMSE=round(np.sqrt(metrics.mean_squared_error(predictions, test_labels)),2)
Mtrs.loc[2] = ['RF_Polynom_Feat',RMSE,ACC,ABS]
Mtrs

Metrics for Random Forest Trained on Expanded Data


Unnamed: 0,Model,RMSE,ACC,Average abs error
0,RF_Nominal_Feat,2135.32,"(88.7, %)","(1510.63, degrees.)"
1,RF_Dummies_Feat,2148.99,"(88.53, %)","(1526.82, degrees.)"
2,RF_Polynom_Feat,1973.71,"(89.8, %)","(1318.53, degrees.)"


In [32]:
#Helmert Coding
obj_df_hel = ce.HelmertEncoder().fit_transform(obj_df)
obj_df_hel.dropna().shape

(189, 49)

In [33]:
df5 = pd.concat([val_df, obj_df_hel], axis=1)
df5.dropna(inplace=True)
df5.head()

Unnamed: 0,symboling,normalized_losses,wheel_base,length,width,height,curb_weight,engine_size,bore,stroke,...,num_cylinders_3,num_cylinders_4,num_cylinders_5,fuel_system_0,fuel_system_1,fuel_system_2,fuel_system_3,fuel_system_4,fuel_system_5,fuel_system_6
0,3,122.0,88.6,168.8,64.1,48.8,2548,130,3.47,2.68,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
1,3,122.0,88.6,168.8,64.1,48.8,2548,130,3.47,2.68,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
2,1,122.0,94.5,171.2,65.5,52.4,2823,152,2.68,3.47,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
3,2,164.0,99.8,176.6,66.2,54.3,2337,109,3.19,3.4,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
4,2,164.0,99.4,176.6,66.4,54.3,2824,136,3.19,3.4,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0


In [34]:
X = df5.drop(['price'], axis=1)
y = df5['price']
# train/test split
from sklearn.model_selection import train_test_split, cross_val_score
train_features, test_features, train_labels, test_labels = train_test_split(X, y, random_state=1)

In [35]:
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics
rf4 = RandomForestRegressor(n_estimators= 1000, random_state=42)
rf4.fit(train_features, train_labels);
predictions = rf4.predict(test_features)

# Performance metrics
errors = abs(predictions - test_labels)
mape = np.mean(100 * (errors / test_labels))
ACC = (round(100 - mape,2),'%')
print('Metrics for Random Forest Trained on Expanded Data')
ABS= (round(np.mean(errors), 2), 'degrees.')
RMSE=round(np.sqrt(metrics.mean_squared_error(predictions, test_labels)),2)
Mtrs.loc[3] = ['RF_Helmert_Feat',RMSE,ACC,ABS]
Mtrs

Metrics for Random Forest Trained on Expanded Data


Unnamed: 0,Model,RMSE,ACC,Average abs error
0,RF_Nominal_Feat,2135.32,"(88.7, %)","(1510.63, degrees.)"
1,RF_Dummies_Feat,2148.99,"(88.53, %)","(1526.82, degrees.)"
2,RF_Polynom_Feat,1973.71,"(89.8, %)","(1318.53, degrees.)"
3,RF_Helmert_Feat,1990.65,"(90.08, %)","(1305.77, degrees.)"


**Conclusiones:**

Se encuentra que en la data total existen $205$ muestras de las cuales hay $46$ con missing values entre variables categóricas y nominales. Se realiza procesamiento para imputar algunos de estos missing values en algunas variables nominales obteniendo como resultado solo $8$ muestras con missing values, de estas $4$ corresponden a la variable "price" por lo que considero que lo mejor es eliminarlos y las otras $4$ son $2$ correspondientes a la variable "horsepower" y "num_doors", la primera con una alta variabilidad y la segunda una categórica para la que tomaría mucho trabajo definir el valor por lo que también las elimino.

Luego se realizan $4$ modelos con diferentes métodos de procesamiento de las variables categóricas, el primer modelo se entrena con el valor nominal para las variables categóricas obteniendo un accuracy de $88.7$%, el segundo modelo contempla las variables como dummies con un accuracy de $88.53$%, para el tercero se elige el método de **Polynomial coding** mostrando un accuracy mayor de $89.8$% y por último se elige el método **Helmert Coding** obteniendo el mejor resultado de los 4 modelos con accuracy de $90.08$%. Sin embargo, para estos dos últimos métodos, al aplicar el encoding se generan otros $8$ missing values para los que se decide eliminarlos trabajando finalmente con 189 muestras de las 205 iniciales.