# Focused Feature Engineering
* Order Categorical Variables
* Remove time components from random forest

Now that we have identified a few select features, it's time to see if we can dedicate some more engineering time to them.

- Are there categories that we can make ordinal? Does that improve our model?
- Are there any potential interactions between features we can try?<br>(eg. differences between two times?)
- Can we generate new features from these important features?<br>(Eg. In Airbnb, we saw a feature of listing_requires_[reviews, id].
Perhaps really what is important is that the listing requires reviews. Was that a feature in our dataset?
Or maybe what's predictive is the number of requirements? Should we add that as a feature?)

## Loading the datasets

In [1]:
#loading the original dataset with 26 features before feature engineering
import pandas as pd

df1 = pd.read_csv("http://mlr.cs.umass.edu/ml/machine-learning-databases/autos/imports-85.data", 
                  header=None,
                 names = ['symboling',
             'normalized_losses',
             'make',
             'fuel_type',
             'aspiration',
             'num_of_doors',
             'body_style',
             'drive_wheels',
             'engine_location',
             'wheel-base',
             'length',
             'width',
             'height',
             'curb_weight',
             'engine_type',
             'num_of_cylinders',
             'engine_size',
             'fuel_system',
             'bore',
             'stroke',
             'compression_ratio',
             'horsepower',
             'peak_rpm',
             'city_mpg',
             'highway_mpg',
             'price'])
df1.head(2)

Unnamed: 0,symboling,normalized_losses,make,fuel_type,aspiration,num_of_doors,body_style,drive_wheels,engine_location,wheel-base,...,engine_size,fuel_system,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,price
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500


In [2]:
df1.shape

(205, 26)

In [3]:
df1.columns

Index(['symboling', 'normalized_losses', 'make', 'fuel_type', 'aspiration',
       'num_of_doors', 'body_style', 'drive_wheels', 'engine_location',
       'wheel-base', 'length', 'width', 'height', 'curb_weight', 'engine_type',
       'num_of_cylinders', 'engine_size', 'fuel_system', 'bore', 'stroke',
       'compression_ratio', 'horsepower', 'peak_rpm', 'city_mpg',
       'highway_mpg', 'price'],
      dtype='object')

In [4]:
df1['make'].unique()

array(['alfa-romero', 'audi', 'bmw', 'chevrolet', 'dodge', 'honda',
       'isuzu', 'jaguar', 'mazda', 'mercedes-benz', 'mercury',
       'mitsubishi', 'nissan', 'peugot', 'plymouth', 'porsche', 'renault',
       'saab', 'subaru', 'toyota', 'volkswagen', 'volvo'], dtype=object)

In [5]:
make_values = df1['make'].value_counts(normalize=True)
make_values

toyota           0.156098
nissan           0.087805
mazda            0.082927
honda            0.063415
mitsubishi       0.063415
volkswagen       0.058537
subaru           0.058537
peugot           0.053659
volvo            0.053659
dodge            0.043902
mercedes-benz    0.039024
bmw              0.039024
plymouth         0.034146
audi             0.034146
saab             0.029268
porsche          0.024390
isuzu            0.019512
jaguar           0.014634
chevrolet        0.014634
alfa-romero      0.014634
renault          0.009756
mercury          0.004878
Name: make, dtype: float64

In [6]:
make_other = make_values[make_values <= 0.05].index
make_other

Index(['dodge', 'mercedes-benz', 'bmw', 'plymouth', 'audi', 'saab', 'porsche',
       'isuzu', 'jaguar', 'chevrolet', 'alfa-romero', 'renault', 'mercury'],
      dtype='object')

#### Looking at the 'make' column, I categorized the 'make' column based on the value_counts of respective unique values, and when the percentage of value_counts are less than 5%, I made them into "other" and then got dummies. Hence, I can generate new features from "make_other" feature by regenerating original unique values from aggregated feature "make_other". And then I will see if the model accuracy is increased.<br>The original unique values categorized into "other" were as shown above: 'dodge', 'bmw', 'mercedes-benz', 'audi', plymouth', 'saab', 'porsche', 'isuzu', 'alfa-romero', 'chevrolet', 'jaguar', 'renault', 'mercury'.

In [7]:
#Loading the selected dataset with 10 features in the previous task.
import pandas as pd

X_train = pd.read_feather('./X_train') 
y_train = pd.read_feather('./y_train')

In [8]:
X_train.columns

Index(['symboling', 'num_of_doors', 'wheel-base', 'length', 'width', 'height',
       'peak_rpm', 'highway_mpg', 'price', 'make_other'],
      dtype='object')

## 1. Categorical Variables

In [9]:
#loading the original dataset with selected features
df2 = df1[['symboling', 'num_of_doors', 'wheel-base', 
     'length', 'width', 'height', 'peak_rpm', 'highway_mpg', 'price', 'make', 'normalized_losses']]

In [10]:
df2.head(2)

Unnamed: 0,symboling,num_of_doors,wheel-base,length,width,height,peak_rpm,highway_mpg,price,make,normalized_losses
0,3,two,88.6,168.8,64.1,48.8,5000,27,13495,alfa-romero,?
1,3,two,88.6,168.8,64.1,48.8,5000,27,16500,alfa-romero,?


In [11]:
import numpy as np

# My created method : replacing '?' into numpy Not A Number.

def replaced_df(df):
    df = df.replace({'?':np.nan}, inplace = True)
    return df

In [12]:
replaced_df(df2)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  method=method,


In [13]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 11 columns):
symboling            205 non-null int64
num_of_doors         203 non-null object
wheel-base           205 non-null float64
length               205 non-null float64
width                205 non-null float64
height               205 non-null float64
peak_rpm             203 non-null object
highway_mpg          205 non-null int64
price                201 non-null object
make                 205 non-null object
normalized_losses    164 non-null object
dtypes: float64(4), int64(2), object(5)
memory usage: 17.7+ KB


In [14]:
df2['num_of_doors'].unique()

array(['two', 'four', nan], dtype=object)

In [15]:
mapping1 = {'four': 4, 'two': 2}
import numpy as np

# My created method : replacing almost number into number

def replaced_number(df):
    df = df.replace(mapping1, inplace = True)
    return df

In [16]:
replaced_number(df2)
df2.head(2)

Unnamed: 0,symboling,num_of_doors,wheel-base,length,width,height,peak_rpm,highway_mpg,price,make,normalized_losses
0,3,2.0,88.6,168.8,64.1,48.8,5000,27,13495,alfa-romero,
1,3,2.0,88.6,168.8,64.1,48.8,5000,27,16500,alfa-romero,


In [17]:
import pandas as pd

def convert_multiple_numeric(df, column_list):
    for column in column_list:
        df[column] = df[column].apply(pd.to_numeric, errors= 'coerce')
    return df

In [18]:
convert_multiple_numeric(df2, ['peak_rpm', 'price', 'num_of_doors', 'normalized_losses']).head(2)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


Unnamed: 0,symboling,num_of_doors,wheel-base,length,width,height,peak_rpm,highway_mpg,price,make,normalized_losses
0,3,2.0,88.6,168.8,64.1,48.8,5000.0,27,13495.0,alfa-romero,
1,3,2.0,88.6,168.8,64.1,48.8,5000.0,27,16500.0,alfa-romero,


In [19]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 11 columns):
symboling            205 non-null int64
num_of_doors         203 non-null float64
wheel-base           205 non-null float64
length               205 non-null float64
width                205 non-null float64
height               205 non-null float64
peak_rpm             203 non-null float64
highway_mpg          205 non-null int64
price                201 non-null float64
make                 205 non-null object
normalized_losses    164 non-null float64
dtypes: float64(8), int64(2), object(1)
memory usage: 17.7+ KB


In [20]:
df3 = df2.copy()

In [21]:
df3.make = df2.make[df2.make.isin(make_other)]

In [22]:
df3['make'].value_counts()

dodge            9
mercedes-benz    8
bmw              8
audi             7
plymouth         7
saab             6
porsche          5
isuzu            4
alfa-romero      3
chevrolet        3
jaguar           3
renault          2
mercury          1
Name: make, dtype: int64

In [23]:
print(df2.shape, df3.shape)

(205, 11) (205, 11)


In [24]:
make_other

Index(['dodge', 'mercedes-benz', 'bmw', 'plymouth', 'audi', 'saab', 'porsche',
       'isuzu', 'jaguar', 'chevrolet', 'alfa-romero', 'renault', 'mercury'],
      dtype='object')

### So far, I generated the new dataet df3 with only selected features. I am trying to see if 'make_other' feature converted into category and coded can make the model better. 

In [25]:
df3.head(1)

Unnamed: 0,symboling,num_of_doors,wheel-base,length,width,height,peak_rpm,highway_mpg,price,make,normalized_losses
0,3,2.0,88.6,168.8,64.1,48.8,5000.0,27,13495.0,alfa-romero,


In [26]:
# returning only the columns that has NaNs.
def some_nans(df):
    some_nans_bools = pd.isnull(df).any()
    return some_nans_bools.index[some_nans_bools]

#imputing means to the NaN values in each column that has NaNs.
def impute_means(df):
    nan_cols = some_nans(df)
    col_means = df[nan_cols].mean()
    imputed_df = df.fillna(col_means)
    return imputed_df

In [27]:
df3 = impute_means(df3)
df3.head(2)

Unnamed: 0,symboling,num_of_doors,wheel-base,length,width,height,peak_rpm,highway_mpg,price,make,normalized_losses
0,3,2.0,88.6,168.8,64.1,48.8,5000.0,27,13495.0,alfa-romero,122.0
1,3,2.0,88.6,168.8,64.1,48.8,5000.0,27,16500.0,alfa-romero,122.0


In [28]:
df3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 11 columns):
symboling            205 non-null int64
num_of_doors         205 non-null float64
wheel-base           205 non-null float64
length               205 non-null float64
width                205 non-null float64
height               205 non-null float64
peak_rpm             205 non-null float64
highway_mpg          205 non-null int64
price                205 non-null float64
make                 66 non-null object
normalized_losses    205 non-null float64
dtypes: float64(8), int64(2), object(1)
memory usage: 17.7+ KB


In [29]:
make_other_categories = df3['make'].astype('category').cat.categories

In [30]:
make_other_categories

Index(['alfa-romero', 'audi', 'bmw', 'chevrolet', 'dodge', 'isuzu', 'jaguar',
       'mercedes-benz', 'mercury', 'plymouth', 'porsche', 'renault', 'saab'],
      dtype='object')

In [31]:
df3['make'] = df3['make'].astype('category').cat.codes


In [32]:
df3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 11 columns):
symboling            205 non-null int64
num_of_doors         205 non-null float64
wheel-base           205 non-null float64
length               205 non-null float64
width                205 non-null float64
height               205 non-null float64
peak_rpm             205 non-null float64
highway_mpg          205 non-null int64
price                205 non-null float64
make                 205 non-null int8
normalized_losses    205 non-null float64
dtypes: float64(8), int64(2), int8(1)
memory usage: 16.3 KB


In [33]:
df3.head(2)

Unnamed: 0,symboling,num_of_doors,wheel-base,length,width,height,peak_rpm,highway_mpg,price,make,normalized_losses
0,3,2.0,88.6,168.8,64.1,48.8,5000.0,27,13495.0,0,122.0
1,3,2.0,88.6,168.8,64.1,48.8,5000.0,27,16500.0,0,122.0


In [34]:
df3.make.value_counts()

-1     139
 4       9
 7       8
 2       8
 9       7
 1       7
 12      6
 10      5
 5       4
 6       3
 3       3
 0       3
 11      2
 8       1
Name: make, dtype: int64

In [35]:
make_other_categories

Index(['alfa-romero', 'audi', 'bmw', 'chevrolet', 'dodge', 'isuzu', 'jaguar',
       'mercedes-benz', 'mercury', 'plymouth', 'porsche', 'renault', 'saab'],
      dtype='object')

In [36]:
y = df3.normalized_losses
y

0      122.0
1      122.0
2      122.0
3      164.0
4      164.0
       ...  
200     95.0
201     95.0
202     95.0
203     95.0
204     95.0
Name: normalized_losses, Length: 205, dtype: float64

### Split the data & Train the model with this focused feature-engineered dataset

In [37]:
X = df3.iloc[:, :-1]

In [38]:
X.head(2)

Unnamed: 0,symboling,num_of_doors,wheel-base,length,width,height,peak_rpm,highway_mpg,price,make
0,3,2.0,88.6,168.8,64.1,48.8,5000.0,27,13495.0,0
1,3,2.0,88.6,168.8,64.1,48.8,5000.0,27,16500.0,0


In [39]:
y = df3.iloc[:, -1]
y

0      122.0
1      122.0
2      122.0
3      164.0
4      164.0
       ...  
200     95.0
201     95.0
202     95.0
203     95.0
204     95.0
Name: normalized_losses, Length: 205, dtype: float64

In [40]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

In [41]:
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)
print(X_val.shape, y_test.shape)

(131, 10) (131,)
(41, 10) (41,)
(33, 10) (41,)


In [42]:
from sklearn.ensemble import RandomForestRegressor

rfr_tuned = RandomForestRegressor(min_samples_leaf = 4,
                                  n_estimators = 14,
                                  max_features = 0.5, random_state=1)

In [43]:
rfr_tuned.fit(X_train, y_train)
rfr_tuned.score(X_val, y_val)

0.5004318787607926

In [44]:
combined_X = np.vstack((X_train, X_val))
print(combined_X.shape)

combined_y = np.concatenate((y_train, y_val))
print(combined_y.shape)

rfr_tuned.fit(combined_X, combined_y)
rfr_tuned.score(X_test, y_test)

(164, 10)
(164,)


0.5290820769798088

### 0.5167 was the best score from the previous step, after selecting 10 features using eli5 permutation importance method and correlation analysis. Now, the score was decreased into 0.5004, so I decided not to use this new generated feature in my model, and I will keep the selected model in the previous task. However, as shown above, the score was better when I evaluated the model against the test set.

## 2. Remove time components from random forest

In [46]:
X_train.head()

Unnamed: 0,symboling,num_of_doors,wheel-base,length,width,height,peak_rpm,highway_mpg,price,make
28,-1,4.0,103.3,174.6,64.6,59.8,5000.0,30,8921.0,4
153,0,4.0,95.7,169.7,63.6,59.1,4800.0,37,6918.0,-1
47,0,4.0,113.0,199.6,69.6,52.8,4750.0,19,32250.0,6
46,2,2.0,96.0,172.6,65.2,51.4,5000.0,29,11048.0,5
117,0,4.0,108.0,186.7,68.3,56.0,5600.0,24,18150.0,-1


In [47]:
X_train.columns

Index(['symboling', 'num_of_doors', 'wheel-base', 'length', 'width', 'height',
       'peak_rpm', 'highway_mpg', 'price', 'make'],
      dtype='object')

### After looking at my selected features shown above, I do not find any time components. Hecne, I will keep the model with 10 features.