# Feature Engineering 2

The features chosen for the first iterations of modeling led to some very good models -- each scoring above 90% accuracy with very low false positive rates (3% - 4.5%).

Now we'll do a comparison with a different feature engineering paradigm. Rather than treating the routes as "route sentences" we'll do a simple one-hot encoding of the interchange states and origin and destination points to see how the models compare. We'll also, downstream, be using a simpler 1D CNN model, rather than the hybrid MLP/CNN model we used for the other features.

In [1]:
import pandas as pd
import numpy as np

import plaidml.keras as pk
pk.install_backend()
from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder

import pickle

In [2]:
df = pd.read_csv('./waybill_relevant_data.csv', low_memory=False)
df.replace('', np.nan, inplace=True)
df.head()

Unnamed: 0,is_hazardous,car_ownership_category_code,all_rail_intermodal_code,estimated_short_line_miles,number_of_articulated_units,origin_location,interchange_state_1,interchange_state_2,interchange_state_3,terminal_location
0,1,P,1,2120,0,"Chicago-Gary-Kenosha, IL-IN-WI",,,,"Los Angeles-Riverside-Orange County, CA-AZ"
1,1,P,9,810,0,"Chicago-Gary-Kenosha, IL-IN-WI",,,,"Philadelphia-Wilmington-Atlantic City, PA-NJ-D..."
2,1,P,1,350,0,"New Orleans, LA-MS",AL,,,"Birmingham, AL"
3,1,P,9,2470,4,"Baton Rouge, LA-MS",IL,AB,,Alberta
4,1,P,1,860,0,"Chicago-Gary-Kenosha, IL-IN-WI",,,,"Shreveport-Bossier City, LA-AR"


In [3]:
num_cols = [
    'estimated_short_line_miles',
    'number_of_articulated_units'
]

cat_cols = [
    'car_ownership_category_code', 
    'all_rail_intermodal_code',
    'origin_location',
    'interchange_state_1',
    'interchange_state_2',
    'interchange_state_3',
    'terminal_location'
]

In [4]:
nums = df[num_cols]
cats = df[cat_cols].astype(str)

In [16]:
# df['all_rail_intermodal_code'] = df['all_rail_intermodal_code'].astype(str)
# encoded_cats = pd.get_dummies(cats)
# encoded_cats.head()

encoder = OneHotEncoder()
encoder.fit(df[cat_cols].fillna('None'))
encoded_cats = encoder.transform(df[cat_cols].fillna('None'))
encoded_cats.shape

(68486, 418)

We will pickle the encoder and scaler for use in our deployment.

In [17]:
with open('./encoder.2.pickle', 'wb') as f:
    pickle.dump(encoder, f)

In [6]:
X_cats = encoded_cats.to_numpy()

In [7]:
scaler = MinMaxScaler()
X_nums = scaler.fit_transform(nums)

X_nums[:10]

array([[0.38061041, 0.        ],
       [0.1454219 , 0.        ],
       [0.06283662, 0.        ],
       [0.44344704, 0.8       ],
       [0.15439856, 0.        ],
       [0.27648115, 0.        ],
       [0.04847397, 0.        ],
       [0.01256732, 0.        ],
       [0.04847397, 0.        ],
       [0.11849192, 0.        ]])

In [8]:
with open('./min_max_scaler.2.pickle', 'wb') as f:
    pickle.dump(scaler, f)

In [9]:
X = np.concatenate([X_nums, X_cats], axis=1)
X.shape

(68486, 420)

We'll pickle the prepared data as well.

In [12]:
y = df['is_hazardous'].to_numpy()
num_data = (X, y)

with open('./big_dummy_data.pickle', 'wb') as f:
    pickle.dump(num_data, f)

Now we will feed this data into our waybill.model_def.4.new_features version.