# Feature Engineering 1

This will be our first interation of feature engineering. During our EDA, we were able to do the feature selection, so now all that's left is to prepare the data with encoding and scaling. We'll also be developing out "route sentences," which will be the text representation of our train routes.

We'll start by reading in the data.

In [1]:
import pandas as pd
import numpy as np

import plaidml.keras as pk
pk.install_backend()
from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
from sklearn.preprocessing import MinMaxScaler

import pickle

In [2]:
df = pd.read_csv('./waybill_relevant_data.csv', low_memory=False)
df.replace('', np.nan, inplace=True)
df.head()

Unnamed: 0,is_hazardous,car_ownership_category_code,all_rail_intermodal_code,estimated_short_line_miles,number_of_articulated_units,origin_location,interchange_state_1,interchange_state_2,interchange_state_3,terminal_location
0,1,P,1,2120,0,"Chicago-Gary-Kenosha, IL-IN-WI",,,,"Los Angeles-Riverside-Orange County, CA-AZ"
1,1,P,9,810,0,"Chicago-Gary-Kenosha, IL-IN-WI",,,,"Philadelphia-Wilmington-Atlantic City, PA-NJ-D..."
2,1,P,1,350,0,"New Orleans, LA-MS",AL,,,"Birmingham, AL"
3,1,P,9,2470,4,"Baton Rouge, LA-MS",IL,AB,,Alberta
4,1,P,1,860,0,"Chicago-Gary-Kenosha, IL-IN-WI",,,,"Shreveport-Bossier City, LA-AR"


### For sequence data

We're going to try using some NLP techniques on the train route, since right now the routes are stored as sequences of string data. So let's engineer the features in the same ways we would engineer sentences, for example.

In this way our predictions become comparable to a simple case of sentiment analysis.

In [3]:
seq_cols = [
    'origin_location',
    'interchange_state_1',
    'interchange_state_2',
    'interchange_state_3',
    'terminal_location'
]


def to_sentence(row):
    ls = []
    for c in seq_cols:
        if (row[c] is None) or (row[c] is np.nan):
            continue
            
        ls.append(row[c].replace(', ', '').replace(' ', '').replace('-', ''))
        
    return ' '.join(ls)

    
routes = df.apply(to_sentence, axis=1).to_list()

In [4]:
routes[:10]

['ChicagoGaryKenoshaILINWI LosAngelesRiversideOrangeCountyCAAZ',
 'ChicagoGaryKenoshaILINWI PhiladelphiaWilmingtonAtlanticCityPANJDEMD',
 'NewOrleansLAMS AL BirminghamAL',
 'BatonRougeLAMS IL AB Alberta',
 'ChicagoGaryKenoshaILINWI ShreveportBossierCityLAAR',
 'CasperWYIDUT TX BeaumontPortArthurTX',
 'NewOrleansLAMS BeaumontPortArthurTX',
 'LosAngelesRiversideOrangeCountyCAAZ LosAngelesRiversideOrangeCountyCAAZ',
 'NewOrleansLAMS BeaumontPortArthurTX',
 'HoustonGalvestonBrazoriaTX TX TX Mexico']

Now that we have our "route sentences" we need to know the vocabulary size.

In [5]:
long_str = ' '.join(routes)

long_ls = long_str.split()
vocab_set = set()
for l in long_ls:
    vocab_set.add(l)
    
print(len(vocab_set))

212


We have 212 distinct words in our vocab, so we'll choose a vocab size of 300, allowing for some unseen values. The maximum route length is five stops, so our max sentence length will be 5 words.

In [6]:
vocab_size = 300
max_length = 5

Now we apply our encoding and padding, and then dump the numpy arrays of our prepared data and the encoder to a pickle file.

In [7]:
encoded_routes = [one_hot(r, vocab_size) for r in routes]
encoded_routes[:5]

[[162, 174], [162, 201], [196, 7, 187], [208, 233, 236, 200], [162, 268]]

In [8]:
with open('./one_hot_encoder.pickle', 'wb') as f:
    pickle.dump(one_hot, f)

In [9]:
padded_routes = pad_sequences(encoded_routes, maxlen=max_length, padding='post')
padded_routes[:5]

array([[162, 174,   0,   0,   0],
       [162, 201,   0,   0,   0],
       [196,   7, 187,   0,   0],
       [208, 233, 236, 200,   0],
       [162, 268,   0,   0,   0]], dtype=int32)

In [10]:
X_seq = padded_routes
y = df['is_hazardous'].to_numpy()

In [11]:
seq_data = (X_seq, y)

with open('sequence_data.pickle', 'wb') as f:
    pickle.dump(seq_data, f)

### For numerical and category data

For our traditional numerical and category data, we'll be applying one-hot encoding and min max scaling, respectively.

In [12]:
num_cols = [
    'estimated_short_line_miles',
    'number_of_articulated_units'
]

cat_cols = [
    'car_ownership_category_code', 
    'all_rail_intermodal_code'
]

In [13]:
nums = df[num_cols]
cats = df[cat_cols].astype(str)

In [14]:
df['all_rail_intermodal_code'] = df['all_rail_intermodal_code'].astype(str)
encoded_cats = pd.get_dummies(cats)
encoded_cats.head()

Unnamed: 0,car_ownership_category_code_P,car_ownership_category_code_R,car_ownership_category_code_T,all_rail_intermodal_code_1,all_rail_intermodal_code_2,all_rail_intermodal_code_9
0,1,0,0,1,0,0
1,1,0,0,0,0,1
2,1,0,0,1,0,0
3,1,0,0,0,0,1
4,1,0,0,1,0,0


In [15]:
X_cats = encoded_cats.to_numpy()

In [16]:
scaler = MinMaxScaler()
X_nums = scaler.fit_transform(nums)

X_nums[:10]

array([[0.38061041, 0.        ],
       [0.1454219 , 0.        ],
       [0.06283662, 0.        ],
       [0.44344704, 0.8       ],
       [0.15439856, 0.        ],
       [0.27648115, 0.        ],
       [0.04847397, 0.        ],
       [0.01256732, 0.        ],
       [0.04847397, 0.        ],
       [0.11849192, 0.        ]])

Now that the data is scaled and encoded, we will pickle the scaler and the numerical dataset.

In [17]:
with open('./min_max_scaler.pickle', 'wb') as f:
    pickle.dump(scaler, f)

In [18]:
X = np.concatenate([X_nums, X_cats], axis=1)
X.shape

(68486, 8)

In [19]:
X_seq.shape

(68486, 5)

In [20]:
num_data = (X, y)

with open('./numerical_data.pickle', 'wb') as f:
    pickle.dump(num_data, f)

Now we are ready to send our data to model training!