# Part1. Feature Engineering

In [1]:
import pandas as pd

df = pd.read_csv('cash or e-zpass train.csv', low_memory=False)

# Exploratory Data Analysis

Before we make decisions, I want to take a look the shape of the data. This dataset contains 7 columns and 6,010,967 rows, which is really huge. To determine which feature should be kept or discarded, we can visuailize some attributes.

In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6010967 entries, 0 to 6010966
Data columns (total 7 columns):
 #   Column                          Dtype 
---  ------                          ----- 
 0   Date                            object
 1   Entrance                        object
 2   Exit                            object
 3   Interval Beginning Time         int64 
 4   Vehicle Class                   object
 5   Vehicle Count                   int64 
 6   Payment Type (Cash or E-ZPass)  object
dtypes: int64(2), object(5)
memory usage: 321.0+ MB


# Encoded features

Because this dataset is a binary classification problem. We will train most models by classifiers instead of regressors. Most classifiers do not accept string based label, so we need to encoded each string based features. The encoding strategy is label encoding because it's the most suitable approach for classifiers. 


In [3]:
from sklearn.preprocessing import LabelEncoder

labelencoder = LabelEncoder()

encoded_df = df.copy()
encoded_df.drop('Date', axis=1)
encoded_df['Vehicle Class'] = labelencoder.fit_transform(df['Vehicle Class'])
encoded_df['Entrance'] = labelencoder.fit_transform(df['Entrance'])
encoded_df['Exit'] = labelencoder.fit_transform(df['Exit'])
encoded_df['Payment Type (Cash or E-ZPass)'] = labelencoder.fit_transform(df['Payment Type (Cash or E-ZPass)'])

# Feature Engineering

This dataset contains 6 features, which are Date, Entrace, Exit, Interval Beginning Time, Vehicle Class, and Vehicle Count. To predict the payment type, I believe the interval beginning time and Vehicle Class are the most  important feature because they are directly related the target. And I am not sure about how important the Date, Entrace, Exit and Vehicle count would be. Therefore, I want to train models by decision tree with 4 different combination of these features and see which one is better. 

1. Interval Beginning Time, Vehicle Class
2. Interval Beginning Time, Vehicle Class, Entrance, Exit,
3. Interval Beginning Time, Vehicle Class, Vehicle Count
4. Interval Beginning Time, Vehicle Class, Date
5. All features excluded date
6. All features



In [4]:
target = df['Payment Type (Cash or E-ZPass)']

date = df['Date'].apply(lambda x: x.split('/'))

month = date.apply(lambda date: int(date[0]))
day = date.apply(lambda date: int(date[1]))

In [5]:
dataset_1 = encoded_df[['Interval Beginning Time', 'Vehicle Class']]
dataset_2 = encoded_df[['Interval Beginning Time', 'Vehicle Class', 'Entrance', 'Exit']]
dataset_3 = encoded_df[['Interval Beginning Time', 'Vehicle Class', 'Vehicle Count']]

dataset_4 = dataset_1.copy()
dataset_4['day'] = day
dataset_4['month'] = month

dataset_5 = encoded_df[['Interval Beginning Time', 'Vehicle Class', 'Entrance', 'Exit', 'Vehicle Count']]

dataset_6 = dataset_5.copy()
dataset_6['day'] = day
dataset_6['month'] = month

# Decision Tree

Since this dataset has more than 6 millions rows, to efficiently determine which dataset is the best, we don't want the chosed algorithm takes too much time. Therefore, here I choose the decision tree classifier because it is the most reliable and fast algorithm.

In [6]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

def decision_tree_accuracy(features, target):
    train_x, test_x, train_y, test_y = train_test_split(features, target, test_size=0.3, random_state=11)
    clf = DecisionTreeClassifier()
    clf.fit(train_x, train_y)
    hyp = clf.predict(test_x)
    return accuracy_score(test_y, hyp)

In [7]:
datasets = [dataset_1, dataset_2, dataset_3, dataset_4, dataset_5, dataset_6]
accuracies = [None] * len(datasets)

for i in range(len(datasets)):
    accuracies[i] = decision_tree_accuracy(datasets[i], target)


# Result

As the following cell shows, the best features set is dataset-5, which has the features: **Interval Beginning Time, Vehicle Class, Entrance, Exit, Vehicle Count**.  
Therefore, I will select these feature in the future processing

In [8]:
accuracies

[0.6880891658639676,
 0.6673493074606373,
 0.6876860140709403,
 0.6862170331909825,
 0.7317787312197532,
 0.6743531687342753]