### Apply ANNs to ANY dataset!
Building a model with the iris dataset was nice and all, but that data is a bit too clean than what we can expect in the "real world". The features were continuous #s on the same scale. There were no categorical variables. We didn't have to do any feature engineering. And there was a "clean" set of 3 target classes. How do we extend ANNs to more realistic tabular data?

For this example, we'll use some NY Taxi data. Given a set of several features - lat/long, pickup time, day of week, distance of ride, ..., can we predict the price of the taxi ride?

### Feature Engineering
Turn lat/long and datetime cols into useful features

In [1]:
import torch
import torch.nn as nn
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# Read in data
df = pd.read_csv('../Data/NYCTaxiFares.csv')  # 120k records of over 55M dataset on kaggle
df.head()

Unnamed: 0,pickup_datetime,fare_amount,fare_class,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,2010-04-19 08:17:56 UTC,6.5,0,-73.992365,40.730521,-73.975499,40.744746,1
1,2010-04-17 15:43:53 UTC,6.9,0,-73.990078,40.740558,-73.974232,40.744114,1
2,2010-04-17 11:23:26 UTC,10.1,1,-73.994149,40.751118,-73.960064,40.766235,2
3,2010-04-11 21:25:03 UTC,8.9,0,-73.990485,40.756422,-73.971205,40.748192,1
4,2010-04-17 02:19:01 UTC,19.7,1,-73.990976,40.734202,-73.905956,40.743115,1


In [3]:
df['fare_amount'].describe()

count    120000.000000
mean         10.040326
std           7.500134
min           2.500000
25%           5.700000
50%           7.700000
75%          11.300000
max          49.900000
Name: fare_amount, dtype: float64

Haversine formula:

${\displaystyle d=2r\arcsin \left({\sqrt {\sin ^{2}\left({\frac {\varphi _{2}-\varphi _{1}}{2}}\right)+\cos(\varphi _{1})\:\cos(\varphi _{2})\:\sin ^{2}\left({\frac {\lambda _{2}-\lambda _{1}}{2}}\right)}}\right)}$

In [8]:
# 1) Feature engineer the lat/long pickup and dropoffs to determine distance travelled. We will use the haversine formula to calculate the distance between 2 lat/long coord pairs on a sphere

def haversine_distance(df, lat1, long1, lat2, long2):
    """Calculates the haversine distance between 2 sets of GPS coords in df
    """
    r = 6371  # avg radius of Earth in km
    
    phi1 = np.radians(df[lat1])
    phi2 = np.radians(df[lat2])
    
    delta_phi = np.radians(df[lat2] - df[lat1])
    delta_lambda = np.radians(df[long2] - df[long1])

    a = np.sin(delta_phi/2)**2 + np.cos(phi1) * np.cos(phi2) * np.sin(delta_lambda/2)**2
    c = 2*np.arctan2(np.sqrt(a), np.sqrt(1-a))
    d = r*c  # in km

    return d

In [9]:
# We now engineer the new "distance_km" feature, which is more useful than some of the original feats
df['distance_km'] = haversine_distance(
    df, 'pickup_latitude', 'pickup_longitude', 'dropoff_latitude', 'dropoff_longitude'
)
df.head()

Unnamed: 0,pickup_datetime,fare_amount,fare_class,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,distance_km
0,2010-04-19 08:17:56 UTC,6.5,0,-73.992365,40.730521,-73.975499,40.744746,1,2.126312
1,2010-04-17 15:43:53 UTC,6.9,0,-73.990078,40.740558,-73.974232,40.744114,1,1.392307
2,2010-04-17 11:23:26 UTC,10.1,1,-73.994149,40.751118,-73.960064,40.766235,2,3.326763
3,2010-04-11 21:25:03 UTC,8.9,0,-73.990485,40.756422,-73.971205,40.748192,1,1.864129
4,2010-04-17 02:19:01 UTC,19.7,1,-73.990976,40.734202,-73.905956,40.743115,1,7.231321


In [10]:
# Convert pickup_datetime into date
df['pickup_datetime'] = pd.to_datetime(df['pickup_datetime'])
df.head()

Unnamed: 0,pickup_datetime,fare_amount,fare_class,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,distance_km
0,2010-04-19 08:17:56+00:00,6.5,0,-73.992365,40.730521,-73.975499,40.744746,1,2.126312
1,2010-04-17 15:43:53+00:00,6.9,0,-73.990078,40.740558,-73.974232,40.744114,1,1.392307
2,2010-04-17 11:23:26+00:00,10.1,1,-73.994149,40.751118,-73.960064,40.766235,2,3.326763
3,2010-04-11 21:25:03+00:00,8.9,0,-73.990485,40.756422,-73.971205,40.748192,1,1.864129
4,2010-04-17 02:19:01+00:00,19.7,1,-73.990976,40.734202,-73.905956,40.743115,1,7.231321


In [12]:
# Convert UTC time into EST
df['pickup_EST'] = df['pickup_datetime'] - pd.Timedelta(hours=4)
df.head()

Unnamed: 0,pickup_datetime,fare_amount,fare_class,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,distance_km,pickup_EST
0,2010-04-19 08:17:56+00:00,6.5,0,-73.992365,40.730521,-73.975499,40.744746,1,2.126312,2010-04-19 04:17:56+00:00
1,2010-04-17 15:43:53+00:00,6.9,0,-73.990078,40.740558,-73.974232,40.744114,1,1.392307,2010-04-17 11:43:53+00:00
2,2010-04-17 11:23:26+00:00,10.1,1,-73.994149,40.751118,-73.960064,40.766235,2,3.326763,2010-04-17 07:23:26+00:00
3,2010-04-11 21:25:03+00:00,8.9,0,-73.990485,40.756422,-73.971205,40.748192,1,1.864129,2010-04-11 17:25:03+00:00
4,2010-04-17 02:19:01+00:00,19.7,1,-73.990976,40.734202,-73.905956,40.743115,1,7.231321,2010-04-16 22:19:01+00:00


In [19]:
df['hour'] = df['pickup_EST'].dt.hour
df['am'] = np.where(df['hour'] < 12, 1, 0)
df['weekday'] = df['pickup_EST'].dt.strftime('%a')
df.head()

Unnamed: 0,pickup_datetime,fare_amount,fare_class,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,distance_km,pickup_EST,Hour,weekday,am,hour
0,2010-04-19 08:17:56+00:00,6.5,0,-73.992365,40.730521,-73.975499,40.744746,1,2.126312,2010-04-19 04:17:56+00:00,4,Mon,1,4
1,2010-04-17 15:43:53+00:00,6.9,0,-73.990078,40.740558,-73.974232,40.744114,1,1.392307,2010-04-17 11:43:53+00:00,11,Sat,1,11
2,2010-04-17 11:23:26+00:00,10.1,1,-73.994149,40.751118,-73.960064,40.766235,2,3.326763,2010-04-17 07:23:26+00:00,7,Sat,1,7
3,2010-04-11 21:25:03+00:00,8.9,0,-73.990485,40.756422,-73.971205,40.748192,1,1.864129,2010-04-11 17:25:03+00:00,17,Sun,0,17
4,2010-04-17 02:19:01+00:00,19.7,1,-73.990976,40.734202,-73.905956,40.743115,1,7.231321,2010-04-16 22:19:01+00:00,22,Fri,0,22


In [20]:
df.columns

Index(['pickup_datetime', 'fare_amount', 'fare_class', 'pickup_longitude',
       'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude',
       'passenger_count', 'distance_km', 'pickup_EST', 'Hour', 'weekday', 'am',
       'hour'],
      dtype='object')

In [38]:
# Separate categorical from continuous columns
cat_cols = ['hour', 'weekday']
cont_cols = ['pickup_latitude', 'pickup_longitude', 'dropoff_latitude', 'dropoff_longitude', 'passenger_count', 'distance_km']
target = 'fare_amount'  # We will treat this as a regression problem where we try to predict fare amount

In [39]:
# Change categorical cols to category dtype
for cat in cat_cols:
    df[cat] = df[cat].astype('category')
df.dtypes

pickup_datetime      datetime64[ns, UTC]
fare_amount                      float64
fare_class                         int64
pickup_longitude                 float64
pickup_latitude                  float64
dropoff_longitude                float64
dropoff_latitude                 float64
passenger_count                    int64
distance_km                      float64
pickup_EST           datetime64[ns, UTC]
Hour                               int64
weekday                         category
am                                 int32
hour                            category
dtype: object

In [40]:
df['hour'].head()

0     4
1    11
2     7
3    17
4    22
Name: hour, dtype: category
Categories (24, int64): [0, 1, 2, 3, ..., 20, 21, 22, 23]

In [41]:
df['weekday'].head()

0    Mon
1    Sat
2    Sat
3    Sun
4    Fri
Name: weekday, dtype: category
Categories (7, object): ['Fri', 'Mon', 'Sat', 'Sun', 'Thu', 'Tue', 'Wed']

In [48]:
# Turn categorical cols into numpy arrays (then pt tensors)
hr = df['hour'].cat.codes.values
wkday = df['weekday'].cat.codes.values
cats = np.stack([hr, wkday], axis=1)
cats

array([[ 4,  1],
       [11,  2],
       [ 7,  2],
       ...,
       [14,  3],
       [ 4,  5],
       [12,  2]], dtype=int8)

In [49]:
# OR, as a one-liner...
# cats = np.stack([df[col].cat.codes.values for col in cat_cols], axis=1)
# cats

In [53]:
# Convert to pt tensor
cats = torch.tensor(cats, dtype=torch.int64)
conts = np.stack([df[col].values for col in cont_cols], axis=1)
conts = torch.tensor(conts, dtype=torch.float)

  cats = torch.tensor(cats, dtype=torch.int64)


In [59]:
# reshape to get column shape instead of 1 long row
y = torch.tensor(df[target].values, dtype=torch.float).reshape(-1,1)
print(y)

tensor([[ 6.5000],
        [ 6.9000],
        [10.1000],
        ...,
        [12.5000],
        [ 4.9000],
        [ 5.3000]])


In [60]:
# Determine length of categories
cat_szs = [len(df[col].cat.categories) for col in cat_cols]
cat_szs

[24, 7]

In [None]:
# Use embeddings - create a simple lookup table of embeddings for a fixed dict of fixed size. This is the one-hot encoding that happens to the categories. Used to store word vals and retrieve them using indices

# General rule of thumb is have at most n/2 embeddings, where n is the # of categories in a column. You shouldn't have more than 50 "embeddings" for a col
emb_szs = [(size,min(50, (size+1)//2)) for size in cat_szs]