### Prediction of New York City Taxi Fares
In this section we will transform the dataset in order to create new useful datetime and distance features, with this new features we will try to predict with regression the fare amount of the taxi rides taking with the rest of the features less the fare class. In the second part we will try to predict the fare class using embeddings and PyTorch embedding layers for the zipcodes.

Description of the dataset

Features:
 
- pickup_datetime - timestamp value indicating when the taxi ride started.
- pickup_longitude - float for longitude coordinate of where the taxi ride started.
- pickup_latitude - float for latitude coordinate of where the taxi ride started.
- dropoff_longitude - float for longitude coordinate of where the taxi ride ended.
- dropoff_latitude - float for latitude coordinate of where the taxi ride ended.
- passenger_count - integer indicating the number of passengers in the taxi ride.

Target
- fare_amount: float dollar amount of the cost of the taxi ride. This value is only in the training set; this is what you are predicting in the test set and it is required in your submission CSV.

### Get the traveled distance with the coordinates

A Regression approach that correlates the traveled distance with the fare amount per ride needs a traveled distance calculation with the coordinates provided. For this reason a new feature could be calculated with the Haversine formula that calculates the distance on a sphere between two sets of GPS coordinates. With this, we reduce the complexity of the travel with a straight line. 

The distance formula works out to

$${\displaystyle d=2r\arcsin \left({\sqrt {\sin ^{2}\left({\frac {\varphi _{2}-\varphi _{1}}{2}}\right)+\cos(\varphi _{1})\:\cos(\varphi _{2})\:\sin ^{2}\left({\frac {\lambda _{2}-\lambda _{1}}{2}}\right)}}\right)}$$

where

$\begin{split} r&: \textrm {radius of the sphere (Earth's radius averages 6371 km)}\\
\varphi_1, \varphi_2&: \textrm {latitudes of point 1 and point 2}\\
\lambda_1, \lambda_2&: \textrm {longitudes of point 1 and point 2}\end{split}$

In [1]:
import torch
import torch.nn as nn
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from torch.utils.data import TensorDataset, DataLoader

In [2]:
has_mps = torch.backends.mps.is_built()
device = "mps" if has_mps else "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")
torch.manual_seed(42)
# Defining an early stopping class for PyTorch
import copy
class EarlyStopping:
  def __init__(self, patience=5, min_delta=0, restore_best_weights=True):
    self.patience = patience
    self.min_delta = min_delta
    self.restore_best_weights = restore_best_weights
    self.best_model = None
    self.best_loss = None
    self.patience_counter = 0
    self.status = ""

  def __call__(self, model, val_loss):
    if self.best_loss is None:
      self.best_loss = val_loss
      self.best_model = copy.deepcopy(model.state_dict())
    elif self.best_loss - val_loss >= self.min_delta:
      self.best_model = copy.deepcopy(model.state_dict())
      self.best_loss = val_loss
      self.status = f"Improvement!!!, actual counter {self.patience_counter}"
      self.patience_counter = 0
    else:
      self.patience_counter += 1
      self.status = f"NO improvement in the last {self.patience_counter} epochs"
      if self.patience_counter >= self.patience:
        self.status = f"Early stopping triggered after {self.patience_counter} epochs."
        if self.restore_best_weights:
          model.load_state_dict(self.best_model)
        return True
    return False

Using device: cpu


In [3]:
# read the MPG dataset
df = pd.read_csv("data/NYCTaxiFares.csv", na_values=["NA", "?"])

# check for missing values
missing_values = df.isnull().sum()
print(missing_values)

pickup_datetime      0
fare_amount          0
fare_class           0
pickup_longitude     0
pickup_latitude      0
dropoff_longitude    0
dropoff_latitude     0
passenger_count      0
dtype: int64


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 120000 entries, 0 to 119999
Data columns (total 8 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   pickup_datetime    120000 non-null  object 
 1   fare_amount        120000 non-null  float64
 2   fare_class         120000 non-null  int64  
 3   pickup_longitude   120000 non-null  float64
 4   pickup_latitude    120000 non-null  float64
 5   dropoff_longitude  120000 non-null  float64
 6   dropoff_latitude   120000 non-null  float64
 7   passenger_count    120000 non-null  int64  
dtypes: float64(5), int64(2), object(1)
memory usage: 7.3+ MB


In [5]:
# function to calculate the distance of the travel
def haversine_distance(df, lat1, lon1, lat2, lon2):
    
    # average radius of the Earth in (km)
    r = 6371
    
    phi1 = np.radians(df[lat1])
    phi2 = np.radians(df[lat2])
    delta_phi = np.radians(df[lat2] - df[lat1])
    delta_lambda = np.radians(df[lon2] - df[lon1])
    
    a = np.sin(delta_phi/2)**2 + np.cos(phi1) * np.cos(phi2) * np.sin(delta_lambda/2)**2
    c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1-a))
    d = (r * c)
    
    return d

In [6]:
# append a 'dist_km' new feature in the dataframe
df['dist_km'] = haversine_distance(df, 'pickup_latitude', 'pickup_longitude', 'dropoff_latitude', 'dropoff_longitude')
df.head()


Unnamed: 0,pickup_datetime,fare_amount,fare_class,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,dist_km
0,2010-04-19 08:17:56 UTC,6.5,0,-73.992365,40.730521,-73.975499,40.744746,1,2.126312
1,2010-04-17 15:43:53 UTC,6.9,0,-73.990078,40.740558,-73.974232,40.744114,1,1.392307
2,2010-04-17 11:23:26 UTC,10.1,1,-73.994149,40.751118,-73.960064,40.766235,2,3.326763
3,2010-04-11 21:25:03 UTC,8.9,0,-73.990485,40.756422,-73.971205,40.748192,1,1.864129
4,2010-04-17 02:19:01 UTC,19.7,1,-73.990976,40.734202,-73.905956,40.743115,1,7.231321


### Time datatypes transformations
To work with the special pandas dtype timestamps the to_datetime method can be used.

In [7]:
# actual dtypes
df['pickup_datetime'] = pd.to_datetime(df['pickup_datetime'])
time_sample = df['pickup_datetime'][0]
print(time_sample)
type(time_sample)

2010-04-19 08:17:56+00:00


pandas._libs.tslibs.timestamps.Timestamp

Correcting pickup_datetime due to daylight savings time (April) There is a 4 hour difference between the value in the dataframe and the real NYC time. Eastern Day Time.

In [8]:
df['EDTdate'] = df['pickup_datetime'] - pd.Timedelta(hours=4)

Extract new features from the time series

In [9]:
df['Hour'] = df['EDTdate'].dt.hour
df['AMorPM'] = np.where(df['Hour']<12, 'am', 'pm')
df['Weekday'] = df['EDTdate'].dt.strftime("%a")
df.head()

Unnamed: 0,pickup_datetime,fare_amount,fare_class,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,dist_km,EDTdate,Hour,AMorPM,Weekday
0,2010-04-19 08:17:56+00:00,6.5,0,-73.992365,40.730521,-73.975499,40.744746,1,2.126312,2010-04-19 04:17:56+00:00,4,am,Mon
1,2010-04-17 15:43:53+00:00,6.9,0,-73.990078,40.740558,-73.974232,40.744114,1,1.392307,2010-04-17 11:43:53+00:00,11,am,Sat
2,2010-04-17 11:23:26+00:00,10.1,1,-73.994149,40.751118,-73.960064,40.766235,2,3.326763,2010-04-17 07:23:26+00:00,7,am,Sat
3,2010-04-11 21:25:03+00:00,8.9,0,-73.990485,40.756422,-73.971205,40.748192,1,1.864129,2010-04-11 17:25:03+00:00,17,pm,Sun
4,2010-04-17 02:19:01+00:00,19.7,1,-73.990976,40.734202,-73.905956,40.743115,1,7.231321,2010-04-16 22:19:01+00:00,22,pm,Fri


### Managing categorical and continuous values in Pandas
Defining arbitrary continuous categories, arbitrary continuous columns and the target feature

In [10]:
df.columns

Index(['pickup_datetime', 'fare_amount', 'fare_class', 'pickup_longitude',
       'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude',
       'passenger_count', 'dist_km', 'EDTdate', 'Hour', 'AMorPM', 'Weekday'],
      dtype='object')

In [11]:
cat_cols = ['Hour', 'AMorPM', 'Weekday']
cont_cols = ['pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude', 'passenger_count', 'dist_km']

# target feature for regression task
y_col = ['fare_amount']

How to know the pandas category dtype

In [12]:
df.dtypes

pickup_datetime      datetime64[ns, UTC]
fare_amount                      float64
fare_class                         int64
pickup_longitude                 float64
pickup_latitude                  float64
dropoff_longitude                float64
dropoff_latitude                 float64
passenger_count                    int64
dist_km                          float64
EDTdate              datetime64[ns, UTC]
Hour                               int32
AMorPM                            object
Weekday                           object
dtype: object

Convert pandas dtypes

In [13]:
for cat in cat_cols:
    df[cat] = df[cat].astype('category')

df.dtypes

pickup_datetime      datetime64[ns, UTC]
fare_amount                      float64
fare_class                         int64
pickup_longitude                 float64
pickup_latitude                  float64
dropoff_longitude                float64
dropoff_latitude                 float64
passenger_count                    int64
dist_km                          float64
EDTdate              datetime64[ns, UTC]
Hour                            category
AMorPM                          category
Weekday                         category
dtype: object

In [14]:
# list the categories inside a pandas categorical field
df['AMorPM'].cat.categories

Index(['am', 'pm'], dtype='object')

In [15]:
# return a numpy array with the codes that corresponds to each category in the pd field
hr = df['Hour'].cat.codes.values
ampm = df['AMorPM'].cat.codes.values
wkdy = df['Weekday'].cat.codes.values
print(hr, ampm, wkdy)

[ 4 11  7 ... 14  4 12] [0 0 0 ... 1 0 1] [1 2 2 ... 3 5 2]


In [16]:
# stack together in a single array matrix
cats = np.stack([hr, ampm, wkdy], axis=1)
cats

array([[ 4,  0,  1],
       [11,  0,  2],
       [ 7,  0,  2],
       ...,
       [14,  1,  3],
       [ 4,  0,  5],
       [12,  1,  2]], dtype=int8)

In [17]:
# create categorical codes matrix with list comprehension
cats_l = np.stack([df[col].cat.codes.values for col in cat_cols], axis=1)
cats

array([[ 4,  0,  1],
       [11,  0,  2],
       [ 7,  0,  2],
       ...,
       [14,  1,  3],
       [ 4,  0,  5],
       [12,  1,  2]], dtype=int8)

In [18]:
# create continuous matrix with list comprehension
conts_l = np.stack([df[col].values for col in cont_cols], axis=1)

### PyTorch features and target tensors

In [19]:
# create categorical tensor
cats = torch.tensor(cats_l, dtype=torch.int32)

# create continuous tensor
conts = torch.tensor(conts_l, dtype=torch.float32)

# create label (y) tensor
y = torch.tensor(df[y_col].values, dtype=torch.float32)

In [20]:
# categorical sizes list
cat_sizes = [len(df[col].cat.categories) for col in cat_cols]
cat_sizes

[24, 2, 7]

In [21]:
# embedding sizes list (divide the number of unique entries in each column by two, if the result is grather than 50 select 50)
emb_sizes = [(size, min(50,(size+1)//2)) for size in cat_sizes]
emb_sizes

[(24, 12), (2, 1), (7, 4)]

### Illustration of how will be trated the embedings inside the "Tabular model". The name "Tabular Model" comes from the fastai library and documentation https://docs.fast.ai/tabular.model.html

In [22]:
# take a sample of the categorigal tensor
cat_sample = cats[:2]
cat_sample

tensor([[ 4,  0,  1],
        [11,  0,  2]], dtype=torch.int32)

In [23]:
# with the embedding sizes list create torch embedding layers and store them in a ModuleList
# modulelist is an iterator
def create_emb_layers(emb_sizes):
    mod_list = []
    for ni, nf in emb_sizes:
      mod_list.append(nn.Embedding(ni, nf))

    self_embeddings = nn.ModuleList(mod_list)
    return self_embeddings

self_embeddings = create_emb_layers(emb_sizes)
self_embeddings

ModuleList(
  (0): Embedding(24, 12)
  (1): Embedding(2, 1)
  (2): Embedding(7, 4)
)

In [24]:
# list comprehension version for create embedding layers
self_embeddings = nn.ModuleList([nn.Embedding(ni, nf) for ni, nf in emb_sizes])
self_embeddings

ModuleList(
  (0): Embedding(24, 12)
  (1): Embedding(2, 1)
  (2): Embedding(7, 4)
)

In [25]:
# recreation of the forward method
embeddings_sample = []
for i,e in enumerate(self_embeddings):
    embeddings_sample.append(e(cat_sample[:,i]))
embeddings_sample

[tensor([[ 0.1937,  0.5139,  0.2956,  1.1011, -0.6396,  1.6985, -1.2358, -2.6161,
          -0.2337,  0.3107,  0.8509,  0.2859],
         [-0.1178, -0.7355,  0.6945,  0.5319,  0.3181,  0.2426,  1.7855, -0.5563,
          -0.4952, -0.3381, -1.3796,  0.6923]], grad_fn=<EmbeddingBackward0>),
 tensor([[-0.4390],
         [-0.4390]], grad_fn=<EmbeddingBackward0>),
 tensor([[-1.1825,  0.6685, -1.7540, -1.5680],
         [-1.3548, -0.8874, -0.7264, -0.1126]], grad_fn=<EmbeddingBackward0>)]

In [26]:
emb_sample = []
for i,e in enumerate(self_embeddings):
    emb_sample.append(cat_sample[:,i])
emb_sample

[tensor([ 4, 11], dtype=torch.int32),
 tensor([0, 0], dtype=torch.int32),
 tensor([1, 2], dtype=torch.int32)]

In [27]:
# concatenate the embedded tensors by row
z = torch.cat(embeddings_sample, 1)
z

tensor([[ 0.1937,  0.5139,  0.2956,  1.1011, -0.6396,  1.6985, -1.2358, -2.6161,
         -0.2337,  0.3107,  0.8509,  0.2859, -0.4390, -1.1825,  0.6685, -1.7540,
         -1.5680],
        [-0.1178, -0.7355,  0.6945,  0.5319,  0.3181,  0.2426,  1.7855, -0.5563,
         -0.4952, -0.3381, -1.3796,  0.6923, -0.4390, -1.3548, -0.8874, -0.7264,
         -0.1126]], grad_fn=<CatBackward0>)

### Tabular model definition

In [28]:
class TabularModel(nn.Module):
    def __init__(self, emb_sizes_s, n_cont, out_sizes, layers, p=0.5):
        super().__init__()
        self.embeds = nn.ModuleList([nn.Embedding(ni, nf) for ni, nf in emb_sizes_s])
        self.emb_drop = nn.Dropout(p)
        self.batch_norm_cont = nn.BatchNorm1d(n_cont)
        
        layer_list = []
        n_emb = sum([nf for ni, nf in emb_sizes_s])
        n_in = n_emb + n_cont
        for i in layers:
            layer_list.append(nn.Linear(n_in, i))
            layer_list.append(nn.ReLU(inplace=True))
            layer_list.append(nn.BatchNorm1d(i))
            layer_list.append(nn.Dropout(p))
            n_in = i
            
        layer_list.append(nn.Linear(layers[-1], out_sizes))
        """In Python, the asterisk (*) operator is used for unpacking a list or a tuple. When you use it in the instruction self.layers = nn.Sequential(*layer_list), it effectively unpacks the elements of layer_list and passes them as individual arguments to the nn.Sequential constructor"""
        self.layers = nn.Sequential(*layer_list)
    
    def forward(self, x_cat, x_cont):
        embeddings = []
        for i, e in enumerate(self.embeds):
            embeddings.append(e(x_cat[:, i]))
        
        x = torch.cat(embeddings, 1)
        x = self.emb_drop(x)
        
        x_cont = self.batch_norm_cont(x_cont)
        x = torch.cat([x, x_cont], 1)
        x = self.layers(x)
        return x

### Training instances and datasets

In [30]:
# model and early stop instances
model = TabularModel(emb_sizes, conts.shape[1], 1, [200, 100], p=0.4).to(device)
early_stop = EarlyStopping(patience=40)

TabularModel(
  (embeds): ModuleList(
    (0): Embedding(24, 12)
    (1): Embedding(2, 1)
    (2): Embedding(7, 4)
  )
  (emb_drop): Dropout(p=0.4, inplace=False)
  (batch_norm_cont): BatchNorm1d(6, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (layers): Sequential(
    (0): Linear(in_features=23, out_features=200, bias=True)
    (1): ReLU(inplace=True)
    (2): BatchNorm1d(200, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (3): Dropout(p=0.4, inplace=False)
    (4): Linear(in_features=200, out_features=100, bias=True)
    (5): ReLU(inplace=True)
    (6): BatchNorm1d(100, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (7): Dropout(p=0.4, inplace=False)
    (8): Linear(in_features=100, out_features=1, bias=True)
  )
)

In [None]:
# combining categorical and continuous tensors for shuffling
combined = torch.cat([cats, conts, y], dim=1)

# torch to numpy array
combined = combined.numpy()

# splitting the combined data into train and test sets
train_data, test_data = tra