## Use PyTorch `Dataset` and `Dataloader` with a structured dataset

* **IMPORTANT:** replace the `URL = None` in the following cell with the  value for the URL from your learning materials.

In [0]:
import os

import pandas as pd
import torch as pt

from torch import nn
from torch.utils.data import DataLoader
from torch.utils.data import TensorDataset

pt.set_default_dtype(pt.float64)

URL = None

assert URL and (type(URL) is str), "Be sure to initialize URL using the value from your learning materials"
os.environ['URL'] = URL

Download and unzip the contents of the `URL` to the `data` subdirectory.

In [0]:
%%bash
wget -q $URL
mkdir -p data
find *.zip | xargs unzip -o -d data/

Read the files that match `part-*.csv` from the `data` subdirectory into a Pandas data frame named `df`.

In [0]:
from pathlib import Path

df = pd.concat(
    pd.read_csv(file) for file in Path('data/').glob('part-*.csv')
)


## Explore the `df` data frame, including the column names, the first few rows of the dataset, and the data frame's memory usage.

In [4]:
df[:5]

Unnamed: 0,fareamount,origindatetime_tr,origin_block_latitude,origin_block_longitude,destination_block_latitude,destination_block_longitude
0,4.6,09/04/2018 00:00,38.898318,-77.0328,38.899817,-77.026514
1,7.3,09/04/2018 00:00,38.888945,-77.03949,38.899817,-77.026514
2,5.14,09/04/2018 00:00,38.909652,-77.033254,38.91861,-77.035028
3,11.67,09/04/2018 00:00,38.896667,-76.982929,38.854779,-76.974019
4,10.27,09/04/2018 00:00,38.897204,-77.008388,38.907244,-77.045287


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3289206 entries, 0 to 3289205
Data columns (total 6 columns):
 #   Column                       Dtype  
---  ------                       -----  
 0   fareamount                   float64
 1   origindatetime_tr            object 
 2   origin_block_latitude        float64
 3   origin_block_longitude       float64
 4   destination_block_latitude   float64
 5   destination_block_longitude  float64
dtypes: float64(5), object(1)
memory usage: 150.6+ MB


## Drop the `origindatetime_tr` column from the data frame. 

For now you are going to predict the taxi fare just based on the lat/lon coordinates of the pickup and the drop off locations. Remove the `origindatetime_tr` column from the data frame in your working dataset.

In [6]:
working_df = df.drop('origindatetime_tr', axis = 1)
working_df.shape

(3289206, 5)

## Sample 10% of your working dataset into a test dataset data frame

* **hint:** use the Pandas `sample` function with the dataframe. Specify a value for the `random_state` to achieve reproducibility.

In [7]:
test_df = working_df.sample(frac = 0.10, random_state = 42)
test_df.shape

(328921, 5)

## Drop the rows that exist in your test dataset from the working dataset to produce a training dataset.

* **hint** DataFrame's `drop` function can use index values from a data frame to drop specific rows.

In [8]:
train_df = working_df.drop(index = test_df.index)
train_df.shape

(2960285, 5)

## Define 2 Python lists: 1st for the feature column names; 2nd for the target column name

In [0]:
FEATURES = ['origin_block_latitude','origin_block_longitude','destination_block_latitude','destination_block_longitude']
TARGET = ['fareamount']

## Create `X` and `y` tensors with the values of your feature and target columns in the training dataset

In [0]:
X = pt.tensor(train_df[FEATURES].values)
y = pt.tensor(train_df[TARGET].values)

## Create a `TensorDataset` instance with the `y` and `X` tensors (in that order)

In [0]:
train_ds = TensorDataset(y, X)

## Create a `DataLoader` instance specifying a custom batch size

A batch size of `2 ** 18 = 262,144` should work well.

In [12]:
BATCH_SIZE = 2 ** 18
train_dl = DataLoader(train_ds, batch_size=BATCH_SIZE)
len(train_dl)

12

## Create a model using `nn.Linear`

In [0]:
w = nn.Linear(len(FEATURES), 1)


## Create an instance of the `AdamW` optimizer for the model

In [0]:
optimizer = pt.optim.AdamW(w.parameters())

## Declare your `forward`, `loss` and `metric` functions

* **hint:** if you are tried of computing MSE by hand you can use `nn.functional.mse_loss` instead.

In [0]:
def forward(X):
  return w(X)

def loss(y_pred, y):
  mse = nn.functional.mse_loss(y_pred, y)
  return mse, mse.sqrt()

## Iterate over the batches returned by your `DataLoader` instance

For every step of gradient descent, print out the MSE, RMSE, and the batch index
* **hint:** you can use Python's `enumerable` for an iterable
* **hint:** the batch returned by the `enumerable` has the same contents as your `TensorDataset` instance

In [16]:
for batch_idx, batch in enumerate(train_dl):
  y, X = batch
  y_pred = forward(X)
  mse, rmse = loss(y_pred, y)
  mse.backward()
  print("Loss: ", mse.item(), " RMSE: ", rmse.item(), " Batch Idx: ", batch_idx)
  optimizer.step()
  optimizer.zero_grad()


Loss:  1209.8594636144935  RMSE:  34.78303413468258  Batch Idx:  0
Loss:  1196.3749611549224  RMSE:  34.58865364761864  Batch Idx:  1
Loss:  1178.9680632450134  RMSE:  34.33610436908959  Batch Idx:  2
Loss:  1168.8755737473123  RMSE:  34.18882235098648  Batch Idx:  3
Loss:  1173.7366406115639  RMSE:  34.25984005525367  Batch Idx:  4
Loss:  1164.8815089894983  RMSE:  34.13036051654741  Batch Idx:  5
Loss:  1153.9715867701436  RMSE:  33.97015729681191  Batch Idx:  6
Loss:  1185.5553946710122  RMSE:  34.43189502003937  Batch Idx:  7
Loss:  1135.4230635040444  RMSE:  33.69603928511546  Batch Idx:  8
Loss:  1095.7779420329684  RMSE:  33.1025367915054  Batch Idx:  9
Loss:  1138.6996221462741  RMSE:  33.74462360356497  Batch Idx:  10
Loss:  1140.6698334085045  RMSE:  33.77380395230162  Batch Idx:  11


## Implement 10 epochs of gradient descent training

For every step of gradient descent, printout the MSE, RMSE, epoch index, and batch index.

* **hint:** you can call `enumerate(DataLoader)` repeatedly in a `for` loop

In [17]:
for epoch in range(10):
  for batch_idx, batch in enumerate(train_dl):
    y, X = batch
    y_pred = forward(X)
    mse, rmse = loss(y_pred, y)
    mse.backward()
    print(" Loss: ", mse.item(), " RMSE: ", rmse.item(), " Epoch: ", epoch, " Batch Idx: ", batch_idx)
    optimizer.step()
    optimizer.zero_grad()

 Loss:  1024.9110619839025  RMSE:  32.014232178578055  Epoch:  0  Batch Idx:  0
 Loss:  1012.6604717743405  RMSE:  31.822326624154  Epoch:  0  Batch Idx:  1
 Loss:  996.8449188331958  RMSE:  31.572850977274697  Epoch:  0  Batch Idx:  2
 Loss:  987.6885137099952  RMSE:  31.42751205090844  Epoch:  0  Batch Idx:  3
 Loss:  992.5071579049934  RMSE:  31.504081607071065  Epoch:  0  Batch Idx:  4
 Loss:  984.6750405196823  RMSE:  31.37953219089925  Epoch:  0  Batch Idx:  5
 Loss:  974.9285092346067  RMSE:  31.223845202578858  Epoch:  0  Batch Idx:  6
 Loss:  1004.5903348602639  RMSE:  31.69527306808168  Epoch:  0  Batch Idx:  7
 Loss:  958.650496555152  RMSE:  30.962081592734556  Epoch:  0  Batch Idx:  8
 Loss:  922.4228107403496  RMSE:  30.371414368454257  Epoch:  0  Batch Idx:  9
 Loss:  962.3974534173578  RMSE:  31.022531383131163  Epoch:  0  Batch Idx:  10
 Loss:  964.6529105287545  RMSE:  31.058862028875986  Epoch:  0  Batch Idx:  11
 Loss:  858.4349159380986  RMSE:  29.299059983864645  

Copyright 2020 CounterFactual.AI LLC. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.