## **TODO:** Set the value of `URL` to the URL from your learning materials

In [1]:
URL = "https://s3.amazonaws.com/courses.axel.net/data_working_with_pytorch.zip"
import os
assert URL and (type(URL) is str), "Be sure to initialize URL using the value from your learning materials"
os.environ['URL'] = URL

In [2]:
%%bash
wget -q $URL -O ./data.zip
mkdir -p data
find *.zip | xargs unzip -o -d data/

Archive:  data.zip
  inflating: data/Consumer_Complaints.csv  
  inflating: data/WA_Fn-UseC_-Sales-Win-Loss.csv  
  inflating: data/part-00000-8f28ed67-aaf4-4f30-9fdf-c27b83547562-c000.csv  
  inflating: data/part-00000-e4c68082-53e1-4e99-add3-ec4b4d46a1e9-c000.csv  
  inflating: data/states.csv         


## Use PyTorch `Dataset` and `Dataloader` with a structured dataset

In [3]:
import os

import pandas as pd
import torch as pt

from torch import nn
from torch.utils.data import DataLoader
from torch.utils.data import TensorDataset

pt.set_default_dtype(pt.float64)

Read the files that match `part-*.csv` from the `data` subdirectory into a Pandas data frame named `df`.

In [4]:
from pathlib import Path

df = pd.concat(
    pd.read_csv(file) for file in Path('data/').glob('part-*.csv')
)


## Explore the `df` data frame, including the column names, the first few rows of the dataset, and the data frame's memory usage.

In [5]:
df[:5]

Unnamed: 0,fareamount,origindatetime_tr,origin_block_latitude,origin_block_longitude,destination_block_latitude,destination_block_longitude,id
0,4.87,06/01/2017 07:00,38.898314,-77.028849,38.902521,-77.030791,751d10ef2403c770a3bd4e220db8594b656d6774962b63...
1,12.7,06/01/2017 14:00,38.904683,-77.046645,38.940181,-77.061193,a9ddc1ab38a3cc3f360e4d2408678d707658762c418e6c...
2,5.14,06/01/2017 12:00,38.910635,-77.042514,38.909652,-77.033254,1f804117b3d98193b5ab7fddc15a543a8165cd60b6b20e...
3,5.14,06/02/2017 13:00,38.889184,-77.021907,38.897207,-77.023477,21af1912855db837c7892fb073f4c59678c305aec0b23b...
4,14.32,06/01/2017 13:00,38.901336,-77.037534,38.942216,-77.073508,26dcdd256e6269e4c6f1ccd2119c345c4deed788a35082...


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6368133 entries, 0 to 3289205
Data columns (total 7 columns):
 #   Column                       Dtype  
---  ------                       -----  
 0   fareamount                   float64
 1   origindatetime_tr            object 
 2   origin_block_latitude        float64
 3   origin_block_longitude       float64
 4   destination_block_latitude   float64
 5   destination_block_longitude  float64
 6   id                           object 
dtypes: float64(5), object(2)
memory usage: 388.7+ MB


## Drop the `origindatetime_tr` column from the data frame. 

For now you are going to predict the taxi fare just based on the lat/lon coordinates of the pickup and the drop off locations. Remove the `origindatetime_tr` column from the data frame in your working dataset.

In [7]:
working_df = df.drop('origindatetime_tr', axis = 1)
working_df.shape

(6368133, 6)

## Sample 10% of your working dataset into a test dataset data frame

* **hint:** use the Pandas `sample` function with the dataframe. Specify a value for the `random_state` to achieve reproducibility.

In [8]:
test_df = working_df.sample(frac = 0.10, random_state = 42)
test_df.shape

(636813, 6)

## Drop the rows that exist in your test dataset from the working dataset to produce a training dataset.

* **hint** DataFrame's `drop` function can use index values from a data frame to drop specific rows.

In [9]:
train_df = working_df.drop(index = test_df.index)
train_df.shape

(5177451, 6)

## Define 2 Python lists: 1st for the feature column names; 2nd for the target column name

In [10]:
FEATURES = ['origin_block_latitude','origin_block_longitude','destination_block_latitude','destination_block_longitude']
TARGET = ['fareamount']

## Create `X` and `y` tensors with the values of your feature and target columns in the training dataset

In [11]:
X = pt.tensor(train_df[FEATURES].values)
y = pt.tensor(train_df[TARGET].values)

## Create a `TensorDataset` instance with the `y` and `X` tensors (in that order)

In [12]:
train_ds = TensorDataset(y, X)

## Create a `DataLoader` instance specifying a custom batch size

A batch size of `2 ** 18 = 262,144` should work well.

In [13]:
BATCH_SIZE = 2 ** 18
train_dl = DataLoader(train_ds, batch_size=BATCH_SIZE)
len(train_dl)

20

## Create a model using `nn.Linear`

In [14]:
w = nn.Linear(len(FEATURES), 1)


## Create an instance of the `AdamW` optimizer for the model

In [15]:
optimizer = pt.optim.AdamW(w.parameters())

## Declare your `forward`, `loss` and `metric` functions

* **hint:** if you are tried of computing MSE by hand you can use `nn.functional.mse_loss` instead.

In [16]:
def forward(X):
  return w(X)

def loss(y_pred, y):
  mse = nn.functional.mse_loss(y_pred, y)
  return mse, mse.sqrt()

## Iterate over the batches returned by your `DataLoader` instance

For every step of gradient descent, print out the MSE, RMSE, and the batch index
* **hint:** you can use Python's `enumerable` for an iterable
* **hint:** the batch returned by the `enumerable` has the same contents as your `TensorDataset` instance

In [17]:
for batch_idx, batch in enumerate(train_dl):
  y, X = batch
  y_pred = forward(X)
  mse, rmse = loss(y_pred, y)
  mse.backward()
  print("Loss: ", mse.item(), " RMSE: ", rmse.item(), " Batch Idx: ", batch_idx)
  optimizer.step()
  optimizer.zero_grad()


Loss:  162.40557078868719  RMSE:  12.743844427357358  Batch Idx:  0
Loss:  157.32902296721116  RMSE:  12.543086660276694  Batch Idx:  1
Loss:  151.71445317775908  RMSE:  12.317242109245035  Batch Idx:  2
Loss:  145.69940587968418  RMSE:  12.070600891409018  Batch Idx:  3
Loss:  141.17782961739232  RMSE:  11.881827705256137  Batch Idx:  4
Loss:  135.73619647110962  RMSE:  11.650587816548555  Batch Idx:  5
Loss:  131.65678235158947  RMSE:  11.47417894019391  Batch Idx:  6
Loss:  125.80096598686318  RMSE:  11.216102976830374  Batch Idx:  7
Loss:  120.78131545116773  RMSE:  10.990055297912187  Batch Idx:  8
Loss:  110.72372013145163  RMSE:  10.52253392161088  Batch Idx:  9
Loss:  101.43828374380561  RMSE:  10.071657447699739  Batch Idx:  10
Loss:  96.75772865869156  RMSE:  9.836550648407782  Batch Idx:  11
Loss:  93.41987616380972  RMSE:  9.665395809991939  Batch Idx:  12
Loss:  94.75716708797569  RMSE:  9.734329308584936  Batch Idx:  13
Loss:  94.82296206264344  RMSE:  9.737708255161655  

## Implement 10 epochs of gradient descent training

For every step of gradient descent, printout the MSE, RMSE, epoch index, and batch index.

* **hint:** you can call `enumerate(DataLoader)` repeatedly in a `for` loop

In [18]:
for epoch in range(10):
  for batch_idx, batch in enumerate(train_dl):
    y, X = batch
    y_pred = forward(X)
    mse, rmse = loss(y_pred, y)
    mse.backward()
    print(" Loss: ", mse.item(), " RMSE: ", rmse.item(), " Epoch: ", epoch, " Batch Idx: ", batch_idx)
    optimizer.step()  
    optimizer.zero_grad()

 Loss:  75.38039074442797  RMSE:  8.682188131135375  Epoch:  0  Batch Idx:  0
 Loss:  72.57649468776552  RMSE:  8.519183921466041  Epoch:  0  Batch Idx:  1
 Loss:  69.42255570641075  RMSE:  8.332019905545758  Epoch:  0  Batch Idx:  2
 Loss:  65.86260107421259  RMSE:  8.115577679636404  Epoch:  0  Batch Idx:  3
 Loss:  63.73823582508914  RMSE:  7.983622976136156  Epoch:  0  Batch Idx:  4
 Loss:  60.74078185801436  RMSE:  7.793637267541668  Epoch:  0  Batch Idx:  5
 Loss:  58.883556055362305  RMSE:  7.673562149051919  Epoch:  0  Batch Idx:  6
 Loss:  55.60594412265404  RMSE:  7.456939326738151  Epoch:  0  Batch Idx:  7
 Loss:  53.09937395815983  RMSE:  7.2869317245435905  Epoch:  0  Batch Idx:  8
 Loss:  47.83739762121358  RMSE:  6.916458459443936  Epoch:  0  Batch Idx:  9
 Loss:  42.71465067342297  RMSE:  6.535644625698598  Epoch:  0  Batch Idx:  10
 Loss:  40.854123091603796  RMSE:  6.391723014305595  Epoch:  0  Batch Idx:  11
 Loss:  39.20203050186332  RMSE:  6.261152489906576  Epoch:

Copyright 2021 CounterFactual.AI LLC. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.