## Task: Predict number of bikers on a given day using linear regression

You are provided with a dataset about Seattle's Fremont Bridge in the form of a csv file.
The data contains different details about a given day, like weather, temperature and other factors (see the dataframe preview below) for more details. The data also contains how many bikers were observed crossing the brudge that day.

You are provided with the code to download and load the csv file.

Your task is to train a linear regression model which takes in the parameters of the day (you can drop the columns that you think you don't need) and predicts the number of bikers according to those parameters.

In [1]:
from IPython.display import clear_output

In [2]:
# Don't modify this code


%pip install gdown==4.5


clear_output()

In [3]:
# Download the CSV file.
!gdown 1_eJU8Y-31_l0oq1sSJT6pROJyo-ufuvD

Downloading...
From: https://drive.google.com/uc?id=1_eJU8Y-31_l0oq1sSJT6pROJyo-ufuvD
To: /content/bikers_data.csv
  0% 0.00/213k [00:00<?, ?B/s]100% 213k/213k [00:00<00:00, 124MB/s]


In [4]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [5]:
data_df = pd.read_csv('bikers_data.csv')

In [6]:
data_df.head()

Unnamed: 0,Date,Number of bikers,Mon,Tue,Wed,Thu,Fri,Sat,Sun,holiday,daylight_hrs,Rainfall (in),Temp (F),dry day
0,2012-10-03,14084.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,11.277359,0.0,56.0,1
1,2012-10-04,13900.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,11.219142,0.0,56.5,1
2,2012-10-05,12592.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,11.161038,0.0,59.5,1
3,2012-10-06,8024.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,11.103056,0.0,60.5,1
4,2012-10-07,8568.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,11.045208,0.0,60.5,1


In [7]:
data_y = data_df['Number of bikers'] # target
data_x = data_df.drop(['Number of bikers', 'Date'], axis=1) # input features

In [8]:
data_x.head()

Unnamed: 0,Mon,Tue,Wed,Thu,Fri,Sat,Sun,holiday,daylight_hrs,Rainfall (in),Temp (F),dry day
0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,11.277359,0.0,56.0,1
1,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,11.219142,0.0,56.5,1
2,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,11.161038,0.0,59.5,1
3,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,11.103056,0.0,60.5,1
4,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,11.045208,0.0,60.5,1


In [9]:
data_y

0       14084.0
1       13900.0
2       12592.0
3        8024.0
4        8568.0
         ...   
2641     4552.0
2642     3352.0
2643     3692.0
2644     7212.0
2645     4568.0
Name: Number of bikers, Length: 2646, dtype: float64

In [10]:
X = data_x.values
y = data_y.values

In [11]:
ones = np.ones(X.shape[0])
X = np.column_stack((ones, X))
# Splitting the data into train and test
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

In [12]:
data_df = pd.read_csv('bikers_data.csv')

In [13]:
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import torch.optim as optim

X_train_tensor = torch.tensor(X_train.astype(np.float32))
y_train_tensor = torch.tensor(y_train.astype(np.float32))

train_set = torch.utils.data.TensorDataset(X_train_tensor,y_train_tensor)

X_test_tensor = torch.tensor(X_val.astype(np.float32))
y_test_tensor = torch.tensor(y_val.astype(np.float32))

test_set = torch.utils.data.TensorDataset(X_test_tensor,y_test_tensor)


In [14]:
theta = np.linalg.solve(X_train.T @ X_train, X_train.T @ y_train)
theta

array([-144699.42869775,  142505.49236531,  142932.60973733,
        142880.7705828 ,  142311.65824937,  140966.10242066,
        135282.63327398,  135026.75862069,   -4901.9998505 ,
           398.70965478,   -2820.87512163,     181.90344125,
          2218.55243908])

In [15]:
model = nn.Sequential(
    # Pipeline
    nn.Linear(X.shape[1],30), # linear layer, input -> hidden
    nn.ReLU(),
    nn.Linear(30,1) # hidden -> output, we chose 1 because it's a linear regression
)

In [16]:
# Loss function MSE
loss_fn = nn.MSELoss()
# Optimizer - Stochastic Gradient Descent
# w = w - lr * gradient
optimizer = optim.SGD(model.parameters(),lr = 1e-4) # 1e-4 is 0.0001

epoch = 50

for epoch in range(epoch):
  # Zero the gradients
  optimizer.zero_grad()
  # From train_set
  X = train_set[:][0] # train_set[row or sample][X or Y]
  y = train_set[:][1]

  prediction = model(X)
  loss = loss_fn(prediction,y)
  loss.backward() # Compute gradients
  optimizer.step() # Update the weights -lr * gradient

  print(f"Epoch {epoch+1} loss: {loss.item()}")


  return F.mse_loss(input, target, reduction=self.reduction)


Epoch 1 loss: 151377136.0
Epoch 2 loss: 6051688960.0
Epoch 3 loss: 151646656.0
Epoch 4 loss: 151598064.0
Epoch 5 loss: 151549456.0
Epoch 6 loss: 151500912.0
Epoch 7 loss: 151452384.0
Epoch 8 loss: 151403872.0
Epoch 9 loss: 151355344.0
Epoch 10 loss: 151306864.0
Epoch 11 loss: 151258400.0
Epoch 12 loss: 151209968.0
Epoch 13 loss: 151161536.0
Epoch 14 loss: 151113120.0
Epoch 15 loss: 151064752.0
Epoch 16 loss: 151016384.0
Epoch 17 loss: 150968032.0
Epoch 18 loss: 150919696.0
Epoch 19 loss: 150871376.0
Epoch 20 loss: 150823120.0
Epoch 21 loss: 150774832.0
Epoch 22 loss: 150726592.0
Epoch 23 loss: 150678368.0
Epoch 24 loss: 150630144.0
Epoch 25 loss: 150581968.0
Epoch 26 loss: 150533792.0
Epoch 27 loss: 150485616.0
Epoch 28 loss: 150437504.0
Epoch 29 loss: 150389392.0
Epoch 30 loss: 150341280.0
Epoch 31 loss: 150293216.0
Epoch 32 loss: 150245152.0
Epoch 33 loss: 150197104.0
Epoch 34 loss: 150149072.0
Epoch 35 loss: 150101088.0
Epoch 36 loss: 150053104.0
Epoch 37 loss: 150005152.0
Epoch 38 

In [17]:
def mean_squared_error(y, y_pred):
  return np.mean((y - y_pred) ** 2)


print(f"Validation loss: {mean_squared_error(test_set[:][1].detach().numpy(), model(test_set[:][0]).detach().numpy())}")
print(f"Train loss: {mean_squared_error(train_set[:][1].detach().numpy(), model(train_set[:][0]).detach().numpy())}")

Validation loss: 144636016.0
Train loss: 149335696.0


In [18]:
# No training
y_train_pred = X_train @ theta
def mean_squared_error(y, y_pred):
  return np.mean((y - y_pred) ** 2)

train = mean_squared_error(y_train,y_train_pred)
print(f"Train loss = {train}")

y_val_pred = X_val @ theta

val = mean_squared_error(y_val,y_val_pred)
print(f"Validation loss = {val}")

Train loss = 4955794.611579515
Validation loss = 4779102.002645696
