CS544: Introduction to Big Data Systems (Spring 2024)

Student: Anais Corona Perez

# P2: Predicting COVID Deaths with PyTorch

# Imports

In [1]:
import numpy as np
import pandas as pd
import torch

## Part 1: Setup
Build the Dockerfile we give you (feel to make edits if you like) to create your environment. Run the container, setup an SSH tunnel, and open JupyterLab in your browser. Create a notebook called p2.ipynb in the nb directory.

In [2]:
# Import Data
test = pd.read_csv('test.csv')
train = pd.read_csv('train.csv')

# Setup Training and Testing Datasets
trainY = torch.tensor(train.iloc[:,-1], dtype = torch.float64).reshape((1044, 1))
trainX = torch.tensor(train.iloc[:, 0:10].values, dtype = torch.float64)

testY = torch.tensor(test.iloc[:,-1], dtype = torch.float64).reshape((348, 1))
testX = torch.tensor(test.iloc[:, 0:10].values, dtype = torch.float64)

**Q1: about how many bytes does trainX consume?**

In [3]:
#q1
int(trainX.shape[0]*trainX.shape[1]*64/8)

83520

**Q2: what is the biggest difference we would have any one cell if we used float16 instead of float64?**

In [4]:
#q2

# Convert trainX to float16 --- then to float64
converted_mat = trainX.to(dtype = torch.float16).to(dtype = torch.float64)

# Subtract converted matrix from original
torch.max(torch.abs(trainX - converted_mat)).item()

0.0

**Q3: is a CUDA GPU available on your VM?**

In [5]:
#q3
torch.cuda.is_available()

False

## Part 2: Prediction with Hardcoded Model

Let's predict the number of COVID deaths in the test dataset under the assumption that the deathrate is 0.004 for those <60 and 0.03 for those >=60. Encode these assumptions as coefficients in a tensor by pasting the following:

In [6]:
coef = torch.tensor([
        [0.0040],
        [0.0040],
        [0.0040],
        [0.0040],
        [0.0040],
        [0.0040], # POS_50_59_CP
        [0.0300], # POS_60_69_CP
        [0.0300],
        [0.0300],
        [0.0300]
], dtype=trainX.dtype)

**Q4: what is the predicted number of deaths for the first census tract?**

In [7]:
#q4
(testX[0]@coef).item()

9.844

**Q5: what is the average number of predicted deaths, over the whole testX dataset?**

In [8]:
#q5
torch.mean(torch.matmul(testX, coef)).item()

12.073632183908048

## Part 3: Optimization

Let's say `y = x^2 - 8x + 19`. We want to find the `x` value that minimizes `y`.

**Q6: first, what is y when x is a tensor containing 0.0?**

In [9]:
#q6
x = torch.tensor(0.0, requires_grad = True)
y = x**2 - 8*x + 19
float(y)

19.0

**Q7: what x value minimizes y?**

Write an optimization loop that uses `torch.optim.SGD`. You can experiment with the training rate and number of iterations, as long as you find a setup that gets approximately the right answer.

In [10]:
#q7
optimizer = torch.optim.SGD([x], lr = 0.001)

for epoch in range(10000):
    y = x**2 - 8*x + 19
    y.backward()
    optimizer.step()
    optimizer.zero_grad()
x.item()

3.9999403953552246

## Part 4: Linear Regression

Use the `torch.zeros` function to initialize a 2-dimensional `coef` matrix of size and type that allows us to compute `trainX @ coef` (we won't bother with a bias factor in this exercise).


**Q8: what is the MSE (mean-square error) when we make predictions using this vector of zero coefficients?**

You'll be comparing `trainX @ coef` to `trainY`

In [11]:
#q8
coef = torch.zeros(10, 1, dtype = trainX.dtype, requires_grad = True)
MSE = sum((trainY-(trainX@coef))**2)/len(trainY)
MSE.item()

197.8007662835249

**Optimization**

In [12]:
torch.manual_seed(544)

## Setup DataLoader ##
ds = torch.utils.data.TensorDataset(trainX, trainY)
dl = torch.utils.data.DataLoader(ds, batch_size=50, shuffle=True)

## Setup Functions ##
coef = torch.zeros(10, 1, dtype = trainX.dtype, requires_grad = True)
loss_fn = torch.nn.MSELoss()
optimizer = torch.optim.SGD([coef], lr = 0.000002)


## Train ##
for epoch in range(500):
    for batchX, batchY in dl:
        predictions = batchX@coef
        loss = loss_fn(predictions, batchY)
        loss.backward() # Compute gradient and add to coefficients
        optimizer.step()
        optimizer.zero_grad()

**Q9: what is the MSE over the training data, using the coefficients resulting from the above training?**

In [13]:
#q9
loss_fn(trainX @ coef, trainY).item()

26.8113940147193

**Q10: what is the MSE over the test data?**

In [14]:
#q10
loss_fn(testX @ coef, testY).item()

29.05854692548551