# **HW1: Regression**
In *assignment 1*, you need to finish:

1.  Basic Part: Implement two regression models to predict the Systolic blood pressure (SBP) of a patient. You will need to implement **both Matrix Inversion and Gradient Descent**.


> *   Step 1: Split Data
> *   Step 2: Preprocess Data
> *   Step 3: Implement Regression
> *   Step 4: Make Prediction
> *   Step 5: Train Model and Generate Result

2.  Advanced Part: Implement one regression model to predict the SBP of multiple patients in a different way than the basic part. You can choose **either** of the two methods for this part.

# **1. Basic Part (55%)**
In the first part, you need to implement the regression to predict SBP from the given DBP


## 1.1 Matrix Inversion Method (25%)


*   Save the prediction result in a csv file **hw1_basic_mi.csv**
*   Print your coefficient


### *Import Packages*

> Note: You **cannot** import any other package

In [25]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import csv
import math
import random

In [26]:
# from google.colab import drive
# drive.mount('/content/drive')

### *Global attributes*
Define the global attributes

In [27]:
training_dataroot = 'hw1_basic_training.csv' # Training data file file named as 'hw1_basic_training.csv'
testing_dataroot = 'hw1_basic_testing.csv'   # Testing data file named as 'hw1_basic_training.csv'
output_dataroot = 'hw1_basic_mi.csv' # Output file will be named as 'hw1_basic.csv'

training_datalist =  [] # Training datalist, saved as numpy array
testing_datalist =  [] # Testing datalist, saved as numpy array

output_datalist =  [] # Your prediction, should be 20 * 1 matrix and saved as numpy array
                      # The format of each row should be ['sbp']

You can add your own global attributes here


In [28]:
def Mape(actual, predicted):
  ape = np.abs((actual - predicted) / np.maximum(np.abs(actual), 1e-7)) * 100
  mape = np.mean(ape)
  return round(mape, 4)

training_dataset = [] # dataset for DBP, i.e. features
validation_dataset = [] # dataset for SBP, i.e. labels
testing_dataset = []
weights = [] # computed weights from matrix inversion

training_mode = True

### *Load the Input File*
First, load the basic input file **hw1_basic_training.csv** and **hw1_basic_testing.csv**

Input data would be stored in *training_datalist* and *testing_datalist*

In [29]:
# Read input csv to datalist
with open(training_dataroot, newline='') as csvfile:
  training_datalist = np.array(list(csv.reader(csvfile)))

with open(testing_dataroot, newline='') as csvfile:
  testing_datalist = np.array(list(csv.reader(csvfile)))

### *Implement the Regression Model*

> Note: It is recommended to use the functions we defined, you can also define your own functions


#### Step 1: Split Data
Split data in *training_datalist* into training dataset and validation dataset
* Validation dataset is used to validate your own model without the testing data



In [30]:
def SplitData():
  _training = []
  _validation = []
  _testing = []

  for training_data in training_datalist[1:]:
    _training.append(int(training_data[0]))
    _validation.append(int(training_data[1]))

  for testing_data in testing_datalist[1:]:
    _testing.append(int(testing_data[0]))

  return np.array(_training), np.array(_validation), np.array(_testing)

#### Step 2: Preprocess Data
Handle the unreasonable data
> Hint: Outlier and missing data can be handled by removing the data or adding the values with the help of statistics  

In [31]:
def PreprocessData():
  # dbp: 60~80, sbp: 90~120
  num_data = training_dataset.size
  for i, dbp, sbp in zip(range(num_data), training_dataset, validation_dataset):
    if dbp is not int:
      training_dataset[i] = dbp = int(dbp)
    if sbp is not int:
      validation_dataset[i] = sbp = int(sbp)
    if dbp < 20 or dbp > 240:
      training_dataset[i] = 70
    if sbp < 30 or sbp > 360:
      validation_dataset[i] = 105

#### Step 3: Implement Regression
> use Matrix Inversion to finish this part




In [32]:
def MatrixInversion():
  global weights
  # add bias term in training dataset
  training_bias = np.vstack((np.ones_like(training_dataset), training_dataset)).T
  # compute inverse matrix of bias
  training_bias_inverse = np.linalg.pinv(training_bias)
  # compute weights by using inverse bias
  weights = training_bias_inverse.dot(validation_dataset)

#### Step 4: Make Prediction
Make prediction of testing dataset and store the value in *output_datalist*
The final *output_datalist* should look something like this
> [ [100], [80], ... , [90] ] where each row contains the predicted SBP

In [33]:
def MakePrediction(testing_dataset):
  testing_bias = np.vstack((np.ones_like(testing_dataset), testing_dataset)).T
  predictions = testing_bias.dot(weights)
  predictions = (np.round(predictions, decimals=4)).astype(float)

  return predictions

#### Step 5: Train Model and Generate Result

> Notice: **Remember to output the coefficients of the model here**, otherwise 5 points would be deducted
* If your regression model is *3x^2 + 2x^1 + 1*, your output would be:
```
3 2 1
```





In [34]:
training_dataset, validation_dataset, testing_dataset = SplitData()
PreprocessData()
MatrixInversion() # compute weights
print(*weights)
output_datalist = MakePrediction(testing_dataset)

if training_mode:
  validations = MakePrediction(training_dataset)
  mape = Mape(validation_dataset, validations)
  print(f'Validations Mape: {mape}')


47.49053479094964 0.9927147813478133
Validations Mape: 5.5434


### *Write the Output File*
Write the prediction to output csv
> Format: 'sbp'




In [35]:
with open(output_dataroot, 'w', newline='', encoding="utf-8") as csvfile:
  writer = csv.writer(csvfile)
  for row in output_datalist:
    writer.writerow([row])

## 1.2 Gradient Descent Method (30%)


*   Save the prediction result in a csv file **hw1_basic_gd.csv**
*   Output your coefficient update in a csv file **hw1_basic_coefficient.csv**
*   Print your coefficient





### *Global attributes*

In [36]:
output_dataroot = 'hw1_basic_gd.csv' # Output file will be named as 'hw1_basic.csv'
coefficient_output_dataroot = 'hw1_basic_coefficient.csv'

training_datalist =  [] # Training datalist, saved as numpy array
testing_datalist =  [] # Testing datalist, saved as numpy array

output_datalist =  [] # Your prediction, should be 20 * 1 matrix and saved as numpy array
                      # The format of each row should be ['sbp']

coefficient_output = [] # Your coefficient update during gradient descent
                   # Should be a (number of iterations * number_of coefficient) matrix
                   # The format of each row should be ['w0', 'w1', ...., 'wn']

Your own global attributes

In [37]:
# computed weights and bias from gradient descent
trained_weights = []
trained_bias = 0

### *Implement the Regression Model*


#### Step 1: Split Data

In [38]:
def SplitData():
  _training = []
  _validation = []
  _testing = []

  for training_data in training_datalist[1:]:
    _training.append(int(training_data[0]))
    _validation.append(int(training_data[1]))

  for testing_data in testing_datalist[1:]:
    _testing.append(int(testing_data[0]))

  return np.array(_training), np.array(_validation), np.array(_testing)

#### Step 2: Preprocess Data

In [39]:
def PreprocessData():
  global training_dataset, testing_dataset
  num_data = training_dataset.size
  for i, dbp, sbp in zip(range(num_data), training_dataset, validation_dataset):
    if dbp is not int:
      training_dataset[i] = dbp = int(dbp)
    if sbp is not int:
      validation_dataset[i] = sbp = int(sbp)
    if dbp < 20 or dbp > 240:
      training_dataset[i] = 70
    if sbp < 30 or sbp > 360:
      validation_dataset[i] = 105

  training_dataset = training_dataset.reshape(-1, 1)
  testing_dataset = testing_dataset.reshape(-1, 1)

#### Step 3: Implement Regression
> use Gradient Descent to finish this part

In [40]:
def GradientDescent(learning_rate, num_iterations):
  num_samples, num_features = 373, 1
  weights = np.zeros((1,))
  bias = 0
  coefficients = []

  for i in range(num_iterations):
    predictions = np.dot(training_dataset, weights) + bias

    errors = predictions - validation_dataset

    gradient_weights = (1 / num_samples) * np.dot(training_dataset.T, errors)
    gradient_bias = (1 / num_samples) * np.sum(errors)

    weights -= learning_rate * gradient_weights
    bias -= learning_rate * gradient_bias

    # store the coefficients while updating
    # weights is (1, 1) matrix, so add [0] to trans the array to number form
    coefficients.append(weights[0])

  return weights, bias, coefficients

#### Step 4: Make Prediction

Make prediction of testing dataset and store the values in *output_datalist*
The final *output_datalist* should look something like this
> [ [100], [80], ... , [90] ] where each row contains the predicted SBP

Remember to also store your coefficient update in *coefficient_output*
The final *coefficient_output* should look something like this
> [ [1, 0, 3, 5], ... , [0.1, 0.3, 0.2, 0.5] ] where each row contains the [w0, w1, ..., wn] of your coefficient





In [41]:
def MakePrediction(testing_dataset):
  predictions = np.dot(testing_dataset, trained_weights) + trained_bias
  predictions = (np.round(predictions, decimals=4)).astype(float)

  if training_mode:
    print(f'dataset shape: {testing_dataset.shape}')
    print(f'weights shape: {trained_weights.shape}')
    print(f'predictions shape: {predictions.shape}')

  return predictions

#### Step 5: Train Model and Generate Result

> Notice: **Remember to output the coefficients of the model here**, otherwise 5 points would be deducted
* If your regression model is *3x^2 + 2x^1 + 1*, your output would be:
```
3 2 1
```



In [42]:
learning_rate = 0.0001
num_iterations = 50000

PreprocessData()
trained_weights, trained_bias, coefficient_output = GradientDescent(learning_rate, num_iterations)
print(*trained_weights)
output_datalist = MakePrediction(testing_dataset)

if training_mode:
  validations = MakePrediction(training_dataset)
  mape = Mape(validation_dataset, validations)
  print(f'Validations Mape: {mape}')

1.4964836443931884
dataset shape: (20, 1)
weights shape: (1,)
predictions shape: (20,)
dataset shape: (373, 1)
weights shape: (1,)
predictions shape: (373,)
Validations Mape: 6.859


### *Write the Output File*

Write the prediction to output csv
> Format: 'sbp'

**Write the coefficient update to csv**
> Format: 'w0', 'w1', ..., 'wn'
>*   The number of columns is based on your number of coefficient
>*   The number of row is based on your number of iterations

In [43]:
with open(output_dataroot, 'w', newline='', encoding="utf-8") as csvfile:
  writer = csv.writer(csvfile)
  for row in output_datalist:
    writer.writerow([row])

with open(coefficient_output_dataroot, 'w', newline='', encoding="utf-8") as csvfile:
  writer = csv.writer(csvfile)
  for row in coefficient_output:
    writer.writerow([row])

# **2. Advanced Part (40%)**
In the second part, you need to implement the regression in a different way than the basic part to help your predictions of multiple patients SBP.

You can choose **either** Matrix Inversion or Gradient Descent method.

The training data will be in **hw1_advanced_training.csv** and the testing data will be in **hw1_advanced_testing.csv**.

Output your prediction in **hw1_advanced.csv**

Notice:
> You cannot import any other package other than those given



### Input the training and testing dataset

In [44]:
training_dataroot_adv = 'hw1_advanced_training.csv' # Training data file file named as 'hw1_basic_training.csv'
testing_dataroot_adv = 'hw1_advanced_testing.csv'   # Testing data file named as 'hw1_basic_training.csv'
output_dataroot_adv = 'hw1_advanced.csv' # Output file will be named as 'hw1_basic.csv'

training_datalist_adv =  [] # Training datalist, saved as numpy array
testing_datalist_adv =  [] # Testing datalist, saved as numpy array

output_datalist_adv =  [] # Your prediction, should be 220 * 1 matrix and saved as numpy array
                      # The format of each row should be ['sbp']

### Your Implementation

In [45]:
training_dataroot_adv = 'hw1_advanced_training.csv' # Training data file file named as 'hw1_basic_training.csv'
testing_dataroot_adv = 'hw1_advanced_testing.csv'   # Testing data file named as 'hw1_basic_training.csv'
output_dataroot_adv = 'hw1_advanced.csv' # Output file will be named as 'hw1_basic.csv'

training_datalist_adv =  [] # Training datalist, saved as numpy array
testing_datalist_adv =  [] # Testing datalist, saved as numpy array

output_datalist_adv =  [] # Your prediction, should be 220 * 1 matrix and saved as numpy array
                      # The format of each row should be ['sbp']

# global attributes
training_dataset_adv = [] # dataset for all features

validation_dataset_adv = [] # dataset for known labels
testing_dataset_adv = []
trained_weights_adv = []
trained_bias_adv = 0

# Read input csv to datalist
with open(training_dataroot_adv, newline='') as csvfile:
  training_datalist_adv = np.array(list(csv.reader(csvfile)))

with open(testing_dataroot_adv, newline='') as csvfile:
  testing_datalist_adv = np.array(list(csv.reader(csvfile)))


# split data
def SplitData_adv():
  _training = []
  _validation = []
  _testing = []

  for training_data in training_datalist_adv[1:]:
    _training.append(training_data[2:6])
    _validation.append(int(training_data[6]))

  for testing_data in testing_datalist_adv[1:]:
    _testing.append(testing_data[2:6])

  return np.array(_training), np.array(_validation), np.array(_testing)

# preprocess data
def PreprocessData_adv():
  global training_dataset_adv, validation_dataset_adv, testing_dataset_adv

  num_samples, num_features = training_dataset_adv.shape
  for i in range(num_samples):
    for j in range(num_features):
      if training_dataset_adv[i][j] == '':
        training_dataset_adv[i][j] = -1

  training_dataset_adv = training_dataset_adv.astype(float)
  validation_dataset_adv = validation_dataset_adv.astype(float)
  testing_dataset_adv = testing_dataset_adv.astype(float)

  for i in range(num_samples):
    # temperature
    if training_dataset_adv[i][0] < 80 or training_dataset_adv[i][0] > 130:
      training_dataset_adv[i][0] = 98
    # heartrate
    if training_dataset_adv[i][1] < 40 or training_dataset_adv[i][1] > 180:
      training_dataset_adv[i][1] = 80
    # resprate
    if training_dataset_adv[i][2] < 5 or training_dataset_adv[i][2] > 40:
      training_dataset_adv[i][2] = 16
    # o2sat
    if training_dataset_adv[i][3] < 70 or training_dataset_adv[i][3] > 100:
      training_dataset_adv[i][3] = 95

    if validation_dataset_adv[i] < 30 or validation_dataset_adv[i] > 360:
      validation_dataset_adv[i] = 105

  training_dataset_adv = training_dataset_adv.reshape(-1, 4)
  testing_dataset_adv = testing_dataset_adv.reshape(-1, 4)


# implement regression
def GradientDescent_adv(learning_rate_adv, num_iterations):
  num_samples, num_features = training_dataset_adv.shape
  weights = np.zeros((num_features,))
  bias = 0

  for _ in range(num_iterations):
    predictions = np.dot(training_dataset_adv, weights) + bias

    errors = predictions - validation_dataset_adv

    gradient_weights = (1 / num_samples) * np.dot(training_dataset_adv.T, errors)
    gradient_bias = (1 / num_samples) * np.sum(errors)

    weights -= learning_rate_adv * gradient_weights
    bias -= learning_rate_adv * gradient_bias

  return weights, bias

# make predictions
def MakePrediction_adv(testing_dataset_adv):
  predictions = np.dot(testing_dataset_adv, trained_weights_adv) + trained_bias_adv
  predictions = (np.round(predictions, decimals=4)).astype(float)

  if training_mode:
    print(f'testing dataset shape: {testing_dataset_adv.shape}')
    print(f'weights shape: {trained_weights_adv.shape}')
    print(f'predictions shape: {predictions.shape}')

  return predictions


# train model and generate results
learning_rate_adv = 0.00001
num_iteration_adv = 100000

training_dataset_adv, validation_dataset_adv, testing_dataset_adv = SplitData_adv()
PreprocessData_adv()
trained_weights_adv, trained_bias_adv = GradientDescent_adv(learning_rate_adv, num_iteration_adv)
output_datalist_adv = MakePrediction_adv(testing_dataset_adv)

if training_mode:
  validations = MakePrediction_adv(training_dataset_adv)
  mape = Mape(validation_dataset_adv, validations)
  print(f'Validations Mape: {mape}')

testing dataset shape: (220, 4)
weights shape: (4,)
predictions shape: (220,)
testing dataset shape: (5696, 4)
weights shape: (4,)
predictions shape: (5696,)
Validations Mape: 13.5698


### Output your Prediction

> your filename should be **hw1_advanced.csv**

In [46]:
with open(output_dataroot, 'w', newline='', encoding="utf-8") as csvfile:
  writer = csv.writer(csvfile)
  for row in output_datalist:
    writer.writerow([row])

# Report *(5%)*

Report should be submitted as a pdf file **hw1_report.pdf**

*   Briefly describe the difficulty you encountered
*   Summarize your work and your reflections
*   No more than one page






# Save the Code File
Please save your code and submit it as an ipynb file! (**hw1.ipynb**)