# **HW1: Regression**
In *assignment 1*, you need to finish:

1.  Basic Part: Implement two regression models to predict the Systolic blood pressure (SBP) of a patient. You will need to implement **both Matrix Inversion and Gradient Descent**.


> *   Step 1: Split Data
> *   Step 2: Preprocess Data
> *   Step 3: Implement Regression
> *   Step 4: Make Prediction
> *   Step 5: Train Model and Generate Result

2.  Advanced Part: Implement one regression model to predict the SBP of multiple patients in a different way than the basic part. You can choose **either** of the two methods for this part.

# **1. Basic Part (55%)**
In the first part, you need to implement the regression to predict SBP from the given DBP


## 1.1 Matrix Inversion Method (25%)


*   Save the prediction result in a csv file **hw1_basic_mi.csv**
*   Print your coefficient


### *Import Packages*

> Note: You **cannot** import any other package

In [57]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import csv
import math
import random

### *Global attributes*
Define the global attributes

In [58]:
training_dataroot = 'hw1_basic_training.csv' # Training data file file named as 'hw1_basic_training.csv'
testing_dataroot = 'hw1_basic_testing.csv'   # Testing data file named as 'hw1_basic_training.csv'
output_dataroot = 'hw1_basic_mi.csv' # Output file will be named as 'hw1_basic.csv'

training_datalist =  [] # Training datalist, saved as numpy array
testing_datalist =  [] # Testing datalist, saved as numpy array

output_datalist =  [] # Your prediction, should be 20 * 3 matrix and saved as numpy array
                      # The format of each row should be ['subject_id', 'charttime', 'sbp']

You can add your own global attributes here


In [59]:
PDBasicTrainingData_IM = pd.read_csv(training_dataroot)
PDBasicTestingData_IM = pd.read_csv(testing_dataroot)

##Training Data to Numpy Array
BasicTestingDataArray_IM = PDBasicTestingData_IM.to_numpy()

### *Load the Input File*
First, load the basic input file **hw1_basic_training.csv** and **hw1_basic_testing.csv**

Input data would be stored in *training_datalist* and *testing_datalist*

In [60]:
# Read input csv to datalist
with open(training_dataroot, newline='') as csvfile:
  training_datalist = np.array(list(csv.reader(csvfile)))

with open(testing_dataroot, newline='') as csvfile:
  testing_datalist = np.array(list(csv.reader(csvfile)))

### *Implement the Regression Model*

> Note: It is recommended to use the functions we defined, you can also define your own functions


#### Step 1: Split Data
Split data in *training_datalist* into training dataset and validation dataset
* Validation dataset is used to validate your own model without the testing data



In [61]:
def SplitData(DataList):
  ##70% Training
  ##30% Validation
  LenDataList = len(DataList)
  LenTraining = math.floor(LenDataList*70/100)
  LenValidation = LenDataList - LenTraining
  TrainingList = DataList[:LenTraining]
  ValidationList = DataList[LenTraining:]

  return TrainingList, ValidationList

#### Step 2: Preprocess Data
Handle the unreasonable data
> Hint: Outlier and missing data can be handled by removing the data or adding the values with the help of statistics  

In [62]:
def PreprocessData(DataframeT):
  ##Only run once, if run multiple times, reload the original dataframe
  DataframeT_temp1 = DataframeT
  DataframeT_temp2 = DataframeT

  ##Using IQR Method
  ##DBP
  Q1_dbp = DataframeT_temp1['dbp'].quantile(0.25)
  Q3_dbp = DataframeT_temp1['dbp'].quantile(0.75)
  IQR_dbp = Q3_dbp-Q1_dbp
  LowerPart_dbp = Q1_dbp - 1.5*IQR_dbp
  UpperPart_dbp = Q3_dbp + 1.5*IQR_dbp

  ##Remove Outliers
  LowerData_dbp = np.where(DataframeT_temp1['dbp']<=LowerPart_dbp)
  DataframeT_temp1.drop(index=LowerData_dbp[0], inplace=True)
  DataframeT_temp1.reset_index(inplace=True, drop=True)
  UpperData_dbp = np.where(DataframeT_temp1['dbp']>=UpperPart_dbp)
  DataframeT_temp1.drop(index=UpperData_dbp[0], inplace=True)
  DataframeT_temp1.reset_index(inplace=True, drop=True)

  print("DBP")
  print(LowerPart_dbp)
  print(UpperPart_dbp)
  print(LowerData_dbp)
  print(UpperData_dbp)

  ##SBP
  Q1_sbp = DataframeT_temp2['sbp'].quantile(0.25)
  Q3_sbp = DataframeT_temp2['sbp'].quantile(0.75)
  IQR_sbp = Q3_sbp-Q1_sbp
  LowerPart_sbp = Q1_sbp - 1.5*IQR_sbp
  UpperPart_sbp = Q3_sbp + 1.5*IQR_sbp

  ##Remove Outliers
  LowerData_sbp = np.where(DataframeT_temp2['sbp']<=LowerPart_sbp)
  DataframeT_temp2.drop(index=LowerData_sbp[0], inplace=True)
  DataframeT_temp2.reset_index(inplace=True, drop=True)
  UpperData_sbp = np.where(DataframeT_temp2['sbp']>=UpperPart_sbp)
  DataframeT_temp2.drop(index=UpperData_sbp[0], inplace=True)
  DataframeT_temp2.reset_index(inplace=True, drop=True)

  print("SBP")
  print(LowerPart_sbp)
  print(UpperPart_sbp)
  print(LowerData_sbp)
  print(UpperData_sbp)

  # FramesT = [DataframeT_temp1, DataframeT_temp2]
  TrueDataframeT = pd.concat([DataframeT_temp1, DataframeT_temp2], ignore_index=True)
  TrueDataframeT = TrueDataframeT.drop_duplicates()
  # TrueDataframeT = DataframeT_temp

  print(len(TrueDataframeT))
  print(TrueDataframeT)
  # TrueDataframeT.to_csv("testing.csv")

  ##Dataframe to numpy array
  BasicTrainingDataArray = TrueDataframeT.to_numpy()

  # return TrueDataframeT
  return BasicTrainingDataArray

#### Step 3: Implement Regression
> use Matrix Inversion to finish this part




In [63]:
def MatrixInversion(Training_DS, Eval_DS):
  ##Y = XW + E
  X, Y = Training_DS[:,0], Training_DS[:,1]
  X_arr = np.vstack((np.ones(len(X)), X)).T ##Make X to be (nx2)

  ##Weight_IM = (X^T . X)^-1 . X^T . Y
  Weight_IM =np.linalg.inv(X_arr.T.dot(X_arr)).dot(X_arr.T).dot(Y)

  X_Eval, Y_Eval = Eval_DS[:,0], Eval_DS[:,1]
  Predic_Eval = X_Eval*Weight_IM[1]+Weight_IM[0]

  ## MAPE = 100/X_LEN * E|(YNormal - YPrediction) / YNormal|
  MAPE_IM = 100/len(Y_Eval)*np.sum(abs(np.divide(np.subtract(Y_Eval, Predic_Eval), Y_Eval)))

  return Weight_IM, MAPE_IM

In [64]:
##Procesess Data, Split Data, Then Matrix Inversion
BasicTrainingDataArray_IM = PreprocessData(PDBasicTrainingData_IM)
TrainingList_IM, ValidationList_IM = SplitData(BasicTrainingDataArray_IM)
Weight_IM, MAPE_IM = MatrixInversion(TrainingList_IM, ValidationList_IM)
Weight_IM_All, MAPE_IM_All = MatrixInversion(BasicTrainingDataArray_IM, BasicTrainingDataArray_IM)

DBP
47.0
119.0
(array([  2, 142]),)
(array([157]),)
SBP
87.5
171.5
(array([], dtype=int64),)
(array([244, 364, 365]),)
320
     dbp  sbp
0     92  142
1     86  119
2     60  112
3     85  122
4     77  119
..   ...  ...
361   98  150
362   98  166
363   94  147
364   72  119
366   94  154

[320 rows x 2 columns]


#### Step 4: Make Prediction
Make prediction of testing dataset and store the value in *output_datalist*
The final *output_datalist* should look something like this
> [ [100], [80], ... , [90] ] where each row contains the predicted SBP

In [65]:
def MakePrediction(Target, Coeff):
  Prediction = Target*Coeff[1]+Coeff[0]
  return Prediction

##BasicTestingDataArray (TestingData)
X_Testing_IM = BasicTestingDataArray_IM[:,0]
Prediction_Testing_IM = MakePrediction(X_Testing_IM, Weight_IM_All)

##Ready for Output
ContentArrayCSV_IM = Prediction_Testing_IM.tolist()
df_tocsv_IM = {'sbp': ContentArrayCSV_IM}
df_tocsv_IM = pd.DataFrame(data = df_tocsv_IM)
csvcontent_prep_list_IM = df_tocsv_IM.values.tolist()
# output_datalist = [['sbp']] + csvcontent_prep_list_IM
output_datalist = csvcontent_prep_list_IM

#### Step 5: Train Model and Generate Result

> Notice: **Remember to output the coefficients of the model here**, otherwise 5 points would be deducted
* If your regression model is *3x^2 + 2x^1 + 1*, your output would be:
```
3 2 1
```





In [66]:
print("Coefficients: " + str(Weight_IM_All[1]) + ' ' + str(Weight_IM_All[0]))
print("MAPE (Mean Absolute Percentage Error): ", MAPE_IM_All)

Coefficients: 0.9675189481717094 49.68242736458167
MAPE (Mean Absolute Percentage Error):  5.459864845434109


### *Write the Output File*
Write the prediction to output csv
> Format: 'sbp'




In [67]:
with open(output_dataroot, 'w', newline='', encoding="utf-8") as csvfile:
  writer = csv.writer(csvfile)
  for row in output_datalist:
    writer.writerow(row)

## 1.2 Gradient Descent Method (30%)


*   Save the prediction result in a csv file **hw1_basic_gd.csv**
*   Output your coefficient update in a csv file **hw1_basic_coefficient.csv**
*   Print your coefficient





### *Global attributes*

In [68]:
output_dataroot = 'hw1_basic_gd.csv' # Output file will be named as 'hw1_basic.csv'
coefficient_output_dataroot = 'hw1_basic_coefficient.csv'

training_datalist =  [] # Training datalist, saved as numpy array
testing_datalist =  [] # Testing datalist, saved as numpy array

output_datalist =  [] # Your prediction, should be 20 * 3 matrix and saved as numpy array
                      # The format of each row should be ['subject_id', 'charttime', 'sbp']

coefficient_output = [] # Your coefficient update during gradient descent
                   # Should be a (number of iterations * number_of coefficient) matrix
                   # The format of each row should be ['w0', 'w1', ...., 'wn']

Your own global attributes

In [69]:
PDBasicTrainingData_GD = pd.read_csv(training_dataroot)
PDBasicTestingData_GD = pd.read_csv(testing_dataroot)

##Training Data to Numpy Array
BasicTestingDataArray_GD = PDBasicTestingData_GD.to_numpy()

### *Implement the Regression Model*


#### Step 1: Split Data

In [70]:
def SplitData(DataList):
  ##70% Training
  ##30% Validation
  LenDataList = len(DataList)
  LenTraining = math.floor(LenDataList*70/100)
  LenValidation = LenDataList - LenTraining
  TrainingList = DataList[:LenTraining]
  ValidationList = DataList[LenTraining:]

  return TrainingList, ValidationList

#### Step 2: Preprocess Data

In [71]:
def PreprocessData(DataframeT):
  ##Only run once, if run multiple times, reload the original dataframe
  DataframeT_temp1 = DataframeT
  DataframeT_temp2 = DataframeT

  ##Using IQR Method
  ##DBP
  Q1_dbp = DataframeT_temp1['dbp'].quantile(0.25)
  Q3_dbp = DataframeT_temp1['dbp'].quantile(0.75)
  IQR_dbp = Q3_dbp-Q1_dbp
  LowerPart_dbp = Q1_dbp - 1.5*IQR_dbp
  UpperPart_dbp = Q3_dbp + 1.5*IQR_dbp

  ##Remove Outliers
  LowerData_dbp = np.where(DataframeT_temp1['dbp']<=LowerPart_dbp)
  DataframeT_temp1.drop(index=LowerData_dbp[0], inplace=True)
  DataframeT_temp1.reset_index(inplace=True, drop=True)
  UpperData_dbp = np.where(DataframeT_temp1['dbp']>=UpperPart_dbp)
  DataframeT_temp1.drop(index=UpperData_dbp[0], inplace=True)
  DataframeT_temp1.reset_index(inplace=True, drop=True)

  print("DBP")
  print(LowerPart_dbp)
  print(UpperPart_dbp)
  print(LowerData_dbp)
  print(UpperData_dbp)

  ##SBP
  Q1_sbp = DataframeT_temp2['sbp'].quantile(0.25)
  Q3_sbp = DataframeT_temp2['sbp'].quantile(0.75)
  IQR_sbp = Q3_sbp-Q1_sbp
  LowerPart_sbp = Q1_sbp - 1.5*IQR_sbp
  UpperPart_sbp = Q3_sbp + 1.5*IQR_sbp

  ##Remove Outliers
  LowerData_sbp = np.where(DataframeT_temp2['sbp']<=LowerPart_sbp)
  DataframeT_temp2.drop(index=LowerData_sbp[0], inplace=True)
  DataframeT_temp2.reset_index(inplace=True, drop=True)
  UpperData_sbp = np.where(DataframeT_temp2['sbp']>=UpperPart_sbp)
  DataframeT_temp2.drop(index=UpperData_sbp[0], inplace=True)
  DataframeT_temp2.reset_index(inplace=True, drop=True)

  print("SBP")
  print(LowerPart_sbp)
  print(UpperPart_sbp)
  print(LowerData_sbp)
  print(UpperData_sbp)

  # FramesT = [DataframeT_temp1, DataframeT_temp2]
  TrueDataframeT = pd.concat([DataframeT_temp1, DataframeT_temp2], ignore_index=True)
  TrueDataframeT = TrueDataframeT.drop_duplicates()
  # TrueDataframeT = DataframeT_temp

  print(len(TrueDataframeT))
  print(TrueDataframeT)
  # TrueDataframeT.to_csv("testing.csv")

  ##Dataframe to numpy array
  BasicTrainingDataArray = TrueDataframeT.to_numpy()

  # return TrueDataframeT
  return BasicTrainingDataArray

#### Step 3: Implement Regression
> use Gradient Descent to finish this part

In [72]:
def GradientDescent(Max_Steps, LearnRate, MaxCoeff, W0, W1, Training_DS):
  X_NP = np.array(Training_DS[:,0])
  Y_NP = np.array(Training_DS[:,1])
  Coeff0_Stor = []
  Coeff0_Stor.append(0)
  Coeff1_Stor = []
  Coeff1_Stor.append(0)
  CurSteps = 0

  #Loop
  while(CurSteps<Max_Steps):
    #Find Loss
    W0_L = -2/len(Y_NP)*np.sum(Y_NP - (W0 + W1*X_NP))
    W1_L = -2/len(Y_NP)*np.sum(X_NP*(Y_NP - (W0 + W1*X_NP)))

    print("====Loss: ====")
    print(W0_L)
    print(W1_L)

    if(abs(LearnRate*W1_L)<MaxCoeff):
      break

    #Get New Weights
    W0 = W0 - (LearnRate*W0_L)
    W1 = W1 - (LearnRate*W1_L)

    Coeff0_Stor.append(W0)
    Coeff1_Stor.append(W1)
    print("New Weights: ")
    print(W0)
    print(W1)
    print("CurSteps: " + str(CurSteps))
    print()

    CurSteps = 1 + CurSteps

  return W0, W1, Coeff0_Stor, Coeff1_Stor

def MAPE_Result(Eval_DS, W0_GD, W1_GD):
  X_Eval, Y_Eval = Eval_DS[:,0], Eval_DS[:,1]
  Predic_Eval = X_Eval*W1_GD+W0_GD

  ## MAPE = 100/X_LEN * E|(YNormal - YPrediction) / YNormal|
  MAPE_GD = 100/len(X_Eval)*np.sum(abs(np.divide(np.subtract(Y_Eval, Predic_Eval), Y_Eval)))
  return MAPE_GD

In [73]:
##Settings
Max_Steps_GD = 1000
LearnRate_GD = 0.00001
MaxCoeff_GD = 0.000001

W0_GDT = 0
W1_GDT = 0

BasicTrainingDataArray_GD = PreprocessData(PDBasicTrainingData_GD)
TrainingList_GD, ValidationList_GD = SplitData(BasicTrainingDataArray_GD)
W0_GD, W1_GD, Coeff0_Stor_GD, Coeff1_Stor_GD = GradientDescent(Max_Steps_GD, LearnRate_GD,
                                                               MaxCoeff_GD, W0_GDT, W1_GDT,
                                                               BasicTrainingDataArray_GD)
MAPE_GD = MAPE_Result(BasicTrainingDataArray_GD, W0_GD, W1_GD)

df_tocsv_Coeff_GD = {"Coef0": Coeff0_Stor_GD, "Coef1": Coeff1_Stor_GD}
df_tocsv_Coeff_GD = pd.DataFrame(data=df_tocsv_Coeff_GD)
csvcontent_prep_list_Coeff = df_tocsv_Coeff_GD.values.tolist()
coefficient_output = csvcontent_prep_list_Coeff

DBP
47.0
119.0
(array([  2, 142]),)
(array([157]),)
SBP
87.5
171.5
(array([], dtype=int64),)
(array([244, 364, 365]),)
320
     dbp  sbp
0     92  142
1     86  119
2     60  112
3     85  122
4     77  119
..   ...  ...
361   98  150
362   98  166
363   94  147
364   72  119
366   94  154

[320 rows x 2 columns]
====Loss: ====
-259.16875
-21703.581250000003
New Weights: 
0.0025916875000000002
0.21703581250000004
CurSteps: 0

====Loss: ====
-223.31603276914063
-18675.344892402343
New Weights: 
0.004824847827691407
0.4037892614240235
CurSteps: 1

====Loss: ====
-192.46573273151546
-16069.628868152819
New Weights: 
0.006749505155006562
0.5644855501055517
CurSteps: 2

====Loss: ====
-165.91987828569364
-13827.480238639471
New Weights: 
0.008408703937863499
0.7027603524919463
CurSteps: 3

====Loss: ====
-143.07788362147014
-11898.171584111053
New Weights: 
0.009839482774078201
0.8217420683330569
CurSteps: 4

====Loss: ====
-123.42296078546624
-10238.053322795948
New Weights: 
0.01107371238

#### Step 4: Make Prediction

Make prediction of testing dataset and store the values in *output_datalist*
The final *output_datalist* should look something like this
> [ [100], [80], ... , [90] ] where each row contains the predicted SBP

Remember to also store your coefficient update in *coefficient_output*
The final *coefficient_output* should look something like this
> [ [1, 0, 3, 5], ... , [0.1, 0.3, 0.2, 0.5] ] where each row contains the [w0, w1, ..., wn] of your coefficient





In [74]:
def MakePrediction(Target, W0, W1):
  Prediction = Target*W1+W0
  return Prediction

##BasicTestingDataArray (TestingData)
X_Testing_GD = BasicTestingDataArray_GD[:,0]
Prediction_Testing_GD = MakePrediction(X_Testing_GD, W0_GD, W1_GD)

##Ready for Output
ContentArrayCSV_GD = Prediction_Testing_GD.tolist()
df_tocsv_GD = {'sbp': ContentArrayCSV_GD}
df_tocsv_GD = pd.DataFrame(data = df_tocsv_GD)
csvcontent_prep_list_GD = df_tocsv_GD.values.tolist()
# output_datalist = [['sbp']] + csvcontent_prep_list_IM
output_datalist = csvcontent_prep_list_GD

In [75]:
print(Prediction_Testing_GD)

[146.2358266  130.68097335 129.12548803 157.12422388  96.4602962
 154.01325323 141.56937063 141.56937063 132.23645868 122.90354673
 121.3480614  126.01451738 110.45966413 121.3480614  144.68034128
 136.90291465 164.9016505  146.2358266  149.34679725 160.23519453]


#### Step 5: Train Model and Generate Result

> Notice: **Remember to output the coefficients of the model here**, otherwise 5 points would be deducted
* If your regression model is *3x^2 + 2x^1 + 1*, your output would be:
```
3 2 1
```



In [76]:
print("Coefficients: " + str(W1_GD) + ' ' + str(W0_GD))
print("MAPE (Mean Absolute Percentage Error): ", MAPE_GD)

Coefficients: 1.55548532498866 0.020206054720652983
MAPE (Mean Absolute Percentage Error):  7.175590217096497


### *Write the Output File*

Write the prediction to output csv
> Format: 'sbp'

**Write the coefficient update to csv**
> Format: 'w0', 'w1', ..., 'wn'
>*   The number of columns is based on your number of coefficient
>*   The number of row is based on your number of iterations

In [77]:
with open(output_dataroot, 'w', newline='', encoding="utf-8") as csvfile:
  writer = csv.writer(csvfile)
  for row in output_datalist:
    writer.writerow(row)

with open(coefficient_output_dataroot, 'w', newline='', encoding="utf-8") as csvfile:
  writer = csv.writer(csvfile)
  for row in coefficient_output:
    writer.writerow(row)

# **2. Advanced Part (40%)**
In the second part, you need to implement the regression in a different way than the basic part to help your predictions of multiple patients SBP.

You can choose **either** Matrix Inversion or Gradient Descent method.

The training data will be in **hw1_advanced_training.csv** and the testing data will be in **hw1_advanced_testing.csv**.

Output your prediction in **hw1_advanced.csv**

Notice:
> You cannot import any other package other than those given



### Input the training and testing dataset

In [78]:
training_dataroot = 'hw1_advanced_training.csv' # Training data file file named as 'hw1_basic_training.csv'
testing_dataroot = 'hw1_advanced_testing.csv'   # Testing data file named as 'hw1_basic_training.csv'
output_dataroot = 'hw1_advanced.csv' # Output file will be named as 'hw1_basic.csv'

training_datalist =  [] # Training datalist, saved as numpy array
testing_datalist =  [] # Testing datalist, saved as numpy array

output_datalist =  [] # Your prediction, should be 220 * 1 matrix and saved as numpy array
                      # The format of each row should be ['sbp']

In [79]:
PDBasicTrainingData_AP = pd.read_csv(training_dataroot)
PDBasicTestingData_AP = pd.read_csv(testing_dataroot)

##Training Data to Numpy Array
BasicTestingDataArray_AP = PDBasicTestingData_AP.to_numpy()

### Your Implementation

####Step 1: Preprocess Data

#####Preprocess Data (Part 1): Remove Temperature Column and NaNs

In [80]:
##Remove Temperature & Remove NaNs
PDBasicTrainingData_AP = PDBasicTrainingData_AP.drop(columns='temperature')
PDBasicTrainingData_AP = PDBasicTrainingData_AP.dropna()
PDBasicTrainingData_AP.reset_index(inplace=True, drop=True)

#####Preprocess Data (Part 2): Time Format to Seconds

In [81]:
##Convert ChartTime format to Seconds
def ChartTime_to_Secs(ChartTimeStr):
  Split1 = ChartTimeStr.split(' days ')
  Split2 = Split1[1].split(':')
  Days = int(Split1[0])
  Hours = int(Split2[0])
  Minutes = int(Split2[1])
  Seconds = int(Split2[2])

  TotalSecs = Days*24*3600 + Hours*3600 + Minutes*60 + Seconds
  return TotalSecs

##Convert the ChartTime in List to Seconds Format
def ChartTimeList_Conv_Secs(ChartTimeList):
  NewChartTime = []
  for ChartTimes in ChartTimeList:
    NewTime = ChartTime_to_Secs(ChartTimes)
    NewChartTime.append(NewTime)
  return NewChartTime

##Convert the ChartTimelist to a DataFrame
def ChartTimeList_to_DataF(ChartTimeList):
  NewDic = {"charttime": ChartTimeList}
  NewDF = pd.DataFrame(data = NewDic)
  return NewDF

##Dataframe of Data, convert the Charttime to Seconds Format
def df_charttime_conversion(DataframeT):
  TempDataframe = DataframeT

  ##Pick the charttime and convert to list
  ChartTimeSelec = DataframeT['charttime']
  ChartTimeList = ChartTimeSelec.tolist()

  ##Conversion to Seconds Format
  NewChartTimeList = ChartTimeList_Conv_Secs(ChartTimeList)

  ##To Dataframe
  NewDF_ChartTime = ChartTimeList_to_DataF(NewChartTimeList)

  ##Replace Old Dataframe with New Data
  TempDataframe.charttime = NewDF_ChartTime.charttime.values
  return TempDataframe

##Execution
PDBasicTrainingData_AP = df_charttime_conversion(PDBasicTrainingData_AP)

#####Preprocess Data (Part 3): Split Data based on Patient ID

In [82]:
##Get Numpy Array of UniqueSubjectIDs
def get_UniqueSubjectID(Dataframe):
  UniqueSubjectID = Dataframe.subject_id.unique()
  return UniqueSubjectID

##Split Dataframes by Groups to List Containing Dataframes
def Split_DF_byGroup(GroupTarg, NameGroup, DataframeT):
  ListOfGroups = []
  for SubjectID in GroupTarg:
    SubjectGroupType = DataframeT.groupby(DataframeT[NameGroup]==SubjectID)
    SubjectGroup = SubjectGroupType.get_group(True)
    SubjectGroup.reset_index(inplace=True, drop=True)
    ListOfGroups.append(SubjectGroup)
  return ListOfGroups

##Execution
UniqueSubjectID_Arr = get_UniqueSubjectID(PDBasicTrainingData_AP)
ListOfFrames_SubjID = Split_DF_byGroup(UniqueSubjectID_Arr, 'subject_id',
                                      PDBasicTrainingData_AP)

In [83]:
UniqueSubjectID_Arr

array([11526383, 12923910, 14699420, 15437705, 15642911, 16298357,
       17331999, 17593883, 18733920, 18791093, 19473413])

In [84]:
ListOfFrames_SubjID

[     subject_id  charttime  heartrate  resprate  o2sat  sbp
 0      11526383          0       88.0      16.0  100.0  136
 1      11526383     667920       85.0      18.0   97.0  142
 2      11526383     679560       92.0      18.0   97.0  141
 3      11526383    5000820       95.0      15.0  100.0  167
 4      11526383    5017500      100.0      16.0  100.0  181
 ..          ...        ...        ...       ...    ...  ...
 554    11526383  148261260       91.0      18.0  100.0  164
 555    11526383  149552400       88.0      16.0   98.0  113
 556    11526383  149556660       85.0      18.0   98.0  141
 557    11526383  149566140       82.0      17.0   98.0  138
 558    11526383  152487120       84.0      18.0   95.0  126
 
 [559 rows x 6 columns],
      subject_id  charttime  heartrate  resprate  o2sat  sbp
 0      12923910          0       71.0      18.0   97.0  167
 1      12923910       6480       75.0      18.0   98.0  174
 2      12923910      14520       68.0      16.0   99.0  1

In [85]:
PDBasicTrainingData_AP

Unnamed: 0,subject_id,charttime,heartrate,resprate,o2sat,sbp
0,11526383,0,88.0,16.0,100.0,136
1,11526383,667920,85.0,18.0,97.0,142
2,11526383,679560,92.0,18.0,97.0,141
3,11526383,5000820,95.0,15.0,100.0,167
4,11526383,5017500,100.0,16.0,100.0,181
...,...,...,...,...,...,...
5431,19473413,209152200,78.0,15.0,100.0,132
5432,19473413,209179260,82.0,16.0,99.0,128
5433,19473413,209843640,122.0,18.0,95.0,126
5434,19473413,209869020,92.0,16.0,96.0,124


#####Preprocess Data (Part 4): Remove Subject_ID Column

In [86]:
SubjectIDGroup = []
for PDFrames in ListOfFrames_SubjID:
  SubjectIDGroup.append(PDFrames.drop(columns="subject_id"))
PDBasicTrainingData_AP = PDBasicTrainingData_AP.drop(columns = "subject_id")

In [87]:
SubjectIDGroup

[     charttime  heartrate  resprate  o2sat  sbp
 0            0       88.0      16.0  100.0  136
 1       667920       85.0      18.0   97.0  142
 2       679560       92.0      18.0   97.0  141
 3      5000820       95.0      15.0  100.0  167
 4      5017500      100.0      16.0  100.0  181
 ..         ...        ...       ...    ...  ...
 554  148261260       91.0      18.0  100.0  164
 555  149552400       88.0      16.0   98.0  113
 556  149556660       85.0      18.0   98.0  141
 557  149566140       82.0      17.0   98.0  138
 558  152487120       84.0      18.0   95.0  126
 
 [559 rows x 5 columns],
      charttime  heartrate  resprate  o2sat  sbp
 0            0       71.0      18.0   97.0  167
 1         6480       75.0      18.0   98.0  174
 2        14520       68.0      16.0   99.0  170
 3        18660       70.0      16.0   96.0  183
 4       571380       85.0      16.0   95.0  131
 ..         ...        ...       ...    ...  ...
 416  108817260       69.0      18.0   97.

In [88]:
PDBasicTrainingData_AP

Unnamed: 0,charttime,heartrate,resprate,o2sat,sbp
0,0,88.0,16.0,100.0,136
1,667920,85.0,18.0,97.0,142
2,679560,92.0,18.0,97.0,141
3,5000820,95.0,15.0,100.0,167
4,5017500,100.0,16.0,100.0,181
...,...,...,...,...,...
5431,209152200,78.0,15.0,100.0,132
5432,209179260,82.0,16.0,99.0,128
5433,209843640,122.0,18.0,95.0,126
5434,209869020,92.0,16.0,96.0,124


#####Preprocess Data (Part 5): Remove Outliers

In [89]:
def PreprocessData_AP(DataframeT):
  ##Only run once, if run multiple times, reload the original dataframe
  DataframeT_temp1 = DataframeT
  DataframeT_temp1.reset_index(inplace=True, drop=True)
  DataframeT_temp2 = DataframeT
  DataframeT_temp2.reset_index(inplace=True, drop=True)
  DataframeT_temp3 = DataframeT
  DataframeT_temp3.reset_index(inplace=True, drop=True)
  DataframeT_temp4 = DataframeT
  DataframeT_temp4.reset_index(inplace=True, drop=True)

  ##Using IQR Method
  ##heartrate
  Q1_hr = DataframeT_temp1['heartrate'].quantile(0.25)
  Q3_hr = DataframeT_temp1['heartrate'].quantile(0.75)
  IQR_hr = Q3_hr-Q1_hr
  LowerPart_hr = Q1_hr - 1.5*IQR_hr
  UpperPart_hr = Q3_hr + 1.5*IQR_hr

  ##Remove Outliers
  LowerData_hr = np.where(DataframeT_temp1['heartrate']<=LowerPart_hr)
  DataframeT_temp1.drop(index=LowerData_hr[0], inplace=True)
  DataframeT_temp1.reset_index(inplace=True, drop=True)
  print("GOOD")
  UpperData_hr = np.where(DataframeT_temp1['heartrate']>=UpperPart_hr)
  DataframeT_temp1.drop(index=UpperData_hr[0], inplace=True)
  DataframeT_temp1.reset_index(inplace=True, drop=True)

  print("HeartRate")
  print(LowerPart_hr)
  print(UpperPart_hr)
  print(LowerData_hr)
  print(UpperData_hr)

  ##resprate
  Q1_rr = DataframeT_temp2['resprate'].quantile(0.25)
  Q3_rr = DataframeT_temp2['resprate'].quantile(0.75)
  IQR_rr = Q3_rr-Q1_rr
  LowerPart_rr = Q1_rr - 1.5*IQR_rr
  UpperPart_rr = Q3_rr + 1.5*IQR_rr

  ##Remove Outliers
  LowerData_rr = np.where(DataframeT_temp2['resprate']<=LowerPart_rr)
  DataframeT_temp2.drop(index=LowerData_rr[0], inplace=True)
  DataframeT_temp2.reset_index(inplace=True, drop=True)
  print("GOOD")
  UpperData_rr = np.where(DataframeT_temp2['resprate']>=UpperPart_rr)
  DataframeT_temp2.drop(index=UpperData_rr[0], inplace=True)
  DataframeT_temp2.reset_index(inplace=True, drop=True)

  print("RespRate")
  print(LowerPart_rr)
  print(UpperPart_rr)
  print(LowerData_rr)
  print(UpperData_rr)

  ##o2sat
  Q1_o2 = DataframeT_temp3['o2sat'].quantile(0.25)
  Q3_o2 = DataframeT_temp3['o2sat'].quantile(0.75)
  IQR_o2 = Q3_o2-Q1_o2
  LowerPart_o2 = Q1_o2 - 1.5*IQR_o2
  UpperPart_o2 = Q3_o2 + 1.5*IQR_o2

  ##Remove Outliers
  LowerData_o2 = np.where(DataframeT_temp3['o2sat']<=LowerPart_o2)
  DataframeT_temp3.drop(index=LowerData_o2[0], inplace=True)
  DataframeT_temp3.reset_index(inplace=True, drop=True)
  print("GOOD")
  UpperData_o2 = np.where(DataframeT_temp3['o2sat']>=UpperPart_o2)
  DataframeT_temp3.drop(index=UpperData_o2[0], inplace=True)
  DataframeT_temp3.reset_index(inplace=True, drop=True)

  print("o2Sat")
  print(LowerPart_o2)
  print(UpperPart_o2)
  print(LowerData_o2)
  print(UpperData_o2)

  ##SBP
  Q1_sbp = DataframeT_temp4['sbp'].quantile(0.25)
  Q3_sbp = DataframeT_temp4['sbp'].quantile(0.75)
  IQR_sbp = Q3_sbp-Q1_sbp
  LowerPart_sbp = Q1_sbp - 1.5*IQR_sbp
  UpperPart_sbp = Q3_sbp + 1.5*IQR_sbp

  ##Remove Outliers
  LowerData_sbp = np.where(DataframeT_temp4['sbp']<=LowerPart_sbp)
  DataframeT_temp4.drop(index=LowerData_sbp[0], inplace=True)
  DataframeT_temp4.reset_index(inplace=True, drop=True)
  print("GOOD")
  UpperData_sbp = np.where(DataframeT_temp4['sbp']>=UpperPart_sbp)
  DataframeT_temp4.drop(index=UpperData_sbp[0], inplace=True)
  DataframeT_temp4.reset_index(inplace=True, drop=True)

  print("SBP")
  print(LowerPart_sbp)
  print(UpperPart_sbp)
  print(LowerData_sbp)
  print(UpperData_sbp)

  # FramesT = [DataframeT_temp1, DataframeT_temp2]
  TrueDataframeT = pd.concat([DataframeT_temp1, DataframeT_temp2, DataframeT_temp3
                              , DataframeT_temp4], ignore_index=True)
  TrueDataframeT = TrueDataframeT.drop_duplicates()
  # TrueDataframeT = DataframeT_temp

  print(len(TrueDataframeT))
  print(TrueDataframeT)
  # TrueDataframeT.to_csv("testing.csv")

  ##Dataframe to numpy array
  BasicTrainingDataArray = TrueDataframeT.to_numpy()

  # return TrueDataframeT
  return BasicTrainingDataArray


##Execution
ListofArrays_SubjID = []
for PDFrames in SubjectIDGroup:
  # print(PDFrames)
  ListofArrays_SubjID.append(PreprocessData_AP(PDFrames))

NonFramesData = PreprocessData_AP(PDBasicTrainingData_AP)

GOOD
HeartRate
67.0
123.0
(array([180]),)
(array([  8,   9,  66,  73, 133, 147, 160, 184, 185, 188, 366, 416, 427,
       435, 436, 452, 476, 492]),)
GOOD
RespRate
11.5
23.5
(array([122, 395]),)
(array([ 29,  46, 112, 132, 163, 245, 246, 250, 280, 281, 359, 360, 361,
       402, 465, 500]),)
GOOD
o2Sat
95.0
103.0
(array([ 11,  47,  48, 133, 234, 274, 352, 354, 425, 449, 468, 469, 492,
       514, 515, 516, 521]),)
(array([], dtype=int64),)
GOOD
SBP
88.5
220.5
(array([210, 214]),)
(array([  8,  11,  16,  17,  18,  23,  39,  40,  44, 401, 410]),)
492
     charttime  heartrate  resprate  o2sat  sbp
0            0       88.0      16.0  100.0  136
1       667920       85.0      18.0   97.0  142
2       679560       92.0      18.0   97.0  141
3      5000820       95.0      15.0  100.0  167
4      5017500      100.0      16.0  100.0  181
..         ...        ...       ...    ...  ...
487  147708960       88.0      20.0   98.0  156
488  148261260       91.0      18.0  100.0  164
489  14955240

In [90]:
ListofArrays_SubjID

[array([[0.0000000e+00, 8.8000000e+01, 1.6000000e+01, 1.0000000e+02,
         1.3600000e+02],
        [6.6792000e+05, 8.5000000e+01, 1.8000000e+01, 9.7000000e+01,
         1.4200000e+02],
        [6.7956000e+05, 9.2000000e+01, 1.8000000e+01, 9.7000000e+01,
         1.4100000e+02],
        ...,
        [1.4955240e+08, 8.8000000e+01, 1.6000000e+01, 9.8000000e+01,
         1.1300000e+02],
        [1.4955666e+08, 8.5000000e+01, 1.8000000e+01, 9.8000000e+01,
         1.4100000e+02],
        [1.4956614e+08, 8.2000000e+01, 1.7000000e+01, 9.8000000e+01,
         1.3800000e+02]]),
 array([[0.0000000e+00, 7.1000000e+01, 1.8000000e+01, 9.7000000e+01,
         1.6700000e+02],
        [6.4800000e+03, 7.5000000e+01, 1.8000000e+01, 9.8000000e+01,
         1.7400000e+02],
        [1.4520000e+04, 6.8000000e+01, 1.6000000e+01, 9.9000000e+01,
         1.7000000e+02],
        ...,
        [1.0921476e+08, 7.8000000e+01, 1.8000000e+01, 9.9000000e+01,
         1.4100000e+02],
        [1.0923276e+08, 8.500000

In [91]:
NonFramesData

array([[0.0000000e+00, 8.8000000e+01, 1.6000000e+01, 1.0000000e+02,
        1.3600000e+02],
       [6.6792000e+05, 8.5000000e+01, 1.8000000e+01, 9.7000000e+01,
        1.4200000e+02],
       [6.7956000e+05, 9.2000000e+01, 1.8000000e+01, 9.7000000e+01,
        1.4100000e+02],
       ...,
       [2.0984364e+08, 1.2200000e+02, 1.8000000e+01, 9.5000000e+01,
        1.2600000e+02],
       [2.0986902e+08, 9.2000000e+01, 1.6000000e+01, 9.6000000e+01,
        1.2400000e+02],
       [2.1986814e+08, 8.7000000e+01, 1.8000000e+01, 9.7000000e+01,
        1.5000000e+02]])

####Step 2: Split Data

In [92]:
##Split Data:
def SplitData(DataList):
  ##70% Training
  ##30% Validation
  LenDataList = len(DataList)
  LenTraining = math.floor(LenDataList*70/100)
  LenValidation = LenDataList - LenTraining
  TrainingList = DataList[:LenTraining]
  ValidationList = DataList[LenTraining:]

  return TrainingList, ValidationList

##Execution
ListofArrays_SubjID_TrainingList = []
ListofArrays_SubjID_ValidationList = []
for PDFrames in ListofArrays_SubjID:
  TrainingListTemp, ValidationListTemp = SplitData(PDFrames)
  ListofArrays_SubjID_TrainingList.append(TrainingListTemp)
  ListofArrays_SubjID_ValidationList.append(ValidationListTemp)

NonFramesDataTrainingList, NonFramesDataValidationList = SplitData(NonFramesData)

##### Total Prepared Variables:

Whole Training Dataset [SPLIT] (Not Divided between Patient IDs):
> NonFramesDataTrainingList

> NonFramesDataValidationList

Whole Training Dataset [NON-SPLIT]
>NonFramesData

Training Dataset [SPLIT] (Divided between Patient IDs):
> ListofArrays_SubjID_TrainingList

> ListofArrays_SubjID_ValidationList

Training Dataset [NON-SPLIT]
>ListofArrays_SubjID

Stored Subject ID: (Numpy Array)
> UniqueSubjectID_Arr

Columns:
> charttime

> heartrate

> resprate

> o2sat

> sbp

In [93]:
NonFramesDataTrainingList

array([[0.0000000e+00, 8.8000000e+01, 1.6000000e+01, 1.0000000e+02,
        1.3600000e+02],
       [6.6792000e+05, 8.5000000e+01, 1.8000000e+01, 9.7000000e+01,
        1.4200000e+02],
       [6.7956000e+05, 9.2000000e+01, 1.8000000e+01, 9.7000000e+01,
        1.4100000e+02],
       ...,
       [2.1935658e+08, 9.3000000e+01, 1.8000000e+01, 9.5000000e+01,
        1.0400000e+02],
       [2.1936864e+08, 8.2000000e+01, 1.8000000e+01, 9.5000000e+01,
        1.0200000e+02],
       [2.1938292e+08, 6.7000000e+01, 1.8000000e+01, 9.6000000e+01,
        1.1300000e+02]])

In [94]:
NonFramesDataValidationList

array([[2.1938970e+08, 6.6000000e+01, 1.8000000e+01, 9.6000000e+01,
        1.2800000e+02],
       [2.1940056e+08, 7.2000000e+01, 1.8000000e+01, 9.7000000e+01,
        1.2900000e+02],
       [2.1959646e+08, 1.0800000e+02, 2.0000000e+01, 9.7000000e+01,
        1.2400000e+02],
       ...,
       [2.0984364e+08, 1.2200000e+02, 1.8000000e+01, 9.5000000e+01,
        1.2600000e+02],
       [2.0986902e+08, 9.2000000e+01, 1.6000000e+01, 9.6000000e+01,
        1.2400000e+02],
       [2.1986814e+08, 8.7000000e+01, 1.8000000e+01, 9.7000000e+01,
        1.5000000e+02]])

In [95]:
NonFramesData

array([[0.0000000e+00, 8.8000000e+01, 1.6000000e+01, 1.0000000e+02,
        1.3600000e+02],
       [6.6792000e+05, 8.5000000e+01, 1.8000000e+01, 9.7000000e+01,
        1.4200000e+02],
       [6.7956000e+05, 9.2000000e+01, 1.8000000e+01, 9.7000000e+01,
        1.4100000e+02],
       ...,
       [2.0984364e+08, 1.2200000e+02, 1.8000000e+01, 9.5000000e+01,
        1.2600000e+02],
       [2.0986902e+08, 9.2000000e+01, 1.6000000e+01, 9.6000000e+01,
        1.2400000e+02],
       [2.1986814e+08, 8.7000000e+01, 1.8000000e+01, 9.7000000e+01,
        1.5000000e+02]])

In [96]:
ListofArrays_SubjID_TrainingList

[array([[0.000000e+00, 8.800000e+01, 1.600000e+01, 1.000000e+02,
         1.360000e+02],
        [6.679200e+05, 8.500000e+01, 1.800000e+01, 9.700000e+01,
         1.420000e+02],
        [6.795600e+05, 9.200000e+01, 1.800000e+01, 9.700000e+01,
         1.410000e+02],
        ...,
        [9.122862e+07, 8.600000e+01, 1.800000e+01, 1.000000e+02,
         1.210000e+02],
        [9.123924e+07, 8.700000e+01, 1.800000e+01, 1.000000e+02,
         1.170000e+02],
        [9.346326e+07, 7.000000e+01, 1.600000e+01, 1.000000e+02,
         1.370000e+02]]),
 array([[0.000000e+00, 7.100000e+01, 1.800000e+01, 9.700000e+01,
         1.670000e+02],
        [6.480000e+03, 7.500000e+01, 1.800000e+01, 9.800000e+01,
         1.740000e+02],
        [1.452000e+04, 6.800000e+01, 1.600000e+01, 9.900000e+01,
         1.700000e+02],
        ...,
        [8.325246e+07, 7.000000e+01, 1.600000e+01, 9.900000e+01,
         1.190000e+02],
        [8.326146e+07, 7.400000e+01, 1.800000e+01, 9.800000e+01,
         1.040000

In [97]:
ListofArrays_SubjID_ValidationList

[array([[9.3475200e+07, 7.2000000e+01, 1.8000000e+01, 1.0000000e+02,
         1.3800000e+02],
        [9.3484560e+07, 7.0000000e+01, 1.8000000e+01, 1.0000000e+02,
         1.3400000e+02],
        [9.3503640e+07, 8.4000000e+01, 1.8000000e+01, 1.0000000e+02,
         1.4400000e+02],
        [9.3983940e+07, 1.0500000e+02, 1.8000000e+01, 1.0000000e+02,
         1.6100000e+02],
        [9.4000860e+07, 1.0300000e+02, 2.0000000e+01, 1.0000000e+02,
         1.3800000e+02],
        [9.4538100e+07, 8.4000000e+01, 1.8000000e+01, 1.0000000e+02,
         1.4700000e+02],
        [9.4544820e+07, 7.9000000e+01, 1.6000000e+01, 1.0000000e+02,
         1.4500000e+02],
        [9.4557000e+07, 8.1000000e+01, 1.6000000e+01, 1.0000000e+02,
         1.4200000e+02],
        [9.4559940e+07, 8.5000000e+01, 1.8000000e+01, 9.8000000e+01,
         1.5300000e+02],
        [9.4757700e+07, 9.5000000e+01, 2.0000000e+01, 1.0000000e+02,
         1.6900000e+02],
        [9.4773060e+07, 9.1000000e+01, 1.9000000e+01, 9.9000

In [98]:
ListofArrays_SubjID

[array([[0.0000000e+00, 8.8000000e+01, 1.6000000e+01, 1.0000000e+02,
         1.3600000e+02],
        [6.6792000e+05, 8.5000000e+01, 1.8000000e+01, 9.7000000e+01,
         1.4200000e+02],
        [6.7956000e+05, 9.2000000e+01, 1.8000000e+01, 9.7000000e+01,
         1.4100000e+02],
        ...,
        [1.4955240e+08, 8.8000000e+01, 1.6000000e+01, 9.8000000e+01,
         1.1300000e+02],
        [1.4955666e+08, 8.5000000e+01, 1.8000000e+01, 9.8000000e+01,
         1.4100000e+02],
        [1.4956614e+08, 8.2000000e+01, 1.7000000e+01, 9.8000000e+01,
         1.3800000e+02]]),
 array([[0.0000000e+00, 7.1000000e+01, 1.8000000e+01, 9.7000000e+01,
         1.6700000e+02],
        [6.4800000e+03, 7.5000000e+01, 1.8000000e+01, 9.8000000e+01,
         1.7400000e+02],
        [1.4520000e+04, 6.8000000e+01, 1.6000000e+01, 9.9000000e+01,
         1.7000000e+02],
        ...,
        [1.0921476e+08, 7.8000000e+01, 1.8000000e+01, 9.9000000e+01,
         1.4100000e+02],
        [1.0923276e+08, 8.500000

#### Step 3: Implement Regression (Inverse Matirx)

In [99]:
def PrepareForMatrixInversion_AP(X_Arr):
  XTarget=X_Arr
  XTarget_Df = pd.DataFrame(XTarget, columns = ['heartrate','resprate','o2sat'])
  XTarget_Df.insert(0, 'onescol', np.ones(XTarget.shape[0]))
  XTarget_array = XTarget_Df.to_numpy()
  return XTarget_array

def MatrixInversion_AP(Training_DS):
  ##Y = XW + E
  X, Y = Training_DS[:,1:4], Training_DS[:,4]
  X_arr = PrepareForMatrixInversion_AP(X)
  Lambda = 9999000000
  Identity_Matrix = np.identity(X_arr.shape[1])

  ##Weight_IM = (X^T . X)^-1 . X^T . Y
  Weight_IM =np.linalg.inv(X_arr.T.dot(X_arr)).dot(X_arr.T).dot(Y)
  # Weight_IM =np.linalg.inv(X_arr.T.dot(X_arr)+Identity_Matrix.dot(Lambda)).dot(X_arr.T).dot(Y)

  return Weight_IM

def MAPE_Result_AP(Eval_DS, Weight):
  X_Eval, Y_Eval = Eval_DS[:,1:4], Eval_DS[:,4]
  X_arr = PrepareForMatrixInversion_AP(X_Eval)
  Predic_Eval = np.dot(X_arr,Weight)

  ## MAPE = 100/X_LEN * E|(YNormal - YPrediction) / YNormal|
  MAPE = 100/len(Y_Eval)*np.sum(abs(np.divide(np.subtract(Y_Eval, Predic_Eval), Y_Eval)))
  return MAPE

#### Step 4: Train Model and Generate Result

#####Train Model and Get MAPEs

In [100]:
WholeDatasetSplit_Weight = MatrixInversion_AP(NonFramesDataTrainingList)
WholeDatasetSplit_Weight_MAPE = MAPE_Result_AP(NonFramesDataValidationList,WholeDatasetSplit_Weight)

WholeDatasetNonSplit_Weight = MatrixInversion_AP(NonFramesData)
WholeDatasetNonSplit_Weight_MAPE= MAPE_Result_AP(NonFramesData,WholeDatasetNonSplit_Weight)

List_GroupedDatasetSplit_Weight = []
List_GroupedDatasetSplit_Weight_MAPE = []
for i in range(len(UniqueSubjectID_Arr)):
  GroupedDatasetSplit_Weight = MatrixInversion_AP(ListofArrays_SubjID_TrainingList[i])
  List_GroupedDatasetSplit_Weight.append(GroupedDatasetSplit_Weight)
  GroupedDatasetSplit_Weight_MAPE = MAPE_Result_AP(ListofArrays_SubjID_ValidationList[i], GroupedDatasetSplit_Weight)
  List_GroupedDatasetSplit_Weight_MAPE.append(GroupedDatasetSplit_Weight_MAPE)

List_GroupedDatasetNonSplit_Weight = []
List_GroupedDatasetNonSplit_Weight_MAPE = []
for List_Frames in ListofArrays_SubjID:
  GroupDatasetNonSplit_Weight = MatrixInversion_AP(List_Frames)
  List_GroupedDatasetNonSplit_Weight.append(GroupDatasetNonSplit_Weight)
  GroupedDatasetNonSplit_Weight_MAPE = MAPE_Result_AP(List_Frames, GroupDatasetNonSplit_Weight)
  List_GroupedDatasetNonSplit_Weight_MAPE.append(GroupedDatasetNonSplit_Weight_MAPE)

#####Get Prediction

In [101]:
def MakePrediction_AP(DatasetPred, Weight):
  X_Pred, Y_Pred = DatasetPred[:,0:3], DatasetPred[:,3]
  X_arr = PrepareForMatrixInversion_AP(X_Pred)
  Predic_Eval = np.dot(X_arr,Weight)

  return Predic_Eval

In [102]:
#Default Weight = No Split Dataset & No Grouping
#Subject Weight = No Split Dataset & Grouping

#UniqueSubjectID (Testing) becomes List
UniqueSubjectID_List = UniqueSubjectID_Arr.tolist()

#Drop Temperature & Charttime Columns
PDBasicTestingData_AP = PDBasicTestingData_AP.drop(columns='temperature')
PDBasicTestingData_AP = PDBasicTestingData_AP.drop(columns='charttime')

#Get Testing's SubjectIDs
Testing_UniqueSubjectID_Arr = get_UniqueSubjectID(PDBasicTestingData_AP)
Testing_ListOfFrames_SubjID = Split_DF_byGroup(Testing_UniqueSubjectID_Arr, 'subject_id',
                                               PDBasicTestingData_AP)

#Drop Subject_id Column
Testing_Group = []
for PDFrames in Testing_ListOfFrames_SubjID:
  CurrentFrame = PDFrames.drop(columns="subject_id")
  Testing_Group.append(CurrentFrame.to_numpy())

In [103]:
#Make Prediction
EndPrediction = []
for EachPatient in Testing_UniqueSubjectID_Arr:
  i = 0
  if(UniqueSubjectID_List.count(EachPatient) > 0):
    #Use Patient Coeff/Weight
    CurrentIndex = UniqueSubjectID_List.index(EachPatient)
    CurrentWeight = List_GroupedDatasetNonSplit_Weight[CurrentIndex]
    CurrentPrediction = MakePrediction_AP(Testing_Group[i], CurrentWeight)
    EndPrediction.append(CurrentPrediction)

  else:
    #use Default Weight
    CurrentWeight = WholeDatasetNonSplit_Weight
    CurrentPrediction = MakePrediction_AP(Testing_Group[i], CurrentWeight)
    EndPrediction.append(CurrentPrediction)
  i = i+1

In [104]:
#Combine the Arrays
EndPrediction_temp = EndPrediction
EndPredictionArray = np.array([])
for i in range(len(EndPrediction_temp)):
  EndPredictionArray = np.concatenate((EndPredictionArray, EndPrediction_temp[i]))

#Numpy Array to List
EndPredictionList = EndPredictionArray.tolist()

#Prepare Data and Insert to "output_datalist"
df_EndPredictionList = {'sbp': EndPredictionList}
df_EndPredictionList = pd.DataFrame(data = df_EndPredictionList)
EndPrediction_Prep_AP = df_EndPredictionList.values.tolist()
output_datalist = EndPrediction_Prep_AP

#####Results (Weight & MAPE):
[Weight Listed: heartrate, resprate, o2sat, Intercept]

######A. Split Dataset & No Grouping

In [105]:
print("Weight: ", end="")
print(str(WholeDatasetSplit_Weight[1]) + ' ' + str(WholeDatasetSplit_Weight[2]) + ' ' + str(WholeDatasetSplit_Weight[3])
+ ' ' + str(WholeDatasetSplit_Weight[0]))
print("MAPE: ", end="")
print(WholeDatasetSplit_Weight_MAPE)

Weight: 0.20299817823482913 0.9927774777295677 1.8304272565793411 -81.18195962769232
MAPE: 12.545819074834151


######B. No Split Dataset & No Grouping

In [106]:
print("Weight: ", end="")
print(str(WholeDatasetNonSplit_Weight[1]) + ' ' + str(WholeDatasetNonSplit_Weight[2])
+ ' ' + str(WholeDatasetNonSplit_Weight[3]) + ' ' + str(WholeDatasetNonSplit_Weight[0]))
print("MAPE: ", end="")
print(WholeDatasetNonSplit_Weight_MAPE)

Weight: 0.1547189508352477 0.7782067354516141 1.5882972603004724 -50.365408827435395
MAPE: 12.522210154569816


######C. Split Dataset & Grouping

In [107]:
for i in range(len(UniqueSubjectID_Arr)):
  print("Patient " + str(UniqueSubjectID_Arr[i]) + ":")
  print("Weight: ", end="")
  print(str(List_GroupedDatasetSplit_Weight[i][1]) + ' ' + str(List_GroupedDatasetSplit_Weight[i][2])
  + ' ' + str(List_GroupedDatasetSplit_Weight[i][3]) + ' ' + str(List_GroupedDatasetSplit_Weight[i][0]))
  print("MAPE: ", end="")
  print(str(List_GroupedDatasetSplit_Weight_MAPE[i]))
  print()

print("Total MAPE (Based on Each Patient MAPE): ", end="")
print(np.sum(List_GroupedDatasetSplit_Weight_MAPE)/len(List_GroupedDatasetSplit_Weight_MAPE))

Patient 11526383:
Weight: 0.9130753044658042 0.28710912673258315 2.9920862167170177 -232.72662747225206
MAPE: 10.988219145059345

Patient 12923910:
Weight: 0.5043496741666356 -1.1409747084157384 1.0051706449471194 22.89485336231972
MAPE: 11.344716456263608

Patient 14699420:
Weight: 0.08114513371087989 1.6844117903478648 -2.063375467738666 282.9577541491813
MAPE: 16.247779494971777

Patient 15437705:
Weight: -0.5657186737691301 0.9107345652038181 0.8466483762066028 65.54213140429957
MAPE: 16.680516931400618

Patient 15642911:
Weight: 0.10495135156694753 0.2019216689322779 1.1500758430631648 1.4588107254517726
MAPE: 8.22996421019789

Patient 16298357:
Weight: 0.21722540153483416 0.9540479758498567 0.8356992836568541 15.15616690176373
MAPE: 10.012837122355355

Patient 17331999:
Weight: 0.14289271376357474 1.0528341016467744 0.5696913879039596 32.58063741724518
MAPE: 8.076234024957989

Patient 17593883:
Weight: 0.12503940329801871 1.0052259847557576 0.04160284888593238 93.01103462119761
M

######D. No Split Dataset & Grouping

In [108]:
for i in range(len(UniqueSubjectID_Arr)):
  print("Patient " + str(UniqueSubjectID_Arr[i]) + ":")
  print("Weight: ", end="")
  print(str(List_GroupedDatasetNonSplit_Weight[i][1]) + ' ' + str(List_GroupedDatasetNonSplit_Weight[i][2])
  + ' ' + str(List_GroupedDatasetNonSplit_Weight[i][3]) + ' ' + str(List_GroupedDatasetNonSplit_Weight[i][0]))
  print("MAPE: ", end="")
  print(str(List_GroupedDatasetNonSplit_Weight_MAPE[i]))
  print()

print("Total MAPE (Based on Each Patient MAPE): ", end="")
print(np.sum(List_GroupedDatasetNonSplit_Weight_MAPE)/len(List_GroupedDatasetNonSplit_Weight_MAPE))

Patient 11526383:
Weight: 0.8778633925521987 0.5898851254960009 2.4490078997448252 -181.0109331695322
MAPE: 11.134407049242462

Patient 12923910:
Weight: 0.4654533983964739 -0.23181409078869897 0.7528761346274664 34.421892590731815
MAPE: 12.31853191027431

Patient 14699420:
Weight: 0.04539456115976877 1.1660674806974873 -2.077989774961317 294.003069195536
MAPE: 13.942257818432491

Patient 15437705:
Weight: -0.6549379721957997 1.8097234334306265 1.662875823258355 -17.273378118218076
MAPE: 14.946975956086712

Patient 15642911:
Weight: 0.10248304526218255 0.435008580024659 1.043351271770905 8.338035300009679
MAPE: 8.894840318728766

Patient 16298357:
Weight: 0.18916040520549085 0.5411071323679504 0.7951027461480772 29.212708056524328
MAPE: 10.871442849300621

Patient 17331999:
Weight: 0.0899098802807443 0.6629070979681698 0.9951634679387651 2.949793132591772
MAPE: 8.324936582626744

Patient 17593883:
Weight: 0.15049354968725348 0.8267854463674151 0.04375508677632567 93.42810515815465
MAPE

### Output your Prediction

> your filename should be **hw1_advanced.csv**

In [109]:
with open(output_dataroot, 'w', newline='', encoding="utf-8") as csvfile:
  writer = csv.writer(csvfile)
  for row in output_datalist:
    writer.writerow(row)

# Report *(5%)*

Report should be submitted as a pdf file **hw1_report.pdf**

*   Briefly describe the difficulty you encountered
*   Summarize your work and your reflections
*   No more than one page






# Save the Code File
Please save your code and submit it as an ipynb file! (**hw1.ipynb**)