# COVID-19 Survey Linear Regression

Let's fit a multidimensional linear model to the Covid-19 survey data. We can add as many input variables as we want. Here we choose the step count and stress level as inputs and sleep latency as the output. After fitting, we can predict sleep latency for any step count and stress level combination.

In [115]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd
import pickle

# Choose which variables to include in each analysis
input_variables = ['Steps', 'Stress']

# Model dimensionality
n = len(input_variables)

# Load preprocessed data
filename = '../data/covid_data_preprocessed.csv'
df = pd.read_csv(filename)
df_latency = df[input_variables + ['Latency']]
df_sleeptime = df[input_variables + ['Sleeptime (h)']]
df_wakes = df[input_variables + ['Wakes']]

# Next fit the model to predict latency
# Choose input and output data
x1 = df_latency.drop('Latency', axis = 1)
y1 = df_latency['Latency']
# Split the data into train and test sets
test_size = 0.1 # For all splits
x1_train, x1_test, y1_train, y1_test = train_test_split(x1, y1, test_size = test_size)
# Fit the model with train data
model_latency = LinearRegression().fit(x1_train, y1_train)
# Make predictions with test data
pred_latency = model_latency.predict(x1_test)

We need to measure the accuracy of the model. Let's use the average root-mean-square deviation, which tells us how much the prediction deviates from the true value on average over the entire test set.

In [116]:
def avg_root_mean_square(true, pred):
    true = np.array(true)
    pred = np.array(pred)
    assert len(true) == len(pred)
    n = len(true)
    rms = 0
    for i in range(n):
        rms += np.sqrt((true[i] - pred[i])**2)
    return rms / n

rms_deviation_latency = avg_root_mean_square(y1_test, pred_latency)
print("Latency model RMS deviation (mins):", rms_deviation_latency)
# Coefficienct of determination for the model
print("Latency model R^2:", model_latency.score(x1_train, y1_train))

Latency model RMS deviation (mins): 9.300615469331843
Latency model R^2: 0.09262626578260291


Fit similar models for sleeptime and wakes.

In [117]:
# Same for the sleeptime model
x2 = df_sleeptime.drop('Sleeptime (h)', axis = 1)
y2 = df_sleeptime['Sleeptime (h)']
x2_train, x2_test, y2_train, y2_test = train_test_split(x2, y2, test_size = test_size)
model_sleeptime = LinearRegression().fit(x2_train, y2_train)
pred_sleeptime = model_sleeptime.predict(x2_test)
rms_deviation_sleeptime = avg_root_mean_square(y2_test, pred_sleeptime)
print("Sleeptime model RMS deviation (hrs):", rms_deviation_sleeptime)
print("Sleeptime model R^2:", model_sleeptime.score(x2_train, y2_train))

# Same for the wakes model
x3 = df_wakes.drop('Wakes', axis = 1)
y3 = df_wakes['Wakes']
x3_train, x3_test, y3_train, y3_test = train_test_split(x3, y3, test_size = test_size)
model_wakes = LinearRegression().fit(x3_train, y3_train)
pred_wakes = model_wakes.predict(x3_test)
rms_deviation_wakes = avg_root_mean_square(y3_test, pred_wakes)
print("Wakes model RMS deviation (# of wakes):", rms_deviation_wakes)
print("Wakes model R^2:", model_wakes.score(x3_train, y3_train))

Sleeptime model RMS deviation (hrs): 0.7394676009833971
Sleeptime model R^2: 0.0055423306690909335
Wakes model RMS deviation (# of wakes): 0.8388761653119496
Wakes model R^2: 0.03779460983166982


Make some predictions to test the use of the models.

In [118]:
# Define a function to make easy predictions
def predict(x, model, n = 1):
    x_tmp = np.array(x).reshape(-1, n)
    y_pred = model.predict(x_tmp)[0]
    # Enforce that the result is positive
    if y_pred >=0:
        return y_pred
    else:
        return 0

# Predict sleep time and quality from steps and stress
steps = 2000
stress_level = 3

x = [steps, stress_level]
pred_latency = predict(x, model_latency, n)
pred_sleeptime = predict(x, model_sleeptime, n)
pred_wakes = predict(x, model_wakes, n)

print('Predicted sleep time in hours: ', round(pred_sleeptime, 2))
print('Predicted sleep latency in minutes: ', round(pred_latency, 2))
print('Predicted number of wakes: ', int(round(pred_wakes, 0)))

Predicted sleep time in hours:  7.71
Predicted sleep latency in minutes:  23.35
Predicted number of wakes:  1




Save models with pickle.

In [119]:
with open('../data/models/covid_model_latency.pkl', 'wb') as f:
    pickle.dump(model_latency, f)

with open('../data/models/covid_model_sleeptime.pkl', 'wb') as f:
    pickle.dump(model_sleeptime, f)

with open('../data/models/covid_model_wakes.pkl', 'wb') as f:
    pickle.dump(model_wakes, f)