## Project Stage - IV (Basic Machine Learning)  ddl: 04/28/2023

## Goals

The goal of Stage IV is to utlize machine learning and statistical models to predict the trend of COVID-19 cases / deaths.


### Tasks for Stage IV:

#### Task 1: (70 pts)
- Team: (30)
    - Develop Linear and Non-Linear (polynomial) regression models for predicting cases and deaths in US. 
        - Start your data from the first day of infections in US. X-Axis - number of days since the first case, Y-Axis - number of new cases and deaths.
        - Calculate and report Root Mean Square Error (RMSE) for your models (linear and non-linear). Discuss bias versus variance tradeoff.
        - Plot trend line along for the data along with the forecast of 1 week ahead. 
        - Describe the trends as compared to other countries. 
- Member: (40 pts)
    - Utilize Linear and Non-Linear (polynomial) regression models to compare trends for a single state (each member should choose different state) and its counties (top 5 with highest number of cases). Start your data from the first day of infections. 
        - X-Axis - number of days since the first case, Y - Axis number of new cases and deaths. Calcluate error using RMSE.
        - Identify which counties are most at risk. Model for top 5 counties with cases within a state and describe their trends.
        - Utilize the hospital data to calculate the point of no return for a state. Use percentage occupancy / utilization to see which states are close and what their trend looks like.
        - Perform hypothesis tests on questions identified in Stage II
            - e.x. *Does higher employment data (overall employment numbers) lead to higher covid case numbers or more rapid increase in covid cases.*. Here you would compare the covid cases to the state or county level enrichment data to prove or disprove your null hypothesis. In this case there will be a two tail - two sample t-test to see if there is a difference and then one-tail - two sample t-test to show higher or lower.
        - Depending on your type of data you can also perform Chi-square test for categorical hypothesis testing. 

    
#### Task 2: (30 pts)
- Member:
    - For each of the aforemention analysis plot graphs,
        - trend line
        - confidence intervals (error in prediction)
        - prediction path (forecast)

**Deliverable**
- Each member creates separate notebooks for member tasks. Upload all notebooks and reports to Canvas. Do not submit to Github, at least before the submission deadline, to avoid potential plagiarism.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
#from mlxtend.evaluate import bias_variance_decomp
from math import sqrt
from sklearn.svm import SVR
import warnings
warnings.filterwarnings('ignore')

In [None]:
confirmed_cases = pd.read_csv('data/covid_confirmed_usafacts.csv')
confirmed_cases

In [None]:
confirmed_deaths = pd.read_csv('data/covid_deaths_usafacts.csv')
confirmed_deaths

In [None]:
def question1(confirmed_cases):
    group_cases_USA = confirmed_cases.sum()
    group_cases_USA = group_cases_USA.iloc[4:]

    temp = group_cases_USA.index

    X =[]
    for i in range(len(temp)):
        X.append(i)
    Y= group_cases_USA.values
    X = np.array(X)
    X= X.reshape(-1,1)
    Y = Y.astype('int')

    X_train_cases, X_test_cases, y_train_cases, y_test_cases = train_test_split(X, Y, test_size=.3)

    lrg = LogisticRegression()
    lrl = LinearRegression()
    lrg.fit(X_train_cases,y_train_cases)
    lrl.fit(X_train_cases,y_train_cases)
    
    pr = LinearRegression()
    poly = PolynomialFeatures(degree=3)
    X_poly_train = poly.fit_transform(X_train_cases)
    X_poly_test = poly.transform(X_test_cases)
    pr.fit(X_poly_train, y_train_cases)

    y_pred_lrl = lrl.predict(X_test_cases)
    y_pred_lrg = lrg.predict(X_test_cases)
    y_pred_pr = pr.predict(X_poly_test)

    rmse_lrl = sqrt(abs(mean_squared_error(y_test_cases, np.abs(y_pred_lrl))))
    rmse_lrg = sqrt(abs(mean_squared_error(y_test_cases, (y_pred_lrg))))
    rmse_pr = sqrt(abs(mean_squared_error(y_test_cases, y_pred_pr)))

    print(f'Logistic regression Root Mean Square Error (RMSE): {round(rmse_lrg,2)}')
    print(f'Linear regression Root Mean Square Error (RMSE): {round(rmse_lrl,2)}')
    print(f'Polynomial regression Root Mean Square Error (RMSE): {round(rmse_pr,2)}')
    
    diff_pred_act = y_test_cases - y_pred_lrg
    sum_diff = sum(diff_pred_act)
    bias = sum_diff/len(y_test_cases)
    print(f'Bias logistic regression model: {round(bias,2)}')
    
    pred_var = y_pred_lrg.var()
    print(f'Variance logistic regression model: {round(pred_var)}')
    
    plt.figure(figsize=(20,15))
    plt.title('Cases along with trend line')
    plt.plot(X,
        lrg.predict(X),
        color='b',
        label = 'Prediction')
    
    plt.plot(X, 
             pr.predict(X_poly_test),
             color='lime', 
             label='polynomial')

    plt.scatter(X,Y,facecolor="none",
        edgecolor='m',
        s=10,
           label = 'Actual')
    plt.legend(
    loc="upper center",
    bbox_to_anchor=(0.5, 1.1),
    ncol=1,
    fancybox=True,
    shadow=True,
    )
    plt.show()
    
    
    
    next_week = []
    for i in range(7):
        next_week.append(i + len(X))
    next_week = np.array(next_week)
    next_week= next_week.reshape(-1,1)

    y_next_week = lrg.predict(next_week)
    X_next = np.append(X, next_week)
    X_next = X_next.reshape(-1,1)
    
    plt.figure(figsize=(20,15))
    plt.title('Cases along with trend line and forecast of 1 week ahead')
    plt.plot(X_next,
            lrg.predict(X_next),
            color='b',
            label = 'Prediction')

    plt.scatter(X,Y,facecolor="none",
            edgecolor='m',
            s=10,
               label = 'Actual')
    
    plt.legend(
        loc="upper center",
        bbox_to_anchor=(0.5, 1.1),
        ncol=1,
        fancybox=True,
        shadow=True,
    )
    plt.show()
    
    

In [None]:
question1(confirmed_cases)

RMSE of both logistic regression and linear regression model decreases as the size of the training set increases, and it decreases the most for the logistic regression.

We can see that the calculated bias is very low, in comparison to the high variance. This means that our logistic regression model is underfitting and is not able to fully capture the pattern in our dataset.


### Predicting Deaths in USA

In [None]:
question1(confirmed_deaths)