# MACHINELEARNING

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
import statsmodels.formula.api as sm
import numpy as np

In [2]:
def multiline(dataframe, exemptions):
    for col in dataframe.columns:
        if not col == exemptions:
            ax = plt.plot(dataframe.index, dataframe[col], label=col)
            listOf_Xticks = np.arange(0, len(dataframe), 24 )
            plt.xticks(listOf_Xticks)
    plt.xticks(rotation=45)
    plt.legend()
    plt.show()

## Terms to Know
- **R-squared**: the r2(2=squared) value represents how much of the variance in the data can be explained by the model. The higher the value, the better the model fits. 
- **Adjusted R-squared**: The adjusted r2 value is used to correct some of the problems with raw r2 score. Adjusted r2 accounts for the number of variables with a dataset, only increasing the r2 score if the variable aids in prediction to a degree greater than what would occur simply by chance. The adjusted r2 value can be used as a more accurate way to determine the fit of a model. 
    - **Best**: above 0.75
    - **Decent**: above 0.4
- **RMSE**: RMSE stands for root-mean-square deviation. In statistical terms, RMSE is the standard deviation of the residuals, residuals meausring the distance between data points and teh regression line. RMSE tells you how spread the data points are around the regression line; how well the data fits to the model. Lower values for RMSE mean a better fit because ethe data is *less* far from the model.
    - **Best**: below 0.2
    - **Decent**: below 0.5

## Contents
1. [DUMMY_VARIABLES](#1.-DUMMY_VARIABLES)

2. [LINEAR_REGRESSION](#2.-LINEAR_REGRESSION)
    
    2.1 [COVID](#2.1-COVID)
    
    2.2 [H1N1](#3.-H1N1)

    2.3 [Both](#4.-Both)

3. [DISCUSSION](#2.-DISCUSSION)

## 1. DUMMY_VARIABLES

"A dummy variable is a variable that takes values of 0 and 1, where the values indicate the presence or absence of something".(https://www.displayr.com/what-are-dummy-variables/). Dummy variables are very useful in statisitcs because they simply the approach to answering the question. By using dummy variablies, one simply has to estimate the likelihood of whether a 0 or 1 will appear at a certian instance, based on the surrounding variables. In our case, we are going to use dummy variables to show when there are covid deaths, h1n1 deaths, and then a variable for when either of them have deaths. 

In [3]:
df = pd.read_csv("for_ml.csv")

In [4]:
df.head()

Unnamed: 0,date,us_pop,alcohol_deaths,alcohol_sales,drug_deaths,homicides,perc_unemp,suicide_deaths,h1n1_deaths,covid_deaths,gdp_period,yearly_infl
0,2000-01-31,280730000.0,0.665408,1.511417,0.631568,0.0,0.0,0.923307,0.0,0.0,3.562918,0.001203
1,2000-02-29,280940000.0,0.58411,1.762654,0.590162,0.0,0.0,0.832206,0.0,0.0,3.560255,0.001202
2,2000-03-31,281160000.0,0.578318,2.136862,0.609973,0.0,0.0,0.896287,0.0,0.0,3.557469,0.001201
3,2000-04-30,281420000.0,0.55291,1.902139,0.627176,0.0,0.0,0.882311,0.0,0.0,3.641433,0.0012
4,2000-05-31,281640000.0,0.56384,2.284832,0.64657,0.0,0.0,0.930976,0.0,0.0,3.638588,0.001199


In [5]:
print(df.shape)
df.head()

(252, 12)


Unnamed: 0,date,us_pop,alcohol_deaths,alcohol_sales,drug_deaths,homicides,perc_unemp,suicide_deaths,h1n1_deaths,covid_deaths,gdp_period,yearly_infl
0,2000-01-31,280730000.0,0.665408,1.511417,0.631568,0.0,0.0,0.923307,0.0,0.0,3.562918,0.001203
1,2000-02-29,280940000.0,0.58411,1.762654,0.590162,0.0,0.0,0.832206,0.0,0.0,3.560255,0.001202
2,2000-03-31,281160000.0,0.578318,2.136862,0.609973,0.0,0.0,0.896287,0.0,0.0,3.557469,0.001201
3,2000-04-30,281420000.0,0.55291,1.902139,0.627176,0.0,0.0,0.882311,0.0,0.0,3.641433,0.0012
4,2000-05-31,281640000.0,0.56384,2.284832,0.64657,0.0,0.0,0.930976,0.0,0.0,3.638588,0.001199


In [6]:
df.set_index('date',inplace=True)
# the inplace parameter edits the object in place(alters the original item) so no new item is created. This makes it so we dont have to set
# the command equal to the new dataframe.
df.head()

Unnamed: 0_level_0,us_pop,alcohol_deaths,alcohol_sales,drug_deaths,homicides,perc_unemp,suicide_deaths,h1n1_deaths,covid_deaths,gdp_period,yearly_infl
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2000-01-31,280730000.0,0.665408,1.511417,0.631568,0.0,0.0,0.923307,0.0,0.0,3.562918,0.001203
2000-02-29,280940000.0,0.58411,1.762654,0.590162,0.0,0.0,0.832206,0.0,0.0,3.560255,0.001202
2000-03-31,281160000.0,0.578318,2.136862,0.609973,0.0,0.0,0.896287,0.0,0.0,3.557469,0.001201
2000-04-30,281420000.0,0.55291,1.902139,0.627176,0.0,0.0,0.882311,0.0,0.0,3.641433,0.0012
2000-05-31,281640000.0,0.56384,2.284832,0.64657,0.0,0.0,0.930976,0.0,0.0,3.638588,0.001199


In [7]:
# these are the commands that are creating our dummy variables. We create a new column in the dataframe by fixing a unique name to the 
# dataframe, then setting it eqaul to our command. In this case or command finds whenever somethine(h1n1,covid, or both) is NOT equal to
# zero, then adds a 1 at the repspective instance into the unique dummy column. 

df['h1n1_dum'] = 1 *(df['h1n1_deaths'] != 0)
df['covid_dum'] = 1 *(df['covid_deaths'] != 0)
df['pandemic'] = 1* ((df['h1n1_deaths'] != 0) | (df['covid_deaths'] != 0)  )

In [8]:
display(df.shape)
df.head()

(252, 14)

Unnamed: 0_level_0,us_pop,alcohol_deaths,alcohol_sales,drug_deaths,homicides,perc_unemp,suicide_deaths,h1n1_deaths,covid_deaths,gdp_period,yearly_infl,h1n1_dum,covid_dum,pandemic
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
2000-01-31,280730000.0,0.665408,1.511417,0.631568,0.0,0.0,0.923307,0.0,0.0,3.562918,0.001203,0,0,0
2000-02-29,280940000.0,0.58411,1.762654,0.590162,0.0,0.0,0.832206,0.0,0.0,3.560255,0.001202,0,0,0
2000-03-31,281160000.0,0.578318,2.136862,0.609973,0.0,0.0,0.896287,0.0,0.0,3.557469,0.001201,0,0,0
2000-04-30,281420000.0,0.55291,1.902139,0.627176,0.0,0.0,0.882311,0.0,0.0,3.641433,0.0012,0,0,0
2000-05-31,281640000.0,0.56384,2.284832,0.64657,0.0,0.0,0.930976,0.0,0.0,3.638588,0.001199,0,0,0


In [9]:
# testing to see if my command above worked correctly. It did! 
df.loc[df['covid_dum']==1]
# this command returns all values equal to 1 in the repsective column I queried. You can use this formant to find all the instances of
# any datatype within any column. 

Unnamed: 0_level_0,us_pop,alcohol_deaths,alcohol_sales,drug_deaths,homicides,perc_unemp,suicide_deaths,h1n1_deaths,covid_deaths,gdp_period,yearly_infl,h1n1_dum,covid_dum,pandemic
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
2020-02-29,329240000.0,0.983477,3.422124,2.032256,0.443749,3.5,1.109221,0.0,0.000304,6.524531,0.000375,0,1,1
2020-03-31,329340000.0,1.080646,4.062974,2.294286,0.511933,4.4,1.199976,0.0,1.306856,6.52255,0.000375,0,1,1
2020-04-30,331450000.0,1.156132,3.690451,2.493287,0.525569,14.7,1.04782,0.0,19.04963,5.876435,0.000372,0,1,1
2020-05-31,331420000.0,1.221713,4.223342,2.94068,0.62187,13.2,1.133305,0.0,31.494478,5.876967,0.000372,0,1,1
2020-06-30,331450000.0,1.27259,4.861065,2.562377,0.677327,11.0,1.19475,0.0,38.455876,5.876435,0.000372,0,1,1
2020-07-31,331500000.0,1.360181,4.726998,2.662142,0.726395,10.2,1.257315,0.0,46.413876,6.376644,0.000372,0,1,1
2020-08-31,331560000.0,1.315599,4.608819,2.591688,0.701532,8.4,1.218482,0.0,55.335987,6.37549,0.000372,0,1,1
2020-09-30,331630000.0,1.274312,4.675693,2.347797,0.651931,7.9,1.178422,0.0,62.374333,6.374144,0.000372,0,1,1
2020-10-31,331700000.0,1.314139,4.880314,2.325294,0.707869,6.9,1.139885,0.0,69.494121,6.475007,0.000372,0,1,1
2020-11-30,331750000.0,1.334439,4.5052,2.305652,0.669781,6.7,1.116503,0.0,80.793067,6.474031,0.000372,0,1,1


# 2. LINEAR_REGRESSION
<hr>
Linear Regression is a statistical approach that models the relationship of indepencdent variables and a single dependent variable. We will be using the 3 dummy variables we created as our dependent variables(as we are trying to prove how much a pandemic "depends" on our selected factors). The linear regression will help us deteremine how strongly each independent variable affects the different dependent variables. 

## 2.1 COVID

In [10]:
# here we set our variables. "X" = independent, "y" = dependent. 
#X=df[['alcohol_deaths','alcohol_sales','drug_deaths','homicides','suicide_deaths','perc_unemp','gdp_period','yearly_infl']]
X=df[['alcohol_deaths','alcohol_sales','drug_deaths','homicides','suicide_deaths','perc_unemp']]


y=df[['covid_dum']]

The commanand below splits our data into the test and training set. The training set is used to train the model, basically teaching the model the best weights to set each vaiable at. The test set is used for testing the trained model on. The testing of the model will be explained via demonstration below. 

As for my split, I put 20% of the data in the training set and 80% of data in the testing set. This is set with the "test_size" parameter in the "train_test_split" function. I found this split to be effective at building a good model; however, feel free to play around with differnt sizes to see how it affects your model. 

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

In [12]:
regression_model = LinearRegression()
regression_model.fit(X_train, y_train)

LinearRegression()

### Model

In [13]:
for idx, col_name in enumerate(X_train.columns):
    print("The coefficient for {} is {}".format(col_name, regression_model.coef_[0][idx]))
    # This function trains the model and assigns the weight for each function. The higher the value of coefficient indicates that as the 
# indepenent variable increases the dependent variable also tends to increase.

The coefficient for alcohol_deaths is 0.7934712813658187
The coefficient for alcohol_sales is -0.008359996820817608
The coefficient for drug_deaths is 0.15812528944120274
The coefficient for homicides is -0.13470939302663698
The coefficient for suicide_deaths is -0.832655245881351
The coefficient for perc_unemp is 0.01456194530873043


In [14]:
intercept = regression_model.intercept_[0]
print("The intercept for our model is {}".format(intercept))
# this function is used to find the y intercept of our data(the value of our y variable when x = 0)

The intercept for our model is 0.1042107130109392


In [15]:
print("y = ",intercept, "+")
for idx, col_name in enumerate(X_train.columns):
    print("{}({}) + ".format(col_name, round(regression_model.coef_[0][idx],2)))
#Our regression line for this model is: 

y =  0.1042107130109392 +
alcohol_deaths(0.79) + 
alcohol_sales(-0.01) + 
drug_deaths(0.16) + 
homicides(-0.13) + 
suicide_deaths(-0.83) + 
perc_unemp(0.01) + 


 In THIS model(COVID) the highest coeffecient is for alcohol_deaths. This means that it is likely more alcohol deaths will occur if COVID deaths rise. The other variables in this model have negative correlations. This means the as COVID deaths rise their repspective values will fall. 

### R squared

In [16]:
regression_model.score(X_test, y_test)

0.6259613884585976

In [17]:
#lm = sm.ols(formula='covid_dum ~ alcohol_deaths + alcohol_sales + drug_deaths + homicides + suicide_deaths + perc_unemp + gdp_period + yearly_infl', data=df)
lm = sm.ols(formula='covid_dum ~ alcohol_deaths + alcohol_sales + drug_deaths + homicides + suicide_deaths + perc_unemp', data=df)
fit = lm.fit()
fit.summary()

0,1,2,3
Dep. Variable:,covid_dum,R-squared:,0.576
Model:,OLS,Adj. R-squared:,0.565
Method:,Least Squares,F-statistic:,55.38
Date:,"Wed, 20 Jul 2022",Prob (F-statistic):,6.49e-43
Time:,16:10:23,Log-Likelihood:,150.62
No. Observations:,252,AIC:,-287.2
Df Residuals:,245,BIC:,-262.5
Df Model:,6,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,0.1632,0.107,1.526,0.128,-0.048,0.374
alcohol_deaths,0.8868,0.154,5.741,0.000,0.583,1.191
alcohol_sales,-0.0117,0.023,-0.507,0.612,-0.057,0.034
drug_deaths,0.1927,0.067,2.897,0.004,0.062,0.324
homicides,-0.1504,0.064,-2.364,0.019,-0.276,-0.025
suicide_deaths,-0.9793,0.116,-8.426,0.000,-1.208,-0.750
perc_unemp,0.0149,0.004,4.188,0.000,0.008,0.022

0,1,2,3
Omnibus:,89.129,Durbin-Watson:,0.447
Prob(Omnibus):,0.0,Jarque-Bera (JB):,436.488
Skew:,1.34,Prob(JB):,1.6499999999999997e-95
Kurtosis:,8.864,Cond. No.,152.0


Our adjusted R-squared for this model is 0.565. This value represents the percent of variation in the dependent variable that can be predicted by the indendent variable(s). In other words this is how well the different factors can be used to forecast when there will be covid deaths.  Although the optimal range for adjusted r is above 0.75, the value we produced still tells us that our model is relatively effective at predicting the dependent variable(covid_deaths). 

In [18]:
X_test.head()

Unnamed: 0_level_0,alcohol_deaths,alcohol_sales,drug_deaths,homicides,suicide_deaths,perc_unemp
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2005-08-31,0.584914,2.872502,0.908476,0.0,0.930453,4.9
2020-12-31,1.399548,4.923888,2.387641,0.667069,1.072796,6.7
2019-04-30,0.951684,3.948267,1.788982,0.440459,1.228953,3.6
2013-06-30,0.726703,3.5914,1.23302,0.471169,1.101612,7.5
2007-08-31,0.633499,3.165174,1.093864,0.566833,0.994362,4.6


In [19]:
from sklearn.metrics import mean_squared_error

y_predict = regression_model.predict(X_test)

#y_test
dates = y_test[y_test["covid_dum"]==1]
print(dates)
#print(y_test)
#print(y_predict)

            covid_dum
date                 
2020-12-31          1
2020-11-30          1
2020-05-31          1
2020-09-30          1
2020-02-29          1


In [20]:
regression_model_mse = mean_squared_error(y_predict, y_test)
regression_model_mse

0.03307530974799022

### RMSE

In [21]:
import math
math.sqrt(regression_model_mse)

0.18186618637885996

Our model has an RMSE below 0.2 meaning that the average variation of the data from the linear model is very small. This indicates that the model is effective at prediction.

In [22]:
X_test.head()

Unnamed: 0_level_0,alcohol_deaths,alcohol_sales,drug_deaths,homicides,suicide_deaths,perc_unemp
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2005-08-31,0.584914,2.872502,0.908476,0.0,0.930453,4.9
2020-12-31,1.399548,4.923888,2.387641,0.667069,1.072796,6.7
2019-04-30,0.951684,3.948267,1.788982,0.440459,1.228953,3.6
2013-06-30,0.726703,3.5914,1.23302,0.471169,1.101612,7.5
2007-08-31,0.633499,3.165174,1.093864,0.566833,0.994362,4.6


## 3. H1N1

We will be going through the same steps we did for COVID. 

In [23]:
y=df[['h1n1_dum']]

In [24]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

In [25]:
regression_model = LinearRegression()
regression_model.fit(X_train, y_train)

LinearRegression()

### MODEL

In [26]:
for idx, col_name in enumerate(X_train.columns):
    print("The coefficient for {} is {}".format(col_name, regression_model.coef_[0][idx]))

The coefficient for alcohol_deaths is -0.23429776313547857
The coefficient for alcohol_sales is -0.024199751903267033
The coefficient for drug_deaths is -0.0340369072352831
The coefficient for homicides is 0.1406696895454995
The coefficient for suicide_deaths is 0.07304739433637272
The coefficient for perc_unemp is 0.027684104666223553


In [27]:
intercept = regression_model.intercept_[0]
print("The intercept for our model is {}".format(intercept))

The intercept for our model is 0.06134064104131683


In [28]:
print("y = ",intercept, "+")
for idx, col_name in enumerate(X_train.columns):
    print("{}({}) + ".format(col_name, round(regression_model.coef_[0][idx],2)))
#Our regression line for this model is: —

y =  0.06134064104131683 +
alcohol_deaths(-0.23) + 
alcohol_sales(-0.02) + 
drug_deaths(-0.03) + 
homicides(0.14) + 
suicide_deaths(0.07) + 
perc_unemp(0.03) + 


It seems that the coefficients for the h1n1 model exhibit opposite behavior than the ones from the covid model. The lowest coefficient in this model is alcohol deaths. In the Covid model alcohol_deaths was the highest coefficient. 

### R-Squared

In [29]:
import statsmodels.formula.api as sm
#lm = sm.ols(formula='h1n1_dum ~ alcohol_deaths+alcohol_sales+drug_deaths+homicides+suicide_deaths+perc_unemp+gdp_period+yearly_infl', data=df)
lm = sm.ols(formula='h1n1_dum ~ alcohol_deaths + alcohol_sales + drug_deaths + homicides + suicide_deaths + perc_unemp', data=df)
fit = lm.fit()
fit.summary()

0,1,2,3
Dep. Variable:,h1n1_dum,R-squared:,0.176
Model:,OLS,Adj. R-squared:,0.156
Method:,Least Squares,F-statistic:,8.741
Date:,"Wed, 20 Jul 2022",Prob (F-statistic):,1.23e-08
Time:,16:10:23,Log-Likelihood:,47.068
No. Observations:,252,AIC:,-80.14
Df Residuals:,245,BIC:,-55.43
Df Model:,6,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,0.0241,0.161,0.150,0.881,-0.294,0.342
alcohol_deaths,-0.0300,0.233,-0.129,0.898,-0.489,0.429
alcohol_sales,-0.0311,0.035,-0.898,0.370,-0.099,0.037
drug_deaths,-0.1017,0.100,-1.013,0.312,-0.299,0.096
homicides,0.1712,0.096,1.784,0.076,-0.018,0.360
suicide_deaths,0.0470,0.175,0.268,0.789,-0.298,0.392
perc_unemp,0.0301,0.005,5.586,0.000,0.019,0.041

0,1,2,3
Omnibus:,174.85,Durbin-Watson:,0.213
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1200.813
Skew:,2.917,Prob(JB):,1.76e-261
Kurtosis:,11.962,Cond. No.,152.0


As you can see the adjusted R squared is much lower for h1n1 than it was for COVID. This means that our independent variables at predicting a h1n1 pandemic as a COVID pandemic. This adjusted r squared shows that the model is only able to predict the occurence of h1n1 deaths with 3 times worse than it could predict covid deaths. This value indicates that model is not effective. 

In [30]:
X_test.head()

Unnamed: 0_level_0,alcohol_deaths,alcohol_sales,drug_deaths,homicides,suicide_deaths,perc_unemp
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2005-08-31,0.584914,2.872502,0.908476,0.0,0.930453,4.9
2020-12-31,1.399548,4.923888,2.387641,0.667069,1.072796,6.7
2019-04-30,0.951684,3.948267,1.788982,0.440459,1.228953,3.6
2013-06-30,0.726703,3.5914,1.23302,0.471169,1.101612,7.5
2007-08-31,0.633499,3.165174,1.093864,0.566833,0.994362,4.6


In [31]:
y_predict = regression_model.predict(X_test)
regression_model_mse = mean_squared_error(y_predict, y_test)
regression_model_mse

0.04402313983036657

In [32]:
dates = y_test[y_test["h1n1_dum"]==1]
print(dates)

            h1n1_dum
date                
2009-09-30         1
2010-01-31         1
2009-12-31         1


In [33]:
math.sqrt(regression_model_mse)

0.20981691979048442

The RMSE of this model indicates low varaition of data from the model. This contradicts the low adjusted r score. This model still might be effective at predicting the occurence of h1n1 deaths. Still, it scored lower than the covid model. 

## 4. Both 

In [34]:
y=df[['pandemic']]

In [35]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

In [36]:
regression_model = LinearRegression()
regression_model.fit(X_train, y_train)

LinearRegression()

In [37]:
for idx, col_name in enumerate(X_train.columns):
    print("The coefficient for {} is {}".format(col_name, regression_model.coef_[0][idx]))

The coefficient for alcohol_deaths is 0.5591735182303401
The coefficient for alcohol_sales is -0.0325597487240848
The coefficient for drug_deaths is 0.12408838220591967
The coefficient for homicides is 0.005960296518862664
The coefficient for suicide_deaths is -0.7596078515449779
The coefficient for perc_unemp is 0.042246049974953966


In [38]:
intercept = regression_model.intercept_[0]
print("The intercept for our model is {}".format(intercept))

The intercept for our model is 0.16555135405225618


In [39]:
print("y = ",intercept, "+")
for idx, col_name in enumerate(X_train.columns):
    print("{}({}) + ".format(col_name, round(regression_model.coef_[0][idx],2)))
#Our regression line for this model is: —

y =  0.16555135405225618 +
alcohol_deaths(0.56) + 
alcohol_sales(-0.03) + 
drug_deaths(0.12) + 
homicides(0.01) + 
suicide_deaths(-0.76) + 
perc_unemp(0.04) + 


In [40]:
regression_model.score(X_test, y_test)

0.4493291115521598

In [41]:
import statsmodels.formula.api as sm
#lm = sm.ols(formula='pandemic ~ alcohol_deaths+alcohol_sales+drug_deaths+homicides+suicide_deaths+perc_unemp+gdp_period+yearly_infl', data=df)
lm = sm.ols(formula='pandemic ~ alcohol_deaths + alcohol_sales + drug_deaths + homicides + suicide_deaths + perc_unemp', data=df)
fit = lm.fit()
fit.summary()

0,1,2,3
Dep. Variable:,pandemic,R-squared:,0.345
Model:,OLS,Adj. R-squared:,0.329
Method:,Least Squares,F-statistic:,21.47
Date:,"Wed, 20 Jul 2022",Prob (F-statistic):,3.12e-20
Time:,16:10:23,Log-Likelihood:,4.5517
No. Observations:,252,AIC:,4.897
Df Residuals:,245,BIC:,29.6
Df Model:,6,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,0.1874,0.191,0.981,0.328,-0.189,0.564
alcohol_deaths,0.8568,0.276,3.107,0.002,0.314,1.400
alcohol_sales,-0.0428,0.041,-1.043,0.298,-0.124,0.038
drug_deaths,0.0910,0.119,0.766,0.444,-0.143,0.325
homicides,0.0208,0.114,0.183,0.855,-0.203,0.244
suicide_deaths,-0.9323,0.207,-4.493,0.000,-1.341,-0.524
perc_unemp,0.0450,0.006,7.065,0.000,0.032,0.058

0,1,2,3
Omnibus:,113.645,Durbin-Watson:,0.303
Prob(Omnibus):,0.0,Jarque-Bera (JB):,364.317
Skew:,2.018,Prob(JB):,7.75e-80
Kurtosis:,7.291,Cond. No.,152.0


This adjusted r scores right in the middle of thje other two models; however, it is a suboptimal value, indicationg bad predictive ability. 

In [42]:
X_test.head()

Unnamed: 0_level_0,alcohol_deaths,alcohol_sales,drug_deaths,homicides,suicide_deaths,perc_unemp
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2005-08-31,0.584914,2.872502,0.908476,0.0,0.930453,4.9
2020-12-31,1.399548,4.923888,2.387641,0.667069,1.072796,6.7
2019-04-30,0.951684,3.948267,1.788982,0.440459,1.228953,3.6
2013-06-30,0.726703,3.5914,1.23302,0.471169,1.101612,7.5
2007-08-31,0.633499,3.165174,1.093864,0.566833,0.994362,4.6


In [43]:
y_predict = regression_model.predict(X_test)
regression_model_mse = mean_squared_error(y_predict, y_test)
regression_model_mse

0.07282998293966053

In [44]:
dates = y_test[y_test["pandemic"]==1]
print(dates)

            pandemic
date                
2020-12-31         1
2020-11-30         1
2020-05-31         1
2009-09-30         1
2010-01-31         1
2009-12-31         1
2020-09-30         1
2020-02-29         1


In [45]:
math.sqrt(regression_model_mse)

0.26987030762879516

This is our highest RMSE so far which means that at this model is the least effective at prediction. 

# 3. Discussion

Research question:
- Do h1n1 classify as a pandemic compared to the covid 19 pandemic?
- What factors are correlated most with mental health decline during a pandemic?
- Did the H1N1 pandemic (2009-2010) have a negative effect on mental health?


- Covid: 
    - alcohol_deaths(0.79) 
    - suicide_deaths(-0.83)
    - R = 0.565
    - RMSE = 0.18186618637885996
- h1n1: 
    - alcohol_deaths(-0.23)
    - homicide(0.14)
    - R = 0.156
    - RMSE = 0.20981691979048442
- both: 
    - alcohol_deaths(0.56)
    - suicide_deaths(-0.76)
    - R = 0.329
    - RMSE = 0.26987030762879516

After our analysis, which includes the visualitions as well as the linear models above, we have created a story that shows the differences between h1n1 and covid. In our scaled visualizations we saw the huge difference between the number of h1n1 deaths and the number of COVID deaths. We investigated the differnces further by conduction a linear regression that we built to predict the occurence of deaths due to each virus based on the a group of independent  variables: alcohol_deaths, alcohol sales, drug_deaths, homicides, suicide_deaths, percent unemployed. The adjusted r score of our COVID model was 0.565, 3 times higher than the R score for the H1N1 model. This indicates that the COVID model was a much better fit for the data; the changes in the data could be explained 3 times better with the COVID model than the h1n1 model. Our RMSE scores for both models were slightly more similiar. COVID scored 0.18 and H1N1 scored 0.21. This score also supports that the covid model was more prescise as the average varaince of the model from the data was lower. I ran a third model that group the dummy variables from bith H1N1 and COVID. The R score was right in the middle of the other two but the RMSE was the highest out of all three. I believe that this is because the model was being built to predict at two different points, making the likelihood for error higher.