# MACHINELEARNING

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
import statsmodels.formula.api as sm
import numpy as np
import pingouin as pg

In [2]:
def multiline(dataframe, exemptions):
    for col in dataframe.columns:
        if not col == exemptions:
            ax = plt.plot(dataframe.index, dataframe[col], label=col)
            listOf_Xticks = np.arange(0, len(dataframe), 24 )
            plt.xticks(listOf_Xticks)
    plt.xticks(rotation=45)
    plt.legend()
    plt.show()

In [3]:
df = pd.read_csv("/Users/natemcdowell/Desktop/python_for_the_datajam/datasets/for_ml.csv")

## Terms to Know
- **R-squared**: the r2(2=squared) value represents how much of the variance in the data can be explained by the model. The higher the value, the better the model fits. 
- **Adjusted R-squared**: The adjusted r2 value is used to correct some of the problems with raw r2 score. Adjusted r2 accounts for the number of variables with a dataset, only increasing the r2 score if the variable aids in prediction to a degree greater than what would occur simply by chance. The adjusted r2 value can be used as a more accurate way to determine the fit of a model. 
    - **Best**: above 0.75
    - **Decent**: above 0.4
- **RMSE**: RMSE stands for root-mean-square deviation. In statistical terms, RMSE is the standard deviation of the residuals, residuals meausring the distance between data points and teh regression line. RMSE tells you how spread the data points are around the regression line; how well the data fits to the model. Lower values for RMSE mean a better fit because ethe data is *less* far from the model.
    - **Best**: below 0.2
    - **Decent**: below 0.5

## Contents
1. [DUMMY_VARIABLES](#1.-DUMMY_VARIABLES)

2. [LINEAR_REGRESSION](#2.-LINEAR_REGRESSION)
    
    2.1 [COVID](#2.1-COVID)
    
    2.2 [H1N1](#3.-H1N1)

    2.3 [Both](#4.-Both)

3. [DISCUSSION](#2.-DISCUSSION)

In [4]:
print(df.shape)
df.head()

(252, 9)


Unnamed: 0,date,alcohol_deaths,alcohol_sales,drug_deaths,homicides,perc_unemp,suicide_deaths,h1n1_deaths,covid_deaths
0,2000-01-31,0.665408,1.511417,0.631568,0.0,0.0,0.923307,0.0,0.0
1,2000-02-29,0.58411,1.762654,0.590162,0.0,0.0,0.832206,0.0,0.0
2,2000-03-31,0.578318,2.136862,0.609973,0.0,0.0,0.896287,0.0,0.0
3,2000-04-30,0.55291,1.902139,0.627176,0.0,0.0,0.882311,0.0,0.0
4,2000-05-31,0.56384,2.284832,0.64657,0.0,0.0,0.930976,0.0,0.0


## 1. CORRELATION
Correlation is a meaurment between -1 and 1 that shows the strenght of a linear relationship between two variables. Since we have many factors in our dataframe, we will be looking at the correlation between many different factors. The correlation below shows the many relationships within our dataframe. 

In [5]:
df.corr().round(3)

Unnamed: 0,alcohol_deaths,alcohol_sales,drug_deaths,homicides,perc_unemp,suicide_deaths,h1n1_deaths,covid_deaths
alcohol_deaths,1.0,0.797,0.945,0.644,0.192,0.74,-0.061,0.576
alcohol_sales,0.797,1.0,0.823,0.719,0.295,0.755,-0.054,0.387
drug_deaths,0.945,0.823,1.0,0.715,0.252,0.799,-0.057,0.461
homicides,0.644,0.719,0.715,1.0,0.444,0.676,0.065,0.261
perc_unemp,0.192,0.295,0.252,0.444,1.0,0.209,0.282,0.19
suicide_deaths,0.74,0.755,0.799,0.676,0.209,1.0,-0.073,0.147
h1n1_deaths,-0.061,-0.054,-0.057,0.065,0.282,-0.073,1.0,-0.031
covid_deaths,0.576,0.387,0.461,0.261,0.19,0.147,-0.031,1.0


The table has a diagnol line of 1.000 running through it. This is because the each factor is featured on both the x axis and y axis. The 1 is a result of a factor measures its correlation with itself. Positive numbers are interpretted as positive correlations. The stronger the positive correlation the more likely the factors will "move" together(when one goes up the other one does too). Negative numbers are interepretted as negative correlations. The closer to -1 the stronger the NEGATIVE relationship, the likelihood that one factor will increase and the other will decrease. 

In [6]:
df.corr().unstack().sort_values().drop_duplicates().tail()
# this command will output the strongest positive correlations

alcohol_sales   alcohol_deaths    0.797358
suicide_deaths  drug_deaths       0.799104
drug_deaths     alcohol_sales     0.823369
alcohol_deaths  drug_deaths       0.945271
                alcohol_deaths    1.000000
dtype: float64

In [7]:
df.corr().unstack().sort_values().drop_duplicates().head()
# this command will output the strongest negative correlations

suicide_deaths  h1n1_deaths      -0.073000
h1n1_deaths     alcohol_deaths   -0.061280
drug_deaths     h1n1_deaths      -0.057400
h1n1_deaths     alcohol_sales    -0.054427
covid_deaths    h1n1_deaths      -0.030727
dtype: float64

Partial Correlation is a measure of correlation between two factors that excludes the influence of the surrounding factors. This meaasure gives us a better idea of how two factors are truly related to each other. 

In [8]:
df.pcorr().round(3)

Unnamed: 0,alcohol_deaths,alcohol_sales,drug_deaths,homicides,perc_unemp,suicide_deaths,h1n1_deaths,covid_deaths
alcohol_deaths,1.0,0.124,0.78,-0.087,-0.204,0.145,0.071,0.507
alcohol_sales,0.124,1.0,0.114,0.253,0.081,0.242,-0.067,0.067
drug_deaths,0.78,0.114,1.0,0.209,0.104,0.213,-0.054,-0.148
homicides,-0.087,0.253,0.209,1.0,0.322,0.156,0.067,-0.029
perc_unemp,-0.204,0.081,0.104,0.322,1.0,-0.002,0.282,0.199
suicide_deaths,0.145,0.242,0.213,0.156,-0.002,1.0,-0.072,-0.431
h1n1_deaths,0.071,-0.067,-0.054,0.067,0.282,-0.072,1.0,-0.077
covid_deaths,0.507,0.067,-0.148,-0.029,0.199,-0.431,-0.077,1.0


In [9]:
df.pcorr().unstack().sort_values().drop_duplicates().tail()

perc_unemp      homicides         0.321795
homicides       perc_unemp        0.321795
covid_deaths    alcohol_deaths    0.506915
drug_deaths     alcohol_deaths    0.779932
alcohol_deaths  alcohol_deaths    1.000000
dtype: float64

In [10]:
df.pcorr().unstack().sort_values().drop_duplicates().head()

suicide_deaths  covid_deaths     -0.431050
covid_deaths    suicide_deaths   -0.431050
alcohol_deaths  perc_unemp       -0.204477
perc_unemp      alcohol_deaths   -0.204477
covid_deaths    drug_deaths      -0.147779
dtype: float64

# 2. LINEAR_REGRESSION
<hr>
Linear Regression is a statistical approach that models the relationship of indepencdent variables and a single dependent variable. We will be using the 3 dummy variables we created as our dependent variables(as we are trying to prove how much a pandemic "depends" on our selected factors). The linear regression will help us deteremine how strongly each independent variable affects the different dependent variables. 

In [11]:
# here we set our variables. "X" = independent, "y" = dependent. 
#X=df[['alcohol_deaths','alcohol_sales','drug_deaths','homicides','suicide_deaths','perc_unemp','gdp_period','yearly_infl']]
X=df[['alcohol_deaths','alcohol_sales','drug_deaths','homicides','covid_deaths','h1n1_deaths','perc_unemp']]


y=df[['suicide_deaths']]

The commanand below splits our data into the test and training set. The training set is used to train the model, basically teaching the model the best weights to set each vaiable at. The test set is used for testing the trained model on. The testing of the model will be explained via demonstration below. 

As for my split, I put 20% of the data in the training set and 80% of data in the testing set. This is set with the "test_size" parameter in the "train_test_split" function. I found this split to be effective at building a good model; however, feel free to play around with differnt sizes to see how it affects your model. 

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

In [13]:
regression_model = LinearRegression()
regression_model.fit(X_train, y_train)

LinearRegression()

### Model

In [14]:
for idx, col_name in enumerate(X_train.columns):
    print("The coefficient for {} is {}".format(col_name, regression_model.coef_[0][idx]))
    # This function trains the model and assigns the weight for each function. The higher the value of coefficient indicates that as the 
# indepenent variable increases the dependent variable also tends to increase.

The coefficient for alcohol_deaths is 0.1474922442037885
The coefficient for alcohol_sales is 0.04424300762766949
The coefficient for drug_deaths is 0.1434200522662822
The coefficient for homicides is 0.06381968069648356
The coefficient for covid_deaths is -0.004278357870728257
The coefficient for h1n1_deaths is -0.005993410056251478
The coefficient for perc_unemp is 0.0005237198255524248


In [15]:
intercept = regression_model.intercept_[0]
print("The intercept for our model is {}".format(intercept))
# this function is used to find the y intercept of our data(the value of our y variable when x = 0)

The intercept for our model is 0.5975400056055207


In [16]:
print("y = ",intercept, "+")
for idx, col_name in enumerate(X_train.columns):
    print("{}({}) + ".format(col_name, round(regression_model.coef_[0][idx],2)))
#Our regression line for this model is: 

y =  0.5975400056055207 +
alcohol_deaths(0.15) + 
alcohol_sales(0.04) + 
drug_deaths(0.14) + 
homicides(0.06) + 
covid_deaths(-0.0) + 
h1n1_deaths(-0.01) + 
perc_unemp(0.0) + 


 In THIS model(COVID) the highest coeffecient is for alcohol_deaths. This means that it is likely more alcohol deaths will occur if COVID deaths rise. The other variables in this model have negative correlations. This means the as COVID deaths rise their repspective values will fall. 

### R squared

In [17]:
regression_model.score(X_test, y_test)

0.6673326963243675

In [18]:
#lm = sm.ols(formula='covid_dum ~ alcohol_deaths + alcohol_sales + drug_deaths + homicides + suicide_deaths + perc_unemp + gdp_period + yearly_infl', data=df)
lm = sm.ols(formula='suicide_deaths ~ alcohol_deaths + alcohol_sales + drug_deaths + homicides + covid_deaths + h1n1_deaths + perc_unemp', data=df)
fit = lm.fit()
fit.summary()

0,1,2,3
Dep. Variable:,suicide_deaths,R-squared:,0.744
Model:,OLS,Adj. R-squared:,0.736
Method:,Least Squares,F-statistic:,101.1
Date:,"Fri, 29 Jul 2022",Prob (F-statistic):,1.93e-68
Time:,14:48:45,Log-Likelihood:,327.53
No. Observations:,252,AIC:,-639.1
Df Residuals:,244,BIC:,-610.8
Df Model:,7,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,0.5978,0.037,16.177,0.000,0.525,0.671
alcohol_deaths,0.2011,0.088,2.282,0.023,0.028,0.375
alcohol_sales,0.0434,0.011,3.894,0.000,0.021,0.065
drug_deaths,0.1112,0.033,3.406,0.001,0.047,0.176
homicides,0.0774,0.031,2.472,0.014,0.016,0.139
covid_deaths,-0.0036,0.000,-7.462,0.000,-0.005,-0.003
h1n1_deaths,-0.0084,0.007,-1.125,0.262,-0.023,0.006
perc_unemp,-6.179e-05,0.002,-0.033,0.974,-0.004,0.004

0,1,2,3
Omnibus:,5.759,Durbin-Watson:,0.987
Prob(Omnibus):,0.056,Jarque-Bera (JB):,5.62
Skew:,-0.364,Prob(JB):,0.0602
Kurtosis:,3.073,Cond. No.,272.0


Our adjusted R-squared for this model is 0.565. This value represents the percent of variation in the dependent variable that can be predicted by the indendent variable(s). In other words this is how well the different factors can be used to forecast when there will be covid deaths.  Although the optimal range for adjusted r is above 0.75, the value we produced still tells us that our model is relatively effective at predicting the dependent variable(covid_deaths). 

In [19]:
X_test.head()

Unnamed: 0,alcohol_deaths,alcohol_sales,drug_deaths,homicides,covid_deaths,h1n1_deaths,perc_unemp
67,0.584914,2.872502,0.908476,0.0,0.0,0.0,4.9
251,1.399548,4.923888,2.387641,0.667069,104.310475,0.0,6.7
231,0.951684,3.948267,1.788982,0.440459,0.0,0.0,3.6
161,0.726703,3.5914,1.23302,0.471169,0.0,0.0,7.5
91,0.633499,3.165174,1.093864,0.566833,0.0,0.0,4.6


In [20]:
from sklearn.metrics import mean_squared_error
y_predict = regression_model.predict(X_test)

In [21]:
regression_model_mse = mean_squared_error(y_predict, y_test)
regression_model_mse

0.0056526235461199946

### RMSE

In [22]:
import math
math.sqrt(regression_model_mse)

0.0751839314356465

Our model has an RMSE below 0.2 meaning that the average variation of the data from the linear model is very small. This indicates that the model is effective at prediction.

# 3. Discussion

Research question:
- Do h1n1 classify as a pandemic compared to the covid 19 pandemic?
- What factors are correlated most with mental health decline during a pandemic?
- Did the H1N1 pandemic (2009-2010) have a negative effect on mental health?


- Covid: 
    - alcohol_deaths(0.79) 
    - suicide_deaths(-0.83)
    - R = 0.565
    - RMSE = 0.18186618637885996
- h1n1: 
    - alcohol_deaths(-0.23)
    - homicide(0.14)
    - R = 0.156
    - RMSE = 0.20981691979048442
- both: 
    - alcohol_deaths(0.56)
    - suicide_deaths(-0.76)
    - R = 0.329
    - RMSE = 0.26987030762879516

After our analysis, which includes the visualitions as well as the linear models above, we have created a story that shows the differences between h1n1 and covid. In our scaled visualizations we saw the huge difference between the number of h1n1 deaths and the number of COVID deaths. We investigated the differnces further by conduction a linear regression that we built to predict the occurence of deaths due to each virus based on the a group of independent  variables: alcohol_deaths, alcohol sales, drug_deaths, homicides, suicide_deaths, percent unemployed. The adjusted r score of our COVID model was 0.565, 3 times higher than the R score for the H1N1 model. This indicates that the COVID model was a much better fit for the data; the changes in the data could be explained 3 times better with the COVID model than the h1n1 model. Our RMSE scores for both models were slightly more similiar. COVID scored 0.18 and H1N1 scored 0.21. This score also supports that the covid model was more prescise as the average varaince of the model from the data was lower. I ran a third model that group the dummy variables from bith H1N1 and COVID. The R score was right in the middle of the other two but the RMSE was the highest out of all three. I believe that this is because the model was being built to predict at two different points, making the likelihood for error higher.