#  OLS Regression Analysis

_Primary Author: Pete King_

This notebook uses the statsmodels library to perform an ordinary least squares (OLS) regression on the data, looking for a correlation between daylight and depression.  We also calculate the Pearson correlation coefficient, which is limited in range between -1 (strong negative correlation) and +1 (strong positive correlation).

In [1]:
import numpy as np
import pandas as pd
import scipy.stats as sps
import statsmodels.api as sm

In [2]:
PATH = 'data/'

In [3]:
analysis_df = pd.read_csv(PATH + 'analysis_2022.csv')
analysis_df.head()

Unnamed: 0,State,DH,DI,ADDEPEV3,MENTHLTH,DH_Z,DI_Z
0,AL,10.033333,0.0,0.0,0.0,1.137661,-0.598581
1,AL,10.033333,0.0,0.0,0.0,1.137661,-0.598581
2,AL,10.033333,0.444444,0.0,0.8,1.137661,-0.450494
3,AL,10.033333,0.0,0.0,0.0,1.137661,-0.598581
4,AL,10.033333,0.0,0.0,0.0,1.137661,-0.598581


In [4]:
# Note how both variables now have zero mean and unit variance.
analysis_df.describe()

Unnamed: 0,DH,DI,ADDEPEV3,MENTHLTH,DH_Z,DI_Z
count,402198.0,402198.0,402198.0,402198.0,402198.0,402198.0
mean,9.312854,1.796478,2.069503,1.164158,4.725539e-15,4.465206e-17
std,0.6333,3.00123,4.051201,2.233103,1.000001,1.000001
min,6.366667,0.0,0.0,0.0,-4.652124,-0.5985814
25%,9.0,0.0,0.0,0.0,-0.4940058,-0.5985814
50%,9.333333,0.0,0.0,0.0,0.03233819,-0.5985814
75%,9.566667,2.222222,0.0,1.066667,0.400779,0.1418567
max,11.05,10.0,10.0,8.0,2.74301,2.73339


In [5]:
# Prep data for an OLS Regression using the statsmodels library
# Add a column of 1's for the intercept calculation:
X = sm.add_constant(analysis_df.DH.values)
y = analysis_df.DI.values

In [6]:
# Perform OLS Regression and look at the summary table
model = sm.OLS(y, X)
results = model.fit()
summary = results.summary(xname=['intercept', 'Daylight Hours'], yname='Depression Index (0-10)')
print(summary)

                               OLS Regression Results                              
Dep. Variable:     Depression Index (0-10)   R-squared:                       0.000
Model:                                 OLS   Adj. R-squared:                  0.000
Method:                      Least Squares   F-statistic:                     17.16
Date:                     Mon, 17 Feb 2025   Prob (F-statistic):           3.44e-05
Time:                             02:40:49   Log-Likelihood:            -1.0127e+06
No. Observations:                   402198   AIC:                         2.025e+06
Df Residuals:                       402196   BIC:                         2.025e+06
Df Model:                                1                                         
Covariance Type:                 nonrobust                                         
                     coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------

In [7]:
# Calculate Pearson's correlation coefficient
pearson = sps.pearsonr(analysis_df.DH.values, analysis_df.DI.values)
r, p_value = pearson.statistic, pearson.pvalue
conf_int = pearson.confidence_interval(0.95)
print(f'Pearson R: {r}')
print(f'p-value: {p_value}')
print(conf_int)

Pearson R: -0.006531579023685456
p-value: 3.4384199781493494e-05
ConfidenceInterval(low=-0.009621882142233964, high=-0.003441151141966374)


## Interpreting the Results

_For a detailed writeup, see the final report._

In summary, the t-statistic of -4.142 with a p-value of 0.000 provides strong evidence of a statistically significant correlation between our two variables of interest.  We cannot assume a causal relationship, but if we did assume one, we could say that the magnitude of the effect from 'daylight hours' on our 'depression index' is extremely small, as evidenced both by the R-squared value of 0.000 and a Pearson correlation coefficient of -0.0065.

What does this mean, practically speaking?  It means that, while there does appear to be a correlation between daylight and depression, the 'effect' is minimal.  We could say that for every additional hour of daylight available to a person, we can expect their depression index score (on a scale of 0-10) to decrease by 0.03, on average.  Other factors appear to play a much larger role in determining the degree to which a person feels depressed.

## Record Dependencies

In [8]:
%load_ext watermark
%watermark
%watermark --iversions

Last updated: 2025-02-17T02:40:50.032503+00:00

Python implementation: CPython
Python version       : 3.10.11
IPython version      : 8.17.2

Compiler    : GCC 11.3.0
OS          : Linux
Release     : 6.5.0-1020-aws
Machine     : x86_64
Processor   : x86_64
CPU cores   : 64
Architecture: 64bit

scipy      : 1.10.1
statsmodels: 0.14.0
pandas     : 2.0.2
numpy      : 1.24.3

