## Coursera
### Wesleyan University Data Analysis and Interpretation Specialization

Course 3: Regression Modeling in Practice<br>
Week 4: Test a Logistic Regression Model<br>
Author: Matt Clark


### Instructions:


> This week's assignment is to test a logistic regression model.
> 
> Data preparation for this assignment:
> 
> 1) If your response variable is categorical with more than two categories, you will need to collapse it down to two categories, or subset your data to select observations from 2 categories.
> 
> 2) If your response variable is quantitative, you will need to bin it into two categories.
> 
> The assignment:
> 
> Write a blog entry that summarize in a few sentences 1) what you found, making sure you discuss the results for the associations between all of your explanatory variables and your response variable. Make sure to include statistical results (odds ratios, p-values, and 95% confidence intervals for the odds ratios) in your summary. 2) Report whether or not your results supported your hypothesis for the association between your primary explanatory variable and your response variable. 3) Discuss whether or not there was evidence of confounding for the association between your primary explanatory and the response variable (Hint: adding additional explanatory variables to your model one at a time will make it easier to identify which of the variables are confounding variables).
> 
> What to Submit: Write a blog entry and submit the URL for your blog. Your blog entry should include 1) the summary of your results that addresses parts 1-3 of the assignment, 2) the output from your logistic regression model.
> 
> Example of how to write logistic regression results:
> 
> After adjusting for potential confounding factors (list them), the odds of having nicotine dependence were more than two times higher for participants with major depression than for participants without major depression (OR=2.36, 95% CI = 1.44-3.81, p=.0001). Age was also significantly associated with nicotine dependence, such that older older participants were significantly less likely to have nicotine dependence (OR= 0.81, 95% CI=0.40-0.93, p=.041).
> 
> Review criteria
> Your assessment will be based on the evidence you provide that you have completed all of the steps. When relevant, gradients in the scoring will be available to reward clarity (for example, you will get one point for submitting an inaccurate or incomplete description of your results, but two points if the description is accurate and complete). In all cases, consider that the peer assessing your work is likely not an expert in the field you are analyzing. You will be assessed equally on all parts of the assignment, and whether you post your program and output.
Summarize in a few sentences what you found, making sure you discuss the results for the associations between all of your explanatory variables and your response variable. Make sure to include statistical results (odds ratios, p-values, and 95% confidence intervals for the odds ratios) in your summary.
Report whether or not your results supported your hypothesis for the association between your primary explanatory variable and your response variable.
Discuss whether or not there was evidence of confounding for the association between your primary explanatory variable and the response variable.
Include your logistic regression output in your blog entry.


### Data Preparation:

#### Import libraries.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sb
import scipy
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf
from pathlib import Path

#### Dataframe generation.

In [2]:
root_dir = Path().resolve().parents[1]
df = pd.read_csv(str(root_dir)+'/mycodebook.csv', low_memory=False)

#### Dichotomize variables.

In [3]:
# Dichotomize explanatory variable PPEDUCAT
# We bin the variable PPEDUCAT into two categories: High School or less, or something more than High School.



def collapse_ppeducat (row):
   if row >= 3 :
       return 1.0
   else:
       return 0.0
    
df['PPEDUCAT'] = df['PPEDUCAT'].apply(collapse_ppeducat) # apply collapse_ppeducat function to dichotomize PPEDUCAT 


In [4]:
# Dichotomize explanatory variable W1_C1
# We bin the variable W1_C1 into two categories: 0: Republican 1: Democrat

di_w1_c1 = {-1: np.nan, 1: 1, 2: 0, 3: np.nan, 4: np.nan} # dictionary that maps W1_C1 variable onto 0: Republican 1: Democrat
df = df.replace({"W1_C1": di_w1_c1}).dropna() # use the dictionary to map W1_C1 and drop NA values.

In [5]:
# Dichotomize response variable W1_F6
# We expclude the -1 refusals to respond, and categorize 0-5: less optimistic, 6-10: more optimistic

di_w1_f6 = {-1: np.nan, 0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 1, 7: 1, 8: 1, 9: 1, 10: 1}
df = df.replace({"W1_F6": di_w1_f6}).dropna() # use the dictionary to drop NA values and replace numeric values with binary.


In [6]:
#### Check code.

In [7]:
# We generate a frequency table for explanitory variables to check our coding.

check = df['PPEDUCAT'].value_counts(sort=False, dropna=False)
print (check)
check1 = df['W1_C1'].value_counts(sort=False, dropna=True)
print(check1)
check2 = df['W1_F6'].value_counts(sort=False, dropna=True)
print(check2)

0.0    329
1.0    599
Name: PPEDUCAT, dtype: int64
1.0    212
0.0    716
Name: W1_C1, dtype: int64
1.0    662
0.0    266
Name: W1_F6, dtype: int64


### Logistic Regression:

#### Logistic regression with Education Level.

In [8]:

lreg1 = smf.logit(formula = 'W1_F6 ~ PPEDUCAT', data = df).fit()
print (lreg1.summary())
# odds ratios
print ("Odds Ratios")
print (np.exp(lreg1.params))


Optimization terminated successfully.
         Current function value: 0.582779
         Iterations 5
                           Logit Regression Results                           
Dep. Variable:                  W1_F6   No. Observations:                  928
Model:                          Logit   Df Residuals:                      926
Method:                           MLE   Df Model:                            1
Date:                Thu, 08 Oct 2020   Pseudo R-squ.:                 0.02727
Time:                        19:19:07   Log-Likelihood:                -540.82
converged:                       True   LL-Null:                       -555.98
Covariance Type:            nonrobust   LLR p-value:                 3.666e-08
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.4131      0.113      3.668      0.000       0.192       0.634
PPEDUCAT       0.8215      0.

#### Odds ratios with 95% confidence intervals.

In [9]:

params = lreg1.params
conf = lreg1.conf_int()
conf['OR'] = params
conf.columns = ['Lower CI', 'Upper CI', 'OR']
print (np.exp(conf))

           Lower CI  Upper CI        OR
Intercept  1.212072  1.884774  1.511450
PPEDUCAT   1.697583  3.046139  2.273999


#### Logistic regression with Level of Education and Political Association.

In [10]:

lreg2 = smf.logit(formula = 'W1_F6 ~ PPEDUCAT + W1_C1', data = df).fit()
print (lreg2.summary())

Optimization terminated successfully.
         Current function value: 0.580260
         Iterations 5
                           Logit Regression Results                           
Dep. Variable:                  W1_F6   No. Observations:                  928
Model:                          Logit   Df Residuals:                      925
Method:                           MLE   Df Model:                            2
Date:                Thu, 08 Oct 2020   Pseudo R-squ.:                 0.03147
Time:                        19:19:07   Log-Likelihood:                -538.48
converged:                       True   LL-Null:                       -555.98
Covariance Type:            nonrobust   LLR p-value:                 2.521e-08
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.3378      0.118      2.864      0.004       0.107       0.569
PPEDUCAT       0.8076      0.

#### Odds ratios with 95% confidence intervals.

In [11]:

params = lreg2.params
conf = lreg2.conf_int()
conf['OR'] = params
conf.columns = ['Lower CI', 'Upper CI', 'OR']
print (np.exp(conf))


           Lower CI  Upper CI        OR
Intercept  1.112492  1.766325  1.401793
PPEDUCAT   1.672679  3.006719  2.242605
W1_C1      1.030273  2.143557  1.486085


### Summary:

Recall that at the beginning of course 1, we chose the Outlook on Life data set and our Null Hypothesis ($H_0$) postulated that there was no association between the Level of Education and Economic Optimism endorsed by survey respondents, with Alternative Hypothesis ($H_a$) that there is a significant positive association between those two variables.
Before undertaking our logistic regression, we undertook the following variable dichotomizations:
The variable PPEDUCAT is a categorical variable measuring the respondents' Level of Education with four levels of response:
1) Less than High School
2) High School
3) Some College
4) Bachelor's Degree or Higher
Requiring dichotomous explanatory and response variables, we collapse these into two categories:
0) Up to completion of High School (Less than High School or High School)
1) More than High School (Some college or Bachelor's Degree or Higher).
We collapsed the other explanatory and response variables similiarly, then ran our regression.
A logistic regression of Level of Education alone, against Economic Optimism shows a significant positive association, with $p-value < 0.0001$ and Odds ratio of 2.273999. A respondent with more than a high school education between about 70% more likely and 3 times as likely to be optimistic with regard to achievement of the american dream than a respondent with a high school education or less.
Next we observe for political party's association with economic optimism by adding it to the logistic regression model, and notice that with a p-value of 0.034, it is significantly, positively associated with economic optimism after controlling for level of education, but since the confidence intervals of our two explanatory variables overlap, we cannot say that one of these conditions is more strongly associated with economic optimism than the other.
To recapitulate, after controlling for confounding factors, level of education is associated with economic optimism with odds ratio of 2.243 falling within a 95% confidence interval ranging from 1.673 to 3.006, and political party affiliation is associated with economic optimism with odds ratio of 1.486 falling within a 95% confidence interval between 1.030 and 2.143. We can say confidently that Level of Education is significantly positively associated with Economic Optimism, consistent with $H_a$, with Political Party Affiliation being a confounding factor.
