# Linear and Logistic Regression Using NHANES Dataset
---
## Linear Regression

In [1]:
# Imports 
import pandas as pd 
import numpy as np 
from pathlib import Path
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
%matplotlib inline

In [2]:
# import NHANES data
df = pd.read_csv(Path('data/nhanes_2015_2016.csv'))

# keep only columns that will be used through out this notebook
df = df[["BPXSY1", "RIDAGEYR", "RIAGENDR", "RIDRETH1", "DMDEDUC2", "BMXBMI", "SMQ020"]] 

# rename columns with meaninful names
df.columns = ['systolic_blood_pressure', 'age','gender','race_hispanic_origin', 'education_level', 'body_mass_index', 'smoke']

# change interger codes to string values
df.gender = df.gender.replace({1:'male', 2:'female'})
df.race_hispanic_origin = df.race_hispanic_origin.replace({1:'Mexican American', 2:'Other Hispanic',3:'Non-Hispanic White',
                                                        4:'Non-Hispanic Black', 5:'Other Race'})
df.education_level = df.education_level.replace({1:'Less than 9th grade', 2:'9-11th grade', 3:'Highschool graduate',4:'Some college',
                                                5:'College graduate', 7:'Refused',9:"Don't Know"})
df.smoke = df.smoke.replace({1:'yes', 2:'no', 7:np.nan, 9:np.nan})

# drop all null values
df.dropna(inplace = True)
df.sample(5)

Unnamed: 0,systolic_blood_pressure,age,gender,race_hispanic_origin,education_level,body_mass_index,smoke
5445,132.0,38,female,Non-Hispanic Black,Some college,29.5,yes
5505,124.0,56,female,Mexican American,Some college,27.8,yes
3281,142.0,55,male,Non-Hispanic White,Some college,27.1,no
748,124.0,42,male,Non-Hispanic Black,Highschool graduate,43.4,no
1829,106.0,33,male,Mexican American,Less than 9th grade,25.6,yes


## Question 1:

Use linear regression to relate the expected body mass index (BMI) to a person's age.

[Reference](https://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.OLS.from_formula.html)

In [3]:
# instantiate ordinary least squares model and fit
model = sm.OLS.from_formula('body_mass_index ~ age', data=df).fit()

# check methods attached to the model
print([x for x in dir(model) if not x.startswith('_') and not x.startswith('_')])

['HC0_se', 'HC1_se', 'HC2_se', 'HC3_se', 'aic', 'bic', 'bse', 'centered_tss', 'compare_f_test', 'compare_lm_test', 'compare_lr_test', 'condition_number', 'conf_int', 'conf_int_el', 'cov_HC0', 'cov_HC1', 'cov_HC2', 'cov_HC3', 'cov_kwds', 'cov_params', 'cov_type', 'df_model', 'df_resid', 'eigenvals', 'el_test', 'ess', 'f_pvalue', 'f_test', 'fittedvalues', 'fvalue', 'get_influence', 'get_prediction', 'get_robustcov_results', 'initialize', 'k_constant', 'llf', 'load', 'model', 'mse_model', 'mse_resid', 'mse_total', 'nobs', 'normalized_cov_params', 'outlier_test', 'params', 'predict', 'pvalues', 'remove_data', 'resid', 'resid_pearson', 'rsquared', 'rsquared_adj', 'save', 'scale', 'ssr', 'summary', 'summary2', 't_test', 't_test_pairwise', 'tvalues', 'uncentered_tss', 'use_t', 'wald_test', 'wald_test_terms', 'wresid']


In [4]:
# print model summary
model.summary()

0,1,2,3
Dep. Variable:,body_mass_index,R-squared:,0.0
Model:,OLS,Adj. R-squared:,0.0
Method:,Least Squares,F-statistic:,2.52
Date:,"Fri, 17 Dec 2021",Prob (F-statistic):,0.112
Time:,01:13:06,Log-Likelihood:,-17124.0
No. Observations:,5094,AIC:,34250.0
Df Residuals:,5092,BIC:,34270.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,29.0754,0.291,100.077,0.000,28.506,29.645
age,0.0088,0.006,1.587,0.112,-0.002,0.020

0,1,2,3
Omnibus:,934.389,Durbin-Watson:,2.011
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1852.985
Skew:,1.105,Prob(JB):,0.0
Kurtosis:,4.962,Cond. No.,156.0


In [5]:
# print parameter names
model.params.keys()

Index(['Intercept', 'age'], dtype='object')

In [6]:
# print R_squared
print(f'R-square: {model.rsquared}')

# print average bmi difference between 40 year old and 20 year old 
print(f'average bmi difference between a 40 year old and a 20 year old: {model.params.age*40 - model.params.age*20}')

R-square: 0.000494668079549121
average bmi difference between a 40 year old and a 20 year old: 0.1756872046290707


Based on the coefficients for this model, we can estimate that body mass index increases on average by `0.0088` units for each additional year in age. So based on this model an older person should have greater body mass than a younger person. However, if we look into the `p-value` we can determine that this *difference is not statistically significat* 

$H_0: U_1 - U_2 = 0\ \ \ Difference\ is\ not\ statistically\ significant$   

$H_a: U_1 - U_2 \neq 0\ \ \ Difference\ is\ statistically\ significant$ 

Since this will be a two tail test we divide the p-value into two:  $\ \ 0.112 \div 2 = 0.056$. 

At the  95% confidence level, we will fail to reject the null hypothesis. We are not confident that there is a relationship between age and body mass index based on this data. Furthermore, `R-square` score implies that the fraction of the variation in body mass index explained by age is close to zero.