<a href="https://colab.research.google.com/github/pharringtonp19/business-analytics/blob/main/notebooks/regression3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

For Background on the Dataset see: https://cps.ipums.org/cps-action/variables/group

### **Clone Library**

In [1]:
!git clone https://github.com/pharringtonp19/business-analytics.git

Cloning into 'business-analytics'...
remote: Enumerating objects: 949, done.[K
remote: Counting objects: 100% (566/566), done.[K
remote: Compressing objects: 100% (215/215), done.[K
remote: Total 949 (delta 414), reused 428 (delta 328), pack-reused 383 (from 1)[K
Receiving objects: 100% (949/949), 18.46 MiB | 11.55 MiB/s, done.
Resolving deltas: 100% (542/542), done.


### **Import Packages**

In [14]:
import jax
import jax.numpy as jnp
import pandas as pd
pd.set_option('display.max_columns', None)
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
import statsmodels.formula.api as smf
from statsmodels.iolib.summary2 import summary_col

### **Read In Data Set**

In [3]:
df = pd.read_csv('/content/business-analytics/datasets/cps_00009.csv.gz', compression="gzip")
df['INCTOT'].replace(999999999, np.nan, inplace=True)
df = df[df['AGE'] <80]

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['INCTOT'].replace(999999999, np.nan, inplace=True)


### **Show Data**

In [4]:
df.head()

Unnamed: 0,YEAR,SERIAL,MONTH,CPSID,ASECFLAG,ASECWTH,PERNUM,CPSIDV,CPSIDP,ASECWT,AGE,SEX,RACE,MARST,VETSTAT,FTOTVAL,INCTOT
2,2024,6,3,20230200357800,1,601.57,3,202302003578031,20230200357803,1312.04,15,2,100,6,0,0,0.0
3,2024,7,3,20230100363800,1,1559.99,1,202301003638011,20230100363801,1559.99,53,2,100,1,1,50838,10801.0
4,2024,7,3,20230100363800,1,1559.99,2,202301003638021,20230100363802,1559.99,52,1,100,1,1,50838,40037.0
5,2024,8,3,20240200366300,1,580.6,1,202402003663011,20240200366301,580.6,68,1,100,5,1,9648,9648.0
6,2024,8,3,20240200366300,1,580.6,2,202402003663021,20240200366302,1351.94,14,2,100,6,0,9648,


### **Check NAN Values**

In [5]:
print(df['AGE'].isna().sum(), df['INCTOT'].isna().sum())
condition = df['AGE'].isna() | df['INCTOT'].isna()
df = df[~condition]

0 28429


### **Single Variable Regression**

In [16]:
reg0 = smf.ols('INCTOT ~ AGE', data = df).fit()
reg0.params

Unnamed: 0,0
Intercept,30852.211683
AGE,571.919726


### **Multivariate Regression**

In [17]:
reg1 = smf.ols('INCTOT ~ AGE + C(SEX)', data = df).fit()
reg1.params

Unnamed: 0,0
Intercept,42560.167388
C(SEX)[T.2],-24788.693718
AGE,595.412553


### **View Results**

In [20]:
print(summary_col([reg0, reg1],
                  stars=True,
                  float_format='%0.2f'))


                 INCTOT I   INCTOT II  
---------------------------------------
Intercept      30852.21*** 42560.17*** 
               (673.45)    (705.49)    
AGE            571.92***   595.41***   
               (13.94)     (13.79)     
C(SEX)[T.2]                -24788.69***
                           (493.47)    
R-squared      0.02        0.04        
R-squared Adj. 0.02        0.04        
Standard errors in parentheses.
* p<.1, ** p<.05, ***p<.01


### **Residulaized Viewpoint**

In [25]:
df['AGE_Fitted'] = smf.ols('AGE ~ C(SEX)', data = df).fit().fittedvalues
df['AGE_residualized'] = df['AGE'] - df['AGE_Fitted']
reg_residualized = smf.ols('INCTOT ~ 0 + AGE_residualized', data=df).fit()
reg_residualized.params

Unnamed: 0,0
AGE_residualized,595.412553
