# Week 6: From cross tabs to multiple regression

For today, we will practice principles of multilple regression. 
Investigating data from the General Social Survey 2018, we will examine earning differentials among individuals and factors that shape the disparities. 

In [1]:
cd "C:\Users\Hyunsu\Desktop\UCM\10. Spring 2020\(SOC 211) Graduate Statistics 2 (TA)\Probelm Sets\#2"
set more off 
use "GSS2018.dta", clear


C:\Users\Hyunsu\Desktop\UCM\10. Spring 2020\(SOC 211) Graduate Statistics 2 (TA)
> \Probelm Sets\#2




Our depedent variable is income in constant dollar (realrinc).

And three sets of independent variables are
(1) Demographic characteristics: age (age), gender (sex), and race (race)
(2) Human capital: years of schooling (educ) and job tenure (yearsjob)
(3) Family backgrounds: father's education (paeduc) and father's socioeconomic index (PASEI10)

In [2]:
keep realrinc age sex race educ yearsjob paeduc PASEI10

In [4]:
%head if _n<=5

Unnamed: 0,age,educ,paeduc,sex,race,yearsjob,realrinc,PASEI10
1,43,14,12,male,white,1,IAP,58.4
2,74,10,0,female,white,.i,IAP,24.6
3,42,16,12,male,white,15,45400,77.4
4,63,16,16,female,white,25,54480,67.7
5,71,18,12,male,black,.i,IAP,58.4


# First, examine our dependent variable using histogram

In [None]:
hist realrinc

(bin=31, start=227, width=4865.2813)


It looks like the income variable is skewed and not normally distributed. In this case, log transformation would be appropriate. Technically, using logged income will reduce the impact of heteroskedasticity.

In [5]:
gen loginc = log(realrinc)

(985 missing values generated)


In [None]:
hist loginc

If you use log transformed variables, you need to be careful when interpreting the results. We will discuss this soon.

# Then, let's take a look at the relationships between our DV and IVs.

Let's generate scatter plots between our DV and appropriate IVs

In [None]:
foreach x of var age educ yearsjob paeduc PASEI10 {
  graph twoway (scatter loginc `x') (lfit loginc `x') 
}

FYI, we can visualize these plots in one space. For instance,

In [7]:
qui {
  graph twoway (scatter loginc age) (lfit loginc age), name(age_scatter)
  graph twoway (scatter loginc educ) (lfit loginc educ), name(educ_scatter)
  graph twoway (scatter loginc yearsjob) (lfit loginc yearsjob), name(yearsjob_scatter)
  graph twoway (scatter loginc paeduc) (lfit loginc paeduc), name(paeduc_scatter)
  graph twoway (scatter loginc PASEI10) (lfit loginc PASEI10), name(PASEI10_scatter)
}

graph combine age_scatter educ_scatter yearsjob_scatter paeduc_scatter PASEI10_scatter

You may want to run correlation analysis to see the linear relationship among variables

In [6]:
pwcorr loginc age educ paeduc yearsjob PASEI10, sig


             |   loginc      age     educ   paeduc yearsjob  PASEI10
-------------+------------------------------------------------------
      loginc |   1.0000 
             |
             |
         age |   0.1673   1.0000 
             |   0.0000
             |
        educ |   0.3364  -0.0230   1.0000 
             |   0.0000   0.2658
             |
      paeduc |   0.1416  -0.2686   0.4129   1.0000 
             |   0.0000   0.0000   0.0000
             |
    yearsjob |   0.2931   0.4823   0.0907  -0.0363   1.0000 
             |   0.0000   0.0000   0.0006   0.2391
             |
     PASEI10 |   0.1796  -0.1075   0.3427   0.5259   0.0480   1.0000 
             |   0.0000   0.0000   0.0000   0.0000   0.1081
             |


And, for gender and race, we can see the group differences using t-test (gender) and ANOVA (race)

In [7]:
ttest loginc, by(sex)
oneway loginc race, tab



Two-sample t test with equal variances
------------------------------------------------------------------------------
   Group |     Obs        Mean    Std. Err.   Std. Dev.   [95% Conf. Interval]
---------+--------------------------------------------------------------------
    male |     646    9.808501    .0470873    1.196795    9.716038    9.900964
  female |     717    9.329179      .04586    1.227988    9.239143    9.419215
---------+--------------------------------------------------------------------
combined |    1363    9.556356     .033486    1.236265    9.490666    9.622046
---------+--------------------------------------------------------------------
    diff |            .4793225    .0658176                .3502076    .6084375
------------------------------------------------------------------------------
    diff = mean(male) - mean(female)                              t =   7.2826
Ho: diff = 0                                     degrees of freedom =     1361

    Ha: di

Based on those results, we can scatch group differences on income and relationships between variables. 

# Let's run regression analysis

In [8]:
reg loginc age i.sex i.race educ yearsjob paeduc PASEI10


      Source |       SS       df       MS              Number of obs =     886
-------------+------------------------------           F(  8,   877) =   33.18
       Model |  261.082962     8  32.6353703           Prob > F      =  0.0000
    Residual |  862.614325   877  .983596721           R-squared     =  0.2323
-------------+------------------------------           Adj R-squared =  0.2253
       Total |  1123.69729   885  1.26971445           Root MSE      =  .99176

------------------------------------------------------------------------------
      loginc |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         age |   .0010792    .002864     0.38   0.706    -.0045418    .0067002
             |
         sex |
     female  |  -.4815536   .0672573    -7.16   0.000    -.6135577   -.3495494
             |
        race |
      black  |  -.0780743   .1086333    -0.72   0.473     -.291286   

Our regression model can be written as:

log(income) = 7.729 + .001(age) - .481(female) - .078(balck) - .039(other) + .119(educ) + .028(yearsjob) + .008(paeduc) + .003(PASEI10)

# How would you interpret the coefficient of .119 for educ?

In general, you can say that one additional year of education is associated with 12 cents ($.119) increase in annual income holding all other variables

However, as noted, our variable is log transformed. So, the interpretation is not corret. 
For here, .119 is the exponentiated coefficient. That is, one unite increase in IV linked to the expected change in log of DV. 

In [9]:
di exp(.119)

1.1263699


Instead, we should see that one additional year of education yields 1.13 dollar (exp.119=1.126) increase in annual income.

# Let's do this with a dummy variable. What about the coefficient of -.481 for female? Are you able to interpret the coefficient?

In [10]:
di exp(-.481)

.61816492


# Based on the regression result, we can compute the adjusted means as a function of IVs using -margins- command.

In [11]:
margins race


Predictive margins                                Number of obs   =        886
Model VCE    : OLS

Expression   : Linear prediction, predict()

------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        race |
      white  |   9.769248   .0384574   254.03   0.000     9.693769    9.844727
      black  |   9.691174     .10122    95.74   0.000     9.492512    9.889835
      other  |    9.72988    .099706    97.59   0.000      9.53419     9.92557
------------------------------------------------------------------------------


The command -marginsplot- enables to visulize the prediction

In [12]:
  marginsplot

In [13]:
margins, at(educ=(0(1)20))


Predictive margins                                Number of obs   =        886
Model VCE    : OLS

Expression   : Linear prediction, predict()

1._at        : educ            =           0

2._at        : educ            =           1

3._at        : educ            =           2

4._at        : educ            =           3

5._at        : educ            =           4

6._at        : educ            =           5

7._at        : educ            =           6

8._at        : educ            =           7

9._at        : educ            =           8

10._at       : educ            =           9

11._at       : educ            =          10

12._at       : educ            =          11

13._at       : educ            =          12

14._at       : educ            =          13

15._at       : educ            =          14

16._at       : educ            =          15

17._at       : educ            =          16

18._at       : educ            =          17

19._at       : educ        

In [None]:
  marginsplot

We can also predict this statistics by manipulating other covariates. 

In [14]:
margins race if yearsjob > 10


Predictive margins                                Number of obs   =        279
Model VCE    : OLS

Expression   : Linear prediction, predict()

------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        race |
      white  |   10.16887   .0549759   184.97   0.000     10.06097    10.27677
      black  |   10.09079    .109513    92.14   0.000     9.875855    10.30573
      other  |    10.1295   .1113244    90.99   0.000     9.911006    10.34799
------------------------------------------------------------------------------


In [15]:
margins sex, at(race=(1) educ=(12)) atmeans


Adjusted predictions                              Number of obs   =        886
Model VCE    : OLS

Expression   : Linear prediction, predict()
at           : age             =    45.27314 (mean)
               1.sex           =    .5022573 (mean)
               2.sex           =    .4977427 (mean)
               race            =           1
               educ            =          12
               yearsjob        =    8.929176 (mean)
               paeduc          =    12.35892 (mean)
               PASEI10         =    48.36524 (mean)

------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         sex |
       male  |   9.715199   .0594978   163.29   0.000     9.598424    9.831973
     female  |   9.233645   .0613634   150.47   0.000     9.113209    9.354081
----

FYI. In Stata, there is another command -prvalue- which has similar functions. But it does not allow to use factor variables. 
Depending on your modeling, you can use either one.

# Lastly, Stata let us report the regression results as preferred styles using -est- commands. 
For instace, 

In [22]:
est clear
quietly eststo: reg loginc age i.sex i.race educ yearsjob paeduc PASEI10
esttab, stats(r2 N, labels("R-Sqaured" "N")) cells(b(star fmt(3)) se(fmt(3) par)) /// 
  nobase mlabel(Model 1) starlevels(* .05 ** .01 *** .001) ///
  coeflabels (_cons "Constant" 2.sex "female" 2.race "black" 3.race "other race" educ "years of schooling" yearsjob "job tenure" paeduc "father's education" PASEI10 "father's socioeconomic index")





----------------------------
                      (1)   
                    Model   
                     b/se   
----------------------------
age                 0.001   
                  (0.003)   
female             -0.482***
                  (0.067)   
black              -0.078   
                  (0.109)   
other race         -0.039   
                  (0.108)   
years of s~g        0.119***
                  (0.013)   
job tenure          0.028***
                  (0.004)   
father's e~n        0.008   
                  (0.011)   
father's s~x        0.003   
                  (0.002)   
Constant            7.729***
                  (0.232)   
----------------------------
R-Sqaured           0.232   
N                 886.000   
----------------------------
