# Using and Interpreting Indicator (Dummy) Variables

Oh, man -- Indicator Variables. We give them so little time in econometrics courses, and yet I can think of few topics I've seen cause problems for established scholars. So let's see if we can give them the treatment they deserve, and hoepfully by the end of this, you've have a firm understanding not only of how to use Indicator Variables, but also (and more importantly) how to interpet them.

This tutorial proceeds as follows:

1) Discussion of indicator variables for categorical variables with only two values (like gender, at least as it is implemented in most surveys)

2) Discussion of indicator variables for categorical variables with more than two values (like race or party)

3) Discussion of indicator variables interacted with continuous variables




## What are indicator variables?

Indicator variables -- sometimes also referred to as dummy variables, though I dont know why -- are variables that take on only the value of 0 and 1, and are used to *indicate* whether a given observation belongs to a discrete category in a way that can be used in regression models. 

For example, indicator variables can be used to indicate if an survey respondent is a woman (if the variable is 1 for women, 0 otherwise) or a democrat (if the variable is 1 for democrats, 0 otherwise). In addition, as discussed in more detail below, a collection of indicator variables can also be used for categorical variables that take on more than 2 variables -- for example, a *collection* of indicator variables can be used to represented individual's political party registration. 

## The TWO things that you must understand when using indicator variables

When you put an indicator variable in a regression model, there are two things you must always keep in mind about interpreting the coefficients associated with the indicator variable: 

1) The coefficient on an indicator variable is an estimate of the average **DIFFERENCE** in the dependent variable for the group identified by the indicator variable (after taking into account other variables in the regression) and

2) the **REFERENCE GROUP**, which is the set of observations for which the indicator variable is always zero. 

If you always remember that the coefficient on an indicator variable is an estimate of a **DIFFERENCE** with respect to a **REFERENCE GROUP** (also sometimes referred to as the "omitted category"), you're 90% of the way to understanding indicator variables. 

OK, let's get concrete. 


# Indicator Variables with Two Category Variable

Let's do a simple example predicting voter turnout using data from North Carolina. Suppose in particular we're interested in looking at how turnout varies by gender, which is dichotomous in the North Carolina voter file (obviously this is somewhat problematic given what we've come to know about gender, but in most datasets you'll find a dichotomous coding). 

In [1]:
# Load data we'll use. Should work for anyone. 
voters = read.csv("https://raw.githubusercontent.com/nickeubank/css_tutorials/master/exercise_data/voter_turnout.csv")
head(voters)

age,gender,voted,party,race
71,FEMALE,1,UNAFFILIATED,WHITE
47,MALE,1,UNAFFILIATED,WHITE
29,MALE,0,DEMOCRATIC,WHITE
60,MALE,1,REPUBLICAN,WHITE
84,MALE,0,DEMOCRATIC,WHITE
56,MALE,1,DEMOCRATIC,BLACK or AFRICAN AMERICAN


In [2]:
# Create a 0/1 variable for female. 
voters$female = as.numeric(voters$gender == "FEMALE") # you can also leave as TRUE / FALSE, 
                                                      # but easier to see this way. 
head(voters) 

age,gender,voted,party,race,female
71,FEMALE,1,UNAFFILIATED,WHITE,1
47,MALE,1,UNAFFILIATED,WHITE,0
29,MALE,0,DEMOCRATIC,WHITE,0
60,MALE,1,REPUBLICAN,WHITE,0
84,MALE,0,DEMOCRATIC,WHITE,0
56,MALE,1,DEMOCRATIC,BLACK or AFRICAN AMERICAN,0


In [3]:
lm(voted ~ female, voters)


Call:
lm(formula = voted ~ female, data = voters)

Coefficients:
(Intercept)       female  
    0.74610      0.00754  


OK, so how do we interpret this coefficient of 0.021 on female? As we said before, it is the average **DIFFERENCE** in the dependent variable (Whether the person votes) with respect to a **REFERENCE GROUP**. In this case, the reference group is anyone for whom the indicator is always equal to zero, which is the set of male voters. 

So this says that women are 0.7% *more likely to vote (in North Carolina) then men.*

Now let's try a more interesting example: Democrats. 

In [4]:
unique(voters$party) # OK, note here that there are THREE party registrations in this data. 

In [5]:
# So let's do the same thing as before for democrats
voters$democrat = as.numeric(voters$party == "DEMOCRATIC")
head(voters)

age,gender,voted,party,race,female,democrat
71,FEMALE,1,UNAFFILIATED,WHITE,1,0
47,MALE,1,UNAFFILIATED,WHITE,0,0
29,MALE,0,DEMOCRATIC,WHITE,0,1
60,MALE,1,REPUBLICAN,WHITE,0,0
84,MALE,0,DEMOCRATIC,WHITE,0,1
56,MALE,1,DEMOCRATIC,BLACK or AFRICAN AMERICAN,0,1


In [6]:
lm(voted ~ democrat, voters)


Call:
lm(formula = voted ~ democrat, data = voters)

Coefficients:
(Intercept)     democrat  
    0.77535     -0.05619  


So now how do we interpret this coefficient on Democrats (-0.05)? As before it's the average **DIFFERENCE** in the dependent variable between the indicated group (democrats) and the reference group. But what's the reference group? Republicans?

No -- the **reference group** or **omitted category** is anyone for whom the indicator variable is always zero -- in this case, all non-democrats, whether they're Republicans or Unaffiliated. 

So this result says that Democrats are less likely to vote than non-democrats, but NOT that they're less likely to vote than republicans per se. 

So how do we deal with multiple categories? With multiple indicator variables!

# Indicator Variables for variables with more than 2 categories

To deal with categorical variables with more than 2 categories, we create indicator variables for all values of the variable *except one*. The one group for which we do not create an indicator variable will become the **reference group** for the regression. The choice of which value to make the referee category won't substantively change the results of the regression -- for example, if you also have a control for age, the coefficient on age will always be the same regardless of the reference group used -- but it does influence how easily you can interpret the results of the regression. 

So since we're interested in the difference in turnout between Democrats and Republicans, let's make Republicans the reference category, and make dummies for democrats and unaffiliated voters. 

In [7]:
voters$unaffiliated = as.numeric(voters$party == "UNAFFILIATED")
head(voters)

age,gender,voted,party,race,female,democrat,unaffiliated
71,FEMALE,1,UNAFFILIATED,WHITE,1,0,1
47,MALE,1,UNAFFILIATED,WHITE,0,0,1
29,MALE,0,DEMOCRATIC,WHITE,0,1,0
60,MALE,1,REPUBLICAN,WHITE,0,0,0
84,MALE,0,DEMOCRATIC,WHITE,0,1,0
56,MALE,1,DEMOCRATIC,BLACK or AFRICAN AMERICAN,0,1,0


In [8]:
summary(lm(voted ~ democrat + unaffiliated, voters))


Call:
lm(formula = voted ~ democrat + unaffiliated, data = voters)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.7988  0.2012  0.2012  0.2808  0.2808 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.798811   0.007438 107.399  < 2e-16 ***
democrat     -0.079652   0.009868  -8.072 7.74e-16 ***
unaffiliated -0.060559   0.011950  -5.068 4.10e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4315 on 9916 degrees of freedom
Multiple R-squared:  0.006737,	Adjusted R-squared:  0.006536 
F-statistic: 33.63 on 2 and 9916 DF,  p-value: 2.786e-15


So how do we interpret this? 

First, we see that the coefficient on democrat is -0.08. That means that the DIFFERENCE in turnout between democrats and the reference group (here, Republicans) is 8%. So democrats have 8 percent point lower turnout on average in this data than Republicans. 

Second, we see that the coefficient on unaffiliated is -0.06. That means that the DIFFERENCE in turnout between unaffiliated voters and the reference group (here, Republicans) is 6%. So democrats have 6 percent point lower turnout on average in this data than Republicans. 

Moreover, the p-value on these indicator variables tells us if these differences are significant. And indeed, they show clearly that the difference between democrats and republicans is significant (p-value = 7.7e-16), and the difference between unaffiliated voters and republicans is significanct (p-value = 4.1e-07). 

But what about the difference between democrats and unaffiliated voters? Well, turns out the regression doesn't give us that directly. To get that, we have to do some additional math. 

For example, the difference between democrats and unaffiliated voters is equal to:

dem - unaffiliated = (dem - republican) - (unaffiliated - republican)
                   \= -0.079652 - -0.060559
                   \= -0.019093
                   
So in other words, dems have 2 percentage point lower turnout than unaffiliated voters. 

But is this difference statistically significant? For that we use the LinearHypothesis function from the `car` library:

In [14]:
#install.packages("car")
library(car)
result = lm(voted ~ democrat + unaffiliated, voters)
linearHypothesis(result, "democrat = unaffiliated")

Res.Df,RSS,Df,Sum of Sq,F,Pr(>F)
9917,1846.441,,,,
9916,1845.917,1.0,0.523837,2.813977,0.09347796


Voila -- p-value of 0.09.

Wanna confirm it? let's change our reference group to unaffiliated. then look at the coefficient on democrat, which will then be the difference between democrats and the new reference group (unaffiliated voters)

In [18]:
voters$republican = as.numeric(voters$party == "REPUBLICAN")
summary(lm(voted ~ democrat + republican, voters))


Call:
lm(formula = voted ~ democrat + republican, data = voters)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.7988  0.2012  0.2012  0.2808  0.2808 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.738252   0.009353  78.932  < 2e-16 ***
democrat    -0.019092   0.011382  -1.677   0.0935 .  
republican   0.060559   0.011950   5.068  4.1e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4315 on 9916 degrees of freedom
Multiple R-squared:  0.006737,	Adjusted R-squared:  0.006536 
F-statistic: 33.63 on 2 and 9916 DF,  p-value: 2.786e-15


As we can now see, the coefficient on democrat (now difference between democrat and unaffiliated) is what we'd calculated above (-0.019) and has the p-value we calculated previously (0.09). This just goes to show that the choice of reference group doesn't change what's actually being estimated, it just changes the interpretation of coefficients and what statistics pop right out of the regression output, and which values require a little extra work.


One other quick note: indicator variables as so common that there's a shorthand to ask R to just make the indicators when it's running the regression -- `factor()`. This often works well, but by default it makes the first value in the variable the reference group, which can be annoying. So to run a regression with indicator variables for party, instead of `lm(voted ~ democrat + unaffiliated, voters)`, you can run: 

In [10]:
lm(voted ~ factor(party), voters)


Call:
lm(formula = voted ~ factor(party), data = voters)

Coefficients:
              (Intercept)    factor(party)REPUBLICAN  
                  0.71916                    0.07965  
factor(party)UNAFFILIATED  
                  0.01909  


# Interactions: Like plain indicators, but for differences in SLOPE rather than differences in LEVELS

Congratulations! You're a pro at indicator variables. But now let's say you want to do INTERACTIONS!

Interactions (where you interact an indicator variable with a continuous variable) is just like a regular interaction variable, except that instead of reporting the difference in **average value** of the dependent variable between the indicated group and the reference group, the coefficient on an interaction term is the **DIFFERENCE** in the **SLOPE** associated with the continous variable between the indicated group and the reference group. 

Let's be concrete: let's suppose we think that turnout among men increases as they get older by a larger amount than for women. In other words, we think that turnout increases with age for both groups, but that there's a **DIFFERENCE** in the amount it increases with age. 

To test this, we need to create some interaction terms. But first, a quick note: when doing interactions, it's critical to not only include all the interaction terms that interest you, but also all the variables in the interaction as stand-alone variables. So for this we want `age` interacted with `female`. But while the coefficient on that estimate is what we're interested in, to get the right results we also need to include just `age` and just `female`.


In [11]:
voters$age_x_female = voters$age * voters$female
summary(lm(voted ~ female + age + age_x_female, voters))


Call:
lm(formula = voted ~ female + age + age_x_female, data = voters)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.7953  0.2047  0.2443  0.2535  0.2817 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.6960718  0.0246740  28.211  < 2e-16 ***
female        0.0933369  0.0330511   2.824  0.00475 ** 
age           0.0008553  0.0004069   2.102  0.03559 *  
age_x_female -0.0014570  0.0005411  -2.693  0.00710 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4328 on 9915 degrees of freedom
Multiple R-squared:  0.0008071,	Adjusted R-squared:  0.0005047 
F-statistic:  2.67 on 3 and 9915 DF,  p-value: 0.04589


Ok, so the coefficient on our interaction is -0.001457. Does that mean that as women get older, their turnout rate declines by -0.145 percentage points per year? **NO!**

It says that women's turnout rate declines by -0.145 percentage points per year LESS THAN MEN'S DECLINE. It is the **DIFFERENCE** in slopes between the two groups.

The coefficient on `age` tells us how turnout varies with age **for the reference group**. So it says that for men, turnout increases by 0.09 percentage points per year. 

But if you want to know how women's turnout varies with age, you have to **ADD** the coefficient on `age` (the rate of change for me) plus the coefficient on `age_x_female` (the difference between the rate of men and women). 


So going through all these coefficients, we have: 

`female` (0.09): Controlling for age, women are 9 percentage points more likely to vote than men. 
`age` (0.0009): As men get one year older, they become 0.1 percentage point more likely to vote. 
`age_x_female` (-0.0015): As women get one year older, the likelihood they vote increases by -0.1 percentage point *less* than men. 

So how much does women's turnout increase if they age one year? 0.0008553 + -0.0014570 = -0.0006. So it actually declines by 0.06 percentage points a year. 

Now let's talk statistical significance. The p-value on age shows us that there's a statistically significant relationship between age and turnout **for men**. The p-value for `age_x_female` tells us that there's a statistically significant **difference** between men and women in how turnout varies with age. But is there a statistically significant relationship between age and turnout for women?

Again, we don't actually get an answer from our regression. To see, we have to run the following: 


In [15]:
result = lm(voted ~ female + age + age_x_female, voters)
linearHypothesis(result, "age + age_x_female = 0")

Res.Df,RSS,Df,Sum of Sq,F,Pr(>F)
9916,1857.47,,,,
9915,1856.937,1.0,0.5331515,2.84673,0.09159147


So the p-value is 0.09 for the relationship between female and age.