# Logistic Regressin custom reference category

When reporting odds ratios for categorical variables in a logistic regression model, one category is chosen as the "reference" and odd ratios are reported for others, relative to this reference.

What if we want to report numbers using another category as the base case?

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import scipy
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf



  import pandas.util.testing as tm


# Logistic Regression

Consider this excellent writeup of the statistical analysis of an admissions data set:

https://stats.idre.ucla.edu/r/dae/logit-regression/

In [2]:
# data = whether students got admitted (admit=1) or not (admit=0) based on their gre and gpa scores, and the rank of their instutution
raw_data = pd.read_csv('https://stats.idre.ucla.edu/stat/data/binary.csv')
raw_data

Unnamed: 0,admit,gre,gpa,rank
0,0,380,3.61,3
1,1,660,3.67,3
2,1,800,4.00,1
3,1,640,3.19,4
4,0,520,2.93,4
...,...,...,...,...
395,0,620,4.00,2
396,0,560,3.04,3
397,0,460,2.63,2
398,0,700,3.65,2


In [3]:
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   admit   400 non-null    int64  
 1   gre     400 non-null    int64  
 2   gpa     400 non-null    float64
 3   rank    400 non-null    int64  
dtypes: float64(1), int64(3)
memory usage: 12.6 KB


In [4]:
 raw_data.describe()

Unnamed: 0,admit,gre,gpa,rank
count,400.0,400.0,400.0,400.0
mean,0.3175,587.7,3.3899,2.485
std,0.466087,115.516536,0.380567,0.94446
min,0.0,220.0,2.26,1.0
25%,0.0,520.0,3.13,2.0
50%,0.0,580.0,3.395,2.0
75%,1.0,660.0,3.67,3.0
max,1.0,800.0,4.0,4.0


In [5]:
# convert rank to categorical
# via https://stackoverflow.com/a/39092877
mydata = raw_data.copy()
mydata['rank'] = pd.Categorical(mydata['rank'])

In [6]:
mydata_crosstab = pd.crosstab(
    mydata['admit'],
    mydata['rank'], 
    margins = False
)
mydata_crosstab

rank,1,2,3,4
admit,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,28,97,93,55
1,33,54,28,12


In [7]:
mydata.dtypes

admit       int64
gre         int64
gpa       float64
rank     category
dtype: object

In [8]:
mylogit = smf.mnlogit(
    'admit ~ gre + gpa + rank',
    data=mydata
).fit()

mylogit

Optimization terminated successfully.
         Current function value: 0.573147
         Iterations 6


<statsmodels.discrete.discrete_model.MultinomialResultsWrapper at 0x7f92878f6f50>

In [9]:
mylogit.summary()

0,1,2,3
Dep. Variable:,admit,No. Observations:,400.0
Model:,MNLogit,Df Residuals:,394.0
Method:,MLE,Df Model:,5.0
Date:,"Sun, 04 Apr 2021",Pseudo R-squ.:,0.08292
Time:,15:04:31,Log-Likelihood:,-229.26
converged:,True,LL-Null:,-249.99
Covariance Type:,nonrobust,LLR p-value:,7.578e-08

admit=1,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-3.99,1.14,-3.5,0.0,-6.224,-1.756
rank[T.2],-0.6754,0.316,-2.134,0.033,-1.296,-0.055
rank[T.3],-1.3402,0.345,-3.881,0.0,-2.017,-0.663
rank[T.4],-1.5515,0.418,-3.713,0.0,-2.37,-0.733
gre,0.0023,0.001,2.07,0.038,0.0,0.004
gpa,0.804,0.332,2.423,0.015,0.154,1.454


In [10]:
mylogit.summary2()

0,1,2,3
Model:,MNLogit,Pseudo R-squared:,0.083
Dependent Variable:,admit,AIC:,470.5175
Date:,2021-04-04 15:04,BIC:,494.4663
No. Observations:,400,Log-Likelihood:,-229.26
Df Model:,5,LL-Null:,-249.99
Df Residuals:,394,LLR p-value:,7.5782e-08
Converged:,1.0000,Scale:,1.0
No. Iterations:,6.0000,,

admit = 0,Coef.,Std.Err.,t,P>|t|,[0.025,0.975]
Intercept,-3.99,1.14,-3.5001,0.0005,-6.2242,-1.7557
rank[T.2],-0.6754,0.3165,-2.1342,0.0328,-1.2958,-0.0551
rank[T.3],-1.3402,0.3453,-3.8812,0.0001,-2.017,-0.6634
rank[T.4],-1.5515,0.4178,-3.7131,0.0002,-2.3704,-0.7325
gre,0.0023,0.0011,2.0699,0.0385,0.0001,0.0044
gpa,0.804,0.3318,2.4231,0.0154,0.1537,1.4544


In [11]:
# mylogit.__dict__

In [12]:
mylogit.params

Unnamed: 0,0
Intercept,-3.989979
rank[T.2],-0.675443
rank[T.3],-1.340204
rank[T.4],-1.551464
gre,0.002264
gpa,0.804038


The above model uses the rank=1 as the reference category an the log odds reported are with respect to this catrgory

log(accept|rank=1)/log(accept|rank=2) = rank[T.2] 	-0.675443

etc. for others
rank[T.3] 	-1.340204
rank[T.4] 	-1.551464



## Statement of the problem

How can we obtain the log odds with respect to another reference category, e.g. rank=2 

In [13]:

# Option one: custom function that permutes categories to put rank 2 as reference
# https://www.statsmodels.org/stable/example_formulas.html#functions

In [14]:
# Option 2: order categoricals so rank=2 comes first in the list at creation time
mydata2 = raw_data.copy()
mydata2['rank'] = pd.Categorical(mydata2['rank'], categories=[2,1,3,4])

In [15]:
mylogit2 = smf.mnlogit(
    'admit ~ gre + gpa + rank',
    data=mydata2
).fit()

mylogit2.summary()

Optimization terminated successfully.
         Current function value: 0.573147
         Iterations 6


0,1,2,3
Dep. Variable:,admit,No. Observations:,400.0
Model:,MNLogit,Df Residuals:,394.0
Method:,MLE,Df Model:,5.0
Date:,"Sun, 04 Apr 2021",Pseudo R-squ.:,0.08292
Time:,15:04:32,Log-Likelihood:,-229.26
converged:,True,LL-Null:,-249.99
Covariance Type:,nonrobust,LLR p-value:,7.578e-08

admit=1,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-4.6654,1.109,-4.205,0.0,-6.84,-2.491
rank[T.1],0.6754,0.316,2.134,0.033,0.055,1.296
rank[T.3],-0.6648,0.283,-2.346,0.019,-1.22,-0.109
rank[T.4],-0.876,0.367,-2.389,0.017,-1.595,-0.157
gre,0.0023,0.001,2.07,0.038,0.0,0.004
gpa,0.804,0.332,2.423,0.015,0.154,1.454


In [16]:
mylogit2.params

Unnamed: 0,0
Intercept,-4.665422
rank[T.1],0.675443
rank[T.3],-0.664761
rank[T.4],-0.876021
gre,0.002264
gpa,0.804038


In [18]:
# Option 3: calculate odds ratio R3/R2 based on data from mylogit
# using arighmetic in log-space

# From mylogit we have log odds
#  log(R2/R1) = -0.675443
#  log(R3/R1) = -1.340204

log_R2_over_R1 = mylogit.params[0]['rank[T.2]']
log_R3_over_R1 = mylogit.params[0]['rank[T.3]']

# We want
#  log(R3/R2)

# in the odds-space (ratios of probs) the  calculatin is
#    R3/R2 = (R3/R1) / (R2/R1) = R3_over_R1/R2_over_R1
# in log-space
#    log(R3/R2) = log(R3/R1) - log(R2/R1) = log_R2_over_R1 -log_R3_over_R1

log_R3_over_R2 = log_R3_over_R1 - log_R2_over_R1
log_R3_over_R2

-0.6647609885043283

In [19]:
# check (by comparing to value obtained in mylogit2 where rank=2 is the reference)
np.isclose(log_R3_over_R2, mylogit2.params[0]['rank[T.3]'])

True

In [20]:
# Option 4: calculate odds ratio R3/R2 based on data from mylogit
# using arighmetic in probability space

# From mylogit we have log odds
#  log(R2/R1) = -0.675443
#  log(R3/R1) = -1.340204

R2_over_R1 = np.exp(mylogit.params[0]['rank[T.2]'])
R3_over_R1 = np.exp(mylogit.params[0]['rank[T.3]'])

# We want
#  R3/R2

# in the odds-space (ratios of probs) the  calculatin is
#    R3/R2 = (R3/R1) / (R2/R1) = R3_over_R1/R2_over_R1

R3_over_R2 = R3_over_R1/R2_over_R1
R3_over_R2

0.5143964596821659

In [21]:
# check (by comparing log of odds ratio to value obtained in mylogit2)
np.isclose(np.log(R3_over_R2), mylogit2.params[0]['rank[T.3]'])

True