## Variable dependiente Multinomial
Cuando se tiene como endogena a una variable categorica, pero no binaria. La opcion mas adecuada es estimarse por un Logit no Ordenado o un Logit Ordenado. Esto dependera de si las categorias cuentan con alguna jerarquia

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.miscmodels.ordinal_model import OrderedModel

### Logit Multinomial No Ordenado
Se usara la base de datos de la eleccion de una rama ocupacional `prog` dependiendo del estatus socioeconomico `ses` y del puntaje de lectura `write`, entre otros

In [2]:
dta = pd.read_stata("https://stats.idre.ucla.edu/stat/data/hsbdemo.dta")
dta.to_csv("./data/hsbdemo.csv")

# Female
dta.rename({"female": "genr"}, axis=1, inplace=True)
dta["female"] = 0
dta.loc[dta["genr"] == "female", "female"] = 1 

# Ajustando
X = dta[["female", "ses", "read", "write"]]
y = dta["prog"]

dta.head()

Unnamed: 0,id,genr,ses,schtyp,prog,read,write,math,science,socst,honors,awards,cid,female
0,45.0,female,low,public,vocation,34.0,35.0,41.0,29.0,26.0,not enrolled,0.0,1,1
1,108.0,male,middle,public,general,34.0,33.0,41.0,36.0,36.0,not enrolled,0.0,1,0
2,15.0,male,high,public,vocation,39.0,39.0,44.0,26.0,42.0,not enrolled,0.0,1,0
3,67.0,male,low,public,vocation,37.0,37.0,42.0,33.0,32.0,not enrolled,0.0,1,0
4,153.0,male,middle,public,vocation,39.0,31.0,40.0,39.0,51.0,not enrolled,0.0,1,0


In [3]:
dta.dtypes

id          float32
genr       category
ses        category
schtyp     category
prog       category
read        float32
write       float32
math        float32
science     float32
socst       float32
honors     category
awards      float32
cid           int16
female        int64
dtype: object

In [4]:
dta["prog"].value_counts()

academic    105
vocation     50
general      45
Name: prog, dtype: int64

In [5]:
pd.pivot_table(dta, index="ses", values="id", columns="prog", aggfunc="count")

prog,general,academic,vocation
ses,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
low,16,19,12
middle,20,44,31
high,9,42,7


In [6]:
# Reemplazando por valores
dta["ses"] = dta["ses"].cat.codes
dta["prog"] = dta["prog"].cat.codes

In [7]:
model = smf.mnlogit("prog ~ female + C(ses) + read + write", data=dta)
results = model.fit()

print(results.summary())

Optimization terminated successfully.
         Current function value: 0.873627
         Iterations 6
                          MNLogit Regression Results                          
Dep. Variable:                   prog   No. Observations:                  200
Model:                        MNLogit   Df Residuals:                      188
Method:                           MLE   Df Model:                           10
Date:                Sun, 01 Jan 2023   Pseudo R-squ.:                  0.1439
Time:                        20:28:52   Log-Likelihood:                -174.73
converged:                       True   LL-Null:                       -204.10
Covariance Type:            nonrobust   LLR p-value:                 6.263e-09
     prog=1       coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept      -3.7529      1.285     -2.921      0.003      -6.271      -1.234
C(ses)[T.1]     0.4387    

In [8]:
MEM = results.get_margeff(at="mean", method="dydx")

print(MEM.summary())

       MNLogit Marginal Effects      
Dep. Variable:                   prog
Method:                          dydx
At:                              mean
     prog=0      dy/dx    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------
C(ses)[T.1]    -0.1084      0.077     -1.403      0.161      -0.260       0.043
C(ses)[T.2]    -0.1561      0.092     -1.692      0.091      -0.337       0.025
female         -0.0393      0.072     -0.549      0.583      -0.180       0.101
read           -0.0053      0.004     -1.231      0.218      -0.014       0.003
write          -0.0012      0.005     -0.264      0.792      -0.010       0.008
-------------------------------------------------------------------------------
     prog=1      dy/dx    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------
C(ses)[T.1]     0.0012      0.099      0.012    

### Logit Multinomial Ordenado
Se usara datos de la probabilidad de aplicar a una universidad `apply`. Las exogenas son si al menos uno de los padres es graduado `pared`, estudio en colegio publico `public` y promedio de notas `gpa`

In [9]:
dta = pd.read_stata("https://stats.idre.ucla.edu/stat/data/ologit.dta")
dta.to_csv("./data/ologit.csv")

X = dta[["pared", "public", "gpa"]]
y = dta["apply"].cat.codes

dta.head()

Unnamed: 0,apply,pared,public,gpa
0,very likely,0,0,3.26
1,somewhat likely,1,0,3.21
2,unlikely,1,1,3.94
3,somewhat likely,0,0,2.81
4,somewhat likely,0,0,2.53


In [10]:
dta.dtypes

apply     category
pared         int8
public        int8
gpa        float32
dtype: object

In [11]:
dta["apply"].value_counts()

unlikely           220
somewhat likely    140
very likely         40
Name: apply, dtype: int64

In [12]:
model = OrderedModel(y, X, dist="logit")
results = model.fit()

print(results.summary())

Optimization terminated successfully.
         Current function value: 0.896869
         Iterations: 361
         Function evaluations: 573
                             OrderedModel Results                             
Dep. Variable:                      y   Log-Likelihood:                -358.75
Model:                   OrderedModel   AIC:                             727.5
Method:            Maximum Likelihood   BIC:                             747.5
Date:                Sun, 01 Jan 2023                                         
Time:                        20:28:53                                         
No. Observations:                 400                                         
Df Residuals:                     395                                         
Df Model:                           5                                         
                 coef    std err          z      P>|z|      [0.025      0.975]
----------------------------------------------------------------------

### Referencias:
* https://www.statsmodels.org/stable/generated/statsmodels.miscmodels.ordinal_model.OrderedModel.html#statsmodels.miscmodels.ordinal_model.OrderedModel
* https://www.statsmodels.org/stable/generated/statsmodels.discrete.discrete_model.MNLogit.html#statsmodels.discrete.discrete_model.MNLogit