In [1]:
import pandas as pd
import statsmodels.api as sm

Let's create a simple data frame:

In [2]:
df = pd.DataFrame({'color':['red','blue','green','red','green','green','blue'],'size':[4,3,5,3,6,7,1]})
df

Unnamed: 0,color,size
0,red,4
1,blue,3
2,green,5
3,red,3
4,green,6
5,green,7
6,blue,1


Start with one-hot-encoding (this is what you should be doing):

In [3]:
dfnew = pd.get_dummies(df,drop_first=True,columns=['color'])
dfnew

Unnamed: 0,size,color_green,color_red
0,4,0,1
1,3,0,0
2,5,1,0
3,3,0,1
4,6,1,0
5,7,1,0
6,1,0,0


Let's run a regression:

In [4]:
X = dfnew[['color_green','color_red']]
Y = dfnew[['size']]

X = sm.add_constant(X)
lm = sm.OLS(Y, X).fit()
print(lm.summary())

                            OLS Regression Results                            
Dep. Variable:                   size   R-squared:                       0.819
Model:                            OLS   Adj. R-squared:                  0.728
Method:                 Least Squares   F-statistic:                     9.048
Date:                Sun, 31 Oct 2021   Prob (F-statistic):             0.0328
Time:                        16:54:51   Log-Likelihood:                -8.3862
No. Observations:                   7   AIC:                             22.77
Df Residuals:                       4   BIC:                             22.61
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
const           2.0000      0.750      2.667      

  warn("omni_normtest is not valid with less than 8 observations; %i "


What does this say? The baseline value is $2$ (this is the average for any "blue" observation). Green observations on average have a size of $2 + 4 = 6$. Red observations have an average size of $2 + 1.5 = 3.5$.

Now, let's try ordinal encoding. We will just an arbitrary sequence:

In [5]:
from sklearn.preprocessing import OrdinalEncoder

dfnew = df.copy()
encoder = OrdinalEncoder(categories=[['green','blue','red']]) 
dfnew[['color']] = encoder.fit_transform(dfnew[['color']])
dfnew

Unnamed: 0,color,size
0,2.0,4
1,1.0,3
2,0.0,5
3,2.0,3
4,0.0,6
5,0.0,7
6,1.0,1


We again run a regression:

In [6]:
X = dfnew[['color']]
Y = dfnew[['size']]

X = sm.add_constant(X)
lm = sm.OLS(Y, X).fit()
print(lm.summary())

                            OLS Regression Results                            
Dep. Variable:                   size   R-squared:                       0.389
Model:                            OLS   Adj. R-squared:                  0.267
Method:                 Least Squares   F-statistic:                     3.189
Date:                Sun, 31 Oct 2021   Prob (F-statistic):              0.134
Time:                        16:54:51   Log-Likelihood:                -12.641
No. Observations:                   7   AIC:                             29.28
Df Residuals:                       5   BIC:                             29.17
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.3529      0.945      5.665      0.0

  warn("omni_normtest is not valid with less than 8 observations; %i "


Green is zero. Hence, the regressions ays that the average green observation should have size 5.3. Blue is 1, hence the average blue observation should have $5.3 - 1*1.4 = 3.9$. Red is 2, so the average red observation should have $5.3 - 2*1.4 = 2.5$. Of course, we can easily see that blue observations actually have a smaller average than red. But the order we gave puts an additional constraint on the model. We see that the model in fact fits much worse (Adjusted $R^2$ of 0.267 instead of 0.728.