### Example Multiple Linear Regression 4.5

For example, based on the **gender** variable, we can create a new variable that takes the form
\begin{equation*}
x_{i}
=
\begin{cases}
1&\text{if $ i $th person is female}\\
0&\text{if $ i $th person is male}
\end{cases}
\end{equation*}

and use this variable as a predictor in the regression equation. This results in the model
\begin{equation}\label{eq:regr_dummy_credit}
y_{i}
=\beta_{0}+\beta_{1}x_{i}+\epsilon_{i}
=
\begin{cases}
\beta_{0}+\beta_{1}+\epsilon_{i}&\text{if $ i $th person is female}\\
\beta_{0}+\epsilon_{1}&\text{if $ i $th person is male}
\end{cases}
\end{equation}

Now, $ \beta_{0} $ can be interpreted as the average credit card balance among males, $\beta_{0} + \beta_{1} $ as the average credit card balance among females, and $ \beta_{1} $ as the average difference in credit card balance between females and males. 

In [2]:
import pandas as pd
import numpy as np
import statsmodels.api as sm

# Load data
df = pd.read_csv('./data/Credit.csv')

balance = df['Balance']

# Initiate dummy variable with zeros:
gender = np.zeros(len(balance))
# Make 1 for Female:
indices_Fem = df[df['Gender']=='Female'].index.values
gender[indices_Fem] = 1

# Fit model
gender_sm = sm.add_constant(gender)
model = sm.OLS(balance, gender_sm).fit()

# Print summary:
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                Balance   R-squared:                       0.000
Model:                            OLS   Adj. R-squared:                 -0.002
Method:                 Least Squares   F-statistic:                    0.1836
Date:                Thu, 02 Oct 2025   Prob (F-statistic):              0.669
Time:                        10:39:25   Log-Likelihood:                -3019.3
No. Observations:                 400   AIC:                             6043.
Df Residuals:                     398   BIC:                             6051.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        509.8031     33.128     15.389      0.0

The **.summary()** method of a fitted model shows the encoding of the *dummy variable* associated with **gender**. The average credit card debt for males is estimated to be $509.80$, whereas females are estimated to carry 19.73 in additional debt for a total of  $ \;509.80+ \; 19.73 = \;529.53 $. 

However, we notice that the p-value for the dummy variable $ \beta_{1} $ is $ 0.6690 $, hence it is very high. This indicates that there is no statistical evidence of a difference in average credit card balance between the genders. 

### Example Multiple Linear Regression 4.6
If we had coded males as 1 and females as 0, then the estimates for $ \beta_{0} $ and $ \beta_{1} $ would have been 529.53 and -19.73 respectively, leading once again to a prediction of credit card debt of $\;529.73- \;19.73= \; 509.80 $ for males and a prediction of 529.53 for females. This is the same result we obtained with the **default** coding scheme.

If we wish to change the coding scheme for the dummy variable, we can change it in **Python** by changing the coding scheme.

In [3]:
# Following Example 4.5
# Initiate dummy variable with zeros:
gender = np.zeros(len(balance))
# Make 1 for Male:
indices_Mal = df[df['Gender'] == ' Male'].index.values
gender[indices_Mal] = 1

# Fit model
gender_sm = sm.add_constant(gender)
model = sm.OLS(balance, gender_sm).fit()

# Print summary:
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                Balance   R-squared:                       0.000
Model:                            OLS   Adj. R-squared:                 -0.002
Method:                 Least Squares   F-statistic:                    0.1836
Date:                Thu, 02 Oct 2025   Prob (F-statistic):              0.669
Time:                        10:39:28   Log-Likelihood:                -3019.3
No. Observations:                 400   AIC:                             6043.
Df Residuals:                     398   BIC:                             6051.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        529.5362     31.988     16.554      0.0

### Example Multiple Linear Regression 4.7
Alternatively, instead of a $ 0/1 $ coding scheme, we could create a dummy variable 
\begin{equation*}
x_{i}
=
\begin{cases}
1&\text{if $ i $th person is female}\\
-1&\text{if $ i $th person is male}
\end{cases}
\end{equation*}

and use this variable in the regression equation. This results in the model
\begin{equation*}
y_{i}
=\beta_{0}+\beta_{1}x_{i}+\epsilon_{i}
=
\begin{cases}
\beta_{0}+\beta_{1}+\epsilon_{i}&\text{if $ i $th person is male}\\
\beta_{0}-\beta_{1}+\epsilon_{1}&\text{if $ i $th person is female}
\end{cases}
\end{equation*}

Now $ \beta_{0} $ can be interpreted as the overall credit card balance (ignoring the gender effect), and $ \beta_{1} $ is the amount that females are above the average and males are below the average.

In [4]:
# Following Example 4.6
# Initiate dummy variable with ones:
gender = np.ones(len(balance))
# Make -1 for Male:
indices_Mal = df[df['Gender'] == ' Male'].index.values
gender[indices_Mal] = -1

# Fit model
gender_sm = sm.add_constant(gender)
model = sm.OLS(balance, gender_sm).fit()

# Print summary:
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                Balance   R-squared:                       0.000
Model:                            OLS   Adj. R-squared:                 -0.002
Method:                 Least Squares   F-statistic:                    0.1836
Date:                Thu, 02 Oct 2025   Prob (F-statistic):              0.669
Time:                        10:39:35   Log-Likelihood:                -3019.3
No. Observations:                 400   AIC:                             6043.
Df Residuals:                     398   BIC:                             6051.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        519.6697     23.026     22.569      0.0

In this example, the estimate for $ \beta_{0} $ 
would be 519.67 , halfway between the male and female averages of 509.80 and 529.53. The estimate for $ \beta_{1} $ would be 9.87, which is half of 19.73, the average difference between females and males. 

### Example Multiple Linear Regression 4.8
For example, for the **ethnicity** variable which has *three* levels we create *two* dummy variables. The first could be
\begin{equation*}
x_{i1}
=
\begin{cases}
1&\text{if $ i $th person is Asian}\\
0&\text{if $ i $th person is not Asian}
\end{cases}
\end{equation*}
and the second could be
\begin{equation*}
x_{i2}
=
\begin{cases}
1&\text{if $ i $th person is Caucasian}\\
0&\text{if $ i $th person is not Caucasian}
\end{cases}
\end{equation*}
Then both of these variables can be used in the regression equation, in 
order to obtain the model
\begin{equation}\label{eq:two_dummy_variables}
y_{i}
=\beta_{0}+\beta_{1}x_{i1}+\beta_{2}x_{i2}+\epsilon_{i}
=
\begin{cases}
\beta_{0}+\beta_{1}+\epsilon_{i}&\text{if $ i $th person is Asian}\\
\beta_{0}+\beta_{2}+\epsilon_{i}&\text{if $ i $th person is Caucasian}\\
\beta_{0}+\epsilon_{i}&\text{if $ i $th person is Afro-American}
\end{cases}
\end{equation}

Now $ \beta_{0} $ can be interpreted as the average credit card balance for African Americans, $ \beta_{1} $ can be interpreted as the difference in the average balance between the Asian and African American categories, and $ \beta_{2} $ can be interpreted as the difference in the average balance between the Caucasian and African American categories. 

- There will always be one fewer dummy variable than the number of levels.
- The level with no dummy variable African American in the example - is known as the *baseline*.
- The equation 
    \begin{equation*}
    y_{i}
    =\beta_{0}+\beta_{1}+\beta_{2}+\epsilon_{i}
    \end{equation*}

    does not make sense, since this person would be Asian *and* Caucasian.

From the summary below, we see that the estimated **balance** for the baseline, African American, is   531.00. 

In [5]:
# Following Example 4.7
# Initiate dummy variable with zeros:
ethnicity = np.zeros((len(balance),2))
# Find indices 
indices_Asi = df[df['Ethnicity'] == 'Asian'].index.values
indices_Cau = df[df['Ethnicity'] == 'Caucasian'].index.values
# Set values
ethnicity[indices_Asi, 0] = 1
ethnicity[indices_Cau, 1] = 1

# Fit model
ethnicity_sm = sm.add_constant(ethnicity)
model = sm.OLS(balance, ethnicity_sm).fit()

# Print summary:
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                Balance   R-squared:                       0.000
Model:                            OLS   Adj. R-squared:                 -0.005
Method:                 Least Squares   F-statistic:                   0.04344
Date:                Thu, 02 Oct 2025   Prob (F-statistic):              0.957
Time:                        10:44:07   Log-Likelihood:                -3019.3
No. Observations:                 400   AIC:                             6045.
Df Residuals:                     397   BIC:                             6057.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        531.0000     46.319     11.464      0.0

It is estimated that the Asian category will have $ 18.69 $ less debt than the African American category, and that the Caucasian category will have 12.50 less debt than the African American category. However, the p-values associated with the coefficient estimates for the two dummy variables are very large, suggesting no statistical evidence of a real difference in credit card balance between the ethnicities. Once again, the level selected as the baseline category is arbitrary, and the final predictions for each group will be the same regardless of this choice. However, the coefficients and their p-values do depend on the choice of dummy variable coding. Rather than rely on the individual coefficients, we can use an F-test to test 
\begin{equation*}
H_{0}:\quad
\beta_{1}
=\beta_{2}
=0
\end{equation*}

The p-value does not depend on the coding. This F-test has a p-value of 0.96, indicating that we cannot reject the null hypothesis that there is *no* relationship between **balance** and **ethnicity**. 