# Q2. Graduate School Acceptance Analysis
#### 0. Estimate a binomial model with intercept only using the logit link function. Interpret the intercept coefficient.
Based on the model summary below, the intercept coefficient (−0.7653) is the log-odds of being accepted in a intercept-only model.

The odds of acceptance are: $odds = e^{Intercept} = e^{−0.7653} ≈ 0.465$ (46.5%).

The p-value (P < 0.05) indicates that the intercept is statistically significant, which means the baseline log-odds are significantly different from 0.

In [149]:
import pandas as pd
import numpy as np

df = pd.read_csv("grad school acceptance.csv")
df.head()

Unnamed: 0,accepted,gre,gpa,ranking
0,0,380,3.61,RANK03
1,1,660,3.67,RANK03
2,1,800,4.0,RANK01
3,1,640,3.19,RANK04
4,0,520,2.93,RANK04


In [151]:
import statsmodels.api as sm

intercept_model = sm.GLM(
    df['accepted'],
    sm.add_constant(np.zeros(len(df))), 
    family=sm.families.Binomial(link=sm.families.links.logit())
).fit()

print(intercept_model.summary())

                 Generalized Linear Model Regression Results                  
Dep. Variable:               accepted   No. Observations:                  400
Model:                            GLM   Df Residuals:                      399
Model Family:                Binomial   Df Model:                            0
Link Function:                  logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -249.99
Date:                Mon, 02 Dec 2024   Deviance:                       499.98
Time:                        23:08:56   Pearson chi2:                     400.
No. Iterations:                     4   Pseudo R-squ. (CS):              0.000
Covariance Type:            nonrobust                                         
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.7653      0.107     -7.125      0.0



#### 1. Compute the average acceptance rate from the model results.

In [154]:
intercept_value = intercept_model.params[0]

avg_acceptance_rate = 1 / (1 + np.exp(-intercept_value))
avg_acceptance_rate

  intercept_value = intercept_model.params[0]


0.31750000000011214

In [156]:
# See if it matches the manual calculation of the average acceptance probability (number of accepted / total rows of the data) 
sum(df['accepted']) / len(df)

0.3175

#### 2. Estimate a model with intercept and GPA scores using the logit link function. What is the impact of an unit change in GPA scores on the odds?
Based on the result below, the odds of acceptance increase by a factor of $e^{1.0511} ≈ 2.861$ for every one-unit increase in GPA scores.
This could mean that a higher GPA significantly increases the likelihood of acceptance.

In [159]:
X_GPA = df['gpa']

GPA_model = sm.GLM(
    df['accepted'],
    sm.add_constant(X_GPA), 
    family=sm.families.Binomial(link=sm.families.links.logit())
).fit()

print(GPA_model.summary())

                 Generalized Linear Model Regression Results                  
Dep. Variable:               accepted   No. Observations:                  400
Model:                            GLM   Df Residuals:                      398
Model Family:                Binomial   Df Model:                            1
Link Function:                  logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -243.48
Date:                Mon, 02 Dec 2024   Deviance:                       486.97
Time:                        23:08:59   Pearson chi2:                     401.
No. Iterations:                     4   Pseudo R-squ. (CS):            0.03200
Covariance Type:            nonrobust                                         
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -4.3576      1.035     -4.209      0.0



#### 3. Estimate a model with intercept, and GPA scores using the logit link function. What is the impact of an unit change in GPA scores on the probability of acceptance for an individual with an average GPA score?

In [168]:
average_gpa = df['gpa'].mean()

# Log-odds and probability for average GPA
log_odds_avg_gpa = GPA_model.params['const'] + GPA_model.params['gpa'] * average_gpa
probability_avg_gpa = 1 / (1 + np.exp(-log_odds_avg_gpa)) * 100

# Change in probability after one-unit increase in GPA
log_odds_unit_increase = GPA_model.params['const'] + GPA_model.params['gpa'] * (average_gpa + 1)
probability_unit_increase = 1 / (1 + np.exp(-log_odds_unit_increase)) * 100
change_in_probability = probability_unit_increase - probability_avg_gpa

print(f"Probability Of Acceptance For An Individual With An Average GPA: {probability_avg_gpa:}%")
print(f"Probability Of Acceptance For A Unit Increase In GPA: {probability_unit_increase:}%")
print(f"Change In Probability For A Unit Increase In GPA: {change_in_probability:}%")

Probability Of Acceptance For An Individual With An Average GPA: 31.12174333764286%
Probability Of Acceptance For A Unit Increase In GPA: 56.38187320802687%
Change In Probability For A Unit Increase In GPA: 25.260129870384006%


#### 4. Estimate the binomial model with logit link function and all available covariates. Interpret the results, including coefficients, z-values, p-values, and residual deviance.

In [170]:
# Create dummy variables for ranking
df = pd.get_dummies(df, columns=['ranking'], drop_first=True)
dummy_columns = df.select_dtypes(include='bool').columns
df[dummy_columns] = df[dummy_columns].astype(int)
df.head()

Unnamed: 0,accepted,gre,gpa,ranking_RANK02,ranking_RANK03,ranking_RANK04
0,0,380,3.61,0,1,0
1,1,660,3.67,0,1,0
2,1,800,4.0,0,0,0
3,1,640,3.19,0,0,1
4,0,520,2.93,0,0,1


In [172]:
# Fir a regression model with all covariates
X_full = df.drop(columns=['accepted'])

full_model = sm.GLM(
    df['accepted'],
    sm.add_constant(X_full), 
    family=sm.families.Binomial(link=sm.families.links.logit())
).fit()

print(full_model.summary())

                 Generalized Linear Model Regression Results                  
Dep. Variable:               accepted   No. Observations:                  400
Model:                            GLM   Df Residuals:                      394
Model Family:                Binomial   Df Model:                            5
Link Function:                  logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -229.26
Date:                Mon, 02 Dec 2024   Deviance:                       458.52
Time:                        23:09:37   Pearson chi2:                     397.
No. Iterations:                     4   Pseudo R-squ. (CS):            0.09846
Covariance Type:            nonrobust                                         
                     coef    std err          z      P>|z|      [0.025      0.975]
----------------------------------------------------------------------------------
const             -3.9900      1.140     -3.



In [174]:
from scipy.stats import chi2

# The deviance and residual df of GPA_model and full_model
deviance_GPA = GPA_model.deviance
deviance_full = full_model.deviance

df_resid_GPA = GPA_model.df_resid
df_resid_full = full_model.df_resid

# Compute the p-value
p_value_GPA = chi2.sf(deviance_GPA, df_resid_GPA)
p_value_full = chi2.sf(deviance_full, df_resid_full)

print(f'deviance of GPA Model: {deviance_GPA}')
print(f'deviance of full Model: {deviance_full}')
print(f"P-value of GPA Model: {p_value_GPA}")
print(f"P-value of Full Model: {p_value_full}")

deviance of GPA Model: 486.96762254232175
deviance of full Model: 458.51749247589896
P-value of GPA Model: 0.0014977845010662725
P-value of Full Model: 0.013653471581259666


##### **Results Interpretation:**

##### **1. Model with GPA Only**
- **Intercept Coefficient = -4.3576**
- **GPA Coefficient = 1.0511**: A one-unit increase in GPA is associated with an increase in the odds of acceptance by $e^{1.0511} ≈ 2.861$.
- **Z-value & P-value**:
  - GPA has a significant effect on acceptance (\( z = 3.517, p < 0.001 \)).
  - The intercept is also significant (\( p < 0.001 \)).
- **Residual Deviance**: 486.97
- **Goodness of Fit**:
  - Log-Likelihood: -243.48
  - Pseudo R-squared: 0.03200

##### **2. Model with GRE, GPA, and Rankings**
- **Intercept Coefficient = -3.9900**
- **GRE Coefficient = 0.0023**: A one-unit increase in GRE is associated with an increase in the odds of acceptance by $e^{0.0023} ≈ 1.0023$. This is statistically significant (\( z = 2.070, p = 0.038 \)).
- **GPA Coefficient = 0.8040**: A one-unit increase in GPA increases the odds of acceptance by $e^{0.8040} ≈ 2.234$. This is also significant (\( z = 2.423, p = 0.015 \)).
- **Ranking Coefficients**: Ranking coefficients are all negative values, which means that it is associated with a decrease in the log-odds of acceptance.
  - **RANK02 = -0.6754**: Significant (\( z = -2.134, p = 0.033 \)).
  - **RANK03 = -1.3402**: Highly significant (\( z = -3.881, p < 0.001 \)).
  - **RANK04 = -1.5515**: Highly significant (\( z = -3.713, p < 0.001 \)).
- **Residual Deviance**: 458.52
- **P-value**: 0.0137
- **Goodness of Fit**:
  - Log-Likelihood: -229.26
  - Pseudo R-squared: 0.09846

---
##### **Conclusion:**
  - The full model has a lower residual deviance (458.52) compared to the GPA-only model (486.97), which indicates that the full model has a better fit.
  - Rankings, particularly RANK03 and RANK04 (higher rankings), have the most substantial effects on the log-odds of acceptance with strongly negative coefficients.
  - The GPA Model has a smaller p-value (0.0015) compared to the Full Model (0.0137). This does not mean that the GPA model is better, but rather that GPA alone is a strong predictor of acceptance.
---

#### 5. An alternative to the logit link function is the probit link function. $$ F^{-1}(p_i) = x_i^T \beta $$ where $F$ is cumulative normal distribution. Estimate the binomial probit model using the probit link function. Interpret the results.

In [178]:
X_probit = df.drop(columns=['accepted'])

probit_model = sm.GLM(
    df['accepted'],
    sm.add_constant(X_full), 
    family=sm.families.Binomial(link=sm.families.links.probit())
).fit()

print(probit_model.summary())

                 Generalized Linear Model Regression Results                  
Dep. Variable:               accepted   No. Observations:                  400
Model:                            GLM   Df Residuals:                      394
Model Family:                Binomial   Df Model:                            5
Link Function:                 probit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -229.21
Date:                Mon, 02 Dec 2024   Deviance:                       458.41
Time:                        23:09:40   Pearson chi2:                     398.
No. Iterations:                     5   Pseudo R-squ. (CS):            0.09869
Covariance Type:            nonrobust                                         
                     coef    std err          z      P>|z|      [0.025      0.975]
----------------------------------------------------------------------------------
const             -2.3868      0.674     -3.

