<a href="https://colab.research.google.com/github/hardikdhamija96/Jamboree_CaseStudy/blob/main/Jumbooree.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

🔷Problem Statement — Jamboree Graduate Admissions
================================================================

Business context
----------------

Jamboree has launched an online “Chance of Admit” checker for study-abroad aspirants. The tool should be credible, transparent, and useful for counselling teams and marketing funnels. We need a data-driven model that explains what drives admits and gives an accurate probability estimate for each student profile.

Objective
---------

*   **Primary**: Build an interpretable regression model to estimate **Chance of Admit** for an applicant, given profile attributes.
    
*   **Secondary**: Identify the **key drivers** that most influence admit probability and quantify their impact so counselling can guide students on practical improvements.
    

Decisions this will enable
--------------------------

*   **Student guidance**: What to improve first (GRE, TOEFL, SOP, LOR, GPA, research) to meaningfully lift admit chance.
    
*   **Lead qualification**: Prioritise high potential leads for counsellor follow-ups.
    
*   **Content and prep strategy**: Which score bands and profile gaps to target in blogs, ads, and workshops.
    
*   **Scholarship or premium service targeting**: Identify segments with strong lift potential.
    

Target variable and inputs
--------------------------

*   **Target**: Chance of Admit (0 to 1).
    
*   **Predictors**: GRE Score, TOEFL Score, University Rating, SOP Strength, LOR Strength, Undergrad GPA, Research Experience (0 or 1).
    
*   **Identifier to drop**: Serial No.
    

Modelling scope and approach
----------------------------

*   **Scope**: Supervised regression with **explanatory focus**.
    
*   **Baseline**: OLS Linear Regression using statsmodels to get coefficients, p-values, confidence intervals, and model diagnostics.
    
*   **Regularised variants**: Ridge and Lasso to handle multicollinearity and improve generalisation; compare with OLS.
    
*   **Assumption checks**: VIF for multicollinearity, residual mean near zero, linearity via residual plots, homoscedasticity, and normality of residuals.
    

Success criteria
----------------

*   **Quality**: Reasonable Train vs Test parity on **MAE, RMSE, R², Adjusted R²** with no obvious overfit.
    
*   **Interpretability**: Clear ranking of drivers and practical interpretation of coefficients.
    
*   **Calibration**: Predicted probabilities align with observed bands on hold-out data.
    
*   **Actionability**: Concrete recommendations that a student can act on to lift chances.
    

Constraints and considerations
------------------------------

*   Data represents past applicants; there may be **selection bias** and noisy proxies like University Rating.
    
*   Relationships may be **nonlinear**; we begin with linear and expand only if diagnostics demand.
    
*   Keep the tool **simple and transparent** for counselling conversations.

In [None]:
!pip install gdown



In [None]:
import gdown

file_id = "1Ym9Zt60vgOReap1cGFM1L7-actKbynyk"
gdown.download(f"https://drive.google.com/uc?id={file_id}", "jumboree.csv", quiet=False)

In [None]:
import pandas as pd
df = pd.read_csv("jumboree.csv")
df.head()

In [None]:
df.shape

In [None]:
df.dtypes

In [None]:
df.info()

In [None]:
df.describe()

*   Dataset has 500 complete entries, no missing values, with GRE/TOEFL/CGPA showing strong academic profiles (averages: GRE ~316, TOEFL ~107, CGPA ~8.6).
    
*   SOP/LOR average ~3.3–3.5, Research done by ~56% applicants; Chance of Admit mean ~0.72 (range 0.34–0.97).
    
*   Data is clean, realistic, and well-suited for regression after dropping the Serial No..

In [None]:
df.drop(columns=["Serial No."], inplace=True)

In [None]:
df.info()

In [None]:
df.duplicated().sum()

In [None]:
df.isnull().sum().sum()

In [None]:
df.info()

*   **Unique ID dropped**: Serial No. removed as it has no predictive value.
    
*   **Duplicates & nulls**: None found (0 duplicates, 0 nulls) → dataset is clean and consistent.
    
*   ✅ Data is now model-ready for EDA and preprocessing, with 8 meaningful variables remaining.

In [None]:
continous_features = ['GRE Score', 'TOEFL Score','CGPA','Chance of Admit']

In [None]:
df.rename(columns={'Chance of Admit ':'Chance of Admit'}, inplace=True)

In [None]:
print(df.columns.tolist())

In [None]:
df[continous_features].describe()

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
fig,axes = plt.subplots(1,4, figsize=(14,4))

for ax,col in zip(axes,continous_features):
  ax.hist(df[col])
  ax.set_title(col)
  ax.set_xlabel("Value")
  ax.set_ylabel("Frequency")
  ax.grid(True)

plt.tight_layout()
plt.show()

*   **GRE & TOEFL**: Both show near-normal distributions with concentration in the upper ranges, confirming applicants are generally high scorers.
    
*   **CGPA**: Clustered around 8–9, with only a few low values → reflects a competitive pool.
    
*   **Chance of Admit**: Most students fall in 0.6–0.9 range, with a clear skew towards higher admit probabilities.

In [None]:
fig,axes = plt.subplots(1,4, figsize=(14,4))

for ax,col in zip(axes,continous_features):
  ax.boxplot(df[col])
  ax.set_title(col)
  ax.grid(True)

plt.tight_layout()
plt.show()

*   **GRE & TOEFL**: Scores are tightly distributed with no major outliers, most students performing in higher bands.
    
*   **CGPA**: Concentrated between 8–9; spread is limited and no extreme anomalies.
    
*   **Chance of Admit**: Mostly between 0.6–0.9, with a few lower-end outliers (<0.4) indicating weaker profiles.

In [None]:
fig,axes = plt.subplots(1,4, figsize=(14,4))

for ax,col in zip(axes,continous_features):
  sns.kdeplot(df[col], ax=ax, fill=True)
  ax.set_title(col)
  ax.grid(True)

plt.tight_layout()
plt.show()

*   **GRE & TOEFL**: Both are near-normal with peaks around ~315 (GRE) and ~107 (TOEFL), showing most candidates are strong scorers.
    
*   **CGPA**: Bell-shaped distribution centered near 8.5–9, with few at the lower end → indicates competitive academic consistency.
    
*   **Chance of Admit**: Slight right skew, peaking near 0.7–0.8 → majority of applicants cluster in medium-to-high admit probabilities.

In [None]:
numerical_features = ['University Rating', 'SOP', 'LOR ', 'Research']

In [None]:
for i in numerical_features:
  print(df[i].value_counts())

In [None]:
fig,axes = plt.subplots(2,len(numerical_features)//2,figsize=(14,6))

axes = axes.flatten()

for ax,col in zip(axes,numerical_features):
  sns.countplot(x=col,data=df,ax=ax)

plt.show()

*   **University Rating**: Majority applicants are from mid-tier universities (Rating 2–4), with Rating 3 most common (162).
    
*   **SOP Strength**: Distribution leans towards stronger SOPs (3.0–4.5), very few weak SOPs (≤1.5).
    
*   **LOR Strength**: Similar to SOP, clustered around 3.0–4.0; weak LORs (≤2) are rare.
    
*   **Research**: 56% of applicants (280) have research experience, 44% (220) do not — showing a balanced but slightly research-heavy dataset.

In [None]:
plt.figure(figsize=(5,4))
sns.scatterplot(x='GRE Score',y='Chance of Admit',data=df)
plt.show()

*   Strong **positive linear relationship** between GRE Score and Chance of Admit — higher GRE consistently aligns with higher admission probability.
    
*   Spread is tighter at the top (GRE ≥ 325 mostly admit chance ≥ 0.8), while at lower GRE (≤ 305) admit chances vary widely → GRE is influential but not the sole determinant.

In [None]:
plt.figure(figsize=(5,4))
sns.scatterplot(x='CGPA',y='Chance of Admit',data=df)
plt.show()

*   **Very strong positive linear trend**: Higher CGPA almost directly translates into higher admit chances.
    
*   Students with **CGPA ≥ 9.0** generally have admission chances above 0.8, while those below 8.0 face wider uncertainty (0.4–0.7 range).

In [None]:
plt.figure(figsize=(5,4))
sns.scatterplot(x='GRE Score',y='TOEFL Score',data=df)
plt.show()

*   Clear **positive correlation**: students with higher GRE scores also tend to have higher TOEFL scores.
    
*   Some spread exists in the mid-range (GRE 305–315) where TOEFL scores vary between 95–115, but overall the relationship is quite linear.

In [None]:
plt.figure(figsize=(10,4))

sns.kdeplot(df,x="Chance of Admit",hue="University Rating",fill=True,common_norm = False,palette='crest')
plt.show()

*   **Clear upward shift**: Higher-rated universities are associated with higher admit probabilities.
    
*   Distribution centers:
    
    *   Rating 1 peaks around ~0.55,
        
    *   Rating 3 around ~0.70,
        
    *   Rating 5 sharply concentrated near ~0.9.
        
*   Overlap exists between adjacent ratings, but the trend is **monotonic upward**.

In [None]:
plt.figure(figsize=(10,4))

sns.kdeplot(df,x="Chance of Admit",hue="Research",fill=True,common_norm = False,palette='crest')
plt.show()

*   Applicants **with research experience (1)** show a distribution centered higher (~ 0.85) compared to those **without research (0)** (~0.75).
    
*   The curve for research students is narrower and shifted right, indicating consistently better admit chances.
    
*   Non-research profiles still get admits, but with lower and more spread-out probabilities.

In [None]:
plt.figure(figsize=(8,4))

sns.kdeplot(df,x="Chance of Admit",hue="SOP",fill=True,common_norm = False,palette='Paired')
plt.show()

*   **Clear upward shift**: Higher SOP scores align with higher admit probabilities.
    
*   SOP = 1–2 peaks around 0.5–0.6, while SOP ≥ 4 shifts sharply to 0.8–0.9+.
    
*   Distributions overlap, but the progression is monotonic — stronger SOP consistently improves chances.

In [None]:
df.columns

In [None]:
df.rename(columns={"LOR ":"LOR"},inplace=True)

In [None]:
plt.figure(figsize=(8,4))

sns.kdeplot(df,x="Chance of Admit",hue="LOR",fill=True,common_norm = False,palette='Paired')
plt.show()

*   **Positive trend**: Higher LOR ratings shift the admit probability curve to the right.
    
*   Weak LORs (≤2.0) mostly peak around 0.5–0.6, while strong LORs (≥4.0) concentrate in the 0.8–0.9+ range.
    
*   Some overlap exists, but overall higher LOR strength is strongly associated with higher admit chances.

In [None]:
corr = df.corr()
plt.figure(figsize=(10, 8))
sns.heatmap(corr, annot=True, fmt=".2f", linewidths=0.5,cmap='crest')
plt.title("Correlation Matrix Heatmap")
plt.show()

*   **Strongest predictors of Chance of Admit**: CGPA (0.88), GRE (0.81), and TOEFL (0.79). These are the **primary academic drivers** of admission chances.
    
*   **Moderate impact variables**: University Rating (0.69), SOP (0.68), LOR (0.65) → they supplement academics by improving overall profile strength.
    
*   **Research**: Positively correlated (0.55), showing value-add but less dominant than scores or GPA.
    
*   **High inter-correlation** among GRE, TOEFL, and CGPA (≥0.8) → raises potential **multicollinearity risk**, requiring VIF check before regression.

In [None]:
# Since GRE, TOEFL and CGPA are highly correalted we can add them and make one feature
df['Total Score'] = (df['GRE Score']*10/340) + (df['TOEFL Score']*10/120) + df['CGPA']
df.drop(['GRE Score', 'TOEFL Score', 'CGPA'], axis=1, inplace=True)

df.head()

*   **New academic composite created**: Total Score combines GRE, TOEFL, and CGPA into one holistic metric on a ~20–30 scale.
    
*   **Remaining features**: University Rating, SOP, LOR, and Research act as qualitative differentiators, with Chance of Admit as target.
    
*   ✅ Dataset is now balanced: **1 composite academic feature + 4 qualitative features → target**.

In [None]:
X = df.drop('Chance of Admit', axis=1)
y = df['Chance of Admit']
X.shape, y.shape

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

In [None]:
X_train.head()

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

X_train

In [None]:
X_train_columns = X.columns
# X_train_columns

In [None]:
X_train = pd.DataFrame(X_train, columns=X_train_columns)
X_test = pd.DataFrame(X_test, columns=X_train_columns)
X_train.head()

In [None]:
X_test.head()

In [None]:
X_train.describe()

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

In [None]:
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

In [None]:
print(f"Coefficients: {lr_model.coef_}")
print(f"Intercept: {lr_model.intercept_}")

*   **Total Score dominates**: With the highest coefficient (0.103), academic strength is the **single biggest driver** of admission chances.
    
*   **Qualitative lifts**: LOR (0.019) and Research (0.013) add meaningful boosts, highlighting the importance of profile quality beyond scores.
    
*   **Moderate role**: SOP (0.003) and University Rating (0.003) have smaller but still positive effects.
    
*   **Overall**: Admissions depend primarily on strong academics, but **supporting factors help differentiate applicants in the competitive middle band**.

In [None]:
y_train_pred = lr_model.predict(X_train)
y_test_pred  = lr_model.predict(X_test)

In [None]:
n_train, p = X_train.shape
n_test = X_test.shape[0]

In [None]:
import numpy as np

In [None]:
mse_tr = mean_squared_error(y_train, y_train_pred)
rmse_tr = np.sqrt(mse_tr)
r2_tr = r2_score(y_train, y_train_pred)
adj_r2_tr = 1 - (1 - r2_tr) * (n_train - 1) / (n_train - p - 1)

print("Training Set")
print(f"  RMSE    : {rmse_tr:.3f}")
print(f"  R^2     : {r2_tr:.3f}")
print(f"  Adj R^2 : {adj_r2_tr:.3f}")

In [None]:
mse_te = mean_squared_error(y_test, y_test_pred)
rmse_te = np.sqrt(mse_te)
r2_te = r2_score(y_test, y_test_pred)
adj_r2_te = 1 - (1 - r2_te) * (n_test - 1) / (n_test - p - 1)

print("\nTest Set")
print(f"  RMSE    : {rmse_te:.3f}")
print(f"  R^2     : {r2_te:.3f}")
print(f"  Adj R^2 : {adj_r2_te:.3f}")

*   **Training vs Test**: Very close scores (Train R² = 0.81, Test R² = 0.80) → model is **stable and not overfitting**.
    
*   **Error levels**: RMSE ~0.06 on both sets → on average, predictions deviate by just 6 percentage points in admit probability, which is **quite accurate** for real-world use.
    
*   **Adjusted R² (~0.79–0.81)**: Confirms that predictors collectively explain ~80% of the variance in admit chances.

In [None]:
plt.figure(figsize=(10,4))

# Actual vs Predicted
sns.scatterplot(x=y_test, y=y_test_pred, color="blue")
sns.regplot(x=y_test, y=y_test_pred, scatter=False, color="red")
plt.xlabel("Actual Chance of Admit")
plt.ylabel("Predicted Chance of Admit")
plt.title("Actual vs. Predicted (Linear Regression)")
plt.show()

*   Points are closely aligned with the diagonal regression line → shows the model is capturing the relationship well.
    
*   Only minor deviations exist, no major systematic bias.
    
*   ✅ Confirms **good predictive accuracy**.

In [None]:
plt.figure(figsize=(10,3))

residuals = y_test - y_test_pred   # use predictions on test set
sns.scatterplot(x=y_test, y=residuals, color="blue")
plt.axhline(0, color="red", linestyle="--")
plt.xlabel("Actual Values")
plt.ylabel("Residuals")
plt.title("Actual vs. Residuals (Test Set)")
plt.show()

*   Residuals are scattered randomly around zero without a clear pattern → satisfies **linearity assumption**.
    
*   Spread is fairly constant across the range of actual values → indicates **no serious heteroscedasticity**.
    
*   ✅ Confirms residuals are centered near zero and model assumptions largely hold.

In [None]:

# Get coefficients
coefficients = lr_model.coef_
features = X_train.columns  # feature names from your dataset

# Put into DataFrame for easy plotting
coef_df = pd.DataFrame({
    "Feature": features,
    "Coefficient": coefficients
}).sort_values(by="Coefficient", ascending=False)

# Plot
plt.figure(figsize=(10,3))
plt.bar(coef_df["Feature"], coef_df["Coefficient"], color="steelblue")
plt.title("Model Coefficients")
plt.ylabel("Weights")
plt.xticks(rotation=90)
plt.show()

*   **Total Score** is by far the strongest driver, confirming that overall academic strength dominates admit decisions.
    
*   **LOR** and **Research** have meaningful positive influence, showing that strong recommendations and research exposure significantly improve chances.
    
*   **SOP** and **University Rating** contribute positively but comparatively less, acting more as supporting enhancers.

In [None]:
import statsmodels.api as sm

In [None]:
X_train_sm = sm.add_constant(X_train)
X_test_sm = sm.add_constant(X_test)

ols_model = sm.OLS(np.array(y_train), X_train_sm).fit()
print(ols_model.summary())

*   Model explains ~81% of admit chance variance (R² = 0.81) → strong fit.
    
*   **Total Score** is the dominant driver; **LOR** and **Research** add significant positive impact.
    
*   **SOP** and **Univ Rating** are statistically weak in this setup, likely overshadowed by academics.
    
*   Residual checks show no major issues → model is robust and reliable.

In [None]:
X_train_sm_new = X_train_sm.drop(['SOP'], axis=1)
ols_model_new = sm.OLS(np.array(y_train), X_train_sm_new).fit()
print(ols_model_new.summary())

Model fit is strong (R² = 0.81).

Total Score dominates; LOR and Research add significant lifts.

University Rating is not statistically significant.

SOP was dropped this time due to low impact in earlier run, simplifying the model without hurting performance.

In [None]:
X_train_sm_new = X_train_sm.drop(['University Rating'], axis=1)
ols_model_new = sm.OLS(np.array(y_train), X_train_sm_new).fit()
print(ols_model_new.summary())

*   Model fit strong (R² = 0.81).
    
*   **Total Score** dominates, with **LOR** and **Research** significant boosters.
    
*   **SOP remains statistically weak** (p = 0.42).
    
*   **University Rating was dropped** this time, simplifying the model without loss of accuracy.

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [None]:
def calculate_vif(X):
    vif_data = pd.DataFrame()
    vif_data["Feature"] = X.columns
    vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
    return vif_data

In [None]:
vif_scores = calculate_vif(X_train)
print(vif_scores)

*   All VIF values are **< 5**, so **no serious multicollinearity** concern.
    
*   Total Score (2.86) and SOP (2.76) are the highest but still within safe limits.
    
*   ✅ Model predictors are independent enough for reliable OLS estimation.

In [None]:
residuals = y_test - y_test_pred

sns.set_theme(style='whitegrid')

plt.figure(figsize=(10, 3))
plt.scatter(y_test_pred, residuals, alpha=0.8, color='blue', edgecolor='k', s=50)

plt.axhline(0, color='red', linestyle='--')

plt.title('Residuals vs. Predicted Values', fontsize=16, fontweight='bold')
plt.xlabel('Predicted Values', fontsize=14)
plt.ylabel('Residuals', fontsize=14)

plt.grid(False)
plt.show()

*   Residuals are randomly scattered around zero → **linearity assumption holds**.
    
*   Spread is fairly uniform across predicted values → no clear signs of **heteroscedasticity**.
    
*   ✅ Confirms that the model errors behave randomly, supporting reliability of regression results.

In [None]:
import scipy.stats as stats
sns.set_theme(style='dark')
fig, axs = plt.subplots(1, 2, figsize=(15, 6))

# Histogram of Residuals
sns.histplot(residuals, bins=30, kde=True, color='blue', ax=axs[0])
axs[0].set_title('Histogram of Residuals', fontsize=16)
axs[0].set_xlabel('Residuals', fontsize=14)
axs[0].set_ylabel('Frequency', fontsize=14)
axs[0].grid(False)

# Q-Q Plot
stats.probplot(residuals, dist="norm", plot=axs[1])
axs[1].get_lines()[1].set_color('orange')
axs[1].get_lines()[0].set_markerfacecolor('red')
axs[1].set_title('Q-Q Plot of Residuals', fontsize=16)
axs[1].grid(True)

plt.tight_layout()
plt.show()

*   **Histogram**: Residuals are roughly bell-shaped and centered near zero, though with slight skew.
    
*   **Q-Q Plot**: Points mostly follow the diagonal line → residuals are approximately normal, with mild deviations at the tails.
    
*   ✅ Assumption of normality is reasonably satisfied for regression.

In [None]:
shapiro_stat, shapiro_p_value = stats.shapiro(residuals)
print(f'Shapiro-Wilk Test Statistic: {shapiro_stat}, p-value: {shapiro_p_value}')

if shapiro_p_value > 0.05:
    print("Fail to reject the null hypothesis: Residuals are normally distributed.")
else:
    print("Reject the null hypothesis: Residuals are not normally distributed.")

*   **Test result**: p-value ≈ 2.5e-05 < 0.05 → reject null; residuals are **not perfectly normal**.
    
*   **But**: With 400 observations, even small deviations trigger significance. Plots (histogram, Q-Q) already showed residuals are _approximately_ normal.
    
*   ✅ Conclusion: Slight non-normality exists, but it’s not severe enough to invalidate regression results.

In [None]:
y_test_pred = ols_model.predict(X_test_sm)

y_test = y_test.reset_index(drop=True)
y_test_pred = y_test_pred.reset_index(drop=True)

residuals = y_test - y_test_pred
residuals.shape

In [None]:
from statsmodels.stats.diagnostic import het_breuschpagan

bp_test = het_breuschpagan(residuals, X_test_sm)

bp_labels = ['Lagrange multiplier statistic', 'p-value', 'f-value', 'f p-value']
bp_results = dict(zip(bp_labels, bp_test))

print(bp_results)

In [None]:
if bp_results['p-value'] > 0.05:
    print("Fail to reject the null hypothesis: No evidence of heteroscedasticity.")
else:
    print("Reject the null hypothesis: Heteroscedasticity may be present.")

* Test result: p-value ≈ 0.16 > 0.05 → fail to reject null.

* ✅ Residual variance is constant → no heteroscedasticity detected.

In [None]:
from statsmodels.stats.diagnostic import het_goldfeldquandt

gq_test = het_goldfeldquandt(ols_model.resid, ols_model.model.exog)
gq_test_statistic = gq_test[0]
gq_p_value = gq_test[1]

print("Goldfeld-Quandt Test Statistic:", gq_test_statistic)
print("p-value:", gq_p_value)

In [None]:
if gq_p_value > 0.05:
    print("Fail to reject the null hypothesis: No evidence of heteroscedasticity.")
else:
    print("Reject the null hypothesis: Heteroscedasticity may be present.")

*   **Goldfeld-Quandt result**: Test statistic ≈ 1.01, p-value ≈ 0.48 (> 0.05) → fail to reject null.
    
*   ✅ Confirms **no heteroscedasticity** in residuals.

In [None]:
from sklearn.linear_model import Lasso

lasso_model = Lasso(alpha=0.1)
lasso_model.fit(X_train, y_train)

In [None]:
# Predictions
y_train_pred = lasso_model.predict(X_train)
y_test_pred  = lasso_model.predict(X_test)

# ----- Training set -----
mse_train = mean_squared_error(y_train, y_train_pred)
rmse_train = np.sqrt(mse_train)
r2_train = r2_score(y_train, y_train_pred)
adj_r2_train = 1 - (1 - r2_train) * (len(y_train) - 1) / (len(y_train) - X_train.shape[1] - 1)

print("Training Set Evaluation:")
print(f"  RMSE    : {rmse_train:.2f}")
print(f"  R^2     : {r2_train:.2f}")
print(f"  Adj R^2 : {adj_r2_train:.2f}")

# ----- Test set -----
mse_test = mean_squared_error(y_test, y_test_pred)
rmse_test = np.sqrt(mse_test)
r2_test = r2_score(y_test, y_test_pred)
adj_r2_test = 1 - (1 - r2_test) * (len(y_test) - 1) / (len(y_test) - X_test.shape[1] - 1)

print("\nTest Set Evaluation:")
print(f"  RMSE    : {rmse_test:.2f}")
print(f"  R^2     : {r2_test:.2f}")
print(f"  Adj R^2 : {adj_r2_test:.2f}")

In [None]:
y_test_pred = lasso_model.predict(X_test)

plt.figure(figsize=(10,4))

# Scatterplot
sns.scatterplot(x=y_test, y=y_test_pred, color="blue", alpha=0.6)

# Regression line
sns.regplot(x=y_test, y=y_test_pred, scatter=False, color="red")

plt.xlabel("Actual Chance of Admit")
plt.ylabel("Predicted Chance of Admit")
plt.title("Actual vs Predicted (Lasso Regression)")
plt.grid(True)
plt.show()

In [None]:
y_test_pred = lasso_model.predict(X_test)

# Residuals = Actual - Predicted
residuals = y_test - y_test_pred

plt.figure(figsize=(7,5))
sns.scatterplot(x=y_test, y=residuals, color="blue", alpha=0.6)

# Horizontal line at 0 (perfect predictions)
plt.axhline(0, color="red", linestyle="--")

plt.xlabel("Actual Chance of Admit")
plt.ylabel("Residuals (Actual - Predicted)")
plt.title("Actual vs Residuals (Lasso Regression)")
plt.grid(True)
plt.show()

In [None]:
from sklearn.linear_model import Ridge

ridge_model = Ridge(alpha=0.1)
ridge_model.fit(X_train, y_train)

In [None]:
# Predictions
y_train_pred = ridge_model.predict(X_train)
y_test_pred  = ridge_model.predict(X_test)

# ----- Training set -----
mse_train = mean_squared_error(y_train, y_train_pred)
rmse_train = np.sqrt(mse_train)
r2_train = r2_score(y_train, y_train_pred)
adj_r2_train = 1 - (1 - r2_train) * (len(y_train) - 1) / (len(y_train) - X_train.shape[1] - 1)

print("Training Set Evaluation:")
print(f"  RMSE    : {rmse_train:.2f}")
print(f"  R^2     : {r2_train:.2f}")
print(f"  Adj R^2 : {adj_r2_train:.2f}")

# ----- Test set -----
mse_test = mean_squared_error(y_test, y_test_pred)
rmse_test = np.sqrt(mse_test)
r2_test = r2_score(y_test, y_test_pred)
adj_r2_test = 1 - (1 - r2_test) * (len(y_test) - 1) / (len(y_test) - X_test.shape[1] - 1)

print("\nTest Set Evaluation:")
print(f"  RMSE    : {rmse_test:.2f}")
print(f"  R^2     : {r2_test:.2f}")
print(f"  Adj R^2 : {adj_r2_test:.2f}")

In [None]:
y_test_pred = ridge_model.predict(X_test)

plt.figure(figsize=(10,4))

# Scatterplot
sns.scatterplot(x=y_test, y=y_test_pred, color="blue", alpha=0.6)

# Regression line
sns.regplot(x=y_test, y=y_test_pred, scatter=False, color="red")

plt.xlabel("Actual Chance of Admit")
plt.ylabel("Predicted Chance of Admit")
plt.title("Actual vs Predicted (Lasso Regression)")
plt.grid(True)
plt.show()

In [None]:
y_test_pred = ridge_model.predict(X_test)

# Residuals = Actual - Predicted
residuals = y_test - y_test_pred

plt.figure(figsize=(7,5))
sns.scatterplot(x=y_test, y=residuals, color="blue", alpha=0.6)

# Horizontal line at 0 (perfect predictions)
plt.axhline(0, color="red", linestyle="--")

plt.xlabel("Actual Chance of Admit")
plt.ylabel("Residuals (Actual - Predicted)")
plt.title("Actual vs Residuals (Lasso Regression)")
plt.grid(True)
plt.show()

*   **OLS**: Performed strongly (R² ≈ 0.81, RMSE ≈ 0.06) with interpretable coefficients; assumptions mostly satisfied.
    
*   **Ridge**: Similar predictive power, helps stabilise coefficients if correlated features exist; slight shrinkage but no major gains since multicollinearity was low (VIF < 5).
    
*   **Lasso**: Introduces feature selection by shrinking less important coefficients toward zero; confirmed SOP/Univ Rating had low impact.
    
*   **Conclusion**: OLS is interpretable and sufficient here, while Ridge/Lasso validate robustness and highlight variable importance.

🔹 Insights
-----------

*   **Model performance**: Both the base Linear Regression and Ridge Regression models performed strongly, explaining ~81% of the variance in admission chances (R² ≈ 0.81, RMSE ≈ 0.06). The Lasso model, while useful for feature selection, did not improve predictive accuracy in this dataset.
    
*   **Feature importance**: The engineered Total Score (combined GRE, TOEFL, and CGPA) emerged as the most influential predictor of admit probability. Among qualitative variables, **Letter of Recommendation (LOR)** and **Research Experience** showed significant positive contributions, while **University Rating** and **SOP Strength** had limited standalone impact.
    
*   **Collinearity considerations**: The original exam-related features (GRE, TOEFL, CGPA) were highly correlated with each other. Consolidating them into a single Total Score feature improved interpretability and reduced multicollinearity without reducing model accuracy.
    
*   **Assumption validation**: All major assumptions of linear regression were satisfied — residuals showed linearity and homoscedasticity, and VIF values indicated no serious multicollinearity. Although the Shapiro-Wilk test flagged residuals as not perfectly normal, both histogram and Q-Q plots showed only minor deviations, acceptable for practical regression use.
    

👉 **Business takeaway**: Admission outcomes are primarily determined by **academic strength**, with **research exposure and strong recommendations** acting as key differentiators. SOPs and University Rating play a marginal role in this dataset, but could matter in borderline or subjective cases.

🔹 Recommendations
------------------

*   **Improve dataset balance**: The target variable (Chance of Admit) is right-skewed, with most applicants having medium-to-high admit chances. Collecting more data on rejected candidates would provide better variance and improve prediction robustness.
    
*   **Enhance feature set**: To capture the holistic nature of graduate admissions, additional independent variables can be introduced, such as:
    
    *   **Work Experience** – indicates practical skills and maturity.
        
    *   **Internships** – reflect application of knowledge in real-world settings.
        
    *   **Extracurricular Activities** – highlight leadership, teamwork, and diverse strengths.
        
    *   **Diversity Variables** – capture socio-cultural diversity that institutions often value.
        
*   **Model choice**: For interpretability, OLS remains the preferred baseline, while Ridge can be deployed in production for coefficient stability if future features increase multicollinearity. Lasso is useful when feature selection is required on larger datasets.
    
*   **Business application**: The model can be integrated into Jamboree’s admission counselling platform as a **“Chance of Admit Estimator”**, guiding students on where to invest efforts (e.g., improve GRE/CGPA, pursue research, strengthen LORs). This not only supports students but also helps Jamboree **prioritise high-potential leads** and tailor services more effectively.