# About Dataset

## Data Source

Synthetic data generated from the Wharton Class of 2025's statistics.

## Metadata

- **application_id**: Unique identifier for each application
- **gender**: Applicant's gender (Male, Female)
- **international**: International student (TRUE/FALSE)
- **gpa**: Grade Point Average of the applicant (on 4.0 scale)
- **major**: Undergraduate major (Business, STEM, Humanities)
- **race**: Racial background of the applicant (e.g., White, Black, Asian, Hispanic, Other / null: international student)
- **gmat**: GMAT score of the applicant (800 points)
- **work_exp**: Number of years of work experience (Year)
- **work_industry**: Industry of the applicant's previous work experience (e.g., Consulting, Finance, Technology, etc.)
- **admission**: Admission status (Admit, Waitlist, Null: Deny)

# EDA

## General Analysis with Ydata Profiling

In [62]:
from ydata_profiling import ProfileReport
import pandas as pd
from sklearn.metrics import mutual_info_score

In [63]:
df = pd.read_csv('dataset/MBA.csv')
# profile = ProfileReport(df, title="Pandas Profiling Report", explorative=True)
# profile.to_file('report.html')

## Feature Importance

For training purpose, let's replace the Waitlist value by Admit change the admission column to a binary format.

In [64]:
df.fillna(value={'race':  "Unknown"}, inplace=True)

In [65]:
categorical = ['gender', 'international', 'major', 'race', 'work_industry']
numerical = ['gpa', 'gmat', 'work_exp']
df.loc[df['admission'] == 'Waitlist', 'admission'] = 'Admit'
df['admission'] = (df['admission'] == 'Admit').astype(int)
overall_admission = df.admission.mean()
overall_admission

np.float64(0.16144656118824668)

Now, let's evaluate how each categorical feature affects the likelihood of admission.

In [66]:
for category in categorical:
    data_category = df.groupby(category).admission.agg(mean='mean', count='count')
    # difference: represents how much the group's admission rate differs from the overall average admission rate (overall_admission).
    data_category['difference'] = data_category['mean'] - overall_admission
    # risk_ratio: indicates how much more or less likely a group is to be admitted compared to the overall average.
    data_category['risk_ratio'] = data_category['mean'] / overall_admission
    display(data_category)
    print()

Unnamed: 0_level_0,mean,count,difference,risk_ratio
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Female,0.222124,2251,0.060677,1.375833
Male,0.126807,3943,-0.03464,0.785443





Unnamed: 0_level_0,mean,count,difference,risk_ratio
international,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
False,0.159007,4352,-0.002439,0.984892
True,0.16721,1842,0.005763,1.035696





Unnamed: 0_level_0,mean,count,difference,risk_ratio
major,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Business,0.158868,1838,-0.002578,0.98403
Humanities,0.16445,2481,0.003003,1.018602
STEM,0.16,1875,-0.001447,0.99104





Unnamed: 0_level_0,mean,count,difference,risk_ratio
race,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Asian,0.18483,1147,0.023383,1.144837
Black,0.098253,916,-0.063193,0.608581
Hispanic,0.11745,596,-0.043997,0.727483
Other,0.21097,237,0.049524,1.306751
Unknown,0.16721,1842,0.005763,1.035696
White,0.18544,1456,0.023993,1.148613





Unnamed: 0_level_0,mean,count,difference,risk_ratio
work_industry,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CPG,0.184211,114,0.022764,1.141
Consulting,0.15874,1619,-0.002707,0.983235
Energy,0.09375,32,-0.067697,0.580688
Financial Services,0.210643,451,0.049196,1.304723
Health Care,0.143713,334,-0.017734,0.890156
Investment Banking,0.155172,580,-0.006274,0.961138
Investment Management,0.222892,166,0.061445,1.38059
Media/Entertainment,0.152542,59,-0.008904,0.944847
Nonprofit/Gov,0.150538,651,-0.010909,0.93243
Other,0.149644,421,-0.011803,0.926893





In [67]:
df[categorical].apply(lambda x: mutual_info_score(x, df['admission'])).sort_values()

major            0.000023
international    0.000052
work_industry    0.001364
race             0.004419
gender           0.007527
dtype: float64

In [68]:
df[numerical].corrwith(df.admission).sort_values()

work_exp    0.006821
gpa         0.289618
gmat        0.353645
dtype: float64