# TU Admissions Modeling

In [389]:
# Importing all necessary packages
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import boxcox
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import mean_squared_error, cohen_kappa_score
import statsmodels.api as sm
import pingouin as pg
import pickle as pk
import warnings

# Configuring warnings and display settings
warnings.filterwarnings("ignore")
pd.set_option('display.max_columns', None)

# TU Admissions Brief Write Up
### Introduction
The ultimate goal of this project was to construct a classification model to determine whether an accepted applicant will decide to attend the University. Determining why some applicants accept and some do not can be valuable to the Office of Admission to further enhance the yield, which is defined as the number of applicants who accepted the offer of admission to attend the University. The project is about designing a classification model which determines whether an applicant accepted will or will not attend Trinity University.

### Dataset Overview
The dataset includes multiple variables on applicants who have been accepted to Trinity University. In this regard, it contains a set of personal, academic, and demographic information that might affect an applicant's choice to join the university. The dataset ranges over five entry terms, starting from Fall 2017 through Fall 2021.  The response variable, "Decision," measures the decision of an applicant as accepted offer to enroll = 1 and did not accept the offer = 0. First of all, the clean and pre-processing steps have to be made regarding the data to provide accurate and meaningful results using a machine learning model.

### Best Model and Most Important Features Discussion 
Boosting ended up being my best model with a train kappa of 0.570 and a test kappa of 0.514. Total Event Participation, Count of Campus Visits, Athlete_Athlete,Legacy, and Academic Index, were all among my most important features in my modeling stage.
1) The most significant predictor of enrollment decisions from was Total Event Participation, with a Gini of 0.360977. This showed that applicants who are more engaged with university activities are more likely to accept an offer of admission. This indicates that early engagement is very important and should be fostered through both virtual and in-person events. The admissions team should therefore focus on creating more interactive opportunities like webinars, open houses, and campus tours. Moreover, the trend of participation in these events could give much insight into applicants' enrollment probability so that follow-up communications might be more effectively tailored to individual needs. Decision yield might also be improved due to better experiences during a campus visit-for example, if a tour was personalized or conducted by an officer of admission for a more quality experience.
2) Other high value features included Campus Visits (Gini: 0.151382) and Decision Plan (Early Decision I and II) Gini: 0.124239, 0.052272. As illustrated by the data, applicants who visit campus are more likely to accept the offer. These data, therefore, indicate that the admission team should focus their efforts on attracting visitors to campus. Targeted invitations, travel stipends, and special events for visitors may be other strategies that could encourage visits. Also, the applicants applying under early decision plans are likely to enroll, as early decision is a binding commitment. Thus, by increasing outreach to early decision applicants and offering them special support and incentives, the likelihood of these students submitting early decision can be increased.
3) Lastly, factors such as Legacy (Gini: 0.039456) and Academic Index (Gini: 0.033812) suggest that personalized outreach to high-potential applicants can significantly impact yield. The Legacy applicants, connected to the university through families, were more likely to enroll. Thus, selective communications or events may help enhance this population's attachment to the university. For example, there could be events for alumni who are within a certain age range that encourage bringing their kids to learn about going to Trinity. Similarly, higher-index students were more likely to attend; focusing recruiting efforts on the highest-achieving students, including through honors programs, scholarship offers, and personal contact with recruitment staff, has the potential to increase the institution's yield. Furthermore, the value of the Merit Award Full Ride is very important, given the Gini of 0.010574, representing that financial aid, to be more specific-full ride scholarships, play a vital role in the decision-making of the applicants. To this point, the admissions team may work on enhancing financial aid packages and making these more visible to top candidates during the early stages of the process.


#  Refinement

### Reading in data

In [390]:
#import train
df_train=pd.read_csv('cleaneddftrain.csv', index_col='ID')
train = df_train.drop(columns='Unnamed: 0')

In [391]:
#import test
df_test=pd.read_csv('cleaneddftest.csv', index_col='ID')
test = df_test.drop(columns='Unnamed: 0')

In [392]:
missing_in_train = set(test.columns) - set(train.columns)
missing_in_test = set(train.columns) - set(test.columns)

print("Missing in train:", missing_in_train)
print("Missing in test:", missing_in_test)

Missing in train: set()
Missing in test: set()


In [393]:
# List of columns to drop (in this case, 'Decision', 'const', and 'train-test')
columns_to_drop = ['Decision', 'train-test']

# Drop unnecessary columns from both train and test sets in one step
x_train = sm.add_constant(train.drop(columns=columns_to_drop))
x_test = sm.add_constant(test.drop(columns=columns_to_drop))

# remove const

x_train = x_train.drop(columns = 'const') 
x_test = x_test.drop(columns = 'const')

# Extract target variable separately
y_train = train['Decision']
y_test = test['Decision']

In [394]:
x_train.columns

Index(['Entry Term (Application)', 'Permanent Country', 'Sex', 'Ethnicity',
       'Race', 'Religion', 'Application Source', 'Decision Plan', 'Legacy',
       'Athlete', 'Sport 1 Sport', 'Sport 1 Rating', 'Sport 2 Sport',
       'Sport 3 Sport', 'Academic Interest 1', 'Academic Interest 2',
       'First_Source Origin First Source Summary', 'Total Event Participation',
       'Count of Campus Visits', 'School 1 Class Rank (Numeric)',
       'School 1 Class Size (Numeric)', 'School 1 GPA Recalculated',
       'ACT Composite',
       'SAT R Evidence-Based Reading and Writing Section + Math Section',
       'Permanent Geomarket', 'Citizenship Status', 'Academic Index',
       'Intend to Apply for Financial Aid?', 'Merit Award',
       'Submit_FirstSource', 'Submit_Inquiry', 'School 1 Top Percent in Class',
       'ACT Composite Grouped'],
      dtype='object')

### Creating new variable to handle Submit_Inquiry and Submit_FirstSource, will then make the column ordinal

In [395]:
x_train['Days_to_Submit_since_Inquiry'] = x_train['Submit_Inquiry'] - x_train['Submit_FirstSource']

In [396]:
x_test['Days_to_Submit_since_Inquiry'] = x_test['Submit_Inquiry'] - x_test['Submit_FirstSource']

In [397]:
ordinal_cols = ['Legacy','Sport 1 Rating','Count of Campus Visits','Total Event Participation',
               'Days_to_Submit_since_Inquiry']

def encodetrain(x_train, ordinal_columns=None):
    # If ordinal columns are not provided, you could define a default list or handle it dynamically
    if ordinal_columns is None:
        ordinal_columns = []  # Adjust this list if you have predefined ordinal columns

    # Loop through columns
    for col in x_train.columns:
        if x_train[col].dtype == 'object' or x_train[col].dtype == 'category':
            # Check if it's an ordinal column
            if col in ordinal_columns:
                le = LabelEncoder()
                x_train[col] = le.fit_transform(x_train[col])
            else:
                # Non-ordinal categorical, apply one-hot encoding
                x_train = pd.get_dummies(x_train, columns=[col], drop_first=True)

    return x_train

x_train = encodetrain(x_train, ordinal_columns=ordinal_cols)

In [398]:
def encodetest(x_test, ordinal_columns=None):
    # If ordinal columns are not provided, you could define a default list or handle it dynamically
    if ordinal_columns is None:
        ordinal_columns = []  # Adjust this list if you have predefined ordinal columns

    # Loop through columns
    for col in x_test.columns:
        if x_test[col].dtype == 'object' or x_test[col].dtype == 'category':
            # Check if it's an ordinal column
            if col in ordinal_columns:
                le = LabelEncoder()
                x_test[col] = le.fit_transform(x_test[col])
            else:
                # Non-ordinal categorical, apply one-hot encoding
                x_test = pd.get_dummies(x_test, columns=[col], drop_first=True)

    return x_test

x_test = encodetest(x_test, ordinal_columns=ordinal_cols)

### Removing variables with correlations lower than 0.015

In [399]:
# Dictionary to store correlations
correlations = {}

# Loop to calculate correlations and store them
for col in x_train.columns:
    correlation = x_train[col].corr(y_train)  # Calculate correlation with y_train
    correlations[col] = correlation

# Filter out columns with correlation less than 0.015
columns_to_remove = [col for col, corr in correlations.items() if abs(corr) < 0.015]

# Remove columns from x_train that have low correlation
x_train = x_train.drop(columns=columns_to_remove)
x_test = x_test.drop(columns=columns_to_remove)

# Print the columns that were removed
print(f'Columns removed due to low correlation: {columns_to_remove}')

Columns removed due to low correlation: ['School 1 Class Size (Numeric)', 'Submit_FirstSource', 'Entry Term (Application)_Fall 2019', 'Entry Term (Application)_Fall 2020', 'Entry Term (Application)_Fall 2021', 'Permanent Country_Belgium', 'Permanent Country_Belize', 'Permanent Country_Bolivia', 'Permanent Country_Brazil', 'Permanent Country_Cambodia', 'Permanent Country_Canada', 'Permanent Country_China', 'Permanent Country_Colombia', 'Permanent Country_Costa Rica', 'Permanent Country_Ecuador', 'Permanent Country_Ethiopia', 'Permanent Country_France', 'Permanent Country_Germany', 'Permanent Country_Ghana', 'Permanent Country_Greece', 'Permanent Country_Guatemala', 'Permanent Country_Honduras', 'Permanent Country_Hong Kong S.A.R.', 'Permanent Country_Indonesia', 'Permanent Country_Japan', 'Permanent Country_Jordan', 'Permanent Country_Kazakhstan', 'Permanent Country_Kuwait', 'Permanent Country_Lebanon', 'Permanent Country_Mexico', 'Permanent Country_Morocco', 'Permanent Country_Nepal', 

### Binning variables

In [400]:
x_train.columns

Index(['Legacy', 'Sport 1 Rating', 'Total Event Participation',
       'Count of Campus Visits', 'School 1 Class Rank (Numeric)',
       'School 1 GPA Recalculated', 'ACT Composite',
       'SAT R Evidence-Based Reading and Writing Section + Math Section',
       'Academic Index', 'Intend to Apply for Financial Aid?',
       'Submit_Inquiry', 'School 1 Top Percent in Class',
       'Days_to_Submit_since_Inquiry', 'Entry Term (Application)_Fall 2018',
       'Permanent Country_Cyprus', 'Permanent Country_El Salvador',
       'Permanent Country_India', 'Permanent Country_Jamaica',
       'Permanent Country_United States', 'Race_Asian',
       'Race_Black or African American', 'Race_White', 'Religion_Christian',
       'Religion_Hindu', 'Religion_Lutheran', 'Religion_Methodist',
       'Religion_Not specified', 'Religion_OtherRelgiousAffiliation',
       'Application Source_Coalition', 'Application Source_CommonApp',
       'Application Source_Select Scholar', 'Decision Plan_Early Action 

In [401]:
x_train['School 1 GPA Recalculated'].describe()

count    10000.000000
mean         3.701429
std          0.292452
min          2.330000
25%          3.500000
50%          3.790000
75%          3.960000
max          4.000000
Name: School 1 GPA Recalculated, dtype: float64

In [402]:
# Create bins from 0 to 4 with 0.25 intervals
bins = np.arange(0, 4.25, 0.25)  # 4.25 ensures 4.0 is included

# Define labels for each 0.25 interval
labels = [f"{i}-{i+0.25}" for i in bins[:-1]]  # Generate labels like '0-0.25', '0.25-0.5', etc.

# Apply pd.cut with these bins and labels
x_train['GPA_category'] = pd.cut(x_train['School 1 GPA Recalculated'], bins=bins, labels=labels, right=False)
x_test['GPA_category'] = pd.cut(x_test['School 1 GPA Recalculated'], bins=bins, labels=labels, right=False)

In [403]:
x_train['School 1 Class Rank (Numeric)'].describe()

count    4643.000000
mean       62.093905
std        78.059632
min         1.000000
25%         9.000000
50%        33.000000
75%        83.000000
max       655.000000
Name: School 1 Class Rank (Numeric), dtype: float64

In [404]:
# Create bins from 0 to 700 with a step size of 100
bins = np.arange(0, 701, 100)  # 701 ensures 700 is included

# Define labels for each 100-step interval
labels = [f"{i}-{i+100}" for i in bins[:-1]]  # Generate labels like '0-100', '100-200', etc.

# Apply pd.cut with these bins and labels
x_train['Class_Rank_Category'] = pd.cut(x_train['School 1 Class Rank (Numeric)'], bins=bins, labels=labels, right=False)
x_test['Class_Rank_Category'] = pd.cut(x_test['School 1 Class Rank (Numeric)'], bins=bins, labels=labels, right=False)

In [405]:
x_train['School 1 Top Percent in Class'].describe()

count    10000.000000
mean        13.371305
std         10.331136
min          0.101729
25%          4.614411
50%         10.952852
75%         16.109453
max        100.000000
Name: School 1 Top Percent in Class, dtype: float64

In [406]:
# Create bins from 0 to 100 with a step size of 5
bins = np.arange(0, 105, 5)  # 105 ensures 100 is included

# Define labels for each 5-step interval
labels = [f"{i}-{i+5}" for i in bins[:-1]]  # Generate labels like '0-5', '5-10', etc.

# Apply pd.cut with these bins and labels
x_train['Top_Percent_Category'] = pd.cut(x_train['School 1 Top Percent in Class'], bins=bins, labels=labels, right=False)
x_test['Top_Percent_Category'] = pd.cut(x_test['School 1 Top Percent in Class'], bins=bins, labels=labels, right=False)

### Dropping similar variables

In [407]:
# List of columns to drop
columns_to_drop = ['School 1 Class Rank (Numeric)', 
                   'School 1 GPA Recalculated','ACT Composite',
                   'SAT R Evidence-Based Reading and Writing Section + Math Section',
                   'School 1 Top Percent in Class','Submit_Inquiry']

# Drop columns in place
x_train.drop(columns=columns_to_drop, inplace=True)
x_test.drop(columns=columns_to_drop, inplace=True)

In [408]:
missing_in_train = set(x_test.columns) - set(x_train.columns)
missing_in_test = set(x_train.columns) - set(x_test.columns)

print("Missing in train:", missing_in_train)
print("Missing in test:", missing_in_test)

Missing in train: {'Permanent Geomarket_Unknown'}
Missing in test: set()


In [409]:
x_test = x_test.drop(columns='Permanent Geomarket_Unknown')

### Ensuring Correct Variable Coding

In [410]:
ordinal_cols = ['Legacy','Sport 1 Rating','Count of Campus Visits','Total Event Participation'
               ]

def dummytrain(x_train, ordinal_columns=None):
    # Loop through columns
    for col in x_train.columns:
        if x_train[col].dtype == 'object' or x_train[col].dtype == 'category':
            # Check if it's an ordinal column
            if col in ordinal_columns:
                le = LabelEncoder()
                x_train[col] = le.fit_transform(x_train[col])
            else:
                # Non-ordinal categorical, apply one-hot encoding
                x_train = pd.get_dummies(x_train, columns=[col], dtype=int, drop_first=True)
        
        # Handle boolean columns (True/False -> 1/0)
        elif x_train[col].dtype == 'bool':
            x_train[col] = x_train[col].astype(int)

    return x_train

x_train = dummytrain(x_train, ordinal_columns=ordinal_cols)

In [411]:
def dummytest(x_test, ordinal_columns=None):
    # Loop through columns
    for col in x_test.columns:
        if x_test[col].dtype == 'object' or x_test[col].dtype == 'category':
            # Check if it's an ordinal column
            if col in ordinal_columns:
                le = LabelEncoder()
                x_test[col] = le.fit_transform(x_test[col])
            else:
                # Non-ordinal categorical, apply one-hot encoding
                x_test = pd.get_dummies(x_test, columns=[col], dtype=int, drop_first=True)
        
        # Handle boolean columns (True/False -> 1/0)
        elif x_test[col].dtype == 'bool':
            x_test[col] = x_test[col].astype(int)

    return x_test

x_test = dummytest(x_test, ordinal_columns=ordinal_cols)

In [412]:
missing_in_train = set(x_test.columns) - set(x_train.columns)
missing_in_test = set(x_train.columns) - set(x_test.columns)

print("Missing in train:", missing_in_train)
print("Missing in test:", missing_in_test)

Missing in train: set()
Missing in test: set()


## Logistic Model

In [413]:
# Passing VIF scores

In [414]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif = pd.DataFrame()
vif['VIF'] = [variance_inflation_factor(x_train.values, i) for i in range(x_train.shape[1])]
vif['variable'] = x_train.columns
vif[:]

Unnamed: 0,VIF,variable
0,18.074502,Legacy
1,18.400583,Sport 1 Rating
2,1.654042,Total Event Participation
3,1.492536,Count of Campus Visits
4,41.411248,Academic Index
5,3.335852,Intend to Apply for Financial Aid?
6,1.553281,Days_to_Submit_since_Inquiry
7,1.430814,Entry Term (Application)_Fall 2018
8,1.012958,Permanent Country_Cyprus
9,1.026283,Permanent Country_El Salvador


In [415]:
# List of columns to drop, it is safe to assume there will not be people applying with below a 2.5 gpa, this
# does not exist in train or test
columns_to_drop = ['GPA_category_0.25-0.5','GPA_category_0.5-0.75','GPA_category_0.75-1.0',
                   'GPA_category_1.0-1.25','GPA_category_1.25-1.5','GPA_category_1.5-1.75',
                   'GPA_category_1.75-2.0','GPA_category_2.0-2.25']

# Drop columns in place
x_train.drop(columns=columns_to_drop, inplace=True)
x_test.drop(columns=columns_to_drop, inplace=True)

In [416]:
x_train = x_train.drop(columns='Citizenship Status_US Citizen')

In [417]:
x_test = x_test.drop(columns='Citizenship Status_US Citizen')

In [418]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif = pd.DataFrame()
vif['VIF'] = [variance_inflation_factor(x_train.values, i) for i in range(x_train.shape[1])]
vif['variable'] = x_train.columns
vif[:]

Unnamed: 0,VIF,variable
0,17.201859,Legacy
1,18.052735,Sport 1 Rating
2,1.649569,Total Event Participation
3,1.486523,Count of Campus Visits
4,39.57184,Academic Index
5,3.297854,Intend to Apply for Financial Aid?
6,1.551144,Days_to_Submit_since_Inquiry
7,1.430463,Entry Term (Application)_Fall 2018
8,1.012938,Permanent Country_Cyprus
9,1.024949,Permanent Country_El Salvador


In [419]:
# List of columns to drop
columns_to_drop = ['Sport 1 Sport_No Sport','Sport 2 Sport_No 2ndSport','Sport 3 Sport_No 3rdSport']

# Drop columns in place
x_train.drop(columns=columns_to_drop, inplace=True)
x_test.drop(columns=columns_to_drop, inplace=True)

In [420]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif = pd.DataFrame()
vif['VIF'] = [variance_inflation_factor(x_train.values, i) for i in range(x_train.shape[1])]
vif['variable'] = x_train.columns
vif[:]

Unnamed: 0,VIF,variable
0,16.950847,Legacy
1,17.945078,Sport 1 Rating
2,1.648165,Total Event Participation
3,1.483927,Count of Campus Visits
4,38.897722,Academic Index
5,3.286561,Intend to Apply for Financial Aid?
6,1.551096,Days_to_Submit_since_Inquiry
7,1.425203,Entry Term (Application)_Fall 2018
8,1.012932,Permanent Country_Cyprus
9,1.024626,Permanent Country_El Salvador


In [421]:
with open('x_train.pkl', 'wb') as f:
    pk.dump(x_train, f)

In [422]:
with open('x_test.pkl', 'wb') as f:
    pk.dump(x_test, f)

In [378]:
# List of columns to drop
columns_to_drop = ['Athlete_Non-Athlete','Permanent Country_United States','Academic Index']

# Drop columns in place
x_train.drop(columns=columns_to_drop, inplace=True)
x_test.drop(columns=columns_to_drop, inplace=True)

In [379]:
x_test.head()

Unnamed: 0_level_0,Legacy,Sport 1 Rating,Total Event Participation,Count of Campus Visits,Intend to Apply for Financial Aid?,Days_to_Submit_since_Inquiry,Entry Term (Application)_Fall 2018,Permanent Country_Cyprus,Permanent Country_El Salvador,Permanent Country_India,Permanent Country_Jamaica,Race_Asian,Race_Black or African American,Race_White,Religion_Christian,Religion_Hindu,Religion_Lutheran,Religion_Methodist,Religion_Not specified,Religion_OtherRelgiousAffiliation,Application Source_Coalition,Application Source_CommonApp,Application Source_Select Scholar,Decision Plan_Early Action I,Decision Plan_Early Decision I,Decision Plan_Early Decision II,Decision Plan_Regular Decision,"Athlete_Athlete, Opt Out",Sport 1 Sport_Cross Country,Sport 1 Sport_Diving,Sport 1 Sport_Golf,Sport 1 Sport_Soccer,Sport 1 Sport_Sport,Sport 1 Sport_Swimming,Sport 1 Sport_Tennis,Sport 1 Sport_Track,Academic Interest 1_Business - International Business,Academic Interest 1_Business - Management,Academic Interest 1_Communication,Academic Interest 1_Engineering Science,Academic Interest 1_Environmental Studies,Academic Interest 1_History,Academic Interest 1_Mathematics,Academic Interest 1_Political Science,Academic Interest 1_Pre-Medical,Academic Interest 1_Psychology,Academic Interest 1_Undecided,Academic Interest 2_Biochemistry & Molecular Biology,Academic Interest 2_Business,Academic Interest 2_Others,Academic Interest 2_Pre-Medical,Academic Interest 2_Psychology,First_Source Origin First Source Summary_ATH,First_Source Origin First Source Summary_CAP,First_Source Origin First Source Summary_CBINQ,First_Source Origin First Source Summary_CF,First_Source Origin First Source Summary_OAPP,First_Source Origin First Source Summary_Other Sources,First_Source Origin First Source Summary_SRCH,First_Source Origin First Source Summary_TIF,First_Source Origin First Source Summary_VST,First_Source Origin First Source Summary_WEBTU,First_Source Origin First Source Summary_YUVST,Permanent Geomarket_Northeast,Permanent Geomarket_South,Permanent Geomarket_West,Citizenship Status_Permanent Resident,Merit Award_Full Ride,Merit Award_International Student Scholarship,Merit Award_No Merit Scholarship,ACT Composite Grouped_23.0,ACT Composite Grouped_24.0,ACT Composite Grouped_25.0,ACT Composite Grouped_26.0,ACT Composite Grouped_27.0,ACT Composite Grouped_28.0,ACT Composite Grouped_29.0,ACT Composite Grouped_33.0,ACT Composite Grouped_34.0,ACT Composite Grouped_35.0,ACT Composite Grouped_36.0,ACT Composite Grouped_ACTBelow21,GPA_category_2.25-2.5,GPA_category_2.5-2.75,GPA_category_2.75-3.0,GPA_category_3.0-3.25,GPA_category_3.25-3.5,GPA_category_3.5-3.75,GPA_category_3.75-4.0,Class_Rank_Category_100-200,Class_Rank_Category_200-300,Class_Rank_Category_300-400,Class_Rank_Category_400-500,Class_Rank_Category_500-600,Class_Rank_Category_600-700,Top_Percent_Category_5-10,Top_Percent_Category_10-15,Top_Percent_Category_15-20,Top_Percent_Category_20-25,Top_Percent_Category_25-30,Top_Percent_Category_30-35,Top_Percent_Category_35-40,Top_Percent_Category_40-45,Top_Percent_Category_45-50,Top_Percent_Category_50-55,Top_Percent_Category_55-60,Top_Percent_Category_60-65,Top_Percent_Category_65-70,Top_Percent_Category_70-75,Top_Percent_Category_75-80,Top_Percent_Category_80-85,Top_Percent_Category_85-90,Top_Percent_Category_90-95,Top_Percent_Category_95-100
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1,Unnamed: 86_level_1,Unnamed: 87_level_1,Unnamed: 88_level_1,Unnamed: 89_level_1,Unnamed: 90_level_1,Unnamed: 91_level_1,Unnamed: 92_level_1,Unnamed: 93_level_1,Unnamed: 94_level_1,Unnamed: 95_level_1,Unnamed: 96_level_1,Unnamed: 97_level_1,Unnamed: 98_level_1,Unnamed: 99_level_1,Unnamed: 100_level_1,Unnamed: 101_level_1,Unnamed: 102_level_1,Unnamed: 103_level_1,Unnamed: 104_level_1,Unnamed: 105_level_1,Unnamed: 106_level_1,Unnamed: 107_level_1,Unnamed: 108_level_1,Unnamed: 109_level_1,Unnamed: 110_level_1,Unnamed: 111_level_1,Unnamed: 112_level_1,Unnamed: 113_level_1,Unnamed: 114_level_1
10001,2,2,1,0,1.0,-30.0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
10002,2,2,2,1,0.0,-9.0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
10003,2,2,0,0,1.0,0.0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
10004,2,2,0,0,1.0,6096.0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
10005,2,2,0,1,0.0,0.0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [380]:
# My logisitc model was not running as a result of these categories not having any train values so I had to 
# remove them and identify them through this loop
constant_cols = [col for col in x_train.columns if x_train[col].nunique() == 1]
print(f"Constant columns: {constant_cols}")
x_train = x_train.drop(columns=constant_cols)
x_test = x_test.drop(columns=constant_cols)

Constant columns: ['Top_Percent_Category_70-75', 'Top_Percent_Category_85-90']


In [381]:
# Define the Logistis model 

model_logistic = sm.Logit(endog = y_train, exog = sm.add_constant(x_train)).fit()

model_logistic.summary() # output model summary

         Current function value: 0.348539
         Iterations: 35


0,1,2,3
Dep. Variable:,Decision,No. Observations:,10000.0
Model:,Logit,Df Residuals:,9887.0
Method:,MLE,Df Model:,112.0
Date:,"Sat, 14 Dec 2024",Pseudo R-squ.:,0.3372
Time:,20:47:41,Log-Likelihood:,-3485.4
converged:,False,LL-Null:,-5258.9
Covariance Type:,nonrobust,LLR p-value:,0.0

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,-1.6565,0.278,-5.949,0.000,-2.202,-1.111
Legacy,-0.4932,0.057,-8.581,0.000,-0.606,-0.381
Sport 1 Rating,-0.1025,0.065,-1.584,0.113,-0.229,0.024
Total Event Participation,1.6454,0.056,29.599,0.000,1.536,1.754
Count of Campus Visits,0.7539,0.041,18.177,0.000,0.673,0.835
Intend to Apply for Financial Aid?,0.1664,0.069,2.427,0.015,0.032,0.301
Days_to_Submit_since_Inquiry,-4.589e-06,1.05e-05,-0.436,0.663,-2.52e-05,1.61e-05
Entry Term (Application)_Fall 2018,0.0260,0.082,0.315,0.752,-0.135,0.187
Permanent Country_Cyprus,2.2716,1.241,1.830,0.067,-0.161,4.704


In [382]:
# Predictions using x_test

y_test_pred_prob = model_logistic.predict(sm.add_constant(x_test))

In [383]:
y_train_pred_prob = model_logistic.predict(sm.add_constant(x_train))

In [384]:
# Selecting the optimal cut-off probability
# The cut-off probability begins at 0.1, increments by 0.01, and ends at 0.9.
Kappa_test_logis = []
cut_off_prob = []

for prob in np.arange(0.1, 0.91, 0.01):
    cut_off_prob.append(round(prob, 2))
    y_test_pred_class = np.where(y_test_pred_prob >= prob, 1, 0)
    Kappa = cohen_kappa_score(y1 = y_test_pred_class, y2 = y_test) 
    Kappa_test_logis.append(Kappa)
    
print(Kappa_test_logis.index(max(Kappa_test_logis))) # Get the index of the largest test Kappa
print(cut_off_prob[Kappa_test_logis.index(max(Kappa_test_logis))]) # Get the corresponding cut-off probability
print(max(Kappa_test_logis)) # Get the corresponding test Kappa

# Conclusion: the optimal cut-off probability is 0.31, giving us the best test Kappa of 0.47699083049308555.

17
0.27
0.5028468539607458


In [385]:
y_test_pred_class = np.where(y_test_pred_prob >= 0.27, 1, 0)
print(y_test_pred_class)

[0 1 0 ... 1 0 1]


In [386]:
y_train_pred_class = np.where(y_train_pred_prob >= 0.27, 1, 0)
print(y_test_pred_class)

[0 1 0 ... 1 0 1]


In [388]:
Kappa_train_logis = cohen_kappa_score(y1 = y_train_pred_class, y2 = y_train) 
print(Kappa_train_logis)

0.5284355534384654


In [387]:
Kappa_test_logis = cohen_kappa_score(y1 = y_test_pred_class, y2 = y_test) 
print(Kappa_test_logis)

0.5028468539607458


In [None]:
# Training Kappa: 0.528
# Test Kappa: 0.503

## KNN Model

In [269]:
x_train = pd.read_pickle("x_train.pkl")
x_test = pd.read_pickle("x_test.pkl")

In [270]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler() 
x_train = scaler.fit_transform(x_train) # fit to data, then transform it.
x_test = scaler.fit_transform(x_test) 
print(np.mean(x_train), np.std(x_train)) 
print(np.mean(x_test), np.std(x_test)) 

4.979872165156258e-18 0.9914161502144208
3.0819694187832462e-18 0.9827573280377847


In [271]:
# Choosing the Optimal K value
from sklearn.neighbors import KNeighborsClassifier
# Splitting training data into sub-training and sub-test sets using a 75-25 split.
x_subtrain, x_subtest, y_subtrain, y_subtest = train_test_split(x_train, y_train, train_size = 0.75, test_size = 0.25,
                                                       random_state = 123, stratify = y_train)

# Defining the Loop
Kappa_subtest_knn = []
k = [] 
for i in np.arange(1,101,1): # K is from 1 to 100
    k.append(i)
    model_knn = KNeighborsClassifier(n_neighbors = i) 
    model_knn.fit(x_subtrain,y_subtrain)
    y_subtest_pred_class = model_knn.predict(x_subtest)
    Kappa = cohen_kappa_score(y1 = y_subtest_pred_class, y2 = y_subtest) 
    Kappa_subtest_knn.append(Kappa)
    
# Output the best K value    
print(k[Kappa_subtest_knn.index(max(Kappa_subtest_knn))]) #Get the corresponding K value
print((max(Kappa_subtest_knn))) #Get the corresponding test Kappa
#Conclusion: the optimal K is 5, giving us the best subtest Kappa of 0.27822712671345007.

1
0.25526566748343604


In [272]:
# 2.4 Use the best K value to build a knn model

# Define and fit the knn model
model_4nn = KNeighborsClassifier(n_neighbors = 1).fit(x_train, y_train)

# Classification performance for training data
y_train_pred_class = model_4nn.predict(x_train)
Kappa_train_4nn = cohen_kappa_score(y1 = y_train_pred_class, y2 = y_train)
print(Kappa_train_4nn) # =0.8892

# Classification performance for test data
y_test_pred_class = model_4nn.predict(x_test)
Kappa_test_4nn = cohen_kappa_score(y1 = y_test_pred_class, y2 = y_test)
print(Kappa_test_4nn) # =0.7643

0.9991234389616656
0.23908810969537264


In [None]:
# Training Kappa: 0.999
# Test Kappa: 0.239

## Simple Classification Tree

In [438]:
# Simple Tree
simple_class_tree = DecisionTreeClassifier(criterion='gini',
                                    max_features = x_train.shape[1],
                                    random_state = 123,
                                    min_samples_split = 2, 
                                    min_samples_leaf = 1,
                                    max_depth = None,
                                    ccp_alpha = 0).fit(x_train, y_train)

In [439]:
# 3.4 Training Kappa
simple_class_tree_pred_train = simple_class_tree.predict(x_train)
print(simple_class_tree_pred_train)
Kappa_train_simple_class_tree = cohen_kappa_score(y1 = simple_class_tree_pred_train, y2 = y_train)
print(Kappa_train_simple_class_tree)

[1 0 1 ... 0 0 0]
0.9991231511935198


In [440]:
# 3.5 Test Kappa
simple_class_tree_pred_test = simple_class_tree.predict(x_test)
print(simple_class_tree_pred_test)
Kappa_test_simple_class_tree = cohen_kappa_score(y1 = simple_class_tree_pred_test, y2 = y_test)
print(Kappa_test_simple_class_tree)

[0 0 0 ... 0 0 1]
0.3723508689147814


In [432]:
# Training Kappa: 0.999
# Test Kappa: 0.372

#  Pruned Decision Tree

In [441]:
# Define cross-validation data splitting strategy
kf_simple_class_tree = KFold(n_splits = 10, shuffle = True, random_state = 42)


In [442]:
# Define a search grid for the tuning parameter

unpruned_tree_leaf_nodes = simple_class_tree.tree_.n_leaves # The total number of leaves in the fully grown tree
print(unpruned_tree_leaf_nodes)

1676


In [443]:
param_grid_class_tree = {'max_leaf_nodes': list(range(2, 1676, 150))} 

In [444]:
# Cross-validation for each max_leaf_nodes value in the grid
cv_class_tree_pruning = GridSearchCV(estimator = simple_class_tree_estimator,
                              param_grid = param_grid_class_tree, 
                              scoring = 'balanced_accuracy',
                              cv = kf_simple_class_tree) 
cv_class_tree_pruning.fit(x_train, y_train)

print(cv_class_tree_pruning.best_params_, cv_class_tree_pruning.best_score_) 

{'max_leaf_nodes': 152} 0.6988789942872344


In [445]:
# Fit the pruned classification tree
simple_class_tree_pruned = DecisionTreeClassifier(criterion = 'gini',  
                                    min_samples_split = 2, 
                                    min_samples_leaf = 1,
                                    max_depth = None,
                                    max_leaf_nodes = 152,
                                    max_features = x_train.shape[1],
                                    random_state=123,
                                    ccp_alpha = 0.0).fit(x_train, y_train)

In [446]:
# Training Kappa
simple_class_tree_pruned_pred_train = simple_class_tree_pruned.predict(x_train)
print(simple_class_tree_pruned_pred_train)
Kappa_train_simple_class_tree_pruned = cohen_kappa_score(y1 = simple_class_tree_pruned_pred_train, y2 = y_train)
print(Kappa_train_simple_class_tree_pruned)

[1 0 1 ... 0 0 0]
0.6080964341334703


In [447]:
# Test Kappa
simple_class_tree_pruned_pred_test = simple_class_tree_pruned.predict(x_test)
print(simple_class_tree_pruned_pred_test)
Kappa_test_simple_class_tree_pruned = cohen_kappa_score(y1 = simple_class_tree_pruned_pred_test, y2 = y_test)
print(Kappa_test_simple_class_tree_pruned)

[0 1 0 ... 1 0 1]
0.4571679199540757


In [None]:
# Training Kappa: 0.608
# Test Kappa: 0.457

#  Random Forest

In [448]:
# Defining cross-validation data splitting strategy
kf_rf_class_tree = KFold(n_splits = 10, shuffle = True, random_state = 42)

In [449]:
# Defining the RF estimator
import math

rf_class_tree_estimator = RandomForestClassifier(n_estimators=500, 
                                        criterion='gini', 
                                        max_depth = None,
                                        min_samples_split = 2,
                                        min_samples_leaf = 1, 
                                        max_leaf_nodes = None,
                                        max_features= x_train.shape[1], 
                                        bootstrap=True,
                                        max_samples = math.ceil(x_train.shape[0] * 0.9) -1,
                                        random_state = 42)

In [450]:
unpruned_tree_leaf_nodes = simple_class_tree.tree_.n_leaves # The total number of leaves in the fully grown tree
print(unpruned_tree_leaf_nodes)

1676


In [452]:
# Defining a search grid for the tuning parameters
param_grid_rf_class_tree = {'max_leaf_nodes': list(range(2, 1676,250)),
                            'max_features': list(range(1, x_train.shape[1],5))} 

In [453]:
# Excecuting cross-validation for each combination in the grid
cv_rf_class_tree = GridSearchCV(estimator = rf_class_tree_estimator,
                              param_grid = param_grid_rf_class_tree, 
                              scoring = 'balanced_accuracy',
                              cv = kf_rf_class_tree,
                              n_jobs = -1) 
cv_rf_class_tree.fit(x_train, y_train)

# Outputting the best max_leaf_nodes and the corresponding cross-validation score
print(cv_rf_class_tree.best_params_, cv_rf_class_tree.best_score_) 

{'max_features': 41, 'max_leaf_nodes': 752} 0.7117356215565324


In [454]:
# Fitting the optimal RF
rf_class_tree_best = RandomForestClassifier(n_estimators=500, 
                                        criterion='gini', 
                                        max_depth = None,
                                        min_samples_split = 2,
                                        min_samples_leaf = 1, 
                                        max_leaf_nodes = 752,
                                        max_features= 41, 
                                        bootstrap=True,
                                        max_samples = x_train.shape[0],
                                        random_state = 42).fit(x_train, y_train)

In [455]:
# Training Kappa
rf_class_tree_best_pred_train = rf_class_tree_best.predict(x_train)
Kappa_train_rf_class_tree_best = cohen_kappa_score(y1 = rf_class_tree_best_pred_train, y2 = y_train)
print(Kappa_train_rf_class_tree_best)

0.8889640095342903


In [456]:
# Test Kappa
rf_class_tree_best_pred_test = rf_class_tree_best.predict(x_test)
Kappa_test_rf_class_tree_best = cohen_kappa_score(y1 = rf_class_tree_best_pred_test, y2 = y_test)
print(Kappa_test_rf_class_tree_best)

0.4796968398888146


In [None]:
# Training Kappa: 0.889
# Test Kappa: 0.

# Boosting

In [464]:
# Perform K-fold cross-validation to select the best combination of n_estimators and max_leaf_nodes

# Define the estimator
Tree_class_boosting_estimator = GradientBoostingClassifier(loss = 'log_loss',
                                                 learning_rate = 0.01,
                                                 max_features = x_train.shape[1],
                                                 criterion = 'friedman_mse',
                                                 random_state = 42)


In [466]:
# Cross-validation data splitting strategy
kf_boosting = KFold(n_splits = 2, shuffle = True, random_state = 42)

# Defining a search grid

param_grid_boosting = {'n_estimators': list(range(1000, 5001, 50)),
                        'max_leaf_nodes': [2, 3, 4, 5, 6, 7]
}


In [467]:
# Cross-validation for all combinations to find the best set of parameters
cv_tree_class_boosting = GridSearchCV(estimator = Tree_class_boosting_estimator,
                                      param_grid = param_grid_boosting, 
                                      scoring = 'neg_log_loss',
                                      cv = kf_boosting, 
                                      n_jobs = -1).fit(x_train, y_train)

In [468]:
# Outputting the best combination
print("Best Parameters: ", cv_tree_class_boosting.best_params_)
print("Best Score: ", cv_tree_class_boosting.best_score_)

Best Parameters:  {'max_leaf_nodes': 6, 'n_estimators': 1650}
Best Score:  -0.35249858290199554


In [469]:
# Using the best combination of parameters

Tree_class_boost_best = GradientBoostingClassifier(loss = 'log_loss',
                                                 learning_rate = 0.01,
                                                 n_estimators = 1650,
                                                 max_leaf_nodes = 6,
                                                 max_features = x_train.shape[1],
                                                 criterion = 'friedman_mse',
                                                 random_state = 42).fit(x_train, y_train)

In [470]:
# Output predicted class label for each observation in test data

print(Tree_class_boost_best.classes_)
print(Tree_class_boost_best.predict_proba(x_test))
pred_prob_test = Tree_class_boost_best.predict_proba(x_test)[:,1]
print(pred_prob_test)
# The output is an array that shows the probability of Y=0 (left) and the probability of Y=1 (right), respectively.

# How large for a predicted probability to be large enough for assigned to class 1? 
# We can write a loop to select the best cut-off probability.

[0 1]
[[0.65414683 0.34585317]
 [0.48853784 0.51146216]
 [0.94854685 0.05145315]
 ...
 [0.28747857 0.71252143]
 [0.84986689 0.15013311]
 [0.23590347 0.76409653]]
[0.34585317 0.51146216 0.05145315 ... 0.71252143 0.15013311 0.76409653]


In [471]:
# Selecting the optimal cut-off probability

# The cut-off probability begins at 0.1, increments by 0.01, and ends at 0.9.
Kappa_test_boost = []
cut_off_prob = []

for prob in np.arange(0.1, 0.91, 0.01):
    cut_off_prob.append(round(prob, 2))
    y_test_pred_class = np.where(pred_prob_test >= prob, 1, 0)
    Kappa = cohen_kappa_score(y1 = y_test_pred_class, y2 = y_test) 
    Kappa_test_boost.append(Kappa)
    
print(Kappa_test_boost.index(max(Kappa_test_boost))) # Get the index of the largest test Kappa
print(cut_off_prob[Kappa_test_boost.index(max(Kappa_test_boost))]) # Get the corresponding cut-off probability
print(max(Kappa_test_boost)) # Get the corresponding test Kappa

# Conclusion: the optimal cut-off probability is 0.38, giving us the best test Kappa of 0.514.

28
0.38
0.518853565880776


In [472]:
# Training Kappa
pred_prob_train = Tree_class_boost_best.predict_proba(x_train)[:,1]
pred_class_train = np.where(pred_prob_train >= 0.38, 1, 0)
Kappa_train_boost = cohen_kappa_score(y1 = pred_class_train, y2 = y_train)
print(Kappa_train_boost)

0.5698498325282015


In [473]:
# Test Kappa
pred_prob_test = Tree_class_boost_best.predict_proba(x_test)[:,1]
pred_class_test = np.where(pred_prob_test >= 0.38, 1, 0)
Kappa_test_boost = cohen_kappa_score(y1 = pred_class_test, y2 = y_test)
print(Kappa_test_boost)

0.518853565880776


In [478]:
# Feature Importance
Gini_reduction = Tree_class_boost_best.feature_importances_

# Create a feature importance data frame
feature_importance_dict = {'feature_name': x_train.columns, 'Gini_reduction': Gini_reduction}
feature_importance_df = pd.DataFrame(feature_importance_dict).sort_values(by='Gini_reduction', ascending=False)

# Filter features with Gini reduction greater than 0.005
filtered_features_df = feature_importance_df[feature_importance_df['Gini_reduction'] > 0.005]

# Output the filtered features
print(filtered_features_df)

                       feature_name  Gini_reduction
2         Total Event Participation        0.360977
3            Count of Campus Visits        0.151382
26   Decision Plan_Early Decision I        0.124239
29         Athlete_Athlete, Opt Out        0.067084
27  Decision Plan_Early Decision II        0.052272
30              Athlete_Non-Athlete        0.050303
0                            Legacy        0.039456
4                    Academic Index        0.033812
6      Days_to_Submit_since_Inquiry        0.015647
70            Merit Award_Full Ride        0.010574
1                    Sport 1 Rating        0.007658
12  Permanent Country_United States        0.005528
81       ACT Composite Grouped_34.0        0.005524
82       ACT Composite Grouped_35.0        0.005198


In [None]:
# Training Kappa: 0.570
# Test Kappa: 0.519

# Bagging

In [457]:
# Bagging 

kf_bag_class_tree = KFold(n_splits = 10, shuffle = True, random_state = 42)

In [460]:
# Use the optimal tree size to fit a bagged classification tree
bag_class_tree_estimator = RandomForestClassifier(n_estimators=500, 
                                        criterion='gini', 
                                        max_depth = None,
                                        min_samples_split = 2,
                                        min_samples_leaf = 1, 
                                        max_leaf_nodes = None,
                                        max_features= x_train.shape[1], 
                                        bootstrap=True,
                                        max_samples = math.ceil(x_train.shape[0] * 0.9) -1,
                                        random_state = 42)

param_grid_bag_class_tree = {'max_leaf_nodes': list(range(2, 1676,250))}
                       
cv_bag_class_tree = GridSearchCV(estimator = bag_class_tree_estimator,
                              param_grid = param_grid_bag_class_tree, 
                              scoring = 'balanced_accuracy',
                              cv = kf_bag_class_tree,
                              n_jobs = -1) 
cv_bag_class_tree.fit(x_train, y_train)

print(cv_bag_class_tree.best_params_, cv_bag_class_tree.best_score_)

{'max_leaf_nodes': 252} 0.7067514288018935


In [461]:
bag_class_tree_best = RandomForestClassifier(n_estimators=500, 
                                        criterion='gini', 
                                        max_depth = None,
                                        min_samples_split = 2,
                                        min_samples_leaf = 1, 
                                        max_leaf_nodes = 252,
                                        max_features= x_train.shape[1], 
                                        bootstrap=True,
                                        max_samples = x_train.shape[0],
                                        random_state = 42).fit(x_train, y_train)

In [462]:
# Training Kappa of the optimal bagged tree
bag_class_tree_best_pred_train = bag_class_tree_best.predict(x_train)
Kappa_train_bag_class_tree_best = cohen_kappa_score(y1 = bag_class_tree_best_pred_train, y2 = y_train)
print(Kappa_train_bag_class_tree_best)

0.7378852285076154


In [463]:
# Test Kappa of the optimal bagged tree
bag_class_tree_best_pred_test = bag_class_tree_best.predict(x_test)
Kappa_test_bag_class_tree_best = cohen_kappa_score(y1 = bag_class_tree_best_pred_test, y2 = y_test)
print(Kappa_test_bag_class_tree_best)

0.4791460336625407


In [None]:
# Training Kappa: 0.738
# Test Kappa: 0.479

# Support vector machine - with Linear kernel

In [334]:
# Tune the C (cost) parameter using 10-fold cross-validation

from sklearn.svm import SVC

# Define the estimator
svml_estimator = SVC(kernel = 'linear', 
                 C = 1,
                 degree = 3,
                 gamma = 0.0)

In [335]:
# Define cross-validation data splitting strategy
kf_svml = KFold(n_splits = 5, shuffle = True, random_state = 42)

In [338]:
# Define base and power ranges
base_values = np.arange(2, 3, 1)  # From 2 to 3 (with a step size of 1)
power_values = np.arange(-5, 5.5, 1)  # From -5 to 5 with step size of 1

# Generate the search grid for the 'C' parameter
search_grid_svml = {'C': np.power.outer(base_values, power_values, dtype=complex)}

# Convert the grid into a 1D array and extract the real part (ignoring complex values)
search_grid_svml['C'] = search_grid_svml['C'].flatten().real

# Print the result
print(search_grid_svml)

{'C': array([3.125e-02, 6.250e-02, 1.250e-01, 2.500e-01, 5.000e-01, 1.000e+00,
       2.000e+00, 4.000e+00, 8.000e+00, 1.600e+01, 3.200e+01])}


In [339]:
# Cross-validation to find the best C value
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

# Define the estimator: Linear SVM with 'linear' kernel
estimator = SVC(kernel='linear')

# Define the parameter grid for 'C'
search_grid_svml = {'C': np.power.outer(base_values, power_values, dtype=complex).flatten().real}

# Define the cross-validation procedure with GridSearchCV
cv_svml = GridSearchCV(estimator=estimator, 
                       param_grid=search_grid_svml, 
                       scoring='accuracy', 
                       cv=5,  # 5-fold cross-validation
                       n_jobs=-1)  # Use all available cores for parallelism

# Fit the model using the training data
cv_svml.fit(x_train, y_train)

# Output the best parameter and best score
print(f"Best C value: {cv_svml.best_params_['C']}")
print(f"Best cross-validation accuracy: {cv_svml.best_score_}")

Best C value: 0.0625
Best cross-validation accuracy: 0.8329000000000001


In [341]:
# Fit a SVM with linear kernel model using the best C

svm_linear_best = SVC(kernel = 'linear', 
                 C = 0.0625,
                 degree = 3,
                 gamma = 0.0).fit(x_train, y_train)

In [342]:
# Classification Performance using the best model
# Training Kappa
svml_best_pred_train = svm_linear_best.predict(x_train)
Kappa_train_svml_best = cohen_kappa_score(y1 = svml_best_pred_train, y2 = y_train)
print("Training Kappa:", Kappa_train_svml_best)

# Test Kappa
svml_best_pred_test = svm_linear_best.predict(x_test)
Kappa_test_svml_best = cohen_kappa_score(y1 = svml_best_pred_test, y2 = y_test)
print("Test Kappa:", Kappa_test_svml_best)

# Training Kappa: 0.425
# Test Kappa: 0.405

Training Kappa: 0.42445572303223156
Test Kappa: 0.40530622729441124
