# Trinity Admission Data Preparation <br>


The goal of this data preparation project is to ready the data for constructing a classification model to determine whether an accepted applicant will decide to attend the University, based on information collected regarding the University’s recently accepted applicants, ranging in entry term from Fall 2017, Fall 2018, Fall 2019, Fall 2020, and Fall 2021.  Your final data set should be ready for modeling. 

This script demonstrates how to clean some typical variables for the test dataset and the cleaning requiremens for all variables. 

The test dataset is a subset of the original dataset, which is used to test the model to understand the relationships between variables. Then the tested model will predict the target variable using predictors in the test dataset.

You need to handle all the columns/variables that are not processed in this scirpt following the intrsuctions in the comments.


In [482]:
import pandas as pd
#Read in TU.csv

TU = pd.read_csv("TU.csv")
# pd.set_option('display.max_columns', None)
TU.head(1)

Unnamed: 0,ID,train-test,Entry Term (Application),Admit Type,Permanent Postal,Permanent Country,Sex,Ethnicity,Race,Religion,...,SAT Concordance Score (of SAT R),ACT Concordance Score (of SAT R),ACT Concordance Score (of SAT),Test Optional,SAT I Critical Reading,SAT I Math,SAT I Writing,SAT R Evidence-Based Reading and Writing Section,SAT R Math Section,Decision
0,1,train,Fall 2017,FY,87507-7944,United States,F,Non Hispanic/Latino,White,Roman Catholic,...,,,,,,,,,,1


In [483]:
# Display all columns in one output
TU.columns

Index(['ID', 'train-test', 'Entry Term (Application)', 'Admit Type',
       'Permanent Postal', 'Permanent Country', 'Sex', 'Ethnicity', 'Race',
       'Religion', 'First_Source Origin First Source Date', 'Inquiry Date',
       'Submitted', 'Application Source', 'Decision Plan',
       'Staff Assigned Name', 'Legacy', 'Athlete', 'Sport 1 Sport',
       'Sport 1 Rating', 'Sport 2 Sport', 'Sport 2 Rating', 'Sport 3 Sport',
       'Sport 3 Rating', 'Academic Interest 1', 'Academic Interest 2',
       'First_Source Origin First Source Summary', 'Total Event Participation',
       'Count of Campus Visits', 'School #1 Organization Category',
       'School 1 Code', 'School 1 Class Rank (Numeric)',
       'School 1 Class Size (Numeric)', 'School 1 GPA', 'School 1 GPA Scale',
       'School 1 GPA Recalculated', 'School 2 Class Rank (Numeric)',
       'School 2 Class Size (Numeric)', 'School 2 GPA', 'School 2 GPA Scale',
       'School 2 GPA Recalculated', 'School 3 Class Rank (Numeric)',
     

In [484]:
# Divide the dataframe into training and test datasets
# In this course, you will only work on the variables in the training set
# You will need to clean the test set following the methods used in this script, when you work on modeling in later chapter.

TUtrain=TU[TU['train-test']=='train']
TUtest=TU[TU['train-test']=='test']

In [485]:
#Column1 - ID

#Check NAs
TUtest['ID'].isna().sum()
#No NA,so no cleaning is required. But ID will be removed in the modeling stage. Why?


0

In [486]:
#Column2 - Entry Term Application

#Check NAs
print(TUtest['Entry Term (Application)'].isna().sum())
#No NA.
TUtest['Entry Term (Application)'].unique()
#No irregular categories.


0


array(['Fall 2019', 'Fall 2020', 'Fall 2021', 'Fall 2017', 'Fall 2018'],
      dtype=object)

In [487]:
#Column 3 - Admit Type

#Check NAs
print(TUtest['Admit Type'].isna().sum())
#No NA.
print(TUtest['Admit Type'].unique())
#No irregular categories.

0
['FY']


In [488]:
#Since the data set only has first years (i.e.,only one category), 
# Admit.Type should be removed.
print(TUtest['Admit Type'])
TUtest=TUtest.drop('Admit Type',axis='columns')
TUtest.columns

10000    FY
10001    FY
10002    FY
10003    FY
10004    FY
         ..
15138    FY
15139    FY
15140    FY
15141    FY
15142    FY
Name: Admit Type, Length: 5143, dtype: object


Index(['ID', 'train-test', 'Entry Term (Application)', 'Permanent Postal',
       'Permanent Country', 'Sex', 'Ethnicity', 'Race', 'Religion',
       'First_Source Origin First Source Date', 'Inquiry Date', 'Submitted',
       'Application Source', 'Decision Plan', 'Staff Assigned Name', 'Legacy',
       'Athlete', 'Sport 1 Sport', 'Sport 1 Rating', 'Sport 2 Sport',
       'Sport 2 Rating', 'Sport 3 Sport', 'Sport 3 Rating',
       'Academic Interest 1', 'Academic Interest 2',
       'First_Source Origin First Source Summary', 'Total Event Participation',
       'Count of Campus Visits', 'School #1 Organization Category',
       'School 1 Code', 'School 1 Class Rank (Numeric)',
       'School 1 Class Size (Numeric)', 'School 1 GPA', 'School 1 GPA Scale',
       'School 1 GPA Recalculated', 'School 2 Class Rank (Numeric)',
       'School 2 Class Size (Numeric)', 'School 2 GPA', 'School 2 GPA Scale',
       'School 2 GPA Recalculated', 'School 3 Class Rank (Numeric)',
       'School 3 Cl

In [489]:
#Column 4 - Permanent Postal
print(TUtest['Permanent Postal'].isna().sum())
#105 NAs.
TUtest['Permanent Postal'].unique()
#However, the column "Permanent.Geomarket" had already provided needed information
#regarding the postal codes of different states. Therefore, this column might be
#redundant, and we might just use "Permanent.Geomarket"
#Therefore, let's remove this column and there is no need to handle the missing values
print(TUtest['Permanent Postal'])
TUtest=TUtest.drop('Permanent Postal',axis='columns')
TUtest.columns

57
10000    77573-3387
10001    78731-1541
10002    76710-7247
10003    78589-4116
10004    78751-3134
            ...    
15138    91006-1737
15139    77494-5298
15140    55443-1016
15141    75024-2138
15142    78624-6081
Name: Permanent Postal, Length: 5143, dtype: object


Index(['ID', 'train-test', 'Entry Term (Application)', 'Permanent Country',
       'Sex', 'Ethnicity', 'Race', 'Religion',
       'First_Source Origin First Source Date', 'Inquiry Date', 'Submitted',
       'Application Source', 'Decision Plan', 'Staff Assigned Name', 'Legacy',
       'Athlete', 'Sport 1 Sport', 'Sport 1 Rating', 'Sport 2 Sport',
       'Sport 2 Rating', 'Sport 3 Sport', 'Sport 3 Rating',
       'Academic Interest 1', 'Academic Interest 2',
       'First_Source Origin First Source Summary', 'Total Event Participation',
       'Count of Campus Visits', 'School #1 Organization Category',
       'School 1 Code', 'School 1 Class Rank (Numeric)',
       'School 1 Class Size (Numeric)', 'School 1 GPA', 'School 1 GPA Scale',
       'School 1 GPA Recalculated', 'School 2 Class Rank (Numeric)',
       'School 2 Class Size (Numeric)', 'School 2 GPA', 'School 2 GPA Scale',
       'School 2 GPA Recalculated', 'School 3 Class Rank (Numeric)',
       'School 3 Class Size (Numeric)',

In [490]:
#Column 5 - Permanent Country
#0 NA.
print(TUtest['Permanent Country'].isna().sum())
#No irregular categories.
TUtest['Permanent Country'].unique()

1


array(['United States', 'China', 'Vietnam', 'India', 'Iceland',
       'Nicaragua', 'Mexico', 'Honduras', 'El Salvador', 'Taiwan', 'Peru',
       'Japan', 'United Kingdom', 'Norway', 'Netherlands', 'Colombia',
       'Ecuador', 'United Arab Emirates', 'Jordan', 'Ghana',
       'South Africa', 'Spain', 'South Korea', 'Hong Kong S.A.R.',
       'Guatemala', 'Egypt', 'Pakistan', 'Australia', 'Nepal',
       'Kazakhstan', nan, 'Costa Rica', 'Saudi Arabia', 'Jamaica',
       'Thailand', 'Uruguay', 'Canada', 'Singapore', 'Brazil', 'Ukraine',
       'Belgium', 'Nigeria', 'Panama', 'Greece', 'Ethiopia', 'Romania',
       'Uganda', 'Tanzania', 'Bolivia', 'Kuwait', 'France', 'Cambodia',
       'Belize', 'Turkey', 'Portugal', 'Zimbabwe', 'Germany', 'Oman',
       'Lebanon', 'Switzerland', 'Philippines', 'Indonesia', 'Mongolia',
       'Russia', 'Morocco', 'Chile', 'Cyprus', 'Albania',
       'Dominican Republic'], dtype=object)

In [491]:
# List of countries that are unique to the test set
MissingInTrain = ['Oman', 'Romania', 'Australia', 'Ukraine', 'Chile', 'Portugal', 'Dominican Republic', 'Zimbabwe', 'Uganda', 'Iceland', 'Egypt', 'Mongolia']

# Make all missing coutries into one category
TUtest['Permanent Country'] = TUtest['Permanent Country'].apply(
    lambda x: 'UniqueCountry' if x in MissingInTrain else x
)

print(TUtest['Permanent Country'].unique())

['United States' 'China' 'Vietnam' 'India' 'UniqueCountry' 'Nicaragua'
 'Mexico' 'Honduras' 'El Salvador' 'Taiwan' 'Peru' 'Japan'
 'United Kingdom' 'Norway' 'Netherlands' 'Colombia' 'Ecuador'
 'United Arab Emirates' 'Jordan' 'Ghana' 'South Africa' 'Spain'
 'South Korea' 'Hong Kong S.A.R.' 'Guatemala' 'Pakistan' 'Nepal'
 'Kazakhstan' nan 'Costa Rica' 'Saudi Arabia' 'Jamaica' 'Thailand'
 'Uruguay' 'Canada' 'Singapore' 'Brazil' 'Belgium' 'Nigeria' 'Panama'
 'Greece' 'Ethiopia' 'Tanzania' 'Bolivia' 'Kuwait' 'France' 'Cambodia'
 'Belize' 'Turkey' 'Germany' 'Lebanon' 'Switzerland' 'Philippines'
 'Indonesia' 'Russia' 'Morocco' 'Cyprus' 'Albania']


In [492]:
# List of countries that are missing in the training data
MissingInTrain = [
    'Barbados', 'Dominica', 'Palestine', 'Poland', 'Georgia', 'Venezuela', 'Italy', 
    'Czech Republic', 'Ireland', 'Cayman Islands', 'Cameroon', 'Malaysia', 'Iran', 
    'The Bahamas', 'New Zealand', 'Bosnia and Herzegovina', 'Paraguay', 'Lithuania', 
    'Trinidad and Tobago', 'Bangladesh', 'Luxembourg', 'Montenegro', 'Kenya', 
    "Cote D'Ivoire", 'Uzbekistan', 'Mozambique'
]

# Apply the transformation to the 'Permanent Country' column in the TUtest dataframe
TUtest['Permanent Country'] = TUtest['Permanent Country'].apply(
    lambda x: 'UniqueCountry' if x in MissingInTrain else x
)

# Check the result (optional)
print(TUtest['Permanent Country'].unique())

['United States' 'China' 'Vietnam' 'India' 'UniqueCountry' 'Nicaragua'
 'Mexico' 'Honduras' 'El Salvador' 'Taiwan' 'Peru' 'Japan'
 'United Kingdom' 'Norway' 'Netherlands' 'Colombia' 'Ecuador'
 'United Arab Emirates' 'Jordan' 'Ghana' 'South Africa' 'Spain'
 'South Korea' 'Hong Kong S.A.R.' 'Guatemala' 'Pakistan' 'Nepal'
 'Kazakhstan' nan 'Costa Rica' 'Saudi Arabia' 'Jamaica' 'Thailand'
 'Uruguay' 'Canada' 'Singapore' 'Brazil' 'Belgium' 'Nigeria' 'Panama'
 'Greece' 'Ethiopia' 'Tanzania' 'Bolivia' 'Kuwait' 'France' 'Cambodia'
 'Belize' 'Turkey' 'Germany' 'Lebanon' 'Switzerland' 'Philippines'
 'Indonesia' 'Russia' 'Morocco' 'Cyprus' 'Albania']


In [493]:
TUtest['Permanent Country'].unique()

array(['United States', 'China', 'Vietnam', 'India', 'UniqueCountry',
       'Nicaragua', 'Mexico', 'Honduras', 'El Salvador', 'Taiwan', 'Peru',
       'Japan', 'United Kingdom', 'Norway', 'Netherlands', 'Colombia',
       'Ecuador', 'United Arab Emirates', 'Jordan', 'Ghana',
       'South Africa', 'Spain', 'South Korea', 'Hong Kong S.A.R.',
       'Guatemala', 'Pakistan', 'Nepal', 'Kazakhstan', nan, 'Costa Rica',
       'Saudi Arabia', 'Jamaica', 'Thailand', 'Uruguay', 'Canada',
       'Singapore', 'Brazil', 'Belgium', 'Nigeria', 'Panama', 'Greece',
       'Ethiopia', 'Tanzania', 'Bolivia', 'Kuwait', 'France', 'Cambodia',
       'Belize', 'Turkey', 'Germany', 'Lebanon', 'Switzerland',
       'Philippines', 'Indonesia', 'Russia', 'Morocco', 'Cyprus',
       'Albania'], dtype=object)

In [494]:
#Column6 - Sex
print(TUtest['Sex'].isna().sum())
#No NA.
print(TUtest['Sex'].unique())
#No irregular categories.


0
['M' 'F']


In [495]:
#Column7 - Ethnicity
print(TUtest['Ethnicity'].isna().sum())
#158 NAs.
print(TUtest['Ethnicity'].unique())
#No irregular categories.
# It is fair to replace NAs with "Not Specified" as we do not have other columns for inter-field checking.

TUtest['Ethnicity'].fillna("Not specified",inplace=True)
TUtest['Ethnicity'].isna().sum()

69
['Hispanic/Latino' 'Non Hispanic/Latino' nan]


0

In [496]:
#Column8 - Race
print(TUtest['Race'].isna().sum())
#389 NAs.
print(TUtest['Race'].unique())

166
['White' nan 'Asian'
 'American Indian or Alaska Native, Black or African American, White'
 'Black or African American' 'Asian, White'
 'Native Hawaiian or Other Pacific' 'Black or African American, White'
 'American Indian or Alaska Native, White'
 'American Indian or Alaska Native'
 'Asian, Native Hawaiian or Other Pacific'
 'Asian, Native Hawaiian or Other Pacific, White'
 'Asian, Black or African American, White'
 'Asian, Black or African American'
 'American Indian or Alaska Native, Asian, White'
 'Native Hawaiian or Other Pacific, White'
 'American Indian or Alaska Native, Asian'
 'Asian, Black or African American, Native Hawaiian or Other Pacific, White'
 'American Indian or Alaska Native, Asian, Black or African American, Native Hawaiian or Other Pacific, White'
 'American Indian or Alaska Native, Asian, Black or African American, White'
 'American Indian or Alaska Native, Black or African American'
 'Black or African American, Native Hawaiian or Other Pacific, White']


In [497]:
#No irregular categories.
#Impute NAs with "Not specified", similar to what we do for Ethnicity.
TUtest['Race'].fillna("Not specified", inplace=True)

In [498]:
#The current classification of Race is too detailed, which leads to very 
#low frequencies for some categories.Let's take a look at the value frequencies.
TUtest['Race'].value_counts()

Race
White                                                                                                          3530
Asian                                                                                                           843
Black or African American                                                                                       255
Not specified                                                                                                   166
Asian, White                                                                                                    158
American Indian or Alaska Native, White                                                                          59
Black or African American, White                                                                                 45
American Indian or Alaska Native                                                                                 37
Asian, Black or African American, White                            

In [499]:
#So let's combine some of the categories because a category with a small number of cases won't have 
#a significant effect on the target variable in the modeling stage.

#Generate a race list that would be kept, the rest will be classified as 'others'
RaceList = list(TUtest['Race'].value_counts()[:7].index)
RaceList

['White',
 'Asian',
 'Black or African American',
 'Not specified',
 'Asian, White',
 'American Indian or Alaska Native, White',
 'Black or African American, White']

In [500]:
TUtest['Race'] = \
TUtest['Race'].apply(lambda x: 'Others' if x not in RaceList else x)
TUtest['Race'].value_counts()

Race
White                                      3530
Asian                                       843
Black or African American                   255
Not specified                               166
Asian, White                                158
Others                                       87
American Indian or Alaska Native, White      59
Black or African American, White             45
Name: count, dtype: int64

In [501]:
#Column 9 - Religion
# print # of NAs
print(TUtest['Religion'].isnull().sum()) # 4177 null values

# print unique values for Religion
print(TUtest.Religion.unique())

#Impute NAs with "Not specified", similar to what we do for Race.
TUtest["Religion"] = TUtest['Religion'].apply(lambda x: "Not specified" if pd.isnull(x) else x) #.apply iterates thru the dataframe, in this case replacing the nulls with "Not Specified"

#The current classification of Race is too detailed, which leads to very 
#low frequencies for some categories.Print the value frequencies.
print(TUtest["Religion"].value_counts())

#Religion has lots of categories, with some categories having a very small number of cases. 
#Let's combine similar levels into one level( for example:['Bible Churches','Christian Reformed','Christian Scientist','Church of Christ','Church of God'] )
TUtest['Religion'] = TUtest.Religion.apply(lambda x: "OtherRelgiousAffiliation" if x in ['Pentecostal',
                                                      'Unitarian','Protestant','Mormon-Latter Day Saints',
                                                      'Evangelical','Assembly of God','Bible Churches',
                                                      'Christian Reformed', 'Christian Scientist',
                                                      'Church of Christ','Church of God', 'Southern Baptist', 
                                                      'United Methodist', 'United Church of Christ',
                                                      'Society of Friends (Quaker)',
                                                      'Presbyterian Church of America',
                                                      'Lutheran-Missourie Synod',"Jehovah's Witnesses",
                                                      'Coptic Church (Egypt)','Mennonite','Episcopal',
                                                      'Eastern Orthodox','Lutheran-Missouri Synod','Baha',
                                                      'Jewish Messianic','Zoroastrian',"Baha'I",'Jain','Sikh',
                                                      'Buddhism','Other','Church of the Nazarene','Independent'
                                                                                        
                                                                                        
                                                                                        ] else x)
#I combined all small christian denominations below   100 


#then combine levels with less than 100 cases into "Other" because a level accounting for lower than 1% 
#of testing set is very unlikely to have a significant effect on the target variable.

print(TUtest.Religion.value_counts())

2145
['Christian' 'Methodist' 'Episcopal' nan 'Roman Catholic' 'Other'
 'Buddhism' 'Unitarian' 'Church of Christ' 'Baptist' 'Anglican'
 'Presbyterian' 'Hindu' 'Lutheran' 'Non-Denominational' 'Eastern Orthodox'
 'Islam/Muslim' 'Jewish' 'Pentecostal' 'Protestant'
 'United Church of Christ' 'Mormon-Latter Day Saints'
 "Jehovah's Witnesses" 'Southern Baptist' 'Jain' "Baha'I"
 'Christian Scientist' 'Presbyterian Church of America' 'Evangelical'
 'Assembly of God' 'Bible Churches' 'Society of Friends (Quaker)'
 'United Methodist' 'Sikh' 'Church of God' 'Coptic Church (Egypt)'
 'Lutheran-Missouri Synod' 'Christian Reformed' 'Independent'
 'Church of the Nazarene']
Religion
Not specified                     2145
Roman Catholic                     934
Christian                          551
Methodist                          264
Baptist                            213
Presbyterian                       147
Hindu                              136
Other                              116
Anglican     

In [502]:
#Column 10 - First_Source Origin First Source Date
print(TUtest['First_Source Origin First Source Date'].isna().sum())
#No NAs.

#convert to date format
TUtest['First_Source Origin First Source Date'] = pd.to_datetime(
    TUtest['First_Source Origin First Source Date'], errors='coerce').fillna(pd.to_datetime('1900-01-01'))
# Display the converted column
print(TUtest['First_Source Origin First Source Date'])

0
10000   2018-02-19 11:11:00
10001   2018-02-19 11:11:00
10002   2019-10-10 12:03:00
10003   2016-11-02 06:00:00
10004   2018-10-08 21:59:00
                ...        
15138   2018-02-19 11:11:00
15139   2018-02-16 16:31:00
15140   2019-06-25 12:50:00
15141   1900-01-01 00:00:00
15142   2017-01-30 17:24:00
Name: First_Source Origin First Source Date, Length: 5143, dtype: datetime64[ns]


In [503]:
#Column11 - Inquiry Date
print(TUtest['Inquiry Date'].isna().sum())
#3181 NAs.

#convert to date format
TUtest['Inquiry Date']= pd.to_datetime(TUtest['Inquiry Date'], errors='coerce')

# Filling NaNs with a statement saying they did not inquire abount trinity admissions
TUtest['Inquiry Date'].fillna("Did not Inquire", inplace=True)
TUtest['Inquiry Date']

1578


10000    2018-09-18 22:16:00
10001    2018-04-25 10:59:00
10002    2019-10-10 11:52:00
10003        Did not Inquire
10004    2018-10-08 22:06:00
                ...         
15138        Did not Inquire
15139    2018-08-28 09:26:00
15140        Did not Inquire
15141    2017-09-10 10:24:00
15142    2018-02-19 11:19:00
Name: Inquiry Date, Length: 5143, dtype: object

In [504]:
#Column 12 - Submitted
print(TUtest['Submitted'].isna().sum())
#No NAs.

#convert to date format
TUtest['Submitted'] = pd.to_datetime(TUtest['Submitted'], errors='ignore').fillna(pd.to_datetime('1900-01-01'))

0


In [505]:
# Column10-12
# After viewing Column10-12, it would be interesting to see
# whether the differences between submission date and First_Source date,
# and the differences between submission date and inquiry date, affect the response.
# So let's calculate the time difference between submission date and first_source date.

# Convert 'Submitted' and 'First_Source Origin First Source Date' to datetime, setting any missing values to '1900-01-01'
TUtest['Submitted'] = pd.to_datetime(TUtest['Submitted'], errors='coerce').fillna(pd.to_datetime('1900-01-01'))
TUtest['First_Source Origin First Source Date'] = pd.to_datetime(TUtest['First_Source Origin First Source Date'], errors='coerce').fillna(pd.to_datetime('1900-01-01'))

# Convert 'Inquiry Date' to datetime, setting any missing values to a placeholder date
# Add an indicator column to mark rows with no inquiry
TUtest['Inquiry Date'] = pd.to_datetime(TUtest['Inquiry Date'], errors='coerce').fillna(pd.to_datetime('1900-01-01'))
TUtest['Inquiry Status'] = TUtest['Inquiry Date'].apply(lambda x: "Did not Inquire" if x == pd.to_datetime('1900-01-01') else "Inquired")

# Calculate the time difference in weeks between 'Submitted' and 'First_Source Origin First Source Date'
TUtest['Submit_FirstSource'] = (TUtest['Submitted'] - TUtest['First_Source Origin First Source Date']).dt.days / 7

# Calculate the time difference in weeks between 'Submitted' and 'Inquiry Date'
TUtest['Submit_Inquiry'] = (TUtest['Submitted'] - TUtest['Inquiry Date']).dt.days / 7

# Optionally, round the calculated week differences to whole numbers
TUtest['Submit_FirstSource'] = TUtest['Submit_FirstSource'].round(0)
TUtest['Submit_Inquiry'] = TUtest['Submit_Inquiry'].round(0)

In [506]:
#There are NAs in Inquiry.Date,
#thus leading to NAs in Submit_Inquiry.
#Impute NAs in Submit_Inquiry with median values.

TUtest['Submit_Inquiry'].fillna(TUtest['Submit_Inquiry'].median(),inplace=True)
TUtest['Submit_Inquiry'].isna().sum()

0

In [507]:
#Remove Column10-12 after you created new variables above.  
TUtest.drop('First_Source Origin First Source Date', axis='columns', inplace=True)
TUtest.drop('Inquiry Date', axis='columns', inplace=True)
TUtest.drop('Submitted', axis='columns', inplace=True)
TUtest.drop("Inquiry Status", axis = 'columns', inplace = True)

In [508]:
#Column13 - Application.Source
print( TUtest['Application Source'].isna().sum())
#No NAs.
print(TUtest['Application Source'].unique())
#No irregular categories.

0
['ApplyTexas' 'CommonApp' 'Coalition' 'Select Scholar']


In [509]:
#Column14 - Decision.Plan
print(TUtest['Decision Plan'].isna().sum())
#No NAs.
print(TUtest['Decision Plan'].unique())
#No irregular categories.

0
['Early Action I' 'Early Action' 'Regular Decision' 'Early Action II'
 'Early Decision II' 'Early Decision I']


In [510]:
#Column15 - Staff.Assigned.Name
#Based on variable description, this variable might not be useful and provide
#insightful information in the modeling.
#Also, some staffs already left Trinity.
#So remove this variable
TUtest.drop(['Staff Assigned Name'], axis='columns', inplace=True)

In [511]:
#Column16 - Legacy
print(TUtest['Legacy'].isna().sum())
#No NAs.
print(TUtest['Legacy'].unique())
#No irregular categories.
#Impute NAs with "No Legacy"
TUtest['Legacy'].fillna("No Legacy",inplace=True)
TUtest['Legacy'].isna().sum()

#Legacy has many options, leading some options to having only a small number of cases.
#Let's group all the options into 3 categories ('Legacy',"No Legacy", "Legacy, Opt Out") 
#so that each category has the chance to affect the response variable.
TUtest['Legacy']=\
TUtest['Legacy'].apply(lambda x: 'Legacy, Opt Out' if x not in ['Legacy','No Legacy'] else x)

4659
[nan 'Legacy' 'Legacy, Opt Out' 'Legacy, VIP' 'Athlete, Legacy, Opt Out'
 'Athlete, Legacy' 'Legacy, Opt Out, VIP' 'Fine Arts, Legacy, VIP'
 'Athlete, Legacy, VIP' 'Fine Arts, Legacy, Opt Out' 'Fine Arts, Legacy'
 'Athlete, Fine Arts, Legacy, Opt Out, VIP'
 'Fine Arts, Legacy, Opt Out, VIP' 'Athlete, Legacy, Opt Out, VIP'
 'Athlete, Fine Arts, Legacy, VIP']


In [512]:
#Column17 - Athlete
# print # NAs.
print(TUtest['Athlete'].isnull().sum()) #checking for NAs / sum, 8683 null values

# print unique value counts.
print(TUtest['Athlete'].value_counts()) #unique value counts returning amount of groups

#Impute NAs with "Non-Athlete"
TUtest['Athlete'] = TUtest.Athlete.apply(lambda x: "Non-Athlete" if pd.isnull(x) else x) 
#.apply() lambda x returning "Non-Athlete" for any null value 

#Similar to Legacy, Athlete has many categories with a few cases.
#Group all options into three categories: 
#Athlete, Non-Athlete, and Athlete, Opt Out.

TUtest['Athlete'] = \
TUtest.Athlete.apply(lambda x: "Athlete, Opt Out" if x in ["Athlete, Opt Out", "Athlete, Legacy, Opt Out", "Athlete, Legacy, Opt Out, VIP","Athlete, Opt Out, VIP","Athlete, Fine Arts, Opt Out","Athlete, Fine Arts, Legacy, Opt Out, VIP"] else x) 
#grouping variables into one group called Athlete Opt Out
# Done by checking if x is in the specified list

TUtest['Athlete'] = \
TUtest.Athlete.apply(lambda x: "Athlete" if x not in ["Non-Athlete","Athlete, Opt Out"] else x) 
#this is grouping items into an Athlete group if they are NOT in these non athlete or opt out groups

4437
Athlete
Athlete                                     460
Athlete, Opt Out                            161
Athlete, Legacy                              34
Athlete, Legacy, Opt Out                     16
Athlete, Fine Arts                           10
Athlete, Legacy, VIP                          9
Athlete, VIP                                  7
Athlete, Opt Out, VIP                         3
Athlete, Fine Arts, Opt Out                   2
Athlete, Legacy, Opt Out, VIP                 2
Athlete, Fine Arts, Legacy, Opt Out, VIP      1
Athlete, Fine Arts, Legacy, VIP               1
Name: count, dtype: int64


In [513]:
print(TUtest['Athlete'].value_counts()) 
# Just checking to see the cleaning was successful

Athlete
Non-Athlete         4437
Athlete              521
Athlete, Opt Out     185
Name: count, dtype: int64


In [514]:
# Print NAs
print(TUtest['Sport 1 Sport'].isnull().sum())  # 8683 null values

# Print unique value counts
print(TUtest['Sport 1 Sport'].value_counts())  # Displays the unique sport counts

# Impute NAs with "No Sport"
TUtest['Sport 1 Sport'] = TUtest['Sport 1 Sport'].fillna("No Sport")

# Remove gender-specific suffixes from sport names (e.g., "Men", "Women")
# Create a mapping to remove gender-based distinctions
gender_removal_map = {
    "Football Men": "Football", 
    "Football Women": "Football",
    "Baseball Men": "Baseball", 
    "Baseball Women": "Baseball",
    "Cross Country Men": "Cross Country", 
    "Cross Country Women": "Cross Country",
    "Soccer Men": "Soccer", 
    "Soccer Women": "Soccer",
    "Track Men": "Track", 
    "Track Women": "Track",
    "Basketball Men": "Basketball", 
    "Basketball Women": "Basketball",
    "Swimming Men": "Swimming", 
    "Swimming Women": "Swimming",
    "Tennis Men": "Tennis", 
    "Tennis Women": "Tennis",
    "Golf Men": "Golf", 
    "Golf Women": "Golf",
    "Diving Men": "Diving", 
    "Diving Women": "Diving",
    "Softball": "Softball",
    "Volleyball": "Volleyball"
}

# Apply the gender_removal_map to group the sports
TUtest['Sport 1 Sport'] = TUtest['Sport 1 Sport'].map(gender_removal_map).fillna(TUtest['Sport 1 Sport'])

# Now the 'Sport 1 Sport' column should only contain the sport names without gender distinctions
print(TUtest['Sport 1 Sport'].value_counts())  # Check updated value counts


4437
Sport 1 Sport
Football               169
Soccer Men              75
Baseball                63
Cross Country Men       45
Basketball Men          39
Cross Country Women     38
Soccer Women            35
Track Women             35
Swimming Men            30
Tennis Women            30
Track Men               29
Swimming Women          26
Softball                25
Volleyball              21
Basketball Women        16
Tennis Men              11
Golf Men                 8
Diving Women             6
Golf Women               5
Name: count, dtype: int64
Sport 1 Sport
No Sport         4437
Football          169
Soccer            110
Cross Country      83
Track              64
Baseball           63
Swimming           56
Basketball         55
Tennis             41
Softball           25
Volleyball         21
Golf               13
Diving              6
Name: count, dtype: int64


In [515]:
#Column18 - Sport 1 Sport

# print # NAs.
print(TUtest['Sport 1 Sport'].isnull().sum()) #8683 null values

# print unique value counts.
print(TUtest['Sport 1 Sport'].value_counts()) 
# value counts shows there are 20 different groups, 2 of which have over 100 observations

#Impute NAs with "No Sport"
TUtest['Sport 1 Sport'] = TUtest['Sport 1 Sport'].apply(lambda x: "No Sport" if pd.isnull(x) else x) 

#Group sport men and sport women into one group
#so that each group has sufficient cases to have an impact on the response.
TUtest['Sport 1 Sport'] = TUtest['Sport 1 Sport'].apply(lambda x: "Sport" if x in 
                                                          ["Football", "Baseball", "Cross Country Men", 
                                                           "Soccer Men", "Track Men", "Basketball Men", 
                                                           "Swimming Men", "Tennis Men", "Golf Men", 
                                                           "Diving Men", "Track Women","Cross Country Women",
                                                           "Soccer Women","Swimming Women","Tennis Women",
                                                           "Softball","Volleyball","Basketball Women",
                                                           "Golf Women","Diving Women"] else x ) 
# Making it into two easily defined groups

0
Sport 1 Sport
No Sport         4437
Football          169
Soccer            110
Cross Country      83
Track              64
Baseball           63
Swimming           56
Basketball         55
Tennis             41
Softball           25
Volleyball         21
Golf               13
Diving              6
Name: count, dtype: int64


In [516]:
print(TUtest['Sport 1 Sport'].value_counts()) 
# Checking to see if it worked

Sport 1 Sport
No Sport         4437
Sport             278
Soccer            110
Cross Country      83
Track              64
Swimming           56
Basketball         55
Tennis             41
Golf               13
Diving              6
Name: count, dtype: int64


In [517]:
#Column19 - Sport 1 Rating
# print # NAs.
print(TUtest['Sport 1 Rating'].isnull().sum()) # 8683 null values

# print unique value counts.
print(TUtest['Sport 1 Rating'].value_counts()) # found 3 different groups all 250 observations of each other

#Impute NAs with "No Sport"
TUtest['Sport 1 Rating'] = TUtest["Sport 1 Rating"].apply(lambda x: "No Sport" if pd.isna(x) else x)
# Just as before, I used the .apply function to replace any null values found thru pd.isna() function, 
# in this case with the value "No Sport"

4437
Sport 1 Rating
Blue Chip    316
Varsity      210
Franchise    180
Name: count, dtype: int64


In [518]:
#Column20 - Sport 2 Sport

# print # NAs.
print(TUtest['Sport 2 Sport'].isnull().sum()) #found 9583 null values using .isnull().sum()

# print unique value counts.
print(TUtest['Sport 2 Sport'].value_counts()) 
# value counts shows 20 different unique groups where only 1 group has over 125 observations

#impute NAs with "No 2ndSport".
TUtest['Sport 2 Sport'] = TUtest["Sport 2 Sport"].apply(lambda x: "No 2ndSport" if pd.isna(x) else x) 
#used the .apply function to replace any null values found thru pd.isna() function with the string "No 2ndSport"

#The number of cases for each sport type is very small (< about 1% of the data set).
#It's better to group all options into 2 categories: 2ndSport vs. No 2ndSport.
TUtest['Sport 2 Sport'] = TUtest["Sport 2 Sport"].apply(lambda x: "2nd Sport" if x not in ["No 2ndSport"] else x) 
#grouped  anything that was not found in "No 2ndSport" into a new group called "2nd Sport" using .apply()


4930
Sport 2 Sport
Track & Field          65
Basketball             27
Football               20
Soccer                 20
Baseball               19
Cross Country          14
Swimming               14
Tennis                 12
Golf                    6
Volleyball              5
Track Men               4
Track Women             2
Cross Country Women     1
Cross Country Men       1
Softball                1
Tennis Men              1
Diving                  1
Name: count, dtype: int64


In [519]:
print(TUtest['Sport 2 Sport'].value_counts()) 
# Verifying code worked

Sport 2 Sport
No 2ndSport    4930
2nd Sport       213
Name: count, dtype: int64


In [520]:
#Column21 - Sport 2 Rating
print(TUtest['Sport 2 Rating'].isna().sum())
#9957 NAs.
print(TUtest['Sport 2 Rating'].unique())
#Only 43 out of 10000 observations are rated, which is less than 0.5% of the data set!
#Sport.2.Rating will not have much impact on the target.
#Remove it in the modeling stage.

TUtest.drop('Sport 2 Rating',axis='columns', inplace=True)

5128
[nan 'Varsity' 'Blue Chip' 'Franchise']


In [521]:
#Column22 - Sport 3 Sport

# print # NAs.
print(TUtest['Sport 3 Sport'].isna().sum()) #9838 null values found with .isna().sum()

# print unique value counts.
print(TUtest['Sport 3 Sport'].value_counts())#value counts found 12 different unique values all having under 41 observations

#impute NAs with "No 3rdSport".
TUtest['Sport 3 Sport'] = TUtest["Sport 3 Sport"].apply(lambda x: "No 3rdSport" if pd.isna(x) else x) 
#used the .apply function to replace any null values found thru pd.isna() function with the string "No 3rdSport"

#The number of cases for each sport type is very small (< 0.5% of the data set).
#It's better to group all options into 2 categories: 3rdSport vs. No 3rdSport.
TUtest['Sport 3 Sport'] = TUtest["Sport 3 Sport"].apply(lambda x: "3rdSport" if x not in ["No 3rdSport"] else x) 
# Same technique as before but the replacing value is 3rdSport

5069
Sport 3 Sport
Basketball           17
Track & Field        14
Cross Country         8
Soccer                6
Volleyball            6
Swimming              5
Baseball              5
Football              4
Golf                  4
Diving                2
Softball              1
Tennis                1
Cross Country Men     1
Name: count, dtype: int64


In [522]:
print(TUtest['Sport 3 Sport'].value_counts())#value counts found 12 different unique values all having under 41 observations
# Verifying the code worked

Sport 3 Sport
No 3rdSport    5069
3rdSport         74
Name: count, dtype: int64


In [523]:
#Column23 - Sport.3.Rating

print(TUtest['Sport 3 Rating'].isna().sum())
#9998 NAs.
print(TUtest['Sport 3 Rating'].unique())
#No questionable category.
#Only 2 out of 10000 observations are rated, which will not provide much insightful
#information. Therefore,remove this column
TUtest.drop('Sport 3 Rating',axis='columns', inplace=True)

5142
[nan 'Varsity']


In [524]:
#Column24 - Academic Interest 1
print(TUtest['Academic Interest 1'].isna().sum())
#1 NAs.
print(TUtest['Academic Interest 1'].unique())



2
['Engineering Science' 'English' 'Psychology' 'Pre-Medical' 'Biochemistry'
 'Business - Communication Management' 'Biology' 'International Studies'
 'Physics' 'Political Science' 'Business - Accounting'
 'Business Legal Studies' 'History' 'Mathematics' 'Pre-Law' 'Finance'
 'Business Analytics & Technology' 'Computer Science' 'Theatre'
 'Neuroscience' 'Business - Management' 'Chemistry' 'Undecided' 'Music'
 'Business' 'Art' 'Biochemistry & Molecular Biology'
 'Business - Marketing' 'Economics' 'German' 'Environmental Studies'
 'Business - International Business' 'Creative Writing' 'Sociology'
 'Geosciences' 'Pharmacy' 'Art History' 'Architecture' 'Entrepreneurship'
 'Communication' 'Nursing' 'Anthropology' 'Classical Languages'
 'Education' 'Mathematical Finance' 'Latin' 'Business - Sport Management'
 'Chinese' 'Urban Studies' 'Philosophy' 'New Media' 'Religion'
 'Architectural Studies' 'Spanish' 'East Asian Studies' 'Music Education'
 'Agriculture' 'French' 'Human Communication' 'Rus

In [525]:
# Step1: Most of the NAs for Academic.Interest.1 have a value for Academic.Interest.2
#We may assign the corresponding values in Academic.Interest.2 
#to NAs in Academic.Interest.1 if Academic.Interest.2 has a value.

# When update values in a subset of dataframes, 
# Try using .loc[row_indexer,col_indexer] = value instead to avoid chained indexing issue

for i,row in TUtest.iterrows():
    if pd.isna(row['Academic Interest 1']):
        print(i,row['Academic Interest 1'],row['Academic Interest 2'])
        TUtest.loc[i,'Academic Interest 1']=TUtest.loc[i,'Academic Interest 2']

11756 nan Business - Management
13813 nan nan


In [526]:
# Step2:For the remaining NAs in Academic.Interest.1, assign Undecided.
TUtest['Academic Interest 1'].fillna('Undecided',inplace=True)
TUtest['Academic Interest 1'].unique()

array(['Engineering Science', 'English', 'Psychology', 'Pre-Medical',
       'Biochemistry', 'Business - Communication Management', 'Biology',
       'International Studies', 'Physics', 'Political Science',
       'Business - Accounting', 'Business Legal Studies', 'History',
       'Mathematics', 'Pre-Law', 'Finance',
       'Business Analytics & Technology', 'Computer Science', 'Theatre',
       'Neuroscience', 'Business - Management', 'Chemistry', 'Undecided',
       'Music', 'Business', 'Art', 'Biochemistry & Molecular Biology',
       'Business - Marketing', 'Economics', 'German',
       'Environmental Studies', 'Business - International Business',
       'Creative Writing', 'Sociology', 'Geosciences', 'Pharmacy',
       'Art History', 'Architecture', 'Entrepreneurship', 'Communication',
       'Nursing', 'Anthropology', 'Classical Languages', 'Education',
       'Mathematical Finance', 'Latin', 'Business - Sport Management',
       'Chinese', 'Urban Studies', 'Philosophy', 'New Me

In [527]:
#  Step3:Group Business related options into "Business".
TUtest['Academic Interest 1'] = \
TUtest['Academic Interest 1'].apply(lambda x: 'Business' if x in['Finance','Entrepreneurship'] else x)

#Group options with a low number of cases (< 100 cases) into "Others".
Majorlist=list(TUtest['Academic Interest 1'].value_counts()[:27].index)
TUtest['Academic Interest 1'] = \
TUtest['Academic Interest 1'].apply(lambda x: 'Others' if x not in Majorlist else x)
TUtest['Academic Interest 1'].value_counts()

Academic Interest 1
Pre-Medical                            517
Biology                                455
Engineering Science                    435
Business                               410
Others                                 406
Psychology                             310
Computer Science                       297
Undecided                              235
Political Science                      199
Neuroscience                           196
Biochemistry & Molecular Biology       147
Biochemistry                           131
International Studies                  125
Business - Management                  115
English                                111
Mathematics                            109
Economics                              103
Business - International Business       96
Business - Marketing                    93
Pre-Law                                 92
Business - Accounting                   88
Chemistry                               82
Environmental Studies             

In [528]:
#Column25 - Academic.Interest.2#Column25 - Academic.Interest.2

#Check NAs
print(  TUtest['Academic Interest 2'].isna().sum())
#94 NAs.

#Replace repeated academic interests with Undecided, 
#then make NAs Undecided
TUtest['Academic Interest 2'].fillna('Undecided')

#Group Business related options into "Business".
TUtest['Academic Interest 2'] = \
TUtest['Academic Interest 2'].apply(lambda x: 'Business' if x in['Finance','Entrepreneurship'
                                                                ,'Business Analytics & Technology'] else x)


#Group options with a low number of cases (< 100 cases) into "Others".
Majorlist=list(TUtest['Academic Interest 2'].value_counts()[:27].index)
TUtest['Academic Interest 2']= \
TUtest['Academic Interest 2'].apply(lambda x: 'Others' if x not in Majorlist else x)
TUtest['Academic Interest 2'].value_counts()


# Additional line to replace 'Music' with 'Others' 
# (this is because Music is missing in the other dataframe, necessary step for modeling)
TUtest['Academic Interest 2'] = TUtest['Academic Interest 2'].apply(
    lambda x: 'Others' if x == 'Music' else x
)

65


In [529]:
TUtest['Academic Interest 2'].value_counts()
# Checking code worked

Academic Interest 2
Others                               873
Business                             455
Biology                              389
Pre-Medical                          295
Psychology                           254
Engineering Science                  235
Biochemistry & Molecular Biology     210
Computer Science                     180
Political Science                    180
Undecided                            167
Biochemistry                         162
Business - Management                160
Neuroscience                         159
Economics                            149
Pre-Law                              135
Mathematics                          120
Business - Marketing                 118
Chemistry                            118
Environmental Studies                115
Business - International Business    101
International Studies                 97
English                               88
Physics                               84
History                              

In [530]:
#Column26 - First_Source Origin First Source Summary

# print # NAs.
print(TUtest["First_Source Origin First Source Summary"].isnull().sum()) 
# No nulls

# print unique value counts.
print(TUtest["First_Source Origin First Source Summary"].value_counts()) 

#Similar to Academic.Interest.2, group options with a low number of cases (< 100) into "Other Sources".
TUtest['First_Source Origin First Source Summary']= \
TUtest['First_Source Origin First Source Summary'].apply(lambda x: 'Other Sources' if x not in ["CBINQ","OAPP","PSAT","SRCH","VST","CF","WEBTU","CAPIQ","CAP","ACT","HSV","ATH","TIF","SIB","SATR","YUVST"] else x) 
# Same as before but for other sources

0
First_Source Origin First Source Summary
CBINQ    2406
OAPP      507
SRCH      280
PSAT      249
VST       238
CF        186
WEBTU     180
CAPIQ     156
CAP       146
HSV        91
ACT        87
ATH        70
SATR       66
SIB        60
YUVST      59
TIF        59
OEVNT      54
HOBS       33
NHI        31
DOC        24
ATHWB      21
GRP        19
APPTX      14
OTH        14
EM         13
ACTPL      10
NICHE      10
CHEGG      10
ALUM        7
SAT         7
TVINT       7
WEBCA       6
TFL         5
TVOTH       4
AP          3
CLNIQ       3
DBT         2
MPC         1
DUOL        1
WEBOT       1
APCU        1
TEL         1
ATS         1
Name: count, dtype: int64


In [531]:
print(TUtest["First_Source Origin First Source Summary"].value_counts()) 
# Seeing code worked

First_Source Origin First Source Summary
CBINQ            2406
OAPP              507
Other Sources     303
SRCH              280
PSAT              249
VST               238
CF                186
WEBTU             180
CAPIQ             156
CAP               146
HSV                91
ACT                87
ATH                70
SATR               66
SIB                60
YUVST              59
TIF                59
Name: count, dtype: int64


In [532]:
#Column27 - Total Event Participation
print(TUtest['Total Event Participation'].isna().sum())
#No NAs.
print(TUtest['Total Event Participation'].unique())

#3, 4, 5 combined accounts for < 1% of the data set.
#Compared to the number of cases in 0, 1, and 2, the number of cases
#in 3, 4, and 5 won't be very useful in predicting the response.
#So group 3, 4, and 5 into "2 or more".

TUtest['Total Event Participation']=\
TUtest['Total Event Participation'].apply(lambda x: '2 or more' if x in[3,4,5] else x)
TUtest['Total Event Participation'].unique()

0
[1 2 0 3 4 5]


array([1, 2, 0, '2 or more'], dtype=object)

In [533]:
#Column28 - Count of Campus Visits
TUtest["Count of Campus Visits"] = TUtest["Count of Campus Visits"].astype(str) 
# I turned the column into string values considering there were a low amount of groups in the column 
# and the grouping instructions in the next few lines. 

# print # NAs.
print(TUtest["Count of Campus Visits"].isnull().sum()) 
# No nulls

# print unique value counts.
print(TUtest["Count of Campus Visits"].value_counts()) 

# group 5, 6, and 8 into '4 or more'.
TUtest["Count of Campus Visits"] = \
TUtest["Count of Campus Visits"].apply(lambda x: '4 or more' if x in ["4","5", "6", "8"] else x) 
#combined groups 4 - 8 into one group called "4 or more" using .apply function


0
Count of Campus Visits
0    3677
1    1156
2     216
3      66
4      18
5      10
Name: count, dtype: int64


In [534]:
print(TUtest["Count of Campus Visits"].value_counts()) 
# Checking code ran correctly

Count of Campus Visits
0            3677
1            1156
2             216
3              66
4 or more      28
Name: count, dtype: int64


In [535]:
#Column29 - School #1 Organization Category
print(TUtest['School #1 Organization Category'].isna().sum())
#25  NAs.
print(TUtest['School #1 Organization Category'].value_counts())
#Only 8 cases belong to College but 9967 cases belong to High School.
#Remove this variable.
TUtest.drop('School #1 Organization Category', axis='columns', inplace=True)

13
School #1 Organization Category
High School    5122
College           8
Name: count, dtype: int64


In [536]:
#Column30 - School 1 Code
print(TUtest['School 1 Code'].isna().sum())
#7842 NAs.
print(TUtest['School 1 Code'].unique())
#School Code will not matter much to produce insightful information.
#Additionally, there are 7842 missing values.
#so remove this column in the modeling stage.
TUtest.drop('School 1 Code', axis='columns', inplace=True)

4055
[    nan 447321. 140187. 443668. 481115. 443425. 446025.  52225. 440186.
 446785. 446365.  52347. 242277. 440294. 241635. 442958. 220595. 343880.
 999999. 445570. 442434. 441761. 444980. 443624. 440393. 442230. 443747.
 440914. 431615. 446203. 300180. 446220.  53197. 444632. 441758. 446782.
  52618. 446088. 444836. 340088. 440324. 443191. 440416. 446148. 440125.
  60414. 444599. 261590. 445052. 444368. 440310.  51106. 444596. 445056.
 443435. 617001. 501157. 332903. 445980. 100432. 440382. 444833. 440748.
 440010.  60275. 310904.  51267. 332360. 440730. 442602. 443541. 440331.
 130042. 431045. 446122. 440557. 724717. 443447. 440900.  54851. 471845.
 442572. 446207. 281315. 447586. 442047. 192601. 443644. 445499. 443361.
 311078. 440343. 444360. 300185. 441755. 443405.  10109. 442558. 999998.
 480145. 446784. 161215. 440326. 430001. 192045. 447481. 441471. 443291.
 112534. 442624. 171680. 220940. 445715. 380678. 430894. 390535. 380011.
 220950.  50485. 444865. 445412. 444209. 44651

In [537]:
#Column31 - School 1 Class Rank (Numeric)
print(TU.loc[TU['train-test']=='test','School 1 Class Rank (Numeric)'].isna().sum())
#2779 NAs.

2779


In [538]:
# Column32 - School 1 Class Size (Numeric)

print(TUtest['School 1 Class Size (Numeric)'].isna().sum())

#5357 NAs.
#Percentage rank can more accurately reflect a student's academic performance than numeric rank. 

#Create a New Column - School 1 Top Percent in Class

TUtest['School 1 Top Percent in Class'] =\
100 *(TUtest['School 1 Class Rank (Numeric)']/TUtest['School 1 Class Size (Numeric)'])

TUtest['School 1 Top Percent in Class'].isna().sum()

2779


2779

In [539]:
# #Impute the 5357 NAs based on Academic.Index column. 

# #Since we need to handle NAs in School 1 Top Percent in Class
# according to Academic.Index, first let's see whether Academic Index needs to be cleaned.
print(TUtest['Academic Index'].isna().sum())
#829 NAs.
print(TUtest['Academic Index'].value_counts())
#No questionable level.
#Impute 829 NAs with the most common level.
TUtest['Academic Index'].fillna(3,inplace=True)
TUtest['Academic Index'].unique()
#No missing values in Academic Index now.

302
Academic Index
3.0    1642
1.0    1284
2.0    1206
4.0     608
5.0     101
Name: count, dtype: int64


array([4., 2., 3., 1., 5.])

In [540]:
#calculate school 1 top percent in class for each academic index group
grouped=TUtest.groupby('Academic Index')
grouped
average=grouped.mean('School 1 Top Percent in Class')
average['School 1 Top Percent in Class']

Academic Index
1.0     4.451811
2.0     8.451642
3.0    16.449033
4.0    29.870245
5.0    39.644202
Name: School 1 Top Percent in Class, dtype: float64

In [541]:
#Impute missing values in 'School 1 Top Percent in Class' based on Academic Index group average
for i,row in TUtest.iterrows():

    if (row['Academic Index']== 1.0) & (pd.isna(row['School 1 Top Percent in Class'])):
        TUtest.loc[i,'School 1 Top Percent in Class']= average['School 1 Top Percent in Class'][1.0]
    elif (row['Academic Index']== 2.0) & (pd.isna(row['School 1 Top Percent in Class'])):
            TUtest.loc[i,'School 1 Top Percent in Class']= average['School 1 Top Percent in Class'][2.0]
    elif (row['Academic Index']== 3.0) & (pd.isna(row['School 1 Top Percent in Class'])):
            TUtest.loc[i,'School 1 Top Percent in Class']= average['School 1 Top Percent in Class'][3.0]
    elif (row['Academic Index']== 4.0) & (pd.isna(row['School 1 Top Percent in Class'])):
            TUtest.loc[i,'School 1 Top Percent in Class']= average['School 1 Top Percent in Class'][4.0]
    elif (row['Academic Index']== 5.0) & (pd.isna(row['School 1 Top Percent in Class'])):
            TUtest.loc[i,'School 1 Top Percent in Class']= average['School 1 Top Percent in Class'][5.0]
print(TUtest['Academic Index'].isna().sum())

0


In [542]:
#Column33 - School 1 GPA

#Remove this variable in the modeling stage
#because School.1.GPA.Recalculated is more accurate.

TUtest.drop('School 1 GPA', axis='columns', inplace=True)

In [543]:
#Column34 - School 1 GPA Scale
#Remove this variable in the modeling stage as it is irrelevant.

TUtest.drop('School 1 GPA Scale', axis='columns', inplace=True)

In [544]:
#Column35 - School 1 GPA Recalculated

#Check NAs

print(TUtest['School 1 GPA Recalculated'].isna().sum())
#0 NAs.

TUtest['School 1 GPA Recalculated'].skew()
#Check skewness

# if the skewness score is below -1 or above 1, the variable is high skewed
#if the skewness score is positive, it means it is right tail skew
# if the skewness score is negative, it means it is left tail skew
#Since it is moderately skewed, and it is understandable for the left skewness as 
#a lot of students got into Trinity with a high GPA (almost 4.0), it is unnecessary to do transformation.
# Some modeling methods requires variables following a normal distribution, therefore you need to transform the skewed data
# before inputting it into the model, such as liner regression analysis.

0


-0.9345840107494408

In [545]:
#Column36 - School 2 Class Rank (Numeric)
#Check NAs
print(TUtest['School 2 Class Rank (Numeric)'].isna().sum()) # 10000 null values 

#Should we keep or remove this variable. Justify your decision in comments.
TUtest.drop('School 2 Class Rank (Numeric)', axis='columns', inplace=True) 
#Dropping column as all cases are blank so we do not need the variable

5143


In [546]:
#Column37 - School 2 Class Size (Numeric)

#Check NAs
print(TUtest['School 2 Class Size (Numeric)'].isna().sum())# 10000 null values 

#Should we keep or remove this variable. Justify your decision in comments.
TUtest.drop('School 2 Class Size (Numeric)', axis='columns', inplace=True)
# Same as the last one, all cases are blank. We do not need this variable.

5143


In [547]:
#Column38 - School 2 GPA

#Check NAs
print(TUtest['School 2 GPA'].isna().sum())# 10000 null values 

#Should we keep or remove this variable. Justify your decision in comments.
TUtest.drop('School 2 GPA', axis='columns', inplace=True)
# Same as before, dropping column as we don't need it 


5143


In [548]:
#Column39 - School 2 GPA Scale
#Check NAs
print(TUtest['School 2 GPA Scale'].isna().sum())# 10000 null values

#Should we keep or remove this variable. Justify your decision in comments.
TUtest.drop('School 2 GPA Scale', axis='columns', inplace=True)
# Same as before, dropping column as we don't need it 

5143


In [549]:
#Column40 - School 2 GPA Recalculated
#Check NAs
print(TUtest['School 2 GPA Recalculated'].isna().sum())# 10000 null values

#Should we keep or remove this variable. Justify your decision in comments.
TUtest.drop('School 2 GPA Recalculated', axis='columns', inplace=True)
# Same as before, dropping column as we don't need it

5143


In [550]:
#Column41 - School 3 Class Rank (Numeric)
#Check NAs
print(TUtest['School 3 Class Rank (Numeric)'].isna().sum())# 10000 null values

#Should we keep or remove this variable. Justify your decision in comments.
TUtest.drop('School 3 Class Rank (Numeric)', axis='columns', inplace=True)
# Same as before, dropping column as we don't need it

5143


In [551]:
#Column42 - School 3 Class Size (Numeric)
#Check NAs
print(TUtest['School 3 Class Size (Numeric)'].isna().sum())# 10000 null values

#Should we keep or remove this variable. Justify your decision in comments.
TUtest.drop('School 3 Class Size (Numeric)', axis='columns', inplace=True)
# Same as before, dropping column as we don't need it


5143


In [552]:
#Column43 - School 3 GPA
#Check NAs
print(TUtest['School 3 GPA'].isna().sum())# 10000 null values

#Should we keep or remove this variable. Justify your decision in comments.
TUtest.drop('School 3 GPA', axis='columns', inplace=True)
# Same as before, dropping column as we don't need it

5143


In [553]:
#Column44 - School 3 GPA Scale
#Check NAs
print(TUtest['School 3 GPA Scale'].isna().sum())# 10000 null values

#Should we keep or remove this variable. Justify your decision in comments.
TUtest.drop('School 3 GPA Scale', axis='columns', inplace=True)
# Same as before, dropping column as we don't need it

5143


In [554]:
#Column45 - School 3 GPA Recalculated
#Check NAs
print(TUtest['School 3 GPA Recalculated'].isna().sum())# 10000 null values

#Should we keep or remove this variable. Justify your decision in comments.
TUtest.drop('School 3 GPA Recalculated', axis='columns', inplace=True)
# Same as before, dropping column as we don't need it

5143


In [555]:
# Column46 ACT Composite

# print # NAs.
print(TUtest["ACT Composite"].isna().sum())  # 4945 null values 

# print unique value counts.
print(TUtest["ACT Composite"].value_counts()) 

# Replace missing ACT scores with SAT Concordance scores. 
# Convert 'SAT R Evidence-Based Reading and Writing Section + Math Section scores' into ACT based on the ACT-SAT concordance table pdf;

sat_to_act = {
    range(1570, 1601): 36,  # dictionary which corresponds to ACT concordance table
    range(1530, 1561): 35,
    range(1490, 1521): 34,
    range(1450, 1481): 33,
    range(1420, 1441): 32,
    range(1390, 1411): 31,
    range(1360, 1381): 30,
    range(1330, 1351): 29,
    range(1300, 1321): 28,
    range(1260, 1291): 27, 
    range(1230, 1251): 26,
    range(1200, 1221): 25,
    range(1160, 1191): 24,
    range(1130, 1151): 23,
    range(1100, 1121): 22,
    range(1060, 1091): 21,
    range(1030, 1051): 20,
    range(990, 1021): 19,
    range(960, 981): 18,
    range(920, 951): 17,
    range(880, 911): 16,
    range(830, 871): 15,
    range(780, 821): 14,
    range(730, 771): 13,
    range(690, 721): 12,
    range(650, 681): 11,
    range(620, 641): 10,
    range(590, 611): 9
}

def calcACT(sat): 
    if pd.isna(sat):  # Ensure no error for missing SAT values
        return None
    for score_range in sat_to_act.keys(): #Loops over SAT ranges dictionary, checks if score is in range. 
        if sat in score_range: # If score in range, 
            return sat_to_act[score_range] #return corresponding ACT score.
    return None  # If SAT score doesn't fit in any range, return None

# Apply the function to fill missing ACT scores using SAT values
TUtest["ACT Composite"] = TUtest.apply(
    lambda row: calcACT(row['SAT R Evidence-Based Reading and Writing Section + Math Section']) 
    if pd.isnull(row['ACT Composite']) else row['ACT Composite'], axis=1
)

# Check again for any missing ACT scores
print(TUtest["ACT Composite"].isna().sum())  # Check how many NAs are left

# Replace any remaining missing ACT scores with the mean of the ACT Composite
TUtest["ACT Composite"] = TUtest["ACT Composite"].apply(
    lambda x: TUtest["ACT Composite"].mean() if pd.isnull(x) else x
).round()

# Final check
print(TUtest["ACT Composite"].isna().sum()) 

2557
ACT Composite
32.0    316
31.0    312
30.0    304
33.0    298
34.0    281
29.0    218
28.0    212
35.0    180
27.0    170
26.0    119
25.0     77
24.0     34
36.0     30
23.0     20
22.0     11
21.0      3
20.0      1
Name: count, dtype: int64
555
0


In [556]:
# Group ACT scores with fewer than 10 occurrences into one category, 
# I can see that all the values below 21 ACT score have less than 10 observations so that is what I will
# call the category

# Check the current value counts for ACT Composite
act_counts = TUtest["ACT Composite"].value_counts()

# Create a new column to categorize ACT scores
def group_rare_scores(act_score):
    if act_counts.get(act_score, 0) < 10:  # If the score appears fewer than 10 times
        return 'ACTBelow21'  # Group them into Below 21
    else:
        return act_score  # Otherwise, keep the original score

# Apply the function to create the new grouped categories
TUtest["ACT Composite Grouped"] = TUtest["ACT Composite"].apply(group_rare_scores)

# Check the distribution of the new categories
print(TUtest["ACT Composite Grouped"].value_counts())

ACT Composite Grouped
30.0          1055
31.0           528
33.0           510
34.0           485
32.0           482
29.0           432
28.0           428
27.0           350
35.0           293
26.0           216
25.0           158
24.0            86
36.0            63
23.0            33
22.0            17
ACTBelow21       7
Name: count, dtype: int64


In [557]:
print(TUtest["ACT Composite"].isna().sum()) #Nulls are gone

0


In [558]:
#Column47 ACT English

#Check NAs 
print(TUtest["ACT English"].isna().sum())

# Since ACT Composite is already a good indicator for ACT scores generally,scores on each section will not matter much to make analyses.
#Remove this variable.
TUtest.drop("ACT English", axis = 'columns', inplace = True) 

2678


In [559]:
#Column48 ACT Reading
print(TUtest["ACT Reading"].isna().sum()) # returned 5205 null values with .isna().sum()

# Since ACT Composite is already a good indicator for ACT scores generally,scores on each section will not matter much to make analyses.
#Remove this variable.
TUtest.drop("ACT Reading", axis = 'columns', inplace = True) 

2678


In [560]:
#Column50 ACT Math
print(TUtest["ACT Math"].isna().sum()) #returned 5205 null values with .isna().sum()

# Since ACT Composite is already a good indicator for ACT scores generally,scores on each section will not matter much to make analyses.
#Remove this variable.
TUtest.drop("ACT Math", axis = 'columns', inplace = True) 

2678


In [561]:
#Column51 ACT Science Reasoning
print(TUtest["ACT Science Reasoning"].isna().sum()) #returned 5205 null values with .isna().sum()

# Since ACT Composite is already a good indicator for ACT scores generally,scores on each section will not matter much to make analyses.
#Remove this variable.
TUtest.drop("ACT Science Reasoning", axis = 'columns', inplace = True) 

2678


In [562]:
#Column52 ACT Writing
print(TUtest["ACT Writing"].isna().sum()) #returned 9830 null values with .isna().sum()

# Since ACT Composite is already a good indicator for ACT scores generally,scores on each section will not matter much to make analyses.
#Remove this variable.
TUtest.drop("ACT Writing", axis = 'columns', inplace = True)

5056


In [563]:
#Column53 ACT SAT I CR + M
print(TUtest["SAT I CR + M"].isna().sum()) #returned 9610 null values with .isna().sum()

# Since ACT Composite is already a good indicator for ACT scores generally,scores on each section will not matter much to make analyses.
#Remove this variable.
TUtest.drop("SAT I CR + M", axis = 'columns', inplace = True) 

4959


In [564]:
#Column54 SAT R Evidence-Based Reading and Writing Section + Math Section
# This column is used to generate ATC concordance scores.
# No further processing needed.

In [565]:
print(TUtest["Permanent Geomarket"].value_counts()) # shows the most frequent geomarket area is TX 
print(TUtest["Permanent Geomarket"].isna().sum()) # No nulls

Permanent Geomarket
TX-06     645
TX-16     608
TX-13     282
TX-15     282
TX-14     188
         ... 
NY-04       1
INT-BE      1
INT-UP      1
PA-09       1
OH-10       1
Name: count, Length: 307, dtype: int64
1


In [566]:
# Column55 Permanent Geomarket
# First, replace the missing values with the most frequent geo market value.
print(TUtest["Permanent Geomarket"].value_counts()) # shows the most frequent geomarket area is TX 
print(TUtest["Permanent Geomarket"].isna().sum()) # No nulls

# Second, group geomarket values into different regions. Refer to the region .csv for grouping.
unique_values = TUtest["Permanent Geomarket"].unique()
print(unique_values)

# Define a single regions dictionary
dictRegions = {
    'West': ['AK', 'HI', 'WA', 'OR', 'CA', 'ID', 'NV', 'MT', 'WY', 'UT', 'CO', 'AZ', 'NM'],
    'Midwest': ['ND', 'SD', 'NE', 'KS', 'MN', 'IA', 'MO', 'WI', 'IL', 'MI', 'IN', 'OH'],
    'South': ['TX', 'OK', 'AR', 'LA', 'KY', 'TN', 'MS', 'AL', 'WV', 'MD', 'DE', 'DC', 'VA', 'NC', 'SC', 'GA', 'FL', 'US'],
    'Northeast': ['PA', 'NY', 'NJ', 'ME', 'VT', 'NH', 'MA', 'CT', 'RI'],
    'International': ['INT', 'PR']
}

# Function to categorize regions
def categorize_region(geomarket):
    # Clean the geomarket value (e.g., remove any suffix after '-')
    clean_geomarket = geomarket.split('-')[0]
    
    # Check in the regions dictionary
    for region, states in dictRegions.items():
        if clean_geomarket in states:
            return region
            
    return "Unknown"  # Default value if not found

Permanent Geomarket
TX-06     645
TX-16     608
TX-13     282
TX-15     282
TX-14     188
         ... 
NY-04       1
INT-BE      1
INT-UP      1
PA-09       1
OH-10       1
Name: count, Length: 307, dtype: int64
1
['TX-18' 'TX-06' 'TX-07' 'TX-11' 'TX-10' 'CA-23' 'TX-13' 'TX-19' 'TX-23'
 'TX-20' 'INT-CH' 'TX-14' 'NM-01' 'MD-06' 'CA-11' 'TX-16' 'IL-12' 'TX-17'
 'TX-15' 'GA-02' 'TX-24' 'AZ-01' 'TX-12' 'WA-01' 'OR-01' 'CA-07' 'OK-02'
 'INT-VM' 'CO-02' 'NC-06' 'CA-20' 'TX-21' 'CA-16' 'MN-01' 'AR-01' 'MA-10'
 'TX-22' 'INT-IN' 'NC-07' 'INT-IC' 'VA-02' 'AZ-02' 'FL-02' 'LA-03' 'TX-05'
 'NJ-10' 'TX-08' 'CA-13' 'TX-02' 'GA-05' 'TN-03' 'INT-NU' 'NH-01' 'CA-28'
 'CA-15' 'CA-03' 'INT-MX' 'NY-28' 'CO-03' 'NC-03' 'TX-01' 'MO-03' 'KS-01'
 'INT-HO' 'LA-02' 'TX-04' 'INT-ES' 'FL-01' 'CA-26' 'MD-07' 'CT-02' 'CA-06'
 'ID-01' 'PR-01' 'WI-01' 'CA-29' 'TN-04' 'INT-TW' 'VA-05' 'FL-04' 'AL-01'
 'US-AE' 'TX-09' 'WI-03' 'NJ-05' 'ID-02' 'KS-02' 'MA-06' 'NY-16' 'WA-02'
 'MO-02' 'NJ-08' 'MS-01' 'OK-01' 'IL-11' 'SD-0

In [567]:
print(TUtest["Permanent Geomarket"].value_counts())

Permanent Geomarket
TX-06     645
TX-16     608
TX-13     282
TX-15     282
TX-14     188
         ... 
NY-04       1
INT-BE      1
INT-UP      1
PA-09       1
OH-10       1
Name: count, Length: 307, dtype: int64


In [568]:
# Apply the function to create a new column for region categorization
TUtest['Permanent Geomarket'] = TUtest['Permanent Geomarket'].astype(str)

TUtest['Permanent Geomarket'] = TUtest['Permanent Geomarket'].apply(categorize_region)

print(TUtest)

          ID train-test Entry Term (Application) Permanent Country Sex  \
10000  10001       test                Fall 2019     United States   M   
10001  10002       test                Fall 2020     United States   F   
10002  10003       test                Fall 2021     United States   F   
10003  10004       test                Fall 2017     United States   M   
10004  10005       test                Fall 2019     United States   M   
...      ...        ...                      ...               ...  ..   
15138  15139       test                Fall 2020     United States   F   
15139  15140       test                Fall 2019     United States   F   
15140  15141       test                Fall 2020     United States   M   
15141  15142       test                Fall 2018     United States   F   
15142  15143       test                Fall 2019     United States   F   

                 Ethnicity                       Race  \
10000      Hispanic/Latino                      White 

In [569]:
#Column56 Citizenship Status

# This column is used for inter-field checking so keep it

In [570]:
#Column57 Academic Index

# This column is used for inter-field checking so keep it

In [571]:
#Column58 Intend to Apply for Financial Aid?
print(TUtest["Intend to Apply for Financial Aid?"].isna().sum()) #return 18 null values
print(TUtest["Intend to Apply for Financial Aid?"].value_counts()) # value counts shows two unique values

#Handling missing values. Justify your choice.
TUtest["Intend to Apply for Financial Aid?"] = \
TUtest["Intend to Apply for Financial Aid?"].apply(lambda x: 1 if pd.isna(x) else x )
# I decided to add all missing values to the 1 group (meaning receiving financial aid) because it is by far 
# the most frequent observation doubling the 0 group (not receiving any financial aid).


3
Intend to Apply for Financial Aid?
1.0    3478
0.0    1662
Name: count, dtype: int64


In [572]:
#Column59 Merit Award
print(TUtest["Merit Award"].isna().sum()) # No null values
print(TUtest["Merit Award"].value_counts()) 
#Refer to the Merit Award Code.csv for grouping. 
#Recategorize all the levels into fewer levels, Justify your grouping policy in comments

International = [
    'I10', 'I12','I12.5','I15','I17',
    'I18','I19','I20','I21','I24','I25',
    'I26','I27','I28','I30','I32','I33',
    'I35','I38','I5','I9','I40','I50',
    'I22', 'I52', 'I7.5', 'I43', 'I23','I45'
]

TUtest['Merit Award'] = \
TUtest['Merit Award'].apply(lambda x: 'International Student Scholarship' if x in International else x)
TUtest['Merit Award'].value_counts()
# Grouped all international financial awards together as there are too many levels of scholarship. 
# These categories will be much easier to work with

0
Merit Award
P23      557
T22      518
T23      496
P17      461
T21      430
T25      390
M30      306
M27      300
M26      264
M25      206
D18      195
D20      189
M24      156
P18       97
D12.5     64
TTS       42
I25       41
TT10      40
Z0        35
I23       34
I30       34
I22       29
TT12      27
TT9       25
TT125     25
I17       21
I18       19
I21       17
I35       15
I26       13
I27       12
I10       12
SEM       11
I24       10
I19        9
I20        9
X0         7
I28        6
I12.5      5
I40        5
I15        4
I38        2
I33        1
I45        1
I32        1
I9         1
I0         1
Name: count, dtype: int64


Merit Award
P23                                  557
T22                                  518
T23                                  496
P17                                  461
T21                                  430
T25                                  390
M30                                  306
International Student Scholarship    301
M27                                  300
M26                                  264
M25                                  206
D18                                  195
D20                                  189
M24                                  156
P18                                   97
D12.5                                 64
TTS                                   42
TT10                                  40
Z0                                    35
TT12                                  27
TT9                                   25
TT125                                 25
SEM                                   11
X0                                     7
I0  

In [573]:
DomesticMeritBased = [
    'D12.5','D18', 'D20','M24','M30', 'M27','M26',
    'M25', 'P17','P23','P18','TT9','T21',
    'T23','T22','T25', 'TT10','TT12','TT125'
]

TUtest['Merit Award'] = \
TUtest['Merit Award'].apply(lambda x: 'Domestic Merit-Based Scholarship' if x in DomesticMeritBased else x)
TUtest['Merit Award'].value_counts()
# Grouped these into the Domestic Meritbased Scholarships as they are all domestic scholarships that have
# different kinds of standards and requirements

Merit Award
Domestic Merit-Based Scholarship     4746
International Student Scholarship     301
TTS                                    42
Z0                                     35
SEM                                    11
X0                                      7
I0                                      1
Name: count, dtype: int64

In [574]:
FullRide = [
    'SEM', 'TTS','X0', 'Y0'
]

TUtest['Merit Award'] = \
TUtest['Merit Award'].apply(lambda x: 'Full Ride' if x in FullRide else x)
TUtest['Merit Award'].value_counts()
# Tuition exchange is basically a full scholarship so I grouped all of these into a full ride
# to show all students who have a full scholarship

Merit Award
Domestic Merit-Based Scholarship     4746
International Student Scholarship     301
Full Ride                              60
Z0                                     35
I0                                      1
Name: count, dtype: int64

In [575]:
NoMeritList = [
    'Z0', 'I0'
]

TUtest['Merit Award'] = \
TUtest['Merit Award'].apply(lambda x: 'No Merit Scholarship' if x in NoMeritList else x)
TUtest['Merit Award'].value_counts()

# Because it is international no merit it makes more sense to say no merit than put them with international.
# Both these do not get any merit scholarship

Merit Award
Domestic Merit-Based Scholarship     4746
International Student Scholarship     301
Full Ride                              60
No Merit Scholarship                   36
Name: count, dtype: int64

In [576]:
#Column60 SAT Concordance Score (of SAT R)

#Remove this variable.Justify why you remove it.
TUtest.drop("SAT Concordance Score (of SAT R)", axis = 'columns', inplace = True)
# Because we have our ACT composite column and were able to transfer the SAT scores into the relative
# ACT score, this column is redundent and therefore needs to be dropped 

In [577]:
#Column61 ACT Concordance Score (of SAT R)
#Remove this variable.Justify why you remove it.
TUtest.drop("ACT Concordance Score (of SAT R)", axis = 'columns', inplace = True)
# Because we have our ACT composite column and were able to transfer the SAT scores into the relative
# ACT score, this column is redundent and therefore needs to be dropped 

In [578]:
#Column62 ACT Concordance Score (of SAT)
#Remove this variable.Justify why you remove it.
TUtest.drop("ACT Concordance Score (of SAT)", axis = 'columns', inplace = True)
# Because we have our ACT composite column and were able to transfer the SAT scores into the relative
# ACT score, this column is redundent and therefore needs to be dropped 

In [579]:
#Column63 Test Optional
#Remove this variable.Justify why you remove it.
TUtest.drop("Test Optional", axis = 'columns', inplace = True)
# Nulls are now the mean for act and sat so it is not necessary to show test optional anymore

In [580]:
#Column64 SAT I Critical Reading
#Remove this variable.Justify why you remove it.
TUtest.drop("SAT I Critical Reading", axis = 'columns', inplace = True)
# Because we have our ACT composite column and were able to transfer the SAT scores into the relative
# ACT score, this column is redundent and therefore needs to be dropped 

In [581]:
#Column65 SAT I Math
#Remove this variable.Justify why you remove it.
TUtest.drop("SAT I Math", axis = 'columns', inplace = True)
# Because we have our ACT composite column and were able to transfer the SAT scores into the relative
# ACT score, this column is redundent and therefore needs to be dropped 

In [582]:
#Column66 SAT I Writing
#Remove this variable.Justify why you remove it.
TUtest.drop("SAT I Writing", axis = 'columns', inplace = True)
# Because we have our ACT composite column and were able to transfer the SAT scores into the relative
# ACT score, this column is redundent and therefore needs to be dropped 

In [583]:
#Column67 SAT R Evidence-Based Reading and Writing Section
#Remove this variable.Justify why you remove it.
TUtest.drop("SAT R Evidence-Based Reading and Writing Section", axis = 'columns', inplace = True)
# Because we have our ACT composite column and were able to transfer the SAT scores into the relative
# ACT score, this column is redundent and therefore needs to be dropped 

In [584]:
#Column68 SAT R Math Section
#Remove this variable.Justify why you remove it.
TUtest.drop("SAT R Math Section", axis = 'columns', inplace = True)
# Because we have our ACT composite column and were able to transfer the SAT scores into the relative
# ACT score, this column is redundent and therefore needs to be dropped 

In [585]:
#Column69 Decision

# This would be your dependent variable in the classification model so keep it.

In [586]:
# After you clean all variables, output the cleaned dataframe to a csv file
# the csv file can be found in your current working directory


TUtest.to_csv('cleaneddftest.csv')

In [587]:
# Compare the your table structure with the screenshot in the submission box
# Make sure all primary predictors and target variables do not have any missing values and only have
# regualr and correct values.

TUtest.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5143 entries, 10000 to 15142
Data columns (total 36 columns):
 #   Column                                                           Non-Null Count  Dtype  
---  ------                                                           --------------  -----  
 0   ID                                                               5143 non-null   int64  
 1   train-test                                                       5143 non-null   object 
 2   Entry Term (Application)                                         5143 non-null   object 
 3   Permanent Country                                                5142 non-null   object 
 4   Sex                                                              5143 non-null   object 
 5   Ethnicity                                                        5143 non-null   object 
 6   Race                                                             5143 non-null   object 
 7   Religion                                  