# PISA 2022 Amazon SageMaker XGBoost
Supervised Learning with Gradient Boosted Trees: A Binary Prediction Problem With Unbalanced Classes

More info on SageMaker Immersion Day: [Workshop Link](https://catalog.us-east-1.prod.workshops.aws/workshops/63069e26-921c-4ce1-9cc7-dd882ff62575/en-US/lab2-model-training/pro-code)


### ***Change country name below!***

In [1]:
country_name = 'Korea'

In [2]:
country_name_edited = country_name.replace("_", "-")

In [3]:
# cell 02
import sagemaker
bucket=sagemaker.Session().default_bucket()
prefix = 'sagemaker/xgboost-'+country_name_edited
 
# Define IAM role
import boto3
import re
from sagemaker import get_execution_role

role = get_execution_role()

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml


Now let's bring in the Python libraries that we'll use throughout the analysis

In [4]:
# cell 03
import numpy as np                                # For matrix operations and numerical processing
import pandas as pd                               # For munging tabular data
import matplotlib.pyplot as plt                   # For charts and visualizations
from IPython.display import Image                 # For displaying images in the notebook
from IPython.display import display               # For displaying outputs in the notebook
from time import gmtime, strftime                 # For labeling SageMaker models, endpoints, etc.
import sys                                        # For writing outputs to notebook
import math                                       # For ceiling function
import json                                       # For parsing hosting outputs
import os                                         # For manipulating filepath names
import sagemaker 
import zipfile     # Amazon SageMaker's Python SDK provides many helper functions

#### Download PISA 2022 Prepared Dataset

This is our dataset output from our cleaned notebook [here](https://7z4vtvpqcoxouiu.studio.us-west-2.sagemaker.aws/jupyterlab/default/lab/tree/RTC%3Amids-capstone/notebooks/eda/Data_merging.ipynb)


In [5]:
%%time 

# cell 06

# Define local file path
local_file_path = "PISA_cleaned_dataset.csv"  # Change as needed

# Define S3 details
bucket_name = "sagemaker-us-west-2-986030204467"
file_key = "capstone/testfiles/PISA_cleaned_dataset.csv"

# Check if the file exists locally
if os.path.exists(local_file_path):
    print("📂 Loading data from local file...")
    data = pd.read_csv(local_file_path, usecols=None)
    
else:
    print("☁️ Downloading data from S3...")
    
    # Create S3 client
    s3_client = boto3.client("s3")

    # Download the file from S3
    response = s3_client.get_object(Bucket=bucket_name, Key=file_key)

    # Read the file into pandas DataFrame
    data = pd.read_csv(response["Body"], usecols=None)

    # Save a local copy for future use
    data.to_csv(local_file_path, index=False)
    print(f"✅ File saved locally as {local_file_path}")

# Display first few rows
#data.head()

pd.set_option('display.max_columns', 500)     # Make sure we can see all of the columns
pd.set_option('display.max_rows', 20)         # Keep the output on one page
data

📂 Loading data from local file...
CPU times: user 23.8 s, sys: 3.26 s, total: 27 s
Wall time: 27 s


Unnamed: 0,CNT,CNTSCHID,CNTSTUID,MATH_Proficient,SISCO,ST347Q01JA,ST347Q02JA,ST349Q01JA_0,ST349Q01JA_1,ST349Q01JA_2,ST349Q01JA_3,ST349Q01JA_4,ST350Q01JA,ST356Q01JA,ST322Q01JA,ST322Q02JA,ST322Q03JA,ST322Q04JA,ST322Q06JA,ST322Q07JA,DURECEC,EFFORT1,EFFORT2,ST259Q01JA,WB164Q01HA,HOMEPOS,ST004D01T,GRADE,REPEAT,EXPECEDU,ICTAVSCH,ICTAVHOM,ICTDISTR,IMMIG,TARDYSD,ST226Q01JA,ST016Q01NA,MISSSC,Option_UH,OECD,PAREDINT,BMMJ1,BFMJ2,WB163Q06HA,WB163Q07HA,ST230Q01JA,SKIPPING,IC180Q01JA,IC180Q08JA,ST059Q02JA,ST296Q04JA,WB176Q01HA,STUDYHMW,IC184Q01JA,IC184Q02JA,IC184Q03JA,IC184Q04JA,ST059Q01TA,ST296Q01JA,ST272Q01JA,ST268Q01JA,ST268Q04JA,ST268Q07JA,ST293Q04JA,ST297Q01JA,ST297Q03JA,ST297Q05JA,ST297Q06JA,ST297Q07JA,ST297Q09JA,WB165Q01HA,WB166Q01HA,WB166Q02HA,WB166Q03HA,WB166Q04HA,ST258Q01JA,ST294Q01JA,ST295Q01JA,WB150Q01HA,WB156Q01HA,WB158Q01HA,WB160Q01HA,WB161Q01HA,WB171Q01HA,WB171Q02HA,WB171Q03HA,WB171Q04HA,WB172Q01HA,WB173Q01HA,WB173Q02HA,WB173Q03HA,WB173Q04HA,WB177Q01HA,WB177Q02HA,WB177Q03HA,WB177Q04HA,WB032Q01NA,WB032Q02NA,WB031Q01NA,EXERPRAC,STUBMI,RELATST,BELONG,BULLIED,FEELSAFE,SCHRISK,PERSEVAGR,CURIOAGR,COOPAGR,EMPATAGR,ASSERAGR,STRESAGR,EMOCOAGR,GROSAGR,INFOSEEK,FAMSUP,DISCLIM,TEACHSUP,COGACRCO,COGACMCO,EXPOFA,EXPO21ST,MATHEFF,MATHEF21,FAMCON,ANXMAT,MATHPERS,CREATEFF,CREATSCH,CREATFAM,CREATAS,CREATOOS,CREATOP,OPENART,IMAGINE,SCHSUST,LEARRES,PROBSELF,FAMSUPSL,FEELLAH,SDLEFF,ICTRES,ESCS,FLSCHOOL,FLMULTSB,FLFAMILY,ACCESSFP,FLCONFIN,FLCONICT,ACCESSFA,ATTCONFM,FRINFLFM,ICTSCH,ICTHOME,ICTQUAL,ICTSUBJ,ICTENQ,ICTFEED,ICTOUT,ICTWKDY,ICTWKEND,ICTREG,ICTINFO,ICTEFFIC,BODYIMA,SOCONPA,LIFESAT,PSYCHSYM,SOCCON,EXPWB,CURSUPP,PQMIMP,PQMCAR,PARINVOL,PQSCHOOL,PASCHPOL,ATTIMMP,CREATHME,CREATACT,CREATOPN,CREATOR,WORKPAY,WORKHOME,SC001Q01TA,SC211Q01JA,SC211Q02JA,SC211Q03JA,SC211Q04JA,SC211Q05JA,SC211Q06JA,SC209Q04JA,SC209Q05JA,SC209Q06JA,SC037Q11JA,SC183Q02JA,SC183Q03JA,SC183Q04JA,SC175Q01JA,SC177Q01JA_1,SC177Q01JA_2,SC177Q01JA_3,SC177Q02JA_1,SC177Q02JA_2,SC177Q02JA_3,SC177Q03JA_1,SC177Q03JA_2,SC177Q03JA_3,SC188Q01JA,SC188Q02JA,SC188Q03JA,SC188Q04JA,SC188Q05JA,SC188Q06JA,SC188Q07JA,SC188Q08JA,SC188Q09JA,SC188Q10JA,SC188Q11JA,SC198Q01JA,SC198Q02JA,SC198Q03JA,SC178Q01JA,SC178Q02JA,SC180Q01JA,SC189Q02WA,SC189Q03WA,SC189Q04WA,SMRATIO,MCLSIZE,MACTIV,MATHEXC_0,MATHEXC_1,MATHEXC_2,MATHEXC_3,ABGMATH,SC064Q05WA,SC064Q06WA,SC064Q01TA,SC064Q02TA,SC064Q04NA,SC064Q03TA,SC064Q07WA,SC213Q01JA,SC213Q02JA,SC037Q01TA,SC037Q02TA,SC037Q03TA,SC037Q04TA,SC037Q05NA,SC037Q06NA,...,DIGDVPOL,TEAFDBK,MTTRAIN,DMCVIEWS,NEGSCLIM,STAFFSHORT,EDUSHORT,STUBEHA,TEACHBEHA,STDTEST,TDTEST,ALLACTIV,BCREATSC,CREENVSC,ACTCRESC,OPENCUL,PROBSCRI,SCPREPBP,SCPREPAP,DIGPREP,LANGN_105,LANGN_108,LANGN_112,LANGN_113,LANGN_118,LANGN_121,LANGN_130,LANGN_133,LANGN_137,LANGN_140,LANGN_147,LANGN_148,LANGN_150,LANGN_154,LANGN_156,LANGN_160,LANGN_170,LANGN_195,LANGN_200,LANGN_202,LANGN_204,LANGN_232,LANGN_237,LANGN_244,LANGN_246,LANGN_254,LANGN_258,LANGN_263,LANGN_264,LANGN_266,LANGN_272,LANGN_273,LANGN_275,LANGN_286,LANGN_301,LANGN_313,LANGN_316,LANGN_317,LANGN_322,LANGN_325,LANGN_327,LANGN_329,LANGN_338,LANGN_340,LANGN_344,LANGN_351,LANGN_358,LANGN_363,LANGN_369,LANGN_371,LANGN_375,LANGN_379,LANGN_381,LANGN_382,LANGN_383,LANGN_404,LANGN_409,LANGN_415,LANGN_420,LANGN_422,LANGN_428,LANGN_434,LANGN_442,LANGN_449,LANGN_451,LANGN_463,LANGN_465,LANGN_467,LANGN_471,LANGN_472,LANGN_474,LANGN_492,LANGN_493,LANGN_494,LANGN_495,LANGN_496,LANGN_500,LANGN_503,LANGN_514,LANGN_517,LANGN_520,LANGN_523,LANGN_527,LANGN_529,LANGN_531,LANGN_540,LANGN_547,LANGN_555,LANGN_561,LANGN_562,LANGN_563,LANGN_565,LANGN_566,LANGN_567,LANGN_600,LANGN_601,LANGN_602,LANGN_605,LANGN_606,LANGN_607,LANGN_608,LANGN_611,LANGN_614,LANGN_615,LANGN_616,LANGN_618,LANGN_619,LANGN_621,LANGN_622,LANGN_623,LANGN_624,LANGN_625,LANGN_626,LANGN_627,LANGN_628,LANGN_630,LANGN_631,LANGN_634,LANGN_635,LANGN_639,LANGN_640,LANGN_641,LANGN_642,LANGN_648,LANGN_650,LANGN_661,LANGN_662,LANGN_663,LANGN_665,LANGN_666,LANGN_667,LANGN_668,LANGN_669,LANGN_670,LANGN_673,LANGN_674,LANGN_675,LANGN_676,LANGN_677,LANGN_678,LANGN_800,LANGN_801,LANGN_802,LANGN_804,LANGN_805,LANGN_806,LANGN_807,LANGN_808,LANGN_809,LANGN_810,LANGN_811,LANGN_812,LANGN_813,LANGN_814,LANGN_815,LANGN_816,LANGN_817,LANGN_818,LANGN_819,LANGN_821,LANGN_823,LANGN_824,LANGN_825,LANGN_826,LANGN_827,LANGN_828,LANGN_829,LANGN_831,LANGN_832,LANGN_833,LANGN_836,LANGN_837,LANGN_838,LANGN_839,LANGN_840,LANGN_841,LANGN_842,LANGN_843,LANGN_844,LANGN_845,LANGN_846,LANGN_849,LANGN_850,LANGN_851,LANGN_852,LANGN_854,LANGN_855,LANGN_857,LANGN_859,LANGN_860,LANGN_861,LANGN_865,LANGN_866,LANGN_868,LANGN_870,LANGN_872,LANGN_873,LANGN_877,LANGN_879,LANGN_881,LANGN_885,LANGN_890,LANGN_892,LANGN_895,LANGN_896,LANGN_897,LANGN_898,LANGN_899,LANGN_900,LANGN_901,LANGN_902,LANGN_903,LANGN_904,LANGN_905,LANGN_906,LANGN_907,LANGN_908,LANGN_909,LANGN_910,LANGN_911,LANGN_912,LANGN_913,LANGN_914,LANGN_916,LANGN_917,LANGN_918,LANGN_919,LANGN_920,LANGN_921,LANGN_922
0,Albania,800282,800001,0,,,,0,0,0,0,0,,,5.0,5.0,3.0,,1.0,1.0,,10.0,10.0,10.0,,1.5995,1.0,0.0,0.0,9.0,0.0,,,1.0,,4.0,10.0,0.0,0,0,14.5,73.91,16.50,,,4.0,1.0,2.0,3.0,7.0,6.0,,10.0,5.0,,,,4.0,3.0,10.0,2.0,1.0,4.0,3.0,0.0,0.0,0.0,0.0,0.0,1.0,,,,,,1.0,6.0,6.0,,,,,,,,,,,,,,,,,,,,,,0.0,,0.9905,-0.2327,-1.2280,1.1246,-0.6386,,3.3518,,,,,,-0.5185,,1.8355,0.6387,1.5558,0.8246,2.4962,-0.2284,2.4031,-1.4413,,,0.5440,-0.0085,2.4021,0.0590,0.8155,4.1226,,,0.7507,2.0225,,,,,,,4.9507,1.1112,,,,,,,,,,,,-1.1989,-2.0261,-1.7886,,,,,0.8373,0.6984,,,,,,,,,,,,,,,,,,,0.0,10.0,3.0,100.0,3.0,23.0,,24.0,,1.0,1.0,1.0,2.0,1.0,1.0,1.0,45.0,0,0,1,0,0,1,0,0,1,4.0,4.0,4.0,4.0,3.0,3.0,3.0,4.0,2.0,4.0,2.0,2.0,2.0,1.0,74.0,26.0,1.0,1.0,1.0,1.0,100.0,28.0,5.0,0,0,0,1,3.0,30.0,30.0,61.0,62.0,11.0,50.0,10.0,90.0,3.0,1.0,1.0,1.0,1.0,1.0,1.0,...,0.5220,0.9868,1.0982,2.1585,-0.4315,-0.0097,-0.2805,-0.9198,0.5521,2.0709,2.0131,1.1162,-0.3682,1.3541,0.3430,0.4217,1.1110,-0.8314,0.8462,0.5908,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,Albania,800115,800002,0,,,,0,0,0,0,0,,,,,,,,,,9.0,8.0,7.0,,-3.8115,2.0,-1.0,0.0,,7.0,6.0,10.0,1.0,0.0,1.0,7.0,0.0,0,0,9.0,24.16,,,,3.0,1.0,4.0,2.0,,,,,,5.0,5.0,5.0,,,,,,,,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,0.3226,0.5031,1.3336,1.1246,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,-3.4930,-3.0507,,,,,,,,,,0.4062,0.3346,-0.1403,-2.0261,0.6198,-0.3848,0.2149,,,0.3729,1.3060,-0.4933,,,,,,,,,,,,,,,,,,,,4.0,,1.0,25.0,,15.0,,1.0,1.0,1.0,1.0,2.0,1.0,2.0,45.0,0,0,0,0,0,0,0,0,0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,3.0,2.0,2.0,1.0,2.0,1.0,1.0,90.0,10.0,2.0,1.0,1.0,,100.0,28.0,0.0,0,0,0,0,3.0,75.0,85.0,50.0,75.0,80.0,75.0,,80.0,3.0,1.0,1.0,1.0,1.0,1.0,1.0,...,-0.4729,-0.4120,0.6955,0.3610,0.3386,-1.4551,2.9595,-0.1936,-2.0409,0.0400,-0.6686,-0.5714,0.1019,1.0791,-0.5544,-0.5450,0.1705,-0.8314,-1.1166,0.0988,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,Albania,800242,800003,0,,,,0,0,0,0,0,,,,,,,,,4.0,10.0,10.0,8.0,,0.2314,2.0,-1.0,0.0,,0.0,,4.0,1.0,0.0,1.0,10.0,0.0,0,0,12.0,,,,,4.0,0.0,,,,2.0,,0.0,,,,5.0,,2.0,10.0,,,,,0.0,0.0,0.0,0.0,1.0,0.0,,,,,,1.0,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.8637,-0.6386,,,,,,,,,,,-0.8615,,,,,,,,,,,,,,,,,,,,,,,,,0.4307,-0.1867,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,1.0,,,,1.0,2.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,1.0,45.0,0,0,1,0,0,1,0,0,0,1.0,,4.0,4.0,2.0,2.0,4.0,3.0,4.0,2.0,1.0,2.0,2.0,1.0,100.0,0.0,1.0,1.0,1.0,1.0,100.0,18.0,3.0,0,0,0,1,3.0,100.0,100.0,100.0,100.0,10.0,10.0,100.0,60.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.1884,1.2416,1.0982,2.1585,-0.9382,0.1683,0.1753,-2.0719,-0.4985,0.5750,1.5226,0.5086,0.3731,0.9015,0.5400,1.2274,0.6353,1.1784,-0.6374,-0.8981,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,Albania,800245,800005,0,1.0,6.0,1.0,0,1,0,0,0,2.0,4.0,3.0,3.0,,3.0,3.0,3.0,0.0,,,5.0,,-2.5956,1.0,-2.0,1.0,4.0,5.0,5.0,12.0,1.0,1.0,1.0,10.0,0.0,0,0,6.0,,14.82,,,3.0,0.0,3.0,4.0,30.0,4.0,,10.0,,,,,4.0,1.0,5.0,4.0,4.0,4.0,4.0,1.0,0.0,1.0,0.0,0.0,0.0,,,,,,1.0,6.0,6.0,,,,,,,,,,,,,,,,,,,,,,10.0,,1.8580,0.5159,0.9885,-0.7560,-0.6386,,-0.7687,,,,,,0.1371,2.2134,-0.7468,0.4426,1.5558,-0.7146,-0.1216,-0.2207,0.3556,-1.3156,2.2322,0.4222,0.5653,-0.2546,-0.4909,-0.3010,-1.0261,1.0191,1.4468,-0.5423,-0.0564,-0.8763,1.5382,0.4308,0.4516,0.0427,-2.1941,-0.9408,-2.1392,-3.2198,,,,,,,,,,-1.7984,-1.5118,-0.3516,-0.1594,0.8946,0.8435,0.4035,,,2.8904,1.2637,,,,,,,,,,,,,,,,,,,0.0,10.0,1.0,,5.0,11.0,,30.0,,1.0,1.0,1.0,1.0,1.0,2.0,2.0,45.0,0,0,1,0,0,1,0,0,0,3.0,3.0,4.0,4.0,2.0,2.0,2.0,3.0,2.0,3.0,2.0,2.0,2.0,1.0,100.0,0.0,1.0,2.0,1.0,1.0,69.5,13.0,4.0,0,0,0,1,3.0,91.0,84.0,93.0,64.0,82.0,97.0,100.0,0.0,3.0,1.0,1.0,1.0,1.0,1.0,1.0,...,0.5587,0.6480,-0.0703,-0.1332,-1.6916,-1.4551,0.4399,-0.5010,-1.4190,0.1011,0.1724,0.4559,-0.3682,1.0478,0.5608,0.4217,,,,0.0419,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,Albania,800285,800006,1,1.0,4.0,6.0,0,0,0,0,1,3.0,1.0,3.0,,1.0,1.0,1.0,1.0,1.0,10.0,9.0,8.0,,-0.5632,1.0,0.0,0.0,,3.0,2.0,13.0,1.0,0.0,4.0,10.0,0.0,0,0,12.0,17.00,30.11,,,2.0,0.0,3.0,4.0,30.0,3.0,,10.0,3.0,3.0,4.0,5.0,4.0,3.0,10.0,2.0,1.0,4.0,,1.0,0.0,0.0,0.0,1.0,0.0,,,,,,1.0,1.0,6.0,,,,,,,,,,,,,,,,,,,,,,2.0,,1.7382,0.7639,-1.2280,1.1246,-0.6386,,0.5342,,,,,,-0.3061,0.6761,-0.5122,0.4029,0.1475,-0.0073,0.7927,-0.6616,-1.0257,-0.5867,0.9425,1.1266,-0.2704,-0.1735,-0.7475,-0.1405,-0.9293,1.6583,1.8557,0.9322,0.9037,-0.4033,0.2241,1.7224,1.6004,1.5114,,1.0353,-0.5542,-1.0548,,,,,,,,,,-2.8292,-3.3582,1.0161,,0.8886,-0.0643,0.9861,,,2.0196,1.6029,-0.2354,,,,,,,,,,,,,,,,,,0.0,4.0,,37.0,1.0,9.0,,,,1.0,1.0,1.0,1.0,1.0,2.0,2.0,45.0,1,0,0,1,0,0,0,0,1,3.0,3.0,3.0,3.0,2.0,2.0,2.0,2.0,4.0,4.0,2.0,2.0,2.0,1.0,80.0,20.0,2.0,1.0,1.0,1.0,100.0,33.0,2.0,0,0,0,0,1.0,67.0,18.0,12.0,21.0,19.0,3.0,21.0,90.0,7.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.3483,1.0430,0.6888,2.1585,-0.6145,-0.7828,0.1000,-0.6199,-0.0485,0.7086,0.7899,0.9383,0.1019,1.6939,0.8448,1.0318,0.0074,-0.8314,-0.7625,3.0051,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
591852,Uzbekistan,86000120,86007488,0,1.0,2.0,1.0,0,0,1,0,0,1.0,,,,,,,,4.0,10.0,10.0,9.0,,-0.9146,1.0,0.0,0.0,9.0,,,,1.0,0.0,1.0,10.0,0.0,0,0,3.0,17.00,28.95,,,4.0,0.0,,,36.0,6.0,,10.0,,,,,5.0,6.0,10.0,4.0,2.0,4.0,,1.0,1.0,1.0,1.0,1.0,1.0,,,,,,1.0,6.0,6.0,,,,,,,,,,,,,,,,,,,,,,4.0,,,-1.0817,-1.2280,0.6942,-0.6386,,0.3063,,,,,,0.5765,-1.0979,1.5941,1.7598,1.5558,2.3368,2.5872,0.1530,-2.2416,2.2815,2.3441,,0.8819,2.2393,2.1524,0.5032,-0.0326,,,-0.4280,,-0.1324,,,,,,,,-2.7487,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,10.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,120.0,1,0,0,1,0,0,0,0,0,1.0,1.0,4.0,2.0,1.0,1.0,4.0,1.0,1.0,4.0,1.0,2.0,1.0,1.0,100.0,0.0,1.0,2.0,1.0,1.0,1.4,28.0,5.0,0,0,0,1,1.0,0.0,0.0,1.0,0.0,0.0,70.0,30.0,73.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,...,0.0977,0.4554,0.2023,-0.7457,-1.4918,,-1.4212,,-1.3372,0.6904,0.0175,1.7104,0.4397,0.7711,,1.2405,-0.5687,-0.8314,-1.1382,0.5571,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
591853,Uzbekistan,86000140,86007489,0,,,,0,0,0,0,0,,,1.0,,2.0,2.0,1.0,1.0,0.0,10.0,10.0,3.0,,-2.1015,2.0,0.0,0.0,,,,,,1.0,5.0,3.0,1.0,0,0,16.0,73.91,30.11,,,4.0,,,,,,,7.0,,,,,,,,,,,,,,,,,,,,,,,1.0,1.0,6.0,,,,,,,,,,,,,,,,,,,,,,4.0,,,-0.2482,-1.2280,-0.7560,-0.6386,,0.0167,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,-0.2024,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,10.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,2.0,2.0,2.0,2.0,115.0,0,0,1,0,0,1,0,0,1,3.0,3.0,3.0,3.0,3.0,2.0,3.0,3.0,3.0,3.0,3.0,2.0,1.0,1.0,60.0,40.0,1.0,1.0,1.0,1.0,100.0,53.0,5.0,0,0,0,1,2.0,81.0,85.0,88.0,96.0,68.0,85.0,63.0,100.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,...,-1.8150,1.0904,-1.6751,-2.6032,-1.4918,,-1.4212,,-1.1342,2.0709,2.0131,3.4880,1.5231,-0.2686,,0.3221,-1.1097,-0.8314,0.8462,-0.1857,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
591854,Uzbekistan,86000024,86007490,0,1.0,1.0,1.0,0,0,0,0,0,,,1.0,1.0,,4.0,1.0,1.0,,,,6.0,,-1.5194,2.0,1.0,0.0,7.0,,,,1.0,0.0,4.0,9.0,0.0,0,0,9.0,17.00,25.71,,,4.0,0.0,,,31.0,6.0,,10.0,,,,,5.0,5.0,10.0,3.0,3.0,3.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,,,,,,4.0,6.0,6.0,,,,,,,,,,,,,,,,,,,,,,10.0,,,-0.3261,-0.5168,0.4417,-0.6386,,-0.0140,,,,,,0.2429,0.2973,-1.0296,0.3521,0.8211,1.0932,0.9323,-0.3998,0.6856,0.3926,0.9997,,-0.2907,0.6311,0.0846,0.5352,-0.5679,0.4911,0.6097,0.4185,-0.3483,-0.1783,,,,,,,,-2.0506,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,5.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,120.0,1,0,0,0,0,1,0,0,1,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,3.0,1.0,1.0,1.0,1.0,100.0,0.0,1.0,1.0,1.0,1.0,100.0,28.0,5.0,0,0,0,1,2.0,90.0,50.0,100.0,100.0,70.0,85.0,0.0,93.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,...,0.5796,1.4724,1.0982,-3.1484,-1.4918,,0.2650,,-1.9660,2.0709,2.0131,1.7685,1.5231,2.1631,,2.8331,-1.6218,1.5159,0.8462,0.8376,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
591855,Uzbekistan,86000174,86007491,0,,,,0,0,0,0,0,,,1.0,1.0,1.0,1.0,,1.0,,7.0,6.0,9.0,,-0.3975,1.0,0.0,0.0,,,,,1.0,0.0,4.0,10.0,0.0,0,0,12.0,73.91,75.43,,,3.0,1.0,,,35.0,6.0,,10.0,,,,,3.0,2.0,7.0,2.0,2.0,3.0,,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,1.0,6.0,6.0,,,,,,,,,,,,,,,,,,,,,,9.0,,,0.5337,-1.2280,1.1246,-0.6386,,2.2987,,,,,,1.2952,,,1.7598,1.5558,1.5399,1.3822,0.3331,0.3322,-0.1652,2.4215,,,,,,,,,,,,,,,,,,,-0.1290,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,10.0,1.0,0.0,0.0,6.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,120.0,0,0,0,0,0,0,0,0,0,1.0,4.0,4.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,100.0,0.0,1.0,2.0,1.0,1.0,100.0,28.0,5.0,0,0,0,1,1.0,75.0,21.0,77.0,70.0,85.0,69.0,0.0,79.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,...,-0.3081,0.7604,1.0982,1.2033,-1.4918,,1.2048,,-0.2361,0.6904,0.6028,1.2086,0.7589,0.8065,,0.7825,0.5093,-0.8314,0.1102,-0.4657,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


#### Download dictionary for the variable names

In [6]:
# Download the file from S3
s3_client = boto3.client("s3")
dictionary_file = s3_client.get_object(Bucket=bucket_name, Key="capstone/testfiles/Variable_dictionary.csv")

# Read the file into pandas DataFrame
dictionary = pd.read_csv(dictionary_file["Body"], usecols=None)

#### Subset the data to a specific COUNTRY

In [7]:
model_data = data[data['CNT'] == country_name]
print(model_data.shape)
model_data.head()

(6454, 570)


Unnamed: 0,CNT,CNTSCHID,CNTSTUID,MATH_Proficient,SISCO,ST347Q01JA,ST347Q02JA,ST349Q01JA_0,ST349Q01JA_1,ST349Q01JA_2,ST349Q01JA_3,ST349Q01JA_4,ST350Q01JA,ST356Q01JA,ST322Q01JA,ST322Q02JA,ST322Q03JA,ST322Q04JA,ST322Q06JA,ST322Q07JA,DURECEC,EFFORT1,EFFORT2,ST259Q01JA,WB164Q01HA,HOMEPOS,ST004D01T,GRADE,REPEAT,EXPECEDU,ICTAVSCH,ICTAVHOM,ICTDISTR,IMMIG,TARDYSD,ST226Q01JA,ST016Q01NA,MISSSC,Option_UH,OECD,PAREDINT,BMMJ1,BFMJ2,WB163Q06HA,WB163Q07HA,ST230Q01JA,SKIPPING,IC180Q01JA,IC180Q08JA,ST059Q02JA,ST296Q04JA,WB176Q01HA,STUDYHMW,IC184Q01JA,IC184Q02JA,IC184Q03JA,IC184Q04JA,ST059Q01TA,ST296Q01JA,ST272Q01JA,ST268Q01JA,ST268Q04JA,ST268Q07JA,ST293Q04JA,ST297Q01JA,ST297Q03JA,ST297Q05JA,ST297Q06JA,ST297Q07JA,ST297Q09JA,WB165Q01HA,WB166Q01HA,WB166Q02HA,WB166Q03HA,WB166Q04HA,ST258Q01JA,ST294Q01JA,ST295Q01JA,WB150Q01HA,WB156Q01HA,WB158Q01HA,WB160Q01HA,WB161Q01HA,WB171Q01HA,WB171Q02HA,WB171Q03HA,WB171Q04HA,WB172Q01HA,WB173Q01HA,WB173Q02HA,WB173Q03HA,WB173Q04HA,WB177Q01HA,WB177Q02HA,WB177Q03HA,WB177Q04HA,WB032Q01NA,WB032Q02NA,WB031Q01NA,EXERPRAC,STUBMI,RELATST,BELONG,BULLIED,FEELSAFE,SCHRISK,PERSEVAGR,CURIOAGR,COOPAGR,EMPATAGR,ASSERAGR,STRESAGR,EMOCOAGR,GROSAGR,INFOSEEK,FAMSUP,DISCLIM,TEACHSUP,COGACRCO,COGACMCO,EXPOFA,EXPO21ST,MATHEFF,MATHEF21,FAMCON,ANXMAT,MATHPERS,CREATEFF,CREATSCH,CREATFAM,CREATAS,CREATOOS,CREATOP,OPENART,IMAGINE,SCHSUST,LEARRES,PROBSELF,FAMSUPSL,FEELLAH,SDLEFF,ICTRES,ESCS,FLSCHOOL,FLMULTSB,FLFAMILY,ACCESSFP,FLCONFIN,FLCONICT,ACCESSFA,ATTCONFM,FRINFLFM,ICTSCH,ICTHOME,ICTQUAL,ICTSUBJ,ICTENQ,ICTFEED,ICTOUT,ICTWKDY,ICTWKEND,ICTREG,ICTINFO,ICTEFFIC,BODYIMA,SOCONPA,LIFESAT,PSYCHSYM,SOCCON,EXPWB,CURSUPP,PQMIMP,PQMCAR,PARINVOL,PQSCHOOL,PASCHPOL,ATTIMMP,CREATHME,CREATACT,CREATOPN,CREATOR,WORKPAY,WORKHOME,SC001Q01TA,SC211Q01JA,SC211Q02JA,SC211Q03JA,SC211Q04JA,SC211Q05JA,SC211Q06JA,SC209Q04JA,SC209Q05JA,SC209Q06JA,SC037Q11JA,SC183Q02JA,SC183Q03JA,SC183Q04JA,SC175Q01JA,SC177Q01JA_1,SC177Q01JA_2,SC177Q01JA_3,SC177Q02JA_1,SC177Q02JA_2,SC177Q02JA_3,SC177Q03JA_1,SC177Q03JA_2,SC177Q03JA_3,SC188Q01JA,SC188Q02JA,SC188Q03JA,SC188Q04JA,SC188Q05JA,SC188Q06JA,SC188Q07JA,SC188Q08JA,SC188Q09JA,SC188Q10JA,SC188Q11JA,SC198Q01JA,SC198Q02JA,SC198Q03JA,SC178Q01JA,SC178Q02JA,SC180Q01JA,SC189Q02WA,SC189Q03WA,SC189Q04WA,SMRATIO,MCLSIZE,MACTIV,MATHEXC_0,MATHEXC_1,MATHEXC_2,MATHEXC_3,ABGMATH,SC064Q05WA,SC064Q06WA,SC064Q01TA,SC064Q02TA,SC064Q04NA,SC064Q03TA,SC064Q07WA,SC213Q01JA,SC213Q02JA,SC037Q01TA,SC037Q02TA,SC037Q03TA,SC037Q04TA,SC037Q05NA,SC037Q06NA,...,DIGDVPOL,TEAFDBK,MTTRAIN,DMCVIEWS,NEGSCLIM,STAFFSHORT,EDUSHORT,STUBEHA,TEACHBEHA,STDTEST,TDTEST,ALLACTIV,BCREATSC,CREENVSC,ACTCRESC,OPENCUL,PROBSCRI,SCPREPBP,SCPREPAP,DIGPREP,LANGN_105,LANGN_108,LANGN_112,LANGN_113,LANGN_118,LANGN_121,LANGN_130,LANGN_133,LANGN_137,LANGN_140,LANGN_147,LANGN_148,LANGN_150,LANGN_154,LANGN_156,LANGN_160,LANGN_170,LANGN_195,LANGN_200,LANGN_202,LANGN_204,LANGN_232,LANGN_237,LANGN_244,LANGN_246,LANGN_254,LANGN_258,LANGN_263,LANGN_264,LANGN_266,LANGN_272,LANGN_273,LANGN_275,LANGN_286,LANGN_301,LANGN_313,LANGN_316,LANGN_317,LANGN_322,LANGN_325,LANGN_327,LANGN_329,LANGN_338,LANGN_340,LANGN_344,LANGN_351,LANGN_358,LANGN_363,LANGN_369,LANGN_371,LANGN_375,LANGN_379,LANGN_381,LANGN_382,LANGN_383,LANGN_404,LANGN_409,LANGN_415,LANGN_420,LANGN_422,LANGN_428,LANGN_434,LANGN_442,LANGN_449,LANGN_451,LANGN_463,LANGN_465,LANGN_467,LANGN_471,LANGN_472,LANGN_474,LANGN_492,LANGN_493,LANGN_494,LANGN_495,LANGN_496,LANGN_500,LANGN_503,LANGN_514,LANGN_517,LANGN_520,LANGN_523,LANGN_527,LANGN_529,LANGN_531,LANGN_540,LANGN_547,LANGN_555,LANGN_561,LANGN_562,LANGN_563,LANGN_565,LANGN_566,LANGN_567,LANGN_600,LANGN_601,LANGN_602,LANGN_605,LANGN_606,LANGN_607,LANGN_608,LANGN_611,LANGN_614,LANGN_615,LANGN_616,LANGN_618,LANGN_619,LANGN_621,LANGN_622,LANGN_623,LANGN_624,LANGN_625,LANGN_626,LANGN_627,LANGN_628,LANGN_630,LANGN_631,LANGN_634,LANGN_635,LANGN_639,LANGN_640,LANGN_641,LANGN_642,LANGN_648,LANGN_650,LANGN_661,LANGN_662,LANGN_663,LANGN_665,LANGN_666,LANGN_667,LANGN_668,LANGN_669,LANGN_670,LANGN_673,LANGN_674,LANGN_675,LANGN_676,LANGN_677,LANGN_678,LANGN_800,LANGN_801,LANGN_802,LANGN_804,LANGN_805,LANGN_806,LANGN_807,LANGN_808,LANGN_809,LANGN_810,LANGN_811,LANGN_812,LANGN_813,LANGN_814,LANGN_815,LANGN_816,LANGN_817,LANGN_818,LANGN_819,LANGN_821,LANGN_823,LANGN_824,LANGN_825,LANGN_826,LANGN_827,LANGN_828,LANGN_829,LANGN_831,LANGN_832,LANGN_833,LANGN_836,LANGN_837,LANGN_838,LANGN_839,LANGN_840,LANGN_841,LANGN_842,LANGN_843,LANGN_844,LANGN_845,LANGN_846,LANGN_849,LANGN_850,LANGN_851,LANGN_852,LANGN_854,LANGN_855,LANGN_857,LANGN_859,LANGN_860,LANGN_861,LANGN_865,LANGN_866,LANGN_868,LANGN_870,LANGN_872,LANGN_873,LANGN_877,LANGN_879,LANGN_881,LANGN_885,LANGN_890,LANGN_892,LANGN_895,LANGN_896,LANGN_897,LANGN_898,LANGN_899,LANGN_900,LANGN_901,LANGN_902,LANGN_903,LANGN_904,LANGN_905,LANGN_906,LANGN_907,LANGN_908,LANGN_909,LANGN_910,LANGN_911,LANGN_912,LANGN_913,LANGN_914,LANGN_916,LANGN_917,LANGN_918,LANGN_919,LANGN_920,LANGN_921,LANGN_922
301772,Korea,41000167,41000001,1,1.0,4.0,1.0,0,0,1,0,0,1.0,2.0,5.0,4.0,4.0,1.0,4.0,,2.0,10.0,10.0,3.0,,-0.5049,2.0,0.0,0.0,7.0,1.0,1.0,8.0,1.0,0.0,4.0,9.0,1.0,0,1,16.0,70.89,35.34,,,3.0,0.0,3.0,2.0,35.0,4.0,,3.0,1.0,1.0,1.0,1.0,4.0,4.0,7.0,2.0,1.0,4.0,,0.0,1.0,0.0,0.0,0.0,0.0,,,,,,1.0,4.0,6.0,,,,,,,,,,,,,,,,,,,,,,4.0,,1.2586,2.4581,-1.228,1.1246,-0.6386,-0.4897,0.8647,0.0823,-0.1419,0.7322,-0.2509,0.4423,0.362,-0.083,1.7139,1.0989,0.8211,-0.0411,0.1654,0.2486,-0.0243,-1.2853,-0.5961,1.5682,-0.3794,0.0995,-0.9474,0.4672,-0.0504,0.5406,-0.8105,-0.3719,-0.0077,0.0608,-0.8055,-2.8172,-0.3344,0.083,,-0.9291,,0.4168,,,,,,,,,,-4.1572,-4.7013,-0.9232,-0.8257,-2.3763,-1.6286,-2.6018,0.4639,0.3311,0.2743,0.0083,-0.509,,,,,,,-0.6418,-0.9554,-1.2159,-1.9132,-0.2402,-0.5022,-1.0694,-0.4112,0.6792,-0.5566,-0.9733,0.0,2.0,4.0,0.0,2.0,3.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,50.0,0,0,1,0,0,1,0,0,1,3.0,3.0,3.0,3.0,3.0,3.0,3.0,4.0,3.0,4.0,3.0,2.0,2.0,1.0,92.0,8.0,1.0,1.0,1.0,1.0,71.5,18.0,4.0,0,0,0,1,3.0,93.0,90.0,90.0,90.0,89.0,90.0,91.0,35.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,...,-1.1792,0.8013,1.0982,,-0.0139,,2.9595,0.8919,0.4397,0.6952,0.5379,1.2021,0.3056,1.2183,0.1406,0.3844,,0.7922,,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
301773,Korea,41000067,41000002,1,1.0,2.0,1.0,0,1,0,0,0,1.0,2.0,2.0,,2.0,1.0,1.0,1.0,4.0,7.0,9.0,6.0,,1.0966,1.0,0.0,0.0,7.0,7.0,6.0,,1.0,0.0,4.0,6.0,0.0,0,1,16.0,47.83,81.13,,,2.0,0.0,2.0,3.0,40.0,6.0,,10.0,1.0,1.0,1.0,1.0,7.0,6.0,7.0,3.0,2.0,4.0,,1.0,0.0,0.0,0.0,0.0,0.0,,,,,,1.0,5.0,6.0,,,,,,,,,,,,,,,,,,,,,,1.0,,-0.7982,0.7289,-1.228,1.1246,-0.6386,-0.1105,-0.5983,0.5526,-0.4662,0.2503,-0.1896,0.0363,0.6831,0.4578,0.1657,1.6396,-0.5635,-0.6228,-0.8374,-1.3596,-0.5187,0.1923,-0.5321,2.1529,0.1031,0.0491,0.0522,-0.9688,-1.0213,0.3856,-0.8105,-0.2202,-1.2067,-0.8257,0.8488,-0.4684,-1.4482,-0.4415,-0.0664,-0.0944,,1.2884,,,,,,,,,,0.4062,0.3346,2.8889,-1.1559,0.3109,-0.5112,2.9804,-0.2585,-0.5889,0.1169,0.6984,2.1919,,,,,,,-0.2065,0.4398,-1.2159,-0.2783,-0.2402,-0.2161,-0.4189,-0.9239,0.6119,-1.5479,-0.715,0.0,1.0,4.0,100.0,0.0,1.0,0.0,0.0,0.0,2.0,2.0,1.0,2.0,2.0,1.0,1.0,120.0,1,0,0,0,0,1,1,0,0,4.0,4.0,4.0,4.0,1.0,1.0,1.0,4.0,4.0,4.0,1.0,2.0,2.0,2.0,95.0,5.0,1.0,2.0,1.0,1.0,91.5556,23.0,4.0,0,0,1,0,1.0,10.0,0.0,40.0,40.0,100.0,100.0,0.0,0.0,2.0,1.0,0.0,1.0,1.0,1.0,1.0,...,-0.6785,0.0154,0.2779,,-0.7166,,-1.4212,-1.6202,0.1304,0.4927,-3.0016,1.3626,0.5864,1.5434,0.1735,0.8816,,,,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
301774,Korea,41000118,41000003,0,1.0,1.0,1.0,0,0,0,0,0,,,,1.0,1.0,1.0,4.0,1.0,3.0,,,10.0,,-0.1689,1.0,0.0,0.0,7.0,7.0,6.0,8.0,1.0,0.0,1.0,10.0,0.0,0,1,16.0,17.0,63.03,,,2.0,0.0,1.0,3.0,20.0,5.0,,10.0,4.0,1.0,1.0,1.0,4.0,3.0,6.0,2.0,2.0,2.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,2.0,6.0,3.0,,,,,,,,,,,,,,,,,,,,,,0.0,,1.0128,-0.5636,-1.228,-0.0918,0.181,0.0456,-1.1632,2.1694,0.1874,-0.3307,0.0667,-1.6409,1.4602,-0.2135,1.7613,1.6491,0.8211,0.526,-0.033,0.5772,0.3681,-0.1415,0.7063,3.9552,-2.3558,0.2302,-2.5141,0.4672,-0.0065,1.4476,-0.8105,-2.6217,-0.4735,-1.2752,,,,,,,,0.4107,,,,,,,,,,0.4062,0.3346,0.3623,-0.1588,-0.3584,0.3074,0.118,0.4639,0.3311,-2.672,0.6984,2.1919,,,,,,,-0.4719,0.7793,-1.2159,-0.1047,-0.8083,0.2052,-1.1773,-0.4384,0.4673,-0.8318,-0.5948,0.0,9.0,5.0,0.0,1.0,15.0,0.0,0.0,0.0,2.0,2.0,2.0,2.0,2.0,1.0,2.0,50.0,0,0,1,0,0,1,0,0,1,3.0,3.0,3.0,3.0,3.0,3.0,3.0,4.0,4.0,4.0,1.0,2.0,2.0,2.0,70.0,30.0,2.0,1.0,1.0,2.0,100.0,23.0,2.0,0,0,0,0,2.0,20.0,20.0,33.0,39.0,41.0,32.0,11.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,...,-0.0209,-0.538,-0.7213,,-0.2063,,-1.4212,-1.2862,-0.3382,-0.0292,0.969,0.9088,-0.9419,-0.8879,0.3171,-0.8379,,,,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
301775,Korea,41000035,41000004,1,0.0,4.0,1.0,0,1,0,0,0,1.0,3.0,5.0,5.0,1.0,,,2.0,3.0,,,6.0,,0.5594,1.0,0.0,0.0,4.0,4.0,6.0,9.0,1.0,0.0,4.0,7.0,0.0,0,1,12.0,58.77,71.55,,,2.0,0.0,3.0,2.0,34.0,2.0,,9.0,5.0,5.0,1.0,1.0,4.0,2.0,7.0,2.0,1.0,4.0,2.0,0.0,0.0,0.0,1.0,0.0,0.0,,,,,,1.0,6.0,6.0,,,,,,,,,,,,,,,,,,,,,,0.0,,-0.401,-0.1988,-1.228,1.1246,-0.6386,-0.2694,-0.6783,-0.6274,-0.1899,-0.2549,-0.1246,0.6598,-0.0144,-0.0534,0.1645,0.4375,-0.1002,-1.8553,-1.1158,-1.5577,-0.7331,-0.9584,-0.2419,-0.1581,-0.5191,-0.4413,-0.4754,0.4757,-1.003,0.5254,-0.1674,-0.3836,0.6258,-0.4403,1.0989,0.4677,-0.5105,-0.4444,,-0.543,,0.203,,,,,,,,,,-2.2978,0.3346,0.3623,1.95,-0.7786,-1.6286,-2.6018,-0.1845,0.2498,2.5996,-0.0269,-0.3241,,,,,,,-0.3715,1.2955,0.4708,-1.8514,-0.2402,0.2052,0.0351,0.6984,1.0611,0.6556,0.1257,0.0,3.0,5.0,0.0,2.0,2.0,0.0,0.0,0.0,1.0,2.0,1.0,2.0,2.0,2.0,2.0,50.0,0,0,1,0,0,1,0,0,1,4.0,4.0,4.0,4.0,3.0,3.0,3.0,3.0,3.0,1.0,1.0,2.0,1.0,2.0,100.0,0.0,1.0,2.0,1.0,1.0,96.6667,28.0,4.0,0,1,0,0,1.0,14.0,0.0,16.0,17.0,26.0,5.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,...,-0.9137,-1.0079,1.0982,,3.395,,0.2681,0.2077,0.3918,-1.3541,-0.864,1.8425,-0.9419,0.237,0.1248,0.3844,,,,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
301776,Korea,41000141,41000005,1,1.0,2.0,1.0,0,0,1,0,0,2.0,3.0,5.0,5.0,,2.0,4.0,3.0,3.0,8.0,8.0,5.0,,0.6772,1.0,0.0,0.0,6.0,7.0,6.0,4.0,1.0,0.0,5.0,4.0,0.0,0,1,12.0,35.34,35.34,,,3.0,0.0,2.0,2.0,37.0,5.0,,6.0,3.0,3.0,2.0,,5.0,3.0,7.0,2.0,1.0,4.0,3.0,1.0,0.0,0.0,0.0,0.0,0.0,,,,,,1.0,2.0,6.0,,,,,,,,,,,,,,,,,,,,,,1.0,,-0.7958,-0.3485,-0.1325,-0.756,-0.6386,-0.8069,0.0565,-0.9088,-0.4784,0.6325,-0.6158,0.3757,-0.3061,-0.5947,-0.275,-0.2938,-0.1002,0.6523,-0.81,0.2678,-0.2039,1.1309,-0.4689,1.7745,0.3378,-1.2798,-0.7976,-0.805,-0.8053,0.5741,0.9333,-0.5442,-0.0077,-0.7272,-0.6932,-0.1612,-0.1803,-0.4921,-0.1428,0.2618,,-0.4307,,,,,,,,,,0.4062,0.3346,-1.5125,-0.35,-0.0822,-0.2822,0.9468,-0.3542,-0.2535,0.1612,-1.0712,-0.0962,,,,,,,-0.4962,-0.4256,0.0486,-1.0038,-0.1169,-0.6869,-0.7676,0.2201,-0.1247,-0.1689,-0.5973,0.0,4.0,6.0,0.0,0.0,7.0,0.0,0.0,0.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,50.0,1,0,0,1,0,0,1,0,0,3.0,3.0,4.0,4.0,4.0,4.0,4.0,4.0,1.0,4.0,3.0,2.0,1.0,1.0,100.0,0.0,1.0,1.0,1.0,1.0,46.963,18.0,5.0,0,0,0,1,3.0,45.0,45.0,30.0,40.0,40.0,65.0,0.0,28.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.2275,0.3428,1.0982,,-1.4394,,-1.4212,-2.227,-0.6367,0.6491,0.3048,2.5994,-1.1911,0.237,1.1995,-1.254,,-0.8314,,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


#### Take out additional variables

In [8]:
# Define the list of columns to drop
columns_to_remove = ["CNTSCHID", "CNTSTUID", "OECD",
    "HOMEPOS", "RELATST", "BELONG", "BULLIED", "FEELSAFE", "SCHRISK", "PERSEVAGR", "CURIOAGR", 
    "COOPAGR", "EMPATAGR", "ASSERAGR", "STRESAGR", "EMOCOAGR", "GROSAGR", "INFOSEEK", "FAMSUP", 
    "DISCLIM", "TEACHSUP", "COGACRCO", "COGACMCO", "EXPOFA", "EXPO21ST", "MATHEFF", "MATHEF21", 
    "FAMCON", "ANXMAT", "MATHPERS", "CREATEFF", "CREATSCH", "CREATFAM", "CREATAS", "CREATOOS", 
    "CREATOP", "OPENART", "IMAGINE", "SCHSUST", "LEARRES", "PROBSELF", "FAMSUPSL", "FEELLAH", 
    "SDLEFF", "ICTRES", "FLSCHOOL", "FLMULTSB", "FLFAMILY", "ACCESSFP", "FLCONFIN", "FLCONICT", 
    "ACCESSFA", "ATTCONFM", "FRINFLFM", "ICTSCH", "ICTHOME", "ICTQUAL", "ICTSUBJ", "ICTENQ", 
    "ICTFEED", "ICTOUT", "ICTWKDY", "ICTWKEND", "ICTREG", "ICTINFO", "ICTEFFIC", "BODYIMA", 
    "SOCONPA", "LIFESAT", "PSYCHSYM", "SOCCON", "EXPWB", "CURSUPP", "PQMIMP", "PQMCAR", 
    "PARINVOL", "PQSCHOOL", "PASCHPOL", "ATTIMMP", "CREATHME", "CREATACT", "CREATOPN", 
    "CREATOR", "SCHAUTO", "TCHPART", "EDULEAD", "INSTLEAD", "ENCOURPG", "DIGDVPOL", "TEAFDBK", 
    "MTTRAIN", "DMCVIEWS", "NEGSCLIM", "STAFFSHORT", "EDUSHORT", "STUBEHA", "TEACHBEHA", 
    "STDTEST", "TDTEST", "ALLACTIV", "BCREATSC", "CREENVSC", "ACTCRESC", "OPENCUL", 
    "PROBSCRI", "SCPREPBP", "SCPREPAP", "DIGPREP", 
    "ESCS", "BMMJ1", "BFMJ2", "EFFORT1", "EFFORT2", "Option_UH", "SC209Q04JA", "SC209Q05JA", "SC209Q06JA"
]

# Drop the columns above
model_data = model_data.drop(columns=columns_to_remove, errors='ignore')  # `errors='ignore'` prevents errors if a column isn't found


In [9]:
print(model_data.shape)
model_data.head()

(6454, 453)


Unnamed: 0,CNT,MATH_Proficient,SISCO,ST347Q01JA,ST347Q02JA,ST349Q01JA_0,ST349Q01JA_1,ST349Q01JA_2,ST349Q01JA_3,ST349Q01JA_4,ST350Q01JA,ST356Q01JA,ST322Q01JA,ST322Q02JA,ST322Q03JA,ST322Q04JA,ST322Q06JA,ST322Q07JA,DURECEC,ST259Q01JA,WB164Q01HA,ST004D01T,GRADE,REPEAT,EXPECEDU,ICTAVSCH,ICTAVHOM,ICTDISTR,IMMIG,TARDYSD,ST226Q01JA,ST016Q01NA,MISSSC,PAREDINT,WB163Q06HA,WB163Q07HA,ST230Q01JA,SKIPPING,IC180Q01JA,IC180Q08JA,ST059Q02JA,ST296Q04JA,WB176Q01HA,STUDYHMW,IC184Q01JA,IC184Q02JA,IC184Q03JA,IC184Q04JA,ST059Q01TA,ST296Q01JA,ST272Q01JA,ST268Q01JA,ST268Q04JA,ST268Q07JA,ST293Q04JA,ST297Q01JA,ST297Q03JA,ST297Q05JA,ST297Q06JA,ST297Q07JA,ST297Q09JA,WB165Q01HA,WB166Q01HA,WB166Q02HA,WB166Q03HA,WB166Q04HA,ST258Q01JA,ST294Q01JA,ST295Q01JA,WB150Q01HA,WB156Q01HA,WB158Q01HA,WB160Q01HA,WB161Q01HA,WB171Q01HA,WB171Q02HA,WB171Q03HA,WB171Q04HA,WB172Q01HA,WB173Q01HA,WB173Q02HA,WB173Q03HA,WB173Q04HA,WB177Q01HA,WB177Q02HA,WB177Q03HA,WB177Q04HA,WB032Q01NA,WB032Q02NA,WB031Q01NA,EXERPRAC,STUBMI,WORKPAY,WORKHOME,SC001Q01TA,SC211Q01JA,SC211Q02JA,SC211Q03JA,SC211Q04JA,SC211Q05JA,SC211Q06JA,SC037Q11JA,SC183Q02JA,SC183Q03JA,SC183Q04JA,SC175Q01JA,SC177Q01JA_1,SC177Q01JA_2,SC177Q01JA_3,SC177Q02JA_1,SC177Q02JA_2,SC177Q02JA_3,SC177Q03JA_1,SC177Q03JA_2,SC177Q03JA_3,SC188Q01JA,SC188Q02JA,SC188Q03JA,SC188Q04JA,SC188Q05JA,SC188Q06JA,SC188Q07JA,SC188Q08JA,SC188Q09JA,SC188Q10JA,SC188Q11JA,SC198Q01JA,SC198Q02JA,SC198Q03JA,SC178Q01JA,SC178Q02JA,SC180Q01JA,SC189Q02WA,SC189Q03WA,SC189Q04WA,SMRATIO,MCLSIZE,MACTIV,MATHEXC_0,MATHEXC_1,MATHEXC_2,MATHEXC_3,ABGMATH,SC064Q05WA,SC064Q06WA,SC064Q01TA,SC064Q02TA,SC064Q04NA,SC064Q03TA,SC064Q07WA,SC213Q01JA,SC213Q02JA,SC037Q01TA,SC037Q02TA,SC037Q03TA,SC037Q04TA,SC037Q05NA,SC037Q06NA,SC037Q07TA,SC037Q09TA,SC200Q01JA,SC200Q02JA,SC200Q03JA,SC200Q04JA,SC224Q01JA,RATCMP1,RATCMP2,RATTAB,SCHSEL,SCHLTYPE_1,SCHLTYPE_2,SCHLTYPE_3,SC034Q01NA,SC034Q02NA,SC034Q03TA,SC034Q04TA,SC195Q01JA,SC195Q02JA,SC195Q03JA,SC195Q04JA,SC042Q01TA,SC042Q02TA,SC214Q01JA,SC214Q02JA,SC214Q03JA,SC215Q01JA,SC215Q02JA,SC215Q03JA,SC215Q04JA,SC215Q05JA,SC215Q06JA,SC215Q07JA,SC215Q08JA,SC216Q06JA,SC216Q07JA,SC216Q08JA,SC216Q09JA,SC217Q01JA,SC217Q02JA,SC217Q03JA,SC217Q04JA,SC217Q05JA,SC217Q06JA,SC217Q07JA,SC217Q08JA,SC217Q10JA,SC218Q01JA,SC219Q01JA,SC220Q01JA,SC221Q01JA,SC221Q02JA,SC221Q03JA,SC221Q04JA,SCSUPRTED,SCSUPRT,SC212Q01JA,SC212Q02JA,SC212Q03JA,SC037Q08TA,SC032Q01TA,SC032Q02TA,SC032Q03TA,SC032Q04TA,LANGN_105,LANGN_108,LANGN_112,LANGN_113,LANGN_118,LANGN_121,LANGN_130,LANGN_133,LANGN_137,LANGN_140,LANGN_147,LANGN_148,LANGN_150,LANGN_154,LANGN_156,LANGN_160,LANGN_170,LANGN_195,LANGN_200,LANGN_202,LANGN_204,LANGN_232,LANGN_237,LANGN_244,LANGN_246,LANGN_254,LANGN_258,LANGN_263,LANGN_264,LANGN_266,LANGN_272,LANGN_273,LANGN_275,LANGN_286,LANGN_301,LANGN_313,LANGN_316,LANGN_317,LANGN_322,LANGN_325,LANGN_327,LANGN_329,LANGN_338,LANGN_340,LANGN_344,LANGN_351,LANGN_358,LANGN_363,LANGN_369,LANGN_371,LANGN_375,LANGN_379,LANGN_381,LANGN_382,LANGN_383,LANGN_404,LANGN_409,LANGN_415,LANGN_420,LANGN_422,LANGN_428,LANGN_434,LANGN_442,LANGN_449,LANGN_451,LANGN_463,LANGN_465,LANGN_467,LANGN_471,LANGN_472,LANGN_474,LANGN_492,LANGN_493,LANGN_494,LANGN_495,LANGN_496,LANGN_500,LANGN_503,LANGN_514,LANGN_517,LANGN_520,LANGN_523,LANGN_527,LANGN_529,LANGN_531,LANGN_540,LANGN_547,LANGN_555,LANGN_561,LANGN_562,LANGN_563,LANGN_565,LANGN_566,LANGN_567,LANGN_600,LANGN_601,LANGN_602,LANGN_605,LANGN_606,LANGN_607,LANGN_608,LANGN_611,LANGN_614,LANGN_615,LANGN_616,LANGN_618,LANGN_619,LANGN_621,LANGN_622,LANGN_623,LANGN_624,LANGN_625,LANGN_626,LANGN_627,LANGN_628,LANGN_630,LANGN_631,LANGN_634,LANGN_635,LANGN_639,LANGN_640,LANGN_641,LANGN_642,LANGN_648,LANGN_650,LANGN_661,LANGN_662,LANGN_663,LANGN_665,LANGN_666,LANGN_667,LANGN_668,LANGN_669,LANGN_670,LANGN_673,LANGN_674,LANGN_675,LANGN_676,LANGN_677,LANGN_678,LANGN_800,LANGN_801,LANGN_802,LANGN_804,LANGN_805,LANGN_806,LANGN_807,LANGN_808,LANGN_809,LANGN_810,LANGN_811,LANGN_812,LANGN_813,LANGN_814,LANGN_815,LANGN_816,LANGN_817,LANGN_818,LANGN_819,LANGN_821,LANGN_823,LANGN_824,LANGN_825,LANGN_826,LANGN_827,LANGN_828,LANGN_829,LANGN_831,LANGN_832,LANGN_833,LANGN_836,LANGN_837,LANGN_838,LANGN_839,LANGN_840,LANGN_841,LANGN_842,LANGN_843,LANGN_844,LANGN_845,LANGN_846,LANGN_849,LANGN_850,LANGN_851,LANGN_852,LANGN_854,LANGN_855,LANGN_857,LANGN_859,LANGN_860,LANGN_861,LANGN_865,LANGN_866,LANGN_868,LANGN_870,LANGN_872,LANGN_873,LANGN_877,LANGN_879,LANGN_881,LANGN_885,LANGN_890,LANGN_892,LANGN_895,LANGN_896,LANGN_897,LANGN_898,LANGN_899,LANGN_900,LANGN_901,LANGN_902,LANGN_903,LANGN_904,LANGN_905,LANGN_906,LANGN_907,LANGN_908,LANGN_909,LANGN_910,LANGN_911,LANGN_912,LANGN_913,LANGN_914,LANGN_916,LANGN_917,LANGN_918,LANGN_919,LANGN_920,LANGN_921,LANGN_922
301772,Korea,1,1.0,4.0,1.0,0,0,1,0,0,1.0,2.0,5.0,4.0,4.0,1.0,4.0,,2.0,3.0,,2.0,0.0,0.0,7.0,1.0,1.0,8.0,1.0,0.0,4.0,9.0,1.0,16.0,,,3.0,0.0,3.0,2.0,35.0,4.0,,3.0,1.0,1.0,1.0,1.0,4.0,4.0,7.0,2.0,1.0,4.0,,0.0,1.0,0.0,0.0,0.0,0.0,,,,,,1.0,4.0,6.0,,,,,,,,,,,,,,,,,,,,,,4.0,,0.0,2.0,4.0,0.0,2.0,3.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,50.0,0,0,1,0,0,1,0,0,1,3.0,3.0,3.0,3.0,3.0,3.0,3.0,4.0,3.0,4.0,3.0,2.0,2.0,1.0,92.0,8.0,1.0,1.0,1.0,1.0,71.5,18.0,4.0,0,0,0,1,3.0,93.0,90.0,90.0,90.0,89.0,90.0,91.0,35.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,4.0,0.07,,0.8,2.0,0,0,1,3.0,2.0,3.0,2.0,1.0,3.0,3.0,2.0,1.0,3.0,5.0,5.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,2.0,2.0,2.0,3.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,11.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
301773,Korea,1,1.0,2.0,1.0,0,1,0,0,0,1.0,2.0,2.0,,2.0,1.0,1.0,1.0,4.0,6.0,,1.0,0.0,0.0,7.0,7.0,6.0,,1.0,0.0,4.0,6.0,0.0,16.0,,,2.0,0.0,2.0,3.0,40.0,6.0,,10.0,1.0,1.0,1.0,1.0,7.0,6.0,7.0,3.0,2.0,4.0,,1.0,0.0,0.0,0.0,0.0,0.0,,,,,,1.0,5.0,6.0,,,,,,,,,,,,,,,,,,,,,,1.0,,0.0,1.0,4.0,100.0,0.0,1.0,0.0,0.0,0.0,2.0,2.0,1.0,1.0,120.0,1,0,0,0,0,1,1,0,0,4.0,4.0,4.0,4.0,1.0,1.0,1.0,4.0,4.0,4.0,1.0,2.0,2.0,2.0,95.0,5.0,1.0,2.0,1.0,1.0,91.5556,23.0,4.0,0,0,1,0,1.0,10.0,0.0,40.0,40.0,100.0,100.0,0.0,0.0,2.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,,,,,4.0,0.2491,1.0,0.1068,1.0,0,0,1,2.0,1.0,1.0,1.0,3.0,1.0,3.0,3.0,3.0,3.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,1.0,1.0,1.0,2.0,1.0,1.0,2.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
301774,Korea,0,1.0,1.0,1.0,0,0,0,0,0,,,,1.0,1.0,1.0,4.0,1.0,3.0,10.0,,1.0,0.0,0.0,7.0,7.0,6.0,8.0,1.0,0.0,1.0,10.0,0.0,16.0,,,2.0,0.0,1.0,3.0,20.0,5.0,,10.0,4.0,1.0,1.0,1.0,4.0,3.0,6.0,2.0,2.0,2.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,2.0,6.0,3.0,,,,,,,,,,,,,,,,,,,,,,0.0,,0.0,9.0,5.0,0.0,1.0,15.0,0.0,0.0,0.0,2.0,2.0,1.0,2.0,50.0,0,0,1,0,0,1,0,0,1,3.0,3.0,3.0,3.0,3.0,3.0,3.0,4.0,4.0,4.0,1.0,2.0,2.0,2.0,70.0,30.0,2.0,1.0,1.0,2.0,100.0,23.0,2.0,0,0,0,0,2.0,20.0,20.0,33.0,39.0,41.0,32.0,11.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,3.0,0.4525,,1.0,2.0,0,1,0,1.0,1.0,3.0,1.0,1.0,2.0,3.0,1.0,3.0,3.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
301775,Korea,1,0.0,4.0,1.0,0,1,0,0,0,1.0,3.0,5.0,5.0,1.0,,,2.0,3.0,6.0,,1.0,0.0,0.0,4.0,4.0,6.0,9.0,1.0,0.0,4.0,7.0,0.0,12.0,,,2.0,0.0,3.0,2.0,34.0,2.0,,9.0,5.0,5.0,1.0,1.0,4.0,2.0,7.0,2.0,1.0,4.0,2.0,0.0,0.0,0.0,1.0,0.0,0.0,,,,,,1.0,6.0,6.0,,,,,,,,,,,,,,,,,,,,,,0.0,,0.0,3.0,5.0,0.0,2.0,2.0,0.0,0.0,0.0,2.0,2.0,2.0,2.0,50.0,0,0,1,0,0,1,0,0,1,4.0,4.0,4.0,4.0,3.0,3.0,3.0,3.0,3.0,1.0,1.0,2.0,1.0,2.0,100.0,0.0,1.0,2.0,1.0,1.0,96.6667,28.0,4.0,0,1,0,0,1.0,14.0,0.0,16.0,17.0,26.0,5.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,,,,,3.0,0.2469,1.0,0.2716,1.0,0,0,1,1.0,2.0,3.0,3.0,1.0,2.0,3.0,3.0,3.0,3.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,1.0,1.0,1.0,2.0,1.0,1.0,2.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
301776,Korea,1,1.0,2.0,1.0,0,0,1,0,0,2.0,3.0,5.0,5.0,,2.0,4.0,3.0,3.0,5.0,,1.0,0.0,0.0,6.0,7.0,6.0,4.0,1.0,0.0,5.0,4.0,0.0,12.0,,,3.0,0.0,2.0,2.0,37.0,5.0,,6.0,3.0,3.0,2.0,,5.0,3.0,7.0,2.0,1.0,4.0,3.0,1.0,0.0,0.0,0.0,0.0,0.0,,,,,,1.0,2.0,6.0,,,,,,,,,,,,,,,,,,,,,,1.0,,0.0,4.0,6.0,0.0,0.0,7.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,50.0,1,0,0,1,0,0,1,0,0,3.0,3.0,4.0,4.0,4.0,4.0,4.0,4.0,1.0,4.0,3.0,2.0,1.0,1.0,100.0,0.0,1.0,1.0,1.0,1.0,46.963,18.0,5.0,0,0,0,1,3.0,45.0,45.0,30.0,40.0,40.0,65.0,0.0,28.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,4.0,0.2644,1.0,0.1442,1.0,0,1,0,2.0,3.0,3.0,3.0,3.0,3.0,4.0,3.0,2.0,3.0,5.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,11.0,1.0,1.0,1.0,1.0,1.0,2.0,0.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


Amazon SageMaker's XGBoost container expects data in the libSVM or CSV data format.  **Note that the first column must be the target variable and the CSV should not include headers.**  Although repetitive, it's easiest to do this after the train|validation|test split rather than before.  This avoids any misalignment issues due to random reordering.
* `MATH_Proficient`: Is the student falling behind in Math? (Average of 10 Math plausible values < 420.07)

In [10]:
# Get percent of students not proficient in Math
proficient_n = (model_data['MATH_Proficient'] == 1).sum()
not_proficient_n = (model_data['MATH_Proficient'] == 0).sum()
not_proficient_p = round( not_proficient_n / (not_proficient_n + proficient_n) * 100, 1)
print("Students who are NOT proficient in Math: ", not_proficient_n, "(", not_proficient_p, "%)")

Students who are NOT proficient in Math:  987 ( 15.3 %)


In [11]:
# Get imbalance ratio (used as a parameter in xgboost)
not_proficient_pp = not_proficient_n / (not_proficient_n + proficient_n)

if not_proficient_pp < 0.5:
    imbalance_ratio = (1 - not_proficient_pp) / not_proficient_pp
else:
    imbalance_ratio = not_proficient_pp / (1 - not_proficient_pp)
    
print("Imbalance ratio:", round(imbalance_ratio,1))

Imbalance ratio: 5.5


In [12]:
# Reorder columns to bring 'MATH_Proficient' first
new_order = ['MATH_Proficient'] + [col for col in model_data.columns if col != 'MATH_Proficient']
model_data = model_data[new_order]

# Check the shape after dropping
print(model_data.shape)

model_data.head()

(6454, 453)


Unnamed: 0,MATH_Proficient,CNT,SISCO,ST347Q01JA,ST347Q02JA,ST349Q01JA_0,ST349Q01JA_1,ST349Q01JA_2,ST349Q01JA_3,ST349Q01JA_4,ST350Q01JA,ST356Q01JA,ST322Q01JA,ST322Q02JA,ST322Q03JA,ST322Q04JA,ST322Q06JA,ST322Q07JA,DURECEC,ST259Q01JA,WB164Q01HA,ST004D01T,GRADE,REPEAT,EXPECEDU,ICTAVSCH,ICTAVHOM,ICTDISTR,IMMIG,TARDYSD,ST226Q01JA,ST016Q01NA,MISSSC,PAREDINT,WB163Q06HA,WB163Q07HA,ST230Q01JA,SKIPPING,IC180Q01JA,IC180Q08JA,ST059Q02JA,ST296Q04JA,WB176Q01HA,STUDYHMW,IC184Q01JA,IC184Q02JA,IC184Q03JA,IC184Q04JA,ST059Q01TA,ST296Q01JA,ST272Q01JA,ST268Q01JA,ST268Q04JA,ST268Q07JA,ST293Q04JA,ST297Q01JA,ST297Q03JA,ST297Q05JA,ST297Q06JA,ST297Q07JA,ST297Q09JA,WB165Q01HA,WB166Q01HA,WB166Q02HA,WB166Q03HA,WB166Q04HA,ST258Q01JA,ST294Q01JA,ST295Q01JA,WB150Q01HA,WB156Q01HA,WB158Q01HA,WB160Q01HA,WB161Q01HA,WB171Q01HA,WB171Q02HA,WB171Q03HA,WB171Q04HA,WB172Q01HA,WB173Q01HA,WB173Q02HA,WB173Q03HA,WB173Q04HA,WB177Q01HA,WB177Q02HA,WB177Q03HA,WB177Q04HA,WB032Q01NA,WB032Q02NA,WB031Q01NA,EXERPRAC,STUBMI,WORKPAY,WORKHOME,SC001Q01TA,SC211Q01JA,SC211Q02JA,SC211Q03JA,SC211Q04JA,SC211Q05JA,SC211Q06JA,SC037Q11JA,SC183Q02JA,SC183Q03JA,SC183Q04JA,SC175Q01JA,SC177Q01JA_1,SC177Q01JA_2,SC177Q01JA_3,SC177Q02JA_1,SC177Q02JA_2,SC177Q02JA_3,SC177Q03JA_1,SC177Q03JA_2,SC177Q03JA_3,SC188Q01JA,SC188Q02JA,SC188Q03JA,SC188Q04JA,SC188Q05JA,SC188Q06JA,SC188Q07JA,SC188Q08JA,SC188Q09JA,SC188Q10JA,SC188Q11JA,SC198Q01JA,SC198Q02JA,SC198Q03JA,SC178Q01JA,SC178Q02JA,SC180Q01JA,SC189Q02WA,SC189Q03WA,SC189Q04WA,SMRATIO,MCLSIZE,MACTIV,MATHEXC_0,MATHEXC_1,MATHEXC_2,MATHEXC_3,ABGMATH,SC064Q05WA,SC064Q06WA,SC064Q01TA,SC064Q02TA,SC064Q04NA,SC064Q03TA,SC064Q07WA,SC213Q01JA,SC213Q02JA,SC037Q01TA,SC037Q02TA,SC037Q03TA,SC037Q04TA,SC037Q05NA,SC037Q06NA,SC037Q07TA,SC037Q09TA,SC200Q01JA,SC200Q02JA,SC200Q03JA,SC200Q04JA,SC224Q01JA,RATCMP1,RATCMP2,RATTAB,SCHSEL,SCHLTYPE_1,SCHLTYPE_2,SCHLTYPE_3,SC034Q01NA,SC034Q02NA,SC034Q03TA,SC034Q04TA,SC195Q01JA,SC195Q02JA,SC195Q03JA,SC195Q04JA,SC042Q01TA,SC042Q02TA,SC214Q01JA,SC214Q02JA,SC214Q03JA,SC215Q01JA,SC215Q02JA,SC215Q03JA,SC215Q04JA,SC215Q05JA,SC215Q06JA,SC215Q07JA,SC215Q08JA,SC216Q06JA,SC216Q07JA,SC216Q08JA,SC216Q09JA,SC217Q01JA,SC217Q02JA,SC217Q03JA,SC217Q04JA,SC217Q05JA,SC217Q06JA,SC217Q07JA,SC217Q08JA,SC217Q10JA,SC218Q01JA,SC219Q01JA,SC220Q01JA,SC221Q01JA,SC221Q02JA,SC221Q03JA,SC221Q04JA,SCSUPRTED,SCSUPRT,SC212Q01JA,SC212Q02JA,SC212Q03JA,SC037Q08TA,SC032Q01TA,SC032Q02TA,SC032Q03TA,SC032Q04TA,LANGN_105,LANGN_108,LANGN_112,LANGN_113,LANGN_118,LANGN_121,LANGN_130,LANGN_133,LANGN_137,LANGN_140,LANGN_147,LANGN_148,LANGN_150,LANGN_154,LANGN_156,LANGN_160,LANGN_170,LANGN_195,LANGN_200,LANGN_202,LANGN_204,LANGN_232,LANGN_237,LANGN_244,LANGN_246,LANGN_254,LANGN_258,LANGN_263,LANGN_264,LANGN_266,LANGN_272,LANGN_273,LANGN_275,LANGN_286,LANGN_301,LANGN_313,LANGN_316,LANGN_317,LANGN_322,LANGN_325,LANGN_327,LANGN_329,LANGN_338,LANGN_340,LANGN_344,LANGN_351,LANGN_358,LANGN_363,LANGN_369,LANGN_371,LANGN_375,LANGN_379,LANGN_381,LANGN_382,LANGN_383,LANGN_404,LANGN_409,LANGN_415,LANGN_420,LANGN_422,LANGN_428,LANGN_434,LANGN_442,LANGN_449,LANGN_451,LANGN_463,LANGN_465,LANGN_467,LANGN_471,LANGN_472,LANGN_474,LANGN_492,LANGN_493,LANGN_494,LANGN_495,LANGN_496,LANGN_500,LANGN_503,LANGN_514,LANGN_517,LANGN_520,LANGN_523,LANGN_527,LANGN_529,LANGN_531,LANGN_540,LANGN_547,LANGN_555,LANGN_561,LANGN_562,LANGN_563,LANGN_565,LANGN_566,LANGN_567,LANGN_600,LANGN_601,LANGN_602,LANGN_605,LANGN_606,LANGN_607,LANGN_608,LANGN_611,LANGN_614,LANGN_615,LANGN_616,LANGN_618,LANGN_619,LANGN_621,LANGN_622,LANGN_623,LANGN_624,LANGN_625,LANGN_626,LANGN_627,LANGN_628,LANGN_630,LANGN_631,LANGN_634,LANGN_635,LANGN_639,LANGN_640,LANGN_641,LANGN_642,LANGN_648,LANGN_650,LANGN_661,LANGN_662,LANGN_663,LANGN_665,LANGN_666,LANGN_667,LANGN_668,LANGN_669,LANGN_670,LANGN_673,LANGN_674,LANGN_675,LANGN_676,LANGN_677,LANGN_678,LANGN_800,LANGN_801,LANGN_802,LANGN_804,LANGN_805,LANGN_806,LANGN_807,LANGN_808,LANGN_809,LANGN_810,LANGN_811,LANGN_812,LANGN_813,LANGN_814,LANGN_815,LANGN_816,LANGN_817,LANGN_818,LANGN_819,LANGN_821,LANGN_823,LANGN_824,LANGN_825,LANGN_826,LANGN_827,LANGN_828,LANGN_829,LANGN_831,LANGN_832,LANGN_833,LANGN_836,LANGN_837,LANGN_838,LANGN_839,LANGN_840,LANGN_841,LANGN_842,LANGN_843,LANGN_844,LANGN_845,LANGN_846,LANGN_849,LANGN_850,LANGN_851,LANGN_852,LANGN_854,LANGN_855,LANGN_857,LANGN_859,LANGN_860,LANGN_861,LANGN_865,LANGN_866,LANGN_868,LANGN_870,LANGN_872,LANGN_873,LANGN_877,LANGN_879,LANGN_881,LANGN_885,LANGN_890,LANGN_892,LANGN_895,LANGN_896,LANGN_897,LANGN_898,LANGN_899,LANGN_900,LANGN_901,LANGN_902,LANGN_903,LANGN_904,LANGN_905,LANGN_906,LANGN_907,LANGN_908,LANGN_909,LANGN_910,LANGN_911,LANGN_912,LANGN_913,LANGN_914,LANGN_916,LANGN_917,LANGN_918,LANGN_919,LANGN_920,LANGN_921,LANGN_922
301772,1,Korea,1.0,4.0,1.0,0,0,1,0,0,1.0,2.0,5.0,4.0,4.0,1.0,4.0,,2.0,3.0,,2.0,0.0,0.0,7.0,1.0,1.0,8.0,1.0,0.0,4.0,9.0,1.0,16.0,,,3.0,0.0,3.0,2.0,35.0,4.0,,3.0,1.0,1.0,1.0,1.0,4.0,4.0,7.0,2.0,1.0,4.0,,0.0,1.0,0.0,0.0,0.0,0.0,,,,,,1.0,4.0,6.0,,,,,,,,,,,,,,,,,,,,,,4.0,,0.0,2.0,4.0,0.0,2.0,3.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,50.0,0,0,1,0,0,1,0,0,1,3.0,3.0,3.0,3.0,3.0,3.0,3.0,4.0,3.0,4.0,3.0,2.0,2.0,1.0,92.0,8.0,1.0,1.0,1.0,1.0,71.5,18.0,4.0,0,0,0,1,3.0,93.0,90.0,90.0,90.0,89.0,90.0,91.0,35.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,4.0,0.07,,0.8,2.0,0,0,1,3.0,2.0,3.0,2.0,1.0,3.0,3.0,2.0,1.0,3.0,5.0,5.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,2.0,2.0,2.0,3.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,11.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
301773,1,Korea,1.0,2.0,1.0,0,1,0,0,0,1.0,2.0,2.0,,2.0,1.0,1.0,1.0,4.0,6.0,,1.0,0.0,0.0,7.0,7.0,6.0,,1.0,0.0,4.0,6.0,0.0,16.0,,,2.0,0.0,2.0,3.0,40.0,6.0,,10.0,1.0,1.0,1.0,1.0,7.0,6.0,7.0,3.0,2.0,4.0,,1.0,0.0,0.0,0.0,0.0,0.0,,,,,,1.0,5.0,6.0,,,,,,,,,,,,,,,,,,,,,,1.0,,0.0,1.0,4.0,100.0,0.0,1.0,0.0,0.0,0.0,2.0,2.0,1.0,1.0,120.0,1,0,0,0,0,1,1,0,0,4.0,4.0,4.0,4.0,1.0,1.0,1.0,4.0,4.0,4.0,1.0,2.0,2.0,2.0,95.0,5.0,1.0,2.0,1.0,1.0,91.5556,23.0,4.0,0,0,1,0,1.0,10.0,0.0,40.0,40.0,100.0,100.0,0.0,0.0,2.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,,,,,4.0,0.2491,1.0,0.1068,1.0,0,0,1,2.0,1.0,1.0,1.0,3.0,1.0,3.0,3.0,3.0,3.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,1.0,1.0,1.0,2.0,1.0,1.0,2.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
301774,0,Korea,1.0,1.0,1.0,0,0,0,0,0,,,,1.0,1.0,1.0,4.0,1.0,3.0,10.0,,1.0,0.0,0.0,7.0,7.0,6.0,8.0,1.0,0.0,1.0,10.0,0.0,16.0,,,2.0,0.0,1.0,3.0,20.0,5.0,,10.0,4.0,1.0,1.0,1.0,4.0,3.0,6.0,2.0,2.0,2.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,2.0,6.0,3.0,,,,,,,,,,,,,,,,,,,,,,0.0,,0.0,9.0,5.0,0.0,1.0,15.0,0.0,0.0,0.0,2.0,2.0,1.0,2.0,50.0,0,0,1,0,0,1,0,0,1,3.0,3.0,3.0,3.0,3.0,3.0,3.0,4.0,4.0,4.0,1.0,2.0,2.0,2.0,70.0,30.0,2.0,1.0,1.0,2.0,100.0,23.0,2.0,0,0,0,0,2.0,20.0,20.0,33.0,39.0,41.0,32.0,11.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,3.0,0.4525,,1.0,2.0,0,1,0,1.0,1.0,3.0,1.0,1.0,2.0,3.0,1.0,3.0,3.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
301775,1,Korea,0.0,4.0,1.0,0,1,0,0,0,1.0,3.0,5.0,5.0,1.0,,,2.0,3.0,6.0,,1.0,0.0,0.0,4.0,4.0,6.0,9.0,1.0,0.0,4.0,7.0,0.0,12.0,,,2.0,0.0,3.0,2.0,34.0,2.0,,9.0,5.0,5.0,1.0,1.0,4.0,2.0,7.0,2.0,1.0,4.0,2.0,0.0,0.0,0.0,1.0,0.0,0.0,,,,,,1.0,6.0,6.0,,,,,,,,,,,,,,,,,,,,,,0.0,,0.0,3.0,5.0,0.0,2.0,2.0,0.0,0.0,0.0,2.0,2.0,2.0,2.0,50.0,0,0,1,0,0,1,0,0,1,4.0,4.0,4.0,4.0,3.0,3.0,3.0,3.0,3.0,1.0,1.0,2.0,1.0,2.0,100.0,0.0,1.0,2.0,1.0,1.0,96.6667,28.0,4.0,0,1,0,0,1.0,14.0,0.0,16.0,17.0,26.0,5.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,,,,,3.0,0.2469,1.0,0.2716,1.0,0,0,1,1.0,2.0,3.0,3.0,1.0,2.0,3.0,3.0,3.0,3.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,1.0,1.0,1.0,2.0,1.0,1.0,2.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
301776,1,Korea,1.0,2.0,1.0,0,0,1,0,0,2.0,3.0,5.0,5.0,,2.0,4.0,3.0,3.0,5.0,,1.0,0.0,0.0,6.0,7.0,6.0,4.0,1.0,0.0,5.0,4.0,0.0,12.0,,,3.0,0.0,2.0,2.0,37.0,5.0,,6.0,3.0,3.0,2.0,,5.0,3.0,7.0,2.0,1.0,4.0,3.0,1.0,0.0,0.0,0.0,0.0,0.0,,,,,,1.0,2.0,6.0,,,,,,,,,,,,,,,,,,,,,,1.0,,0.0,4.0,6.0,0.0,0.0,7.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,50.0,1,0,0,1,0,0,1,0,0,3.0,3.0,4.0,4.0,4.0,4.0,4.0,4.0,1.0,4.0,3.0,2.0,1.0,1.0,100.0,0.0,1.0,1.0,1.0,1.0,46.963,18.0,5.0,0,0,0,1,3.0,45.0,45.0,30.0,40.0,40.0,65.0,0.0,28.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,4.0,0.2644,1.0,0.1442,1.0,0,1,0,2.0,3.0,3.0,3.0,3.0,3.0,4.0,3.0,2.0,3.0,5.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,11.0,1.0,1.0,1.0,1.0,1.0,2.0,0.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


We'll randomly split the data into 3 uneven groups.  **The model will be trained on 70% of data, it will then be evaluated on 15% of data to give us an estimate of the accuracy we hope to have on "new" data, and 15% will be held back as a final testing dataset which will be used later on.**

A seed is included in the code so the splits can be replicated!

In [13]:
# cell 12
# Randomly sort the data then split out first 70%, second 15%, and last 15%
train_data, validation_data, test_data = np.split(model_data.sample(frac=1, random_state=1729), [int(0.7 * len(model_data)), int(0.85 * len(model_data))])   

  return bound(*args, **kwds)


In [14]:
print("Number of rows in FULL dataset:", model_data.shape[0])

train_data_percent = round(train_data.shape[0]/model_data.shape[0] * 100, 0)
print("Number of rows in TRAINING dataset:", train_data.shape[0], "(", train_data_percent, "% )")

validation_data_percent = round(validation_data.shape[0]/model_data.shape[0] * 100, 0)
print("Number of rows in VALIDATION dataset:", validation_data.shape[0], "(", validation_data_percent, "% )")

test_data_percent = round(test_data.shape[0]/model_data.shape[0] * 100, 0)
print("Number of rows in TEST dataset:", test_data.shape[0], "(", test_data_percent, "% )")

Number of rows in FULL dataset: 6454
Number of rows in TRAINING dataset: 4517 ( 70.0 % )
Number of rows in VALIDATION dataset: 968 ( 15.0 % )
Number of rows in TEST dataset: 969 ( 15.0 % )


#### Drop country names from the dataset

In [15]:
# cell 13
#pd.concat([train_data['y_yes'], train_data.drop(['y_no', 'y_yes'], axis=1)], axis=1).to_csv('train.csv', index=False, header=False)
#pd.concat([validation_data['y_yes'], validation_data.drop(['y_no', 'y_yes'], axis=1)], axis=1).to_csv('validation.csv', index=False, header=False)# Drop non-numeric columns (e.g., country names or IDs that are not numeric)

# Drop string variables (Country names)
non_numeric_columns = train_data.select_dtypes(exclude=['number']).columns

train_data = train_data.drop(columns=non_numeric_columns)
validation_data = validation_data.drop(columns=non_numeric_columns)
test_data = test_data.drop(columns=non_numeric_columns)

# Save train dataset 
train_data.to_csv('train.csv', index=False, header=False)

# Save validation dataset 
validation_data.to_csv('validation.csv', index=False, header=False)


In [16]:
# Training data - Saved later to S3 as CSV
print(train_data.shape)
train_data.head()

(4517, 452)


Unnamed: 0,MATH_Proficient,SISCO,ST347Q01JA,ST347Q02JA,ST349Q01JA_0,ST349Q01JA_1,ST349Q01JA_2,ST349Q01JA_3,ST349Q01JA_4,ST350Q01JA,ST356Q01JA,ST322Q01JA,ST322Q02JA,ST322Q03JA,ST322Q04JA,ST322Q06JA,ST322Q07JA,DURECEC,ST259Q01JA,WB164Q01HA,ST004D01T,GRADE,REPEAT,EXPECEDU,ICTAVSCH,ICTAVHOM,ICTDISTR,IMMIG,TARDYSD,ST226Q01JA,ST016Q01NA,MISSSC,PAREDINT,WB163Q06HA,WB163Q07HA,ST230Q01JA,SKIPPING,IC180Q01JA,IC180Q08JA,ST059Q02JA,ST296Q04JA,WB176Q01HA,STUDYHMW,IC184Q01JA,IC184Q02JA,IC184Q03JA,IC184Q04JA,ST059Q01TA,ST296Q01JA,ST272Q01JA,ST268Q01JA,ST268Q04JA,ST268Q07JA,ST293Q04JA,ST297Q01JA,ST297Q03JA,ST297Q05JA,ST297Q06JA,ST297Q07JA,ST297Q09JA,WB165Q01HA,WB166Q01HA,WB166Q02HA,WB166Q03HA,WB166Q04HA,ST258Q01JA,ST294Q01JA,ST295Q01JA,WB150Q01HA,WB156Q01HA,WB158Q01HA,WB160Q01HA,WB161Q01HA,WB171Q01HA,WB171Q02HA,WB171Q03HA,WB171Q04HA,WB172Q01HA,WB173Q01HA,WB173Q02HA,WB173Q03HA,WB173Q04HA,WB177Q01HA,WB177Q02HA,WB177Q03HA,WB177Q04HA,WB032Q01NA,WB032Q02NA,WB031Q01NA,EXERPRAC,STUBMI,WORKPAY,WORKHOME,SC001Q01TA,SC211Q01JA,SC211Q02JA,SC211Q03JA,SC211Q04JA,SC211Q05JA,SC211Q06JA,SC037Q11JA,SC183Q02JA,SC183Q03JA,SC183Q04JA,SC175Q01JA,SC177Q01JA_1,SC177Q01JA_2,SC177Q01JA_3,SC177Q02JA_1,SC177Q02JA_2,SC177Q02JA_3,SC177Q03JA_1,SC177Q03JA_2,SC177Q03JA_3,SC188Q01JA,SC188Q02JA,SC188Q03JA,SC188Q04JA,SC188Q05JA,SC188Q06JA,SC188Q07JA,SC188Q08JA,SC188Q09JA,SC188Q10JA,SC188Q11JA,SC198Q01JA,SC198Q02JA,SC198Q03JA,SC178Q01JA,SC178Q02JA,SC180Q01JA,SC189Q02WA,SC189Q03WA,SC189Q04WA,SMRATIO,MCLSIZE,MACTIV,MATHEXC_0,MATHEXC_1,MATHEXC_2,MATHEXC_3,ABGMATH,SC064Q05WA,SC064Q06WA,SC064Q01TA,SC064Q02TA,SC064Q04NA,SC064Q03TA,SC064Q07WA,SC213Q01JA,SC213Q02JA,SC037Q01TA,SC037Q02TA,SC037Q03TA,SC037Q04TA,SC037Q05NA,SC037Q06NA,SC037Q07TA,SC037Q09TA,SC200Q01JA,SC200Q02JA,SC200Q03JA,SC200Q04JA,SC224Q01JA,RATCMP1,RATCMP2,RATTAB,SCHSEL,SCHLTYPE_1,SCHLTYPE_2,SCHLTYPE_3,SC034Q01NA,SC034Q02NA,SC034Q03TA,SC034Q04TA,SC195Q01JA,SC195Q02JA,SC195Q03JA,SC195Q04JA,SC042Q01TA,SC042Q02TA,SC214Q01JA,SC214Q02JA,SC214Q03JA,SC215Q01JA,SC215Q02JA,SC215Q03JA,SC215Q04JA,SC215Q05JA,SC215Q06JA,SC215Q07JA,SC215Q08JA,SC216Q06JA,SC216Q07JA,SC216Q08JA,SC216Q09JA,SC217Q01JA,SC217Q02JA,SC217Q03JA,SC217Q04JA,SC217Q05JA,SC217Q06JA,SC217Q07JA,SC217Q08JA,SC217Q10JA,SC218Q01JA,SC219Q01JA,SC220Q01JA,SC221Q01JA,SC221Q02JA,SC221Q03JA,SC221Q04JA,SCSUPRTED,SCSUPRT,SC212Q01JA,SC212Q02JA,SC212Q03JA,SC037Q08TA,SC032Q01TA,SC032Q02TA,SC032Q03TA,SC032Q04TA,LANGN_105,LANGN_108,LANGN_112,LANGN_113,LANGN_118,LANGN_121,LANGN_130,LANGN_133,LANGN_137,LANGN_140,LANGN_147,LANGN_148,LANGN_150,LANGN_154,LANGN_156,LANGN_160,LANGN_170,LANGN_195,LANGN_200,LANGN_202,LANGN_204,LANGN_232,LANGN_237,LANGN_244,LANGN_246,LANGN_254,LANGN_258,LANGN_263,LANGN_264,LANGN_266,LANGN_272,LANGN_273,LANGN_275,LANGN_286,LANGN_301,LANGN_313,LANGN_316,LANGN_317,LANGN_322,LANGN_325,LANGN_327,LANGN_329,LANGN_338,LANGN_340,LANGN_344,LANGN_351,LANGN_358,LANGN_363,LANGN_369,LANGN_371,LANGN_375,LANGN_379,LANGN_381,LANGN_382,LANGN_383,LANGN_404,LANGN_409,LANGN_415,LANGN_420,LANGN_422,LANGN_428,LANGN_434,LANGN_442,LANGN_449,LANGN_451,LANGN_463,LANGN_465,LANGN_467,LANGN_471,LANGN_472,LANGN_474,LANGN_492,LANGN_493,LANGN_494,LANGN_495,LANGN_496,LANGN_500,LANGN_503,LANGN_514,LANGN_517,LANGN_520,LANGN_523,LANGN_527,LANGN_529,LANGN_531,LANGN_540,LANGN_547,LANGN_555,LANGN_561,LANGN_562,LANGN_563,LANGN_565,LANGN_566,LANGN_567,LANGN_600,LANGN_601,LANGN_602,LANGN_605,LANGN_606,LANGN_607,LANGN_608,LANGN_611,LANGN_614,LANGN_615,LANGN_616,LANGN_618,LANGN_619,LANGN_621,LANGN_622,LANGN_623,LANGN_624,LANGN_625,LANGN_626,LANGN_627,LANGN_628,LANGN_630,LANGN_631,LANGN_634,LANGN_635,LANGN_639,LANGN_640,LANGN_641,LANGN_642,LANGN_648,LANGN_650,LANGN_661,LANGN_662,LANGN_663,LANGN_665,LANGN_666,LANGN_667,LANGN_668,LANGN_669,LANGN_670,LANGN_673,LANGN_674,LANGN_675,LANGN_676,LANGN_677,LANGN_678,LANGN_800,LANGN_801,LANGN_802,LANGN_804,LANGN_805,LANGN_806,LANGN_807,LANGN_808,LANGN_809,LANGN_810,LANGN_811,LANGN_812,LANGN_813,LANGN_814,LANGN_815,LANGN_816,LANGN_817,LANGN_818,LANGN_819,LANGN_821,LANGN_823,LANGN_824,LANGN_825,LANGN_826,LANGN_827,LANGN_828,LANGN_829,LANGN_831,LANGN_832,LANGN_833,LANGN_836,LANGN_837,LANGN_838,LANGN_839,LANGN_840,LANGN_841,LANGN_842,LANGN_843,LANGN_844,LANGN_845,LANGN_846,LANGN_849,LANGN_850,LANGN_851,LANGN_852,LANGN_854,LANGN_855,LANGN_857,LANGN_859,LANGN_860,LANGN_861,LANGN_865,LANGN_866,LANGN_868,LANGN_870,LANGN_872,LANGN_873,LANGN_877,LANGN_879,LANGN_881,LANGN_885,LANGN_890,LANGN_892,LANGN_895,LANGN_896,LANGN_897,LANGN_898,LANGN_899,LANGN_900,LANGN_901,LANGN_902,LANGN_903,LANGN_904,LANGN_905,LANGN_906,LANGN_907,LANGN_908,LANGN_909,LANGN_910,LANGN_911,LANGN_912,LANGN_913,LANGN_914,LANGN_916,LANGN_917,LANGN_918,LANGN_919,LANGN_920,LANGN_921,LANGN_922
302980,1,1.0,1.0,1.0,0,0,0,0,0,,,5.0,5.0,2.0,5.0,4.0,,,8.0,,1.0,0.0,0.0,7.0,7.0,6.0,11.0,1.0,0.0,4.0,8.0,0.0,16.0,,,2.0,0.0,3.0,3.0,50.0,6.0,,10.0,4.0,2.0,1.0,4.0,10.0,5.0,2.0,2.0,2.0,3.0,,0.0,0.0,0.0,0.0,1.0,0.0,,,,,,1.0,6.0,6.0,,,,,,,,,,,,,,,,,,,,,,4.0,,0.0,4.0,4.0,0.0,0.0,20.0,0.0,0.0,0.0,2.0,1.0,1.0,1.0,50.0,1,0,0,1,0,0,1,0,0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,1.0,1.0,1.0,80.0,20.0,1.0,1.0,1.0,1.0,71.0,28.0,4.0,0,0,1,0,3.0,92.0,92.0,91.0,92.0,92.0,92.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,4.0,0.3141,1.0,0.0,3.0,1,0,0,3.0,3.0,3.0,3.0,5.0,5.0,5.0,5.0,1.0,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
307302,1,,4.0,1.0,0,1,0,0,0,1.0,2.0,5.0,5.0,,,,5.0,3.0,5.0,,1.0,0.0,0.0,7.0,7.0,6.0,3.0,1.0,0.0,4.0,5.0,0.0,12.0,,,1.0,0.0,2.0,1.0,35.0,6.0,,7.0,4.0,4.0,1.0,3.0,4.0,6.0,5.0,1.0,1.0,4.0,2.0,0.0,0.0,0.0,0.0,0.0,1.0,,,,,,1.0,1.0,6.0,,,,,,,,,,,,,,,,,,,,,,0.0,,0.0,10.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,50.0,1,0,0,0,0,1,0,0,1,4.0,,,,,,,,,,,2.0,1.0,1.0,,,1.0,1.0,2.0,1.0,83.8824,23.0,5.0,0,0,0,1,2.0,100.0,99.0,97.0,99.0,99.0,99.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,4.0,0.3361,1.0,0.3361,3.0,0,0,1,2.0,3.0,3.0,4.0,3.0,3.0,3.0,3.0,2.0,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
307349,1,1.0,1.0,1.0,0,0,0,0,0,,,5.0,5.0,1.0,1.0,,1.0,,7.0,,2.0,-1.0,0.0,7.0,3.0,6.0,15.0,1.0,0.0,2.0,9.0,0.0,16.0,,,2.0,0.0,3.0,1.0,31.0,5.0,,10.0,1.0,1.0,,,4.0,3.0,6.0,4.0,4.0,4.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,,,,,,1.0,6.0,6.0,,,,,,,,,,,,,,,,,,,,,,3.0,,0.0,7.0,6.0,0.0,1.0,0.0,0.0,0.0,0.0,3.0,2.0,2.0,2.0,45.0,0,0,1,0,0,1,0,0,1,3.0,3.0,4.0,2.0,2.0,2.0,2.0,3.0,3.0,3.0,1.0,2.0,2.0,1.0,100.0,0.0,1.0,1.0,1.0,1.0,100.0,28.0,4.0,0,0,1,0,1.0,30.0,90.0,40.0,60.0,20.0,10.0,0.0,100.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,,,,3.0,0.3636,1.0,0.2727,1.0,0,0,1,2.0,1.0,5.0,2.0,1.0,1.0,5.0,5.0,3.0,3.0,5.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,3.0,2.0,2.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,11.0,2.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,2.0,2.0,1.0,1.0,1.0,1.0,2.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
307320,1,1.0,4.0,1.0,0,1,0,0,0,2.0,2.0,5.0,1.0,,2.0,2.0,1.0,2.0,6.0,,2.0,0.0,0.0,7.0,0.0,6.0,5.0,1.0,1.0,4.0,6.0,0.0,16.0,,,1.0,0.0,3.0,3.0,34.0,2.0,,5.0,1.0,1.0,1.0,1.0,4.0,3.0,7.0,1.0,1.0,3.0,2.0,0.0,0.0,0.0,1.0,0.0,0.0,,,,,,1.0,1.0,6.0,,,,,,,,,,,,,,,,,,,,,,9.0,,0.0,2.0,5.0,2.0,1.0,2.0,0.0,0.0,0.0,2.0,2.0,2.0,2.0,50.0,0,0,1,0,0,1,1,0,0,3.0,4.0,4.0,3.0,3.0,3.0,3.0,4.0,2.0,4.0,3.0,2.0,1.0,1.0,100.0,0.0,1.0,2.0,1.0,1.0,59.5385,18.0,3.0,0,0,1,0,2.0,60.0,60.0,75.0,90.0,0.0,100.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,4.0,0.3361,0.75,0.5378,1.0,0,1,0,3.0,2.0,3.0,3.0,2.0,3.0,3.0,3.0,3.0,3.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
302365,0,0.0,6.0,6.0,1,0,0,0,0,3.0,4.0,,,,,,,3.0,10.0,,2.0,-1.0,0.0,,0.0,0.0,16.0,1.0,1.0,2.0,10.0,0.0,12.0,,,3.0,0.0,4.0,4.0,0.0,1.0,,1.0,,,,,0.0,1.0,1.0,4.0,4.0,4.0,4.0,0.0,0.0,0.0,0.0,0.0,1.0,,,,,,1.0,6.0,6.0,,,,,,,,,,,,,,,,,,,,,,10.0,,0.0,1.0,4.0,0.0,0.0,2.0,0.0,0.0,0.0,2.0,1.0,1.0,2.0,45.0,0,0,0,0,0,1,0,0,0,2.0,3.0,3.0,3.0,1.0,1.0,1.0,3.0,3.0,3.0,1.0,2.0,2.0,1.0,90.0,10.0,1.0,1.0,1.0,1.0,100.0,28.0,2.0,0,0,1,0,2.0,2.0,1.0,1.0,7.0,0.0,4.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,2.0,2.0,2.0,2.0,3.0,0.2113,1.0,0.6338,1.0,0,0,1,1.0,1.0,2.0,2.0,1.0,1.0,2.0,2.0,3.0,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,2.0,1.0,1.0,2.0,1.0,2.0,2.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [17]:
# Validation data - Saved later to S3 as CSV
print(validation_data.shape)
validation_data.head()

(968, 452)


Unnamed: 0,MATH_Proficient,SISCO,ST347Q01JA,ST347Q02JA,ST349Q01JA_0,ST349Q01JA_1,ST349Q01JA_2,ST349Q01JA_3,ST349Q01JA_4,ST350Q01JA,ST356Q01JA,ST322Q01JA,ST322Q02JA,ST322Q03JA,ST322Q04JA,ST322Q06JA,ST322Q07JA,DURECEC,ST259Q01JA,WB164Q01HA,ST004D01T,GRADE,REPEAT,EXPECEDU,ICTAVSCH,ICTAVHOM,ICTDISTR,IMMIG,TARDYSD,ST226Q01JA,ST016Q01NA,MISSSC,PAREDINT,WB163Q06HA,WB163Q07HA,ST230Q01JA,SKIPPING,IC180Q01JA,IC180Q08JA,ST059Q02JA,ST296Q04JA,WB176Q01HA,STUDYHMW,IC184Q01JA,IC184Q02JA,IC184Q03JA,IC184Q04JA,ST059Q01TA,ST296Q01JA,ST272Q01JA,ST268Q01JA,ST268Q04JA,ST268Q07JA,ST293Q04JA,ST297Q01JA,ST297Q03JA,ST297Q05JA,ST297Q06JA,ST297Q07JA,ST297Q09JA,WB165Q01HA,WB166Q01HA,WB166Q02HA,WB166Q03HA,WB166Q04HA,ST258Q01JA,ST294Q01JA,ST295Q01JA,WB150Q01HA,WB156Q01HA,WB158Q01HA,WB160Q01HA,WB161Q01HA,WB171Q01HA,WB171Q02HA,WB171Q03HA,WB171Q04HA,WB172Q01HA,WB173Q01HA,WB173Q02HA,WB173Q03HA,WB173Q04HA,WB177Q01HA,WB177Q02HA,WB177Q03HA,WB177Q04HA,WB032Q01NA,WB032Q02NA,WB031Q01NA,EXERPRAC,STUBMI,WORKPAY,WORKHOME,SC001Q01TA,SC211Q01JA,SC211Q02JA,SC211Q03JA,SC211Q04JA,SC211Q05JA,SC211Q06JA,SC037Q11JA,SC183Q02JA,SC183Q03JA,SC183Q04JA,SC175Q01JA,SC177Q01JA_1,SC177Q01JA_2,SC177Q01JA_3,SC177Q02JA_1,SC177Q02JA_2,SC177Q02JA_3,SC177Q03JA_1,SC177Q03JA_2,SC177Q03JA_3,SC188Q01JA,SC188Q02JA,SC188Q03JA,SC188Q04JA,SC188Q05JA,SC188Q06JA,SC188Q07JA,SC188Q08JA,SC188Q09JA,SC188Q10JA,SC188Q11JA,SC198Q01JA,SC198Q02JA,SC198Q03JA,SC178Q01JA,SC178Q02JA,SC180Q01JA,SC189Q02WA,SC189Q03WA,SC189Q04WA,SMRATIO,MCLSIZE,MACTIV,MATHEXC_0,MATHEXC_1,MATHEXC_2,MATHEXC_3,ABGMATH,SC064Q05WA,SC064Q06WA,SC064Q01TA,SC064Q02TA,SC064Q04NA,SC064Q03TA,SC064Q07WA,SC213Q01JA,SC213Q02JA,SC037Q01TA,SC037Q02TA,SC037Q03TA,SC037Q04TA,SC037Q05NA,SC037Q06NA,SC037Q07TA,SC037Q09TA,SC200Q01JA,SC200Q02JA,SC200Q03JA,SC200Q04JA,SC224Q01JA,RATCMP1,RATCMP2,RATTAB,SCHSEL,SCHLTYPE_1,SCHLTYPE_2,SCHLTYPE_3,SC034Q01NA,SC034Q02NA,SC034Q03TA,SC034Q04TA,SC195Q01JA,SC195Q02JA,SC195Q03JA,SC195Q04JA,SC042Q01TA,SC042Q02TA,SC214Q01JA,SC214Q02JA,SC214Q03JA,SC215Q01JA,SC215Q02JA,SC215Q03JA,SC215Q04JA,SC215Q05JA,SC215Q06JA,SC215Q07JA,SC215Q08JA,SC216Q06JA,SC216Q07JA,SC216Q08JA,SC216Q09JA,SC217Q01JA,SC217Q02JA,SC217Q03JA,SC217Q04JA,SC217Q05JA,SC217Q06JA,SC217Q07JA,SC217Q08JA,SC217Q10JA,SC218Q01JA,SC219Q01JA,SC220Q01JA,SC221Q01JA,SC221Q02JA,SC221Q03JA,SC221Q04JA,SCSUPRTED,SCSUPRT,SC212Q01JA,SC212Q02JA,SC212Q03JA,SC037Q08TA,SC032Q01TA,SC032Q02TA,SC032Q03TA,SC032Q04TA,LANGN_105,LANGN_108,LANGN_112,LANGN_113,LANGN_118,LANGN_121,LANGN_130,LANGN_133,LANGN_137,LANGN_140,LANGN_147,LANGN_148,LANGN_150,LANGN_154,LANGN_156,LANGN_160,LANGN_170,LANGN_195,LANGN_200,LANGN_202,LANGN_204,LANGN_232,LANGN_237,LANGN_244,LANGN_246,LANGN_254,LANGN_258,LANGN_263,LANGN_264,LANGN_266,LANGN_272,LANGN_273,LANGN_275,LANGN_286,LANGN_301,LANGN_313,LANGN_316,LANGN_317,LANGN_322,LANGN_325,LANGN_327,LANGN_329,LANGN_338,LANGN_340,LANGN_344,LANGN_351,LANGN_358,LANGN_363,LANGN_369,LANGN_371,LANGN_375,LANGN_379,LANGN_381,LANGN_382,LANGN_383,LANGN_404,LANGN_409,LANGN_415,LANGN_420,LANGN_422,LANGN_428,LANGN_434,LANGN_442,LANGN_449,LANGN_451,LANGN_463,LANGN_465,LANGN_467,LANGN_471,LANGN_472,LANGN_474,LANGN_492,LANGN_493,LANGN_494,LANGN_495,LANGN_496,LANGN_500,LANGN_503,LANGN_514,LANGN_517,LANGN_520,LANGN_523,LANGN_527,LANGN_529,LANGN_531,LANGN_540,LANGN_547,LANGN_555,LANGN_561,LANGN_562,LANGN_563,LANGN_565,LANGN_566,LANGN_567,LANGN_600,LANGN_601,LANGN_602,LANGN_605,LANGN_606,LANGN_607,LANGN_608,LANGN_611,LANGN_614,LANGN_615,LANGN_616,LANGN_618,LANGN_619,LANGN_621,LANGN_622,LANGN_623,LANGN_624,LANGN_625,LANGN_626,LANGN_627,LANGN_628,LANGN_630,LANGN_631,LANGN_634,LANGN_635,LANGN_639,LANGN_640,LANGN_641,LANGN_642,LANGN_648,LANGN_650,LANGN_661,LANGN_662,LANGN_663,LANGN_665,LANGN_666,LANGN_667,LANGN_668,LANGN_669,LANGN_670,LANGN_673,LANGN_674,LANGN_675,LANGN_676,LANGN_677,LANGN_678,LANGN_800,LANGN_801,LANGN_802,LANGN_804,LANGN_805,LANGN_806,LANGN_807,LANGN_808,LANGN_809,LANGN_810,LANGN_811,LANGN_812,LANGN_813,LANGN_814,LANGN_815,LANGN_816,LANGN_817,LANGN_818,LANGN_819,LANGN_821,LANGN_823,LANGN_824,LANGN_825,LANGN_826,LANGN_827,LANGN_828,LANGN_829,LANGN_831,LANGN_832,LANGN_833,LANGN_836,LANGN_837,LANGN_838,LANGN_839,LANGN_840,LANGN_841,LANGN_842,LANGN_843,LANGN_844,LANGN_845,LANGN_846,LANGN_849,LANGN_850,LANGN_851,LANGN_852,LANGN_854,LANGN_855,LANGN_857,LANGN_859,LANGN_860,LANGN_861,LANGN_865,LANGN_866,LANGN_868,LANGN_870,LANGN_872,LANGN_873,LANGN_877,LANGN_879,LANGN_881,LANGN_885,LANGN_890,LANGN_892,LANGN_895,LANGN_896,LANGN_897,LANGN_898,LANGN_899,LANGN_900,LANGN_901,LANGN_902,LANGN_903,LANGN_904,LANGN_905,LANGN_906,LANGN_907,LANGN_908,LANGN_909,LANGN_910,LANGN_911,LANGN_912,LANGN_913,LANGN_914,LANGN_916,LANGN_917,LANGN_918,LANGN_919,LANGN_920,LANGN_921,LANGN_922
302894,1,1.0,2.0,1.0,0,1,0,0,0,1.0,2.0,5.0,5.0,2.0,1.0,1.0,,3.0,4.0,,1.0,0.0,0.0,3.0,7.0,6.0,14.0,1.0,0.0,4.0,5.0,0.0,16.0,,,4.0,0.0,3.0,2.0,30.0,4.0,,1.0,3.0,1.0,1.0,1.0,3.0,3.0,8.0,1.0,1.0,3.0,2.0,0.0,0.0,0.0,1.0,0.0,0.0,,,,,,1.0,4.0,6.0,,,,,,,,,,,,,,,,,,,,,,2.0,,0.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,2.0,2.0,2.0,50.0,0,0,1,0,0,1,0,0,1,1.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,1.0,2.0,2.0,1.0,90.0,10.0,1.0,2.0,1.0,1.0,100.0,28.0,1.0,0,0,1,0,2.0,20.0,0.0,20.0,20.0,10.0,10.0,0.0,5.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,,,,,3.0,0.566,1.0,0.9906,3.0,0,0,1,1.0,2.0,2.0,1.0,1.0,2.0,3.0,1.0,3.0,3.0,5.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,11.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,2.0,1.0,1.0,2.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
305312,1,,3.0,1.0,0,0,1,0,0,1.0,2.0,,,,,,1.0,4.0,8.0,,2.0,0.0,0.0,4.0,7.0,6.0,9.0,1.0,0.0,4.0,9.0,0.0,14.5,,,4.0,0.0,2.0,3.0,4.0,2.0,,10.0,3.0,3.0,3.0,3.0,4.0,1.0,8.0,2.0,1.0,2.0,,0.0,0.0,0.0,0.0,0.0,1.0,,,,,,1.0,6.0,6.0,,,,,,,,,,,,,,,,,,,,,,10.0,,0.0,10.0,5.0,0.0,0.0,11.0,0.0,0.0,0.0,2.0,1.0,1.0,1.0,50.0,1,0,0,1,0,0,0,0,1,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,4.0,4.0,3.0,2.0,1.0,1.0,95.0,5.0,1.0,1.0,1.0,1.0,68.0,18.0,3.0,0,0,1,0,3.0,31.0,23.0,17.0,22.0,0.0,7.0,0.0,20.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,3.0,0.1348,,0.5618,1.0,1,0,0,2.0,2.0,3.0,3.0,3.0,3.0,3.0,3.0,2.0,2.0,3.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,9.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
306963,1,1.0,5.0,1.0,0,1,0,0,0,1.0,2.0,5.0,,4.0,,,2.0,0.0,5.0,,1.0,0.0,0.0,3.0,6.0,6.0,,1.0,0.0,4.0,5.0,0.0,12.0,,,2.0,0.0,3.0,3.0,30.0,3.0,,6.0,4.0,1.0,1.0,1.0,4.0,1.0,3.0,3.0,3.0,3.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,,,,,,1.0,1.0,6.0,,,,,,,,,,,,,,,,,,,,,,4.0,,0.0,10.0,5.0,0.0,0.0,20.0,0.0,0.0,0.0,2.0,1.0,2.0,2.0,50.0,0,0,1,0,0,1,0,0,1,4.0,4.0,4.0,4.0,3.0,3.0,3.0,4.0,4.0,4.0,2.0,2.0,2.0,1.0,40.0,60.0,1.0,2.0,1.0,1.0,100.0,13.0,2.0,0,0,1,0,1.0,10.0,20.0,30.0,60.0,3.0,2.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,3.0,0.5426,,0.6977,3.0,0,1,0,2.0,2.0,3.0,3.0,1.0,1.0,3.0,2.0,3.0,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,1.0,2.0,1.0,2.0,1.0,1.0,2.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
302965,1,,2.0,,0,0,1,0,0,1.0,3.0,5.0,5.0,4.0,,5.0,2.0,3.0,5.0,,1.0,0.0,0.0,7.0,7.0,6.0,12.0,1.0,0.0,4.0,3.0,0.0,12.0,,,2.0,0.0,3.0,1.0,43.0,4.0,,6.0,1.0,1.0,1.0,1.0,7.0,3.0,7.0,2.0,2.0,4.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,,,,,,1.0,6.0,6.0,,,,,,,,,,,,,,,,,,,,,,2.0,,0.0,5.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,1.0,1.0,1.0,50.0,0,0,0,0,0,0,0,0,0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,1.0,1.0,1.0,100.0,0.0,1.0,1.0,1.0,1.0,71.2857,23.0,4.0,0,1,0,0,2.0,46.0,54.0,57.0,53.0,56.0,55.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,3.0,1.0,1.0,1.0,3.0,0,0,1,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
302211,1,1.0,4.0,1.0,0,0,1,0,0,1.0,3.0,,5.0,3.0,1.0,1.0,1.0,,5.0,,2.0,0.0,0.0,4.0,7.0,6.0,7.0,1.0,0.0,4.0,9.0,0.0,12.0,,,4.0,0.0,2.0,2.0,40.0,3.0,,3.0,1.0,1.0,1.0,1.0,4.0,5.0,8.0,3.0,3.0,3.0,,0.0,1.0,0.0,1.0,0.0,0.0,,,,,,1.0,3.0,6.0,,,,,,,,,,,,,,,,,,,,,,5.0,,0.0,5.0,5.0,0.0,0.0,11.0,0.0,0.0,0.0,2.0,1.0,1.0,1.0,50.0,1,0,0,1,0,0,0,0,1,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,4.0,4.0,3.0,2.0,1.0,1.0,95.0,5.0,1.0,1.0,1.0,1.0,68.0,18.0,3.0,0,0,1,0,3.0,31.0,23.0,17.0,22.0,0.0,7.0,0.0,20.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,3.0,0.1348,,0.5618,1.0,1,0,0,2.0,2.0,3.0,3.0,3.0,3.0,3.0,3.0,2.0,2.0,3.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,9.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [18]:
# Test data - NOT SAVED TO S3
print(test_data.shape)
test_data.head()

(969, 452)


Unnamed: 0,MATH_Proficient,SISCO,ST347Q01JA,ST347Q02JA,ST349Q01JA_0,ST349Q01JA_1,ST349Q01JA_2,ST349Q01JA_3,ST349Q01JA_4,ST350Q01JA,ST356Q01JA,ST322Q01JA,ST322Q02JA,ST322Q03JA,ST322Q04JA,ST322Q06JA,ST322Q07JA,DURECEC,ST259Q01JA,WB164Q01HA,ST004D01T,GRADE,REPEAT,EXPECEDU,ICTAVSCH,ICTAVHOM,ICTDISTR,IMMIG,TARDYSD,ST226Q01JA,ST016Q01NA,MISSSC,PAREDINT,WB163Q06HA,WB163Q07HA,ST230Q01JA,SKIPPING,IC180Q01JA,IC180Q08JA,ST059Q02JA,ST296Q04JA,WB176Q01HA,STUDYHMW,IC184Q01JA,IC184Q02JA,IC184Q03JA,IC184Q04JA,ST059Q01TA,ST296Q01JA,ST272Q01JA,ST268Q01JA,ST268Q04JA,ST268Q07JA,ST293Q04JA,ST297Q01JA,ST297Q03JA,ST297Q05JA,ST297Q06JA,ST297Q07JA,ST297Q09JA,WB165Q01HA,WB166Q01HA,WB166Q02HA,WB166Q03HA,WB166Q04HA,ST258Q01JA,ST294Q01JA,ST295Q01JA,WB150Q01HA,WB156Q01HA,WB158Q01HA,WB160Q01HA,WB161Q01HA,WB171Q01HA,WB171Q02HA,WB171Q03HA,WB171Q04HA,WB172Q01HA,WB173Q01HA,WB173Q02HA,WB173Q03HA,WB173Q04HA,WB177Q01HA,WB177Q02HA,WB177Q03HA,WB177Q04HA,WB032Q01NA,WB032Q02NA,WB031Q01NA,EXERPRAC,STUBMI,WORKPAY,WORKHOME,SC001Q01TA,SC211Q01JA,SC211Q02JA,SC211Q03JA,SC211Q04JA,SC211Q05JA,SC211Q06JA,SC037Q11JA,SC183Q02JA,SC183Q03JA,SC183Q04JA,SC175Q01JA,SC177Q01JA_1,SC177Q01JA_2,SC177Q01JA_3,SC177Q02JA_1,SC177Q02JA_2,SC177Q02JA_3,SC177Q03JA_1,SC177Q03JA_2,SC177Q03JA_3,SC188Q01JA,SC188Q02JA,SC188Q03JA,SC188Q04JA,SC188Q05JA,SC188Q06JA,SC188Q07JA,SC188Q08JA,SC188Q09JA,SC188Q10JA,SC188Q11JA,SC198Q01JA,SC198Q02JA,SC198Q03JA,SC178Q01JA,SC178Q02JA,SC180Q01JA,SC189Q02WA,SC189Q03WA,SC189Q04WA,SMRATIO,MCLSIZE,MACTIV,MATHEXC_0,MATHEXC_1,MATHEXC_2,MATHEXC_3,ABGMATH,SC064Q05WA,SC064Q06WA,SC064Q01TA,SC064Q02TA,SC064Q04NA,SC064Q03TA,SC064Q07WA,SC213Q01JA,SC213Q02JA,SC037Q01TA,SC037Q02TA,SC037Q03TA,SC037Q04TA,SC037Q05NA,SC037Q06NA,SC037Q07TA,SC037Q09TA,SC200Q01JA,SC200Q02JA,SC200Q03JA,SC200Q04JA,SC224Q01JA,RATCMP1,RATCMP2,RATTAB,SCHSEL,SCHLTYPE_1,SCHLTYPE_2,SCHLTYPE_3,SC034Q01NA,SC034Q02NA,SC034Q03TA,SC034Q04TA,SC195Q01JA,SC195Q02JA,SC195Q03JA,SC195Q04JA,SC042Q01TA,SC042Q02TA,SC214Q01JA,SC214Q02JA,SC214Q03JA,SC215Q01JA,SC215Q02JA,SC215Q03JA,SC215Q04JA,SC215Q05JA,SC215Q06JA,SC215Q07JA,SC215Q08JA,SC216Q06JA,SC216Q07JA,SC216Q08JA,SC216Q09JA,SC217Q01JA,SC217Q02JA,SC217Q03JA,SC217Q04JA,SC217Q05JA,SC217Q06JA,SC217Q07JA,SC217Q08JA,SC217Q10JA,SC218Q01JA,SC219Q01JA,SC220Q01JA,SC221Q01JA,SC221Q02JA,SC221Q03JA,SC221Q04JA,SCSUPRTED,SCSUPRT,SC212Q01JA,SC212Q02JA,SC212Q03JA,SC037Q08TA,SC032Q01TA,SC032Q02TA,SC032Q03TA,SC032Q04TA,LANGN_105,LANGN_108,LANGN_112,LANGN_113,LANGN_118,LANGN_121,LANGN_130,LANGN_133,LANGN_137,LANGN_140,LANGN_147,LANGN_148,LANGN_150,LANGN_154,LANGN_156,LANGN_160,LANGN_170,LANGN_195,LANGN_200,LANGN_202,LANGN_204,LANGN_232,LANGN_237,LANGN_244,LANGN_246,LANGN_254,LANGN_258,LANGN_263,LANGN_264,LANGN_266,LANGN_272,LANGN_273,LANGN_275,LANGN_286,LANGN_301,LANGN_313,LANGN_316,LANGN_317,LANGN_322,LANGN_325,LANGN_327,LANGN_329,LANGN_338,LANGN_340,LANGN_344,LANGN_351,LANGN_358,LANGN_363,LANGN_369,LANGN_371,LANGN_375,LANGN_379,LANGN_381,LANGN_382,LANGN_383,LANGN_404,LANGN_409,LANGN_415,LANGN_420,LANGN_422,LANGN_428,LANGN_434,LANGN_442,LANGN_449,LANGN_451,LANGN_463,LANGN_465,LANGN_467,LANGN_471,LANGN_472,LANGN_474,LANGN_492,LANGN_493,LANGN_494,LANGN_495,LANGN_496,LANGN_500,LANGN_503,LANGN_514,LANGN_517,LANGN_520,LANGN_523,LANGN_527,LANGN_529,LANGN_531,LANGN_540,LANGN_547,LANGN_555,LANGN_561,LANGN_562,LANGN_563,LANGN_565,LANGN_566,LANGN_567,LANGN_600,LANGN_601,LANGN_602,LANGN_605,LANGN_606,LANGN_607,LANGN_608,LANGN_611,LANGN_614,LANGN_615,LANGN_616,LANGN_618,LANGN_619,LANGN_621,LANGN_622,LANGN_623,LANGN_624,LANGN_625,LANGN_626,LANGN_627,LANGN_628,LANGN_630,LANGN_631,LANGN_634,LANGN_635,LANGN_639,LANGN_640,LANGN_641,LANGN_642,LANGN_648,LANGN_650,LANGN_661,LANGN_662,LANGN_663,LANGN_665,LANGN_666,LANGN_667,LANGN_668,LANGN_669,LANGN_670,LANGN_673,LANGN_674,LANGN_675,LANGN_676,LANGN_677,LANGN_678,LANGN_800,LANGN_801,LANGN_802,LANGN_804,LANGN_805,LANGN_806,LANGN_807,LANGN_808,LANGN_809,LANGN_810,LANGN_811,LANGN_812,LANGN_813,LANGN_814,LANGN_815,LANGN_816,LANGN_817,LANGN_818,LANGN_819,LANGN_821,LANGN_823,LANGN_824,LANGN_825,LANGN_826,LANGN_827,LANGN_828,LANGN_829,LANGN_831,LANGN_832,LANGN_833,LANGN_836,LANGN_837,LANGN_838,LANGN_839,LANGN_840,LANGN_841,LANGN_842,LANGN_843,LANGN_844,LANGN_845,LANGN_846,LANGN_849,LANGN_850,LANGN_851,LANGN_852,LANGN_854,LANGN_855,LANGN_857,LANGN_859,LANGN_860,LANGN_861,LANGN_865,LANGN_866,LANGN_868,LANGN_870,LANGN_872,LANGN_873,LANGN_877,LANGN_879,LANGN_881,LANGN_885,LANGN_890,LANGN_892,LANGN_895,LANGN_896,LANGN_897,LANGN_898,LANGN_899,LANGN_900,LANGN_901,LANGN_902,LANGN_903,LANGN_904,LANGN_905,LANGN_906,LANGN_907,LANGN_908,LANGN_909,LANGN_910,LANGN_911,LANGN_912,LANGN_913,LANGN_914,LANGN_916,LANGN_917,LANGN_918,LANGN_919,LANGN_920,LANGN_921,LANGN_922
304288,1,1.0,3.0,1.0,0,0,1,0,0,1.0,2.0,3.0,3.0,3.0,3.0,,4.0,,5.0,,2.0,0.0,0.0,,7.0,6.0,8.0,1.0,0.0,4.0,7.0,0.0,12.0,,,3.0,0.0,2.0,3.0,3.0,3.0,,2.0,1.0,1.0,1.0,1.0,3.0,2.0,8.0,2.0,2.0,3.0,3.0,0.0,1.0,0.0,0.0,0.0,0.0,,,,,,1.0,6.0,6.0,,,,,,,,,,,,,,,,,,,,,,1.0,,0.0,5.0,5.0,0.0,2.0,25.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,50.0,0,0,1,0,0,1,0,0,1,4.0,4.0,4.0,4.0,4.0,4.0,4.0,3.0,3.0,3.0,1.0,2.0,2.0,1.0,70.0,30.0,1.0,1.0,1.0,1.0,94.875,28.0,5.0,0,0,0,1,2.0,15.0,26.0,26.0,86.0,0.0,13.0,0.0,60.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,3.0,0.1751,1.0,0.1946,1.0,0,0,1,2.0,3.0,3.0,3.0,1.0,3.0,3.0,3.0,2.0,2.0,4.0,4.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,3.0,3.0,4.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,11.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
304050,1,1.0,2.0,1.0,0,0,1,0,0,3.0,3.0,1.0,,5.0,3.0,1.0,1.0,,7.0,,2.0,-1.0,0.0,,7.0,6.0,5.0,1.0,0.0,2.0,6.0,0.0,12.0,,,3.0,0.0,3.0,3.0,3.0,4.0,,6.0,4.0,4.0,4.0,4.0,3.0,3.0,7.0,3.0,2.0,3.0,2.0,0.0,0.0,0.0,1.0,0.0,0.0,,,,,,1.0,4.0,6.0,,,,,,,,,,,,,,,,,,,,,,10.0,,0.0,6.0,4.0,0.0,0.0,2.0,0.0,0.0,0.0,2.0,1.0,1.0,2.0,45.0,0,0,0,0,0,1,0,0,0,2.0,3.0,3.0,3.0,1.0,1.0,1.0,3.0,3.0,3.0,1.0,2.0,2.0,1.0,90.0,10.0,1.0,1.0,1.0,1.0,100.0,28.0,2.0,0,0,1,0,2.0,2.0,1.0,1.0,7.0,0.0,4.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,2.0,2.0,2.0,2.0,3.0,0.2113,1.0,0.6338,1.0,0,0,1,1.0,1.0,2.0,2.0,1.0,1.0,2.0,2.0,3.0,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,2.0,1.0,1.0,2.0,1.0,2.0,2.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
302086,1,1.0,1.0,1.0,0,0,0,0,0,,,5.0,5.0,3.0,,,4.0,3.0,6.0,,1.0,0.0,0.0,7.0,7.0,6.0,5.0,1.0,0.0,4.0,6.0,0.0,16.0,,,2.0,0.0,3.0,1.0,42.0,2.0,,10.0,4.0,3.0,4.0,2.0,7.0,3.0,5.0,3.0,2.0,4.0,,0.0,0.0,0.0,0.0,1.0,0.0,,,,,,1.0,6.0,6.0,,,,,,,,,,,,,,,,,,,,,,4.0,,0.0,0.0,6.0,0.0,2.0,6.0,0.0,0.0,0.0,2.0,2.0,1.0,2.0,50.0,1,0,0,0,0,1,0,0,1,4.0,4.0,4.0,4.0,2.0,2.0,2.0,4.0,4.0,3.0,1.0,2.0,2.0,1.0,92.0,8.0,1.0,2.0,1.0,1.0,86.8,23.0,4.0,0,0,0,1,2.0,40.0,0.0,40.0,20.0,0.0,2.0,0.0,80.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,,,,,3.0,0.0,,0.2679,1.0,0,1,0,1.0,3.0,3.0,1.0,1.0,3.0,3.0,1.0,3.0,3.0,5.0,2.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,2.0,1.0,3.0,4.0,2.0,3.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,11.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,2.0,1.0,1.0,2.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
306009,1,1.0,2.0,1.0,0,1,0,0,0,1.0,3.0,5.0,5.0,2.0,2.0,5.0,,3.0,8.0,,2.0,0.0,0.0,9.0,7.0,6.0,9.0,1.0,0.0,4.0,8.0,0.0,16.0,,,2.0,0.0,3.0,2.0,60.0,5.0,,10.0,1.0,1.0,1.0,4.0,17.0,4.0,7.0,3.0,2.0,3.0,2.0,0.0,0.0,0.0,1.0,0.0,0.0,,,,,,1.0,6.0,6.0,,,,,,,,,,,,,,,,,,,,,,8.0,,0.0,7.0,6.0,1.0,0.0,10.0,0.0,0.0,0.0,2.0,2.0,2.0,2.0,50.0,0,0,1,0,0,1,0,0,1,4.0,4.0,4.0,4.0,4.0,4.0,4.0,3.0,2.0,3.0,1.0,2.0,2.0,1.0,95.0,5.0,1.0,1.0,1.0,1.0,87.5,33.0,5.0,0,0,0,1,2.0,100.0,54.0,100.0,100.0,53.0,100.0,20.0,40.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,3.0,0.2078,1.0,0.1039,3.0,1,0,0,2.0,2.0,3.0,2.0,1.0,3.0,3.0,2.0,1.0,2.0,1.0,5.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,3.0,3.0,3.0,4.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,11.0,2.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
302334,1,1.0,1.0,1.0,0,0,0,0,0,,,5.0,5.0,1.0,,5.0,1.0,4.0,6.0,,1.0,0.0,0.0,4.0,7.0,6.0,16.0,1.0,0.0,4.0,1.0,0.0,12.0,,,2.0,0.0,3.0,3.0,30.0,5.0,,10.0,3.0,3.0,3.0,3.0,11.0,4.0,8.0,2.0,4.0,4.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,,,,,,1.0,3.0,4.0,,,,,,,,,,,,,,,,,,,,,,6.0,,2.0,3.0,6.0,0.0,2.0,6.0,0.0,0.0,0.0,2.0,2.0,1.0,2.0,50.0,1,0,0,0,0,1,0,0,1,4.0,4.0,4.0,4.0,2.0,2.0,2.0,4.0,4.0,3.0,1.0,2.0,2.0,1.0,92.0,8.0,1.0,2.0,1.0,1.0,86.8,23.0,4.0,0,0,0,1,2.0,40.0,0.0,40.0,20.0,0.0,2.0,0.0,80.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,,,,,3.0,0.0,,0.2679,1.0,0,1,0,1.0,3.0,3.0,1.0,1.0,3.0,3.0,1.0,3.0,3.0,5.0,2.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,2.0,1.0,3.0,4.0,2.0,3.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,11.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,2.0,1.0,1.0,2.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


Now we'll copy the file to S3 for Amazon SageMaker's managed training to pickup.

In [19]:
# cell 14
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train/train.csv')).upload_file('train.csv')
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'validation/validation.csv')).upload_file('validation.csv')

## Training 

At a high level, gradient boosted trees works by combining predictions from many simple models, each of which tries to address the weaknesses of the previous models.  By doing this the collection of simple models can actually outperform large, complex models.  Other Amazon SageMaker notebooks elaborate on gradient boosting trees further and how they differ from similar algorithms.

`xgboost` is an extremely popular, open-source package for gradient boosted trees.  It is computationally powerful, fully featured, and has been successfully used in many machine learning competitions.  Let's start with a simple `xgboost` model, trained using Amazon SageMaker's managed, distributed training framework.

First we'll need to specify the ECR container location for Amazon SageMaker's implementation of XGBoost.

In [20]:
# cell 15
container = sagemaker.image_uris.retrieve(region=boto3.Session().region_name, framework='xgboost', version='latest')

Then, because we're training with the CSV file format, we'll create `s3_input`s that our training function can use as a pointer to the files in S3, which also specify that the content type is CSV.

In [21]:
# cell 16
s3_input_train = sagemaker.inputs.TrainingInput(s3_data='s3://{}/{}/train'.format(bucket, prefix), content_type='csv')
s3_input_validation = sagemaker.inputs.TrainingInput(s3_data='s3://{}/{}/validation/'.format(bucket, prefix), content_type='csv')

First we'll need to specify training parameters to the estimator.  This includes:
1. The `xgboost` algorithm container
1. The IAM role to use
1. Training instance type and count
1. S3 location for output data
1. Algorithm hyperparameters

And then a `.fit()` function which specifies:
1. S3 location for output data.  In this case we have both a training and validation set which are passed in.

In [22]:
# cell 17
sess = sagemaker.Session()

xgb = sagemaker.estimator.Estimator(container,
                                    role, 
                                    instance_count=1, 
                                    instance_type='ml.m4.xlarge',
                                    output_path='s3://{}/{}/output'.format(bucket, prefix),
                                    sagemaker_session=sess)

In [23]:
xgb.set_hyperparameters(# seed=42,       # Random seed (turned off because we are using a different seed per iteration)  
                        seed_per_iteration=True,   # Different seed for each boosting iteration, can prevent overfitting
                        early_stopping_rounds=10,   # Stop if AUC doesn’t improve for 10 rounds
                        # scale_pos_weight=imbalance_ratio, # Helps when outcome is imbalanced (but specificity decreased)
                        objective='binary:logistic',
                        eval_metric='auc', # AUCPR is better than AUC-ROC when the outcome is not balanced (but can't use with auto tuning)
                        num_round=100,   # Number of boosting rounds for training                    
                        eta=0.05,   # Learning rate, lower value is more robust to overfitting but requires more boosting rounds
                        max_depth=5,   # Deeper trees can model more complex patterns but may overfit
                        min_child_weight=10,   # Higher value ensures leaf nodes have sufficient samples, preventing overfitting
                        gamma=4,    # Higher values make it harder to partition a leaf node, making the algorithm more conservative
                        subsample=0.8,   # Fraction of training instances to use for each boosting round, < 1 can prevent overfitting
                        alpha=5   # L1 regularization term on weights, higher value leads to more regularization                       
                        )

# xgb.fit({'train': s3_input_train, 'validation': s3_input_validation}) 

#### Use auto-tuning to find best hyperparameters

Amazon SageMaker automatic model tuning, also known as hyperparameter tuning, finds the best version of a model by running many training jobs on your dataset using the algorithm and ranges of hyperparameters that you specify. It then chooses the hyperparameter values that result in a model that performs the best, as measured by a metric that you choose.
For example, suppose that you want to solve a binary classification problem on this marketing dataset. Your goal is to maximize the area under the curve (auc) metric of the algorithm by training an XGBoost Algorithm model. You don't know which values of the eta, alpha, min_child_weight, and max_depth hyperparameters to use to train the best model. To find the best values for these hyperparameters, you can specify ranges of values that Amazon SageMaker hyperparameter tuning searches to find the combination of values that results in the training job that performs the best as measured by the objective metric that you chose. Hyperparameter tuning launches training jobs that use hyperparameter values in the ranges that you specified, and returns the training job with highest auc.

In [24]:
from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter, HyperparameterTuner

hyperparameter_ranges = {'num_round': IntegerParameter(50, 300),
                         'eta': ContinuousParameter(0.01, 0.1),
                         'max_depth': IntegerParameter(3, 6),
                         'min_child_weight': IntegerParameter(5, 20),
                         'gamma': IntegerParameter(1, 10),     
                         'subsample': ContinuousParameter(0.7, 1.0),
                         'alpha': IntegerParameter(1, 10),
                         }


In [25]:
tuner = HyperparameterTuner(estimator=xgb,
                            objective_metric_name='validation:auc',
                            hyperparameter_ranges=hyperparameter_ranges,
                            max_jobs=50,  
                            max_parallel_jobs=5)

# May need to adjust number of jobs depending on budget!

In [26]:
tuner.fit({'train': s3_input_train, 'validation': s3_input_validation})

No finished training job found associated with this estimator. Please make sure this estimator is only used for building workflow config
No finished training job found associated with this estimator. Please make sure this estimator is only used for building workflow config


..............................................................................................................................................!


In [27]:
# cell 26
boto3.client('sagemaker').describe_hyper_parameter_tuning_job(
HyperParameterTuningJobName=tuner.latest_tuning_job.job_name)['HyperParameterTuningJobStatus']

'Completed'

In [28]:
# cell 27
# Return the best training job name
best_training_job = tuner.best_training_job()
print("Best training job:", best_training_job)

Best training job: xgboost-250222-2035-038-6edb14f5


In [29]:
# Print out hyperparameters of BEST model

response = boto3.client('sagemaker').describe_training_job(TrainingJobName=best_training_job)
best_hyperparameters = response["HyperParameters"]

best_num_round = int(best_hyperparameters["num_round"])
best_eta = float(best_hyperparameters["eta"])
best_max_depth = int(best_hyperparameters["max_depth"])
best_min_child_weight = int(best_hyperparameters["min_child_weight"])
best_gamma = int(best_hyperparameters["gamma"])
best_subsample = float(best_hyperparameters["subsample"])
best_alpha = int(best_hyperparameters["alpha"])

print("BEST num_round: ", best_num_round)
print("BEST eta: ", round(best_eta, 2))
print("BEST max_depth: ", best_max_depth)
print("BEST min_child_weight: ", best_min_child_weight)
print("BEST gamma: ", best_gamma)
print("BEST subsample: ", round(best_subsample, 2))
print("BEST alpha: ", best_alpha)


BEST num_round:  252
BEST eta:  0.1
BEST max_depth:  3
BEST min_child_weight:  18
BEST gamma:  4
BEST subsample:  0.71
BEST alpha:  1


## Deploy the model (the best model identified by HyperparameterTuner)

In [30]:
# cell 28
tuner_predictor = tuner.deploy(initial_instance_count=1,
                           instance_type='ml.m4.xlarge')


2025-02-22 20:45:27 Starting - Found matching resource for reuse
2025-02-22 20:45:27 Downloading - Downloading the training image
2025-02-22 20:45:27 Training - Training image download completed. Training in progress.
2025-02-22 20:45:27 Uploading - Uploading generated training model
2025-02-22 20:45:27 Completed - Resource reused by training job: xgboost-250222-2035-043-92bef2f4
-------!

In [31]:
# cell 29
# Create a serializer
tuner_predictor.serializer = sagemaker.serializers.CSVSerializer()

Now, we'll use a simple function to:
1. Loop over our test dataset
1. Split it into mini-batches of rows 
1. Convert those mini-batches to CSV string payloads (notice, we drop the target variable from our dataset first)
1. Retrieve mini-batch predictions by invoking the XGBoost endpoint
1. Collect predictions and convert from the CSV output our model provides into a NumPy array

In [32]:
# Define function for predictions
def predict(data, predictor, rows=500):
    split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))
    predictions = ''
    for array in split_array:
        predictions = ','.join([predictions, predictor.predict(array).decode('utf-8')])
    return np.fromstring(predictions[1:], sep=',')


In [33]:
# Make predictions - Probabilities
predictions = predict(test_data.drop(['MATH_Proficient'], axis=1).to_numpy(),tuner_predictor)

In [None]:
# Save the real values for the test set
real_values = test_data['MATH_Proficient']
real_values.to_csv('real_values.csv', index=False, header=False)

# Save the predicted values for the test set
predicted_values_full = predictions
predicted_values_full = pd.DataFrame(predicted_values_full, columns=['Predicted Values'])
predicted_values_full.to_csv('predicted_values_full.csv', index=False, header=False)

In [34]:
# Clean up
tuner_predictor.delete_endpoint(delete_endpoint_config=True)

## Explain the trained model using Clarify

In [35]:
from datetime import datetime

session = sagemaker.Session()

model_name = "Clarify-{}-{}".format(country_name_edited, datetime.now().strftime("%d-%m-%Y-%H-%M-%S"))

best_model = sagemaker.estimator.Estimator.attach(best_training_job)  # Attach the best training job

model = best_model.create_model(name=model_name)  # Create a model from the best job

container_def = model.prepare_container_def()

session.create_model(model_name, role, container_def)


2025-02-22 20:45:27 Starting - Found matching resource for reuse
2025-02-22 20:45:27 Downloading - Downloading the training image
2025-02-22 20:45:27 Training - Training image download completed. Training in progress.
2025-02-22 20:45:27 Uploading - Uploading generated training model
2025-02-22 20:45:27 Completed - Resource reused by training job: xgboost-250222-2035-043-92bef2f4


'Clarify-Korea-22-02-2025-20-51-46'

In [36]:
test_features = test_data.drop(["MATH_Proficient"], axis=1)
test_target = test_data["MATH_Proficient"]
test_features.to_csv("test_features.csv", index=False, header=False)

In [37]:
from sagemaker import clarify

clarify_processor = clarify.SageMakerClarifyProcessor(
    role=role, instance_count=1, instance_type="ml.m5.2xlarge", sagemaker_session=session
)

model_config = clarify.ModelConfig(
    model_name=model_name,
    instance_type="ml.m5.large",
    instance_count=1,
    accept_type="text/csv",
    content_type="text/csv",
)

In [38]:
from sagemaker.s3 import S3Downloader

# Download data from S3 to local instance
local_path = S3Downloader.download('s3://{}/{}/train'.format(bucket, prefix), './tmp/train_data')

In [39]:
# Load and sample
full_data = pd.read_csv('./tmp/train_data/train.csv', header=None)
n = min(3000, len(full_data))  # Should we decrease this? It takes 2 hours to run
sampled_data = full_data.sample(n=n)  # If full_data has less than 3000, use full sample

# Save sampled data back to S3
sampled_path = 'sampled_train_data.csv'
sampled_data.to_csv(sampled_path, index=False)

from sagemaker.s3 import S3Uploader
sampled_s3_uri = S3Uploader.upload(sampled_path, 's3://{}/{}/sampled_train'.format(bucket, prefix))

In [40]:
print(sampled_data.shape)
sampled_data.head()

(3000, 452)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255,256,257,258,259,260,261,262,263,264,265,266,267,268,269,270,271,272,273,274,275,276,277,278,279,280,281,282,283,284,285,286,287,288,289,290,291,292,293,294,295,296,297,298,299,300,301,302,303,304,305,306,307,308,309,310,311,312,313,314,315,316,317,318,319,320,321,322,323,324,325,326,327,328,329,330,331,332,333,334,335,336,337,338,339,340,341,342,343,344,345,346,347,348,349,350,351,352,353,354,355,356,357,358,359,360,361,362,363,364,365,366,367,368,369,370,371,372,373,374,375,376,377,378,379,380,381,382,383,384,385,386,387,388,389,390,391,392,393,394,395,396,397,398,399,400,401,402,403,404,405,406,407,408,409,410,411,412,413,414,415,416,417,418,419,420,421,422,423,424,425,426,427,428,429,430,431,432,433,434,435,436,437,438,439,440,441,442,443,444,445,446,447,448,449,450,451
2209,0,0.0,2.0,2.0,0,0,1,0,0,1.0,1.0,3.0,3.0,,4.0,3.0,3.0,4.0,4.0,,2.0,0.0,1.0,3.0,7.0,6.0,8.0,1.0,2.0,4.0,2.0,0.0,16.0,,,3.0,1.0,4.0,4.0,3.0,1.0,,0.0,4.0,4.0,4.0,4.0,0.0,1.0,1.0,2.0,3.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,,,,,,2.0,6.0,6.0,,,,,,,,,,,,,,,,,,,,,,2.0,,0.0,0.0,4.0,0.0,5.0,28.0,0.0,0.0,0.0,1.0,2.0,2.0,1.0,50.0,0,0,1,0,0,1,0,0,1,4.0,4.0,4.0,4.0,3.0,3.0,3.0,3.0,4.0,4.0,1.0,2.0,2.0,1.0,100.0,0.0,1.0,2.0,1.0,1.0,100.0,18.0,2.0,0,0,1,0,2.0,30.0,30.0,20.0,50.0,0.0,10.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,,,,,3.0,3.1126,1.0,0.0861,3.0,0,0,1,2.0,1.0,3.0,3.0,2.0,1.0,3.0,3.0,3.0,3.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.0,2.0,2.0,1.0,2.0,1.0,2.0,2.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1805,0,1.0,1.0,1.0,0,0,0,0,0,,,,,,,,,2.0,10.0,,1.0,0.0,0.0,9.0,7.0,6.0,8.0,1.0,0.0,4.0,3.0,0.0,16.0,,,2.0,0.0,2.0,2.0,5.0,6.0,,10.0,3.0,3.0,3.0,3.0,5.0,6.0,7.0,1.0,1.0,3.0,3.0,0.0,0.0,1.0,0.0,0.0,0.0,,,,,,1.0,2.0,4.0,,,,,,,,,,,,,,,,,,,,,,2.0,,0.0,4.0,4.0,0.0,0.0,20.0,0.0,0.0,0.0,3.0,2.0,2.0,2.0,50.0,0,0,0,0,0,0,0,0,0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,2.0,2.0,1.0,70.0,30.0,1.0,2.0,1.0,1.0,100.0,23.0,1.0,0,0,1,0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,110.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,3.0,0.8721,1.0,2.1512,3.0,0,1,0,2.0,2.0,3.0,3.0,1.0,1.0,4.0,4.0,3.0,2.0,5.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,3.0,1.0,3.0,3.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,1.0,1.0,1.0,11.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,1.0,2.0,1.0,1.0,2.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3086,1,1.0,1.0,1.0,0,0,0,0,0,,,5.0,,3.0,2.0,4.0,2.0,3.0,8.0,,2.0,0.0,0.0,8.0,7.0,6.0,2.0,1.0,0.0,4.0,7.0,0.0,14.5,,,3.0,0.0,1.0,1.0,35.0,4.0,,10.0,5.0,5.0,1.0,1.0,4.0,5.0,4.0,2.0,2.0,4.0,,0.0,0.0,1.0,0.0,0.0,0.0,,,,,,1.0,6.0,6.0,,,,,,,,,,,,,,,,,,,,,,4.0,,0.0,10.0,4.0,0.0,2.0,3.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,50.0,0,0,1,0,0,1,0,0,1,3.0,3.0,3.0,3.0,3.0,3.0,3.0,4.0,3.0,4.0,3.0,2.0,2.0,1.0,92.0,8.0,1.0,1.0,1.0,1.0,71.5,18.0,4.0,0,0,0,1,3.0,93.0,90.0,90.0,90.0,89.0,90.0,91.0,35.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,4.0,0.07,,0.8,2.0,0,0,1,3.0,2.0,3.0,2.0,1.0,3.0,3.0,2.0,1.0,3.0,5.0,5.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,2.0,2.0,2.0,3.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,11.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2783,1,1.0,3.0,1.0,0,1,0,0,0,1.0,2.0,5.0,5.0,5.0,2.0,,2.0,4.0,5.0,,1.0,0.0,0.0,4.0,7.0,6.0,6.0,1.0,0.0,4.0,7.0,0.0,12.0,,,1.0,0.0,2.0,2.0,33.0,2.0,,5.0,4.0,3.0,3.0,3.0,3.0,1.0,3.0,2.0,2.0,1.0,,0.0,0.0,0.0,0.0,0.0,1.0,,,,,,1.0,6.0,5.0,,,,,,,,,,,,,,,,,,,,,,0.0,,0.0,0.0,4.0,5.0,0.0,14.0,5.0,5.0,0.0,2.0,2.0,2.0,1.0,50.0,0,0,1,0,0,1,0,0,1,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,2.0,2.0,1.0,100.0,0.0,1.0,1.0,1.0,1.0,100.0,23.0,2.0,0,0,1,0,3.0,52.0,100.0,99.0,100.0,0.0,90.0,0.0,14.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,4.0,1.9048,1.0,0.3175,3.0,0,1,0,2.0,2.0,3.0,5.0,2.0,1.0,3.0,5.0,3.0,3.0,5.0,2.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,11.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3201,1,1.0,1.0,1.0,0,0,0,0,0,,,5.0,5.0,3.0,1.0,1.0,,,5.0,,1.0,0.0,0.0,7.0,5.0,6.0,11.0,1.0,0.0,4.0,7.0,0.0,12.0,,,2.0,0.0,2.0,1.0,34.0,4.0,,7.0,1.0,1.0,1.0,1.0,4.0,2.0,10.0,2.0,2.0,4.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,,,,,,1.0,6.0,6.0,,,,,,,,,,,,,,,,,,,,,,0.0,,0.0,2.0,3.0,0.0,1.0,2.0,0.0,0.0,0.0,3.0,1.0,2.0,1.0,50.0,1,0,0,1,0,0,0,0,1,4.0,4.0,4.0,4.0,4.0,4.0,4.0,3.0,3.0,3.0,2.0,2.0,2.0,1.0,90.0,10.0,1.0,2.0,1.0,1.0,82.4286,23.0,5.0,0,0,0,1,3.0,20.0,0.0,91.0,57.0,0.0,83.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,2.0,,,,4.0,0.3403,1.0,1.0209,1.0,0,0,1,2.0,2.0,3.0,3.0,2.0,2.0,3.0,3.0,3.0,3.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [41]:
shap_config = clarify.SHAPConfig(
    baseline=[test_features.iloc[0].values.tolist()],
    num_samples=3000,  
    agg_method="mean_abs",
    save_local_shap_values=True
)

explainability_output_path = "s3://{}/{}/clarify-explainability".format(bucket, prefix)

explainability_data_config = clarify.DataConfig(
    #s3_data_input_path='s3://{}/{}/train'.format(bucket, prefix),
    s3_data_input_path=sampled_s3_uri,
    s3_output_path=explainability_output_path,
    label='MATH_Proficient',
    headers=train_data.columns.to_list(),
    dataset_type="text/csv",
)

In [42]:
# Set logging level for 'sagemaker.clarify' to WARNING (hides INFO messages)
import logging

logging.getLogger("sagemaker.clarify").setLevel(logging.WARNING)

clarify_processor.run_explainability(
    data_config=explainability_data_config,
    model_config=model_config,
    explainability_config=shap_config
)

INFO:sagemaker:Creating processing-job with name Clarify-Explainability-2025-02-22-20-51-53-309


................[34msagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml[0m
[34msagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml[0m
[34mWe are not in a supported iso region, /bin/sh exiting gracefully with no changes.[0m
[34mINFO:sagemaker-clarify-processing:Starting SageMaker Clarify Processing job[0m
[34mINFO:analyzer.data_loading.data_loader_util:Analysis config path: /opt/ml/processing/input/config/analysis_config.json[0m
[34mINFO:analyzer.data_loading.data_loader_util:Analysis result path: /opt/ml/processing/output[0m
[34mINFO:analyzer.data_loading.data_loader_util:This host is algo-1.[0m
[34mINFO:analyzer.data_loading.data_loader_util:This host is the leader.[0m
[34mINFO:analyzer.data_loading.data_loader_util:Number of hosts in the cluster is 1.[0m
[34mINFO:sagemaker-clarify-processing:Running Python / Pandas based analyzer.[0m
[34mINFO:analyzer.data_loading.data_lo

## Train the model again with the top 20 predictors
#### Get the list of top 20 predictors

In [43]:
# Replace with your actual bucket name and prefix used in explainability_output_path
# bucket = "your-bucket-name"
# prefix = "your-prefix"  # e.g., the folder structure used in your explainability_output_path

# Construct the S3 key for the output file
key = f"{prefix}/clarify-explainability/analysis.json"

# Initialize boto3 client for S3 and download the JSON report
s3 = boto3.client("s3")
response = s3.get_object(Bucket=bucket, Key=key)
content = response["Body"].read().decode("utf-8")
report = json.loads(content)

# Navigate to the global SHAP values dictionary
global_shap = report["explanations"]["kernel_shap"]["label0"]["global_shap_values"]

# Sort the items by the SHAP value in descending order and take the top 20
top_20 = sorted(global_shap.items(), key=lambda item: item[1], reverse=True)[:20]

# Extract just the feature names
top_20_features = [feature for feature, value in top_20]

# Print
print("Top 20 features with the highest mean absolute SHAP values:")
for feature in top_20_features:
    print(feature)


INFO:botocore.httpchecksum:Skipping checksum validation. Response did not contain one of the following algorithms: ['crc32', 'sha1', 'sha256'].


Top 20 features with the highest mean absolute SHAP values:
ST059Q02JA
SC211Q03JA
STUDYHMW
ST268Q07JA
ST293Q04JA
IC180Q08JA
ST272Q01JA
ST322Q01JA
ST268Q01JA
ST296Q01JA
SC064Q03TA
EXPECEDU
SC188Q05JA
RATCMP1
ST016Q01NA
IC184Q03JA
IC184Q02JA
IC180Q01JA
ICTDISTR
ST004D01T


In [44]:
# Make a subset of the training dataset (with only 20 predictors)
variables_to_keep = ["MATH_Proficient"] + top_20_features
train_data_small = train_data[variables_to_keep]
print(train_data_small.shape)
train_data_small.head()

(4517, 21)


Unnamed: 0,MATH_Proficient,ST059Q02JA,SC211Q03JA,STUDYHMW,ST268Q07JA,ST293Q04JA,IC180Q08JA,ST272Q01JA,ST322Q01JA,ST268Q01JA,ST296Q01JA,SC064Q03TA,EXPECEDU,SC188Q05JA,RATCMP1,ST016Q01NA,IC184Q03JA,IC184Q02JA,IC180Q01JA,ICTDISTR,ST004D01T
302980,1,50.0,20.0,10.0,3.0,,3.0,2.0,5.0,2.0,5.0,92.0,7.0,4.0,0.3141,8.0,1.0,2.0,3.0,11.0,1.0
307302,1,35.0,0.0,7.0,4.0,2.0,1.0,5.0,5.0,1.0,6.0,99.0,7.0,,0.3361,5.0,1.0,4.0,2.0,3.0,1.0
307349,1,31.0,0.0,10.0,4.0,1.0,1.0,6.0,5.0,4.0,3.0,10.0,7.0,2.0,0.3636,9.0,,1.0,3.0,15.0,2.0
307320,1,34.0,2.0,5.0,3.0,2.0,3.0,7.0,5.0,1.0,3.0,100.0,7.0,3.0,0.3361,6.0,1.0,1.0,3.0,5.0,2.0
302365,0,0.0,2.0,1.0,4.0,4.0,4.0,1.0,,4.0,1.0,4.0,,1.0,0.2113,10.0,,,4.0,16.0,2.0


In [45]:
# Save train dataset 
train_data_small.to_csv('train_small.csv', index=False, header=False)
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train_small/train_small.csv')).upload_file('train_small.csv')

In [46]:
# Make a subset of the validation dataset (with only 20 predictors)
validation_data_small = validation_data[variables_to_keep]
print(validation_data_small.shape)
validation_data_small.head()

(968, 21)


Unnamed: 0,MATH_Proficient,ST059Q02JA,SC211Q03JA,STUDYHMW,ST268Q07JA,ST293Q04JA,IC180Q08JA,ST272Q01JA,ST322Q01JA,ST268Q01JA,ST296Q01JA,SC064Q03TA,EXPECEDU,SC188Q05JA,RATCMP1,ST016Q01NA,IC184Q03JA,IC184Q02JA,IC180Q01JA,ICTDISTR,ST004D01T
302894,1,30.0,0.0,1.0,3.0,2.0,2.0,8.0,5.0,1.0,3.0,10.0,3.0,3.0,0.566,5.0,1.0,1.0,3.0,14.0,1.0
305312,1,4.0,11.0,10.0,2.0,,3.0,8.0,,2.0,1.0,7.0,4.0,3.0,0.1348,9.0,3.0,3.0,2.0,9.0,2.0
306963,1,30.0,20.0,6.0,3.0,1.0,3.0,3.0,5.0,3.0,1.0,2.0,3.0,3.0,0.5426,5.0,1.0,1.0,3.0,,1.0
302965,1,43.0,0.0,6.0,4.0,1.0,1.0,7.0,5.0,2.0,3.0,55.0,7.0,3.0,1.0,3.0,1.0,1.0,3.0,12.0,1.0
302211,1,40.0,11.0,3.0,3.0,,2.0,8.0,,3.0,5.0,7.0,4.0,3.0,0.1348,9.0,1.0,1.0,2.0,7.0,2.0


In [47]:
# Save validation dataset 
validation_data_small.to_csv('validation_small.csv', index=False, header=False)
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'validation_small/validation_small.csv')).upload_file('validation_small.csv')

#### Train the model using the hyperparameters from the best model

In [48]:
# cell 15
container = sagemaker.image_uris.retrieve(region=boto3.Session().region_name, framework='xgboost', version='latest')

INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.


In [49]:
# cell 16
s3_input_train_small = sagemaker.inputs.TrainingInput(s3_data='s3://{}/{}/train_small'.format(bucket, prefix), content_type='csv')
s3_input_validation_small = sagemaker.inputs.TrainingInput(s3_data='s3://{}/{}/validation_small/'.format(bucket, prefix), content_type='csv')

In [50]:
# cell 17
sess = sagemaker.Session()

xgb_small = sagemaker.estimator.Estimator(container,
                                    role, 
                                    instance_count=1, 
                                    instance_type='ml.m4.xlarge',
                                    output_path='s3://{}/{}/output'.format(bucket, prefix),
                                    sagemaker_session=sess)

xgb_small.set_hyperparameters(# seed=42,       # Random seed (turned off because we are using a different seed per iteration)  
                        seed_per_iteration=True,   # Different seed for each boosting iteration, can prevent overfitting
                        early_stopping_rounds=10,   # Stop if AUC doesn’t improve for 10 rounds
                        # scale_pos_weight=imbalance_ratio, # Helps when outcome is imbalanced (but specificity decreased)
                        objective='binary:logistic',
                        eval_metric='auc',
                        num_round=best_num_round,   # Number of boosting rounds for training                    
                        eta=best_eta,   # Learning rate, lower value is more robust to overfitting but requires more boosting rounds
                        max_depth=best_max_depth,   # Deeper trees can model more complex patterns but may overfit
                        min_child_weight=best_min_child_weight,   # Higher value ensures leaf nodes have sufficient samples, preventing overfitting
                        gamma=best_gamma,    # Higher values make it harder to partition a leaf node, making the algorithm more conservative
                        subsample=best_subsample,   # Fraction of training instances to use for each boosting round, < 1 can prevent overfitting
                        alpha=best_alpha   # L1 regularization term on weights, higher value leads to more regularization                       
                        )

xgb_small.fit({'train': s3_input_train_small, 'validation': s3_input_validation_small}) 

INFO:sagemaker:Creating training-job with name: xgboost-2025-02-22-22-49-58-017


2025-02-22 22:49:59 Starting - Starting the training job...
2025-02-22 22:50:14 Starting - Preparing the instances for training...
2025-02-22 22:50:40 Downloading - Downloading input data...
2025-02-22 22:51:10 Downloading - Downloading the training image......
2025-02-22 22:52:31 Training - Training image download completed. Training in progress.
2025-02-22 22:52:31 Uploading - Uploading generated training model[34mArguments: train[0m
[34m[2025-02-22:22:52:23:INFO] Running standalone xgboost training.[0m
[34m[2025-02-22:22:52:23:INFO] File size need to be processed in the node: 0.43mb. Available memory size in the node: 8559.36mb[0m
[34m[2025-02-22:22:52:23:INFO] Determined delimiter of CSV input is ','[0m
[34m[22:52:23] S3DistributionType set as FullyReplicated[0m
[34m[22:52:23] 4517x20 matrix with 90340 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,[0m
[34m[2025-02-22:22:52:23:INFO] Determined delimiter of CSV input is ','[0m
[34m[2

## Deploy the model
Now that we've trained the `xgboost` algorithm on our data, let's deploy a model that's hosted behind a real-time endpoint.

In [51]:
test_data_small = test_data[variables_to_keep]

In [52]:
# cell 18
xgb_small_predictor = xgb_small.deploy(initial_instance_count=1,
                           instance_type='ml.m4.xlarge')

INFO:sagemaker:Creating model with name: xgboost-2025-02-22-22-53-15-075
INFO:sagemaker:Creating endpoint-config with name xgboost-2025-02-22-22-53-15-075
INFO:sagemaker:Creating endpoint with name xgboost-2025-02-22-22-53-15-075


------!

In [53]:
# cell 19
xgb_small_predictor.serializer = sagemaker.serializers.CSVSerializer()

Now, we'll use a simple function to:
1. Loop over our test dataset
1. Split it into mini-batches of rows 
1. Convert those mini-batches to CSV string payloads (notice, we drop the target variable from our dataset first)
1. Retrieve mini-batch predictions by invoking the XGBoost endpoint
1. Collect predictions and convert from the CSV output our model provides into a NumPy array

In [54]:
# Make predictions - Probabilities
predictions_small = predict(test_data_small.drop(['MATH_Proficient'], axis=1).to_numpy(), xgb_small_predictor)

In [None]:
# Save the predicted values for the test set
predicted_values_small = predictions_small
predicted_values_small = pd.DataFrame(predicted_values_small, columns=['Predicted Values'])
predicted_values_small.to_csv('predicted_values_small.csv', index=False, header=False)

In [55]:
# Clean up
xgb_small_predictor.delete_endpoint(delete_endpoint_config=True)

INFO:sagemaker:Deleting endpoint configuration with name: xgboost-2025-02-22-22-53-15-075
INFO:sagemaker:Deleting endpoint with name: xgboost-2025-02-22-22-53-15-075


## Summary

#### Hyperparameters

In [57]:
print("seed_per_iteration: True")
print("early_stopping_rounds: 10")
print("objective: binary:logistic")
print("eval_metric: auc")
print("\nnum_round: ", best_num_round)
print("eta: ", round(best_eta, 2))
print("max_depth: ", best_max_depth)
print("min_child_weight: ", best_min_child_weight)
print("gamma: ", best_gamma)
print("subsample: ", round(best_subsample, 2))
print("alpha: ", best_alpha)


seed_per_iteration: True
early_stopping_rounds: 10
objective: binary:logistic
eval_metric: auc

num_round:  252
eta:  0.1
max_depth:  3
min_child_weight:  18
gamma:  4
subsample:  0.71
alpha:  1


#### Number of students not proficient in Math

In [71]:
#print("Students who are proficient: ", proficient_n)
print("Students who are NOT proficient in Math: ", not_proficient_n, "(", not_proficient_p, "%)")

Students who are NOT proficient in Math:  987 ( 15.3 %)


#### Model performance (model with all the predictors)

In [74]:
suggested_threshold = (100 - not_proficient_p)/100
print("Suggested threshold:", round(suggested_threshold, 2))

Suggested threshold: 0.85


***Adjust the threhold for the FINAL PREDICTIONS if necessary!!*** 

The model will predict as Math_proficient if the probability is above this threhold. (If the threshold is above 0.5, it will reduce the number of students predicted as "Math proficient" for both students that are actually proficient and not proficient in Math.)

In [44]:
threshold = 0.863

print("Threshold:", threshold)

Threshold: 0.863


In [32]:
import pandas as pd
import numpy as np

# Read in the real values
real_values = pd.read_csv('real_values.csv', usecols=[0], header=None)
real_values = real_values.values.ravel()

# Read in the predicted values (using the full model)
predicted_values_full = pd.read_csv('predicted_values_full.csv', usecols=[0], header=None)
predicted_values_full = predicted_values_full.values.ravel()

In [43]:
cm = pd.crosstab(index=real_values, 
                 columns=np.round( (predicted_values_full >= threshold).astype(int) ), 
                 rownames=['actuals'], 
                 colnames=['predictions'])

TN = cm.loc[0.0, 0.0]
FP = cm.loc[0.0, 1.0]
FN = cm.loc[1.0, 0.0]
TP = cm.loc[1.0, 1.0]

accuracy = (TP + TN) / (TP + TN + FP + FN) * 100
precision = TP / (TP + FP) * 100 if (TP + FP) > 0 else 0
recall = TP / (TP + FN) * 100 if (TP + FN) > 0 else 0
f1_score = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
specificity = TN / (TN + FP) * 100 if (TN + FP) > 0 else 0

print("MODEL USING ALL FEATURES \n")
print(cm)

print("\nAccuracy: {:.1f}".format(accuracy))
print("F1 Score: {:.1f}".format(f1_score))
print("Precision: {:.1f}".format(precision))
print("Recall: {:.1f}".format(recall))
print("Specificity: {:.1f}".format(specificity))

MODEL USING ALL FEATURES 

predictions    0    1
actuals              
0            120   29
1            156  664

Accuracy: 80.9
F1 Score: 87.8
Precision: 95.8
Recall: 81.0
Specificity: 80.5


### Model performance (model with 20 predictors)

In [34]:
# Read in the predicted values (using 20 predictors)
predicted_values_small = pd.read_csv('predicted_values_small.csv', usecols=[0], header=None)
predicted_values_small = predicted_values_small.values.ravel()

In [41]:
cm_small = pd.crosstab(index=real_values, 
                       columns=np.round( (predicted_values_small >= threshold).astype(int) ), 
                       rownames=['actuals'], 
                       colnames=['predictions'])

TN_small = cm_small.loc[0.0, 0.0]
FP_small = cm_small.loc[0.0, 1.0]
FN_small = cm_small.loc[1.0, 0.0]
TP_small = cm_small.loc[1.0, 1.0]

accuracy_small = (TP_small + TN_small) / (TP_small + TN_small + FP_small + FN_small) * 100
precision_small = TP_small / (TP_small + FP_small) * 100 if (TP_small + FP_small) > 0 else 0
recall_small = TP_small / (TP_small + FN_small) * 100 if (TP_small + FN_small) > 0 else 0
f1_score_small = 2 * (precision_small * recall_small) / (precision_small + recall_small) if (precision_small + recall_small) > 0 else 0
specificity_small = TN_small / (TN_small + FP_small) * 100 if (TN_small + FP_small) > 0 else 0

print("MODEL USING 20 FEATURES \n")
print(cm_small)

print("\nAccuracy: {:.1f}".format(accuracy_small))
print("F1 Score: {:.1f}".format(f1_score_small))
print("Precision: {:.1f}".format(precision_small))
print("Recall: {:.1f}".format(recall_small))
print("Specificity: {:.1f}".format(specificity_small))

MODEL USING 20 FEATURES 

predictions    0    1
actuals              
0            120   29
1            170  650

Accuracy: 79.5
F1 Score: 86.7
Precision: 95.7
Recall: 79.3
Specificity: 80.5


#### Top 20 features

In [85]:
pd.set_option('display.max_colwidth', None)
from IPython.display import display, Markdown

# Filter the DataFrame to only include rows where Variable_name is in top_20_features
top_20_dictionary = dictionary[dictionary["Variable_name"].isin(top_20_features)]
top_20_table = top_20_dictionary.set_index("Variable_name").loc[top_20_features].reset_index()
display(Markdown(top_20_table.to_markdown()))

|    | Variable_name   | Variable_label                                                                                                                          |
|---:|:----------------|:----------------------------------------------------------------------------------------------------------------------------------------|
|  0 | ST059Q02JA      | Total number of [class periods] per week for all subjects, including mathematics                                                        |
|  1 | SC211Q03JA      | Percentage [15-year-old modal grade] students who: Students from socioeconomically disadvantaged homes                                  |
|  2 | STUDYHMW        | Studying for school or homework before or after school                                                                                  |
|  3 | ST268Q07JA      | Agree/disagree: I want to do well in my mathematics class.                                                                              |
|  4 | ST293Q04JA      | This school year, how often: I gave up when I did not understand the mathematics material that was being taught.                        |
|  5 | IC180Q08JA      | Agree/disagree: I share made-up information on social networks without flagging its inaccuracy.                                         |
|  6 | ST272Q01JA      | On 1-10 scale, rate quality of mathematics instruction this school year? Quality of mathematics instruction?                            |
|  7 | ST322Q01JA      | How often: I turn off notifications from social networks and apps on my [digital devices] during class.                                 |
|  8 | ST268Q01JA      | Agree/disagree: Mathematics is one of my favourite subjects.                                                                            |
|  9 | ST296Q01JA      | How much time spent on homework in: Mathematics homework                                                                                |
| 10 | SC064Q03TA      | Proportion parent/guardians who: Participated in local school government (e.g. parent council or school management committee)           |
| 11 | EXPECEDU        | Highest expected educational level                                                                                                      |
| 12 | SC188Q05JA      | Extent structures your school's math programme: Results from [local or municipal] assessments                                           |
| 13 | RATCMP1         | Availability of computers                                                                                                               |
| 14 | ST016Q01NA      | Overall, how satisfied are you with your life as a whole these days?                                                                    |
| 15 | IC184Q03JA      | How often: I use [digital resources] for simulations and modelling (e.g. [GeoGebra], [NetLogo]), virtual laboratories (e.g. [Labster]). |
| 16 | IC184Q02JA      | How often: I use [digital resources] to solve equations.                                                                                |
| 17 | IC180Q01JA      | Agree/disagree: I trust what I read online.                                                                                             |
| 18 | ICTDISTR        | Distress from online content and cyberbullying                                                                                          |
| 19 | ST004D01T       | Student (Standardized) Gender                                                                                                           |