# PISA 2022 Amazon SageMaker KNN

More info on SageMaker Immersion Day: [Workshop Link](https://catalog.us-east-1.prod.workshops.aws/workshops/63069e26-921c-4ce1-9cc7-dd882ff62575/en-US/lab2-model-training/pro-code)


### ***Change country name below!***

In [1]:
country_name = 'United_States'

In [2]:
country_name_edited = country_name.replace("_", "-")

In [3]:
# cell 02
import sagemaker
bucket=sagemaker.Session().default_bucket()
prefix = 'sagemaker/knn-Elijah-'+country_name_edited
 
# Define IAM role
import boto3
import re
from sagemaker import get_execution_role

role = get_execution_role()

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml


Now let's bring in the Python libraries that we'll use throughout the analysis

In [4]:
# cell 03
import numpy as np                                # For matrix operations and numerical processing
import pandas as pd                               # For munging tabular data
import matplotlib.pyplot as plt                   # For charts and visualizations
from IPython.display import Image                 # For displaying images in the notebook
from IPython.display import display               # For displaying outputs in the notebook
from time import gmtime, strftime                 # For labeling SageMaker models, endpoints, etc.
import sys                                        # For writing outputs to notebook
import math                                       # For ceiling function
import json                                       # For parsing hosting outputs
import os                                         # For manipulating filepath names
import sagemaker 
import zipfile     # Amazon SageMaker's Python SDK provides many helper functions

In [5]:
bucket_name = "sagemaker-us-west-2-986030204467"

#### Download PISA 2022 Prepared Dataset

This is our dataset output from our cleaned notebook [here](https://7z4vtvpqcoxouiu.studio.us-west-2.sagemaker.aws/jupyterlab/default/lab/tree/RTC%3Amids-capstone/notebooks/eda/Data_merging.ipynb)


In [6]:
%%time 

# cell 06

# Define local file path
local_file_path = "../eda/with-wle-latent/PISA_cleaned_dataset.csv"  # Change as needed

# Define S3 details
bucket_name = "sagemaker-us-west-2-986030204467"
file_key = "capstone/testfiles/PISA_cleaned_dataset.csv"

# Check if the file exists locally
if os.path.exists(local_file_path):
    print("📂 Loading data from local file...")
    data = pd.read_csv(local_file_path, usecols=None)
    
else:
    print("☁️ Downloading data from S3...")
    
    # Create S3 client
    s3_client = boto3.client("s3")

    # Download the file from S3
    response = s3_client.get_object(Bucket=bucket_name, Key=file_key)

    # Read the file into pandas DataFrame
    data = pd.read_csv(response["Body"], usecols=None)

    # Save a local copy for future use
    data.to_csv(local_file_path, index=False)
    print(f"✅ File saved locally as {local_file_path}")

# Display first few rows
#data.head()

pd.set_option('display.max_columns', 500)     # Make sure we can see all of the columns
pd.set_option('display.max_rows', 20)         # Keep the output on one page
data

📂 Loading data from local file...
CPU times: user 22.3 s, sys: 3.29 s, total: 25.6 s
Wall time: 25.6 s


Unnamed: 0,CNT,CNTSCHID,CNTSTUID,MATH_Proficient,SISCO,ST347Q01JA,ST347Q02JA,ST349Q01JA_0,ST349Q01JA_1,ST349Q01JA_2,ST349Q01JA_3,ST349Q01JA_4,ST350Q01JA,ST356Q01JA,ST322Q01JA,ST322Q02JA,ST322Q03JA,ST322Q04JA,ST322Q06JA,ST322Q07JA,DURECEC,EFFORT1,EFFORT2,ST259Q01JA,WB164Q01HA,HOMEPOS,ST004D01T,GRADE,REPEAT,EXPECEDU,ICTAVSCH,ICTAVHOM,ICTDISTR,IMMIG,TARDYSD,ST226Q01JA,ST016Q01NA,MISSSC,Option_UH,OECD,PAREDINT,BMMJ1,BFMJ2,WB163Q06HA,WB163Q07HA,ST230Q01JA,SKIPPING,IC180Q01JA,IC180Q08JA,ST059Q02JA,ST296Q04JA,WB176Q01HA,STUDYHMW,IC184Q01JA,IC184Q02JA,IC184Q03JA,IC184Q04JA,ST059Q01TA,ST296Q01JA,ST272Q01JA,ST268Q01JA,ST268Q04JA,ST268Q07JA,ST293Q04JA,ST297Q01JA,ST297Q03JA,ST297Q05JA,ST297Q06JA,ST297Q07JA,ST297Q09JA,WB165Q01HA,WB166Q01HA,WB166Q02HA,WB166Q03HA,WB166Q04HA,ST258Q01JA,ST294Q01JA,ST295Q01JA,WB150Q01HA,WB156Q01HA,WB158Q01HA,WB160Q01HA,WB161Q01HA,WB171Q01HA,WB171Q02HA,WB171Q03HA,WB171Q04HA,WB172Q01HA,WB173Q01HA,WB173Q02HA,WB173Q03HA,WB173Q04HA,WB177Q01HA,WB177Q02HA,WB177Q03HA,WB177Q04HA,WB032Q01NA,WB032Q02NA,WB031Q01NA,EXERPRAC,STUBMI,RELATST,BELONG,BULLIED,FEELSAFE,SCHRISK,PERSEVAGR,CURIOAGR,COOPAGR,EMPATAGR,ASSERAGR,STRESAGR,EMOCOAGR,GROSAGR,INFOSEEK,FAMSUP,DISCLIM,TEACHSUP,COGACRCO,COGACMCO,EXPOFA,EXPO21ST,MATHEFF,MATHEF21,FAMCON,ANXMAT,MATHPERS,CREATEFF,CREATSCH,CREATFAM,CREATAS,CREATOOS,CREATOP,OPENART,IMAGINE,SCHSUST,LEARRES,PROBSELF,FAMSUPSL,FEELLAH,SDLEFF,ICTRES,ESCS,FLSCHOOL,FLMULTSB,FLFAMILY,ACCESSFP,FLCONFIN,FLCONICT,ACCESSFA,ATTCONFM,FRINFLFM,ICTSCH,ICTHOME,ICTQUAL,ICTSUBJ,ICTENQ,ICTFEED,ICTOUT,ICTWKDY,ICTWKEND,ICTREG,ICTINFO,ICTEFFIC,BODYIMA,SOCONPA,LIFESAT,PSYCHSYM,SOCCON,EXPWB,CURSUPP,PQMIMP,PQMCAR,PARINVOL,PQSCHOOL,PASCHPOL,ATTIMMP,CREATHME,CREATACT,CREATOPN,CREATOR,WORKPAY,WORKHOME,SC001Q01TA,SC211Q01JA,SC211Q02JA,SC211Q03JA,SC211Q04JA,SC211Q05JA,SC211Q06JA,SC209Q04JA,SC209Q05JA,SC209Q06JA,SC037Q11JA,SC183Q02JA,SC183Q03JA,SC183Q04JA,SC175Q01JA,SC177Q01JA_1,SC177Q01JA_2,SC177Q01JA_3,SC177Q02JA_1,SC177Q02JA_2,SC177Q02JA_3,SC177Q03JA_1,SC177Q03JA_2,SC177Q03JA_3,SC188Q01JA,SC188Q02JA,SC188Q03JA,SC188Q04JA,SC188Q05JA,SC188Q06JA,SC188Q07JA,SC188Q08JA,SC188Q09JA,SC188Q10JA,SC188Q11JA,SC198Q01JA,SC198Q02JA,SC198Q03JA,SC178Q01JA,SC178Q02JA,SC180Q01JA,SC189Q02WA,SC189Q03WA,SC189Q04WA,SMRATIO,MCLSIZE,MACTIV,MATHEXC_0,MATHEXC_1,MATHEXC_2,MATHEXC_3,ABGMATH,SC064Q05WA,SC064Q06WA,SC064Q01TA,SC064Q02TA,SC064Q04NA,SC064Q03TA,SC064Q07WA,SC213Q01JA,SC213Q02JA,SC037Q01TA,SC037Q02TA,SC037Q03TA,SC037Q04TA,SC037Q05NA,SC037Q06NA,...,DIGDVPOL,TEAFDBK,MTTRAIN,DMCVIEWS,NEGSCLIM,STAFFSHORT,EDUSHORT,STUBEHA,TEACHBEHA,STDTEST,TDTEST,ALLACTIV,BCREATSC,CREENVSC,ACTCRESC,OPENCUL,PROBSCRI,SCPREPBP,SCPREPAP,DIGPREP,LANGN_105,LANGN_108,LANGN_112,LANGN_113,LANGN_118,LANGN_121,LANGN_130,LANGN_133,LANGN_137,LANGN_140,LANGN_147,LANGN_148,LANGN_150,LANGN_154,LANGN_156,LANGN_160,LANGN_170,LANGN_195,LANGN_200,LANGN_202,LANGN_204,LANGN_232,LANGN_237,LANGN_244,LANGN_246,LANGN_254,LANGN_258,LANGN_263,LANGN_264,LANGN_266,LANGN_272,LANGN_273,LANGN_275,LANGN_286,LANGN_301,LANGN_313,LANGN_316,LANGN_317,LANGN_322,LANGN_325,LANGN_327,LANGN_329,LANGN_338,LANGN_340,LANGN_344,LANGN_351,LANGN_358,LANGN_363,LANGN_369,LANGN_371,LANGN_375,LANGN_379,LANGN_381,LANGN_382,LANGN_383,LANGN_404,LANGN_409,LANGN_415,LANGN_420,LANGN_422,LANGN_428,LANGN_434,LANGN_442,LANGN_449,LANGN_451,LANGN_463,LANGN_465,LANGN_467,LANGN_471,LANGN_472,LANGN_474,LANGN_492,LANGN_493,LANGN_494,LANGN_495,LANGN_496,LANGN_500,LANGN_503,LANGN_514,LANGN_517,LANGN_520,LANGN_523,LANGN_527,LANGN_529,LANGN_531,LANGN_540,LANGN_547,LANGN_555,LANGN_561,LANGN_562,LANGN_563,LANGN_565,LANGN_566,LANGN_567,LANGN_600,LANGN_601,LANGN_602,LANGN_605,LANGN_606,LANGN_607,LANGN_608,LANGN_611,LANGN_614,LANGN_615,LANGN_616,LANGN_618,LANGN_619,LANGN_621,LANGN_622,LANGN_623,LANGN_624,LANGN_625,LANGN_626,LANGN_627,LANGN_628,LANGN_630,LANGN_631,LANGN_634,LANGN_635,LANGN_639,LANGN_640,LANGN_641,LANGN_642,LANGN_648,LANGN_650,LANGN_661,LANGN_662,LANGN_663,LANGN_665,LANGN_666,LANGN_667,LANGN_668,LANGN_669,LANGN_670,LANGN_673,LANGN_674,LANGN_675,LANGN_676,LANGN_677,LANGN_678,LANGN_800,LANGN_801,LANGN_802,LANGN_804,LANGN_805,LANGN_806,LANGN_807,LANGN_808,LANGN_809,LANGN_810,LANGN_811,LANGN_812,LANGN_813,LANGN_814,LANGN_815,LANGN_816,LANGN_817,LANGN_818,LANGN_819,LANGN_821,LANGN_823,LANGN_824,LANGN_825,LANGN_826,LANGN_827,LANGN_828,LANGN_829,LANGN_831,LANGN_832,LANGN_833,LANGN_836,LANGN_837,LANGN_838,LANGN_839,LANGN_840,LANGN_841,LANGN_842,LANGN_843,LANGN_844,LANGN_845,LANGN_846,LANGN_849,LANGN_850,LANGN_851,LANGN_852,LANGN_854,LANGN_855,LANGN_857,LANGN_859,LANGN_860,LANGN_861,LANGN_865,LANGN_866,LANGN_868,LANGN_870,LANGN_872,LANGN_873,LANGN_877,LANGN_879,LANGN_881,LANGN_885,LANGN_890,LANGN_892,LANGN_895,LANGN_896,LANGN_897,LANGN_898,LANGN_899,LANGN_900,LANGN_901,LANGN_902,LANGN_903,LANGN_904,LANGN_905,LANGN_906,LANGN_907,LANGN_908,LANGN_909,LANGN_910,LANGN_911,LANGN_912,LANGN_913,LANGN_914,LANGN_916,LANGN_917,LANGN_918,LANGN_919,LANGN_920,LANGN_921,LANGN_922
0,Albania,800282,800001,0,,,,0,0,0,0,0,,,5.0,5.0,3.0,,1.0,1.0,,10.0,10.0,10.0,,1.5995,1.0,0.0,0.0,9.0,0.0,,,1.0,,4.0,10.0,0.0,0,0,14.5,73.91,16.50,,,4.0,1.0,2.0,3.0,7.0,6.0,,10.0,5.0,,,,4.0,3.0,10.0,2.0,1.0,4.0,3.0,0.0,0.0,0.0,0.0,0.0,1.0,,,,,,1.0,6.0,6.0,,,,,,,,,,,,,,,,,,,,,,0.0,,0.9905,-0.2327,-1.2280,1.1246,-0.6386,,3.3518,,,,,,-0.5185,,1.8355,0.6387,1.5558,0.8246,2.4962,-0.2284,2.4031,-1.4413,,,0.5440,-0.0085,2.4021,0.0590,0.8155,4.1226,,,0.7507,2.0225,,,,,,,4.9507,1.1112,,,,,,,,,,,,-1.1989,-2.0261,-1.7886,,,,,0.8373,0.6984,,,,,,,,,,,,,,,,,,,0.0,10.0,3.0,100.0,3.0,23.0,,24.0,,1.0,1.0,1.0,2.0,1.0,1.0,1.0,45.0,0,0,1,0,0,1,0,0,1,4.0,4.0,4.0,4.0,3.0,3.0,3.0,4.0,2.0,4.0,2.0,2.0,2.0,1.0,74.0,26.0,1.0,1.0,1.0,1.0,100.0,28.0,5.0,0,0,0,1,3.0,30.0,30.0,61.0,62.0,11.0,50.0,10.0,90.0,3.0,1.0,1.0,1.0,1.0,1.0,1.0,...,0.5220,0.9868,1.0982,2.1585,-0.4315,-0.0097,-0.2805,-0.9198,0.5521,2.0709,2.0131,1.1162,-0.3682,1.3541,0.3430,0.4217,1.1110,-0.8314,0.8462,0.5908,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,Albania,800115,800002,0,,,,0,0,0,0,0,,,,,,,,,,9.0,8.0,7.0,,-3.8115,2.0,-1.0,0.0,,7.0,6.0,10.0,1.0,0.0,1.0,7.0,0.0,0,0,9.0,24.16,,,,3.0,1.0,4.0,2.0,,,,,,5.0,5.0,5.0,,,,,,,,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,0.3226,0.5031,1.3336,1.1246,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,-3.4930,-3.0507,,,,,,,,,,0.4062,0.3346,-0.1403,-2.0261,0.6198,-0.3848,0.2149,,,0.3729,1.3060,-0.4933,,,,,,,,,,,,,,,,,,,,4.0,,1.0,25.0,,15.0,,1.0,1.0,1.0,1.0,2.0,1.0,2.0,45.0,0,0,0,0,0,0,0,0,0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,3.0,2.0,2.0,1.0,2.0,1.0,1.0,90.0,10.0,2.0,1.0,1.0,,100.0,28.0,0.0,0,0,0,0,3.0,75.0,85.0,50.0,75.0,80.0,75.0,,80.0,3.0,1.0,1.0,1.0,1.0,1.0,1.0,...,-0.4729,-0.4120,0.6955,0.3610,0.3386,-1.4551,2.9595,-0.1936,-2.0409,0.0400,-0.6686,-0.5714,0.1019,1.0791,-0.5544,-0.5450,0.1705,-0.8314,-1.1166,0.0988,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,Albania,800242,800003,0,,,,0,0,0,0,0,,,,,,,,,4.0,10.0,10.0,8.0,,0.2314,2.0,-1.0,0.0,,0.0,,4.0,1.0,0.0,1.0,10.0,0.0,0,0,12.0,,,,,4.0,0.0,,,,2.0,,0.0,,,,5.0,,2.0,10.0,,,,,0.0,0.0,0.0,0.0,1.0,0.0,,,,,,1.0,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.8637,-0.6386,,,,,,,,,,,-0.8615,,,,,,,,,,,,,,,,,,,,,,,,,0.4307,-0.1867,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,1.0,,,,1.0,2.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,1.0,45.0,0,0,1,0,0,1,0,0,0,1.0,,4.0,4.0,2.0,2.0,4.0,3.0,4.0,2.0,1.0,2.0,2.0,1.0,100.0,0.0,1.0,1.0,1.0,1.0,100.0,18.0,3.0,0,0,0,1,3.0,100.0,100.0,100.0,100.0,10.0,10.0,100.0,60.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.1884,1.2416,1.0982,2.1585,-0.9382,0.1683,0.1753,-2.0719,-0.4985,0.5750,1.5226,0.5086,0.3731,0.9015,0.5400,1.2274,0.6353,1.1784,-0.6374,-0.8981,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,Albania,800245,800005,0,1.0,6.0,1.0,0,1,0,0,0,2.0,4.0,3.0,3.0,,3.0,3.0,3.0,0.0,,,5.0,,-2.5956,1.0,-2.0,1.0,4.0,5.0,5.0,12.0,1.0,1.0,1.0,10.0,0.0,0,0,6.0,,14.82,,,3.0,0.0,3.0,4.0,30.0,4.0,,10.0,,,,,4.0,1.0,5.0,4.0,4.0,4.0,4.0,1.0,0.0,1.0,0.0,0.0,0.0,,,,,,1.0,6.0,6.0,,,,,,,,,,,,,,,,,,,,,,10.0,,1.8580,0.5159,0.9885,-0.7560,-0.6386,,-0.7687,,,,,,0.1371,2.2134,-0.7468,0.4426,1.5558,-0.7146,-0.1216,-0.2207,0.3556,-1.3156,2.2322,0.4222,0.5653,-0.2546,-0.4909,-0.3010,-1.0261,1.0191,1.4468,-0.5423,-0.0564,-0.8763,1.5382,0.4308,0.4516,0.0427,-2.1941,-0.9408,-2.1392,-3.2198,,,,,,,,,,-1.7984,-1.5118,-0.3516,-0.1594,0.8946,0.8435,0.4035,,,2.8904,1.2637,,,,,,,,,,,,,,,,,,,0.0,10.0,1.0,,5.0,11.0,,30.0,,1.0,1.0,1.0,1.0,1.0,2.0,2.0,45.0,0,0,1,0,0,1,0,0,0,3.0,3.0,4.0,4.0,2.0,2.0,2.0,3.0,2.0,3.0,2.0,2.0,2.0,1.0,100.0,0.0,1.0,2.0,1.0,1.0,69.5,13.0,4.0,0,0,0,1,3.0,91.0,84.0,93.0,64.0,82.0,97.0,100.0,0.0,3.0,1.0,1.0,1.0,1.0,1.0,1.0,...,0.5587,0.6480,-0.0703,-0.1332,-1.6916,-1.4551,0.4399,-0.5010,-1.4190,0.1011,0.1724,0.4559,-0.3682,1.0478,0.5608,0.4217,,,,0.0419,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,Albania,800285,800006,1,1.0,4.0,6.0,0,0,0,0,1,3.0,1.0,3.0,,1.0,1.0,1.0,1.0,1.0,10.0,9.0,8.0,,-0.5632,1.0,0.0,0.0,,3.0,2.0,13.0,1.0,0.0,4.0,10.0,0.0,0,0,12.0,17.00,30.11,,,2.0,0.0,3.0,4.0,30.0,3.0,,10.0,3.0,3.0,4.0,5.0,4.0,3.0,10.0,2.0,1.0,4.0,,1.0,0.0,0.0,0.0,1.0,0.0,,,,,,1.0,1.0,6.0,,,,,,,,,,,,,,,,,,,,,,2.0,,1.7382,0.7639,-1.2280,1.1246,-0.6386,,0.5342,,,,,,-0.3061,0.6761,-0.5122,0.4029,0.1475,-0.0073,0.7927,-0.6616,-1.0257,-0.5867,0.9425,1.1266,-0.2704,-0.1735,-0.7475,-0.1405,-0.9293,1.6583,1.8557,0.9322,0.9037,-0.4033,0.2241,1.7224,1.6004,1.5114,,1.0353,-0.5542,-1.0548,,,,,,,,,,-2.8292,-3.3582,1.0161,,0.8886,-0.0643,0.9861,,,2.0196,1.6029,-0.2354,,,,,,,,,,,,,,,,,,0.0,4.0,,37.0,1.0,9.0,,,,1.0,1.0,1.0,1.0,1.0,2.0,2.0,45.0,1,0,0,1,0,0,0,0,1,3.0,3.0,3.0,3.0,2.0,2.0,2.0,2.0,4.0,4.0,2.0,2.0,2.0,1.0,80.0,20.0,2.0,1.0,1.0,1.0,100.0,33.0,2.0,0,0,0,0,1.0,67.0,18.0,12.0,21.0,19.0,3.0,21.0,90.0,7.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.3483,1.0430,0.6888,2.1585,-0.6145,-0.7828,0.1000,-0.6199,-0.0485,0.7086,0.7899,0.9383,0.1019,1.6939,0.8448,1.0318,0.0074,-0.8314,-0.7625,3.0051,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
591852,Uzbekistan,86000120,86007488,0,1.0,2.0,1.0,0,0,1,0,0,1.0,,,,,,,,4.0,10.0,10.0,9.0,,-0.9146,1.0,0.0,0.0,9.0,,,,1.0,0.0,1.0,10.0,0.0,0,0,3.0,17.00,28.95,,,4.0,0.0,,,36.0,6.0,,10.0,,,,,5.0,6.0,10.0,4.0,2.0,4.0,,1.0,1.0,1.0,1.0,1.0,1.0,,,,,,1.0,6.0,6.0,,,,,,,,,,,,,,,,,,,,,,4.0,,,-1.0817,-1.2280,0.6942,-0.6386,,0.3063,,,,,,0.5765,-1.0979,1.5941,1.7598,1.5558,2.3368,2.5872,0.1530,-2.2416,2.2815,2.3441,,0.8819,2.2393,2.1524,0.5032,-0.0326,,,-0.4280,,-0.1324,,,,,,,,-2.7487,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,10.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,120.0,1,0,0,1,0,0,0,0,0,1.0,1.0,4.0,2.0,1.0,1.0,4.0,1.0,1.0,4.0,1.0,2.0,1.0,1.0,100.0,0.0,1.0,2.0,1.0,1.0,1.4,28.0,5.0,0,0,0,1,1.0,0.0,0.0,1.0,0.0,0.0,70.0,30.0,73.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,...,0.0977,0.4554,0.2023,-0.7457,-1.4918,,-1.4212,,-1.3372,0.6904,0.0175,1.7104,0.4397,0.7711,,1.2405,-0.5687,-0.8314,-1.1382,0.5571,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
591853,Uzbekistan,86000140,86007489,0,,,,0,0,0,0,0,,,1.0,,2.0,2.0,1.0,1.0,0.0,10.0,10.0,3.0,,-2.1015,2.0,0.0,0.0,,,,,,1.0,5.0,3.0,1.0,0,0,16.0,73.91,30.11,,,4.0,,,,,,,7.0,,,,,,,,,,,,,,,,,,,,,,,1.0,1.0,6.0,,,,,,,,,,,,,,,,,,,,,,4.0,,,-0.2482,-1.2280,-0.7560,-0.6386,,0.0167,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,-0.2024,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,10.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,2.0,2.0,2.0,2.0,115.0,0,0,1,0,0,1,0,0,1,3.0,3.0,3.0,3.0,3.0,2.0,3.0,3.0,3.0,3.0,3.0,2.0,1.0,1.0,60.0,40.0,1.0,1.0,1.0,1.0,100.0,53.0,5.0,0,0,0,1,2.0,81.0,85.0,88.0,96.0,68.0,85.0,63.0,100.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,...,-1.8150,1.0904,-1.6751,-2.6032,-1.4918,,-1.4212,,-1.1342,2.0709,2.0131,3.4880,1.5231,-0.2686,,0.3221,-1.1097,-0.8314,0.8462,-0.1857,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
591854,Uzbekistan,86000024,86007490,0,1.0,1.0,1.0,0,0,0,0,0,,,1.0,1.0,,4.0,1.0,1.0,,,,6.0,,-1.5194,2.0,1.0,0.0,7.0,,,,1.0,0.0,4.0,9.0,0.0,0,0,9.0,17.00,25.71,,,4.0,0.0,,,31.0,6.0,,10.0,,,,,5.0,5.0,10.0,3.0,3.0,3.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,,,,,,4.0,6.0,6.0,,,,,,,,,,,,,,,,,,,,,,10.0,,,-0.3261,-0.5168,0.4417,-0.6386,,-0.0140,,,,,,0.2429,0.2973,-1.0296,0.3521,0.8211,1.0932,0.9323,-0.3998,0.6856,0.3926,0.9997,,-0.2907,0.6311,0.0846,0.5352,-0.5679,0.4911,0.6097,0.4185,-0.3483,-0.1783,,,,,,,,-2.0506,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,5.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,120.0,1,0,0,0,0,1,0,0,1,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,3.0,1.0,1.0,1.0,1.0,100.0,0.0,1.0,1.0,1.0,1.0,100.0,28.0,5.0,0,0,0,1,2.0,90.0,50.0,100.0,100.0,70.0,85.0,0.0,93.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,...,0.5796,1.4724,1.0982,-3.1484,-1.4918,,0.2650,,-1.9660,2.0709,2.0131,1.7685,1.5231,2.1631,,2.8331,-1.6218,1.5159,0.8462,0.8376,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
591855,Uzbekistan,86000174,86007491,0,,,,0,0,0,0,0,,,1.0,1.0,1.0,1.0,,1.0,,7.0,6.0,9.0,,-0.3975,1.0,0.0,0.0,,,,,1.0,0.0,4.0,10.0,0.0,0,0,12.0,73.91,75.43,,,3.0,1.0,,,35.0,6.0,,10.0,,,,,3.0,2.0,7.0,2.0,2.0,3.0,,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,1.0,6.0,6.0,,,,,,,,,,,,,,,,,,,,,,9.0,,,0.5337,-1.2280,1.1246,-0.6386,,2.2987,,,,,,1.2952,,,1.7598,1.5558,1.5399,1.3822,0.3331,0.3322,-0.1652,2.4215,,,,,,,,,,,,,,,,,,,-0.1290,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,10.0,1.0,0.0,0.0,6.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,120.0,0,0,0,0,0,0,0,0,0,1.0,4.0,4.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,100.0,0.0,1.0,2.0,1.0,1.0,100.0,28.0,5.0,0,0,0,1,1.0,75.0,21.0,77.0,70.0,85.0,69.0,0.0,79.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,...,-0.3081,0.7604,1.0982,1.2033,-1.4918,,1.2048,,-0.2361,0.6904,0.6028,1.2086,0.7589,0.8065,,0.7825,0.5093,-0.8314,0.1102,-0.4657,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


#### Download dictionary for the variable names

In [7]:
# Download the file from S3
s3_client = boto3.client("s3")
dictionary_file = s3_client.get_object(Bucket=bucket_name, Key="capstone/testfiles/all_vars.csv")

# Read the file into pandas DataFrame
dictionary = pd.read_csv(dictionary_file["Body"], usecols=None)

#### Subset the data to a specific COUNTRY

In [8]:
model_data = data[data['CNT'] == country_name]
print(model_data.shape)
model_data.head()

(4552, 570)


Unnamed: 0,CNT,CNTSCHID,CNTSTUID,MATH_Proficient,SISCO,ST347Q01JA,ST347Q02JA,ST349Q01JA_0,ST349Q01JA_1,ST349Q01JA_2,ST349Q01JA_3,ST349Q01JA_4,ST350Q01JA,ST356Q01JA,ST322Q01JA,ST322Q02JA,ST322Q03JA,ST322Q04JA,ST322Q06JA,ST322Q07JA,DURECEC,EFFORT1,EFFORT2,ST259Q01JA,WB164Q01HA,HOMEPOS,ST004D01T,GRADE,REPEAT,EXPECEDU,ICTAVSCH,ICTAVHOM,ICTDISTR,IMMIG,TARDYSD,ST226Q01JA,ST016Q01NA,MISSSC,Option_UH,OECD,PAREDINT,BMMJ1,BFMJ2,WB163Q06HA,WB163Q07HA,ST230Q01JA,SKIPPING,IC180Q01JA,IC180Q08JA,ST059Q02JA,ST296Q04JA,WB176Q01HA,STUDYHMW,IC184Q01JA,IC184Q02JA,IC184Q03JA,IC184Q04JA,ST059Q01TA,ST296Q01JA,ST272Q01JA,ST268Q01JA,ST268Q04JA,ST268Q07JA,ST293Q04JA,ST297Q01JA,ST297Q03JA,ST297Q05JA,ST297Q06JA,ST297Q07JA,ST297Q09JA,WB165Q01HA,WB166Q01HA,WB166Q02HA,WB166Q03HA,WB166Q04HA,ST258Q01JA,ST294Q01JA,ST295Q01JA,WB150Q01HA,WB156Q01HA,WB158Q01HA,WB160Q01HA,WB161Q01HA,WB171Q01HA,WB171Q02HA,WB171Q03HA,WB171Q04HA,WB172Q01HA,WB173Q01HA,WB173Q02HA,WB173Q03HA,WB173Q04HA,WB177Q01HA,WB177Q02HA,WB177Q03HA,WB177Q04HA,WB032Q01NA,WB032Q02NA,WB031Q01NA,EXERPRAC,STUBMI,RELATST,BELONG,BULLIED,FEELSAFE,SCHRISK,PERSEVAGR,CURIOAGR,COOPAGR,EMPATAGR,ASSERAGR,STRESAGR,EMOCOAGR,GROSAGR,INFOSEEK,FAMSUP,DISCLIM,TEACHSUP,COGACRCO,COGACMCO,EXPOFA,EXPO21ST,MATHEFF,MATHEF21,FAMCON,ANXMAT,MATHPERS,CREATEFF,CREATSCH,CREATFAM,CREATAS,CREATOOS,CREATOP,OPENART,IMAGINE,SCHSUST,LEARRES,PROBSELF,FAMSUPSL,FEELLAH,SDLEFF,ICTRES,ESCS,FLSCHOOL,FLMULTSB,FLFAMILY,ACCESSFP,FLCONFIN,FLCONICT,ACCESSFA,ATTCONFM,FRINFLFM,ICTSCH,ICTHOME,ICTQUAL,ICTSUBJ,ICTENQ,ICTFEED,ICTOUT,ICTWKDY,ICTWKEND,ICTREG,ICTINFO,ICTEFFIC,BODYIMA,SOCONPA,LIFESAT,PSYCHSYM,SOCCON,EXPWB,CURSUPP,PQMIMP,PQMCAR,PARINVOL,PQSCHOOL,PASCHPOL,ATTIMMP,CREATHME,CREATACT,CREATOPN,CREATOR,WORKPAY,WORKHOME,SC001Q01TA,SC211Q01JA,SC211Q02JA,SC211Q03JA,SC211Q04JA,SC211Q05JA,SC211Q06JA,SC209Q04JA,SC209Q05JA,SC209Q06JA,SC037Q11JA,SC183Q02JA,SC183Q03JA,SC183Q04JA,SC175Q01JA,SC177Q01JA_1,SC177Q01JA_2,SC177Q01JA_3,SC177Q02JA_1,SC177Q02JA_2,SC177Q02JA_3,SC177Q03JA_1,SC177Q03JA_2,SC177Q03JA_3,SC188Q01JA,SC188Q02JA,SC188Q03JA,SC188Q04JA,SC188Q05JA,SC188Q06JA,SC188Q07JA,SC188Q08JA,SC188Q09JA,SC188Q10JA,SC188Q11JA,SC198Q01JA,SC198Q02JA,SC198Q03JA,SC178Q01JA,SC178Q02JA,SC180Q01JA,SC189Q02WA,SC189Q03WA,SC189Q04WA,SMRATIO,MCLSIZE,MACTIV,MATHEXC_0,MATHEXC_1,MATHEXC_2,MATHEXC_3,ABGMATH,SC064Q05WA,SC064Q06WA,SC064Q01TA,SC064Q02TA,SC064Q04NA,SC064Q03TA,SC064Q07WA,SC213Q01JA,SC213Q02JA,SC037Q01TA,SC037Q02TA,SC037Q03TA,SC037Q04TA,SC037Q05NA,SC037Q06NA,...,DIGDVPOL,TEAFDBK,MTTRAIN,DMCVIEWS,NEGSCLIM,STAFFSHORT,EDUSHORT,STUBEHA,TEACHBEHA,STDTEST,TDTEST,ALLACTIV,BCREATSC,CREENVSC,ACTCRESC,OPENCUL,PROBSCRI,SCPREPBP,SCPREPAP,DIGPREP,LANGN_105,LANGN_108,LANGN_112,LANGN_113,LANGN_118,LANGN_121,LANGN_130,LANGN_133,LANGN_137,LANGN_140,LANGN_147,LANGN_148,LANGN_150,LANGN_154,LANGN_156,LANGN_160,LANGN_170,LANGN_195,LANGN_200,LANGN_202,LANGN_204,LANGN_232,LANGN_237,LANGN_244,LANGN_246,LANGN_254,LANGN_258,LANGN_263,LANGN_264,LANGN_266,LANGN_272,LANGN_273,LANGN_275,LANGN_286,LANGN_301,LANGN_313,LANGN_316,LANGN_317,LANGN_322,LANGN_325,LANGN_327,LANGN_329,LANGN_338,LANGN_340,LANGN_344,LANGN_351,LANGN_358,LANGN_363,LANGN_369,LANGN_371,LANGN_375,LANGN_379,LANGN_381,LANGN_382,LANGN_383,LANGN_404,LANGN_409,LANGN_415,LANGN_420,LANGN_422,LANGN_428,LANGN_434,LANGN_442,LANGN_449,LANGN_451,LANGN_463,LANGN_465,LANGN_467,LANGN_471,LANGN_472,LANGN_474,LANGN_492,LANGN_493,LANGN_494,LANGN_495,LANGN_496,LANGN_500,LANGN_503,LANGN_514,LANGN_517,LANGN_520,LANGN_523,LANGN_527,LANGN_529,LANGN_531,LANGN_540,LANGN_547,LANGN_555,LANGN_561,LANGN_562,LANGN_563,LANGN_565,LANGN_566,LANGN_567,LANGN_600,LANGN_601,LANGN_602,LANGN_605,LANGN_606,LANGN_607,LANGN_608,LANGN_611,LANGN_614,LANGN_615,LANGN_616,LANGN_618,LANGN_619,LANGN_621,LANGN_622,LANGN_623,LANGN_624,LANGN_625,LANGN_626,LANGN_627,LANGN_628,LANGN_630,LANGN_631,LANGN_634,LANGN_635,LANGN_639,LANGN_640,LANGN_641,LANGN_642,LANGN_648,LANGN_650,LANGN_661,LANGN_662,LANGN_663,LANGN_665,LANGN_666,LANGN_667,LANGN_668,LANGN_669,LANGN_670,LANGN_673,LANGN_674,LANGN_675,LANGN_676,LANGN_677,LANGN_678,LANGN_800,LANGN_801,LANGN_802,LANGN_804,LANGN_805,LANGN_806,LANGN_807,LANGN_808,LANGN_809,LANGN_810,LANGN_811,LANGN_812,LANGN_813,LANGN_814,LANGN_815,LANGN_816,LANGN_817,LANGN_818,LANGN_819,LANGN_821,LANGN_823,LANGN_824,LANGN_825,LANGN_826,LANGN_827,LANGN_828,LANGN_829,LANGN_831,LANGN_832,LANGN_833,LANGN_836,LANGN_837,LANGN_838,LANGN_839,LANGN_840,LANGN_841,LANGN_842,LANGN_843,LANGN_844,LANGN_845,LANGN_846,LANGN_849,LANGN_850,LANGN_851,LANGN_852,LANGN_854,LANGN_855,LANGN_857,LANGN_859,LANGN_860,LANGN_861,LANGN_865,LANGN_866,LANGN_868,LANGN_870,LANGN_872,LANGN_873,LANGN_877,LANGN_879,LANGN_881,LANGN_885,LANGN_890,LANGN_892,LANGN_895,LANGN_896,LANGN_897,LANGN_898,LANGN_899,LANGN_900,LANGN_901,LANGN_902,LANGN_903,LANGN_904,LANGN_905,LANGN_906,LANGN_907,LANGN_908,LANGN_909,LANGN_910,LANGN_911,LANGN_912,LANGN_913,LANGN_914,LANGN_916,LANGN_917,LANGN_918,LANGN_919,LANGN_920,LANGN_921,LANGN_922
573394,United_States,84000060,84000002,1,1.0,4.0,1.0,0,1,0,0,0,1.0,3.0,,,,,,,2.0,,,9.0,,1.1179,2.0,0.0,0.0,9.0,7.0,6.0,,1.0,0.0,3.0,,0.0,1,1,16.0,,79.05,,,4.0,0.0,3.0,2.0,35.0,3.0,,2.0,,,,,5.0,1.0,,1.0,3.0,3.0,,0.0,0.0,0.0,0.0,0.0,1.0,,,,,,1.0,1.0,6.0,,,,,,,,,,,,,,,,,,,,,,10.0,,0.1038,-0.338,-1.228,-0.756,-0.6386,,,,,,,,0.0382,0.649,0.0488,0.4571,0.8211,-0.1121,-0.5143,-1.8524,-1.3021,-0.3183,0.676,1.3797,0.051,0.2552,,,,,,,,,-0.4968,-0.9109,-0.7734,-0.2933,-0.4859,0.1535,0.5598,1.2582,-1.5638,-1.2838,0.8839,-0.1039,-0.076,0.4078,0.1666,0.1636,1.1412,0.4062,0.3346,-0.4445,1.8109,0.9504,2.942,0.1306,-0.1975,0.3311,-1.0564,0.6984,0.8955,,,,,,,,,,,,,,,,,,0.0,2.0,5.0,1.0,3.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,2.0,2.0,2.0,1.0,55.0,0,0,1,0,1,0,0,0,0,4.0,4.0,3.0,3.0,4.0,3.0,2.0,4.0,3.0,4.0,1.0,1.0,1.0,1.0,,,1.0,2.0,1.0,1.0,100.0,33.0,2.0,0,0,0,1,2.0,40.0,15.0,15.0,40.0,0.0,15.0,45.0,,,1.0,1.0,1.0,1.0,1.0,1.0,...,,0.8595,0.3795,,0.7657,0.1134,0.265,-0.017,0.6015,-0.319,0.4598,0.3978,,,,,0.5308,1.4151,0.8462,0.5185,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
573395,United_States,84000055,84000003,1,0.0,2.0,1.0,0,1,0,0,0,1.0,3.0,,,,,,,,6.0,9.0,9.0,,0.7983,2.0,0.0,0.0,4.0,7.0,6.0,,1.0,0.0,1.0,,0.0,1,1,12.0,73.91,59.89,,,3.0,1.0,2.0,1.0,45.0,1.0,,6.0,5.0,,,,5.0,1.0,,2.0,3.0,3.0,3.0,0.0,0.0,0.0,0.0,0.0,1.0,,,,,,1.0,6.0,6.0,,,,,,,,,,,,,,,,,,,,,,10.0,,-0.0174,0.6428,-1.228,-2.7886,-0.6386,,,,,,,,0.3305,-0.3146,-0.5726,0.801,0.4357,-1.3042,-1.0977,-1.1565,-0.601,-0.2856,0.6664,-0.3762,-0.4577,-0.2159,,,,,,,,,0.5715,-0.4028,-1.1885,-0.4074,-0.1733,1.892,0.4946,0.3488,-1.5638,-1.2838,-0.3122,0.1055,0.6582,0.1395,-0.2132,-0.7905,-1.1238,0.4062,0.3346,-0.1426,0.7813,-0.6933,-0.1406,-0.1964,0.1737,-0.3368,-0.6418,-0.4656,-0.2216,,,,,,,,,,,,,,,,,,0.0,0.0,,,,,,,,,,,,,,,,0,0,0,0,0,0,0,0,0,,,,,,,,,,,,,,,,,,,,,,,,0,0,0,0,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
573396,United_States,84000121,84000004,1,1.0,5.0,1.0,0,1,0,0,0,1.0,3.0,,,,,,,2.0,9.0,10.0,5.0,,1.1761,2.0,0.0,0.0,8.0,7.0,6.0,6.0,1.0,0.0,3.0,,0.0,1,1,16.0,67.94,82.41,,,4.0,0.0,2.0,3.0,35.0,4.0,,9.0,4.0,4.0,3.0,4.0,3.0,3.0,,4.0,4.0,4.0,2.0,0.0,0.0,0.0,0.0,0.0,1.0,,,,,,1.0,4.0,6.0,,,,,,,,,,,,,,,,,,,,,,5.0,,0.1729,-0.3397,-1.228,-0.756,0.5348,,,,,,,,0.4093,-0.1336,-0.0759,0.3494,0.1475,0.3115,0.5258,-0.1153,0.5318,0.2647,1.4576,3.8646,-0.8811,0.2501,,,,,,,,,-0.613,0.5977,0.4215,0.1281,,0.1381,1.002,1.3463,0.2562,-0.9246,0.7196,-0.3331,0.4209,0.4078,0.0747,0.5131,0.0098,0.4062,0.3346,0.9323,0.8077,0.501,-0.034,-0.1167,-0.9287,-0.0838,0.1936,-0.6519,0.8652,,,,,,,,,,,,,,,,,,0.0,5.0,4.0,14.0,10.0,19.0,2.0,3.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,90.0,1,0,0,1,0,0,1,0,0,4.0,4.0,3.0,3.0,1.0,3.0,3.0,3.0,4.0,4.0,1.0,1.0,1.0,1.0,85.0,15.0,1.0,1.0,2.0,2.0,100.0,23.0,5.0,0,0,0,1,2.0,10.0,5.0,25.0,25.0,10.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,...,,-0.8024,1.0982,,0.6029,-0.209,-1.4212,-0.2527,0.2266,-0.355,0.1509,2.1298,,,,,,,,1.9844,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
573397,United_States,84000013,84000005,1,1.0,5.0,1.0,0,0,1,0,0,1.0,3.0,,,,,,,3.0,10.0,7.0,6.0,,-0.9389,1.0,0.0,0.0,4.0,7.0,6.0,9.0,1.0,0.0,1.0,,0.0,1,1,12.0,24.98,,,,4.0,1.0,3.0,3.0,35.0,4.0,,3.0,4.0,4.0,4.0,4.0,5.0,2.0,,1.0,1.0,4.0,3.0,0.0,0.0,0.0,0.0,0.0,1.0,,,,,,2.0,1.0,6.0,,,,,,,,,,,,,,,,,,,,,,0.0,,-0.476,-0.4867,0.845,-0.756,1.6441,,,,,,,,-0.2794,-0.1891,-0.3054,-2.4517,-1.0693,-0.2027,-0.173,-0.4268,-0.2228,-1.2825,-0.3274,-0.9285,2.5078,-0.1429,,,,,,,,,-0.9827,-1.114,1.9628,-0.4236,-0.3047,-0.3446,-0.748,-1.3108,-1.5638,-0.2047,1.9974,0.6641,1.5367,2.0781,0.4997,-0.1178,0.5052,0.4062,0.3346,0.1515,0.113,0.8108,0.0753,-0.3164,0.6306,0.6194,0.1612,0.6984,2.2012,,,,,,,,,,,,,,,,,,10.0,10.0,3.0,1.0,13.0,10.0,0.0,8.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,50.0,0,1,0,0,0,1,0,0,1,4.0,4.0,1.0,1.0,1.0,1.0,1.0,4.0,2.0,1.0,3.0,2.0,1.0,1.0,84.0,16.0,2.0,2.0,1.0,1.0,100.0,23.0,2.0,0,0,0,0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,1.0,1.0,1.0,1.0,1.0,1.0,...,,-1.9591,1.0982,,0.6029,0.5236,-1.4212,0.7768,-0.4415,0.1368,0.4801,1.1965,,,,,,,,1.521,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
573398,United_States,84000010,84000006,1,1.0,4.0,1.0,0,0,0,0,1,2.0,4.0,,,,,,,2.0,9.0,9.0,5.0,,0.2333,1.0,1.0,0.0,9.0,7.0,6.0,5.0,1.0,2.0,2.0,,0.0,1,1,12.0,,16.5,,,4.0,0.0,3.0,1.0,20.0,2.0,,1.0,5.0,5.0,5.0,,4.0,1.0,,1.0,2.0,3.0,,0.0,0.0,0.0,1.0,1.0,0.0,,,,,,1.0,1.0,6.0,,,,,,,,,,,,,,,,,,,,,,5.0,,1.1739,2.7562,0.379,1.1246,-0.6386,,,,,,,,0.7577,0.2189,-0.896,1.5567,0.4357,0.1869,-0.5739,-0.86,0.1874,-0.798,0.0348,0.8133,-0.5006,2.0741,,,,,,,,,-0.9161,0.0822,-0.7514,-0.6898,-0.1882,1.7786,1.7606,-0.9745,0.5646,0.6268,1.1982,0.267,-0.9491,0.6847,0.3443,-0.8863,-1.8695,0.4062,0.3346,1.2692,0.9944,1.4769,1.2131,1.5408,0.2304,0.2188,0.2255,0.8296,0.2461,,,,,,,,,,,,,,,,,,5.0,0.0,5.0,60.0,35.0,70.0,30.0,30.0,20.0,2.0,2.0,2.0,2.0,1.0,1.0,1.0,90.0,0,1,0,0,1,0,0,1,0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,1.0,1.0,1.0,60.0,40.0,2.0,2.0,2.0,1.0,100.0,28.0,1.0,0,0,0,0,2.0,50.0,75.0,50.0,75.0,50.0,10.0,30.0,90.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,...,,0.2189,1.0982,,0.9177,0.7007,0.1312,1.7484,0.6101,1.9769,0.5645,0.5485,,,,,0.5308,-0.8314,0.8462,-0.1166,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


#### Take out additional variables

In [9]:
# Define the list of columns to drop
columns_to_remove = ["CNT", "CNTSCHID", "CNTSTUID", "OECD",
    "HOMEPOS", "RELATST", "BELONG", "BULLIED", "FEELSAFE", "SCHRISK", "PERSEVAGR", "CURIOAGR", 
    "COOPAGR", "EMPATAGR", "ASSERAGR", "STRESAGR", "EMOCOAGR", "GROSAGR", "INFOSEEK", "FAMSUP", 
    "DISCLIM", "TEACHSUP", "COGACRCO", "COGACMCO", "EXPOFA", "EXPO21ST", "MATHEFF", "MATHEF21", 
    "FAMCON", "ANXMAT", "MATHPERS", "CREATEFF", "CREATSCH", "CREATFAM", "CREATAS", "CREATOOS", 
    "CREATOP", "OPENART", "IMAGINE", "SCHSUST", "LEARRES", "PROBSELF", "FAMSUPSL", "FEELLAH", 
    "SDLEFF", "ICTRES", "FLSCHOOL", "FLMULTSB", "FLFAMILY", "ACCESSFP", "FLCONFIN", "FLCONICT", 
    "ACCESSFA", "ATTCONFM", "FRINFLFM", "ICTSCH", "ICTHOME", "ICTQUAL", "ICTSUBJ", "ICTENQ", 
    "ICTFEED", "ICTOUT", "ICTWKDY", "ICTWKEND", "ICTREG", "ICTINFO", "ICTEFFIC", "BODYIMA", 
    "SOCONPA", "LIFESAT", "PSYCHSYM", "SOCCON", "EXPWB", "CURSUPP", "PQMIMP", "PQMCAR", 
    "PARINVOL", "PQSCHOOL", "PASCHPOL", "ATTIMMP", "CREATHME", "CREATACT", "CREATOPN", 
    "CREATOR", "SCHAUTO", "TCHPART", "EDULEAD", "INSTLEAD", "ENCOURPG", "DIGDVPOL", "TEAFDBK", 
    "MTTRAIN", "DMCVIEWS", "NEGSCLIM", "STAFFSHORT", "EDUSHORT", "STUBEHA", "TEACHBEHA", 
    "STDTEST", "TDTEST", "ALLACTIV", "BCREATSC", "CREENVSC", "ACTCRESC", "OPENCUL", 
    "PROBSCRI", "SCPREPBP", "SCPREPAP", "DIGPREP", 
    "ESCS", "BMMJ1", "BFMJ2", "EFFORT1", "EFFORT2", "Option_UH", "SC209Q04JA", "SC209Q05JA", "SC209Q06JA"
]

# Drop the columns above
model_data = model_data.drop(columns=columns_to_remove, errors='ignore')  # `errors='ignore'` prevents errors if a column isn't found


In [10]:
print(model_data.shape)
model_data.head()

(4552, 452)


Unnamed: 0,MATH_Proficient,SISCO,ST347Q01JA,ST347Q02JA,ST349Q01JA_0,ST349Q01JA_1,ST349Q01JA_2,ST349Q01JA_3,ST349Q01JA_4,ST350Q01JA,ST356Q01JA,ST322Q01JA,ST322Q02JA,ST322Q03JA,ST322Q04JA,ST322Q06JA,ST322Q07JA,DURECEC,ST259Q01JA,WB164Q01HA,ST004D01T,GRADE,REPEAT,EXPECEDU,ICTAVSCH,ICTAVHOM,ICTDISTR,IMMIG,TARDYSD,ST226Q01JA,ST016Q01NA,MISSSC,PAREDINT,WB163Q06HA,WB163Q07HA,ST230Q01JA,SKIPPING,IC180Q01JA,IC180Q08JA,ST059Q02JA,ST296Q04JA,WB176Q01HA,STUDYHMW,IC184Q01JA,IC184Q02JA,IC184Q03JA,IC184Q04JA,ST059Q01TA,ST296Q01JA,ST272Q01JA,ST268Q01JA,ST268Q04JA,ST268Q07JA,ST293Q04JA,ST297Q01JA,ST297Q03JA,ST297Q05JA,ST297Q06JA,ST297Q07JA,ST297Q09JA,WB165Q01HA,WB166Q01HA,WB166Q02HA,WB166Q03HA,WB166Q04HA,ST258Q01JA,ST294Q01JA,ST295Q01JA,WB150Q01HA,WB156Q01HA,WB158Q01HA,WB160Q01HA,WB161Q01HA,WB171Q01HA,WB171Q02HA,WB171Q03HA,WB171Q04HA,WB172Q01HA,WB173Q01HA,WB173Q02HA,WB173Q03HA,WB173Q04HA,WB177Q01HA,WB177Q02HA,WB177Q03HA,WB177Q04HA,WB032Q01NA,WB032Q02NA,WB031Q01NA,EXERPRAC,STUBMI,WORKPAY,WORKHOME,SC001Q01TA,SC211Q01JA,SC211Q02JA,SC211Q03JA,SC211Q04JA,SC211Q05JA,SC211Q06JA,SC037Q11JA,SC183Q02JA,SC183Q03JA,SC183Q04JA,SC175Q01JA,SC177Q01JA_1,SC177Q01JA_2,SC177Q01JA_3,SC177Q02JA_1,SC177Q02JA_2,SC177Q02JA_3,SC177Q03JA_1,SC177Q03JA_2,SC177Q03JA_3,SC188Q01JA,SC188Q02JA,SC188Q03JA,SC188Q04JA,SC188Q05JA,SC188Q06JA,SC188Q07JA,SC188Q08JA,SC188Q09JA,SC188Q10JA,SC188Q11JA,SC198Q01JA,SC198Q02JA,SC198Q03JA,SC178Q01JA,SC178Q02JA,SC180Q01JA,SC189Q02WA,SC189Q03WA,SC189Q04WA,SMRATIO,MCLSIZE,MACTIV,MATHEXC_0,MATHEXC_1,MATHEXC_2,MATHEXC_3,ABGMATH,SC064Q05WA,SC064Q06WA,SC064Q01TA,SC064Q02TA,SC064Q04NA,SC064Q03TA,SC064Q07WA,SC213Q01JA,SC213Q02JA,SC037Q01TA,SC037Q02TA,SC037Q03TA,SC037Q04TA,SC037Q05NA,SC037Q06NA,SC037Q07TA,SC037Q09TA,SC200Q01JA,SC200Q02JA,SC200Q03JA,SC200Q04JA,SC224Q01JA,RATCMP1,RATCMP2,RATTAB,SCHSEL,SCHLTYPE_1,SCHLTYPE_2,SCHLTYPE_3,SC034Q01NA,SC034Q02NA,SC034Q03TA,SC034Q04TA,SC195Q01JA,SC195Q02JA,SC195Q03JA,SC195Q04JA,SC042Q01TA,SC042Q02TA,SC214Q01JA,SC214Q02JA,SC214Q03JA,SC215Q01JA,SC215Q02JA,SC215Q03JA,SC215Q04JA,SC215Q05JA,SC215Q06JA,SC215Q07JA,SC215Q08JA,SC216Q06JA,SC216Q07JA,SC216Q08JA,SC216Q09JA,SC217Q01JA,SC217Q02JA,SC217Q03JA,SC217Q04JA,SC217Q05JA,SC217Q06JA,SC217Q07JA,SC217Q08JA,SC217Q10JA,SC218Q01JA,SC219Q01JA,SC220Q01JA,SC221Q01JA,SC221Q02JA,SC221Q03JA,SC221Q04JA,SCSUPRTED,SCSUPRT,SC212Q01JA,SC212Q02JA,SC212Q03JA,SC037Q08TA,SC032Q01TA,SC032Q02TA,SC032Q03TA,SC032Q04TA,LANGN_105,LANGN_108,LANGN_112,LANGN_113,LANGN_118,LANGN_121,LANGN_130,LANGN_133,LANGN_137,LANGN_140,LANGN_147,LANGN_148,LANGN_150,LANGN_154,LANGN_156,LANGN_160,LANGN_170,LANGN_195,LANGN_200,LANGN_202,LANGN_204,LANGN_232,LANGN_237,LANGN_244,LANGN_246,LANGN_254,LANGN_258,LANGN_263,LANGN_264,LANGN_266,LANGN_272,LANGN_273,LANGN_275,LANGN_286,LANGN_301,LANGN_313,LANGN_316,LANGN_317,LANGN_322,LANGN_325,LANGN_327,LANGN_329,LANGN_338,LANGN_340,LANGN_344,LANGN_351,LANGN_358,LANGN_363,LANGN_369,LANGN_371,LANGN_375,LANGN_379,LANGN_381,LANGN_382,LANGN_383,LANGN_404,LANGN_409,LANGN_415,LANGN_420,LANGN_422,LANGN_428,LANGN_434,LANGN_442,LANGN_449,LANGN_451,LANGN_463,LANGN_465,LANGN_467,LANGN_471,LANGN_472,LANGN_474,LANGN_492,LANGN_493,LANGN_494,LANGN_495,LANGN_496,LANGN_500,LANGN_503,LANGN_514,LANGN_517,LANGN_520,LANGN_523,LANGN_527,LANGN_529,LANGN_531,LANGN_540,LANGN_547,LANGN_555,LANGN_561,LANGN_562,LANGN_563,LANGN_565,LANGN_566,LANGN_567,LANGN_600,LANGN_601,LANGN_602,LANGN_605,LANGN_606,LANGN_607,LANGN_608,LANGN_611,LANGN_614,LANGN_615,LANGN_616,LANGN_618,LANGN_619,LANGN_621,LANGN_622,LANGN_623,LANGN_624,LANGN_625,LANGN_626,LANGN_627,LANGN_628,LANGN_630,LANGN_631,LANGN_634,LANGN_635,LANGN_639,LANGN_640,LANGN_641,LANGN_642,LANGN_648,LANGN_650,LANGN_661,LANGN_662,LANGN_663,LANGN_665,LANGN_666,LANGN_667,LANGN_668,LANGN_669,LANGN_670,LANGN_673,LANGN_674,LANGN_675,LANGN_676,LANGN_677,LANGN_678,LANGN_800,LANGN_801,LANGN_802,LANGN_804,LANGN_805,LANGN_806,LANGN_807,LANGN_808,LANGN_809,LANGN_810,LANGN_811,LANGN_812,LANGN_813,LANGN_814,LANGN_815,LANGN_816,LANGN_817,LANGN_818,LANGN_819,LANGN_821,LANGN_823,LANGN_824,LANGN_825,LANGN_826,LANGN_827,LANGN_828,LANGN_829,LANGN_831,LANGN_832,LANGN_833,LANGN_836,LANGN_837,LANGN_838,LANGN_839,LANGN_840,LANGN_841,LANGN_842,LANGN_843,LANGN_844,LANGN_845,LANGN_846,LANGN_849,LANGN_850,LANGN_851,LANGN_852,LANGN_854,LANGN_855,LANGN_857,LANGN_859,LANGN_860,LANGN_861,LANGN_865,LANGN_866,LANGN_868,LANGN_870,LANGN_872,LANGN_873,LANGN_877,LANGN_879,LANGN_881,LANGN_885,LANGN_890,LANGN_892,LANGN_895,LANGN_896,LANGN_897,LANGN_898,LANGN_899,LANGN_900,LANGN_901,LANGN_902,LANGN_903,LANGN_904,LANGN_905,LANGN_906,LANGN_907,LANGN_908,LANGN_909,LANGN_910,LANGN_911,LANGN_912,LANGN_913,LANGN_914,LANGN_916,LANGN_917,LANGN_918,LANGN_919,LANGN_920,LANGN_921,LANGN_922
573394,1,1.0,4.0,1.0,0,1,0,0,0,1.0,3.0,,,,,,,2.0,9.0,,2.0,0.0,0.0,9.0,7.0,6.0,,1.0,0.0,3.0,,0.0,16.0,,,4.0,0.0,3.0,2.0,35.0,3.0,,2.0,,,,,5.0,1.0,,1.0,3.0,3.0,,0.0,0.0,0.0,0.0,0.0,1.0,,,,,,1.0,1.0,6.0,,,,,,,,,,,,,,,,,,,,,,10.0,,0.0,2.0,5.0,1.0,3.0,1.0,0.0,0.0,0.0,2.0,2.0,2.0,1.0,55.0,0,0,1,0,1,0,0,0,0,4.0,4.0,3.0,3.0,4.0,3.0,2.0,4.0,3.0,4.0,1.0,1.0,1.0,1.0,,,1.0,2.0,1.0,1.0,100.0,33.0,2.0,0,0,0,1,2.0,40.0,15.0,15.0,40.0,0.0,15.0,45.0,,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,1.0,1.0,3.0,0.1884,1.0,4.0612,1.0,0,0,1,3.0,2.0,5.0,5.0,2.0,2.0,5.0,5.0,2.0,1.0,5.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,3.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,1.0,1.0,1.0,,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
573395,1,0.0,2.0,1.0,0,1,0,0,0,1.0,3.0,,,,,,,,9.0,,2.0,0.0,0.0,4.0,7.0,6.0,,1.0,0.0,1.0,,0.0,12.0,,,3.0,1.0,2.0,1.0,45.0,1.0,,6.0,5.0,,,,5.0,1.0,,2.0,3.0,3.0,3.0,0.0,0.0,0.0,0.0,0.0,1.0,,,,,,1.0,6.0,6.0,,,,,,,,,,,,,,,,,,,,,,10.0,,0.0,0.0,,,,,,,,,,,,,0,0,0,0,0,0,0,0,0,,,,,,,,,,,,,,,,,,,,,,,,0,0,0,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,0,0,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
573396,1,1.0,5.0,1.0,0,1,0,0,0,1.0,3.0,,,,,,,2.0,5.0,,2.0,0.0,0.0,8.0,7.0,6.0,6.0,1.0,0.0,3.0,,0.0,16.0,,,4.0,0.0,2.0,3.0,35.0,4.0,,9.0,4.0,4.0,3.0,4.0,3.0,3.0,,4.0,4.0,4.0,2.0,0.0,0.0,0.0,0.0,0.0,1.0,,,,,,1.0,4.0,6.0,,,,,,,,,,,,,,,,,,,,,,5.0,,0.0,5.0,4.0,14.0,10.0,19.0,2.0,3.0,1.0,1.0,1.0,1.0,1.0,90.0,1,0,0,1,0,0,1,0,0,4.0,4.0,3.0,3.0,1.0,3.0,3.0,3.0,4.0,4.0,1.0,1.0,1.0,1.0,85.0,15.0,1.0,1.0,2.0,2.0,100.0,23.0,5.0,0,0,0,1,2.0,10.0,5.0,25.0,25.0,10.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,4.0,1.0,1.0,0.0,1.0,0,0,1,2.0,2.0,5.0,1.0,2.0,2.0,5.0,1.0,2.0,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,1.0,2.0,1.0,1.0,2.0,1.0,2.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
573397,1,1.0,5.0,1.0,0,0,1,0,0,1.0,3.0,,,,,,,3.0,6.0,,1.0,0.0,0.0,4.0,7.0,6.0,9.0,1.0,0.0,1.0,,0.0,12.0,,,4.0,1.0,3.0,3.0,35.0,4.0,,3.0,4.0,4.0,4.0,4.0,5.0,2.0,,1.0,1.0,4.0,3.0,0.0,0.0,0.0,0.0,0.0,1.0,,,,,,2.0,1.0,6.0,,,,,,,,,,,,,,,,,,,,,,0.0,,10.0,10.0,3.0,1.0,13.0,10.0,0.0,8.0,0.0,1.0,1.0,1.0,1.0,50.0,0,1,0,0,0,1,0,0,1,4.0,4.0,1.0,1.0,1.0,1.0,1.0,4.0,2.0,1.0,3.0,2.0,1.0,1.0,84.0,16.0,2.0,2.0,1.0,1.0,100.0,23.0,2.0,0,0,0,0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,3.0,1.0,1.0,0.1163,1.0,0,0,1,2.0,1.0,4.0,5.0,2.0,1.0,5.0,5.0,2.0,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
573398,1,1.0,4.0,1.0,0,0,0,0,1,2.0,4.0,,,,,,,2.0,5.0,,1.0,1.0,0.0,9.0,7.0,6.0,5.0,1.0,2.0,2.0,,0.0,12.0,,,4.0,0.0,3.0,1.0,20.0,2.0,,1.0,5.0,5.0,5.0,,4.0,1.0,,1.0,2.0,3.0,,0.0,0.0,0.0,1.0,1.0,0.0,,,,,,1.0,1.0,6.0,,,,,,,,,,,,,,,,,,,,,,5.0,,5.0,0.0,5.0,60.0,35.0,70.0,30.0,30.0,20.0,2.0,1.0,1.0,1.0,90.0,0,1,0,0,1,0,0,1,0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,1.0,1.0,1.0,60.0,40.0,2.0,2.0,2.0,1.0,100.0,28.0,1.0,0,0,0,0,2.0,50.0,75.0,50.0,75.0,50.0,10.0,30.0,90.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,3.0,1.0,1.0,0.0,2.0,0,0,1,2.0,2.0,5.0,1.0,2.0,2.0,5.0,1.0,2.0,2.0,5.0,5.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,3.0,2.0,3.0,2.0,1.0,1.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,9.0,3.0,1.0,1.0,1.0,2.0,2.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


Amazon SageMaker's XGBoost container expects data in the libSVM or CSV data format.  **Note that the first column must be the target variable and the CSV should not include headers.**  Although repetitive, it's easiest to do this after the train|validation|test split rather than before.  This avoids any misalignment issues due to random reordering.
* `MATH_Proficient`: Is the student falling behind in Math? (Average of 10 Math plausible values < 420.07)

In [11]:
# Get percent of students not proficient in Math
proficient_n = (model_data['MATH_Proficient'] == 1).sum()
not_proficient_n = (model_data['MATH_Proficient'] == 0).sum()
not_proficient_p = round( not_proficient_n / (not_proficient_n + proficient_n) * 100, 1)
print("Students who are NOT proficient in Math: ", not_proficient_n, "(", not_proficient_p, "%)")

Students who are NOT proficient in Math:  1607 ( 35.3 %)


In [12]:
# Get imbalance ratio 
not_proficient_pp = not_proficient_n / (not_proficient_n + proficient_n)

if not_proficient_pp < 0.5:
    imbalance_ratio = (1 - not_proficient_pp) / not_proficient_pp
else:
    imbalance_ratio = not_proficient_pp / (1 - not_proficient_pp)
    
print("Imbalance ratio:", round(imbalance_ratio,1))

Imbalance ratio: 1.8


In [13]:
# Reorder columns to bring 'MATH_Proficient' first
new_order = ['MATH_Proficient'] + [col for col in model_data.columns if col != 'MATH_Proficient']
model_data = model_data[new_order]

# Get number of features
n_features_original = model_data.shape[1]-1

# Check the shape after dropping
print(model_data.shape)

model_data.head()

(4552, 452)


Unnamed: 0,MATH_Proficient,SISCO,ST347Q01JA,ST347Q02JA,ST349Q01JA_0,ST349Q01JA_1,ST349Q01JA_2,ST349Q01JA_3,ST349Q01JA_4,ST350Q01JA,ST356Q01JA,ST322Q01JA,ST322Q02JA,ST322Q03JA,ST322Q04JA,ST322Q06JA,ST322Q07JA,DURECEC,ST259Q01JA,WB164Q01HA,ST004D01T,GRADE,REPEAT,EXPECEDU,ICTAVSCH,ICTAVHOM,ICTDISTR,IMMIG,TARDYSD,ST226Q01JA,ST016Q01NA,MISSSC,PAREDINT,WB163Q06HA,WB163Q07HA,ST230Q01JA,SKIPPING,IC180Q01JA,IC180Q08JA,ST059Q02JA,ST296Q04JA,WB176Q01HA,STUDYHMW,IC184Q01JA,IC184Q02JA,IC184Q03JA,IC184Q04JA,ST059Q01TA,ST296Q01JA,ST272Q01JA,ST268Q01JA,ST268Q04JA,ST268Q07JA,ST293Q04JA,ST297Q01JA,ST297Q03JA,ST297Q05JA,ST297Q06JA,ST297Q07JA,ST297Q09JA,WB165Q01HA,WB166Q01HA,WB166Q02HA,WB166Q03HA,WB166Q04HA,ST258Q01JA,ST294Q01JA,ST295Q01JA,WB150Q01HA,WB156Q01HA,WB158Q01HA,WB160Q01HA,WB161Q01HA,WB171Q01HA,WB171Q02HA,WB171Q03HA,WB171Q04HA,WB172Q01HA,WB173Q01HA,WB173Q02HA,WB173Q03HA,WB173Q04HA,WB177Q01HA,WB177Q02HA,WB177Q03HA,WB177Q04HA,WB032Q01NA,WB032Q02NA,WB031Q01NA,EXERPRAC,STUBMI,WORKPAY,WORKHOME,SC001Q01TA,SC211Q01JA,SC211Q02JA,SC211Q03JA,SC211Q04JA,SC211Q05JA,SC211Q06JA,SC037Q11JA,SC183Q02JA,SC183Q03JA,SC183Q04JA,SC175Q01JA,SC177Q01JA_1,SC177Q01JA_2,SC177Q01JA_3,SC177Q02JA_1,SC177Q02JA_2,SC177Q02JA_3,SC177Q03JA_1,SC177Q03JA_2,SC177Q03JA_3,SC188Q01JA,SC188Q02JA,SC188Q03JA,SC188Q04JA,SC188Q05JA,SC188Q06JA,SC188Q07JA,SC188Q08JA,SC188Q09JA,SC188Q10JA,SC188Q11JA,SC198Q01JA,SC198Q02JA,SC198Q03JA,SC178Q01JA,SC178Q02JA,SC180Q01JA,SC189Q02WA,SC189Q03WA,SC189Q04WA,SMRATIO,MCLSIZE,MACTIV,MATHEXC_0,MATHEXC_1,MATHEXC_2,MATHEXC_3,ABGMATH,SC064Q05WA,SC064Q06WA,SC064Q01TA,SC064Q02TA,SC064Q04NA,SC064Q03TA,SC064Q07WA,SC213Q01JA,SC213Q02JA,SC037Q01TA,SC037Q02TA,SC037Q03TA,SC037Q04TA,SC037Q05NA,SC037Q06NA,SC037Q07TA,SC037Q09TA,SC200Q01JA,SC200Q02JA,SC200Q03JA,SC200Q04JA,SC224Q01JA,RATCMP1,RATCMP2,RATTAB,SCHSEL,SCHLTYPE_1,SCHLTYPE_2,SCHLTYPE_3,SC034Q01NA,SC034Q02NA,SC034Q03TA,SC034Q04TA,SC195Q01JA,SC195Q02JA,SC195Q03JA,SC195Q04JA,SC042Q01TA,SC042Q02TA,SC214Q01JA,SC214Q02JA,SC214Q03JA,SC215Q01JA,SC215Q02JA,SC215Q03JA,SC215Q04JA,SC215Q05JA,SC215Q06JA,SC215Q07JA,SC215Q08JA,SC216Q06JA,SC216Q07JA,SC216Q08JA,SC216Q09JA,SC217Q01JA,SC217Q02JA,SC217Q03JA,SC217Q04JA,SC217Q05JA,SC217Q06JA,SC217Q07JA,SC217Q08JA,SC217Q10JA,SC218Q01JA,SC219Q01JA,SC220Q01JA,SC221Q01JA,SC221Q02JA,SC221Q03JA,SC221Q04JA,SCSUPRTED,SCSUPRT,SC212Q01JA,SC212Q02JA,SC212Q03JA,SC037Q08TA,SC032Q01TA,SC032Q02TA,SC032Q03TA,SC032Q04TA,LANGN_105,LANGN_108,LANGN_112,LANGN_113,LANGN_118,LANGN_121,LANGN_130,LANGN_133,LANGN_137,LANGN_140,LANGN_147,LANGN_148,LANGN_150,LANGN_154,LANGN_156,LANGN_160,LANGN_170,LANGN_195,LANGN_200,LANGN_202,LANGN_204,LANGN_232,LANGN_237,LANGN_244,LANGN_246,LANGN_254,LANGN_258,LANGN_263,LANGN_264,LANGN_266,LANGN_272,LANGN_273,LANGN_275,LANGN_286,LANGN_301,LANGN_313,LANGN_316,LANGN_317,LANGN_322,LANGN_325,LANGN_327,LANGN_329,LANGN_338,LANGN_340,LANGN_344,LANGN_351,LANGN_358,LANGN_363,LANGN_369,LANGN_371,LANGN_375,LANGN_379,LANGN_381,LANGN_382,LANGN_383,LANGN_404,LANGN_409,LANGN_415,LANGN_420,LANGN_422,LANGN_428,LANGN_434,LANGN_442,LANGN_449,LANGN_451,LANGN_463,LANGN_465,LANGN_467,LANGN_471,LANGN_472,LANGN_474,LANGN_492,LANGN_493,LANGN_494,LANGN_495,LANGN_496,LANGN_500,LANGN_503,LANGN_514,LANGN_517,LANGN_520,LANGN_523,LANGN_527,LANGN_529,LANGN_531,LANGN_540,LANGN_547,LANGN_555,LANGN_561,LANGN_562,LANGN_563,LANGN_565,LANGN_566,LANGN_567,LANGN_600,LANGN_601,LANGN_602,LANGN_605,LANGN_606,LANGN_607,LANGN_608,LANGN_611,LANGN_614,LANGN_615,LANGN_616,LANGN_618,LANGN_619,LANGN_621,LANGN_622,LANGN_623,LANGN_624,LANGN_625,LANGN_626,LANGN_627,LANGN_628,LANGN_630,LANGN_631,LANGN_634,LANGN_635,LANGN_639,LANGN_640,LANGN_641,LANGN_642,LANGN_648,LANGN_650,LANGN_661,LANGN_662,LANGN_663,LANGN_665,LANGN_666,LANGN_667,LANGN_668,LANGN_669,LANGN_670,LANGN_673,LANGN_674,LANGN_675,LANGN_676,LANGN_677,LANGN_678,LANGN_800,LANGN_801,LANGN_802,LANGN_804,LANGN_805,LANGN_806,LANGN_807,LANGN_808,LANGN_809,LANGN_810,LANGN_811,LANGN_812,LANGN_813,LANGN_814,LANGN_815,LANGN_816,LANGN_817,LANGN_818,LANGN_819,LANGN_821,LANGN_823,LANGN_824,LANGN_825,LANGN_826,LANGN_827,LANGN_828,LANGN_829,LANGN_831,LANGN_832,LANGN_833,LANGN_836,LANGN_837,LANGN_838,LANGN_839,LANGN_840,LANGN_841,LANGN_842,LANGN_843,LANGN_844,LANGN_845,LANGN_846,LANGN_849,LANGN_850,LANGN_851,LANGN_852,LANGN_854,LANGN_855,LANGN_857,LANGN_859,LANGN_860,LANGN_861,LANGN_865,LANGN_866,LANGN_868,LANGN_870,LANGN_872,LANGN_873,LANGN_877,LANGN_879,LANGN_881,LANGN_885,LANGN_890,LANGN_892,LANGN_895,LANGN_896,LANGN_897,LANGN_898,LANGN_899,LANGN_900,LANGN_901,LANGN_902,LANGN_903,LANGN_904,LANGN_905,LANGN_906,LANGN_907,LANGN_908,LANGN_909,LANGN_910,LANGN_911,LANGN_912,LANGN_913,LANGN_914,LANGN_916,LANGN_917,LANGN_918,LANGN_919,LANGN_920,LANGN_921,LANGN_922
573394,1,1.0,4.0,1.0,0,1,0,0,0,1.0,3.0,,,,,,,2.0,9.0,,2.0,0.0,0.0,9.0,7.0,6.0,,1.0,0.0,3.0,,0.0,16.0,,,4.0,0.0,3.0,2.0,35.0,3.0,,2.0,,,,,5.0,1.0,,1.0,3.0,3.0,,0.0,0.0,0.0,0.0,0.0,1.0,,,,,,1.0,1.0,6.0,,,,,,,,,,,,,,,,,,,,,,10.0,,0.0,2.0,5.0,1.0,3.0,1.0,0.0,0.0,0.0,2.0,2.0,2.0,1.0,55.0,0,0,1,0,1,0,0,0,0,4.0,4.0,3.0,3.0,4.0,3.0,2.0,4.0,3.0,4.0,1.0,1.0,1.0,1.0,,,1.0,2.0,1.0,1.0,100.0,33.0,2.0,0,0,0,1,2.0,40.0,15.0,15.0,40.0,0.0,15.0,45.0,,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,1.0,1.0,1.0,3.0,0.1884,1.0,4.0612,1.0,0,0,1,3.0,2.0,5.0,5.0,2.0,2.0,5.0,5.0,2.0,1.0,5.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,3.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,1.0,1.0,1.0,,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
573395,1,0.0,2.0,1.0,0,1,0,0,0,1.0,3.0,,,,,,,,9.0,,2.0,0.0,0.0,4.0,7.0,6.0,,1.0,0.0,1.0,,0.0,12.0,,,3.0,1.0,2.0,1.0,45.0,1.0,,6.0,5.0,,,,5.0,1.0,,2.0,3.0,3.0,3.0,0.0,0.0,0.0,0.0,0.0,1.0,,,,,,1.0,6.0,6.0,,,,,,,,,,,,,,,,,,,,,,10.0,,0.0,0.0,,,,,,,,,,,,,0,0,0,0,0,0,0,0,0,,,,,,,,,,,,,,,,,,,,,,,,0,0,0,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,0,0,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
573396,1,1.0,5.0,1.0,0,1,0,0,0,1.0,3.0,,,,,,,2.0,5.0,,2.0,0.0,0.0,8.0,7.0,6.0,6.0,1.0,0.0,3.0,,0.0,16.0,,,4.0,0.0,2.0,3.0,35.0,4.0,,9.0,4.0,4.0,3.0,4.0,3.0,3.0,,4.0,4.0,4.0,2.0,0.0,0.0,0.0,0.0,0.0,1.0,,,,,,1.0,4.0,6.0,,,,,,,,,,,,,,,,,,,,,,5.0,,0.0,5.0,4.0,14.0,10.0,19.0,2.0,3.0,1.0,1.0,1.0,1.0,1.0,90.0,1,0,0,1,0,0,1,0,0,4.0,4.0,3.0,3.0,1.0,3.0,3.0,3.0,4.0,4.0,1.0,1.0,1.0,1.0,85.0,15.0,1.0,1.0,2.0,2.0,100.0,23.0,5.0,0,0,0,1,2.0,10.0,5.0,25.0,25.0,10.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,4.0,1.0,1.0,0.0,1.0,0,0,1,2.0,2.0,5.0,1.0,2.0,2.0,5.0,1.0,2.0,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,1.0,2.0,1.0,1.0,2.0,1.0,2.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
573397,1,1.0,5.0,1.0,0,0,1,0,0,1.0,3.0,,,,,,,3.0,6.0,,1.0,0.0,0.0,4.0,7.0,6.0,9.0,1.0,0.0,1.0,,0.0,12.0,,,4.0,1.0,3.0,3.0,35.0,4.0,,3.0,4.0,4.0,4.0,4.0,5.0,2.0,,1.0,1.0,4.0,3.0,0.0,0.0,0.0,0.0,0.0,1.0,,,,,,2.0,1.0,6.0,,,,,,,,,,,,,,,,,,,,,,0.0,,10.0,10.0,3.0,1.0,13.0,10.0,0.0,8.0,0.0,1.0,1.0,1.0,1.0,50.0,0,1,0,0,0,1,0,0,1,4.0,4.0,1.0,1.0,1.0,1.0,1.0,4.0,2.0,1.0,3.0,2.0,1.0,1.0,84.0,16.0,2.0,2.0,1.0,1.0,100.0,23.0,2.0,0,0,0,0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,3.0,1.0,1.0,0.1163,1.0,0,0,1,2.0,1.0,4.0,5.0,2.0,1.0,5.0,5.0,2.0,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
573398,1,1.0,4.0,1.0,0,0,0,0,1,2.0,4.0,,,,,,,2.0,5.0,,1.0,1.0,0.0,9.0,7.0,6.0,5.0,1.0,2.0,2.0,,0.0,12.0,,,4.0,0.0,3.0,1.0,20.0,2.0,,1.0,5.0,5.0,5.0,,4.0,1.0,,1.0,2.0,3.0,,0.0,0.0,0.0,1.0,1.0,0.0,,,,,,1.0,1.0,6.0,,,,,,,,,,,,,,,,,,,,,,5.0,,5.0,0.0,5.0,60.0,35.0,70.0,30.0,30.0,20.0,2.0,1.0,1.0,1.0,90.0,0,1,0,0,1,0,0,1,0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,1.0,1.0,1.0,60.0,40.0,2.0,2.0,2.0,1.0,100.0,28.0,1.0,0,0,0,0,2.0,50.0,75.0,50.0,75.0,50.0,10.0,30.0,90.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,3.0,1.0,1.0,0.0,2.0,0,0,1,2.0,2.0,5.0,1.0,2.0,2.0,5.0,1.0,2.0,2.0,5.0,5.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,3.0,2.0,3.0,2.0,1.0,1.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,9.0,3.0,1.0,1.0,1.0,2.0,2.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


#### Drop columns with more than 20% missing values

***I commented out the code below because KNN might be able to work with datasets with missing values (like xgboost). If it yells at you that it can't handle missing values (which is what happened for linear learner), uncomment and run the codes below.***

In [14]:
model_data.dropna(thresh=int(0.8 * len(model_data)), axis=1, inplace=True)
print(model_data.shape)

(4552, 367)


In [15]:
n_features_final = model_data.shape[1]-1
print("Number of features (before dropping features with more than 20% missing):", n_features_original)
print("Number of features (after dropping features with more than 20% missing):", n_features_final)
print("Number of features with more than 20% missing:", n_features_original - n_features_final)

Number of features (before dropping features with more than 20% missing): 451
Number of features (after dropping features with more than 20% missing): 366
Number of features with more than 20% missing: 85


In [16]:
feature_dim=n_features_final

#### For columns with less than 20% missing values, fill missing values with the median value of the column

In [17]:
model_data.fillna(model_data.median(), inplace=True)

We'll randomly split the data into 3 uneven groups.  **The model will be trained on 70% of data, it will then be evaluated on 15% of data to give us an estimate of the accuracy we hope to have on "new" data, and 15% will be held back as a final testing dataset which will be used later on.**

A seed is included in the code so the splits can be replicated!

In [18]:
# cell 12
# Randomly sort the data then split out first 70%, second 15%, and last 15%
train_data, validation_data, test_data = np.split(model_data.sample(frac=1, random_state=1729), [int(0.7 * len(model_data)), int(0.85 * len(model_data))])   

  return bound(*args, **kwds)


In [19]:
print("Number of rows in FULL dataset:", model_data.shape[0])

train_data_percent = round(train_data.shape[0]/model_data.shape[0] * 100, 0)
print("Number of rows in TRAINING dataset:", train_data.shape[0], "(", train_data_percent, "% )")

validation_data_percent = round(validation_data.shape[0]/model_data.shape[0] * 100, 0)
print("Number of rows in VALIDATION dataset:", validation_data.shape[0], "(", validation_data_percent, "% )")

test_data_percent = round(test_data.shape[0]/model_data.shape[0] * 100, 0)
print("Number of rows in TEST dataset:", test_data.shape[0], "(", test_data_percent, "% )")

Number of rows in FULL dataset: 4552
Number of rows in TRAINING dataset: 3186 ( 70.0 % )
Number of rows in VALIDATION dataset: 683 ( 15.0 % )
Number of rows in TEST dataset: 683 ( 15.0 % )


In [20]:
# Save train dataset 
train_data.to_csv('train.csv', index=False, header=False)

# Save validation dataset 
validation_data.to_csv('validation.csv', index=False, header=False)


In [21]:
# Training data - Saved later to S3 as CSV
print(train_data.shape)
train_data.head()

(3186, 367)


Unnamed: 0,MATH_Proficient,SISCO,ST347Q01JA,ST347Q02JA,ST349Q01JA_0,ST349Q01JA_1,ST349Q01JA_2,ST349Q01JA_3,ST349Q01JA_4,ST259Q01JA,ST004D01T,GRADE,REPEAT,EXPECEDU,ICTAVSCH,ICTAVHOM,IMMIG,TARDYSD,ST226Q01JA,MISSSC,PAREDINT,ST230Q01JA,SKIPPING,IC180Q01JA,IC180Q08JA,ST059Q02JA,ST296Q04JA,STUDYHMW,IC184Q01JA,IC184Q02JA,ST059Q01TA,ST296Q01JA,ST268Q01JA,ST268Q04JA,ST268Q07JA,ST297Q01JA,ST297Q03JA,ST297Q05JA,ST297Q06JA,ST297Q07JA,ST297Q09JA,ST258Q01JA,ST294Q01JA,ST295Q01JA,EXERPRAC,WORKPAY,WORKHOME,SC001Q01TA,SC211Q01JA,SC211Q02JA,SC211Q03JA,SC211Q04JA,SC211Q05JA,SC211Q06JA,SC037Q11JA,SC183Q02JA,SC183Q03JA,SC183Q04JA,SC175Q01JA,SC177Q01JA_1,SC177Q01JA_2,SC177Q01JA_3,SC177Q02JA_1,SC177Q02JA_2,SC177Q02JA_3,SC177Q03JA_1,SC177Q03JA_2,SC177Q03JA_3,SC188Q01JA,SC188Q02JA,SC188Q03JA,SC188Q04JA,SC188Q05JA,SC188Q06JA,SC188Q07JA,SC188Q08JA,SC188Q09JA,SC188Q10JA,SC188Q11JA,SC198Q01JA,SC198Q02JA,SC198Q03JA,SC178Q01JA,SC178Q02JA,SC180Q01JA,SC189Q02WA,SC189Q03WA,SC189Q04WA,MCLSIZE,MACTIV,MATHEXC_0,MATHEXC_1,MATHEXC_2,MATHEXC_3,ABGMATH,SC064Q05WA,SC064Q06WA,SC064Q01TA,SC064Q02TA,SC064Q04NA,SC064Q03TA,SC064Q07WA,SC213Q01JA,SC213Q02JA,SC037Q01TA,SC037Q02TA,SC037Q03TA,SC037Q04TA,SC037Q05NA,SC037Q06NA,SC037Q07TA,SC037Q09TA,SC224Q01JA,RATCMP1,RATCMP2,SCHSEL,SCHLTYPE_1,SCHLTYPE_2,SCHLTYPE_3,SC034Q01NA,SC034Q02NA,SC034Q03TA,SC034Q04TA,SC195Q01JA,SC195Q02JA,SC195Q03JA,SC195Q04JA,SC042Q01TA,SC042Q02TA,SC212Q01JA,SC212Q02JA,SC212Q03JA,SC037Q08TA,SC032Q01TA,SC032Q02TA,SC032Q03TA,SC032Q04TA,LANGN_105,LANGN_108,LANGN_112,LANGN_113,LANGN_118,LANGN_121,LANGN_130,LANGN_133,LANGN_137,LANGN_140,LANGN_147,LANGN_148,LANGN_150,LANGN_154,LANGN_156,LANGN_160,LANGN_170,LANGN_195,LANGN_200,LANGN_202,LANGN_204,LANGN_232,LANGN_237,LANGN_244,LANGN_246,LANGN_254,LANGN_258,LANGN_263,LANGN_264,LANGN_266,LANGN_272,LANGN_273,LANGN_275,LANGN_286,LANGN_301,LANGN_313,LANGN_316,LANGN_317,LANGN_322,LANGN_325,LANGN_327,LANGN_329,LANGN_338,LANGN_340,LANGN_344,LANGN_351,LANGN_358,LANGN_363,LANGN_369,LANGN_371,LANGN_375,LANGN_379,LANGN_381,LANGN_382,LANGN_383,LANGN_404,LANGN_409,LANGN_415,LANGN_420,LANGN_422,LANGN_428,LANGN_434,LANGN_442,LANGN_449,LANGN_451,LANGN_463,LANGN_465,LANGN_467,LANGN_471,LANGN_472,LANGN_474,LANGN_492,LANGN_493,LANGN_494,LANGN_495,LANGN_496,LANGN_500,LANGN_503,LANGN_514,LANGN_517,LANGN_520,LANGN_523,LANGN_527,LANGN_529,LANGN_531,LANGN_540,LANGN_547,LANGN_555,LANGN_561,LANGN_562,LANGN_563,LANGN_565,LANGN_566,LANGN_567,LANGN_600,LANGN_601,LANGN_602,LANGN_605,LANGN_606,LANGN_607,LANGN_608,LANGN_611,LANGN_614,LANGN_615,LANGN_616,LANGN_618,LANGN_619,LANGN_621,LANGN_622,LANGN_623,LANGN_624,LANGN_625,LANGN_626,LANGN_627,LANGN_628,LANGN_630,LANGN_631,LANGN_634,LANGN_635,LANGN_639,LANGN_640,LANGN_641,LANGN_642,LANGN_648,LANGN_650,LANGN_661,LANGN_662,LANGN_663,LANGN_665,LANGN_666,LANGN_667,LANGN_668,LANGN_669,LANGN_670,LANGN_673,LANGN_674,LANGN_675,LANGN_676,LANGN_677,LANGN_678,LANGN_800,LANGN_801,LANGN_802,LANGN_804,LANGN_805,LANGN_806,LANGN_807,LANGN_808,LANGN_809,LANGN_810,LANGN_811,LANGN_812,LANGN_813,LANGN_814,LANGN_815,LANGN_816,LANGN_817,LANGN_818,LANGN_819,LANGN_821,LANGN_823,LANGN_824,LANGN_825,LANGN_826,LANGN_827,LANGN_828,LANGN_829,LANGN_831,LANGN_832,LANGN_833,LANGN_836,LANGN_837,LANGN_838,LANGN_839,LANGN_840,LANGN_841,LANGN_842,LANGN_843,LANGN_844,LANGN_845,LANGN_846,LANGN_849,LANGN_850,LANGN_851,LANGN_852,LANGN_854,LANGN_855,LANGN_857,LANGN_859,LANGN_860,LANGN_861,LANGN_865,LANGN_866,LANGN_868,LANGN_870,LANGN_872,LANGN_873,LANGN_877,LANGN_879,LANGN_881,LANGN_885,LANGN_890,LANGN_892,LANGN_895,LANGN_896,LANGN_897,LANGN_898,LANGN_899,LANGN_900,LANGN_901,LANGN_902,LANGN_903,LANGN_904,LANGN_905,LANGN_906,LANGN_907,LANGN_908,LANGN_909,LANGN_910,LANGN_911,LANGN_912,LANGN_913,LANGN_914,LANGN_916,LANGN_917,LANGN_918,LANGN_919,LANGN_920,LANGN_921,LANGN_922
575474,0,0.0,4.0,1.0,0,1,0,0,0,5.0,1.0,0.0,0.0,7.0,7.0,6.0,3.0,0.0,3.0,0.0,14.5,4.0,0.0,3.0,1.0,8.0,1.0,0.0,5.0,5.0,3.0,1.0,1.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,6.0,0.0,1.0,1.0,4.0,49.0,12.0,53.0,19.0,37.0,2.0,1.0,1.0,1.0,1.0,60.0,0,0,0,0,0,0,0,0,0,4.0,4.0,3.0,3.0,3.0,3.0,2.0,3.0,3.0,3.0,2.0,1.0,1.0,1.0,75.0,25.0,1.0,2.0,1.0,1.0,28.0,3.0,0,0,0,0,2.0,22.0,30.0,30.0,41.0,10.0,7.0,17.0,60.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,3.0,1.0,1.0,1.0,0,0,1,2.0,2.0,5.0,4.0,2.0,2.0,5.0,3.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
574994,1,1.0,4.0,2.0,0,0,0,0,1,3.0,1.0,1.0,0.0,7.0,7.0,6.0,1.0,2.0,2.0,0.0,12.0,3.0,1.0,3.0,3.0,8.0,1.0,1.0,4.0,4.0,4.0,1.0,1.0,3.0,4.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,4.0,2.0,1.0,0.0,0.0,3.0,13.0,13.0,41.0,12.0,30.0,0.0,1.0,2.0,2.0,2.0,90.0,1,0,0,1,0,0,1,0,0,4.0,4.0,2.0,4.0,2.0,2.0,1.0,2.0,3.0,3.0,1.0,2.0,1.0,1.0,65.0,35.0,2.0,1.0,1.0,1.0,33.0,2.0,0,0,0,0,1.0,10.0,30.0,20.0,20.0,5.0,5.0,4.0,230.0,3.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,4.0,1.0,1.0,1.0,0,0,1,3.0,2.0,5.0,5.0,2.0,3.0,5.0,5.0,3.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
574164,0,1.0,3.0,1.0,0,0,0,0,1,6.0,1.0,0.0,0.0,9.0,7.0,6.0,1.0,1.0,3.0,0.0,16.0,4.0,0.0,3.0,3.0,1.0,4.0,4.0,3.0,3.0,1.0,1.0,1.0,1.0,3.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,2.0,6.0,9.0,0.0,6.0,4.0,9.0,16.0,83.0,4.0,9.0,0.0,1.0,1.0,1.0,1.0,50.0,0,0,1,1,0,0,0,0,1,3.0,3.0,2.0,1.0,2.0,3.0,1.0,1.0,1.0,1.0,4.0,2.0,1.0,1.0,50.0,50.0,1.0,1.0,2.0,1.0,33.0,5.0,0,0,0,1,2.0,35.0,50.0,50.0,50.0,1.0,15.0,1.0,75.0,5.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,3.0,1.0,1.0,1.0,0,0,1,3.0,3.0,4.0,5.0,2.0,3.0,4.0,5.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
576638,0,0.0,4.0,1.0,0,1,0,0,0,7.0,1.0,0.0,0.0,8.0,6.0,5.0,1.0,1.0,3.0,0.0,16.0,2.0,0.0,2.0,2.0,20.0,3.0,6.0,4.0,4.0,3.0,1.0,2.0,2.0,4.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,4.0,6.0,6.0,0.0,10.0,3.0,65.0,6.0,1.0,65.0,80.0,0.0,1.0,1.0,1.0,1.0,90.0,0,0,1,0,0,1,1,0,0,4.0,4.0,2.0,1.0,4.0,4.0,2.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,60.0,40.0,2.0,2.0,2.0,1.0,23.0,2.0,0,0,0,0,3.0,40.0,80.0,40.0,90.0,0.0,20.0,0.0,100.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,3.0,5.4598,1.0,2.0,0,0,1,3.0,5.0,5.0,1.0,3.0,5.0,5.0,1.0,2.0,2.0,2.0,2.0,1.0,1.0,1.0,2.0,1.0,2.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
577351,1,1.0,1.0,1.0,0,0,0,0,0,6.0,1.0,0.0,0.0,7.0,7.0,6.0,1.0,1.0,3.0,0.0,16.0,4.0,0.0,3.0,1.0,20.0,4.0,4.0,5.0,3.0,5.0,2.0,4.0,4.0,4.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,4.0,5.0,0.0,0.0,0.0,3.0,20.0,2.0,22.0,3.0,1.0,0.0,1.0,1.0,1.0,1.0,90.0,0,0,1,0,0,1,1,0,0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,1.0,1.0,1.0,95.0,5.0,1.0,1.0,1.0,1.0,18.0,3.0,0,0,0,1,3.0,56.0,79.0,100.0,99.0,23.0,10.0,0.0,80.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,4.0,1.0,1.0,3.0,0,0,1,3.0,3.0,5.0,5.0,3.0,2.0,5.0,5.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [22]:
# Validation data - Saved later to S3 as CSV
print(validation_data.shape)
validation_data.head()

(683, 367)


Unnamed: 0,MATH_Proficient,SISCO,ST347Q01JA,ST347Q02JA,ST349Q01JA_0,ST349Q01JA_1,ST349Q01JA_2,ST349Q01JA_3,ST349Q01JA_4,ST259Q01JA,ST004D01T,GRADE,REPEAT,EXPECEDU,ICTAVSCH,ICTAVHOM,IMMIG,TARDYSD,ST226Q01JA,MISSSC,PAREDINT,ST230Q01JA,SKIPPING,IC180Q01JA,IC180Q08JA,ST059Q02JA,ST296Q04JA,STUDYHMW,IC184Q01JA,IC184Q02JA,ST059Q01TA,ST296Q01JA,ST268Q01JA,ST268Q04JA,ST268Q07JA,ST297Q01JA,ST297Q03JA,ST297Q05JA,ST297Q06JA,ST297Q07JA,ST297Q09JA,ST258Q01JA,ST294Q01JA,ST295Q01JA,EXERPRAC,WORKPAY,WORKHOME,SC001Q01TA,SC211Q01JA,SC211Q02JA,SC211Q03JA,SC211Q04JA,SC211Q05JA,SC211Q06JA,SC037Q11JA,SC183Q02JA,SC183Q03JA,SC183Q04JA,SC175Q01JA,SC177Q01JA_1,SC177Q01JA_2,SC177Q01JA_3,SC177Q02JA_1,SC177Q02JA_2,SC177Q02JA_3,SC177Q03JA_1,SC177Q03JA_2,SC177Q03JA_3,SC188Q01JA,SC188Q02JA,SC188Q03JA,SC188Q04JA,SC188Q05JA,SC188Q06JA,SC188Q07JA,SC188Q08JA,SC188Q09JA,SC188Q10JA,SC188Q11JA,SC198Q01JA,SC198Q02JA,SC198Q03JA,SC178Q01JA,SC178Q02JA,SC180Q01JA,SC189Q02WA,SC189Q03WA,SC189Q04WA,MCLSIZE,MACTIV,MATHEXC_0,MATHEXC_1,MATHEXC_2,MATHEXC_3,ABGMATH,SC064Q05WA,SC064Q06WA,SC064Q01TA,SC064Q02TA,SC064Q04NA,SC064Q03TA,SC064Q07WA,SC213Q01JA,SC213Q02JA,SC037Q01TA,SC037Q02TA,SC037Q03TA,SC037Q04TA,SC037Q05NA,SC037Q06NA,SC037Q07TA,SC037Q09TA,SC224Q01JA,RATCMP1,RATCMP2,SCHSEL,SCHLTYPE_1,SCHLTYPE_2,SCHLTYPE_3,SC034Q01NA,SC034Q02NA,SC034Q03TA,SC034Q04TA,SC195Q01JA,SC195Q02JA,SC195Q03JA,SC195Q04JA,SC042Q01TA,SC042Q02TA,SC212Q01JA,SC212Q02JA,SC212Q03JA,SC037Q08TA,SC032Q01TA,SC032Q02TA,SC032Q03TA,SC032Q04TA,LANGN_105,LANGN_108,LANGN_112,LANGN_113,LANGN_118,LANGN_121,LANGN_130,LANGN_133,LANGN_137,LANGN_140,LANGN_147,LANGN_148,LANGN_150,LANGN_154,LANGN_156,LANGN_160,LANGN_170,LANGN_195,LANGN_200,LANGN_202,LANGN_204,LANGN_232,LANGN_237,LANGN_244,LANGN_246,LANGN_254,LANGN_258,LANGN_263,LANGN_264,LANGN_266,LANGN_272,LANGN_273,LANGN_275,LANGN_286,LANGN_301,LANGN_313,LANGN_316,LANGN_317,LANGN_322,LANGN_325,LANGN_327,LANGN_329,LANGN_338,LANGN_340,LANGN_344,LANGN_351,LANGN_358,LANGN_363,LANGN_369,LANGN_371,LANGN_375,LANGN_379,LANGN_381,LANGN_382,LANGN_383,LANGN_404,LANGN_409,LANGN_415,LANGN_420,LANGN_422,LANGN_428,LANGN_434,LANGN_442,LANGN_449,LANGN_451,LANGN_463,LANGN_465,LANGN_467,LANGN_471,LANGN_472,LANGN_474,LANGN_492,LANGN_493,LANGN_494,LANGN_495,LANGN_496,LANGN_500,LANGN_503,LANGN_514,LANGN_517,LANGN_520,LANGN_523,LANGN_527,LANGN_529,LANGN_531,LANGN_540,LANGN_547,LANGN_555,LANGN_561,LANGN_562,LANGN_563,LANGN_565,LANGN_566,LANGN_567,LANGN_600,LANGN_601,LANGN_602,LANGN_605,LANGN_606,LANGN_607,LANGN_608,LANGN_611,LANGN_614,LANGN_615,LANGN_616,LANGN_618,LANGN_619,LANGN_621,LANGN_622,LANGN_623,LANGN_624,LANGN_625,LANGN_626,LANGN_627,LANGN_628,LANGN_630,LANGN_631,LANGN_634,LANGN_635,LANGN_639,LANGN_640,LANGN_641,LANGN_642,LANGN_648,LANGN_650,LANGN_661,LANGN_662,LANGN_663,LANGN_665,LANGN_666,LANGN_667,LANGN_668,LANGN_669,LANGN_670,LANGN_673,LANGN_674,LANGN_675,LANGN_676,LANGN_677,LANGN_678,LANGN_800,LANGN_801,LANGN_802,LANGN_804,LANGN_805,LANGN_806,LANGN_807,LANGN_808,LANGN_809,LANGN_810,LANGN_811,LANGN_812,LANGN_813,LANGN_814,LANGN_815,LANGN_816,LANGN_817,LANGN_818,LANGN_819,LANGN_821,LANGN_823,LANGN_824,LANGN_825,LANGN_826,LANGN_827,LANGN_828,LANGN_829,LANGN_831,LANGN_832,LANGN_833,LANGN_836,LANGN_837,LANGN_838,LANGN_839,LANGN_840,LANGN_841,LANGN_842,LANGN_843,LANGN_844,LANGN_845,LANGN_846,LANGN_849,LANGN_850,LANGN_851,LANGN_852,LANGN_854,LANGN_855,LANGN_857,LANGN_859,LANGN_860,LANGN_861,LANGN_865,LANGN_866,LANGN_868,LANGN_870,LANGN_872,LANGN_873,LANGN_877,LANGN_879,LANGN_881,LANGN_885,LANGN_890,LANGN_892,LANGN_895,LANGN_896,LANGN_897,LANGN_898,LANGN_899,LANGN_900,LANGN_901,LANGN_902,LANGN_903,LANGN_904,LANGN_905,LANGN_906,LANGN_907,LANGN_908,LANGN_909,LANGN_910,LANGN_911,LANGN_912,LANGN_913,LANGN_914,LANGN_916,LANGN_917,LANGN_918,LANGN_919,LANGN_920,LANGN_921,LANGN_922
574478,1,1.0,6.0,1.0,0,0,0,0,1,7.0,1.0,0.0,0.0,6.0,7.0,6.0,1.0,0.0,3.0,0.0,12.0,4.0,0.0,1.0,3.0,3.0,3.0,9.0,3.0,4.0,3.0,1.0,4.0,4.0,4.0,0.0,1.0,1.0,0.0,0.0,0.0,5.0,1.0,6.0,3.0,0.0,10.0,4.0,40.0,30.0,79.0,11.0,18.0,4.0,1.0,1.0,1.0,1.0,60.0,0,0,0,0,0,0,0,0,0,4.0,4.0,3.0,3.0,3.0,3.0,2.0,3.0,3.0,3.0,2.0,1.0,1.0,1.0,75.0,25.0,1.0,2.0,1.0,1.0,28.0,3.0,0,0,0,0,2.0,22.0,30.0,30.0,41.0,10.0,7.0,17.0,60.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,3.0,1.0,1.0,1.0,0,0,1,2.0,2.0,5.0,4.0,2.0,2.0,5.0,3.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
573678,1,1.0,4.0,1.0,0,0,0,0,0,9.0,2.0,0.0,0.0,7.0,7.0,6.0,1.0,1.0,3.0,0.0,16.0,3.0,1.0,2.0,2.0,7.0,1.0,0.0,4.0,4.0,7.0,1.0,2.0,3.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,6.0,0.0,5.0,0.0,3.0,1.0,13.0,22.0,0.0,4.0,0.0,2.0,2.0,1.0,1.0,50.0,0,0,1,0,0,1,0,1,0,4.0,3.0,2.0,4.0,2.0,2.0,2.0,4.0,4.0,4.0,2.0,1.0,1.0,2.0,75.0,25.0,1.0,2.0,2.0,1.0,23.0,4.0,0,0,0,1,2.0,15.0,32.0,50.0,40.0,50.0,5.0,30.0,45.0,18.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,3.0,1.0,1.0,1.0,0,0,1,3.0,2.0,5.0,1.0,3.0,2.0,5.0,1.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
576927,0,1.0,4.0,4.0,0,0,0,0,1,5.0,1.0,0.0,0.0,9.0,4.0,6.0,1.0,1.0,4.0,1.0,16.0,4.0,0.0,2.0,2.0,4.0,1.0,4.0,4.0,4.0,1.0,1.0,1.0,1.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,4.0,6.0,0.0,0.0,0.0,4.0,5.0,2.0,83.0,10.0,7.0,0.0,1.0,1.0,1.0,1.0,80.0,0,0,0,0,0,0,0,0,0,4.0,4.0,3.0,3.0,3.0,3.0,2.0,3.0,3.0,3.0,2.0,1.0,1.0,1.0,75.0,25.0,1.0,2.0,1.0,1.0,23.0,1.0,0,0,0,0,2.0,22.0,30.0,30.0,41.0,10.0,7.0,17.0,60.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,3.0,3.4043,1.0,1.0,0,0,1,2.0,2.0,5.0,4.0,2.0,2.0,5.0,3.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
576321,1,0.0,4.0,1.0,0,1,0,0,0,8.0,1.0,1.0,0.0,7.0,7.0,6.0,2.0,0.0,2.0,0.0,16.0,3.0,1.0,2.0,2.0,20.0,6.0,5.0,3.0,3.0,5.0,1.0,1.0,1.0,4.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,6.0,6.0,0.0,0.0,5.0,2.0,44.0,12.0,0.0,0.0,0.0,0.0,2.0,2.0,1.0,1.0,40.0,0,0,1,0,0,1,0,0,1,2.0,2.0,2.0,4.0,1.0,1.0,1.0,4.0,3.0,3.0,1.0,2.0,2.0,2.0,100.0,0.0,2.0,2.0,2.0,1.0,13.0,3.0,0,0,0,0,2.0,90.0,100.0,90.0,100.0,50.0,50.0,60.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,3.0,1.1765,1.0,2.0,1,0,0,1.0,2.0,4.0,3.0,1.0,2.0,3.0,3.0,2.0,3.0,1.0,1.0,2.0,1.0,2.0,2.0,1.0,2.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
576339,1,1.0,3.0,1.0,0,0,0,0,1,4.0,2.0,-1.0,0.0,4.0,7.0,6.0,1.0,0.0,1.0,0.0,12.0,4.0,0.0,2.0,2.0,7.0,5.0,2.0,4.0,4.0,7.0,1.0,4.0,4.0,4.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,6.0,0.0,0.0,4.0,2.0,0.0,18.0,30.0,0.0,0.0,0.0,1.0,2.0,2.0,2.0,55.0,0,0,1,0,0,1,0,0,1,2.0,4.0,2.0,3.0,4.0,4.0,2.0,3.0,2.0,3.0,2.0,1.0,1.0,1.0,80.0,20.0,1.0,1.0,1.0,1.0,23.0,4.0,0,0,0,1,2.0,50.0,100.0,50.0,100.0,25.0,25.0,75.0,20.0,5.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,4.0,1.0,1.0,3.0,0,0,1,2.0,2.0,5.0,1.0,2.0,1.0,5.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [23]:
# Test data - NOT SAVED TO S3
print(test_data.shape)
test_data.head()

(683, 367)


Unnamed: 0,MATH_Proficient,SISCO,ST347Q01JA,ST347Q02JA,ST349Q01JA_0,ST349Q01JA_1,ST349Q01JA_2,ST349Q01JA_3,ST349Q01JA_4,ST259Q01JA,ST004D01T,GRADE,REPEAT,EXPECEDU,ICTAVSCH,ICTAVHOM,IMMIG,TARDYSD,ST226Q01JA,MISSSC,PAREDINT,ST230Q01JA,SKIPPING,IC180Q01JA,IC180Q08JA,ST059Q02JA,ST296Q04JA,STUDYHMW,IC184Q01JA,IC184Q02JA,ST059Q01TA,ST296Q01JA,ST268Q01JA,ST268Q04JA,ST268Q07JA,ST297Q01JA,ST297Q03JA,ST297Q05JA,ST297Q06JA,ST297Q07JA,ST297Q09JA,ST258Q01JA,ST294Q01JA,ST295Q01JA,EXERPRAC,WORKPAY,WORKHOME,SC001Q01TA,SC211Q01JA,SC211Q02JA,SC211Q03JA,SC211Q04JA,SC211Q05JA,SC211Q06JA,SC037Q11JA,SC183Q02JA,SC183Q03JA,SC183Q04JA,SC175Q01JA,SC177Q01JA_1,SC177Q01JA_2,SC177Q01JA_3,SC177Q02JA_1,SC177Q02JA_2,SC177Q02JA_3,SC177Q03JA_1,SC177Q03JA_2,SC177Q03JA_3,SC188Q01JA,SC188Q02JA,SC188Q03JA,SC188Q04JA,SC188Q05JA,SC188Q06JA,SC188Q07JA,SC188Q08JA,SC188Q09JA,SC188Q10JA,SC188Q11JA,SC198Q01JA,SC198Q02JA,SC198Q03JA,SC178Q01JA,SC178Q02JA,SC180Q01JA,SC189Q02WA,SC189Q03WA,SC189Q04WA,MCLSIZE,MACTIV,MATHEXC_0,MATHEXC_1,MATHEXC_2,MATHEXC_3,ABGMATH,SC064Q05WA,SC064Q06WA,SC064Q01TA,SC064Q02TA,SC064Q04NA,SC064Q03TA,SC064Q07WA,SC213Q01JA,SC213Q02JA,SC037Q01TA,SC037Q02TA,SC037Q03TA,SC037Q04TA,SC037Q05NA,SC037Q06NA,SC037Q07TA,SC037Q09TA,SC224Q01JA,RATCMP1,RATCMP2,SCHSEL,SCHLTYPE_1,SCHLTYPE_2,SCHLTYPE_3,SC034Q01NA,SC034Q02NA,SC034Q03TA,SC034Q04TA,SC195Q01JA,SC195Q02JA,SC195Q03JA,SC195Q04JA,SC042Q01TA,SC042Q02TA,SC212Q01JA,SC212Q02JA,SC212Q03JA,SC037Q08TA,SC032Q01TA,SC032Q02TA,SC032Q03TA,SC032Q04TA,LANGN_105,LANGN_108,LANGN_112,LANGN_113,LANGN_118,LANGN_121,LANGN_130,LANGN_133,LANGN_137,LANGN_140,LANGN_147,LANGN_148,LANGN_150,LANGN_154,LANGN_156,LANGN_160,LANGN_170,LANGN_195,LANGN_200,LANGN_202,LANGN_204,LANGN_232,LANGN_237,LANGN_244,LANGN_246,LANGN_254,LANGN_258,LANGN_263,LANGN_264,LANGN_266,LANGN_272,LANGN_273,LANGN_275,LANGN_286,LANGN_301,LANGN_313,LANGN_316,LANGN_317,LANGN_322,LANGN_325,LANGN_327,LANGN_329,LANGN_338,LANGN_340,LANGN_344,LANGN_351,LANGN_358,LANGN_363,LANGN_369,LANGN_371,LANGN_375,LANGN_379,LANGN_381,LANGN_382,LANGN_383,LANGN_404,LANGN_409,LANGN_415,LANGN_420,LANGN_422,LANGN_428,LANGN_434,LANGN_442,LANGN_449,LANGN_451,LANGN_463,LANGN_465,LANGN_467,LANGN_471,LANGN_472,LANGN_474,LANGN_492,LANGN_493,LANGN_494,LANGN_495,LANGN_496,LANGN_500,LANGN_503,LANGN_514,LANGN_517,LANGN_520,LANGN_523,LANGN_527,LANGN_529,LANGN_531,LANGN_540,LANGN_547,LANGN_555,LANGN_561,LANGN_562,LANGN_563,LANGN_565,LANGN_566,LANGN_567,LANGN_600,LANGN_601,LANGN_602,LANGN_605,LANGN_606,LANGN_607,LANGN_608,LANGN_611,LANGN_614,LANGN_615,LANGN_616,LANGN_618,LANGN_619,LANGN_621,LANGN_622,LANGN_623,LANGN_624,LANGN_625,LANGN_626,LANGN_627,LANGN_628,LANGN_630,LANGN_631,LANGN_634,LANGN_635,LANGN_639,LANGN_640,LANGN_641,LANGN_642,LANGN_648,LANGN_650,LANGN_661,LANGN_662,LANGN_663,LANGN_665,LANGN_666,LANGN_667,LANGN_668,LANGN_669,LANGN_670,LANGN_673,LANGN_674,LANGN_675,LANGN_676,LANGN_677,LANGN_678,LANGN_800,LANGN_801,LANGN_802,LANGN_804,LANGN_805,LANGN_806,LANGN_807,LANGN_808,LANGN_809,LANGN_810,LANGN_811,LANGN_812,LANGN_813,LANGN_814,LANGN_815,LANGN_816,LANGN_817,LANGN_818,LANGN_819,LANGN_821,LANGN_823,LANGN_824,LANGN_825,LANGN_826,LANGN_827,LANGN_828,LANGN_829,LANGN_831,LANGN_832,LANGN_833,LANGN_836,LANGN_837,LANGN_838,LANGN_839,LANGN_840,LANGN_841,LANGN_842,LANGN_843,LANGN_844,LANGN_845,LANGN_846,LANGN_849,LANGN_850,LANGN_851,LANGN_852,LANGN_854,LANGN_855,LANGN_857,LANGN_859,LANGN_860,LANGN_861,LANGN_865,LANGN_866,LANGN_868,LANGN_870,LANGN_872,LANGN_873,LANGN_877,LANGN_879,LANGN_881,LANGN_885,LANGN_890,LANGN_892,LANGN_895,LANGN_896,LANGN_897,LANGN_898,LANGN_899,LANGN_900,LANGN_901,LANGN_902,LANGN_903,LANGN_904,LANGN_905,LANGN_906,LANGN_907,LANGN_908,LANGN_909,LANGN_910,LANGN_911,LANGN_912,LANGN_913,LANGN_914,LANGN_916,LANGN_917,LANGN_918,LANGN_919,LANGN_920,LANGN_921,LANGN_922
577660,1,1.0,4.0,1.0,0,0,0,0,0,7.0,2.0,0.0,0.0,7.0,7.0,6.0,1.0,0.0,3.0,0.0,16.0,3.0,0.0,2.0,2.0,8.0,3.0,4.0,4.0,4.0,4.0,1.0,2.0,3.0,4.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,3.0,6.0,5.0,0.0,4.0,2.0,1.0,13.0,29.0,1.0,1.0,0.0,2.0,1.0,1.0,1.0,90.0,0,0,1,1,0,0,0,0,1,3.0,3.0,3.0,3.0,2.0,2.0,2.0,4.0,2.0,2.0,1.0,2.0,1.0,1.0,65.0,35.0,1.0,2.0,1.0,2.0,23.0,4.0,0,0,0,1,3.0,10.0,20.0,20.0,30.0,10.0,10.0,10.0,100.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,3.0,1.7647,1.0,1.0,0,0,1,2.0,2.0,4.0,5.0,2.0,2.0,4.0,5.0,2.0,3.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
577072,0,1.0,3.0,1.0,0,0,1,0,0,9.0,1.0,0.0,0.0,8.0,7.0,6.0,1.0,2.0,4.0,0.0,16.0,4.0,1.0,3.0,1.0,7.0,4.0,8.0,5.0,4.0,1.0,2.0,2.0,2.0,4.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,5.0,6.0,7.0,3.0,8.0,3.0,9.0,13.0,44.0,3.0,4.0,0.0,1.0,1.0,1.0,1.0,60.0,0,0,0,0,0,0,0,0,0,4.0,4.0,3.0,3.0,3.0,3.0,2.0,3.0,3.0,3.0,2.0,1.0,1.0,1.0,75.0,25.0,1.0,2.0,1.0,1.0,28.0,3.0,0,0,0,0,2.0,22.0,30.0,30.0,41.0,10.0,7.0,17.0,60.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,3.0,1.0,1.0,1.0,0,0,1,2.0,2.0,5.0,4.0,2.0,2.0,5.0,3.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
576813,0,1.0,4.0,2.0,0,1,0,0,0,8.0,1.0,0.0,0.0,8.0,7.0,6.0,1.0,0.0,4.0,0.0,16.0,4.0,0.0,3.0,3.0,4.0,1.0,6.0,4.0,5.0,2.0,1.0,1.0,1.0,4.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,5.0,6.0,2.0,4.0,8.0,3.0,1.0,3.0,5.0,0.0,2.0,0.0,1.0,2.0,2.0,2.0,80.0,0,0,1,0,0,1,0,0,1,4.0,4.0,3.0,4.0,3.0,2.0,1.0,4.0,3.0,2.0,1.0,1.0,1.0,1.0,87.0,13.0,2.0,2.0,2.0,1.0,23.0,3.0,0,0,0,0,2.0,10.0,10.0,23.0,23.0,0.0,9.0,3.0,60.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,3.0,1.0137,0.9865,1.0,0,0,1,2.0,2.0,4.0,1.0,2.0,4.0,4.0,1.0,2.0,3.0,2.0,1.0,1.0,1.0,1.0,2.0,1.0,2.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
577850,0,1.0,1.0,1.0,0,0,0,0,0,7.0,2.0,0.0,0.0,7.0,7.0,6.0,1.0,0.0,3.0,0.0,12.0,3.0,0.0,2.0,2.0,8.0,3.0,4.0,4.0,4.0,4.0,1.0,2.0,3.0,4.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,3.0,6.0,5.0,0.0,4.0,3.0,5.0,46.0,80.0,0.0,10.0,0.0,2.0,1.0,1.0,1.0,45.0,0,0,1,0,0,1,0,0,1,3.0,4.0,1.0,3.0,3.0,1.0,1.0,3.0,4.0,3.0,1.0,2.0,1.0,2.0,90.0,10.0,1.0,1.0,2.0,2.0,13.0,3.0,0,0,0,1,2.0,40.0,40.0,40.0,40.0,0.0,5.0,10.0,0.0,5.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,2.0,0,0,1,2.0,1.0,2.0,5.0,2.0,3.0,4.0,4.0,3.0,3.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,2.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
573800,1,1.0,4.0,1.0,0,1,0,0,0,8.0,1.0,0.0,0.0,8.0,7.0,6.0,1.0,0.0,3.0,0.0,16.0,1.0,1.0,2.0,1.0,7.0,5.0,7.0,4.0,5.0,4.0,3.0,1.0,2.0,4.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,2.0,6.0,0.0,0.0,0.0,2.0,18.0,14.0,77.0,6.0,5.0,0.0,1.0,1.0,1.0,1.0,65.0,1,0,0,0,0,1,1,0,0,4.0,4.0,2.0,3.0,3.0,3.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,45.0,55.0,1.0,2.0,2.0,1.0,28.0,2.0,0,0,0,1,3.0,28.0,37.0,35.0,36.0,12.0,6.0,17.0,60.0,15.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,3.0,0,0,1,4.0,2.0,5.0,1.0,2.0,2.0,5.0,1.0,3.0,2.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


Now we'll copy the file to S3 for Amazon SageMaker's managed training to pickup.

In [24]:
# cell 14
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train/train.csv')).upload_file('train.csv')
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'validation/validation.csv')).upload_file('validation.csv')

## Training 

***In the code below, you should change "xgboost" to something that works for knn***

In [25]:
# cell 15
container = sagemaker.image_uris.retrieve(region=boto3.Session().region_name, framework='knn', version='latest')

Defaulting to the only supported framework/algorithm version: 1. Ignoring framework/algorithm version: latest.


Then, because we're training with the CSV file format, we'll create `s3_input`s that our training function can use as a pointer to the files in S3, which also specify that the content type is CSV.

***In the code below, you might have to change "text/csv" to something else, depending on how knn works***

In [26]:
# cell 16
s3_input_train = sagemaker.inputs.TrainingInput(s3_data='s3://{}/{}/train'.format(bucket, prefix), content_type='text/csv')
s3_input_validation = sagemaker.inputs.TrainingInput(s3_data='s3://{}/{}/validation/'.format(bucket, prefix), content_type='text/csv')

In [27]:
print(f"Training data location: {s3_input_train.config['DataSource']['S3DataSource']['S3Uri']}")
print(f"Training data location: {s3_input_validation.config['DataSource']['S3DataSource']['S3Uri']}")
print(f"Training data location: {s3_input_train}")
print(f"Validation data location: {s3_input_validation}")

Training data location: s3://sagemaker-us-west-2-986030204467/sagemaker/knn-Elijah-United-States/train
Training data location: s3://sagemaker-us-west-2-986030204467/sagemaker/knn-Elijah-United-States/validation/
Training data location: <sagemaker.inputs.TrainingInput object at 0x7f468be165d0>
Validation data location: <sagemaker.inputs.TrainingInput object at 0x7f468be16450>


***In the code below, you should change "linear-learner" to something that works for knn***

In [28]:
# cell 17
sess = sagemaker.Session()
algorithm_image_uri = sagemaker.image_uris.retrieve("knn", sess.boto_region_name)

#### Use auto-tuning to find best hyperparameters

***In the code below, change the hyperparameters to something that is relevant to KNN***

In [29]:
# Get your session's region
region = boto3.Session().region_name
print(f"Your SageMaker session is running in region: {region}")

# Recreate the estimator with the correct output path
knn_estimator = sagemaker.estimator.Estimator(
    image_uri=algorithm_image_uri,
    role=role,
    instance_count=1,
    instance_type='ml.m4.xlarge',
    output_path=f's3://{bucket}/{prefix}/output',
    sagemaker_session=sess
)

# Set the hyperparameters again
knn_estimator.set_hyperparameters(
    k=10,  
    sample_size=5000,
    predictor_type='classifier',
    feature_dim=feature_dim,
    index_metric='COSINE'
)

Your SageMaker session is running in region: us-west-2


***In the code below, you might have to change "validation:roc_auc_score" to something else that works for KNN***

In [30]:
# Now proceed with training
knn_estimator.fit({'train': s3_input_train, 'test': s3_input_validation})

INFO:sagemaker:Creating training-job with name: knn-2025-03-11-00-39-14-000


2025-03-11 00:39:15 Starting - Starting the training job...
2025-03-11 00:39:28 Starting - Preparing the instances for training...
2025-03-11 00:40:17 Downloading - Downloading the training image............
2025-03-11 00:42:18 Training - Training image download completed. Training in progress.....[34mDocker entrypoint called with argument(s): train[0m
[34mRunning default environment configuration script[0m
[34m[03/11/2025 00:42:44 INFO 140361102260032] Reading default configuration from /opt/amazon/lib/python3.9/site-packages/algorithm/resources/default-conf.json: {'_kvstore': 'dist_async', '_log_level': 'info', '_num_gpus': 'auto', '_num_kv_servers': '1', '_tuning_objective_metric': '', '_faiss_index_nprobe': '5', 'epochs': '1', 'feature_dim': 'auto', 'faiss_index_ivf_nlists': 'auto', 'index_metric': 'L2', 'index_type': 'faiss.Flat', 'mini_batch_size': '5000', '_enable_profiler': 'false'}[0m
[34m[03/11/2025 00:42:44 INFO 140361102260032] Merging with provided configuration fro

In [31]:
from sagemaker.tuner import IntegerParameter, CategoricalParameter, HyperparameterTuner

# Define KNN-specific hyperparameter ranges
hyperparameter_ranges = {
    'k': IntegerParameter(1, 30),  # Number of nearest neighbors
    'sample_size': IntegerParameter(1000, min(10000, train_data.shape[0]))  # Subsample size for large datasets
}


In [32]:
tuner = HyperparameterTuner(estimator=knn_estimator,
                            objective_metric_name='test:accuracy',
                            hyperparameter_ranges=hyperparameter_ranges,
                            max_jobs=50,  
                            max_parallel_jobs=5)

In [33]:
tuner.fit({'train': s3_input_train, 'test': s3_input_validation})

INFO:sagemaker:Creating hyperparameter tuning job with name: knn-250311-0043


..........................................................................................................................................................!


In [34]:
# cell 26
boto3.client('sagemaker').describe_hyper_parameter_tuning_job(
HyperParameterTuningJobName=tuner.latest_tuning_job.job_name)['HyperParameterTuningJobStatus']

'Completed'

In [35]:
# cell 27
# Return the best training job name
best_training_job = tuner.best_training_job()
print("Best training job:", best_training_job)

Best training job: knn-250311-0043-035-8d547ba1


## Deploy the model (the best model identified by HyperparameterTuner)

In [36]:
# cell 28
knn_predictor = tuner.deploy(initial_instance_count=1,
                           instance_type='ml.m4.xlarge')


2025-03-11 00:53:53 Starting - Found matching resource for reuse
2025-03-11 00:53:53 Downloading - Downloading the training image
2025-03-11 00:53:53 Training - Training image download completed. Training in progress.
2025-03-11 00:53:53 Uploading - Uploading generated training model
2025-03-11 00:53:53 Completed - Resource reused by training job: knn-250311-0043-039-ac66a051

INFO:sagemaker:Creating model with name: knn-2025-03-11-00-56-47-128





INFO:sagemaker:Creating endpoint-config with name knn-250311-0043-035-8d547ba1
INFO:sagemaker:Creating endpoint with name knn-250311-0043-035-8d547ba1


---------!

In [37]:
# cell 29
# Create a serializer
knn_predictor.serializer = sagemaker.serializers.CSVSerializer()

Now, we'll use a simple function to:
1. Loop over our test dataset
1. Split it into mini-batches of rows 
1. Convert those mini-batches to CSV string payloads (notice, we drop the target variable from our dataset first)
1. Retrieve mini-batch predictions by invoking the XGBoost endpoint
1. Collect predictions and convert from the CSV output our model provides into a NumPy array

In [38]:
# Get the raw prediction output
raw_predictions = knn_predictor.predict(test_data.drop(['MATH_Proficient'], axis=1).to_numpy())

# Decode and parse JSON
parsed_predictions = json.loads(raw_predictions.decode("utf-8"))

# Extract the scores
predictions = np.array([pred["predicted_label"] for pred in parsed_predictions["predictions"]])


In [39]:
# Save the real values for the test set
real_values = test_data['MATH_Proficient']
real_values.to_csv('real_values.csv', index=False, header=False)

# Save the predicted values for the test set
predicted_values_full = predictions
predicted_values_full = pd.DataFrame(predicted_values_full, columns=['Predicted Values'])
predicted_values_full.to_csv('predicted_values_full.csv', index=False, header=False)

In [40]:
# Clean up
knn_predictor.delete_endpoint(delete_endpoint_config=True)

INFO:sagemaker:Deleting endpoint configuration with name: knn-250311-0043-035-8d547ba1
INFO:sagemaker:Deleting endpoint with name: knn-250311-0043-035-8d547ba1


## Explain the trained model using Clarify

In [41]:
from datetime import datetime

session = sagemaker.Session()

model_name = "Clarify-{}-{}".format(country_name_edited, datetime.now().strftime("%d-%m-%Y-%H-%M-%S"))

best_model = sagemaker.estimator.Estimator.attach(best_training_job)  # Attach the best training job

model = best_model.create_model(name=model_name)  # Create a model from the best job

container_def = model.prepare_container_def()

session.create_model(model_name, role, container_def)


2025-03-11 00:53:53 Starting - Found matching resource for reuse
2025-03-11 00:53:53 Downloading - Downloading the training image
2025-03-11 00:53:53 Training - Training image download completed. Training in progress.
2025-03-11 00:53:53 Uploading - Uploading generated training model
2025-03-11 00:53:53 Completed - Resource reused by training job: knn-250311-0043-039-ac66a051

INFO:sagemaker:Creating model with name: Clarify-United-States-11-03-2025-01-01-50





'Clarify-United-States-11-03-2025-01-01-50'

In [42]:
test_features = test_data.drop(["MATH_Proficient"], axis=1)
test_target = test_data["MATH_Proficient"]
test_features.to_csv("test_features.csv", index=False, header=False)

***In the code below, you might have to change "text/csv" to something else that works for KNN***

In [43]:
from sagemaker import clarify

clarify_processor = clarify.SageMakerClarifyProcessor(
    role=role, instance_count=1, instance_type="ml.m5.2xlarge", sagemaker_session=session
)

model_config = clarify.ModelConfig(
    model_name=model_name,
    instance_type="ml.m5.large",
    instance_count=1,
    accept_type="application/json",
    content_type="text/csv"
)

INFO:sagemaker.image_uris:Defaulting to the only supported framework/algorithm version: 1.0.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.


In [44]:
from sagemaker.s3 import S3Downloader

# Download data from S3 to local instance
local_path = S3Downloader.download('s3://{}/{}/train'.format(bucket, prefix), './tmp/train_data')

In [45]:
# Load and sample
full_data = pd.read_csv('./tmp/train_data/train.csv', header=None)
n = min(3000, len(full_data))  
sampled_data = full_data.sample(n=n)  # If full_data has less than n, use the full sample

# Save sampled data back to S3
sampled_path = 'sampled_train_data.csv'
sampled_data.to_csv(sampled_path, index=False)

from sagemaker.s3 import S3Uploader
sampled_s3_uri = S3Uploader.upload(sampled_path, 's3://{}/{}/sampled_train'.format(bucket, prefix))

In [46]:
print(sampled_data.shape)
sampled_data.head()

(3000, 367)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255,256,257,258,259,260,261,262,263,264,265,266,267,268,269,270,271,272,273,274,275,276,277,278,279,280,281,282,283,284,285,286,287,288,289,290,291,292,293,294,295,296,297,298,299,300,301,302,303,304,305,306,307,308,309,310,311,312,313,314,315,316,317,318,319,320,321,322,323,324,325,326,327,328,329,330,331,332,333,334,335,336,337,338,339,340,341,342,343,344,345,346,347,348,349,350,351,352,353,354,355,356,357,358,359,360,361,362,363,364,365,366
391,1,1.0,4.0,1.0,0,0,0,0,0,2.0,2.0,0.0,0.0,7.0,7.0,6.0,2.0,1.0,3.0,0.0,12.0,4.0,0.0,2.0,1.0,20.0,3.0,3.0,1.0,5.0,5.0,2.0,2.0,3.0,4.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,6.0,6.0,10.0,0.0,7.0,4.0,10.0,14.0,74.0,4.0,22.0,0.0,1.0,1.0,1.0,1.0,90.0,0,0,1,0,0,1,0,0,1,4.0,2.0,2.0,4.0,1.0,4.0,1.0,3.0,2.0,4.0,1.0,2.0,1.0,1.0,71.0,29.0,1.0,2.0,1.0,1.0,33.0,3.0,0,0,1,0,2.0,10.0,25.0,8.0,21.0,1.0,2.0,2.0,5.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,4.0,1.0,1.0,1.0,0,0,1,1.0,2.0,4.0,1.0,1.0,1.0,5.0,1.0,3.0,3.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1504,1,1.0,1.0,1.0,0,0,0,0,0,5.0,2.0,0.0,0.0,5.0,7.0,6.0,2.0,0.0,3.0,0.0,12.0,4.0,0.0,2.0,2.0,10.0,5.0,5.0,5.0,5.0,5.0,2.0,4.0,4.0,4.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,6.0,6.0,10.0,0.0,10.0,2.0,2.0,9.0,42.0,6.0,6.0,0.0,1.0,1.0,1.0,1.0,90.0,0,1,0,0,1,0,0,1,0,1.0,4.0,1.0,4.0,3.0,3.0,2.0,3.0,3.0,3.0,1.0,1.0,1.0,1.0,60.0,40.0,2.0,2.0,1.0,1.0,23.0,0.0,0,0,0,0,3.0,10.0,33.0,33.0,33.0,10.0,2.0,20.0,60.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,3.0,1.0,1.0,1.0,0,0,1,2.0,2.0,4.0,1.0,2.0,1.0,4.0,1.0,2.0,2.0,1.0,1.0,2.0,1.0,1.0,2.0,1.0,2.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1982,0,1.0,1.0,1.0,0,0,0,0,0,5.0,2.0,0.0,0.0,6.0,7.0,6.0,1.0,0.0,1.0,0.0,16.0,4.0,0.0,2.0,2.0,8.0,1.0,6.0,4.0,4.0,3.0,1.0,3.0,3.0,3.0,0.0,1.0,0.0,1.0,0.0,0.0,4.0,1.0,6.0,7.0,0.0,9.0,5.0,17.0,13.0,80.0,30.0,50.0,0.0,2.0,1.0,2.0,2.0,80.0,1,0,0,1,0,0,0,0,1,4.0,4.0,4.0,3.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,1.0,2.0,75.0,25.0,2.0,1.0,2.0,1.0,38.0,1.0,0,0,0,0,1.0,20.0,20.0,20.0,20.0,10.0,10.0,20.0,220.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,3.0,1.0,1.0,1.0,0,0,1,3.0,2.0,5.0,1.0,3.0,2.0,5.0,1.0,3.0,3.0,1.0,1.0,1.0,1.0,2.0,2.0,1.0,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1929,1,1.0,4.0,1.0,0,0,1,0,0,6.0,1.0,0.0,0.0,6.0,7.0,5.0,1.0,0.0,3.0,0.0,14.5,2.0,1.0,2.0,3.0,4.0,3.0,2.0,4.0,4.0,1.0,1.0,2.0,3.0,3.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,5.0,6.0,2.0,0.0,5.0,1.0,4.0,14.0,37.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,85.0,0,0,1,0,0,1,0,0,1,2.0,4.0,3.0,2.0,2.0,3.0,3.0,1.0,3.0,2.0,2.0,2.0,1.0,1.0,73.0,27.0,1.0,2.0,1.0,1.0,28.0,2.0,0,0,0,1,3.0,22.0,80.0,50.0,84.0,0.0,10.0,24.0,50.0,5.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,4.0,1.0,1.0,3.0,0,0,1,3.0,2.0,5.0,5.0,2.0,2.0,5.0,5.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,2.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2664,1,1.0,4.0,1.0,0,1,0,0,0,6.0,1.0,0.0,0.0,8.0,7.0,6.0,1.0,0.0,3.0,0.0,16.0,2.0,0.0,2.0,1.0,35.0,3.0,6.0,3.0,3.0,5.0,1.0,3.0,3.0,4.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,2.0,4.0,3.0,0.0,6.0,3.0,9.0,13.0,44.0,3.0,4.0,0.0,1.0,1.0,1.0,1.0,50.0,0,0,1,0,0,1,0,0,1,4.0,4.0,2.0,4.0,4.0,4.0,3.0,4.0,4.0,4.0,4.0,1.0,1.0,1.0,80.0,20.0,1.0,1.0,2.0,1.0,28.0,4.0,0,0,0,1,2.0,26.0,26.0,26.0,26.0,15.0,5.0,5.0,90.0,15.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,3.0,1.0,1.0,3.0,0,0,1,3.0,2.0,5.0,5.0,3.0,2.0,5.0,5.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


***In the code below, you might have to change "text/csv" to something else that works for KNN***

In [47]:
shap_config = clarify.SHAPConfig(
    baseline=[test_features.iloc[0].values.tolist()],
    num_samples=500,  
    agg_method="mean_abs",
    save_local_shap_values=True
)

explainability_output_path = "s3://{}/{}/clarify-explainability".format(bucket, prefix)

explainability_data_config = clarify.DataConfig(
    s3_data_input_path=sampled_s3_uri,
    s3_output_path=explainability_output_path,
    label='MATH_Proficient',
    headers=train_data.columns.to_list(),
    dataset_type="text/csv",
)

In [48]:
predictions_config = clarify.ModelPredictedLabelConfig(label="predictions[*].predicted_label")

In [49]:
# Set logging level for 'sagemaker.clarify' to WARNING (hides INFO messages)
import logging

logging.getLogger("sagemaker.clarify").setLevel(logging.WARNING)

clarify_processor.run_explainability(
    data_config=explainability_data_config,
    model_config=model_config,
    model_scores =predictions_config,
    explainability_config=shap_config
)

INFO:sagemaker:Creating processing-job with name Clarify-Explainability-2025-03-11-01-01-57-021


........................[34msagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml[0m
[34msagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml[0m
[34mWe are not in a supported iso region, /bin/sh exiting gracefully with no changes.[0m
[34mINFO:sagemaker-clarify-processing:Starting SageMaker Clarify Processing job[0m
[34mINFO:analyzer.data_loading.data_loader_util:Analysis config path: /opt/ml/processing/input/config/analysis_config.json[0m
[34mINFO:analyzer.data_loading.data_loader_util:Analysis result path: /opt/ml/processing/output[0m
[34mINFO:analyzer.data_loading.data_loader_util:This host is algo-1.[0m
[34mINFO:analyzer.data_loading.data_loader_util:This host is the leader.[0m
[34mINFO:analyzer.data_loading.data_loader_util:Number of hosts in the cluster is 1.[0m
[34mINFO:sagemaker-clarify-processing:Running Python / Pandas based analyzer.[0m
[34mINFO:analyzer.data_loading

## Train the model again with the top 20 predictors
#### Get the list of top 20 predictors

In [50]:
# Replace with your actual bucket name and prefix used in explainability_output_path
# bucket = "your-bucket-name"
# prefix = "your-prefix"  # e.g., the folder structure used in your explainability_output_path

# Construct the S3 key for the output file
key = f"{prefix}/clarify-explainability/analysis.json"

# Initialize boto3 client for S3 and download the JSON report
s3 = boto3.client("s3")
response = s3.get_object(Bucket=bucket, Key=key)
content = response["Body"].read().decode("utf-8")
report = json.loads(content)

# Navigate to the global SHAP values dictionary
global_shap = report["explanations"]["kernel_shap"]["1.0"]["global_shap_values"]

# Sort the items by the SHAP value in descending order and take the top 20
top_20 = sorted(global_shap.items(), key=lambda item: item[1], reverse=True)[:20]

# Extract just the feature names
top_20_features = [feature for feature, value in top_20]

# Print
print("Top 20 features with the highest mean absolute SHAP values:")
for feature in top_20_features:
    print(feature)


INFO:botocore.httpchecksum:Skipping checksum validation. Response did not contain one of the following algorithms: ['crc32', 'sha1', 'sha256'].


Top 20 features with the highest mean absolute SHAP values:
SC211Q03JA
SC213Q01JA
ST059Q02JA
SC175Q01JA
SC211Q01JA
LANGN_922
WORKHOME
EXERPRAC
SC064Q05WA
SC064Q07WA
PAREDINT
SC064Q06WA
STUDYHMW
SC211Q05JA
SC064Q02TA
SC064Q04NA
ST294Q01JA
SC064Q01TA
ST059Q01TA
SC178Q02JA


In [51]:
# Make a subset of the training dataset (with only 20 predictors)
variables_to_keep = ["MATH_Proficient"] + top_20_features
train_data_small = train_data[variables_to_keep]
print(train_data_small.shape)
train_data_small.head()

(3186, 21)


Unnamed: 0,MATH_Proficient,SC211Q03JA,SC213Q01JA,ST059Q02JA,SC175Q01JA,SC211Q01JA,LANGN_922,WORKHOME,EXERPRAC,SC064Q05WA,SC064Q07WA,PAREDINT,SC064Q06WA,STUDYHMW,SC211Q05JA,SC064Q02TA,SC064Q04NA,ST294Q01JA,SC064Q01TA,ST059Q01TA,SC178Q02JA
575474,0,53.0,60.0,8.0,60.0,49.0,0,1.0,0.0,22.0,17.0,14.5,30.0,0.0,37.0,41.0,10.0,1.0,30.0,3.0,25.0
574994,1,41.0,230.0,8.0,90.0,13.0,0,0.0,1.0,10.0,4.0,12.0,30.0,1.0,30.0,20.0,5.0,4.0,20.0,4.0,35.0
574164,0,83.0,75.0,1.0,50.0,9.0,0,6.0,9.0,35.0,1.0,16.0,50.0,4.0,9.0,50.0,1.0,2.0,50.0,1.0,50.0
576638,0,1.0,100.0,20.0,90.0,65.0,0,10.0,6.0,40.0,0.0,16.0,80.0,6.0,80.0,90.0,0.0,4.0,40.0,3.0,40.0
577351,1,22.0,80.0,20.0,90.0,20.0,0,0.0,0.0,56.0,0.0,16.0,79.0,4.0,1.0,99.0,23.0,4.0,100.0,5.0,5.0


In [52]:
# Save train dataset 
train_data_small.to_csv('train_small.csv', index=False, header=False)
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train_small/train_small.csv')).upload_file('train_small.csv')

In [53]:
# Make a subset of the validation dataset (with only 20 predictors)
validation_data_small = validation_data[variables_to_keep]
print(validation_data_small.shape)
validation_data_small.head()

(683, 21)


Unnamed: 0,MATH_Proficient,SC211Q03JA,SC213Q01JA,ST059Q02JA,SC175Q01JA,SC211Q01JA,LANGN_922,WORKHOME,EXERPRAC,SC064Q05WA,SC064Q07WA,PAREDINT,SC064Q06WA,STUDYHMW,SC211Q05JA,SC064Q02TA,SC064Q04NA,ST294Q01JA,SC064Q01TA,ST059Q01TA,SC178Q02JA
574478,1,79.0,60.0,3.0,60.0,40.0,0,10.0,3.0,22.0,17.0,12.0,30.0,9.0,18.0,41.0,10.0,1.0,30.0,3.0,25.0
573678,1,22.0,45.0,7.0,50.0,1.0,0,0.0,0.0,15.0,30.0,16.0,32.0,0.0,4.0,40.0,50.0,1.0,50.0,7.0,25.0
576927,0,83.0,60.0,4.0,80.0,5.0,0,0.0,0.0,22.0,17.0,16.0,30.0,4.0,7.0,41.0,10.0,4.0,30.0,1.0,25.0
576321,1,0.0,0.0,20.0,40.0,44.0,0,5.0,0.0,90.0,60.0,16.0,100.0,5.0,0.0,100.0,50.0,6.0,90.0,5.0,0.0
576339,1,30.0,20.0,7.0,55.0,0.0,0,4.0,0.0,50.0,75.0,12.0,100.0,2.0,0.0,100.0,25.0,1.0,50.0,7.0,20.0


In [54]:
# Save validation dataset 
validation_data_small.to_csv('validation_small.csv', index=False, header=False)
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'validation_small/validation_small.csv')).upload_file('validation_small.csv')

#### Train the model using the hyperparameters from the best model

***In the code below, you should change "xgboost" to something else that works for KNN***

In [55]:
# cell 15
container = sagemaker.image_uris.retrieve(region=boto3.Session().region_name, framework='knn', version='latest')

INFO:sagemaker.image_uris:Same images used for training and inference. Defaulting to image scope: inference.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.


***In the code below, you might have to change "text/csv" to something else that works for KNN***

In [56]:
# cell 16
s3_input_train_small = sagemaker.inputs.TrainingInput(s3_data='s3://{}/{}/train_small'.format(bucket, prefix), content_type='text/csv')
s3_input_validation_small = sagemaker.inputs.TrainingInput(s3_data='s3://{}/{}/validation_small/'.format(bucket, prefix), content_type='text/csv')

In [57]:
knn_small = sagemaker.estimator.Estimator(
    image_uri=algorithm_image_uri,
    role=role,
    instance_count=1,
    instance_type='ml.m4.xlarge',
    output_path=f's3://{bucket}/{prefix}/output',
    sagemaker_session=sess
)

knn_small.set_hyperparameters(
                k=10,  # Number of nearest neighbors
                sample_size=5000,  # Size of the sample used for training
                predictor_type='classifier',  # 'classifier' or 'regressor'
                feature_dim=20,  # Number of features
                index_metric='COSINE'  # Distance metric
            )
knn_small.fit({'train': s3_input_train_small, 'validation': s3_input_validation_small}) 

INFO:sagemaker:Creating training-job with name: knn-2025-03-11-01-30-48-550


2025-03-11 01:30:50 Starting - Starting the training job...
2025-03-11 01:31:05 Starting - Preparing the instances for training...
2025-03-11 01:31:30 Downloading - Downloading input data...
2025-03-11 01:31:55 Downloading - Downloading the training image............
2025-03-11 01:34:16 Training - Training image download completed. Training in progress....[34mDocker entrypoint called with argument(s): train[0m
[34mRunning default environment configuration script[0m
[34m[03/11/2025 01:34:43 INFO 140622802691904] Reading default configuration from /opt/amazon/lib/python3.9/site-packages/algorithm/resources/default-conf.json: {'_kvstore': 'dist_async', '_log_level': 'info', '_num_gpus': 'auto', '_num_kv_servers': '1', '_tuning_objective_metric': '', '_faiss_index_nprobe': '5', 'epochs': '1', 'feature_dim': 'auto', 'faiss_index_ivf_nlists': 'auto', 'index_metric': 'L2', 'index_type': 'faiss.Flat', 'mini_batch_size': '5000', '_enable_profiler': 'false'}[0m
[34m[03/11/2025 01:34:43 IN

## Deploy the model

In [58]:
test_data_small = test_data[variables_to_keep]

In [59]:
# cell 18
knn_small_predictor = knn_small.deploy(initial_instance_count=1,
                           instance_type='ml.m4.xlarge')

INFO:sagemaker:Creating model with name: knn-2025-03-11-01-35-36-450
INFO:sagemaker:Creating endpoint-config with name knn-2025-03-11-01-35-36-450
INFO:sagemaker:Creating endpoint with name knn-2025-03-11-01-35-36-450


---------!

In [60]:
# cell 19
knn_small_predictor.serializer = sagemaker.serializers.CSVSerializer()

Now, we'll use a simple function to:
1. Loop over our test dataset
1. Split it into mini-batches of rows 
1. Convert those mini-batches to CSV string payloads (notice, we drop the target variable from our dataset first)
1. Retrieve mini-batch predictions by invoking the XGBoost endpoint
1. Collect predictions and convert from the CSV output our model provides into a NumPy array

In [61]:
# Get the raw prediction output
raw_predictions_small = knn_small_predictor.predict(test_data_small.drop(['MATH_Proficient'], axis=1).to_numpy())

# Decode and parse JSON
parsed_predictions_small = json.loads(raw_predictions_small.decode("utf-8"))

# Extract the scores
predictions_small = np.array([pred["predicted_label"] for pred in parsed_predictions_small["predictions"]])

In [62]:
# Save the predicted values for the test set
predicted_values_small = predictions_small
predicted_values_small = pd.DataFrame(predicted_values_small, columns=['Predicted Values'])
predicted_values_small.to_csv('predicted_values_small.csv', index=False, header=False)

In [63]:
# Clean up
knn_small_predictor.delete_endpoint(delete_endpoint_config=True)

INFO:sagemaker:Deleting endpoint configuration with name: knn-2025-03-11-01-35-36-450
INFO:sagemaker:Deleting endpoint with name: knn-2025-03-11-01-35-36-450


## Summary

#### Number of students not proficient in Math

In [64]:
#print("Students who are proficient: ", proficient_n)
print("Students who are NOT proficient in Math: ", not_proficient_n, "(", not_proficient_p, "%)")

Students who are NOT proficient in Math:  1607 ( 35.3 %)


#### Model performance (model with all the predictors)

In [65]:
suggested_threshold = (100 - not_proficient_p)/100
print("Suggested threshold:", round(suggested_threshold, 2))

Suggested threshold: 0.65


***Adjust the threhold for the FINAL PREDICTIONS if necessary!!*** 

The model will predict as Math_proficient if the probability is above this threhold. (If the threshold is above 0.5, it will reduce the number of students predicted as "Math proficient" for both students that are actually proficient and not proficient in Math.)

In [66]:
threshold = 0.68

print("Threshold:", threshold)

Threshold: 0.68


In [67]:
import pandas as pd
import numpy as np

# Read in the real values
real_values = pd.read_csv('real_values.csv', usecols=[0], header=None)
real_values = real_values.values.ravel()

# Read in the predicted values (using the full model)
predicted_values_full = pd.read_csv('predicted_values_full.csv', usecols=[0], header=None)
predicted_values_full = predicted_values_full.values.ravel()

In [68]:
cm = pd.crosstab(index=real_values, 
                 columns=np.round( (predicted_values_full >= threshold).astype(int) ), 
                 rownames=['actuals'], 
                 colnames=['predictions'])

TN = cm.loc[0.0, 0.0]
FP = cm.loc[0.0, 1.0]
FN = cm.loc[1.0, 0.0]
TP = cm.loc[1.0, 1.0]

accuracy = (TP + TN) / (TP + TN + FP + FN) * 100
precision = TP / (TP + FP) * 100 if (TP + FP) > 0 else 0
recall = TP / (TP + FN) * 100 if (TP + FN) > 0 else 0
f1_score = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
specificity = TN / (TN + FP) * 100 if (TN + FP) > 0 else 0

print("MODEL USING ALL FEATURES \n")
print(cm)

print("\nAccuracy: {:.1f}".format(accuracy))
print("F1 Score: {:.1f}".format(f1_score))
print("Precision: {:.1f}".format(precision))
print("Recall: {:.1f}".format(recall))
print("Specificity: {:.1f}".format(specificity))

MODEL USING ALL FEATURES 

predictions    0    1
actuals              
0            105  121
1             80  377

Accuracy: 70.6
F1 Score: 79.0
Precision: 75.7
Recall: 82.5
Specificity: 46.5


### Model performance (model with 20 predictors)

In [69]:
# Read in the predicted values (using 20 predictors)
predicted_values_small = pd.read_csv('predicted_values_small.csv', usecols=[0], header=None)
predicted_values_small = predicted_values_small.values.ravel()

In [70]:
cm_small = pd.crosstab(index=real_values, 
                       columns=np.round( (predicted_values_small >= threshold).astype(int) ), 
                       rownames=['actuals'], 
                       colnames=['predictions'])

TN_small = cm_small.loc[0.0, 0.0]
FP_small = cm_small.loc[0.0, 1.0]
FN_small = cm_small.loc[1.0, 0.0]
TP_small = cm_small.loc[1.0, 1.0]

accuracy_small = (TP_small + TN_small) / (TP_small + TN_small + FP_small + FN_small) * 100
precision_small = TP_small / (TP_small + FP_small) * 100 if (TP_small + FP_small) > 0 else 0
recall_small = TP_small / (TP_small + FN_small) * 100 if (TP_small + FN_small) > 0 else 0
f1_score_small = 2 * (precision_small * recall_small) / (precision_small + recall_small) if (precision_small + recall_small) > 0 else 0
specificity_small = TN_small / (TN_small + FP_small) * 100 if (TN_small + FP_small) > 0 else 0

print("MODEL USING 20 FEATURES \n")
print(cm_small)

print("\nAccuracy: {:.1f}".format(accuracy_small))
print("F1 Score: {:.1f}".format(f1_score_small))
print("Precision: {:.1f}".format(precision_small))
print("Recall: {:.1f}".format(recall_small))
print("Specificity: {:.1f}".format(specificity_small))

MODEL USING 20 FEATURES 

predictions    0    1
actuals              
0            142   84
1             96  361

Accuracy: 73.6
F1 Score: 80.0
Precision: 81.1
Recall: 79.0
Specificity: 62.8


#### Top 20 features

In [71]:
pd.set_option('display.max_colwidth', None)
from IPython.display import display, Markdown

# Filter the DataFrame to only include rows where Variable_name is in top_20_features
top_20_dictionary = dictionary[dictionary["Variable_name"].isin(top_20_features)]
top_20_table = top_20_dictionary.set_index("Variable_name").loc[top_20_features].reset_index()
display(Markdown(top_20_table.to_markdown()))

|    | Variable_name   | Variable_label                                                                                                                                                                           |
|---:|:----------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|  0 | SC211Q03JA      | Percentage [15-year-old modal grade] students who: Students from socioeconomically disadvantaged homes                                                                                   |
|  1 | SC213Q01JA      | In last 3 yrs, how many school days was school closed due to: Number of school days closed because of COVID-19:                                                                          |
|  2 | ST059Q02JA      | Total number of [class periods] per week for all subjects, including mathematics                                                                                                         |
|  3 | SC175Q01JA      | Minutes in [class period] for: Mathematics                                                                                                                                               |
|  4 | SC211Q01JA      | Percentage [15-year-old modal grade] students who: Students whose [heritage language] is different from [test language]                                                                  |
|  5 | LANGN_922       | Language at home - Other language in Uzbekistan                                                                                                                                          |
|  6 | WORKHOME        | Working in household/take care of family members before or after school                                                                                                                  |
|  7 | EXERPRAC        | Exercise or practice a sport before or after school                                                                                                                                      |
|  8 | SC064Q05WA      | Proportion parent/guardians who: Discussed their child's behaviour with a teacher on the parents' or guardians' own initiative                                                           |
|  9 | SC064Q07WA      | Proportion parent/guardians who: Assisted in fundraising for the school                                                                                                                  |
| 10 | PAREDINT        | Index highest parental education (international years of schooling scale)                                                                                                                |
| 11 | SC064Q06WA      | Proportion parent/guardians who: Discussed their child's behaviour on the initiative of one of their child's teachers                                                                    |
| 12 | STUDYHMW        | Studying for school or homework before or after school                                                                                                                                   |
| 13 | SC211Q05JA      | Percentage [15-year-old modal grade] students who: Students who have parents who have immigrated                                                                                         |
| 14 | SC064Q02TA      | Proportion parent/guardians who: Discussed their child's progress on the initiative of one of their child's teachers                                                                     |
| 15 | SC064Q04NA      | Proportion parent/guardians who: Volunteered in physical or extra-curricular activities, (e.g. building maintenance, carpentry, gardening or yard work, school play, sports, field trip) |
| 16 | ST294Q01JA      | How many days/wk before school: Eat breakfast                                                                                                                                            |
| 17 | SC064Q01TA      | Proportion parent/guardians who: Discussed their child's progress with a teacher on the parents' or guardians' own initiative                                                            |
| 18 | ST059Q01TA      | Number of [class periods] per week in mathematics                                                                                                                                        |
| 19 | SC178Q02JA      | Percentage of students who received these [marks] in math in last [school report]: [Marks] below the [pass mark]                                                                         |