# PISA 2022 Amazon SageMaker XGBoost - USA
_**Supervised Learning with Gradient Boosted Trees: A Binary Prediction Problem With Unbalanced Classes**_

***More info on SageMaker Immersion Day:*** [Workshop Link](https://catalog.us-east-1.prod.workshops.aws/workshops/63069e26-921c-4ce1-9cc7-dd882ff62575/en-US/lab2-model-training/pro-code)

---

## Contents

1. [Background](#Background)
1. [Prepration](#Preparation)
1. [Data](#Data)
    1. [Exploration](#Exploration)
    1. [Transformation](#Transformation)
1. [Training](#Training)
1. [Hosting](#Hosting)
1. [Evaluation](#Evaluation)
1. [Exentsions](#Extensions)

---

## Background

This notebook runs an Amazon SageMaker pipeline to predict if students will fall behind in Math using the PISA 2022 dataset

* Preparing your Amazon SageMaker notebook
* Downloading data from the internet into Amazon SageMaker
* Investigating and transforming the data so that it can be fed to Amazon SageMaker algorithms
* Estimating a model using the Gradient Boosting algorithm
* Evaluating the effectiveness of the model
* Setting the model up to make on-going predictions

---

## Preparation

_This notebook was created and tested on an ml.m4.xlarge notebook instance._

Let's start by specifying:

- The S3 bucket and prefix that you want to use for training and model data.  This should be within the same region as the Notebook Instance, training, and hosting.
- The IAM role arn used to give training and hosting access to your data. See the documentation for how to create these.  Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the boto regexp with a the appropriate full IAM role arn string(s).

***Change country name below!***

In [1]:
country_name = 'United_States'
country_name_edited = country_name.replace("_", "-")

In [2]:
# cell 02
import sagemaker
bucket=sagemaker.Session().default_bucket()
#prefix = 'sagemaker/xgboost-'+country_name_edited
prefix = 'sagemaker/xgboost-'+country_name_edited+"WithWLE"

# Define IAM role
import boto3
import re
from sagemaker import get_execution_role

role = get_execution_role()

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml


Now let's bring in the Python libraries that we'll use throughout the analysis

In [3]:
# cell 03
import numpy as np                                # For matrix operations and numerical processing
import pandas as pd                               # For munging tabular data
import matplotlib.pyplot as plt                   # For charts and visualizations
from IPython.display import Image                 # For displaying images in the notebook
from IPython.display import display               # For displaying outputs in the notebook
from time import gmtime, strftime                 # For labeling SageMaker models, endpoints, etc.
import sys                                        # For writing outputs to notebook
import math                                       # For ceiling function
import json                                       # For parsing hosting outputs
import os                                         # For manipulating filepath names
import sagemaker 
import zipfile     # Amazon SageMaker's Python SDK provides many helper functions

In [4]:
# cell 04
pd.__version__

'2.2.3'

Make sure pandas version is set to 1.2.4 or later. If it is not the case, restart the kernel before going further

---

## Download PISA 2022 Prepared Dataset

This is our dataset output from our cleaning notebook [here](https://7z4vtvpqcoxouiu.studio.us-west-2.sagemaker.aws/jupyterlab/default/lab/tree/RTC%3Amids-capstone/notebooks/eda/Data_merging.ipynb)


In [5]:
%%time 

# cell 06

# Define local file path
local_file_path = "PISA_cleaned_dataset.csv"  # Change as needed

# Define S3 details
bucket_name = "sagemaker-us-west-2-986030204467"
file_key = "capstone/testfiles/PISA_cleaned_dataset.csv"

# Check if the file exists locally
if os.path.exists(local_file_path):
    print("📂 Loading data from local file...")
    data = pd.read_csv(local_file_path, usecols=None)
    
else:
    print("☁️ Downloading data from S3...")
    
    # Create S3 client
    s3_client = boto3.client("s3")

    # Download the file from S3
    response = s3_client.get_object(Bucket=bucket_name, Key=file_key)

    # Read the file into pandas DataFrame
    data = pd.read_csv(response["Body"], usecols=None)

    # Save a local copy for future use
    data.to_csv(local_file_path, index=False)
    print(f"✅ File saved locally as {local_file_path}")

# Display first few rows
#data.head()

pd.set_option('display.max_columns', 500)     # Make sure we can see all of the columns
pd.set_option('display.max_rows', 20)         # Keep the output on one page
data

📂 Loading data from local file...
CPU times: user 23.6 s, sys: 3.64 s, total: 27.3 s
Wall time: 27.3 s


Unnamed: 0,CNT,CNTSCHID,CNTSTUID,MATH_Proficient,SISCO,ST347Q01JA,ST347Q02JA,ST349Q01JA_0,ST349Q01JA_1,ST349Q01JA_2,ST349Q01JA_3,ST349Q01JA_4,ST350Q01JA,ST356Q01JA,ST322Q01JA,ST322Q02JA,ST322Q03JA,ST322Q04JA,ST322Q06JA,ST322Q07JA,DURECEC,EFFORT1,EFFORT2,ST259Q01JA,WB164Q01HA,HOMEPOS,ST004D01T,GRADE,REPEAT,EXPECEDU,ICTAVSCH,ICTAVHOM,ICTDISTR,IMMIG,TARDYSD,ST226Q01JA,ST016Q01NA,MISSSC,Option_UH,OECD,PAREDINT,BMMJ1,BFMJ2,WB163Q06HA,WB163Q07HA,ST230Q01JA,SKIPPING,IC180Q01JA,IC180Q08JA,ST059Q02JA,ST296Q04JA,WB176Q01HA,STUDYHMW,IC184Q01JA,IC184Q02JA,IC184Q03JA,IC184Q04JA,ST059Q01TA,ST296Q01JA,ST272Q01JA,ST268Q01JA,ST268Q04JA,ST268Q07JA,ST293Q04JA,ST297Q01JA,ST297Q03JA,ST297Q05JA,ST297Q06JA,ST297Q07JA,ST297Q09JA,WB165Q01HA,WB166Q01HA,WB166Q02HA,WB166Q03HA,WB166Q04HA,ST258Q01JA,ST294Q01JA,ST295Q01JA,WB150Q01HA,WB156Q01HA,WB158Q01HA,WB160Q01HA,WB161Q01HA,WB171Q01HA,WB171Q02HA,WB171Q03HA,WB171Q04HA,WB172Q01HA,WB173Q01HA,WB173Q02HA,WB173Q03HA,WB173Q04HA,WB177Q01HA,WB177Q02HA,WB177Q03HA,WB177Q04HA,WB032Q01NA,WB032Q02NA,WB031Q01NA,EXERPRAC,STUBMI,RELATST,BELONG,BULLIED,FEELSAFE,SCHRISK,PERSEVAGR,CURIOAGR,COOPAGR,EMPATAGR,ASSERAGR,STRESAGR,EMOCOAGR,GROSAGR,INFOSEEK,FAMSUP,DISCLIM,TEACHSUP,COGACRCO,COGACMCO,EXPOFA,EXPO21ST,MATHEFF,MATHEF21,FAMCON,ANXMAT,MATHPERS,CREATEFF,CREATSCH,CREATFAM,CREATAS,CREATOOS,CREATOP,OPENART,IMAGINE,SCHSUST,LEARRES,PROBSELF,FAMSUPSL,FEELLAH,SDLEFF,ICTRES,ESCS,FLSCHOOL,FLMULTSB,FLFAMILY,ACCESSFP,FLCONFIN,FLCONICT,ACCESSFA,ATTCONFM,FRINFLFM,ICTSCH,ICTHOME,ICTQUAL,ICTSUBJ,ICTENQ,ICTFEED,ICTOUT,ICTWKDY,ICTWKEND,ICTREG,ICTINFO,ICTEFFIC,BODYIMA,SOCONPA,LIFESAT,PSYCHSYM,SOCCON,EXPWB,CURSUPP,PQMIMP,PQMCAR,PARINVOL,PQSCHOOL,PASCHPOL,ATTIMMP,CREATHME,CREATACT,CREATOPN,CREATOR,WORKPAY,WORKHOME,SC001Q01TA,SC211Q01JA,SC211Q02JA,SC211Q03JA,SC211Q04JA,SC211Q05JA,SC211Q06JA,SC209Q04JA,SC209Q05JA,SC209Q06JA,SC037Q11JA,SC183Q02JA,SC183Q03JA,SC183Q04JA,SC175Q01JA,SC177Q01JA_1,SC177Q01JA_2,SC177Q01JA_3,SC177Q02JA_1,SC177Q02JA_2,SC177Q02JA_3,SC177Q03JA_1,SC177Q03JA_2,SC177Q03JA_3,SC188Q01JA,SC188Q02JA,SC188Q03JA,SC188Q04JA,SC188Q05JA,SC188Q06JA,SC188Q07JA,SC188Q08JA,SC188Q09JA,SC188Q10JA,SC188Q11JA,SC198Q01JA,SC198Q02JA,SC198Q03JA,SC178Q01JA,SC178Q02JA,SC180Q01JA,SC189Q02WA,SC189Q03WA,SC189Q04WA,SMRATIO,MCLSIZE,MACTIV,MATHEXC_0,MATHEXC_1,MATHEXC_2,MATHEXC_3,ABGMATH,SC064Q05WA,SC064Q06WA,SC064Q01TA,SC064Q02TA,SC064Q04NA,SC064Q03TA,SC064Q07WA,SC213Q01JA,SC213Q02JA,SC037Q01TA,SC037Q02TA,SC037Q03TA,SC037Q04TA,SC037Q05NA,SC037Q06NA,...,DIGDVPOL,TEAFDBK,MTTRAIN,DMCVIEWS,NEGSCLIM,STAFFSHORT,EDUSHORT,STUBEHA,TEACHBEHA,STDTEST,TDTEST,ALLACTIV,BCREATSC,CREENVSC,ACTCRESC,OPENCUL,PROBSCRI,SCPREPBP,SCPREPAP,DIGPREP,LANGN_105,LANGN_108,LANGN_112,LANGN_113,LANGN_118,LANGN_121,LANGN_130,LANGN_133,LANGN_137,LANGN_140,LANGN_147,LANGN_148,LANGN_150,LANGN_154,LANGN_156,LANGN_160,LANGN_170,LANGN_195,LANGN_200,LANGN_202,LANGN_204,LANGN_232,LANGN_237,LANGN_244,LANGN_246,LANGN_254,LANGN_258,LANGN_263,LANGN_264,LANGN_266,LANGN_272,LANGN_273,LANGN_275,LANGN_286,LANGN_301,LANGN_313,LANGN_316,LANGN_317,LANGN_322,LANGN_325,LANGN_327,LANGN_329,LANGN_338,LANGN_340,LANGN_344,LANGN_351,LANGN_358,LANGN_363,LANGN_369,LANGN_371,LANGN_375,LANGN_379,LANGN_381,LANGN_382,LANGN_383,LANGN_404,LANGN_409,LANGN_415,LANGN_420,LANGN_422,LANGN_428,LANGN_434,LANGN_442,LANGN_449,LANGN_451,LANGN_463,LANGN_465,LANGN_467,LANGN_471,LANGN_472,LANGN_474,LANGN_492,LANGN_493,LANGN_494,LANGN_495,LANGN_496,LANGN_500,LANGN_503,LANGN_514,LANGN_517,LANGN_520,LANGN_523,LANGN_527,LANGN_529,LANGN_531,LANGN_540,LANGN_547,LANGN_555,LANGN_561,LANGN_562,LANGN_563,LANGN_565,LANGN_566,LANGN_567,LANGN_600,LANGN_601,LANGN_602,LANGN_605,LANGN_606,LANGN_607,LANGN_608,LANGN_611,LANGN_614,LANGN_615,LANGN_616,LANGN_618,LANGN_619,LANGN_621,LANGN_622,LANGN_623,LANGN_624,LANGN_625,LANGN_626,LANGN_627,LANGN_628,LANGN_630,LANGN_631,LANGN_634,LANGN_635,LANGN_639,LANGN_640,LANGN_641,LANGN_642,LANGN_648,LANGN_650,LANGN_661,LANGN_662,LANGN_663,LANGN_665,LANGN_666,LANGN_667,LANGN_668,LANGN_669,LANGN_670,LANGN_673,LANGN_674,LANGN_675,LANGN_676,LANGN_677,LANGN_678,LANGN_800,LANGN_801,LANGN_802,LANGN_804,LANGN_805,LANGN_806,LANGN_807,LANGN_808,LANGN_809,LANGN_810,LANGN_811,LANGN_812,LANGN_813,LANGN_814,LANGN_815,LANGN_816,LANGN_817,LANGN_818,LANGN_819,LANGN_821,LANGN_823,LANGN_824,LANGN_825,LANGN_826,LANGN_827,LANGN_828,LANGN_829,LANGN_831,LANGN_832,LANGN_833,LANGN_836,LANGN_837,LANGN_838,LANGN_839,LANGN_840,LANGN_841,LANGN_842,LANGN_843,LANGN_844,LANGN_845,LANGN_846,LANGN_849,LANGN_850,LANGN_851,LANGN_852,LANGN_854,LANGN_855,LANGN_857,LANGN_859,LANGN_860,LANGN_861,LANGN_865,LANGN_866,LANGN_868,LANGN_870,LANGN_872,LANGN_873,LANGN_877,LANGN_879,LANGN_881,LANGN_885,LANGN_890,LANGN_892,LANGN_895,LANGN_896,LANGN_897,LANGN_898,LANGN_899,LANGN_900,LANGN_901,LANGN_902,LANGN_903,LANGN_904,LANGN_905,LANGN_906,LANGN_907,LANGN_908,LANGN_909,LANGN_910,LANGN_911,LANGN_912,LANGN_913,LANGN_914,LANGN_916,LANGN_917,LANGN_918,LANGN_919,LANGN_920,LANGN_921,LANGN_922
0,Albania,800282,800001,0,,,,0,0,0,0,0,,,5.0,5.0,3.0,,1.0,1.0,,10.0,10.0,10.0,,1.5995,1.0,0.0,0.0,9.0,0.0,,,1.0,,4.0,10.0,0.0,0,0,14.5,73.91,16.50,,,4.0,1.0,2.0,3.0,7.0,6.0,,10.0,5.0,,,,4.0,3.0,10.0,2.0,1.0,4.0,3.0,0.0,0.0,0.0,0.0,0.0,1.0,,,,,,1.0,6.0,6.0,,,,,,,,,,,,,,,,,,,,,,0.0,,0.9905,-0.2327,-1.2280,1.1246,-0.6386,,3.3518,,,,,,-0.5185,,1.8355,0.6387,1.5558,0.8246,2.4962,-0.2284,2.4031,-1.4413,,,0.5440,-0.0085,2.4021,0.0590,0.8155,4.1226,,,0.7507,2.0225,,,,,,,4.9507,1.1112,,,,,,,,,,,,-1.1989,-2.0261,-1.7886,,,,,0.8373,0.6984,,,,,,,,,,,,,,,,,,,0.0,10.0,3.0,100.0,3.0,23.0,,24.0,,1.0,1.0,1.0,2.0,1.0,1.0,1.0,45.0,0,0,1,0,0,1,0,0,1,4.0,4.0,4.0,4.0,3.0,3.0,3.0,4.0,2.0,4.0,2.0,2.0,2.0,1.0,74.0,26.0,1.0,1.0,1.0,1.0,100.0,28.0,5.0,0,0,0,1,3.0,30.0,30.0,61.0,62.0,11.0,50.0,10.0,90.0,3.0,1.0,1.0,1.0,1.0,1.0,1.0,...,0.5220,0.9868,1.0982,2.1585,-0.4315,-0.0097,-0.2805,-0.9198,0.5521,2.0709,2.0131,1.1162,-0.3682,1.3541,0.3430,0.4217,1.1110,-0.8314,0.8462,0.5908,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,Albania,800115,800002,0,,,,0,0,0,0,0,,,,,,,,,,9.0,8.0,7.0,,-3.8115,2.0,-1.0,0.0,,7.0,6.0,10.0,1.0,0.0,1.0,7.0,0.0,0,0,9.0,24.16,,,,3.0,1.0,4.0,2.0,,,,,,5.0,5.0,5.0,,,,,,,,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,0.3226,0.5031,1.3336,1.1246,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,-3.4930,-3.0507,,,,,,,,,,0.4062,0.3346,-0.1403,-2.0261,0.6198,-0.3848,0.2149,,,0.3729,1.3060,-0.4933,,,,,,,,,,,,,,,,,,,,4.0,,1.0,25.0,,15.0,,1.0,1.0,1.0,1.0,2.0,1.0,2.0,45.0,0,0,0,0,0,0,0,0,0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,3.0,2.0,2.0,1.0,2.0,1.0,1.0,90.0,10.0,2.0,1.0,1.0,,100.0,28.0,0.0,0,0,0,0,3.0,75.0,85.0,50.0,75.0,80.0,75.0,,80.0,3.0,1.0,1.0,1.0,1.0,1.0,1.0,...,-0.4729,-0.4120,0.6955,0.3610,0.3386,-1.4551,2.9595,-0.1936,-2.0409,0.0400,-0.6686,-0.5714,0.1019,1.0791,-0.5544,-0.5450,0.1705,-0.8314,-1.1166,0.0988,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,Albania,800242,800003,0,,,,0,0,0,0,0,,,,,,,,,4.0,10.0,10.0,8.0,,0.2314,2.0,-1.0,0.0,,0.0,,4.0,1.0,0.0,1.0,10.0,0.0,0,0,12.0,,,,,4.0,0.0,,,,2.0,,0.0,,,,5.0,,2.0,10.0,,,,,0.0,0.0,0.0,0.0,1.0,0.0,,,,,,1.0,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.8637,-0.6386,,,,,,,,,,,-0.8615,,,,,,,,,,,,,,,,,,,,,,,,,0.4307,-0.1867,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,1.0,,,,1.0,2.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,1.0,45.0,0,0,1,0,0,1,0,0,0,1.0,,4.0,4.0,2.0,2.0,4.0,3.0,4.0,2.0,1.0,2.0,2.0,1.0,100.0,0.0,1.0,1.0,1.0,1.0,100.0,18.0,3.0,0,0,0,1,3.0,100.0,100.0,100.0,100.0,10.0,10.0,100.0,60.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.1884,1.2416,1.0982,2.1585,-0.9382,0.1683,0.1753,-2.0719,-0.4985,0.5750,1.5226,0.5086,0.3731,0.9015,0.5400,1.2274,0.6353,1.1784,-0.6374,-0.8981,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,Albania,800245,800005,0,1.0,6.0,1.0,0,1,0,0,0,2.0,4.0,3.0,3.0,,3.0,3.0,3.0,0.0,,,5.0,,-2.5956,1.0,-2.0,1.0,4.0,5.0,5.0,12.0,1.0,1.0,1.0,10.0,0.0,0,0,6.0,,14.82,,,3.0,0.0,3.0,4.0,30.0,4.0,,10.0,,,,,4.0,1.0,5.0,4.0,4.0,4.0,4.0,1.0,0.0,1.0,0.0,0.0,0.0,,,,,,1.0,6.0,6.0,,,,,,,,,,,,,,,,,,,,,,10.0,,1.8580,0.5159,0.9885,-0.7560,-0.6386,,-0.7687,,,,,,0.1371,2.2134,-0.7468,0.4426,1.5558,-0.7146,-0.1216,-0.2207,0.3556,-1.3156,2.2322,0.4222,0.5653,-0.2546,-0.4909,-0.3010,-1.0261,1.0191,1.4468,-0.5423,-0.0564,-0.8763,1.5382,0.4308,0.4516,0.0427,-2.1941,-0.9408,-2.1392,-3.2198,,,,,,,,,,-1.7984,-1.5118,-0.3516,-0.1594,0.8946,0.8435,0.4035,,,2.8904,1.2637,,,,,,,,,,,,,,,,,,,0.0,10.0,1.0,,5.0,11.0,,30.0,,1.0,1.0,1.0,1.0,1.0,2.0,2.0,45.0,0,0,1,0,0,1,0,0,0,3.0,3.0,4.0,4.0,2.0,2.0,2.0,3.0,2.0,3.0,2.0,2.0,2.0,1.0,100.0,0.0,1.0,2.0,1.0,1.0,69.5,13.0,4.0,0,0,0,1,3.0,91.0,84.0,93.0,64.0,82.0,97.0,100.0,0.0,3.0,1.0,1.0,1.0,1.0,1.0,1.0,...,0.5587,0.6480,-0.0703,-0.1332,-1.6916,-1.4551,0.4399,-0.5010,-1.4190,0.1011,0.1724,0.4559,-0.3682,1.0478,0.5608,0.4217,,,,0.0419,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,Albania,800285,800006,1,1.0,4.0,6.0,0,0,0,0,1,3.0,1.0,3.0,,1.0,1.0,1.0,1.0,1.0,10.0,9.0,8.0,,-0.5632,1.0,0.0,0.0,,3.0,2.0,13.0,1.0,0.0,4.0,10.0,0.0,0,0,12.0,17.00,30.11,,,2.0,0.0,3.0,4.0,30.0,3.0,,10.0,3.0,3.0,4.0,5.0,4.0,3.0,10.0,2.0,1.0,4.0,,1.0,0.0,0.0,0.0,1.0,0.0,,,,,,1.0,1.0,6.0,,,,,,,,,,,,,,,,,,,,,,2.0,,1.7382,0.7639,-1.2280,1.1246,-0.6386,,0.5342,,,,,,-0.3061,0.6761,-0.5122,0.4029,0.1475,-0.0073,0.7927,-0.6616,-1.0257,-0.5867,0.9425,1.1266,-0.2704,-0.1735,-0.7475,-0.1405,-0.9293,1.6583,1.8557,0.9322,0.9037,-0.4033,0.2241,1.7224,1.6004,1.5114,,1.0353,-0.5542,-1.0548,,,,,,,,,,-2.8292,-3.3582,1.0161,,0.8886,-0.0643,0.9861,,,2.0196,1.6029,-0.2354,,,,,,,,,,,,,,,,,,0.0,4.0,,37.0,1.0,9.0,,,,1.0,1.0,1.0,1.0,1.0,2.0,2.0,45.0,1,0,0,1,0,0,0,0,1,3.0,3.0,3.0,3.0,2.0,2.0,2.0,2.0,4.0,4.0,2.0,2.0,2.0,1.0,80.0,20.0,2.0,1.0,1.0,1.0,100.0,33.0,2.0,0,0,0,0,1.0,67.0,18.0,12.0,21.0,19.0,3.0,21.0,90.0,7.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.3483,1.0430,0.6888,2.1585,-0.6145,-0.7828,0.1000,-0.6199,-0.0485,0.7086,0.7899,0.9383,0.1019,1.6939,0.8448,1.0318,0.0074,-0.8314,-0.7625,3.0051,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
591852,Uzbekistan,86000120,86007488,0,1.0,2.0,1.0,0,0,1,0,0,1.0,,,,,,,,4.0,10.0,10.0,9.0,,-0.9146,1.0,0.0,0.0,9.0,,,,1.0,0.0,1.0,10.0,0.0,0,0,3.0,17.00,28.95,,,4.0,0.0,,,36.0,6.0,,10.0,,,,,5.0,6.0,10.0,4.0,2.0,4.0,,1.0,1.0,1.0,1.0,1.0,1.0,,,,,,1.0,6.0,6.0,,,,,,,,,,,,,,,,,,,,,,4.0,,,-1.0817,-1.2280,0.6942,-0.6386,,0.3063,,,,,,0.5765,-1.0979,1.5941,1.7598,1.5558,2.3368,2.5872,0.1530,-2.2416,2.2815,2.3441,,0.8819,2.2393,2.1524,0.5032,-0.0326,,,-0.4280,,-0.1324,,,,,,,,-2.7487,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,10.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,120.0,1,0,0,1,0,0,0,0,0,1.0,1.0,4.0,2.0,1.0,1.0,4.0,1.0,1.0,4.0,1.0,2.0,1.0,1.0,100.0,0.0,1.0,2.0,1.0,1.0,1.4,28.0,5.0,0,0,0,1,1.0,0.0,0.0,1.0,0.0,0.0,70.0,30.0,73.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,...,0.0977,0.4554,0.2023,-0.7457,-1.4918,,-1.4212,,-1.3372,0.6904,0.0175,1.7104,0.4397,0.7711,,1.2405,-0.5687,-0.8314,-1.1382,0.5571,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
591853,Uzbekistan,86000140,86007489,0,,,,0,0,0,0,0,,,1.0,,2.0,2.0,1.0,1.0,0.0,10.0,10.0,3.0,,-2.1015,2.0,0.0,0.0,,,,,,1.0,5.0,3.0,1.0,0,0,16.0,73.91,30.11,,,4.0,,,,,,,7.0,,,,,,,,,,,,,,,,,,,,,,,1.0,1.0,6.0,,,,,,,,,,,,,,,,,,,,,,4.0,,,-0.2482,-1.2280,-0.7560,-0.6386,,0.0167,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,-0.2024,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,10.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,2.0,2.0,2.0,2.0,115.0,0,0,1,0,0,1,0,0,1,3.0,3.0,3.0,3.0,3.0,2.0,3.0,3.0,3.0,3.0,3.0,2.0,1.0,1.0,60.0,40.0,1.0,1.0,1.0,1.0,100.0,53.0,5.0,0,0,0,1,2.0,81.0,85.0,88.0,96.0,68.0,85.0,63.0,100.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,...,-1.8150,1.0904,-1.6751,-2.6032,-1.4918,,-1.4212,,-1.1342,2.0709,2.0131,3.4880,1.5231,-0.2686,,0.3221,-1.1097,-0.8314,0.8462,-0.1857,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
591854,Uzbekistan,86000024,86007490,0,1.0,1.0,1.0,0,0,0,0,0,,,1.0,1.0,,4.0,1.0,1.0,,,,6.0,,-1.5194,2.0,1.0,0.0,7.0,,,,1.0,0.0,4.0,9.0,0.0,0,0,9.0,17.00,25.71,,,4.0,0.0,,,31.0,6.0,,10.0,,,,,5.0,5.0,10.0,3.0,3.0,3.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,,,,,,4.0,6.0,6.0,,,,,,,,,,,,,,,,,,,,,,10.0,,,-0.3261,-0.5168,0.4417,-0.6386,,-0.0140,,,,,,0.2429,0.2973,-1.0296,0.3521,0.8211,1.0932,0.9323,-0.3998,0.6856,0.3926,0.9997,,-0.2907,0.6311,0.0846,0.5352,-0.5679,0.4911,0.6097,0.4185,-0.3483,-0.1783,,,,,,,,-2.0506,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,5.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,120.0,1,0,0,0,0,1,0,0,1,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,3.0,1.0,1.0,1.0,1.0,100.0,0.0,1.0,1.0,1.0,1.0,100.0,28.0,5.0,0,0,0,1,2.0,90.0,50.0,100.0,100.0,70.0,85.0,0.0,93.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,...,0.5796,1.4724,1.0982,-3.1484,-1.4918,,0.2650,,-1.9660,2.0709,2.0131,1.7685,1.5231,2.1631,,2.8331,-1.6218,1.5159,0.8462,0.8376,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
591855,Uzbekistan,86000174,86007491,0,,,,0,0,0,0,0,,,1.0,1.0,1.0,1.0,,1.0,,7.0,6.0,9.0,,-0.3975,1.0,0.0,0.0,,,,,1.0,0.0,4.0,10.0,0.0,0,0,12.0,73.91,75.43,,,3.0,1.0,,,35.0,6.0,,10.0,,,,,3.0,2.0,7.0,2.0,2.0,3.0,,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,1.0,6.0,6.0,,,,,,,,,,,,,,,,,,,,,,9.0,,,0.5337,-1.2280,1.1246,-0.6386,,2.2987,,,,,,1.2952,,,1.7598,1.5558,1.5399,1.3822,0.3331,0.3322,-0.1652,2.4215,,,,,,,,,,,,,,,,,,,-0.1290,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,10.0,1.0,0.0,0.0,6.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,120.0,0,0,0,0,0,0,0,0,0,1.0,4.0,4.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,100.0,0.0,1.0,2.0,1.0,1.0,100.0,28.0,5.0,0,0,0,1,1.0,75.0,21.0,77.0,70.0,85.0,69.0,0.0,79.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,...,-0.3081,0.7604,1.0982,1.2033,-1.4918,,1.2048,,-0.2361,0.6904,0.6028,1.2086,0.7589,0.8065,,0.7825,0.5093,-0.8314,0.1102,-0.4657,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


Let's talk about the data.  At a high level, we can see:

_**Specifics on each of the features:**_

*Target variable:*
* `MATH_Proficient`: Is the student proficient in Math per PISA statistics (average of 10 Math plausible values >= 420.07)? (binary: 'yes','no')

### Exploration
Let's start exploring the data in our data prep widget.  First, let's understand how the features are distributed.

In [6]:
print(data['MATH_Proficient'].shape)

(591857,)


In [7]:
print(data.columns.duplicated().any()) 

False


In [6]:
# cell 10
# Convert categorical variables to sets of indicators
# model_data = pd.get_dummies(data, dtype=float)   
model_data = data[data['CNT'] == country_name]
print(model_data.shape)

(4552, 570)


***Need to clean code below if using a cleaned dataset without the post-processed variables and MATH_PROFICIENT as first variable***

In [7]:
# Define the list of columns to drop
# columns_to_remove = [
#     "HOMEPOS", "RELATST", "BELONG", "BULLIED", "FEELSAFE", "SCHRISK", "PERSEVAGR", "CURIOAGR", 
#     "COOPAGR", "EMPATAGR", "ASSERAGR", "STRESAGR", "EMOCOAGR", "GROSAGR", "INFOSEEK", "FAMSUP", 
#     "DISCLIM", "TEACHSUP", "COGACRCO", "COGACMCO", "EXPOFA", "EXPO21ST", "MATHEFF", "MATHEF21", 
#     "FAMCON", "ANXMAT", "MATHPERS", "CREATEFF", "CREATSCH", "CREATFAM", "CREATAS", "CREATOOS", 
#     "CREATOP", "OPENART", "IMAGINE", "SCHSUST", "LEARRES", "PROBSELF", "FAMSUPSL", "FEELLAH", 
#     "SDLEFF", "ICTRES", "FLSCHOOL", "FLMULTSB", "FLFAMILY", "ACCESSFP", "FLCONFIN", "FLCONICT", 
#     "ACCESSFA", "ATTCONFM", "FRINFLFM", "ICTSCH", "ICTHOME", "ICTQUAL", "ICTSUBJ", "ICTENQ", 
#     "ICTFEED", "ICTOUT", "ICTWKDY", "ICTWKEND", "ICTREG", "ICTINFO", "ICTEFFIC", "BODYIMA", 
#     "SOCONPA", "LIFESAT", "PSYCHSYM", "SOCCON", "EXPWB", "CURSUPP", "PQMIMP", "PQMCAR", 
#     "PARINVOL", "PQSCHOOL", "PASCHPOL", "ATTIMMP", "CREATHME", "CREATACT", "CREATOPN", 
#     "CREATOR", "SCHAUTO", "TCHPART", "EDULEAD", "INSTLEAD", "ENCOURPG", "DIGDVPOL", "TEAFDBK", 
#     "MTTRAIN", "DMCVIEWS", "NEGSCLIM", "STAFFSHORT", "EDUSHORT", "STUBEHA", "TEACHBEHA", 
#     "STDTEST", "TDTEST", "ALLACTIV", "BCREATSC", "CREENVSC", "ACTCRESC", "OPENCUL", 
#     "PROBSCRI", "SCPREPBP", "SCPREPAP", "DIGPREP", "ESCS", "BMMJ1", "BFMJ2"
# ]

columns_to_remove = []

# Drop the columns above
model_data = model_data.drop(columns=columns_to_remove, errors='ignore')  # `errors='ignore'` prevents errors if a column isn't found

# Get the list of columns
cols = list(model_data.columns)

# Reorder so that the fourth column ('MATH_Proficient') comes first
new_order = [cols[3]] + cols[:3] + cols[4:]

# Apply the new column order to model_data
model_data = model_data[new_order]

# Check the shape after dropping
print("Shape after dropping post-processed variables:", model_data.shape)

Shape after dropping post-processed variables: (4552, 570)


In [8]:
model_data.head()

Unnamed: 0,MATH_Proficient,CNT,CNTSCHID,CNTSTUID,SISCO,ST347Q01JA,ST347Q02JA,ST349Q01JA_0,ST349Q01JA_1,ST349Q01JA_2,ST349Q01JA_3,ST349Q01JA_4,ST350Q01JA,ST356Q01JA,ST322Q01JA,ST322Q02JA,ST322Q03JA,ST322Q04JA,ST322Q06JA,ST322Q07JA,DURECEC,EFFORT1,EFFORT2,ST259Q01JA,WB164Q01HA,HOMEPOS,ST004D01T,GRADE,REPEAT,EXPECEDU,ICTAVSCH,ICTAVHOM,ICTDISTR,IMMIG,TARDYSD,ST226Q01JA,ST016Q01NA,MISSSC,Option_UH,OECD,PAREDINT,BMMJ1,BFMJ2,WB163Q06HA,WB163Q07HA,ST230Q01JA,SKIPPING,IC180Q01JA,IC180Q08JA,ST059Q02JA,ST296Q04JA,WB176Q01HA,STUDYHMW,IC184Q01JA,IC184Q02JA,IC184Q03JA,IC184Q04JA,ST059Q01TA,ST296Q01JA,ST272Q01JA,ST268Q01JA,ST268Q04JA,ST268Q07JA,ST293Q04JA,ST297Q01JA,ST297Q03JA,ST297Q05JA,ST297Q06JA,ST297Q07JA,ST297Q09JA,WB165Q01HA,WB166Q01HA,WB166Q02HA,WB166Q03HA,WB166Q04HA,ST258Q01JA,ST294Q01JA,ST295Q01JA,WB150Q01HA,WB156Q01HA,WB158Q01HA,WB160Q01HA,WB161Q01HA,WB171Q01HA,WB171Q02HA,WB171Q03HA,WB171Q04HA,WB172Q01HA,WB173Q01HA,WB173Q02HA,WB173Q03HA,WB173Q04HA,WB177Q01HA,WB177Q02HA,WB177Q03HA,WB177Q04HA,WB032Q01NA,WB032Q02NA,WB031Q01NA,EXERPRAC,STUBMI,RELATST,BELONG,BULLIED,FEELSAFE,SCHRISK,PERSEVAGR,CURIOAGR,COOPAGR,EMPATAGR,ASSERAGR,STRESAGR,EMOCOAGR,GROSAGR,INFOSEEK,FAMSUP,DISCLIM,TEACHSUP,COGACRCO,COGACMCO,EXPOFA,EXPO21ST,MATHEFF,MATHEF21,FAMCON,ANXMAT,MATHPERS,CREATEFF,CREATSCH,CREATFAM,CREATAS,CREATOOS,CREATOP,OPENART,IMAGINE,SCHSUST,LEARRES,PROBSELF,FAMSUPSL,FEELLAH,SDLEFF,ICTRES,ESCS,FLSCHOOL,FLMULTSB,FLFAMILY,ACCESSFP,FLCONFIN,FLCONICT,ACCESSFA,ATTCONFM,FRINFLFM,ICTSCH,ICTHOME,ICTQUAL,ICTSUBJ,ICTENQ,ICTFEED,ICTOUT,ICTWKDY,ICTWKEND,ICTREG,ICTINFO,ICTEFFIC,BODYIMA,SOCONPA,LIFESAT,PSYCHSYM,SOCCON,EXPWB,CURSUPP,PQMIMP,PQMCAR,PARINVOL,PQSCHOOL,PASCHPOL,ATTIMMP,CREATHME,CREATACT,CREATOPN,CREATOR,WORKPAY,WORKHOME,SC001Q01TA,SC211Q01JA,SC211Q02JA,SC211Q03JA,SC211Q04JA,SC211Q05JA,SC211Q06JA,SC209Q04JA,SC209Q05JA,SC209Q06JA,SC037Q11JA,SC183Q02JA,SC183Q03JA,SC183Q04JA,SC175Q01JA,SC177Q01JA_1,SC177Q01JA_2,SC177Q01JA_3,SC177Q02JA_1,SC177Q02JA_2,SC177Q02JA_3,SC177Q03JA_1,SC177Q03JA_2,SC177Q03JA_3,SC188Q01JA,SC188Q02JA,SC188Q03JA,SC188Q04JA,SC188Q05JA,SC188Q06JA,SC188Q07JA,SC188Q08JA,SC188Q09JA,SC188Q10JA,SC188Q11JA,SC198Q01JA,SC198Q02JA,SC198Q03JA,SC178Q01JA,SC178Q02JA,SC180Q01JA,SC189Q02WA,SC189Q03WA,SC189Q04WA,SMRATIO,MCLSIZE,MACTIV,MATHEXC_0,MATHEXC_1,MATHEXC_2,MATHEXC_3,ABGMATH,SC064Q05WA,SC064Q06WA,SC064Q01TA,SC064Q02TA,SC064Q04NA,SC064Q03TA,SC064Q07WA,SC213Q01JA,SC213Q02JA,SC037Q01TA,SC037Q02TA,SC037Q03TA,SC037Q04TA,SC037Q05NA,SC037Q06NA,...,DIGDVPOL,TEAFDBK,MTTRAIN,DMCVIEWS,NEGSCLIM,STAFFSHORT,EDUSHORT,STUBEHA,TEACHBEHA,STDTEST,TDTEST,ALLACTIV,BCREATSC,CREENVSC,ACTCRESC,OPENCUL,PROBSCRI,SCPREPBP,SCPREPAP,DIGPREP,LANGN_105,LANGN_108,LANGN_112,LANGN_113,LANGN_118,LANGN_121,LANGN_130,LANGN_133,LANGN_137,LANGN_140,LANGN_147,LANGN_148,LANGN_150,LANGN_154,LANGN_156,LANGN_160,LANGN_170,LANGN_195,LANGN_200,LANGN_202,LANGN_204,LANGN_232,LANGN_237,LANGN_244,LANGN_246,LANGN_254,LANGN_258,LANGN_263,LANGN_264,LANGN_266,LANGN_272,LANGN_273,LANGN_275,LANGN_286,LANGN_301,LANGN_313,LANGN_316,LANGN_317,LANGN_322,LANGN_325,LANGN_327,LANGN_329,LANGN_338,LANGN_340,LANGN_344,LANGN_351,LANGN_358,LANGN_363,LANGN_369,LANGN_371,LANGN_375,LANGN_379,LANGN_381,LANGN_382,LANGN_383,LANGN_404,LANGN_409,LANGN_415,LANGN_420,LANGN_422,LANGN_428,LANGN_434,LANGN_442,LANGN_449,LANGN_451,LANGN_463,LANGN_465,LANGN_467,LANGN_471,LANGN_472,LANGN_474,LANGN_492,LANGN_493,LANGN_494,LANGN_495,LANGN_496,LANGN_500,LANGN_503,LANGN_514,LANGN_517,LANGN_520,LANGN_523,LANGN_527,LANGN_529,LANGN_531,LANGN_540,LANGN_547,LANGN_555,LANGN_561,LANGN_562,LANGN_563,LANGN_565,LANGN_566,LANGN_567,LANGN_600,LANGN_601,LANGN_602,LANGN_605,LANGN_606,LANGN_607,LANGN_608,LANGN_611,LANGN_614,LANGN_615,LANGN_616,LANGN_618,LANGN_619,LANGN_621,LANGN_622,LANGN_623,LANGN_624,LANGN_625,LANGN_626,LANGN_627,LANGN_628,LANGN_630,LANGN_631,LANGN_634,LANGN_635,LANGN_639,LANGN_640,LANGN_641,LANGN_642,LANGN_648,LANGN_650,LANGN_661,LANGN_662,LANGN_663,LANGN_665,LANGN_666,LANGN_667,LANGN_668,LANGN_669,LANGN_670,LANGN_673,LANGN_674,LANGN_675,LANGN_676,LANGN_677,LANGN_678,LANGN_800,LANGN_801,LANGN_802,LANGN_804,LANGN_805,LANGN_806,LANGN_807,LANGN_808,LANGN_809,LANGN_810,LANGN_811,LANGN_812,LANGN_813,LANGN_814,LANGN_815,LANGN_816,LANGN_817,LANGN_818,LANGN_819,LANGN_821,LANGN_823,LANGN_824,LANGN_825,LANGN_826,LANGN_827,LANGN_828,LANGN_829,LANGN_831,LANGN_832,LANGN_833,LANGN_836,LANGN_837,LANGN_838,LANGN_839,LANGN_840,LANGN_841,LANGN_842,LANGN_843,LANGN_844,LANGN_845,LANGN_846,LANGN_849,LANGN_850,LANGN_851,LANGN_852,LANGN_854,LANGN_855,LANGN_857,LANGN_859,LANGN_860,LANGN_861,LANGN_865,LANGN_866,LANGN_868,LANGN_870,LANGN_872,LANGN_873,LANGN_877,LANGN_879,LANGN_881,LANGN_885,LANGN_890,LANGN_892,LANGN_895,LANGN_896,LANGN_897,LANGN_898,LANGN_899,LANGN_900,LANGN_901,LANGN_902,LANGN_903,LANGN_904,LANGN_905,LANGN_906,LANGN_907,LANGN_908,LANGN_909,LANGN_910,LANGN_911,LANGN_912,LANGN_913,LANGN_914,LANGN_916,LANGN_917,LANGN_918,LANGN_919,LANGN_920,LANGN_921,LANGN_922
573394,1,United_States,84000060,84000002,1.0,4.0,1.0,0,1,0,0,0,1.0,3.0,,,,,,,2.0,,,9.0,,1.1179,2.0,0.0,0.0,9.0,7.0,6.0,,1.0,0.0,3.0,,0.0,1,1,16.0,,79.05,,,4.0,0.0,3.0,2.0,35.0,3.0,,2.0,,,,,5.0,1.0,,1.0,3.0,3.0,,0.0,0.0,0.0,0.0,0.0,1.0,,,,,,1.0,1.0,6.0,,,,,,,,,,,,,,,,,,,,,,10.0,,0.1038,-0.338,-1.228,-0.756,-0.6386,,,,,,,,0.0382,0.649,0.0488,0.4571,0.8211,-0.1121,-0.5143,-1.8524,-1.3021,-0.3183,0.676,1.3797,0.051,0.2552,,,,,,,,,-0.4968,-0.9109,-0.7734,-0.2933,-0.4859,0.1535,0.5598,1.2582,-1.5638,-1.2838,0.8839,-0.1039,-0.076,0.4078,0.1666,0.1636,1.1412,0.4062,0.3346,-0.4445,1.8109,0.9504,2.942,0.1306,-0.1975,0.3311,-1.0564,0.6984,0.8955,,,,,,,,,,,,,,,,,,0.0,2.0,5.0,1.0,3.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,2.0,2.0,2.0,1.0,55.0,0,0,1,0,1,0,0,0,0,4.0,4.0,3.0,3.0,4.0,3.0,2.0,4.0,3.0,4.0,1.0,1.0,1.0,1.0,,,1.0,2.0,1.0,1.0,100.0,33.0,2.0,0,0,0,1,2.0,40.0,15.0,15.0,40.0,0.0,15.0,45.0,,,1.0,1.0,1.0,1.0,1.0,1.0,...,,0.8595,0.3795,,0.7657,0.1134,0.265,-0.017,0.6015,-0.319,0.4598,0.3978,,,,,0.5308,1.4151,0.8462,0.5185,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
573395,1,United_States,84000055,84000003,0.0,2.0,1.0,0,1,0,0,0,1.0,3.0,,,,,,,,6.0,9.0,9.0,,0.7983,2.0,0.0,0.0,4.0,7.0,6.0,,1.0,0.0,1.0,,0.0,1,1,12.0,73.91,59.89,,,3.0,1.0,2.0,1.0,45.0,1.0,,6.0,5.0,,,,5.0,1.0,,2.0,3.0,3.0,3.0,0.0,0.0,0.0,0.0,0.0,1.0,,,,,,1.0,6.0,6.0,,,,,,,,,,,,,,,,,,,,,,10.0,,-0.0174,0.6428,-1.228,-2.7886,-0.6386,,,,,,,,0.3305,-0.3146,-0.5726,0.801,0.4357,-1.3042,-1.0977,-1.1565,-0.601,-0.2856,0.6664,-0.3762,-0.4577,-0.2159,,,,,,,,,0.5715,-0.4028,-1.1885,-0.4074,-0.1733,1.892,0.4946,0.3488,-1.5638,-1.2838,-0.3122,0.1055,0.6582,0.1395,-0.2132,-0.7905,-1.1238,0.4062,0.3346,-0.1426,0.7813,-0.6933,-0.1406,-0.1964,0.1737,-0.3368,-0.6418,-0.4656,-0.2216,,,,,,,,,,,,,,,,,,0.0,0.0,,,,,,,,,,,,,,,,0,0,0,0,0,0,0,0,0,,,,,,,,,,,,,,,,,,,,,,,,0,0,0,0,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
573396,1,United_States,84000121,84000004,1.0,5.0,1.0,0,1,0,0,0,1.0,3.0,,,,,,,2.0,9.0,10.0,5.0,,1.1761,2.0,0.0,0.0,8.0,7.0,6.0,6.0,1.0,0.0,3.0,,0.0,1,1,16.0,67.94,82.41,,,4.0,0.0,2.0,3.0,35.0,4.0,,9.0,4.0,4.0,3.0,4.0,3.0,3.0,,4.0,4.0,4.0,2.0,0.0,0.0,0.0,0.0,0.0,1.0,,,,,,1.0,4.0,6.0,,,,,,,,,,,,,,,,,,,,,,5.0,,0.1729,-0.3397,-1.228,-0.756,0.5348,,,,,,,,0.4093,-0.1336,-0.0759,0.3494,0.1475,0.3115,0.5258,-0.1153,0.5318,0.2647,1.4576,3.8646,-0.8811,0.2501,,,,,,,,,-0.613,0.5977,0.4215,0.1281,,0.1381,1.002,1.3463,0.2562,-0.9246,0.7196,-0.3331,0.4209,0.4078,0.0747,0.5131,0.0098,0.4062,0.3346,0.9323,0.8077,0.501,-0.034,-0.1167,-0.9287,-0.0838,0.1936,-0.6519,0.8652,,,,,,,,,,,,,,,,,,0.0,5.0,4.0,14.0,10.0,19.0,2.0,3.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,90.0,1,0,0,1,0,0,1,0,0,4.0,4.0,3.0,3.0,1.0,3.0,3.0,3.0,4.0,4.0,1.0,1.0,1.0,1.0,85.0,15.0,1.0,1.0,2.0,2.0,100.0,23.0,5.0,0,0,0,1,2.0,10.0,5.0,25.0,25.0,10.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,...,,-0.8024,1.0982,,0.6029,-0.209,-1.4212,-0.2527,0.2266,-0.355,0.1509,2.1298,,,,,,,,1.9844,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
573397,1,United_States,84000013,84000005,1.0,5.0,1.0,0,0,1,0,0,1.0,3.0,,,,,,,3.0,10.0,7.0,6.0,,-0.9389,1.0,0.0,0.0,4.0,7.0,6.0,9.0,1.0,0.0,1.0,,0.0,1,1,12.0,24.98,,,,4.0,1.0,3.0,3.0,35.0,4.0,,3.0,4.0,4.0,4.0,4.0,5.0,2.0,,1.0,1.0,4.0,3.0,0.0,0.0,0.0,0.0,0.0,1.0,,,,,,2.0,1.0,6.0,,,,,,,,,,,,,,,,,,,,,,0.0,,-0.476,-0.4867,0.845,-0.756,1.6441,,,,,,,,-0.2794,-0.1891,-0.3054,-2.4517,-1.0693,-0.2027,-0.173,-0.4268,-0.2228,-1.2825,-0.3274,-0.9285,2.5078,-0.1429,,,,,,,,,-0.9827,-1.114,1.9628,-0.4236,-0.3047,-0.3446,-0.748,-1.3108,-1.5638,-0.2047,1.9974,0.6641,1.5367,2.0781,0.4997,-0.1178,0.5052,0.4062,0.3346,0.1515,0.113,0.8108,0.0753,-0.3164,0.6306,0.6194,0.1612,0.6984,2.2012,,,,,,,,,,,,,,,,,,10.0,10.0,3.0,1.0,13.0,10.0,0.0,8.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,50.0,0,1,0,0,0,1,0,0,1,4.0,4.0,1.0,1.0,1.0,1.0,1.0,4.0,2.0,1.0,3.0,2.0,1.0,1.0,84.0,16.0,2.0,2.0,1.0,1.0,100.0,23.0,2.0,0,0,0,0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,1.0,1.0,1.0,1.0,1.0,1.0,...,,-1.9591,1.0982,,0.6029,0.5236,-1.4212,0.7768,-0.4415,0.1368,0.4801,1.1965,,,,,,,,1.521,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
573398,1,United_States,84000010,84000006,1.0,4.0,1.0,0,0,0,0,1,2.0,4.0,,,,,,,2.0,9.0,9.0,5.0,,0.2333,1.0,1.0,0.0,9.0,7.0,6.0,5.0,1.0,2.0,2.0,,0.0,1,1,12.0,,16.5,,,4.0,0.0,3.0,1.0,20.0,2.0,,1.0,5.0,5.0,5.0,,4.0,1.0,,1.0,2.0,3.0,,0.0,0.0,0.0,1.0,1.0,0.0,,,,,,1.0,1.0,6.0,,,,,,,,,,,,,,,,,,,,,,5.0,,1.1739,2.7562,0.379,1.1246,-0.6386,,,,,,,,0.7577,0.2189,-0.896,1.5567,0.4357,0.1869,-0.5739,-0.86,0.1874,-0.798,0.0348,0.8133,-0.5006,2.0741,,,,,,,,,-0.9161,0.0822,-0.7514,-0.6898,-0.1882,1.7786,1.7606,-0.9745,0.5646,0.6268,1.1982,0.267,-0.9491,0.6847,0.3443,-0.8863,-1.8695,0.4062,0.3346,1.2692,0.9944,1.4769,1.2131,1.5408,0.2304,0.2188,0.2255,0.8296,0.2461,,,,,,,,,,,,,,,,,,5.0,0.0,5.0,60.0,35.0,70.0,30.0,30.0,20.0,2.0,2.0,2.0,2.0,1.0,1.0,1.0,90.0,0,1,0,0,1,0,0,1,0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,1.0,1.0,1.0,60.0,40.0,2.0,2.0,2.0,1.0,100.0,28.0,1.0,0,0,0,0,2.0,50.0,75.0,50.0,75.0,50.0,10.0,30.0,90.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,...,,0.2189,1.0982,,0.9177,0.7007,0.1312,1.7484,0.6101,1.9769,0.5645,0.5485,,,,,0.5308,-0.8314,0.8462,-0.1166,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


We'll randomly split the data into 3 uneven groups.  **The model will be trained on 70% of data, it will then be evaluated on 15% of data to give us an estimate of the accuracy we hope to have on "new" data, and 15% will be held back as a final testing dataset which will be used later on.**

A seed is included in the code so the splits can be replicated!

In [9]:
# cell 12
train_data, validation_data, test_data = np.split(model_data.sample(frac=1, random_state=1729), [int(0.7 * len(model_data)), int(0.85 * len(model_data))])   # Randomly sort the data then split out first 70%, second 20%, and last 10%

  return bound(*args, **kwds)


In [10]:
print("Number of rows in FULL dataset:", model_data.shape[0])

train_data_percent = round(train_data.shape[0]/model_data.shape[0] * 100, 0)
print("Number of rows in TRAINING dataset:", train_data.shape[0], ",", train_data_percent, "%")

validation_data_percent = round(validation_data.shape[0]/model_data.shape[0] * 100, 0)
print("Number of rows in VALIDATION dataset:", validation_data.shape[0], ",", validation_data_percent, "%")

test_data_percent = round(test_data.shape[0]/model_data.shape[0] * 100, 0)
print("Number of rows in TEST dataset:", test_data.shape[0], ",", test_data_percent, "%")

Number of rows in FULL dataset: 4552
Number of rows in TRAINING dataset: 3186 , 70.0 %
Number of rows in VALIDATION dataset: 683 , 15.0 %
Number of rows in TEST dataset: 683 , 15.0 %


Amazon SageMaker's XGBoost container expects data in the libSVM or CSV data format.  **Note that the first column must be the target variable and the CSV should not include headers.**  Although repetitive, it's easiest to do this after the train|validation|test split rather than before.  This avoids any misalignment issues due to random reordering.

In [11]:
# cell 13
#pd.concat([train_data['y_yes'], train_data.drop(['y_no', 'y_yes'], axis=1)], axis=1).to_csv('train.csv', index=False, header=False)
#pd.concat([validation_data['y_yes'], validation_data.drop(['y_no', 'y_yes'], axis=1)], axis=1).to_csv('validation.csv', index=False, header=False)# Drop non-numeric columns (e.g., country names or IDs that are not numeric)

# Drop string variables (Country names)
non_numeric_columns = train_data.select_dtypes(exclude=['number']).columns
non_numeric_columns_test = test_data.select_dtypes(exclude=['number']).columns

train_data = train_data.drop(columns=non_numeric_columns)
validation_data = validation_data.drop(columns=non_numeric_columns)
test_data = test_data.drop(columns=non_numeric_columns_test)

# Save train dataset 
train_data.to_csv('train.csv', index=False, header=False)

# Save validation dataset 
validation_data.to_csv('validation.csv', index=False, header=False)


In [12]:
# Training data - Saved to S3 as CSV
print(train_data.shape)
train_data.head()

(3186, 569)


Unnamed: 0,MATH_Proficient,CNTSCHID,CNTSTUID,SISCO,ST347Q01JA,ST347Q02JA,ST349Q01JA_0,ST349Q01JA_1,ST349Q01JA_2,ST349Q01JA_3,ST349Q01JA_4,ST350Q01JA,ST356Q01JA,ST322Q01JA,ST322Q02JA,ST322Q03JA,ST322Q04JA,ST322Q06JA,ST322Q07JA,DURECEC,EFFORT1,EFFORT2,ST259Q01JA,WB164Q01HA,HOMEPOS,ST004D01T,GRADE,REPEAT,EXPECEDU,ICTAVSCH,ICTAVHOM,ICTDISTR,IMMIG,TARDYSD,ST226Q01JA,ST016Q01NA,MISSSC,Option_UH,OECD,PAREDINT,BMMJ1,BFMJ2,WB163Q06HA,WB163Q07HA,ST230Q01JA,SKIPPING,IC180Q01JA,IC180Q08JA,ST059Q02JA,ST296Q04JA,WB176Q01HA,STUDYHMW,IC184Q01JA,IC184Q02JA,IC184Q03JA,IC184Q04JA,ST059Q01TA,ST296Q01JA,ST272Q01JA,ST268Q01JA,ST268Q04JA,ST268Q07JA,ST293Q04JA,ST297Q01JA,ST297Q03JA,ST297Q05JA,ST297Q06JA,ST297Q07JA,ST297Q09JA,WB165Q01HA,WB166Q01HA,WB166Q02HA,WB166Q03HA,WB166Q04HA,ST258Q01JA,ST294Q01JA,ST295Q01JA,WB150Q01HA,WB156Q01HA,WB158Q01HA,WB160Q01HA,WB161Q01HA,WB171Q01HA,WB171Q02HA,WB171Q03HA,WB171Q04HA,WB172Q01HA,WB173Q01HA,WB173Q02HA,WB173Q03HA,WB173Q04HA,WB177Q01HA,WB177Q02HA,WB177Q03HA,WB177Q04HA,WB032Q01NA,WB032Q02NA,WB031Q01NA,EXERPRAC,STUBMI,RELATST,BELONG,BULLIED,FEELSAFE,SCHRISK,PERSEVAGR,CURIOAGR,COOPAGR,EMPATAGR,ASSERAGR,STRESAGR,EMOCOAGR,GROSAGR,INFOSEEK,FAMSUP,DISCLIM,TEACHSUP,COGACRCO,COGACMCO,EXPOFA,EXPO21ST,MATHEFF,MATHEF21,FAMCON,ANXMAT,MATHPERS,CREATEFF,CREATSCH,CREATFAM,CREATAS,CREATOOS,CREATOP,OPENART,IMAGINE,SCHSUST,LEARRES,PROBSELF,FAMSUPSL,FEELLAH,SDLEFF,ICTRES,ESCS,FLSCHOOL,FLMULTSB,FLFAMILY,ACCESSFP,FLCONFIN,FLCONICT,ACCESSFA,ATTCONFM,FRINFLFM,ICTSCH,ICTHOME,ICTQUAL,ICTSUBJ,ICTENQ,ICTFEED,ICTOUT,ICTWKDY,ICTWKEND,ICTREG,ICTINFO,ICTEFFIC,BODYIMA,SOCONPA,LIFESAT,PSYCHSYM,SOCCON,EXPWB,CURSUPP,PQMIMP,PQMCAR,PARINVOL,PQSCHOOL,PASCHPOL,ATTIMMP,CREATHME,CREATACT,CREATOPN,CREATOR,WORKPAY,WORKHOME,SC001Q01TA,SC211Q01JA,SC211Q02JA,SC211Q03JA,SC211Q04JA,SC211Q05JA,SC211Q06JA,SC209Q04JA,SC209Q05JA,SC209Q06JA,SC037Q11JA,SC183Q02JA,SC183Q03JA,SC183Q04JA,SC175Q01JA,SC177Q01JA_1,SC177Q01JA_2,SC177Q01JA_3,SC177Q02JA_1,SC177Q02JA_2,SC177Q02JA_3,SC177Q03JA_1,SC177Q03JA_2,SC177Q03JA_3,SC188Q01JA,SC188Q02JA,SC188Q03JA,SC188Q04JA,SC188Q05JA,SC188Q06JA,SC188Q07JA,SC188Q08JA,SC188Q09JA,SC188Q10JA,SC188Q11JA,SC198Q01JA,SC198Q02JA,SC198Q03JA,SC178Q01JA,SC178Q02JA,SC180Q01JA,SC189Q02WA,SC189Q03WA,SC189Q04WA,SMRATIO,MCLSIZE,MACTIV,MATHEXC_0,MATHEXC_1,MATHEXC_2,MATHEXC_3,ABGMATH,SC064Q05WA,SC064Q06WA,SC064Q01TA,SC064Q02TA,SC064Q04NA,SC064Q03TA,SC064Q07WA,SC213Q01JA,SC213Q02JA,SC037Q01TA,SC037Q02TA,SC037Q03TA,SC037Q04TA,SC037Q05NA,SC037Q06NA,SC037Q07TA,...,DIGDVPOL,TEAFDBK,MTTRAIN,DMCVIEWS,NEGSCLIM,STAFFSHORT,EDUSHORT,STUBEHA,TEACHBEHA,STDTEST,TDTEST,ALLACTIV,BCREATSC,CREENVSC,ACTCRESC,OPENCUL,PROBSCRI,SCPREPBP,SCPREPAP,DIGPREP,LANGN_105,LANGN_108,LANGN_112,LANGN_113,LANGN_118,LANGN_121,LANGN_130,LANGN_133,LANGN_137,LANGN_140,LANGN_147,LANGN_148,LANGN_150,LANGN_154,LANGN_156,LANGN_160,LANGN_170,LANGN_195,LANGN_200,LANGN_202,LANGN_204,LANGN_232,LANGN_237,LANGN_244,LANGN_246,LANGN_254,LANGN_258,LANGN_263,LANGN_264,LANGN_266,LANGN_272,LANGN_273,LANGN_275,LANGN_286,LANGN_301,LANGN_313,LANGN_316,LANGN_317,LANGN_322,LANGN_325,LANGN_327,LANGN_329,LANGN_338,LANGN_340,LANGN_344,LANGN_351,LANGN_358,LANGN_363,LANGN_369,LANGN_371,LANGN_375,LANGN_379,LANGN_381,LANGN_382,LANGN_383,LANGN_404,LANGN_409,LANGN_415,LANGN_420,LANGN_422,LANGN_428,LANGN_434,LANGN_442,LANGN_449,LANGN_451,LANGN_463,LANGN_465,LANGN_467,LANGN_471,LANGN_472,LANGN_474,LANGN_492,LANGN_493,LANGN_494,LANGN_495,LANGN_496,LANGN_500,LANGN_503,LANGN_514,LANGN_517,LANGN_520,LANGN_523,LANGN_527,LANGN_529,LANGN_531,LANGN_540,LANGN_547,LANGN_555,LANGN_561,LANGN_562,LANGN_563,LANGN_565,LANGN_566,LANGN_567,LANGN_600,LANGN_601,LANGN_602,LANGN_605,LANGN_606,LANGN_607,LANGN_608,LANGN_611,LANGN_614,LANGN_615,LANGN_616,LANGN_618,LANGN_619,LANGN_621,LANGN_622,LANGN_623,LANGN_624,LANGN_625,LANGN_626,LANGN_627,LANGN_628,LANGN_630,LANGN_631,LANGN_634,LANGN_635,LANGN_639,LANGN_640,LANGN_641,LANGN_642,LANGN_648,LANGN_650,LANGN_661,LANGN_662,LANGN_663,LANGN_665,LANGN_666,LANGN_667,LANGN_668,LANGN_669,LANGN_670,LANGN_673,LANGN_674,LANGN_675,LANGN_676,LANGN_677,LANGN_678,LANGN_800,LANGN_801,LANGN_802,LANGN_804,LANGN_805,LANGN_806,LANGN_807,LANGN_808,LANGN_809,LANGN_810,LANGN_811,LANGN_812,LANGN_813,LANGN_814,LANGN_815,LANGN_816,LANGN_817,LANGN_818,LANGN_819,LANGN_821,LANGN_823,LANGN_824,LANGN_825,LANGN_826,LANGN_827,LANGN_828,LANGN_829,LANGN_831,LANGN_832,LANGN_833,LANGN_836,LANGN_837,LANGN_838,LANGN_839,LANGN_840,LANGN_841,LANGN_842,LANGN_843,LANGN_844,LANGN_845,LANGN_846,LANGN_849,LANGN_850,LANGN_851,LANGN_852,LANGN_854,LANGN_855,LANGN_857,LANGN_859,LANGN_860,LANGN_861,LANGN_865,LANGN_866,LANGN_868,LANGN_870,LANGN_872,LANGN_873,LANGN_877,LANGN_879,LANGN_881,LANGN_885,LANGN_890,LANGN_892,LANGN_895,LANGN_896,LANGN_897,LANGN_898,LANGN_899,LANGN_900,LANGN_901,LANGN_902,LANGN_903,LANGN_904,LANGN_905,LANGN_906,LANGN_907,LANGN_908,LANGN_909,LANGN_910,LANGN_911,LANGN_912,LANGN_913,LANGN_914,LANGN_916,LANGN_917,LANGN_918,LANGN_919,LANGN_920,LANGN_921,LANGN_922
575474,0,84000082,84003756,0.0,4.0,1.0,0,1,0,0,0,1.0,1.0,,,,,,,,1.0,1.0,5.0,,-1.1481,1.0,0.0,0.0,,7.0,6.0,,3.0,0.0,3.0,,0.0,1,1,14.5,24.53,80.78,,,4.0,0.0,3.0,1.0,8.0,1.0,,0.0,5.0,5.0,5.0,5.0,3.0,1.0,,1.0,2.0,1.0,,0.0,0.0,0.0,0.0,0.0,1.0,,,,,,1.0,1.0,6.0,,,,,,,,,,,,,,,,,,,,,,0.0,,-0.6183,-0.7308,0.5154,-0.756,-0.6386,,,,,,,,-1.3676,-0.9894,-1.5073,1.0855,-0.3322,-2.3562,-1.9366,2.3809,-2.2407,-3.3656,-2.0022,-3.6782,2.5429,-2.6752,,,,,,,,,-1.7994,-1.676,0.4139,-1.6587,-2.6173,-2.1826,-0.2097,0.0772,-1.5638,-1.6481,-0.4553,-1.3771,-2.184,-2.1589,-0.2971,-1.5754,-0.5383,0.4062,0.3346,0.5981,0.6764,-1.1226,-0.9048,0.7546,3.9236,1.4846,-0.8917,-0.027,-0.3083,,,,,,,,,,,,,,,,,,1.0,1.0,4.0,49.0,12.0,53.0,19.0,37.0,2.0,,,,,,,,,0,0,0,0,0,0,0,0,0,,,,,,,,,,,,,,,,,,,,,100.0,,,0,0,0,0,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
574994,1,84000118,84002904,1.0,,2.0,0,0,0,0,1,2.0,3.0,,,,,,,,7.0,9.0,3.0,,-0.2785,1.0,1.0,0.0,7.0,7.0,6.0,9.0,1.0,2.0,2.0,,0.0,1,1,12.0,30.9,18.13,,,3.0,1.0,3.0,3.0,,1.0,,1.0,4.0,4.0,4.0,4.0,,,,1.0,3.0,4.0,,0.0,0.0,0.0,0.0,0.0,1.0,,,,,,1.0,4.0,2.0,,,,,,,,,,,,,,,,,,,,,,1.0,,-0.5702,-2.1726,0.0102,-1.3019,0.9993,,,,,,,,0.7577,1.209,-0.5722,,,,,1.2671,0.7737,-0.5362,0.1895,1.3158,0.3025,-0.0523,,,,,,,,,0.3775,0.1809,0.7439,-1.0628,-0.0534,-0.3865,-0.3003,-0.9194,-0.5552,1.5186,-1.3402,-0.556,-0.9413,0.0905,0.1818,-0.1522,0.4889,0.4062,0.3346,0.1203,,-0.0668,0.2913,-0.2136,0.5391,0.8846,-0.6861,-0.4148,-0.82,,,,,,,,,,,,,,,,,,0.0,0.0,3.0,13.0,13.0,41.0,12.0,30.0,0.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,90.0,1,0,0,1,0,0,1,0,0,4.0,4.0,2.0,4.0,2.0,2.0,1.0,2.0,3.0,3.0,1.0,2.0,1.0,1.0,65.0,35.0,2.0,1.0,1.0,1.0,100.0,33.0,2.0,0,0,0,0,1.0,10.0,30.0,20.0,20.0,5.0,5.0,4.0,230.0,3.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,...,,0.656,1.0982,,-0.0945,1.1703,-1.4212,1.2222,0.933,0.3525,0.1781,0.9381,,,,,-0.2829,0.7922,0.8462,-0.0391,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
574164,0,84000031,84001409,1.0,3.0,1.0,0,0,0,0,1,1.0,1.0,,,,,,,2.0,7.0,8.0,6.0,,-0.8731,1.0,0.0,0.0,9.0,7.0,6.0,8.0,1.0,1.0,3.0,,0.0,1,1,16.0,57.0,17.0,,,4.0,0.0,3.0,3.0,1.0,4.0,,4.0,3.0,3.0,4.0,4.0,1.0,1.0,,1.0,1.0,3.0,2.0,1.0,0.0,0.0,1.0,1.0,0.0,,,,,,1.0,2.0,6.0,,,,,,,,,,,,,,,,,,,,,,9.0,,0.6031,-0.4514,-1.228,-0.756,-0.6386,,,,,,,,0.5481,2.4397,-0.2439,-0.0504,0.4357,0.7143,0.241,-0.1372,0.3595,-0.9866,0.7063,-0.2182,1.0161,0.0248,,,,,,,,,-0.9729,-1.0323,0.7005,0.3146,0.2668,-0.2182,-0.6172,-0.0014,0.9734,0.0506,-0.1845,0.6217,0.4118,0.4078,0.049,0.2737,0.9251,0.4062,0.3346,0.3623,0.9928,0.0084,-0.0179,-0.3155,0.8678,0.5069,0.5201,-0.111,-0.7498,,,,,,,,,,,,,,,,,,0.0,6.0,4.0,9.0,16.0,83.0,4.0,9.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,50.0,0,0,1,1,0,0,0,0,1,3.0,3.0,2.0,1.0,2.0,3.0,1.0,1.0,1.0,1.0,4.0,2.0,1.0,1.0,50.0,50.0,1.0,1.0,2.0,1.0,100.0,33.0,5.0,0,0,0,1,2.0,35.0,50.0,50.0,50.0,1.0,15.0,1.0,75.0,5.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,,-1.2499,1.0982,,0.5063,-0.2766,-1.4212,0.9694,0.2266,0.947,0.5134,2.1298,,,,,0.7167,0.7823,0.8462,2.0888,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
576638,0,84000004,84005803,0.0,4.0,1.0,0,1,0,0,0,1.0,,,,,,,,2.0,7.0,10.0,7.0,,-0.9628,1.0,0.0,0.0,8.0,6.0,5.0,5.0,1.0,1.0,3.0,,0.0,1,1,16.0,62.13,26.8,,,2.0,0.0,2.0,2.0,20.0,3.0,,6.0,,,,,3.0,1.0,,2.0,2.0,4.0,,0.0,0.0,0.0,0.0,0.0,1.0,,,,,,1.0,4.0,6.0,,,,,,,,,,,,,,,,,,,,,,6.0,,-0.3931,-0.9068,-1.228,-0.756,-0.6386,,,,,,,,0.3305,0.5613,-0.7277,1.7914,-0.1002,-2.2388,-0.2284,-0.7486,-0.8234,-0.1005,-0.2102,-0.4084,0.6387,0.1803,,,,,,,,,1.9506,0.4948,2.8316,0.7804,2.098,,-0.6,0.0574,0.1025,,,,,,,,,-0.0831,0.263,-0.9557,0.445,0.2475,0.6371,2.7353,0.4019,0.0028,0.5645,0.1649,2.2293,,,,,,,,,,,,,,,,,,0.0,10.0,3.0,65.0,6.0,1.0,65.0,80.0,0.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,90.0,0,0,1,0,0,1,1,0,0,4.0,4.0,2.0,1.0,4.0,4.0,2.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,60.0,40.0,2.0,2.0,2.0,1.0,100.0,23.0,2.0,0,0,0,0,3.0,40.0,80.0,40.0,90.0,0.0,20.0,0.0,100.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,...,,0.095,0.1983,,-0.3827,-0.5393,-1.4212,0.014,0.2397,1.9769,0.4252,1.0179,,,,,-0.0105,1.5573,0.1102,0.925,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
577351,1,84000075,84007073,1.0,1.0,1.0,0,0,0,0,0,,,,,,,,,,9.0,9.0,6.0,,-0.1205,1.0,0.0,0.0,7.0,7.0,6.0,2.0,1.0,1.0,3.0,,0.0,1,1,16.0,29.14,78.69,,,4.0,0.0,3.0,1.0,20.0,4.0,,4.0,5.0,3.0,1.0,,5.0,2.0,,4.0,4.0,4.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,,,,,,1.0,4.0,5.0,,,,,,,,,,,,,,,,,,,,,,0.0,,0.1047,0.5159,-1.228,0.1413,-0.6386,,,,,,,,0.7577,0.6888,-0.6354,0.6461,1.5558,1.0162,0.4067,-0.2172,0.5446,0.7226,-0.0224,1.9154,-2.249,0.3149,,,,,,,,,,,,,,,0.4484,0.7268,-0.6368,-1.1953,0.6086,0.1251,-0.0716,0.1553,0.0071,0.6231,-0.7656,0.4062,0.3346,1.6428,0.6342,0.1404,-0.2549,1.4466,-0.3215,0.3133,0.5313,-0.3784,0.0093,,,,,,,,,,,,,,,,,,0.0,0.0,3.0,20.0,2.0,22.0,3.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,90.0,0,0,1,0,0,1,1,0,0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,1.0,1.0,1.0,95.0,5.0,1.0,1.0,1.0,1.0,100.0,18.0,3.0,0,0,0,1,3.0,56.0,79.0,100.0,99.0,23.0,10.0,0.0,80.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,,1.1488,1.0982,,-1.1403,-2.1066,-1.4212,-2.5157,-1.2943,0.8532,1.8799,0.1364,,,,,-3.0365,-0.8314,0.8462,3.7675,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [13]:
# Validation data - Saved to S3 as CSV
print(validation_data.shape)
validation_data.head()

(683, 569)


Unnamed: 0,MATH_Proficient,CNTSCHID,CNTSTUID,SISCO,ST347Q01JA,ST347Q02JA,ST349Q01JA_0,ST349Q01JA_1,ST349Q01JA_2,ST349Q01JA_3,ST349Q01JA_4,ST350Q01JA,ST356Q01JA,ST322Q01JA,ST322Q02JA,ST322Q03JA,ST322Q04JA,ST322Q06JA,ST322Q07JA,DURECEC,EFFORT1,EFFORT2,ST259Q01JA,WB164Q01HA,HOMEPOS,ST004D01T,GRADE,REPEAT,EXPECEDU,ICTAVSCH,ICTAVHOM,ICTDISTR,IMMIG,TARDYSD,ST226Q01JA,ST016Q01NA,MISSSC,Option_UH,OECD,PAREDINT,BMMJ1,BFMJ2,WB163Q06HA,WB163Q07HA,ST230Q01JA,SKIPPING,IC180Q01JA,IC180Q08JA,ST059Q02JA,ST296Q04JA,WB176Q01HA,STUDYHMW,IC184Q01JA,IC184Q02JA,IC184Q03JA,IC184Q04JA,ST059Q01TA,ST296Q01JA,ST272Q01JA,ST268Q01JA,ST268Q04JA,ST268Q07JA,ST293Q04JA,ST297Q01JA,ST297Q03JA,ST297Q05JA,ST297Q06JA,ST297Q07JA,ST297Q09JA,WB165Q01HA,WB166Q01HA,WB166Q02HA,WB166Q03HA,WB166Q04HA,ST258Q01JA,ST294Q01JA,ST295Q01JA,WB150Q01HA,WB156Q01HA,WB158Q01HA,WB160Q01HA,WB161Q01HA,WB171Q01HA,WB171Q02HA,WB171Q03HA,WB171Q04HA,WB172Q01HA,WB173Q01HA,WB173Q02HA,WB173Q03HA,WB173Q04HA,WB177Q01HA,WB177Q02HA,WB177Q03HA,WB177Q04HA,WB032Q01NA,WB032Q02NA,WB031Q01NA,EXERPRAC,STUBMI,RELATST,BELONG,BULLIED,FEELSAFE,SCHRISK,PERSEVAGR,CURIOAGR,COOPAGR,EMPATAGR,ASSERAGR,STRESAGR,EMOCOAGR,GROSAGR,INFOSEEK,FAMSUP,DISCLIM,TEACHSUP,COGACRCO,COGACMCO,EXPOFA,EXPO21ST,MATHEFF,MATHEF21,FAMCON,ANXMAT,MATHPERS,CREATEFF,CREATSCH,CREATFAM,CREATAS,CREATOOS,CREATOP,OPENART,IMAGINE,SCHSUST,LEARRES,PROBSELF,FAMSUPSL,FEELLAH,SDLEFF,ICTRES,ESCS,FLSCHOOL,FLMULTSB,FLFAMILY,ACCESSFP,FLCONFIN,FLCONICT,ACCESSFA,ATTCONFM,FRINFLFM,ICTSCH,ICTHOME,ICTQUAL,ICTSUBJ,ICTENQ,ICTFEED,ICTOUT,ICTWKDY,ICTWKEND,ICTREG,ICTINFO,ICTEFFIC,BODYIMA,SOCONPA,LIFESAT,PSYCHSYM,SOCCON,EXPWB,CURSUPP,PQMIMP,PQMCAR,PARINVOL,PQSCHOOL,PASCHPOL,ATTIMMP,CREATHME,CREATACT,CREATOPN,CREATOR,WORKPAY,WORKHOME,SC001Q01TA,SC211Q01JA,SC211Q02JA,SC211Q03JA,SC211Q04JA,SC211Q05JA,SC211Q06JA,SC209Q04JA,SC209Q05JA,SC209Q06JA,SC037Q11JA,SC183Q02JA,SC183Q03JA,SC183Q04JA,SC175Q01JA,SC177Q01JA_1,SC177Q01JA_2,SC177Q01JA_3,SC177Q02JA_1,SC177Q02JA_2,SC177Q02JA_3,SC177Q03JA_1,SC177Q03JA_2,SC177Q03JA_3,SC188Q01JA,SC188Q02JA,SC188Q03JA,SC188Q04JA,SC188Q05JA,SC188Q06JA,SC188Q07JA,SC188Q08JA,SC188Q09JA,SC188Q10JA,SC188Q11JA,SC198Q01JA,SC198Q02JA,SC198Q03JA,SC178Q01JA,SC178Q02JA,SC180Q01JA,SC189Q02WA,SC189Q03WA,SC189Q04WA,SMRATIO,MCLSIZE,MACTIV,MATHEXC_0,MATHEXC_1,MATHEXC_2,MATHEXC_3,ABGMATH,SC064Q05WA,SC064Q06WA,SC064Q01TA,SC064Q02TA,SC064Q04NA,SC064Q03TA,SC064Q07WA,SC213Q01JA,SC213Q02JA,SC037Q01TA,SC037Q02TA,SC037Q03TA,SC037Q04TA,SC037Q05NA,SC037Q06NA,SC037Q07TA,...,DIGDVPOL,TEAFDBK,MTTRAIN,DMCVIEWS,NEGSCLIM,STAFFSHORT,EDUSHORT,STUBEHA,TEACHBEHA,STDTEST,TDTEST,ALLACTIV,BCREATSC,CREENVSC,ACTCRESC,OPENCUL,PROBSCRI,SCPREPBP,SCPREPAP,DIGPREP,LANGN_105,LANGN_108,LANGN_112,LANGN_113,LANGN_118,LANGN_121,LANGN_130,LANGN_133,LANGN_137,LANGN_140,LANGN_147,LANGN_148,LANGN_150,LANGN_154,LANGN_156,LANGN_160,LANGN_170,LANGN_195,LANGN_200,LANGN_202,LANGN_204,LANGN_232,LANGN_237,LANGN_244,LANGN_246,LANGN_254,LANGN_258,LANGN_263,LANGN_264,LANGN_266,LANGN_272,LANGN_273,LANGN_275,LANGN_286,LANGN_301,LANGN_313,LANGN_316,LANGN_317,LANGN_322,LANGN_325,LANGN_327,LANGN_329,LANGN_338,LANGN_340,LANGN_344,LANGN_351,LANGN_358,LANGN_363,LANGN_369,LANGN_371,LANGN_375,LANGN_379,LANGN_381,LANGN_382,LANGN_383,LANGN_404,LANGN_409,LANGN_415,LANGN_420,LANGN_422,LANGN_428,LANGN_434,LANGN_442,LANGN_449,LANGN_451,LANGN_463,LANGN_465,LANGN_467,LANGN_471,LANGN_472,LANGN_474,LANGN_492,LANGN_493,LANGN_494,LANGN_495,LANGN_496,LANGN_500,LANGN_503,LANGN_514,LANGN_517,LANGN_520,LANGN_523,LANGN_527,LANGN_529,LANGN_531,LANGN_540,LANGN_547,LANGN_555,LANGN_561,LANGN_562,LANGN_563,LANGN_565,LANGN_566,LANGN_567,LANGN_600,LANGN_601,LANGN_602,LANGN_605,LANGN_606,LANGN_607,LANGN_608,LANGN_611,LANGN_614,LANGN_615,LANGN_616,LANGN_618,LANGN_619,LANGN_621,LANGN_622,LANGN_623,LANGN_624,LANGN_625,LANGN_626,LANGN_627,LANGN_628,LANGN_630,LANGN_631,LANGN_634,LANGN_635,LANGN_639,LANGN_640,LANGN_641,LANGN_642,LANGN_648,LANGN_650,LANGN_661,LANGN_662,LANGN_663,LANGN_665,LANGN_666,LANGN_667,LANGN_668,LANGN_669,LANGN_670,LANGN_673,LANGN_674,LANGN_675,LANGN_676,LANGN_677,LANGN_678,LANGN_800,LANGN_801,LANGN_802,LANGN_804,LANGN_805,LANGN_806,LANGN_807,LANGN_808,LANGN_809,LANGN_810,LANGN_811,LANGN_812,LANGN_813,LANGN_814,LANGN_815,LANGN_816,LANGN_817,LANGN_818,LANGN_819,LANGN_821,LANGN_823,LANGN_824,LANGN_825,LANGN_826,LANGN_827,LANGN_828,LANGN_829,LANGN_831,LANGN_832,LANGN_833,LANGN_836,LANGN_837,LANGN_838,LANGN_839,LANGN_840,LANGN_841,LANGN_842,LANGN_843,LANGN_844,LANGN_845,LANGN_846,LANGN_849,LANGN_850,LANGN_851,LANGN_852,LANGN_854,LANGN_855,LANGN_857,LANGN_859,LANGN_860,LANGN_861,LANGN_865,LANGN_866,LANGN_868,LANGN_870,LANGN_872,LANGN_873,LANGN_877,LANGN_879,LANGN_881,LANGN_885,LANGN_890,LANGN_892,LANGN_895,LANGN_896,LANGN_897,LANGN_898,LANGN_899,LANGN_900,LANGN_901,LANGN_902,LANGN_903,LANGN_904,LANGN_905,LANGN_906,LANGN_907,LANGN_908,LANGN_909,LANGN_910,LANGN_911,LANGN_912,LANGN_913,LANGN_914,LANGN_916,LANGN_917,LANGN_918,LANGN_919,LANGN_920,LANGN_921,LANGN_922
574478,1,84000008,84001986,1.0,6.0,1.0,0,0,0,0,1,1.0,3.0,,,,,,,,10.0,10.0,7.0,,0.8067,1.0,0.0,0.0,6.0,7.0,6.0,4.0,1.0,0.0,3.0,,0.0,1,1,12.0,26.64,25.95,,,4.0,0.0,1.0,3.0,3.0,3.0,,9.0,3.0,4.0,3.0,,3.0,1.0,,4.0,4.0,4.0,,0.0,1.0,1.0,0.0,0.0,0.0,,,,,,5.0,1.0,6.0,,,,,,,,,,,,,,,,,,,,,,3.0,,0.1045,0.0311,-1.228,-0.756,1.1023,,,,,,,,0.3305,-0.4405,-0.0655,0.1997,1.5558,2.1044,1.5715,-0.4304,0.3636,0.5076,1.5288,0.3847,2.5026,2.8491,,,,,,,,,1.023,0.7935,0.479,0.5361,1.2196,0.0219,1.0229,-0.5402,-0.5552,0.0324,1.8656,0.5406,0.2255,1.2021,1.5151,0.5408,-0.9579,0.4062,0.3346,-0.1177,1.8109,2.9787,1.6331,2.1015,0.2928,0.2424,0.0831,2.6014,0.0986,,,,,,,,,,,,,,,,,,0.0,10.0,4.0,40.0,30.0,79.0,11.0,18.0,4.0,,,,,,,,,0,0,0,0,0,0,0,0,0,,,,,,,,,,,,,,,,,,,,,,,,0,0,0,0,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
573678,1,84000068,84000520,,,,0,0,0,0,0,,,,,,,,,,8.0,10.0,9.0,,-0.0418,2.0,0.0,0.0,,,,,1.0,1.0,3.0,,0.0,1,1,16.0,38.88,30.78,,,3.0,1.0,,,7.0,1.0,,0.0,,,,,7.0,1.0,,,,,,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,1.0,1.0,6.0,,,,,,,,,,,,,,,,,,,,,,0.0,,0.2316,-0.1965,-1.228,-0.756,0.4456,,,,,,,,-0.0994,,,0.1166,-0.1002,,,,,,,,,,,,,,,,,,,,,,,,-0.7123,0.0085,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,5.0,0.0,,1.0,13.0,22.0,0.0,,0.0,1.0,1.0,1.0,2.0,2.0,1.0,1.0,50.0,0,0,1,0,0,1,0,1,0,4.0,3.0,2.0,4.0,2.0,2.0,2.0,4.0,4.0,4.0,2.0,1.0,1.0,2.0,75.0,25.0,1.0,2.0,2.0,1.0,,23.0,4.0,0,0,0,1,2.0,15.0,32.0,50.0,40.0,50.0,5.0,30.0,45.0,18.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,,0.1848,0.1756,,0.5245,0.961,-0.4413,1.35,0.9355,0.947,0.6721,1.1965,,,,,0.2762,-0.8314,0.1102,0.1131,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
576927,0,84000049,84006320,1.0,,4.0,0,0,0,0,1,1.0,2.0,,,,,,,2.0,7.0,8.0,5.0,,-1.3368,1.0,0.0,0.0,9.0,4.0,,,1.0,1.0,4.0,,1.0,1,1,16.0,,,,,4.0,0.0,,,4.0,1.0,,4.0,,,,,1.0,1.0,,1.0,1.0,3.0,,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,1.0,4.0,6.0,,,,,,,,,,,,,,,,,,,,,,0.0,,-0.8798,-0.5964,-1.228,-0.756,0.4139,,,,,,,,-0.2794,-0.5128,1.6461,-0.6761,-1.0693,0.437,-0.5169,0.5942,2.1164,-3.4378,-2.3493,2.0407,2.5078,-0.1177,,,,,,,,,-2.508,0.3302,0.5352,-0.5082,-0.2159,-2.5236,-1.2528,-0.6956,,,1.0098,,0.1788,0.6273,1.0014,,0.9251,0.1606,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,4.0,5.0,2.0,83.0,10.0,7.0,,,,,,,,,80.0,0,0,0,0,0,0,0,0,0,,,,,,,,,,,,,,,,,1.0,,,,,23.0,1.0,0,0,0,0,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,1.5605,,,,,,,,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
576321,1,84000138,84005234,0.0,4.0,1.0,0,1,0,0,0,1.0,3.0,,,,,,,1.0,8.0,,8.0,,1.8748,1.0,1.0,0.0,7.0,7.0,6.0,4.0,2.0,0.0,2.0,,0.0,1,1,16.0,63.03,65.01,,,3.0,1.0,2.0,2.0,20.0,6.0,,5.0,3.0,3.0,,,5.0,1.0,,1.0,1.0,4.0,,0.0,0.0,0.0,0.0,0.0,1.0,,,,,,1.0,6.0,6.0,,,,,,,,,,,,,,,,,,,,,,0.0,,-0.098,-0.3485,-1.228,-0.756,-0.6386,,,,,,,,0.7577,0.1352,0.8359,-0.2863,0.4357,-0.2533,-0.7195,0.0117,0.0065,-1.2399,-0.4296,0.6755,0.429,-0.4855,,,,,,,,,-0.0508,0.3561,-0.2545,-0.3973,-0.4074,-0.9785,2.8019,1.3137,-1.5638,0.3684,-0.4993,-0.2928,-0.2246,0.924,0.3975,-0.407,-0.8774,0.4062,0.3346,0.13,0.4646,0.0962,-0.8331,-0.4407,0.3179,0.668,0.0574,0.0811,0.0044,,,,,,,,,,,,,,,,,,0.0,5.0,2.0,44.0,12.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,2.0,2.0,1.0,1.0,40.0,0,0,1,0,0,1,0,0,1,2.0,2.0,2.0,4.0,1.0,1.0,1.0,4.0,3.0,3.0,1.0,2.0,2.0,2.0,100.0,0.0,2.0,2.0,2.0,1.0,65.2727,13.0,3.0,0,0,0,0,2.0,90.0,100.0,90.0,100.0,50.0,50.0,60.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,...,,,-0.0381,,0.0359,0.0572,-0.6344,-0.2527,0.2397,-1.7236,-0.4116,2.1298,,,,,,,,1.1612,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
576339,1,84000124,84005268,1.0,3.0,1.0,0,0,0,0,1,2.0,3.0,,,,,,,3.0,10.0,7.0,4.0,,-1.3411,2.0,-1.0,0.0,4.0,,,,,0.0,1.0,,0.0,1,1,12.0,43.85,24.53,,,4.0,0.0,,,7.0,5.0,,2.0,,,,,7.0,1.0,,4.0,4.0,4.0,,0.0,0.0,0.0,0.0,0.0,1.0,,,,,,1.0,1.0,6.0,,,,,,,,,,,,,,,,,,,,,,0.0,,0.2898,-1.0249,-0.267,-0.756,-0.6386,,,,,,,,0.6466,-0.7264,-1.4343,-0.3272,-0.1002,-0.8184,-0.5026,-2.772,-0.4285,-0.7925,-0.6121,-0.639,-0.3794,0.082,,,,,,,,,-0.2235,0.3296,-0.5591,-0.4879,0.6688,-0.3988,-0.8601,-1.1249,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,4.0,2.0,0.0,18.0,30.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,55.0,0,0,1,0,0,1,0,0,1,2.0,4.0,2.0,3.0,4.0,4.0,2.0,3.0,2.0,3.0,2.0,1.0,1.0,1.0,80.0,20.0,1.0,1.0,1.0,1.0,100.0,23.0,4.0,0,0,0,1,2.0,50.0,100.0,50.0,100.0,25.0,25.0,75.0,20.0,5.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,...,,1.0689,-0.0313,,-0.1428,3.7658,-0.3123,-0.7973,0.5395,1.9769,1.2251,1.0123,,,,,-0.2488,0.5151,0.8462,0.1902,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [14]:
# Test data - NOT SAVED TO S3
print(test_data.shape)
test_data.head()

(683, 569)


Unnamed: 0,MATH_Proficient,CNTSCHID,CNTSTUID,SISCO,ST347Q01JA,ST347Q02JA,ST349Q01JA_0,ST349Q01JA_1,ST349Q01JA_2,ST349Q01JA_3,ST349Q01JA_4,ST350Q01JA,ST356Q01JA,ST322Q01JA,ST322Q02JA,ST322Q03JA,ST322Q04JA,ST322Q06JA,ST322Q07JA,DURECEC,EFFORT1,EFFORT2,ST259Q01JA,WB164Q01HA,HOMEPOS,ST004D01T,GRADE,REPEAT,EXPECEDU,ICTAVSCH,ICTAVHOM,ICTDISTR,IMMIG,TARDYSD,ST226Q01JA,ST016Q01NA,MISSSC,Option_UH,OECD,PAREDINT,BMMJ1,BFMJ2,WB163Q06HA,WB163Q07HA,ST230Q01JA,SKIPPING,IC180Q01JA,IC180Q08JA,ST059Q02JA,ST296Q04JA,WB176Q01HA,STUDYHMW,IC184Q01JA,IC184Q02JA,IC184Q03JA,IC184Q04JA,ST059Q01TA,ST296Q01JA,ST272Q01JA,ST268Q01JA,ST268Q04JA,ST268Q07JA,ST293Q04JA,ST297Q01JA,ST297Q03JA,ST297Q05JA,ST297Q06JA,ST297Q07JA,ST297Q09JA,WB165Q01HA,WB166Q01HA,WB166Q02HA,WB166Q03HA,WB166Q04HA,ST258Q01JA,ST294Q01JA,ST295Q01JA,WB150Q01HA,WB156Q01HA,WB158Q01HA,WB160Q01HA,WB161Q01HA,WB171Q01HA,WB171Q02HA,WB171Q03HA,WB171Q04HA,WB172Q01HA,WB173Q01HA,WB173Q02HA,WB173Q03HA,WB173Q04HA,WB177Q01HA,WB177Q02HA,WB177Q03HA,WB177Q04HA,WB032Q01NA,WB032Q02NA,WB031Q01NA,EXERPRAC,STUBMI,RELATST,BELONG,BULLIED,FEELSAFE,SCHRISK,PERSEVAGR,CURIOAGR,COOPAGR,EMPATAGR,ASSERAGR,STRESAGR,EMOCOAGR,GROSAGR,INFOSEEK,FAMSUP,DISCLIM,TEACHSUP,COGACRCO,COGACMCO,EXPOFA,EXPO21ST,MATHEFF,MATHEF21,FAMCON,ANXMAT,MATHPERS,CREATEFF,CREATSCH,CREATFAM,CREATAS,CREATOOS,CREATOP,OPENART,IMAGINE,SCHSUST,LEARRES,PROBSELF,FAMSUPSL,FEELLAH,SDLEFF,ICTRES,ESCS,FLSCHOOL,FLMULTSB,FLFAMILY,ACCESSFP,FLCONFIN,FLCONICT,ACCESSFA,ATTCONFM,FRINFLFM,ICTSCH,ICTHOME,ICTQUAL,ICTSUBJ,ICTENQ,ICTFEED,ICTOUT,ICTWKDY,ICTWKEND,ICTREG,ICTINFO,ICTEFFIC,BODYIMA,SOCONPA,LIFESAT,PSYCHSYM,SOCCON,EXPWB,CURSUPP,PQMIMP,PQMCAR,PARINVOL,PQSCHOOL,PASCHPOL,ATTIMMP,CREATHME,CREATACT,CREATOPN,CREATOR,WORKPAY,WORKHOME,SC001Q01TA,SC211Q01JA,SC211Q02JA,SC211Q03JA,SC211Q04JA,SC211Q05JA,SC211Q06JA,SC209Q04JA,SC209Q05JA,SC209Q06JA,SC037Q11JA,SC183Q02JA,SC183Q03JA,SC183Q04JA,SC175Q01JA,SC177Q01JA_1,SC177Q01JA_2,SC177Q01JA_3,SC177Q02JA_1,SC177Q02JA_2,SC177Q02JA_3,SC177Q03JA_1,SC177Q03JA_2,SC177Q03JA_3,SC188Q01JA,SC188Q02JA,SC188Q03JA,SC188Q04JA,SC188Q05JA,SC188Q06JA,SC188Q07JA,SC188Q08JA,SC188Q09JA,SC188Q10JA,SC188Q11JA,SC198Q01JA,SC198Q02JA,SC198Q03JA,SC178Q01JA,SC178Q02JA,SC180Q01JA,SC189Q02WA,SC189Q03WA,SC189Q04WA,SMRATIO,MCLSIZE,MACTIV,MATHEXC_0,MATHEXC_1,MATHEXC_2,MATHEXC_3,ABGMATH,SC064Q05WA,SC064Q06WA,SC064Q01TA,SC064Q02TA,SC064Q04NA,SC064Q03TA,SC064Q07WA,SC213Q01JA,SC213Q02JA,SC037Q01TA,SC037Q02TA,SC037Q03TA,SC037Q04TA,SC037Q05NA,SC037Q06NA,SC037Q07TA,...,DIGDVPOL,TEAFDBK,MTTRAIN,DMCVIEWS,NEGSCLIM,STAFFSHORT,EDUSHORT,STUBEHA,TEACHBEHA,STDTEST,TDTEST,ALLACTIV,BCREATSC,CREENVSC,ACTCRESC,OPENCUL,PROBSCRI,SCPREPBP,SCPREPAP,DIGPREP,LANGN_105,LANGN_108,LANGN_112,LANGN_113,LANGN_118,LANGN_121,LANGN_130,LANGN_133,LANGN_137,LANGN_140,LANGN_147,LANGN_148,LANGN_150,LANGN_154,LANGN_156,LANGN_160,LANGN_170,LANGN_195,LANGN_200,LANGN_202,LANGN_204,LANGN_232,LANGN_237,LANGN_244,LANGN_246,LANGN_254,LANGN_258,LANGN_263,LANGN_264,LANGN_266,LANGN_272,LANGN_273,LANGN_275,LANGN_286,LANGN_301,LANGN_313,LANGN_316,LANGN_317,LANGN_322,LANGN_325,LANGN_327,LANGN_329,LANGN_338,LANGN_340,LANGN_344,LANGN_351,LANGN_358,LANGN_363,LANGN_369,LANGN_371,LANGN_375,LANGN_379,LANGN_381,LANGN_382,LANGN_383,LANGN_404,LANGN_409,LANGN_415,LANGN_420,LANGN_422,LANGN_428,LANGN_434,LANGN_442,LANGN_449,LANGN_451,LANGN_463,LANGN_465,LANGN_467,LANGN_471,LANGN_472,LANGN_474,LANGN_492,LANGN_493,LANGN_494,LANGN_495,LANGN_496,LANGN_500,LANGN_503,LANGN_514,LANGN_517,LANGN_520,LANGN_523,LANGN_527,LANGN_529,LANGN_531,LANGN_540,LANGN_547,LANGN_555,LANGN_561,LANGN_562,LANGN_563,LANGN_565,LANGN_566,LANGN_567,LANGN_600,LANGN_601,LANGN_602,LANGN_605,LANGN_606,LANGN_607,LANGN_608,LANGN_611,LANGN_614,LANGN_615,LANGN_616,LANGN_618,LANGN_619,LANGN_621,LANGN_622,LANGN_623,LANGN_624,LANGN_625,LANGN_626,LANGN_627,LANGN_628,LANGN_630,LANGN_631,LANGN_634,LANGN_635,LANGN_639,LANGN_640,LANGN_641,LANGN_642,LANGN_648,LANGN_650,LANGN_661,LANGN_662,LANGN_663,LANGN_665,LANGN_666,LANGN_667,LANGN_668,LANGN_669,LANGN_670,LANGN_673,LANGN_674,LANGN_675,LANGN_676,LANGN_677,LANGN_678,LANGN_800,LANGN_801,LANGN_802,LANGN_804,LANGN_805,LANGN_806,LANGN_807,LANGN_808,LANGN_809,LANGN_810,LANGN_811,LANGN_812,LANGN_813,LANGN_814,LANGN_815,LANGN_816,LANGN_817,LANGN_818,LANGN_819,LANGN_821,LANGN_823,LANGN_824,LANGN_825,LANGN_826,LANGN_827,LANGN_828,LANGN_829,LANGN_831,LANGN_832,LANGN_833,LANGN_836,LANGN_837,LANGN_838,LANGN_839,LANGN_840,LANGN_841,LANGN_842,LANGN_843,LANGN_844,LANGN_845,LANGN_846,LANGN_849,LANGN_850,LANGN_851,LANGN_852,LANGN_854,LANGN_855,LANGN_857,LANGN_859,LANGN_860,LANGN_861,LANGN_865,LANGN_866,LANGN_868,LANGN_870,LANGN_872,LANGN_873,LANGN_877,LANGN_879,LANGN_881,LANGN_885,LANGN_890,LANGN_892,LANGN_895,LANGN_896,LANGN_897,LANGN_898,LANGN_899,LANGN_900,LANGN_901,LANGN_902,LANGN_903,LANGN_904,LANGN_905,LANGN_906,LANGN_907,LANGN_908,LANGN_909,LANGN_910,LANGN_911,LANGN_912,LANGN_913,LANGN_914,LANGN_916,LANGN_917,LANGN_918,LANGN_919,LANGN_920,LANGN_921,LANGN_922
577660,1,84000126,84007619,,,,0,0,0,0,0,,,,,,,,,,,,,,,2.0,0.0,,,,,,,,,,,1,1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.0,1.0,13.0,29.0,1.0,1.0,0.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,90.0,0,0,1,1,0,0,0,0,1,3.0,3.0,3.0,3.0,2.0,2.0,2.0,4.0,2.0,2.0,1.0,2.0,1.0,1.0,65.0,35.0,1.0,2.0,1.0,2.0,100.0,23.0,4.0,0,0,0,1,3.0,10.0,20.0,20.0,30.0,10.0,10.0,10.0,100.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,...,,0.0086,1.0982,,0.2343,-0.2766,-1.4212,-0.3873,0.23,0.1301,-0.0107,1.2851,,,,,-0.035,1.3112,0.8462,0.8175,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
577072,0,84000158,84006571,1.0,3.0,,0,0,1,0,0,1.0,2.0,,,,,,,,9.0,10.0,9.0,,-0.1963,1.0,0.0,0.0,8.0,7.0,6.0,8.0,1.0,2.0,4.0,,0.0,1,1,16.0,39.02,,,,4.0,1.0,3.0,1.0,7.0,4.0,,8.0,5.0,4.0,3.0,4.0,1.0,2.0,,2.0,2.0,4.0,3.0,1.0,1.0,0.0,1.0,1.0,0.0,,,,,,1.0,5.0,6.0,,,,,,,,,,,,,,,,,,,,,,7.0,,1.1452,-0.0092,0.9815,-0.756,2.6536,,,,,,,,0.6466,0.3281,-0.8258,0.4277,0.8211,0.1438,0.429,-0.2065,0.0996,-0.6534,0.9135,0.8158,-0.7447,-0.1438,,,,,,,,,-0.2055,0.3135,1.2651,0.4585,,-0.9969,1.1454,-0.0542,0.9734,0.0203,0.7058,0.5403,0.81,0.8213,0.781,0.3395,-0.2178,0.4062,0.3346,-0.348,0.4662,0.4707,0.5964,-0.4631,1.7979,1.397,0.0782,0.281,0.2629,,,,,,,,,,,,,,,,,,3.0,8.0,3.0,,,,,,,,,,,,,,,0,0,0,0,0,0,0,0,0,,,,,,,,,,,,,,,,,,,,,,,,0,0,0,0,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
576813,0,84000153,84006127,1.0,4.0,2.0,0,1,0,0,0,1.0,3.0,,,,,,,1.0,7.0,,8.0,,0.5803,1.0,0.0,0.0,8.0,7.0,6.0,12.0,1.0,0.0,4.0,,0.0,1,1,16.0,56.0,,,,4.0,0.0,3.0,3.0,4.0,1.0,,6.0,4.0,5.0,5.0,3.0,2.0,1.0,,1.0,1.0,4.0,2.0,0.0,0.0,0.0,0.0,0.0,1.0,,,,,,1.0,5.0,6.0,,,,,,,,,,,,,,,,,,,,,,2.0,,-0.5742,-0.6456,0.9967,-0.756,0.181,,,,,,,,-1.0347,0.0189,-0.54,0.6387,-0.5635,-0.0731,-1.4597,0.4731,0.0897,-1.5902,-0.5393,0.1193,2.6109,0.2635,,,,,,,,,-0.0213,-0.0057,0.5219,-0.0733,-0.4078,0.0322,1.0891,0.5953,0.4019,-1.4392,-1.8093,-0.0406,-0.5238,-0.3051,0.2438,-0.7975,0.0098,0.4062,0.3346,0.3623,-0.2033,0.8478,0.0587,-0.6154,1.3149,1.0051,0.8206,-0.2455,-0.6665,,,,,,,,,,,,,,,,,,4.0,8.0,3.0,1.0,3.0,5.0,0.0,2.0,0.0,,,,1.0,2.0,2.0,2.0,80.0,0,0,1,0,0,1,0,0,1,4.0,4.0,3.0,4.0,3.0,2.0,1.0,4.0,3.0,2.0,1.0,1.0,1.0,1.0,87.0,13.0,2.0,2.0,2.0,1.0,88.8125,23.0,3.0,0,0,0,0,2.0,10.0,10.0,23.0,23.0,0.0,9.0,3.0,,,0.0,1.0,1.0,1.0,1.0,1.0,1.0,...,,-0.3955,-0.3854,,0.2079,-0.2982,-0.2376,0.8767,0.9355,0.2793,-0.0107,1.512,,,,,-0.2488,0.527,-1.5036,-0.8263,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
577850,0,84000045,84007989,,1.0,1.0,0,0,0,0,0,,,,,,,,,2.0,8.0,9.0,,,-0.0002,2.0,0.0,0.0,,,,,1.0,0.0,,,,1,1,12.0,,,,,,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.3378,,,,,,,,,,,,,,,,-0.039,-1.0048,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,3.0,5.0,46.0,80.0,0.0,10.0,0.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,45.0,0,0,1,0,0,1,0,0,1,3.0,4.0,1.0,3.0,3.0,1.0,1.0,3.0,4.0,3.0,1.0,2.0,1.0,2.0,90.0,10.0,1.0,1.0,2.0,2.0,92.5,13.0,3.0,0,0,0,1,2.0,40.0,40.0,40.0,40.0,0.0,5.0,10.0,0.0,5.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,,-1.0859,1.0982,,-0.0731,-0.6909,0.1,-0.2527,-2.0409,-0.3082,0.4865,-0.8237,,,,,,,,0.6818,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
573800,1,84000127,84000730,1.0,4.0,1.0,0,1,0,0,0,1.0,2.0,,,,,,,,9.0,10.0,8.0,,0.6254,1.0,0.0,0.0,8.0,7.0,6.0,11.0,1.0,0.0,3.0,,0.0,1,1,16.0,82.41,82.41,,,1.0,1.0,2.0,1.0,7.0,5.0,,7.0,4.0,5.0,,,4.0,3.0,,1.0,2.0,4.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,,,,,,1.0,2.0,6.0,,,,,,,,,,,,,,,,,,,,,,0.0,,-0.9717,-1.0406,1.1008,0.4417,1.2046,,,,,,,,0.3305,0.6196,0.8067,-0.8697,-0.8059,0.7165,-1.7079,-0.0582,-0.1406,0.2544,0.2056,1.106,1.8003,-0.0266,,,,,,,,,-0.0622,0.0074,0.7422,0.8228,,-0.8522,0.4668,1.113,-0.5113,0.5683,1.712,0.2079,-0.6026,0.906,0.2596,0.0167,0.6198,0.4062,0.3346,-0.1464,0.3801,0.6102,0.6498,0.7372,0.2265,0.5245,0.5762,1.5174,0.5137,,,,,,,,,,,,,,,,,,0.0,0.0,2.0,18.0,14.0,77.0,6.0,5.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,65.0,1,0,0,0,0,1,1,0,0,4.0,4.0,2.0,3.0,3.0,3.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,45.0,55.0,1.0,2.0,2.0,1.0,,28.0,2.0,0,0,0,1,3.0,28.0,37.0,35.0,36.0,12.0,6.0,17.0,60.0,15.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,...,,-1.3317,0.2231,,-1.6916,0.7007,-1.4212,0.6585,0.5986,0.8999,-0.3751,-0.0199,,,,,1.0327,-0.8314,0.8462,-2.0883,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


Now we'll copy the file to S3 for Amazon SageMaker's managed training to pickup.

In [17]:
# cell 14
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train/train.csv')).upload_file('train.csv')
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'validation/validation.csv')).upload_file('validation.csv')

---

## Training - XGBoost
***Change this to automatic model tuning!***

At a high level, gradient boosted trees works by combining predictions from many simple models, each of which tries to address the weaknesses of the previous models.  By doing this the collection of simple models can actually outperform large, complex models.  Other Amazon SageMaker notebooks elaborate on gradient boosting trees further and how they differ from similar algorithms.

`xgboost` is an extremely popular, open-source package for gradient boosted trees.  It is computationally powerful, fully featured, and has been successfully used in many machine learning competitions.  Let's start with a simple `xgboost` model, trained using Amazon SageMaker's managed, distributed training framework.

First we'll need to specify the ECR container location for Amazon SageMaker's implementation of XGBoost.

In [15]:
# cell 15
container = sagemaker.image_uris.retrieve(region=boto3.Session().region_name, framework='xgboost', version='latest')

Then, because we're training with the CSV file format, we'll create `s3_input`s that our training function can use as a pointer to the files in S3, which also specify that the content type is CSV.

In [16]:
# cell 16
s3_input_train = sagemaker.inputs.TrainingInput(s3_data='s3://{}/{}/train'.format(bucket, prefix), content_type='csv')
s3_input_validation = sagemaker.inputs.TrainingInput(s3_data='s3://{}/{}/validation/'.format(bucket, prefix), content_type='csv')

First we'll need to specify training parameters to the estimator.  This includes:
1. The `xgboost` algorithm container
1. The IAM role to use
1. Training instance type and count
1. S3 location for output data
1. Algorithm hyperparameters

And then a `.fit()` function which specifies:
1. S3 location for output data.  In this case we have both a training and validation set which are passed in.

In [17]:
# cell 17
sess = sagemaker.Session()

xgb = sagemaker.estimator.Estimator(container,
                                    role, 
                                    instance_count=1, 
                                    instance_type='ml.m4.xlarge',
                                    output_path='s3://{}/{}/output'.format(bucket, prefix),
                                    sagemaker_session=sess)

xgb.set_hyperparameters(max_depth=5,
                        eta=0.2,
                        gamma=4,
                        min_child_weight=6,
                        subsample=0.8,
                        silent=0,
                        objective='binary:logistic',
                        num_round=10,  # Adjust number of rounds
                        seed=42,       # Set fixed seed
                        seed_per_iteration=42,  # Ensures same randomness per iteration
                        early_stopping_rounds=10
                       )

xgb.fit({'train': s3_input_train, 'validation': s3_input_validation}) 

INFO:sagemaker:Creating training-job with name: xgboost-2025-02-20-17-54-18-071


2025-02-20 17:54:19 Starting - Starting the training job...
2025-02-20 17:54:52 Downloading - Downloading input data...
2025-02-20 17:55:23 Downloading - Downloading the training image......
2025-02-20 17:56:24 Training - Training image download completed. Training in progress...[34mArguments: train[0m
[34m[2025-02-20:17:56:35:INFO] Running standalone xgboost training.[0m
[34m[2025-02-20:17:56:35:INFO] File size need to be processed in the node: 6.04mb. Available memory size in the node: 8565.71mb[0m
[34m[2025-02-20:17:56:35:INFO] Determined delimiter of CSV input is ','[0m
[34m[17:56:35] S3DistributionType set as FullyReplicated[0m
[34m[17:56:35] 3186x568 matrix with 1809648 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,[0m
[34m[2025-02-20:17:56:35:INFO] Determined delimiter of CSV input is ','[0m
[34m[17:56:35] S3DistributionType set as FullyReplicated[0m
[34m[17:56:35] 683x568 matrix with 387944 entries loaded from /opt/ml/input/d

## Explain the model using Clarify

In [21]:
from datetime import datetime

session = sagemaker.Session()
model_name = "Clarify-{}-{}".format(country_name_edited, datetime.now().strftime("%d-%m-%Y-%H-%M-%S"))
model = xgb.create_model(name=model_name)
container_def = model.prepare_container_def()
session.create_model(model_name, role, container_def)

INFO:sagemaker:Creating model with name: Clarify-United-States-20-02-2025-16-14-35


'Clarify-United-States-20-02-2025-16-14-35'

In [22]:
train_data.head(5)

Unnamed: 0,MATH_Proficient,CNTSCHID,CNTSTUID,SISCO,ST347Q01JA,ST347Q02JA,ST349Q01JA_0,ST349Q01JA_1,ST349Q01JA_2,ST349Q01JA_3,ST349Q01JA_4,ST350Q01JA,ST356Q01JA,ST322Q01JA,ST322Q02JA,ST322Q03JA,ST322Q04JA,ST322Q06JA,ST322Q07JA,DURECEC,EFFORT1,EFFORT2,ST259Q01JA,WB164Q01HA,HOMEPOS,ST004D01T,GRADE,REPEAT,EXPECEDU,ICTAVSCH,ICTAVHOM,ICTDISTR,IMMIG,TARDYSD,ST226Q01JA,ST016Q01NA,MISSSC,Option_UH,OECD,PAREDINT,BMMJ1,BFMJ2,WB163Q06HA,WB163Q07HA,ST230Q01JA,SKIPPING,IC180Q01JA,IC180Q08JA,ST059Q02JA,ST296Q04JA,WB176Q01HA,STUDYHMW,IC184Q01JA,IC184Q02JA,IC184Q03JA,IC184Q04JA,ST059Q01TA,ST296Q01JA,ST272Q01JA,ST268Q01JA,ST268Q04JA,ST268Q07JA,ST293Q04JA,ST297Q01JA,ST297Q03JA,ST297Q05JA,ST297Q06JA,ST297Q07JA,ST297Q09JA,WB165Q01HA,WB166Q01HA,WB166Q02HA,WB166Q03HA,WB166Q04HA,ST258Q01JA,ST294Q01JA,ST295Q01JA,WB150Q01HA,WB156Q01HA,WB158Q01HA,WB160Q01HA,WB161Q01HA,WB171Q01HA,WB171Q02HA,WB171Q03HA,WB171Q04HA,WB172Q01HA,WB173Q01HA,WB173Q02HA,WB173Q03HA,WB173Q04HA,WB177Q01HA,WB177Q02HA,WB177Q03HA,WB177Q04HA,WB032Q01NA,WB032Q02NA,WB031Q01NA,EXERPRAC,STUBMI,RELATST,BELONG,BULLIED,FEELSAFE,SCHRISK,PERSEVAGR,CURIOAGR,COOPAGR,EMPATAGR,ASSERAGR,STRESAGR,EMOCOAGR,GROSAGR,INFOSEEK,FAMSUP,DISCLIM,TEACHSUP,COGACRCO,COGACMCO,EXPOFA,EXPO21ST,MATHEFF,MATHEF21,FAMCON,ANXMAT,MATHPERS,CREATEFF,CREATSCH,CREATFAM,CREATAS,CREATOOS,CREATOP,OPENART,IMAGINE,SCHSUST,LEARRES,PROBSELF,FAMSUPSL,FEELLAH,SDLEFF,ICTRES,ESCS,FLSCHOOL,FLMULTSB,FLFAMILY,ACCESSFP,FLCONFIN,FLCONICT,ACCESSFA,ATTCONFM,FRINFLFM,ICTSCH,ICTHOME,ICTQUAL,ICTSUBJ,ICTENQ,ICTFEED,ICTOUT,ICTWKDY,ICTWKEND,ICTREG,ICTINFO,ICTEFFIC,BODYIMA,SOCONPA,LIFESAT,PSYCHSYM,SOCCON,EXPWB,CURSUPP,PQMIMP,PQMCAR,PARINVOL,PQSCHOOL,PASCHPOL,ATTIMMP,CREATHME,CREATACT,CREATOPN,CREATOR,WORKPAY,WORKHOME,SC001Q01TA,SC211Q01JA,SC211Q02JA,SC211Q03JA,SC211Q04JA,SC211Q05JA,SC211Q06JA,SC209Q04JA,SC209Q05JA,SC209Q06JA,SC037Q11JA,SC183Q02JA,SC183Q03JA,SC183Q04JA,SC175Q01JA,SC177Q01JA_1,SC177Q01JA_2,SC177Q01JA_3,SC177Q02JA_1,SC177Q02JA_2,SC177Q02JA_3,SC177Q03JA_1,SC177Q03JA_2,SC177Q03JA_3,SC188Q01JA,SC188Q02JA,SC188Q03JA,SC188Q04JA,SC188Q05JA,SC188Q06JA,SC188Q07JA,SC188Q08JA,SC188Q09JA,SC188Q10JA,SC188Q11JA,SC198Q01JA,SC198Q02JA,SC198Q03JA,SC178Q01JA,SC178Q02JA,SC180Q01JA,SC189Q02WA,SC189Q03WA,SC189Q04WA,SMRATIO,MCLSIZE,MACTIV,MATHEXC_0,MATHEXC_1,MATHEXC_2,MATHEXC_3,ABGMATH,SC064Q05WA,SC064Q06WA,SC064Q01TA,SC064Q02TA,SC064Q04NA,SC064Q03TA,SC064Q07WA,SC213Q01JA,SC213Q02JA,SC037Q01TA,SC037Q02TA,SC037Q03TA,SC037Q04TA,SC037Q05NA,SC037Q06NA,SC037Q07TA,...,DIGDVPOL,TEAFDBK,MTTRAIN,DMCVIEWS,NEGSCLIM,STAFFSHORT,EDUSHORT,STUBEHA,TEACHBEHA,STDTEST,TDTEST,ALLACTIV,BCREATSC,CREENVSC,ACTCRESC,OPENCUL,PROBSCRI,SCPREPBP,SCPREPAP,DIGPREP,LANGN_105,LANGN_108,LANGN_112,LANGN_113,LANGN_118,LANGN_121,LANGN_130,LANGN_133,LANGN_137,LANGN_140,LANGN_147,LANGN_148,LANGN_150,LANGN_154,LANGN_156,LANGN_160,LANGN_170,LANGN_195,LANGN_200,LANGN_202,LANGN_204,LANGN_232,LANGN_237,LANGN_244,LANGN_246,LANGN_254,LANGN_258,LANGN_263,LANGN_264,LANGN_266,LANGN_272,LANGN_273,LANGN_275,LANGN_286,LANGN_301,LANGN_313,LANGN_316,LANGN_317,LANGN_322,LANGN_325,LANGN_327,LANGN_329,LANGN_338,LANGN_340,LANGN_344,LANGN_351,LANGN_358,LANGN_363,LANGN_369,LANGN_371,LANGN_375,LANGN_379,LANGN_381,LANGN_382,LANGN_383,LANGN_404,LANGN_409,LANGN_415,LANGN_420,LANGN_422,LANGN_428,LANGN_434,LANGN_442,LANGN_449,LANGN_451,LANGN_463,LANGN_465,LANGN_467,LANGN_471,LANGN_472,LANGN_474,LANGN_492,LANGN_493,LANGN_494,LANGN_495,LANGN_496,LANGN_500,LANGN_503,LANGN_514,LANGN_517,LANGN_520,LANGN_523,LANGN_527,LANGN_529,LANGN_531,LANGN_540,LANGN_547,LANGN_555,LANGN_561,LANGN_562,LANGN_563,LANGN_565,LANGN_566,LANGN_567,LANGN_600,LANGN_601,LANGN_602,LANGN_605,LANGN_606,LANGN_607,LANGN_608,LANGN_611,LANGN_614,LANGN_615,LANGN_616,LANGN_618,LANGN_619,LANGN_621,LANGN_622,LANGN_623,LANGN_624,LANGN_625,LANGN_626,LANGN_627,LANGN_628,LANGN_630,LANGN_631,LANGN_634,LANGN_635,LANGN_639,LANGN_640,LANGN_641,LANGN_642,LANGN_648,LANGN_650,LANGN_661,LANGN_662,LANGN_663,LANGN_665,LANGN_666,LANGN_667,LANGN_668,LANGN_669,LANGN_670,LANGN_673,LANGN_674,LANGN_675,LANGN_676,LANGN_677,LANGN_678,LANGN_800,LANGN_801,LANGN_802,LANGN_804,LANGN_805,LANGN_806,LANGN_807,LANGN_808,LANGN_809,LANGN_810,LANGN_811,LANGN_812,LANGN_813,LANGN_814,LANGN_815,LANGN_816,LANGN_817,LANGN_818,LANGN_819,LANGN_821,LANGN_823,LANGN_824,LANGN_825,LANGN_826,LANGN_827,LANGN_828,LANGN_829,LANGN_831,LANGN_832,LANGN_833,LANGN_836,LANGN_837,LANGN_838,LANGN_839,LANGN_840,LANGN_841,LANGN_842,LANGN_843,LANGN_844,LANGN_845,LANGN_846,LANGN_849,LANGN_850,LANGN_851,LANGN_852,LANGN_854,LANGN_855,LANGN_857,LANGN_859,LANGN_860,LANGN_861,LANGN_865,LANGN_866,LANGN_868,LANGN_870,LANGN_872,LANGN_873,LANGN_877,LANGN_879,LANGN_881,LANGN_885,LANGN_890,LANGN_892,LANGN_895,LANGN_896,LANGN_897,LANGN_898,LANGN_899,LANGN_900,LANGN_901,LANGN_902,LANGN_903,LANGN_904,LANGN_905,LANGN_906,LANGN_907,LANGN_908,LANGN_909,LANGN_910,LANGN_911,LANGN_912,LANGN_913,LANGN_914,LANGN_916,LANGN_917,LANGN_918,LANGN_919,LANGN_920,LANGN_921,LANGN_922
575474,0,84000082,84003756,0.0,4.0,1.0,0,1,0,0,0,1.0,1.0,,,,,,,,1.0,1.0,5.0,,-1.1481,1.0,0.0,0.0,,7.0,6.0,,3.0,0.0,3.0,,0.0,1,1,14.5,24.53,80.78,,,4.0,0.0,3.0,1.0,8.0,1.0,,0.0,5.0,5.0,5.0,5.0,3.0,1.0,,1.0,2.0,1.0,,0.0,0.0,0.0,0.0,0.0,1.0,,,,,,1.0,1.0,6.0,,,,,,,,,,,,,,,,,,,,,,0.0,,-0.6183,-0.7308,0.5154,-0.756,-0.6386,,,,,,,,-1.3676,-0.9894,-1.5073,1.0855,-0.3322,-2.3562,-1.9366,2.3809,-2.2407,-3.3656,-2.0022,-3.6782,2.5429,-2.6752,,,,,,,,,-1.7994,-1.676,0.4139,-1.6587,-2.6173,-2.1826,-0.2097,0.0772,-1.5638,-1.6481,-0.4553,-1.3771,-2.184,-2.1589,-0.2971,-1.5754,-0.5383,0.4062,0.3346,0.5981,0.6764,-1.1226,-0.9048,0.7546,3.9236,1.4846,-0.8917,-0.027,-0.3083,,,,,,,,,,,,,,,,,,1.0,1.0,4.0,49.0,12.0,53.0,19.0,37.0,2.0,,,,,,,,,0,0,0,0,0,0,0,0,0,,,,,,,,,,,,,,,,,,,,,100.0,,,0,0,0,0,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
574994,1,84000118,84002904,1.0,,2.0,0,0,0,0,1,2.0,3.0,,,,,,,,7.0,9.0,3.0,,-0.2785,1.0,1.0,0.0,7.0,7.0,6.0,9.0,1.0,2.0,2.0,,0.0,1,1,12.0,30.9,18.13,,,3.0,1.0,3.0,3.0,,1.0,,1.0,4.0,4.0,4.0,4.0,,,,1.0,3.0,4.0,,0.0,0.0,0.0,0.0,0.0,1.0,,,,,,1.0,4.0,2.0,,,,,,,,,,,,,,,,,,,,,,1.0,,-0.5702,-2.1726,0.0102,-1.3019,0.9993,,,,,,,,0.7577,1.209,-0.5722,,,,,1.2671,0.7737,-0.5362,0.1895,1.3158,0.3025,-0.0523,,,,,,,,,0.3775,0.1809,0.7439,-1.0628,-0.0534,-0.3865,-0.3003,-0.9194,-0.5552,1.5186,-1.3402,-0.556,-0.9413,0.0905,0.1818,-0.1522,0.4889,0.4062,0.3346,0.1203,,-0.0668,0.2913,-0.2136,0.5391,0.8846,-0.6861,-0.4148,-0.82,,,,,,,,,,,,,,,,,,0.0,0.0,3.0,13.0,13.0,41.0,12.0,30.0,0.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,90.0,1,0,0,1,0,0,1,0,0,4.0,4.0,2.0,4.0,2.0,2.0,1.0,2.0,3.0,3.0,1.0,2.0,1.0,1.0,65.0,35.0,2.0,1.0,1.0,1.0,100.0,33.0,2.0,0,0,0,0,1.0,10.0,30.0,20.0,20.0,5.0,5.0,4.0,230.0,3.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,...,,0.656,1.0982,,-0.0945,1.1703,-1.4212,1.2222,0.933,0.3525,0.1781,0.9381,,,,,-0.2829,0.7922,0.8462,-0.0391,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
574164,0,84000031,84001409,1.0,3.0,1.0,0,0,0,0,1,1.0,1.0,,,,,,,2.0,7.0,8.0,6.0,,-0.8731,1.0,0.0,0.0,9.0,7.0,6.0,8.0,1.0,1.0,3.0,,0.0,1,1,16.0,57.0,17.0,,,4.0,0.0,3.0,3.0,1.0,4.0,,4.0,3.0,3.0,4.0,4.0,1.0,1.0,,1.0,1.0,3.0,2.0,1.0,0.0,0.0,1.0,1.0,0.0,,,,,,1.0,2.0,6.0,,,,,,,,,,,,,,,,,,,,,,9.0,,0.6031,-0.4514,-1.228,-0.756,-0.6386,,,,,,,,0.5481,2.4397,-0.2439,-0.0504,0.4357,0.7143,0.241,-0.1372,0.3595,-0.9866,0.7063,-0.2182,1.0161,0.0248,,,,,,,,,-0.9729,-1.0323,0.7005,0.3146,0.2668,-0.2182,-0.6172,-0.0014,0.9734,0.0506,-0.1845,0.6217,0.4118,0.4078,0.049,0.2737,0.9251,0.4062,0.3346,0.3623,0.9928,0.0084,-0.0179,-0.3155,0.8678,0.5069,0.5201,-0.111,-0.7498,,,,,,,,,,,,,,,,,,0.0,6.0,4.0,9.0,16.0,83.0,4.0,9.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,50.0,0,0,1,1,0,0,0,0,1,3.0,3.0,2.0,1.0,2.0,3.0,1.0,1.0,1.0,1.0,4.0,2.0,1.0,1.0,50.0,50.0,1.0,1.0,2.0,1.0,100.0,33.0,5.0,0,0,0,1,2.0,35.0,50.0,50.0,50.0,1.0,15.0,1.0,75.0,5.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,,-1.2499,1.0982,,0.5063,-0.2766,-1.4212,0.9694,0.2266,0.947,0.5134,2.1298,,,,,0.7167,0.7823,0.8462,2.0888,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
576638,0,84000004,84005803,0.0,4.0,1.0,0,1,0,0,0,1.0,,,,,,,,2.0,7.0,10.0,7.0,,-0.9628,1.0,0.0,0.0,8.0,6.0,5.0,5.0,1.0,1.0,3.0,,0.0,1,1,16.0,62.13,26.8,,,2.0,0.0,2.0,2.0,20.0,3.0,,6.0,,,,,3.0,1.0,,2.0,2.0,4.0,,0.0,0.0,0.0,0.0,0.0,1.0,,,,,,1.0,4.0,6.0,,,,,,,,,,,,,,,,,,,,,,6.0,,-0.3931,-0.9068,-1.228,-0.756,-0.6386,,,,,,,,0.3305,0.5613,-0.7277,1.7914,-0.1002,-2.2388,-0.2284,-0.7486,-0.8234,-0.1005,-0.2102,-0.4084,0.6387,0.1803,,,,,,,,,1.9506,0.4948,2.8316,0.7804,2.098,,-0.6,0.0574,0.1025,,,,,,,,,-0.0831,0.263,-0.9557,0.445,0.2475,0.6371,2.7353,0.4019,0.0028,0.5645,0.1649,2.2293,,,,,,,,,,,,,,,,,,0.0,10.0,3.0,65.0,6.0,1.0,65.0,80.0,0.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,90.0,0,0,1,0,0,1,1,0,0,4.0,4.0,2.0,1.0,4.0,4.0,2.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,60.0,40.0,2.0,2.0,2.0,1.0,100.0,23.0,2.0,0,0,0,0,3.0,40.0,80.0,40.0,90.0,0.0,20.0,0.0,100.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,...,,0.095,0.1983,,-0.3827,-0.5393,-1.4212,0.014,0.2397,1.9769,0.4252,1.0179,,,,,-0.0105,1.5573,0.1102,0.925,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
577351,1,84000075,84007073,1.0,1.0,1.0,0,0,0,0,0,,,,,,,,,,9.0,9.0,6.0,,-0.1205,1.0,0.0,0.0,7.0,7.0,6.0,2.0,1.0,1.0,3.0,,0.0,1,1,16.0,29.14,78.69,,,4.0,0.0,3.0,1.0,20.0,4.0,,4.0,5.0,3.0,1.0,,5.0,2.0,,4.0,4.0,4.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,,,,,,1.0,4.0,5.0,,,,,,,,,,,,,,,,,,,,,,0.0,,0.1047,0.5159,-1.228,0.1413,-0.6386,,,,,,,,0.7577,0.6888,-0.6354,0.6461,1.5558,1.0162,0.4067,-0.2172,0.5446,0.7226,-0.0224,1.9154,-2.249,0.3149,,,,,,,,,,,,,,,0.4484,0.7268,-0.6368,-1.1953,0.6086,0.1251,-0.0716,0.1553,0.0071,0.6231,-0.7656,0.4062,0.3346,1.6428,0.6342,0.1404,-0.2549,1.4466,-0.3215,0.3133,0.5313,-0.3784,0.0093,,,,,,,,,,,,,,,,,,0.0,0.0,3.0,20.0,2.0,22.0,3.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,90.0,0,0,1,0,0,1,1,0,0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,1.0,1.0,1.0,95.0,5.0,1.0,1.0,1.0,1.0,100.0,18.0,3.0,0,0,0,1,3.0,56.0,79.0,100.0,99.0,23.0,10.0,0.0,80.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,,1.1488,1.0982,,-1.1403,-2.1066,-1.4212,-2.5157,-1.2943,0.8532,1.8799,0.1364,,,,,-3.0365,-0.8314,0.8462,3.7675,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [23]:
validation_data.head(5)

Unnamed: 0,MATH_Proficient,CNTSCHID,CNTSTUID,SISCO,ST347Q01JA,ST347Q02JA,ST349Q01JA_0,ST349Q01JA_1,ST349Q01JA_2,ST349Q01JA_3,ST349Q01JA_4,ST350Q01JA,ST356Q01JA,ST322Q01JA,ST322Q02JA,ST322Q03JA,ST322Q04JA,ST322Q06JA,ST322Q07JA,DURECEC,EFFORT1,EFFORT2,ST259Q01JA,WB164Q01HA,HOMEPOS,ST004D01T,GRADE,REPEAT,EXPECEDU,ICTAVSCH,ICTAVHOM,ICTDISTR,IMMIG,TARDYSD,ST226Q01JA,ST016Q01NA,MISSSC,Option_UH,OECD,PAREDINT,BMMJ1,BFMJ2,WB163Q06HA,WB163Q07HA,ST230Q01JA,SKIPPING,IC180Q01JA,IC180Q08JA,ST059Q02JA,ST296Q04JA,WB176Q01HA,STUDYHMW,IC184Q01JA,IC184Q02JA,IC184Q03JA,IC184Q04JA,ST059Q01TA,ST296Q01JA,ST272Q01JA,ST268Q01JA,ST268Q04JA,ST268Q07JA,ST293Q04JA,ST297Q01JA,ST297Q03JA,ST297Q05JA,ST297Q06JA,ST297Q07JA,ST297Q09JA,WB165Q01HA,WB166Q01HA,WB166Q02HA,WB166Q03HA,WB166Q04HA,ST258Q01JA,ST294Q01JA,ST295Q01JA,WB150Q01HA,WB156Q01HA,WB158Q01HA,WB160Q01HA,WB161Q01HA,WB171Q01HA,WB171Q02HA,WB171Q03HA,WB171Q04HA,WB172Q01HA,WB173Q01HA,WB173Q02HA,WB173Q03HA,WB173Q04HA,WB177Q01HA,WB177Q02HA,WB177Q03HA,WB177Q04HA,WB032Q01NA,WB032Q02NA,WB031Q01NA,EXERPRAC,STUBMI,RELATST,BELONG,BULLIED,FEELSAFE,SCHRISK,PERSEVAGR,CURIOAGR,COOPAGR,EMPATAGR,ASSERAGR,STRESAGR,EMOCOAGR,GROSAGR,INFOSEEK,FAMSUP,DISCLIM,TEACHSUP,COGACRCO,COGACMCO,EXPOFA,EXPO21ST,MATHEFF,MATHEF21,FAMCON,ANXMAT,MATHPERS,CREATEFF,CREATSCH,CREATFAM,CREATAS,CREATOOS,CREATOP,OPENART,IMAGINE,SCHSUST,LEARRES,PROBSELF,FAMSUPSL,FEELLAH,SDLEFF,ICTRES,ESCS,FLSCHOOL,FLMULTSB,FLFAMILY,ACCESSFP,FLCONFIN,FLCONICT,ACCESSFA,ATTCONFM,FRINFLFM,ICTSCH,ICTHOME,ICTQUAL,ICTSUBJ,ICTENQ,ICTFEED,ICTOUT,ICTWKDY,ICTWKEND,ICTREG,ICTINFO,ICTEFFIC,BODYIMA,SOCONPA,LIFESAT,PSYCHSYM,SOCCON,EXPWB,CURSUPP,PQMIMP,PQMCAR,PARINVOL,PQSCHOOL,PASCHPOL,ATTIMMP,CREATHME,CREATACT,CREATOPN,CREATOR,WORKPAY,WORKHOME,SC001Q01TA,SC211Q01JA,SC211Q02JA,SC211Q03JA,SC211Q04JA,SC211Q05JA,SC211Q06JA,SC209Q04JA,SC209Q05JA,SC209Q06JA,SC037Q11JA,SC183Q02JA,SC183Q03JA,SC183Q04JA,SC175Q01JA,SC177Q01JA_1,SC177Q01JA_2,SC177Q01JA_3,SC177Q02JA_1,SC177Q02JA_2,SC177Q02JA_3,SC177Q03JA_1,SC177Q03JA_2,SC177Q03JA_3,SC188Q01JA,SC188Q02JA,SC188Q03JA,SC188Q04JA,SC188Q05JA,SC188Q06JA,SC188Q07JA,SC188Q08JA,SC188Q09JA,SC188Q10JA,SC188Q11JA,SC198Q01JA,SC198Q02JA,SC198Q03JA,SC178Q01JA,SC178Q02JA,SC180Q01JA,SC189Q02WA,SC189Q03WA,SC189Q04WA,SMRATIO,MCLSIZE,MACTIV,MATHEXC_0,MATHEXC_1,MATHEXC_2,MATHEXC_3,ABGMATH,SC064Q05WA,SC064Q06WA,SC064Q01TA,SC064Q02TA,SC064Q04NA,SC064Q03TA,SC064Q07WA,SC213Q01JA,SC213Q02JA,SC037Q01TA,SC037Q02TA,SC037Q03TA,SC037Q04TA,SC037Q05NA,SC037Q06NA,SC037Q07TA,...,DIGDVPOL,TEAFDBK,MTTRAIN,DMCVIEWS,NEGSCLIM,STAFFSHORT,EDUSHORT,STUBEHA,TEACHBEHA,STDTEST,TDTEST,ALLACTIV,BCREATSC,CREENVSC,ACTCRESC,OPENCUL,PROBSCRI,SCPREPBP,SCPREPAP,DIGPREP,LANGN_105,LANGN_108,LANGN_112,LANGN_113,LANGN_118,LANGN_121,LANGN_130,LANGN_133,LANGN_137,LANGN_140,LANGN_147,LANGN_148,LANGN_150,LANGN_154,LANGN_156,LANGN_160,LANGN_170,LANGN_195,LANGN_200,LANGN_202,LANGN_204,LANGN_232,LANGN_237,LANGN_244,LANGN_246,LANGN_254,LANGN_258,LANGN_263,LANGN_264,LANGN_266,LANGN_272,LANGN_273,LANGN_275,LANGN_286,LANGN_301,LANGN_313,LANGN_316,LANGN_317,LANGN_322,LANGN_325,LANGN_327,LANGN_329,LANGN_338,LANGN_340,LANGN_344,LANGN_351,LANGN_358,LANGN_363,LANGN_369,LANGN_371,LANGN_375,LANGN_379,LANGN_381,LANGN_382,LANGN_383,LANGN_404,LANGN_409,LANGN_415,LANGN_420,LANGN_422,LANGN_428,LANGN_434,LANGN_442,LANGN_449,LANGN_451,LANGN_463,LANGN_465,LANGN_467,LANGN_471,LANGN_472,LANGN_474,LANGN_492,LANGN_493,LANGN_494,LANGN_495,LANGN_496,LANGN_500,LANGN_503,LANGN_514,LANGN_517,LANGN_520,LANGN_523,LANGN_527,LANGN_529,LANGN_531,LANGN_540,LANGN_547,LANGN_555,LANGN_561,LANGN_562,LANGN_563,LANGN_565,LANGN_566,LANGN_567,LANGN_600,LANGN_601,LANGN_602,LANGN_605,LANGN_606,LANGN_607,LANGN_608,LANGN_611,LANGN_614,LANGN_615,LANGN_616,LANGN_618,LANGN_619,LANGN_621,LANGN_622,LANGN_623,LANGN_624,LANGN_625,LANGN_626,LANGN_627,LANGN_628,LANGN_630,LANGN_631,LANGN_634,LANGN_635,LANGN_639,LANGN_640,LANGN_641,LANGN_642,LANGN_648,LANGN_650,LANGN_661,LANGN_662,LANGN_663,LANGN_665,LANGN_666,LANGN_667,LANGN_668,LANGN_669,LANGN_670,LANGN_673,LANGN_674,LANGN_675,LANGN_676,LANGN_677,LANGN_678,LANGN_800,LANGN_801,LANGN_802,LANGN_804,LANGN_805,LANGN_806,LANGN_807,LANGN_808,LANGN_809,LANGN_810,LANGN_811,LANGN_812,LANGN_813,LANGN_814,LANGN_815,LANGN_816,LANGN_817,LANGN_818,LANGN_819,LANGN_821,LANGN_823,LANGN_824,LANGN_825,LANGN_826,LANGN_827,LANGN_828,LANGN_829,LANGN_831,LANGN_832,LANGN_833,LANGN_836,LANGN_837,LANGN_838,LANGN_839,LANGN_840,LANGN_841,LANGN_842,LANGN_843,LANGN_844,LANGN_845,LANGN_846,LANGN_849,LANGN_850,LANGN_851,LANGN_852,LANGN_854,LANGN_855,LANGN_857,LANGN_859,LANGN_860,LANGN_861,LANGN_865,LANGN_866,LANGN_868,LANGN_870,LANGN_872,LANGN_873,LANGN_877,LANGN_879,LANGN_881,LANGN_885,LANGN_890,LANGN_892,LANGN_895,LANGN_896,LANGN_897,LANGN_898,LANGN_899,LANGN_900,LANGN_901,LANGN_902,LANGN_903,LANGN_904,LANGN_905,LANGN_906,LANGN_907,LANGN_908,LANGN_909,LANGN_910,LANGN_911,LANGN_912,LANGN_913,LANGN_914,LANGN_916,LANGN_917,LANGN_918,LANGN_919,LANGN_920,LANGN_921,LANGN_922
574478,1,84000008,84001986,1.0,6.0,1.0,0,0,0,0,1,1.0,3.0,,,,,,,,10.0,10.0,7.0,,0.8067,1.0,0.0,0.0,6.0,7.0,6.0,4.0,1.0,0.0,3.0,,0.0,1,1,12.0,26.64,25.95,,,4.0,0.0,1.0,3.0,3.0,3.0,,9.0,3.0,4.0,3.0,,3.0,1.0,,4.0,4.0,4.0,,0.0,1.0,1.0,0.0,0.0,0.0,,,,,,5.0,1.0,6.0,,,,,,,,,,,,,,,,,,,,,,3.0,,0.1045,0.0311,-1.228,-0.756,1.1023,,,,,,,,0.3305,-0.4405,-0.0655,0.1997,1.5558,2.1044,1.5715,-0.4304,0.3636,0.5076,1.5288,0.3847,2.5026,2.8491,,,,,,,,,1.023,0.7935,0.479,0.5361,1.2196,0.0219,1.0229,-0.5402,-0.5552,0.0324,1.8656,0.5406,0.2255,1.2021,1.5151,0.5408,-0.9579,0.4062,0.3346,-0.1177,1.8109,2.9787,1.6331,2.1015,0.2928,0.2424,0.0831,2.6014,0.0986,,,,,,,,,,,,,,,,,,0.0,10.0,4.0,40.0,30.0,79.0,11.0,18.0,4.0,,,,,,,,,0,0,0,0,0,0,0,0,0,,,,,,,,,,,,,,,,,,,,,,,,0,0,0,0,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
573678,1,84000068,84000520,,,,0,0,0,0,0,,,,,,,,,,8.0,10.0,9.0,,-0.0418,2.0,0.0,0.0,,,,,1.0,1.0,3.0,,0.0,1,1,16.0,38.88,30.78,,,3.0,1.0,,,7.0,1.0,,0.0,,,,,7.0,1.0,,,,,,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,1.0,1.0,6.0,,,,,,,,,,,,,,,,,,,,,,0.0,,0.2316,-0.1965,-1.228,-0.756,0.4456,,,,,,,,-0.0994,,,0.1166,-0.1002,,,,,,,,,,,,,,,,,,,,,,,,-0.7123,0.0085,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,5.0,0.0,,1.0,13.0,22.0,0.0,,0.0,1.0,1.0,1.0,2.0,2.0,1.0,1.0,50.0,0,0,1,0,0,1,0,1,0,4.0,3.0,2.0,4.0,2.0,2.0,2.0,4.0,4.0,4.0,2.0,1.0,1.0,2.0,75.0,25.0,1.0,2.0,2.0,1.0,,23.0,4.0,0,0,0,1,2.0,15.0,32.0,50.0,40.0,50.0,5.0,30.0,45.0,18.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,,0.1848,0.1756,,0.5245,0.961,-0.4413,1.35,0.9355,0.947,0.6721,1.1965,,,,,0.2762,-0.8314,0.1102,0.1131,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
576927,0,84000049,84006320,1.0,,4.0,0,0,0,0,1,1.0,2.0,,,,,,,2.0,7.0,8.0,5.0,,-1.3368,1.0,0.0,0.0,9.0,4.0,,,1.0,1.0,4.0,,1.0,1,1,16.0,,,,,4.0,0.0,,,4.0,1.0,,4.0,,,,,1.0,1.0,,1.0,1.0,3.0,,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,1.0,4.0,6.0,,,,,,,,,,,,,,,,,,,,,,0.0,,-0.8798,-0.5964,-1.228,-0.756,0.4139,,,,,,,,-0.2794,-0.5128,1.6461,-0.6761,-1.0693,0.437,-0.5169,0.5942,2.1164,-3.4378,-2.3493,2.0407,2.5078,-0.1177,,,,,,,,,-2.508,0.3302,0.5352,-0.5082,-0.2159,-2.5236,-1.2528,-0.6956,,,1.0098,,0.1788,0.6273,1.0014,,0.9251,0.1606,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,4.0,5.0,2.0,83.0,10.0,7.0,,,,,,,,,80.0,0,0,0,0,0,0,0,0,0,,,,,,,,,,,,,,,,,1.0,,,,,23.0,1.0,0,0,0,0,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,1.5605,,,,,,,,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
576321,1,84000138,84005234,0.0,4.0,1.0,0,1,0,0,0,1.0,3.0,,,,,,,1.0,8.0,,8.0,,1.8748,1.0,1.0,0.0,7.0,7.0,6.0,4.0,2.0,0.0,2.0,,0.0,1,1,16.0,63.03,65.01,,,3.0,1.0,2.0,2.0,20.0,6.0,,5.0,3.0,3.0,,,5.0,1.0,,1.0,1.0,4.0,,0.0,0.0,0.0,0.0,0.0,1.0,,,,,,1.0,6.0,6.0,,,,,,,,,,,,,,,,,,,,,,0.0,,-0.098,-0.3485,-1.228,-0.756,-0.6386,,,,,,,,0.7577,0.1352,0.8359,-0.2863,0.4357,-0.2533,-0.7195,0.0117,0.0065,-1.2399,-0.4296,0.6755,0.429,-0.4855,,,,,,,,,-0.0508,0.3561,-0.2545,-0.3973,-0.4074,-0.9785,2.8019,1.3137,-1.5638,0.3684,-0.4993,-0.2928,-0.2246,0.924,0.3975,-0.407,-0.8774,0.4062,0.3346,0.13,0.4646,0.0962,-0.8331,-0.4407,0.3179,0.668,0.0574,0.0811,0.0044,,,,,,,,,,,,,,,,,,0.0,5.0,2.0,44.0,12.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,2.0,2.0,1.0,1.0,40.0,0,0,1,0,0,1,0,0,1,2.0,2.0,2.0,4.0,1.0,1.0,1.0,4.0,3.0,3.0,1.0,2.0,2.0,2.0,100.0,0.0,2.0,2.0,2.0,1.0,65.2727,13.0,3.0,0,0,0,0,2.0,90.0,100.0,90.0,100.0,50.0,50.0,60.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,...,,,-0.0381,,0.0359,0.0572,-0.6344,-0.2527,0.2397,-1.7236,-0.4116,2.1298,,,,,,,,1.1612,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
576339,1,84000124,84005268,1.0,3.0,1.0,0,0,0,0,1,2.0,3.0,,,,,,,3.0,10.0,7.0,4.0,,-1.3411,2.0,-1.0,0.0,4.0,,,,,0.0,1.0,,0.0,1,1,12.0,43.85,24.53,,,4.0,0.0,,,7.0,5.0,,2.0,,,,,7.0,1.0,,4.0,4.0,4.0,,0.0,0.0,0.0,0.0,0.0,1.0,,,,,,1.0,1.0,6.0,,,,,,,,,,,,,,,,,,,,,,0.0,,0.2898,-1.0249,-0.267,-0.756,-0.6386,,,,,,,,0.6466,-0.7264,-1.4343,-0.3272,-0.1002,-0.8184,-0.5026,-2.772,-0.4285,-0.7925,-0.6121,-0.639,-0.3794,0.082,,,,,,,,,-0.2235,0.3296,-0.5591,-0.4879,0.6688,-0.3988,-0.8601,-1.1249,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,4.0,2.0,0.0,18.0,30.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,55.0,0,0,1,0,0,1,0,0,1,2.0,4.0,2.0,3.0,4.0,4.0,2.0,3.0,2.0,3.0,2.0,1.0,1.0,1.0,80.0,20.0,1.0,1.0,1.0,1.0,100.0,23.0,4.0,0,0,0,1,2.0,50.0,100.0,50.0,100.0,25.0,25.0,75.0,20.0,5.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,...,,1.0689,-0.0313,,-0.1428,3.7658,-0.3123,-0.7973,0.5395,1.9769,1.2251,1.0123,,,,,-0.2488,0.5151,0.8462,0.1902,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [24]:
test_data.head(5)

Unnamed: 0,MATH_Proficient,CNTSCHID,CNTSTUID,SISCO,ST347Q01JA,ST347Q02JA,ST349Q01JA_0,ST349Q01JA_1,ST349Q01JA_2,ST349Q01JA_3,ST349Q01JA_4,ST350Q01JA,ST356Q01JA,ST322Q01JA,ST322Q02JA,ST322Q03JA,ST322Q04JA,ST322Q06JA,ST322Q07JA,DURECEC,EFFORT1,EFFORT2,ST259Q01JA,WB164Q01HA,HOMEPOS,ST004D01T,GRADE,REPEAT,EXPECEDU,ICTAVSCH,ICTAVHOM,ICTDISTR,IMMIG,TARDYSD,ST226Q01JA,ST016Q01NA,MISSSC,Option_UH,OECD,PAREDINT,BMMJ1,BFMJ2,WB163Q06HA,WB163Q07HA,ST230Q01JA,SKIPPING,IC180Q01JA,IC180Q08JA,ST059Q02JA,ST296Q04JA,WB176Q01HA,STUDYHMW,IC184Q01JA,IC184Q02JA,IC184Q03JA,IC184Q04JA,ST059Q01TA,ST296Q01JA,ST272Q01JA,ST268Q01JA,ST268Q04JA,ST268Q07JA,ST293Q04JA,ST297Q01JA,ST297Q03JA,ST297Q05JA,ST297Q06JA,ST297Q07JA,ST297Q09JA,WB165Q01HA,WB166Q01HA,WB166Q02HA,WB166Q03HA,WB166Q04HA,ST258Q01JA,ST294Q01JA,ST295Q01JA,WB150Q01HA,WB156Q01HA,WB158Q01HA,WB160Q01HA,WB161Q01HA,WB171Q01HA,WB171Q02HA,WB171Q03HA,WB171Q04HA,WB172Q01HA,WB173Q01HA,WB173Q02HA,WB173Q03HA,WB173Q04HA,WB177Q01HA,WB177Q02HA,WB177Q03HA,WB177Q04HA,WB032Q01NA,WB032Q02NA,WB031Q01NA,EXERPRAC,STUBMI,RELATST,BELONG,BULLIED,FEELSAFE,SCHRISK,PERSEVAGR,CURIOAGR,COOPAGR,EMPATAGR,ASSERAGR,STRESAGR,EMOCOAGR,GROSAGR,INFOSEEK,FAMSUP,DISCLIM,TEACHSUP,COGACRCO,COGACMCO,EXPOFA,EXPO21ST,MATHEFF,MATHEF21,FAMCON,ANXMAT,MATHPERS,CREATEFF,CREATSCH,CREATFAM,CREATAS,CREATOOS,CREATOP,OPENART,IMAGINE,SCHSUST,LEARRES,PROBSELF,FAMSUPSL,FEELLAH,SDLEFF,ICTRES,ESCS,FLSCHOOL,FLMULTSB,FLFAMILY,ACCESSFP,FLCONFIN,FLCONICT,ACCESSFA,ATTCONFM,FRINFLFM,ICTSCH,ICTHOME,ICTQUAL,ICTSUBJ,ICTENQ,ICTFEED,ICTOUT,ICTWKDY,ICTWKEND,ICTREG,ICTINFO,ICTEFFIC,BODYIMA,SOCONPA,LIFESAT,PSYCHSYM,SOCCON,EXPWB,CURSUPP,PQMIMP,PQMCAR,PARINVOL,PQSCHOOL,PASCHPOL,ATTIMMP,CREATHME,CREATACT,CREATOPN,CREATOR,WORKPAY,WORKHOME,SC001Q01TA,SC211Q01JA,SC211Q02JA,SC211Q03JA,SC211Q04JA,SC211Q05JA,SC211Q06JA,SC209Q04JA,SC209Q05JA,SC209Q06JA,SC037Q11JA,SC183Q02JA,SC183Q03JA,SC183Q04JA,SC175Q01JA,SC177Q01JA_1,SC177Q01JA_2,SC177Q01JA_3,SC177Q02JA_1,SC177Q02JA_2,SC177Q02JA_3,SC177Q03JA_1,SC177Q03JA_2,SC177Q03JA_3,SC188Q01JA,SC188Q02JA,SC188Q03JA,SC188Q04JA,SC188Q05JA,SC188Q06JA,SC188Q07JA,SC188Q08JA,SC188Q09JA,SC188Q10JA,SC188Q11JA,SC198Q01JA,SC198Q02JA,SC198Q03JA,SC178Q01JA,SC178Q02JA,SC180Q01JA,SC189Q02WA,SC189Q03WA,SC189Q04WA,SMRATIO,MCLSIZE,MACTIV,MATHEXC_0,MATHEXC_1,MATHEXC_2,MATHEXC_3,ABGMATH,SC064Q05WA,SC064Q06WA,SC064Q01TA,SC064Q02TA,SC064Q04NA,SC064Q03TA,SC064Q07WA,SC213Q01JA,SC213Q02JA,SC037Q01TA,SC037Q02TA,SC037Q03TA,SC037Q04TA,SC037Q05NA,SC037Q06NA,SC037Q07TA,...,DIGDVPOL,TEAFDBK,MTTRAIN,DMCVIEWS,NEGSCLIM,STAFFSHORT,EDUSHORT,STUBEHA,TEACHBEHA,STDTEST,TDTEST,ALLACTIV,BCREATSC,CREENVSC,ACTCRESC,OPENCUL,PROBSCRI,SCPREPBP,SCPREPAP,DIGPREP,LANGN_105,LANGN_108,LANGN_112,LANGN_113,LANGN_118,LANGN_121,LANGN_130,LANGN_133,LANGN_137,LANGN_140,LANGN_147,LANGN_148,LANGN_150,LANGN_154,LANGN_156,LANGN_160,LANGN_170,LANGN_195,LANGN_200,LANGN_202,LANGN_204,LANGN_232,LANGN_237,LANGN_244,LANGN_246,LANGN_254,LANGN_258,LANGN_263,LANGN_264,LANGN_266,LANGN_272,LANGN_273,LANGN_275,LANGN_286,LANGN_301,LANGN_313,LANGN_316,LANGN_317,LANGN_322,LANGN_325,LANGN_327,LANGN_329,LANGN_338,LANGN_340,LANGN_344,LANGN_351,LANGN_358,LANGN_363,LANGN_369,LANGN_371,LANGN_375,LANGN_379,LANGN_381,LANGN_382,LANGN_383,LANGN_404,LANGN_409,LANGN_415,LANGN_420,LANGN_422,LANGN_428,LANGN_434,LANGN_442,LANGN_449,LANGN_451,LANGN_463,LANGN_465,LANGN_467,LANGN_471,LANGN_472,LANGN_474,LANGN_492,LANGN_493,LANGN_494,LANGN_495,LANGN_496,LANGN_500,LANGN_503,LANGN_514,LANGN_517,LANGN_520,LANGN_523,LANGN_527,LANGN_529,LANGN_531,LANGN_540,LANGN_547,LANGN_555,LANGN_561,LANGN_562,LANGN_563,LANGN_565,LANGN_566,LANGN_567,LANGN_600,LANGN_601,LANGN_602,LANGN_605,LANGN_606,LANGN_607,LANGN_608,LANGN_611,LANGN_614,LANGN_615,LANGN_616,LANGN_618,LANGN_619,LANGN_621,LANGN_622,LANGN_623,LANGN_624,LANGN_625,LANGN_626,LANGN_627,LANGN_628,LANGN_630,LANGN_631,LANGN_634,LANGN_635,LANGN_639,LANGN_640,LANGN_641,LANGN_642,LANGN_648,LANGN_650,LANGN_661,LANGN_662,LANGN_663,LANGN_665,LANGN_666,LANGN_667,LANGN_668,LANGN_669,LANGN_670,LANGN_673,LANGN_674,LANGN_675,LANGN_676,LANGN_677,LANGN_678,LANGN_800,LANGN_801,LANGN_802,LANGN_804,LANGN_805,LANGN_806,LANGN_807,LANGN_808,LANGN_809,LANGN_810,LANGN_811,LANGN_812,LANGN_813,LANGN_814,LANGN_815,LANGN_816,LANGN_817,LANGN_818,LANGN_819,LANGN_821,LANGN_823,LANGN_824,LANGN_825,LANGN_826,LANGN_827,LANGN_828,LANGN_829,LANGN_831,LANGN_832,LANGN_833,LANGN_836,LANGN_837,LANGN_838,LANGN_839,LANGN_840,LANGN_841,LANGN_842,LANGN_843,LANGN_844,LANGN_845,LANGN_846,LANGN_849,LANGN_850,LANGN_851,LANGN_852,LANGN_854,LANGN_855,LANGN_857,LANGN_859,LANGN_860,LANGN_861,LANGN_865,LANGN_866,LANGN_868,LANGN_870,LANGN_872,LANGN_873,LANGN_877,LANGN_879,LANGN_881,LANGN_885,LANGN_890,LANGN_892,LANGN_895,LANGN_896,LANGN_897,LANGN_898,LANGN_899,LANGN_900,LANGN_901,LANGN_902,LANGN_903,LANGN_904,LANGN_905,LANGN_906,LANGN_907,LANGN_908,LANGN_909,LANGN_910,LANGN_911,LANGN_912,LANGN_913,LANGN_914,LANGN_916,LANGN_917,LANGN_918,LANGN_919,LANGN_920,LANGN_921,LANGN_922
577660,1,84000126,84007619,,,,0,0,0,0,0,,,,,,,,,,,,,,,2.0,0.0,,,,,,,,,,,1,1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.0,1.0,13.0,29.0,1.0,1.0,0.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,90.0,0,0,1,1,0,0,0,0,1,3.0,3.0,3.0,3.0,2.0,2.0,2.0,4.0,2.0,2.0,1.0,2.0,1.0,1.0,65.0,35.0,1.0,2.0,1.0,2.0,100.0,23.0,4.0,0,0,0,1,3.0,10.0,20.0,20.0,30.0,10.0,10.0,10.0,100.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,...,,0.0086,1.0982,,0.2343,-0.2766,-1.4212,-0.3873,0.23,0.1301,-0.0107,1.2851,,,,,-0.035,1.3112,0.8462,0.8175,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
577072,0,84000158,84006571,1.0,3.0,,0,0,1,0,0,1.0,2.0,,,,,,,,9.0,10.0,9.0,,-0.1963,1.0,0.0,0.0,8.0,7.0,6.0,8.0,1.0,2.0,4.0,,0.0,1,1,16.0,39.02,,,,4.0,1.0,3.0,1.0,7.0,4.0,,8.0,5.0,4.0,3.0,4.0,1.0,2.0,,2.0,2.0,4.0,3.0,1.0,1.0,0.0,1.0,1.0,0.0,,,,,,1.0,5.0,6.0,,,,,,,,,,,,,,,,,,,,,,7.0,,1.1452,-0.0092,0.9815,-0.756,2.6536,,,,,,,,0.6466,0.3281,-0.8258,0.4277,0.8211,0.1438,0.429,-0.2065,0.0996,-0.6534,0.9135,0.8158,-0.7447,-0.1438,,,,,,,,,-0.2055,0.3135,1.2651,0.4585,,-0.9969,1.1454,-0.0542,0.9734,0.0203,0.7058,0.5403,0.81,0.8213,0.781,0.3395,-0.2178,0.4062,0.3346,-0.348,0.4662,0.4707,0.5964,-0.4631,1.7979,1.397,0.0782,0.281,0.2629,,,,,,,,,,,,,,,,,,3.0,8.0,3.0,,,,,,,,,,,,,,,0,0,0,0,0,0,0,0,0,,,,,,,,,,,,,,,,,,,,,,,,0,0,0,0,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
576813,0,84000153,84006127,1.0,4.0,2.0,0,1,0,0,0,1.0,3.0,,,,,,,1.0,7.0,,8.0,,0.5803,1.0,0.0,0.0,8.0,7.0,6.0,12.0,1.0,0.0,4.0,,0.0,1,1,16.0,56.0,,,,4.0,0.0,3.0,3.0,4.0,1.0,,6.0,4.0,5.0,5.0,3.0,2.0,1.0,,1.0,1.0,4.0,2.0,0.0,0.0,0.0,0.0,0.0,1.0,,,,,,1.0,5.0,6.0,,,,,,,,,,,,,,,,,,,,,,2.0,,-0.5742,-0.6456,0.9967,-0.756,0.181,,,,,,,,-1.0347,0.0189,-0.54,0.6387,-0.5635,-0.0731,-1.4597,0.4731,0.0897,-1.5902,-0.5393,0.1193,2.6109,0.2635,,,,,,,,,-0.0213,-0.0057,0.5219,-0.0733,-0.4078,0.0322,1.0891,0.5953,0.4019,-1.4392,-1.8093,-0.0406,-0.5238,-0.3051,0.2438,-0.7975,0.0098,0.4062,0.3346,0.3623,-0.2033,0.8478,0.0587,-0.6154,1.3149,1.0051,0.8206,-0.2455,-0.6665,,,,,,,,,,,,,,,,,,4.0,8.0,3.0,1.0,3.0,5.0,0.0,2.0,0.0,,,,1.0,2.0,2.0,2.0,80.0,0,0,1,0,0,1,0,0,1,4.0,4.0,3.0,4.0,3.0,2.0,1.0,4.0,3.0,2.0,1.0,1.0,1.0,1.0,87.0,13.0,2.0,2.0,2.0,1.0,88.8125,23.0,3.0,0,0,0,0,2.0,10.0,10.0,23.0,23.0,0.0,9.0,3.0,,,0.0,1.0,1.0,1.0,1.0,1.0,1.0,...,,-0.3955,-0.3854,,0.2079,-0.2982,-0.2376,0.8767,0.9355,0.2793,-0.0107,1.512,,,,,-0.2488,0.527,-1.5036,-0.8263,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
577850,0,84000045,84007989,,1.0,1.0,0,0,0,0,0,,,,,,,,,2.0,8.0,9.0,,,-0.0002,2.0,0.0,0.0,,,,,1.0,0.0,,,,1,1,12.0,,,,,,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.3378,,,,,,,,,,,,,,,,-0.039,-1.0048,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,3.0,5.0,46.0,80.0,0.0,10.0,0.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,45.0,0,0,1,0,0,1,0,0,1,3.0,4.0,1.0,3.0,3.0,1.0,1.0,3.0,4.0,3.0,1.0,2.0,1.0,2.0,90.0,10.0,1.0,1.0,2.0,2.0,92.5,13.0,3.0,0,0,0,1,2.0,40.0,40.0,40.0,40.0,0.0,5.0,10.0,0.0,5.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,,-1.0859,1.0982,,-0.0731,-0.6909,0.1,-0.2527,-2.0409,-0.3082,0.4865,-0.8237,,,,,,,,0.6818,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
573800,1,84000127,84000730,1.0,4.0,1.0,0,1,0,0,0,1.0,2.0,,,,,,,,9.0,10.0,8.0,,0.6254,1.0,0.0,0.0,8.0,7.0,6.0,11.0,1.0,0.0,3.0,,0.0,1,1,16.0,82.41,82.41,,,1.0,1.0,2.0,1.0,7.0,5.0,,7.0,4.0,5.0,,,4.0,3.0,,1.0,2.0,4.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,,,,,,1.0,2.0,6.0,,,,,,,,,,,,,,,,,,,,,,0.0,,-0.9717,-1.0406,1.1008,0.4417,1.2046,,,,,,,,0.3305,0.6196,0.8067,-0.8697,-0.8059,0.7165,-1.7079,-0.0582,-0.1406,0.2544,0.2056,1.106,1.8003,-0.0266,,,,,,,,,-0.0622,0.0074,0.7422,0.8228,,-0.8522,0.4668,1.113,-0.5113,0.5683,1.712,0.2079,-0.6026,0.906,0.2596,0.0167,0.6198,0.4062,0.3346,-0.1464,0.3801,0.6102,0.6498,0.7372,0.2265,0.5245,0.5762,1.5174,0.5137,,,,,,,,,,,,,,,,,,0.0,0.0,2.0,18.0,14.0,77.0,6.0,5.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,65.0,1,0,0,0,0,1,1,0,0,4.0,4.0,2.0,3.0,3.0,3.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,45.0,55.0,1.0,2.0,2.0,1.0,,28.0,2.0,0,0,0,1,3.0,28.0,37.0,35.0,36.0,12.0,6.0,17.0,60.0,15.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,...,,-1.3317,0.2231,,-1.6916,0.7007,-1.4212,0.6585,0.5986,0.8999,-0.3751,-0.0199,,,,,1.0327,-0.8314,0.8462,-2.0883,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [25]:
test_features = test_data.drop(["MATH_Proficient"], axis=1)
test_target = test_data["MATH_Proficient"]
test_features.to_csv("test_features.csv", index=False, header=False)

In [26]:
from sagemaker import clarify

clarify_processor = clarify.SageMakerClarifyProcessor(
    role=role, instance_count=1, instance_type="ml.m5.2xlarge", sagemaker_session=session
)

model_config = clarify.ModelConfig(
    model_name=model_name,
    instance_type="ml.m5.large",
    instance_count=1,
    accept_type="text/csv",
    content_type="text/csv",
)

INFO:sagemaker.image_uris:Defaulting to the only supported framework/algorithm version: 1.0.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.


In [27]:
from sagemaker.s3 import S3Downloader

# Download data from S3 to local instance
local_path = S3Downloader.download('s3://{}/{}/train'.format(bucket, prefix), './tmp/train_data')

In [28]:
# Load and sample
full_data = pd.read_csv('./tmp/train_data/train.csv', header=None)
n = min(1500, len(full_data))  # Should we decrease this to 1000? It takes a long time to run 1500.
sampled_data = full_data.sample(n=n)  # If full_data has less than 1500, use full sample

# Save sampled data back to S3
sampled_path = 'sampled_train_data.csv'
sampled_data.to_csv(sampled_path, index=False)

from sagemaker.s3 import S3Uploader
sampled_s3_uri = S3Uploader.upload(sampled_path, 's3://{}/{}/sampled_train'.format(bucket, prefix))

In [29]:
sampled_data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,...,319,320,321,322,323,324,325,326,327,328,329,330,331,332,333,334,335,336,337,338,339,340,341,342,343,344,345,346,347,348,349,350,351,352,353,354,355,356,357,358,359,360,361,362,363,364,365,366,367,368,369,370,371,372,373,374,375,376,377,378,379,380,381,382,383,384,385,386,387,388,389,390,391,392,393,394,395,396,397,398,399,400,401,402,403,404,405,406,407,408,409,410,411,412,413,414,415,416,417,418,419,420,421,422,423,424,425,426,427,428,429,430,431,432,433,434,435,436,437,438,439,440,441,442,443,444,445,446,447,448,449,450,451,452,453,454,455,456,457,458,459,460,461,462,463,464,465,466,467,468,469,470,471,472,473,474,475,476,477,478,479,480,481,482,483,484,485,486,487,488,489,490,491,492,493,494,495,496,497,498,499,500,501,502,503,504,505,506,507,508,509,510,511,512,513,514,515,516,517,518,519,520,521,522,523,524,525,526,527,528,529,530,531,532,533,534,535,536,537,538,539,540,541,542,543,544,545,546,547,548,549,550,551,552,553,554,555,556,557,558,559,560,561,562,563,564,565,566,567,568
3121,1,84000048,84005365,1.0,3.0,1.0,0,0,0,0,1,1.0,3.0,,,,,,,2.0,9.0,10.0,10.0,,0.6638,1.0,0.0,0.0,9.0,7.0,6.0,2.0,1.0,0.0,3.0,,0.0,1,1,16.0,46.76,73.71,,,1.0,0.0,3.0,2.0,3.0,3.0,,5.0,5.0,,1.0,1.0,3.0,2.0,,1.0,3.0,4.0,,0.0,0.0,1.0,1.0,0.0,0.0,,,,,,1.0,6.0,6.0,,,,,,,,,,,,,,,,,,,,,,5.0,,-0.0212,0.6428,-1.228,0.6342,-0.6386,,,,,,,,0.0382,0.0605,1.5146,0.3658,1.5558,-0.3441,-0.749,1.3816,-0.4751,2.0169,0.3735,1.5493,-0.755,1.3299,,,,,,,,,1.8008,0.7014,-1.8511,-0.0797,0.6099,0.382,1.0285,0.965,1.2318,0.6865,-0.8823,0.3207,-0.076,0.7001,0.3986,0.6763,0.1343,0.4062,0.3346,1.5552,1.8109,1.0539,-0.0385,1.0766,-0.4207,-0.4255,-0.8917,-0.836,0.3007,,,,,,,,,,,,,,,,,,0.0,5.0,2.0,0.0,10.0,26.0,0.0,0.0,0.0,1.0,1.0,1.0,2.0,1.0,2.0,1.0,105.0,0,0,0,0,0,0,0,0,0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,3.0,3.0,3.0,2.0,1.0,1.0,1.0,75.0,25.0,2.0,1.0,2.0,1.0,100.0,28.0,3.0,0,0,0,0,2.0,23.0,31.0,20.0,41.0,30.0,32.0,5.0,250.0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,,0.6984,1.0982,,0.8784,0.2841,0.1,-0.2527,0.2266,1.5709,1.0165,2.1298,,,,,0.4403,-0.8314,0.8462,-0.1166,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
377,0,84000090,84007573,1.0,5.0,1.0,0,0,0,1,0,1.0,1.0,,,,,,,,8.0,10.0,7.0,,0.2794,1.0,0.0,0.0,5.0,7.0,6.0,8.0,1.0,1.0,1.0,,0.0,1,1,16.0,,,,,2.0,0.0,3.0,2.0,,5.0,,9.0,4.0,,4.0,4.0,,3.0,,1.0,2.0,4.0,,1.0,1.0,0.0,0.0,0.0,1.0,,,,,,5.0,3.0,6.0,,,,,,,,,,,,,,,,,,,,,,6.0,,-0.7508,-0.3397,-0.267,-0.756,0.9202,,,,,,,,0.3305,-0.6794,0.0002,0.2002,-0.8059,-1.0286,-1.9245,0.111,0.3211,-1.5247,-0.1945,0.6022,0.429,-0.2908,,,,,,,,,-0.2141,-0.0262,0.5849,0.0479,,-0.9723,0.8398,1.1002,0.4019,,0.6102,0.8948,-0.5238,-0.7413,0.4268,-0.2666,0.5565,0.4062,0.3346,0.3623,1.2373,0.9329,0.5964,-0.0348,0.7734,0.7175,0.6428,-0.1308,0.0441,,,,,,,,,,,,,,,,,,0.0,8.0,3.0,8.0,18.0,34.0,2.0,3.0,0.0,2.0,1.0,1.0,2.0,1.0,1.0,1.0,55.0,0,0,1,0,0,1,0,1,0,3.0,3.0,3.0,4.0,4.0,4.0,3.0,3.0,3.0,3.0,3.0,1.0,1.0,1.0,80.0,20.0,2.0,2.0,2.0,1.0,100.0,28.0,1.0,0,0,0,0,3.0,50.0,51.0,53.0,56.0,11.0,7.0,51.0,20.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,...,,0.5931,1.0982,,-1.6916,-0.1914,-1.4212,-0.2712,-1.2356,0.8999,0.9364,0.7833,,,,,0.5308,-0.8314,0.8462,0.324,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
823,1,84000099,84006223,1.0,6.0,1.0,0,0,0,0,1,1.0,1.0,,,,,,,,7.0,9.0,8.0,,1.2892,2.0,0.0,0.0,9.0,7.0,6.0,8.0,1.0,2.0,3.0,,0.0,1,1,16.0,68.7,75.13,,,4.0,1.0,2.0,1.0,20.0,5.0,,4.0,4.0,4.0,,,3.0,3.0,,2.0,3.0,4.0,,0.0,0.0,0.0,0.0,0.0,1.0,,,,,,1.0,3.0,4.0,,,,,,,,,,,,,,,,,,,,,,3.0,,-0.1242,-1.2544,1.7055,-0.756,0.9202,,,,,,,,-0.0994,0.8359,0.1487,1.1855,0.4357,-0.4533,-1.2785,2.2077,0.2375,-0.5168,0.4711,1.9075,0.7774,-0.2176,,,,,,,,,-0.106,0.1195,1.114,-0.1382,0.2354,0.0628,0.511,1.2567,-1.0699,,0.8671,-1.1763,-0.0266,0.12,-0.7573,-0.0891,-0.1079,0.4062,0.3346,-0.4429,0.1839,0.651,0.6629,0.3514,-0.2575,-0.379,0.2941,-0.0942,-0.0904,,,,,,,,,,,,,,,,,,0.0,6.0,3.0,4.0,10.0,60.0,3.0,6.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,80.0,1,0,0,0,0,0,0,1,0,4.0,3.0,2.0,3.0,2.0,3.0,2.0,4.0,4.0,3.0,3.0,2.0,1.0,1.0,68.0,32.0,2.0,2.0,1.0,1.0,100.0,18.0,2.0,0,0,0,0,3.0,49.0,35.0,27.0,42.0,5.0,3.0,3.0,55.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,...,,-0.1221,1.0982,,0.8848,-0.5635,-1.4212,0.8826,0.5642,0.8532,0.4252,1.1965,,,,,1.2511,-0.8314,0.1102,-0.8929,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3063,0,84000057,84008135,,,,0,0,0,0,0,,,,,,,,,,7.0,9.0,,,,1.0,0.0,,,,,,,,,,,1,1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,3.0,,,,,,,2.0,2.0,2.0,1.0,1.0,1.0,1.0,50.0,0,0,1,0,0,1,0,0,1,4.0,4.0,2.0,4.0,4.0,4.0,3.0,4.0,4.0,4.0,4.0,1.0,1.0,1.0,80.0,20.0,1.0,1.0,2.0,1.0,,28.0,4.0,0,0,0,1,2.0,26.0,26.0,26.0,26.0,15.0,5.0,5.0,90.0,15.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,,0.4001,0.4671,,0.7602,0.7007,0.1,0.2446,0.9255,1.3442,0.9861,1.1965,,,,,0.0379,0.7951,0.8462,-0.1166,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
432,1,84000025,84005802,,,,0,0,0,0,0,,,,,,,,,,8.0,10.0,,,,1.0,0.0,,,,,,,,,,,1,1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,3.0,16.0,10.0,18.0,4.0,4.0,0.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,80.0,1,0,0,1,0,0,1,0,0,4.0,4.0,3.0,2.0,3.0,3.0,1.0,1.0,1.0,3.0,2.0,2.0,1.0,1.0,72.0,28.0,1.0,1.0,1.0,1.0,100.0,33.0,5.0,0,0,0,1,2.0,19.0,30.0,45.0,33.0,6.0,10.0,16.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,,0.7044,1.0982,,-0.2244,-2.1066,-1.4212,-0.3348,-0.7733,-0.3082,0.6708,1.512,,,,,,,,0.324,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [30]:
shap_config = clarify.SHAPConfig(
    baseline=[test_features.iloc[0].values.tolist()],
    num_samples=3000,  
    agg_method="mean_abs",
    save_local_shap_values=True
)

explainability_output_path = "s3://{}/{}/clarify-explainability".format(bucket, prefix)

explainability_data_config = clarify.DataConfig(
    #s3_data_input_path='s3://{}/{}/train'.format(bucket, prefix),
    s3_data_input_path=sampled_s3_uri,
    s3_output_path=explainability_output_path,
    label='MATH_Proficient',
    headers=train_data.columns.to_list(),
    dataset_type="text/csv",
)

In [None]:
clarify_processor.run_explainability(
    data_config=explainability_data_config,
    model_config=model_config,
    explainability_config=shap_config
)

INFO:sagemaker.clarify:Analysis Config: {'dataset_type': 'text/csv', 'headers': ['MATH_Proficient', 'CNTSCHID', 'CNTSTUID', 'SISCO', 'ST347Q01JA', 'ST347Q02JA', 'ST349Q01JA_0', 'ST349Q01JA_1', 'ST349Q01JA_2', 'ST349Q01JA_3', 'ST349Q01JA_4', 'ST350Q01JA', 'ST356Q01JA', 'ST322Q01JA', 'ST322Q02JA', 'ST322Q03JA', 'ST322Q04JA', 'ST322Q06JA', 'ST322Q07JA', 'DURECEC', 'EFFORT1', 'EFFORT2', 'ST259Q01JA', 'WB164Q01HA', 'HOMEPOS', 'ST004D01T', 'GRADE', 'REPEAT', 'EXPECEDU', 'ICTAVSCH', 'ICTAVHOM', 'ICTDISTR', 'IMMIG', 'TARDYSD', 'ST226Q01JA', 'ST016Q01NA', 'MISSSC', 'Option_UH', 'OECD', 'PAREDINT', 'BMMJ1', 'BFMJ2', 'WB163Q06HA', 'WB163Q07HA', 'ST230Q01JA', 'SKIPPING', 'IC180Q01JA', 'IC180Q08JA', 'ST059Q02JA', 'ST296Q04JA', 'WB176Q01HA', 'STUDYHMW', 'IC184Q01JA', 'IC184Q02JA', 'IC184Q03JA', 'IC184Q04JA', 'ST059Q01TA', 'ST296Q01JA', 'ST272Q01JA', 'ST268Q01JA', 'ST268Q04JA', 'ST268Q07JA', 'ST293Q04JA', 'ST297Q01JA', 'ST297Q03JA', 'ST297Q05JA', 'ST297Q06JA', 'ST297Q07JA', 'ST297Q09JA', 'WB165Q01HA'

.................[34msagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml[0m
[34msagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml[0m
[34mWe are not in a supported iso region, /bin/sh exiting gracefully with no changes.[0m
[34mINFO:sagemaker-clarify-processing:Starting SageMaker Clarify Processing job[0m
[34mINFO:analyzer.data_loading.data_loader_util:Analysis config path: /opt/ml/processing/input/config/analysis_config.json[0m
[34mINFO:analyzer.data_loading.data_loader_util:Analysis result path: /opt/ml/processing/output[0m
[34mINFO:analyzer.data_loading.data_loader_util:This host is algo-1.[0m
[34mINFO:analyzer.data_loading.data_loader_util:This host is the leader.[0m
[34mINFO:analyzer.data_loading.data_loader_util:Number of hosts in the cluster is 1.[0m
[34mINFO:sagemaker-clarify-processing:Running Python / Pandas based analyzer.[0m
[34mINFO:analyzer.data_loading.data_l

## Train the model again
#### Using only the influential variables & automatic tuning

In [None]:
import json
import boto3

# Replace with your actual bucket name and prefix used in explainability_output_path
# bucket = "your-bucket-name"
# prefix = "your-prefix"  # e.g., the folder structure used in your explainability_output_path

# Construct the S3 key for the output file (adjust the filename if necessary)
key = f"{prefix}/clarify-explainability/explainability_report.json"

# Initialize boto3 client for S3 and download the JSON report
s3 = boto3.client("s3")
response = s3.get_object(Bucket=bucket, Key=key)
content = response["Body"].read().decode("utf-8")
report = json.loads(content)

# Navigate to the global SHAP values dictionary
global_shap = report["explanations"]["kernel_shap"]["label0"]["global_shap_values"]

# Sort the items by the SHAP value in descending order and take the top 20
top_20 = sorted(global_shap.items(), key=lambda item: item[1], reverse=True)[:20]

# Extract just the feature names
top_20_features = [feature for feature, value in top_20]

print("Top 20 features with the highest mean absolute SHAP values:")
for feature in top_20_features:
    print(feature)

---

## Hosting
Now that we've trained the `xgboost` algorithm on our data, let's deploy a model that's hosted behind a real-time endpoint.

In [18]:
# cell 18
xgb_predictor = xgb.deploy(initial_instance_count=1,
                           instance_type='ml.m4.xlarge')

INFO:sagemaker:Creating model with name: xgboost-2025-02-20-17-57-04-832
INFO:sagemaker:Creating endpoint-config with name xgboost-2025-02-20-17-57-04-832
INFO:sagemaker:Creating endpoint with name xgboost-2025-02-20-17-57-04-832


------!

---

## Evaluation
First we'll need to determine how we pass data into and receive data from our endpoint.  Our data is currently stored as NumPy arrays in memory of our notebook instance.  To send it in an HTTP POST request, we'll serialize it as a CSV string and then decode the resulting CSV.

*Note: For inference with CSV format, SageMaker XGBoost requires that the data does NOT include the target variable.*

In [19]:
# cell 19
xgb_predictor.serializer = sagemaker.serializers.CSVSerializer()

Now, we'll use a simple function to:
1. Loop over our test dataset
1. Split it into mini-batches of rows 
1. Convert those mini-batches to CSV string payloads (notice, we drop the target variable from our dataset first)
1. Retrieve mini-batch predictions by invoking the XGBoost endpoint
1. Collect predictions and convert from the CSV output our model provides into a NumPy array

In [20]:
# cell 20
def predict(data, predictor, rows=500):
    split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))
    predictions = ''
    for array in split_array:
        predictions = ','.join([predictions, predictor.predict(array).decode('utf-8')])
    return np.fromstring(predictions[1:], sep=',')

# Use the updated target variable and drop it from test data
predictions = predict(test_data.drop(['MATH_Proficient'], axis=1).to_numpy(), xgb_predictor)


Now we'll check our confusion matrix to see how well we predicted versus actuals.

In [23]:
# cell 21

# Generate the confusion matrix (ensure predictions are rounded appropriately)
cm = pd.crosstab(index=test_data['MATH_Proficient'], 
                 columns=np.round(predictions), 
                 rownames=['actuals'], 
                 colnames=['predictions'])
print("Confusion Matrix:")
print(cm)

# Extract values from the confusion matrix
# Assuming that:
# - actual class 0 is negative
# - actual class 1 is positive
TN = cm.loc[0.0, 0.0]
FP = cm.loc[0.0, 1.0]
FN = cm.loc[1.0, 0.0]
TP = cm.loc[1.0, 1.0]

# Calculate Accuracy
accuracy = (TP + TN) / (TP + TN + FP + FN) * 100

# Calculate Precision (for the positive class)
precision = TP / (TP + FP) * 100 if (TP + FP) > 0 else 0

# Calculate Recall (for the positive class)
recall = TP / (TP + FN) * 100 if (TP + FN) > 0 else 0

# Calculate F1 Score
f1_score = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0

# Calculate Specificity (True Negative Rate)
specificity = TN / (TN + FP) * 100 if (TN + FP) > 0 else 0

# Print out the calculated metrics
print("\n Accuracy: {:.2f}".format(accuracy))
print("\n Precision: {:.2f}".format(precision))
print("\n Recall: {:.2f}".format(recall))
print("\n F1 Score: {:.2f}".format(f1_score))
print("\n Specificity: {:.2f}".format(specificity))

Confusion Matrix:
predictions  0.0  1.0
actuals              
0            153   73
1             77  380

 Accuracy: 78.04

 Precision: 83.89

 Recall: 83.15

 F1 Score: 83.52

 Specificity: 67.70


The model can (and should) be tuned to improve this.  

_Note that because there is some element of randomness in the algorithm's subsample, your results may differ slightly from the text written above._

### Clean Up
Delete any resources you created in this notebook that you no longer wish to use.

In [None]:
# cell 22
xgb_predictor.delete_endpoint(delete_endpoint_config=True)

--
## Serverless Deployment (Optional)
After training the model, retrieve the model artifacts so that we can deploy the model to an endpoint.

In [None]:
# Setup clients
import boto3

client = boto3.client(service_name="sagemaker")
runtime = boto3.client(service_name="sagemaker-runtime")

In [None]:
# Retrieve model data from training job
model_artifacts = xgb.model_data
model_artifacts

### Model Creation
Create a model by providing your model artifacts, the container image URI, environment variables for the container (if applicable), a model name, and the SageMaker IAM role.

In [None]:
from time import gmtime, strftime

model_name = "xgboost-serverless" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print("Model name: " + model_name)

# dummy environment variables
byo_container_env_vars = {"SAGEMAKER_CONTAINER_LOG_LEVEL": "20", "SOME_ENV_VAR": "myEnvVar"}

create_model_response = client.create_model(
    ModelName=model_name,
    Containers=[
        {
            "Image": container,
            "Mode": "SingleModel",
            "ModelDataUrl": model_artifacts,
            "Environment": byo_container_env_vars,
        }
    ],
    ExecutionRoleArn=role,
)

print("Model Arn: " + create_model_response["ModelArn"])

### Endpoint Configuration Creation
This is where you can adjust the Serverless Configuration for your endpoint. The current max concurrent invocations for a single endpoint, known as MaxConcurrency, can be any value from 1 to 200, and MemorySize can be any of the following: 1024 MB, 2048 MB, 3072 MB, 4096 MB, 5120 MB, or 6144 MB.

In [None]:
xgboost_epc_name = "xgboost-serverless-epc" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())

endpoint_config_response = client.create_endpoint_config(
    EndpointConfigName=xgboost_epc_name,
    ProductionVariants=[
        {
            "VariantName": "byoVariant",
            "ModelName": model_name,
            "ServerlessConfig": {
                "MemorySizeInMB": 4096,
                "MaxConcurrency": 1,
            },
        },
    ],
)

print("Endpoint Configuration Arn: " + endpoint_config_response["EndpointConfigArn"])

### Serverless Endpoint Creation
Now that we have an endpoint configuration, we can create a serverless endpoint and deploy our model to it. When creating the endpoint, provide the name of your endpoint configuration and a name for the new endpoint.

In [None]:
endpoint_name = "xgboost-serverless-ep" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())

create_endpoint_response = client.create_endpoint(
    EndpointName=endpoint_name,
    EndpointConfigName=xgboost_epc_name,
)

print("Endpoint Arn: " + create_endpoint_response["EndpointArn"])

Wait until the endpoint status is InService before invoking the endpoint.

In [None]:
# wait for endpoint to reach a terminal state (InService) using describe endpoint

import time

describe_endpoint_response = client.describe_endpoint(EndpointName=endpoint_name)

while describe_endpoint_response["EndpointStatus"] == "Creating":
    describe_endpoint_response = client.describe_endpoint(EndpointName=endpoint_name)
    print(describe_endpoint_response["EndpointStatus"])
    time.sleep(15)

describe_endpoint_response

### Endpoint Invocation
Invoke the endpoint by sending a request to it. The following is a sample data point grabbed from the CSV file downloaded from the Direct Marketing dataset.

***Need to change code below to read in a row from our dataset!***

In [None]:
payload ="800001,0,,,,0,0,0,0,0,,,5.0,5.0,3.0,,1.0,1.0,,10.0,10.0,10.0,,1.5995,1.0,0.0,0.0,9.0,0.0,,,1.0,,4.0,10.0,0.0,0,0,14.5,73.91,16.5,,,4.0,1.0,2.0,3.0,7.0,6.0,,10.0,5.0,,,,4.0,3.0,10.0,2.0,1.0,4.0,3.0,0.0,0.0,0.0,0.0,0.0,1.0,,,,,,1.0,6.0,6.0,,,,,,,,,,,,,,,,,,,,,,0.0,,0.9905,-0.2327,-1.228,1.1246,-0.6386,,3.3518,,,,,,-0.5185,,1.8355,0.6387,1.5558,0.8246,2.4962,-0.2284,2.4031,-1.4413,,,0.544,-0.0085,2.4021,0.059,0.8155,4.1226,,,0.7507,2.0225,,,,,,,4.9507,1.1112,,,,,,,,,,,,-1.1989,-2.0261,-1.7886,,,,,0.8373,0.6984,,,,,,,,,,,,,,,,,,,0.0,10.0,3.0,100.0,3.0,23.0,,24.0,,1.0,1.0,1.0,2.0,1.0,1.0,1.0,45.0,0,0,1,0,0,1,0,0,1,4.0,4.0,4.0,4.0,3.0,3.0,3.0,4.0,2.0,4.0,2.0,2.0,2.0,1.0,74.0,26.0,1.0,1.0,1.0,1.0,100.0,28.0,5.0,0,0,0,1,3.0,30.0,30.0,61.0,62.0,11.0,50.0,10.0,90.0,3.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,3.0,0.2813,0.5556,0.625,3.0,0,0,1,3.0,4.0,4.0,1.0,1.0,3.0,3.0,1.0,3.0,3.0,5.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,3.0,3.0,3.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,10.0,3.0,1.0,1.0,1.0,1.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,-0.3418,,0.5374,0.3734,0.113,0.522,0.9868,1.0982,2.1585,-0.4315,-0.0097,-0.2805,-0.9198,0.5521,2.0709,2.0131,1.1162,-0.3682,1.3541,0.343,0.4217,1.111,-0.8314,0.8462,0.5908,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0"

response = runtime.invoke_endpoint(
    EndpointName=endpoint_name,
    Body=payload,
    ContentType="text/csv",
)

print(response["Body"].read())

### Clean Up
Delete any resources you created in this notebook that you no longer wish to use.

In [None]:
client.delete_model(ModelName=model_name)
client.delete_endpoint_config(EndpointConfigName=xgboost_epc_name)
client.delete_endpoint(EndpointName=endpoint_name)

## Automatic model Tuning (optional)
Amazon SageMaker automatic model tuning, also known as hyperparameter tuning, finds the best version of a model by running many training jobs on your dataset using the algorithm and ranges of hyperparameters that you specify. It then chooses the hyperparameter values that result in a model that performs the best, as measured by a metric that you choose.
For example, suppose that you want to solve a binary classification problem on this marketing dataset. Your goal is to maximize the area under the curve (auc) metric of the algorithm by training an XGBoost Algorithm model. You don't know which values of the eta, alpha, min_child_weight, and max_depth hyperparameters to use to train the best model. To find the best values for these hyperparameters, you can specify ranges of values that Amazon SageMaker hyperparameter tuning searches to find the combination of values that results in the training job that performs the best as measured by the objective metric that you chose. Hyperparameter tuning launches training jobs that use hyperparameter values in the ranges that you specified, and returns the training job with highest auc.


In [None]:
# cell 22
from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter, HyperparameterTuner

hyperparameter_ranges = {'eta': ContinuousParameter(0, 1),
                            'min_child_weight': ContinuousParameter(1, 10),
                            'alpha': ContinuousParameter(0, 2),
                            'max_depth': IntegerParameter(1, 10)}


In [None]:
# cell 23
objective_metric_name = 'validation:auc'

In [None]:
# cell 24
tuner = HyperparameterTuner(xgb,
                            objective_metric_name,
                            hyperparameter_ranges,
                            max_jobs=3,
                            max_parallel_jobs=3)


In [None]:
# cell 25
tuner.fit({'train': s3_input_train, 'validation': s3_input_validation})

In [None]:
# cell 26
boto3.client('sagemaker').describe_hyper_parameter_tuning_job(
HyperParameterTuningJobName=tuner.latest_tuning_job.job_name)['HyperParameterTuningJobStatus']

In [None]:
# cell 27
# return the best training job name
tuner.best_training_job()

In [None]:
# cell 28
#  Deploy the best trained or user specified model to an Amazon SageMaker endpoint
tuner_predictor = tuner.deploy(initial_instance_count=1,
                           instance_type='ml.m4.xlarge')

In [None]:
# cell 29
# Create a serializer
tuner_predictor.serializer = sagemaker.serializers.CSVSerializer()

In [None]:
# cell 30
# Predict
predictions = predict(test_data.drop(['MATH_Proficient'], axis=1).to_numpy(),tuner_predictor)

In [None]:
# cell 31

# Generate the confusion matrix (ensure predictions are rounded appropriately)
cm = pd.crosstab(index=test_data['MATH_Proficient'], 
                 columns=np.round(predictions), 
                 rownames=['actuals'], 
                 colnames=['predictions'])
print("Confusion Matrix:")
print(cm)

# Extract values from the confusion matrix
# Assuming that:
# - actual class 0 is negative
# - actual class 1 is positive
TN = cm.loc[0.0, 0.0]
FP = cm.loc[0.0, 1.0]
FN = cm.loc[1.0, 0.0]
TP = cm.loc[1.0, 1.0]

# Calculate Accuracy
accuracy = (TP + TN) / (TP + TN + FP + FN) * 100

# Calculate Precision (for the positive class)
precision = TP / (TP + FP) * 100 if (TP + FP) > 0 else 0

# Calculate Recall (for the positive class)
recall = TP / (TP + FN) * 100 if (TP + FN) > 0 else 0

# Calculate F1 Score
f1_score = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0

# Calculate Specificity (True Negative Rate)
specificity = TN / (TN + FP) * 100 if (TN + FP) > 0 else 0

# Print out the calculated metrics
print("\n Accuracy: {:.2f}".format(accuracy))
print("\n Precision: {:.2f}".format(precision))
print("\n Recall: {:.2f}".format(recall))
print("\n F1 Score: {:.2f}".format(f1_score))
print("\n Specificity: {:.2f}".format(specificity))

### Clean-up

If you are done with this notebook, please run the cell below.  This will remove the hosted endpoint you created and avoid any charges from a stray instance being left on.

In [None]:
# cell 33
tuner_predictor.delete_endpoint(delete_endpoint_config=True)