The preprocessing is performed via the SchoolYear class found in the source folder.

In [8]:
import os, sys
# Set absolute path to the root folder of the directory
full_path = os.getcwd()
home_folder = 'CPS_GradRate_Analysis'
root = full_path.split(home_folder)[0] + home_folder + '/'
sys.path.append(root)

In [9]:
%load_ext autoreload
%autoreload 2

from src.preprocessing_schoolid import SchoolYear


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


The SchoolYear class represents data from a CPS school gathered during an individual school year.  

The SchoolYear class takes two arguments: 
  - a path to a School Profile CSV from a given year.
  - a path to a Progress Report CSV from the same year.
  
The school year profile `csvs` have been downloaded from the [Chicago Data Portal](https://data.cityofchicago.org/). 

  - [2016-2017 Profile](https://data.cityofchicago.org/Education/Chicago-Public-Schools-School-Profile-Information-/8i6r-et8s)
  - [2017-2018 Profile](https://data.cityofchicago.org/Education/Chicago-Public-Schools-School-Profile-Information-/w4qj-h7bg)
  - [2018-2019 Profile](https://data.cityofchicago.org/Education/Chicago-Public-Schools-School-Profile-Information-/kh4r-387c)

Files should be downloaded and placed in the `data/chicago_data_portal_csv_files` folder

In [10]:
# After downloading the csv's, instantiate a SchoolYear object 
path_to_pr_1819 = '../data/chicago_data_portal_csv_files/Chicago_Public_Schools_-_School_Progress_Reports_SY1819.csv'
path_to_sp_1819 = '../data/chicago_data_portal_csv_files/Chicago_Public_Schools_-_School_Profile_Information_SY1819.csv'
sy_1819 = SchoolYear(path_to_sp_1819, path_to_pr_1819)


In [11]:
path_to_pr_1718 = '../data/chicago_data_portal_csv_files/Chicago_Public_Schools_-_School_Progress_Reports_SY1718.csv'
path_to_sp_1718 = '../data/chicago_data_portal_csv_files/Chicago_Public_Schools_-_School_Profile_Information_SY1718.csv'
sy_1718 = SchoolYear(path_to_sp_1718, path_to_pr_1718)

The original data has been converted into dataframes, which can be accessed by the `sp_df` and `pr_df` attributes.

In [12]:
sy_1819.sp_df.sample()

Unnamed: 0,School_ID,Legacy_Unit_ID,Finance_ID,Short_Name,Long_Name,Primary_Category,Is_High_School,Is_Middle_School,Is_Elementary_School,Is_Pre_School,Summary,Administrator_Title,Administrator,Secondary_Contact_Title,Secondary_Contact,Address,City,State,Zip,Phone,Fax,CPS_School_Profile,Website,Facebook,Twitter,Youtube,Pinterest,Attendance_Boundaries,Grades_Offered_All,Grades_Offered,Student_Count_Total,Student_Count_Low_Income,Student_Count_Special_Ed,Student_Count_English_Learners,Student_Count_Black,Student_Count_Hispanic,Student_Count_White,Student_Count_Asian,Student_Count_Native_American,Student_Count_Other_Ethnicity,Student_Count_Asian_Pacific_Islander,Student_Count_Multi,Student_Count_Hawaiian_Pacific_Islander,Student_Count_Ethnicity_Not_Available,Statistics_Description,Demographic_Description,Dress_Code,PreK_School_Day,Kindergarten_School_Day,School_Hours,Freshman_Start_End_Time,After_School_Hours,Earliest_Drop_Off_Time,Classroom_Languages,Bilingual_Services,Refugee_Services,Title_1_Eligible,PreSchool_Inclusive,Preschool_Instructional,Significantly_Modified,Hard_Of_Hearing,Visual_Impairments,Transportation_Bus,Transportation_El,Transportation_Metra,School_Latitude,School_Longitude,Average_ACT_School,Mean_ACT,College_Enrollment_Rate_School,College_Enrollment_Rate_Mean,Graduation_Rate_School,Graduation_Rate_Mean,Overall_Rating,Rating_Status,Rating_Statement,Classification_Description,School_Year,Third_Contact_Title,Third_Contact_Name,Fourth_Contact_Title,Fourth_Contact_Name,Fifth_Contact_Title,Fifth_Contact_Name,Sixth_Contact_Title,Sixth_Contact_Name,Seventh_Contact_Title,Seventh_Contact_Name,Network,Is_GoCPS_Participant,Is_GoCPS_PreK,Is_GoCPS_Elementary,Is_GoCPS_High_School,Open_For_Enrollment_Date,Closed_For_Enrollment_Date
197,610039,4490,24201,VON LINNE,Carl von Linne Elementary School,ES,False,True,True,True,Linne is a neighborhood school with a rich tra...,Principal,Ms.Renee P Mackin,Assistant Principal,Gabriel Parra,3221 N SACRAMENTO AVE,Chicago,Illinois,60618,7735345000.0,7735345000.0,http://cps.edu/Schools/Pages/school.aspx?Schoo...,http://www.linneschool.org/,,https://twitter.com/VonLinneElem,,,True,"PK,K,1,2,3,4,5,6,7,8","PK,K-8",654,525,96,205,23,543,61,9,6,0,0,8,4,0,There are 654 students enrolled at VON LINNE. ...,The largest demographic at VON LINNE is Hispan...,True,Half Day,Full Day,08:00 AM-03:00 PM,,3:00-6:00,7:30,Spanish,True,,True,,,,,,,Blue,,41.940009,-87.702635,,,,68.2,,78.2,Level 1+,GOOD STANDING,"This school received a Level 1+ rating, which ...",Schools that have an attendance boundary. Gene...,School Year 2018-2019,,,,,,,,,,,ISP,True,False,True,False,09/01/2004 12:00:00 AM,


In [13]:
# For the 2018/2019 school year, there are 660 records and 95 columns in the school profile csv: 
print(sy_1819.sp_df.shape)

(660, 95)


In [14]:
# For the 2018/2019 school year, there are 654 records and 182 columns in the school profile csv: 
print(sy_1819.pr_df.shape)

(654, 182)


The merge_pr_and_sp method merges the Progress Report (pr_df) and School Profile (sp_df) dataframes on School_id.
It is called in the objects __init__ function.
   - merged_df: a dataframe that will be altered.

In [15]:
sy_1819.merged_df.sample()

Unnamed: 0,School_ID,Legacy_Unit_ID,Finance_ID,Short_Name_sp,Long_Name_sp,Primary_Category_sp,Is_High_School,Is_Middle_School,Is_Elementary_School,Is_Pre_School,Summary,Administrator_Title,Administrator,Secondary_Contact_Title,Secondary_Contact,Address_sp,City_sp,State_sp,Zip_sp,Phone_sp,Fax_sp,CPS_School_Profile_sp,Website_sp,Facebook,Twitter,Youtube,Pinterest,Attendance_Boundaries,Grades_Offered_All,Grades_Offered,Student_Count_Total,Student_Count_Low_Income,Student_Count_Special_Ed,Student_Count_English_Learners,Student_Count_Black,Student_Count_Hispanic,Student_Count_White,Student_Count_Asian,Student_Count_Native_American,Student_Count_Other_Ethnicity,Student_Count_Asian_Pacific_Islander,Student_Count_Multi,Student_Count_Hawaiian_Pacific_Islander,Student_Count_Ethnicity_Not_Available,Statistics_Description,Demographic_Description,Dress_Code,PreK_School_Day,Kindergarten_School_Day,School_Hours,Freshman_Start_End_Time,After_School_Hours,Earliest_Drop_Off_Time,Classroom_Languages,Bilingual_Services,Refugee_Services,Title_1_Eligible,PreSchool_Inclusive,Preschool_Instructional,Significantly_Modified,Hard_Of_Hearing,Visual_Impairments,Transportation_Bus,Transportation_El,Transportation_Metra,School_Latitude_sp,School_Longitude_sp,Average_ACT_School,Mean_ACT,College_Enrollment_Rate_School,College_Enrollment_Rate_Mean,Graduation_Rate_School,Graduation_Rate_Mean,Overall_Rating,Rating_Status,Rating_Statement,Classification_Description,School_Year,Third_Contact_Title,Third_Contact_Name,Fourth_Contact_Title,Fourth_Contact_Name,Fifth_Contact_Title,Fifth_Contact_Name,Sixth_Contact_Title,Sixth_Contact_Name,Seventh_Contact_Title,Seventh_Contact_Name,Network,Is_GoCPS_Participant,Is_GoCPS_PreK,Is_GoCPS_Elementary,Is_GoCPS_High_School,Open_For_Enrollment_Date,Closed_For_Enrollment_Date,Short_Name_pr,Long_Name_pr,School_Type,Primary_Category_pr,Address_pr,City_pr,State_pr,Zip_pr,Phone_pr,Fax_pr,CPS_School_Profile_pr,Website_pr,Progress_Report_Year,Blue_Ribbon_Award_Year,Excelerate_Award_Gold_Year,Spot_Light_Award_Year,Improvement_Award_Year,Excellence_Award_Year,Student_Growth_Rating,Student_Growth_Description,Growth_Reading_Grades_Tested_Pct_ES,Growth_Reading_Grades_Tested_Label_ES,Growth_Math_Grades_Tested_Pct_ES,Growth_Math_Grades_Tested_Label_ES,Student_Attainment_Rating,Student_Attainment_Description,Attainment_Reading_Pct_ES,Attainment_Reading_Lbl_ES,Attainment_Math_Pct_ES,Attainment_Math_Lbl_ES,Culture_Climate_Rating,Culture_Climate_Description,School_Survey_Student_Response_Rate_Pct,School_Survey_Student_Response_Rate_Avg_Pct,School_Survey_Teacher_Response_Rate_Pct,School_Survey_Teacher_Response_Rate_Avg_Pct,School_Survey_Parent_Response_Rate_Pct,School_Survey_Parent_Response_Rate_Avg_Pct,Healthy_School_Certification,Healthy_School_Certification_Description,Creative_School_Certification,Creative_School_Certification_Description,NWEA_Reading_Growth_Grade_3_Pct,NWEA_Reading_Growth_Grade_3_Lbl,NWEA_Reading_Growth_Grade_4_Pct,NWEA_Reading_Growth_Grade_4_Lbl,NWEA_Reading_Growth_Grade_5_Pct,NWEA_Reading_Growth_Grade_5_Lbl,NWEA_Reading_Growth_Grade_6_Pct,NWEA_Reading_Growth_Grade_6_Lbl,NWEA_Reading_Growth_Grade_7_Pct,NWEA_Reading_Growth_Grade_7_Lbl,NWEA_Reading_Growth_Grade_8_Pct,NWEA_Reading_Growth_Grade_8_Lbl,NWEA_Math_Growth_Grade_3_Pct,NWEA_Math_Growth_Grade_3_Lbl,NWEA_Math_Growth_Grade_4_Pct,NWEA_Math_Growth_Grade_4_Lbl,NWEA_Math_Growth_Grade_5_Pct,NWEA_Math_Growth_Grade_5_Lbl,NWEA_Math_Growth_Grade_6_Pct,NWEA_Math_Growth_Grade_6_Lbl,NWEA_Math_Growth_Grade_7_Pct,NWEA_Math_Growth_Grade_7_Lbl,NWEA_Math_Growth_Grade_8_Pct,NWEA_Math_Growth_Grade_8_Lbl,NWEA_Reading_Attainment_Grade_2_Pct,NWEA_Reading_Attainment_Grade_2_Lbl,NWEA_Reading_Attainment_Grade_3_Pct,NWEA_Reading_Attainment_Grade_3_Lbl,NWEA_Reading_Attainment_Grade_4_Pct,NWEA_Reading_Attainment_Grade_4_Lbl,NWEA_Reading_Attainment_Grade_5_Pct,NWEA_Reading_Attainment_Grade_5_Lbl,NWEA_Reading_Attainment_Grade_6_Pct,NWEA_Reading_Attainment_Grade_6_Lbl,NWEA_Reading_Attainment_Grade_7_Pct,NWEA_Reading_Attainment_Grade_7_Lbl,NWEA_Reading_Attainment_Grade_8_Pct,NWEA_Reading_Attainment_Grade_8_Lbl,NWEA_Math_Attainment_Grade_2_Pct,NWEA_Math_Attainment_Grade_2_Lbl,NWEA_Math_Attainment_Grade_3_Pct,NWEA_Math_Attainment_Grade_3_Lbl,NWEA_Math_Attainment_Grade_4_Pct,NWEA_Math_Attainment_Grade_4_Lbl,NWEA_Math_Attainment_Grade_5_Pct,NWEA_Math_Attainment_Grade_5_Lbl,NWEA_Math_Attainment_Grade_6_Pct,NWEA_Math_Attainment_Grade_6_Lbl,NWEA_Math_Attainment_Grade_7_Pct,NWEA_Math_Attainment_Grade_7_Lbl,NWEA_Math_Attainment_Grade_8_Pct,NWEA_Math_Attainment_Grade_8_Lbl,School_Survey_Involved_Families,School_Survey_Supportive_Environment,School_Survey_Ambitious_Instruction,School_Survey_Effective_Leaders,School_Survey_Collaborative_Teachers,School_Survey_Safety,Suspensions_Per_100_Students_Year_1_Pct,Suspensions_Per_100_Students_Year_2_Pct,Suspensions_Per_100_Students_Avg_Pct,Misconducts_To_Suspensions_Year_1_Pct,Misconducts_To_Suspensions_Year_2_Pct,Misconducts_To_Suspensions_Avg_Pct,Average_Length_Suspension_Year_1_Pct,Average_Length_Suspension_Year_2_Pct,Average_Length_Suspension_Avg_Pct,Behavior_Discipline_Year_1,Behavior_Discipline_Year_2,School_Survey_School_Community,School_Survey_Parent_Teacher_Partnership,School_Survey_Quality_Of_Facilities,Student_Attendance_Year_1_Pct,Student_Attendance_Year_2_Pct,Student_Attendance_Avg_Pct,Teacher_Attendance_Year_1_Pct,Teacher_Attendance_Year_2_Pct,Teacher_Attendance_Avg_Pct,One_Year_Dropout_Rate_Year_1_Pct,One_Year_Dropout_Rate_Year_2_Pct,One_Year_Dropout_Rate_Avg_Pct,Other_Metrics_Year_1,Other_Metrics_Year_2,Freshmen_On_Track_School_Pct_Year_2,Freshmen_On_Track_CPS_Pct_Year_2,Freshmen_On_Track_School_Pct_Year_1,Freshmen_On_Track_CPS_Pct_Year_1,Graduation_4_Year_School_Pct_Year_2,Graduation_4_Year_CPS_Pct_Year_2,Graduation_4_Year_School_Pct_Year_1,Graduation_4_Year_CPS_Pct_Year_1,Graduation_5_Year_School_Pct_Year_2,Graduation_5_Year_CPS_Pct_Year_2,Graduation_5_Year_School_Pct_Year_1,Graduation_5_Year_CPS_Pct_Year_1,College_Enrollment_School_Pct_Year_2,College_Enrollment_CPS_Pct_Year_2,College_Enrollment_School_Pct_Year_1,College_Enrollment_CPS_Pct_Year_1,College_Persistence_School_Pct_Year_2,College_Persistence_CPS_Pct_Year_2,College_Persistence_School_Pct_Year_1,College_Persistence_CPS_Pct_Year_1,Progress_Toward_Graduation_Year_1,Progress_Toward_Graduation_Year_2,State_School_Report_Card_URL,Mobility_Rate_Pct,Chronic_Truancy_Pct,Empty_Progress_Report_Message,School_Survey_Rating_Description,Supportive_School_Award,Supportive_School_Award_Desc,Parent_Survey_Results_Year,School_Latitude_pr,School_Longitude_pr,PSAT_Grade_9_Score_School_Avg,PSAT_Grade_10_Score_School_Avg,SAT_Grade_11_Score_School_Avg,SAT_Grade_11_Score_CPS_Avg,Growth_PSAT_Grade_9_School_Pct,Growth_PSAT_Grade_9_School_Lbl,Growth_PSAT_Reading_Grade_10_School_Pct,Growth_PSAT_Reading_Grade_10_School_Lbl,Growth_SAT_Grade_11_School_Pct,Growth_SAT_Grade_11_School_Lbl,Attainment_PSAT_Grade_9_School_Pct,Attainment_PSAT_Grade_9_School_Lbl,Attainment_PSAT_Grade_10_School_Pct,Attainment_PSAT_Grade_10_School_Lbl,Attainment_SAT_Grade_11_School_Pct,Attainment_SAT_Grade_11_School_Lbl,Attainment_All_Grades_School_Pct,Attainment_All_Grades_School_Lbl,Growth_PSAT_Math_Grade_10_School_Pct,Growth_PSAT_Math_Grade_10_School_Lbl,Growth_SAT_Reading_Grade_11_School_Pct,Growth_SAT_Reading_Grade_11_School_Lbl,Growth_SAT_Math_Grade_11_School_Pct,Growth_SAT_Math_Grade_11_School_Lbl
507,610044,4540,24251,LOWELL,James Russell Lowell Elementary School,ES,False,True,True,True,Our vision at Lowell Elementary School and Mun...,Principal,Gladys Betty Rivera,Other,,3320 W HIRSCH ST,Chicago,Illinois,60651,7735344000.0,7735344000.0,http://cps.edu/Schools/Pages/school.aspx?Schoo...,,,,,,True,"PK,K,1,2,3,4,5,6,7,8","PK,K-8",406,396,115,149,91,299,9,0,0,0,0,7,0,0,There are 406 students enrolled at LOWELL. 97...,The largest demographic at LOWELL is Hispanic....,True,Half Day,Full Day,07:30 AM-02:30 PM,,,,,True,,True,Y,Y,True,,,"72, 82",,,41.90652,-87.710217,,,,68.2,,78.2,Level 2+,GOOD STANDING,"This school received a Level 2+ rating, which ...",Schools that have an attendance boundary. Gene...,School Year 2018-2019,,,,,,,,,,,Network 5,True,False,True,False,09/01/2004 12:00:00 AM,,LOWELL,James Russell Lowell Elementary School,Neighborhood,ES,3320 W HIRSCH ST,Chicago,Illinois,60651,7735344000.0,7735344000.0,http://cps.edu/Schools/Pages/school.aspx?Schoo...,,2018,,2018.0,,2013.0,,AVERAGE,Student Growth measures the change in standard...,71.0,71st,28.0,28th,AVERAGE,Student Attainment measures how well the schoo...,46.0,46th,31.0,31st,WELL ORGANIZED,Results are based on student and teacher respo...,79.4,81.4,95.5,79.9,< 30%,35.6,Not Achieved,Students learn better at healthy schools! This...,EXCELLING,This school is Excelling in the arts. It meets...,99.0,99th,34.0,34th,99.0,99th,84.0,84th,1.0,1st,84.0,84th,75.0,75th,1.0,1st,28.0,28th,66.0,66th,1.0,1st,97.0,97th,48.0,48th,48.0,48th,56.0,56th,54.0,54th,46.0,46th,29.0,29th,46.0,46th,59.0,59th,34.0,34th,24.0,24th,6.0,6th,26.0,26th,22.0,22nd,72.0,72nd,STRONG,NEUTRAL,STRONG,STRONG,STRONG,NEUTRAL,0.8,2.0,5.6,3.4,2.1,13.5,2.8 days,2.6 days,2.0 days,2018.0,2017.0,NOT ENOUGH DATA,NOT ENOUGH DATA,NOT ENOUGH DATA,94.0,95.0,93.3,94.1,94.6,95.0,,,6.4,2017.0,2018.0,,89.4,,88.7,,75.6,,74.7,,78.2,,77.5,,68.2,,59.8,,72.3,,71.9,2017.0,2018.0,http://iirc.niu.edu/School.aspx?schoolid=15016...,18.7,27.7,,This school is “Well-Organized for Improvement...,EMERGING,This school has developed an action plan to su...,2018.0,41.90652,-87.710217,,,,969.0,,,,,,,,,,,,,,,,,,,,


In [16]:
# As one would expect, the merged dataframe has 651 rows and 276 columns:
sy_1819.merged_df.shape

(651, 276)

# School Count

The SchoolYear class has attributes describing the number of schools and total number of high schools each year.

In 2018-2019, there are 651 total schools in CPS and there are 176 high schools.

In [17]:
sy_1819.total_school_count

651

In [18]:
sy_1819.total_high_school_count

176

In 2017-2018, there are 661 total schools, and 184 high schools.

In [19]:
sy_1718.total_school_count

661

In [20]:
sy_1718.total_high_school_count

184

Below is a list of school ids included in the 2017-18 school year, but not in the 2018-19 school year.

In [21]:
schools_not_in_1819 = []
for id_ in sy_1718.merged_df['School_ID']:
     if id_ not in list(sy_1819.merged_df['School_ID']):
            schools_not_in_1819.append(id_)
schools_not_in_1819

[610580,
 610506,
 610566,
 400087,
 610591,
 610567,
 400078,
 610581,
 400045,
 400102]

# Preprocessing (which will not cause leakage with train-test-split)

There are various preprocessing techniques that will make analysis, visualization, and modelling easier. The preprocessing below are meant to be performed prior to train-test-split or crossvalidation.  They will not cause data leakage.

The preprocessing methods are as follows:

  - **convert_is_high_school**: some School Year Profiles encode high school as a a boolean, others encode it as Y/N.  this function ensures all are booleans.
  - **make_percent_low_income**: create a column which divides Student_Count_Low_Income by Student_Count_Total

In [22]:
# Ensure Is_High_School is a boolean
sy_1819.convert_is_high_school_to_bool()['Is_High_School'][:5]

0     True
1    False
2     True
3    False
4    False
Name: Is_High_School, dtype: bool

In [23]:
# Create perc_low_income column

sy_1819.make_percent_low_income()['perc_low_income'][:5]

0    0.654028
1    0.082397
2    0.974026
3    0.783133
4    0.266667
Name: perc_low_income, dtype: float64

# Filtering

Various filtering decisions will be crucial to analysis and model building. For example, if one were modeling graduation rates, only high schools with graduation rates would be included in the data set.  Or, for the same graudation rate problem, one may want to filter out Options Schools. Options Schools serve special populations and may have different missions than non-Option schools.  Their graduation rates, for example, can be near zero.

In [24]:
# Make another copy of the 1819 School Year to compare changes after filtering
sy_1819_unaltered = SchoolYear(path_to_sp_1819, path_to_pr_1819)

## Isolating High Schools

For modeling graduation rates, a first step is to remove all schools other than high schools from the dataset.
The `isolate_high_schools` method does that. 

In [25]:
sy_1819.isolate_high_schools()
sy_1819.merged_df.sample(5)['Is_High_School']

341    True
530    True
286    True
487    True
117    True
Name: Is_High_School, dtype: bool

# Drop No Graduation Rate Schools

In [6]:
sy_1819.drop_no_gr_schools()
sy_1819.merged_df['Graduation_Rate_School'].isna().sum()

0

# Drop Schools which don't have recorded students

There are two school records which have student counts of 0 in the original dataframe.  Dropping schools with no graduation rates removed these schools.

In [26]:
sy_1819.merged_df[sy_1819.merged_df['Student_Count_Total'] == 0]

Unnamed: 0,School_ID,Legacy_Unit_ID,Finance_ID,Short_Name_sp,Long_Name_sp,Primary_Category_sp,Is_High_School,Is_Middle_School,Is_Elementary_School,Is_Pre_School,Summary,Administrator_Title,Administrator,Secondary_Contact_Title,Secondary_Contact,Address_sp,City_sp,State_sp,Zip_sp,Phone_sp,Fax_sp,CPS_School_Profile_sp,Website_sp,Facebook,Twitter,Youtube,Pinterest,Attendance_Boundaries,Grades_Offered_All,Grades_Offered,Student_Count_Total,Student_Count_Low_Income,Student_Count_Special_Ed,Student_Count_English_Learners,Student_Count_Black,Student_Count_Hispanic,Student_Count_White,Student_Count_Asian,Student_Count_Native_American,Student_Count_Other_Ethnicity,Student_Count_Asian_Pacific_Islander,Student_Count_Multi,Student_Count_Hawaiian_Pacific_Islander,Student_Count_Ethnicity_Not_Available,Statistics_Description,Demographic_Description,Dress_Code,PreK_School_Day,Kindergarten_School_Day,School_Hours,Freshman_Start_End_Time,After_School_Hours,Earliest_Drop_Off_Time,Classroom_Languages,Bilingual_Services,Refugee_Services,Title_1_Eligible,PreSchool_Inclusive,Preschool_Instructional,Significantly_Modified,Hard_Of_Hearing,Visual_Impairments,Transportation_Bus,Transportation_El,Transportation_Metra,School_Latitude_sp,School_Longitude_sp,Average_ACT_School,Mean_ACT,College_Enrollment_Rate_School,College_Enrollment_Rate_Mean,Graduation_Rate_School,Graduation_Rate_Mean,Overall_Rating,Rating_Status,Rating_Statement,Classification_Description,School_Year,Third_Contact_Title,Third_Contact_Name,Fourth_Contact_Title,Fourth_Contact_Name,Fifth_Contact_Title,Fifth_Contact_Name,Sixth_Contact_Title,Sixth_Contact_Name,Seventh_Contact_Title,Seventh_Contact_Name,Network,Is_GoCPS_Participant,Is_GoCPS_PreK,Is_GoCPS_Elementary,Is_GoCPS_High_School,Open_For_Enrollment_Date,Closed_For_Enrollment_Date,Short_Name_pr,Long_Name_pr,School_Type,Primary_Category_pr,Address_pr,City_pr,State_pr,Zip_pr,Phone_pr,Fax_pr,CPS_School_Profile_pr,Website_pr,Progress_Report_Year,Blue_Ribbon_Award_Year,Excelerate_Award_Gold_Year,Spot_Light_Award_Year,Improvement_Award_Year,Excellence_Award_Year,Student_Growth_Rating,Student_Growth_Description,Growth_Reading_Grades_Tested_Pct_ES,Growth_Reading_Grades_Tested_Label_ES,Growth_Math_Grades_Tested_Pct_ES,Growth_Math_Grades_Tested_Label_ES,Student_Attainment_Rating,Student_Attainment_Description,Attainment_Reading_Pct_ES,Attainment_Reading_Lbl_ES,Attainment_Math_Pct_ES,Attainment_Math_Lbl_ES,Culture_Climate_Rating,Culture_Climate_Description,School_Survey_Student_Response_Rate_Pct,School_Survey_Student_Response_Rate_Avg_Pct,School_Survey_Teacher_Response_Rate_Pct,School_Survey_Teacher_Response_Rate_Avg_Pct,School_Survey_Parent_Response_Rate_Pct,School_Survey_Parent_Response_Rate_Avg_Pct,Healthy_School_Certification,Healthy_School_Certification_Description,Creative_School_Certification,Creative_School_Certification_Description,NWEA_Reading_Growth_Grade_3_Pct,NWEA_Reading_Growth_Grade_3_Lbl,NWEA_Reading_Growth_Grade_4_Pct,NWEA_Reading_Growth_Grade_4_Lbl,NWEA_Reading_Growth_Grade_5_Pct,NWEA_Reading_Growth_Grade_5_Lbl,NWEA_Reading_Growth_Grade_6_Pct,NWEA_Reading_Growth_Grade_6_Lbl,NWEA_Reading_Growth_Grade_7_Pct,NWEA_Reading_Growth_Grade_7_Lbl,NWEA_Reading_Growth_Grade_8_Pct,NWEA_Reading_Growth_Grade_8_Lbl,NWEA_Math_Growth_Grade_3_Pct,NWEA_Math_Growth_Grade_3_Lbl,NWEA_Math_Growth_Grade_4_Pct,NWEA_Math_Growth_Grade_4_Lbl,NWEA_Math_Growth_Grade_5_Pct,NWEA_Math_Growth_Grade_5_Lbl,NWEA_Math_Growth_Grade_6_Pct,NWEA_Math_Growth_Grade_6_Lbl,NWEA_Math_Growth_Grade_7_Pct,NWEA_Math_Growth_Grade_7_Lbl,NWEA_Math_Growth_Grade_8_Pct,NWEA_Math_Growth_Grade_8_Lbl,NWEA_Reading_Attainment_Grade_2_Pct,NWEA_Reading_Attainment_Grade_2_Lbl,NWEA_Reading_Attainment_Grade_3_Pct,NWEA_Reading_Attainment_Grade_3_Lbl,NWEA_Reading_Attainment_Grade_4_Pct,NWEA_Reading_Attainment_Grade_4_Lbl,NWEA_Reading_Attainment_Grade_5_Pct,NWEA_Reading_Attainment_Grade_5_Lbl,NWEA_Reading_Attainment_Grade_6_Pct,NWEA_Reading_Attainment_Grade_6_Lbl,NWEA_Reading_Attainment_Grade_7_Pct,NWEA_Reading_Attainment_Grade_7_Lbl,NWEA_Reading_Attainment_Grade_8_Pct,NWEA_Reading_Attainment_Grade_8_Lbl,NWEA_Math_Attainment_Grade_2_Pct,NWEA_Math_Attainment_Grade_2_Lbl,NWEA_Math_Attainment_Grade_3_Pct,NWEA_Math_Attainment_Grade_3_Lbl,NWEA_Math_Attainment_Grade_4_Pct,NWEA_Math_Attainment_Grade_4_Lbl,NWEA_Math_Attainment_Grade_5_Pct,NWEA_Math_Attainment_Grade_5_Lbl,NWEA_Math_Attainment_Grade_6_Pct,NWEA_Math_Attainment_Grade_6_Lbl,NWEA_Math_Attainment_Grade_7_Pct,NWEA_Math_Attainment_Grade_7_Lbl,NWEA_Math_Attainment_Grade_8_Pct,NWEA_Math_Attainment_Grade_8_Lbl,School_Survey_Involved_Families,School_Survey_Supportive_Environment,School_Survey_Ambitious_Instruction,School_Survey_Effective_Leaders,School_Survey_Collaborative_Teachers,School_Survey_Safety,Suspensions_Per_100_Students_Year_1_Pct,Suspensions_Per_100_Students_Year_2_Pct,Suspensions_Per_100_Students_Avg_Pct,Misconducts_To_Suspensions_Year_1_Pct,Misconducts_To_Suspensions_Year_2_Pct,Misconducts_To_Suspensions_Avg_Pct,Average_Length_Suspension_Year_1_Pct,Average_Length_Suspension_Year_2_Pct,Average_Length_Suspension_Avg_Pct,Behavior_Discipline_Year_1,Behavior_Discipline_Year_2,School_Survey_School_Community,School_Survey_Parent_Teacher_Partnership,School_Survey_Quality_Of_Facilities,Student_Attendance_Year_1_Pct,Student_Attendance_Year_2_Pct,Student_Attendance_Avg_Pct,Teacher_Attendance_Year_1_Pct,Teacher_Attendance_Year_2_Pct,Teacher_Attendance_Avg_Pct,One_Year_Dropout_Rate_Year_1_Pct,One_Year_Dropout_Rate_Year_2_Pct,One_Year_Dropout_Rate_Avg_Pct,Other_Metrics_Year_1,Other_Metrics_Year_2,Freshmen_On_Track_School_Pct_Year_2,Freshmen_On_Track_CPS_Pct_Year_2,Freshmen_On_Track_School_Pct_Year_1,Freshmen_On_Track_CPS_Pct_Year_1,Graduation_4_Year_School_Pct_Year_2,Graduation_4_Year_CPS_Pct_Year_2,Graduation_4_Year_School_Pct_Year_1,Graduation_4_Year_CPS_Pct_Year_1,Graduation_5_Year_School_Pct_Year_2,Graduation_5_Year_CPS_Pct_Year_2,Graduation_5_Year_School_Pct_Year_1,Graduation_5_Year_CPS_Pct_Year_1,College_Enrollment_School_Pct_Year_2,College_Enrollment_CPS_Pct_Year_2,College_Enrollment_School_Pct_Year_1,College_Enrollment_CPS_Pct_Year_1,College_Persistence_School_Pct_Year_2,College_Persistence_CPS_Pct_Year_2,College_Persistence_School_Pct_Year_1,College_Persistence_CPS_Pct_Year_1,Progress_Toward_Graduation_Year_1,Progress_Toward_Graduation_Year_2,State_School_Report_Card_URL,Mobility_Rate_Pct,Chronic_Truancy_Pct,Empty_Progress_Report_Message,School_Survey_Rating_Description,Supportive_School_Award,Supportive_School_Award_Desc,Parent_Survey_Results_Year,School_Latitude_pr,School_Longitude_pr,PSAT_Grade_9_Score_School_Avg,PSAT_Grade_10_Score_School_Avg,SAT_Grade_11_Score_School_Avg,SAT_Grade_11_Score_CPS_Avg,Growth_PSAT_Grade_9_School_Pct,Growth_PSAT_Grade_9_School_Lbl,Growth_PSAT_Reading_Grade_10_School_Pct,Growth_PSAT_Reading_Grade_10_School_Lbl,Growth_SAT_Grade_11_School_Pct,Growth_SAT_Grade_11_School_Lbl,Attainment_PSAT_Grade_9_School_Pct,Attainment_PSAT_Grade_9_School_Lbl,Attainment_PSAT_Grade_10_School_Pct,Attainment_PSAT_Grade_10_School_Lbl,Attainment_SAT_Grade_11_School_Pct,Attainment_SAT_Grade_11_School_Lbl,Attainment_All_Grades_School_Pct,Attainment_All_Grades_School_Lbl,Growth_PSAT_Math_Grade_10_School_Pct,Growth_PSAT_Math_Grade_10_School_Lbl,Growth_SAT_Reading_Grade_11_School_Pct,Growth_SAT_Reading_Grade_11_School_Lbl,Growth_SAT_Math_Grade_11_School_Pct,Growth_SAT_Math_Grade_11_School_Lbl,perc_low_income
15,610592,9692,0,ENGLEWOOD STEM HS,Englewood STEM High School,HS,True,False,False,False,The New Englewood STEM High School will open i...,Principal,,,,6835 S NORMAL,Chicago,Illinois,60621,,,http://cps.edu/Schools/Pages/school.aspx?Schoo...,,,,,,True,9,9,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,There is not any Demographic information for t...,False,,,,,,,,,,,,,,,,,,,41.770251,-87.639061,,,,59.8,,77.5,,,,Schools that have an attendance boundary. Gene...,School Year 2018-2019,,,,,,,,,,,,True,False,,True,07/01/2019 12:00:00 AM,,ENGLEWOOD STEM HS,Englewood STEM High School,Neighborhood,HS,6835 S NORMAL,Chicago,Illinois,60621,7735354000.0,7735354000.0,http://cps.edu/Schools/Pages/school.aspx?Schoo...,https://englewoodstemhs.cps.edu,2018,,,,,,,,,,,,,,,,,,,,,,,,,35.6,Not Achieved,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,This school does not have enough data to displ...,,,,,41.770251,-87.639061,,,,,,,,,,,,,,,,,,,,,,,,,0.0
582,400142,9059,66626,YCCS - VIRTUAL,YCCS-Virtual HS,HS,True,False,False,False,,Director,Ms.Mary Bradley,,,1900 W VAN BUREN ST,Chicago,Illinois,60612,3124290000.0,3122436000.0,http://cps.edu/Schools/Pages/school.aspx?Schoo...,https://cps.edu/yccsvirtualhs,,,,,False,9101112,9-12,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,There is not any Demographic information for t...,False,,,,,,,,,,,,,,,,,,,41.876317,-87.674138,,,,68.2,,78.2,Inability to Rate,NOT APPLICABLE,This school did not have enough data to receiv...,"Schools that are open to all Chicago children,...",School Year 2018-2019,,,,,,,,,,,Options,False,False,False,False,07/01/2012 12:00:00 AM,,YCCS - VIRTUAL,YCCS-Virtual HS,Charter,HS,1900 W VAN BUREN ST,Chicago,Illinois,60612,3124290000.0,3122436000.0,http://cps.edu/Schools/Pages/school.aspx?Schoo...,https://cps.edu/yccsvirtualhs,2018,,,,,,NO DATA AVAILABLE,Student Growth measures the change in standard...,,,,,NO DATA AVAILABLE,Student Attainment measures how well the schoo...,,,,,NOT ENOUGH DATA,Results are based on student and teacher respo...,,81.4,,79.9,< 30%,35.6,Not Achieved,Students learn better at healthy schools! This...,INCOMPLETE DATA,This school has an arts designation of Incompl...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,NOT ENOUGH DATA,NOT ENOUGH DATA,NOT ENOUGH DATA,NOT ENOUGH DATA,NOT ENOUGH DATA,NOT ENOUGH DATA,,,5.6,,,13.5,,,2.0 days,2018.0,2017.0,NOT ENOUGH DATA,NOT ENOUGH DATA,NOT ENOUGH DATA,,,93.3,,,95.0,,,6.4,2017.0,2018.0,,89.4,,88.7,,75.6,,74.7,,78.2,,77.5,,68.2,,59.8,,72.3,41.7,71.9,2017.0,2018.0,http://iirc.niu.edu/School.aspx?schoolid=15016...,,,A School Progress Report customized for CPS Op...,This school does not have enough data for a re...,NOT RATED,This school has not submitted an action plan t...,2018.0,41.876317,-87.674138,,,,969.0,,,,,,,,,,,,,,,,,,,,,0.0


In [27]:
sy_1819.drop_no_student_schools()

# Isolate Important Columns



The preprocessing function, isolate_important_columns, reduces the number of columns in the datasets from 92 - 20.

In [None]:
from src.preprocessing.preprocessing import isolate_important_columns

df_dict = {year: isolate_important_columns(df_dict[year]) for year in df_dict}
df_dict['2017-2018']

After this reduction, the following columns are left:

  - School_ID
  - Graduation_Rate_School
  - Student_Count_Total
  - Student_Count_Low_Income
  - Student_Count_Special_Ed
  - Student_Count_English_Learners
  - 10 Columns Counting Populations of Different Ethnicities
  - **Is_High_School**
  - Dress_Code
  - Classroom_Languages
  - Transportation_El
  
The bolded columns require preprocessing, which is shown below.

# Is_High_School

The school profiles for 2016-2017 and 2017-2018 encode `Is_High_School` as 'Y/N', whereas 2018-2019 encodes it as 'True/False'.  

The function below converts Y/N to True/False to ensure consistency.

In [None]:
from src.preprocessing.preprocessing import convert_is_high_school_to_bool

df_dict = {year: convert_is_high_school_to_bool(df_dict[year]) for year in df_dict}
df_dict['2016-2017']['Is_High_School']

# Dress_Code

The same conversions are applied to the Dress_Code column

In [None]:
from src.preprocessing.preprocessing import convert_dress_code_to_bool

df_dict = {year: convert_dress_code_to_bool(df_dict[year]) for year in df_dict}
df_dict['2016-2017']['Dress_Code']

In [None]:
# Add Year column to dataframes

In [None]:
df_dict['2018-2019']

In [None]:
# Interesting: primary category would be a good feature to change to Primary_Is_High_School.  
# This would give a signal of whether a school is a specifically a high school.
df_hs['2018-2019']['Primary_Category']