The school year profile `csvs` have been downloaded from the [Chicago Data Portal](https://data.cityofchicago.org/).   

# School Year Profiles

The first source of data are the school year profiles:

  - [2016-2017 Profile](https://data.cityofchicago.org/Education/Chicago-Public-Schools-School-Profile-Information-/8i6r-et8s)
  - [2017-2018 Profile](https://data.cityofchicago.org/Education/Chicago-Public-Schools-School-Profile-Information-/w4qj-h7bg)
  - [2018-2019 Profile](https://data.cityofchicago.org/Education/Chicago-Public-Schools-School-Profile-Information-/kh4r-387c)

Files should be downloaded and placed in the `data/chicago_data_portal_csv_files` folder.

There are slight differences in the csv files which require quick preprocessing steps.  These preprocessing steps are packaged in the `src/preprocessing` folder 

In [1]:
# Imports to ensure modules import correctly. 

import os, sys

# Set absolute path to the root folder of the directory
full_path = os.getcwd()
home_folder = 'CPS_GradRate_Analysis'
root = full_path.split(home_folder)[0] + home_folder + '/'
sys.path.append(root)

%load_ext autoreload
%autoreload 2

In [2]:
import pandas as pd
pd.set_option('display.max_columns', None)
import numpy as np
from src.preprocessing.preprocessing import years, paths
from src.preprocessing.preprocessing import create_sp_path_dictionary, import_multiple_sy_profiles, isolate_high_schools

2015-2016 missing


In [3]:
# Load available csvs into a dictionary of dataframes
sp_paths = create_sp_path_dictionary(years[:-1], paths)
df_dict = import_multiple_sy_profiles(sp_paths)

In [4]:
len(df_dict['2017-2018'].columns)

92

# Isolate Important Columns



The preprocessing function, isolate_important_columns, reduces the number of columns in the datasets from 92 - 20.

In [5]:
from src.preprocessing.preprocessing import isolate_important_columns

df_dict = {year: isolate_important_columns(df_dict[year]) for year in df_dict}
df_dict['2017-2018']

Unnamed: 0,School_ID,Short_Name,Graduation_Rate_School,Student_Count_Total,Student_Count_Low_Income,Student_Count_Special_Ed,Student_Count_English_Learners,Student_Count_Black,Student_Count_Hispanic,Student_Count_White,Student_Count_Asian,Student_Count_Native_American,Student_Count_Other_Ethnicity,Student_Count_Asian_Pacific_Islander,Student_Count_Multi,Student_Count_Hawaiian_Pacific_Islander,Student_Count_Ethnicity_Not_Available,Is_High_School,Dress_Code,Classroom_Languages,Transportation_El
0,610521,DAVIS M,,237,227,45,10,216,20,0,0,0,0,0,1,0,0,N,Y,,
1,609750,SIMPSON HS,23.1,34,25,3,3,25,9,0,0,0,0,0,0,0,0,Y,N,Spanish,Pink
2,610386,PEACE AND EDUCATION HS,,94,58,16,13,29,62,2,0,0,0,0,1,0,0,Y,Y,"French, Spanish",
3,400123,YCCS - SCHOLASTIC ACHIEVEMENT,,172,77,33,1,165,6,0,0,1,0,0,0,0,0,Y,Y,,Green
4,400116,MONTESSORI ENGLEWOOD,,333,191,48,2,323,9,1,0,0,0,0,0,0,0,N,N,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
656,610030,KOZMINSKI,,257,133,27,5,241,8,1,0,2,0,0,4,1,0,N,Y,,
657,610197,TALCOTT,,473,267,81,196,34,386,34,2,3,0,0,13,1,0,N,Y,Spanish,Orange
658,610084,KELLER,,227,56,2,5,81,25,52,60,0,0,0,6,2,1,N,N,Spanish,Orange
659,609711,HARPER HS,55.9,87,73,23,1,84,3,0,0,0,0,0,0,0,0,Y,Y,Spanish,Blue


After this reduction, the following columns are left:

  - School_ID
  - Graduation_Rate_School
  - Student_Count_Total
  - Student_Count_Low_Income
  - Student_Count_Special_Ed
  - Student_Count_English_Learners
  - 10 Columns Counting Populations of Different Ethnicities
  - **Is_High_School**
  - Dress_Code
  - Classroom_Languages
  - Transportation_El
  
The bolded columns require preprocessing, which is shown below.

# Is_High_School

The school profiles for 2016-2017 and 2017-2018 encode `Is_High_School` as 'Y/N', whereas 2018-2019 encodes it as 'True/False'.  

The function below converts Y/N to True/False to ensure consistency.

In [6]:
from src.preprocessing.preprocessing import convert_is_high_school_to_bool

df_dict = {year: convert_is_high_school_to_bool(df_dict[year]) for year in df_dict}
df_dict['2016-2017']['Is_High_School']

0      False
1       True
2       True
3       True
4      False
       ...  
656    False
657     True
658     True
659     True
660    False
Name: Is_High_School, Length: 661, dtype: bool

# Dress_Code

The same conversions are applied to the Dress_Code column

In [7]:
from src.preprocessing.preprocessing import convert_dress_code_to_bool

df_dict = {year: convert_dress_code_to_bool(df_dict[year]) for year in df_dict}
df_dict['2016-2017']['Dress_Code']

0      False
1       True
2      False
3       True
4       True
       ...  
656     True
657    False
658     True
659     True
660     True
Name: Dress_Code, Length: 661, dtype: bool

In [8]:
# Add Year column to dataframes

In [11]:
df_dict['2018-2019'].

Unnamed: 0,School_ID,Short_Name,Graduation_Rate_School,Student_Count_Total,Student_Count_Low_Income,Student_Count_Special_Ed,Student_Count_English_Learners,Student_Count_Black,Student_Count_Hispanic,Student_Count_White,Student_Count_Asian,Student_Count_Native_American,Student_Count_Other_Ethnicity,Student_Count_Asian_Pacific_Islander,Student_Count_Multi,Student_Count_Hawaiian_Pacific_Islander,Student_Count_Ethnicity_Not_Available,Is_High_School,Dress_Code,Classroom_Languages,Transportation_El
0,400172,ASPIRA - BUSINESS & FINANCE HS,,633,414,130,195,17,597,10,4,1,0,0,4,0,0,True,True,"Spanish, Spanish for Heritage Speakers",Blue
1,609794,EDISON,,267,22,10,1,11,22,160,43,1,0,0,29,1,0,False,False,French,Brown
2,609780,MARINE LEADERSHIP AT AMES HS,,847,825,79,158,17,817,8,2,1,0,0,2,0,0,True,True,Spanish,
3,400039,ERIE,,415,325,69,197,56,342,7,1,2,0,0,6,1,0,False,True,Spanish,
4,610590,BRONZEVILLE CLASSICAL,,90,24,2,1,55,6,8,16,0,0,0,5,0,0,False,False,,"Green, Red"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
655,400022,CHIARTS HS,84.4,606,220,38,5,205,214,120,11,2,0,0,15,1,38,True,False,"French, Spanish","Blue, Red"
656,610383,SOCIAL JUSTICE HS,81.0,304,290,59,74,37,264,2,0,1,0,0,0,0,0,True,True,"French, Spanish, Spanish for Heritage Speakers",
657,610589,SOR JUANA,,92,44,5,21,6,75,6,2,1,0,0,1,0,1,False,False,,
658,400130,YCCS - YOUTH DEVELOPMENT,,96,94,27,0,93,0,0,0,2,0,0,0,0,1,True,True,,


In [9]:
# Interesting: primary category would be a good feature to change to Primary_Is_High_School.  
# This would give a signal of whether a school is a specifically a high school.
df_hs['2018-2019']['Primary_Category']

NameError: name 'df_hs' is not defined