The school year profile `csvs` have been downloaded from the [Chicago Data Portal](https://data.cityofchicago.org/).   

# School Year Profiles

The first source of data are the school year profiles:

  - [2016-2017 Profile](https://data.cityofchicago.org/Education/Chicago-Public-Schools-School-Profile-Information-/8i6r-et8s)
  - [2017-2018 Profile](https://data.cityofchicago.org/Education/Chicago-Public-Schools-School-Profile-Information-/w4qj-h7bg)
  - [2018-2019 Profile](https://data.cityofchicago.org/Education/Chicago-Public-Schools-School-Profile-Information-/kh4r-387c)

Files should be downloaded and placed in the `data/chicago_data_portal_csv_files` folder.

There are slight differences in the csv files which require quick preprocessing steps.  These preprocessing steps are packaged in the `src/preprocessing` folder 

In [25]:
# Imports to ensure modules import correctly. 

import os, sys

# Set absolute path to the root folder of the directory
full_path = os.getcwd()
home_folder = 'CPS_GradRate_Analysis'
root = full_path.split(home_folder)[0] + home_folder + '/'
sys.path.append(root)

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [26]:
import pandas as pd
pd.set_option('display.max_columns', None)
import numpy as np
from src.preprocessing.preprocessing import years, paths
from src.preprocessing.preprocessing import create_sp_path_dictionary, import_multiple_sy_profiles, isolate_high_schools

In [27]:
# Load available csvs into a dictionary of dataframes
sp_paths = create_sp_path_dictionary(years[:-1], paths)
df_dict = import_multiple_sy_profiles(sp_paths)

In [28]:
len(df_dict['2017-2018'].columns)

92

# Isolate Important Columns



The preprocessing function, isolate_important_columns, reduces the number of columns in the datasets from 92 - 20.

In [37]:
from src.preprocessing.preprocessing import isolate_important_columns

df_dict = {year: isolate_important_columns(df_dict[year]) for year in df_dict}
df_dict['2017-2018']

Unnamed: 0,School_ID,Graduation_Rate_School,Student_Count_Total,Student_Count_Low_Income,Student_Count_Special_Ed,Student_Count_English_Learners,Student_Count_Black,Student_Count_Hispanic,Student_Count_White,Student_Count_Asian,Student_Count_Native_American,Student_Count_Other_Ethnicity,Student_Count_Asian_Pacific_Islander,Student_Count_Multi,Student_Count_Hawaiian_Pacific_Islander,Student_Count_Ethnicity_Not_Available,Is_High_School,Dress_Code,Classroom_Languages,Transportation_El
0,610521,,237,227,45,10,216,20,0,0,0,0,0,1,0,0,False,Y,,
1,609750,23.1,34,25,3,3,25,9,0,0,0,0,0,0,0,0,True,N,Spanish,Pink
2,610386,,94,58,16,13,29,62,2,0,0,0,0,1,0,0,True,Y,"French, Spanish",
3,400123,,172,77,33,1,165,6,0,0,1,0,0,0,0,0,True,Y,,Green
4,400116,,333,191,48,2,323,9,1,0,0,0,0,0,0,0,False,N,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
656,610030,,257,133,27,5,241,8,1,0,2,0,0,4,1,0,False,Y,,
657,610197,,473,267,81,196,34,386,34,2,3,0,0,13,1,0,False,Y,Spanish,Orange
658,610084,,227,56,2,5,81,25,52,60,0,0,0,6,2,1,False,N,Spanish,Orange
659,609711,55.9,87,73,23,1,84,3,0,0,0,0,0,0,0,0,True,Y,Spanish,Blue


After this reduction, the following columns are left:

  - School_ID
  - Graduation_Rate_School
  - Student_Count_Total
  - Student_Count_Low_Income
  - Student_Count_Special_Ed
  - Student_Count_English_Learners
  - 10 Columns Counting Populations of Different Ethnicities
  - **Is_High_School**
  - Dress_Code
  - Classroom_Languages
  - Transportation_El
  
The bolded columns require preprocessing, which is shown below.

# Is_High_School

The school profiles for 2016-2017 and 2017-2018 encode `Is_High_School` as 'Y/N', whereas 2018-2019 encodes it as 'True/False'.  

The function below converts Y/N to True/False to ensure consistency.

In [33]:
from src.preprocessing.preprocessing import convert_is_high_school_to_bool

df_dict = {year: convert_is_high_school_to_bool(df_dict[year]) for year in df_dict}
df_dict['2016-2017']['Is_High_School']

0      False
1       True
2       True
3       True
4      False
       ...  
656    False
657     True
658     True
659     True
660    False
Name: Is_High_School, Length: 661, dtype: bool

# Dress_Code

The same conversions are applied to the Dress_Code column

In [41]:
from src.preprocessing.preprocessing import convert_dress_code_to_bool

df_dict = {year: convert_dress_code_to_bool(df_dict[year]) for year in df_dict}
df_dict['2016-2017']['Dress_Code']

0      False
1       True
2      False
3       True
4       True
       ...  
656     True
657    False
658     True
659     True
660     True
Name: Dress_Code, Length: 661, dtype: bool

In [74]:
from src.preprocessing.preprocessing import isolate_high_schools

In [77]:
df_hs = {year: isolate_high_schools(df_dict[year]) for year in df_dict}

0      HS
2      HS
5      HS
9      HS
13     HS
       ..
651    HS
652    HS
655    HS
656    HS
658    HS
Name: Primary_Category, Length: 183, dtype: object

In [None]:
# Interesting: primary category would be a good feature to change to Primary_Is_High_School.  
# This would give a signal of whether a school is a specifically a high school.
df_hs['2018-2019']['Primary_Category']