## Notebook Summary
#### *Capstone: Data Cleaning #1*
---
This contents of this notebook includes data cleaning for the 2 main datasets per year. Due to unique identifiers being messy, this process took heavy data cleaning.

### Datasets
---

There are datasets included in the [`datasets`](./datasets/) 

* [`18-20-eoy-student-discipline.xlsx`](../Capstone/datasets/18-20-eoy-student-discipline.xlsx): school discipline data including in-school, out-of-school, and expulsion incidents
* [`18-20-Report-Card-Public-Data-Set.xlsx`](../Capstone/datasets/18-20-Report-Card-Public-Data-Set.xlsx): school academic, demographic, race, and other descriptions

# Data Cleaning & Merging of  Student Discipline & School Report Card Datasets

---

In [1]:
# import packages
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import re
import warnings
warnings.filterwarnings('ignore')

## 1. Read & Clean Discipline Data from State of IL
---

In [2]:
# read in the data from all 3 sheets
discipline_18_raw = pd.read_excel('datasets/18-20-eoy-student-discipline.xlsx', sheet_name = '2018', header=3)
discipline_19_raw = pd.read_excel('datasets/18-20-eoy-student-discipline.xlsx', sheet_name = '2019', header=3)
discipline_20_raw = pd.read_excel('datasets/18-20-eoy-student-discipline.xlsx', sheet_name = '2020', header=3)

In [3]:
print(discipline_18_raw.shape)
print(discipline_19_raw.shape)
print(discipline_20_raw.shape)

(5756, 33)
(5389, 33)
(4967, 33)


In [4]:
discipline_18_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5756 entries, 0 to 5755
Data columns (total 33 columns):
 #   Column                                     Non-Null Count  Dtype  
---  ------                                     --------------  -----  
 0   District Name                              5756 non-null   object 
 1   School Name                                5755 non-null   object 
 2   ActionCode                                 5755 non-null   float64
 3   ActionDesc                                 5755 non-null   object 
 4   Total Incidents                            5756 non-null   int64  
 5   Total Students                             5756 non-null   int64  
 6   Female                                     5475 non-null   float64
 7   Male                                       5475 non-null   float64
 8   Hispanic or Latino                         4042 non-null   float64
 9   American Indian or Alaska Native           71 non-null     float64
 10  Black or African America

In [5]:
discipline_18_raw.head(2)

Unnamed: 0,District Name,School Name,ActionCode,ActionDesc,Total Incidents,Total Students,Female,Male,Hispanic or Latino,American Indian or Alaska Native,...,Dangerous Weapon: Other,Other Reason,Tobacco,Less than 1,"[1,2)","[2,3)","[3,4)","[4,10]",GREATER THAN 10,NOT REPORTED
0,A E R O Spec Educ Coop,P R I D E School,3.0,In-School Suspension,1,1,,,,,...,0,1,0,0,1,0,0,0,0,0
1,Abingdon-Avon CUSD 276,Avon Elem Sch,3.0,In-School Suspension,1,1,0.0,1.0,,,...,0,1,0,0,1,0,0,0,0,0


In [6]:
discipline_18_raw[discipline_18_raw['School Name'] == 'Zion-Benton Twnshp Hi Sch']

# some schools have 2 rows of data split between In-School Suspension and Out-of_school suspension
# using total incidents (in-school and out-of-school) for modeling target

Unnamed: 0,District Name,School Name,ActionCode,ActionDesc,Total Incidents,Total Students,Female,Male,Hispanic or Latino,American Indian or Alaska Native,...,Dangerous Weapon: Other,Other Reason,Tobacco,Less than 1,"[1,2)","[2,3)","[3,4)","[4,10]",GREATER THAN 10,NOT REPORTED
5751,Zion-Benton Twp HSD 126,Zion-Benton Twnshp Hi Sch,1.0,Expulsion - Received Educational Services,5,5,4.0,1.0,1.0,,...,0,0,0,0,5,0,0,0,0,0
5752,Zion-Benton Twp HSD 126,Zion-Benton Twnshp Hi Sch,2.0,Expulsion - Did not Receive Educational Services,1,1,0.0,1.0,0.0,,...,0,0,0,0,1,0,0,0,0,0
5753,Zion-Benton Twp HSD 126,Zion-Benton Twnshp Hi Sch,4.0,Out-of School Suspension,148,126,72.0,76.0,41.0,,...,0,0,0,0,8,3,29,105,3,0


In [7]:
discipline_19_raw.head(2)

Unnamed: 0,District Name,School Name,ActionCode,ActionDesc,Total Incidents,Total Students,Female,Male,Hispanic or Latino,American Indian or Alaska Native,...,Dangerous Weapon: Other,Other Reason,Tobacco,Less than 1,"[1,2)","[2,3)","[3,4)","[4,10]",GREATER THAN 10,NOT REPORTED
0,A E R O Spec Educ Coop,P R I D E School,3.0,In-School Suspension,1,1,,,,,...,0,1,0,0,1,0,0,0,0,0
1,A E R O Spec Educ Coop,P R I D E School,4.0,Out-of School Suspension,2,2,,,,,...,0,2,0,0,0,0,1,1,0,0


In [8]:
discipline_20_raw.head()

Unnamed: 0,District Name,School Name,ActionCode,ActionDesc,Total Incidents,Total Students,Female,Male,Hispanic or Latino,American Indian or Alaska Native,...,Dangerous Weapon: Other,Other Reason,Tobacco,Less than 1,"[1,2)","[2,3)","[3,4)","[4,10]",GREATER THAN 10,NOT REPORTED
0,Abingdon-Avon CUSD 276,Hedding Grade Sch,3.0,In-School Suspension,2,1,0.0,2.0,0.0,,...,0,2,0,0,1,0,1,0,0,0
1,Abingdon-Avon CUSD 276,Abingdon-Avon Middle Sch,3.0,In-School Suspension,21,16,5.0,16.0,,,...,0,19,1,11,7,2,1,0,0,0
2,Abingdon-Avon CUSD 276,Abingdon-Avon High Sch,3.0,In-School Suspension,20,12,10.0,10.0,0.0,,...,0,12,8,0,13,7,0,0,0,0
3,Abingdon-Avon CUSD 276,Abingdon-Avon High Sch,4.0,Out-of School Suspension,23,13,11.0,12.0,0.0,,...,0,18,0,2,6,8,4,3,0,0
4,Abingdon-Avon CUSD 276,Abingdon-Avon Middle Sch,4.0,Out-of School Suspension,8,6,1.0,7.0,,,...,0,6,0,0,2,2,3,1,0,0


#### Drop all columns that aren't needed for modeling

In [9]:
discipline_18 = discipline_18_raw[['District Name', 'School Name', 'Total Incidents', 'Total Students']].copy()
discipline_19 = discipline_19_raw[['District Name', 'School Name', 'Total Incidents', 'Total Students']].copy()
discipline_20 = discipline_20_raw[['District Name', 'School Name', 'Total Incidents', 'Total Students']].copy()

#### Check data types

In [10]:
discipline_18.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5756 entries, 0 to 5755
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   District Name    5756 non-null   object
 1   School Name      5755 non-null   object
 2   Total Incidents  5756 non-null   int64 
 3   Total Students   5756 non-null   int64 
dtypes: int64(2), object(2)
memory usage: 180.0+ KB


In [11]:
discipline_19.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5389 entries, 0 to 5388
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   District Name    5389 non-null   object
 1   School Name      5388 non-null   object
 2   Total Incidents  5389 non-null   int64 
 3   Total Students   5389 non-null   int64 
dtypes: int64(2), object(2)
memory usage: 168.5+ KB


In [12]:
discipline_20.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4967 entries, 0 to 4966
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   District Name    4967 non-null   object
 1   School Name      4966 non-null   object
 2   Total Incidents  4967 non-null   int64 
 3   Total Students   4967 non-null   int64 
dtypes: int64(2), object(2)
memory usage: 155.3+ KB


#### Drop 'total' row from dataset

In [13]:
#Delete total row
#discipline_18.iloc[[5755]]
# discipline_19[discipline_19['District Name'] == 'Total']
# discipline_20[discipline_20['District Name'] == 'Total']

discipline_18 = discipline_18.drop(index=5755)
discipline_19 = discipline_19.drop(index=5388)
discipline_20 = discipline_20.drop(index=4966)

In [14]:
print(discipline_18.shape)
print(discipline_19.shape)
print(discipline_20.shape)

(5755, 4)
(5388, 4)
(4966, 4)


#### Checking for nulls

In [15]:
discipline_18.isna().sum()

District Name      0
School Name        0
Total Incidents    0
Total Students     0
dtype: int64

In [16]:
discipline_19.isna().sum()

District Name      0
School Name        0
Total Incidents    0
Total Students     0
dtype: int64

In [17]:
discipline_20.isna().sum()

District Name      0
School Name        0
Total Incidents    0
Total Students     0
dtype: int64

#### Adding columns for school year & a district/school column

In [18]:
discipline_18['school_year'] = '17-18'
discipline_19['school_year'] = '18-19'
discipline_20['school_year'] = '19-20'

#### Grouping 'Total Incidents' by school

In [19]:
discipline_18 = discipline_18.groupby(['school_year','District Name','School Name'])[['Total Incidents']].sum().reset_index()
discipline_19 = discipline_19.groupby(['school_year','District Name','School Name'])[['Total Incidents']].sum().reset_index()
discipline_20 = discipline_20.groupby(['school_year','District Name','School Name'])[['Total Incidents']].sum().reset_index()

In [20]:
print(discipline_18.shape)
print(discipline_19.shape)
print(discipline_20.shape)

(3281, 4)
(3083, 4)
(2973, 4)


## 2. Read & Clean Report Card Public Dataset from State of IL
---

In [21]:
schoolreportcard_18_raw = pd.read_excel('datasets/18-20-Report-Card-Public-Data-Set.xlsx', sheet_name = '17-18')
schoolreportcard_19_raw = pd.read_excel('datasets/18-20-Report-Card-Public-Data-Set.xlsx', sheet_name = '18-19')
schoolreportcard_20_raw = pd.read_excel('datasets/18-20-Report-Card-Public-Data-Set.xlsx', sheet_name = '19-20')

In [22]:
print(schoolreportcard_18_raw.shape)
print(schoolreportcard_19_raw.shape)
print(schoolreportcard_20_raw.shape)

(4754, 390)
(4738, 851)
(4727, 908)


In [23]:
# filter rows where type = School
schoolreportcard_18 = schoolreportcard_18_raw[schoolreportcard_18_raw['Type'] == 'School']
schoolreportcard_19 = schoolreportcard_19_raw[schoolreportcard_19_raw['Type'] == 'School']
schoolreportcard_20 = schoolreportcard_20_raw[schoolreportcard_20_raw['Type'] == 'School']

print(schoolreportcard_18.shape)
print(schoolreportcard_19.shape)
print(schoolreportcard_20.shape)

(3888, 390)
(3872, 851)
(3859, 908)


In [24]:
schoolreportcard_18.head(5)

Unnamed: 0,RCDTS,Type,School Name,District,City,County,District Type,District Size,School Type,Grades Served,...,Five Essential Survey Involved Families Level,Five Essential Survey Supportive Environment,Five Essential Survey Supportive Environment Level,Five Essential Survey Ambitious Instruction,Five Essential Survey Ambitious Instruction.1,Five Essential Survey Student Response Rate %,Five Essential Survey Teacher Response Rate %,Five Essential Survey Schools with over 50% Response Rate %,Five Essential Survey Student Response Rate Median,Five Essential Survey Teacher Response Rate Median
1,10010010260001,School,Seymour High School,Payson CUSD 1,Payson,Adams,UNIT,MEDIUM,HIGH SCHOOL,7 8 9 10 11 12,...,3.0,49.0,3.0,45.0,3.0,91.4,75.0,,,
2,10010010262002,School,Seymour Elementary School,Payson CUSD 1,Payson,Adams,UNIT,MEDIUM,ELEMENTARY,PK K 1 2 3 4 5 6,...,2.0,56.0,3.0,40.0,3.0,97.5,99.9,,,
4,10010020260001,School,Liberty High School,Liberty CUSD 2,Liberty,Adams,UNIT,MEDIUM,HIGH SCHOOL,7 8 9 10 11 12,...,0.0,0.0,0.0,0.0,0.0,0.0,47.1,,,
5,10010020262002,School,Liberty Elementary School,Liberty CUSD 2,Liberty,Adams,UNIT,MEDIUM,ELEMENTARY,PK K 1 2 3 4 5 6,...,0.0,0.0,0.0,0.0,0.0,0.0,48.7,,,
7,10010030260001,School,Central High School,Central CUSD 3,Camp Point,Adams,UNIT,MEDIUM,HIGH SCHOOL,9 10 11 12,...,,,,,,,,,,


#### One off changes that need to be made across multiple datasets

In [25]:
# One off changes that needs to be made in all report card datasets.
schoolreportcard_18.replace({'Northwood Jr High School': 'Northwood Middle School'}, inplace=True)
schoolreportcard_19.replace({'Northwood Jr High School': 'Northwood Middle School'}, inplace=True)
schoolreportcard_20.replace({'Northwood Jr High School': 'Northwood Middle School'}, inplace=True)

In [26]:
#renaming 1 District that doesn't match the other dataset
discipline_18['District Name'].replace({'ACE Amandla Charter School':'Amandla Charter School'}, inplace=True)
discipline_18['School Name'].replace({'ACE Amandla Charter School':'Amandla Charter School'}, inplace=True)

discipline_19['District Name'].replace({'ACE Amandla Charter School':'Amandla Charter School'}, inplace=True)
discipline_19['School Name'].replace({'ACE Amandla Charter School':'Amandla Charter School'}, inplace=True)

#### Adding an identifier col to the discipline & reportcard datasets

In [27]:
discipline_18['district_school'] = discipline_18['District Name'] + '_' + discipline_18['School Name']
discipline_19['district_school'] = discipline_19['District Name'] + '_' + discipline_19['School Name']
discipline_20['district_school'] = discipline_20['District Name'] + '_' + discipline_20['School Name']

In [28]:
#create a key column to be used for merge - 'district_name'
schoolreportcard_18['district_school'] = schoolreportcard_18['District'] + '_' + schoolreportcard_18['School Name']
schoolreportcard_19['district_school'] = schoolreportcard_19['District'] + '_' + schoolreportcard_19['School Name']
schoolreportcard_20['district_school'] = schoolreportcard_20['District'] + '_' + schoolreportcard_20['School Name']

## 3. Setup DFs for Merge
---

In [29]:
#### Read in school codes dataset

In [31]:
#read in dataset with RCDTS codes
codes = pd.read_csv('../Capstone/datasets/IL_RCDTS_codes.csv')
codes.head(2)

Unnamed: 0,CountyName,RecType,Region-2\nCounty-3\nDistrict-4,Type,School,RCDTS,FacilityName,NCES ID
0,Adams,Dist,10010010,26,0,10010010260000,Payson CUSD 1,1730990
1,Adams,Dist,10010020,26,0,10010020260000,Liberty CUSD 2,1722770


### *2017-2018 Dataset*

In [32]:
d1 = discipline_18['District Name'].unique().tolist()
d2 = schoolreportcard_18['District'].unique().tolist()

#### Dropping Districts That We Don't Need - these are schools/districts not included in the report card public data set

In [33]:
# district_schools not in the report card df - most likely due to school naming issues or just not in dataset
# I don't want to lose schools if it's naming convention so going to explore more

set(d1) - set(d2)

{'A E R O  Spec Educ Coop',
 'Achievement Centers',
 'Adam/Brwn/Cass/Morgn/Pik/Sctt ROE',
 'Alxndr/Jcksn/Pulsk/Prry/Union ROE',
 'Arrowhead Ranch',
 'Baby Fold',
 'Bi-County Special Educ Coop',
 'Black Hawk Area Sp Ed District',
 'Bond/Christian/Effingham/Fayette/Montgomery ROE',
 'Boone/Winnebago ROE',
 'Calhoun/Greene/Jersy/Macoupin ROE',
 'Camelot Education',
 'Carroll/Jo Daviess/Stephenson ROE',
 'Champaign/Ford ROE',
 'Childrens Home Assoc of IL',
 'Clay/Cwford/Jsper/Lwrnce/Rhland',
 'Clintn/Jeffrsn/Marin/Washngtn ROE',
 'Clk/Cls/Cmbn/Dglas/Edgr/Mltr/Shlb',
 'Connections Day School',
 'Coordinated Youth & Human Service',
 'Cunningham Childrens Home',
 'De Kalb ROE',
 'DeWitt/Livingstn/Logan/McLean ROE',
 'DuPage ROE',
 'Eastern IL Area of Spec Educ',
 'Edw/Glt/Hlt/Hdn/Pop/Sln/Wbh/Wn/Wh ROE',
 'Elgin Coll Dist 509',
 'Frankln/Johnsn/Massc/Willimsn ROE',
 'Giant Steps Illinois',
 'Grundy/Kendall ROE',
 'Hancck/Fultn/Schuylr/McDonogh ROE',
 'Illinois Mathematics & Science Academy',
 

In [34]:
#dropping these districts from the dataset b/c they are different school entities not included in the public dataset
drop = list(set(d1) - set(d2))

#find the index of the rows with the items in the drop list
to_drop = list(discipline_18[discipline_18['District Name'].isin(drop)].index)

#drop those indices and reset index
discipline_18.drop(index=to_drop, axis=0, inplace=True)

#reset index
discipline_18.reset_index(drop=True)

Unnamed: 0,school_year,District Name,School Name,Total Incidents,district_school
0,17-18,Amandla Charter School,Amandla Charter School,300,Amandla Charter School_Amandla Charter School
1,17-18,Abingdon-Avon CUSD 276,Abingdon-Avon High Sch,86,Abingdon-Avon CUSD 276_Abingdon-Avon High Sch
2,17-18,Abingdon-Avon CUSD 276,Abingdon-Avon Middle Sch,16,Abingdon-Avon CUSD 276_Abingdon-Avon Middle Sch
3,17-18,Abingdon-Avon CUSD 276,Avon Elem Sch,3,Abingdon-Avon CUSD 276_Avon Elem Sch
4,17-18,Abingdon-Avon CUSD 276,Hedding Grade Sch,9,Abingdon-Avon CUSD 276_Hedding Grade Sch
...,...,...,...,...,...
3151,17-18,Zion ESD 6,Shiloh Park Elem School,38,Zion ESD 6_Shiloh Park Elem School
3152,17-18,Zion ESD 6,West Elementary School,77,Zion ESD 6_West Elementary School
3153,17-18,Zion ESD 6,Zion Central Middle School,253,Zion ESD 6_Zion Central Middle School
3154,17-18,Zion-Benton Twp HSD 126,New Tech High - Zion-Benton East,8,Zion-Benton Twp HSD 126_New Tech High - Zion-B...


#### Comparing district_school values between the two datasets for merge

In [35]:
#names don't match or don't exist in the report card

s1 = discipline_18['district_school'].unique().tolist()
s2 = schoolreportcard_18['district_school'].unique().tolist()

#create a list that has the index of the mismatched schools
find = list(discipline_18[discipline_18['district_school'].isin(list(set(s1) - set(s2)))].index)

#create a new df with those indices only
df = discipline_18.loc[discipline_18.index.isin(find)]

df.head()

Unnamed: 0,school_year,District Name,School Name,Total Incidents,district_school
32,17-18,Alton CUSD 11,Mark Twain,44,Alton CUSD 11_Mark Twain
51,17-18,Arbor Park SD 145,Arbor Elementary School,2,Arbor Park SD 145_Arbor Elementary School
53,17-18,Arbor Park SD 145,Scarlet Oak Elementary School,3,Arbor Park SD 145_Scarlet Oak Elementary School
161,17-18,Belleville Twp HSD 201,Belleville Twp HS-Night/Alt Sch,13,Belleville Twp HSD 201_Belleville Twp HS-Night...
294,17-18,CHSD 117,Gateway School,1,CHSD 117_Gateway School


In [36]:
df.shape

(107, 5)

In [37]:
#retrieve RCDTS codes only if it matches the school name of how it's noted in the discipline dataframe
df2 = df.merge(codes, how='inner', left_on='School Name', right_on='FacilityName')
df2.rename(columns={'School Name': 'School Name Disc'}, inplace=True)
df2.rename(columns={'district_school': 'district_school_disc'}, inplace=True)
df2.head(2)

Unnamed: 0,school_year,District Name,School Name Disc,Total Incidents,district_school_disc,CountyName,RecType,Region-2\nCounty-3\nDistrict-4,Type,School,RCDTS,FacilityName,NCES ID
0,17-18,Alton CUSD 11,Mark Twain,44,Alton CUSD 11_Mark Twain,Madison,Sch,410570110,26,3006,410570110263006,Mark Twain,170360001473
1,17-18,Arbor Park SD 145,Arbor Elementary School,2,Arbor Park SD 145_Arbor Elementary School,Cook,Sch,70161450,2,2004,70161450022004,Arbor Elementary School,170393000080


In [38]:
df2.shape  #this shows that 10 of the names weren't fond in the RCDTS codes

(97, 13)

In [39]:
#now I'm searching to see which of these 97 RCDTS codes are actually in the '18 df to let me know to keep those

df3 = df2.merge(schoolreportcard_18, how='inner', left_on='RCDTS', right_on='RCDTS')
df3.shape

(44, 403)

In [40]:
df3.head(2)

Unnamed: 0,school_year,District Name,School Name Disc,Total Incidents,district_school_disc,CountyName,RecType,Region-2\nCounty-3\nDistrict-4,Type_x,School,...,Five Essential Survey Supportive Environment,Five Essential Survey Supportive Environment Level,Five Essential Survey Ambitious Instruction,Five Essential Survey Ambitious Instruction.1,Five Essential Survey Student Response Rate %,Five Essential Survey Teacher Response Rate %,Five Essential Survey Schools with over 50% Response Rate %,Five Essential Survey Student Response Rate Median,Five Essential Survey Teacher Response Rate Median,district_school
0,17-18,Arbor Park SD 145,Arbor Elementary School,2,Arbor Park SD 145_Arbor Elementary School,Cook,Sch,70161450,2,2004,...,0.0,0.0,0.0,0.0,0.0,99.9,,,,Arbor Park SD 145_Morton Gingerwood Elem School
1,17-18,Arbor Park SD 145,Scarlet Oak Elementary School,3,Arbor Park SD 145_Scarlet Oak Elementary School,Cook,Sch,70161450,2,2003,...,0.0,0.0,0.0,0.0,0.0,95.0,,,,Arbor Park SD 145_Scarlet Oak Elem School


In [41]:
#this is the list of schools we want to make sure we keep from the difference list
schools_tokeep = list(df3['district_school_disc'].unique())

mismatch = list(set(s1) - set(s2))

#subtract the schools we want to keep from the mismatch
remove = list(set(mismatch) - set(schools_tokeep))


In [42]:
#find the index of the rows with the items in the drop list
to_drop = list(discipline_18[discipline_18['district_school'].isin(remove)].index)

#drop those indices and reset index
discipline_18_final = discipline_18.drop(index=to_drop, axis=0)

In [43]:
#these are the schools we want to get merged into the report dataset
discipline_18_final.shape

(3093, 5)

In [44]:
#create a df that has the district and school names across both datasets
school_mismatch = df3[['RCDTS', 'NCES ID', 'District Name','School Name Disc', 'district_school_disc', 'District', 'School Name','district_school']]
school_mismatch.head()

Unnamed: 0,RCDTS,NCES ID,District Name,School Name Disc,district_school_disc,District,School Name,district_school
0,70161450022004,170393000080,Arbor Park SD 145,Arbor Elementary School,Arbor Park SD 145_Arbor Elementary School,Arbor Park SD 145,Morton Gingerwood Elem School,Arbor Park SD 145_Morton Gingerwood Elem School
1,70161450022003,170393000081,Arbor Park SD 145,Scarlet Oak Elementary School,Arbor Park SD 145_Scarlet Oak Elementary School,Arbor Park SD 145,Scarlet Oak Elem School,Arbor Park SD 145_Scarlet Oak Elem School
2,190220940160001,174044004071,CHSD 94,West Chicago Community High School,CHSD 94_West Chicago Community High School,CHSD 94,Community High School,CHSD 94_Community High School
3,500821870261014,170804006232,Cahokia CUSD 187,Wirth/Parks Middle School,Cahokia CUSD 187_Wirth/Parks Middle School,Cahokia CUSD 187,7th Grade Academy,Cahokia CUSD 187_7th Grade Academy
4,80083990262001,170940004571,Chadwick-Milledgeville CUSD 399,Chadwick-Milledgeville Elem School,Chadwick-Milledgeville CUSD 399_Chadwick-Mille...,Chadwick-Milledgeville CUSD 399,Chadwick Elem School,Chadwick-Milledgeville CUSD 399_Chadwick Elem ...


In [45]:
#creating a dictionary to do a quick name change

rename_list = dict(list(zip(school_mismatch['School Name Disc'], school_mismatch['School Name'])))

#rename schools to match the report card
discipline_18_final = discipline_18_final.replace(rename_list)

In [46]:
#need to redo the district_school col with new school name updates

#drop columns
discipline_18_final.drop(columns='district_school', inplace=True)

#recreate column
discipline_18_final['key'] = discipline_18_final['District Name'] + '_' + discipline_18_final['School Name']

#rename School Name columns so it doesn't match the other df
discipline_18_final.rename(columns={'School Name': 'School'}, inplace=True)

#drop columns
schoolreportcard_18.drop(columns='district_school', inplace=True)

#recreate column
schoolreportcard_18['key'] = schoolreportcard_18['District'] + '_' + schoolreportcard_18['School Name']

In [47]:
discipline_18_final.head()

Unnamed: 0,school_year,District Name,School,Total Incidents,key
1,17-18,Amandla Charter School,Amandla Charter School,300,Amandla Charter School_Amandla Charter School
2,17-18,Abingdon-Avon CUSD 276,Abingdon-Avon High Sch,86,Abingdon-Avon CUSD 276_Abingdon-Avon High Sch
3,17-18,Abingdon-Avon CUSD 276,Abingdon-Avon Middle Sch,16,Abingdon-Avon CUSD 276_Abingdon-Avon Middle Sch
4,17-18,Abingdon-Avon CUSD 276,Avon Elem Sch,3,Abingdon-Avon CUSD 276_Avon Elem Sch
5,17-18,Abingdon-Avon CUSD 276,Hedding Grade Sch,9,Abingdon-Avon CUSD 276_Hedding Grade Sch


In [48]:
key1 = discipline_18_final['key'].unique().tolist()
key2 = schoolreportcard_18['key'].unique().tolist()

set(key1) - set(key2)

#all discipline data for '18 can now be merged with report card data!

set()

### *2018-2019 Dataset*

In [49]:
d1 = discipline_19['District Name'].unique().tolist()
d2 = schoolreportcard_19['District'].unique().tolist()

#### Dropping Districts That We Don't Need - these are schools/districts not included in the report card public data set

In [50]:
# district_schools not in the report card df - most likely due to school naming issues or just not in dataset
# I don't want to lose schools if it's naming convention so going to explore more

set(d1) - set(d2)

{'A E R O  Spec Educ Coop',
 'Achievement Centers',
 'Adam/Brwn/Cass/Morgn/Pik/Sctt ROE',
 'Adventist GlenOaks Hospital',
 'Alxndr/Jcksn/Pulsk/Prry/Union ROE',
 'Baby Fold',
 'Bi-County Special Educ Coop',
 'Bond/Christian/Effingham/Fayette/Montgomery ROE',
 'Boone/Winnebago ROE',
 'Camelot Education',
 'Champaign/Ford ROE',
 'Childrens Home Assoc of IL',
 'Clintn/Jeffrsn/Marin/Washngtn ROE',
 'Connections Day School',
 'Coordinated Youth & Human Service',
 'Cunningham Childrens Home',
 'De Kalb ROE',
 'DeWitt/Livingstn/Logan/McLean ROE',
 'Edw/Glt/Hlt/Hdn/Pop/Sln/Wbh/Wn/Wh ROE',
 'Eisenhower Cooperative',
 'Exc Children Have Opportunities',
 'Family Guidance Centers Inc',
 'Frankln/Johnsn/Massc/Willimsn ROE',
 'Great Circle',
 'Grundy County Spec Educ Coop',
 'Grundy/Kendall ROE',
 'Hancck/Fultn/Schuylr/McDonogh ROE',
 'Henry-Stark County Spec Ed Dist',
 'Iroquois/Kankakee ROE',
 'Kane ROE',
 'Knox Warren Special Education Districts',
 'La Salle/Marshall/Putnam ROE',
 'Lake ROE',
 'Le

In [51]:
#dropping these districts from the dataset b/c they are different school entities not included in the public dataset
drop = list(set(d1) - set(d2))

#find the index of the rows with the items in the drop list
to_drop = list(discipline_19[discipline_19['District Name'].isin(drop)].index)

#drop those indices and reset index
discipline_19.drop(index=to_drop, axis=0, inplace=True)

#reset index
discipline_19.reset_index(drop=True)

Unnamed: 0,school_year,District Name,School Name,Total Incidents,district_school
0,18-19,A-C Central CUSD 262,A-C Central High School,8,A-C Central CUSD 262_A-C Central High School
1,18-19,Amandla Charter School,Amandla Charter School,192,Amandla Charter School_Amandla Charter School
2,18-19,Abingdon-Avon CUSD 276,Abingdon-Avon High Sch,53,Abingdon-Avon CUSD 276_Abingdon-Avon High Sch
3,18-19,Abingdon-Avon CUSD 276,Abingdon-Avon Middle Sch,61,Abingdon-Avon CUSD 276_Abingdon-Avon Middle Sch
4,18-19,Addison SD 4,Army Trail Elem School,12,Addison SD 4_Army Trail Elem School
...,...,...,...,...,...
2961,18-19,Zion ESD 6,Shiloh Park Elem School,20,Zion ESD 6_Shiloh Park Elem School
2962,18-19,Zion ESD 6,West Elementary School,18,Zion ESD 6_West Elementary School
2963,18-19,Zion ESD 6,Zion Central Middle School,270,Zion ESD 6_Zion Central Middle School
2964,18-19,Zion-Benton Twp HSD 126,New Tech High - Zion-Benton East,12,Zion-Benton Twp HSD 126_New Tech High - Zion-B...


#### Comparing district_school values between the two datasets for merge

In [52]:
#names don't match or don't exist in the report card

s1 = discipline_19['district_school'].unique().tolist()
s2 = schoolreportcard_19['district_school'].unique().tolist()

#create a list that has the index of the mismatched schools
find = list(discipline_19[discipline_19['district_school'].isin(list(set(s1) - set(s2)))].index)

#create a new df with those indices only
df = discipline_19.loc[discipline_19.index.isin(find)]

df.head()

Unnamed: 0,school_year,District Name,School Name,Total Incidents,district_school
34,18-19,Alton CUSD 11,Mark Twain,53,Alton CUSD 11_Mark Twain
54,18-19,Arbor Park SD 145,Arbor Elementary School,2,Arbor Park SD 145_Arbor Elementary School
56,18-19,Arbor Park SD 145,Kimberly Heights Elementary School,3,Arbor Park SD 145_Kimberly Heights Elementary ...
57,18-19,Arbor Park SD 145,Scarlet Oak Elementary School,1,Arbor Park SD 145_Scarlet Oak Elementary School
148,18-19,Belleville Twp HSD 201,Belleville Twp HS-Night/Alt Sch,3,Belleville Twp HSD 201_Belleville Twp HS-Night...


In [53]:
df.shape

(90, 5)

In [54]:
#retrieve RCDTS codes only if it matches the school name of how it's noted in the discipline dataframe
df2 = df.merge(codes, how='inner', left_on='School Name', right_on='FacilityName')
df2.rename(columns={'School Name': 'School Name Disc'}, inplace=True)
df2.rename(columns={'district_school': 'district_school_disc'}, inplace=True)
df2.head(2)

Unnamed: 0,school_year,District Name,School Name Disc,Total Incidents,district_school_disc,CountyName,RecType,Region-2\nCounty-3\nDistrict-4,Type,School,RCDTS,FacilityName,NCES ID
0,18-19,Alton CUSD 11,Mark Twain,53,Alton CUSD 11_Mark Twain,Madison,Sch,410570110,26,3006,410570110263006,Mark Twain,170360001473
1,18-19,Arbor Park SD 145,Arbor Elementary School,2,Arbor Park SD 145_Arbor Elementary School,Cook,Sch,70161450,2,2004,70161450022004,Arbor Elementary School,170393000080


In [55]:
df2.shape

(83, 13)

In [56]:
#now I'm searching to see which of these RCDTS codes are actually in the '19 df to let me know to keep those

df3 = df2.merge(schoolreportcard_19, how='inner', left_on='RCDTS', right_on='RCDTS')
df3.shape

(31, 864)

In [57]:
df3.head(2)

Unnamed: 0,school_year,District Name,School Name Disc,Total Incidents,district_school_disc,CountyName,RecType,Region-2\nCounty-3\nDistrict-4,Type_x,School,...,State Performance Plan Indicator 8 - Met State Target?,State Performance Plan Indicator 9 - Met State Target?,State Performance Plan Indicator 10 - Met State Target?,State Performance Plan Indicator 11 - Met State Target?,State Performance Plan Indicator 12 - Met State Target?,State Performance Plan Indicator 13 - Met State Target?,State Performance Plan Indicator 14A - Met State Target?,State Performance Plan Indicator 14B - Met State Target?,State Performance Plan Indicator 14c - Met State Target?,district_school
0,18-19,Arbor Park SD 145,Arbor Elementary School,2,Arbor Park SD 145_Arbor Elementary School,Cook,Sch,70161450,2,2004,...,,,,,,,,,,Arbor Park SD 145_Morton Gingerwood Elem School
1,18-19,Arbor Park SD 145,Kimberly Heights Elementary School,3,Arbor Park SD 145_Kimberly Heights Elementary ...,Cook,Sch,70161450,2,2002,...,,,,,,,,,,Arbor Park SD 145_Kimberly Heights Elem School


In [58]:
#this is the list of schools we want to make sure we keep from the difference list
schools_tokeep = list(df3['district_school_disc'].unique())

mismatch = list(set(s1) - set(s2))

#subtract the schools we want to keep from the mismatch
remove = list(set(mismatch) - set(schools_tokeep))

In [59]:
#find the index of the rows with the items in the drop list
to_drop = list(discipline_19[discipline_19['district_school'].isin(remove)].index)

#drop those indices and reset index
discipline_19_final = discipline_19.drop(index=to_drop, axis=0)

In [60]:
#these are the schools we want to get merged into the report dataset
discipline_19_final.shape

(2907, 5)

In [61]:
#create a df that has the district and school names across both datasets
school_mismatch = df3[['RCDTS','NCES ID','District Name','School Name Disc', 'district_school_disc', 'District', 'School Name','district_school']]
school_mismatch.head()

Unnamed: 0,RCDTS,NCES ID,District Name,School Name Disc,district_school_disc,District,School Name,district_school
0,70161450022004,170393000080,Arbor Park SD 145,Arbor Elementary School,Arbor Park SD 145_Arbor Elementary School,Arbor Park SD 145,Morton Gingerwood Elem School,Arbor Park SD 145_Morton Gingerwood Elem School
1,70161450022002,170393000079,Arbor Park SD 145,Kimberly Heights Elementary School,Arbor Park SD 145_Kimberly Heights Elementary ...,Arbor Park SD 145,Kimberly Heights Elem School,Arbor Park SD 145_Kimberly Heights Elem School
2,70161450022003,170393000081,Arbor Park SD 145,Scarlet Oak Elementary School,Arbor Park SD 145_Scarlet Oak Elementary School,Arbor Park SD 145,Scarlet Oak Elem School,Arbor Park SD 145_Scarlet Oak Elem School
3,190220940160001,174044004071,CHSD 94,West Chicago Community High School,CHSD 94_West Chicago Community High School,CHSD 94,Community High School,CHSD 94_Community High School
4,500821870261014,170804006232,Cahokia CUSD 187,Wirth/Parks Middle School,Cahokia CUSD 187_Wirth/Parks Middle School,Cahokia CUSD 187,7th Grade Academy,Cahokia CUSD 187_7th Grade Academy


In [62]:
#creating a dictionary to do a quick name change

rename_list = dict(list(zip(school_mismatch['School Name Disc'], school_mismatch['School Name'])))

#rename schools to match the report card
discipline_19_final = discipline_19_final.replace(rename_list)

In [63]:
#need to redo the district_school col with new school name updates

#drop columns
discipline_19_final.drop(columns='district_school', inplace=True)

#recreate column
discipline_19_final['key'] = discipline_19_final['District Name'] + '_' + discipline_19_final['School Name']

#rename School Name columns so it doesn't match the other df
discipline_19_final.rename(columns={'School Name': 'School'}, inplace=True)

#drop columns
schoolreportcard_19.drop(columns='district_school', inplace=True)

#recreate column
schoolreportcard_19['key'] = schoolreportcard_19['District'] + '_' + schoolreportcard_19['School Name']

In [64]:
discipline_19_final.head()

Unnamed: 0,school_year,District Name,School,Total Incidents,key
1,18-19,A-C Central CUSD 262,A-C Central High School,8,A-C Central CUSD 262_A-C Central High School
2,18-19,Amandla Charter School,Amandla Charter School,192,Amandla Charter School_Amandla Charter School
3,18-19,Abingdon-Avon CUSD 276,Abingdon-Avon High Sch,53,Abingdon-Avon CUSD 276_Abingdon-Avon High Sch
4,18-19,Abingdon-Avon CUSD 276,Abingdon-Avon Middle Sch,61,Abingdon-Avon CUSD 276_Abingdon-Avon Middle Sch
10,18-19,Addison SD 4,Army Trail Elem School,12,Addison SD 4_Army Trail Elem School


In [65]:
key1 = discipline_19_final['key'].unique().tolist()
key2 = schoolreportcard_19['key'].unique().tolist()

set(key1) - set(key2)

#all discipline data for '19 can now be merged with report card data!

set()

### *2019-2020 Dataset*

In [66]:
d1 = discipline_20['District Name'].unique().tolist()
d2 = schoolreportcard_20['District'].unique().tolist()

#### Dropping Districts That We Don't Need - these are schools/districts not included in the report card public data set

In [67]:
# district_schools not in the report card df - most likely due to school naming issues or just not in dataset
# I don't want to lose schools if it's naming convention so going to explore more

set(d1) - set(d2)

{'Adam/Brwn/Cass/Morgn/Pik/Sctt ROE',
 'Adventist GlenOaks Hospital',
 'Allendale Association',
 'Alxndr/Jcksn/Pulsk/Prry/Union ROE',
 'Baby Fold',
 'Bi-County Special Educ Coop',
 'Bond/Christian/Effingham/Fayette/Montgomery ROE',
 'Boone/Winnebago ROE',
 'Calhoun/Greene/Jersy/Macoupin ROE',
 'Carroll/Jo Daviess/Stephenson ROE',
 'Center on Deafness',
 'Champaign/Ford ROE',
 'Childrens Home Assoc of IL',
 'Clintn/Jeffrsn/Marin/Washngtn ROE',
 'Clk/Cls/Cmbn/Dglas/Edgr/Mltr/Shlb',
 'Connections Academy East',
 'Coordinated Youth & Human Service',
 'Cunningham Childrens Home',
 'De Kalb ROE',
 'DeWitt/Livingstn/Logan/McLean ROE',
 'DuPage ROE',
 'Easter Seals Metropolitan Chicago',
 'Edw/Glt/Hlt/Hdn/Pop/Sln/Wbh/Wn/Wh ROE',
 'Elgin Coll Dist 509',
 'Grundy County Spec Educ Coop',
 'Grundy/Kendall ROE',
 'Hancck/Fultn/Schuylr/McDonogh ROE',
 'Iroquois/Kankakee ROE',
 'Kane ROE',
 'Knox Warren Special Education Districts',
 'La Salle/Marshall/Putnam ROE',
 'LaSalle Putnam Alliance',
 'Lake 

In [68]:
#dropping these districts from the dataset b/c they are different school entities not included in the public dataset
drop = list(set(d1) - set(d2))

#find the index of the rows with the items in the drop list
to_drop = list(discipline_20[discipline_20['District Name'].isin(drop)].index)

#drop those indices and reset index
discipline_20.drop(index=to_drop, axis=0, inplace=True)

#reset index
discipline_20.reset_index(drop=True)

Unnamed: 0,school_year,District Name,School Name,Total Incidents,district_school
0,19-20,Abingdon-Avon CUSD 276,Abingdon-Avon High Sch,43,Abingdon-Avon CUSD 276_Abingdon-Avon High Sch
1,19-20,Abingdon-Avon CUSD 276,Abingdon-Avon Middle Sch,29,Abingdon-Avon CUSD 276_Abingdon-Avon Middle Sch
2,19-20,Abingdon-Avon CUSD 276,Hedding Grade Sch,6,Abingdon-Avon CUSD 276_Hedding Grade Sch
3,19-20,Addison SD 4,Army Trail Elem School,5,Addison SD 4_Army Trail Elem School
4,19-20,Addison SD 4,Indian Trail Jr High School,70,Addison SD 4_Indian Trail Jr High School
...,...,...,...,...,...
2845,19-20,Zion ESD 6,Shiloh Park Elem School,7,Zion ESD 6_Shiloh Park Elem School
2846,19-20,Zion ESD 6,West Elementary School,20,Zion ESD 6_West Elementary School
2847,19-20,Zion ESD 6,Zion Central Middle School,79,Zion ESD 6_Zion Central Middle School
2848,19-20,Zion-Benton Twp HSD 126,New Tech High - Zion-Benton East,21,Zion-Benton Twp HSD 126_New Tech High - Zion-B...


#### Comparing district_school values between the two datasets for merge

In [69]:
#names don't match or don't exist in the report card

s1 = discipline_20['district_school'].unique().tolist()
s2 = schoolreportcard_20['district_school'].unique().tolist()

#create a list that has the index of the mismatched schools
find = list(discipline_20[discipline_20['district_school'].isin(list(set(s1) - set(s2)))].index)

#create a new df with those indices only
df = discipline_20.loc[discipline_20.index.isin(find)]

df.head()

Unnamed: 0,school_year,District Name,School Name,Total Incidents,district_school
30,19-20,Alton CUSD 11,Mark Twain,53,Alton CUSD 11_Mark Twain
51,19-20,Arbor Park SD 145,Arbor Elementary School,2,Arbor Park SD 145_Arbor Elementary School
157,19-20,Belleville Twp HSD 201,Belleville Twp HS-Night/Alt Sch,13,Belleville Twp HSD 201_Belleville Twp HS-Night...
295,19-20,CHSD 218,Delta Learning Center,46,CHSD 218_Delta Learning Center
297,19-20,CHSD 94,West Chicago Community High School,493,CHSD 94_West Chicago Community High School


In [70]:
df.shape

(70, 5)

In [71]:
#retrieve RCDTS codes only if it matches the school name of how it's noted in the discipline dataframe
df2 = df.merge(codes, how='inner', left_on='School Name', right_on='FacilityName')
df2.rename(columns={'School Name': 'School Name Disc'}, inplace=True)
df2.rename(columns={'district_school': 'district_school_disc'}, inplace=True)
df2.head(2)

Unnamed: 0,school_year,District Name,School Name Disc,Total Incidents,district_school_disc,CountyName,RecType,Region-2\nCounty-3\nDistrict-4,Type,School,RCDTS,FacilityName,NCES ID
0,19-20,Alton CUSD 11,Mark Twain,53,Alton CUSD 11_Mark Twain,Madison,Sch,410570110,26,3006,410570110263006,Mark Twain,170360001473
1,19-20,Arbor Park SD 145,Arbor Elementary School,2,Arbor Park SD 145_Arbor Elementary School,Cook,Sch,70161450,2,2004,70161450022004,Arbor Elementary School,170393000080


In [72]:
df2.shape

(65, 13)

In [73]:
#now I'm searching to see which of these RCDTS codes are actually in the '19 df to let me know to keep those

df3 = df2.merge(schoolreportcard_20, how='inner', left_on='RCDTS', right_on='RCDTS')
df3.shape

(19, 921)

In [74]:
df3.head(2)

Unnamed: 0,school_year,District Name,School Name Disc,Total Incidents,district_school_disc,CountyName,RecType,Region-2\nCounty-3\nDistrict-4,Type_x,School,...,"% Students Identified as Gifted, Taught by Gifted-endorsed Teachers - Black or African American","% Students Identified as Gifted, Taught by Gifted-endorsed Teachers - Hispanic or Latino","% Students Identified as Gifted, Taught by Gifted-endorsed Teachers - Asian","% Students Identified as Gifted, Taught by Gifted-endorsed Teachers - Native Hawaiian or Other Pacific Islander","% Students Identified as Gifted, Taught by Gifted-endorsed Teachers - American Indian or Alaska Native","% Students Identified as Gifted, Taught by Gifted-endorsed Teachers - Two or More Race","% Students Identified as Gifted, Taught by Gifted-endorsed Teachers - IEP","% Students Identified as Gifted, Taught by Gifted-endorsed Teachers - EL","% Students Identified as Gifted, Taught by Gifted-endorsed Teachers - Low Income",district_school
0,19-20,Arbor Park SD 145,Arbor Elementary School,2,Arbor Park SD 145_Arbor Elementary School,Cook,Sch,70161450,2,2004,...,,,,,,,,,,Arbor Park SD 145_Morton Gingerwood Elem School
1,19-20,CHSD 94,West Chicago Community High School,493,CHSD 94_West Chicago Community High School,Dupage,Sch,190220940,16,1,...,,,,,,,,,,CHSD 94_Community High School


In [75]:
#this is the list of schools we want to make sure we keep from the difference list
schools_tokeep = list(df3['district_school_disc'].unique())

mismatch = list(set(s1) - set(s2))

#subtract the schools we want to keep from the mismatch
remove = list(set(mismatch) - set(schools_tokeep))

In [76]:
#find the index of the rows with the items in the drop list
to_drop = list(discipline_20[discipline_20['district_school'].isin(remove)].index)

#drop those indices and reset index
discipline_20_final = discipline_20.drop(index=to_drop, axis=0)

In [77]:
#these are the schools we want to get merged into the report dataset
discipline_20_final.shape

(2799, 5)

In [78]:
#create a df that has the district and school names across both datasets
school_mismatch = df3[['RCDTS', 'NCES ID', 'District Name','School Name Disc', 'district_school_disc', 'District', 'School Name','district_school']]
school_mismatch.head()

Unnamed: 0,RCDTS,NCES ID,District Name,School Name Disc,district_school_disc,District,School Name,district_school
0,70161450022004,170393000080,Arbor Park SD 145,Arbor Elementary School,Arbor Park SD 145_Arbor Elementary School,Arbor Park SD 145,Morton Gingerwood Elem School,Arbor Park SD 145_Morton Gingerwood Elem School
1,190220940160001,174044004071,CHSD 94,West Chicago Community High School,CHSD 94_West Chicago Community High School,CHSD 94,Community High School,CHSD 94_Community High School
2,500821870261014,170804006232,Cahokia CUSD 187,Wirth/Parks Middle School,Cahokia CUSD 187_Wirth/Parks Middle School,Cahokia CUSD 187,7th Grade Academy,Cahokia CUSD 187_7th Grade Academy
3,150162990252283,170993000868,City of Chicago SD 299,Chicago World Language Academy,City of Chicago SD 299_Chicago World Language ...,City of Chicago SD 299,Jackson A Elem Language Acad,City of Chicago SD 299_Jackson A Elem Language...
4,150162990252052,170993000763,City of Chicago SD 299,Harriet Tubman Elem School,City of Chicago SD 299_Harriet Tubman Elem School,City of Chicago SD 299,Agassiz Elem School,City of Chicago SD 299_Agassiz Elem School


In [79]:
#creating a dictionary to do a quick name change

rename_list = dict(list(zip(school_mismatch['School Name Disc'], school_mismatch['School Name'])))

#rename schools to match the report card
discipline_20_final = discipline_20_final.replace(rename_list)

In [80]:
#need to redo the district_school col with new school name updates

#drop columns
discipline_20_final.drop(columns='district_school', inplace=True)

#recreate column
discipline_20_final['key'] = discipline_20_final['District Name'] + '_' + discipline_20_final['School Name']

#rename School Name columns so it doesn't match the other df
discipline_20_final.rename(columns={'School Name': 'School'}, inplace=True)

#drop columns
schoolreportcard_20.drop(columns='district_school', inplace=True)

#recreate column
schoolreportcard_20['key'] = schoolreportcard_20['District'] + '_' + schoolreportcard_20['School Name']

In [81]:
discipline_20_final.head()

Unnamed: 0,school_year,District Name,School,Total Incidents,key
0,19-20,Abingdon-Avon CUSD 276,Abingdon-Avon High Sch,43,Abingdon-Avon CUSD 276_Abingdon-Avon High Sch
1,19-20,Abingdon-Avon CUSD 276,Abingdon-Avon Middle Sch,29,Abingdon-Avon CUSD 276_Abingdon-Avon Middle Sch
2,19-20,Abingdon-Avon CUSD 276,Hedding Grade Sch,6,Abingdon-Avon CUSD 276_Hedding Grade Sch
6,19-20,Addison SD 4,Army Trail Elem School,5,Addison SD 4_Army Trail Elem School
7,19-20,Addison SD 4,Indian Trail Jr High School,70,Addison SD 4_Indian Trail Jr High School


In [82]:
key1 = discipline_20_final['key'].unique().tolist()
key2 = schoolreportcard_20['key'].unique().tolist()

set(key1) - set(key2)

#all discipline data for '20 can now be merged with report card data!

set()

## 4. Merge discipline data with report card data for all years
---

In [83]:
print('reportcard18:', schoolreportcard_18.shape)
print('reportcard19:', schoolreportcard_19.shape)
print('reportcard20:', schoolreportcard_20.shape)
print('discipline18:', discipline_18_final.shape)
print('discipline19:', discipline_19_final.shape)
print('discipline20:', discipline_20_final.shape)

reportcard18: (3888, 391)
reportcard19: (3872, 852)
reportcard20: (3859, 909)
discipline18: (3093, 5)
discipline19: (2907, 5)
discipline20: (2799, 5)


In [84]:
#merge the two dfs - keep all of the school report card rows
df_18 = schoolreportcard_18.merge(discipline_18_final, how='left', on='key')
df_19 = schoolreportcard_19.merge(discipline_19_final, how='left', on='key')
df_20 = schoolreportcard_20.merge(discipline_20_final, how='left', on='key')

In [85]:
print(df_18.shape)
print(df_19.shape)
print(df_20.shape)

(3888, 395)
(3872, 856)
(3859, 913)


In [86]:
# imputing school year for the rows that are null
df_18['school_year'].fillna('17-18', inplace=True)
df_19['school_year'].fillna('18-19', inplace=True)
df_20['school_year'].fillna('19-20', inplace=True)

In [87]:
# imputing schools that didn't report any discipline incidents with 0

In [88]:
df_18['Total Incidents'].fillna(0, inplace=True)
df_19['Total Incidents'].fillna(0, inplace=True)
df_20['Total Incidents'].fillna(0, inplace=True)

## 5. Merge in NCES ID from RCDTS codes & drop rows where NCESID is not found
---
This will make it easier to bring in other data later on

In [89]:
ncesid = codes[['RCDTS', 'NCES ID']]

In [90]:
df_18 = df_18.merge(ncesid, how='left', on='RCDTS')
df_19 = df_19.merge(ncesid, how='left', on='RCDTS')
df_20 = df_20.merge(ncesid, how='left', on='RCDTS')

In [91]:
#drop rows where NCESID couldn't be matched
df_18.drop(index = list(df_18[df_18['NCES ID'].isna()].index), axis=0, inplace=True)
df_18.reset_index(drop=True)

df_19.drop(index = list(df_19[df_19['NCES ID'].isna()].index), axis=0, inplace=True)
df_19.reset_index(drop=True)

df_20.drop(index = list(df_20[df_20['NCES ID'].isna()].index), axis=0, inplace=True)
df_20.reset_index(drop=True)

Unnamed: 0,RCDTS,Type,School Name,District,City,County,District Type,District Size,School Type,Grades Served,...,"% Students Identified as Gifted, Taught by Gifted-endorsed Teachers - Two or More Race","% Students Identified as Gifted, Taught by Gifted-endorsed Teachers - IEP","% Students Identified as Gifted, Taught by Gifted-endorsed Teachers - EL","% Students Identified as Gifted, Taught by Gifted-endorsed Teachers - Low Income",key,school_year,District Name,School,Total Incidents,NCES ID
0,010010010260001,School,Seymour High School,Payson CUSD 1,Payson,Adams,UNIT,MEDIUM,HIGH SCHOOL,7 8 9 10 11 12,...,,,,,Payson CUSD 1_Seymour High School,19-20,Payson CUSD 1,Seymour High School,50.0,173099003226
1,010010010262002,School,Seymour Elementary School,Payson CUSD 1,Payson,Adams,UNIT,MEDIUM,ELEMENTARY,PK K 1 2 3 4 5 6,...,,,,,Payson CUSD 1_Seymour Elementary School,19-20,Payson CUSD 1,Seymour Elementary School,6.0,173099003225
2,010010020260001,School,Liberty High School,Liberty CUSD 2,Liberty,Adams,UNIT,MEDIUM,HIGH SCHOOL,7 8 9 10 11 12,...,,,,,Liberty CUSD 2_Liberty High School,19-20,Liberty CUSD 2,Liberty High School,12.0,172277002524
3,010010020262002,School,Liberty Elementary School,Liberty CUSD 2,Liberty,Adams,UNIT,MEDIUM,ELEMENTARY,PK K 1 2 3 4 5 6,...,,,,,Liberty CUSD 2_Liberty Elementary School,19-20,Liberty CUSD 2,Liberty Elementary School,10.0,172277002523
4,010010030260001,School,Central High School,Central CUSD 3,Camp Point,Adams,UNIT,MEDIUM,HIGH SCHOOL,9 10 11 12,...,,,,,Central CUSD 3_Central High School,19-20,Central CUSD 3,Central High School,48.0,170822000431
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3809,601054280303050,School,IYC Chicago,IDJJ Sch Dist 428,Chicago,Dept Of Corrections,UNIT,,HIGH SCHOOL,6 7 8 9 10 11 12,...,,,,,IDJJ Sch Dist 428_IYC Chicago,19-20,,,0.0,170000603793
3810,651089010800001,School,University High School,ISU Laboratory Schools,Normal,State Of Illinois,UNIT,,HIGH SCHOOL,9 10 11 12,...,,,,,ISU Laboratory Schools_University High School,19-20,,,0.0,170009904506
3811,651089010802001,School,Thomas Metcalf School,ISU Laboratory Schools,Normal,State Of Illinois,UNIT,,ELEMENTARY,PK K 1 2 3 4 5 6 7 8,...,,,,,ISU Laboratory Schools_Thomas Metcalf School,19-20,,,0.0,170009904504
3812,651089020800001,School,University of Illinois High Sch,University of Ill Lab School,Urbana,State Of Illinois,HIGH SCHOOL,,HIGH SCHOOL,7 8 9 10 11 12,...,,,,,University of Ill Lab School_University of Ill...,19-20,,,0.0,170010004505


In [92]:
print(df_18.shape)
print(df_19.shape)
print(df_20.shape)

(3797, 396)
(3801, 857)
(3814, 914)


In [93]:
#export
# df_18.to_csv('../Capstone/cleaned_datasets/cleaning/df_18.csv')
# df_19.to_csv('../Capstone/cleaned_datasets/cleaning/df_19.csv')
# df_20.to_csv('../Capstone/cleaned_datasets/cleaning/df_20.csv')