# <u> The National Alliance of Concurrent Enrollment Partnerships </u>
## 2015-16 Civil Rights Data Collection (CRDC)
## Advanced Placement (AP) v. Dual Enrollment (DE)
### Initial Filtration
#### Alijah O'Connor - 2018
------------------------------------------------------------------------------------
---
The basis for this entire project relies on the accuracy of establishing a dataset that includes only 'Traditional High Schools.' The definition we have chosen to use centers around schools that contain 11th or 12th grades, non-special education, non-juvenile justice, non-alternative schools; however, other filters include removing virtual schools, adult schools, schools without a matching National Center for Education Statistics (NCES) identifier (some excepetions).  Note that the dataset used herein inlcudes both the 2015-2016 CRDC school information, but also information gathered in the 2015-2016 NCES National Survey.  Below is the official filtration procedure:
    - Join the uncompiled 2015-2016 NCES dataset into one dataset
    - Join the compiled NCES dataset with the 2015-2016 CRDC dataset
    
    - Filter Out (Dataset Attribute in Parentheses)
        - Special Education, Alternative, Juvenile Justice Schools (CRDC)
        - Schools without 11th or 12 Grade (CRDC)
        - Virtual Schools (NCES)
        - Schools reported as 'Elementary', 'Middle', or 'Other' (NCES)
        - Special Education and Alternative/Other (NCES)
        - Schools with 'Adult' in the name
    - Recover Some Schools that did not have matching NCES identifiers
        - Join recovered schools with dataset
        - Remove any remaining schools that did not match
------------------------------------------------------------------------------------
---

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from my_functions import combokey_converter

%matplotlib inline
sns.set_style('whitegrid')
plt.rc('axes', titlesize = 14, titleweight = 'bold', labelweight = 'bold')

# <font color = green> II. Data Cleaning/Joining </font>

# crdc_1516 Data
<div class="alert alert-block alert-info"><b> 96,360 Schools before any filtering <br>
111 Fields (Matches the crdc_cols)</b></div>
<br><br>
Used combokey_convert.converter to create a csv-compatible "COMBOKEY"

In [2]:
crdc_1516 = pd.read_csv('../filtered_data/00_crdc_1516_initial.csv', 
                        dtype = {'LEAID':np.object})

In [3]:
crdc_1516['COMBOKEY'] = combokey_converter.convert(crdc_1516, 'LEAID', 'SCHID')

In [4]:
crdc_1516.head()

Unnamed: 0,LEA_STATE,LEA_STATE_NAME,LEAID,LEA_NAME,SCHID,SCH_NAME,COMBOKEY,JJ,SCH_GRADE_PS,SCH_GRADE_KG,...,SCH_IBENR_WH_M,SCH_IBENR_WH_F,SCH_IBENR_TR_M,SCH_IBENR_TR_F,TOT_IBENR_M,TOT_IBENR_F,SCH_IBENR_LEP_M,SCH_IBENR_LEP_F,SCH_IBENR_IDEA_M,SCH_IBENR_IDEA_F
0,AL,ALABAMA,100002,Alabama Youth Services,1705,Wallace Sch - Mt Meigs Campus,='010000201705',Yes,No,No,...,-9,-9,-9,-9,-9,-9,-9,-9,-9,-9
1,AL,ALABAMA,100002,Alabama Youth Services,1706,McNeel Sch - Vacca Campus,='010000201706',Yes,No,No,...,-9,-9,-9,-9,-9,-9,-9,-9,-9,-9
2,AL,ALABAMA,100002,Alabama Youth Services,1876,Alabama Youth Services,='010000201876',No,No,No,...,-9,-9,-9,-9,-9,-9,-9,-9,-9,-9
3,AL,ALABAMA,100002,Alabama Youth Services,99995,AUTAUGA CAMPUS,='010000299995',Yes,No,No,...,-9,-9,-9,-9,-9,-9,-9,-9,-9,-9
4,AL,ALABAMA,100005,Albertville City,870,Albertville Middle School,='010000500870',No,No,No,...,-9,-9,-9,-9,-9,-9,-9,-9,-9,-9


In [28]:
str(len(crdc_1516.index)) + ' Total Schools in the 2015-2016 CRDC Survey'

'96360 Total Schools in the 2015-2016 CRDC Survey'

# nces_1516 Data
<div class="alert alert-block alert-info"><b> The nces_1516 Data was recorded in separate files (each with different numbers of schools), so I will have to join the separate files to avoid corruption/loss of data. </b><br>
    <u>Files</u><br>
    1. Characteristics <br>
    2. Directory <br>
    3. Geographic <br>
</div><div class = 'alert alert-block alert-info'>
Like the crdc data, the combokey field was generated using my combokey_converter.convert function.<br></div>

<div class="alert alert-block alert-warning">
1. **100232 Initial Schools**<br><br>
2. **After first inner join (Directory and Characteristics) --> 100232 schools**<br>
Note: I ran a check to ensure that all of the matching combokeys have matching school names -- 100% identical.<br><br>
3. **After second inner join (above_combined and Geographic) --> 100087**<br> Note:  I ran the same check to ensure that all of the schools matched and nearly 9000 came back as non-matching.  I then compared the first word of each of the two name fields, and only 9 schools came back as non-matching.  After close examination, I decided to cull these 9 schools.<br></div><div class = 'alert alert-block alert-warning'>
**CSV saved to '../filtered_data/01_nces_1516_initial_ccd.csv'**

In [6]:
nces_1516_characteristics = pd.read_csv('../filtered_data/01_nces_1516_initial_school_characteristics.csv')

In [7]:
nces_1516_characteristics['combokey'] = combokey_converter.convert(nces_1516_characteristics, 'LEAID', 'SCHID')

In [8]:
str(len(nces_1516_characteristics.index)) + ' NCES Schools'

'100232 NCES Schools'

In [9]:
nces_1516_directory = pd.read_csv('../filtered_data/01_nces_1516_initial_school_directory.csv')

In [10]:
nces_1516_directory['combokey'] = combokey_converter.convert(nces_1516_directory, 'LEAID', 'SCHID')

**First Join:  Directory + Characteristics**

In [11]:
nces_1516 = nces_1516_characteristics.set_index('combokey').join(nces_1516_directory.set_index('combokey'), how = 'inner', lsuffix = 'dir_')

In [12]:
len(nces_1516.index)

100232

In [13]:
len(nces_1516[nces_1516.SCH_NAME == nces_1516.SCH_NAMEdir_].index)

100232

In [14]:
nces_1516 = nces_1516.drop(['LEAIDdir_', 'SCHIDdir_', 'SCH_NAMEdir_'], axis = 1)

**Second Join: combined + geo**

In [15]:
nces_1516_geo = pd.read_csv('../filtered_data/01_nces_1516_initial_geographic.csv',  dtype = {'LOCALE15': np.object})

In [16]:
nces_1516_geo['combokey'] = combokey_converter.convert(nces_1516_geo, 'LEAID', 'SCHID')

In [17]:
nces_1516_test = nces_1516.join(nces_1516_geo.set_index('combokey'), how = 'inner', rsuffix = 'dir_')

In [18]:
len(nces_1516_test.index)

100096

In [19]:
"""How many schools have matching School Names between CRDC and NCES?"""
len(nces_1516_test[nces_1516_test.SCH_NAME == nces_1516_test.NAME].index)

91091

In [20]:
def name_checker(sch1, sch2):
    sch1 = sch1.lower()
    sch2 = sch2.lower()
    
    if sch1[0] == sch2[0]:
        return 0
    return 1

nces_1516_test['no_match_name'] = nces_1516_test.apply(lambda row: name_checker(row['SCH_NAME'], row['NAME']), axis = 1)
nces_1516_test[nces_1516_test.no_match_name == 1][['NAME', 'SCH_NAME']]

Unnamed: 0_level_0,NAME,SCH_NAME
combokey,Unnamed: 1_level_1,Unnamed: 2_level_1
='051266001562',HYLTON JUNIOR HIGH SCHOOL,LAKESIDE JUNIOR HIGH SCHOOL
='090147001810',Stowe - Early Learning Center (S,EPS PK STEAM Academy
='090171001700',Alternative High School Programs,Greenwich Alternative High School
='090192001616',STEM Magnet School at Dwight,Betances STEM Magnet School
='090279000148',Hyde School of Health Science an,Cortlandt V.R. Creed Health and Sport Sciences...
='090279001543',Helene Grant Headstart,Dr. Mayo Early Childhood School
='090279001585',Katherine Brennan/Clarence Roger,Brennan Rogers School
='090351201476',Education Connection Special Edu,GFLC/ACCESS School
='090423001808',Hatton Preschool Program,Southington Public Schools Preschool Program a...


In [21]:
nces_1516_full = nces_1516_test[nces_1516_test.no_match_name == 0].drop(['LEAIDdir_', 'SCHIDdir_', 'no_match_name', 'NAME'], axis = 1)

In [22]:
nces_1516_full.head()

Unnamed: 0_level_0,TITLEI,LEAID,LEA_NAME,STABR,SCHID,SCH_NAME,SCH_TYPE_TEXT,SCH_TYPE,LEVEL,VIRTUAL,GSLO,GSHI,NMCNTY15,LOCALE15,LAT1516,LON1516
combokey,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
='010000200277',-9,100002,Alabama Youth Services,AL,277,Sequoyah Sch - Chalkville Campus,Alternative Education School,4,3,No,7,12,Jefferson County,21,33.673661,-86.628755
='010000201667',-9,100002,Alabama Youth Services,AL,1667,Camps,Alternative Education School,4,3,No,7,12,Autauga County,41,32.521681,-86.530132
='010000201670',-9,100002,Alabama Youth Services,AL,1670,Det Ctr,Alternative Education School,4,3,No,7,12,Clarke County,41,31.938444,-87.750529
='010000201705',-9,100002,Alabama Youth Services,AL,1705,Wallace Sch - Mt Meigs Campus,Alternative Education School,4,3,No,7,12,Montgomery County,41,32.374812,-86.08236
='010000201706',-9,100002,Alabama Youth Services,AL,1706,McNeel Sch - Vacca Campus,Alternative Education School,4,3,No,7,12,Jefferson County,12,33.583385,-86.710058


In [23]:
len(nces_1516_full.index)

100087

In [24]:
# nces_1516_full.to_csv('../filtered_data/01_nces_1516_initial_combined_ccd.csv')

# NCES (combined) and CRDC join
<div class="alert alert-block alert-warning">Out of the 96360 schools in the crdc1516 dataset, <b>3861</b> schools did not have a matching Combokey. These non-matching schools were kept in the dataset.<br><br>

Using the name checker function from above, another <b>182</b> schools were found to have School Names whose first words did not match between the NCES and CRDC sets.  Airing on the side of caution, these schools were indiscriminately culled.<br><br>

**Final school count in the combined dataset:  96178**</div>
<div class = 'alert alert_block alert-info'>Dataset saved to '03_crdc_nces_1516_raw_combined.csv'

In [25]:
crdc_nces1516_test = crdc_1516.set_index('COMBOKEY').join(nces_1516_full, how = 'left', rsuffix=('_'))

In [26]:
crdc_nces1516_test[crdc_nces1516_test.SCH_NAME_.isnull()].LEAID.count()

3861

In [29]:
crdc_nces_1516 = crdc_nces1516_test.drop(['LEA_NAME_', 'LEAID_', 'SCHID_', 'SCH_NAME_'], axis = 1)

In [30]:
len(crdc_nces_1516.index)

96360

In [31]:
crdc_nces_1516 = crdc_nces_1516.fillna('Missing')

In [32]:
# crdc_nces_1516.to_csv('../filtered_data/03_crdc_nces_1516_raw_combined.csv')

# <font color = green> IV. Filtration </font>

# Remove Schools without 11th or 12th Grade (CRDC)

In [134]:
filter1_crdc_nces_1516 = crdc_nces_1516.copy()

In [135]:
from my_functions.extra_functions import students_in_11_or_12
filter1_crdc_nces_1516['Students_in_11_12'] = filter1_crdc_nces_1516.apply(lambda row: students_in_11_or_12(row['SCH_GRADE_G11'], row['SCH_GRADE_G12']), axis = 1)

In [136]:
filtered_out_1 = filter1_crdc_nces_1516[(filter1_crdc_nces_1516.Students_in_11_12 == 'No')]
filter1_crdc_nces_1516 = filter1_crdc_nces_1516[(filter1_crdc_nces_1516.Students_in_11_12 == 'Yes')]

In [138]:
display(len(filter1_crdc_nces_1516.index))
len(filtered_out_1)

25051

71309

# Select Non-[Junvile Justice, Special Education, and Alternative Schools] (CRDC)
<div class = 'alert alert-block alert-info'>Schools that answered 'No' to each of those three questions on the CRDC Sruvey.<br><br> 
I also used a keyword filter to remove any remaining "Juvenile Justice"-eque Institutions.</div>

In [140]:
filter2_crdc_nces_1516 = filter1_crdc_nces_1516.copy()
filtered_out_2 = filter2_crdc_nces_1516[(filter2_crdc_nces_1516.JJ == 'Yes') | (filter2_crdc_nces_1516.SCH_STATUS_ALT == 'Yes') | (filter2_crdc_nces_1516.SCH_STATUS_SPED == 'Yes')]
filter2_crdc_nces_1516 = filter2_crdc_nces_1516[(filter2_crdc_nces_1516.JJ == 'No') & (filter2_crdc_nces_1516.SCH_STATUS_ALT == 'No') & (filter2_crdc_nces_1516.SCH_STATUS_SPED == 'No')]

In [141]:
def jj_keyword_remove(name):
    kws = ['behavioral', 'juvenile', 'correction']
    for kw in kws:
        if kw in name.strip().lower():
            return False
    return True

filter2_crdc_nces_1516 = filter2_crdc_nces_1516[filter2_crdc_nces_1516.SCH_NAME.apply(lambda x: jj_keyword_remove(x))]
filter2_crdc_nces_1516 = filter2_crdc_nces_1516[filter2_crdc_nces_1516.LEA_NAME.apply(lambda x: jj_keyword_remove(x))]

In [143]:
display(len(filter2_crdc_nces_1516.index))
len(filtered_out_2)

20646

4356

# Remove Virtual Schools (NCES)
<div class = 'alert alert-block alert-info'>
1. Remove any Schools that reported 'Yes' to the Virtual Schools Question<br>
2. Remove Schools that have certain keyword that likely indicate an online school
</div>

In [149]:
filter3_crdc_nces_1516 = filter2_crdc_nces_1516.copy()
filtered_out_3 = filter3_crdc_nces_1516[filter3_crdc_nces_1516.VIRTUAL == 'Yes']
filter3_crdc_nces_1516 = filter3_crdc_nces_1516[filter3_crdc_nces_1516.VIRTUAL != 'Yes']

In [150]:
len(filter3_crdc_nces_1516.index)

20334

In [151]:
def any_missed_virtuals(name):
    kws = ['virtual', 'cyber', 'electronic', 'internet', 'online', 'distance']
    for kw in kws:
        if kw in name.strip().lower():
            return False
    return True
filtered_out_3 = filtered_out_3.append(filter3_crdc_nces_1516[~filter3_crdc_nces_1516.SCH_NAME.apply(lambda x: any_missed_virtuals(x))])
filter3_crdc_nces_1516 = filter3_crdc_nces_1516[filter3_crdc_nces_1516.SCH_NAME.apply(lambda x: any_missed_virtuals(x))]

In [152]:
display(len(filter3_crdc_nces_1516.index))
len(filtered_out_3)

20269

377

# Remove schools reported as elementary, middle, or 'N' (NCES)
<div class = 'alert alert-block alert-info'>Even with the Lowest/Highest Grade filter, I wanted to ensure that no non-typical high schools (as reported by the NCES's LEVEL Field) are retained.  The Other category is perhaps the most important to cull here, as many of the very, very large charter-type schools are listed in this category.
<br><br>
Schools with Missing Values were retained.
</div>

In [156]:
filter4_crdc_nces_1516 = filter3_crdc_nces_1516.copy()

In [157]:
filter4_crdc_nces_1516.LEVEL.value_counts()

3          16311
4           2798
Missing     1004
1             66
N             65
2             25
Name: LEVEL, dtype: int64

In [158]:
filtered_out_4 = filter4_crdc_nces_1516[(filter4_crdc_nces_1516.LEVEL == 'N') | (filter4_crdc_nces_1516.LEVEL == '1') | (filter4_crdc_nces_1516.LEVEL == '2')]
filter4_crdc_nces_1516 = filter4_crdc_nces_1516[(filter4_crdc_nces_1516.LEVEL == 'Missing') | (filter4_crdc_nces_1516.LEVEL == '3') | (filter4_crdc_nces_1516.LEVEL == '4')]

In [159]:
display(len(filter4_crdc_nces_1516.index))
len(filtered_out_4)

20113

156

# Remove Special Education and Alternative/Other Schools (NCES)
<div class = 'alert alert-block alert-info'>Removed Schools with a SCH_TYPE that was not 1 (Regular) or 3 (Vocational).  Culls additional "Special Education", and "Alternative/Other" schools.
<br><br>
Schools with Missing Values were retained.
</div>

In [49]:
filter5_crdc_nces_1516 = filter4_crdc_nces_1516.copy()

In [50]:
filter5_crdc_nces_1516.SCH_TYPE.value_counts()

1.0        17692
4.0         1040
Missing     1004
3.0          342
2.0           35
Name: SCH_TYPE, dtype: int64

In [51]:
filter5_crdc_nces_1516 = filter5_crdc_nces_1516[(filter5_crdc_nces_1516.SCH_TYPE == 'Missing') | (filter5_crdc_nces_1516.SCH_TYPE == 1) | (filter5_crdc_nces_1516.SCH_TYPE == 3)]

In [52]:
len(filter5_crdc_nces_1516.index)

19038

**Mini-Filter:  Remove schools with 'adult' in the Name (CRDC) **

In [53]:
filter5_crdc_nces_1516 = filter5_crdc_nces_1516[~filter5_crdc_nces_1516.SCH_NAME.str.contains('adult', case=False)]

In [54]:
len(filter5_crdc_nces_1516)

19012

# <font color = green> V. Dealing with Missing Values </font>
<div class = 'alert alert-cell alert-info'> With nearly 1200 schools missing NCES data, including schools from prominent districts like "NEW YORK CITY PUBLIC SCHOOLS" and "Green Dot Public Schools," it is important to try to recover as much of these schools as possible.
<br><br>
The problem that I found was that the CRDC lumped a number of school districts together; therefore, the combokeys of schools in these districts do not match those of the NCES.
</div>

<div class = 'alert alert-cell alert-info'>
**I tried a number of methods to try to properly join these missing schools:**<br>
- Using only the school name:  This had difficulties because there are many schools that share the same name, so when a join is implemented, these schools are given all of the values of the other schools (i.e. it creates a lot of duplicate values).
- Using the NCES data from 2013:  This was also problematic, as most of the same schools that were missing in this dataset were also constrained to the same problem in the 2013-2014 dataset.<br>
- Using the District and the name together:  This also suffered from the fact that the CRDC data combines some school districts; therefore, the names of the districts still did not match up.<br>
- **Finally, I used a combination of the name of the school and the state:  There were only a handfull in the dataset containing the missing values.**<br><br>
</div>

<div class = 'alert alert-cell alert-warning'>
**821 (out of 1194)** Missing Schools were recovered using this method </div>

<div class = 'alert alert-cell alert-info'>
Next, I recovered the remaining schools in the 'New York City Public Schools District', because it was clear that they were simply missing due to a LEA reporting error in the CRDC data.  This process was two-parted:<br>
- First, Because it seemed as though most of these remaining New York schools had the incorrect LEAID, I used the the school id and state abreviation to create a unique identifier.<br>
- Second, I used the NCES database to manually search for the remaining schools correct their combokey
</div>

<div class = 'alert alert-cell alert-warning'>
**36** More High Schools Recovered  </div>

<div class = 'alert alert-cell alert-info'>
I performed the same (nces-provided field)-filtration steps on the recovered data.  Then, I hand-removed duplicate values by checking the original filtered data for matching records. </div>

<div class = 'alert alert-cell alert-warning'>
**468** Recovered High Schools Total  </div>

In [55]:
"""Which districts had the most missing schools?"""
with pd.option_context('display.max_rows', 1200):
    display(filter5_crdc_nces_1516[filter5_crdc_nces_1516.LEVEL == 'Missing'].groupby('LEA_NAME')['LEAID'].count().sort_values(ascending = False))

LEA_NAME
NEW YORK CITY PUBLIC SCHOOLS                                                               477
Green Dot Public Schools                                                                    11
NORMAN                                                                                       7
Dept. of Svs. for Children Youth & Their Families                                            5
OFFICE OF EDUCATION DEPARTMENT OF CHILDREN AND FAMILIES                                      4
Ombudsman Educational Services Ltd. a subsidiary of Educ 2                                   4
TULSA                                                                                        3
Boston                                                                                       3
Cherokee County                                                                              3
Clayton County                                                                               3
Coweta County                            

In [56]:
filter5_missing_leas = filter5_crdc_nces_1516[filter5_crdc_nces_1516.LEVEL == 'Missing'].groupby('LEA_NAME')['LEAID'].count().sort_values(ascending = False)

In [57]:
# filter5_missing_leas.to_csv('../filtered_data/04_inital_filter_missing_LEAs.csv')

In [58]:
"""How many missing schools?"""
filter5_missing_schools = filter5_crdc_nces_1516[filter5_crdc_nces_1516.LEVEL == 'Missing']
len(filter5_missing_schools.index)

989

In [59]:
# filter5_missing_schools.to_csv('../filtered_data/04_intital_filter_missing_schools.csv')

** Manipulate missing schools and original nces data --> join **

In [60]:
filter5_schname_state = filter5_missing_schools.copy()

In [61]:
filter5_schname_state = filter5_schname_state.reset_index()

In [62]:
filter5_schname_state['SCH_NAME'] = filter5_schname_state['SCH_NAME'].apply(lambda x: x.lower())
filter5_schname_state['SCH_NAME_ST_NUM'] = filter5_schname_state.SCH_NAME + filter5_schname_state.LEA_STATE

In [63]:
"""How many duplicate schools in the filter5 dataset?"""
filter5_schname_state.groupby('SCH_NAME_ST_NUM')['SCH_NAME_ST_NUM'].count().sort_values(ascending = False).head(10)

SCH_NAME_ST_NUM
community collaborative charterCA                                 2
harlem village academies highNY                                   2
performance learning centerGA                                     2
yuba city charterCA                                               1
emma lazarus high schoolNY                                        1
esperanza prepatory academyNY                                     1
escuela popular/center for training and careers, family lrngCA    1
escuela popular accelerated family learning center (k-8)CA        1
escondido charter highCA                                          1
erie 2-chautauqua-cattaraugus boces @ iroquoisNY                  1
Name: SCH_NAME_ST_NUM, dtype: int64

In [64]:
filter5_schname_state[filter5_schname_state.SCH_NAME_ST_NUM == 'performance learning centerGA']

Unnamed: 0,COMBOKEY,LEA_STATE,LEA_STATE_NAME,LEAID,LEA_NAME,SCHID,SCH_NAME,JJ,SCH_GRADE_PS,SCH_GRADE_KG,...,LEVEL,VIRTUAL,GSLO,GSHI,NMCNTY15,LOCALE15,LAT1516,LON1516,Students_in_11_12,SCH_NAME_ST_NUM
301,='130129003727',GA,GEORGIA,1301290,Cobb County,3727,performance learning center,No,No,No,...,Missing,Missing,Missing,Missing,Missing,Missing,Missing,Missing,Yes,performance learning centerGA
313,='130270003728',GA,GEORGIA,1302700,Harris County,3728,performance learning center,No,No,No,...,Missing,Missing,Missing,Missing,Missing,Missing,Missing,Missing,Yes,performance learning centerGA


In [65]:
nces_1516_schname_state = nces_1516_full.copy()

In [66]:
nces_1516_schname_state = nces_1516_schname_state.reset_index()

In [67]:
nces_1516_schname_state['SCH_NAME'] = nces_1516_schname_state['SCH_NAME'].apply(lambda x: x.lower())
nces_1516_schname_state['SCH_NAME_ST_NUM'] = nces_1516_schname_state.SCH_NAME + nces_1516_schname_state.STABR

In [68]:
"""Join the NCES and filter5 datasets on the SCH_NAME_ST_NUM column"""
schname_combined = filter5_schname_state.set_index('SCH_NAME_ST_NUM').join(nces_1516_schname_state.set_index('SCH_NAME_ST_NUM'), how = 'left', rsuffix = '_')

In [69]:
"""How many schools have duplicated values?"""
schname_combined.SCH_NAME_.value_counts().sort_values(ascending = False).head(10)

tarrant co j j a e p               6
community collaborative charter    4
hart el                            2
accelerated achievement academy    2
university high                    2
beacon high school                 2
performance learning center        2
dewitt clinton high school         1
brooklyn latin school (the)        1
julian charter                     1
Name: SCH_NAME_, dtype: int64

In [70]:
"""How may more schools were matched?"""
len(schname_combined[schname_combined.SCH_NAME_.notnull()].index)

688

In [71]:
"""How many schools still did not have a match?"""
len(schname_combined[schname_combined.SCH_NAME_.isnull()].index)

312

## Recover the NY Schools

In [72]:
schname_combined_missing = schname_combined.copy()
schname_combined_missing = schname_combined_missing[schname_combined_missing.SCH_NAME_.isnull()]

schname_combined_missing_ny = schname_combined_missing.copy()
schname_combined_missing_ny = schname_combined_missing_ny[schname_combined_missing_ny['LEA_NAME'] == 'NEW YORK CITY PUBLIC SCHOOLS']

In [73]:
print(len(schname_combined_missing_ny.index))
print(schname_combined_missing_ny.SCHID.nunique())

22
22


In [74]:
schname_combined_missing_ny = schname_combined_missing_ny.drop(['TITLEI_', 'STABR_', 'SCH_TYPE_TEXT_', 'SCH_TYPE_',
                                                                'LEVEL_', 'VIRTUAL_', 'GSLO_', 'GSHI_', 
                                            'NMCNTY15_', 'LOCALE15_', 'LAT1516_', 'LON1516_', 'combokey',
                                            'LEAID_', 'LEA_NAME_', 'SCH_NAME_', 'SCHID_'], axis = 1)

In [75]:
def schid_state_maker(schid, state):
    schid = str(schid).zfill(5)
    return schid + state

In [76]:
schname_combined_missing_ny['schid_state'] = schname_combined_missing_ny.apply(lambda row: schid_state_maker(row['SCHID'], row['LEA_STATE']), axis = 1)

In [77]:
nces_for_missing_ny = nces_1516_full.copy()

nces_for_missing_ny['schid_state'] = nces_for_missing_ny.apply(lambda row: schid_state_maker(row['SCHID'], row['STABR']), axis = 1)

In [78]:
missing_ny_joined = schname_combined_missing_ny.set_index('schid_state').join(nces_for_missing_ny.reset_index().set_index('schid_state'), how = 'left', rsuffix = "_")

In [79]:
""" Join the missing NY schools with NCES """
missing_ny_joined[missing_ny_joined.LEVEL_.notnull()][['SCH_NAME','SCH_NAME_']]

Unnamed: 0_level_0,SCH_NAME,SCH_NAME_
schid_state,Unnamed: 1_level_1,Unnamed: 2_level_1
01409NY,"law, government and community service high school",LAW GOVERNMENT AND COMMUNITY SERVICE HIGH SCHOOL
02961NY,"bronx school for law, government and justice",BRONX SCHOOL FOR LAW GOVERNMENT AND JUSTICE
03091NY,"high school of enterprise, business & technology",HIGH SCHOOL OF ENTERPRISE BUSINESS & TECHNOLOGY
04873NY,"new explorations into science,tech and math hi...",NEW EXPLORATIONS INTO SCIENCETECH AND MATH HIG...
05113NY,"high school for law, advocacy and community ju...",HIGH SCHOOL FOR LAW ADVOCACY AND COMMUNITY JUS...
05516NY,"science, tech & research high school at erasmus",SCIENCE TECH & RESEARCH HIGH SCHOOL AT ERASMUS
05521NY,ms 223 laboratory school of finance and techno...,MS 223 LABORATORY SCHOOL OF FINANCE AND TECHNO...
05536NY,"queens high school of teaching, liberal arts a...",QUEENS HIGH SCHOOL OF TEACHING LIBERAL ARTS AN...
05677NY,"marie curie high sch-nursing, medicine & appli...",MARIE CURIE HIGH SCH-NURSING MEDICINE & APPLIE...
05774NY,"high school for arts, imagination and inquiry",HIGH SCHOOL FOR ARTS IMAGINATION AND INQUIRY


In [80]:
""" Dealing with remaining missing NY Schools """
missing_ny_2 = missing_ny_joined.copy()
missing_ny_2 = missing_ny_2[missing_ny_2.LEVEL_.isnull()]

len(missing_ny_2.index)

5

In [81]:
missing_ny_2 = missing_ny_2.drop(['TITLEI_', 'STABR_', 'SCH_TYPE_TEXT_', 'SCH_TYPE_',
                   'LEVEL_', 'VIRTUAL_', 'GSLO_', 'GSHI_', 
                   'NMCNTY15_', 'LOCALE15_', 'LAT1516_', 'LON1516_', 'combokey',
                   'LEAID_', 'LEA_NAME_', 'SCH_NAME_', 'SCHID_'], axis = 1)

In [82]:
missing_ny_2['actual_combokey'] = pd.Series(np.resize(0, len(missing_ny_2.index)), dtype = np.object)

# missing_ny_2.at["99780NY", 'actual_combokey'] = "='360012306528'"
# missing_ny_2.at["99796NY", 'actual_combokey'] = "='360012306535'"
# missing_ny_2.at["99775NY", 'actual_combokey'] = "='360012006484'"
# missing_ny_2.at["99776NY", 'actual_combokey'] = "='360010106508'"
# missing_ny_2.at["99805NY", 'actual_combokey'] = "='360008306490'"
missing_ny_2.at["99874NY", 'actual_combokey'] = "='360007706372'"
missing_ny_2.at["99933NY", 'actual_combokey'] = "='360008106380'"
missing_ny_2.at["99968NY", 'actual_combokey'] = "='360007606296'"
missing_ny_2.at["99992NY", 'actual_combokey'] = "='360009706274'"
missing_ny_2.at["99995NY", 'actual_combokey'] = "='360009506273'"

In [83]:
""" Join again on the NCES """
missing_ny_2_joined = missing_ny_2.set_index('actual_combokey').join(nces_1516_full, how = 'left', rsuffix = '_')

In [84]:
"""How many matched?"""
len(missing_ny_2_joined[missing_ny_2_joined.LEVEL_.notnull()].index)

5

## Combine recovered schools and performing filters 

** Concatenate the two recovered Missing NY Schools sets **

In [85]:
missing_ny_joined_matching = missing_ny_joined[missing_ny_joined.LEVEL_.notnull()]

In [86]:
all_missing_ny_recovered = missing_ny_2_joined.append(missing_ny_joined_matching)

**Join the original recovered schools (using schname_st identifier) with the recovered NY schools**

In [87]:
recovered_schools = schname_combined.copy()
recovered_schools = recovered_schools.fillna("Missing")

In [88]:
recovered_schools = recovered_schools[recovered_schools['SCH_NAME_'] != "Missing"]

In [89]:
recovered_schools_all = recovered_schools.append(all_missing_ny_recovered)

** Reformat the Columns ** -- Need to make sure that the recovered schools dataset's columns match the original filtered dataset's columns (required for concatenating the two sets properly)

In [90]:
"""Drop original nces columns (the ones with missing values)"""    
recovered_schools_all = recovered_schools_all.drop(['TITLEI', 'STABR', 'SCH_TYPE_TEXT', 'SCH_TYPE', 'LEVEL', 'VIRTUAL', 'GSLO', 'GSHI', 
                                            'NMCNTY15', 'LOCALE15', 'LAT1516', 'LON1516', 'combokey',
                                            'LEAID_', 'LEA_NAME_', 'SCH_NAME_', 'SCHID_'], axis = 1)
"""Rename new matching columns to replace the columns above (necessary for a proper concatenation later)"""
recovered_schools_all = recovered_schools_all.rename(lambda x: x.strip('_'), axis = 'columns')
recovered_schools_all = recovered_schools_all.set_index('COMBOKEY')

In [91]:
"""Do the columns between the original filtered set and recovered missing values set match"""
print(len(recovered_schools_all.columns.values))
print(len(filter5_crdc_nces_1516.columns.values))

123
123


In [92]:
""" How many schools recovered? """
len(recovered_schools_all.index)

710

In [93]:
"""Store the recovered schools for use in other jupyter notebooks (like 04_Filtered_School_Analysis.csv)"""
%store recovered_schools_all

Stored 'recovered_schools_all' (DataFrame)


** Non-Virtual Schools **

In [94]:
recovered_schools_filter1 = recovered_schools_all.copy()

In [95]:
filtered_out_recovered = recovered_schools_filter1[recovered_schools_filter1.VIRTUAL == 'Yes']
recovered_schools_filter1 = recovered_schools_filter1[recovered_schools_filter1.VIRTUAL != 'Yes']

In [96]:
"""How many schools remain?"""
len(recovered_schools_filter1.index)

697

** NCES-Reported High Schools **

In [97]:
recovered_schools_filter2 = recovered_schools_filter1.copy()

In [98]:
filtered_out_recovered = filtered_out_recovered.append(recovered_schools_filter2[(recovered_schools_filter2.LEVEL=='1') |
                                                        (recovered_schools_filter2.LEVEL=='2') |
                                                        (recovered_schools_filter2.LEVEL=='N')])
recovered_schools_filter2 = recovered_schools_filter2[(recovered_schools_filter2.LEVEL == '3') |
                                                      (recovered_schools_filter2.LEVEL == '4')]

In [99]:
"""How many schools remain?"""
len(recovered_schools_filter2.index)

692

** NCES-Reported Regular / Vocational **

In [100]:
recovered_schools_filter3 = recovered_schools_filter2.copy()

In [101]:
recovered_schools_filter3.SCH_TYPE.value_counts()

1.0    648
4.0     27
3.0     17
Name: SCH_TYPE, dtype: int64

In [102]:
filtered_out_recovered = filtered_out_recovered.append(recovered_schools_filter3[(recovered_schools_filter3.SCH_TYPE == 2) | 
                                                                                 (recovered_schools_filter3.SCH_TYPE == 4)])
recovered_schools_filter3 = recovered_schools_filter3[(recovered_schools_filter3.SCH_TYPE == 1) |\
                                                      (recovered_schools_filter3.SCH_TYPE == 3)]

In [103]:
"""How many schools remain?"""
len(recovered_schools_filter3.index)

665

**Remove Schools with 'Adult' in the Name**

In [104]:
filtered_out_recovered = filtered_out_recovered.append(recovered_schools_filter3[recovered_schools_filter3.SCH_NAME.str.contains('Adult', case=False)])
recovered_schools_filter3 = recovered_schools_filter3[~recovered_schools_filter3.SCH_NAME.str.contains('Adult', case=False)]

In [105]:
len(filtered_out_recovered)

45

**Clean Duplicate Values **

In [106]:
recovered_schools_filter3.groupby('SCH_NAME')['SCH_NAME'].count().sort_values(ascending = False).head(5)
"""NOTE: the community collaborative charter duplication appears to be legit (two campuses of the same school?)"""

'NOTE: the community collaborative charter duplication appears to be legit (two campuses of the same school?)'

In [107]:
"""Beacon High School in Dutchess County is already in the filter5 dataset -- Remove"""
recovered_schools_filter4 = recovered_schools_filter3.copy()
recovered_schools_filter4 = recovered_schools_filter4[(recovered_schools_filter4.SCH_NAME != 'beacon high school') | (recovered_schools_filter4.NMCNTY15 != 'Dutchess County')]

In [108]:
"""Both of the performance learning centers here actually matched to a different 'performance learning center' record;
therefore, they should both be removed"""
recovered_schools_filter4 = recovered_schools_filter4[recovered_schools_filter4.SCH_NAME != 'performance learning center']

In [109]:
"""The University High in Irvine was already accounted for; therefore, needs to be removed from the recovered"""
recovered_schools_filter4 = recovered_schools_filter4[(recovered_schools_filter4.SCH_NAME != 'university high') | (recovered_schools_filter4.NMCNTY15 != 'Orange County')]

In [110]:
"""'How many final recovered values?'"""
len(recovered_schools_filter4.index)

661

In [111]:
"""Store the final recovered_schools to be filtered-in/out"""
%store recovered_schools_filter4
%store filtered_out_recovered

Stored 'recovered_schools_filter4' (DataFrame)
Stored 'filtered_out_recovered' (DataFrame)


# <font color = green> VI. Concatenating Recovered Missing Values with the original Filtered Dataset </font>
<div class = 'alert alert-cell alert-info'> Finally, I concatenated the recovered high schools with the original filtered set.<br><br>

I ensured that no duplicate values were added in the process.

Then saved the file to "../filtered_data/04_filter_final.csv" </div>
<div class = 'alert alert-cell alert-warning'>
Final Total:  **15725 High Schools**

In [112]:
"""Remove the missing values"""
filter6_crdc_nces_1516 = filter5_crdc_nces_1516.copy()
filter6_crdc_nces_1516 = filter6_crdc_nces_1516[filter6_crdc_nces_1516.LEVEL != "Missing"]

In [113]:
"""How many initial Duplicates?
Interesting enough, these duplicates appear to legitimate; the problem seems to be that the schools actually have 
different names (e.g. "The ADAIR Co. High"'s are actually supposed to be labeled ADAIR Co. R-I High and ADAIR Co. R-II BRASHEAR)"""
filter6_crdc_nces_1516.groupby(['STABR','SCH_NAME','NMCNTY15'])['SCH_NAME'].count().sort_values(ascending=False).head()

STABR  SCH_NAME           NMCNTY15        
TX     TAYLOR H S         Harris County       2
       LEE H S            Harris County       2
       STERLING H S       Harris County       2
MO     ADAIR CO. HIGH     Adair County        2
KS     Smoky Valley High  McPherson County    1
Name: SCH_NAME, dtype: int64

In [114]:
"""Any dulications in the recovered schools?
    The community collaborative charter schools are two different schools."""
recovered_schools_filter4.groupby(['STABR','SCH_NAME','NMCNTY15'])['SCH_NAME'].count().sort_values(ascending=False).head()

STABR  SCH_NAME                                   NMCNTY15         
CA     community collaborative charter            Sacramento County    2
TX     ischool high of hickory creek              Denton County        1
NY     arts and media preparatory academy         Kings County         1
       baccalaureate school for global education  Queens County        1
       aviation career and technical high school  Queens County        1
Name: SCH_NAME, dtype: int64

In [115]:
# filtered_and_recovered = pd.concat([filter6_crdc_nces_1516, recovered_schools_filter4])
filtered_and_recovered = filter6_crdc_nces_1516.append(recovered_schools_filter4)

In [116]:
"""Do the numbers of columns match?"""
print(len(filter6_crdc_nces_1516.columns.values))
len(filtered_and_recovered.columns.values)

123


123

In [117]:
"""Because Columns are stored as dictionaries, there is no inherent order to the columns -- Pandas automatically 
uses an alphabetical sort on an append/concatenation.  I reorded the columns to show the SCH Name first"""
schName = ['SCH_NAME']
reorder = schName + [c for c in filtered_and_recovered.columns if c not in schName]
filtered_and_recovered = filtered_and_recovered[reorder]

In [118]:
"""No added duplicate records"""
filtered_and_recovered.groupby(['STABR','SCH_NAME','NMCNTY15'])['SCH_NAME'].count().sort_values(ascending=False).head(6)

STABR  SCH_NAME                            NMCNTY15         
CA     community collaborative charter     Sacramento County    2
TX     STERLING H S                        Harris County        2
       TAYLOR H S                          Harris County        2
MO     ADAIR CO. HIGH                      Adair County         2
TX     LEE H S                             Harris County        2
KS     Victoria Junior-Senior High School  Ellis County         1
Name: SCH_NAME, dtype: int64

In [119]:
"How many total high schools in the set?"
len(filtered_and_recovered.index)

18684

In [120]:
# filtered_and_recovered.to_csv('../filtered_data/04_filter_final.csv')

# Final Missing Schools
<div class = 'alert alert-cell alert-info'>**348 Schools**<br> Saved to '04_final_missing.csv'

In [121]:
final_missing = schname_combined[(schname_combined.SCH_NAME_.isnull()) & (schname_combined.LEA_NAME != 'NEW YORK CITY PUBLIC SCHOOLS')]

In [122]:
""" How many final missing schools? """
len(final_missing.index)

290

In [123]:
# final_missing.to_csv('../filtered_data/04_final_missing.csv')

In [124]:
""" Top remaining unaccounted districts """
final_missing.groupby('LEA_NAME')['LEAID'].count().sort_values(ascending = False).head(10)

LEA_NAME
NORMAN                                                     7
Dept. of Svs. for Children Youth & Their Families          5
OFFICE OF EDUCATION DEPARTMENT OF CHILDREN AND FAMILIES    4
ERIE 2-CHAUTAUQUA-CATTARAUGUS BOCES                        3
Clayton County                                             3
NASSAU BOCES                                               3
Cherokee County                                            3
WINDSOR SCHOOL DISTRICT                                    3
Boston                                                     3
TULSA                                                      3
Name: LEAID, dtype: int64