# Create 2018-2019 School Datasets¶
### This program uses all flattened raw datasets to create the school dataset files within the NCEA repository.

1. This notebook reads raw dataset .csv files directly from the \EducationDataNC\2018\Raw Datasets folder.
2. Each raw dataset is transformed to contain only one record per public school campus or unique agency_code.
3. Many raw datasets have more than one record per campus, per year. In these instances, table pivots are used to create new columns from row level entries and reduce each dataset to one record per school. This adds many new colums the flattened dataset. (see the code below for more details)
4. School datasets merge all flattened files into one dataset with one record per agency_code.

In [1]:
#import required Libraries
import pandas as pd
import numpy as np
import os
import string

#**********************************************************************************
# Set the following variables before running this code!!!
#**********************************************************************************

#'C:/Users/Jake/Documents/GitHub/EducationDataNC/2018/
dirPath = 'D:/BenepactLLC/Belk/NC_Report_Card_Data/2020/October 2020/2019/'

#Location where copies of the raw data files will be read in from csv files.
# 'C:/Users/Jake/Documents/GitHub/EducationDataNC/2018/Raw Datasets/'
dataDir = dirPath + 'Raw Datasets/Flattened Datasets/'

#Location where the new school datasets will be created.
# 'C:/Users/Jake/Documents/GitHub/EducationDataNC/2018/School Datasets/'
outputDir = dirPath + 'School Datasets/'

#All raw data files are processed for the year below
schoolYear = 2019

# Read in the Raw Data Files
### This section reads raw data files directly from the \\Raw Datasets folder.

* The file input location is specified at the dataDir parameter.
* The file output location is specified at the outputDir parameter.
* The schoolYear parameter is used to specify the correct school year to process.

# A List of All Files Processed

In [2]:
#Use wildcards to find files in a directory
import os
import glob
import ntpath

#Get and display a list of all .csv file names for 2018 download
rcdFiles = glob.glob(dataDir + '*.csv')

rcdFileNames = [os.path.splitext(ntpath.basename(x))[0] for x in rcdFiles]

print('A List of File Names and Record Counts for Processing:\n')

#Create dataframes for each file
for fileName in rcdFileNames:
    #create one ataframe for each .csv file in rcdFileNames  
    exec(fileName + ' = pd.read_csv("' + dataDir + '" + "' + fileName + '" + ".csv", low_memory=False, dtype={"agency_code": object})')  
    print(fileName + ', ' + str(len(eval(fileName).index)) )
    

A List of File Names and Record Counts for Processing:

ec_pupils_pct, 2504
ratio, 2609
rcd_acc_aapart, 56
rcd_acc_act, 627
rcd_acc_awa, 583
rcd_acc_cgr, 627
rcd_acc_eds, 2654
rcd_acc_eg, 2548
rcd_acc_elp, 1774
rcd_acc_essa_desig, 2654
rcd_acc_gp, 596
rcd_acc_irm, 1278
rcd_acc_lowperf, 2654
rcd_acc_ltg, 2525
rcd_acc_ltg_detail, 2665
rcd_acc_mcr, 610
rcd_acc_part, 2537
rcd_acc_part_detail, 2537
rcd_acc_pc, 2596
rcd_acc_rta, 1462
rcd_acc_spg2, 2543
rcd_acc_wk, 404
rcd_adm, 2647
rcd_ap, 468
rcd_arts, 2504
rcd_charter, 270
rcd_chronic_absent, 2612
rcd_college, 602
rcd_course2, 664
rcd_cte_concentrators, 553
rcd_cte_credentials, 533
rcd_cte_endorsement, 547
rcd_cte_enrollment, 1189
rcd_dlmi, 2698
rcd_effectiveness3, 2623
rcd_eq, 2659
rcd_funds2, 2456
rcd_ib, 30
rcd_improvement2, 109
rcd_inc2, 2602
rcd_location, 2702
rcd_pk_enroll, 889
rcd_readiness, 1249
rcd_sar, 2542
rcd_sat, 578
rcd_welcome, 1171


# Merge all datasets to one master dataset with one record per school
* **Starting with the location table we left outer join on agency_code, merging data from each reshaped table into one master record.**
* **The report below ensures that merges by location result in one unique record per public school campus.**
* **This report also shows changes to the final dataset's column and row counts as each flattened raw dataset is merged into the final Public School Datasets.**

In [3]:
rcd_location.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2702 entries, 0 to 2701
Data columns (total 23 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   agency_code           2702 non-null   object 
 1   category_code_loc     2701 non-null   object 
 2   agency_level_loc      2702 non-null   object 
 3   lea_code_loc          2701 non-null   object 
 4   designation_type_loc  2701 non-null   object 
 5   name_loc              2702 non-null   object 
 6   county_loc            2701 non-null   object 
 7   street_addr_loc       2701 non-null   object 
 8   stree_addr2_loc       30 non-null     object 
 9   city_loc              2701 non-null   object 
 10  state_loc             2701 non-null   object 
 11  zip_loc               2701 non-null   float64
 12  phone_loc             2698 non-null   object 
 13  grade_span_loc        2701 non-null   object 
 14  school_type_loc       2701 non-null   object 
 15  calendar_type_loc    

In [4]:
#Make a copy of a variable (by value) using copy() or deepcopy()
import copy 

#Remove state and district level location records before performing campus level merges
rcd_location = rcd_location[(rcd_location['agency_code'] != 'NC-SEA') & 
                            (rcd_location['agency_code'].str.contains("LEA") == False) & 
                            (rcd_location['agency_code'] != 'NAT')
                           ]

#Do not merge file: rcd_acc_pc
mergeFileNames = copy.deepcopy(rcdFileNames)
mergeFileNames.remove('rcd_location')


print('*********************************Start: RCD Location Data*********************************')
rcd_location.info(verbose=False)

for fileName in mergeFileNames:
    rcd_location = rcd_location.merge(eval(fileName),how='left',on='agency_code', suffixes=('', '_Drop'))
    print('*********************************After: ' + fileName + '**************************')
    rcd_location.info(verbose=False)
    

#Rename final merged rcd file! 
PublicSchools = rcd_location

#Delete all of the duplicate / overlapping columns 
#i.e. When two tables have columns with identical names, the column from the table inside the merge() is deleted.
dropCols = [x for x in PublicSchools.columns if x.endswith('_Drop')]
PublicSchools = PublicSchools.drop(dropCols, axis=1)

#Delete any masking columns that were missed. 
dropCols = [x for x in PublicSchools.columns if x.endswith('_masking')]
PublicSchools = PublicSchools.drop(dropCols, axis=1)

print('*********************************After: Deleting Duplicated Columns*********')
PublicSchools.info(verbose=False)

*********************************Start: RCD Location Data*********************************
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2702 entries, 0 to 2701
Columns: 23 entries, agency_code to street_addr2_loc
dtypes: float64(2), object(21)
memory usage: 506.6+ KB
*********************************After: ec_pupils_pct**************************
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2702 entries, 0 to 2701
Columns: 51 entries, agency_code to two_or_more_female_pct
dtypes: float64(28), object(23)
memory usage: 1.1+ MB
*********************************After: ratio**************************
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2702 entries, 0 to 2701
Columns: 67 entries, agency_code to ada_adm_rank_2017_thru_2019_attendance
dtypes: float64(35), object(32)
memory usage: 1.4+ MB
*********************************After: rcd_acc_aapart**************************
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2702 entries, 0 to 2701
Columns: 87 entries, agency_co

*********************************After: rcd_cte_credentials**************************
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2702 entries, 0 to 2701
Columns: 4415 entries, agency_code to cred_earned_pct_cte_cred
dtypes: float64(4369), object(46)
memory usage: 91.0+ MB
*********************************After: rcd_cte_endorsement**************************
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2702 entries, 0 to 2701
Columns: 4420 entries, agency_code to global_language_cte_end_cte_enroll
dtypes: float64(4374), object(46)
memory usage: 91.1+ MB
*********************************After: rcd_cte_enrollment**************************
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2702 entries, 0 to 2701
Columns: 4423 entries, agency_code to pct
dtypes: float64(4377), object(46)
memory usage: 91.2+ MB
*********************************After: rcd_dlmi**************************
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2702 entries, 0 to 2701
Columns: 4436 entries, 

### Standardize all Column Names
**All column names should be in lowercase with words_separated by an underscore.** 

In [5]:
# Replace all spaces with an underscore in column names 
PublicSchools.columns = [i.replace(' ','_').lower() for i in PublicSchools.columns]

### Save the final school datasets. 

In [6]:
#Save the master file to disk
PublicSchools.to_csv(outputDir + 'PublicSchools' + str(schoolYear) + '.csv', sep=',', index=False)

print('*********************************All Public Schools****************************')
PublicSchools.info(verbose=False)

#Filter regular public high schools
HighSchools = PublicSchools[PublicSchools.category_code_loc.isin(['H','A','T'])]

#Save the file to disk
HighSchools.to_csv(outputDir + 'PublicHighSchools' + str(schoolYear) + '.csv', sep=',', index=False)

print('*********************************Regular Public High Schools*******************')
HighSchools.info(verbose=False)

#Filter regular public middle schools
MiddleSchools = PublicSchools[PublicSchools.category_code_loc.isin(['M','T','A','I'])]

#Save the file to disk
MiddleSchools.to_csv(outputDir + 'PublicMiddleSchools' + str(schoolYear) + '.csv', sep=',', index=False)

print('*********************************Regular Public Middle Schools******************')
MiddleSchools.info(verbose=False)

#Filter regular elementary high schools
ElementarySchools = PublicSchools[PublicSchools.category_code_loc.isin(['E','A','I'])]

#Save the file to disk
ElementarySchools.to_csv(outputDir + 'PublicElementarySchools' + str(schoolYear) + '.csv', sep=',', index=False)

print('*********************************Regular Public Elementary Schools**************')
ElementarySchools.info(verbose=False)

*********************************All Public Schools****************************
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2702 entries, 0 to 2701
Columns: 4674 entries, agency_code to welcome_url
dtypes: float64(4616), object(58)
memory usage: 96.4+ MB
*********************************Regular Public High Schools*******************
<class 'pandas.core.frame.DataFrame'>
Int64Index: 716 entries, 0 to 2700
Columns: 4674 entries, agency_code to welcome_url
dtypes: float64(4616), object(58)
memory usage: 25.5+ MB
*********************************Regular Public Middle Schools******************
<class 'pandas.core.frame.DataFrame'>
Int64Index: 846 entries, 0 to 2700
Columns: 4674 entries, agency_code to welcome_url
dtypes: float64(4616), object(58)
memory usage: 30.2+ MB
*********************************Regular Public Elementary Schools**************
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1603 entries, 0 to 2687
Columns: 4674 entries, agency_code to welcome_url
dtypes: float

# Data Columns Available in the Public School Dataset

In [7]:
PublicSchools.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2702 entries, 0 to 2701
Data columns (total 4674 columns):
 #    Column                                         Dtype  
---   ------                                         -----  
 0    agency_code                                    object 
 1    category_code_loc                              object 
 2    agency_level_loc                               object 
 3    lea_code_loc                                   object 
 4    designation_type_loc                           object 
 5    name_loc                                       object 
 6    county_loc                                     object 
 7    street_addr_loc                                object 
 8    stree_addr2_loc                                object 
 9    city_loc                                       object 
 10   state_loc                                      object 
 11   zip_loc                                        float64
 12   phone_loc                      