# Create NCDPI 2017-2018 Raw Datasets
### This program downloads all original datasets from www.ncpublicschools.org and saves them as .csv files. These data files are used to create all the flattened and machine learning datasets within the NCEA repository.

1. This notebook downloads raw datasets directly from NCDPI specific URLs.
2. Each raw dataset is filtered by school year and saved in the original layout as a .csv file.
3. For consistency, both the Year and School code fields are renamed to "year" and "agency_code" in all files.
4. All masking is removed from raw data fields using the following code: replace({"*":0, ">95":100, "<5":0, "<10":5 })
5. All * or carriage returns are removed from column names.
6. All raw datasets created by this program are used to create the "flattened" and "machine learning" Public School datasets.

In [1]:
#Run this to add the correct packages path to your jupyter enviroment, if it is missing. 
#import sys
#sys.path.append('C:/Users/Jake/Anaconda2/envs/example_env/Lib/site-packages')

In [2]:
#import required Libraries
import pandas as pd
import numpy as np

#**********************************************************************************
# Set the following variables before running this code!!!
#**********************************************************************************

#Location where copies of the raw data files will be downloaded and saved as csv files.
dataDir = 'D:/BenepactLLC/Belk/NC_Report_Card_Data/2020/June 2020/2018/Raw Datasets/'

#All raw data files are filtered for the year below
schoolYear = 2018

## Download and Save Copy of the Original SRC Data

In [3]:
import urllib.request

#Download and save an original copy of the raw SRC data 
url="http://www.ncpublicschools.org/docs/src/researchers/src-datasets.zip"
zipFilePath = dataDir + 'src-datasets.zip'

#Comment out the next line after downloading the original data one time! 
#urllib.request.urlretrieve(url, zipFilePath)

import zipfile

#Extract the zip file and all school datasets to the //Raw Datasets/ folder
zip_ref = zipfile.ZipFile(zipFilePath, 'r')
zip_ref.extractall(dataDir)
zip_ref.close()

## Notes on rcd_effectiveness.csv, Max Year: 2017
* The rcd_effectiveness.csv file for 2017-18 only had data up to 2017.  
* When the 2018-19 file was published, it had the data for 2017-18 as well. 
* So I moved the file with the correct data from 2018-19 to 2017-18 and republished all of the school and ML data files. 

## Notes on rcd_college.csv, Max Year: 2017
* The rcd_college.csv file for 2017-18 only had data up to 2017.  
* When the 2018-19 file was published, it had data for 2017-18 as well. 
* So I moved the file with the correct data from 2018-19 to 2017-18 and republished all of the school and ML data files. 

# Get the Most Recent Year of Data from Each File

In [4]:
# Update the dataDir path for this part
dataDir = dataDir + 'SRC_Datasets/'

In [5]:
#Use ntpath.basename to get a filename from a filepath
import ntpath

def CleanUpRcdFiles(filePath):
    fileName = ntpath.basename(filePath)
    schFile = pd.read_csv(filePath, dtype={'agency_code': object}, low_memory=False)
    maxYear = schFile['year'].max()
    
    #Filter records for the most recent year
    schFile = schFile[schFile['year'] == maxYear]
    
    #Remove state and district level summary records 
    #schFile = schFile[(schFile['agency_code'] != 'NC-SEA') & (schFile['agency_code'].str.contains("LEA") == False)]
        
    #Remove * character from any fields. 
    schFile = schFile.replace({'*':''})
    schFile.to_csv(dataDir + fileName, sep=',', index=False)
    
    print(fileName + ', Max Year: ' + str(maxYear))
    return (fileName, maxYear)

In [6]:
#Use wildcards to find files in a directory
import glob

#Get and display a list of all .csv file names for 2018 download
rcdFiles = glob.glob(dataDir + 'rcd*.csv')

print('Saving Files to: ' + dataDir + '\n')

file_data_years = []
for filePth in rcdFiles:
    fileName = ntpath.basename(filePth)
    if fileName != 'rcd_code_desc.csv': 
        file_data_years.append(CleanUpRcdFiles(filePth))

Saving Files to: D:/BenepactLLC/Belk/NC_Report_Card_Data/2020/June 2020/2018/Raw Datasets/SRC_Datasets/

rcd_161.csv, Max Year: 2012
rcd_acc_aapart.csv, Max Year: 2018
rcd_acc_act.csv, Max Year: 2018
rcd_acc_awa.csv, Max Year: 2018
rcd_acc_cgr.csv, Max Year: 2018
rcd_acc_eds.csv, Max Year: 2018
rcd_acc_elp.csv, Max Year: 2018
rcd_acc_essa_desig.csv, Max Year: 2018
rcd_acc_gp.csv, Max Year: 2018
rcd_acc_irm.csv, Max Year: 2018
rcd_acc_lowperf.csv, Max Year: 2018
rcd_acc_ltg.csv, Max Year: 2018
rcd_acc_ltg_detail.csv, Max Year: 2018
rcd_acc_mcr.csv, Max Year: 2018
rcd_acc_part.csv, Max Year: 2018
rcd_acc_part_detail.csv, Max Year: 2018
rcd_acc_pc.csv, Max Year: 2018
rcd_acc_rta.csv, Max Year: 2018
rcd_acc_spg1.csv, Max Year: 2017
rcd_acc_spg2.csv, Max Year: 2018
rcd_acc_wk.csv, Max Year: 2018
rcd_adm.csv, Max Year: 2018
rcd_ap.csv, Max Year: 2018
rcd_arts.csv, Max Year: 2018
rcd_att.csv, Max Year: 2018
rcd_charter.csv, Max Year: 2018
rcd_chronic_absent.csv, Max Year: 2018.0
rcd_college.c

In [7]:
#Remove comma from amount field in rcd_improvement
rcd_improvement = pd.read_csv(dataDir + 'rcd_improvement.csv', low_memory=False, dtype={'agency_code': object})
rcd_improvement['amount'] = rcd_improvement['amount'].astype(str).str.replace(',', '').astype(float)
rcd_improvement.to_csv(dataDir + 'rcd_improvement.csv', sep=',', index=False)

In [8]:
import os
# Manually remove what appear to be retired data files

# rcd_161.csv, Max Year: 2012, appears retired
os.remove(dataDir + 'rcd_161.csv')
# rcd_esea_att.csv, Max Year: 2015, appears retired
os.remove(dataDir + 'rcd_esea_att.csv')
# rcd_hqt.csv, Max Year: 2016, appears retired
os.remove(dataDir + 'rcd_hqt.csv')

# Manually remove any split data files that do not have data for the current year. 

# rcd_acc_spg1.csv, Max Year: 2017, rcd_acc_spg2 has 2018
os.remove(dataDir + 'rcd_acc_spg1.csv')
# rcd_courses1.csv, Max Year: 2017, rcd_courses2 has 2018 
os.remove(dataDir + 'rcd_courses1.csv')
# rcd_inc1.csv, Max Year: 2017, rcd_inc2 has 2018
os.remove(dataDir + 'rcd_inc1.csv')

# Manually remove files that do not have campus level data

# rcd_neap.csv, National Data and State Only
os.remove(dataDir + 'rcd_naep.csv')
# rcd_prin_demo.csv, 1 column of District Level Data Only
os.remove(dataDir + 'rcd_prin_demo.csv')

In [9]:
# Update our file lists
file_data_years = [i for i in file_data_years if i[0] not in 
                   ['rcd_161.csv','rcd_esea_att.csv','rcd_hqt.csv','rcd_acc_spg1.csv','rcd_courses1.csv',
                    'rcd_inc1.csv','rcd_naep.csv','rcd_prin_demo.csv']]

rcdFiles = glob.glob(dataDir + 'SRC_Datasets/' + 'rcd*.csv') 

print('Remaining Files')
print('----------------------------------')

for f in file_data_years:
    print(f)

Remaining Files
----------------------------------
('rcd_acc_aapart.csv', 2018)
('rcd_acc_act.csv', 2018)
('rcd_acc_awa.csv', 2018)
('rcd_acc_cgr.csv', 2018)
('rcd_acc_eds.csv', 2018)
('rcd_acc_elp.csv', 2018)
('rcd_acc_essa_desig.csv', 2018)
('rcd_acc_gp.csv', 2018)
('rcd_acc_irm.csv', 2018)
('rcd_acc_lowperf.csv', 2018)
('rcd_acc_ltg.csv', 2018)
('rcd_acc_ltg_detail.csv', 2018)
('rcd_acc_mcr.csv', 2018)
('rcd_acc_part.csv', 2018)
('rcd_acc_part_detail.csv', 2018)
('rcd_acc_pc.csv', 2018)
('rcd_acc_rta.csv', 2018)
('rcd_acc_spg2.csv', 2018)
('rcd_acc_wk.csv', 2018)
('rcd_adm.csv', 2018)
('rcd_ap.csv', 2018)
('rcd_arts.csv', 2018)
('rcd_att.csv', 2018)
('rcd_charter.csv', 2018)
('rcd_chronic_absent.csv', 2018.0)
('rcd_college.csv', 2018)
('rcd_courses2.csv', 2018)
('rcd_cte_concentrators.csv', 2018)
('rcd_cte_credentials.csv', 2018)
('rcd_cte_endorsement.csv', 2018)
('rcd_cte_enrollment.csv', 2018)
('rcd_dlmi.csv', 2018)
('rcd_effectiveness.csv', 2018)
('rcd_experience.csv', 2018)
('rc

# Flatten the Raw Data Files
### This section reads raw data files directly from the \\Raw Datasets folder and flattens each file.
1. Each agency_code could represent National, State, District, Or School Campus level data.
2. This code creates new data columns using pivots until there is only one record per agency_code.
3. Percentage fields are always used for pivot values in cases where count, denominators, or percentages are available.  

In [10]:
#Get and display a list of all .csv file names for 2018 download
rcdFiles = glob.glob(dataDir + 'rcd*.csv')

rcdFileNames = [ntpath.basename(x)[:-4] for x in rcdFiles]

In [11]:
#Do not process the rcd_code_desc file  
rcdFileNames.remove('rcd_code_desc')

In [12]:
def PivotCsv(dataDir, fileName, pivValues, pivIndex, pivColumns, colSuffix):
    pivFile = pd.read_csv(dataDir + fileName, low_memory=False, dtype={pivIndex: object})
    
    pivFile = pd.pivot_table(pivFile, values=pivValues,index=pivIndex,columns=pivColumns)
    
    #concatenate multiindex column names using a list comprehension.
    pivFile.columns = [ '_'.join(str(i) for i in col) + colSuffix for col in pivFile.columns]

    #Make our index a column for merges later
    pivFile.reset_index(level=0, inplace=True)
    return pivFile

In [13]:
#Pivot File - rcd_161 - Retired
#rcd_161 = PivotCsv(dataDir, 'rcd_161.csv',['ccc_pct'],'agency_code', ['status','subgroup'],'_161')

#Pivot File - rcd_acc_aapart 
rcd_acc_aapart = PivotCsv(dataDir, 'rcd_acc_aapart.csv',['pct'],'agency_code', ['subject','grade'],'_AAPART')

#Pivot File - rcd_acc_act 
rcd_acc_act = PivotCsv(dataDir, 'rcd_acc_act.csv',['pct'],'agency_code', ['subject','subgroup'],'_ACT')

#Pivot File - rcd_acc_awa 
rcd_acc_awa = PivotCsv(dataDir, 'rcd_acc_awa.csv',['pct'],'agency_code', ['subgroup'],'_AWA')

#Pivot File - rcd_acc_cgr
rcd_acc_cgr = PivotCsv(dataDir, 'rcd_acc_cgr.csv',['pct'],'agency_code', ['cgr_type', 'subgroup'],'_CGR')

#File - rcd_acc_eds
rcd_acc_eds = pd.read_csv(dataDir + 'rcd_acc_eds.csv', low_memory=False, dtype={'agency_code': object})
rcd_acc_eds = rcd_acc_eds[['agency_code', 'pct_eds']]

#Pivot File - rcd_acc_elp
rcd_acc_elp = PivotCsv(dataDir, 'rcd_acc_elp.csv',['pct'],'agency_code', ['subgroup'],'_ELP')

#File - rcd_acc_essa_desig
rcd_acc_essa_desig = pd.read_csv(dataDir + 'rcd_acc_essa_desig.csv', low_memory=False, dtype={'agency_code': object})
rcd_acc_essa_desig.drop(['year'], axis=1, inplace=True)

#File - rcd_acc_gp
rcd_acc_gp = pd.read_csv(dataDir + 'rcd_acc_gp.csv', low_memory=False, dtype={'agency_code': object})
rcd_acc_gp.drop(['year'], axis=1, inplace=True)

#Pivot File - rcd_acc_irm
rcd_acc_irm = PivotCsv(dataDir, 'rcd_acc_irm.csv',['pct_prof'],'agency_code', ['grade'],'gr_irm')

#File - rcd_acc_lowperf
rcd_acc_lowperf = pd.read_csv(dataDir + 'rcd_acc_lowperf.csv', low_memory=False, dtype={'agency_code': object})
rcd_acc_lowperf = rcd_acc_lowperf[['agency_code', 'lp_school','rlp_school','clpc_school']]

#Pivot File - rcd_acc_ltg
rcd_acc_ltg = PivotCsv(dataDir, 'rcd_acc_ltg.csv',['pct_met'],'agency_code', ['target'],'_LTG')

#File - rcd_acc_ltg_detail
rcd_acc_ltg_detail = pd.read_csv(dataDir + 'rcd_acc_ltg_detail.csv', low_memory=False, dtype={'agency_code': object})
rcd_acc_ltg_detail.drop(['year'], axis=1, inplace=True)

#Pivot File - rcd_acc_mcr
rcd_acc_mcr = PivotCsv(dataDir, 'rcd_acc_mcr.csv',['pct'],'agency_code', ['subgroup'],'_MCR')

#Pivot File - rcd_acc_part_detail
rcd_acc_part = PivotCsv(dataDir, 'rcd_acc_part.csv',['pct_met'],'agency_code', ['target'],'_PART')

#Pivot File - rcd_acc_part
rcd_acc_part_detail = PivotCsv(dataDir, 'rcd_acc_part_detail.csv',['pct'],'agency_code', ['target','subgroup'],'_PART_DET')

#Pivot File - rcd_acc_pc - WARNING 3323 columns!!! 
rcd_acc_pc = PivotCsv(dataDir, 'rcd_acc_pc.csv',['pct'],'agency_code', ['standard','subject','grade','subgroup'],'_PC')

#Pivot File - rcd_acc_part_detail
rcd_acc_rta = PivotCsv(dataDir, 'rcd_acc_rta.csv',['pct'],'agency_code', ['metric'],'_RTA')

#File - rcd_acc_spg1 _ Retired
# rcd_acc_spg1 = pd.read_csv(dataDir + 'rcd_acc_spg1.csv', low_memory=False, dtype={'agency_code': object})
# rcd_acc_spg1.drop(['year'], axis=1, inplace=True)

#File - rcd_acc_spg2
pivVals = ['asm_option','k2_feeder','aaa_score','awa_score','cgrs_score','elp_score',
           'mcr_score','scgs_score','bi_score','ach_score','eg_status','eg_score',
           'spg_score','spg_grade','mags_score','ma_eg_status','ma_eg_score',
           'ma_spg_score','ma_spg_grade','rdgs_score','rd_eg_status','rd_eg_score',
           'rd_spg_score','rd_spg_grade']
rcd_acc_spg2 = PivotCsv(dataDir, 'rcd_acc_spg2.csv',pivVals,'agency_code', ['subgroup'],'_SPG2')

#pivVals = ['aaa_score','awa_score','cgrs_score','elp_score','mcr_score','scgs_score','bi_score',
#           'ach_score','eg_status','eg_score','spg_score','spg_grade']
#           
#rcd_acc_spg2 = PivotCsv(dataDir, 'rcd_acc_spg2.csv',pivVals,'agency_code', ['subgroup'],'_SPG2')

#Pivot File - rcd_acc_wk
rcd_acc_wk = PivotCsv(dataDir, 'rcd_acc_wk.csv',['pct'],'agency_code', ['subgroup'],'_WK')

#File - rcd_adm
rcd_adm = pd.read_csv(dataDir + 'rcd_adm.csv', low_memory=False, dtype={'agency_code': object})
rcd_adm.drop(['year','category_code'], axis=1, inplace=True)

#File - rcd_ap
#Found 0 duplicate agency_codes in this file, no pivot 
rcd_ap = pd.read_csv(dataDir + 'rcd_ap.csv', low_memory=False, dtype={'agency_code': object})
rcd_ap.drop(['year','category_code'], axis=1, inplace=True)

#File - rcd_arts
rcd_arts = pd.read_csv(dataDir + 'rcd_arts.csv', low_memory=False, dtype={'agency_code': object})
rcd_arts.drop(['year'], axis=1, inplace=True)

#File - rcd_att
#Found 0 duplicate agency_codes in this file, no pivot 
rcd_att = pd.read_csv(dataDir + 'rcd_att.csv', low_memory=False, dtype={'agency_code': object})
rcd_att.drop(['year','category_code'], axis=1, inplace=True)

#Pivot File - rcd_charter
#rcd_charter = PivotCsv(dataDir, 'rcd_charter.csv',['pct_enrolled'],'agency_code', ['home_lea','subgroup'],'_CHARTER')
rcd_charter = PivotCsv(dataDir, 'rcd_charter.csv',['pct_enrolled'],'agency_code', ['subgroup'],'_CHARTER')

#Pivot File - rcd_chronic_absent
rcd_chronic_absent = PivotCsv(dataDir, 'rcd_chronic_absent.csv',['pct'],'agency_code', ['subgroup'],'_CHRON_ABSENT')

#Pivot File - rcd_college
rcd_college = PivotCsv(dataDir, 'rcd_college.csv',['pct_enrolled'],'agency_code', ['Status','subgroup'],'_COLLEGE')

#File - rcd_courses1 - Retired - 2017 DATA
#Found 0 duplicate agency_codes in this file, no pivot 
#rcd_courses1 = pd.read_csv(dataDir + 'rcd_courses1.csv', low_memory=False, dtype={'agency_code': object})
#rcd_courses1.drop(['year','category_code'], axis=1, inplace=True)

#Pivot File - rcd_courses2
# Pivot File - rcd_courses2
pivCols = ['tot_num_ap','subgroup','pct_ap','tot_num_ccp','pct_ccp','tot_num_ib','pct_ib']
rcd_courses2 = PivotCsv(dataDir, 'rcd_courses2.csv',pivCols,'agency_code', 
                        ['category_code','subgroup'],'_COURSES2')

#rcd_courses2 = PivotCsv(dataDir, 'rcd_courses2.csv',['pct_ap','pct_ccp','pct_ib'],'agency_code', ['category_code','subgroup'],
#                        '_COURSES2')

#Pivot File - rcd_cte_concentrators
rcd_cte_concentrators = PivotCsv(dataDir, 'rcd_cte_concentrators.csv',['num_concentrators'],'agency_code',
                                 ['career_cluster'],'')

#File - rcd_cte_credentials
rcd_cte_credentials = pd.read_csv(dataDir + 'rcd_cte_credentials.csv', low_memory=False, dtype={'agency_code': object})
rcd_cte_credentials.drop(['year'], axis=1, inplace=True)

#File - rcd_cte_endorsement
rcd_cte_endorsement = pd.read_csv(dataDir + 'rcd_cte_endorsement.csv', low_memory=False, dtype={'agency_code': object})
rcd_cte_endorsement.drop(['year'], axis=1, inplace=True)

#File - rcd_cte_enrollment
rcd_cte_enrollment = pd.read_csv(dataDir + 'rcd_cte_enrollment.csv', low_memory=False, dtype={'agency_code': object})
rcd_cte_enrollment['cte_enrollment_pct'] = rcd_cte_enrollment['pct'] 
rcd_cte_enrollment.drop(['year','pct'], axis=1, inplace=True)

#File - rcd_dlmi
rcd_dlmi = pd.read_csv(dataDir + 'rcd_dlmi.csv', low_memory=False, dtype={'agency_code': object})
rcd_dlmi.drop(['year'], axis=1, inplace=True)

#Pivot File - rcd_effectiveness - 2017 Data
rcd_effectiveness = PivotCsv(dataDir, 'rcd_effectiveness.csv',['pct_rating'],'agency_code', ['ee_standard','ee_rating'],'')

#File - rcd_esea_att - Retired - 2015 DATA
#Found 0 duplicate agency_codes in this file, no pivot 
#rcd_esea_att = pd.read_csv(dataDir + 'rcd_esea_att.csv', low_memory=False, dtype={'agency_code': object})
#rcd_esea_att.drop(['year','category_code'], axis=1, inplace=True)

#Pivot File - rcd_experience
expPivColumns = ['pct_experience_0','pct_experience_10','pct_experience_4',
                 'pct_adv_degree','pct_turnover','total_class_teach','avg_class_teach']
rcd_experience = PivotCsv(dataDir, 'rcd_experience.csv',expPivColumns,'agency_code', ['staff'],'Exp')

#File - !!!DISTRICT LEVEL DATA!!!
rcd_funds = pd.read_csv(dataDir + 'rcd_funds.csv', low_memory=False, dtype={'agency_code': object})
rcd_funds.drop(['year'], axis=1, inplace=True)

#Pivot File - rcd_hqt - Retired - !!!2016 DATA!!!
# rcd_hqt = PivotCsv(dataDir, 'rcd_hqt.csv',['highqual_class_pct'],'agency_code', ['category_code'],'')

#File - rcd_ib
#Found 0 duplicate agency_codes in this file at school level, no pivot 
rcd_ib = pd.read_csv(dataDir + 'rcd_ib.csv', low_memory=False, dtype={'agency_code': object})
rcd_ib.drop(['year','category_code'], axis=1, inplace=True)

#Pivot File - rcd_improvement
rcd_improvement = PivotCsv(dataDir, 'rcd_improvement.csv',['amount'],'agency_code', ['strategy'],'_Improve_Amt')

#File - rcd_inc1 - Retired 
#Found 0 duplicate agency_codes in this file at school level, no pivot 
#rcd_inc1 = pd.read_csv(dataDir + 'rcd_inc1.csv', low_memory=False, dtype={'agency_code': object})
#rcd_inc1.drop(['year','category_code'], axis=1, inplace=True)

#File - rcd_inc2 
pivFields = ['iss_per1000','sts_per1000','lts_per1000',
             'exp_per1000','crime_per1000','blhr_per1000',
             'rplw_per1000','arre_per1000']
rcd_inc2 = PivotCsv(dataDir, 'rcd_inc2.csv',pivFields,'agency_code', ['subgroup'],'')

#File - rcd_licenses
#Found 0 duplicate agency_codes in this file at school level, no pivot 
rcd_licenses = pd.read_csv(dataDir + 'rcd_licenses.csv', low_memory=False, dtype={'agency_code': object})
rcd_licenses.drop(['year','category_code'], axis=1, inplace=True)

#File - rcd_location
#Found 0 duplicate agency_codes in this file at school level, no pivot 
rcd_location = pd.read_csv(dataDir + 'rcd_location.csv', low_memory=False, dtype={'agency_code': object})
rcd_location.drop(['year'], axis=1, inplace=True)


#Pivot File - rcd_naep !!!NATIONAL & STATE LEVEL DATA!!!
#pivCols = ['grade','naep_subject','subgroup','Proficiency_level']
#rcd_naep = PivotCsv(dataDir, 'rcd_naep.csv',['percent_proficient'],'agency_code', pivCols,'_NAEP')

#File - rcd_licenses
#Found 0 duplicate agency_codes in this file at school level, no pivot 
rcd_nbpts = pd.read_csv(dataDir + 'rcd_nbpts.csv', low_memory=False, dtype={'agency_code': object})
rcd_nbpts.drop(['year','category_code','total_nbpts_num'], axis=1, inplace=True)

#Pivot File - rcd_pk_enroll
rcd_pk_enroll = PivotCsv(dataDir, 'rcd_pk_enroll.csv',['pct', 'count'],'agency_code', ['subgroup'],'_PK_ENROLL')

#Pivot File - rcd_prin_demo - !!! District Level Data !!!
# rcd_prin_demo = PivotCsv(dataDir, 'rcd_prin_demo.csv',['pct_prin_demo'],'agency_code', ['subgroup'],'')

#File - rcd_readiness
#Found 0 duplicate agency_codes in this file at school level, no pivot 
rcd_readiness = pd.read_csv(dataDir + 'rcd_readiness.csv', low_memory=False, dtype={'agency_code': object})
rcd_readiness.drop(['year','category_code'], axis=1, inplace=True)

#Pivot File - rcd_sar
rcd_sar = PivotCsv(dataDir, 'rcd_sar.csv',['avg_size'],'agency_code', ['grade_eoc'],'_SAR')

#File - rcd_sat
#Found 0 duplicate agency_codes in this file at school level, no pivot 
rcd_sat = pd.read_csv(dataDir + 'rcd_sat.csv', low_memory=False, dtype={'agency_code': object})
rcd_sat.drop(['year','category_code'], axis=1, inplace=True)

#File - rcd_welcome
rcd_welcome = pd.read_csv(dataDir + 'rcd_welcome.csv', low_memory=False, dtype={'agency_code': object})
rcd_welcome.drop(['year'], axis=1, inplace=True)

# Save All Flattened Files to \\Raw Datasets Directory
**This code saves all the flattened file versions as .csv files in \\Raw Datasets\

In [14]:
#Set dataDir back to the original value (one folder up from current location)
import os 
dataDir = os.path.dirname(os.path.dirname(dataDir)) + '/'
dataDir

'D:/BenepactLLC/Belk/NC_Report_Card_Data/2020/June 2020/2018/Raw Datasets/'

In [15]:
print('Saving Flattened Versions and Record Counts for the Following Raw Data Files: \n')
for fileName in rcdFileNames:
    eval(fileName).to_csv(dataDir + 'Flattened Datasets/' + fileName + '.csv', sep=',', index=False)
    print(fileName + ', ' + str(len(eval(fileName).index)))
    

Saving Flattened Versions and Record Counts for the Following Raw Data Files: 

rcd_acc_aapart, 53
rcd_acc_act, 742
rcd_acc_awa, 688
rcd_acc_cgr, 738
rcd_acc_eds, 2761
rcd_acc_elp, 1809
rcd_acc_essa_desig, 2645
rcd_acc_gp, 658
rcd_acc_irm, 1276
rcd_acc_lowperf, 2760
rcd_acc_ltg, 2461
rcd_acc_ltg_detail, 2631
rcd_acc_mcr, 717
rcd_acc_part, 2527
rcd_acc_part_detail, 2527
rcd_acc_pc, 2697
rcd_acc_rta, 1576
rcd_acc_spg2, 2538
rcd_acc_wk, 517
rcd_adm, 3197
rcd_ap, 563
rcd_arts, 2509
rcd_att, 3115
rcd_charter, 241
rcd_chronic_absent, 2719
rcd_college, 690
rcd_courses2, 652
rcd_cte_concentrators, 492
rcd_cte_credentials, 436
rcd_cte_endorsement, 537
rcd_cte_enrollment, 1184
rcd_dlmi, 2723
rcd_effectiveness, 2777
rcd_experience, 2758
rcd_funds, 292
rcd_ib, 51
rcd_improvement, 19
rcd_inc2, 2792
rcd_licenses, 3121
rcd_location, 2759
rcd_nbpts, 3124
rcd_pk_enroll, 988
rcd_readiness, 1323
rcd_sar, 2630
rcd_sat, 613
rcd_welcome, 1311


## Download and Save Copy of the Original Statistical Profiles Data

## -----------------  Manual Download Required!!! -----------------

In [16]:
#Statistical Profiles - Student Body Racial Compositions at the School Level
#import io
#import requests

#url='http://apps.schools.nc.gov/ords/f?p=145:221::CSV::::'
statProfPath = dataDir + 'SRC_Datasets/' + 'ec_pupils.csv'

#Passing this URL directly into pd.read_csv() threw HTTP errors - This is my workaround
#s = requests.get(url).content
#ec_pupils = pd.read_csv(io.StringIO(s.decode('utf-8')), low_memory=False
#                        , dtype={'LEA': object,'School': object})

ec_pupils = pd.read_csv(statProfPath, low_memory=False
                        , dtype={'LEA': object,'School': object})

#Rename year for consistency
ec_pupils.rename({'Year':'year', '____LEA Name____':'LEA Name', '___School Name___':'school name',
                 'Two or  More Male':'two or more male', 'Two or  More Female':'two or more female'}, axis=1, inplace=True)

#Create agency_code from LEA and School code as an index
ec_pupils['agency_code'] = ec_pupils['LEA'] + ec_pupils['School']

#Filter to 2018 school year (There is already 2019 school year data in this file)
#ec_pupils = ec_pupils[ec_pupils.year == schoolYear]

#Some schools are missing race data.  Get the most recent year of data available for each agency code
ec_pupils = ec_pupils.sort_values(by=['agency_code', 'year'])
ec_pupils = ec_pupils.drop_duplicates(subset=["agency_code"], keep="last")

ec_pupils.columns = [i.lower() for i in ec_pupils.columns]

#Save the original data to the source datasets folder 
ec_pupils.to_csv(statProfPath, sep=',', index=False)

## Create Flattened Statistical Profiles with Racial Composition Percentages

In [17]:
#***********************************************************************
# Statistical Profiles - Student Body Racial Compositions at the School Level Reshape
#
# Statistical Profiles data are already one record per public school but must be converted to percentages
# Creates a new dataset - ec_pupils_pct.csv
#
#***********************************************************************

#Statistical Profiles - Student Body Racial Compositions at the School Level
ec_pupils = pd.read_csv(statProfPath, low_memory=False, dtype={'agency_code': object})

#Create Racial Composition summary variables
ec_pupils['indian'] = ec_pupils['indian male'] + ec_pupils['indian female']
ec_pupils['asian'] = ec_pupils['asian male'] + ec_pupils['asian female']
ec_pupils['hispanic'] = ec_pupils['hispanic male'] + ec_pupils['hispanic female']
ec_pupils['black'] = ec_pupils['black male'] + ec_pupils['black female']
ec_pupils['white'] = ec_pupils['white male'] + ec_pupils['white female']
ec_pupils['pacific island'] = ec_pupils['pacific island male'] + ec_pupils['pacific island female']
ec_pupils['two or more'] = ec_pupils['two or more male'] + ec_pupils['two or more female']

#The original total field is corrupted with non-printable characters and will not convert to int or float 
ec_pupils.drop(['total'], axis=1, inplace=True)
#Create a new totals field by summing race composition fields
ec_pupils['total'] = ec_pupils['indian'] + ec_pupils['asian'] + \
                     ec_pupils['hispanic'] + ec_pupils['black'] + \
                     ec_pupils['white'] + ec_pupils['pacific island'] + ec_pupils['two or more']
#Convert totals to float64 for division later
ec_pupils['total'] = ec_pupils['total'].astype(np.float64)

#Create minority summary variables 
ec_pupils['minority male'] = ec_pupils['indian male'] + ec_pupils['asian male'] \
                           + ec_pupils['hispanic male'] + ec_pupils['black male'] \
                           + ec_pupils['pacific island male'] + ec_pupils['two or more male'] 
ec_pupils['minority female'] = ec_pupils['indian female'] + ec_pupils['asian female'] \
                           + ec_pupils['hispanic female'] + ec_pupils['black female'] \
                           + ec_pupils['pacific island female'] + ec_pupils['two or more female']
ec_pupils['minority'] = ec_pupils['minority male'] + ec_pupils['minority female']

#Create Student Body Racial Composition PERCENTAGES at the School Level
ec_pupils_pct = pd.DataFrame({'agency_code'   : ec_pupils['agency_code']
                            , 'lea' : ec_pupils['lea']
                            , 'lea_name' : ec_pupils['lea name']
                            , 'school' : ec_pupils['school']
                            , 'school_name' : ec_pupils['school name']
                            , 'indian_pct'   : ec_pupils['indian'] / ec_pupils['total']  
                            , 'asian_pct'    : ec_pupils['asian'] / ec_pupils['total']
                            , 'hispanic_pct' : ec_pupils['hispanic'] / ec_pupils['total']
                            , 'black_pct'    : ec_pupils['black'] / ec_pupils['total']
                            , 'white_pct'    : ec_pupils['white'] / ec_pupils['total']
                            , 'pacific_Island_pct': ec_pupils['pacific island'] / ec_pupils['total']
                            , 'two_or_more_pct': ec_pupils['two or more'] / ec_pupils['total']
                            , 'minority_pct' : ec_pupils['minority'] / ec_pupils['total']
                            
                              
                            , 'indian_male_pct'   : ec_pupils['indian male'] / ec_pupils['total']  
                            , 'asian_male_pct'    : ec_pupils['asian male'] / ec_pupils['total']
                            , 'hispanic_male_pct' : ec_pupils['hispanic male'] / ec_pupils['total']
                            , 'black_male_pct'    : ec_pupils['black male'] / ec_pupils['total']
                            , 'white_male_pct'    : ec_pupils['white male'] / ec_pupils['total']
                            , 'pacific_Island_male_pct': ec_pupils['pacific island male'] / ec_pupils['total']
                            , 'two_or_more_male_pct': ec_pupils['two or more male'] / ec_pupils['total']  
                            , 'minority_male_pct' : ec_pupils['minority male'] / ec_pupils['total']
                                                          
                            , 'indian_female_pct'   : ec_pupils['indian female'] / ec_pupils['total']  
                            , 'asian_female_pct'    : ec_pupils['asian female'] / ec_pupils['total']
                            , 'hispanic_female_pct' : ec_pupils['hispanic female'] / ec_pupils['total']
                            , 'black_female_pct'    : ec_pupils['black female'] / ec_pupils['total']
                            , 'white_female_pct'    : ec_pupils['white female'] / ec_pupils['total']
                            , 'minority_female_pct' : ec_pupils['minority female'] / ec_pupils['total'] 
                            , 'pacific_Island_female_pct': ec_pupils['pacific island female'] / ec_pupils['total']
                            , 'two_or_more_female_pct': ec_pupils['two or more female'] / ec_pupils['total']
                             })

ec_pupils_pct.columns = [i.lower() for i in ec_pupils_pct.columns]

#Save the flattened racial composition percentage data to disk 
ec_pupils_pct.to_csv(dataDir + 'Flattened Datasets/' + 'ec_pupils_pct.csv', sep=',', index=False)

#Print file details
print('Saving Flattened Versions and Record Counts for the Following Raw Data Files: \n')
print('ec_pupils_pct' + ', ' + str(len(ec_pupils_pct.index)))

Saving Flattened Versions and Record Counts for the Following Raw Data Files: 

ec_pupils_pct, 2494


## Process School Attendance Data

## -----------------  Manual Download Required!!! -----------------

### Notes on Attendance Data 
* **Location** - https://www.dpi.nc.gov/districts-schools/district-operations/financial-and-business-services/demographics-and-finances/student-accounting-data#average-daily-attendance-&-average-daily-membership-ratios-(ada:adm)
* **Download: Average Daily Attendance & Average Daily Membership Ratios (ADA:ADM), Three-year historical attendance and membership data** - https://files.nc.gov/dpi/documents/fbs/accounting/data/adm/ratio.xlsx
* **File Name** -Ratio.csv
* This file needs manual processing 
 1. Delete notes page from spreadsheet
 2. Remove empty columns in report and save as .csv file
 3. Edit .csv file - delete 2 rows of column headings, add in column headings below.
* **Column Names to Cut and Paste: ** - agency_code,lea_no,lea_name,school_no,school_name,grade_span,ada_ct_2017,adm_ct_2017,ada_adm_ratio_2018,ada_ct_2018,adm_ct_2018,ada_adm_ratio_2018,ada_ct_2019,adm_ct_2019,ada_adm_ratio_2019,ada_adm_ratio_2017_thru_2019,ada_adm_rank_2017_thru_2019
* **MAKE SURE THE YEAR NUMBERS IN THE COLUMN NAMES ARE ACCURATE**

### Metadata Details
* **ada** = average daily attendance
* **ama** = average daily membership

### Check for duplicate agency codes
**The duplicate agency code below was removed manually**

In [18]:
#Found 0 duplicate agency_codes in this file at school level, no pivot 
rcd_readiness = pd.read_csv(dataDir + 'SRC_Datasets/' + 'ratio.csv', low_memory=False, dtype={'agency_code': object})
vc = rcd_readiness['agency_code'].value_counts()
vc[vc >= 2]

Series([], Name: agency_code, dtype: int64)

In [19]:
rcd_readiness[rcd_readiness.agency_code == '840361']

 

Unnamed: 0,agency_code,lea_no,lea_name,school_no,school_name,grade_span,ada_ct_2017,adm_ct_2017,ada_adm_ratio_2018,ada_ct_2018,adm_ct_2018,ada_adm_ratio_2018.1,ada_ct_2019,adm_ct_2019,ada_adm_ratio_2019,ada_adm_ratio_2017_thru_2019,ada_adm_rank_2017_thru_2019
2140,840361,840.0,Stanly County Schools,361.0,Stanly Early College High,09-13,202,207,97.58,201,206,97.57,193,197,97.97,97.71,92.0


In [18]:
#File - ratio.csv
#Found 0 duplicate agency_codes in this file at school level, no pivot 
rcd_readiness = pd.read_csv(dataDir + 'SRC_Datasets/' + 'ratio.csv', low_memory=False, dtype={'agency_code': object})
rcd_readiness.columns = ['agency_code'] + [i.lower() + '_attendance' for i in rcd_readiness.columns if i != 'agency_code']
rcd_readiness.to_csv(dataDir + 'Flattened Datasets/' + 'ratio.csv', sep=',', index=False)