# Create NCDPI 2017-2018 Raw Datasets
### This program downloads all original datasets from www.ncpublicschools.org and saves them as .csv files. These data files are used to create all the flattened and machine learning datasets within the NCEA repository.

1. This notebook downloads raw datasets directly from NCDPI specific URLs.
2. Each raw dataset is filtered by school year and saved in the original layout as a .csv file.
3. For consistency, both the Year and School code fields are renamed to "year" and "agency_code" in all files.
4. All masking is removed from raw data fields using the following code: replace({"*":0, ">95":100, "<5":0, "<10":5 })
5. All * or carriage returns are removed from column names.
6. All raw datasets created by this program are used to create the "flattened" and "machine learning" Public School datasets.

In [95]:
#Run this to add the correct packages path to your jupyter enviroment, if it is missing. 
#import sys
#sys.path.append('C:/Users/Jake/Anaconda2/envs/example_env/Lib/site-packages')

In [54]:
#import required Libraries
import pandas as pd
import numpy as np

#**********************************************************************************
# Set the following variables before running this code!!!
#**********************************************************************************

#Location where copies of the raw data files will be downloaded and saved as csv files.
dataDir = 'D:/BenepactLLC/Belk/NC_Report_Card_Data/2020/May 2020/2018/Raw Datasets/'

#All raw data files are filtered for the year below
schoolYear = 2018

## Download and Save Copy of the Original SRC Data

In [55]:
import urllib.request

#Download and save an original copy of the raw SRC data 
url="http://www.ncpublicschools.org/docs/src/researchers/src-datasets.zip"
zipFilePath = dataDir + 'src-datasets.zip'

#Comment out the next line after downloading the original data one time! 
#urllib.request.urlretrieve(url, zipFilePath)

import zipfile

#Extract the zip file and all school datasets to the //Raw Datasets/ folder
zip_ref = zipfile.ZipFile(zipFilePath, 'r')
zip_ref.extractall(dataDir)
zip_ref.close()

# Get the Most Recent Year of Data from Each File

In [56]:
# Update the dataDir path for this part
dataDir = dataDir + 'SRC_Datasets/'

In [57]:
#Use ntpath.basename to get a filename from a filepath
import ntpath

def CleanUpRcdFiles(filePath):
    fileName = ntpath.basename(filePath)
    schFile = pd.read_csv(filePath, dtype={'agency_code': object}, low_memory=False)
    maxYear = schFile['year'].max()
    
    #Filter records for the most recent year
    schFile = schFile[schFile['year'] == maxYear]
    
    #Remove state and district level summary records 
    #schFile = schFile[(schFile['agency_code'] != 'NC-SEA') & (schFile['agency_code'].str.contains("LEA") == False)]
        
    #Remove * character from any fields. 
    schFile = schFile.replace({'*':''})
    schFile.to_csv(dataDir + fileName, sep=',', index=False)
    
    print(fileName + ', Max Year: ' + str(maxYear))

In [58]:
#Use wildcards to find files in a directory
import glob

#Get and display a list of all .csv file names for 2018 download
rcdFiles = glob.glob(dataDir + 'rcd*.csv')

print('Saving Files to: ' + dataDir + '\n')

for filePth in rcdFiles:
    fileName = ntpath.basename(filePth)
    if fileName != 'rcd_code_desc.csv': 
        CleanUpRcdFiles(filePth)

Saving Files to: D:/BenepactLLC/Belk/NC_Report_Card_Data/2020/May 2020/2018/Raw Datasets/SRC_Datasets/

rcd_161.csv, Max Year: 2012
rcd_acc_aapart.csv, Max Year: 2018
rcd_acc_act.csv, Max Year: 2018
rcd_acc_awa.csv, Max Year: 2018
rcd_acc_cgr.csv, Max Year: 2018
rcd_acc_eds.csv, Max Year: 2018
rcd_acc_elp.csv, Max Year: 2018
rcd_acc_essa_desig.csv, Max Year: 2018
rcd_acc_gp.csv, Max Year: 2018
rcd_acc_irm.csv, Max Year: 2018
rcd_acc_lowperf.csv, Max Year: 2018
rcd_acc_ltg.csv, Max Year: 2018
rcd_acc_ltg_detail.csv, Max Year: 2018
rcd_acc_mcr.csv, Max Year: 2018
rcd_acc_part.csv, Max Year: 2018
rcd_acc_part_detail.csv, Max Year: 2018
rcd_acc_pc.csv, Max Year: 2018
rcd_acc_rta.csv, Max Year: 2018
rcd_acc_spg1.csv, Max Year: 2017
rcd_acc_spg2.csv, Max Year: 2018
rcd_acc_wk.csv, Max Year: 2018
rcd_adm.csv, Max Year: 2018
rcd_ap.csv, Max Year: 2018
rcd_arts.csv, Max Year: 2018
rcd_att.csv, Max Year: 2018
rcd_charter.csv, Max Year: 2018
rcd_chronic_absent.csv, Max Year: 2018.0
rcd_college.cs

In [59]:
#Remove comma from amount field in rcd_improvement
rcd_improvement = pd.read_csv(dataDir + 'rcd_improvement.csv', low_memory=False, dtype={'agency_code': object})
rcd_improvement['amount'] = rcd_improvement['amount'].astype(str).str.replace(',', '').astype(float)
rcd_improvement.to_csv(dataDir + 'rcd_improvement.csv', sep=',', index=False)

# Flatten the Raw Data Files
### This section reads raw data files directly from the \\Raw Datasets folder and flattens each file.
1. Each agency_code could represent National, State, District, Or School Campus level data.
2. This code creates new data columns using pivots until there is only one record per agency_code.
3. Percentage fields are always used for pivot values in cases where count, denominators, or percentages are available.  

In [61]:
#Get and display a list of all .csv file names for 2018 download
rcdFiles = glob.glob(dataDir + 'rcd*.csv')

rcdFileNames = [ntpath.basename(x)[:-4] for x in rcdFiles]

In [62]:
#Do not process the rcd_code_desc file  
rcdFileNames.remove('rcd_code_desc')

In [63]:
def PivotCsv(dataDir, fileName, pivValues, pivIndex, pivColumns, colSuffix):
    pivFile = pd.read_csv(dataDir + fileName, low_memory=False, dtype={pivIndex: object})
    
    pivFile = pd.pivot_table(pivFile, values=pivValues,index=pivIndex,columns=pivColumns)
    
    #concatenate multiindex column names using a list comprehension.
    pivFile.columns = [ '_'.join(str(i) for i in col) + colSuffix for col in pivFile.columns]

    #Make our index a column for merges later
    pivFile.reset_index(level=0, inplace=True)
    return pivFile

In [64]:
#Pivot File - rcd_161 
rcd_161 = PivotCsv(dataDir, 'rcd_161.csv',['ccc_pct'],'agency_code', ['status','subgroup'],'_161')

#Pivot File - rcd_acc_aapart 
rcd_acc_aapart = PivotCsv(dataDir, 'rcd_acc_aapart.csv',['pct'],'agency_code', ['subject','grade'],'_AAPART')

#Pivot File - rcd_acc_act 
rcd_acc_act = PivotCsv(dataDir, 'rcd_acc_act.csv',['pct'],'agency_code', ['subject','subgroup'],'_ACT')

#Pivot File - rcd_acc_awa 
rcd_acc_awa = PivotCsv(dataDir, 'rcd_acc_awa.csv',['pct'],'agency_code', ['subgroup'],'_AWA')

#Pivot File - rcd_acc_cgr
rcd_acc_cgr = PivotCsv(dataDir, 'rcd_acc_cgr.csv',['pct'],'agency_code', ['cgr_type', 'subgroup'],'_CGR')

#File - rcd_acc_eds
rcd_acc_eds = pd.read_csv(dataDir + 'rcd_acc_eds.csv', low_memory=False, dtype={'agency_code': object})
rcd_acc_eds = rcd_acc_eds[['agency_code', 'pct_eds']]

#Pivot File - rcd_acc_elp
rcd_acc_elp = PivotCsv(dataDir, 'rcd_acc_elp.csv',['pct'],'agency_code', ['subgroup'],'_ELP')

#File - rcd_acc_essa_desig
rcd_acc_essa_desig = pd.read_csv(dataDir + 'rcd_acc_essa_desig.csv', low_memory=False, dtype={'agency_code': object})
rcd_acc_essa_desig.drop(['year'], axis=1, inplace=True)

#File - rcd_acc_gp
rcd_acc_gp = pd.read_csv(dataDir + 'rcd_acc_gp.csv', low_memory=False, dtype={'agency_code': object})
rcd_acc_gp.drop(['year'], axis=1, inplace=True)

#Pivot File - rcd_acc_irm
rcd_acc_irm = PivotCsv(dataDir, 'rcd_acc_irm.csv',['pct_prof'],'agency_code', ['grade'],'gr_irm')

#File - rcd_acc_lowperf
rcd_acc_lowperf = pd.read_csv(dataDir + 'rcd_acc_lowperf.csv', low_memory=False, dtype={'agency_code': object})
rcd_acc_lowperf = rcd_acc_lowperf[['agency_code', 'lp_school','rlp_school','clpc_school']]

#Pivot File - rcd_acc_ltg
rcd_acc_ltg = PivotCsv(dataDir, 'rcd_acc_ltg.csv',['pct_met'],'agency_code', ['target'],'_LTG')

#File - rcd_acc_ltg_detail
rcd_acc_ltg_detail = pd.read_csv(dataDir + 'rcd_acc_ltg_detail.csv', low_memory=False, dtype={'agency_code': object})
rcd_acc_ltg_detail.drop(['year'], axis=1, inplace=True)

#Pivot File - rcd_acc_mcr
rcd_acc_mcr = PivotCsv(dataDir, 'rcd_acc_mcr.csv',['pct'],'agency_code', ['subgroup'],'_MCR')

#Pivot File - rcd_acc_part_detail
rcd_acc_part = PivotCsv(dataDir, 'rcd_acc_part.csv',['pct_met'],'agency_code', ['target'],'_PART')

#Pivot File - rcd_acc_part
rcd_acc_part_detail = PivotCsv(dataDir, 'rcd_acc_part_detail.csv',['pct'],'agency_code', ['target','subgroup'],'_PART_DET')

#Pivot File - rcd_acc_pc - WARNING 3323 columns!!! 
rcd_acc_pc = PivotCsv(dataDir, 'rcd_acc_pc.csv',['pct'],'agency_code', ['standard','subject','grade','subgroup'],'_PC')

#Pivot File - rcd_acc_part_detail
rcd_acc_rta = PivotCsv(dataDir, 'rcd_acc_rta.csv',['pct'],'agency_code', ['metric'],'_RTA')

#File - rcd_acc_spg1
rcd_acc_spg1 = pd.read_csv(dataDir + 'rcd_acc_spg1.csv', low_memory=False, dtype={'agency_code': object})
rcd_acc_spg1.drop(['year'], axis=1, inplace=True)

#File - rcd_acc_spg2
pivVals = ['aaa_score','awa_score','cgrs_score','elp_score','mcr_score','scgs_score','bi_score',
           'ach_score','eg_status','eg_score','spg_score','spg_grade']
           
rcd_acc_spg2 = PivotCsv(dataDir, 'rcd_acc_spg2.csv',pivVals,'agency_code', ['subgroup'],'_SPG2')

#Pivot File - rcd_acc_wk
rcd_acc_wk = PivotCsv(dataDir, 'rcd_acc_wk.csv',['pct'],'agency_code', ['subgroup'],'_WK')

#File - rcd_adm
rcd_adm = pd.read_csv(dataDir + 'rcd_adm.csv', low_memory=False, dtype={'agency_code': object})
rcd_adm.drop(['year','category_code'], axis=1, inplace=True)

#File - rcd_ap
#Found 0 duplicate agency_codes in this file, no pivot 
rcd_ap = pd.read_csv(dataDir + 'rcd_ap.csv', low_memory=False, dtype={'agency_code': object})
rcd_ap.drop(['year','category_code'], axis=1, inplace=True)

#File - rcd_arts
rcd_arts = pd.read_csv(dataDir + 'rcd_arts.csv', low_memory=False, dtype={'agency_code': object})
rcd_arts.drop(['year'], axis=1, inplace=True)

#File - rcd_att
#Found 0 duplicate agency_codes in this file, no pivot 
rcd_att = pd.read_csv(dataDir + 'rcd_att.csv', low_memory=False, dtype={'agency_code': object})
rcd_att.drop(['year','category_code'], axis=1, inplace=True)

#Pivot File - rcd_charter
rcd_charter = PivotCsv(dataDir, 'rcd_charter.csv',['pct_enrolled'],'agency_code', ['home_lea','subgroup'],'_CHARTER')

#Pivot File - rcd_chronic_absent
rcd_chronic_absent = PivotCsv(dataDir, 'rcd_chronic_absent.csv',['pct'],'agency_code', ['subgroup'],'_CHRON_ABSENT')

#Pivot File - rcd_college
rcd_college = PivotCsv(dataDir, 'rcd_college.csv',['pct_enrolled'],'agency_code', ['status','subgroup'],'_COLLEGE')

#File - rcd_courses1 - 2017 DATA
#Found 0 duplicate agency_codes in this file, no pivot 
rcd_courses1 = pd.read_csv(dataDir + 'rcd_courses1.csv', low_memory=False, dtype={'agency_code': object})
rcd_courses1.drop(['year','category_code'], axis=1, inplace=True)

#Pivot File - rcd_courses2
rcd_courses2 = PivotCsv(dataDir, 'rcd_courses2.csv',['pct_ap','pct_ccp','pct_ib'],'agency_code', ['category_code','subgroup'],
                        '_COURSES2')

#Pivot File - rcd_cte_concentrators
rcd_cte_concentrators = PivotCsv(dataDir, 'rcd_cte_concentrators.csv',['num_concentrators'],'agency_code',
                                 ['career_cluster'],'')

#File - rcd_cte_credentials
rcd_cte_credentials = pd.read_csv(dataDir + 'rcd_cte_credentials.csv', low_memory=False, dtype={'agency_code': object})
rcd_cte_credentials.drop(['year'], axis=1, inplace=True)

#File - rcd_cte_endorsement
rcd_cte_endorsement = pd.read_csv(dataDir + 'rcd_cte_endorsement.csv', low_memory=False, dtype={'agency_code': object})
rcd_cte_endorsement.drop(['year'], axis=1, inplace=True)

#File - rcd_cte_enrollment
rcd_cte_enrollment = pd.read_csv(dataDir + 'rcd_cte_enrollment.csv', low_memory=False, dtype={'agency_code': object})
rcd_cte_enrollment['cte_enrollment_pct'] = rcd_cte_enrollment['pct'] 
rcd_cte_enrollment.drop(['year','pct'], axis=1, inplace=True)

#File - rcd_dlmi
rcd_dlmi = pd.read_csv(dataDir + 'rcd_dlmi.csv', low_memory=False, dtype={'agency_code': object})
rcd_dlmi.drop(['year'], axis=1, inplace=True)

#Pivot File - rcd_effectiveness - 2017 Data
rcd_effectiveness = PivotCsv(dataDir, 'rcd_effectiveness.csv',['pct_rating'],'agency_code', ['ee_standard','ee_rating'],'')

#File - rcd_esea_att - 2015 DATA
#Found 0 duplicate agency_codes in this file, no pivot 
rcd_esea_att = pd.read_csv(dataDir + 'rcd_esea_att.csv', low_memory=False, dtype={'agency_code': object})
rcd_esea_att.drop(['year','category_code'], axis=1, inplace=True)

#Pivot File - rcd_experience
expPivColumns = ['pct_experience_0','pct_experience_10','pct_experience_4',
                 'pct_adv_degree','pct_turnover','total_class_teach','avg_class_teach']
rcd_experience = PivotCsv(dataDir, 'rcd_experience.csv',expPivColumns,'agency_code', ['staff'],'Exp')

#File - !!!DISTRICT LEVEL DATA!!!
rcd_funds = pd.read_csv(dataDir + 'rcd_funds.csv', low_memory=False, dtype={'agency_code': object})
rcd_funds.drop(['year'], axis=1, inplace=True)

#Pivot File - rcd_hqt - !!!2016 DATA!!!
rcd_hqt = PivotCsv(dataDir, 'rcd_hqt.csv',['highqual_class_pct'],'agency_code', ['category_code'],'')

#File - rcd_ib
#Found 0 duplicate agency_codes in this file at school level, no pivot 
rcd_ib = pd.read_csv(dataDir + 'rcd_ib.csv', low_memory=False, dtype={'agency_code': object})
rcd_ib.drop(['year','category_code'], axis=1, inplace=True)

#Pivot File - rcd_improvement
rcd_improvement = PivotCsv(dataDir, 'rcd_improvement.csv',['amount'],'agency_code', ['strategy'],'_Improve_Amt')

#File - rcd_inc1
#Found 0 duplicate agency_codes in this file at school level, no pivot 
rcd_inc1 = pd.read_csv(dataDir + 'rcd_inc1.csv', low_memory=False, dtype={'agency_code': object})
rcd_inc1.drop(['year','category_code'], axis=1, inplace=True)

#File - rcd_inc2 
pivFields = ['iss_per1000','sts_per1000','lts_per1000',
             'exp_per1000','crime_per1000','blhr_per1000',
             'rplw_per1000','arre_per1000']
rcd_inc2 = PivotCsv(dataDir, 'rcd_inc2.csv',pivFields,'agency_code', ['subgroup'],'')

#File - rcd_licenses
#Found 0 duplicate agency_codes in this file at school level, no pivot 
rcd_licenses = pd.read_csv(dataDir + 'rcd_licenses.csv', low_memory=False, dtype={'agency_code': object})
rcd_licenses.drop(['year','category_code'], axis=1, inplace=True)

#File - rcd_location
#Found 0 duplicate agency_codes in this file at school level, no pivot 
rcd_location = pd.read_csv(dataDir + 'rcd_location.csv', low_memory=False, dtype={'agency_code': object})
rcd_location.drop(['year'], axis=1, inplace=True)


#Pivot File - rcd_naep !!!NATIONAL & STATE LEVEL DATA!!!
pivCols = ['grade','naep_subject','subgroup','Proficiency_level']
rcd_naep = PivotCsv(dataDir, 'rcd_naep.csv',['percent_proficient'],'agency_code', pivCols,'_NAEP')

#File - rcd_licenses
#Found 0 duplicate agency_codes in this file at school level, no pivot 
rcd_nbpts = pd.read_csv(dataDir + 'rcd_nbpts.csv', low_memory=False, dtype={'agency_code': object})
rcd_nbpts.drop(['year','category_code','total_nbpts_num'], axis=1, inplace=True)

#Pivot File - rcd_pk_enroll
rcd_pk_enroll = PivotCsv(dataDir, 'rcd_pk_enroll.csv',['pct'],'agency_code', ['subgroup'],'_PK_ENROLL')

#Pivot File - rcd_prin_demo - !!! District Level Data !!!
rcd_prin_demo = PivotCsv(dataDir, 'rcd_prin_demo.csv',['pct_prin_demo'],'agency_code', ['subgroup'],'')

#File - rcd_readiness
#Found 0 duplicate agency_codes in this file at school level, no pivot 
rcd_readiness = pd.read_csv(dataDir + 'rcd_readiness.csv', low_memory=False, dtype={'agency_code': object})
rcd_readiness.drop(['year','category_code'], axis=1, inplace=True)

#Pivot File - rcd_sar
rcd_sar = PivotCsv(dataDir, 'rcd_sar.csv',['avg_size'],'agency_code', ['grade_eoc'],'_SAR')

#File - rcd_sat
#Found 0 duplicate agency_codes in this file at school level, no pivot 
rcd_sat = pd.read_csv(dataDir + 'rcd_sat.csv', low_memory=False, dtype={'agency_code': object})
rcd_sat.drop(['year','category_code'], axis=1, inplace=True)

#File - rcd_welcome
rcd_welcome = pd.read_csv(dataDir + 'rcd_welcome.csv', low_memory=False, dtype={'agency_code': object})
rcd_welcome.drop(['year'], axis=1, inplace=True)

# Save All Flattened Files to \\Raw Datasets Directory
**This code saves all the flattened file versions as .csv files in \\Raw Datasets\

In [65]:
#Set dataDir back to the original value (one folder up from current location)
import os 
dataDir = os.path.dirname(os.path.dirname(dataDir)) + '/'
dataDir

'D:/BenepactLLC/Belk/NC_Report_Card_Data/2020/May 2020/2018/Raw Datasets/'

In [66]:
print('Saving Flattened Versions and Record Counts for the Following Raw Data Files: \n')
for fileName in rcdFileNames:
    eval(fileName).to_csv(dataDir + 'Flattened Datasets/' + fileName + '.csv', sep=',', index=False)
    print(fileName + ', ' + str(len(eval(fileName).index)))
    

Saving Flattened Versions and Record Counts for the Following Raw Data Files: 

rcd_161, 644
rcd_acc_aapart, 53
rcd_acc_act, 742
rcd_acc_awa, 688
rcd_acc_cgr, 738
rcd_acc_eds, 2761
rcd_acc_elp, 1809
rcd_acc_essa_desig, 2645
rcd_acc_gp, 658
rcd_acc_irm, 1276
rcd_acc_lowperf, 2760
rcd_acc_ltg, 2461
rcd_acc_ltg_detail, 2631
rcd_acc_mcr, 717
rcd_acc_part, 2527
rcd_acc_part_detail, 2527
rcd_acc_pc, 2697
rcd_acc_rta, 1576
rcd_acc_spg1, 2584
rcd_acc_spg2, 2538
rcd_acc_wk, 517
rcd_adm, 3197
rcd_ap, 563
rcd_arts, 2509
rcd_att, 3115
rcd_charter, 241
rcd_chronic_absent, 2719
rcd_college, 690
rcd_courses1, 773
rcd_courses2, 636
rcd_cte_concentrators, 492
rcd_cte_credentials, 436
rcd_cte_endorsement, 537
rcd_cte_enrollment, 1184
rcd_dlmi, 2723
rcd_effectiveness, 2724
rcd_esea_att, 2700
rcd_experience, 2758
rcd_funds, 292
rcd_hqt, 2697
rcd_ib, 51
rcd_improvement, 19
rcd_inc1, 3097
rcd_inc2, 2792
rcd_licenses, 3121
rcd_location, 2759
rcd_naep, 2
rcd_nbpts, 3124
rcd_pk_enroll, 988
rcd_prin_demo, 116
r

## Download and Save Copy of the Original Statistical Profiles Data

In [67]:
#Statistical Profiles - Student Body Racial Compositions at the School Level
import io
import requests

url='http://apps.schools.nc.gov/ords/f?p=145:221::CSV::::'
statProfPath = dataDir + 'SRC_Datasets/' + 'ec_pupils.csv'

#Passing this URL directly into pd.read_csv() threw HTTP errors - This is my workaround
s = requests.get(url).content
ec_pupils = pd.read_csv(io.StringIO(s.decode('utf-8')), low_memory=False
                        , dtype={'LEA': object,'School': object})

#Rename year for consistency
ec_pupils.rename({'Year':'year'}, axis=1, inplace=True)

#Create agency_code from LEA and School code as an index
ec_pupils['agency_code'] = ec_pupils['LEA'] + ec_pupils['School']

#Filter to 2018 school year (There is already 2019 school year data in this file)
#ec_pupils = ec_pupils[ec_pupils.year == schoolYear]

#Some schools are missing race data.  Get the most recent year of data available for each agency code
ec_pupils = ec_pupils.sort_values(by=['agency_code', 'year'])
ec_pupils = ec_pupils.drop_duplicates(subset=["agency_code"], keep="last")

#Save the original data to the source datasets folder 
ec_pupils.to_csv(statProfPath, sep=',', index=False)

#Get data for the most recent school year
#CleanUpRcdFiles(statProfPath)

## Create Flattened Statistical Profiles with Racial Composition Percentages

In [68]:
#***********************************************************************
# Statistical Profiles - Student Body Racial Compositions at the School Level Reshape
#
# Statistical Profiles data are already one record per public school but must be converted to percentages
# Creates a new dataset - ec_pupils_pct.csv
#
#***********************************************************************

#Statistical Profiles - Student Body Racial Compositions at the School Level
ec_pupils = pd.read_csv(statProfPath, low_memory=False, dtype={'agency_code': object})

#Create Racial Composition summary variables
ec_pupils['Indian'] = ec_pupils['Indian Male'] + ec_pupils['Indian Female']
ec_pupils['Asian'] = ec_pupils['Asian Male'] + ec_pupils['Asian Female']
ec_pupils['Hispanic'] = ec_pupils['Hispanic Male'] + ec_pupils['Hispanic Female']
ec_pupils['Black'] = ec_pupils['Black Male'] + ec_pupils['Black Female']
ec_pupils['White'] = ec_pupils['White Male'] + ec_pupils['White Female']
ec_pupils['Pacific Island'] = ec_pupils['Pacific Island Male'] + ec_pupils['Pacific Island Female']
ec_pupils['Two or  More'] = ec_pupils['Two or  More Male'] + ec_pupils['Two or  More Female']

#The original total field is corrupted with non-printable characters and will not convert to int or float 
ec_pupils.drop(['Total'], axis=1, inplace=True)
#Create a new totals field by summing race composition fields
ec_pupils['Total'] = ec_pupils['Indian'] + ec_pupils['Asian'] + \
                     ec_pupils['Hispanic'] + ec_pupils['Black'] + \
                     ec_pupils['White'] + ec_pupils['Pacific Island'] + ec_pupils['Two or  More']
#Convert Totals to float64 for division later
ec_pupils['Total'] = ec_pupils['Total'].astype(np.float64)

#Create Minority summary variables 
ec_pupils['Minority Male'] = ec_pupils['Indian Male'] + ec_pupils['Asian Male'] \
                           + ec_pupils['Hispanic Male'] + ec_pupils['Black Male'] \
                           + ec_pupils['Pacific Island Male'] + ec_pupils['Two or  More Male'] 
ec_pupils['Minority Female'] = ec_pupils['Indian Female'] + ec_pupils['Asian Female'] \
                           + ec_pupils['Hispanic Female'] + ec_pupils['Black Female'] \
                           + ec_pupils['Pacific Island Female'] + ec_pupils['Two or  More Female']
ec_pupils['Minority'] = ec_pupils['Minority Male'] + ec_pupils['Minority Female']

#Create Student Body Racial Composition PERCENTAGES at the School Level
ec_pupils_pct = pd.DataFrame({'agency_code'   : ec_pupils['agency_code']
                            , 'School Name' : ec_pupils['___School Name___']
                            , 'IndianPct'   : ec_pupils['Indian'] / ec_pupils['Total']  
                            , 'AsianPct'    : ec_pupils['Asian'] / ec_pupils['Total']
                            , 'HispanicPct' : ec_pupils['Hispanic'] / ec_pupils['Total']
                            , 'BlackPct'    : ec_pupils['Black'] / ec_pupils['Total']
                            , 'WhitePct'    : ec_pupils['White'] / ec_pupils['Total']
                            , 'PacificIslandPct': ec_pupils['Pacific Island'] / ec_pupils['Total']
                            , 'TwoOrMorePct': ec_pupils['Two or  More'] / ec_pupils['Total']
                            , 'MinorityPct' : ec_pupils['Minority'] / ec_pupils['Total']
                            
                              
                            , 'IndianMalePct'   : ec_pupils['Indian Male'] / ec_pupils['Total']  
                            , 'AsianMalePct'    : ec_pupils['Asian Male'] / ec_pupils['Total']
                            , 'HispanicMalePct' : ec_pupils['Hispanic Male'] / ec_pupils['Total']
                            , 'BlackMalePct'    : ec_pupils['Black Male'] / ec_pupils['Total']
                            , 'WhiteMalePct'    : ec_pupils['White Male'] / ec_pupils['Total']
                            , 'PacificIslandMalePct': ec_pupils['Pacific Island Male'] / ec_pupils['Total']
                            , 'TwoOrMoreMalePct': ec_pupils['Two or  More Male'] / ec_pupils['Total']  
                            , 'MinorityMalePct' : ec_pupils['Minority Male'] / ec_pupils['Total']
                                                          
                            , 'IndianFemalePct'   : ec_pupils['Indian Female'] / ec_pupils['Total']  
                            , 'AsianFemalePct'    : ec_pupils['Asian Female'] / ec_pupils['Total']
                            , 'HispanicFemalePct' : ec_pupils['Hispanic Female'] / ec_pupils['Total']
                            , 'BlackFemalePct'    : ec_pupils['Black Female'] / ec_pupils['Total']
                            , 'WhiteFemalePct'    : ec_pupils['White Female'] / ec_pupils['Total']
                            , 'MinorityFemalePct' : ec_pupils['Minority Female'] / ec_pupils['Total'] 
                            , 'PacificIslandFemalePct': ec_pupils['Pacific Island Female'] / ec_pupils['Total']
                            , 'TwoOrMoreFemalePct': ec_pupils['Two or  More Female'] / ec_pupils['Total']
                             })

#Save the flattened racial composition percentage data to disk 
ec_pupils_pct.to_csv(dataDir + 'Flattened Datasets/' + 'ec_pupils_pct.csv', sep=',', index=False)

#Print file details
print('Saving Flattened Versions and Record Counts for the Following Raw Data Files: \n')
print('ec_pupils_pct' + ', ' + str(len(ec_pupils_pct.index)))

Saving Flattened Versions and Record Counts for the Following Raw Data Files: 

ec_pupils_pct, 2558


## Create rcd_pk_enroll.csv Counts Flattened File
* The rcd_pk_enroll.csv percentages always = 1 for the _ALL subgroup (only shows distribution of race)
* Adding actual PK enrollment counts to track PK enrollment growth. 

In [69]:
#Pivot File - rcd_pk_enroll
rcd_pk_enroll_ct = PivotCsv(dataDir, 'SRC_Datasets/rcd_pk_enroll.csv',['count'],'agency_code', ['subgroup'],'_PK_ENROLL')

rcd_pk_enroll_ct.to_csv(dataDir + 'Flattened Datasets/' + 'rcd_pk_enroll_ct.csv', sep=',', index=False)
print('rcd_pk_enroll_ct' + ', ' + str(rcd_pk_enroll_ct.index))

rcd_pk_enroll_ct, RangeIndex(start=0, stop=988, step=1)
