# Add Student Counts
The program reads in the unzipped National Center for Education Statistics (NCES) Common Core Data and adds school characteristics to school location data.

## Description of Program
- program:    NCES_2cv1_AddStudentCount
- task:       Add Count of Stucents by Grade Level to School Data
- Version:    2022-01-07
- project:    Interdependent Networked Community Resilience Modeling Environment (IN-CORE) Subtask 5.2 - Social Institutions
- funding:    NIST Financial Assistance Award Numbers: 70NANB15H044 and 70NANB20H008 
- author:     Nathanael Rosenheim

- Suggested Citation:
Rosenheim, N. (2021) “Obtain, Clean, and Explore School Location and Attendance Boundary Data". 
Archived on Github and ICPSR.

In [1]:
# Import Python Packages Required for program
import pandas as pd       # Pandas for reading in data 
#import geopandas as gpd   # Geopandas for reading Shapefiles
import numpy as np        # Numpy helps with selected data
import os                 # Operating System (os) For folders and finding working directory
#import folium as fm       # folium has more dynamic maps - but requires internet connection

In [2]:
# Display versions being used - important information for replication
import sys
print("Python Version     ", sys.version)
print("pandas version:    ", pd.__version__)
#print("geopandas version: ", gpd.__version__)
print("numpy version:     ", np.__version__)
#print("folium version:    ", fm.__version__)

Python Version      3.8.12 | packaged by conda-forge | (default, Oct 12 2021, 21:22:46) [MSC v.1916 64 bit (AMD64)]
pandas version:     1.3.5
numpy version:      1.22.0


In [3]:
# Store Program Name for output files to have the same name
programname = "NCES_2cv1_AddStudentCount_2022-01-07"
# Make directory to save output
if not os.path.exists(programname):
    os.mkdir(programname)

In [4]:
os.getcwd()

'g:\\Shared drives\\HRRC_IN-CORE\\Tasks\\P4.9 Testebeds\\Lumberton_LaborMarketAllocation\\SourceData\\nces.ed.gov\\WorkNPR'

## Read in NCES Teacher Count Files
Files for the CCD were downloaded manually and saved to the Source Data Folder.

In [None]:
sourcefolder = '../ccd_data/ccd_sch_052_1516_w_2a_011717_csv/'
sourcefile = 'ccd_sch_052_1516_w_2a_011717.csv'

In [11]:
# Check encoding
# https://stackoverflow.com/questions/37177069/how-to-check-encoding-of-a-csv-file/52648848
with open(sourcefolder+sourcefile) as f:
    print(f)

<_io.TextIOWrapper name='../ccd_data/ccd_sch_052_1516_w_2a_011717_csv/ccd_sch_052_1516_w_2a_011717.csv' mode='r' encoding='cp1252'>


In [12]:
sourcefolder = '../ccd_data/ccd_sch_052_1516_w_2a_011717_csv/'
sourcefile = 'ccd_sch_052_1516_w_2a_011717.csv'
ccd_sch = pd.read_csv(sourcefolder+sourcefile, encoding='cp1252')
ccd_sch.head()

Unnamed: 0,SURVYEAR,FIPST,STABR,STATENAME,SEANAME,LEAID,ST_LEAID,LEA_NAME,SCHID,ST_SCHID,...,BLALF,WH,WHALM,WHALF,HP,HPALM,HPALF,TR,TRALM,TRALF
0,2015-2016,1,AL,ALABAMA,Alabama Department Of Education,100002,210,Alabama Youth Services,277,210-0020,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1
1,2015-2016,1,AL,ALABAMA,Alabama Department Of Education,100002,210,Alabama Youth Services,1667,210-0050,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1
2,2015-2016,1,AL,ALABAMA,Alabama Department Of Education,100002,210,Alabama Youth Services,1670,210-0060,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1
3,2015-2016,1,AL,ALABAMA,Alabama Department Of Education,100002,210,Alabama Youth Services,1705,210-0030,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1
4,2015-2016,1,AL,ALABAMA,Alabama Department Of Education,100002,210,Alabama Youth Services,1706,210-0040,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1


In [13]:
ccd_sch.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
FIPST,99264.0,2.894247e+01,1.681424e+01,1.000000e+00,1.300000e+01,2.900000e+01,4.200000e+01,7.800000e+01
LEAID,99264.0,2.905620e+06,1.680978e+06,1.000020e+05,1.302340e+06,2.908370e+06,4.218840e+06,7.800030e+06
SCHID,99264.0,2.896736e+03,3.627689e+03,1.000000e+00,7.080000e+02,1.672000e+03,3.838250e+03,9.048000e+04
NCESSCH,99264.0,2.905620e+11,1.680978e+11,1.000020e+10,1.302340e+11,2.908370e+11,4.218840e+11,7.800030e+11
PK,99264.0,1.083756e+01,3.549082e+01,-9.000000e+00,-2.000000e+00,-2.000000e+00,5.000000e+00,1.374000e+03
...,...,...,...,...,...,...,...,...
HPALM,99264.0,9.761646e-01,1.019700e+01,-9.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,7.990000e+02
HPALF,99264.0,9.143395e-01,9.262055e+00,-9.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,7.150000e+02
TR,99264.0,1.731654e+01,2.406278e+01,-9.000000e+00,2.000000e+00,9.000000e+00,2.400000e+01,9.210000e+02
TRALM,99264.0,8.777734e+00,1.229852e+01,-9.000000e+00,1.000000e+00,5.000000e+00,1.200000e+01,4.490000e+02


In [25]:
# Try to find correct id to merge
idcols = [col for col in ccd_sch if "ID" in col]
idcols

['LEAID', 'ST_LEAID', 'SCHID', 'ST_SCHID']

In [31]:
ccd_sch[idcols+['NCESSCH']].astype(str).describe()

Unnamed: 0,ncesid,NCESSCH
count,99264,99264
unique,13219,99264
top,42,10000200277
freq,50,1


## Read in NCES School Location File
School location files were obtained and cleaned with the previous program `NCES_2av1_SelectCountySchools`

In [14]:
sourceprogram = 'NCES_2av1_SelectCountySchools_2021-06-06'
selected_schools = pd.read_csv(sourceprogram+'/'+sourceprogram+'.csv')
selected_schools.head()

Unnamed: 0.1,Unnamed: 0,ncesid,name,addr,city,stabbr,zip,cnty15,geometry,level,schtype,lat,lon,schyr
0,0,370004002349,CIS Academy,818 West 3rd Street,Pembroke,NC,28372,37155,POINT (-79.20335664043833 34.68503759480223),99,1,34.685038,-79.203357,2015-2016
1,1,370034603302,Southeastern Academy,12251 NC HWY 41 North,Lumberton,NC,28358,37155,POINT (-78.87378865362859 34.65169717880272),99,1,34.651697,-78.873789,2015-2016
2,2,370225003249,Sandy Grove Middle,300 Chason Road,Lumber Bridge,NC,28357,37155,POINT (-79.06581931618486 34.89650979378273),99,1,34.89651,-79.065819,2015-2016
3,3,370393001569,Deep Branch Elementary,4045 Deep Branch Road,Lumberton,NC,28360,37155,POINT (-79.14600999186194 34.63037683072379),99,1,34.630377,-79.14601,2015-2016
4,4,370393001570,Fairgrove Middle,1953 Fairgrove Sch Road,Fairmont,NC,28340,37155,POINT (-79.17370687961406 34.49329831006692),99,1,34.493298,-79.173707,2015-2016


In [28]:
# Try to find correct id to merge
idcols = [col for col in selected_schools if "id" in col]
idcols

['ncesid']

In [30]:
selected_schools[idcols].astype(str).describe()

Unnamed: 0,ncesid
count,55
unique,55
top,370004002349
freq,1


## Merge Staff County Data with School Locations

In [32]:
ccd_sch['ncesid'] = ccd_sch['NCESSCH'].apply(str)
ccd_sch.head()

Unnamed: 0,SURVYEAR,FIPST,STABR,STATENAME,SEANAME,LEAID,ST_LEAID,LEA_NAME,SCHID,ST_SCHID,...,WH,WHALM,WHALF,HP,HPALM,HPALF,TR,TRALM,TRALF,ncesid
0,2015-2016,1,AL,ALABAMA,Alabama Department Of Education,100002,210,Alabama Youth Services,277,210-0020,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,10000200277
1,2015-2016,1,AL,ALABAMA,Alabama Department Of Education,100002,210,Alabama Youth Services,1667,210-0050,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,10000201667
2,2015-2016,1,AL,ALABAMA,Alabama Department Of Education,100002,210,Alabama Youth Services,1670,210-0060,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,10000201670
3,2015-2016,1,AL,ALABAMA,Alabama Department Of Education,100002,210,Alabama Youth Services,1705,210-0030,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,10000201705
4,2015-2016,1,AL,ALABAMA,Alabama Department Of Education,100002,210,Alabama Youth Services,1706,210-0040,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,10000201706


In [40]:
student_count_cols = ['PK','KG','G01','G02','G03','G04','G05','G06'
,'G07','G08','G09','G10','G11','G12','G13','TOTAL']

In [41]:
addstudents = pd.merge(left = ccd_sch[['ncesid']+student_count_cols],
                      right = selected_schools,
                     left_on = 'ncesid',
                     right_on = 'ncesid',
                     how = "right")

In [42]:
addstudents.ncesid.describe()

count               55
unique              55
top       370004002349
freq                 1
Name: ncesid, dtype: object

In [43]:
addstudents['G01'].describe()

count     44.000000
mean      47.840909
std       54.331002
min       -2.000000
25%       -2.000000
50%       42.500000
75%       89.000000
max      178.000000
Name: G01, dtype: float64

# Work with missing and NA data

For Numeric Data:
* -1 denotes missing, not available, or not reported data items.
* -2 denotes "not applicable" data items.
* -9 indicates suppressed data because it failed the multi-year edit.

G:\Shared drives\HRRC_IN-CORE\Tasks\P4.9 Testebeds\Lumberton_LaborMarketAllocation\SourceData\nces.ed.gov\ccd_data\2015-16_CCD_Companion_School_Membership.xlsx

In [45]:
for col in student_count_cols:
    addstudents.loc[addstudents[col] == -2,col] = 0

In [47]:
addstudents.head()

Unnamed: 0,ncesid,PK,KG,G01,G02,G03,G04,G05,G06,G07,...,city,stabbr,zip,cnty15,geometry,level,schtype,lat,lon,schyr
0,370004002349,0.0,0.0,0.0,0.0,0.0,0.0,0.0,40.0,40.0,...,Pembroke,NC,28372,37155,POINT (-79.20335664043833 34.68503759480223),99,1,34.685038,-79.203357,2015-2016
1,370034603302,0.0,21.0,21.0,23.0,23.0,23.0,24.0,25.0,26.0,...,Lumberton,NC,28358,37155,POINT (-78.87378865362859 34.65169717880272),99,1,34.651697,-78.873789,2015-2016
2,370225003249,0.0,0.0,0.0,0.0,0.0,0.0,0.0,174.0,159.0,...,Lumber Bridge,NC,28357,37155,POINT (-79.06581931618486 34.89650979378273),99,1,34.89651,-79.065819,2015-2016
3,370393001569,15.0,60.0,75.0,56.0,68.0,49.0,59.0,45.0,0.0,...,Lumberton,NC,28360,37155,POINT (-79.14600999186194 34.63037683072379),99,1,34.630377,-79.14601,2015-2016
4,370393001570,0.0,0.0,0.0,0.0,0.0,53.0,58.0,66.0,58.0,...,Fairmont,NC,28340,37155,POINT (-79.17370687961406 34.49329831006692),99,1,34.493298,-79.173707,2015-2016


In [39]:
addstudents.to_csv(programname+"/"+programname+".csv")