# Student Record File Workflow

## Overview
Functions to obtain and clean data required for the Student Record File in Python using Census API and NCES files. 

The workflow produces a Student Person Record (SREC) file that can be linked to the Person Record File. The file includes student level records with sex, grade level, race and ethnicity.

Based on NCES data. 

The output of this workflow is a CSV file with the student record file.

The output CSV is designed to be used in the Interdependent Networked Community Resilience Modeling Environment (IN-CORE) for the housing unit allocation model.

IN-CORE is an open source python package that can be used to model the resilience of a community. To download IN-CORE, see:

https://incore.ncsa.illinois.edu/


## Instructions
Users can run the workflow by executing each block of code in the notebook.

Users can modify the code to select one county or multiple counties.

## Description of Program
- program:    ncoda_07ev1_run_SREC_workflow
- task:       Obtain School Location and Attendance Boundaries
- Version:    2023-02-10
- project:    Interdependent Networked Community Resilience Modeling Environment (IN-CORE) Subtask 5.2 - Social Institutions
- funding:	  NIST Financial Assistance Award Numbers: 70NANB15H044 and 70NANB20H008 
- author:     Nathanael Rosenheim

- Suggested Citation:
Rosenheim, Nathanael (2021) “Detailed Household and Housing Unit Characteristics: Data and Replication Code.” DesignSafe-CI. 
https://doi.org/10.17603/ds2-jwf6-s535.

## Setup Python Environment

In [1]:
# Import Python Packages Required for program
import pandas as pd       # Pandas for reading in data 
import geopandas as gpd   # Geopandas for reading Shapefiles
import numpy as np        # Numpy for working with arrays
import os                 # Operating System (os) For folders and finding working directory
import sys
import zipfile            # Zipfile for working with compressed Zipped files
import wget               # Wget for downloading files from the web
import scooby # Reports Python environment

In [2]:
# Generate report of Python environment
print(scooby.Report(additional=['pandas']))


--------------------------------------------------------------------------------
  Date: Thu Feb 16 11:14:49 2023 Central Standard Time

                OS : Windows
            CPU(s) : 12
           Machine : AMD64
      Architecture : 64bit
               RAM : 31.6 GiB
       Environment : Jupyter

  Python 3.8.16 | packaged by conda-forge | (default, Feb  1 2023, 15:53:35)
  [MSC v.1929 64 bit (AMD64)]

            pandas : 1.5.3
             numpy : 1.24.2
             scipy : 1.10.0
           IPython : 8.10.0
        matplotlib : 3.6.3
            scooby : 0.7.1
--------------------------------------------------------------------------------


In [3]:
#To replicate this notebook Clone the Github Package to a folder that is a sibling of this notebook.
# To access the sibling package you will need to append the parent directory ('..') to the system path list.
# append the path of the directory that includes the github repository.
# This step is not required when the package is in a folder below the notebook file.
github_code_path  = ""
sys.path.append(github_code_path)

In [4]:
os.getcwd()

'c:\\Users\\nathanael99\\MyProjects\\github\\intersect-community-data'

In [5]:
# To reload submodules need to use this magic command to set autoreload on
%load_ext autoreload
%autoreload 2
# open, read, and execute python program with reusable commands
from pyncoda.CommunitySourceData.nces_ed_gov.nces_01a_downloadfiles \
    import *
from pyncoda.CommunitySourceData.nces_ed_gov.nces_00c_cleanutils \
    import *
from pyncoda.CommunitySourceData.nces_ed_gov.nces_02c_SRECcleanCCD \
    import *
from pyncoda.CommunitySourceData.nces_ed_gov.nces_02d_SRECtidy \
    import *

## Obtain NCES Files
This section of code provides details on the web addresses for obtaining the NCES data. These datafiles are quiet large. It is recommended that the files are downloaded once. To facilitate the downloading of the files a Comma Seperated Values (CSV) file was create using Microsoft Excel (note CSV files are easier to read into the notebook). The CSV file includes the descriptions and important file names to be obtained. This input file can be modified for different school years.

In [6]:
folder_path = 'pyncoda\\CommunitySourceData\\nces_ed_gov\\'
filename = 'nces_00b_ObtainSchoolData_2023-02-10.csv'
downloadlistcsv = folder_path + filename
#county_list = ['37155']
#communityname = "RobesonCounty_NC"
county_list = ['48167']
communityname = "GalvestonCounty_TX"
outputfolder = f"OutputData\\{communityname}\\01_CommunitySourceData"
outputfolder_tidy = f"OutputData\\{communityname}\\02_TidyCommunitySourceData"

year = '2015-2016'

In [25]:
schoollist_community = create_schoolist_community(downloadlistcsv, 
                        county_list,
                        communityname,
                        outputfolder,
                        year)

Directory Outputdata\00_SourceData\nces_ed_gov Already exists.
Directory Outputdata\00_SourceData\nces_ed_gov\unzipped Already exists.
   Checking to see if file EDGE_GEOCODE_POSTSEC_FILEDOC.pdf has been downloaded...
   file already exists in folder 
   Checking to see if file EDGE_GEOCODE_POSTSECONDARYSCH_1516.zip has been downloaded...
   file already exists in folder 
   Checking to see if file EDGE_GEOCODE_PUBLIC_FILEDOC.pdf has been downloaded...
   file already exists in folder 
   Checking to see if file EDGE_GEOCODE_PUBLICLEA_1516.zip has been downloaded...
   file already exists in folder 
   Checking to see if file EDGE_GEOCODE_PUBLIC_FILEDOC.pdf has been downloaded...
   file already exists in folder 
   Checking to see if file EDGE_GEOCODE_PUBLICSCH_1516.zip has been downloaded...
   file already exists in folder 
   Checking to see if file EDGE_GEOCODE_PSS1718_FILEDOC.pdf has been downloaded...
   file already exists in folder 
   Checking to see if file EDGE_GEOCODE_PRIV

In [8]:
SAB_community = create_sab_community(downloadlistcsv, 
                        schoollist_community,
                        communityname,
                        outputfolder,
                        year)

Directory Outputdata\00_SourceData\nces_ed_gov Already exists.
Directory Outputdata\00_SourceData\nces_ed_gov\unzipped Already exists.
   Checking to see if file EDGE_GEOCODE_POSTSEC_FILEDOC.pdf has been downloaded...
   file already exists in folder 
   Checking to see if file EDGE_GEOCODE_POSTSECONDARYSCH_1516.zip has been downloaded...
   file already exists in folder 
   Checking to see if file EDGE_GEOCODE_PUBLIC_FILEDOC.pdf has been downloaded...
   file already exists in folder 
   Checking to see if file EDGE_GEOCODE_PUBLICLEA_1516.zip has been downloaded...
   file already exists in folder 
   Checking to see if file EDGE_GEOCODE_PUBLIC_FILEDOC.pdf has been downloaded...
   file already exists in folder 
   Checking to see if file EDGE_GEOCODE_PUBLICSCH_1516.zip has been downloaded...
   file already exists in folder 
   Checking to see if file EDGE_GEOCODE_PSS1718_FILEDOC.pdf has been downloaded...
   file already exists in folder 
   Checking to see if file EDGE_GEOCODE_PRIV

In [9]:
sab_boundaries = split_SAB_gradelevel(SAB_community,
                    outputfolder,
                    year)

PrimarySAB
MiddleSAB
HighSAB
OtherSAB
    No data for OtherSAB . Skipping.
OpenEnrollSAB
    No data for OpenEnrollSAB . Skipping.


## Generate NCES Student Record File
Based on Common Core Data file for schools.

In [11]:
ccd_df = nces_clean_student_ccd(outputfolder,
        county_list = county_list, 
        communityname = communityname)

NCESSCH
FIPST
LEAID
SCHNO
STID09
SEASCH09
LEANM09
LEANM09 not converted to integer
SCHNAM09
SCHNAM09 not converted to integer
PHONE09
PHONE09 not converted to integer
MSTREE09
MSTREE09 not converted to integer
MCITY09
MCITY09 not converted to integer
MSTATE09
MSTATE09 not converted to integer
MZIP09
     0 Observations with missing value code -1
    Code meaning: when numeric data are missing; that is, a value is expected but none was measured.
    Missing values replaces with 0
     0 Observations with missing value code -2
    Code meaning: when numeric data are not applicable; that is, a value is neither expected nor measured.
    Missing values replaces with 0
     0 Observations with missing value code -9
    Code meaning: when the submitted data item does not meet NCES data quality standards; the value is suppressed.
    Missing values replaces with 0
MZIP409
MZIP409 not converted to integer
LSTREE09
LSTREE09 not converted to integer
LCITY09
LCITY09 not converted to integer
LSTAT

In [12]:
ccd_df.head()

Unnamed: 0,NCESSCH,FIPST,LEAID,SCHNO,STID09,SEASCH09,LEANM09,SCHNAM09,PHONE09,MSTREE09,...,HPALF09,TR09,TRALM09,TRALF09,TOTETH09,TotalStudents,CheckTotal1,CheckTotal2,CountFlag1,CountFlag2
83407,480004307907,48,4800043,7907,101809,101809001,BAY AREA CHARTER INC,ED WHITE MEMORIAL HIGH SCHOOL,2813160001,P O BOX 2126,...,0,0,0,0,84,84,0,0,0,0
83409,480004310710,48,4800043,10710,101809,101809041,BAY AREA CHARTER INC,BAY AREA CHARTER MIDDLE,2813160001,P O BOX 2126,...,0,0,0,0,27,27,0,0,0,0
83446,480005607888,48,4800056,7888,84801,84801101,MAINLAND PREPARATORY ACADEMY,MAINLAND PREPARATORY ACADEMY,4099349100,319 NEWMAN RD,...,0,0,0,0,481,481,0,0,0,0
83516,480008211124,48,4800082,11124,15819,15819103,SHEKINAH RADIANCE ACADEMY,SHEKINAH RADIANCE ACADEMY ABUNDANT LIFE,4099358773,5130 CASEY,...,0,0,0,0,315,315,0,0,0,0
83555,480010808213,48,4800108,8213,84802,84802001,ODYSSEY ACADEMY INC,ODYSSEY ACADEMY INC,4097509289,2412 61ST ST,...,0,0,0,0,506,506,0,0,0,0


In [35]:
srec_df = tidy_SREC_nces(outputfolder_tidy, 
                ccd_df = ccd_df,
                communityname = communityname,
                year = '09')


File OutputData\GalvestonCounty_TX\02_TidyCommunitySourceData/nces_tidy_SCREC_ccd_GalvestonCounty_TX_09.csv Already exists - Skipping Tidy SCREC NCES.


In [36]:
# Manually fix issue with L Gilbert Carroll Middle SAB
condition1 = (srec_df['NCESSCH'] == '370393002235')
conditions = condition1
# Gilbert Carroll is an overlapping SAB
srec_df.loc[conditions,'ncessch_5'] = '370393002235'

# Manually fix issue with CIS ACADEMY
# Located in Pembroke, North Carolina
condition1 = (srec_df['NCESSCH'] == '370004002349')
conditions = condition1
srec_df.loc[conditions,'ncessch_6'] = '370004002349'

In [37]:
srec_df.head()


Unnamed: 0,index,NCESSCH,SCHNAM09,LEVEL09,CHARTR09,LATCOD09,LONCOD09,racecat5,race,hispan,...,srecid,ncessch_1,ncessch_2,ncessch_3,ncessch_4,gradelevel1,gradelevel2,gradelevel3,ncessch_5,ncessch_6
0,4,480010808213,ODYSSEY ACADEMY INC,1,1,29.275459,-94.830946,1,3,0,...,S480010808213syr20092010c0001,480010808213,,,,PK,PK,PK,,
1,66,482028012362,KIPP COASTAL VILLAGE,1,1,29.30625,-94.777931,1,3,0,...,S482028012362syr20092010c0001,482028012362,,,,PK,PK,PK,,
2,75,482331008572,HITCHCOCK HEADSTART,1,2,29.36007,-95.029256,1,3,0,...,S482331008572syr20092010c0001,482331008572,,,,PK,PK,PK,,
3,85,482616008215,EARLY CHILDHOOD LEARNING CENTER,1,2,29.37781,-94.980339,1,3,0,...,S482616008215syr20092010c0001,482616008215,,,,PK,PK,PK,,
4,153,482028001997,CRENSHAW EL AND MIDDLE,1,2,29.488015,-94.556004,1,3,0,...,S482028001997syr20092010c0001,482028001997,,,,PK,PK,PK,,


In [38]:
srec_df.head(1).T

Unnamed: 0,0
index,4
NCESSCH,480010808213
SCHNAM09,ODYSSEY ACADEMY INC
LEVEL09,1
CHARTR09,1
LATCOD09,29.275459
LONCOD09,-94.830946
racecat5,1
race,3
hispan,0


In [39]:

srec_df.srecid.astype(str).describe()


count                             53555
unique                            53555
top       S480010808213syr20092010c0001
freq                                  1
Name: srecid, dtype: object

In [44]:
# Collapse data by NCESSCH with total enrollment
year = '2009-2010'
school_total_enroll = srec_df[['NCESSCH','SCHNAM09','srecid']].groupby(['NCESSCH','SCHNAM09']).count().reset_index()
school_total_enroll.rename(columns={'srecid':f'total_enroll{year}'}, inplace=True)
school_total_enroll

Unnamed: 0,NCESSCH,SCHNAM09,total_enroll2009-2010
0,480004307907,ED WHITE MEMORIAL HIGH SCHOOL,84
1,480004310710,BAY AREA CHARTER MIDDLE,27
2,480005607888,MAINLAND PREPARATORY ACADEMY,481
3,480008211124,SHEKINAH RADIANCE ACADEMY ABUNDANT LIFE,315
4,480010808213,ODYSSEY ACADEMY INC,506
...,...,...,...
86,484251004864,ROOSEVELT-WILSON EL,632
87,484251004865,TEXAS CITY H S,1621
88,484251006177,FRY INTERMEDIATE,910
89,484251007611,TEXAS CITY J J A E P,13


In [45]:
# save as csv
school_total_enroll.to_csv(f"{outputfolder_tidy}\\{communityname}_studentcount_{year}.csv", index=False)

In [43]:
# Create list of school names 
schoollist = srec_df['NCESSCH'].unique().tolist()
# loop over the list of schools and create a dataframe for each school
for school in schoollist:
    # Create a dataframe for each school
    school_df = srec_df[srec_df['NCESSCH'] == school]
    print(school_df[['CHARTR09','SCHNAM09','LEVEL09','gradelevel']].\
        groupby(['CHARTR09','LEVEL09','gradelevel']).describe())

                            SCHNAM09                                 
                               count unique                  top freq
CHARTR09 LEVEL09 gradelevel                                          
1        1       G01              46      1  ODYSSEY ACADEMY INC   46
                 G02              57      1  ODYSSEY ACADEMY INC   57
                 G03              60      1  ODYSSEY ACADEMY INC   60
                 G04              37      1  ODYSSEY ACADEMY INC   37
                 G05              38      1  ODYSSEY ACADEMY INC   38
                 G06              30      1  ODYSSEY ACADEMY INC   30
                 G07              24      1  ODYSSEY ACADEMY INC   24
                 G08              35      1  ODYSSEY ACADEMY INC   35
                 KG               64      1  ODYSSEY ACADEMY INC   64
                 PK              115      1  ODYSSEY ACADEMY INC  115
                            SCHNAM09                                  
                   

In [42]:
srec_df[['CHARTR09','SCHNAM09','LEVEL09','gradelevel']].\
    groupby(['CHARTR09','LEVEL09','gradelevel']).describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,SCHNAM09,SCHNAM09,SCHNAM09,SCHNAM09
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,count,unique,top,freq
CHARTR09,LEVEL09,gradelevel,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
1,1,G01,250,4,KIPP COASTAL VILLAGE,109
1,1,G02,137,3,MAINLAND PREPARATORY ACADEMY,58
1,1,G03,142,3,ODYSSEY ACADEMY INC,60
1,1,G04,104,3,MAINLAND PREPARATORY ACADEMY,42
1,1,G05,99,3,MAINLAND PREPARATORY ACADEMY,39
1,1,G06,83,3,ODYSSEY ACADEMY INC,30
1,1,G07,51,2,MAINLAND PREPARATORY ACADEMY,27
1,1,G08,56,2,ODYSSEY ACADEMY INC,35
1,1,KG,243,4,KIPP COASTAL VILLAGE,89
1,1,PK,322,4,KIPP COASTAL VILLAGE,119
