# Student Record File Workflow

## Overview
Functions to obtain and clean data required for the Student Record File in Python using Census API and NCES files. 

The workflow produces a Student Person Record (SREC) file that can be linked to the Person Record File. The file includes student level records with sex, grade level, race and ethnicity.

Based on NCES data. 

The output of this workflow is a CSV file with the student record file.

The output CSV is designed to be used in the Interdependent Networked Community Resilience Modeling Environment (IN-CORE) for the housing unit allocation model.

IN-CORE is an open source python package that can be used to model the resilience of a community. To download IN-CORE, see:

https://incore.ncsa.illinois.edu/


## Instructions
Users can run the workflow by executing each block of code in the notebook.

Users can modify the code to select one county or multiple counties.

## Description of Program
- program:    ncoda_07ev1_run_SREC_workflow
- task:       Obtain School Location and Attendance Boundaries
- Version:    2023-02-10
- project:    Interdependent Networked Community Resilience Modeling Environment (IN-CORE) Subtask 5.2 - Social Institutions
- funding:	  NIST Financial Assistance Award Numbers: 70NANB15H044 and 70NANB20H008 
- author:     Nathanael Rosenheim

- Suggested Citation:
Rosenheim, Nathanael (2021) “Detailed Household and Housing Unit Characteristics: Data and Replication Code.” DesignSafe-CI. 
https://doi.org/10.17603/ds2-jwf6-s535.

## Setup Python Environment

In [19]:
# Import Python Packages Required for program
import pandas as pd       # Pandas for reading in data 
import geopandas as gpd   # Geopandas for reading Shapefiles
import numpy as np        # Numpy for working with arrays
import os                 # Operating System (os) For folders and finding working directory
import sys
import zipfile            # Zipfile for working with compressed Zipped files
import wget               # Wget for downloading files from the web
import scooby # Reports Python environment

In [2]:
# Generate report of Python environment
print(scooby.Report(additional=['pandas']))


--------------------------------------------------------------------------------
  Date: Fri Feb 10 13:56:55 2023 Central Standard Time

                OS : Windows
            CPU(s) : 12
           Machine : AMD64
      Architecture : 64bit
               RAM : 31.6 GiB
       Environment : Jupyter

  Python 3.10.9 | packaged by conda-forge | (main, Feb  2 2023, 20:14:58) [MSC
  v.1929 64 bit (AMD64)]

            pandas : 1.3.5
             numpy : 1.24.2
             scipy : 1.10.0
           IPython : 8.10.0
        matplotlib : 3.6.3
            scooby : 0.5.12
--------------------------------------------------------------------------------


In [3]:
#To replicate this notebook Clone the Github Package to a folder that is a sibling of this notebook.
# To access the sibling package you will need to append the parent directory ('..') to the system path list.
# append the path of the directory that includes the github repository.
# This step is not required when the package is in a folder below the notebook file.
github_code_path  = ""
sys.path.append(github_code_path)

In [4]:
os.getcwd()

'c:\\Users\\nathanael99\\MyProjects\\github\\intersect-community-data'

In [5]:
# To reload submodules need to use this magic command to set autoreload on
%load_ext autoreload
%autoreload 2
# open, read, and execute python program with reusable commands
#from pyncoda.ncoda_07e_generate_prec import generate_prec_functions

## Obtain NCES Files
This section of code provides details on the web addresses for obtaining the NCES data. These datafiles are quiet large. It is recommended that the files are downloaded once. To facilitate the downloading of the files a Comma Seperated Values (CSV) file was create using Microsoft Excel (note CSV files are easier to read into the notebook). The CSV file includes the descriptions and important file names to be obtained. This input file can be modified for different school years.

In [6]:
folder_path = 'pyncoda\\CommunitySourceData\\nces_ed_gov\\'
filename = 'NCES_1av1_ObtainSchoolData_2021-06-06.csv'
filelist_df = pd.read_csv(folder_path+filename)
filelist_df

Unnamed: 0,Program,File Description,School Year,Documentation File Name,Data File Name,Unzipped Shapefile File Location,Documentation File URL,Data File URL
0,EDGE,Postsecondary School File,2015-2016,EDGE_GEOCODE_POSTSEC_FILEDOC.pdf,EDGE_GEOCODE_POSTSECONDARYSCH_1516.zip,EDGE_GEOCODE_POSTSECONDARYSCH_1516/EDGE_GEOCOD...,https://nces.ed.gov/programs/edge/docs/,https://nces.ed.gov/programs/edge/data/
1,EDGE,Public District File,2015-2016,EDGE_GEOCODE_PUBLIC_FILEDOC.pdf,EDGE_GEOCODE_PUBLICLEA_1516.zip,EDGE_GEOCODE_PUBLICLEA_1516/EDGE_GEOCODE_PUBLI...,https://nces.ed.gov/programs/edge/docs/,https://nces.ed.gov/programs/edge/data/
2,EDGE,Public School File,2015-2016,EDGE_GEOCODE_PUBLIC_FILEDOC.pdf,EDGE_GEOCODE_PUBLICSCH_1516.zip,EDGE_GEOCODE_PUBLICSCH_1516/EDGE_GEOCODE_PUBLI...,https://nces.ed.gov/programs/edge/docs/,https://nces.ed.gov/programs/edge/data/
3,EDGE,Private School File,2015-2016,EDGE_GEOCODE_PSS1718_FILEDOC.pdf,EDGE_GEOCODE_PRIVATESCH_15_16.zip,EDGE_GEOCODE_PRIVATESCH_15_16.shp,https://nces.ed.gov/programs/edge/docs/,https://nces.ed.gov/programs/edge/data/
4,EDGE,School Attendance Boundaries Single Shapefile,2015-2016,EDGE_SABS_2015_2016_TECHDOC.pdf,SABS_1516.zip,SABS_1516/SABS_1516.shp,https://nces.ed.gov/programs/edge/docs/,https://nces.ed.gov/programs/edge/data/
5,CCD,Common Core Data Staff,2015-2016,2015-16_CCD_Companion_School_Staff.xlsx,ccd_sch_059_1516_w_2a_011717_csv.zip,,https://nces.ed.gov/ccd/xls/,https://nces.ed.gov/ccd/data/zip/
6,CCD,Common Core Data School Characteristics,2015-2016,2015-16_CCD_Companion_School_CCD_School.xlsx,ccd_sch_129_1516_w_2a_011717_csv.zip,,https://nces.ed.gov/ccd/xls/,https://nces.ed.gov/ccd/data/zip/


# Setup directory

In [7]:
output_sourcedata = 'Outputdata\\00_SourceData'
output_directory = 'Outputdata\\00_SourceData\\nces_ed_gov'
# Make directory to save output
if not os.path.exists(output_sourcedata):
    print("Making new directory to save output: ",
        output_sourcedata)
    os.mkdir(output_sourcedata)
if not os.path.exists(output_directory):
    print("Making new directory to save output: ",
        output_directory)
    os.mkdir(output_directory)
else:
    print("Directory",output_directory,"Already exists.")

unzipped_output_directory = output_directory+'\\unzipped'
# Make directory to save output
if not os.path.exists(unzipped_output_directory):
    print("Making unzipped_output_directory directory"+
        " to save output: ",unzipped_output_directory)
    os.mkdir(unzipped_output_directory)
else:
    print("Directory",unzipped_output_directory,
        "Already exists.")

Directory Outputdata\00_SourceData\nces_ed_gov Already exists.


In [21]:
def select_var(data, selectvar: str, selectlist):
    """
    
    Args:
        :param data: data to select from
        :type data: pandas dataframe or geopandas dataframe
        :param selectvar: Variable to select from
        :param selectlist: List of values to select       
    
    Returns:
        dataframe: selected values from data
    """
    
    # Make a copy of object - deep = True - creates a new object
    data_selected = data[data[selectvar].isin(selectlist)].copy(deep=True)
    
    # How many observations selected 
    obs = len(data_selected.index)
    print(obs,"observations selected using ",selectvar,
            " in list ",selectlist)
    
    # Return data with job count
    return data_selected

# Select School Attendance Boundary data
def select_sabs(data,NCESSCH_list,LEAID_list):
    '''
    Select School Attendance Boundary data
    based on the list of school ids (NCESSCH) and 
    school district ids (LEAID)
    '''
    
    data['slcncessch'] = np.where(data['ncessch'].isin(NCESSCH_list),1,0)
    data['slcleaid']   = np.where(data['leaid'].isin(LEAID_list),1,0)
    
    data_selected = data[(data['slcncessch'] == 1) |
                         (data['slcleaid'] == 1)].copy(deep=True)
    
    # How many observations selected 
    obs = len(data_selected.index)
    print(obs,"observations selected")
    
    return data_selected

def prepare_nces_data_for_append(gdf,
                            copyvars,
                            level,
                            schtype,
                            years):
    '''
    individual nces files have different column names
    this function renames the columns to match the
    column names in the appended file
    '''
    append_gdf = gdf[copyvars].copy()

    # All data frames need to have the same column names
    colnames = ['ncesid','name','addr','city','stabbr','zip','cnty15','geometry']
    append_gdf.columns = colnames
    
    # level relates to the level of the school
    # 1 = elementary to 5 = postsecondary
    append_gdf['level'] = level
    # School type 
    # 1 = district, 2 = public, 3 = charter, 4 = private
    append_gdf['schtype'] = schtype
    append_gdf['lat'] = append_gdf['geometry'].centroid.y
    append_gdf['lon'] = append_gdf['geometry'].centroid.x
    append_gdf['schyr'] = years
    
    return append_gdf

def split_SAB_gradelevel(df):
    '''
    ### Split SABs by level and open enrollment

    The SAB files has 5 different levels (`level`)

    - 1 = Primary
    - 2 = Middle
    - 3 = High
    - 4 = Other
    - N = Not Applicable

    The SAB files have a flag for schools 
    that allow open enrollment `openEnroll`.

    The SAB file can be split into non-overlapping 
    files that represent the different levels and 
    if the school allows open enrollment.
    '''

    sab_boundaries = {}
    condition2 = (df['openEnroll']=='0')

    year = '2015-2016'
    sab_boundaries[('Primary SAB', year)] = \
        df[(df['level']=='1') & condition2].copy(deep=True)
    sab_boundaries[('Middle SAB', year)] = \
        df[(df['level']=='2') & condition2].copy(deep=True)
    sab_boundaries[('High SAB', year)] = \
        df[(df['level']=='3') & condition2].copy(deep=True)
    sab_boundaries[('Other SAB', year)] = \
        df[(df['level']=='4') & condition2].copy(deep=True)
    sab_boundaries[('Open Enroll SAB', year)] = \
        df[(df['openEnroll']=='1')].copy(deep=True)
    
    for key in sab_boundaries:
        # Set Coordinate Reference System to to WGS84    
        sab_boundaries[key]['geometry'] = \
            sab_boundaries[key]['geometry'].\
                to_crs(epsg=4326) 
        # save as shapefile
        sab_boundaries[key].to_file(programname+"/"+newfilename)

    
    return sab_boundaries

In [None]:
'''
Loop through file list and download the data file
and documentation for each file.
'''

In [8]:
# create empty dictionary to store geodataframes for each file
schooldata = {} 
select_schooldata = {} 
county_list = ['37155']
outputfolder = "OutputData\\RobesonCounty_NC\\01_CommunitySourceData"
communityname = "RobesonCounty_NC"

for index, files in filelist_df.iterrows():
    print("\nDownloading",files['File Description'],"Files for School Year",files['School Year'])
    
    # Create dictionary with documentation and 
    # data file names and associated URL
    downloadfiles = \
        {files['Documentation File Name']:
            files['Documentation File URL'],
         files['Data File Name']:
            files['Data File URL']}
    for file in downloadfiles:
        # Set file path where file will be downloaded
        filepath = output_directory+"/"+file
        print("   Checking to see if file",file,
            "has been downloaded...")
        
        # set URL for where the file is located
        url = downloadfiles[file]+file
        
        # Check if file exists - if not then download
        if not os.path.exists(filepath):
            print("   Downloading: ",file, "from \n",url)
            wget.download(url, out=output_directory)
        else:
            print("   file",file,"already exists in folder ",
                output_directory)
            print("   original file was downloaded from", 
                url)

    # unzip the downloaded file
    print("\n Unzipping",files['File Description'],
        "Files for School Year",files['School Year'])
    
    datafile = files['Data File Name']
    # Set file path where file will be downloaded
    filepath = output_directory+"/"+file
    print("   file",file,"already exists in folder ",output_directory)
    print("   files will be unzipped. to", unzipped_output_directory)
    with zipfile.ZipFile(filepath, 'r') as zip_ref:
        zip_ref.extractall(unzipped_output_directory)

    # Convert shapefiles to geopandas dataframe
    #where is unzipped shapefile
    shapefile = files['Unzipped Shapefile File Location']
    # Check if shapefile exists
    if shapefile == 'None':
        print("   No shapefile to read in for this file")
    else:
        filepath = unzipped_output_directory+"/"+shapefile
        print("   file saved as geopandas dataframe ",
            "in a dictionary with 2 keys.")
        # Set keys
        key1 = files['File Description']
        key2 = files['School Year']
        schooldata[(key1,key2)] = gpd.read_file(filepath)
        # Set Coordinate Reference System to to WGS84
        schooldata[(key1,key2)]['geometry'] = \
            schooldata[(key1,key2)]['geometry'].\
                to_crs(epsg=4326)

    # The SAB file does not hae a county variable
    if "School Attendance Boundaries" in str(key1):
        print("School Attendance Boundaries can not be "
                "selected by geography.")
    else:            
        select_schooldata[(key1,key2)] = \
            select_var(schooldata[(key1,key2)],
                'CNTY15',county_list)
        # prepare data for appending
        select_schooldata[(key1,key2)] = \
            prepare_nces_data_for_append( 
                select_schooldata[(key1,key2)],
                files['copyvars'],
                files['level'],
                files['schtype'],
                files['School Year'])
        # save as shapefile
        root_filename = files['Data File Name'][:-4]
        stem_filename = '_'+communityname+'.shp'
        new_filename = root_filename+stem_filename
        select_schooldata[(key1,key2)].to_file(outputfolder+"/"+new_filename)



Downloading Postsecondary School File Files for School Year 2015-2016
   Checking to see if file EDGE_GEOCODE_POSTSEC_FILEDOC.pdf has been downloaded...
   file EDGE_GEOCODE_POSTSEC_FILEDOC.pdf already exists in folder  Outputdata\00_SourceData\nces_ed_gov
   original file was downloaded from https://nces.ed.gov/programs/edge/docs/EDGE_GEOCODE_POSTSEC_FILEDOC.pdf
   Checking to see if file EDGE_GEOCODE_POSTSECONDARYSCH_1516.zip has been downloaded...
   file EDGE_GEOCODE_POSTSECONDARYSCH_1516.zip already exists in folder  Outputdata\00_SourceData\nces_ed_gov
   original file was downloaded from https://nces.ed.gov/programs/edge/data/EDGE_GEOCODE_POSTSECONDARYSCH_1516.zip

Downloading Public District File Files for School Year 2015-2016
   Checking to see if file EDGE_GEOCODE_PUBLIC_FILEDOC.pdf has been downloaded...
   file EDGE_GEOCODE_PUBLIC_FILEDOC.pdf already exists in folder  Outputdata\00_SourceData\nces_ed_gov
   original file was downloaded from https://nces.ed.gov/programs/ed

In [None]:
# Append School Data
append_schooldata = pd.concat(select_schooldata, 
                              ignore_index=True, sort=False)


append_schooldata.head()

## Select NCES SAB data for a single county

In [18]:
# Create list of School identification numbers (NCESSCH)
# `NCESSCH` values
NCESSCH_list = select_schooldata[
        ('Public School File', '2015-2016')].NCESSCH.tolist()
# Create list of `LEAID` values
# Local education agency identification numbers (LEAID) 
LEAID_list = select_schooldata[
    ('Public District File', '2015-2016')].LEAID.tolist()
datafile = ('School Attendance Boundaries Single Shapefile', '2015-2016')
data = schooldata[datafile]
select_schooldata[datafile] = \
    select_sabs(data,NCESSCH_list,LEAID_list)

('Postsecondary School File', '2015-2016')
2 observations selected using  CNTY15  in list  ['37155']
('Public District File', '2015-2016')
3 observations selected using  CNTY15  in list  ['37155']
('Public School File', '2015-2016')
45 observations selected using  CNTY15  in list  ['37155']
('Private School File', '2015-2016')
5 observations selected using  CNTY15  in list  ['37155']
('School Attendance Boundaries Single Shapefile', '2015-2016')
School Attendance Boundaries can not be selected by geography.


In [None]:
df = select_schooldata[('School Attendance Boundaries Single Shapefile', '2015-2016')]
sab_boundaries = split_SAB_gradelevel(df)