# Student Record File Workflow

## Overview
Functions to obtain and clean data required for the Student Record File in Python using Census API and NCES files. 

The workflow produces a Student Person Record (SREC) file that can be linked to the Person Record File. The file includes student level records with sex, grade level, race and ethnicity.

Based on NCES data. 

The output of this workflow is a CSV file with the student record file.

The output CSV is designed to be used in the Interdependent Networked Community Resilience Modeling Environment (IN-CORE) for the housing unit allocation model.

IN-CORE is an open source python package that can be used to model the resilience of a community. To download IN-CORE, see:

https://incore.ncsa.illinois.edu/


## Instructions
Users can run the workflow by executing each block of code in the notebook.

Users can modify the code to select one county or multiple counties.

## Description of Program
- program:    ncoda_07ev1_run_SREC_workflow
- task:       Obtain School Location and Attendance Boundaries
- Version:    2023-02-10
- project:    Interdependent Networked Community Resilience Modeling Environment (IN-CORE) Subtask 5.2 - Social Institutions
- funding:	  NIST Financial Assistance Award Numbers: 70NANB15H044 and 70NANB20H008 
- author:     Nathanael Rosenheim

- Suggested Citation:
Rosenheim, Nathanael (2021) “Detailed Household and Housing Unit Characteristics: Data and Replication Code.” DesignSafe-CI. 
https://doi.org/10.17603/ds2-jwf6-s535.

## Setup Python Environment

In [4]:
# Import Python Packages Required for program
import pandas as pd       # Pandas for reading in data 
import geopandas as gpd   # Geopandas for reading Shapefiles
import os                 # Operating System (os) For folders and finding working directory
import sys
import zipfile            # Zipfile for working with compressed Zipped files
import wget               # Wget for downloading files from the web
import scooby # Reports Python environment

In [2]:
# Generate report of Python environment
print(scooby.Report(additional=['pandas']))


--------------------------------------------------------------------------------
  Date: Fri Feb 10 13:17:32 2023 Central Standard Time

                OS : Windows
            CPU(s) : 12
           Machine : AMD64
      Architecture : 64bit
               RAM : 31.6 GiB
       Environment : Jupyter

  Python 3.10.9 | packaged by conda-forge | (main, Feb  2 2023, 20:14:58) [MSC
  v.1929 64 bit (AMD64)]

            pandas : 1.3.5
             numpy : 1.24.2
             scipy : 1.10.0
           IPython : 8.10.0
        matplotlib : 3.6.3
            scooby : 0.5.12
--------------------------------------------------------------------------------


In [5]:
#To replicate this notebook Clone the Github Package to a folder that is a sibling of this notebook.
# To access the sibling package you will need to append the parent directory ('..') to the system path list.
# append the path of the directory that includes the github repository.
# This step is not required when the package is in a folder below the notebook file.
github_code_path  = ""
sys.path.append(github_code_path)

In [6]:
os.getcwd()

'c:\\Users\\nathanael99\\MyProjects\\IN-CORE\\Tasks\\PublishHUIv2\\HousingUnitInventories_2022-03-03\\ReplicationCode\\intersect-community-data'

In [7]:
# To reload submodules need to use this magic command to set autoreload on
%load_ext autoreload
%autoreload 2
# open, read, and execute python program with reusable commands
#from pyncoda.ncoda_07e_generate_prec import generate_prec_functions

## Obtain NCES Files
This section of code provides details on the web addresses for obtaining the NCES data. These datafiles are quiet large. It is recommended that the files are downloaded once. To facilitate the downloading of the files a Comma Seperated Values (CSV) file was create using Microsoft Excel (note CSV files are easier to read into the notebook). The CSV file includes the descriptions and important file names to be obtained. This input file can be modified for different school years.

In [22]:
folder_path = 'pyncoda\\CommunitySourceData\\nces_ed_gov\\'
filename = 'NCES_1av1_ObtainSchoolData_2021-06-06.csv'
filelist_df = pd.read_csv(folder_path+filename)
filelist_df

Unnamed: 0,Program,File Description,School Year,Documentation File Name,Data File Name,Unzipped Shapefile File Location,Documentation File URL,Data File URL
0,EDGE,Postsecondary School File,2015-2016,EDGE_GEOCODE_POSTSEC_FILEDOC.pdf,EDGE_GEOCODE_POSTSECONDARYSCH_1516.zip,EDGE_GEOCODE_POSTSECONDARYSCH_1516/EDGE_GEOCOD...,https://nces.ed.gov/programs/edge/docs/,https://nces.ed.gov/programs/edge/data/
1,EDGE,Public District File,2015-2016,EDGE_GEOCODE_PUBLIC_FILEDOC.pdf,EDGE_GEOCODE_PUBLICLEA_1516.zip,EDGE_GEOCODE_PUBLICLEA_1516/EDGE_GEOCODE_PUBLI...,https://nces.ed.gov/programs/edge/docs/,https://nces.ed.gov/programs/edge/data/
2,EDGE,Public School File,2015-2016,EDGE_GEOCODE_PUBLIC_FILEDOC.pdf,EDGE_GEOCODE_PUBLICSCH_1516.zip,EDGE_GEOCODE_PUBLICSCH_1516/EDGE_GEOCODE_PUBLI...,https://nces.ed.gov/programs/edge/docs/,https://nces.ed.gov/programs/edge/data/
3,EDGE,Private School File,2015-2016,EDGE_GEOCODE_PSS1718_FILEDOC.pdf,EDGE_GEOCODE_PRIVATESCH_15_16.zip,EDGE_GEOCODE_PRIVATESCH_15_16.shp,https://nces.ed.gov/programs/edge/docs/,https://nces.ed.gov/programs/edge/data/
4,EDGE,School Attendance Boundaries Single Shapefile,2015-2016,EDGE_SABS_2015_2016_TECHDOC.pdf,SABS_1516.zip,SABS_1516/SABS_1516.shp,https://nces.ed.gov/programs/edge/docs/,https://nces.ed.gov/programs/edge/data/
5,CCD,Common Core Data Staff,2015-2016,2015-16_CCD_Companion_School_Staff.xlsx,ccd_sch_059_1516_w_2a_011717_csv.zip,,https://nces.ed.gov/ccd/xls/,https://nces.ed.gov/ccd/data/zip/
6,CCD,Common Core Data School Characteristics,2015-2016,2015-16_CCD_Companion_School_CCD_School.xlsx,ccd_sch_129_1516_w_2a_011717_csv.zip,,https://nces.ed.gov/ccd/xls/,https://nces.ed.gov/ccd/data/zip/


### Notice - Data files have Documentation Files
It is important to download the data files and the documentation files.

To match the School Attendance Zones the 2015-2016 data for school locations will be used.
Data for other years also exists - the file names for different years could be updated for different school years.

In [23]:
output_sourcedata = 'Outputdata\\00_SourceData'
output_directory = 'Outputdata\\00_SourceData\\nces_ed_gov'
# Make directory to save output
if not os.path.exists(output_sourcedata):
    print("Making new directory to save output: ",output_sourcedata)
    os.mkdir(output_sourcedata)
if not os.path.exists(output_directory):
    print("Making new directory to save output: ",output_directory)
    os.mkdir(output_directory)
else:
    print("Directory",output_directory,"Already exists.")

Directory Outputdata\00_SourceData\nces_ed_gov Already exists.


### Loop through file list  and download the data file and documentation for each file.
This look steps through each row (`iterrows`) in the dataframe. The loop creates a dictionary with the name and location of the data documentation file and the datafile. The second internal loop steps through the two files to download. The internal loop first checks to see if the file has already been downloaded. If the file has `not` been downloaded the program uses the `wget` function to download the data from the `url`. If the has been downloaded (`else`) the program outputs a comment that the file has already been downloaded. This loop helps manage the downloading of many complext files and the associated documentation. The structure of the loop reinforces the provenance of the data - which will help future project members understand the source of the school location and attendance data.

In [24]:
for index, files in filelist_df.iterrows():
    print("\nDownloading",files['File Description'],"Files for School Year",files['School Year'])
    
    # Create dictionary with documentation and data file names and associated URL
    downloadfiles = {files['Documentation File Name']:files['Documentation File URL'],
                     files['Data File Name']:files['Data File URL']}
    for file in downloadfiles:
        # Set file path where file will be downloaded
        filepath = output_directory+"/"+file
        print("   Checking to see if file",file,"has been downloaded...")
        
        # set URL for where the file is located
        url = downloadfiles[file]+file
        
        # Check if file exists - if not then download
        if not os.path.exists(filepath):
            print("   Downloading: ",file, "from \n",url)
            wget.download(url, out=output_directory)
        else:
            print("   file",file,"already exists in folder ",output_directory)
            print("   original file was downloaded from", url)


Downloading Postsecondary School File Files for School Year 2015-2016
   Checking to see if file EDGE_GEOCODE_POSTSEC_FILEDOC.pdf has been downloaded...
   file EDGE_GEOCODE_POSTSEC_FILEDOC.pdf already exists in folder  Outputdata\00_SourceData\nces_ed_gov
   original file was downloaded from https://nces.ed.gov/programs/edge/docs/EDGE_GEOCODE_POSTSEC_FILEDOC.pdf
   Checking to see if file EDGE_GEOCODE_POSTSECONDARYSCH_1516.zip has been downloaded...
   file EDGE_GEOCODE_POSTSECONDARYSCH_1516.zip already exists in folder  Outputdata\00_SourceData\nces_ed_gov
   original file was downloaded from https://nces.ed.gov/programs/edge/data/EDGE_GEOCODE_POSTSECONDARYSCH_1516.zip

Downloading Public District File Files for School Year 2015-2016
   Checking to see if file EDGE_GEOCODE_PUBLIC_FILEDOC.pdf has been downloaded...
   file EDGE_GEOCODE_PUBLIC_FILEDOC.pdf already exists in folder  Outputdata\00_SourceData\nces_ed_gov
   original file was downloaded from https://nces.ed.gov/programs/ed

## Unzip Folders
Each of the zip folder with data files a different structure for saving the spatial data.

In [29]:
unzipped_output_directory = output_directory+'\\unzipped'
# Make directory to save output
if not os.path.exists(unzipped_output_directory):
    print("Making unzipped_output_directory directory to save output: ",unzipped_output_directory)
    os.mkdir(unzipped_output_directory)
else:
    print("Directory",unzipped_output_directory,"Already exists.")

Making unzipped_output_directory directory to save output:  Outputdata\00_SourceData\nces_ed_gov\unzipped


In [11]:
filelist_df

Unnamed: 0,File Description,School Year,Documentation File Name,Data File Name,Documentation File URL,Data File URL
0,Postsecondary School File,2015-2016,EDGE_GEOCODE_POSTSEC_FILEDOC.pdf,EDGE_GEOCODE_POSTSECONDARYSCH_1516.zip,https://nces.ed.gov/programs/edge/docs/,https://nces.ed.gov/programs/edge/data/
1,Public District File,2015-2016,EDGE_GEOCODE_PUBLIC_FILEDOC.pdf,EDGE_GEOCODE_PUBLICLEA_1516.zip,https://nces.ed.gov/programs/edge/docs/,https://nces.ed.gov/programs/edge/data/
2,Public School File,2015-2016,EDGE_GEOCODE_PUBLIC_FILEDOC.pdf,EDGE_GEOCODE_PUBLICSCH_1516.zip,https://nces.ed.gov/programs/edge/docs/,https://nces.ed.gov/programs/edge/data/
3,Private School File,2015-2016,EDGE_GEOCODE_PSS1718_FILEDOC.pdf,EDGE_GEOCODE_PRIVATESCH_15_16.zip,https://nces.ed.gov/programs/edge/docs/,https://nces.ed.gov/programs/edge/data/
4,School Attendance Boundaries Single Shapefile,2015-2016,EDGE_SABS_2015_2016_TECHDOC.pdf,SABS_1516.zip,https://nces.ed.gov/programs/edge/docs/,https://nces.ed.gov/programs/edge/data/


In [30]:
for index, files in filelist_df.iterrows():
    print("\n Unzipping",files['File Description'],"Files for School Year",files['School Year'])
    
    file = files['Data File Name']
    # Set file path where file will be downloaded
    filepath = output_directory+"/"+file
    print("   Checking to see if zip file exists",file,"has been downloaded...")

    # Check if file exists - if not then download
    if not os.path.exists(filepath):
        print("   Warning file: ",file, "has not been downloaded - run first part of program first")
    else:
        print("   file",file,"already exists in folder ",output_directory)
        print("   files will be unzipped. to", unzipped_output_directory)
        with zipfile.ZipFile(filepath, 'r') as zip_ref:
            zip_ref.extractall(unzipped_output_directory)


 Unzipping Postsecondary School File Files for School Year 2015-2016
   Checking to see if zip file exists EDGE_GEOCODE_POSTSECONDARYSCH_1516.zip has been downloaded...
   file EDGE_GEOCODE_POSTSECONDARYSCH_1516.zip already exists in folder  Outputdata\00_SourceData\nces_ed_gov
   files will be unzipped. to Outputdata\00_SourceData\nces_ed_gov\unzipped


FileNotFoundError: [WinError 206] The filename or extension is too long: 'Outputdata\\00_SourceData\\nces_ed_gov\\unzipped\\EDGE_GEOCODE_POSTSECONDARYSCH_1516\\EDGE_GEOCODE_POSTSECONDARYSCH_1516.gdb'