Goal: The student race/ethnicity data reports only include percentages, so we wanted to use the DESE student enrollment reports to get total student enrollment at each school by year. 

In [22]:
import pandas as pd
import requests
from bs4 import BeautifulSoup as bs
import re
import glob
import os

The export button for the student enrollment report showed us where to specify the year argument in the download link, so we wrote a loop to download .xls files from 2010 to 2017. 

In [None]:
years = [2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017]

for year in years:
    base = 'http://profiles.doe.mass.edu/state_report/enrollmentbygrade.aspx?mode=school&year={}&Continue.x=9&Continue.y=6&export_excel=yes'
    base = base.format(year)
    print(base)
    filename = "student_"+"enrollment_"+str(year)+".xls"
    with open(filename, 'w') as output:
        output.write(requests.get(base).text)

The below function converts the .xls files to dataframes. Files from 2010 to 2015 are passed through the function and are added to a master dataframe.

In [None]:
def clean_file(file):
    df = pd.read_html(file)
    df = pd.DataFrame(df[0])
    header = df.iloc[0]
    df.columns = header
    df = df[1:]
    file_name = os.path.splitext(file)[0]
    df["Source"] = file_name
    df['Year'] = df["Source"].str[-4:]
    return df

clean_file('student_enrollment_2016.xls')

In [24]:
all_files = glob.glob("student_enrollment_201[0-5].xls")
df_list = []

for file in all_files:
    file = clean_file(file)
    df_list.append(file)
    
df = pd.concat(df_list)

We set the index to "Org Code" before writing the master file to a .csv. 

In [25]:
## Change ORG CODE to Org Code name
df = df.rename(columns={'ORG CODE': 'Org Code'})
df

Unnamed: 0,SCHOOL,Org Code,PK,K,1,2,3,4,5,6,7,8,9,10,11,12,SP,TOTAL,Source,Year
1,Abby Kelley Foster Charter Public (District) -...,04450105,0,98,124,136,117,137,124,132,126,122,117,89,66,38,0,1426,student_enrollment_2010,2010
2,Abington - Abington ECC,00010003,52,154,170,156,0,0,0,0,0,0,0,0,0,0,0,532,student_enrollment_2010,2010
3,Abington - Abington High,00010505,0,0,0,0,0,0,0,0,0,0,141,131,141,145,0,558,student_enrollment_2010,2010
4,Abington - Center,00010005,0,0,0,0,66,57,56,56,0,0,0,0,0,0,0,235,student_enrollment_2010,2010
5,Abington - Frolio Middle School,00010405,0,0,0,0,0,0,0,0,209,188,0,0,0,0,0,397,student_enrollment_2010,2010
6,Abington - Woodsdale,00010015,0,0,0,0,99,135,116,117,0,0,0,0,0,0,0,467,student_enrollment_2010,2010
7,Academy Of the Pacific Rim Charter Public (Dis...,04120530,0,0,0,0,0,0,84,83,75,65,49,49,44,33,0,482,student_enrollment_2010,2010
8,Acton - Douglas,00020020,0,65,67,69,73,73,74,78,0,0,0,0,0,0,0,499,student_enrollment_2010,2010
9,Acton - Gates,00020025,0,63,67,69,74,71,73,73,0,0,0,0,0,0,0,490,student_enrollment_2010,2010
10,Acton - Luther Conant,00020030,0,65,69,70,72,70,78,76,0,0,0,0,0,0,0,500,student_enrollment_2010,2010


In [27]:
df = df.set_index('Org Code')

In [28]:
df.to_csv("student_enrollment_all.csv")