Goal: Download student race and ethnicity data for all schools from past years and create one master CSV called student_raceeth_all.

In [2]:
import pandas as pd
import requests
from bs4 import BeautifulSoup as bs
import re
import glob
import os

The export button on the student race/ethnicity report page shows where to specify the year parameter, so we wrote a loop to download the last 7 years' worth of data and save the reports as .xls files (this is the file extension that the state uses). We did not commit the .xls files to Github.

In [None]:
years = [2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017]

for year in years:
    base = 'http://profiles.doe.mass.edu/state_report/enrollmentbyracegender.aspx?mode=school&year={}&Continue=View+Report&export_excel=yes'
    base = base.format(year)
    print(base)
    filename = "student_"+"raceeth_"+str(year)+".xls"
    with open(filename, 'w') as output:
        output.write(requests.get(base).text)

The .xls files had html elements in them, so we wrote a function that would read each file in as html and convert it to a DataFrame.

In [16]:
def clean_file(file):
    df = pd.read_html(file)
    df = pd.DataFrame(df[0])
    header = df.iloc[0]
    df.columns = header
    df = df[1:]
    file_name = os.path.splitext(file)[0]
    df["Source"] = file_name
    df['Year'] = df["Source"].str[-4:]
    return df

We only focused on data between 2010-2015, so we wrote a loop to clean those files and add them to a master data frame.

In [17]:
all_files = glob.glob("student_raceeth_201[0-5].xls")
df_list = []

for file in all_files:
    file = clean_file(file)
    df_list.append(file)
    
df = pd.concat(df_list)

We set the index to "Org Code" to match the other school data files before writing to a CSV.

In [None]:
df = df.rename(columns={'ORG CODE': 'Org Code'})
df = df.set_index('Org Code')
df.to_csv("student_raceeth_all.csv")