# w209 Final Project Load Data

This code loads and processes the fitnessgram and academic test results into a dataset that will ultimately be used in our final project visualization.  

The resulting dataset has observations for each school, district, county, subgroup (e.g. male, female, black, hispanic, economically disadvantaged, etc.). The fitness data reported for each of these observations are the percentage of 5th, 7th, and 9th grade students in the healthy fitness zone, need improvement fitness zone, and high risk fitness zone for aerobic capacity and body composition.

In [74]:
#Import Packages
import os
import pandas as pd


#*******************************************************************************
#*******************************************************************************
#Set these file paths for your own local machine before running
#*******************************************************************************
#*******************************************************************************

#Set file path containing fitnessgram data
fitnessgram_datapath = "/Users/nwchen24/Desktop/UC_Berkeley/w209_Data_Viz/final_project_data/Fitnessgram_Results"

#Set file path containing academic test data
academic_datapath = "/Users/nwchen24/Desktop/UC_Berkeley/w209_Data_Viz/final_project_data/Test_Results"

#Set file path where you want to write the combined data
combined_datapath = '/Users/nwchen24/Desktop/UC_Berkeley/w209_Data_Viz/final_project_data/Combined_Data/Combined_Physical_Fitness_Data.pkl'

# FitnessGram Data

In [2]:
#initialize lists to hold filepaths
Phys_files_list = []
Entities_files_list = []

#Walk the data directory and get all filepaths
for root, dirs, files in os.walk(fitnessgram_datapath):
    for filename in files:
        #Get full list of filepaths to the physical fitness test files
        if filename.endswith('.txt'):    
            if filename[:4] == "Phys":
                Phys_files_list.append(fitnessgram_datapath + "/PFT_" + filename[7:11] + "/" + filename)
            if filename[8:16] == "Research":
                Phys_files_list.append(fitnessgram_datapath + "/PFT_" + str(int(filename[:4])+1) + "/" + filename)

            #Get full list of filepaths to the entities files        
            if filename[:8] == "Entities":
                Entities_files_list.append(fitnessgram_datapath + "/PFT_" + filename[8:13] + "/" + filename)
            if filename[8:16] == "Entities":
                Entities_files_list.append(fitnessgram_datapath + "/PFT_" + str(int(filename[:4])+1) + "/" + filename)

In [6]:
#get list of all columns in the file from each year
Phys_col_list = []

#read PhysFit files and append column names to the list
for filepath in Phys_files_list:
    
    if int(filepath[73:77]) < 2014:
        pass
    
    elif (int(filepath[73:77]) >= 2014):
        #read the file
        temp_df = pd.read_csv(filepath)
        #print the shape
        print filepath[73:77]
        print temp_df.shape
        #get the columns
        temp_col_list = temp_df.columns
        #add columns not already encountered to the column list
        for colname in temp_col_list:
            if colname not in Phys_col_list:
                Phys_col_list.append(colname)


2014
(2233435, 24)
Index([u'Level_Number', u'Report_Number', u'Table_Number', u'Line_Number',
       u'CO', u'DIST', u'SCHL', u'Line_Text', u'NoStud5', u'NoHFZ5', u'Perc5a',
       u'Perc5b', u'Perc5c', u'NoStud7', u'NoHFZ7', u'Perc7a', u'Perc7b',
       u'Perc7c', u'NoStud9', u'NoHFZ9', u'Perc9a', u'Perc9b', u'Perc9c',
       u'ChrtNum'],
      dtype='object')
2015
(2255801, 24)
Index([u'Level_Number', u'Report_Number', u'Table_Number', u'Line_Number',
       u'CO', u'DIST', u'SCHL', u'Line_Text', u'NoStud5', u'NoHFZ5', u'Perc5a',
       u'Perc5b', u'Perc5c', u'NoStud7', u'NoHFZ7', u'Perc7a', u'Perc7b',
       u'Perc7c', u'NoStud9', u'NoHFZ9', u'Perc9a', u'Perc9b', u'Perc9c',
       u'ChrtNum'],
      dtype='object')
2016
(2279222, 24)
Index([u'Level_Number', u'Report_Number', u'Table_Number', u'Line_Number',
       u'CO', u'DIST', u'SCHL', u'Line_Text', u'NoStud5', u'NoHFZ5', u'Perc5a',
       u'Perc5b', u'Perc5c', u'NoStud7', u'NoHFZ7', u'Perc7a', u'Perc7b',
       u'Perc7c', u'NoSt

In [5]:
#The columns are consistent among the 2014 - 2016 files
Phys_col_list

['Level_Number',
 'Report_Number',
 'Table_Number',
 'Line_Number',
 'CO',
 'DIST',
 'SCHL',
 'Line_Text',
 'NoStud5',
 'NoHFZ5',
 'Perc5a',
 'Perc5b',
 'Perc5c',
 'NoStud7',
 'NoHFZ7',
 'Perc7a',
 'Perc7b',
 'Perc7c',
 'NoStud9',
 'NoHFZ9',
 'Perc9a',
 'Perc9b',
 'Perc9c',
 'ChrtNum']

In [8]:
#initialize dataframe to hold the physical fitness files
Physfit_df = pd.DataFrame(columns = Phys_col_list)

#read PhysFit files
 

for filepath in Phys_files_list:
    if int(filepath[73:77]) < 2014:
        pass
    
    elif (int(filepath[73:77]) >= 2014):
        temp_df = pd.read_csv(filepath)
        #temp_df = temp_df.rename(columns = Physfit_column_mapping)
        temp_df['Year'] = filepath[73:77]
        Physfit_df = Physfit_df.append(temp_df)
        print filepath[73:77] + " Read Successfully"
    

2014 Read Successfully
2015 Read Successfully
2016 Read Successfully


### Subset Data to Keep only Observations we Want

In [62]:
Physfit_df.shape

#Keep only summaries of certain fitness tests
Physfit_df_2 = Physfit_df.loc[Physfit_df.Table_Number == 1]

#Keep only Aerobic capacity and body composition reports
#remove trailing spaces from line descriptor
Physfit_df_2['Line_Text'] = Physfit_df_2['Line_Text'].map(lambda x: x.strip())
Physfit_df_2 = Physfit_df_2.loc[(Physfit_df_2.Line_Text == "Aerobic Capacity") | (Physfit_df_2.Line_Text == "Body Composition")]

#Remove state level summaries
Physfit_df_2 = Physfit_df_2.loc[Physfit_df_2.Level_Number != 4]

#Remove summary report numbers - these were already removed in the first subset step that kept only summaries of individual fitness tests
Physfit_df_2 = Physfit_df_2.loc[Physfit_df_2.Report_Number < 14]

#********************************
#Convert to wide format
#Take aerobic capacity and body comp subsets
Physfit_df_2a = Physfit_df_2.loc[Physfit_df_2.Line_Text == "Aerobic Capacity"]
Physfit_df_2b = Physfit_df_2.loc[Physfit_df_2.Line_Text == "Body Composition"]

#rename columns
body_comp_col_dict = {}
body_comp_col_dict['NoHFZ5'] = 'NoHFZ5_bodycomp'
body_comp_col_dict['NoHFZ7'] = 'NoHFZ7_bodycomp'
body_comp_col_dict['NoHFZ9'] = 'NoHFZ9_bodycomp'

body_comp_col_dict['NoStud5'] = 'NoStud5_bodycomp'
body_comp_col_dict['NoStud7'] = 'NoStud7_bodycomp'
body_comp_col_dict['NoStud9'] = 'NoStud9_bodycomp'

body_comp_col_dict['Perc5a'] = 'Perc5HFZ_bodycomp'
body_comp_col_dict['Perc5b'] = 'Perc5NI_bodycomp'
body_comp_col_dict['Perc5c'] = 'Perc5NI_HR_bodycomp'

body_comp_col_dict['Perc7a'] = 'Perc7HFZ_bodycomp'
body_comp_col_dict['Perc7b'] = 'Perc7NI_bodycomp'
body_comp_col_dict['Perc7c'] = 'Perc7NI_HR_bodycomp'

body_comp_col_dict['Perc9a'] = 'Perc9HFZ_bodycomp'
body_comp_col_dict['Perc9b'] = 'Perc9NI_bodycomp'
body_comp_col_dict['Perc9c'] = 'Perc9NI_HR_bodycomp'


aero_col_dict = {}
aero_col_dict['NoHFZ5'] = 'NoHFZ5_aerobic'
aero_col_dict['NoHFZ7'] = 'NoHFZ7_aerobic'
aero_col_dict['NoHFZ9'] = 'NoHFZ9_aerobic'

aero_col_dict['NoStud5'] = 'NoStud5_aerobic'
aero_col_dict['NoStud7'] = 'NoStud7_aerobic'
aero_col_dict['NoStud9'] = 'NoStud9_aerobic'

aero_col_dict['Perc5a'] = 'Perc5HFZ_aerobic'
aero_col_dict['Perc5b'] = 'Perc5NI_aerobic'
aero_col_dict['Perc5c'] = 'Perc5NI_HR_aerobic'

aero_col_dict['Perc7a'] = 'Perc7HFZ_aerobic'
aero_col_dict['Perc7b'] = 'Perc7NI_aerobic'
aero_col_dict['Perc7c'] = 'Perc7NI_HR_aerobic'

aero_col_dict['Perc9a'] = 'Perc9HFZ_aerobic'
aero_col_dict['Perc9b'] = 'Perc9NI_aerobic'
aero_col_dict['Perc9c'] = 'Perc9NI_HR_aerobic'


Physfit_df_2a = Physfit_df_2a.rename(columns = aero_col_dict)
Physfit_df_2b = Physfit_df_2b.rename(columns = body_comp_col_dict)

#delete line number and test descriptor columns which also identifies the fitness metric
Physfit_df_2a = Physfit_df_2a.drop(['Line_Number', 'Line_Text'], axis = 1)
Physfit_df_2b = Physfit_df_2b.drop(['Line_Number', 'Line_Text'], axis = 1)

#merge
Physfit_df_2_comb = pd.merge(left = Physfit_df_2a, right = Physfit_df_2b, how = 'inner')

#Add labels to the subgroup identifiers
subgroup_label_dict = {}
#These mappings are based on the data descriptions
subgroup_label_dict[0] = 'All'
subgroup_label_dict[1] = 'Female'
subgroup_label_dict[2] = 'Male'
subgroup_label_dict[3] = 'Black'
subgroup_label_dict[4] = 'American_Indian'
subgroup_label_dict[5] = 'Asian'
subgroup_label_dict[6] = 'Filipino'
subgroup_label_dict[7] = 'Hispanic'
subgroup_label_dict[8] = 'Hawaiian'
subgroup_label_dict[9] = 'White'
subgroup_label_dict[10] = 'Multiracial'
subgroup_label_dict[11] = 'Economic_disadv'
subgroup_label_dict[12] = 'NOT_economic_disadv'
subgroup_label_dict[13] = 'No_economic_info'

Physfit_df_2_comb = Physfit_df_2_comb.replace({"Report_Number": subgroup_label_dict})

#Drop the columns counting the number of students in each group
Physfit_df_2_comb = Physfit_df_2_comb.drop(['NoHFZ5_aerobic', 'NoHFZ7_aerobic', 'NoHFZ9_aerobic', 'NoStud5_aerobic', 'NoStud7_aerobic', 'NoStud9_aerobic',\
                                           'NoHFZ5_bodycomp', 'NoHFZ7_bodycomp', 'NoHFZ9_bodycomp', 'NoStud5_bodycomp', 'NoStud7_bodycomp', 'NoStud9_bodycomp'], \
                                          axis = 1)

Physfit_df_2_comb.head()

#Physfit_df_2.head()
#print Physfit_df_2_comb.columns
#print Physfit_df_2_comb.shape
#Physfit_df_2b.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0,CO,ChrtNum,DIST,Level_Number,Perc5HFZ_aerobic,Perc5NI_aerobic,Perc5NI_HR_aerobic,Perc7HFZ_aerobic,Perc7NI_aerobic,Perc7NI_HR_aerobic,...,Year,Perc5HFZ_bodycomp,Perc5NI_bodycomp,Perc5NI_HR_bodycomp,Perc7HFZ_bodycomp,Perc7NI_bodycomp,Perc7NI_HR_bodycomp,Perc9HFZ_bodycomp,Perc9NI_bodycomp,Perc9NI_HR_bodycomp
0,10.0,0,73965.0,1.0,66.1,25.8,8.1,0.0,0.0,0.0,...,2014,58.9,21.0,20.1,0.0,0.0,0.0,0.0,0.0,0.0
1,10.0,0,73965.0,1.0,0.0,0.0,0.0,78.0,11.9,10.1,...,2014,0.0,0.0,0.0,65.6,16.3,18.1,0.0,0.0,0.0
2,10.0,0,73999.0,1.0,0.0,0.0,0.0,**,**,**,...,2014,0.0,0.0,0.0,**,**,**,0.0,0.0,0.0
3,10.0,0,73999.0,1.0,41.5,42.4,16.1,0.0,0.0,0.0,...,2014,39.8,22.9,37.3,0.0,0.0,0.0,0.0,0.0,0.0
4,10.0,0,73999.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,2014,0.0,0.0,0.0,0.0,0.0,0.0,**,**,**


# Academic Test Data

In [77]:
#initialize lists to hold filepaths
academic_files_list = []
academic_entities_files_list = []

#Walk the data directory and get all filepaths
for root, dirs, files in os.walk(academic_datapath):
    for filename in files:
        #Get full list of filepaths to the physical fitness test files
        if filename[7:10] == "all":    
            academic_files_list.append(academic_datapath + "/" + filename[2:6] + "/" + filename)

In [78]:
academic_files_list

['/Users/nwchen24/Desktop/UC_Berkeley/w209_Data_Viz/final_project_data/Test_Results/2014/ca2014_all_csv_v2.txt',
 '/Users/nwchen24/Desktop/UC_Berkeley/w209_Data_Viz/final_project_data/Test_Results/2015/ca2015_all_csv_v3.txt',
 '/Users/nwchen24/Desktop/UC_Berkeley/w209_Data_Viz/final_project_data/Test_Results/2016/ca2016_all_csv_v3.txt']

In [86]:
#get list of all columns in the file from each year
academic_col_list = []

#read PhysFit files and append column names to the list
for filepath in academic_files_list:
    #read the file
    temp_df = pd.read_csv(filepath)
    #print the shape
    print filepath[89:93]
    print temp_df.shape
    #get the columns
    temp_col_list = temp_df.columns
    print temp_col_list
    #add columns not already encountered to the column list
    for colname in temp_col_list:
        if colname not in academic_col_list:
            academic_col_list.append(colname)
            
print len(academic_col_list)

2014
(1177424, 21)
Index([u'County Code', u'District Code', u'School Code', u'Charter Number',
       u'Test Year', u'Subgroup ID', u'Test Type', u'CAPA Assessment Level',
       u'Total Tested At Entity Level', u'Total Tested At Subgroup Level',
       u'Grade', u'Test Id', u'Students Tested', u'Mean Scale Score',
       u'Percentage Advanced', u'Percentage Proficient',
       u'Percentage At Or Above Proficient', u'Percentage Basic',
       u'Percentage Below Basic', u'Percentage Far Below Basic',
       u'Students with Scores'],
      dtype='object')
2015
(727771, 21)
Index([u'County Code', u'District Code', u'School Code', u'filler',
       u'Test Year', u'Subgroup ID', u'Test Type', u'CAPA Assessment Level',
       u'Total Tested At Entity Level', u'Total Tested At Subgroup Level',
       u'Grade', u'Test Id', u'Students Tested', u'Mean Scale Score',
       u'Percentage Advanced', u'Percentage Proficient',
       u'Percentage At Or Above Proficient', u'Percentage Basic',
       u'

In [93]:
#academic_col_list.remove('filler')
academic_col_list.remove('CAPA Assessment Level')
academic_col_list.remove('CAPA Science Assessment Level')
academic_col_list.remove('Total CAASPP Enrollment')
academic_col_list.remove('Total Students with Scores')
academic_col_list.remove('Students Tested')
academic_col_list.remove('Total Tested At Subgroup Level')
academic_col_list.remove('Total Tested At Entity Level')
academic_col_list.remove('Percentage Advanced')
academic_col_list.remove('Percentage Proficient')
academic_col_list.remove('Percentage At Or Above Proficient')
academic_col_list.remove('Percentage Basic')
academic_col_list.remove('Percentage Below Basic')
academic_col_list.remove('Percentage Far Below Basic')
academic_col_list.remove('Students with Scores')
academic_col_list

['County Code',
 'District Code',
 'School Code',
 'Charter Number',
 'Test Year',
 'Subgroup ID',
 'Test Type',
 'Grade',
 'Test Id',
 'Mean Scale Score']

In [122]:
academic_df = pd.DataFrame(columns = academic_col_list)

#read PhysFit files
 

for filepath in academic_files_list:
    temp_df = pd.read_csv(filepath)
    #temp_df = temp_df.rename(columns = Physfit_column_mapping)
    academic_df = academic_df.append(temp_df)
    print filepath[89:93] + " Read Successfully"

#Keep only the columns that we want
academic_df = academic_df.drop(['CAPA Assessment Level', 'CAPA Science Assessment Level', 'Total CAASPP Enrollment',\
                               'Total Students with Scores', 'Students Tested', 'Total Tested At Subgroup Level',\
                               'Total Tested At Entity Level', 'Percentage Advanced', 'Percentage Proficient',\
                               'Percentage At Or Above Proficient', 'Percentage Basic', 'Percentage Below Basic',\
                               'Percentage Far Below Basic', 'Students with Scores', 'filler'], axis = 1)

#replace spaces in column names with underscores
academic_df.columns = [c.replace(' ', '_') for c in academic_df.columns]

2014 Read Successfully
2015 Read Successfully
2016 Read Successfully


In [178]:
#Subset the academic data to only the observations we're interested in.
#See subgroup ID mappings here: http://caaspp.cde.ca.gov/caaspp2015/research_fixfileformat.aspx
#We want all students, and the following subgroups individually:
#male, female, black, american indian, asian, filipino, hispanic, hawaiian, white, multiracial, economically disadvantaged, not ecnonomically disadvantaged, and non economic info

#instantiate dict to map subgroup IDs to the groups we're interested in
subgroup_label_dict = {}

subgroup_label_dict[1] = 'All'
subgroup_label_dict[4] = 'Female'
subgroup_label_dict[3] = 'Male'
subgroup_label_dict[74] = 'Black'
subgroup_label_dict[75] = 'American_Indian'
subgroup_label_dict[76] = 'Asian'
subgroup_label_dict[77] = 'Filipino'
subgroup_label_dict[78] = 'Hispanic'
subgroup_label_dict[79] = 'Hawaiian'
subgroup_label_dict[80] = 'White'
subgroup_label_dict[144] = 'Multiracial'
subgroup_label_dict[31] = 'Economic_disadv'
subgroup_label_dict[111] = 'NOT_economic_disadv'

#Keep only observations for the subgroups we're interested in
academic_df_2 = academic_df.loc[academic_df.Subgroup_ID.isin([1.0, 4.0, 3.0, 74.0, 75.0, 76.0, 77.0, 78.0, 79.0, 80.0, 144.0, 31.0, 111.0])]

#Keep only grades 5, 7, and 9
academic_df_2 = academic_df_2.loc[academic_df_2.Grade.isin([5.0, 7.0, 9.0])]

#convert mean scale score to numeric
academic_df_2['Mean_Scale_Score'] = academic_df_2['Mean_Scale_Score'].apply(pd.to_numeric, errors = "NA")

#Get average score for these three grades for each school
academic_df_2 = academic_df_2.drop(['Grade', 'Test_Id', 'Test_Type'], axis = 1)
academic_df_2 = academic_df_2.groupby(['Charter_Number', 'County_Code', 'District_Code', 'School_Code', 'Subgroup_ID', 'Test_Year'], as_index=False).mean()

#relabel subgroup IDs
academic_df_2 = academic_df_2.replace({"Subgroup_ID": subgroup_label_dict})

#Mean academic test scores now ready to merge with fitnessgram data

In [179]:
Physfit_df_2_comb.columns


Index([u'County_Code', u'Charter_Number', u'District_Code', u'Level_Number',
       u'Perc5HFZ_aerobic', u'Perc5NI_aerobic', u'Perc5NI_HR_aerobic',
       u'Perc7HFZ_aerobic', u'Perc7NI_aerobic', u'Perc7NI_HR_aerobic',
       u'Perc9HFZ_aerobic', u'Perc9NI_aerobic', u'Perc9NI_HR_aerobic',
       u'Subgroup', u'School_Code', u'Year', u'Perc5HFZ_bodycomp',
       u'Perc5NI_bodycomp', u'Perc5NI_HR_bodycomp', u'Perc7HFZ_bodycomp',
       u'Perc7NI_bodycomp', u'Perc7NI_HR_bodycomp', u'Perc9HFZ_bodycomp',
       u'Perc9NI_bodycomp', u'Perc9NI_HR_bodycomp'],
      dtype='object')

# Merge Academic Results with Fitnessgram

In [181]:
#rename academic data columns to facilitate merge
Physfit_col_mapping = {}
Physfit_col_mapping['CO'] = 'County_Code'
Physfit_col_mapping['ChrtNum'] = 'Charter_Number'
Physfit_col_mapping['DIST'] = 'District_Code'
Physfit_col_mapping['SCHL'] = 'School_Code'
Physfit_col_mapping['Report_Number'] = 'Subgroup'

academic_col_mapping = {}
academic_col_mapping['Test_Year'] = 'Year'
academic_col_mapping['Subgroup_ID'] = 'Subgroup'
academic_col_mapping['Mean_Scale_Score'] = 'Mean_Academic_Test_Score'

academic_df_2 = academic_df_2.rename(columns = academic_col_mapping)
Physfit_df_2_comb = Physfit_df_2_comb.rename(columns = Physfit_col_mapping)

#convert year to numeric in fitnessgram dataset
Physfit_df_2_comb['Year'] = Physfit_df_2_comb['Year'].apply(pd.to_numeric, errors = "NA")

#delete table number from fitnessgram dataset (only one value)
#del Physfit_df_2_comb['Table_Number']

#Keep only school level observations
Physfit_df_2_comb = Physfit_df_2_comb.loc[Physfit_df_2_comb.Level_Number == 1]
#academic_df_2 = academic_df_2.loc[academic_df_2.School_Code > 100]


In [182]:
academic_df_2.head()
academic_df_2.dtypes

Charter_Number              float64
County_Code                 float64
District_Code               float64
School_Code                 float64
Subgroup                     object
Year                        float64
Mean_Academic_Test_Score    float64
dtype: object

In [183]:
Physfit_df_2_comb.head()
Physfit_df_2_comb.dtypes



County_Code            float64
Charter_Number          object
District_Code          float64
Level_Number           float64
Perc5HFZ_aerobic        object
Perc5NI_aerobic         object
Perc5NI_HR_aerobic      object
Perc7HFZ_aerobic        object
Perc7NI_aerobic         object
Perc7NI_HR_aerobic      object
Perc9HFZ_aerobic        object
Perc9NI_aerobic         object
Perc9NI_HR_aerobic      object
Subgroup                object
School_Code            float64
Year                     int64
Perc5HFZ_bodycomp       object
Perc5NI_bodycomp        object
Perc5NI_HR_bodycomp     object
Perc7HFZ_bodycomp       object
Perc7NI_bodycomp        object
Perc7NI_HR_bodycomp     object
Perc9HFZ_bodycomp       object
Perc9NI_bodycomp        object
Perc9NI_HR_bodycomp     object
dtype: object

In [190]:
#Merge
print Physfit_df_2_comb.shape
print academic_df_2.shape

combined_DF = pd.merge(left = Physfit_df_2_comb, right = academic_df_2, how = 'inner', on = ['County_Code', 'District_Code', 'School_Code', 'Subgroup', 'Year'])
print combined_DF.shape




(376819, 25)
(69982, 7)


# MERGE TO DO  

There are a lot of observations in the fitnessgram dataset that do not show up in the academic test results data. We need to figure out why there are so many missing observations. The cells below are intended to investigate the observations in the fitnessgram dataset that do not appear in the test results data.

In [None]:
pd.options.display.max_columns = 100

unmerged_df = pd.merge(left = Physfit_df_2_comb, right = academic_df_2, how = 'left', on = ['County_Code', 'District_Code', 'School_Code', 'Subgroup', 'Year'])

unmerged_df = combined_DF.loc[unmerged_df.isnull().any(axis=1)]


In [191]:
unmerged.head()

Unnamed: 0,County_Code,Charter_Number_x,District_Code,Level_Number,Perc5HFZ_aerobic,Perc5NI_aerobic,Perc5NI_HR_aerobic,Perc7HFZ_aerobic,Perc7NI_aerobic,Perc7NI_HR_aerobic,Perc9HFZ_aerobic,Perc9NI_aerobic,Perc9NI_HR_aerobic,Subgroup,School_Code,Year,Perc5HFZ_bodycomp,Perc5NI_bodycomp,Perc5NI_HR_bodycomp,Perc7HFZ_bodycomp,Perc7NI_bodycomp,Perc7NI_HR_bodycomp,Perc9HFZ_bodycomp,Perc9NI_bodycomp,Perc9NI_HR_bodycomp,Charter_Number_y,Mean_Academic_Test_Score
1,10.0,0,73965.0,1.0,0.0,0.0,0.0,78.0,11.9,10.1,0.0,0.0,0.0,All,6120539.0,2014,0.0,0.0,0.0,65.6,16.3,18.1,0.0,0.0,0.0,,
2,10.0,0,73999.0,1.0,0.0,0.0,0.0,**,**,**,0.0,0.0,0.0,All,1.0,2014,0.0,0.0,0.0,**,**,**,0.0,0.0,0.0,,
4,10.0,0,73999.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,**,**,**,All,1033422.0,2014,0.0,0.0,0.0,0.0,0.0,0.0,**,**,**,,
5,10.0,0,73999.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,72.5,20.3,7.2,All,1033430.0,2014,0.0,0.0,0.0,0.0,0.0,0.0,61.2,24.3,14.5,,
7,10.0,0,73999.0,1.0,0.0,0.0,0.0,77.3,13.5,9.2,0.0,0.0,0.0,All,6006696.0,2014,0.0,0.0,0.0,54.1,18.6,27.3,0.0,0.0,0.0,,


In [87]:
#Save serialized version to file
Physfit_df.to_pickle(combined_datapath)

## Physical Fitness Data Description

There appears to have been a change in reporting procedure in 2012. Starting in 2012, for each of the grades 5, 7, and 9, the percentage of students not in the healthy fitness zone is split between 'Needs Improvement' and 'High Risk'. We will want to determine whether the cutoff to determine whether students not in the healthy fitness zone remained the same after this reporting change was implemented.

Report_Number (and possibly report type) reports the group being reported on in that observation (e.g. all students, male students, female students, black students, white students, etc).

Line_Number and Line_Text identify the data being reported (I think this means the particular fitness measuer)