# Preprocessing

In [1]:
import pandas as pd
import numpy as np
import sys

### Step 1
Write a function that processes a single .txt file. It must:
* drop rows that do not contain a unique school identifier.
* drop rows that correspond to elementary/middle school education. We are focusing on high school data.

In [30]:
def read_text_file(textfilepath):
    """
    Input: textfilepath, a path to the text file to be generated to a datafrmae
    Output: Pandas DataFrame corresponding to input text file
    """
    df = pd.read_csv(textfilepath, sep="\t", encoding="ISO-8859-1")
    return df 

In [160]:
df1 = read_text_file("./txt_files/absenteeism_txt_files/2016-17_ChronAbsenteeism.txt")
df2 = read_text_file("./txt_files/absenteeism_txt_files/2017-18_ChronAbsenteeism.txt")

In [161]:
def drop_rows(df):
    """
    Input: Pandas DataFrame
    Output: Pandas DataFrame with (a) rows with no school code and (b) rows corresponding to elementary/
            middle school education removed
    """
    rows_to_drop = []
    #drop rows that do not have unique school code
    for i, code in enumerate(df["SchoolCode"]):
        if pd.isnull(code) or code == 0:
            rows_to_drop.append(i)
    #drop rows that correspond to elementary and middle school data
    for i, reporting_category in enumerate(df["ReportingCategory"]):
        if reporting_category in ["GRKN", "GRK", "GR13", "GR46", "GR78", "GRK8", "GRUG"]:
            rows_to_drop.append(i)
    df = df.drop(rows_to_drop,axis=0)
    return df

In [162]:
df1 = drop_rows(df1)
df2 = drop_rows(df2)

### Step 2
Write a function that takes generates a single DataFrame given multiple dataframes from different time periods of the same category. The resultant DataFrame should organize each school's data in chronological order.

In [165]:
def combine_single_category_dataframes(dfs):
    """
    Input: a list of DataFrames to be combined. It is *imperative* that the list of DataFrames be
           given in chronological order. (i.e. a dataset for 2016-2017 is before 2017-2018)
    Output: A single DataFrame that combines information from the input DataFrames. It organizes
            each school's data in chronological order
    """
    #generate list of school codes captured across all dataframes
    school_codes = []
    for df in dfs:
        school_codes.extend(df["SchoolCode"].unique())
    school_codes = [int(school_code) for school_code in school_codes]
    unique_school_codes = list(set(school_codes))

    #convert academic year to integers (so comparisons between year can produce chronological ordering)
    for df in dfs:
        if isinstance(df["AcademicYear"][df.index[0]],str):
            integer_years = []
            for year in df["AcademicYear"]:
                integer_years.append(int(year[0:4]))
            df["AcademicYear"] = integer_years
        
    #iterate over school codes, pull rows associated with that school code from each dataframe
    for desired_code in unique_school_codes:
        associated_rows = {}
        for i,df in enumerate(dfs):
            year = df["AcademicYear"][df.index[0]]
            associated_rows[year] = []
            for j,code in enumerate(df["SchoolCode"]):
                if code == desired_code:
                    associated_rows[year].append(j)
        #sort associated rows by date 
        sorted_years = sorted(associated_rows.keys())
        new_df = pd.DataFrame()
        for i,year in enumerate(sorted_years):
            new_df = new_df.append(dfs[i].iloc[associated_rows[year]])
    return new_df

In [166]:
new_df = combine_single_category_dataframes([df1,df2])

In [167]:
new_df

Unnamed: 0,AcademicYear,AggregateLevel,CountyCode,DistrictCode,SchoolCode,CountyName,DistrictName,SchoolName,CharterYN,ReportingCategory,ChronicAbsenteeismEligibleCumula,ChronicAbsenteeismCount,ChronicAbsenteeismRate
194133,2016,S,38,68478.0,6062079.0,San Francisco,San Francisco Unified,Presidio Middle,All,GF,558.0,0.0,0.0
194250,2016,S,38,68478.0,6062079.0,San Francisco,San Francisco Unified,Presidio Middle,All,GM,596.0,0.0,0.0
194752,2016,S,38,68478.0,6062079.0,San Francisco,San Francisco Unified,Presidio Middle,All,RA,553.0,0.0,0.0
194867,2016,S,38,68478.0,6062079.0,San Francisco,San Francisco Unified,Presidio Middle,All,RB,61.0,0.0,0.0
194980,2016,S,38,68478.0,6062079.0,San Francisco,San Francisco Unified,Presidio Middle,All,RD,71.0,0.0,0.0
195096,2016,S,38,68478.0,6062079.0,San Francisco,San Francisco Unified,Presidio Middle,All,RF,34.0,0.0,0.0
195210,2016,S,38,68478.0,6062079.0,San Francisco,San Francisco Unified,Presidio Middle,All,RH,125.0,0.0,0.0
195299,2016,S,38,68478.0,6062079.0,San Francisco,San Francisco Unified,Presidio Middle,All,RI,,,
195390,2016,S,38,68478.0,6062079.0,San Francisco,San Francisco Unified,Presidio Middle,All,RP,,,
195502,2016,S,38,68478.0,6062079.0,San Francisco,San Francisco Unified,Presidio Middle,All,RT,58.0,0.0,0.0


The code in the above cell can be optimized. It has to search through every dataframe for every school code. Instead, keep track of a dictionary for each dataframe with keys being school codes and values being associated rows.

### Step 3
Write a function that combines multiple DataFrames that were generated in Step 2.  