# Just for fun: Automated QA example for scaling to larger datasets

**Input:** SAT.txt (Raw Data)

**Output:** Targeted_QA_Report_Table.txt with Error Type column (**Note:** Hypothetical output; see dataframe below for example)

## This is just an example for TOTAL errors and instances of Null values

**Note:** This was done quickly just to show that the QA could be automated and generate a QA report that:
1. Would avoid data loss from just dropping records where issues/errors arise.
2. Create a targeted report with a subset of the records that have issues.
3. The targeted report could be used by an analyst to follow up on the Data Integrity issues using:
    - Phone calls to schools or DOE to aquire accurate data and/or assess issue(s).
    - Backtracking to the initial data Extraction to identify potential issues.
    - Get creative and resourceful to identify other avenues to help create a robust and final dataset.


In [1]:
# Import libraries
import pandas as pd
import numpy as np

In [2]:
# Import tab-delimited text file as dataframe (df)
sat_df = pd.read_table('SAT.txt', sep='\t')

# View a sample of rows from SAT dataframe to be sure the data read in correctly
sat_df.sample(5)

Unnamed: 0,CO_CODE,DIST_CODE,SCH_CODE,TOTAL,MATHEMATICS,CRITICAL_READING,WRITING,SAT_1550,N_STUDENTS_SCORED
269,27,3385.0,50,1410,557,540,541,63.3,903
1,1,120.0,10,1502,515,506,481,41.7,660
53,3,5410.0,30,1491,519,488,484,37.2,409
379,41,4100.0,50,1543,529,513,501,44.1,782
363,39,4290.0,50,1347,466,445,436,21.9,350


In [3]:
# Setup variables to simplify readability of code in function below.
math = sat_df['MATHEMATICS']
reading = sat_df['CRITICAL_READING']
writing = sat_df['WRITING']

In [4]:
def create_QA_report(dataframe):
    '''This function finds and reports errors in an output table that has the SAT
    original data along with a column named ERROR_TYPE. The table can be used for further
    exploration of problematic data regarding school peformance metrics.'''
    
    # Add together the avg. scores from 3 columns and populate new column VARIFY_TOTAL.
    dataframe['NEW_TOTAL'] = math + reading + writing
    
    # Add column ERROR_TYPE and use conditional statement to populate with type of error
    dataframe['ERROR_TYPE'] = np.where(dataframe['TOTAL'] != dataframe['NEW_TOTAL'], 
                                 'Totals Differ', 'None')
    
    # Select only the records that had errors from 'totals' and store in dataframe
    total_errors = dataframe.loc[dataframe['ERROR_TYPE'] == 'Totals Differ']
    
    # Add Null error to error type column
    null_errors = dataframe[dataframe.isnull().any(axis=1)].copy()
    null_errors['ERROR_TYPE'] = 'Null Present'
    
    # create list of dataframes with reported errrors and then concatenate them by stacking
    dataframes = [total_errors, null_errors]
    qa_report_table = pd.concat(dataframes)
    
    # Reorder columns to show ERROR_TYPES and TOTALS columns first
    qa_report_table = qa_report_table[['ERROR_TYPE', 'TOTAL', 'NEW_TOTAL',
                                       'MATHEMATICS', 'CRITICAL_READING', 'WRITING', 
                                       'CO_CODE', 'DIST_CODE', 'SCH_CODE', 
                                       'SAT_1550', 'N_STUDENTS_SCORED']]
    return qa_report_table

In [5]:
# run function on 2013 SAT data to produce a targeted list of records with errors
create_QA_report(sat_df)

Unnamed: 0,ERROR_TYPE,TOTAL,NEW_TOTAL,MATHEMATICS,CRITICAL_READING,WRITING,CO_CODE,DIST_CODE,SCH_CODE,SAT_1550,N_STUDENTS_SCORED
88,Totals Differ,1700,1734,594,566,574,7,800.0,30,74.1,628
269,Totals Differ,1410,1638,557,540,541,27,3385.0,50,63.3,903
328,Totals Differ,1346,1402,490,452,460,35,490.0,20,20.8,674
106,Null Present,1199,1199,423,386,390,11,,20,5.5,361
107,Null Present,1380,1380,472,454,454,11,,30,28.5,32
108,Null Present,1363,1363,464,453,446,11,,50,19.4,254
109,Null Present,1411,1411,490,464,457,11,,50,28.1,127
