# Table of Contents

This project consists of the following code written to the following directories:
1. ## root
    1. This Document
    1. Project_1_slides.pdf
        * Slides used during this presentation
1. ## /code
    1. starter-code.ipynb
        * The code provided with this assignment, including coding challenges and explicit responses to questions. Much of my work was initially here before being rerganized. If something is broken there's a good chance it's because I moved it. You can maybe find a working copy here.
    1. EDA.ipynb
        * More organized and thourough exploratory data analysis
    1. sat_cleaning
        * Web scraping and initial cleaning of SAT percentile scores
    1. act_cleaning
        * Web scraping and initial cleaning of ACT percentile scores
    1. visualizations.ipynb
        * Visualizations, including those used in my presentation and some others
    1. cleaning_DFs.ipynb
        * Cleaning of provided data and further cleaning of the SAT and ACT percentile documents
1. ## /data
    1. All CSVs and Pickles used in this project, including those provided
1. ## /visualizations
    1. Scatter plots
        * 2017_act_scatterplot.png
        * 2017_sat_scatterplot.png
     1. Regression plots
         * Reg_ACT.png
         * Reg_SAT.png
     1. Choropleths
         * ACTCOMP2017.png Key: CompKey.png
         * ACTPART2017.png Key: PartKey.png

# Data Dictionary
|Feature|Type|Dataset|Description|
|---|---|---|---|
|state|*string*|all by year ACT and SAT data|the US state under consideration| 
|participation|*float*|all by year ACT and SAT data|reported portion of students taking a given test. For this analysis, it's assumed that high test participation is correlated with test preparation|
|'evidence_based_reading_and_writing|int| all SAT data | score between 200 and 800, not directly considered here, but included for completeness
|math|int| all SAT data | score between 200 and 800, not directly considered here, but included for completeness
|composite|float|all|The final score on the given test for SAT: The sum of the previous 2 entries. For ACT, the mean of the four following|
|science|float|ACT|Score between 1- 36 not directly referenced
|math|float|ACT|Score between 1- 36 not directly referenced
|reading|float|ACT|Score between 1- 36 not directly referenced
|writing|float|ACT|Score between 1- 36 not directly referenced
|code|str|visualization DF's|Two letter state abbreviations. Required for Plotly Choropleths
|prticipation-score-index|float|EDA|The state's composite score divided by the state's participation rate. Used as a metric to explore test preparatiob effectiveness
|norm|float|EDA| The normalized participation-score-index [(PSI - mean ofPSIs)/standard deviation of PSIS]


# Introduction

This project is an exploratory data analysis of the effect of preparation on standardized test performance. Ultimately this project is about options for colleges to quantify and correct for the effects of test preparation.

Assumptions:
* States with higher participation rates have more intituional support and therefore it can be used to approximate preparation
* In the absence of an effect from participation there will be a linear fall in mean test performance as more non-college-bound students scores are incuded
* Any non inverse linear correlation may be an indication of effects from preparation

# Key Findings

I found an inverse correlation between participation rate and composite score by dividing the composite score participation rate. There was a clear signal with both the SAT and the ACT having different states with high scores but those states having a high PSI each year. A simple linear regression didn't yeild interesting results, but there is a sign that more sophisticated regression models might yeild better results. More research is needed. 

# Addenda

## tables

First a bit of code to read in the data:

In [8]:
import pandas as pd
import os
filenames = [file for file in os.listdir('data/') if file.find('indices')!= -1]

In [9]:
for i in range(len(filenames)):
    df = pd.read_pickle(f'data/{filenames[i]}')
    display(filenames[i])
    display(df)

'indices_2018_act_with_percentiles.pkl.pkl'

Unnamed: 0,state,participation,composite,year,rounded,percentile,participation_score_index,psi_mean,psi_std
0,Maine,0.06,24.3,2019,24,74,405.0,67.505032,68.731658
1,Rhode Island,0.12,24.7,2019,25,78,205.833333,67.505032,68.731658
2,Delaware,0.13,24.1,2019,24,74,185.384615,67.505032,68.731658
3,New Hampshire,0.14,25.0,2019,25,78,178.571429,67.505032,68.731658
4,Pennsylvania,0.17,23.6,2019,24,74,138.823529,67.505032,68.731658


'indices_2016_with_percentiles.pkl.pkl'

Unnamed: 0,state,participation,composite,year,rounded,percentile,PERCENTILE,participation_score_index,psi_mean,psi_std
0,Maine,0.08,24.3,2017,24,0,8.0,303.75,51.48321,48.33932
1,New Hampshire,0.18,25.5,2017,26,0,2.0,141.666667,51.48321,48.33932
2,Delaware,0.18,24.1,2017,24,0,8.0,133.888889,51.48321,48.33932
3,Rhode Island,0.21,24.0,2017,24,0,8.0,114.285714,51.48321,48.33932
4,Pennsylvania,0.23,23.7,2017,24,0,8.0,103.043478,51.48321,48.33932


'indices_2016_act_with_percentiles.pkl.pkl'

Unnamed: 0,state,participation,composite,year,rounded,percentile,participation_score_index,psi_mean,psi_std
0,Maine,0.08,24.3,2017,24,74,303.75,53.004333,49.410601
1,New Hampshire,0.18,25.5,2017,26,83,141.666667,53.004333,49.410601
2,Delaware,0.18,24.1,2017,24,74,133.888889,53.004333,49.410601
3,Rhode Island,0.21,24.0,2017,24,74,114.285714,53.004333,49.410601
4,Pennsylvania,0.23,23.7,2017,24,74,103.043478,53.004333,49.410601


'indices_2017_act_with_percentiles.pkl.pkl'

Unnamed: 0,state,participation,composite,year,rounded,percentile,participation_score_index,psi_mean,psi_std
0,Maine,0.07,24.0,2018,24,73,342.857143,64.138941,68.008284
1,Maine,0.07,24.0,2018,24,73,342.857143,64.138941,68.008284
2,Rhode Island,0.15,24.2,2018,24,73,161.333333,64.138941,68.008284
3,New Hampshire,0.16,25.1,2018,25,78,156.875,64.138941,68.008284
4,Delaware,0.17,23.8,2018,24,73,140.0,64.138941,68.008284


'indices_2019_sat_with_percentile.pkl'

Unnamed: 0,state,participation,evidence-based_reading_and_writing,math,composite,year,rounded,code,percentiles,participation_score_index,psi_mean,psi_std
0,North Dakota,0.02,640,643,1283,2018,1280,ND,86,64150.0,12816.416479,16403.816494
1,Wisconsin,0.03,641,653,1294,2018,1290,WI,87,43133.333333,12816.416479,16403.816494
2,Iowa,0.03,634,631,1265,2018,1260,IA,83,42166.666667,12816.416479,16403.816494
3,Wyoming,0.03,633,625,1257,2018,1260,WY,83,41900.0,12816.416479,16403.816494
4,Nebraska,0.03,629,623,1252,2018,1250,NE,82,41733.333333,12816.416479,16403.816494


'indices_2017_sat_with_percentile.pkl'

Unnamed: 0,state,participation,evidence-based_reading_and_writing,math,composite,year,rounded,code,percentiles,participation_score_index,psi_mean,psi_std
0,North Dakota,0.02,640,643,1283,2018,1280,ND,86,64150.0,12816.416479,16403.816494
1,Wisconsin,0.03,641,653,1294,2018,1290,WI,87,43133.333333,12816.416479,16403.816494
2,Iowa,0.03,634,631,1265,2018,1260,IA,83,42166.666667,12816.416479,16403.816494
3,Wyoming,0.03,633,625,1257,2018,1260,WY,83,41900.0,12816.416479,16403.816494
4,Nebraska,0.03,629,623,1252,2018,1250,NE,82,41733.333333,12816.416479,16403.816494


'indices_2018_sat_with_percentile.pkl'

Unnamed: 0,state,participation,evidence-based_reading_and_writing,math,composite,year,rounded,percentile,participation_score_index,psi_mean,psi_std
0,North Dakota,0.02,627,636,1263,2019,1260.0,82,63150.0,12278.27121,16285.779395
1,Wisconsin,0.03,635,648,1283,2019,1280.0,84,42766.666667,12278.27121,16285.779395
2,South Dakota,0.03,633,635,1268,2019,1270.0,83,42266.666667,12278.27121,16285.779395
3,Nebraska,0.03,628,631,1260,2019,1260.0,82,42000.0,12278.27121,16285.779395
4,Iowa,0.03,622,622,1244,2019,1240.0,80,41466.666667,12278.27121,16285.779395


'indices_2017_with_percentiles.pkl.pkl'

Unnamed: 0,state,participation,composite,year,rounded,percentile,PERCENTILE,participation_score_index,psi_mean,psi_std
0,Maine,0.07,24.0,2018,24,0,7.0,342.857143,64.138941,68.008284
1,Maine,0.07,24.0,2018,24,0,7.0,342.857143,64.138941,68.008284
2,Rhode Island,0.15,24.2,2018,24,0,7.0,161.333333,64.138941,68.008284
3,New Hampshire,0.16,25.1,2018,25,0,3.0,156.875,64.138941,68.008284
4,Delaware,0.17,23.8,2018,24,0,7.0,140.0,64.138941,68.008284


These are the final tables displaying states with the highest PSI

# notes

I consider this to be a work in progress.Here's some things I'd like to add:
* Better organizaiton of the data folder
* More Chorpleths
* More types of graphs
* A dashbord
* More complex regression models. For the record, it looks to me like a log function is going to be the best fit

All notbooks run on my computer and should run on yours. Since the pickles have already been generated, it shouldnt matter in what order they're run.

# Acknowlegements and works cited:
* The geeks-for-geeks website, stack exchange, and the PANDAS reference documentation(https://pandas.pydata.org/docs/reference) were indispensible in finding general background
* Ebdolabi, Mehran "Why the Test Preparation Industry May Finally Get Out of the Classroom. Forbes.com  https://www.forbes.com/sites/forbestechcouncil/2020/04/29/why-the-test-preparation-industry-may-finally-get-out-of-the-classroom/
* Brad Soloman, stackoverflow comment https://stackoverflow.com/questions/46915495/normalization-vs-numpy-way-to-normalize
* SAT Percentiles: https://blog.prepscholar.com/historical-percentiles-new-sat'
* ACT Percentiles: https://blog.prepscholar.com/historical-act-percentiles-2020-2019-2018-2017-2016