Fetching contributors…
Cannot retrieve contributors at this time
182 lines (139 sloc) 9.61 KB

## Statistical Analysis of Participation in Standardized Tests [view code]

The code is available here or by clicking on the [view code] link above.

## Overview

Suppose that the College Board, a nonprofit organization responsible for administering the SAT (Scholastic Aptitude Test), seeks to increase the rate of high-school graduates who participate in its exams. This project's aim is to make recommendations about which measures the College Board might take in order to achieve that.

## Problem Statement

The problem we need to solve is to how to make actionable suggestions to the College Board to help them increase the participation rates in their exams. For that we need to perform an exploratory data analysis (EDA) to find appropriate metrics that can be adjusted by the College Board accordingly. Performing the EDA we must among other things:

• Find relevants patterns in the data
• Search for possible relations between subsets of the data (for example, are scores and participation rates correlated? If yes how?)
• Test hypotheses about the data using statistical inference method
• Identify possible biases in the data and, if possible, suggest corrections

## Brief introduction to the data

The data is based on the SAT and the ACT (which stands for American College Testing and it is administered by another institution, namely, the ACT. Inc) exams from around the United States in 2017.

The data contains:

• Average SAT and ACT scores by state (scores for each section of each exam)
• Participation rates for both exams by state.

Both SAT and ACT are standardized tests for college admissions and are similar in content type but have some differences in structure. A few relevant differences are:

• The ACT has a Science Test and the SAT does not
• There is a SAT Math Section for which the student is not allowed to use a calculator
• The SAT's College Board joins Reading and Writing into one score, the "Evidence-Based Reading and Writing" whereas in the ACT the tests are separated.

## EDA Steps

### Importing basic modules

We first need to import Python libraries including:

• `Pandas`, for data manipulation and analysis
• `SciPy` which is a Python-based ecosystem of software for mathematics, science, and engineering.
• `NumPy` which is a library consisting of multidimensional array objects and a collection of routines for processing of array.
• `Statsmodels` which is a Python package that allows users to explore data, estimate statistical models, and perform statistical tests and complements `SciPy`;
• `Matplotlib` is a plotting library for the Python and NumPy;
• `Seaborn` is complimentary to Matplotlib and it specifically targets statistical data visualization
• `Pylab` is embedded inside Matplotlib and provides a Matlab-like experience for the user. It imports portions of Matplotlib and NumPy.

This information was taken directly from the documentation.

To read our data (which is in the form of `csv` files), into a `DataFrame` structure, we use the method `read_csv()` and pass in each file name as a string:

``````sat = pd.read_csv('sat.csv')
``````

Note that the first columns of both tables seem to be identical to the DataFrame indexes. We can quickly confirm that using an `assert` statement. When an `assert` statement is encountered, Python evaluates it and if the expression is false raises an `AssertionError` exception this link contains more details.

``````assert (sat.index.tolist() == sat['Unnamed: 0'].tolist())
assert (act.index.tolist() == act['Unnamed: 0'].tolist())
``````

After some data wrangling and feature engineering, the SAT (top) and ACT (bottomw) `DataFrames` become:

The SAT table displays three different averages for each state:

• The first column is the state
• The second column is the average participation of students in each state
• The third and fourth columns are the average scores in the Math and Evidence-Based Reading and Writing tests (the name EBRW is explained above).

The ACT table displays the following averages for each State:

• The first column is the state
• The second column is the average participation of students in that state
• The third, fourth, fifth and sixth columns are the scores in the English, Math, Reading and Science tests

We can look for problems with the data for example:

• Using `info()`
• Using `describe()`
• Looking at the last rows and or last columns which frequently contain aggregate values
• Looking for null values
• Outliers

The third item was taken care of. There were no null values but there are outliers as we shall see when we perform the plotting. I will convert the columns 'Participation' into `floats` using a function to extract the % but keeping the scale between 0 and 100. The `.replace( )` method takes the argument `regex` =`True` because `type('%')` = `str`.

``````def perc_into_float(df,col):
return df[col].replace('%','',regex=True).astype('float')
df_sat['Participation'] = perc_into_float(df_sat,'Participation')
df_act['Participation'] = perc_into_float(df_act,'Participation')
``````

I now will create a dictionary for each column mapping the State to its respective value for that column using the function:

``````def dict_all(df,cols,n):
return [df.set_index('State').to_dict()[cols[i]] for i in range(1,len(cols))][n]
``````

The dicitonaries are:

``````dsat_part = dict_all(df_sat,df_sat.columns.tolist(),0)
dsat_EBRW = dict_all(df_sat,df_sat.columns.tolist(),1)
dsat_math = dict_all(df_sat,df_sat.columns.tolist(),2)
``````

Now I create one dictionary where each key is the column name, and each value is an iterable (a list or a Pandas Series) of all the values in that column. The following function accomplishes that:

``````def dict_col(df):
return {col:df[col].tolist() for col in df.columns}
``````

Merging the dataframes on the state column and changing the names of the columns to distinguish between the SAT columns and the ACT columns:

``````df_total = pd.merge(df_sat, df_act, on='State')
df_total = pd.merge(df_sat, df_act, on='State')
``````

I know write a function from scratch without loops to compute standard deviation and calculate the standard deviation of each numeric column in both data sets:

``````def stdev(X):
n = len(X)
return ((1.0/n)*np.sum([(x-np.mean(X))**2 for x in X]))**(0.5)
cols = df_total.columns[1:].tolist()
sd = [round(stdev([df_total[col].tolist() for col in cols][i] ) ,3) for i in range(0,len(cols))]
``````

Pandas calculates `std` DataFrame using n-1 as denominator instead of n. Setting the number of `ddof`=0 solves this issue. I will now turn the list `sd` into a new observation in your dataset. I first put `State` as index and then concatenate the new row, renaming it.

``````df_total_new = df_total.copy()
df_total_new = df_total_new.set_index('State')
df2 = pd.DataFrame([[34.929, 45.217, 84.073, 31.824, 2.33, 1.962, 2.047, 3.151]],columns=df_total_new.columns)
df_total_new = pd.concat([df2,df_total_new])
df_total_new = df_total_new.rename(index={df_total_new.index[0]: 'sd'})
``````

Sorting the dataframe by the values in a numeric column e.g. observations descending by SAT participation rate:

``````df_total_new = df_total.copy()
df_total_new = df_total_new.set_index('State').sort_values("Participation_SAT (%)",ascending=False)
df_total_new = pd.concat([df2,df_total_new])
df_total_new = df_total_new.rename(index={df_total_new.index[0]: 'sd'})
``````

I will now use a boolean filter to display only observations with a score above a certain threshold

``````df_total_new = df_total_new[df_total_new['Participation_SAT (%)']>50]
``````

## Descriptive and Inferential Statistics

### Confidence Interval

In inferential statistics usually the true parameter is not known, and all one can do is to determine a confidence interval which is a range of likely values for it. Confidence intervals are centered at a point estimate and then include the standard error multiplied by a multiplier.

## To be continued

The full analysis is contained in the notebook.

You can’t perform that action at this time.