# Descriptive Stats (Work in progress)

By Kenneth Burchfiel

Released under the MIT license

Now that we've learned how to retrieve, reformat, and clean data, we can finally begin analyzing it! This notebook demonstrates how to calculate descriptive statistics in Python using the Pandas library. One benefit of performing these calculations in Python (rather than Excel, Google Sheets, or another spreadsheet program) is that, once you have these tasks scripted, you can quickly rerun these tasks whenever the original data gets updated.\* You can even have your computer run the script on a daily or hourly basis, thus freeing up time you'd need to spend on busywork for more interesting tasks. 

For example, suppose leaders at NVCU would like to know, on a daily basis, how spring survey results differ from fall and winter ones. (This number could change each day as new spring survey data gets released.) One way to accomplish this task would be to retrieve survey data from your database each day; paste it into Excel or Google Sheets; pivot the data, and then share the output. However, you could also accomplish these same steps in Python. While this would likely take you longer the first time around, you could then create updated analyses of your data in mere seconds. This notebook will show you how!

\* There are certainly ways to automate Excel tasks as well (e.g. using Visual Basic). I don't have any experience with Visual Basic, so I'm not the best person to compare these two tools; however, I have no doubt that learning it would take some time, and given Python's versatility and power, I would recommend applying that time to learning Python instead. (You can get an estimate of the world's interest in Python versus Visual Basic by checking out the [TIOBE index](https://www.tiobe.com/tiobe-index/).)

In [1]:
import pandas as pd
import numpy as np
from sqlalchemy import create_engine

We'll first import our combined set of fall, spring, and winter student survey results; these results were created within data_prep.ipynb. (Note that these results also include college and level data that we merged in from NVCU's curr_enrollment SQL table; that way, we can evaluate average results by level and college.)

In [2]:
df_survey_results = pd.read_csv('../Data_Prep/2023_survey_results.csv')
df_survey_results.head()

Unnamed: 0,student_id,starting_year,season,score,season_order,college,level,level_for_sorting
0,2020-1,2023,Fall,88,0,STC,Fr,0
1,2020-2,2023,Fall,37,0,STM,Fr,0
2,2020-3,2023,Fall,54,0,STC,Fr,0
3,2020-4,2023,Fall,56,0,STC,Fr,0
4,2020-5,2023,Fall,77,0,STM,Fr,0


## Evaluating changes in average university-wide results during the school year

Our dataset contains survey results from the fall, winter, and spring. In order to determine how the mean survey score has changed over the course of the year, we can use Pandas' [`pivot_table()` function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot_table.html)--which I consider to be one of the most useful functions in the Pandas library.

The pivot_table() call below introduces three key arguments:

`index`: the list of values by which to group results. Although our dataset only contains data for one year, we'll still include `starting_year` in our results in order to (1) allow the function to accommodate other school years and (2) demonstrate to the viewer that all of this data comes from 2023. We'll also add both `season_order` and `season` to our list (in that order) so as to display results by season in chronological order. (Without the `season_order` argument, our results would be sorted alphabetically: by Fall, Spring, and then Winter.

`values`: the metric to assess. We're interested in analyzing changes in average score by year, so we'll pass `score` as our argument.

`aggfunc`: the aggregate function to apply to our list of values. We'll use `mean` here, but we could also have chosen `median` as a measure of the average.

I generally like to add `reset_index()` to the result of `pivot_table` in order to remove any blank index values.

In [3]:
df_results_by_season = df_survey_results.pivot_table(
    index=['starting_year', 'season_order', 'season'], 
    values='score', aggfunc='mean').reset_index()
df_results_by_season

Unnamed: 0,starting_year,season_order,season,score
0,2023,0,Fall,69.682251
1,2023,1,Winter,64.199483
2,2023,2,Spring,72.049622


These results show that the average score fell around 5 points from the fall to the winter, then increased nearly 8 points from the winter to the spring.

Here's what the output looks like without the trailing reset_index() call:

In [4]:
df_survey_results.pivot_table(
    index=['starting_year', 'season_order', 'season'], 
    values='score', aggfunc='mean')

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,score
starting_year,season_order,season,Unnamed: 3_level_1
2023,0,Fall,69.682251
2023,1,Winter,64.199483
2023,2,Spring,72.049622


We can also find the average score across seasons by setting `margins` to True. The `margins_name` argument lets us assign a name to this row; if we leave it blank, the row will be titled 'All.'

In [5]:
df_survey_results.pivot_table(
    index = ['starting_year', 'season_order', 'season'], 
    values = 'score', aggfunc = 'mean', margins=True, margins_name='2023 Average').reset_index()

Unnamed: 0,starting_year,season_order,season,score
0,2023,0.0,Fall,69.682251
1,2023,1.0,Winter,64.199483
2,2023,2.0,Spring,72.049622
3,2023 Average,,,68.877736


We can also use the pivot_table() function to determine survey response rates as a percentage of our current enrollment. We can import this current enrollment value from our NVCU database:

In [6]:
e = create_engine('sqlite:///../Appendix/nvcu_db.db')

Calculating our current enrollment by counting the number of rows in our curr_enrollment table:

In [7]:
enrollment_count = len(pd.read_sql(
    "Select * from curr_enrollment", con = e))
enrollment_count

16384

A faster way of computing this number is to request it within the original SQL query via COUNT(*). The following line, which demonstrates this approach, took only 6 milliseconds to run on my computer--one tenth the duration of the previous line (which took 62 milliseconds). If we were dealing with millions of rows instead of thousands, this performance difference would probably be even greater.

In [8]:
enrollment_count = pd.read_sql(
    "Select COUNT(*) from curr_enrollment", con = e).iloc[0]['COUNT(*)']
enrollment_count

16384

Counting the number of survey results by season:

*Note: When calculating row counts, make sure that the column you pass to the `values` argument doesn't contain null values; otherwise, your row counts will be incorrect (as null values will get excluded from your counts.) To prevent this issue, I often like to create a column that stores a value of 1 for every row. Using this column (titled `responses` in the following cell) ensures that my pivot table will show the correct row counts for each group.*

In [9]:
df_survey_results['responses'] = 1
df_response_rates = df_survey_results.pivot_table(
    index = ['starting_year', 'season_order', 'season'], 
    values = 'responses', aggfunc = 'count').reset_index() # Because all 'count'
# values are 1, we could have made 'sum' the aggfunc rather than 'count'
df_response_rates

Unnamed: 0,starting_year,season_order,season,responses
0,2023,0,Fall,16384
1,2023,1,Winter,13926
2,2023,2,Spring,16384


Calculating response rates as the quotient of survey counts and NVCU's current enrollment:

In [10]:
df_response_rates['response_rate'] = 100*(
    df_response_rates['responses'] / enrollment_count)
df_response_rates

Unnamed: 0,starting_year,season_order,season,responses,response_rate
0,2023,0,Fall,16384,100.0
1,2023,1,Winter,13926,84.997559
2,2023,2,Spring,16384,100.0


This table shows that our survey response rates were 100% during the fall and spring and around 85% during the winter.

## Using the 'columns' `pivot_table()` argument to show seasons side by side

Currently, the DataFrame is in 'long' format: each row shows data for one specific season. However, in order to more easily calculate the change in results from one season to another, we can also use the `columns` argument within pivot_table() in order to show scores for each season side by side. (This will prove especially useful when we add additional index variables to our pivot_table() call.

The following function is similar to our earlier pivot_table calls except that the `season_order` and `season` values have been moved from the `index` argument to the argument for `columns`. This change makes the seasons appear horizontally rather than vertically.

In [11]:
df_results_by_season_wide = df_survey_results.pivot_table(
    index = 'starting_year', columns = ['season_order', 'season'],
    values = 'score', aggfunc = 'mean').reset_index()
df_results_by_season_wide

season_order,starting_year,0,1,2
season,Unnamed: 1_level_1,Fall,Winter,Spring
0,2023,69.682251,64.199483,72.049622


Note that, because we passed two values to the `columns` parameter, two levels of headers are now visible. However, I'd like to show just one level of columns that includes the 'starting_year' value in the top row and the season names in the bottom row. We can accomplish this by first calling `to_flat_index` to 'flatten' the columns into tuples:

In [12]:
df_results_by_season_wide.columns = (
    df_results_by_season_wide.columns.to_flat_index())
df_results_by_season_wide

Unnamed: 0,"(starting_year, )","(0, Fall)","(1, Winter)","(2, Spring)"
0,2023,69.682251,64.199483,72.049622


Next, I'll use a list comprehension to replace our tuple-based columns with string-based ones. Note that I want to keep the first entry ('starting_year') in the first tuple and the second entries (`Fall`, `Winter`, and `Spring`) in the others; this can be accomplished by adding an if/else statement to our list comprehension.

In [13]:
df_results_by_season_wide.columns = [
    column_tuple[0] if column_tuple[1] not in ['Fall', 'Winter', 'Spring'] 
    else column_tuple[1] for column_tuple in 
    df_results_by_season_wide.columns]
df_results_by_season_wide

Unnamed: 0,starting_year,Fall,Winter,Spring
0,2023,69.682251,64.199483,72.049622


Now that we have our seasons next to one another, we can easily calculate changes in average scores between seasons:

In [14]:
df_results_by_season_wide['Fall-Winter Change'] = (
    df_results_by_season_wide['Winter'] - df_results_by_season_wide['Fall'])
df_results_by_season_wide['Winter-Spring Change'] = (
    df_results_by_season_wide['Spring'] - df_results_by_season_wide['Winter'])
df_results_by_season_wide['Fall-Spring Change'] = (
    df_results_by_season_wide['Spring'] - df_results_by_season_wide['Fall'])
df_results_by_season_wide

Unnamed: 0,starting_year,Fall,Winter,Spring,Fall-Winter Change,Winter-Spring Change,Fall-Spring Change
0,2023,69.682251,64.199483,72.049622,-5.482768,7.850139,2.367371


## Adding additional pivot index values

We now know that our average NVCU student survey scores declined from the fall to the winter and then rose from the winter to the spring. Was this trend the same across colleges and levels? We can answer this question by adding our college and level fields to the `index` argument of our pivot table function.

In order to make this section more efficient, we can create a function that performs the pivot table, column renaming, and growth calculations shown for df_results_by_season_wide. This will greatly reduce the amount of code that we need to write to perform these additional analyses.

In [15]:
def create_wide_table(index_values):
    '''This function creates a wide pivot table of df_survey_results, then
    performs additional column renaming steps and growth calculations.
    
    index_values: a list of values to pass to the index argument of 
    pivot_table().'''
    
    df_wide = df_survey_results.pivot_table(
    index = index_values, columns = ['season_order', 'season'],
    values = 'score', aggfunc = 'mean').reset_index()
    df_wide.columns = (
    df_wide.columns.to_flat_index())
    df_wide.columns = [
        column_tuple[0] if column_tuple[1] not in ['Fall', 'Winter', 'Spring'] 
        else column_tuple[1] for column_tuple in 
        df_wide.columns]
    df_wide['Fall-Spring Change'] = (
        df_wide['Spring'] - df_wide['Fall'])
    df_wide['Fall-Winter Change'] = (
        df_wide['Winter'] - df_wide['Fall'])
    df_wide['Winter-Spring Change'] = (
        df_wide['Spring'] - df_wide['Winter'])
    df_wide['Fall-Spring Change'] = (
        df_wide['Spring'] - df_wide['Fall'])
    return df_wide

### Evaluating changes in survey scores by season and college:

In [16]:
df_results_by_season_and_college_wide = create_wide_table(
    index_values = ['starting_year', 'college'])
df_results_by_season_and_college_wide

Unnamed: 0,starting_year,college,Fall,Winter,Spring,Fall-Spring Change,Fall-Winter Change,Winter-Spring Change
0,2023,STB,69.797119,64.472207,67.077551,-2.719568,-5.324912,2.605344
1,2023,STC,69.568665,64.081522,66.911444,-2.657221,-5.487143,2.829922
2,2023,STL,69.596675,64.028346,76.727809,7.131134,-5.568328,12.699463
3,2023,STM,69.735685,64.18493,76.639004,6.90332,-5.550755,12.454074


Although university-wide survey results grew from the fall to the spring, this table shows that results for two colleges (STB and STC) actually *dropped* over that time period. (Their average spring survey scores were also markedly lower than STL's and STM's.) It also demonstrates that fall-to-winter scores dropped for all colleges and that every college saw an increase in scores during the winter-to-spring period. 

### Evaluating changes in survey scores by season and level:

We'll pivot the data by `level_for_sorting` and *then* `level` so as to order the rows from youngest to oldest.

In [17]:
df_results_by_season_and_college_wide = create_wide_table(
    index_values = ['starting_year', 'level_for_sorting', 'level'])
df_results_by_season_and_college_wide

Unnamed: 0,starting_year,level_for_sorting,level,Fall,Winter,Spring,Fall-Spring Change,Fall-Winter Change,Winter-Spring Change
0,2023,0,Fr,69.698085,64.24084,73.546671,3.848586,-5.457245,9.305831
1,2023,1,So,69.768957,64.267583,69.350671,-0.418286,-5.501374,5.083088
2,2023,2,Ju,69.688672,64.169098,69.495124,-0.193548,-5.519574,5.326026
3,2023,3,Se,69.609774,64.150389,74.83318,5.223406,-5.459385,10.682791


This table shows that freshmen and juniors had similar fall and spring average scores, but scores for freshmen and seniors increased. All levels showed a fall-to-winter drop followed by a winter-to-spring rise.

### Evaluating changes in survey scores by season, college, *and* level:

(I originally named the following DataFrame `df_results_by_season_level_and_college_wide`, but since that's a rather long name and we'll use this DataFrame quite a bit within this section, I abbreviated the index values as 'slc'.)

In [18]:
df_results_slc = create_wide_table(
    index_values = ['starting_year', 'college', 'level_for_sorting', 'level'])
df_results_slc

Unnamed: 0,starting_year,college,level_for_sorting,level,Fall,Winter,Spring,Fall-Spring Change,Fall-Winter Change,Winter-Spring Change
0,2023,STB,0,Fr,69.154329,63.899705,68.543287,-0.611041,-5.254624,4.643582
1,2023,STB,1,So,69.950769,64.64881,64.950769,-5.0,-5.30196,0.30196
2,2023,STB,2,Ju,70.377306,65.20221,65.377306,-5.0,-5.175096,0.175096
3,2023,STB,3,Se,69.593583,64.102722,69.177235,-0.416348,-5.490861,5.074513
4,2023,STC,0,Fr,70.097292,64.864387,69.58676,-0.510532,-5.232905,4.722373
5,2023,STC,1,So,69.180932,63.615385,64.180932,-5.0,-5.565547,0.565547
6,2023,STC,2,Ju,69.331325,63.757143,64.331325,-5.0,-5.574182,0.574182
7,2023,STC,3,Se,69.598913,63.996188,69.079348,-0.519565,-5.602725,5.08316
8,2023,STL,0,Fr,69.026639,63.384259,77.915984,8.889344,-5.64238,14.531724
9,2023,STL,1,So,69.878706,64.394612,74.634771,4.756065,-5.484094,10.240159


## Comparing rows via sort_values() and rank()

Which college/level pairs had the highest and lowest spring survey results? We could examine `df_results_slc` line by line to answer this question; however, two Pandas functions--sort_values() and rank()--can make it easier to compare survey outcomes by college and level.

First, here are the the five college/level pairs with the highest average spring results:

In [19]:
df_results_slc.sort_values(
'Spring', ascending = False).head()

Unnamed: 0,starting_year,college,level_for_sorting,level,Fall,Winter,Spring,Fall-Spring Change,Fall-Winter Change,Winter-Spring Change
12,2023,STM,0,Fr,70.05859,64.325909,79.309831,9.251241,-5.732681,14.983923
11,2023,STL,3,Se,69.58503,64.110254,79.111622,9.526592,-5.474775,15.001367
15,2023,STM,3,Se,69.650503,64.312456,78.488468,8.837966,-5.338047,14.176013
8,2023,STL,0,Fr,69.026639,63.384259,77.915984,8.889344,-5.64238,14.531724
9,2023,STL,1,So,69.878706,64.394612,74.634771,4.756065,-5.484094,10.240159


And here are the five pairs with the *lowest* spring results:

In [20]:
df_results_slc.sort_values(
    'Spring', ascending = False).tail()

Unnamed: 0,starting_year,college,level_for_sorting,level,Fall,Winter,Spring,Fall-Spring Change,Fall-Winter Change,Winter-Spring Change
0,2023,STB,0,Fr,69.154329,63.899705,68.543287,-0.611041,-5.254624,4.643582
2,2023,STB,2,Ju,70.377306,65.20221,65.377306,-5.0,-5.175096,0.175096
1,2023,STB,1,So,69.950769,64.64881,64.950769,-5.0,-5.30196,0.30196
6,2023,STC,2,Ju,69.331325,63.757143,64.331325,-5.0,-5.574182,0.574182
5,2023,STC,1,So,69.180932,63.615385,64.180932,-5.0,-5.565547,0.565547


Note that the use of sort_values() here did not actually change the underlying order of the DataFrame. Although it displays in sorted order immediately after sort_values() gets called, the DataFrame will revert to its original sort order during subsequent lines of code. The following cell demonstrates this:

In [21]:
df_results_slc.head() # Note that the DataFrame
# is once again sorted by college and level

Unnamed: 0,starting_year,college,level_for_sorting,level,Fall,Winter,Spring,Fall-Spring Change,Fall-Winter Change,Winter-Spring Change
0,2023,STB,0,Fr,69.154329,63.899705,68.543287,-0.611041,-5.254624,4.643582
1,2023,STB,1,So,69.950769,64.64881,64.950769,-5.0,-5.30196,0.30196
2,2023,STB,2,Ju,70.377306,65.20221,65.377306,-5.0,-5.175096,0.175096
3,2023,STB,3,Se,69.593583,64.102722,69.177235,-0.416348,-5.490861,5.074513
4,2023,STC,0,Fr,70.097292,64.864387,69.58676,-0.510532,-5.232905,4.722373


This behavior, which is seen in many other Pandas functions, is actually quite helpful: it allows you to test out changes and modifications without making them permanent (which, if you make a mistake, could force you to restart your script).

It's also worth mentioning that none of these changes are affecting the underlying .csv file from which we retrieved our data. That file will only get modified if we use to_csv() to save our table to that same filename.

To make a sort persistent, you can use one of the following two lines:

In [22]:
# First option (my preference because it often requires fewer characters:
df_results_slc.sort_values(
    'Spring', inplace=True)

# An alternative option (which can come in handy when making multiple
# changes to a dataset at once):
df_results_slc = (
    df_results_slc.sort_values('Spring'))


# Make sure NOT to add 'inplace = True' as an argument when using the second
# method, as your DataFrame will be replaced with None! 
# For an explanation of None, see: 
# https://docs.python.org/3/library/constants.html#None

df_results_slc.head()

Unnamed: 0,starting_year,college,level_for_sorting,level,Fall,Winter,Spring,Fall-Spring Change,Fall-Winter Change,Winter-Spring Change
5,2023,STC,1,So,69.180932,63.615385,64.180932,-5.0,-5.565547,0.565547
6,2023,STC,2,Ju,69.331325,63.757143,64.331325,-5.0,-5.574182,0.574182
1,2023,STB,1,So,69.950769,64.64881,64.950769,-5.0,-5.30196,0.30196
2,2023,STB,2,Ju,70.377306,65.20221,65.377306,-5.0,-5.175096,0.175096
0,2023,STB,0,Fr,69.154329,63.899705,68.543287,-0.611041,-5.254624,4.643582


# Calculating percentiles and ranks

Ranks and percentiles are alternative ways to evaluate values relative to their peers. Let's say that the NVCU administration would like you to calculate both the rank *and* the percentile of each college/level pair's average spring score. However, they'd also like you to round the spring results to integers before making these calculations so that pairs with similar scores will get treated equally.

First, we'll create a new condensed DataFrame that can store these integer-based results, ranks, and percentiles. We'll then assign ranks to each integer.

In [23]:
# Creating a condensed DataFrame:
df_spring_ranks = df_results_slc.copy()[[
    'starting_year', 'college', 
    'level_for_sorting', 'level', 'Spring']].sort_values(
    'Spring', ascending = False)
# Converting average spring results to integers:
df_spring_ranks['Spring'] = df_spring_ranks['Spring'].astype('int')

# Calculating our ranks:
# Note: the inclusion of "method = 'min'" ensures that, in the case of ties,
# each tied row will show the lowest (i.e. best) possible rank. This is the ranking
# convention that I'm more familiar with, but Pandas allows for other methods also.
# ascending = False assigns the best ranks to the highest results.
# See the df.rank() documentation for more details:
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rank.html

df_spring_ranks['Spring_Rank'] = df_spring_ranks[
'Spring'].rank(ascending = False, method = 'min')
df_spring_ranks

Unnamed: 0,starting_year,college,level_for_sorting,level,Spring,Spring_Rank
12,2023,STM,0,Fr,79,1.0
11,2023,STL,3,Se,79,1.0
15,2023,STM,3,Se,78,3.0
8,2023,STL,0,Fr,77,4.0
9,2023,STL,1,So,74,5.0
13,2023,STM,1,So,74,5.0
10,2023,STL,2,Ju,74,5.0
14,2023,STM,2,Ju,73,8.0
4,2023,STC,0,Fr,69,9.0
3,2023,STB,3,Se,69,9.0


Our code for calculating percentiles will also use `df.rank()`; we can instruct that function to display its output as percentiles by adding the argument `pct=True`. We'll also add (1) `ascending=True` so that the highest scores will get the highest percentiles and (2) `method='max'` so that, in the case of ties, the highest possible percentile will get displayed.

Note that, while the highest percentile in the following output is 100, the lowest percentile is not 0. I believe this is because Pandas calculates percentiles as the percentage of results *equal to or lower than* the current result. Therefore, even the lowest row won't get a percentile of 0 during percentile calculations, as it will at least be equal to itself. 

In [24]:
df_spring_ranks['Spring_Percentile'] = 100 * df_spring_ranks['Spring'].rank(
    ascending=True, pct=True, method='max')
df_spring_ranks

Unnamed: 0,starting_year,college,level_for_sorting,level,Spring,Spring_Rank,Spring_Percentile
12,2023,STM,0,Fr,79,1.0,100.0
11,2023,STL,3,Se,79,1.0,100.0
15,2023,STM,3,Se,78,3.0,87.5
8,2023,STL,0,Fr,77,4.0,81.25
9,2023,STL,1,So,74,5.0,75.0
13,2023,STM,1,So,74,5.0,75.0
10,2023,STL,2,Ju,74,5.0,75.0
14,2023,STM,2,Ju,73,8.0,56.25
4,2023,STC,0,Fr,69,9.0,50.0
3,2023,STB,3,Se,69,9.0,50.0


Although our DataFrame was sorted by Spring results, df.rank() would still have successfully calculated ranks and percentiles regardless of how the DataFrame happened to be sorted.

# More to come!