# Descriptive Stats

By Kenneth Burchfiel

Released under the MIT license

Now that we've learned how to retrieve, reformat, and clean data, we can finally begin analyzing it! This notebook demonstrates how to calculate descriptive statistics in Python using Pandas. 

Suppose leaders at NVCU would like to know, on a daily basis, how spring survey results differ from fall and winter ones. (This number could change each day as new spring survey data gets released.) One way to accomplish this task would be to retrieve survey data from your database each day; paste it into Excel or Google Sheets; pivot the data, and then share the output. However, you could also accomplish these same steps in Python. While this would likely take you longer the first time around, you could then create updated analyses of your data in mere seconds. This notebook will show you how!

In [1]:
import sys
sys.path.insert(1, '../Appendix')
from helper_funcs import config_notebook
display_type = config_notebook(display_max_columns = 8,
                              display_max_rows = 16) 

import pandas as pd
import numpy as np
from sqlalchemy import create_engine    

We'll first import our combined set of fall, spring, and winter student survey results; these results were created within data_prep.ipynb. (Note that these results also include college and level data that we merged in from NVCU's curr_enrollment SQL table; that way, we can evaluate average results by level and college.)

In [2]:
df_survey_results = pd.read_csv('../Data_Prep/2023_survey_results.csv')
df_survey_results.head()

Unnamed: 0,student_id,starting_year,season,score,season_order,college,level,level_for_sorting
0,2020-1,2023,Fall,88,0,STC,Se,3
1,2020-2,2023,Fall,37,0,STM,Se,3
2,2020-3,2023,Fall,54,0,STC,Se,3
3,2020-4,2023,Fall,56,0,STC,Se,3
4,2020-5,2023,Fall,77,0,STM,Se,3


## Evaluating changes in average university-wide results during the school year

Our dataset contains survey results from the fall, winter, and spring. In order to determine how the mean survey score has changed over the course of the year, we can use Pandas' [`pivot_table()` function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot_table.html)--which I consider to be one of the most useful functions in the Pandas library.

The pivot_table() call below introduces three key arguments:

`index`: the list of values by which to group results. Although our dataset only contains data for one year, we'll still include `starting_year` in our results in order to (1) allow the function to accommodate other school years and (2) demonstrate to the viewer that all of this data comes from 2023. We'll also add both `season_order` and `season` to our list (in that order) so as to display results by season in chronological order. (Without the `season_order` argument, our results would be sorted alphabetically: by Fall, Spring, and then Winter.

`values`: the metric to assess. We're interested in analyzing changes in average score by year, so we'll pass `score` as our argument.

`aggfunc`: the aggregate function to apply to our list of values. We'll use `mean` here, but we could also have chosen `median` as a measure of the average.

I generally like to add `reset_index()` to the result of `pivot_table` in order to remove any blank index values.

In [3]:
df_results_by_season = df_survey_results.pivot_table(
    index=['starting_year', 'season_order', 'season'], 
    values='score', aggfunc='mean').reset_index()
df_results_by_season

Unnamed: 0,starting_year,season_order,season,score
0,2023,0,Fall,69.682251
1,2023,1,Winter,64.16401
2,2023,2,Spring,72.049622


These results show that the average score fell around 5 points from the fall to the winter, then increased nearly 8 points from the winter to the spring.

Here's what the output looks like without the trailing reset_index() call:

In [4]:
df_survey_results.pivot_table(
    index=['starting_year', 'season_order', 'season'], 
    values='score', aggfunc='mean')

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,score
starting_year,season_order,season,Unnamed: 3_level_1
2023,0,Fall,69.682251
2023,1,Winter,64.16401
2023,2,Spring,72.049622


We can also find the average score across seasons by setting `margins` to True. The `margins_name` argument lets us assign a name to this row; if we leave it blank, the row will be titled 'All.'

In [5]:
df_survey_results.pivot_table(
    index = ['starting_year', 'season_order', 'season'], 
    values = 'score', aggfunc = 'mean', margins=True, 
    margins_name='2023 Average').reset_index()

Unnamed: 0,starting_year,season_order,season,score
0,2023,0.0,Fall,69.682251
1,2023,1.0,Winter,64.16401
2,2023,2.0,Spring,72.049622
3,2023 Average,,,68.867156


## Calculating response rates

We can also use the pivot_table() function to determine survey response rates as a percentage of our current enrollment. We can import this current enrollment value from our NVCU database:

In [6]:
e = create_engine('sqlite:///../Appendix/nvcu_db.db')

Calculating our current enrollment by counting the number of rows in our curr_enrollment table:

In [7]:
enrollment_count = len(pd.read_sql(
    "Select * from curr_enrollment", con = e))
enrollment_count

16384

A faster way of computing this number is to request it within the original SQL query via COUNT(*). The following line, which demonstrates this approach, took only 6 milliseconds to run on my computer--one tenth the duration of the previous line (which took 62 milliseconds). If we were dealing with millions of rows instead of thousands, this performance difference would probably be even greater.

In [8]:
enrollment_count = pd.read_sql(
    "Select COUNT(*) from curr_enrollment", con = e).iloc[0]['COUNT(*)']
enrollment_count

np.int64(16384)

Counting the number of survey results by season: 

(For teaching purposes, this cell also shows how to display results for two different value/aggfunc pairs.)

*Note: When calculating row counts, make sure that the column you pass to the `values` argument doesn't contain null values; otherwise, your row counts will be incorrect (as null values will get excluded from your counts.) To prevent this issue, I often like to create a column that stores a value of 1 for every row. Using this column (titled `responses` in the following cell) ensures that my pivot table will show the correct row counts for each group.*

In [9]:
df_survey_results['responses'] = 1
df_response_rates = df_survey_results.pivot_table(
    index = ['starting_year', 'season_order', 'season'], 
    values = ['score', 'responses'], 
    aggfunc = {'score':'mean','responses':'count'}).reset_index() 
    # Because all 'responses' values are 1, we could have made 'sum' 
    # the aggfunc for 'responses' rather than 'count'
df_response_rates

Unnamed: 0,starting_year,season_order,season,responses,score
0,2023,0,Fall,16384,69.682251
1,2023,1,Winter,13926,64.16401
2,2023,2,Spring,16384,72.049622


Calculating response rates as the quotient of survey counts and NVCU's current enrollment:

In [10]:
df_response_rates['response_rate'] = 100*(
    df_response_rates['responses'] / enrollment_count)
df_response_rates

Unnamed: 0,starting_year,season_order,season,responses,score,response_rate
0,2023,0,Fall,16384,69.682251,100.0
1,2023,1,Winter,13926,64.16401,84.997559
2,2023,2,Spring,16384,72.049622,100.0


This table shows that our survey response rates were 100% during the fall and spring and around 85% during the winter.

## Using the `columns` argument within `pivot_table()` to show seasons side by side

Currently, the DataFrame is in 'long' format: each row shows data for one specific season. However, in order to more easily calculate the change in results from one season to another, we can also use the `columns` argument when creating a pivot table in order to show scores for each season side by side. (This will prove especially useful when we add additional index variables to our pivot_table() call.

The following function is similar to our earlier pivot_table calls except that the `season_order` and `season` values have been moved from the `index` argument to the argument for `columns`. This change makes the seasons appear horizontally rather than vertically.

In [11]:
df_results_by_season_wide = df_survey_results.pivot_table(
    index = 'starting_year', columns = ['season_order', 'season'],
    values = 'score', aggfunc = 'mean').reset_index()
df_results_by_season_wide

season_order,starting_year,0,1,2
season,Unnamed: 1_level_1,Fall,Winter,Spring
0,2023,69.682251,64.16401,72.049622


Note that, because we passed two values to the `columns` parameter, two levels of headers are now visible. However, I'd like to show just one level of columns that includes the 'starting_year' value in the top row and the season names in the bottom row. We can accomplish this by first calling `to_flat_index` to 'flatten' the columns into tuples:

In [12]:
df_results_by_season_wide.columns = (
    df_results_by_season_wide.columns.to_flat_index())
df_results_by_season_wide

Unnamed: 0,"(starting_year, )","(0, Fall)","(1, Winter)","(2, Spring)"
0,2023,69.682251,64.16401,72.049622


Next, I'll use a list comprehension to replace our tuple-based columns with string-based ones. Note that I want to keep the first entry ('starting_year') in the first tuple and the second entries (`Fall`, `Winter`, and `Spring`) in the others; this can be accomplished by adding an if/else statement to our list comprehension.

In [13]:
df_results_by_season_wide.columns = [
    column_tuple[0] if column_tuple[1] not in ['Fall', 'Winter', 'Spring'] 
    else column_tuple[1] for column_tuple in 
    df_results_by_season_wide.columns]
df_results_by_season_wide

Unnamed: 0,starting_year,Fall,Winter,Spring
0,2023,69.682251,64.16401,72.049622


Now that we have our seasons next to one another, we can easily calculate changes in average scores between seasons:

In [14]:
df_results_by_season_wide['Fall-Winter Change'] = (
    df_results_by_season_wide['Winter'] 
    - df_results_by_season_wide['Fall'])

df_results_by_season_wide['Winter-Spring Change'] = (
    df_results_by_season_wide['Spring'] 
    - df_results_by_season_wide['Winter'])

df_results_by_season_wide['Fall-Spring Change'] = (
    df_results_by_season_wide['Spring'] 
    - df_results_by_season_wide['Fall'])
df_results_by_season_wide

Unnamed: 0,starting_year,Fall,Winter,Spring,Fall-Winter Change,Winter-Spring Change,Fall-Spring Change
0,2023,69.682251,64.16401,72.049622,-5.518241,7.885612,2.367371


## Adding additional pivot index values

We now know that our average NVCU student survey scores declined from the fall to the winter and then rose from the winter to the spring. Was this trend the same across colleges and levels? We can answer this question by adding our college and level fields to the `index` argument of our pivot table function.

In order to make this section more efficient, we can create a function that performs the pivot table, column renaming, and growth calculations shown for df_results_by_season_wide. This will greatly reduce the amount of code that we need to write to perform these additional analyses.

In [15]:
def create_wide_table(index_values):
    '''This function creates a wide pivot table of df_survey_results, then
    performs additional column renaming steps and growth calculations.
    
    index_values: a list of values to pass to the index argument of 
    pivot_table().'''
    
    df_wide = df_survey_results.pivot_table(
    index = index_values, columns = ['season_order', 'season'],
    values = 'score', aggfunc = 'mean').reset_index()
    df_wide.columns = (
    df_wide.columns.to_flat_index())
    df_wide.columns = [
        column_tuple[0] if column_tuple[1] not in 
        ['Fall', 'Winter', 'Spring'] 
        else column_tuple[1] for column_tuple in 
        df_wide.columns]
    df_wide['Fall-Spring Change'] = (
        df_wide['Spring'] - df_wide['Fall'])
    df_wide['Fall-Winter Change'] = (
        df_wide['Winter'] - df_wide['Fall'])
    df_wide['Winter-Spring Change'] = (
        df_wide['Spring'] - df_wide['Winter'])
    df_wide['Fall-Spring Change'] = (
        df_wide['Spring'] - df_wide['Fall'])
    return df_wide

### Evaluating changes in survey scores by season and college:

In [16]:
df_results_by_season_and_college_wide = create_wide_table(
    index_values = ['starting_year', 'college'])
# Saving these results to a .csv file so that they can be used within
# other parts of Python for Nonprofits:
df_results_by_season_and_college_wide.to_csv(
    'survey_results_by_college_wide.csv', index = False)
df_results_by_season_and_college_wide

Unnamed: 0,starting_year,college,Fall,Winter,Spring,Fall-Spring Change,Fall-Winter Change,Winter-Spring Change
0,2023,STB,69.797119,64.38455,67.077551,-2.719568,-5.412569,2.693001
1,2023,STC,69.568665,64.015579,66.911444,-2.657221,-5.553085,2.895865
2,2023,STL,69.596675,64.06675,76.727809,7.131134,-5.529924,12.661059
3,2023,STM,69.735685,64.160458,76.639004,6.90332,-5.575227,12.478546


The following cell creates a 'long' version of this table that displays only one season per row. We'll utilize a .csv copy of this table within PFN's graphing section. (We could also use the wide-formatted table shown above within our graphing script, but I wanted to make sure to demonstrate how to graph long-formatted data.)

Note that `season_order` is added before `season` within the `index` list in order to get Winter results to precede Spring ones; however, it's then dropped in order to keep the table more streamlined.

In [17]:
df_results_by_season_and_college_long = df_survey_results.pivot_table(
    index = ['starting_year', 'college', 'season_order', 'season'], 
    values = 'score', aggfunc = 'mean').reset_index().drop(
    'season_order', axis = 1)
df_results_by_season_and_college_long.to_csv(
    'survey_results_by_college_long.csv', index = False)
df_results_by_season_and_college_long

Unnamed: 0,starting_year,college,season,score
0,2023,STB,Fall,69.797119
1,2023,STB,Winter,64.38455
2,2023,STB,Spring,67.077551
3,2023,STC,Fall,69.568665
4,2023,STC,Winter,64.015579
5,2023,STC,Spring,66.911444
6,2023,STL,Fall,69.596675
7,2023,STL,Winter,64.06675
8,2023,STL,Spring,76.727809
9,2023,STM,Fall,69.735685


Although university-wide survey results grew from the fall to the spring, this table shows that results for two colleges (STB and STC) actually *dropped* over that time period. (Their average spring survey scores were also markedly lower than STL's and STM's.) It also demonstrates that fall-to-winter scores dropped for all colleges and that every college saw an increase in scores during the winter-to-spring period. 

### Evaluating changes in survey scores by season and level:

We'll pivot the data by `level_for_sorting` and *then* `level` so as to order the rows from youngest to oldest.

In [18]:
df_results_by_season_and_college_wide = create_wide_table(
    index_values = ['starting_year', 'level_for_sorting', 'level'])
df_results_by_season_and_college_wide.drop([
    'starting_year', 'level_for_sorting'], axis = 1) # I chose to drop
# these columns so that none of the more important ones would get 
# cut off by my notebook's 7-column limit

Unnamed: 0,level,Fall,Winter,Spring,Fall-Spring Change,Fall-Winter Change,Winter-Spring Change
0,Fr,69.609774,64.050715,74.83318,5.223406,-5.559059,10.782465
1,So,69.688672,64.160635,69.495124,-0.193548,-5.528037,5.334488
2,Ju,69.768957,64.212553,69.350671,-0.418286,-5.556404,5.138118
3,Se,69.698085,64.300142,73.546671,3.848586,-5.397943,9.246529


This table shows that scores for sophomores and juniors did not change much from the fall to the spring, but scores for freshmen and seniors increased. All levels showed a fall-to-winter drop followed by a winter-to-spring rise.

### Evaluating changes in survey scores by season, college, *and* level:

(I originally named the following DataFrame `df_results_by_season_level_and_college_wide`, but since that's a rather long name and we'll use this DataFrame quite a bit within this section, I abbreviated the index values as 'slc'.)

In [19]:
df_results_slc = create_wide_table(
    index_values = ['starting_year', 'college', 
                    'level_for_sorting', 'level'])
df_results_slc.to_csv('survey_results_slc_wide.csv', index = False)
df_results_slc.head()

Unnamed: 0,starting_year,college,level_for_sorting,level,...,Spring,Fall-Spring Change,Fall-Winter Change,Winter-Spring Change
0,2023,STB,0,Fr,...,69.177235,-0.416348,-5.711283,5.294934
1,2023,STB,1,So,...,65.377306,-5.0,-5.163252,0.163252
2,2023,STB,2,Ju,...,64.950769,-5.0,-5.493452,0.493452
3,2023,STB,3,Se,...,68.543287,-0.611041,-5.158683,4.547641
4,2023,STC,0,Fr,...,69.079348,-0.519565,-5.62741,5.107845


The following cell creates a 'long' version of this table that can be incorporated into PFN's graphing section.

In [20]:
df_survey_results_slc_long = df_survey_results.pivot_table(
    index = ['starting_year', 'college', 
             'level_for_sorting', 'level', 'season'],
    values = 'score', aggfunc = 'mean').reset_index()
df_survey_results_slc_long.to_csv(
    'survey_results_slc_long.csv', index = False)
df_survey_results_slc_long.head()

Unnamed: 0,starting_year,college,level_for_sorting,level,season,score
0,2023,STB,0,Fr,Fall,69.593583
1,2023,STB,0,Fr,Spring,69.177235
2,2023,STB,0,Fr,Winter,63.8823
3,2023,STB,1,So,Fall,70.377306
4,2023,STB,1,So,Spring,65.377306


## Comparing rows via sort_values() and rank()

Which college/level pairs had the highest and lowest spring survey results? We could examine `df_results_slc` line by line to answer this question; however, two Pandas functions--sort_values() and rank()--can make it easier to compare survey outcomes by college and level.

First, here are the the five college/level pairs with the highest average spring results:

In [21]:
df_results_slc.sort_values(
'Spring', ascending = False).drop(
    ['starting_year', 'level_for_sorting'], axis = 1).head()

Unnamed: 0,college,level,Fall,Winter,Spring,Fall-Spring Change,Fall-Winter Change,Winter-Spring Change
15,STM,Se,70.05859,64.376009,79.309831,9.251241,-5.682581,14.933822
8,STL,Fr,69.58503,64.100231,79.111622,9.526592,-5.484798,15.01139
12,STM,Fr,69.650503,64.179469,78.488468,8.837966,-5.471033,14.308999
11,STL,Se,69.026639,63.744731,77.915984,8.889344,-5.281909,14.171253
10,STL,Ju,69.878706,64.56651,74.634771,4.756065,-5.312196,10.068261


And here are the five pairs with the *lowest* spring results:

In [36]:
df_results_slc.sort_values(
    'Spring').drop(
    ['starting_year', 'level_for_sorting'], axis = 1).head()

Unnamed: 0,college,level,Fall,Winter,Spring,Fall-Spring Change,Fall-Winter Change,Winter-Spring Change
6,STC,Ju,69.180932,63.448187,64.180932,-5.0,-5.732745,0.732745
5,STC,So,69.331325,63.803725,64.331325,-5.0,-5.5276,0.5276
2,STB,Ju,69.950769,64.457317,64.950769,-5.0,-5.493452,0.493452
1,STB,So,70.377306,65.214054,65.377306,-5.0,-5.163252,0.163252
3,STB,Se,69.154329,63.995646,68.543287,-0.611041,-5.158683,4.547641


Note that the use of sort_values() here did not actually change the underlying order of the DataFrame. Although it displays in sorted order immediately after sort_values() gets called, the DataFrame will revert to its original sort order during subsequent lines of code. The following cell demonstrates this:

In [23]:
df_results_slc.head() # Note that the DataFrame
# is once again sorted by college and level

Unnamed: 0,starting_year,college,level_for_sorting,level,...,Spring,Fall-Spring Change,Fall-Winter Change,Winter-Spring Change
0,2023,STB,0,Fr,...,69.177235,-0.416348,-5.711283,5.294934
1,2023,STB,1,So,...,65.377306,-5.0,-5.163252,0.163252
2,2023,STB,2,Ju,...,64.950769,-5.0,-5.493452,0.493452
3,2023,STB,3,Se,...,68.543287,-0.611041,-5.158683,4.547641
4,2023,STC,0,Fr,...,69.079348,-0.519565,-5.62741,5.107845


This behavior, which is seen in many other Pandas functions, is actually quite helpful: it allows you to test out changes and modifications without making them permanent (which, if you make a mistake, could force you to restart your script).

It's also worth mentioning that none of these changes are affecting the underlying .csv file from which we retrieved our data. That file will only get modified if we use to_csv() to save our table to that same filename.

To make a sort persistent, you can use one of the following two lines:

In [24]:
# First option (my preference because it often requires fewer characters:
df_results_slc.sort_values(
    'Spring', inplace=True)

# An alternative option (which can come in handy when making multiple
# changes to a dataset at once):
df_results_slc = (
    df_results_slc.sort_values('Spring')).copy()


# Make sure NOT to add 'inplace = True' as an argument when using the 
# second method, as your DataFrame will then get replaced with None! 
# For an explanation of None, see: 
# https://docs.python.org/3/library/constants.html#None

df_results_slc.head()

Unnamed: 0,starting_year,college,level_for_sorting,level,...,Spring,Fall-Spring Change,Fall-Winter Change,Winter-Spring Change
6,2023,STC,2,Ju,...,64.180932,-5.0,-5.732745,0.732745
5,2023,STC,1,So,...,64.331325,-5.0,-5.5276,0.5276
2,2023,STB,2,Ju,...,64.950769,-5.0,-5.493452,0.493452
1,2023,STB,1,So,...,65.377306,-5.0,-5.163252,0.163252
3,2023,STB,3,Se,...,68.543287,-0.611041,-5.158683,4.547641


## Calculating percentiles and ranks

Ranks and percentiles are alternative ways to evaluate values relative to their peers. Let's say that the NVCU administration would like you to calculate both the rank *and* the percentile of each college/level pair's average spring score. However, they'd also like you to round the spring results to integers before making these calculations so that pairs with similar scores will get treated equally.

First, we'll create a new condensed DataFrame that can store these integer-based results, ranks, and percentiles. We'll then assign ranks to each integer.

In [25]:
# Creating a condensed DataFrame:
df_spring_ranks = df_results_slc.copy()[[
    'starting_year', 'college', 
    'level_for_sorting', 'level', 'Spring']].sort_values(
    'Spring', ascending = False)
# Converting average spring results to integers:
df_spring_ranks['Spring'] = df_spring_ranks['Spring'].astype('int')

# Calculating our ranks:
# Note: the inclusion of "method = 'min'" ensures that, in the case of 
# ties, each tied row will show the lowest (i.e. best) possible rank. 
# This is the ranking convention that I'm more familiar with, but Pandas 
# allows for other methods also. ascending = False assigns the best ranks 
# to the highest results.
# See the df.rank() documentation for more details:
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.
# DataFrame.rank.html

df_spring_ranks['Spring_Rank'] = df_spring_ranks[
'Spring'].rank(ascending = False, method = 'min')
df_spring_ranks

Unnamed: 0,starting_year,college,level_for_sorting,level,Spring,Spring_Rank
15,2023,STM,3,Se,79,1.0
8,2023,STL,0,Fr,79,1.0
12,2023,STM,0,Fr,78,3.0
11,2023,STL,3,Se,77,4.0
10,2023,STL,2,Ju,74,5.0
14,2023,STM,2,Ju,74,5.0
9,2023,STL,1,So,74,5.0
13,2023,STM,1,So,73,8.0
7,2023,STC,3,Se,69,9.0
0,2023,STB,0,Fr,69,9.0


Our code for calculating percentiles will also use `df.rank()`; we can instruct that function to display its output as percentiles by adding the argument `pct=True`. We'll also add (1) `ascending=True` so that the highest scores will get the highest percentiles and (2) `method='max'` so that, in the case of ties, the highest possible percentile will get displayed.

Note that, while the highest percentile in the following output is 100, the lowest percentile is not 0. I believe this is because Pandas calculates percentiles as the percentage of results *equal to or lower than* the current result. Therefore, even the lowest row won't get a percentile of 0 during percentile calculations, as it will at least be equal to itself. 

In [26]:
df_spring_ranks['Spring_Percentile'] = (100 
* df_spring_ranks['Spring'].rank(
    ascending=True, pct=True, method='max'))
df_spring_ranks.head()

Unnamed: 0,starting_year,college,level_for_sorting,level,Spring,Spring_Rank,Spring_Percentile
15,2023,STM,3,Se,79,1.0,100.0
8,2023,STL,0,Fr,79,1.0,100.0
12,2023,STM,0,Fr,78,3.0,87.5
11,2023,STL,3,Se,77,4.0,81.25
10,2023,STL,2,Ju,74,5.0,75.0


Although our DataFrame was sorted by Spring results, df.rank() would still have successfully calculated ranks and percentiles regardless of how the DataFrame happened to be sorted.

### Calculating average results by student (and dealing with missing values)

Suppose that the NVCU administration wishes to see what percentage of students had a weighted average annual survey score below 60. Because they are more interested in students' most recent survey results, they would like you to assign a weight of 0.2 to the fall results; 0.3 to the winter results; and 0.5 to the spring results. (Thus, students' weighted survey averages will equal 0.2\*F + 0.3\*W + 0.5\*S, with F, W, and S referring to students' fall, winter, and spring results, respectively.)

We'll begin this analysis by calling create_wide_table that shows fall, winter, and spring scores side by side for each student:

In [27]:
df_student_results_wide = create_wide_table(
    index_values=['starting_year', 'student_id'])
df_student_results_wide.head()

Unnamed: 0,starting_year,student_id,Fall,Winter,Spring,Fall-Spring Change,Fall-Winter Change,Winter-Spring Change
0,2023,2020-1,88.0,81.0,86.0,-2.0,-7.0,5.0
1,2023,2020-10,69.0,63.0,73.0,4.0,-6.0,10.0
2,2023,2020-100,68.0,,88.0,20.0,,
3,2023,2020-1000,58.0,,65.0,7.0,,
4,2023,2020-1001,88.0,84.0,100.0,12.0,-4.0,16.0


Notice that some winter results have NaN (not a number) values. We can count the number of NaN results for each column via `df.isna().sum()`:

In [28]:
df_student_results_wide.isna().sum()

starting_year              0
student_id                 0
Fall                       0
Winter                  2458
Spring                     0
Fall-Spring Change         0
Fall-Winter Change      2458
Winter-Spring Change    2458
dtype: int64

These missing results will make our weighted average calculations a bit more complicated. For instance, suppose we tried to create our weighted averages using the following code:

In [29]:
df_student_results_wide['weighted_avg_score'] = (
    df_student_results_wide['Fall'] * 0.2 
    + df_student_results_wide['Winter'] * 0.3 
    + df_student_results_wide['Spring'] * 0.5)
df_student_results_wide.head()

Unnamed: 0,starting_year,student_id,Fall,Winter,...,Fall-Spring Change,Fall-Winter Change,Winter-Spring Change,weighted_avg_score
0,2023,2020-1,88.0,81.0,...,-2.0,-7.0,5.0,84.9
1,2023,2020-10,69.0,63.0,...,4.0,-6.0,10.0,69.2
2,2023,2020-100,68.0,,...,20.0,,,
3,2023,2020-1000,58.0,,...,7.0,,,
4,2023,2020-1001,88.0,84.0,...,12.0,-4.0,16.0,92.8


This code works fine for students with valid scores for all 3 seasons, but those with a NaN winter value will also end up with a NaN average score.

We can get rid of the NaN results by calling `fillna()` to replace all NaN values with 0, as shown below. However, **note that this approach will result in inaccurately low average score values for students with missing winter results.** This is because our valid score weights for these students (0.2 for Fall and 0.5 for Spring) add up to only 0.7; in other words, these scores are around 30% lower than they should be.

In [30]:
df_student_results_wide['weighted_avg_score'] = (
    df_student_results_wide['Fall'].fillna(0) * 0.2 
    + df_student_results_wide['Winter'].fillna(0) * 0.3 
    + df_student_results_wide['Spring'].fillna(0) * 0.5)
df_student_results_wide.query("Winter.isna()").head()

Unnamed: 0,starting_year,student_id,Fall,Winter,...,Fall-Spring Change,Fall-Winter Change,Winter-Spring Change,weighted_avg_score
2,2023,2020-100,68.0,,...,20.0,,,57.6
3,2023,2020-1000,58.0,,...,7.0,,,44.1
8,2023,2020-1005,70.0,,...,0.0,,,49.0
9,2023,2020-1006,63.0,,...,17.0,,,52.6
15,2023,2020-1011,85.0,,...,15.0,,,67.0


Here's a better approach that, while a bit more complex, successfully adjusts for missing values. First, we'll create a weight column for each season that displays either our predetermined weight (if a student has a valid score for that season) or 0 (if the student does not). We'll also create a column that adds all of these weights together.

In [31]:
season_weight_dict = {'Fall':0.2,'Winter':0.3,'Spring':0.5}
# Using a for loop to create these columns makes our code a bit more 
# concise.
for season in ['Fall', 'Winter', 'Spring']:
    df_student_results_wide[season+'_weight'] = np.where(
        df_student_results_wide[season].isna(), 0, 
        season_weight_dict[season])

# adding axis=1 as an argument to df.sum() ensures that the calculations
# will be made row-wise rather than column-wise.
df_student_results_wide['weight_sum'] = df_student_results_wide[[
    'Fall_weight', 'Winter_weight', 'Spring_weight']].sum(axis=1)

df_student_results_wide.head()

Unnamed: 0,starting_year,student_id,Fall,Winter,...,Fall_weight,Winter_weight,Spring_weight,weight_sum
0,2023,2020-1,88.0,81.0,...,0.2,0.3,0.5,1.0
1,2023,2020-10,69.0,63.0,...,0.2,0.3,0.5,1.0
2,2023,2020-100,68.0,,...,0.2,0.0,0.5,0.7
3,2023,2020-1000,58.0,,...,0.2,0.0,0.5,0.7
4,2023,2020-1001,88.0,84.0,...,0.2,0.3,0.5,1.0


We can now accurately calculate average scores for all students by (1) multiplying each score by its corresponding weight value (which will be 0 in the case of missing scores); (2) adding these products together; and then (3) dividing the sum by the `weight_sum` column. If a student happens to have a missing value for a given season, the `weight_sum` value will be lower to compensate for this omission. 

In [32]:
df_student_results_wide['weighted_avg_score'] = (
    df_student_results_wide['Fall'].fillna(0) 
    * df_student_results_wide['Fall_weight']
    + df_student_results_wide['Winter'].fillna(0) 
    * df_student_results_wide['Winter_weight']
    + df_student_results_wide['Spring'].fillna(0)
    * df_student_results_wide['Spring_weight']) / (
        df_student_results_wide['weight_sum'])

df_student_results_wide.query("Winter.isna()").head()

Unnamed: 0,starting_year,student_id,Fall,Winter,...,Fall_weight,Winter_weight,Spring_weight,weight_sum
2,2023,2020-100,68.0,,...,0.2,0.0,0.5,0.7
3,2023,2020-1000,58.0,,...,0.2,0.0,0.5,0.7
8,2023,2020-1005,70.0,,...,0.2,0.0,0.5,0.7
9,2023,2020-1006,63.0,,...,0.2,0.0,0.5,0.7
15,2023,2020-1011,85.0,,...,0.2,0.0,0.5,0.7


A few notes:

1. This approach would also successfully compensate for students with missing fall or spring scores. It would only fail to work for students who had no survey results at all--but such students should be excluded from these calculations to begin with.

2. Note that, for students with missing winter weights, this code uses a fall score weight of 0.2/0.7 (28.6%) and a spring score weight of 0.5/0.7 (71.4%). These weights are higher in order to compensate for the missing winter results.

Now that we've calculated weighted average results for all students, we can answer the administrators' original question (what percentage of students had a weighted average survey score below 60):

In [33]:
# Using np.where() to flag students with a weighted average below 60:
# (See https://numpy.org/doc/stable/reference/generated/numpy.where.html 
# for more information about np.where().)
df_student_results_wide['weighted_avg_below_60'] = np.where(
    df_student_results_wide['weighted_avg_score'] < 60, 1, 0)

# Calling value_counts(normalize=True), then multiplying the results 
# by 100, allows us to calculate the percentage of students with weighted 
# averages below 60.
100*df_student_results_wide['weighted_avg_below_60'].value_counts(
    normalize=True)

weighted_avg_below_60
0    81.304932
1    18.695068
Name: proportion, dtype: float64

It turns out that around 18.7% of students had a weighted average survey score below 60.

This notebook has provided a brief introduction to descriptive statistics calculations within Python. PFN's graphing section will teach you how to convert some of the pivot tables created here into line and bar charts, thus making this data easier to interpret.