# Descriptive Stats (Work in progress)

By Kenneth Burchfiel

Released under the MIT license

This notebook demonstrates how to calculate descriptive statistics in Python using the Pandas library. One benefit of performing these calculations in Python (rather than Excel, Google Sheets, or another spreadsheet program) is that, once you have these tasks scripted, you can quickly rerun these tasks whenever the original data gets updated.\* You can even have your computer run the script on a daily or hourly basis, thus freeing up time you'd need to spend on busywork for more interesting tasks. 

For example, suppose leaders at NVCU would like to know, on a daily basis, how spring survey results differ from fall and winter ones. (This number could change each day as new spring survey data gets released.) One way to accomplish this task would be to retrieve survey data from your database each day; paste it into Excel or Google Sheets; pivot the data, and then share the output. However, you could also accomplish these same steps in Python. While this would likely take you longer the first time around, you could then create updated analyses of your data in mere seconds. This notebook will show you how!

\* There are certainly ways to automate Excel tasks as well (e.g. using Visual Basic). I don't have any experience with Visual Basic, so I'm not the best person to compare these two tools; however, I have no doubt that learning it would take some time, and given Python's versatility and power, I would recommend applying that time to learning Python instead. (You can get an estimate of the world's interest in Python versus Visual Basic by checking out the [TIOBE index](https://www.tiobe.com/tiobe-index/).)

In [1]:
import pandas as pd
import numpy as np

We'll first import our combined set of fall, spring, and winter student survey results; these results were created within data_prep.ipynb. (Note that these results also include college and level data from our curr_enrollment SQL table; that way, we can evaluate average results by level and college.)

In [3]:
df_survey_results = pd.read_csv('../Data_Prep/2023_survey_results.csv')
df_survey_results.head()

Unnamed: 0,student_id,starting_year,season,score,season_order,college,level,level_for_sorting
0,2020-1,2023,Fall,88,0,STC,Fr,0
1,2020-2,2023,Fall,37,0,STM,Fr,0
2,2020-3,2023,Fall,54,0,STC,Fr,0
3,2020-4,2023,Fall,56,0,STC,Fr,0
4,2020-5,2023,Fall,77,0,STM,Fr,0


## Evaluating changes in university-wide survey results over time

In [4]:
df_survey_results.pivot_table(
    index = ['starting_year', 'season_order', 'season'], 
    values = 'score', aggfunc = 'mean').reset_index()

Unnamed: 0,starting_year,season_order,season,score
0,2023,0,Fall,69.682251
1,2023,1,Winter,64.199483
2,2023,2,Spring,72.049622
