# Exam Data Tutorial
The goal of this tutorial is to introduce some useful functions and show how to do typical tasks when working with quantitative educational data. This tutorial assumes you already have basic knowledge of Python and Pandas. It also assumes you are familiar with method chaining (e.g. `myobject.function()` ) as a way to reduce intermediate steps and make your code more readable.

In this lesson, you will learn the following:
* How to sort and view data
* Compute values from multiple columns
* Compute values across a row
* How to break results down by some grouping variable
* How to extract a column from a data frame for use elsewhere

## Setting up
This tutorial uses the `exam_data.csv` file. This file represents simulated exam scores for a course with three exams and is similar to what you might see if you were using the eLC gradebook for a project. Scores are between 0 and 100 but you'll notice most students did well on this exam. There is also a generic `group` binary variable that could represent something like female, underrepresented minority, first generation status, etc.

`pandas` is the package that handles data frames in Python so we are doing to load that first. We can then use the `read_csv` function to import the `exam_data.csv` file as a data frame in Pyton

In [2]:
import pandas as pd

In [4]:
exam_data = pd.read_csv("exam_data.csv")

## Familiarizing ourselves with the data
Before doing any analysis, it's useful to get a sense of what the data looks like. There are a variety of different things we can check. Let's walk through them.

First, let's look at the first few rows of the data frame to see what our data looks like. We can do this either with indexing or `.head()`. Let's look at the first six rows. Remember that Python indexing starts at 0 and not 1!

You'll notice that both examples return the same rows. You can change the numbers to view more or fewer rows.

In [75]:
exam_data[0:6]

Unnamed: 0,id,group,exam1,exam2,exam3,avg_grade
0,1001,group1,90,95,97,94.0
1,1002,group2,77,80,81,79.333333
2,1003,group2,80,84,86,83.333333
3,1004,group2,89,93,95,92.333333
4,1005,group1,89,94,98,93.666667
5,1006,group1,84,88,91,87.666667


In [77]:
exam_data.head(6)

Unnamed: 0,id,group,exam1,exam2,exam3,avg_grade
0,1001,group1,90,95,97,94.0
1,1002,group2,77,80,81,79.333333
2,1003,group2,80,84,86,83.333333
3,1004,group2,89,93,95,92.333333
4,1005,group1,89,94,98,93.666667
5,1006,group1,84,88,91,87.666667


Another useful thing to do is sort the data frame by some column. We can do that by `.sort_values`. Let's sort by exam 1 scores first. Then let's do an example of sorting by group. Remember that `.sort_values()` is an out-of-place function so the original data frame is unchanged and you need to write this as a variable if you want to save it or set the `inplace` parameter in `.sort_values()` to `True` if you want the original data frame to be modified (generally not recommended).

In [86]:
# sorting by exam1 scores
exam_data.sort_values(by = 'exam1')

Unnamed: 0,id,group,exam1,exam2,exam3,avg_grade
191,1192,group2,69,74,75,72.666667
98,1099,group1,70,74,76,73.333333
187,1188,group1,72,80,82,78.000000
150,1151,group2,72,77,78,75.666667
17,1018,group2,72,75,78,75.000000
...,...,...,...,...,...,...
58,1059,group2,94,98,100,97.333333
42,1043,group1,94,99,100,97.666667
161,1162,group1,95,100,100,98.333333
170,1171,group1,96,100,100,98.666667


In [84]:
# sorting by group and then saving the sorted data frame as a new variable
exam_data_sorted = exam_data.sort_values(by='group')

## Calculating values from a column
For many numeric data, we are often interested in a summary statistic like the mean, median, or standard deviation to describe some type of behavior. For example, what was the average score on a test.

This is pretty straighforward to do in Python. Let's do an example with the mean.

The basic idea is to select the columns we want to use and then chain the type statistic we want to calculate such as `.mean()`, `.median()`, or `.std()`. If we want to calculate the mean score of each exam, we can do the following.

In [9]:
exam_data_summary = exam_data[['exam1', 'exam2', 'exam3']].mean()
exam_data_summary

exam1    83.620
exam2    88.035
exam3    90.270
dtype: float64

Notice here I save the means as a separate variable so I can easily get them back. I can also directly print to console if I don't need to save them.

In [100]:
exam_data[['exam1', 'exam2', 'exam3']].mean()

exam1    83.620
exam2    88.035
exam3    90.270
dtype: float64

I can do the same thing with other functions to like finding the standard deviation or spread of the scores

In [103]:
exam_data[['exam1', 'exam2', 'exam3']].std()

exam1    5.088720
exam2    5.301349
exam3    5.324363
dtype: float64

If I want to get multiple common statistics at the same time, I can use `.describe()`

In [108]:
exam_data[['exam1', 'exam2', 'exam3']].describe()

Unnamed: 0,exam1,exam2,exam3
count,200.0,200.0,200.0
mean,83.62,88.035,90.27
std,5.08872,5.301349,5.324363
min,69.0,74.0,75.0
25%,80.0,84.0,87.0
50%,83.5,88.0,90.0
75%,87.0,92.0,94.0
max,96.0,100.0,100.0


You'll notice I get the same values as before but also much more information!

You can select the columns you want or can you use the entire data frame and Python will skip the non-numeric columns

In [111]:
exam_data.describe()

Unnamed: 0,id,exam1,exam2,exam3,avg_grade
count,200.0,200.0,200.0,200.0,200.0
mean,1100.5,83.62,88.035,90.27,87.308333
std,57.879185,5.08872,5.301349,5.324363,5.192115
min,1001.0,69.0,74.0,75.0,72.666667
25%,1050.75,80.0,84.0,87.0,84.25
50%,1100.5,83.5,88.0,90.0,87.333333
75%,1150.25,87.0,92.0,94.0,91.0
max,1200.0,96.0,100.0,100.0,98.666667


In [13]:
exam_data['avg_grade'] = exam_data[['exam1', 'exam2', 'exam3']].mean(axis=1)

In [23]:
exam_data_group_summary = exam_data.groupby('group').agg(
    exam1_mean=('exam1', 'mean'),
    exam2_mean=('exam2', 'mean'),
    exam3_mean=('exam3', 'mean')
).reset_index()

exam_data_group_summary

Unnamed: 0,group,exam1_mean,exam2_mean,exam3_mean
0,group1,85.15,90.09,92.54
1,group2,82.09,85.98,88.0


In [25]:
# Group the data by 'group'
grouped_data = exam_data.groupby('group')

# Calculate the mean for each exam column
exam1_mean = grouped_data['exam1'].mean()
exam2_mean = grouped_data['exam2'].mean()
exam3_mean = grouped_data['exam3'].mean()

# Create a new DataFrame to store the results
exam_data_group_summary_2 = pd.DataFrame({
    'group': exam1_mean.index,
    'exam1_mean': exam1_mean.values,
    'exam2_mean': exam2_mean.values,
    'exam3_mean': exam3_mean.values
})

exam_data_group_summary_2

Unnamed: 0,group,exam1_mean,exam2_mean,exam3_mean
0,group1,85.15,90.09,92.54
1,group2,82.09,85.98,88.0


In [31]:
# Calculate the average grade by row
exam_data['avg_grade'] = exam_data[['exam1', 'exam2', 'exam3']].mean(axis=1)

# Filter the DataFrame for the specified group
filtered_data = exam_data[exam_data['group'] == 'group1']

# Select the 'avg_grade' column
avg_grade_data = filtered_data[['avg_grade']]


In [35]:
avg_grade_data_2 = (
    exam_data
    .assign(avg_grade=exam_data[['exam1', 'exam2', 'exam3']].mean(axis=1))  # Calculate avg_grade
    .query('group == "group1"')  # Filter for group1
    .loc[:, ['avg_grade']]  # Select the avg_grade column
)


In [None]:
# Calculate the average grade by row
exam_data['avg_grade'] = exam_data[['exam1', 'exam2', 'exam3']].mean(axis=1)

# Filter the DataFrame for the specified group
filtered_data = exam_data[exam_data['group'] == 'group1']

# Pull the 'avg_grade' column as a Series
avg_grade_data = filtered_data['avg_grade']


In [None]:
avg_grade_data = (
    exam_data
    .assign(avg_grade=exam_data[['exam1', 'exam2', 'exam3']].mean(axis=1))  # Calculate avg_grade
    .query('group == "group1"')  # Filter for group1
    .avg_grade  # Pull the avg_grade column
)
