# MSDS 631 - Lecture 7 (March 6, 2019)

## Pandas Aggregations and Analytical Methods and Combining Data

### Aggregations

A great deal of analyzing raw data is trying to summarize it for further analysis. So far, we've been writing for-loops and storing data into dictionaries to then run other analyses (think percentage of students on probation). To do this, you defined the attribute you wanted to "group by" (majors, in this case). Pandas allows you to do this automatically and perform certain functions on all of the data associated with each particular value.

If we wanted to use base Python to find the average GPA amongst students in each major, we would do the following:

In [None]:
#Open data
import json
with open('students.json', 'r') as f:
    students_list_of_dicts = json.load(f)

#Create an empty list so we can add the students' GPAs
major_gpas = {}
possible_majors = set([i['major'] for i in students_list_of_dicts])
for major in possible_majors:
    major_gpas[major] = []

#Get all of the students GPAs for their major
for student in students_list_of_dicts:
    student_major = student['major']
    major_gpas[student_major].append(student['gpa'])

#Compute the average
average_gpas = {}
for major in major_gpas:
    avg_gpa = sum(major_gpas[major]) / len(major_gpas[major])
    rounded_gpa = round(avg_gpa, 3)
    average_gpas[major] = rounded_gpa
average_gpas

That's **three** separate for-loops with two separate dictionaries that we had to use in order to move data into their appropriate locations so that we could make computations. That's a lot! Imagine what we'd have to do if we wanted to add gender, or worse yet, gender AND class.

With Pandas aggregations we can tell the DataFrame what we want to do with a LOT less code.

Let's start by loading the data into a DataFrame.

In [None]:
import pandas as pd
students_df = pd.read_csv('students.csv')

In [None]:
#Now let's compute the mean GPA by major


In [None]:
#Now let's compute the mean GPY by major AND gender


In [None]:
#Now let's compute the mean GPY by major, class, and gender


There are many types of computations you can do with aggregations (too many to list here). The most common methods you will call include:
- .mean()
- .max()
- .min()
- .median()
- .size()
 - Counts how many times you see the value of the attribute(s) you are grouping by
- .count()
 - Counts how many non-null values you have in a column
- .rank()
 - Ranks a particular value within a group
 
Let's use the methods above to understand what it's doing

In [None]:
#Max GPA by major


In [None]:
#Min GPA by major


In [None]:
#Median GPA by major


In [None]:
#How many students are in each major


In [None]:
#How many non-null values are there for each column grouped by major
#Min GPA by major


In [None]:
#Copute the rank of the students' GPAs, by major
#Ties are assigned the "best" rank


In [None]:
gpa_ranks.head()

In [None]:
students_df['gpa_rank'] = gpa_ranks
students_df.head()

In [None]:
students_df = students_df.sort_values(['major', 'gpa_rank'])

In [None]:
students_df.head(20)

### Merging Data

Merging data is one of the most powerful tools in Pandas. If you've learned SQL before, then you'll be familiar with a lot of these concepts. Merging allows us to match data from different DataFrames.

Using the students data, imagine we are trying to "standardize" the students GPAs. For those of you unfamiliar with standardization, it is measuring the number of standard deviations away from the mean that a value is.

Since each major has a slightly different level of difficulty and each class has a different composition of student talent, we want to compare each student's GPA against the values for their major and class. Let's do that now.

In [None]:
mean_gpa_by_major_and_class = 
std_gpa_by_major_and_class = 

In [None]:
mean_gpa_by_major_and_class.head()

In [None]:
std_gpa_by_major_and_class.head()

In [None]:
#Need to rename column since they share the same name - method 1


In [None]:
#Method 2 for renaming columns
#Note this method RETURNS the new dataframe


In [None]:
#Merge means to students_df


In [None]:
#What does the DataFrame look like now?
students_df_w_mean.head()

In [None]:
#Merge standard deviations to new students_df


In [None]:
#What does the DataFrame look like now?
students_df_w_mean_std.head()

In [None]:
#Compute how far from the mean the student's GPA is
students_df_w_mean_std['std_from_mean'] = 

In [None]:
#Compute how many standard deviations away from the mean the student's GPA is
students_df_w_mean_std['standardized_gpa'] = 

In [None]:
#Look at the new data
students_df_w_mean_std.head()

In [None]:
#Look at the distribution of original GPAs
from matplotlib import pyplot as plt
students_df_w_mean_std['gpa'].hist(bins=20)
plt.show()

In [None]:
#Look at the distribution of standardized GPAs
students_df_w_mean_std['standardized_gpa'].hist(bins=20)
plt.show()

Let's try looking at disparate data and use it to join the data.

Let's use Pandas to get all of the data from Quiz 2 into the same DataFrame.

In [None]:
with open('department_enrollment.json', 'r') as f:
    dept_enrollment = json.load(f)

In [None]:
dept_enrollment.keys()

In [None]:
#Let's try creating a DataFrame from this dictionary of lists


The creation of DataFrames from dictionaries of lists MUST have lists that are the same length. Since we can't do that with our dictionary here, we're going to have to manually do this.

In [None]:
#Write a function that makes a single DataFrame for a major


In [None]:
#Let's create a list of DataFrames and concatenate them together


In [None]:
#Take a look at the new big DataFrame


In [None]:
#Let's load the student_gpas.json file


In [None]:
#Getting the data into a DataFrame isn't easy


In [None]:
#Load student_directory data


In [None]:
#Now let's combine all of the data together
