# Exploring & Visualizing Distribuions: Class Sizes @ Whitman

This notebook uses a dataset of historic class sizes at Whitman College for a subset of majors/departments to explore and visualize distributions.

_October 10, 2023
CS / Math 215_

## Part 1: Exploring the data

In [1]:
# import the packages we've been using all semester long
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
# First, import the data file as a data frame
df_courses = pd.read_excel("whitman-course-sizes-2000-2023.xlsx")

### Your task:

Examine the dataset. Ask yourself: What does each row represent? Then, figure out: 

In [3]:
# How many different majors are there in this data set and which are they?

# YOUR CODE HERE

In [4]:
# What time period does this data frame cover?

# YOUR CODE HERE

In [4]:
# How many *unique* course names are there in this dataset, based on the "Short Title"?

# YOUR CODE HERE

In [6]:
# How many courses offerings (i.e. instances when a course was taught) are there for each major?

# YOUR CODE HERE

Ok, now that we have that down, let's answer some more complex questions:

In [13]:
# Last semester (Spring 2023), how many courses were there taught in each Subject/Major?

# YOUR CODE HERE

In [15]:
# Last semester (Spring 2023), what was the largest course?

# YOUR CODE HERE

In [17]:
# Last semester (Spring 2023), which of the  majors in this dataset had the 
# highest and the lowest average class size?

# YOUR CODE HERE

In [19]:
# Last semester (Spring 2023), pick a Subject/Major and determine 
# the percent of classes with fewer than 20 students

# YOUR CODE HERE

### CHALLENGE
Write a function that takes in two inputs -- a list of courses by course "Name" (i.e. ["CS-270-A", "CS-310-A", "ECON-107-A"] and a dataframe of course. It then returns the **mean experienced class size** for a student in those classes.


In [22]:
# YOUR CODE HERE
def mean_exp_class_size(course_list, dataframe):

In [None]:
# Sample test case
#course_list = ["CS-270-B", "CS-310-A", "ECON-107-A"]
#mean_exp_class_size(course_list, df_SP2023)

## Part 2: Visualizing the data

Now, this is fun... but we can also use _visualizations_ to explore the data.

First, let's consider the *distributions* of the course sizes.

How might we use visualizations to examine each Subject/Major?

We could start by making a [histogram](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html) for each one...

In [69]:
# Make a histogram of the course sizes (for all subjects, all years together)
# What do you notice?

# YOUR CODE HERE


In [30]:
# Drop any course that only has 0 or 1 students in it
# You can either make a new data frame or modify the existing one

# YOUR CODE HERE


In [71]:
# Now redraw the histogram -- how is it different?

# YOUR CODE HERE 


In [32]:
# Pick any Subject/Major you like
# Make a histogram of the class sizes for that Major

# YOUR CODE HERE

Or, we could put multiple majors on one histogram.

Let's try that out...

Other things to try:
* How is it different if we use `density=True` or `density=False`?
* Can you add a legend?

In [36]:
# Pick three Majors and plot histograms of their course sizes on the same plot...

# YOUR CODE HERE

## Part 3: Introducing Seaborn

Ok, we are getting closer... But is there a better way? Let's try some of the plots we can make with Seaborn.

__What is Seaborn?__

Seaborn is a visualization package that runs on top of Matplotlib. It is specially designed for statistical analysis and for exploring distributions in data. You can find out more at: https://seaborn.pydata.org/. A great place to start is the [Seaborn Tutorial](https://seaborn.pydata.org/tutorial/introduction.html).

__Installation__

Depending on your Anaconda installation, you may need to install the package. 

First, try running the import code block below. If it works, great! 

If it doesn't work, open up a terminal and run this to install:

    conda install seaborn



In [5]:
# import Seaborn
import seaborn as sns

The next part is new, so follow along as best you can.

You don't need to know everything (I certainly don't!). Just get comfortable trying to figure our how these plots work by consulting the [Seaborn documentation](https://seaborn.pydata.org/tutorial.html) and looking for examples that help you understand.

In [72]:
# We can use a seaborn .distplot()
# It's basically the same as a matplotlib histogram
# We can put all of the Subjects/Majors on one plot:

# NOTE: You can uncomment the code below to create the plots
# But if you named your dataframes something different, you'll need to update that part

#sns.displot(df_courses, x="Active Students Count")

In [73]:
# Or we could show just one major

#sns.displot(df_courses[df_courses["Subject"] == "CS"], x="Active Students Count", hue="Subject")

In [74]:
# Or we could show them all...
# But this is messy (although note how easy it is to make!)

#sns.displot(df_courses, x="Active Students Count", hue="Subject")

In [75]:
# And here we are normalizing:

#sns.displot(df_courses, x="Active Students Count", hue="Subject", stat="probability")

In [76]:
# We can also make a continuous plot
# This is using "kernel density estimation" to smooth out the histogram
# More at: https://en.wikipedia.org/wiki/Kernel_density_estimation

#sns.displot(df_courses, x="Active Students Count", kind="kde")

In [77]:
# Maybe it will be easier to see all of the courses if we use our smooth kernel density (kde) plot

#sns.displot(df_courses, x="Active Students Count", kind="kde", hue="Subject")

To up our game, now let's visualize the Subjects/Majors separately.

To do this, let's play with a **catplot** -- which stands for categorical plot. This is useful for dealing with categorical data, like the different Subjects and Courses that we have in this data set. For more on catplot, check out: https://seaborn.pydata.org/generated/seaborn.catplot.html

In [78]:
# A basic catplot showing the distributions...
# Notice how every course (row) becomes a dot

#sns.catplot(data=df_courses, x="Subject", y="Active Students Count", hue="Subject")

In [83]:
# Ok, that is a good start... let's try adding a jitter and making it transparent

#sns.catplot(data=df_courses, x="Subject", y="Active Students Count", alpha=0.25, jitter=True, hue="Subject")

In [84]:
# This is a little messy, so let's try it with just the courses last semester

#sns.catplot(data=df_SP2023, x="Subject", y="Active Students Count", alpha=0.5, hue="Subject")

In [86]:
# With this smaller dataset, we can also try a swarmplot
# Notice the difference?

#sns.swarmplot(data=df_SP2023, x="Subject", y="Active Students Count", hue="Subject")

How else might we visualize distributions? We could do a box plot, also known as a box and whiskers plot. ([This post](https://www.khanacademy.org/math/statistics-probability/summarizing-quantitative-data/box-whisker-plots/a/box-plot-review) has a nice refresher on how box plots work.)

In [87]:
# We can change kind to "box"
#sns.catplot(data=df_courses, x="Subject", y="Active Students Count", kind="box")

In [88]:
# Or we could do a bar plot with errors
#sns.catplot(data=df_courses, x="Subject", y="Active Students Count", kind="bar")

In [89]:
# Or a violin plot (which is kind of like a box plot, but continuous)
#sns.catplot(data=df_courses, x="Subject", y="Active Students Count", kind="violin")

Notice how the different plots emphasize different features of the distributions. 

For example, the violin emphasizes many of the courses seems to have "lumps" --- these might be common course caps (12, 25, etc.).

And the bar chart really emphasizes the means.

What do the other visualizations empahsize?

In [50]:
# What are some other ways we can explore distributions?

In [1]:
# We could also make a grid of histograms

#sns.displot(df_courses, x="Active Students Count", col="Subject", col_wrap=2, height=2, aspect=2)

In [2]:
# We can also look at a BIVARIATE DISTRIBUTION...
# That is, distribution according to two different variables at once

# Let's look at the distribution of class size by term (semester) 

#sns.displot(df_courses, y="Active Students Count", x="Term")

In [3]:
# Oof, that's really ugly! Let's set the figure size

#my_plot = sns.displot(df_courses, y="Active Students Count", x="Term", height=8, aspect=1.5)

# And rotate the x-axes labels
#plt.xticks(rotation=45)
#plt.show()

# I'm showing you this not because you need to know this, but because these are the things that
# Will come up as you work on making plots :P

## Part 4: Time series
These plots aren't really great for looking at trends over time. Right now, "Term" is categorical. Can we turn it into a date-time object?

In [7]:
# Add three new columns into your data frame:
# Year
# Month
# Day

# For our purposes, you can assume that the Fall semester starts on August 1 
# and the Spring semester stats on January 1


# YOUR CODE HERE

Now, use pd.to_datetime to make a new column, "Date", that is a datetime object using the "Day" "Month" and "Year" info. Remember pd.to_datetime: https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html

In [10]:
# YOUR CODE HERE

In [92]:
# Let's make a bar chart showing the average class size over time

#fig, axs = plt.subplots(figsize=(10, 4))

#df_courses.groupby(df_courses["Date"])["Active Students Count"].mean().plot(kind='bar', rot=90, ax=axs)


#plt.xlabel("Semester")  # custom x label using Matplotlib

#plt.ylabel("Average Class Size")

In [93]:
# Or we could do a line
#fig, axs = plt.subplots(figsize=(20, 4))

#df_courses.groupby(df_courses["Date"])["Active Students Count"].mean().plot(kind='line', rot=90, ax=axs)


#plt.xlabel("Semester")  # custom x label using Matplotlib

#plt.ylabel("Average Class Size")

In [94]:
# What about plotting the **distribution** over time?
# This is what Seaborn does well!

#sns.lineplot(x="Date", y="Active Students Count", data=df_courses)

In [95]:
# We can also look at the different Subjects/Majors

#sns.lineplot(x="Date", y="Active Students Count", hue="Subject", data=df_courses)

In [96]:
# Wow, that's messy! Let's clean it up
# We can put the plots onto a grid
# To do this, we use .relplot()

#g = sns.relplot(
#    data=df_courses,
#    x="Date", y="Active Students Count", col="Subject", hue="Subject",
#    kind="line", col_wrap=2, height=2, aspect=1.5, legend=False,)

In [97]:
# What if we just want to look at the past 5 years?
# Now that we have a datetime object, we can use a Boolean selector!

#recent_classes = df_courses[df_courses["Date"] >= "01-01-2018"]

#g = sns.relplot(
#    data=recent_classes,
#    x="Date", y="Active Students Count", col="Subject", hue="Subject",
#    kind="line", col_wrap=2, height=2, aspect=1.5, legend=False,)

### CHALLENGE: What else can you do with this data?

You might try...
* Analyzing the overall number of students taking each Subject over time
* Coming up with another interesting question and answering it
* Making a useful function (i.e. our mean expereinced class size estimator)
* Making an exploratory visualization

What else might you want to do?