# COMM 187: Data Science in Communication Research
# Spring 2025

## Week #8: Data Visualization using seaborn
**Monday, May 19, 2025**

Welcome to the Week #8 Coding Lab for COMM 187: Data Science in Communication Research! 

Today's lesson plan:
 - Introduction to `seaborn`
 - Box and whisker plots
 - Joint plots 

Last week we learned the following types of plots:

 - **Histograms** -- for visualizing data distribution and description
 - **Scatterplots** -- for visualizing relationships between two variables

This week, we will learn the following types:
 - **Box plots** -- for visualizing difference in means between groups; ideal for visualizing t-tests and ANOVA
 - **Joint plots** -- for visualizing association between two variables; ideal for correlations and regression

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### Introduction to `seaborn`

Seaborn is a data visualization Python package built on top of `matplotlib`. It provides a high-level interface for creating informative and good-looking statistical visualizations. 

As with any package, the first step is to import it! Typically, the alias used for seaborn is `sns`. 

In [None]:
import seaborn as sns

This week, we will work with the college majors and earnings dataset like last week. Let's load the dataset.

In [None]:
df = pd.read_csv('./data/recent-grads.csv')

As discussed in previous weeks, a good first step always is to look at the first few rows of the dataset using the `.head()` method to get an idea of what the data "looks" like. Let's do that here.

In [None]:
df.head()

### **Box Plots**

**Box plots**, or box-and-whisker plots, visually summarize the distribution of a dataset. They display the median, quartiles, and potential outliers. 

The **box** represents the interquartile range (IQR), with the **lower and upper edges** at the **first (Q1)** and **third quartiles (Q3)**. The line inside the box marks the median (Q2). **Whiskers** extend from the box to the smallest and largest values within **1.5 * IQR** from Q1 and Q3. Points outside this range are considered outliers. Box plots highlight central tendency, variability, and outliers effectively.

![](./imgs/boxplot_explanation.jpg)

[Image source](https://builtin.com/data-science/boxplot)

Box plots are a great way to visualize statistical comparison between two or more variables!

Let us make a boxplot figure for the percentage of White and Black population across all regions, using the function `boxplot` from `seaborn` library. For more details, read the documentation [here](https://seaborn.pydata.org/generated/seaborn.boxplot.html). Here are the basic parameters of the boxplot function you need to know:

| Parameter | Description |
|-----------|-------------|
| `data` | The dataset you want to plot (DataFrame, array, etc.) |
| `x` | The variable to show on the x-axis |
| `y` | The variable to show on the y-axis |
| `hue` | Optional third variable for grouping data with different colors |
| `ax` | The matplotlib axes to draw the plot on (usually leave this alone) |
| `color` | Single color for all elements in the plot |
| `palette` | Set of colors to use when `hue` is specified |
| `width` | How wide each box should be (0.8 = 80% of available space) |
| `orient` | Which way the boxes should face ("v"=vertical, "h"=horizontal) |
| `fill` | Whether to fill boxes with color (`True`) or just show outlines (`False`) |
| `dodge` | Whether to shift boxes sideways when using `hue` to avoid overlap |
| `gap` | Space to add between boxes (0 = no gap) |


In [None]:
df.columns

In [None]:
# Set figure size (optional)
plt.figure(figsize=(12, 8))

# Create boxplot using seaborn (sns) boxplot function
sns.boxplot(data=df, x='Major_category', y='Income')

plt.show()

The x-axis labels are overlapping with each other! How do we fix that? 

We can use the `rotation` parameterin the `xticks` from matplotlib, by specifying the angle with which you would like to "rotate" the labels for "tick-marks" on the x axis. The default rotation is 0 degrees, and 90 degrees means that the text would be vertically presented. We will go with somewhere in between: 45 degrees. 

In [None]:
plt.figure(figsize=(12, 8))
sns.boxplot(data=df, x='Major_category', y='Income')

# Rotate x-axis labels for readability
plt.xticks(rotation=45, ha='right')

plt.show()

Let's add x-axis label, y-axis label, and a title. 

In [None]:
plt.figure(figsize=(12, 8))
sns.boxplot(data=df, x='Major_category', y='Income')
plt.xticks(rotation=45, ha='right')

# Add labels and title
plt.xlabel('Major Category')
plt.ylabel('Median Income ($)')
plt.title('Income Distribution by Major Category')

plt.show()

#### Practice Question 1
Create a boxplot showing the difference in unemployment rates by major category, with red boxes and horizontal orientation. 

**Note:** if you were conducting an ANOVA analysis for the difference in unemployment rates across major categories, this would be the visualization you would choose to visually represent your ANOVA analysis.

**Steps:**
1. Create a boxplot with Major_category on y-axis and Unemployment_rate on x-axis
2. Make the boxes red
3. Set the orientation to horizontal
4. Add appropriate labels and title

In [None]:
### Your code below this line


#### Practice Question 2
Create a boxplot showing the difference in income for the major categories "Health", "Arts", and "Communication & Journalism".

In [None]:
### Your code below this line


### **Joint Plots**

In the last lab, we learned how to visualize scatterplot to visualize the relationship between two variables. However, scatter plots do not effectively show the magnitude and direction of the correlation between two variables. Remember the plot we discussed in class from this article? Here, in addition to the scatterplot, a **line** helped us visualize the nature of association between the two variables. 

To accomplish this simply, we can use regression jointplots using seaborn's `jointplot` function.

**What is a Jointplot?** A jointplot combines:

 - A main scatter plot showing the relationship between two variables
 - Marginal distributions on the sides (histograms or KDE plots)

When using `kind="reg"`, it adds a regression line to show the linear relationship between variables.  

Let's create a jointplot showing the relationship between median income and unemployment rate. First, let's just plot the data without a line.

In [None]:
sns.jointplot(data=df, x='Unemployment_rate', y='Income')

plt.show()

We created a jointplot with sns.jointplot() specifying:

 - `data`: Our dataframe
 - `x` and `y`: Our variables of interest

Great! What if we want the color to be different for different college majors? Let's try that with the `hue` parameter.

In [None]:
sns.jointplot(data=df, x='Unemployment_rate', y='Income', hue='Major_category')

plt.show()

Now, instead of different colors, we want all the points to have the same color. Let's try that with the `color` parameter and change the data points to the color purple. 

In [None]:
sns.jointplot(data=df, x='Unemployment_rate', y='Income', color = 'purple')

plt.show()

**IMPORTANT** Now, we want to add a line to this plot. We can do this by setting the parameter `kind` to `reg`. What does this mean? It adds a "regression line" to the scatter plot, which helps us see the magnitude (slope) and direction (positive or negative) of the relationship between the two variables.

In [None]:
sns.jointplot(data=df, x='Unemployment_rate', y='Income', kind='reg', color = 'purple')

plt.show()

Great! Let's now add a title using the `plt.title` function.

In [None]:
sns.jointplot(data=df, x='Unemployment_rate', y='Income', kind='reg', color = 'purple')

# Add title
plt.title('Relationship Between Employment Rate and Income')

plt.show()

Given the special configuration of this plot type, the title appears at an inconvenient location. Let's locate it above the plot using the `suptitle` function from matplotlib, which puts a "super" title at the top center of the figure.

In [None]:
sns.jointplot(data=df, x='Unemployment_rate', y='Income', kind='reg', color = 'purple')

# Add title
plt.suptitle('Relationship Between Employment Rate and Income')

plt.show()

#### Practice Question 3

Create a jointplot showing the relationship between the percentage of women in a major (`ShareWomen`) and median income (`Income`).

In [None]:
### Your code below this line
