# COMM 187: Data Science in Communication Research
# Spring 2025

## Week #7: Data Visualization using matplotlib
**Monday, May 21, 2025**

Welcome to the Coding Lab #7 for COMM 187: Data Science in Communication Research! 

Today's lesson plan:
 - Grouping data in `pandas`
 - Data Visualization using `matlpotlib`
 - Practice with histograms, scatterplots, 
 - How to save a visualization

Today's lessons are based on the following online resources (feel free to try them out yourselves too!):
 - https://wesmckinney.com/book/pandas-basics
 - https://wesmckinney.com/book/plotting-and-visualization

### Grouping data in `pandas` using `groupby`

In [None]:
import pandas as pd
import numpy as np

In [None]:
df['Major_category'].unique()

In [None]:
df = pd.read_csv('./data/recent-grads.csv')

Instead of calculating the aforementioned statistics for all of the data, if I want to calculate these statistics, let us say, for all the different majors. I would then have to subset the data by each major and calculate the summary statistics for each major. That is a very long process!

Instead, we can "group" the data based on the values of a column (for e.g., the "Major_category" column) and perform the same statistic across all groups. We do this operation using `.groupby()` function. 

Learn more about how to use `.groupby()` here: https://www.geeksforgeeks.org/python-pandas-dataframe-groupby/ 

Using the tutorial linked here, discuss with your table how to solve the following questions:

**Question 1.** Use `.groupby()` function to group the DataFrame based on the column “Major_category”. 

In [None]:
### Your code below this line
df.groupby('Major_category')

**Question 2.** Use `.groupby()` function to group the DataFrame based on the column “Major_category”, and select only the "Income" column. 

In [None]:
### Your code below this line
df.groupby('Major_category')['Income']

**Question 3.** Use `.groupby()` function and `.mean()` together to print the mean income for each “Major_category” group. 

In [None]:
### Your code below this line
df.groupby('Major_category')['Income'].mean()

**Question 4.** Use `.groupby()` function, `.mean()` function, and `.sort_values()` together to print the mean income for each “Major_category” group sorted in descending order of the mean incomes. 

In [None]:
### Your code below this line
df.groupby('Major_category')['Income'].mean().sort_values(ascending=False)

**Question 5.** Use `.groupby()` function, `.median()` function, and `.sort_values()` together to print the median income for each “Major_category” group sorted in descending order of the mean incomes. 

In [None]:
### Your code below this line
df.groupby('Major_category')['Income'].median().sort_values(ascending=False)

### Practice

Identify the top 3 major categories with the highest average unemployment rates and provide the standard deviation for the unemployment rate within those categories. Use `.groupby()`, `.mean()`, `.std()`, and `.sort_values()`.

In [None]:
### Your code below this line


### Data Visualization using `matplotlib`

As practiced in the Datacamp Module #3, we will use `matplotlib` package to create visualizations of data and statistics in Python. Specifically, we will be using the `pyplot` subpackage. Just as we have been using numpy as np and pandas as pd, and use the alias `plt` for `matplotlib.pyplot`.

In [None]:
import matplotlib.pyplot as plt

We are going to practice the following types of plots today:

 - **Histograms** -- for visualizing data distribution and description
 - **Scatterplots** -- for visualizing relationships between two variables
 - **Box plots** -- for visualizing difference in means between groups

### **Histograms**

A histogram shows the distribution of data. In this example, we’ll plot the distribution of Major Category values. The hist() function requires at least one parameter, an array containing the data to be binned.

A bin represents a range of values into which the data is grouped. Each bin corresponds to a bar in the histogram, showing the count (or frequency) of data points that fall within that range.
The number of bins affects the level of detail in the histogram:

- More bins create narrower ranges, showing more detail about small variations in the data.
- Fewer bins create broader ranges, giving a more general overview of the data distribution.

**Step 1:** Create a simple histogram.

In [None]:
plt.hist(df["Major_category"])
plt.show()

**Step 2:** Customize Your Histogram

Just like line plots, `matplotlib` allows you to customize histograms. You can add a title and change colors.  

On additional color feature is `edgecolors`. This will add a border color around each bar, making them more distinct.


In [None]:
plt.hist(df['Major_category'], color='purple', edgecolor='black' )
plt.title("Distribution of Major categories")
plt.xlabel("Major Category")
plt.ylabel("Frequency")
plt.show()

You may also want to modify the size of bins in a histogram to improve understanding for your viewers.

`bins`: Changes the number of bars or bins in the histogram. The default is usually 10, but this can be adjusted based on the data.

In [None]:
plt.hist(df['Major_category'], bins=5, color='purple', edgecolor='black' )
plt.title("Distribution of Major categories")
plt.xlabel("Major Category")
plt.ylabel("Frequency")
plt.show()

### **Scatterplots**

A scatter plot is used to visualize the relationship between two variables where each point is one observation. This type of plot is made using the `scatter()` function. It needs two arrays of the same length, one for the values of the x-axis, and one for values on the y-axis. It’s helpful for identifying trends, clusters, or outliers.

In this example, we’ll plot speed on the x-axis and attack on the y-axis.

**Step 1:** Create a simple Scatterplot.



In [None]:
plt.scatter(df["Men"], df["Women"])
plt.show()

**Step 2**: Let's customize our figure so that it is easier to interpret.

`plt.scatter()` can use the same arguments to make labels and titles and change colors and marker types.


In [None]:
plt.scatter(df['Men'], df['Women'], color='green', marker='*')
plt.xlabel('Men')
plt.ylabel('Women')
plt.title('Men vs. Women in the dataset')
plt.show()

**Try it for yourself!** Make a scatterplot between Share_women and Undemployment_rate.

In [None]:
### Write your code below (in place of ...)
...

##### More Scatterplot Customization
**Adjusting Size of Points**: Unlike simple line plots, `plt.scatter()` does not have a `ms` function to modify size. Instead, you need to use `s` to increase the size of scatterplot points.


In [None]:
plt.scatter(df['Men'], df['Women'], color='green', marker='*', s = 300)
plt.xlabel('Men')
plt.ylabel('Women')
plt.title('Men vs. Women in the dataset')
plt.show()

**Adjusting Transparency**: Sometimes when you have a large dataset, points on a scatterplot can overlap making it diffuclt to get a full understanding of your data. If this occurs you may want to adjust transparency using `alpha`. The closer your value is to 1 the darker (less transparent) your points will be. The closer your value is to 0 the lighter (more transparent) your points will be.


In [None]:
plt.scatter(df['Men'], df['Women'], color='green', marker='*', s = 300, alpha = 0.5)
plt.xlabel('Men')
plt.ylabel('Women')
plt.title('Men vs. Women in the dataset')
plt.show()