## COMM 187 (160DS): Data Science in Communication Research -- Spring 2024

## Coding Lab #8: Introduction to Data Visualization with Python
**Wednesday, May 22, 2024** \
*Hannah Overbye-Thompson*

## Data Visualization Introduction
One of the most important things we do with programming is visualizing data. 
Data visualization allows us to: 
- communicate what we have found with a larger audiance
- inspect patterns and trends in our data 
- identify outliers in our data 
- compare variables 

In python we use the library  `matplotlib`. `matplotlib` in Python is a fundamental skill for anyone interested in data visualization.

Matplotlib is an integral part of the scientific Python ecosystem, and it is used for visualization. It is an extension of NumPy. It provides a Matlab-like interface for plotting and visualization. It was originally developed by John D. Hunter as an open source alternative usable with Python (Pajankar, 2021). You can read more about mat plot lib here: https://matplotlib.org/

You can think of creating a graphic in Python as being much like painting. You start with an empty canvas. Every time you use a graphics function, it adds new elements to your canvas (another apt anology might be transparencies that were used with projectors in the 90s). Later on, you can add more elements on top of your initial plot if you want.

### Today's lesson plan: 
- Practice plotting with some sample data 
- Make your own plot of some real world password data

## Let's Practice Visualizing Some Data! 

### Line Plot

<b> Example: Coffee Consumption Over a Week </b>
<b>Objective:</b> To plot the number of cups of coffee consumed each day over a week.

<b>Step 1:</b> import the `pyplot` module from `matplotlib` as `plt` and create some data

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
days = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
cups = [3, 4, 2, 3, 5, 6, 4]  # cups of coffee
coffee_type = ["Espresso", "Latte", "Americano", "Cappuccino", "Mocha", "Flat White", "Cold Brew"]
coffee_prices = [2.5, 3.5, 3.0, 4.0, 4.5, 4.0, 4.0]  # in dollars
satisfaction_level = [4, 3, 5, 4, 5, 4, 3]  # satisfaction level out of 5

In [None]:
data = {
    'Day': days,
    'Cups': cups,
    'Coffee Type': coffee_type,
    'Price': coffee_prices,
    'Satisfaction Level': satisfaction_level
}

df = pd.DataFrame(data)

<b>Step 2:</b> Create a **Single Line Plot (or Line plot)**

When there is only one visualization in a figure that uses the function plot(), then it is known as a single-line plot

In [None]:
plt.plot(df["Day"], df["Cups"])
plt.show() # you need to run plt.show to actually see your plot

<b>Step 3:</b> Customize your plot! 
`matplotlib` allows you to customize your plot to make it more readable. This is important as the purpose of a plot is to make is readable to others. 

In [None]:
plt.plot(df["Day"], df["Cups"])
plt.title("Coffee Consumption Over a Week")
plt.xlabel("Day of the Week")
plt.ylabel("Cups of Coffee")
plt.xticks(rotation=45) 
plt.show()

A plot or figure typically includes a horizontal X axis and a vertical Y axis, a title, an X label, a Y label, and X and Y tick marks.

These axes provide a framework for plotting data points. The X axis typically represents the independent variable, while the Y axis represents the dependent variable. They allow viewers to interpret the relationship between the variables. The title offers a brief description of what the plot is about, giving viewers immediate context. It summarizes the main point or focus of the data being presented, while the labels on the X and Y axes identify the variables being measured and the units of measurement. 

`matplotlib` allows you to make different types of plots beyond a line plot. Let's try making a scatter plot, a boxplot and a histogram.

**Question:** Create a single line plot using `Price` as your dependent variable and `Coffee Type` as your independent variable. Rotate the xticks 90 degrees. Interpret your results

In [None]:
### Write your code below (in place of ...)
...

### Scatter Plot
You can also visualize your data using scatter plots. Typically, scatter plots are used to visualize a pair of variables. One variable is assigned to the x-axis, and the other to the y-axis, with each x-y pair represented by a point on the plot. The x and y arrays must be the same size. You can also customize the color of the points by using `color = `.

In [None]:
plt.scatter(df["Day"], df["Cups"], color='brown')  # Use brown color for points, you can use whatever color you want
plt.title("Coffee Consumption Scatter Plot")
plt.xlabel("Day of the Week")
plt.ylabel("Cups of Coffee")
plt.show()

**Question:** Create a scatter plot `Price` as your dependent variable and `Satisfaction Level` as your independent variable. Make the points a color of your choosing.

In [None]:
### Write your code below (in place of ...)
...

**Question:** Create a scatter plot with `Price` as your dependent variable and `Coffee Type` as your independent variable. Use this documentation to color the points according to `satisfaction level`: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.scatter.html 

In [None]:
### Write your code below (in place of ...)
...

### Histograms

A histogram is a graphical representation of the distribution of a dataset. It divides the data into intervals, or bins, and displays the frequency or count of data points within each bin. The height of each bar in the histogram represents the number of data points that fall into that bin, providing a visual way to understand the underlying distribution, spread, and central tendency of the data.

In [None]:
plt.hist(df["Cups"], color='pink')  # Define bins and use green color
plt.title("Histogram of Coffee Consumption")
plt.xlabel("Cups of Coffee")
plt.ylabel("Frequency")
plt.show()

**Question:** Create a histogram of `Price`. Make the bars a color of your choosing.

In [None]:
### Write your code below (in place of ...)
...

## Plotting Real Data
There are almost endless possiblities for the types of plots you can create with `matplotlib`. But what we really want to use plotting for is to get insights from real data. Today we will be using a dataset consisting of some commonly used passwords.

<img src="./images/XKCD_passwords.png" width=700px height=500px />

Whenever you decide to use real data, the first step after reading in your data is to get some more information about it. Let's see what information we have from this password data...

In [None]:
pass_data = pd.read_csv("./data/passwords.csv")
print(pass_data.head())

It looks like we have a lot of data about the strength of these passwords, the time to crack the password and the category of these passwords. 

<img src="./images/password_info.png" width=700px height=400px />

Once you have some information about your data, we need to come up with an interesting reserach question. In this case, I want to know what are the most common categories passwords fall into. 

To start to answer this question, let's start by counting the number of times each category appears. 

In [None]:
category_counts = pass_data['category'].value_counts() # first let's get a count of the different categories
print(category_counts)

Next, I want to plot this data using a bar plot. When I did this, I got the following figure. 

<img src="./images/password_category_distribution1.png" width=700px height=600px />

### ✨  Coding challenge ✨

Take the next few minutes and try to recreate the above plot. 

<b>Hint0:</b> Use this link to some useful documentation: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.bar.html

<b>Hint1:</b> Remember when working with a series you can you .index and .values to get the index and the values of your data. 

In [None]:
### Write your code below (in place of ...)
...

<b> References </b>

Pajankar, A. (2021). Hands-On Matplotlib: Learn Plotting and Visualizations with Python 3 (1st ed.). Apress L. P. https://doi.org/10.1007/978-1-4842-7410-1. 