# Introduction to Data Visualization with Python

## Lecture Notes and in-class exercises

Welcome to the final and **the coolest** part of this course! 😎 I'm not saying that the previous topics aren't cool. But visualizations are the bridges that can connect your data analytics skills with the outside world. You wouldn't print out a DataFrame to report your findings, would you? 

![table-form-vs-plotted](https://github.com/bdi475/images/blob/main/lecture-notes/dataviz-python/dataviz-tabular-vs-plotted-01.jpeg?raw=true)

👆 Image source: https://towardsdatascience.com/data-visualization-for-machine-learning-and-data-science-a45178970be7

### 📊 Dataviz libraries

The de-facto standard library for data visualization in introductory Data Analytics courses is [matplotlib](https://matplotlib.org/), which is a low-level visualization library for Python. [seaborn](https://seaborn.pydata.org/) is another popular library based on [matplotlib](https://matplotlib.org/). Below are some of the most popular and battle-tested data visualization libraries. All of these libraries are free and open source.

- [matplotlib](https://matplotlib.org/): Low-level visualization library for Python
- [seaborn](https://seaborn.pydata.org/): High-level visualization library for Python **based on matplotlib**
- [bokeh](https://bokeh.org/): Interactive visualizations for modern web browsers
- [plotnine](https://plotnine.readthedocs.io/): Python-version of [ggplot2](https://ggplot2.tidyverse.org/)
- [plotly](https://plotly.com/python/): Interactive visualizations - supports Python, Javascript, and R
- [altair](https://altair-viz.github.io/): Declarative visualization in Python based on [Vega](https://vega.github.io/vega/)

### Plot.ly

We'll use [plotly](https://plotly.com/python/),which provides both low-level and high-level interfaces to create publication-ready graphs.

![plotly logo image](https://github.com/bdi475/images/blob/main/lecture-notes/dataviz-python/plotly-logo-640px.png?raw=true)

Import modules used by **🧭 Check Your Work** sections and the autograder.

In [None]:
import unittest
import base64
import plotly
tc = unittest.TestCase()

---

## ✨ Visualize your dataset

▶️ First, run the code below to ensure you're using the correct version of plotly.

---

### 🎯 Pre-exercise: Import Packages

#### 👇 Tasks

- ✔️ Import the following Python packages.
    1. `pandas`: Use alias `pd`.
    2. `numpy`: Use alias `np`.
    3. `plotly.express`: Use alias `px`.
    4. `plotly.graph_objects`: Use alias `go`.

In [None]:
# YOUR CODE BEGINS

# YOUR CODE ENDS

#### 🧭 Check your work

In [None]:
import sys
tc.assertTrue('pd' in globals(), 'Check whether you have correctly imported Pandas with an alias.')
tc.assertTrue('np' in globals(), 'Check whether you have correctly imported NumPy with an alias.')
tc.assertEqual(plotly.__version__[:2], '5.', 'Plotly version mismatch, expcted 5.x.x')
tc.assertIsNotNone(go.Figure, 'Check whether you have correctly imported plotly.graph_objects with an alias go.')
tc.assertIsNotNone(px.scatter, 'Check whether you have correctly imported plotly.express with an alias px.')

---
### 📌 Import dataset

Today, we work with an HR Dataset to uncover insights about HR metrics, measurement, and analytics. The data has been downloaded from [https://rpubs.com/rhuebner/hr_codebook_v14](https://rpubs.com/rhuebner/hr_codebook_v14) without any modification.

▶️ Run the code below to import an HR Dataset. 🐷👧👨🏻‍🦰👩🏼‍🦳👳🏽‍♂️👩🏾‍🦲🐼.

In [None]:
# Display all columns
pd.set_option('display.max_columns', 50)

df_hr = pd.read_csv('https://github.com/bdi475/datasets/raw/main/HR-dataset-v14.csv')

display(df_hr)

--- 

## 📦 Box plot: summary of distribution

Box plots divide the data into 4 sections that each contain 25% of the data. It is useful to quickly identify the distribution of the data based on Q1, Q2 (median), and Q3.

![box plot explanation](https://github.com/bdi475/images/blob/main/lecture-notes/dataviz-python/box-plot-explanation-01.png?raw=true)

---
### 📌 GPA example

Before using the HR dataset, let's create a simple box plot of 12 different GPAs. NumPy is used here to calculate statistical figures.

In [None]:
gpa = np.array([3.33, 2.67, 3.0, 3.67, 3.67, 2.33, 3.0, 3.0, 2.67, 4.0, 3.33, 2.67, 4.0])
gpa

In [None]:
fig = px.box(x=gpa, title='GPA Distribution (Horizontal Box Plot)')
fig.show()

In [None]:
print(f"Mean: {np.mean(gpa)}")
print(f"Median: {np.median(gpa)}")
print(f"Q1: {np.quantile(gpa, 0.25)}")
print(f"Q3: {np.quantile(gpa, 0.75)}")
print(f"IQR: {np.quantile(gpa, 0.75) - np.quantile(gpa, 0.25)}")

### 🦄 Findings

- Median is `3`.
- Minimum is `2.33`.
- Maximum is `4`.
- Interquartile range is `1`.
    - You can calculate this value by subtracting Q1 from Q3: `3.67 - 2.67`.
- There is a positive skew.
    - This is also shown by comparing the mean and the median.

---

### 🎯 Exercise 1: Salary box plot (vertical)

#### 👇 Tasks

- ✔️ Draw a vertical box plot of `Salary` in `df_hr`.
- ✔️ Store your figure to a variable named `fig`.
- ✔️ Add an appropriate title to your figure.
- ✔️ Display the figure using `fig.show()`

In [None]:
# YOUR CODE BEGINS

# YOUR CODE ENDS

#### 🔑 Sample output

![Salary distribution vertical](https://github.com/bdi475/images/blob/main/exercises/intro-to-dataviz/salary_dispersion_box_plot_vertical.png?raw=true)

#### 🧭 Check your work

In [None]:
tc.assertEqual(len(fig.data), 1, 'There must be only one plot in your figure')
tc.assertIsNotNone(fig.layout.title.text, 'Missing figure title')
tc.assertEqual(fig.data[0].type, 'box', 'Not a box plot')
tc.assertEqual(fig.data[0].orientation, 'v', 'Your plot should have a vertical orientation')
np.testing.assert_array_equal(fig.data[0].y, df_hr['Salary'], 'Incorrect data')

---

### 🎯 Exercise 2: Salary box plot (horizontal)

#### 👇 Tasks

- ✔️ Draw a horizontal box plot of `Salary`.
- ✔️ Store your figure to a variable named `fig`.
- ✔️ Add an appropriate title to your figure.
- ✔️ Display the figure using `fig.show()`

In [None]:
# YOUR CODE BEGINS

# YOUR CODE ENDS

#### 🔑 Sample output

![Salary distribution horizontal](https://github.com/bdi475/images/blob/main/exercises/intro-to-dataviz/salary_dispersion_box_plot_horizontal.png?raw=true)

#### 🧭 Check your work

In [None]:
tc.assertEqual(len(fig.data), 1, 'There must be only one plot in your figure')
tc.assertIsNotNone(fig.layout.title.text, 'Missing figure title')
tc.assertEqual(fig.data[0].type, 'box', 'Not a box plot')
tc.assertEqual(fig.data[0].orientation, 'h', 'Your plot should have a horizontal orientation')
np.testing.assert_array_equal(fig.data[0].x, df_hr['Salary'], 'Incorrect data')

---

### 🎯 Exercise 3: Salary distribution by citizenship status

#### 👇 Tasks

- ✔️ Draw horizontal box plots of `Salary` by `CitizenDesc`.
- ✔️ Store your figure to a variable named `fig`.
- ✔️ Add an appropriate title to your figure.
- ✔️ Display the figure using `fig.show()`

In [None]:
# YOUR CODE BEGINS

# YOUR CODE ENDS

#### 🔑 Sample output

![Salary distribution by citizenship status box plots](https://github.com/bdi475/images/blob/main/exercises/intro-to-dataviz/salary_dispersion_by_citizenship_status_box_plots_horizontal.png?raw=true)

#### 🧭 Check your work

In [None]:
tc.assertEqual(len(fig.data), 1, 'There must be only one plot in your figure')
tc.assertIsNotNone(fig.layout.title.text, 'Missing figure title')
tc.assertEqual(fig.data[0].type, 'box', 'Not a box plot')
tc.assertEqual(fig.data[0].orientation, 'h', 'Your plot should have a horizontal orientation')
np.testing.assert_array_equal(fig.data[0].x, df_hr['Salary'], 'Incorrect x-axis data')
np.testing.assert_array_equal(fig.data[0].y, df_hr['CitizenDesc'], 'Incorrect y-axis data')

---

### 🎯 Exercise 4: Salary distribution by performance

#### 👇 Tasks

- ✔️ Draw horizontal box plots of `Salary` by `PerformanceScore`.
- ✔️ Store your figure to a variable named `fig`.
- ✔️ Add an appropriate title to your figure.
- ✔️ Display the figure using `fig.show()`

In [None]:
# YOUR CODE BEGINS

# YOUR CODE ENDS

#### 🔑 Sample output

![Salary distribution by performance score box plots](https://github.com/bdi475/images/blob/main/exercises/intro-to-dataviz/salary_dispersion_by_performance_score_box_plots_horizontal.png?raw=true)

#### 🧭 Check your work

In [None]:
tc.assertEqual(len(fig.data), 1, 'There must be only one plot in your figure')
tc.assertIsNotNone(fig.layout.title.text, 'Missing figure title')
tc.assertEqual(fig.data[0].type, 'box', 'Not a box plot')
tc.assertEqual(fig.data[0].orientation, 'h', 'Your plot should have a horizontal orientation')
np.testing.assert_array_equal(fig.data[0].x, df_hr['Salary'], 'Incorrect x-axis data')
np.testing.assert_array_equal(fig.data[0].y, df_hr['PerformanceScore'], 'Incorrect y-axis data')

---

### 🎯 Exercise 5: Salary distribution by department

#### 👇 Tasks

- ✔️ Draw horizontal box plots of `Salary` by `Department`.
- ✔️ Store your figure to a variable named `fig`.
- ✔️ Add an appropriate title to your figure.
- ✔️ Set the height of your figure to `600`.
- ✔️ Display the figure using `fig.show()`

In [None]:
# YOUR CODE BEGINS

# YOUR CODE ENDS

#### 🔑 Sample output

![Salary distribution by department box plots](https://github.com/bdi475/images/blob/main/exercises/intro-to-dataviz/salary_dispersion_by_department_box_plots_horizontal.png?raw=true)

#### 🧭 Check your work

In [None]:
tc.assertEqual(len(fig.data), 1, 'There must be only one plot in your figure')
tc.assertIsNotNone(fig.layout.title.text, 'Missing figure title')
tc.assertEqual(fig.data[0].type, 'box', 'Not a box plot')
tc.assertEqual(fig.data[0].orientation, 'h', 'Your plot should have a horizontal orientation')
tc.assertEqual(fig.layout.height, 600, 'Incorrect height')
np.testing.assert_array_equal(fig.data[0].x, df_hr['Salary'], 'Incorrect x-axis data')
np.testing.assert_array_equal(fig.data[0].y, df_hr['Department'], 'Incorrect y-axis data')

--- 

## 🟪 Histogram: frequency distribution

Histograms display frequency distributions using bars of different heights.

In [None]:
fig = px.histogram(x=np.random.randn(500))
fig.show()

---

### 🎯 Exercise 6: Salary histogram

#### 👇 Tasks

- ✔️ Draw a histogram of `Salary` in `df_hr`.
- ✔️ Store your figure to a variable named `fig`.
- ✔️ Add an appropriate title to your figure.
- ✔️ Display the figure using `fig.show()`

In [None]:
# YOUR CODE BEGINS

# YOUR CODE ENDS

#### 🔑 Sample output

![Salary distribution histogram](https://github.com/bdi475/images/blob/main/exercises/intro-to-dataviz/salary_histogram.png?raw=true)

#### 🧭 Check your work

In [None]:
tc.assertEqual(len(fig.data), 1, 'There must be only one plot in your figure')
tc.assertIsNotNone(fig.layout.title.text, 'Missing figure title')
tc.assertEqual(fig.data[0].type, 'histogram', 'Not a histogram')
tc.assertEqual(fig.data[0].orientation, 'v', 'Your plot should have a vertical orientation')
np.testing.assert_array_equal(fig.data[0].x, df_hr['Salary'], 'Incorrect data')

---

### 🎯 Exercise 7: Number of absences histogram

#### 👇 Tasks

- ✔️ Draw a histogram of `Absences` in `df_hr`.
- ✔️ Store your figure to a variable named `fig`.
- ✔️ Add an appropriate title to your figure.
- ✔️ Display the figure using `fig.show()`

In [None]:
# YOUR CODE BEGINS

# YOUR CODE ENDS

#### 🔑 Sample output

![Number of absence histogram](https://github.com/bdi475/images/blob/main/exercises/intro-to-dataviz/number_of_absences_histogram.png?raw=true)

#### 🧭 Check your work

In [None]:
tc.assertEqual(len(fig.data), 1, 'There must be only one plot in your figure')
tc.assertIsNotNone(fig.layout.title.text, 'Missing figure title')
tc.assertEqual(fig.data[0].type, 'histogram', 'Not a histogram')
tc.assertEqual(fig.data[0].orientation, 'v', 'Your plot should have a vertical orientation')
np.testing.assert_array_equal(fig.data[0].x, df_hr['Absences'], 'Incorrect data')

---

### 🎯 Exercise 8: Salary histograms by gender

#### 👇 Tasks

- ✔️ Draw overlaid histograms of `Salary` in `df_hr` by `GenderID`.
- ✔️ Full code is provided below.

```python
# BEGIN SOLUTION
fig = go.Figure()
fig.add_trace(
    go.Histogram(
        x=df_hr[df_hr['GenderID'] == 0]['Salary'],
        name='Male'
    )
)

fig.add_trace(
    go.Histogram(
        x=df_hr[df_hr['GenderID'] == 1]['Salary'],
        name='Female'
    )
)

# Overlay both histograms
fig.update_layout(barmode='overlay')

# Reduce opacity to see both histograms
fig.update_traces(opacity=0.6)
fig.show()
# END SOLUTION
```

In [None]:
# YOUR CODE BEGINS

# YOUR CODE ENDS

#### 🧭 Check your work

In [None]:
tc.assertEqual(len(fig.data), 2, 'There must be two plots in your figure')
tc.assertEqual(fig.data[0].type, 'histogram', 'Not a histogram')
tc.assertEqual(fig.data[1].type, 'histogram', 'Not a histogram')
np.testing.assert_array_equal(fig.data[0].x, df_hr[df_hr['GenderID'] == 0]['Salary'], 'Incorrect data')
np.testing.assert_array_equal(fig.data[1].x, df_hr[df_hr['GenderID'] == 1]['Salary'], 'Incorrect data')