<a target="_blank" href="https://colab.research.google.com/github/m-yuhas/data-visualization-workshop/blob/main/Data_Visualization_Workbook_(Student).ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# NTU Data Visualization Workshop Workbook (Student Version)
Welcome to the NTU Data Visualization Workshop.  This notebook contains examples of several commonly used plots.  The order of the notebook follows the order of the accompanying presentation.  Each section ends with an exercise that you will need to complete.  Example solutions will be uploaded to Github after the session, but remember: data visualization is also an art, and the solutions you come up with in class may be better.

## Getting Started
This notebook focuses on data visualization in Python.  We will be using the following libraries:
- **Seaborn** - A library which provides many plot types and lots of options for customization.
- **Matplotlib** - Provides much of the same plotting functionality as Matlab.
- **Numpy** - Provides array and matrix datatypes for Python.
- **Pandas** - Allows easy loading of datasets in CSV, or Excel data formats.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd

import seaborn.objects as so


## Visualizing Relationships


### Scatter Plots
One of the easiest ways to visualize the relationship between two variables is a scatterplot.  Although you cannot make any statistical arguments by looking at a scatter plot, it serves as a useful tool to demonstrate relationships and during the data exploration phase.

Suppose we believe there is a relationship between vehicle age and mpg (miles-per-gallon: a measure of fuel efficiency in the US).  First, we nee a dataset with a list of vehicles, the year they were built, and their respective mpg.

In [None]:
mpg_dataset = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/refs/heads/master/mpg.csv")
mpg_dataset

This dataset is a list of vehicles and their respective stats.  With Seaborn, we can use the objects API to easily plot model year (year of manufacture) versus mpg.

1. Start by instatiating a Plot object and specifying the ```data```
to plot along with the features represented by the ```x``` and ```y``` axes.  If we want to further segragate data, we can specify another feature with the ```color``` argument.
2. Add a ```Dots()``` plot.
3. Overwrite labels with ```.loabel()``` to make them look more professional.
4. Use the ```show()``` method to display the plot. (This is not required in Google Colab, but will be necessary if you run this script independently with Python.

**Reflection:** *Can you see any interesting trends in the plotted data?*




In [None]:
(
  so.Plot(data=mpg_dataset, x="model_year", y="mpg", color="origin")
    .add(so.Dots())
    .label(x="Model Year", y="Miles per Gallon", color=str.capitalize)
    .show()
)

### Aggregating Data
Interesting, by looking at the colors, it appears that while fuel efficiency has increased over the year, cars made in Europe or Japan are generally more efficient than those made in the US.  However, the data is very noisy.  Seaborn gives us the ```Agg()``` object to allow use to aggregate data along the dimensions we wish to plot.  This means taking all the data points and simply returning their average.

**Reflection:** *What is the difference between aggregating by ```mean``` and ```median```?*

In [None]:
(
  so.Plot(data=mpg_dataset, x="model_year", y="mpg", color="origin")
    .add(so.Dots(), so.Agg("mean"))
    .label(x="Model Year", y="Miles per Gallon", color=str.capitalize)
    .show()
)

### Line Charts
Above, we see the average fuel efficiency for all cars from a specific region for each year, however, it is difficult to visualize the trend over time.  For this we will turn to a line chart.  With Seaborn objects, this is easy to accomplish with the ```Line()``` object.

**Reflection:** *Is it better to use linear interpolation to generate the lines between data points or use smooth curves like a spline?*

In [None]:
(
  so.Plot(data=mpg_dataset, x="model_year", y="mpg", color="origin")
    .add(so.Line(), so.Agg("mean"))
    .label(x="Model Year", y="Miles per Gallon", color=str.capitalize)
    .show()
)

### Confidence Intervals
The chart above may be misleading.  For instance, not every car manufactured in Europe was more efficent than every car manufactured in the United States in 1978.  We can use confidence intervals to show how tight the distribution of fuel efficiency is for each country for each year.

We use the ```Band()``` object to add a shaded region around each line.  The ```Est()``` object functions similar to the ```Agg()``` object we used earlier, except instead of returning a single number, it returns and upper and lower bound on the confidence interval (in this case 95%).

**Reflection:** *What does the shaded region (95% confidence interval) represent?*

In [None]:
(
  so.Plot(data=mpg_dataset, x="model_year", y="mpg", color="origin")
    .add(so.Line(), so.Agg("mean"))
    .add(so.Band(), so.Est(errorbar=("ci", 95)))
    .label(x="Model Year", y="Miles per Gallon", color=str.capitalize)
    .show()
)

### Exercise 1: Visualizing the Stock Market
The code below loads a dataset of the Dow Jones Industrial Average (a stock market index) between 1914 and 1968.  Can you plot the index price over time?

In [None]:
dowjones = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/refs/heads/master/dowjones.csv", parse_dates=["Date"])
dowjones

In [None]:
# Your code here
# HINT: this data is already a time series, so you don't need to use an agregator.

## Visualizing Amounts and Proportions

For this series of exercises we will use the Titanic dataset.  The dataset is available in CSV form and contains information about all the passengers on the infamous steam ship, Titanic. Run the code below to load the dataset and see what features (columns) are available.

In [None]:
titanic = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/refs/heads/master/titanic.csv")
titanic

### Bar Charts
One of the simplest ways to visualize amounts is to use a bar chart.  Bar charts are great at comparing raw values from different categories and give a better feeling about the difference between amounts than tabular data.  In Seaborn, we can use the ```Count()``` object to get a count of all data with a particular feature, in this case passenger class, which we will plot along the ```y``` axis.  The bar chart itself can be drawn using the ```Bar()``` object.

**Reflection:** *Which is better, horizontal or vertical bars?*

In [None]:
(
  so.Plot(titanic, y="class")
    .add(so.Bar(), so.Count())
    .scale(y=so.Nominal(order=["First", "Second", "Third"]))
    .label(y="Class", x="Total Pax.")
    .show()
)

### Bar Charts Divided by Category
We can use the ```color``` argument in ```Plot()``` to segregate our data by another feature, for example, sex.  Note that in Seaborn, the bars will be drawn on top of each other unless we use ```Dodge()``` to make them appear side by side.

**Reflection:** *Is it ever possible for bars to extend in the negative direction?*

In [None]:
(
  so.Plot(titanic, y="class", color="sex")
    .add(so.Bar(), so.Count(), so.Dodge())
    .scale(y=so.Nominal(order=["First", "Second", "Third"]))
    .label(y="Class", x="Total Pax.")
    .show()
)

### Stacked Bars
Placing the bars side by side gives us a good fealing of the proportion of males to females in each class, but it makes it difficult to visualize the total class-to-class difference among all genders.  We have the option of stacking bars with the ```Stack()``` object to emphasize this.

**Reflection:** *How can you better illustrate that a ratio between two classes in changing, not just the total amount?*

In [None]:
(
  so.Plot(titanic, y="class", color="sex")
    .add(so.Bar(), so.Count(), so.Stack())
    .scale(y=so.Nominal(order=["First", "Second", "Third"]))
    .label(y="Class", x="Total Pax.")
    .show()
)

### Histograms
In the example above there are three distinct classes: First, Second, and Third.  However, for some features like ```age``` there are an infinite number of possible values.  We can still get an idea of the number of people in each agerange by creating artificial *bins*, which represent an age range and allocating each sample one of these bins.  Luckily, we can do this automatically with the seaborn ```Hist()``` object.

**Reflection:** *What is the best number of bins to use in a histogram?*

In [None]:
(
  so.Plot(titanic, x="age", color="class")
    .add(so.Bar(), so.Hist(), so.Stack())
    .scale(color=so.Nominal(order=["First", "Second", "Third"]))
    .label(x="Age", y="Total Pax.")
    .show()
)

### Exercise 2: Do People Tip More on the Weekend?
The code below loads a dataset of tip amounts with several features.  Can you plot a histogram of tip amount over the days of the week?

In [None]:
tips = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/refs/heads/master/tips.csv")
tips

In [None]:
# Your code here
# HINT: You will need to use a histogram or an aggregator depending on how you interpret "tip amount."

## Visualizing Distributions

### Box Plots
One of the easiest ways to visualize a distribution of data is with a box plot.  The box plot is a graphical representation of the five-number summary: minimum, first quartile, median, third quartile, and maximum.  This allows us to see the skewness and variance of a distribution in addition to its median value.

For this exercise, we will use the iris dataset, which contains measurements from various species of irises.

In [None]:
iris = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/refs/heads/master/iris.csv")
iris

For distribution plots we can no longer use Seaborn's objects API.  Instead we will use seaborn to directly draw on a set of Matplotlib axes.  The ```catplot``` function allows us to plot distibutions of a feature, ```x``` for each category in ```y```.  The function returns a handle to a set of axes that we can use to set axes labels.

If you run this script outside of Google Colab, you will need to used the ```plt.show()``` command to display the figure.

**Reflection:** *What do the whiskers on the box plot represent?  What about the edges of the boxes?*

In [None]:
ax = sns.catplot(data=iris, x="petal_length", y="species", kind="box")
ax.set(xlabel="Petal Length", ylabel="Species")
ax.set_yticklabels(["Setosa", "Versicolor", "Virginica"])
plt.show()

### Violin Plots

Boxplots can help us visualize the skewness of a distribution, however, not all distributions are unimodal, i.e., the probability distribution function (PDF) has more than one peak.  To visualize this, we can use a violin plot, which shows the PDF of each plotted distribution.  To create a violin plot with Seaborn, use the ```catplot``` function with parameter ```kind="violin"```.

**Reflection:** *How are discrete data points transformed into a continuous distribution?  Is this always accurate?*

In [None]:
sns.catplot(
    data=iris, x="petal_length", y="species",
    kind="violin"
)
ax.set(xlabel="Petal Length", ylabel="Species")
ax.set_yticklabels(["Setosa", "Versicolor", "Virginica"])
plt.show()

### Bivariate Plots
Sometimes we want to visualize a distribition over two numerical features.  An easy way to do this is with a bivariate plot.  We will use Seaborn's ```displot``` function to accomplish this.

**Reflection:** *How could we visualize distributions across more than three features?*

In [None]:
ax = sns.displot(data=iris, x="petal_length", y="petal_width", hue="species")
ax.set(xlabel="Petal Length", ylabel="Petal Width")
ax.legend.set_title("Species")
plt.show()

### Bivariate Contour Plots
Sometimes we want a smoother representation of the data.  A type of model called a kernel density estimate (KDE) models a discrete distribution as a continuous one by summing multiple normal distributions, each with a mean located at one of the data points.  In Seaborn we can use the ```kind="kde"``` argument to the ```displot``` function to create a contour plot based on such a KDE.

**Reflection:** *Does changing the kernel bandwidth impact how the data is presented?*

In [None]:
sns.displot(data=iris, x="petal_length", y="petal_width", hue="species", kind="kde")
ax.set(xlabel="Petal Length", ylabel="Petal Width")
ax.legend.set_title("Species")
plt.show()

### Exercise 3: Classifying Penguins
The code below loads a dataset of the bill and flipper length of several species of penguins.  Can you plot a kernel density estimate (KDE) of the distribution of flipper length ond bill length for each species?

In [None]:
penguins = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/refs/heads/master/penguins.csv")
penguins

In [None]:
# Your code here
# HINT: You will need to use a histogram or an aggregator depending on how you interpret "tip amount."

## Visualizing More Than Two Dimensions


### Pivot Tables
Sometimes data in its raw tabular form does not lend itself well to visualization.  Consider the dataset of number of passengers per month per year below.  While we could easily plot this as a line graph, we may want to see if certain months are more popular for travel every year and if there are any year-to-year trends for the same month.

In [None]:
flights = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/refs/heads/master/flights.csv")
flights

We can use a pivot table to transform the data so that years are rows and months are columns.  To do this we will use Pandas built-in ```pivot()``` method with ```index``` as year and ```columns``` as month.  The ```values``` is the number used to fill each cell of the new table.

In [None]:
flights = flights.pivot(index="year", columns="month", values="passengers")
flights

### Heatmaps
While this table makes it easier to identify year-on-year trends for each month, it's still difficult to parse.  We can use a heatmap, which assigns brighter colors to higher values, just like brighter/darker areas of a flame. We use Seaborn's ```heatmap``` function below.

**Reflection:** *Which is better, lighter or darker colors representing higher values?*

In [None]:
ax = sns.heatmap(flights)
ax.set(xlabel="Month", ylabel="Year")
plt.show()

### Annotated Heatmap
While the visualization above gives us a good intuitive feel, sometimes we still want to visualize the actual numbers.  
1. We use ```annot=True``` to accomplish this.  The ```fmt=".0f"``` tells seaborn not to display any decimal point.  This can be used for rounding decimal numbers.
2. We can also remove the colorbar with ```cbar=False```.
3. You can change the color palette with the ```cmap``` argument.  Additional color palettes are available in the [Seaborn documentation](https://seaborn.pydata.org/tutorial/color_palettes.html)

**Reflection:** *How would you round the annotation data to the nearest tenth (e.g. 0.3333 becomes 0.3)?*

In [None]:
sns.heatmap(flights, annot=True, fmt='.0f', cbar=False, cmap="viridis")
ax.set(xlabel="Month", ylabel="Year")
plt.show()

### Cartograms
A specific type of heatmap called a chloropleth projects heatmap values onto a map.  For example, suppose we wanted to visualize world population by country.

First, we will need a special library called Geopandas.

In [None]:
import geopandas

After installing Geopandas, we can load a dataset.  Notice the columns in the dataset, specifically the ```geometry``` column, which contains the points that outline each country.

In [None]:
url = "https://naciscdn.org/naturalearth/110m/cultural/ne_110m_admin_0_countries.zip"
world = geopandas.read_file(url)
world

To plot a chloropleth, all we have to do is call ```plot()``` on the dataset and provide the ```column``` corresponding to the number we want to plot.  We can set the colormap the same way we did with Seaborn's heatmap.

In [None]:
world.plot(column="POP_EST", cmap="viridis")

### Exercise 4: World GDP
Can you visualize the GDP of each country?

In [None]:
# Your code here
# HINT: You can get a list of the available columns with the following command:
#list(world.columns)