# Telling Stories using Graphs


## Introduction

There is an old saying that states, a picture is worth a thousand words. People have been using pictures to tell stories and sell ideas for over a century. As data analysts or data scientists today, we are not limited to conducting analysis on data. We also want to use data to help us tell stories. People who can and have the patience to understand complex data are very few. Therefore, we need to use the "picture" way to tell our stories to people so that they can "see" complex data in the way we expect.

In this lesson we will first review two main principles of good storytelling with data visualization. Then we will examine the different types of relationships between data and the most appropriate choice of chart to visually represent each type of relationship.

## Some Real Life and Current Examples

Before, we discussed Interactive Data Visualisation and saw how interactive data visualisations can be a powerful tool to convey as message. Let us have a look at the following examples:


1.   https://ourworldindata.org/coronavirus
2.   https://ourworldindata.org/co2-and-other-greenhouse-gas-emissions
3.   https://graphics.reuters.com/ENVIRONMENT-PLASTIC/0100B275155/index.html
4.   https://informationisbeautiful.net/beautifulnews/






## Principles of Good Story Telling

Good storytelling via data visualization has two main principles:

1. To make the information we want to convey **salient** and **relevant** to the audience.
2. To allow the audience to **accurately interpret the information we want to convey.**

Data visualization is good at extracting relevant information from the real world and presenting a sustained **"snapshot"** so that the audience can examine in depth over time. However, even in data visualization, the information contained in the charts can still be overwhelming and noisy (i.e. having a lot of irrelevant information to distract the audience from the main point). **Techniques must be employed to highlight the main information we want to convey so that the audience can capture it easily.**

One example technique to present salient and relevant information is by using the appropriate visual mode (e.g. color, layout, size, perspective, etc.). Consider the following scenario where the audience is expected to "see" how frequent the word data appears in an article. Which of the two presentations allows the audience to capture the information better?

![title](image2.png)

As seen in the example above, using the color mode can help the audience more easily capture the information you want to convey.

Sometimes we expect the audience to more accurately capture the information we present to them such as the relationships between data. In this case, we should choose the appropriate type of graph. Consider the following graphs that shows the big data IT application industry investiment structure in China (according to data from China National Bureau of Statistics):



![title](image1.png)

![title](image3.png)

As seen in the example above, pie graph allows the audience to better perceive the proportion of different data categories. And bar graph allows the audience to rank the data categories by value. Therefore, it is important to understand which graph type is appropriate to tell what kind of stories. In the rest of this lesson, we will have a deeper dive into the purposes of different types of graphs.

## Using the Appropriate Type of Graph

Next we will go over the graph types we have learned so far and explain when to use each type to present what kind of data relationships. We will use our vehicles dataset to demonstrate these visualizations.

## Types of Variables

The type of data determines to a large degree the plot that you will use. For instance, if you have a discrete data column, you will will likely not use a scatter plot, but a bar plot. However, when dealing with continuous data, we may use a scatter plot. 


There are roughly two types of variables (i.e. columns) in the data set:

1. **categorical variables** (blood type A, B, AB or O.; colour 'Red', 'Blue', 'Yellow');
    - binary variables ('Yes/No', 'Male/Female', 1/0);
2. **continuous variables** (numerical ranges, growth, age);

Some useful resources:

1. https://chartio.com/learn/charts/essential-chart-types-for-data-visualization/
2. https://learn.g2.com/discrete-vs-continuous-data

![title](image4.png)

### Scatter Plot

A clear and concise visualization that can be used to determine the existence and type of relationship between two variables is a scatter plot. If we would like to plot 3 variables, we can add additional dimensions to our chart using color or size. However, we should be careful not to overload the chart with information. This will make our chart no longer clear and concise. For more than two variables that we would like to compare pairwise, we may opt to use a scatter matrix instead. A scatter plot can help us detect whether there is a linear relationship between two variables or whether there is a different type of relationship (for example, an exponential or logarithmic relationship). In some cases, we may observe a random distribution of points which means there is no relationship between the two variables. Scatter plots are also useful in identifying outliers and certain aspects underlying in the distribution of our data.

The following scenarios are presented as examples for you to decide which type of scatter plot should be used in different situations.

#### Scenario 1: Two Variables
We have two variables City MPG and Highway MPG which are linearly correlated. In this scenario, a single-color two-dimensional scatter plot is adequate to represent the linear relationship of the variables:

In [None]:
import pandas as pd
%matplotlib inline

In [None]:
vehicles = pd.read_csv('vehicles.csv')

In [None]:
vehicles.head()

In [None]:
vehicles.plot.scatter(x="City MPG", y="Highway MPG", alpha=0.2, grid=True)

#### Scenario 2: Three Variables
In addition to City MPG and Highway MPG, now we have a third variable: CO2 Emission Grams/Mile. How to visualize the third dimension in a 2-dimensional scatter plot? The answer is to use point color or size.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# create colour map
cmap = sns.cubehelix_palette(as_cmap=True)

# plot
vehicles.plot.scatter(x="City MPG", y="Highway MPG", c="CO2 Emission Grams/Mile", cmap=cmap, alpha=0.2, grid=True)

#### Scenario 3: More Than Three Variables
But the vehicles dataset has 9 numeric variables. How do we visualize 9 variables in a 2-dimensional scatter plot? Well, a single scatter plot obviously does not meet our needs. We can use a scatter plot matrix to visualize the pair-wise relationships of those variables.


In [None]:
pd.plotting.scatter_matrix(vehicles, figsize=(20,10))
plt.show()

In the matrix above, the sub-plots in the diagonal can be ignored because they are each variable's relationship to itself. In other sub-plots, you can roughly tell some variables have linear relationships (e.g. Fuel Barrels/Year vs CO2 Emission Grams/Mile), curvilinear relationships (e.g. Fuel Barrels/Year vs Combined MPG), and no relationship (e.g. Year vs all others).

### Line Charts
Line charts are similar to scatter plots. There are different ways to connect the points in a line chart. While we can simply connect the lines without looking for a trend, we can also create a linear trend or or use other methods like spline interpolation.

Using a line chart makes it easier to plot more than one variable at once using the same x and y axis. To do this, we must reshape the data first using the melt function.

The melt function will create a dataset that has all MPG values in one column and another column will indicate what type of MPG this is. We then use this indicator column for plotting the color.

In [None]:
df = pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c'},
                   'B': {0: 1, 1: 3, 2: 5},
                   'C': {0: 2, 1: 4, 2: 6}})
df.head()

In [None]:
pd.melt(df, id_vars=['A'], value_vars=['B', 'C'])

In [None]:
line_df = pd.melt(vehicles[['City MPG', 'Highway MPG', 'CO2 Emission Grams/Mile']], id_vars='CO2 Emission Grams/Mile', var_name='MPG')
line_df

In [None]:
sns.lmplot(x='value',y='CO2 Emission Grams/Mile', hue='MPG', data=line_df, fit_reg=True) 

### Histograms
Histograms are typically used to illustrate the distribution of the data. Similar to scatter plots, histograms can also help us to identify outliers. Histograms can also show us whether there is a general trend of skewness or symmetry in the data. **Histograms are a good choice for examining the distribution of one variable at a time.** With a careful selection of bins, we can also identify whether the data has a single mode or whether it is bimodal or multimodal.

Recall that the default number of bins is 10.

In [None]:
vehicles['City MPG'].hist()

In [None]:
vehicles['City MPG'].hist(bins=20)

### Bar Charts
Bar charts can be used to compare within categorical variables. We can use these charts for comparison between time periods (for example, this year vs. last year) or between two subgroups in the population (like males and females). In this example we are comparing two variables that describe a vehicle's MPG.

In [None]:
vehicles_mean = vehicles[["Highway MPG", "City MPG", "Drivetrain"]].groupby(["Drivetrain"]).agg("mean")

In [None]:
vehicles_mean

In [None]:
vehicles_mean.plot.barh()

## Heatmaps

We've already seen heatmaps. However, these were quite rudimentary. Seaborn helps you to also plot very detailed and clear heatmaps, instantaneously highlighting all important relationship without disregarding important information.

In [None]:
import numpy as np

In [None]:
corr = vehicles.corr()
corr

In [None]:
zeros = np.zeros((4,4))
np.triu(np.ones_like(zeros, dtype=np.bool))

In [None]:
plt.figure(figsize=(12,12));
mask = np.triu(np.ones_like(corr, dtype=np.bool))
cmap = sns.diverging_palette(250, 300, as_cmap=True)

sns.heatmap(corr, cmap=cmap, center=0, mask=mask)
plt.show()

### Pie Charts
**Many experts in the field of data visualization recommend not using pie charts.** Pie charts are intended to demonstrate the proportionate difference between the different groups in a categorical variable. However, looking at the different groups as a fraction of a whole means a loss of information about the original data. There are always better suited visualizations than pie charts. The most obvious alternative is a bar chart.

## Summary
In this lesson we learned about the main types of visualizations and what they can be used for. We also learned which visualizations not to use in order to communicate our findings effectively.