<img src=images/gdd-logo.png align=right width=300px>

          
# Place: Making Charts in Python

For this notebook we are going to be using matplotlib for examples but feel free to pick another library as long as you can display the techniques discussed.

In particular we will be covering:

- [Responsibility as Storytellers](#responsibility)
- [Choosing the right graph](#choosing)
    - [<mark>Ex. How does chart type affect the reader?</mark>](#ex1)
- [A note on... confidence intervalse](#note-ci)
- [A note on.. boxplots](#note-box)
- [A note on... confidence intervalse](#note-pie)
- [<mark>Exercise: Case studies</mark>](#ex-case)
- [Time Series graphs](#ts)
- [<mark>Exercise: Choosing the right chart type</mark>](#ex-choosing)

<a id='responsibility'></a>

## Responsibility as Data Storytellers

<img src=images/data-stories.png align=center width=300px>

As storytellers of data, we actually share a lot in common with the likes of Stephen King, Margaret Atwood or Dr. Seuss, and therefore need to understand the underlying elements needed to tell any story.

There are considered to be 3 Ps of Storytelling that drive a plot: Place, People, and Purpose. We are going to match up to this in our own data storytelling

- Place
- People
- Purpose

However, unlike many fiction writers, we are portraying absolute facts which are shown through the insights we gain from our data analysis. These facts are what will lead people to make conclusions and, in a lot of cases, future business decisions. 

Our role as the data storyteller is to **bridge the gap** between clinical and analytical data, to purpose-driven, interesting and thought-provoking stories.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

<a id='choosing'></a>

## Choosing the right graph

---

<font size=5 color='blue' align=center>P is for Place</font>

**The 3 Ps of storytelling:** Place, People, Purpose.

---

The place of our story is where we are going to start. Besides where your chart will be consumed (PowerPoint, 
notebooks, Python, streamlit, dashboard), we need to determine what chart we are going to use and what structure will our charts need to have.

If you have data you want to visualize, make sure you use the right charts. While your data might work with multiple chart types, it’s up to you to select the one that ensures your message is clear and accurate. Remember, **data is only valuable if you know how to visualize it and give context**.

- Line 
- Bar 
- Stacked bar
- Area 
- Pie 
- Scatter

## <mark>Exercise: Choose a different graph style.</mark>

For each chart, try a different type of chart. 
Why were the specific charts chosen for these examples?

In [None]:
from plot_chars import *

plot_charts(fifa='barh',
              cheese='line', 
              lego='area', 
              marvel='pie', 
              cars='scatter', 
              books='bar')

**Note**: If the above code does not work due to import issues, go to the file `plot_chars.py` and copy/paste the function `plot_charts` into the the cell above, above the function call.

### In what situation are the following charts used?
Bar charts

Line Charts

Area Charts

Pie Charts

Scatter Charts

Multiple Bar Charts (stacked or side-by-side)

Picking the correct chart is a really important step in the art of data storytelling. The wrong choice can leave the reader uncomfortable, be misleading or be impossible to decipher.

We have explored the 6 main charts, but there are many more out there! Check out [this PDF](../ChartGuide.pdf) to see more examples.

<a id='note-ci'></a>

---
## A note on...

### Bar charts with error labels

Seaborn can be really useful if we want to do some kind of aggregation (sum, mean etc.) within our plot. 

Seaborn's error bars will also give a general idea of how precise a measurement is:

In [None]:
import seaborn as sns

In [None]:
import numpy as np
marvel = pd.read_csv('data/marvel.csv')
marvel.head()

In [None]:
sns.barplot(data=marvel, x='Alignment', y='Height', capsize=.2)

The variance in our `neutral` category seems quite high.

In [None]:
marvel.groupby('Alignment')['Height'].describe()

Looks like we have a small number of `neutral` heroes, and have some outliers in the upper quartile:

In [None]:
(
    marvel
    .loc[marvel['Alignment']=='neutral', ['Name', 'Height']]
    .sort_values("Height", ascending=False)
)

Galactus is obviously making our neutral heroes look like they are considerably taller...

<img src=images/galactus.jpeg width=200px>

The confidence interval was a clear demonstration of how low observations can cause misleading representation of aggregate values, however for the non-technical audience they are harder to decipher.

---
<a id='note-box'></a>

## A note on...

### Boxplots

The same idea can be said for boxplots. The below chart tells us a lot in terms of the attack score for each pokemon type. We understand the average attack score and also the consistency within that group, however boxplots are not as common as simpler graph choices so should be used when working with audiences who know the chart and are comfortable with interpreting the data.

In [None]:
pokemon = pd.read_csv('data/pokemon.csv')

fig, ax = plt.subplots(figsize=(12,7))

sns.boxplot(data=pokemon, y='attack', x='type', ax=ax)
plt.show()

Often we want to ensure that is the least amount of interpretation needed as possible in order to allow everyone to digest the key information together.

The **whole** audience should be able to understand the graph. If only 80% can, the 20% that can't will be left feeling left out.

Use boxplots and confidence intervals with...
- Technical audiences
- Your own analysis to get to the root cause
- Places where consumers can drill down

---
<a id='note-pie'></a>

## A note on...

### Pie Charts

Pie charts are one of the most overused graphs in data storytelling. They provide an immediate visual that people are familiar with. However in most cases they are not the best way to present data. 

They often distort the information - this makes it more difficult for decision-makers to understand the messages they contain.

If you’re interested in making better presentations, reports and dashboards, one simple way is to eliminate pie charts from your repertoire.

**Too many categories**

The hardest thing about pie charts is calculating the angles. We're not naturally good at it, or at least not all of us.

In [None]:
marvel['EyeColor'].value_counts().plot(kind='pie')

**Omitting categories**

Pie charts should show percentages and should always add up to 100%. Raw values in a pie chart are confusing and percentages that don't describe the proportion of the whole are misleading.

In [None]:
election = pd.read_csv('data/plots/election_alaska.csv')

election.plot(kind='pie', 
              y='total_votes', 
              labels= election['percent_label'], 
              legend=False, 
              ylabel='', title = 'Results of the 2020 US Election in Alaska')
plt.show()

Classically this is a technique used in media tactics to make certain categories appear dominant, as well as misrepresent certain data. For example:

<img src=images/covid.png width=300px align=left>

**In time series data**

Often the aim is to show the changes in categories over time... however it doesn't give us enough information.

In [None]:
programming = (
    pd.read_csv('data/programming-trends.csv', index_col='Month', parse_dates=['Month'])
    .loc['2016':'2020']
    .resample('Y').mean()
)

fig, ax = plt.subplots(1,programming.shape[0], figsize=(12,6))
for i, p in enumerate(programming.index):
    programming.loc[p][::-1].plot(kind='pie', ax=ax[i], ylabel=p.year)
plt.tight_layout()

Where with a simple line plot we'd be able to see exactly where one group surpassed another:

In [None]:
programming.plot(title='Google Trends for Different Programming Languages');

**Fear the 3D Pie Chart:**

This is not (easily) managed in Python, but good to note for other visualisation tools. Notice how below in the 3D image Item C looks bigger than Item A, when in actual fact it is less than half:

<img src=images/3d-pie.png width=500px align=center>

The pie chart is one of the most used and often most hated chart types of all time. However it does have its place!

**Best practices for creating pie charts:**

 - Make sure your sectors add up to the whole (100%) - Sounds obvious, but this is a common mistake
 
 - Compare just a few (2-5) categories to get your point across and if the slices are roughly the same size, consider a bar or column chart

 - Include all data, with no overlapping sectors

---
<a id='ex-case'></a>

## <mark> Exercise: Case Studies</mark>

Answer/complete the following...

**You should use confidence intervals when...**

**A good example of when to use a pie chart would be...**

**Think about the following use cases. What type of chart would you use and why?**

1. Training data where the total amount of training has increased since the introduction of online learning

2. Show the status of tasks completed in JIRA (Not started, On-going, Completed)

3. Data on this year's sales compared to the budget month-on-month. The CEO would like to take these values and use some of them in his Quarterly report.

---
<a id='ts'></a>

## Let's get Serious with Time Series

Time series are a great way to show the movement of data over time. The most common time series chart is a line chart as it provokes the thought of movement (from each point to the next) however it's not uncommon to see bar-charts as long as the data is grouped in a robust way - monthly data for one year for example.

The most common mistake with time series data is not converting the date variable to be a `pd.datetime` datatype. This can cause issues within our axes. Take a look at the chart below, what issues can you see?

In [None]:
schiphol = pd.read_csv('data/schiphol-passengers.csv')

In [None]:
# fig
fig = plt.figure(figsize=(10,5))

# axes
axes = fig.add_axes([0,0,1,1])

schiphol.plot(x='date', y='total_passengers', ax=axes, title='Passenger Volume at Schiphol Airport');

If we plot this using seaborn, this issue is seen more clearly:

In [None]:
# fig
fig = plt.figure(figsize=(10,5))

# axes
axes = fig.add_axes([0,0,1,1])

(
    sns.lineplot(data=schiphol, x='date', y='total_passengers')
    .set(title='Passenger Volume at Schiphol Airport')
);

Seaborn is trying to show every single label on the x-axis since pandas currently treats this as a string. We need to convert this to a `pd.datetime` to improve this. Also, if we set this to our index, it will automatically be on our x-axis.

In [None]:
schiphol_clean = pd.read_csv('data/schiphol-passengers.csv', 
                             parse_dates=['date'], 
                             index_col='date')

In [None]:
# fig
fig = plt.figure(figsize=(10,5))
fig.autofmt_xdate()

# axes
axes = fig.add_axes([0,0,1,1])

(
    schiphol_clean
    .div(1000_000)
    .plot(ax=axes, ylabel='Num Passengers (M)', 
          title='Passenger Volume at Schiphol Airport')
);

## How to compare different things over time

Time series line charts are also great at comparing one year to the next. Let's look at bike rentals from a US rental company in the years 2011 and 2012:

In [None]:
bikes = pd.read_csv('data/plots/bikes-by-month.csv', index_col='Month')
ax = (
    bikes
    .plot(title='Total Yearly Bike Rentals')
)

It might also be good to aggregate here on a monthly basis, say if we were looking if we hit each monthly target:

In [None]:
ax = (
    bikes
    .plot(kind='bar', title='Total Yearly Bike Rentals')
)

---
<img src=images/conclusion.png align=right>

# Conclusion

We have looked at the main different types of chart one can use when plotting information. It is important to be extremely comfortable with these chart types first before we start using more complex solutions for chart types. This means we can start adding more story telling elements (People, Plot and Purpose) to really control and aid the story that's being told. Once we have confidence in that we can start to look at more complicated chart types.

For a more in-depth guide to choosing the correct chart, there is a PDF in the `Extras/` folder with a breakdown of over 80 different chart types.

## Next Steps

We are going to explore:
- Who are the **P: People** in our story? What characters drive our data story?