# Simple plots with matplotlib/pyplot



# The Canada immigration dataset

We'll use a dataset describing immigration to Canada in the 1980-2013 period. The data list, for 196 nations, the number of immigrants. There's also a few metadata columns describing the nation.

Load the data with Pandas using the following code:

In [None]:
import pandas as pd

#doc: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html
df_canada = pd.read_excel(
    'https://github.com/ne1s0n/dataviz_python/raw/main/resources/Canada.xlsx',
    sheet_name = 'Canada by Citizenship',  #the file contains three sheets
    skiprows = range(20), #skip the first twenty rows
    skipfooter = 2        #skip the last two rows
)

In [None]:
df_canada.head()

In [None]:
df_canada.info()

In [None]:
df_canada.describe()

Let's rename the "OdName" column to "Country", and set it to be the index.

In [None]:
#renaming a column
#doc: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html
df_canada.rename(columns = {'OdName':'Country'}, inplace = True)

#using Country as index
#doc: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.set_index.html
df_canada.set_index('Country', inplace = True)

df_canada.head()

In [None]:
#we can now directly refer to country names
#can you guess what's the role of mycols variable?
mycols = df_canada.columns[0:15]
df_canada.loc['Albania', mycols]

We want to add a "Total" column, at 

---

# ASSIGNMENT! Add a "Total" column

It's important that everybody does this assignment, since we are going to change the `df_canada` dataframe and then use this updated version in some examples and excercises.

Remember to "Runtime/Run before", so that you actually have a `df_canada` dataframe :-)

We want to add a new column to the dataframe named "Total" (keep an eye on the case) which contains the total of immigrants for each country.

To do so we need to add, for each row, the contents of *some* columns, i.e. those named after a year.

---

In [None]:
# your solution here

# Plotting setup

Nothing too fancy here.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Line plot

The very basic. Let's plot the number of immigrants from Italy (or any country of your choice) for each year

In [None]:
#this is handy to select only the correct columns
years = df_canada.columns[8:42]
y = df_canada.loc['Italy', years]

plt.plot(y)

In [None]:
type(y)

A bit of aestetics:

In [None]:
y = df_canada.loc['Italy', years]

plt.figure(figsize = (12, 8))
plt.plot(y, marker='.', ms=10)
plt.ylabel('Immigrants')
plt.title('Immigrants from Italy to Canada')
plt.grid()

Let's examine the structure of the above code:

```
y = df_canada.loc['Italy', years]
```
We already know this instruction, it returns a Pandas `Series`, i.e. a 1D set of values ([see doc for more](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html))

```
plt.figure(figsize = (12, 8))
```
Here we ask pyplot (not matplotlib) to instantiate a new figure ([see doc for more](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.figure.html)). Internally several Artist layer are instantiated.

```
plt.plot(y, marker='.', ms=10)
```
We add an actual plot. Its default beviour is to create a blue line plot with the y axis zoomed to the minimum size able to show the data. We add dot markers on each data point and increase the marker size (`ms`) a bit.

```
plt.ylabel('Immigrants')
plt.title('Immigrants from Italy to Canada')
plt.grid()
```
These are all function that operate on the currently active plot, adding labels and decorations.


---

# ASSIGNMENT! Custom line plot

You are required to update the plot above, but changing the aspect of the line so to obtain:

* thicker line
* red color
* dashed instead of continuous

Please refer to documentation of [plot()](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.plot.html#matplotlib.pyplot.plot), which you may notice internally refers to [Line2D()](https://matplotlib.org/stable/api/_as_gen/matplotlib.lines.Line2D.html#matplotlib.lines.Line2D), to be consulted for all the supported graphical options.

---

In [None]:
# Your solution here

# LineS plot

Since we have so many countries, let's add a couple more lines! It should be easy, let's just select two more countries.

In [None]:
y = df_canada.loc[['Italy', 'France', 'Germany'], years]

plt.plot(y)

What happened?

In [None]:
countries = ['Italy', 'France', 'Germany']

y = df_canada.loc[countries, years].transpose()

plt.plot(y, label = countries)

This works, but if we want more control on the appearance of the line we need a slightly more complex code:

In [None]:
countries = ['Italy', 'France', 'Germany']

for country in countries:
  y = df_canada.loc[country, years]
  plt.plot(y, label = country)

#since we are at it, let's add a legend
plt.legend(loc='upper left')

# Scatter plot

Similar to a line plot, a scatter plot is obtained when each data point contains an x and y value. Let's plot the immigration from Italy vs. France to see if there's any kind of correlation.

In [None]:
x = df_canada.loc['France', years]
y = df_canada.loc['Italy', years]

plt.scatter(x, y)

#computing and printing pearson's correlation, notice the dtype cast
#from "object" to "float"
np.corrcoef(x.to_numpy().astype(float), y.to_numpy().astype(float))

In [None]:
x = df_canada.loc['France', years]
y = df_canada.loc['China', years]

plt.scatter(x, y)

#computing and printing pearson's correlation, notice the dtype cast
#from "object" to "float"
np.corrcoef(x.to_numpy().astype(float), y.to_numpy().astype(float))

--- 

# ASSIGNMENT! Double scatter

Put together the two plots above (or other pairing of nations, as for your liking). Let's say:

* Italy-France in green
* France-China in red

---

In [None]:
#your solution here

# Bubble plot

Bubble plots are a special type of scatter plot where the size (and optionally the color) of the drawn data points vary, according to another variable.

They may be hard to read, since the human eye is not very well suited for comparing differences in area (or shade). Still, they may be interesting.

To obtain a bubble plot using pyplot just regular call to `scatter()` and add an array with the sizes to the parameter `s`. Since the sice is expressed in points squared (which is not super intuitive) some adjustment may be required.

As an example, we plot the number of immigrants from france over the year, but assign a random size to each dot.

In [None]:
#normally a scatter plot needs an x and a y
x = years
y = df_canada.loc['France', years]

#we create a random array of the correct size,
#multiplied by 100 to get sensible results
size = np.random.random_sample(len(x)) * 100

#the bubble plot is obtaines as if it were a 
#normal scatter plot
plt.scatter(x=x, y=y, s=size)

In [None]:
#your solution here

# Bar plot

Similar to line plot, a bar plot is most suited when x contains categorical data. We keep in mind the [bar() documentation](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.bar.html) and have no problem at this point in creating a bar plot. Keep in mind that we need to explicitly pass two arguments, `x` and `height`.

Let's create a bar plot with two countries, China and India.

In [None]:
country = 'China'
y = df_canada.loc[country, years]
plt.bar(x=years, height=y, label = country)

country = 'India'
y = df_canada.loc[country, years]
plt.bar(x=years, height=y, label = country)

#since we are at it, let's add a legend
plt.legend(loc='upper left')

The plot is done, but the result is not completely satisfying. The bars from India are in foreground and can completely cover China. There are two workarounds:

* make India bars partially transparent, so that it's possible to see through
* put the two countries side by side

---

# ASSIGNMENT! Bar chart with two series

Implement one (or both!) the solutions described above.

Hints

* the first solution (using transparency) is easier. Keep in mind that the trasparency parameter is conventionally called "alpha" (even outside python)
* the second solution (side by side) requires you to manipulate two different arguments of the `bar()` function

---

In [None]:
#your solution here

Both solutions are not super handy and extend with some difficulty to more than two series. We'll see easier approaches with Seaborn.

# Pie chart

A pie chart is rarely used in scientific publications, since it's generally considered a poor visualization tool.

Consider, for example, the following charts representing the election results for five parties in three different elections:

![pie1](https://github.com/ne1s0n/dataviz_python/raw/main/resources/wikipedia-pie-charts-1.png)

It's difficult to easily notice if there's some clear trend, and even if one notices that the black slice is shrinking it's hard to quantify of how much. 

Compare with the following bar plot:

![pie2](https://github.com/ne1s0n/dataviz_python/raw/main/resources/wikipedia-pie-charts-2.png)

Now it's clear what's going on. For more in depth analysis of the pitfalls of pie chart go to ataccama.com article ["Why pie chart are evil"](https://www.ataccama.com/blog/why-pie-charts-are-evil).

Anyway, you may sometimes be required to do a pie chart, so let's do one and then move along.



In [None]:
#let's group our immigration data by continent
df_continent = df_canada.groupby(by='AreaName', axis=0).sum()

#Did it work? What happened to non-numeric columns?
df_continent.head()

In [None]:
plt.pie(df_continent['Total'])

Meh. Let's clean up a bit, keeping in mind our options from the [pie() documentation](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.pie.html)

In [None]:
colors = ['red', 'green', 'blue', 'yellow', 'pink']
labels = df_continent.index
expl = [0, 0.1, 0, 0.1, 0, 0]

plt.pie(df_continent['Total'], colors=colors, labels=labels, explode=expl)


# Histograms

Histograms are an excellent way to explore the data distribution, i.e. understand what are the most common values and what are the most rare.

A histogram represents data using bars of various heights. Each bar groups numbers into specific ranges. Taller bars show that more data falls within that specific range.

To do so data are grouped in "bins". So there's a bin counting e.g. the number of values between 1 and 10, another for values between 11 and 20 and so forth.

In pyplot histograms are obtained via the [hist()](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html) function. Internally, the function uses numpy, and in particular [np.histogram()](https://numpy.org/doc/stable/reference/generated/numpy.histogram.html#numpy.histogram) to compute the bins. It then invokes the pyplot [plt.stairs()](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.stairs.html) to actually draw the plot.

Let's see a simple example: the distribution of immigrants for year 2013

In [None]:
plt.hist(df_canada.loc[:, 2013])

Taking a look at the bins:

In [None]:
counts, bins = np.histogram(df_canada.loc[:, 2013])
print(bins)
print(counts)

Doing histograms with more than one distribution is technically possile, but easily becomes messy. It incurs in the same limitations as barplots, and it's better done using libraries built upon matplotlib/pyplot.

# Box plots

Box plots are a handy way to compare distributions of more than one variable. In fact, there's little use in doing a box plot with a single variable. In this aspect, they are somewhat the opposite of histograms. 

Unsurprisingly, the pyplot function for creating a boxplot is called [boxplot()](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.boxplot.html)

A boxplot works with medians and quartiles, like this:

```
     Q1-1.5IQR   Q1   median  Q3   Q3+1.5IQR
                  |-----:-----|
  o      |--------|     :     |--------|    o  o
                  |-----:-----|
flier             <----------->            fliers
                       IQR
```

(ASCII art courtesy of pyplot documentation)

Let's compare the distribution of immigrants coming from the first 10 countries in the dataframe.


In [None]:
countries = df_canada.index[0:10]
df_10 = df_canada.loc[countries, years]

plt.boxplot(df_10)

---

# ASSIGNMENT! Boxplot

Improve the appearance of the above box plot by:

* making it horizontal instead of vertical
* add the names of the countries
* change the symbol for the outliers (named "fliers" in pyplot lingo) to something else of your liking

---

In [None]:
# your solution here
plt.figure(figsize=(15, 8))
a = plt.boxplot(df_10, labels = countries, vert=False, sym='+')

# Subplots

It is often necessary to collate side by side different plots. That's where subplots come into play. The [subplot()](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.subplot.html) function
splits the existing figure into a rectangular grid, and selects a piece of the grid as target for the successive graphical calls.

You will need to specify the number of rows, columns, and active cell (counting by rows).

In [None]:
#create a plot
plt.figure(figsize = (18, 8))

#first subplot
plt.subplot(1, 2, 1) #rows, columns, current subplot
plt.plot(df_canada.loc['Italy', years])
plt.title('Italy')

#second subplot
plt.subplot(1, 2, 2) #rows, columns, current subplot
plt.plot(df_canada.loc['France', years])
plt.title('France')

It's possible to mix and match different grids. Just be sure to avoid overlapping subplots.

In [None]:
#create a plot
plt.figure(figsize = (18, 8))

#first subplot
#  +---+---+
#  | x |   |
#  +---+---+
plt.subplot(1, 2, 1) #rows, columns, current subplot
plt.plot(df_canada.loc['China', years])
plt.title('China')

#second subplot
#  +---+---+
#  |   | x |
#  +---+---+
#  |   |   |
#  +---+---+
plt.subplot(2, 2, 2) #rows, columns, current subplot
plt.plot(df_canada.loc['Italy', years])
plt.title('Italy')

#third subplot
#  +---+---+
#  |   |   |
#  +---+---+
#  |   | x |
#  +---+---+
plt.subplot(2, 2, 4) #rows, columns, current subplot
plt.plot(df_canada.loc['France', years])
plt.title('France')

# Missing

Stuff that you may want to investigate:

* color definition (RGB...)
* axis ticks (major and minor, symbol, position...)
* axis labels (font size, formatting...)
* axis transformation (flipping x and y, log scale...)
* spines (the border of the plot) and background