**Note:** This is kind of an advanced notebook that shows the power of Python for animating data. Strictly speaking you don't need to do the notebook, but it can be fun to see how you can use Python to accomplish a relatively complex task.

# Day 2-Part 3: Animating data

Every year, BP publishes the Statistical Review of World Energy, an analysis of the world energy markets through time. The data are available at the following [link](https://www.bp.com/en/global/corporate/energy-economics/statistical-review-of-world-energy.html).

In this notebook, we will use the data from 2022 and the libraries `numpy`, `pandas`, and `celluloid` to make a movie of Primary Energy Consumption per capita versus CO2 emissions per capita from 1965 to 2021.

**Note:** I added to the original BP Excel file, a sheet with the codes and regions numbers of the countries in the BP dataset. This allows coloring the countries by region, and labeling them by code. See Excel file in the data directory.

We start by importing the required libraries:

In [None]:
# import libraries
import os
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

Now, let's read the primary energy consumption from the Excel file. Notice that we need to do some cleaning (remove some rows), but that is easy with `pandas`. At the end, we should have a `DataFrame` with 92 rows (countries):

In [None]:
# primary energy consumption
# safe path to file
path = os.path.join("..", "data", "bp-stats-review-2022-all-data.xlsx") 
# read from third row and drop last 14 rows, make first column the index
pec = pd.read_excel(path, sheet_name = "Primary Energy Consumption", 
                    header=2, skipfooter=14, index_col="Exajoules") 
# remove three last columns
pec.drop(columns=pec.columns[-3:], axis=1,  inplace=True)
# remove empty rows
pec.dropna(inplace=True)
# remove rows containing "Total"
pec = pec[pec.index.str.contains("Total") == False]
# number of rows should be 92
print("Number of rows =", len(pec.index))
# list the first 5 countries
pec.head()

Then, we read the primary energy consumption per capita. Again, we need to remove some rows and end up with 92 countries:

In [None]:
# primary energy consumption per capita
# read from third row and drop last 13 rows, make first column the index
pec_cap = pd.read_excel(path, sheet_name = "Primary Energy - Cons capita", 
                        header=2, skipfooter=13, index_col="Gigajoule per capita")
# remove two last columns
pec_cap.drop(columns=pec_cap.columns[-2:], axis=1,  inplace=True)
# remove empty rows
pec_cap.dropna(inplace=True)
# remove rows containing "Total"
pec_cap = pec_cap[pec_cap.index.str.contains("Total") == False]
# Number of rows should be 92
print("Number of rows =", len(pec_cap.index))
# list the first 5 countries
pec_cap.head()

Then, we read the CO2 emissions. Again, after cleaning the data, we end up with 92 countries:

In [None]:
# co2 emissions from energy
# read from third row and drop last 16 rows, make first column the index
co2 = pd.read_excel(path, sheet_name = "CO2 Emissions from Energy", 
                    header=2, skipfooter=16, index_col="Million tonnes of carbon dioxide")
# remove three last columns
co2.drop(columns=co2.columns[-3:], axis=1,  inplace=True)
# remove empty rows
co2.dropna(inplace=True)
# remove rows containing "Total"
co2 = co2[co2.index.str.contains("Total") == False]
# Number of rows should be 92
print("Number of rows =", len(co2.index))
# list the first 5 countries
co2.head()

Finally, we read the countries codes and regions. Notice that here, we need to use only those countries that are in the other DataFrames:

In [None]:
# codes and regions
cod_reg = pd.read_excel(path, sheet_name = "Codes and regions") 
# make first column the index of the DataFrame
cod_reg.set_index("Country", inplace=True)
# use only the indexes/countries in the other DataFrames
cod_reg = cod_reg.loc[pec.index]
# Number of rows should be 92
print("Number of rows =", len(cod_reg.index))
# list the first 5 countries
cod_reg.head()

To be on the safe side, let's verify the indexes of the DataFrames are equal:

In [None]:
# check the indexes of the DataFrames are equal
print(pec.index.equals(pec_cap.index))
print(pec.index.equals(co2.index))
print(pec.index.equals(cod_reg.index))

Now we can compute the population by dividing the primary energy consumption by the primary energy consumption per capita. The population will be in millions:

In [None]:
# compute population by dividing primary energy consumption pec,
# by primary energy_consumption per capita pec_cap
# Notice that pec is in Exajoules, while pec_cap is in Gigajoules
# Therefore population in millions is
population = (pec*1000)/pec_cap
# set the name of the axis for the index to Millions
population.rename_axis("Millions", inplace=True)
# list the first 5 countries
population.head()

We then divide the CO2 emissions by the population to get the CO2 emissions per capita. The result will be in Tonnes:

In [None]:
# compute CO2 emissions per capita by dividing the CO2 emissions co2 by population
# notice that co2 is in million tonnes and the population is in millions
# therefore CO2 emissions per capita in tonnes is
co2_cap = co2 / population
# set the name of the axis for the index to Tonnes 
co2_cap.rename_axis("Tonnes", inplace=True)
# list the first 5 countries
co2_cap.head()

Let's plot the last year of these data. Notice that we define lists of regions and colors of the regions, and iterate over them to plot them separately. Also the size of a country in the plot is controlled by the country's population. To better visualize these data, we make both axes logarithmic:

In [None]:
# graph as scatter the energy consumption per capita versus CO2 emissions per capita for the year 2021
# color the points by region and make their size proportional to population

# regions:
# 1 = North America
# 2 = South and Central America
# 3 = Europe
# 4 = CIS
# 5 = Middle East
# 6 = Africa
# 7 = Asia Pacific
regions = [1, 2, 3, 4, 5, 6, 7]
regions = regions[::-1] # reverse list of regions

# colors for regions
colors = ["palegreen", "darkgreen", "blue", "magenta", "orange", "red", "yellow"]
colors = colors[::-1] # reverse list of colors

# year
year = 2021

# make figure
fig, ax = plt.subplots(figsize=(15,7.5))

# for each region
for (region, color) in zip(regions, colors):
    # extract region data
    my_pec_cap = pec_cap[cod_reg["region"] == region]
    my_co2_cap = co2_cap[cod_reg["region"] == region]
    my_population = population[cod_reg["region"] == region]
    # plot data
    ax.scatter(my_pec_cap[year], my_co2_cap[year], s=my_population[year]*2, 
               c=color, edgecolor="0", alpha=0.75, zorder=2)
    # plot labels
    for index in my_pec_cap.index:
        if my_co2_cap.loc[index,year] >= 0.1:
            ax.text(x=my_pec_cap.loc[index,year], y=my_co2_cap.loc[index,year], 
                    s=cod_reg.loc[index,"code"], size=8, zorder=3)

# plot year
ax.text(x = 2.5, y = 0.85, s=str(year), 
        fontdict=dict(fontfamily="Courier New", color="lightgray", size=250), zorder=1)    

# set axes
ax.set_xlim([1, 2000])
ax.set_ylim([0.1, 100])
ax.set_xscale("log") # x axis is log
ax.set_yscale("log") # y axis is log
ax.set_xlabel("Primary energy consumption per capita [Gigajoules]")
ax.set_ylabel("CO2 emissions per capita [Tonnes]")
ax.grid(True)

That looks pretty cool, but now we would like to animate these data over time. To do that we use [celluloid](https://github.com/jwkvam/celluloid). This external module makes it easy to create an animation. If you don't have `celluloid`, please install it by running the cell below:

In [None]:
# run this cell to install celluloid
import sys
!{sys.executable} -m pip install celluloid

We can now plot every year and snap it using celluloid. The plot below does not look nice but the animation will, be patient:

In [None]:
# Create animation of energy consumption per capita versus CO2 emissions per capita over time

# import celluloid Camera
from celluloid import Camera

# create figure
fig, ax = plt.subplots(figsize=(15,7.5))
# set axes
ax.set_xlim([1, 2000])
ax.set_ylim([0.1, 100])
ax.set_xscale("log") # x axis is log
ax.set_yscale("log") # y axis is log
ax.set_xlabel("Primary energy consumption per capita [Gigajoules]")
ax.set_ylabel("CO2 emissions per capita [Tonnes]")
ax.grid(True)
# create camera
camera = Camera(fig)

# for each year
for year in pec_cap.columns:
    # for each region
    for (region, color) in zip(regions, colors):
        # extract region data
        my_pec_cap = pec_cap[cod_reg["region"] == region]
        my_co2_cap = co2_cap[cod_reg["region"] == region]
        my_population = population[cod_reg["region"] == region]
        # plot data
        ax.scatter(my_pec_cap[year], my_co2_cap[year], s=my_population[year]*2, 
                   c=color, edgecolor="0", alpha=0.75, zorder=2)
        # plot labels
        for index in my_pec_cap.index:
            if my_co2_cap.loc[index,year] >= 0.1:
                ax.text(x=my_pec_cap.loc[index,year], y=my_co2_cap.loc[index,year], 
                        s=cod_reg.loc[index,"code"], size=8, zorder=3)
    # plot year
    ax.text(x = 2.5, y = 0.85, s=str(year), 
            fontdict=dict(fontfamily="Courier New", color="silver", size=250), zorder=1)
    # snap current plot
    camera.snap()

Now we can create the animation:

In [None]:
# create animation
animation = camera.animate(interval = 500, repeat = True, repeat_delay = 500)

And play it in the notebook. To do this, you may need to install [ffmpeg](https://www.ffmpeg.org/download.html). If this is too much of a hassle, you can skip the cell below.

In [None]:
# import HTML to display animation in notebook
from IPython.display import HTML
# play animation. This takes some time, be patient
HTML(animation.to_html5_video())

Finally, we can save the animation as a movie 🙂:

In [None]:
# save animation. This takes some time, be patient
animation.save("PrimEnergyConsVsCO2PerCapita.mp4", dpi=300)