# Preparing the data: Carbon dioxide emissions

We're walking through our four datasets, processing one at a time:

1. **Carbon dioxide emissions by country**
2. Income (as measured by GDP per capita) by country
3. Population by country (so we can convert CO2 emissions into per capita emissions)
4. List of territories by continent (since we want to be able to group the countries by region of the world)

Let's start by loading our data into a pandas DataFrame and take a look at the first 10 rows of content:

In [10]:
import pandas as pd

co2_all = pd.read_csv("data/co2_emissions_tonnes_per_person.csv")
co2_all.head(10)

Unnamed: 0,country,1799,1800,1801,1802,1803,1804,1805,1806,1807,...,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
0,Afghanistan,,,,,,,,,,...,0.238,0.29,0.406,0.345,0.28,0.253,0.262,0.245,0.247,0.254
1,Angola,,,,,,,,,,...,1.23,1.24,1.25,1.35,1.28,1.64,1.22,1.18,1.14,1.12
2,Albania,,,,,,,,,,...,1.47,1.56,1.79,1.69,1.69,1.9,1.6,1.57,1.61,1.59
3,Andorra,,,,,,,,,,...,6.12,6.12,5.87,5.92,5.9,5.83,5.97,6.07,6.27,6.12
4,United Arab Emirates,,,,,,,,,,...,20.9,18.3,18.9,23.8,23.7,24.2,20.7,21.7,21.1,21.4
5,Argentina,,,,,,,,,,...,4.42,4.57,4.61,4.6,4.56,4.56,4.64,4.6,4.55,4.41
6,Armenia,,,,,,,,,,...,1.51,1.48,1.73,1.99,1.91,1.91,1.65,1.76,1.7,1.89
7,Antigua and Barbuda,,,,,,,,,,...,5.88,5.96,5.75,5.8,5.73,5.7,5.84,5.9,5.89,5.88
8,Australia,,,,,,,,,,...,18.8,18.4,18.0,17.8,17.1,16.7,16.8,17.0,17.0,16.9
9,Austria,,,,,,,,0.054,,...,8.1,8.6,8.3,7.95,7.97,7.49,7.7,7.7,7.94,7.75


Let's also use the `pandas` `describe()` method to summarize the contents:

In [11]:
co2_all.describe()

Unnamed: 0,1799,1800,1801,1802,1803,1804,1805,1806,1807,1808,...,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
count,5.0,5.0,7.0,5.0,6.0,5.0,5.0,6.0,5.0,5.0,...,194.0,194.0,194.0,194.0,194.0,194.0,194.0,194.0,194.0,194.0
mean,0.523786,0.517832,0.659911,0.570422,0.489042,0.590578,0.611746,0.52638,0.599746,0.593316,...,4.623215,4.725098,4.668183,4.722525,4.638652,4.598566,4.496805,4.466514,4.490955,4.499112
std,1.093672,1.085826,1.06514,1.201775,1.083367,1.224113,1.279321,1.148889,1.252499,1.23932,...,6.10235,6.137543,6.049405,6.201106,5.915767,6.024297,5.802658,5.63684,5.647153,5.604449
min,0.00733,0.00716,0.00698,0.00681,0.00665,0.00649,0.00633,0.00618,0.00603,0.00588,...,0.0227,0.0304,0.0366,0.0342,0.0419,0.0383,0.0367,0.0254,0.0244,0.0243
25%,0.0422,0.0293,0.03815,0.0283,0.050025,0.0517,0.0473,0.043425,0.0438,0.0447,...,0.5435,0.609,0.6225,0.635,0.67025,0.677,0.6725,0.69425,0.68675,0.68875
50%,0.0442,0.0438,0.0494,0.0468,0.05205,0.0534,0.0497,0.0546,0.0526,0.0492,...,2.38,2.42,2.375,2.49,2.46,2.51,2.525,2.5,2.535,2.54
75%,0.0452,0.0489,1.01835,0.0502,0.068625,0.0613,0.0554,0.11355,0.0563,0.0568,...,6.1425,6.4375,6.635,6.575,6.235,6.1825,5.895,6.0625,6.0325,6.015
max,2.48,2.46,2.45,2.72,2.7,2.78,2.9,2.87,2.84,2.81,...,41.5,38.8,39.2,42.5,36.0,43.1,41.3,38.5,39.8,38.0


Looking at the data we learn a few things. First, the dataset contains data on CO2 starting in 1799 and going through 2017. For the plot we will be creating, we only need to plot one year of data, likely the most recent year. Since we only have data up to 2017, we'll need to use that as the common year. Going forward, we'll only concern ourselves with data from 2017.

Second, we see that many of those early years have `NaN` values, which indicates there is no data present. We can see in the "count" row of the description above that in 1799 and 1800, only 5 entries are non-empty, and this number is 194 from 2008 forward. Let's see how many countries we have in total, by using the `shape` attribute of `co2_all`:

In [12]:
co2_all.shape

(194, 220)

We have 194 rows and 194 non-empty values in the later years that are included, so we have one value for each country on this list in 2017. We'll want to use the latest data that we have to make the plot as relevant as possible, so the last thing to do is to just extract the 2017 data. We'll need the 2017 column for the data as well as the corresponding country/territory so we can combine the data later with income and population:

In [13]:
co2 = co2_all[["country", "2017"]]
co2

Unnamed: 0,country,2017
0,Afghanistan,0.254
1,Angola,1.120
2,Albania,1.590
3,Andorra,6.120
4,United Arab Emirates,21.400
...,...,...
189,Samoa,1.320
190,Yemen,0.356
191,South Africa,8.100
192,Zambia,0.302


One last step: since we'll eventually be merging our columns together, let's rename the '2017' column so that it is more self-explanatory and call it 'co2' using the `rename()` method:

In [14]:
co2 = co2.rename(columns={"2017": "co2"})
co2

Unnamed: 0,country,co2
0,Afghanistan,0.254
1,Angola,1.120
2,Albania,1.590
3,Andorra,6.120
4,United Arab Emirates,21.400
...,...,...
189,Samoa,1.320
190,Yemen,0.356
191,South Africa,8.100
192,Zambia,0.302


Lastly, let's save our data to file.

In [15]:
co2.to_csv('data/intermediate/co2.csv', index=False)

With all of this complete, let's create a function that includes each of the steps above to reproduce what we've done:

In [None]:
import pandas as pd

# Function to process CO2 data.
def process_co2(infile,outfile):
    # Read the data
    co2_all = pd.read_csv(infile)
    
    # Select the columns that we need and rename them for ease of reference
    co2 = co2_all[["country", "2017"]]
    co2 = co2.rename(columns={"2017": "co2"})
    
    # Output the resulting data
    co2.to_csv(outfile, index=False)

# Test to make sure the results are the same as the original
raw_co2_file = "data/co2_emissions_tonnes_per_person.csv"
co2_file = 'data/intermediate/co2.csv'
process_co2(raw_co2_file,co2_file)

The CO2 emissions dataset is ready for use! Let's move on to exploring income per person.