# Project: gathering, inspecting, and cleaning the data

In this step of the project, we'll gather the data that we need

1. Carbon dioxide emissions by country
2. Income (as measured by GDP per capita) by country
3. Population by country (so we can convert CO2 emissions into per capita emissions)
4. List of territories by continent (since we want to be able to group the countries by region of the world)

As we load each dataset we will explore it and perform some pre-processing steps to prepare the data for merging it together for our plots. We'll be on the lookout for any inconsistencies in the data (numbers and text mixed up in the data), missing data which may dictate which years we will be able to plot in our final plot.

## Data Sources

Let's start by loading our four data sources. The first three sources are provided on the GapMinder website, but each has a slightly different source.

1. [**Carbon dioxide emissions**](https://www.gapminder.org/tools/#$model$markers$spreadsheet$encoding$number$data$concept=co2_emissions_tonnes_per_person&space@=country&=time;;&scale$domain:null&type:null&zoomed:null;;&label$data$concept=name;;&frame$value=2018&data$concept=time;;;;;;&chart-type=spreadsheet&url=v1) (tonnes per person). These data are corbon dioxide emissions from the burning of fossil fuels in the units of metric tonnes pf CO2 per person. These data are sourced from the [Carbon Dioxide Information Analysis Center](https://cdiac.ess-dive.lbl.gov/) at Lawrence Berkeley National Laboratory. If you download these data as a CSV you will get the file 'co2_emissions_tonnes_per_person.csv', which we have included in the `data/` folder.
2. [**Income per person**](https://www.gapminder.org/tools/#$model$markers$spreadsheet$encoding$number$data$concept=income_per_person_gdppercapita_ppp_inflation_adjusted&space@=country&=time;;&scale$domain:null&type:null&zoomed:null;;&label$data$concept=name;;&frame$value=2018&data$concept=time;;;;;;&chart-type=spreadsheet&url=v1) (gross domestic product per capita, in international dollars, inflation-adjusted to 2011 prices). These data are sourced from the Gapminder based on World Bank, A. Maddison, M. Lindgren, International Monetary Fund, and others:  [link to more information on the data source](https://www.gapminder.org/data/documentation/gd001/). If you download these data as a CSV you will get the file 'income_per_person_gdppercapita_ppp_inflation_adjusted.csv', which we have included in the `data/` folder.
3. [**Population**](https://www.gapminder.org/tools/#$model$markers$spreadsheet$encoding$number$data$concept=population_total&space@=country&=time;;&scale$domain:null&type:null&zoomed:null;;&label$data$concept=name;;&frame$value=2018&data$concept=time;;;;;;&chart-type=spreadsheet&url=v1). These data are sourced from Gapminder based on Maddison and the United Nations: [link to more information on the data source](http://gapm.io/dpop). If you download these data as a CSV you will get the file 'population_total.csv', which we have included in the `data/` folder.
4. [**Territories by continent**](https://unstats.un.org/unsd/methodology/m49/overview#). These data are sourced from the United Nations. These data are provided in the file 'united_nations_continents.csv' which we have included in the `data/` folder.

At each of the above links you could have clicked the "CSV" button to grab the data, but we have done that and provided all of the data in the `data/` folder so you can get started exploring these data right away, but we have not changed the content of each dataset at all. 

## Data Exploration

Before you begin using any dataset, you should get to know it a bit to understand its content, check for missing or anomalous values, and determine whether any additional processing is needed before you analyze it further. We'll walk through each of these four datasets, starting with carbon dioxide emissions.

### Dataset 1 of 4: Carbon dioxide emissions data

Let's start by loading our data into a `pandas` `DataFrame` and take a look at the first 10 rows of content:

In [1]:
import pandas as pd
co2_all = pd.read_csv('data/co2_emissions_tonnes_per_person.csv')
co2_all.head(10)

Unnamed: 0,country,1799,1800,1801,1802,1803,1804,1805,1806,1807,...,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
0,Afghanistan,,,,,,,,,,...,0.238,0.29,0.406,0.345,0.28,0.253,0.262,0.245,0.247,0.254
1,Angola,,,,,,,,,,...,1.23,1.24,1.25,1.35,1.28,1.64,1.22,1.18,1.14,1.12
2,Albania,,,,,,,,,,...,1.47,1.56,1.79,1.69,1.69,1.9,1.6,1.57,1.61,1.59
3,Andorra,,,,,,,,,,...,6.12,6.12,5.87,5.92,5.9,5.83,5.97,6.07,6.27,6.12
4,United Arab Emirates,,,,,,,,,,...,20.9,18.3,18.9,23.8,23.7,24.2,20.7,21.7,21.1,21.4
5,Argentina,,,,,,,,,,...,4.42,4.57,4.61,4.6,4.56,4.56,4.64,4.6,4.55,4.41
6,Armenia,,,,,,,,,,...,1.51,1.48,1.73,1.99,1.91,1.91,1.65,1.76,1.7,1.89
7,Antigua and Barbuda,,,,,,,,,,...,5.88,5.96,5.75,5.8,5.73,5.7,5.84,5.9,5.89,5.88
8,Australia,,,,,,,,,,...,18.8,18.4,18.0,17.8,17.1,16.7,16.8,17.0,17.0,16.9
9,Austria,,,,,,,,0.054,,...,8.1,8.6,8.3,7.95,7.97,7.49,7.7,7.7,7.94,7.75


Let's also use the `pandas` `describe()` method to summarize the contents:

In [2]:
co2_all.describe()

Unnamed: 0,1799,1800,1801,1802,1803,1804,1805,1806,1807,1808,...,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
count,5.0,5.0,7.0,5.0,6.0,5.0,5.0,6.0,5.0,5.0,...,194.0,194.0,194.0,194.0,194.0,194.0,194.0,194.0,194.0,194.0
mean,0.523786,0.517832,0.659911,0.570422,0.489042,0.590578,0.611746,0.52638,0.599746,0.593316,...,4.623215,4.725098,4.668183,4.722525,4.638652,4.598566,4.496805,4.466514,4.490955,4.499112
std,1.093672,1.085826,1.06514,1.201775,1.083367,1.224113,1.279321,1.148889,1.252499,1.23932,...,6.10235,6.137543,6.049405,6.201106,5.915767,6.024297,5.802658,5.63684,5.647153,5.604449
min,0.00733,0.00716,0.00698,0.00681,0.00665,0.00649,0.00633,0.00618,0.00603,0.00588,...,0.0227,0.0304,0.0366,0.0342,0.0419,0.0383,0.0367,0.0254,0.0244,0.0243
25%,0.0422,0.0293,0.03815,0.0283,0.050025,0.0517,0.0473,0.043425,0.0438,0.0447,...,0.5435,0.609,0.6225,0.635,0.67025,0.677,0.6725,0.69425,0.68675,0.68875
50%,0.0442,0.0438,0.0494,0.0468,0.05205,0.0534,0.0497,0.0546,0.0526,0.0492,...,2.38,2.42,2.375,2.49,2.46,2.51,2.525,2.5,2.535,2.54
75%,0.0452,0.0489,1.01835,0.0502,0.068625,0.0613,0.0554,0.11355,0.0563,0.0568,...,6.1425,6.4375,6.635,6.575,6.235,6.1825,5.895,6.0625,6.0325,6.015
max,2.48,2.46,2.45,2.72,2.7,2.78,2.9,2.87,2.84,2.81,...,41.5,38.8,39.2,42.5,36.0,43.1,41.3,38.5,39.8,38.0


Looking at the data we learn a few things. First, the dataset contains data on CO2 starting in 1799 and going through 2017. For the plot we will be creating, we only need to plot one year of data, likely the most recent year. Since we only have data up to 2017, we'll need to use that as the common year. Going forward, we'll only concern ourselves with data from 2017.

Second, we see that many of those early years have `NaN` values, which indicates there is no data present. We can see in the "count" row of the description above that in 1799 and 1800, only 5 entries are non-empty, and this number is 194 from 2008 forward. Let's see how many countries we have in total, by using the `shape` attribute of `co2_all`:

In [3]:
co2_all.shape

(194, 220)

We have 194 rows and 194 non-empty values in the later years that are included, so we have one value for each country on this list in 2017. We'll want to use the latest data that we have to make the plot as relevant as possible, so the last thing to do is to just extract the 2017 data. We'll need the 2017 column for the data as well as the corresponding country/territory so we can combine the data later with income and population:

In [4]:
co2 = co2_all[['country','2017']]
co2

Unnamed: 0,country,2017
0,Afghanistan,0.254
1,Angola,1.120
2,Albania,1.590
3,Andorra,6.120
4,United Arab Emirates,21.400
...,...,...
189,Samoa,1.320
190,Yemen,0.356
191,South Africa,8.100
192,Zambia,0.302


One last step: since we'll eventually be merging our columns together, let's rename the '2017' column so that it is more self-explanatory and call it 'co2' using the `rename()` method:

In [5]:
co2 = co2.rename(columns={'2017':'co2'})
co2

The CO2 emissions data is ready for use! Let's move on to exploring income per person.

### Dataset 2 of 4: Income per person

Again, let's start by loading and viewing the first 10 rows of our data:

In [6]:
income_all = pd.read_csv('data/income_per_person_gdppercapita_ppp_inflation_adjusted.csv')
income_all.head(10)

Unnamed: 0,country,1799,1800,1801,1802,1803,1804,1805,1806,1807,...,2040,2041,2042,2043,2044,2045,2046,2047,2048,2049
0,Afghanistan,683,683,683,683,683,683,683,683,683,...,2690,2750,2810,2870,2930,2990,3060,3120,3190,3260
1,Angola,700,702,705,709,711,714,718,721,725,...,8000,8170,8350,8530,8710,8900,9090,9280,9480,9690
2,Albania,755,755,755,755,755,756,756,756,756,...,25.1k,25.6k,26.2k,26.7k,27.3k,27.9k,28.5k,29.1k,29.7k,30.4k
3,Andorra,1360,1360,1360,1360,1370,1370,1370,1370,1380,...,68.9k,70.4k,71.9k,73.4k,75k,76.6k,78.3k,80k,81.7k,83.4k
4,United Arab Emirates,1130,1130,1140,1140,1150,1150,1160,1160,1160,...,101k,103k,105k,107k,110k,112k,114k,117k,119k,122k
5,Argentina,1730,1730,1740,1740,1750,1760,1760,1770,1770,...,30.5k,31.1k,31.8k,32.5k,33.2k,33.9k,34.7k,35.4k,36.2k,36.9k
6,Armenia,582,582,582,582,582,582,582,582,582,...,24.7k,25.2k,25.8k,26.3k,26.9k,27.5k,28.1k,28.7k,29.3k,29.9k
7,Antigua and Barbuda,857,857,857,857,857,857,857,858,858,...,26.4k,27k,27.6k,28.1k,28.8k,29.4k,30k,30.7k,31.3k,32k
8,Australia,925,930,936,941,947,952,956,962,968,...,72k,73.5k,75.1k,76.7k,78.4k,80.1k,81.8k,83.5k,85.3k,87.2k
9,Austria,2090,2100,2110,2120,2130,2130,2140,2150,2160,...,77.8k,79.5k,81.2k,83k,84.7k,86.6k,88.4k,90.3k,92.3k,94.3k


As we look over the dataset here, we can see we have data that goes from 1799 through 2049. So clearly the later years are projections rather than measured data (unless the data providers have a time traveler on staff!).

We know that we're only concerned with the year 2017 from our discussion of CO2 data, so let's extract the year 2017 and the country:

In [7]:
income_2017 = income_all[['country','2017']]
income_2017

Unnamed: 0,country,2017
0,Afghanistan,2030
1,Angola,6930
2,Albania,13.3k
3,Andorra,58.3k
4,United Arab Emirates,67k
...,...,...
190,Samoa,6390
191,Yemen,2660
192,South Africa,13.9k
193,Zambia,3520



But notice something odd here... Some of the values have the letter 'k' at the end of a number. While 'k' is the SI shorthand for a thousand units (representing "kilo"). But what is it doing in our dataset? Let's take a look at the data types in our data. We'd expect those to be numbers: floats or integers or something similar.

In [8]:
income_2017.dtypes

country    object
2017       object
dtype: object

OK, the data from 2017 are objects, not floats or integers, just like the column for 'country'. Let's take a look at one of the entries in 2017 and see what's going on with the data there.

In [9]:
first_entry = income_2017['2017'][0]
first_entry

'2030'

In [10]:
type(first_entry)

str

It's a string! It's not a number. The authors of the data represented the number 12,000 as the string '12k' instead. We need to correct this since all data that we plot will need to be numerical.

For each value, we'll need to read in the string - if there are no letters in the entry, we just need to convert it to a number. If there is a letter k in it, we need to remove that letter and THEN convert the rest into a number and multiply it by 1,000. Let's create a function that does that:

**EXERCISE: Write a function that takes a string as input and outputs the correct numerical representation of the number as a float. That is, if the input is '20' then the output should be the number 20.0; if the input is '20k' then the output should be the number 20000.0**

In [1]:
# BREAK HERE AND DIRECT STUDENTS TO GRADED LAB

In [11]:
def string2num(string):
    if 'k' in string:
        number = float(string[:-1])*1000
    else:
        number = float(string)
    return number
        
print(string2num('2'))
print(string2num('2k'))

2.0
2000.0


Now let's apply our function to our data for each entry in 2017. But before we do, let's make a copy of our data so we don't edit the original (`deep = True` ensures all the data are a copy rather than a view):

In [12]:
income = income_2017.copy(deep=True)

We can use the `pandas` method `apply()` to apply our `string2num` function to *each entry* in our '2017' column:

In [13]:
income['2017'] = income_2017['2017'].apply(string2num)
income

Unnamed: 0,country,2017
0,Afghanistan,2030.0
1,Angola,6930.0
2,Albania,13300.0
3,Andorra,58300.0
4,United Arab Emirates,67000.0
...,...,...
190,Samoa,6390.0
191,Yemen,2660.0
192,South Africa,13900.0
193,Zambia,3520.0


While it looks like we have successfully converted the strings to floats, let's look at our data types again to ensure we fixed the issue:

In [14]:
income.dtypes

country     object
2017       float64
dtype: object

Great! We have a column of floats now, so it looks like our fix worked. Let's make one more check to see whether there are any `NaN` values introduced - we want to avoid introducing anything like that into the mix:

In [15]:
income['2017'].isna().values.any()

False

No `NaN` values - so we're all set! 

And as our last step, let's also rename the column header from '2017' to 'income', similar to what we did with `co2` above, so that its content is clear and it will be ready to be merged together after we've loaded each of our four datasets.

In [16]:
income = income.rename(columns={'2017':'income'})
income

Unnamed: 0,country,income
0,Afghanistan,2030.0
1,Angola,6930.0
2,Albania,13300.0
3,Andorra,58300.0
4,United Arab Emirates,67000.0
...,...,...
190,Samoa,6390.0
191,Yemen,2660.0
192,South Africa,13900.0
193,Zambia,3520.0



With both CO2 and income data prepared, population is next!

### Dataset 3 of 4: Population

As is our standard practice at this point, let's load in and view the first 10 entries in our dataset:

In [17]:
pop_all = pd.read_csv('data/population_total.csv')
pop_all.head(10)

Unnamed: 0,country,1799,1800,1801,1802,1803,1804,1805,1806,1807,...,2090,2091,2092,2093,2094,2095,2096,2097,2098,2099
0,Afghanistan,3.28M,3.28M,3.28M,3.28M,3.28M,3.28M,3.28M,3.28M,3.28M,...,76.6M,76.4M,76.3M,76.1M,76M,75.8M,75.6M,75.4M,75.2M,74.9M
1,Angola,1.57M,1.57M,1.57M,1.57M,1.57M,1.57M,1.57M,1.57M,1.57M,...,168M,170M,172M,175M,177M,179M,182M,184M,186M,188M
2,Albania,400k,402k,404k,405k,407k,409k,411k,413k,414k,...,1.33M,1.3M,1.27M,1.25M,1.22M,1.19M,1.17M,1.14M,1.11M,1.09M
3,Andorra,2650,2650,2650,2650,2650,2650,2650,2650,2650,...,63k,62.9k,62.9k,62.8k,62.7k,62.7k,62.6k,62.5k,62.5k,62.4k
4,United Arab Emirates,40.2k,40.2k,40.2k,40.2k,40.2k,40.2k,40.2k,40.2k,40.2k,...,12.3M,12.4M,12.5M,12.5M,12.6M,12.7M,12.7M,12.8M,12.8M,12.9M
5,Argentina,534k,520k,506k,492k,479k,466k,453k,441k,429k,...,57.5M,57.5M,57.4M,57.3M,57.2M,57.2M,57.1M,57M,56.9M,56.8M
6,Armenia,413k,413k,413k,413k,413k,413k,413k,413k,413k,...,2.18M,2.16M,2.15M,2.13M,2.12M,2.1M,2.09M,2.07M,2.05M,2.04M
7,Antigua and Barbuda,37k,37k,37k,37k,37k,37k,37k,37k,37k,...,105k,104k,104k,104k,104k,103k,103k,103k,102k,102k
8,Australia,200k,205k,211k,216k,222k,227k,233k,239k,246k,...,41.1M,41.3M,41.5M,41.7M,41.9M,42.1M,42.3M,42.5M,42.7M,42.9M
9,Austria,3M,3.02M,3.04M,3.05M,3.07M,3.09M,3.11M,3.12M,3.14M,...,8.65M,8.65M,8.65M,8.65M,8.65M,8.66M,8.66M,8.67M,8.67M,8.68M


These data go well into the future, all the way out to 2099. Let's begin by grabbing the 2017 data and dive a bit deeper.

In [18]:
pop_2017 = pop_all[['country','2017']]
pop_2017

Unnamed: 0,country,2017
0,Afghanistan,37.2M
1,Angola,30.8M
2,Albania,2.88M
3,Andorra,77k
4,United Arab Emirates,9.63M
...,...,...
192,Samoa,196k
193,Yemen,28.5M
194,South Africa,57.8M
195,Zambia,17.4M


It looks like we have suffixes of 'k' and 'M' (representing millions). We will have to apply a similar function to what we used for income here to prepare these data. But we have to ask the question: are 'k' and 'M' the *only* suffixes included in the data or are there others? Let's write a function to figure that out.

**EXERCISE: Create a function that takes in a `numpy` array (into which we will pass the 2017 population values of `pop_2017['2017'].values`) and searches through them to find any alphabetic letters that are present.** Hint: we recommend looking into the string method `isalpha()` and the `numpy` method `unique()` to help with this exercise.

For example, if your array of strings contained the following:
|string list|
|-|
| '13k' |
| '546' |
| '9M' |
| '12M' |
| '900k' |

Desited output: ['k', 'M']

In [None]:
# BREAK HERE AND DIRECT STUDENTS TO GRADED LAB

In [19]:
# SOLUTION
import numpy as np

def getcharacters(string_list):
    characters = []

    for value in string_list:
        for character in value:
            if character.isalpha():
                characters.append(character)
    return np.unique(characters)

array(['B', 'M', 'k'], dtype='<U1')

Let's apply the function we created to our dataset to see what string values are present:

In [None]:
unique_characters = getcharacters(pop_2017['2017'].values)
unique_characters

Great! we now know that we need to handle thousands ('k'), millions ('M'), and billions ('B'). 

**EXERCISE: adjust your `string2num()` function that we created earlier to accommodate replacing the 'k', 'M' and 'B' characters**

In [None]:
# BREAK HERE AND DIRECT STUDENTS TO GRADED LAB

In [20]:
def string2num(string):
    if 'k' in string:
        number = float(string[:-1])*1000
    elif 'M' in string:
        number = float(string[:-1])*1e6
    elif 'B' in string:
        number = float(string[:-1])*1e9
    else:
        number = float(string)
    return number
        
print(string2num('2'))
print(string2num('2k'))
print(string2num('2M'))
print(string2num('2B'))

2.0
2000.0
2000000.0
2000000000.0


Just like we did for the case of income, let's apply our updated `string2num()` function to our population dataset. Since we're modifying our data, let's be sure to make a copy of it first.

In [21]:
pop = pop_2017.copy(deep=True)

pop['2017'] = pop_2017['2017'].apply(string2num)
pop

Unnamed: 0,country,2017
0,Afghanistan,37200000.0
1,Angola,30800000.0
2,Albania,2880000.0
3,Andorra,77000.0
4,United Arab Emirates,9630000.0
...,...,...
192,Samoa,196000.0
193,Yemen,28500000.0
194,South Africa,57800000.0
195,Zambia,17400000.0


This looks good, let's double check our data types again to make sure everything was converted over to floats and make sure there were no `NaN` values introduced during processing:

In [22]:
pop.dtypes

country     object
2017       float64
dtype: object

In [23]:
pop.isna().values.any()

False

Lastly, let's rename the column header from '2017' to 'population' so that its column label is clear and descriptive for when we merge these data frames later.

In [24]:
pop = pop.rename(columns={"2017": "population"})
pop

Unnamed: 0,country,population
0,Afghanistan,37200000.0
1,Angola,30800000.0
2,Albania,2880000.0
3,Andorra,77000.0
4,United Arab Emirates,9630000.0
...,...,...
192,Samoa,196000.0
193,Yemen,28500000.0
194,South Africa,57800000.0
195,Zambia,17400000.0


With that done, we now have prepared our data for `co2`, `income`, and `pop`. All we need now is a way to identify the corresponding continent / region of the world for for each country, and for that, we can look at our UN data.

### Dataset 4 of 4: Continents / global regions

Once again, let's start by loading in and inspecting the next and final dataset we'll be incorporating: the UN data on continents.

In [25]:
undata = pd.read_csv('data/united_nations_continents.csv')
undata.head(10)

ParserError: Error tokenizing data. C error: Expected 1 fields in line 67, saw 2


Well that was unexpected! Whenever we see an error like this, it's good to look directly at our data to see if we missed something about the data that prevented it from loading. Let's take a look into the first few lines of the 'united_nations_continents.csv' file and see if anything odd is going on there:

```text
Global Code;Global Name;Region Code;Region Name;Sub-region Code;Sub-region Name;Intermediate Region Code;Intermediate Region Name;Country or Area;M49 Code;ISO-alpha2 Code;ISO-alpha3 Code;Least Developed Countries (LDC);Land Locked Developing Countries (LLDC);Small Island Developing States (SIDS)
001;World;002;Africa;015;Northern Africa;;;Algeria;012;DZ;DZA;;;
001;World;002;Africa;015;Northern Africa;;;Egypt;818;EG;EGY;;;
001;World;002;Africa;015;Northern Africa;;;Libya;434;LY;LBY;;;
001;World;002;Africa;015;Northern Africa;;;Morocco;504;MA;MAR;;;
001;World;002;Africa;015;Northern Africa;;;Sudan;729;SD;SDN;x;;
````

There is definitely something odd going on here! A CSV file stands for "comma separated values" however, this file is not comma separated, it's semicolon separated! To address this, we will need to adjust our data loading code to accommodate. We can do this with the `sep` keyword for the `read_csv()` method to let it know the separator is a semicolon and otherwise proceed as it would with the comma.

In [26]:
undata = pd.read_csv('data/united_nations_continents.csv', sep=';')
undata.head(10)

Unnamed: 0,Global Code,Global Name,Region Code,Region Name,Sub-region Code,Sub-region Name,Intermediate Region Code,Intermediate Region Name,Country or Area,M49 Code,ISO-alpha2 Code,ISO-alpha3 Code,Least Developed Countries (LDC),Land Locked Developing Countries (LLDC),Small Island Developing States (SIDS)
0,1,World,2.0,Africa,15.0,Northern Africa,,,Algeria,12,DZ,DZA,,,
1,1,World,2.0,Africa,15.0,Northern Africa,,,Egypt,818,EG,EGY,,,
2,1,World,2.0,Africa,15.0,Northern Africa,,,Libya,434,LY,LBY,,,
3,1,World,2.0,Africa,15.0,Northern Africa,,,Morocco,504,MA,MAR,,,
4,1,World,2.0,Africa,15.0,Northern Africa,,,Sudan,729,SD,SDN,x,,
5,1,World,2.0,Africa,15.0,Northern Africa,,,Tunisia,788,TN,TUN,,,
6,1,World,2.0,Africa,15.0,Northern Africa,,,Western Sahara,732,EH,ESH,,,
7,1,World,2.0,Africa,202.0,Sub-Saharan Africa,14.0,Eastern Africa,British Indian Ocean Territory,86,IO,IOT,,,
8,1,World,2.0,Africa,202.0,Sub-Saharan Africa,14.0,Eastern Africa,Burundi,108,BI,BDI,x,x,
9,1,World,2.0,Africa,202.0,Sub-Saharan Africa,14.0,Eastern Africa,Comoros,174,KM,COM,x,,x


While there's a lot of information here, what we need are two columns: 'Region Name' and 'Country or Area'. Let's extract just those from the data:

In [27]:
continent = undata[['Region Name','Country or Area']]
continent.head(10)

Unnamed: 0,Region Name,Country or Area
0,Africa,Algeria
1,Africa,Egypt
2,Africa,Libya
3,Africa,Morocco
4,Africa,Sudan
5,Africa,Tunisia
6,Africa,Western Sahara
7,Africa,British Indian Ocean Territory
8,Africa,Burundi
9,Africa,Comoros


For ease of reference, let's rename the columns 'Region Name' and 'Country or Area' to 'continent' and 'country' respectively:

In [28]:
continent = continent.rename(columns={"Region Name": "continent", "Country or Area": "country"})
continent

Unnamed: 0,continent,country
0,Africa,Algeria
1,Africa,Egypt
2,Africa,Libya
3,Africa,Morocco
4,Africa,Sudan
...,...,...
244,Oceania,Samoa
245,Oceania,Tokelau
246,Oceania,Tonga
247,Oceania,Tuvalu


Let's also sort `continent` by 'country':

In [29]:
continent = continent.sort_values(by=['country'])
continent

Unnamed: 0,continent,country
141,Asia,Afghanistan
195,Europe,Albania
0,Africa,Algeria
239,Oceania,American Samoa
196,Europe,Andorra
...,...,...
6,Africa,Western Sahara
167,Asia,Yemen
27,Africa,Zambia
28,Africa,Zimbabwe


Now we have each of the pieces of the puzzle that we need: `co2`, `income`, `pop` and `continents`. Getting the data ready for use requires significant effort, but it's best if it can be done in a way that is easily reproducible. Imagine that we received an update to the data for 2023 and the first time through we had done all of the steps above manually in the original CSV files. We would then have to go through that process all over again. However, because we automated each process, we can easily rerun the above steps on new data, so our investment in time would pay off going forward.

Now that we have each of the four data sources separately, in the next lesson we will merge these into one dataset that we can then plot.

In [31]:
### Not for inclusion - just to save contents from this notebook
import pickle

pickle_path = 'data/pickle/'

to_pickle = ['co2','income','pop','continent']
for item in to_pickle:
    eval(f'{item}.to_pickle(\'{pickle_path}{item}.pkl\')')
