# Preparing the data: Continents

We're walking through our four datasets, processing one at a time:

1. *Carbon dioxide emissions by country*
2. *Income (as measured by GDP per capita) by country*
3. *Population by country (so we can convert CO2 emissions into per capita emissions)*
4. **List of territories by continent (since we want to be able to group the countries by region of the world)**

Once again, let's start by loading in and inspecting the next and final dataset we'll be incorporating: the UN data on continents.

```
continents_all = pd.read_csv("data/united_nations_continents.csv")
continents_all.head(10)
```

```
File parsers.pyx:2058, in pandas._libs.parsers.raise_parser_error()

ParserError: Error tokenizing data. C error: Expected 1 fields in line 67, saw 2
```

Uh, oh. When we run this, we get an error! What's going on? Whenever we see an error like this, it's good to look directly at our data to see if we missed something about the data that prevented it from loading. Let's take a look into the first few lines of the 'united_nations_continents.csv' file and see if anything odd is going on there:

```text
Global Code;Global Name;Region Code;Region Name;Sub-region Code;Sub-region Name;Intermediate Region Code;Intermediate Region Name;Country or Area;M49 Code;ISO-alpha2 Code;ISO-alpha3 Code;Least Developed Countries (LDC);Land Locked Developing Countries (LLDC);Small Island Developing States (SIDS)
001;World;002;Africa;015;Northern Africa;;;Algeria;012;DZ;DZA;;;
001;World;002;Africa;015;Northern Africa;;;Egypt;818;EG;EGY;;;
001;World;002;Africa;015;Northern Africa;;;Libya;434;LY;LBY;;;
001;World;002;Africa;015;Northern Africa;;;Morocco;504;MA;MAR;;;
001;World;002;Africa;015;Northern Africa;;;Sudan;729;SD;SDN;x;;
````

There is definitely something odd going on here! A CSV file stands for "comma separated values" however, this file is not comma separated, it's semicolon separated! To address this, we will need to adjust our data loading code to accommodate. We can do this with the `sep` keyword for the `read_csv()` method to let it know the separator is a semicolon and otherwise proceed as it would with the comma.

In [2]:
import pandas as pd

continents_all = pd.read_csv("data/united_nations_continents.csv", sep=";")
continents_all.head(10)

Unnamed: 0,Global Code,Global Name,Region Code,Region Name,Sub-region Code,Sub-region Name,Intermediate Region Code,Intermediate Region Name,Country or Area,M49 Code,ISO-alpha2 Code,ISO-alpha3 Code,Least Developed Countries (LDC),Land Locked Developing Countries (LLDC),Small Island Developing States (SIDS)
0,1,World,2.0,Africa,15.0,Northern Africa,,,Algeria,12,DZ,DZA,,,
1,1,World,2.0,Africa,15.0,Northern Africa,,,Egypt,818,EG,EGY,,,
2,1,World,2.0,Africa,15.0,Northern Africa,,,Libya,434,LY,LBY,,,
3,1,World,2.0,Africa,15.0,Northern Africa,,,Morocco,504,MA,MAR,,,
4,1,World,2.0,Africa,15.0,Northern Africa,,,Sudan,729,SD,SDN,x,,
5,1,World,2.0,Africa,15.0,Northern Africa,,,Tunisia,788,TN,TUN,,,
6,1,World,2.0,Africa,15.0,Northern Africa,,,Western Sahara,732,EH,ESH,,,
7,1,World,2.0,Africa,202.0,Sub-Saharan Africa,14.0,Eastern Africa,British Indian Ocean Territory,86,IO,IOT,,,
8,1,World,2.0,Africa,202.0,Sub-Saharan Africa,14.0,Eastern Africa,Burundi,108,BI,BDI,x,x,
9,1,World,2.0,Africa,202.0,Sub-Saharan Africa,14.0,Eastern Africa,Comoros,174,KM,COM,x,,x


While there's a lot of information here, what we need are two columns: 'Region Name' and 'Country or Area'. Let's extract just those from the data:

In [3]:
continent = continents_all[["Region Name", "Country or Area"]]
continent.head(10)

Unnamed: 0,Region Name,Country or Area
0,Africa,Algeria
1,Africa,Egypt
2,Africa,Libya
3,Africa,Morocco
4,Africa,Sudan
5,Africa,Tunisia
6,Africa,Western Sahara
7,Africa,British Indian Ocean Territory
8,Africa,Burundi
9,Africa,Comoros


For ease of reference, let's rename the columns 'Region Name' and 'Country or Area' to 'continent' and 'country' respectively:

In [4]:
continent = continent.rename(
    columns={"Region Name": "continent", "Country or Area": "country"}
)
continent

Unnamed: 0,continent,country
0,Africa,Algeria
1,Africa,Egypt
2,Africa,Libya
3,Africa,Morocco
4,Africa,Sudan
...,...,...
244,Oceania,Samoa
245,Oceania,Tokelau
246,Oceania,Tonga
247,Oceania,Tuvalu


Let's also sort `continent` by 'country':

In [5]:
continent = continent.sort_values(by=["country"])
continent

Unnamed: 0,continent,country
141,Asia,Afghanistan
195,Europe,Albania
0,Africa,Algeria
239,Oceania,American Samoa
196,Europe,Andorra
...,...,...
6,Africa,Western Sahara
167,Asia,Yemen
27,Africa,Zambia
28,Africa,Zimbabwe


Great! This looks ready to save to file.

In [6]:
continent.to_csv('data/intermediate/continent.csv', index=False)

Now we have each of the pieces of the puzzle that we need: `co2`, `income`, `pop` and `continents`. Getting the data ready for use requires significant effort, but it's best if it can be done in a way that is easily reproducible. Imagine that we received an update to the data for 2023 and the first time through we had done all of the steps above manually in the original CSV files. We would then have to go through that process all over again. However, because we automated each process, we can easily rerun the above steps on new data, so our investment in time would pay off going forward.

Now that we have each of the four data sources separately, in the next lesson we will merge these into one dataset that we can then plot.