# Preparing the data: Income

We're walking through our four datasets, processing one at a time:

1. *Carbon dioxide emissions by country*
2. *Income (as measured by GDP per capita) by country*
3. **Population by country (so we can convert CO2 emissions into per capita emissions)**
4. List of territories by continent (since we want to be able to group the countries by region of the world)

As is our standard practice at this point, let's load in and view the first 10 entries in our dataset:

In [2]:
import pandas as pd
pop_all = pd.read_csv("data/raw/population_total.csv")
pop_all.head(10)

Unnamed: 0,country,1799,1800,1801,1802,1803,1804,1805,1806,1807,...,2090,2091,2092,2093,2094,2095,2096,2097,2098,2099
0,Afghanistan,3.28M,3.28M,3.28M,3.28M,3.28M,3.28M,3.28M,3.28M,3.28M,...,76.6M,76.4M,76.3M,76.1M,76M,75.8M,75.6M,75.4M,75.2M,74.9M
1,Angola,1.57M,1.57M,1.57M,1.57M,1.57M,1.57M,1.57M,1.57M,1.57M,...,168M,170M,172M,175M,177M,179M,182M,184M,186M,188M
2,Albania,400k,402k,404k,405k,407k,409k,411k,413k,414k,...,1.33M,1.3M,1.27M,1.25M,1.22M,1.19M,1.17M,1.14M,1.11M,1.09M
3,Andorra,2650,2650,2650,2650,2650,2650,2650,2650,2650,...,63k,62.9k,62.9k,62.8k,62.7k,62.7k,62.6k,62.5k,62.5k,62.4k
4,United Arab Emirates,40.2k,40.2k,40.2k,40.2k,40.2k,40.2k,40.2k,40.2k,40.2k,...,12.3M,12.4M,12.5M,12.5M,12.6M,12.7M,12.7M,12.8M,12.8M,12.9M
5,Argentina,534k,520k,506k,492k,479k,466k,453k,441k,429k,...,57.5M,57.5M,57.4M,57.3M,57.2M,57.2M,57.1M,57M,56.9M,56.8M
6,Armenia,413k,413k,413k,413k,413k,413k,413k,413k,413k,...,2.18M,2.16M,2.15M,2.13M,2.12M,2.1M,2.09M,2.07M,2.05M,2.04M
7,Antigua and Barbuda,37k,37k,37k,37k,37k,37k,37k,37k,37k,...,105k,104k,104k,104k,104k,103k,103k,103k,102k,102k
8,Australia,200k,205k,211k,216k,222k,227k,233k,239k,246k,...,41.1M,41.3M,41.5M,41.7M,41.9M,42.1M,42.3M,42.5M,42.7M,42.9M
9,Austria,3M,3.02M,3.04M,3.05M,3.07M,3.09M,3.11M,3.12M,3.14M,...,8.65M,8.65M,8.65M,8.65M,8.65M,8.66M,8.66M,8.67M,8.67M,8.68M


These data go well into the future, all the way out to 2099, so again, these are projections. Let's begin by grabbing the 2017 data and dive a bit deeper.

In [3]:
pop_2017 = pop_all[["country", "2017"]]
pop_2017

Unnamed: 0,country,2017
0,Afghanistan,37.2M
1,Angola,30.8M
2,Albania,2.88M
3,Andorra,77k
4,United Arab Emirates,9.63M
...,...,...
192,Samoa,196k
193,Yemen,28.5M
194,South Africa,57.8M
195,Zambia,17.4M


It looks like we have suffixes of 'k' and 'M' (representing millions). We will have to apply a similar function to what we used for income here to prepare these data. But we have to ask the question: are 'k' and 'M' the *only* suffixes included in the data or are there others? Let's write a function to figure that out.

We'll create a function that takes in a `numpy` array (into which we will pass the 2017 population values of `pop_2017['2017'].values`) and searches through them to find any alphabetic letters that are present. We'll also make use of the string method `isalpha()` and the `numpy` method `unique()` to help with this process.

For example, if our array of strings contained the following:
|string list|
|-|
| '13k' |
| '546' |
| '9M' |
| '12M' |
| '900k' |

Our desired output would be: ['k', 'M']

In [4]:
import numpy as np

def getcharacters(string_list):
    characters = []

    for value in string_list:
        for character in value:
            if character.isalpha():
                characters.append(character)
    return np.unique(characters)

Let's apply the function we created to our dataset to see what string values are present:

In [5]:
unique_characters = getcharacters(pop_2017["2017"].values)
unique_characters

array(['B', 'M', 'k'], dtype='<U1')

Great! we now know that we need to handle thousands ('k'), millions ('M'), and billions ('B'). 

Now we know we need to adjust our `string2num()` function that we created earlier to accommodate replacing the 'k', 'M' and 'B' characters.

In [6]:
def string2num(string):
    if "k" in string:
        number = float(string[:-1]) * 1000
    elif "M" in string:
        number = float(string[:-1]) * 1e6
    elif "B" in string:
        number = float(string[:-1]) * 1e9
    else:
        number = float(string)
    return number

# Test our output
print(string2num("2"))
print(string2num("2k"))
print(string2num("2M"))
print(string2num("2B"))

2.0
2000.0
2000000.0
2000000000.0


Just like we did for the case of income, let's apply our updated `string2num()` function to our population dataset. Since we're modifying our data, let's be sure to make a copy of it first.

In [7]:
pop = pop_2017.copy(deep=True)

pop["2017"] = pop_2017["2017"].apply(string2num)
pop

Unnamed: 0,country,2017
0,Afghanistan,37200000.0
1,Angola,30800000.0
2,Albania,2880000.0
3,Andorra,77000.0
4,United Arab Emirates,9630000.0
...,...,...
192,Samoa,196000.0
193,Yemen,28500000.0
194,South Africa,57800000.0
195,Zambia,17400000.0


This looks good, let's double check our data types again to make sure everything was converted over to floats and make sure there were no `NaN` values introduced during processing:

In [8]:
pop.dtypes

country     object
2017       float64
dtype: object

In [9]:
pop.isna().values.any()

False

Lastly, let's rename the column header from '2017' to 'population' so that its column label is clear and descriptive for when we merge these data frames later.

In [10]:
pop = pop.rename(columns={"2017": "population"})
pop

Unnamed: 0,country,population
0,Afghanistan,37200000.0
1,Angola,30800000.0
2,Albania,2880000.0
3,Andorra,77000.0
4,United Arab Emirates,9630000.0
...,...,...
192,Samoa,196000.0
193,Yemen,28500000.0
194,South Africa,57800000.0
195,Zambia,17400000.0


Excellent! This is ready to save to file.

In [11]:
pop.to_csv('data/intermediate/pop.csv', index=False)

Great - let's collect these steps into a single function that we can use to reproduce this process later, as we've done ith emissions and income:

In [None]:
import pandas as pd

def process_population(infile,outfile):
    # Helper function to convert string data within the file into numerical data
    def string2num(string):
        if "k" in string:
            number = float(string[:-1]) * 1000
        elif "M" in string:
            number = float(string[:-1]) * 1e6
        elif "B" in string:
            number = float(string[:-1]) * 1e9
        else:
            number = float(string)
        return number
    
    # Read the file, select the columns of interest
    pop_all = pd.read_csv(infile)
    pop_2017 = pop_all[["country", "2017"]]
    
    # Convert the textual data in the dataset into numerical data
    pop = pop_2017.copy(deep=True)
    pop["2017"] = pop_2017["2017"].apply(string2num)
    
    # Rename the columns to make them more readable and save the output to file
    pop = pop.rename(columns={"2017": "population"})
    pop.to_csv(outfile, index=False)

# Test and make sure the file produces identical output as the original process:
raw_population_file = "data/raw/population_total.csv"
population_file = "data/intermediate/pop.csv"
process_population(raw_population_file,population_file)

With that done, we now have prepared our data for `co2`, `income`, and `pop`. All we need now is a way to identify the corresponding continent / region of the world for each country, and for that, we can look at our UN data.