# Preparing the data: Income

We're walking through our four datasets, processing one at a time:

1. *Carbon dioxide emissions by country*
2. **Income (as measured by GDP per capita) by country**
3. Population by country (so we can convert CO2 emissions into per capita emissions)
4. List of territories by continent (since we want to be able to group the countries by region of the world)

Again, let's start by loading and viewing the first 10 rows of our data:

In [2]:
import pandas as pd

income_all = pd.read_csv("data/income_per_person_gdppercapita_ppp_inflation_adjusted.csv")
income_all.head(10)

Unnamed: 0,country,1799,1800,1801,1802,1803,1804,1805,1806,1807,...,2040,2041,2042,2043,2044,2045,2046,2047,2048,2049
0,Afghanistan,683,683,683,683,683,683,683,683,683,...,2690,2750,2810,2870,2930,2990,3060,3120,3190,3260
1,Angola,700,702,705,709,711,714,718,721,725,...,8000,8170,8350,8530,8710,8900,9090,9280,9480,9690
2,Albania,755,755,755,755,755,756,756,756,756,...,25.1k,25.6k,26.2k,26.7k,27.3k,27.9k,28.5k,29.1k,29.7k,30.4k
3,Andorra,1360,1360,1360,1360,1370,1370,1370,1370,1380,...,68.9k,70.4k,71.9k,73.4k,75k,76.6k,78.3k,80k,81.7k,83.4k
4,United Arab Emirates,1130,1130,1140,1140,1150,1150,1160,1160,1160,...,101k,103k,105k,107k,110k,112k,114k,117k,119k,122k
5,Argentina,1730,1730,1740,1740,1750,1760,1760,1770,1770,...,30.5k,31.1k,31.8k,32.5k,33.2k,33.9k,34.7k,35.4k,36.2k,36.9k
6,Armenia,582,582,582,582,582,582,582,582,582,...,24.7k,25.2k,25.8k,26.3k,26.9k,27.5k,28.1k,28.7k,29.3k,29.9k
7,Antigua and Barbuda,857,857,857,857,857,857,857,858,858,...,26.4k,27k,27.6k,28.1k,28.8k,29.4k,30k,30.7k,31.3k,32k
8,Australia,925,930,936,941,947,952,956,962,968,...,72k,73.5k,75.1k,76.7k,78.4k,80.1k,81.8k,83.5k,85.3k,87.2k
9,Austria,2090,2100,2110,2120,2130,2130,2140,2150,2160,...,77.8k,79.5k,81.2k,83k,84.7k,86.6k,88.4k,90.3k,92.3k,94.3k


As we look over the dataset here, we can see we have data that goes from 1799 through 2049. So clearly the later years are projections rather than measured data (unless the data providers have a time traveler on staff!).

We know that we're only concerned with the year 2017 from our discussion of CO2 data, so let's extract the year 2017 and the country:

In [3]:
income_2017 = income_all[["country", "2017"]]
income_2017

Unnamed: 0,country,2017
0,Afghanistan,2030
1,Angola,6930
2,Albania,13.3k
3,Andorra,58.3k
4,United Arab Emirates,67k
...,...,...
190,Samoa,6390
191,Yemen,2660
192,South Africa,13.9k
193,Zambia,3520



But notice something odd here... Some of the values have the letter 'k' at the end of a number. While 'k' is the SI shorthand for a thousand units (representing "kilo"). But what is it doing in our dataset? Let's take a look at the data types in our data. We'd expect those to be numbers: floats or integers or something similar.

In [4]:
income_2017.dtypes

country    object
2017       object
dtype: object

OK, the data from 2017 are objects, not floats or integers, just like the column for 'country'. Let's take a look at one of the entries in 2017 and see what's going on with the data there.

In [5]:
first_entry = income_2017["2017"][0]
first_entry

'2030'

In [6]:
type(first_entry)

str

It's a string! It's not a number. The authors of the data represented the number 12,000 as the string '12k' instead. We need to correct this since all data that we plot will need to be numerical.

For each value, we'll need to read in the string - if there are no letters in the entry, we just need to convert it to a number. If there is a letter k in it, we need to remove that letter and THEN convert the rest into a number and multiply it by 1,000. Let's create a function that does that:

We'll write a function that takes a string as input and outputs the correct numerical representation of the number as a float. That is, if the input is '20' then the output should be the number 20.0; if the input is '20k' then the output should be the number 20000.0

In [7]:
def string2num(string):
    if "k" in string:
        number = float(string[:-1]) * 1000
    else:
        number = float(string)
    return number


print(string2num("2"))
print(string2num("2k"))

2.0
2000.0


Now let's apply our function to our data for each entry in 2017. But before we do, let's make a copy of our data so we don't edit the original (`deep = True`) ensures all the data are a copy rather than a view):

In [8]:
income = income_2017.copy(deep=True)

We can use the `pandas` method `apply()` to apply our `string2num` function to *each entry* in our '2017' column:

In [9]:
income["2017"] = income_2017["2017"].apply(string2num)
income

Unnamed: 0,country,2017
0,Afghanistan,2030.0
1,Angola,6930.0
2,Albania,13300.0
3,Andorra,58300.0
4,United Arab Emirates,67000.0
...,...,...
190,Samoa,6390.0
191,Yemen,2660.0
192,South Africa,13900.0
193,Zambia,3520.0


While it looks like we have successfully converted the strings to floats, let's look at our data types again to ensure we fixed the issue:

In [10]:
income.dtypes

country     object
2017       float64
dtype: object

Great! We have a column of floats now, so it looks like our fix worked. Let's make one more check to see whether there are any `NaN` values introduced - we want to avoid introducing anything like that into the mix:

In [11]:
income["2017"].isna().values.any()

False

No `NaN` values - so we're all set! 

And as our last step, let's also rename the column header from '2017' to 'income', similar to what we did with `co2` above, so that its content is clear and it will be ready to be merged together after we've loaded each of our four datasets.

In [12]:
income = income.rename(columns={"2017": "income"})
income

Unnamed: 0,country,income
0,Afghanistan,2030.0
1,Angola,6930.0
2,Albania,13300.0
3,Andorra,58300.0
4,United Arab Emirates,67000.0
...,...,...
190,Samoa,6390.0
191,Yemen,2660.0
192,South Africa,13900.0
193,Zambia,3520.0


Great - this looks ready to go! Let's save this to file.

In [13]:
income.to_csv('data/intermediate/income.csv', index=False)

Wonderful! With that complete, let's again collect our steps and various functions into a single function that we can run for reproducibility:

In [None]:
import pandas as pd

# Function to process income data
def process_income(infile,outfile):
    # Helper function to convert string data within the file into numerical data
    def string2num(string):
        if "k" in string:
            number = float(string[:-1]) * 1000
        else:
            number = float(string)
        return number
    
    # Read the file, select the columns of interest
    income_all = pd.read_csv(infile)
    income_2017 = income_all[["country", "2017"]]
    
    # Convert the textual data in the dataset into numerical data
    income = income_2017.copy(deep=True)
    income["2017"] = income_2017["2017"].apply(string2num)
    
    # Rename the columns to make them more readable and save the output to file
    income = income.rename(columns={"2017": "income"})
    income.to_csv(outfile, index=False)

# Test and make sure the file produces identical output as the original process:
raw_income_file = "data/income_per_person_gdppercapita_ppp_inflation_adjusted.csv"
income_file = "data/intermediate/income.csv"
process_income(raw_income_file,income_file)

With both CO2 and income data prepared, population is next!