In [2]:
%matplotlib inline
import pandas as pd
import numpy as np
import re
from matplotlib.ticker import MaxNLocator
import matplotlib.pyplot as plt
from requests import get
from bs4 import BeautifulSoup
from collections import defaultdict

plt.rcParams["figure.figsize"] = (15,8) #set size of plot

# Start working

## Finding calories needed for each country
The first need we need to do for our analysis is finding the total calories demand in each country... **write other things to start**

### Working on calories demand  
We load the calories demand datasets we scraped for the webpage [Calories](https://health.gov/dietaryguidelines/2015/guidelines/appendix-2/), datasets will we working on to match with population data

In [229]:
male_calory_demand = pd.read_excel("data/calories_demand.xlsx",header =None, sheet_name=0, names=['age', 'sedentary', 'moderate', 'active'])

In [230]:
females_calory_demand =  pd.read_excel("data/calories_demand.xlsx",header =None, sheet_name=1, names=['age', 'sedentary', 'moderate', 'active'])

In order to better work with the information we have collected, we will make some simplifications on the data. Mainly, we will:
- in the calories demands database, assume an average of necessary input per age
- group the ages into ranges that match the ranges provided in the World Population Database

In [231]:
def input_average(data_frame):
    result = data_frame.copy()
    result['input (KCal)'] = result.mean(axis=1) #computing the mean
    result = result.drop(columns=['sedentary', 'moderate', 'active']) #we keep only the mean
    return result

In [232]:
male_calories_avg = input_average(male_calory_demand)
females_calories_avg = input_average(females_calory_demand)

We have now obtained a caloric demand average for simpler calculations in the future and stored in the two precedent datasets.  
Now, we need a  way to match the age groups in this dataframe, to the ones in the population database we obtained. As such, let's analyse how ages are represented in our calory demand dataframes.

In [11]:
male_calories_avg['age'].unique()

array([2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
       '19-20', '21-25', '26-30', '31-35', '36-40', '41-45', '46-50',
       '51-55', '56-60', '61-65', '66-70', '71-75', '76 and up', nan],
      dtype=object)

We can see there are ranges of ages with different sizes (which makes sense, because different age groups have different caloric needs). We'll present a function that creates one row per individual age

In [12]:
def single_age(age_range):
    if type(age_range) ==  float: # nans are the only floats in the age column
        return -1
    elif type(age_range) == int:
        return age_range
    elif re.search('\d-\d', age_range):
        group = age_range.split('-')
        return list(range(int(group[0]), int(group[1])+1))
    elif age_range == "76 and up":
        return list(range(76, 101+1))

In [13]:
def explode_age(data_frame):
    accum = []
    for i in data_frame.index:
        row = data_frame.loc[i]
        single = single_age(row['age'])
        if single == -1: # we ignore the nan values, as their rows are empty
            continue
        if type(single) == int:
            accum.append((single, row['input (KCal)']))
        elif type(single) == list:
            accum.extend([(x, row['input (KCal)']) for x in single]) 
    return pd.DataFrame(accum, columns=data_frame.columns)

We apply the function to our two dataframe:

In [14]:
male_explode = explode_age(male_calories_avg)
female_explode = explode_age(females_calories_avg)

In [16]:
male_explode['age'].unique()

array([  2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,  13,  14,
        15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,  27,
        28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,  39,  40,
        41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,  52,  53,
        54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,  65,  66,
        67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,  78,  79,
        80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,  91,  92,
        93,  94,  95,  96,  97,  98,  99, 100, 101], dtype=int64)

Ages are now unique in each dataframe ( `male_explode` and `female_explode` ) and there's a caloric input value for each of them.

#### Joao please comment the next two function and the code in the cells

The last step to allow the match with the population database is to build the **same age groups** we have in that set.  
We do this in the next two functions:

In [17]:
def group(age):
    i = int(5*(age//5))
    return "{}-{}".format(i, i+4)

In [18]:
def compress_ages(data_frame):
    accum = defaultdict(list)
    for i in data_frame.index:
        row = data_frame.loc[i]
        g_id = group(row['age'])
        if g_id == "100-104":
            g_id = "100+"
        accum[g_id].append(row['input (KCal)'])
    for i in accum:
        accum[i] = sum(accum[i]) / len(accum[i])
    return pd.DataFrame.from_dict(accum, orient='index')

We can lastly apply the functions to the dataframes:

In [19]:
new_male_need = compress_ages(male_explode)
new_female_need = compress_ages(female_explode)

We also use the age group as new index and rename the columns:

In [20]:
new_male_need.index.name = 'age_group'
new_male_need = new_male_need.rename(columns={0: 'input (KCal)'})
new_female_need.index.name = 'age_group'
new_female_need = new_female_need.rename(columns={0: 'input (KCal)'})

Let's see the result we have achieved and collected in our matchable dataframe `new_male_need` and `new_female_need`

In [233]:
new_male_need.head()

Unnamed: 0_level_0,input (KCal)
age_group,Unnamed: 1_level_1
0-4,1222.222222
5-9,1613.333333
10-14,2133.333333
15-19,2760.0
20-24,2746.666667


## Computing total calories by matching 

**We now move on the matching with the *World Population Database* (United Nation) to compute the total calories needed in each country since 1950 to 2020**  
Matching with a precise dataset like this one is essential to take count of the demographics inside each country  
  
  
Firstly, we load the list of African countries (to filter the DB) and the two Databases (one for males, the other for females)  
*A note*: the values in the population dataframe are reported in **thousand**

In [219]:
with open ("data/african_countries.txt") as af_c:
    af_countries = [line.rstrip() for line in af_c] #loading list

Loading and cleaning the dataset to be prepared for merging (they present vary unuseful columns of with too long names)

In [220]:
#loading datasets
pop_male = pd.read_excel("data/POPULATION_BY_AGE_MALE.xlsx", sheet_name="ESTIMATES")
pop_female = pd.read_excel("data/POPULATION_BY_AGE_FEMALE.xlsx", sheet_name="ESTIMATES")

In [221]:
#cleaning male population dataset
pop_male.drop(columns=["Index", "Variant", "Notes", "Country code", "Type", "Parent code"], inplace=True)
pop_male.rename(columns={"Reference date (as of 1 July)": "year", "Region, subregion, country or area *": "country"}, inplace=True)
#taking only african countries
pop_male = pop_male[pop_male['country'].isin(af_countries)]

#cleaning female dataset
pop_female.drop(columns=["Index", "Variant", "Notes", "Country code", "Type", "Parent code"], inplace=True)
pop_female.rename(columns={"Reference date (as of 1 July)": "year", "Region, subregion, country or area *": "country"}, inplace=True)
#only african
pop_female = pop_female[pop_female['country'].isin(af_countries)]

Now we multiply each column of the population data (as we said, in thousand) for each matching `age_group` in the calories table (that here we squeeze to allow the multiplication, similar to a transpose rows/columns of the dataset).  
We obtain two datasets: `total_cal_male` and `total_cal_female` reporting total calories needed for **each country in each year per age group per gender**

In [222]:
#total calories male
pop_mal_mult = pop_male.drop(columns=["country", "year"])
male_mult_res = pop_mal_mult.multiply(new_male_need.squeeze()) # squeeze adapts the dimension of the dataframe
#rejoin with old dataframe and delete old column (just population)
total_cal_male = pop_male.join(male_mult_res, lsuffix="_old")
total_cal_male = total_cal_male[total_cal_male.columns[~total_cal_male.columns.str.endswith('_old')]]

In [223]:
#total calories female
pop_fem_mult = pop_female.drop(columns=["country", "year"])
female_mult_res = pop_fem_mult.multiply(new_female_need.squeeze())
total_cal_female = pop_female.join(female_mult_res, lsuffix="_old")
total_cal_female = total_cal_female[total_cal_female.columns[~total_cal_female.columns.str.endswith('_old')]]

Once we have the calories needed for both gender, we can add them together easily to achieve total calories needed for **each country in each year per age group**, and we collect them in the dataframe `total_cal_ages`

In [224]:
#copy the male inside the total (to mantain country and year columns) and sum with female
total_cal_ages = total_cal_male.copy()
sum_ind = total_cal_ages.columns[2:]
total_cal_ages[sum_ind] = total_cal_ages[sum_ind] + total_cal_female[sum_ind]

To find the total calories needed for **each country in each year** we then proceed to the sum over all the age groups, collected in `total_cal`

In [225]:
total_cal = total_cal_ages.copy()
sum_ind = total_cal.columns[2:]

#computing sum of cal over ages, removing ages
total_cal['Calories'] = total_cal[sum_ind].sum(axis=1)
total_cal.drop(columns=sum_ind, inplace=True)

The values proposed up to here **need to be scaled appropriartely**, as the total amount of calories should be multiplied by 1000 to take in account the value reported in the population dataset.  
Instead of dealing with large number (up to order of $10^9$), we decide to divide another time by 1000.  
The reader is so advised that from now on all the calories will be reported in **Gcal** (I'm not totally sure, check together ahah) 

In [227]:
change_col = total_cal_ages.columns[2:] #index for the first 3 dataframes are the same
total_cal_male[change_col] = total_cal_male[change_col]/1000
total_cal_female[change_col] = total_cal_male[change_col]/1000
total_cal_ages[change_col] = total_cal_male[change_col]/1000
total_cal['Calories'] = total_cal['Calories']/1000

Drawing a sample of the final dataframe..

In [218]:
total_cal.head()

Unnamed: 0,country,year,Calories
390,Burundi,1950,4603.361707
391,Burundi,1955,5005.139209
392,Burundi,1960,5489.419702
393,Burundi,1965,6057.955751
394,Burundi,1970,6815.674227


#### Summing up to now
* We have defined a reasonable amount of calories needed for each gender and each group by taking average
* We collected these values in `new_male_need` and `new_female_need`
* We loaded the population of the African countries from the United Nation Dataset 
* We matched the population for the kcal needed by each age group. From this matching we build 4 different datasets, with different granularity levels:  
`total_cal_male`, `total_cal_female`, `total_cal_ages`, `total_cal`
* We changed the scale of our final dataframes to easily work with large numbers