# BigMac Index

This notebook will clean + combine the following 2 datasets, and save a cleaned version that is useable for analysis

Datasets: 
- BigMacPrice: https://www.kaggle.com/datasets/vittoriogiatti/bigmacprice?resource=download
- CPI: https://data.oecd.org/price/inflation-cpi.htm

The analyses can be found in the notebook "BigMacPrices - Analyses"

# Load packages + Datasets

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import datetime as dt
import pycountry
import requests
import json
from datetime import datetime, timedelta
from urllib3.exceptions import InsecureRequestWarning

# Suppress only the single warning from urllib3 needed.
requests.packages.urllib3.disable_warnings(category=InsecureRequestWarning)

%matplotlib inline

df_bigmac = pd.read_csv('BigMacPrice.csv')
df_bigmac.head()

Unnamed: 0,date,currency_code,name,local_price,dollar_ex,dollar_price
0,2000-04-01,ARS,Argentina,2.5,1,2.5
1,2000-04-01,AUD,Australia,2.59,1,2.59
2,2000-04-01,BRL,Brazil,2.95,1,2.95
3,2000-04-01,GBP,Britain,1.9,1,1.9
4,2000-04-01,CAD,Canada,2.85,1,2.85


In [2]:
df_cpi = pd.read_csv('CPI_07122022080621228.csv')
df_cpi.head()

Unnamed: 0,LOCATION,INDICATOR,SUBJECT,MEASURE,FREQUENCY,TIME,Value,Flag Codes
0,AUS,CPI,ENRG,AGRWTH,A,1972,4.91007,
1,AUS,CPI,ENRG,AGRWTH,A,1973,3.762801,
2,AUS,CPI,ENRG,AGRWTH,A,1974,13.17354,
3,AUS,CPI,ENRG,AGRWTH,A,1975,19.42247,
4,AUS,CPI,ENRG,AGRWTH,A,1976,8.833195,


# Clean / Map Data

Few key take-aways:
-  The names in the BigMac Prices dataset and the CPI dataset did not always match, therefore a function has been created that uses the pycountry package (https://pypi.org/project/pycountry/)
- The initial Big Mac price dataset already contained data regarding the exchange rate (local currency to dollar). However, after some investigation it has been found that these columns contained some invalid datapoints, for example, there were a few currencies that had a stable exchange rate of 1 for the entire period. To adjust for this the exchangerate.host api (https://api.exchangerate.host) has been used to redefine the exchange rate.
- Exchange rate api gave errors for some combinations of time periods + currencies, if a KeyError has been detected, there will be one more try 10 days later, if there's still an error None will be returned
- For allmost all periods the Big Mac prices were on the first of january or july, however for 2019 and 2020 this didnt hold, as there were dates like 2019-07-09 and 2020-01-14, these values have been deemed to be errors and have been reset to 2019-07-01 and 2020-01-01 respectively
- CPI can be calculated in a few different ways, here the IDX2015 measure has been selected, with monthly datapoints


In [20]:
# Cleaning functions

def map_country(country_str):
    '''
    Function that maps a string with a country name, to the standardized formated in the pycountry package
    
    Args:
        country_str: string representing country name
        
    Output:
        String in standardized format of the country_str    
    '''
    try:
        return pycountry.countries.lookup(country_str).name
    except LookupError:
        return None
    except Exception as e:
        raise ValueError


def convert_exchange_rate(base, out_curr, date):
    """
    Function that gathers the exchange rate between to currencies for a given date
    Using the exchangeratehost api (https://api.exchangerate.host) 
    
    Args:
        dase: Base currency to compare for
        out_curr: The output currency to compare the base with
        date: The date to compare the currencies for
    
    Output: 
        float with the exchange rate-value
    """
    try:
        # Some days give errors, if so, try 10 days later
        date_plus10 = datetime.strptime(date, '%Y-%m-%d')
        date_plus10 = datetime.strftime(date_plus10 + timedelta(days=10), '%Y-%m-%d')
        # api url for request
        url = 'https://api.exchangerate.host/timeseries?base={0}&start_date={1}&end_date={2}&symbols={3}'.format(base,
                                                                                                                 date,
                                                                                                                 date_plus10,
                                                                                                                 out_curr)
        response = requests.get(url, verify=False)
        # retrive response in json format
        data = response.json()

        return data['rates'][date][out_curr]
    except KeyError:
        try:
            return data['rates'][date_plus10][out_curr]
        except Exception:
            return None
            
def clean_cpi(df_cpi, measure):
    # Filter CPI
    f1 = df_cpi.TIME.str.contains("-")
    f2 = df_cpi.TIME.str.contains('Q')
    f3 = df_cpi.SUBJECT == "TOT"
    f4 = df_cpi.MEASURE == measure
    mask = f1 & ~f2 &  f3 & f4

    df_cpi = df_cpi[mask]

    df_cpi['TIME'] = pd.to_datetime(df_cpi['TIME'])

    df_cpi['country'] = df_cpi['LOCATION'].apply(map_country)
    df_cpi = df_cpi[~df_cpi['country'].isnull()]
    
    df_cpi = df_cpi.rename(columns={'Value': f"CPI_{measure}"})
    return df_cpi


def clean_df_bigmac(df_bigmac):
    
    # Use same mapping as for df_ppi
    df_bigmac['country'] = df_bigmac.name.apply(map_country)
    
    # Few countries unable to map in pycountry - so do manually
    remap = {'Britain': 'United Kingdom',
             'Russia': 'Russian Federation',
             'UAE': 'United Arab Emirates'}
    df_bigmac['country'] = df_bigmac['name'].replace(remap)
    
    # Seems to be an error in consistency of the date column for the given dates
    df_bigmac.loc[df_bigmac.date == "2019-07-09", "date"] = "2019-07-01"
    df_bigmac.loc[df_bigmac.date == "2020-01-14", "date"] = "2020-01-01"
    
    # Map Exchange Rates
    df_bigmac['dollar_ex_adjusted'] = df_bigmac.apply(lambda x: convert_exchange_rate(base="USD",
                                                                                      out_curr=x.currency_code,
                                                                                      date=x.date),
                                                         axis=1)
    
    # Set col date col to datetime for joining purposes
    df_bigmac['date'] = pd.to_datetime(df_bigmac['date'])
    
    return df_bigmac

In [21]:
# Clean Data Frames
df_cpi_cleaned = clean_cpi(df_cpi, "IDX2015")
df_bigmac_cleaned = clean_df_bigmac(df_bigmac=df_bigmac)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


 # Join datasets + create price comparisons between USD and local currency

The datasets are joined on the date + country column.
Furthermore, some columns related to price comparisons are created to compare the expected price bases on the exchange rate to the actual price (after conversion).

In [26]:
# Put together in 1 DataFrame, and put all rows in perspective to the US ($)
df = pd.merge(df_bigmac, df_cpi_cleaned, left_on=['date', 'country'], right_on=['TIME', 'country'], how='left')
df = df.set_index('date')

# Create a seperate dataframe with only USA prices, to join
df_usa = df[df['country'] == "United States"]
df = df.join(df_usa[['local_price']], how='left', rsuffix="_usa")

# Set expected prices and difference based on exchange rates in local currency
df['expected_price'] = round(df['local_price_usa'] * df['dollar_ex_adjusted'], 2)
df['diff_local'] = round(df['local_price'] - df['expected_price'], 2)

# Set expected prices and difference based on exchange rates in local currency
df['dollar_price_adjusted'] = round(df['local_price'] / df['dollar_ex_adjusted'], 2)
df['diff_dollar'] = round(df['dollar_price_adjusted'] - df['local_price_usa'], 2)

# Also determing percentual difference
df['perc_diff'] = round((df['diff_dollar'] / df['local_price_usa'])*100, 2)

df = df.reset_index()

df.head()

Unnamed: 0,date,currency_code,name,local_price,dollar_ex,dollar_price,country,dollar_ex_adjusted,LOCATION,INDICATOR,...,FREQUENCY,TIME,CPI_IDX2015,Flag Codes,local_price_usa,expected_price,diff_local,dollar_price_adjusted,diff_dollar,perc_diff
0,2000-04-01,ARS,Argentina,2.5,1,2.5,Argentina,,,,...,,NaT,,,2.24,,,,,
1,2000-04-01,AUD,Australia,2.59,1,2.59,Australia,1.655082,,,...,,NaT,,,2.24,3.71,-1.12,1.56,-0.68,-30.36
2,2000-04-01,BRL,Brazil,2.95,1,2.95,Brazil,,BRA,CPI,...,M,2000-04-01,37.37136,,2.24,,,,,
3,2000-04-01,GBP,Britain,1.9,1,1.9,United Kingdom,0.626505,GBR,CPI,...,M,2000-04-01,73.3,,2.24,1.4,0.5,3.03,0.79,35.27
4,2000-04-01,CAD,Canada,2.85,1,2.85,Canada,1.452842,CAN,CPI,...,M,2000-04-01,74.66421,,2.24,3.25,-0.4,1.96,-0.28,-12.5


# Extra cleaning

The remaining dataset still had a few invalid datapoint, so they have been manually cleaned.

Extra cleaning consisting of:
- Lithuania switched to the euro in 2015, but already had Big Mac prices in Euro before 2015, all observations of Lithuania before 2015 have been dropped (2 rows)
- For several currencies the API was not able to get an exchange rate, they have been dropped for convenience purposes (95 rows)


Rows remaining: 1851


In [27]:
# Extra cleaning
# Lithuania switched to the euro since 2015, but has prices in euro before that date
f1 = df.name == "Lithuania"
f2 = df.date < datetime(2015, 1, 1)

df = df.drop(df[f1&f2].index)

# Several countries without an exchange rate to the USD for the given period - drop for convenience purpose
df = df[~df.dollar_ex_adjusted.isnull()]

# Select relevant columns and pickle dataset for analyses

In [28]:
# Select only columns of interest
cols = ['date', 'currency_code', 'name', 'local_price', 'dollar_ex_adjusted',
        'CPI_IDX2015', 'local_price_usa', 'expected_price', 'diff_local',
        'dollar_price_adjusted', 'diff_dollar', 'perc_diff']
df = df[cols]

# Pickle Dataset, because the creation takes too long
df.to_pickle('BigMacPrices_cleaned')

In [29]:
df.head()

Unnamed: 0,date,currency_code,name,local_price,dollar_ex_adjusted,CPI_IDX2015,local_price_usa,expected_price,diff_local,dollar_price_adjusted,diff_dollar,perc_diff
1,2000-04-01,AUD,Australia,2.59,1.655082,,2.24,3.71,-1.12,1.56,-0.68,-30.36
3,2000-04-01,GBP,Britain,1.9,0.626505,73.3,2.24,1.4,0.5,3.03,0.79,35.27
4,2000-04-01,CAD,Canada,2.85,1.452842,74.66421,2.24,3.25,-0.4,1.96,-0.28,-12.5
7,2000-04-01,CZK,Czech Republic,54.37,37.917932,,2.24,84.94,-30.57,1.43,-0.81,-36.16
8,2000-04-01,DKK,Denmark,24.75,7.79441,76.01382,2.24,17.46,7.29,3.18,0.94,41.96
