In [1]:
import pandas as pd
from pandas import Series, DataFrame

Populating the interactive namespace from numpy and matplotlib


In [2]:
tourism_filename = '../data/oecd_tourism.csv'
tourism_df = pd.read_csv(tourism_filename, 
                        usecols=['LOCATION', 'SUBJECT', 'TIME', 'Value'])

locations_filename = '../data/oecd_locations.csv'
locations_df = pd.read_csv(locations_filename,
                          header=None,
                           names=['LOCATION', 'NAME'],
                          index_col='LOCATION')

# Beyond 1

What happens if we perform the join in the other direction?  That is, if we invoke `join` on `tourism_df`, passing it an argument of `locations_df`?  Do we get the same result?

In [3]:
# We're again performing a left join, meaning that the left side (i.e., the data frame on 
# which we're running the join) determines which rows will be included. If there is no match
# on the right, then we get a null value in NAME.

tourism_df.set_index('LOCATION').join(locations_df)

Unnamed: 0_level_0,SUBJECT,TIME,Value,NAME
LOCATION,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AUS,INT_REC,2008,31159.800,Australia
AUS,INT_REC,2009,29980.700,Australia
AUS,INT_REC,2010,35165.500,Australia
AUS,INT_REC,2011,38710.100,Australia
AUS,INT_REC,2012,38003.700,Australia
...,...,...,...,...
ZAF,INT-EXP,2015,5734.731,
ZAF,INT-EXP,2016,5354.391,
ZAF,INT-EXP,2017,6067.963,
ZAF,INT-EXP,2018,6347.762,


# Beyond 2

Get the mean tourism income per year, rather than by country.  Do we see any evidence of less tourism income during time of the Great Recession, which started in in 2008?

In [4]:
# Yes, we definitely see that 2008, 2009, and 2010 are at the bottom of the list.

fullname_df = locations_df.join(tourism_df.set_index('LOCATION'))

(
    fullname_df
    .loc[fullname_df['SUBJECT'] == 'INT_REC']
    .groupby('TIME')['Value']
    .mean()
    .sort_values(ascending=False)
)


TIME
2019    62786.617333
2018    43185.853875
2017    40326.702250
2014    40043.334563
2016    39483.592062
2015    38912.695437
2013    37996.198750
2012    35628.632063
2011    34299.966375
2008    31757.065750
2010    30949.524125
2009    28505.886562
Name: Value, dtype: float64

# Beyond 3

Reset the index on `locations_df`, such that it has a (default) numeric index, and two columns (`LOCATION` and `NAME`). Now run `join` on `locations_df`, specifying that you want to use the `LOCATION` column on the caller, rather than the index. (The argument data frame will always be joined on its index.)

In [5]:
tourism_df = tourism_df.set_index('LOCATION')
locations_df.join(tourism_df, on='LOCATION')

Unnamed: 0_level_0,NAME,SUBJECT,TIME,Value
LOCATION,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AUS,Australia,INT_REC,2008,31159.8
AUS,Australia,INT_REC,2009,29980.7
AUS,Australia,INT_REC,2010,35165.5
AUS,Australia,INT_REC,2011,38710.1
AUS,Australia,INT_REC,2012,38003.7
...,...,...,...,...
ISR,Israel,INT-EXP,2015,7507.0
ISR,Israel,INT-EXP,2016,8210.3
ISR,Israel,INT-EXP,2017,8986.0
ISR,Israel,INT-EXP,2018,9974.7
