In [1]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

# Beyond 1

Read in the three data frames, but without setting an index. Ensure that the column names in `oecd_tourism_df` are `abbrev`, `TIME`, and `Value`, and that the `dtype` of the `Value` column is `np.int64`.

In [2]:
oecd_df = pd.read_csv('../data/oecd_locations.csv', header=None,
                     names=['abbrev', 'country'])

oecd_tourism_df = pd.read_csv('../data/oecd_tourism.csv',
                             usecols=[0, 5,6],
                              header=0,
                             names=['abbrev', 'TIME', 'Value'])

wine_df = pd.read_csv('../data/winemag-150k-reviews.csv', 
                      usecols=['country', 'points'])

# Beyond 2

Perform the same joins as before, but using `merge`, rather than `join`.

In [7]:
tourism_spending = (oecd_df
                    .merge(oecd_tourism_df, on='abbrev')
                    .groupby('country')['Value'].mean()
                   )
tourism_spending

country
Australia          37634.433333
Austria            16673.886364
Belgium            16525.237545
Brazil             13942.913958
Canada             32593.612500
Denmark            10362.563636
Finland             5288.658591
France             58228.804000
Germany            75011.823091
Hungary             5108.871591
Israel              6634.454042
Italy              39539.560000
Japan              28606.891667
Korea              21677.131818
United Kingdom     63507.159091
United States     171847.083333
Name: Value, dtype: float64

In [9]:
country_points = (
    wine_df
    .groupby('country')['points'].mean()
)

country_points

country
Albania                   88.000000
Argentina                 85.996093
Australia                 87.892475
Austria                   89.276742
Bosnia and Herzegovina    84.750000
Brazil                    83.240000
Bulgaria                  85.467532
Canada                    88.239796
Chile                     86.296768
China                     82.000000
Croatia                   86.280899
Cyprus                    85.870968
Czech Republic            85.833333
Egypt                     83.666667
England                   92.888889
France                    88.925870
Georgia                   85.511628
Germany                   88.626427
Greece                    86.117647
Hungary                   87.329004
India                     87.625000
Israel                    87.176190
Italy                     88.413664
Japan                     85.000000
Lebanon                   85.702703
Lithuania                 84.250000
Luxembourg                87.000000
Macedonia           

In [10]:
(
    country_points.to_frame()
    .merge(tourism_spending, on='country')
)

Unnamed: 0_level_0,points,Value
country,Unnamed: 1_level_1,Unnamed: 2_level_1
Australia,87.892475,37634.433333
Austria,89.276742,16673.886364
Brazil,83.24,13942.913958
Canada,88.239796,32593.6125
France,88.92587,58228.804
Germany,88.626427,75011.823091
Hungary,87.329004,5108.871591
Israel,87.17619,6634.454042
Italy,88.413664,39539.56
Japan,85.0,28606.891667


In [11]:
(
    country_points.to_frame()
    .merge(tourism_spending, on='country', how='outer')
)

Unnamed: 0_level_0,points,Value
country,Unnamed: 1_level_1,Unnamed: 2_level_1
Albania,88.0,
Argentina,85.996093,
Australia,87.892475,37634.433333
Austria,89.276742,16673.886364
Bosnia and Herzegovina,84.75,
Brazil,83.24,13942.913958
Bulgaria,85.467532,
Canada,88.239796,32593.6125
Chile,86.296768,
China,82.0,


# Beyond 3

How is the default `merge` different from the default `join`, when it comes to `NaN` values?

By default, `join` performs a "left join," meaning that there might be `NaN` values in the column(s) from the right side. However, `merge` performs an "inner join" by default, meaning that it uses the intersection of the indexes from the right and left. As a result, `NaN` values won't occur thanks to the join (but they might come in thanks to `NaN` values in the input frames.