# Your Info
__Name:__

__PDX Email:__

__Collaborators:__

# Workout 7: Merging Datasets

This workout provides exercises to develop your skills in merging and combining data.

# About the Datasets

The datasets you'll be working with today are based on The OECD (Organization for Economic Co-Operation and Development).

* file 1 - `data/oecd_locations.csv` - contains a subset of OECD countries
* file 2 - `data/oecd_tourism.csv` - contains...
* file 3 - `data/winemag-150k-reviews.csv` - contains...


# 0. Import pandas

Import the pandas library using the alias `pd`.

In [1]:
## Begin Solution
import pandas as pd
## End Solution

# 1. Load OECD Location Data

Load the data from `data/oecd_locations.csv` into a pandas DataFrame named `oecd_df`. Note that the CSV file does not contain a header. 

Ensure the resulting DataFrame meets the following specifications:

* Column name: `country`
* Index: Set to the country's abbreviation.

In [2]:
## Begin Solution
file = "data/oecd_locations.csv"

oecd_df = pd.read_csv(file,
                      header=None,
                      names=["ABBREVIATION","COUNTRY"],
                      index_col="ABBREVIATION")

oecd_df.head()
## End Solution

Unnamed: 0_level_0,COUNTRY
ABBREVIATION,Unnamed: 1_level_1
AUS,Australia
AUT,Austria
BEL,Belgium
CAN,Canada
DNK,Denmark


# 2. Load OECD Tourism Data

Perform the following steps to create the `oecd_tourism_df` DataFrame:

1. Load data from `data/oecd_tourism.csv`, selecting only the `LOCATION`, `TIME`, `Value`, and `SUBJECT` columns.
2. Set `LOCATION` as the index.
3. Filter the DataFrame to keep only the rows where the `SUBJECT` column's value is `INT-EXP`.
4. Drop the `SUBJECT` column from the filtered DataFrame.

In [13]:
## Begin Solution
file = "data/oecd_tourism.csv"

oecd_tourism_df = (
    pd
    .read_csv(file,
              usecols=['LOCATION', 'TIME', 'Value', 'SUBJECT'],
              index_col='LOCATION')
    .loc[lambda df_: df_['SUBJECT'] == 'INT-EXP']
    .drop('SUBJECT', axis='columns')
)
oecd_tourism_df.head()                                
## End Solution

Unnamed: 0_level_0,TIME,Value
LOCATION,Unnamed: 1_level_1,Unnamed: 2_level_1
AUS,2008,27620.0
AUS,2009,25629.6
AUS,2010,31916.5
AUS,2011,39381.5
AUS,2012,41632.8


# 3. Create Tourism Series

Create a pandas Series named `tourism_spending` with country names as the index and average tourism spending as values.  

1. Perform an inner join of the `oecd_df` and `oecd_tourism_df` DataFrames, joining on the index.
2. Group the joined DataFrame by the `COUNTRY` column.
3. Compute the mean of the `Value` column for each group to get the average tourism spending per country.

In [14]:
## Begin Solution
tourism_spending = oecd_df.merge(oecd_tourism_df,
                                 how="inner",
                                left_index=True,
                                right_index=True)

tourism_spending.groupby("COUNTRY")["Value"].mean()
## End Solution

COUNTRY
Australia          36727.966667
Austria            11934.563636
Belgium            20859.883455
Brazil             21564.351833
Canada             40984.633333
Denmark            11326.169636
Finland             5877.080909
France             51394.272273
Germany            96615.075545
Hungary             2918.390182
Israel              6726.524833
Italy              34148.908455
Japan              32197.925000
Korea              25573.509091
United Kingdom     75262.227273
United States     142080.666667
Name: Value, dtype: float64

# 4. Load Wine Review Data

Create a third data frame, `wine_df`:

1. Load the data from the CSV file located at `data/winemag-150k-reviews.csv`.
2. Extract the country and points columns.

In [15]:
## Begin Solution
file = "data/winemag-150k-reviews.csv"

wine_df = pd.read_csv(file,
                     usecols=["country", 
                              "points"]
                     )

wine_df.head()
## End Solution

Unnamed: 0,country,points
0,US,96
1,Spain,96
2,US,96
3,US,96
4,France,95


# 5. Calculate Average Wine Scores by Country

Get the average wine score for each country, across all wine reviews, sorted in descending order.

1. Group `wine_df` by country.
2. Calculate the mean of points.
3. Sort in descending order.

In [11]:
## Begin Solution
country_points = wine_df.groupby("country")["points"].mean()
country_points.sort_values(ascending=False)
## End Solution

country
England                   92.888889
Austria                   89.276742
France                    88.925870
Germany                   88.626427
Italy                     88.413664
Canada                    88.239796
Slovenia                  88.234043
Morocco                   88.166667
Turkey                    88.096154
Portugal                  88.057685
Albania                   88.000000
US-France                 88.000000
Australia                 87.892475
US                        87.818789
Serbia                    87.714286
India                     87.625000
New Zealand               87.554217
Hungary                   87.329004
Switzerland               87.250000
South Africa              87.225421
Israel                    87.176190
Luxembourg                87.000000
Spain                     86.646589
Chile                     86.296768
Croatia                   86.280899
Greece                    86.117647
Tunisia                   86.000000
Argentina           

# 6. Merge Data (Inner)



In [19]:
## Begin Solution
country_points.to_frame().join(tourism_spending).sort_values("Value")
## End Solution

Unnamed: 0_level_0,points,COUNTRY,TIME,Value
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Albania,88.0,,,
Argentina,85.996093,,,
Australia,87.892475,,,
Austria,89.276742,,,
Bosnia and Herzegovina,84.75,,,
Brazil,83.24,,,
Bulgaria,85.467532,,,
Canada,88.239796,,,
Chile,86.296768,,,
China,82.0,,,


# 7. Merge Data (Outer)



In [27]:
## Begin Solution
country_points.to_frame().join(tourism_spending, how="outer")
## End Solution 

Unnamed: 0_level_0,points,COUNTRY,SUBJECT,TIME,Value
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
AUS,,Australia,INT_REC,2008.0,31159.8
AUS,,Australia,INT_REC,2009.0,29980.7
AUS,,Australia,INT_REC,2010.0,35165.5
AUS,,Australia,INT_REC,2011.0,38710.1
AUS,,Australia,INT_REC,2012.0,38003.7
...,...,...,...,...,...
USA,,United States,INT-EXP,2017.0,158331.0
USA,,United States,INT-EXP,2018.0,172548.0
USA,,United States,INT-EXP,2019.0,182365.0
Ukraine,84.600000,,,,
