## 2021: Week 12 - Maldives Tourism

One of the best things about being a Dr Prepper is that people are always bringing interesting datasets to your attention. A little while ago, Tableau Zen Master Lorna Brown showed me a dataset with all kinds of information on tourism in the Maldives. This database has a lot of data on different Key Economic Indicators, but as you can imagine, it also has a bit of a quirky structure! For inspiration as to why we might want to clean this data up, check out Lorna's viz below:

![img](https://1.bp.blogspot.com/-U5TB6lC03pE/YCp2Mt1c7UI/AAAAAAAAAv8/_dEl3ETnJ5Exn6pagL2X6IuMbvp1Og21wCLcBGAsYHQ/w640-h512/%2523IronQuest%2BMaldives%2BTourism.png)

### Input

Our input is very wide this week, with 136 fields and only 28 rows. It covers tourism in the Maldives from 2010 to 2020. The source of this data is here but you can download it in the usual way from here.

![img](https://1.bp.blogspot.com/-toKHiHeJINY/YCp3avdHi7I/AAAAAAAAAwI/9Npu2oRJK844Uva0b6u5qnb1PHx98322wCLcBGAsYHQ/w640-h216/2021W12.png)

### Requirment

- Input the data
- Pivot all of the month fields into a single column 
- Rename the fields and ensure that each field has the correct data type
- Filter out the nulls 
- Filter our dataset so our Values are referring to Number of Tourists
- Our goal now is to remove all totals and subtotals from our dataset so that only the lowest level of granularity remains. Currently we have Total > Continents > Countries, but we don't have data for all countries in a continent, so it's not as simple as just filtering out the totals and subtotals. Plus in our Continents level of detail, we also have The Middle East and UN passport holders as categories. If you feel confident in your prep skills, this (plus the output) should be enough information to go on, but otherwise read on for a breakdown of the steps we need to take:
    - Filter out Total tourist arrivals
    - Split our workflow into 2 streams: Continents and Countries
        - Hint: the hierarchy field will be useful here
    - Split out the Continent and Country names from the relevant fields 
    - Aggregate our Country stream to the Continent level 
    - Join the two streams together and work out how many tourists arrivals there are that we don't know the country of 
    - Add in a Country field with the value "Unknown" 
    - Union this back to here we had our Country breakdown 
- Output the data

### Output

![img](https://1.bp.blogspot.com/-hTwC2KmKM7E/YCp-LMvoaiI/AAAAAAAAAwU/9vsg8JoArKAMA0GXEu6DWHsit3Kjc6Y8wCLcBGAsYHQ/w353-h400/2021W12%2BOut.png)

4 fields
- Month
- Breakdown
- Country
- Number of Tourists

1,826 rows (1,827 including headers)

In [447]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Input the data

In [448]:
df = pd.read_csv("./data/Tourism Input.csv")

In [449]:
df.head()

Unnamed: 0,id,Series-Measure,Hierarchy-Breakdown,Unit-Detail,Jan-10,Feb-10,Mar-10,Apr-10,May-10,Jun-10,...,Mar-20,Apr-20,May-20,Jun-20,Jul-20,Aug-20,Sep-20,Oct-20,Nov-20,Dec-20
0,1103,Total tourist arrivals,Real Sector / Tourism,Tourists,67478.0,77063.0,74975.0,60742.0,58324.0,44050.0,...,59630.0,13.0,41.0,1.0,1752.0,7636.0,9605.0,21515.0,35757.0,96412.0
1,1104,Tourist bednights,Real Sector / Tourism,Bednights,552287.0,578472.0,581848.0,503007.0,443824.0,327385.0,...,562302.2051,8844.0203,4776.6212,2325.8012,24673.4247,71370.6948,75367.8621,169709.0807,279030.282,623284.397
2,1105,Average stay,Real Sector / Tourism,Days,8.184697,7.506481,7.76056,8.281041,7.609628,7.432122,...,9.4298541854713,9.428593030082,86.847657368888,42.287293761914,14.083004941515,9.3485538100132,9.4824196160074,9.6159959503923,8.877098540146,9.1055876952922
3,1106,Operational bed capacity,Real Sector / Tourism,Beds,22825.0,23472.0,23934.0,24124.0,23885.0,23585.0,...,51001.0,7690.0,2978.0,3078.0,9821.0,19263.0,25328.0,32600.0,37378.0,42194.0
4,1107,Bednight capacity,Real Sector / Tourism,Beds,707575.0,657216.0,741954.0,723720.0,740435.0,707550.0,...,1581031.0,230700.0,92318.0,92340.0,304451.0,597153.0,759840.0,1010600.0,1121340.0,1308014.0


### Pivot all of the month fields into a single column
- Rename the fields and ensure that each field has the correct data type
- Filter out the nulls

In [450]:
df = df.melt(id_vars=["id", "Series-Measure", "Hierarchy-Breakdown", "Unit-Detail"],
             var_name="Time",
             value_name="Tourists")
null_rows = df.loc[df["Tourists"] == "na"].index
df = df.drop(null_rows, axis=0)
df.shape

(3325, 6)

### Filter our dataset so our Values are referring to Number of Tourists
- Our goal now is to remove all totals and subtotals from our dataset so that only the lowest level of granularity remains. 
- Currently we have Total > Continents > Countries, but we don't have data for all countries in a continent, so it's not as simple as just filtering out the totals and subtotals. 
- Plus in our Continents level of detail, we also have The Middle East and UN passport holders as categories. 

In [451]:
df = df.loc[df["Unit-Detail"] == "Tourists"]
df["Area"] = df["Series-Measure"].map(lambda x: x.split("from")[-1])
df = df.reset_index(drop=True)
df.shape

(1958, 7)

In [452]:
df["Tourists"].value_counts()

0        39
1         8
23        7
32        5
17        5
         ..
2392      1
49179     1
35574     1
92298     1
3005      1
Name: Tourists, Length: 1765, dtype: int64

In [453]:
df["Tourists"] = df["Tourists"].astype(int)

### Split our workflow into 2 streams: Continents and Countries
- Split out the Continent and Country names from the relevant fields

In [454]:
list_of_area = pd.Series(df.groupby(["Area"])["Tourists"].sum().index)
list_of_area = list_of_area.str.strip()
list_of_area.values

array(['Africa', 'Americas', 'Asia', 'Australia', 'China', 'Europe',
       'France', 'Germany', 'India', 'Italy', 'Oceania', 'Russia',
       'United States', 'the Middle East', 'the United Kingdom',
       'Total tourist arrivals',
       'Tourist arrivals - UN passport holders and others'], dtype=object)

In [455]:
grouped = df.groupby(["Area","Time"])["Tourists"].sum().reset_index()
grouped["Area"] = grouped["Area"].str.strip()
grouped.head()

Unnamed: 0,Area,Time,Tourists
0,Africa,Apr-10,550
1,Africa,Apr-11,773
2,Africa,Apr-12,740
3,Africa,Apr-13,752
4,Africa,Apr-14,820


In [456]:
grouped.loc[grouped["Area"] == "Tourist arrivals - UN passport holders and others", "Area"] = grouped.loc[grouped["Area"] == "Tourist arrivals - UN passport holders and others", "Area"].map(lambda x: x.split("-")[1])
grouped["Area"] = grouped["Area"].map(lambda x: x.strip())

In [457]:
list_of_continents = ['Africa', 'Americas', 'Asia', 'Europe','Oceania','the Middle East',
                      "UN passport holders and others"]
continents = grouped.loc[grouped["Area"].isin(list_of_continents), :]
continents

Unnamed: 0,Area,Time,Tourists
0,Africa,Apr-10,550
1,Africa,Apr-11,773
2,Africa,Apr-12,740
3,Africa,Apr-13,752
4,Africa,Apr-14,820
...,...,...,...
1953,UN passport holders and others,Sep-16,34
1954,UN passport holders and others,Sep-17,30
1955,UN passport holders and others,Sep-18,12
1956,UN passport holders and others,Sep-19,32


### Aggregate our Country stream to the Continent level

In [458]:
list_of_countries = ['Australia', 'China','France', 'Germany', 'India', 'Italy', 'Russia',
                     'United States','the United Kingdom']
countries = grouped.loc[grouped["Area"].isin(list_of_countries), :]
countries

Unnamed: 0,Area,Time,Tourists
396,Australia,Apr-20,0
397,Australia,Aug-20,49
398,Australia,Dec-20,607
399,Australia,Feb-20,2040
400,Australia,Jul-20,17
...,...,...,...
1733,the United Kingdom,Sep-16,7797
1734,the United Kingdom,Sep-17,7499
1735,the United Kingdom,Sep-18,8106
1736,the United Kingdom,Sep-19,7876


In [459]:
countries.loc[countries["Area"] == "Australia", "Continent"] = "Oceania"
countries.loc[countries["Area"].isin(["China", "India"]), "Continent"] = "Asia"
countries.loc[countries["Area"].isin(["Italy", "Russia", "France", "Germany", "the United Kingdom"]), "Continent"] = "Europe"
countries.loc[countries["Area"] == "United States", "Continent"] = "Oceania"

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value, pi)


In [460]:
countries = countries.rename(columns={"Area": "Country", "Time": "Month", "Tourists": "Number of Tourists"})

In [461]:
continents.loc[:, "Country"] = "Unknown"
continents = continents.rename(columns={"Area": "Continent", "Time": "Month", 
                                        "Tourists": "Number of Tourists"})

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = value


In [462]:
countries = countries.loc[:, ["Continent", "Month", "Number of Tourists", "Country"]]

### Join the two streams together and work out how many tourists arrivals there are that we don't know the country of
- Add in a Country field with the value "Unknown"
- Union this back to here we had our Country breakdown

In [463]:
final_output = pd.concat([continents, countries], axis=0)
final_output = final_output.loc[:, ["Month", "Continent", "Country", "Number of Tourists"]]
final_output = final_output.rename(columns={"Continent": "Breakdown"})
final_output.loc[final_output["Country"] == "the United Kingdom", "Country"] = "United Kingdom"

In [465]:
final_output["Month"] = pd.to_datetime(final_output["Month"], format="%b-%y")
final_output["Month"] = final_output["Month"].map(lambda x: x.strftime("%d/%m/%Y"))

In [466]:
final_output.sample(10)

Unnamed: 0,Month,Breakdown,Country,Number of Tourists
0,01/04/2010,Africa,Unknown,550
1487,01/08/2012,the Middle East,Unknown,2712
1606,01/04/2010,Europe,United Kingdom,9529
1089,01/12/2010,Europe,Italy,9845
2,01/04/2012,Africa,Unknown,740
1532,01/07/2013,the Middle East,Unknown,1220
1094,01/12/2015,Europe,Italy,11280
528,01/09/2010,Asia,China,13028
824,01/08/2020,Europe,Germany,306
1141,01/06/2018,Europe,Italy,3236


In [467]:
final_output.shape

(1826, 4)

In [468]:
final_output.to_csv("./output/Week12_output.csv")