In [1]:
# Initialize Otter
import otter
grader = otter.Notebook("lab06.ipynb")

In [2]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from IPython.display import display, Latex, Markdown
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
%matplotlib inline

import geopandas
import pycountry
import geopy

import re

# Lab 6: Geospatial Visualizations

In this lab, you will generate a 3D map visualizing data from [this paper](https://gabriel-zucman.eu/who-owns-offshore-real-estate/). The paper looks at the ownership of offshore real estate in Dubai (where Rohan grew up).  We would like to thank Professor Zucman for making his data freely available and accessible. Professor Zucman is one of the foremost experts in economic inequality; take [Econ 133](https://gabriel-zucman.eu/econ133/) to learn about it from him!

In order to generate the map, we will first import a cleaned version of the dataset from the paper. Then, we will do some essential data cleaning steps so the data can be interpreted by plotting packages. Then, we will generate a sample plot. Finally, we will use widgets to easily toggle between multiple plots.

### Learning Objectives:
- Revisits some data cleaning techniques
- Generates geospatial visualizations in 2D and 3D

First, let's load in the dataset.

In [3]:
data = pd.read_csv('APZO2022Data-cleaned.csv')
data.head()

Unnamed: 0,Country,Unique Owners,Unique Properties,Total Property Values,"GDP (current USD, 2018 or latest available - for coverage)",Total Property Value / GDP,Mean Property value,Median Property Value,Number of firms,Mean Age,...,Properties in Dubai Marina,Properties in Palm Jumeirah,Number of villas,Number of buildings,Share of total values owned by top 10% owners,Share of total values owned by top 10% persons,Share of total values owned by top 10% firms,Share of total values owned by top 1% owners,Share of total values owned by top 1% persons,Share of total values owned by top 1% firms
0,World,273871.0,883268.0,"USD 532,564,964,318",86300000000000.0,0.62%,"USD 603,015","USD 212,081",8017.0,49.7,...,41685.0,17186.0,58913.0,4114.0,80%,26%,51%,63%,13%,38%
1,Afghanistan,1134.0,3220.0,"USD 1,410,029,024",18100000000.0,7.81%,"USD 437,897","USD 252,006",0.0,45.4,...,98.0,199.0,352.0,16.0,66%,66%,,38%,38%,
2,Albania,17.0,19.0,"USD 6,762,555",15200000000.0,0.04%,"USD 355,924","USD 325,745",0.0,40.4,...,4.9998,4.9998,4.9998,0.0,79%,79%,,56%,56%,
3,Algeria,790.0,1539.0,"USD 450,011,530",175000000000.0,0.26%,"USD 292,405","USD 196,108",0.0,48.8,...,173.0,27.0,131.0,0.0,54%,54%,,25%,25%,
4,American Samoa,4.9998,9.0,"USD 10,082,724",639000000.0,1.58%,"USD 1,120,303","USD 844,572",0.0,57.5,...,0.0,0.0,0.0,0.0,93%,93%,,93%,93%,


---
## Part 1: Cleaning Data

In this part, you will follow the steps for cleaning the data as described in each individual subpart. If you do not follow the steps exactly, the plots will not be generated in subsequent parts.

**Question 1.1:** Set the `Country` column as the table index and delete the first row (the one for the entire World) from the data.

In [4]:
data = data.set_index('Country') # set index
data = data.iloc[1:] # delete the first row
data.head()

Unnamed: 0_level_0,Unique Owners,Unique Properties,Total Property Values,"GDP (current USD, 2018 or latest available - for coverage)",Total Property Value / GDP,Mean Property value,Median Property Value,Number of firms,Mean Age,Number with age variable,...,Properties in Dubai Marina,Properties in Palm Jumeirah,Number of villas,Number of buildings,Share of total values owned by top 10% owners,Share of total values owned by top 10% persons,Share of total values owned by top 10% firms,Share of total values owned by top 1% owners,Share of total values owned by top 1% persons,Share of total values owned by top 1% firms
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Afghanistan,1134.0,3220.0,"USD 1,410,029,024",18100000000.0,7.81%,"USD 437,897","USD 252,006",0.0,45.4,1123.0,...,98.0,199.0,352.0,16.0,66%,66%,,38%,38%,
Albania,17.0,19.0,"USD 6,762,555",15200000000.0,0.04%,"USD 355,924","USD 325,745",0.0,40.4,16.0,...,4.9998,4.9998,4.9998,0.0,79%,79%,,56%,56%,
Algeria,790.0,1539.0,"USD 450,011,530",175000000000.0,0.26%,"USD 292,405","USD 196,108",0.0,48.8,787.0,...,173.0,27.0,131.0,0.0,54%,54%,,25%,25%,
American Samoa,4.9998,9.0,"USD 10,082,724",639000000.0,1.58%,"USD 1,120,303","USD 844,572",0.0,57.5,4.9998,...,0.0,0.0,0.0,0.0,93%,93%,,93%,93%,
Andorra,4.9998,4.9998,"USD 243,222",3220000000.0,0.01%,,,0.0,54.0,4.9998,...,4.9998,0.0,0.0,0.0,100%,100%,,100%,100%,


In [5]:
grader.check("q1_1")

Now, let's take a look at the following columns:

In [6]:
data[['Total Property Value / GDP', 'Female share', 'Share of total values owned by top 10% owners', 
            'Share of total values owned by top 10% persons','Share of total values owned by top 10% firms',
           'Share of total values owned by top 1% owners', 'Share of total values owned by top 1% persons',
           'Share of total values owned by top 1% firms']]

Unnamed: 0_level_0,Total Property Value / GDP,Female share,Share of total values owned by top 10% owners,Share of total values owned by top 10% persons,Share of total values owned by top 10% firms,Share of total values owned by top 1% owners,Share of total values owned by top 1% persons,Share of total values owned by top 1% firms
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Afghanistan,7.81%,17%,66%,66%,,38%,38%,
Albania,0.04%,56%,79%,79%,,56%,56%,
Algeria,0.26%,29%,54%,54%,,25%,25%,
American Samoa,1.58%,,93%,93%,,93%,93%,
Andorra,0.01%,,100%,100%,,100%,100%,
...,...,...,...,...,...,...,...,...
Venezuela,0.01%,29%,45%,45%,,9%,9%,
Vietnam,0.00%,68%,31%,31%,,13%,13%,
Yemen,4.83%,25%,67%,67%,,34%,34%,
Zambia,0.08%,29%,39%,39%,,12%,12%,


As you can see, the data is written in percents. Since the software can only plot numbers, the percentages will need to be converted to a number out of 100. A similar problem can also be seen below:

In [7]:
data[['Total Property Values', 'Mean Property value', 'Median Property Value']]

Unnamed: 0_level_0,Total Property Values,Mean Property value,Median Property Value
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Afghanistan,"USD 1,410,029,024","USD 437,897","USD 252,006"
Albania,"USD 6,762,555","USD 355,924","USD 325,745"
Algeria,"USD 450,011,530","USD 292,405","USD 196,108"
American Samoa,"USD 10,082,724","USD 1,120,303","USD 844,572"
Andorra,"USD 243,222",,
...,...,...,...
Venezuela,"USD 45,244,832","USD 435,046","USD 267,746"
Vietnam,"USD 7,910,712","USD 316,429","USD 320,063"
Yemen,"USD 1,043,509,028","USD 287,468","USD 162,415"
Zambia,"USD 19,746,787","USD 207,861","USD 168,948"


In this case, the letters 'USD', commas and spaces will need to be removed from the above rows so the data can be read as numbers.

**Question 1.2:** Fix the issues described above by converting the given columns to numbers. Once you have converted the columns to numbers, change the datatype of all the columns to be `float64`.

For example, we want to convert "7.81%" to "7.81", "USD 1,410,029,024" to "1410029024". 

*Hint 1:* Consider using string methods like we did in project 1.   
*Hint 2:* You can get part of a string by slicing a string like we did in Lab 5. We can do this on a column in a dataframe using the string method. This [tutorial](https://note.nkmk.me/en/python-pandas-str-slice/) may be helpful. 

In [8]:
for col in ['Total Property Values', 'Mean Property value', 'Median Property Value']:
    # get rid of 'USD' and commas
    data[col] = data[col].str.replace('USD', '').str.replace(',', '')

for col in ['Total Property Value / GDP', 'Female share', 'Share of total values owned by top 10% owners', 
            'Share of total values owned by top 10% persons','Share of total values owned by top 10% firms',
           'Share of total values owned by top 1% owners', 'Share of total values owned by top 1% persons',
           'Share of total values owned by top 1% firms']:
    data[col] = data[col].str.replace('%','') # get rid of '%'

# convert all the columns to float64
for col in data.columns:
    data[col] = data[col].astype('float64')
data

Unnamed: 0_level_0,Unique Owners,Unique Properties,Total Property Values,"GDP (current USD, 2018 or latest available - for coverage)",Total Property Value / GDP,Mean Property value,Median Property Value,Number of firms,Mean Age,Number with age variable,...,Properties in Dubai Marina,Properties in Palm Jumeirah,Number of villas,Number of buildings,Share of total values owned by top 10% owners,Share of total values owned by top 10% persons,Share of total values owned by top 10% firms,Share of total values owned by top 1% owners,Share of total values owned by top 1% persons,Share of total values owned by top 1% firms
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Afghanistan,1134.0000,3220.0000,1.410029e+09,1.810000e+10,7.81,437897.0,252006.0,0.0,45.4,1123.0000,...,98.0000,199.0000,352.0000,16.0,66.0,66.0,,38.0,38.0,
Albania,17.0000,19.0000,6.762555e+06,1.520000e+10,0.04,355924.0,325745.0,0.0,40.4,16.0000,...,4.9998,4.9998,4.9998,0.0,79.0,79.0,,56.0,56.0,
Algeria,790.0000,1539.0000,4.500115e+08,1.750000e+11,0.26,292405.0,196108.0,0.0,48.8,787.0000,...,173.0000,27.0000,131.0000,0.0,54.0,54.0,,25.0,25.0,
American Samoa,4.9998,9.0000,1.008272e+07,6.390000e+08,1.58,1120303.0,844572.0,0.0,57.5,4.9998,...,0.0000,0.0000,0.0000,0.0,93.0,93.0,,93.0,93.0,
Andorra,4.9998,4.9998,2.432220e+05,3.220000e+09,0.01,,,0.0,54.0,4.9998,...,4.9998,0.0000,0.0000,0.0,100.0,100.0,,100.0,100.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Venezuela,77.0000,104.0000,4.524483e+07,4.820000e+11,0.01,435046.0,267746.0,0.0,45.4,77.0000,...,9.0000,0.0000,14.0000,0.0,45.0,45.0,,9.0,9.0,
Vietnam,22.0000,25.0000,7.910712e+06,2.450000e+11,0.00,316429.0,320063.0,0.0,42.7,22.0000,...,7.0000,4.9998,4.9998,0.0,31.0,31.0,,13.0,13.0,
Yemen,965.0000,3630.0000,1.043509e+09,2.160000e+10,4.83,287468.0,162415.0,0.0,49.4,962.0000,...,133.0000,45.0000,259.0000,12.0,67.0,67.0,,34.0,34.0,
Zambia,65.0000,95.0000,1.974679e+07,2.630000e+10,0.08,207861.0,168948.0,0.0,52.4,65.0000,...,4.9998,0.0000,5.0000,0.0,39.0,39.0,,12.0,12.0,


In [9]:
grader.check("q1_2")

Now that all of our data is stored as floats, we must deal with ambiguity in country names. For example, United States, United States of America and USA all refer to the same country. It's hard for a package to keep track of all the different names for a country, so instead packages like to refer to the standardized, 3-letter [country 
codes](https://www.iban.com/country-codes). The following function takes in a country name and attempts to find the 3 digit country code associated with the country.

In [10]:
import pycountry
pycountry.countries.get(name='Albania').alpha_3
                                    # .alpha_3 refers to the 3-letter country code 
                                    # .alpha_2 refers to the 2-letter country code

'ALB'

**Question 1.3:** Use the provided function to try and find the associated country code for all the countries in your data. Write a function `get_alpha3code` that get the 3-letter country code given the country name, and then apply this function to the index of our dataframe. There will be cases where the function fails as it cannot find the associated country code - consider using a try-except block to deal with these cases.

In [11]:
def get_alpha3code(country_name):
    try:
        code = pycountry.countries.get(name=country_name).alpha_3
    except: # if it cannot find the associated country code
        code = 'None'
    return code
data['Code'] = data.index.map(get_alpha3code)
data.head()

Unnamed: 0_level_0,Unique Owners,Unique Properties,Total Property Values,"GDP (current USD, 2018 or latest available - for coverage)",Total Property Value / GDP,Mean Property value,Median Property Value,Number of firms,Mean Age,Number with age variable,...,Properties in Palm Jumeirah,Number of villas,Number of buildings,Share of total values owned by top 10% owners,Share of total values owned by top 10% persons,Share of total values owned by top 10% firms,Share of total values owned by top 1% owners,Share of total values owned by top 1% persons,Share of total values owned by top 1% firms,Code
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Afghanistan,1134.0,3220.0,1410029000.0,18100000000.0,7.81,437897.0,252006.0,0.0,45.4,1123.0,...,199.0,352.0,16.0,66.0,66.0,,38.0,38.0,,AFG
Albania,17.0,19.0,6762555.0,15200000000.0,0.04,355924.0,325745.0,0.0,40.4,16.0,...,4.9998,4.9998,0.0,79.0,79.0,,56.0,56.0,,ALB
Algeria,790.0,1539.0,450011500.0,175000000000.0,0.26,292405.0,196108.0,0.0,48.8,787.0,...,27.0,131.0,0.0,54.0,54.0,,25.0,25.0,,DZA
American Samoa,4.9998,9.0,10082720.0,639000000.0,1.58,1120303.0,844572.0,0.0,57.5,4.9998,...,0.0,0.0,0.0,93.0,93.0,,93.0,93.0,,ASM
Andorra,4.9998,4.9998,243222.0,3220000000.0,0.01,,,0.0,54.0,4.9998,...,0.0,0.0,0.0,100.0,100.0,,100.0,100.0,,AND


In [12]:
grader.check("q1_3")

Let us quickly see the cases where the function fails.

In [13]:
data[data["Code"] == "None"]

Unnamed: 0_level_0,Unique Owners,Unique Properties,Total Property Values,"GDP (current USD, 2018 or latest available - for coverage)",Total Property Value / GDP,Mean Property value,Median Property Value,Number of firms,Mean Age,Number with age variable,...,Properties in Palm Jumeirah,Number of villas,Number of buildings,Share of total values owned by top 10% owners,Share of total values owned by top 10% persons,Share of total values owned by top 10% firms,Share of total values owned by top 1% owners,Share of total values owned by top 1% persons,Share of total values owned by top 1% firms,Code
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Antigua & Barbuda,14.0,35.0,125462400.0,1610000000.0,7.81,3584639.0,573702.0,0.0,52.1,13.0,...,0.0,9.0,0.0,86.0,86.0,,80.0,80.0,,
Bolivia,5.0,5.0,1033388.0,40300000000.0,0.0,206678.0,98122.0,0.0,47.8,5.0,...,0.0,4.9998,0.0,70.0,70.0,,70.0,70.0,,
British Virgin Islands,45.0,912.0,383562700.0,1000000000.0,38.36,420573.0,246430.0,44.0,47.0,4.9998,...,7.0,5.0,5.0,80.0,0.0,80.0,49.0,0.0,49.0,
Brunei,13.0,17.0,3756692.0,13600000000.0,0.03,220982.0,211154.0,0.0,47.7,13.0,...,0.0,0.0,0.0,39.0,39.0,,29.0,29.0,,
Comoros Islands,166.0,253.0,96003740.0,1190000000.0,8.07,379461.0,199552.0,0.0,42.6,166.0,...,4.9998,31.0,0.0,55.0,55.0,,20.0,20.0,,
"Congo, Republic Of",17.0,30.0,12957190.0,13700000000.0,0.09,431906.0,221947.0,0.0,50.3,17.0,...,0.0,6.0,0.0,35.0,35.0,,18.0,18.0,,
Czech Republic,158.0,224.0,89288840.0,249000000000.0,0.04,398611.0,279865.0,5.0,46.5,146.0,...,37.0,22.0,0.0,44.0,44.0,1.0,17.0,17.0,1.0,
"Democratic Rep, Of Congo",4.9998,4.9998,2906772.0,47200000000.0,0.01,,,0.0,45.0,4.9998,...,0.0,4.9998,0.0,96.0,96.0,,96.0,96.0,,
Foreign Governmental Organisation,8.0,16.0,34918260.0,,,2182392.0,1395278.0,4.9998,,0.0,...,0.0,4.9998,0.0,32.0,32.0,4.0,32.0,32.0,4.0,
Iran,8669.0,16703.0,7028738000.0,294000000000.0,2.39,420807.0,222273.0,7.0,52.7,8201.0,...,360.0,1766.0,25.0,64.0,62.0,1.0,31.0,29.0,1.0,


**Question 1.4:** We can see that the function fails in a small portion of cases. We have provided a list of all the cases where the function fails; you have to manually correct these cases by manually referencing the [website](https://www.iban.com/country-codes). This might seem tedious, but that is the point - data cleaning must done with careful attention to detail.

In [14]:
data.loc['Antigua & Barbuda','Code'] = 'ATG'
data.loc['Brunei','Code'] = 'BRN'
data.loc['Congo, Republic Of','Code'] = 'COG'
data.loc['Czech Republic','Code'] = 'CZE'
data.loc['Iran','Code'] = 'IRN'
data.loc['Ivory Coast','Code'] = 'CIV'
data.loc['Kyrgistan','Code'] = 'KGZ'
data.loc['Macedonia','Code'] = 'MKD'
data.loc['Moldova','Code'] = 'MDA'
data.loc['North Korea','Code'] = 'PRK'
data.loc['Palestine','Code'] = 'PSE'
data.loc['Russia','Code'] = 'RUS'
data.loc['Saint Vincent & The Grenadines','Code'] = 'VCT'
data.loc['South Korea','Code'] = 'KOR'
data.loc['Southern Sudan','Code'] = 'SSD'
data.loc['Syria','Code'] = 'SYR'
data.loc['Taiwan','Code'] = 'TWN'
data.loc['Tanzania','Code'] = 'TZA'
data.loc['Trinidad & Tobago','Code'] = 'TTO'
data.loc['USA','Code'] = 'USA'
data.loc['Venezuela','Code'] = 'VEN'
data.loc['Vietnam','Code'] = 'VNM'
data.loc['Bolivia','Code'] = 'BOL'
data.loc['Democratic Rep, Of Congo','Code'] = 'COD'
data.loc['Turkey','Code'] = 'TUR'
data.loc['Comoros Islands','Code'] = 'COM'
data.loc['British Virgin Islands','Code'] = 'VGB'

In [15]:
grader.check("q1_4")

Let us now look at the cases where the `Code` column is still 'None'.

In [16]:
data[data["Code"] == "None"]

Unnamed: 0_level_0,Unique Owners,Unique Properties,Total Property Values,"GDP (current USD, 2018 or latest available - for coverage)",Total Property Value / GDP,Mean Property value,Median Property Value,Number of firms,Mean Age,Number with age variable,...,Properties in Palm Jumeirah,Number of villas,Number of buildings,Share of total values owned by top 10% owners,Share of total values owned by top 10% persons,Share of total values owned by top 10% firms,Share of total values owned by top 1% owners,Share of total values owned by top 1% persons,Share of total values owned by top 1% firms,Code
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Foreign Governmental Organisation,8.0,16.0,34918260.0,,,2182392.0,1395278.0,4.9998,,0.0,...,0.0,4.9998,0.0,32.0,32.0,4.0,32.0,32.0,4.0,
Kosovo,4.9998,4.9998,1066098.0,7880000000.0,0.01,,,0.0,53.3,4.9998,...,0.0,0.0,0.0,59.0,59.0,,59.0,59.0,,
Unknown Background (Firm),5438.0,61289.0,92315580000.0,,,1506234.0,285122.0,5335.0,,0.0,...,1175.0,3321.0,894.0,94.0,,94.0,73.0,,73.0,
Unknown Background (Person),14403.0,19965.0,6829583000.0,,,342078.0,190274.0,0.0,53.2,4631.0,...,439.0,867.0,27.0,59.0,59.0,,32.0,32.0,,


We can see none of these countries/organizations have a 3-letter country code associated with them, so we can drop these rows.

In [17]:
data = data[data["Code"] != "None"]
data.head()

Unnamed: 0_level_0,Unique Owners,Unique Properties,Total Property Values,"GDP (current USD, 2018 or latest available - for coverage)",Total Property Value / GDP,Mean Property value,Median Property Value,Number of firms,Mean Age,Number with age variable,...,Properties in Palm Jumeirah,Number of villas,Number of buildings,Share of total values owned by top 10% owners,Share of total values owned by top 10% persons,Share of total values owned by top 10% firms,Share of total values owned by top 1% owners,Share of total values owned by top 1% persons,Share of total values owned by top 1% firms,Code
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Afghanistan,1134.0,3220.0,1410029000.0,18100000000.0,7.81,437897.0,252006.0,0.0,45.4,1123.0,...,199.0,352.0,16.0,66.0,66.0,,38.0,38.0,,AFG
Albania,17.0,19.0,6762555.0,15200000000.0,0.04,355924.0,325745.0,0.0,40.4,16.0,...,4.9998,4.9998,0.0,79.0,79.0,,56.0,56.0,,ALB
Algeria,790.0,1539.0,450011500.0,175000000000.0,0.26,292405.0,196108.0,0.0,48.8,787.0,...,27.0,131.0,0.0,54.0,54.0,,25.0,25.0,,DZA
American Samoa,4.9998,9.0,10082720.0,639000000.0,1.58,1120303.0,844572.0,0.0,57.5,4.9998,...,0.0,0.0,0.0,93.0,93.0,,93.0,93.0,,ASM
Andorra,4.9998,4.9998,243222.0,3220000000.0,0.01,,,0.0,54.0,4.9998,...,0.0,0.0,0.0,100.0,100.0,,100.0,100.0,,AND


## Part 2: Generating a Sample Map

In this part, you will use your data to generate a sample plot visualizing the `Total Property Value / GDP` column.

When we hover over a country in the generated plot, we would like to be able to see it's name, the total property value owned and how it compares to the amount of property owned by other countries. In order to do this, we must first rank all the countries by total property value owned.

**Question 2.1:** First, sort all the values in the table by `Total Property Value / GDP` in ascending order (this sorting is important for when we generate the colors in the plot later). Then, rank all the countries by `Total Property Value / GDP`, in descending order. Store all the ranks in a column in the data named 'Rank'. 

*Hint:* [`pandas.Series.rank`](https://pandas.pydata.org/docs/reference/api/pandas.Series.rank.html) may be useful.

In [18]:
data = data.sort_values('Total Property Value / GDP')
data['Rank'] = data['Total Property Value / GDP'].rank()
data

Unnamed: 0_level_0,Unique Owners,Unique Properties,Total Property Values,"GDP (current USD, 2018 or latest available - for coverage)",Total Property Value / GDP,Mean Property value,Median Property Value,Number of firms,Mean Age,Number with age variable,...,Number of villas,Number of buildings,Share of total values owned by top 10% owners,Share of total values owned by top 10% persons,Share of total values owned by top 10% firms,Share of total values owned by top 1% owners,Share of total values owned by top 1% persons,Share of total values owned by top 1% firms,Code,Rank
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Cuba,4.9998,4.9998,6.135230e+05,1.000000e+11,0.00,,,0.0,40.0,4.9998,...,0.0000,0.0000,88.0,88.0,,88.0,88.0,,CUB,7.5
Guatemala,4.9998,4.9998,7.172760e+05,7.320000e+10,0.00,,,0.0,35.5,4.9998,...,4.9998,0.0000,77.0,77.0,,77.0,77.0,,GTM,7.5
Bolivia,5.0000,5.0000,1.033388e+06,4.030000e+10,0.00,206678.0,98122.0,0.0,47.8,5.0000,...,4.9998,0.0000,70.0,70.0,,70.0,70.0,,BOL,7.5
Vietnam,22.0000,25.0000,7.910712e+06,2.450000e+11,0.00,316429.0,320063.0,0.0,42.7,22.0000,...,4.9998,0.0000,31.0,31.0,,13.0,13.0,,VNM,7.5
Faroe Islands,4.9998,4.9998,1.472440e+05,3.050000e+09,0.00,,,0.0,53.0,4.9998,...,0.0000,0.0000,100.0,100.0,,100.0,100.0,,FRO,7.5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Dominica,58.0000,196.0000,1.008556e+08,5.510000e+08,18.32,514569.0,320047.0,0.0,48.5,54.0000,...,27.0000,0.0000,54.0,54.0,,18.0,18.0,,DMA,189.0
Seychelles,66.0000,353.0000,2.932545e+08,1.550000e+09,18.94,830749.0,618288.0,17.0,48.4,49.0000,...,118.0000,0.0000,83.0,42.0,39.0,37.0,37.0,36.0,SYC,190.0
British Virgin Islands,45.0000,912.0000,3.835627e+08,1.000000e+09,38.36,420573.0,246430.0,44.0,47.0,4.9998,...,5.0000,5.0000,80.0,0.0,80.0,49.0,0.0,49.0,VGB,191.0
United Arab Emirates,63438.0000,427538.0000,2.872459e+11,4.220000e+11,68.04,671861.0,211542.0,1966.0,49.9,58069.0000,...,14301.0000,2495.0000,85.0,19.0,62.0,73.0,10.0,37.0,ARE,192.0


In [19]:
grader.check("q2_1")

Now, we must think about how we want the colors in the plot to look like. For the sake of simplicity, let's say we want to bin the colors. So, we will need to group the countries into bins depending on their value of `Total Property Value / GDP`, and then assign a color to each bin. Let's take a look at the values in the column.


In [20]:
data['Total Property Value / GDP'].describe()

count    193.000000
mean       2.203264
std       11.726847
min        0.000000
25%        0.030000
50%        0.110000
75%        0.460000
max      140.800000
Name: Total Property Value / GDP, dtype: float64

As we can see, there are some clear outliers in the data. If we were to only consider the minimum and the maximum of the data, we could assign bins like ${[0,30),  [30,60),  [60,90),  [90,120),  [120,150)}$. This would leave most countries in the bottom most bin, and not provide an accurate color representation of the data. Ultimately, the bins you choose are a personal choice, but it is important to consider how those bins affect the final plot. We have provided sample bins for this part, but please feel free to mess around with these bins if you like.

<!-- BEGIN QUESTION -->

**Question 2.2:** Fill in the provided code below to generate your sample plot for Total Property Value / GDP!

In [21]:
fig = px.choropleth(data, # This is the name of your dataset
                    locations='Code', # Which column are the country codes stored in?
                    color=pd.cut(data['Total Property Value / GDP'], 
                                bins=[0, 0.015,0.05,0.1,0.2,1,140]).astype(str).fillna('No Data'),
                                #These are our sample bins, feel free to mess around with them
                    hover_name = data.index, # Which column are the country names stored in?
                    hover_data={"Total Property Value / GDP":":.1f", "Rank":":"},
                    # Change the above line so we can see the ratio of property value to GDP to 2 decimal places
                    color_discrete_sequence=px.colors.sequential.BuPu,
                    #Feel free to mess around with colors if you're interested
                    title = 'Total Property Value/GDP by Country', # Write an appropriate title
                    height = 900
                   )
fig.update_layout(
    geo=dict(
        showframe=False,
        showcoastlines=False,
        projection_type='mercator'
    ),
    margin=dict(l=50, r=50, t=50, b=50),
)
fig.show()

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 2.3:** Using the code above, generate a similar plot for `Total Property Values`.  Make sure the color bins and title are appropriate. However, make this plot 3D.

*Hint:* which line of code above references a 2D projection of the Earth? Here's a [list of supported projections](https://plotly.com/python/map-configuration/#map-projections). 

In [22]:
data['Rank'] = data['Total Property Values'].rank() # rank the data
fig = px.choropleth(data, #This is the name of your dataset
                    locations='Code', # Which column are the country codes stored in?
                    color=pd.cut(data['Total Property Values'], 
                               bins=[0, 0.001*1000000000,0.01*1000000000,0.1*1000000000,0.5*1000000000,3*1000000000,300*1000000000]).astype(str).fillna('No Data'),
                    #These are our sample bins, feel free to mess around with them
                    hover_name = data.index, # Which column are the country names stored in?
                    hover_data={"Total Property Values":":.1f", "Rank":":"},
                    # Change the above line so we can see the ratio of property value to 2 decimal places
                    color_discrete_sequence=px.colors.sequential.BuPu,
                    #Feel free to mess around with colors if you're interested
                    title = 'Total Property Values by Country', # Write an appropriate title
                    height = 900
                   )
fig.update_layout(
    geo=dict(
        showframe=False,
        showcoastlines=False,
        projection_type='orthographic'
    ),
    margin=dict(l=50, r=50, t=50, b=50),
)
fig.show()

<!-- END QUESTION -->

---
## Part 3: Using Widgets

Congratulations on making the first map! In this part, we will generate a map that can easily toggle between different columns to visualize different data. In order to do this, we must first introduce [widgets](https://ipywidgets.readthedocs.io). Widgets are interactive browser controls that allow you to choose between different values. An example is included below.

In [23]:
from ipywidgets import Dropdown
Dropdown(
    options=['1', '2', '3'],
    value='2',
    description='Number:',
    disabled=False,
)

Dropdown(description='Number:', index=1, options=('1', '2', '3'), value='2')

The `interact` function in the widgets module takes in a function, a list of values for it's parameters and determines the appropriate widget to let you visualize the function. 2 examples are included below.

In [24]:
def say_my_name(name):
    """
    Print the current widget value in a short sentence
    """
    print(f'My name is {name}')
     
interact(say_my_name, name=["James", "Bond", "James Bond"]);

interactive(children=(Dropdown(description='name', options=('James', 'Bond', 'James Bond'), value='James'), Ouâ€¦

In [25]:
def f(x):
    return x + 1
lst = [1,2,3]
interact(f, x=lst);

interactive(children=(Dropdown(description='x', options=(1, 2, 3), value=1), Output()), _dom_classes=('widget-â€¦

We will be using the `interact` function to generate a widget that lets us choose between and visualize the different column values easily. In order to this, we must first write a function that lets us generate a 3D plot for any column name. Thankfully, this isn't too hard. If you remember, other than changing the projection, making the plot for question 7 wasn't too bad once your code for question 6 was working.

The main thing we must consider for different columns is how to automatically determine the different color bins. After all, we won't be able to make judgement calls for every single bin. One way to automate this process would be to look at the data quintiles - assign the bottom 20% of the data to one bin, the next 20% to another bin, and so on. 

**Question 3.1:** Computationally determine the quintiles of `data['Total Property Values']` and return the information as a list. The list must start at the minimum value of `data['Total Property Values']` and end at the maximum value.

Hint: The solution does not need to be longer than one line. This [method](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.quantile.html) may be helpful. 

In [26]:
quintiles = data['Total Property Values'].quantile(q=[0,0.2,0.4,0.6,0.8,1])
quintiles

0.0    3.750600e+04
0.2    4.678943e+06
0.4    2.153267e+07
0.6    1.225616e+08
0.8    5.523684e+08
1.0    2.872459e+11
Name: Total Property Values, dtype: float64

In [27]:
grader.check("q3_1")

<!-- BEGIN QUESTION -->

**Question 3.2:** Write a function that takes in a column name and generates a 3D plot visualizing that column data. Name the function `plot_generator`. Feel free to assign the column name as the plot title.

In [28]:
def plot_generator(col):
    data_new = data.copy()
    data_new['Rank'] = data_new[col].rank() # generate ranking
    fig = px.choropleth(data_new,
                        locations = data_new['Code'],
                        color=pd.cut(data_new[col], 
                                    bins=data_new[col].quantile(q=[0.0,0.2,0.4,0.6,0.8,1.0]),duplicates='drop').astype(str).fillna('No Data'),
                        hover_name = data_new.index,
                        hover_data={col:":.2f", "Rank":":"},
                        color_discrete_sequence=px.colors.sequential.BuPu,
                        title = f'{col} by country',
                        height = 900
                       )
    fig.update_layout(
         title_text=f'{col} by country',
        geo=dict(
            showframe=False,
            showcoastlines=False,
            projection_type= 'orthographic'
        ),
        margin=dict(l=50, r=50, t=50, b=50),
    )
    fig.show()

<!-- END QUESTION -->

Let's make sure the function works:

In [29]:
plot_generator("Total Property Values")

Now, view the beautiful visualizations with `interact`!

In [30]:
interact(plot_generator, col=data);

interactive(children=(Dropdown(description='col', options=('Unique Owners', 'Unique Properties', 'Total Properâ€¦

Let's throw in another toggle as a bonus!

In [31]:
display(widgets.interactive(plot_generator, col=widgets.ToggleButtons(options=[
    "Total Property Values", "Total Property Value / GDP", "Mean Property value", "Median Property Value"])));


interactive(children=(ToggleButtons(description='col', options=('Total Property Values', 'Total Property Valueâ€¦

<!-- BEGIN QUESTION -->

**Question 3.3:** Using the widget that you generate above. Name one country that has both a high total property value invested in Dubai and a high total property value / GDP. What does that potentially imply about income inequality in that country? This is an open-ended question. 

Although the graphs are visually appealing â€“ scrolling back and forward between multiple graphs is tedious, and I, therefore, start by looking at the countries in the top decile both in `'Total Property Values'` and `'Total Property Value / GDP'` with a regular table

In [32]:
upper_decile = data['Total Property Values'].quantile(0.9)
TPV_upper_decile = data[data['Total Property Values'] > upper_decile].reset_index()
TPV_upper_decile = TPV_upper_decile[['Country', 'Total Property Values', 'GDP (current USD, 2018 or latest available - for coverage)', 'Total Property Value / GDP']]

upper_decile = data['Total Property Value / GDP'].quantile(0.9)
TPVGDP_upper_decile = data[data['Total Property Value / GDP'] > upper_decile].reset_index()
TPVGDP_upper_decile = TPVGDP_upper_decile['Country']

merge = TPV_upper_decile.merge(TPVGDP_upper_decile, how='inner',on='Country').set_index('Country')
display(merge)

Unnamed: 0_level_0,Total Property Values,"GDP (current USD, 2018 or latest available - for coverage)",Total Property Value / GDP
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Kuwait,3795782000.0,138000000000.0,2.75
Pakistan,10631640000.0,315000000000.0,3.38
Lebanon,3398557000.0,55300000000.0,6.15
Syria,2976197000.0,40400000000.0,7.37
Jordan,5197686000.0,42900000000.0,12.11
United Arab Emirates,287245900000.0,422000000000.0,68.04


As the table above shows, six countries appear in the upper decile of both `'Total Property Values'` and `'Total Property Value / GDP'`. Unsurprisingly, one of them is the UAE itself, while the rest are Middle Eastern countries. A likely reason for the countries to be there is a relatively high inequality in respective countries. To be on this list, they must have some really wealthy people who can afford expensive property in Dubai while maintaining a relatively low GDP.

<!-- END QUESTION -->

**Congratulations!!** You are done with the lab. Hopefully you enjoyed producing these geospatial visualizations!

---
## Feedback

**Question 4:** Please fill out this short [feedback form](https://forms.gle/zfy4e7NH8gvcYmB37) to let us know your thoughts about this lab! We really appreciate your opinions and feedback! At the end of the Google form, you should see a codeword. Assign the codeword to the variable `codeword` below. 

In [33]:
codeword = 'Dubai'

In [34]:
grader.check("q4")

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [35]:
# Save your notebook first, then run this cell to export your submission.
grader.export(run_tests=True)


You are using an unsupported version of pandoc (3.1.11.1).
Your version must be at least (1.12.1) but less than (3.0.0).
Refer to https://pandoc.org/installing.html.
Continuing with doubts...


Your element with mimetype(s) dict_keys(['application/vnd.plotly.v1+json']) is not able to be represented.


Your element with mimetype(s) dict_keys(['application/vnd.plotly.v1+json']) is not able to be represented.



Running your submission against local test cases...


Your submission received the following results when run against available test cases:

    q1_1 results: All test cases passed!

    q1_2 results: All test cases passed!

    q1_3 results: All test cases passed!

    q1_4 results: All test cases passed!

    q2_1 results: All test cases passed!

    q3_1 results: All test cases passed!

    q4 results: All test cases passed!
