# Advanced Queries and Messy Data Lab

### Introduction

In this lesson, we'll work with some economic data derived from Wikipedia's dataset on [per capita GDP](https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)_per_capita).  Let's get started.

### Loading our data

In [148]:
countries_df =  pd.read_csv('countries.csv')

In [130]:
countries = countries_df.to_dict('records')

So looking at our data above, we can see that each dictionary is a separate country.  For each country we have listed the Country, Region, GDP, and Year.  The GDP is the average GDP per person, so we can see that in Liechtenstein, the average GDP is `169,049`.  Let's just admire that for a second.

### Exploring our Text Data 

Ok, so before doing initial analysis, it's a good idea to get an overview of our data.  Let's start by creating a unique collection of all of the regions in our dataset.

In [136]:
regions = []
for country in countries:
    regions.append(country['Region'])
    
set(regions)

# {'Africa', 'Americas', 'Asia', 'Europe', 'Oceania', 'World'}

{'Africa', 'Americas', 'Asia', 'Europe', 'Oceania', 'World'}

Ok, so we can see that the Americas -- both North and South is considered one region.  And that we have the mysterious region of `World`.

Let's a get a list of the countries who are listed under World, so we can get a better sense of what that is about.

In [140]:
world_countries = []
for country in countries:
    if country['Region'] == 'World':
        world_countries.append(country)
        
world_countries

[{'Country': 'Worl', 'Region': 'World', 'Gdp': 12263.0, 'Year': 2021.0}]

Ok, so this looks like it is the average world GDP for the year 2021.  Eventually, we can try to find the average GDP in our dataset to make sure it lines up.  Or we can just ask Google the world per capita GDP.

<img src="./per-capita.png" width="60%">

Let's also just see the size of our dataset -- so we can see if we really do have a lot of countries listed.

In [164]:
len(countries)

222

Wikipedia says there are only 195 countries.  So this is something to keep in mind.  It may mean that we have some duplicate entries, or that some items (like World) do not represent countries at all.  It could also just be that the number of countries differs based on the list of data -- although the difference here appears pretty large.

### Exploring Numeric Data

Ok, so we've looked to see that various regions are included.  Now let's confirm GDP data makes sense.  This is slightly more difficult because there can be a range of different GDPs.  So it probably does not make sense to get a unique list of all GDPs.  One thing we can do is look at the upper and lower range of our data.

For example, we can get a list of all of minimum and maximum values of our GDP.  This  way we can ensure that our data makes sense.  To do so we can't just ask the minimum of a list of dictionaries.

In [155]:
countries[:2]

[{'Country': 'Liechtenstein',
  'Region': 'Europe',
  'Gdp': 169049.0,
  'Year': 2019.0},
 {'Country': 'Monaco', 'Region': 'Europe', 'Gdp': 173688.0, 'Year': 2020.0}]

This is because Python wouldn't know which of these attributes it is trying to find the minimum of.  Instead, we need to first create a list of GDPs.

In [156]:
gdps = []
for country in countries:
    gdps.append(country['Gdp'])
    
gdps[:5]

[169049.0, 173688.0, 135683.0, 110870.0, 99152.0]

And now that we have just one list of numbers, we can find the minimum in that list.

In [157]:
min(gdps)

237.0

And the maximum.

In [158]:
max(gdps)

173688.0

Now do the same thing for year.  Find the minimum year in the dataset using the same pattern.

In [160]:
years = []
for country in countries:
    years.append(country['Year'])


In [162]:
min(year)

# 2007.0

2007.0

In [163]:
max(year) # 2021.0

2021.0

Ok, so that's surprising, and could lead to incorrect conclusions to have data from such different years.

### Searching through data

Ok, so now that we've gotten a sense of our data it's time to ask questions of our data.  We know that our data is from various years, so let's reduce our data to only those dictionaries that have a year of 2019 or greater.

In [182]:
recent_countries = []

for country in countries:
    if country['Year'] > 2019:
        recent_countries.append(country)

In [185]:
print(recent_countries[:4])

# [{'Country': 'Monaco', 'Region': 'Europe', 'Gdp': 173688.0, 'Year': 2020.0},
# {'Country': 'Luxembourg', 'Region': 'Europe', 'Gdp': 135683.0, 'Year': 2021.0}, 
# {'Country': 'Bermuda', 'Region': 'Americas', 'Gdp': 110870.0, 'Year': 2021.0}, 
# {'Country': 'Ireland', 'Region': 'Europe', 'Gdp': 99152.0, 'Year': 2021.0}]

[{'Country': 'Monaco', 'Region': 'Europe', 'Gdp': 173688.0, 'Year': 2020.0}, {'Country': 'Luxembourg', 'Region': 'Europe', 'Gdp': 135683.0, 'Year': 2021.0}, {'Country': 'Bermuda', 'Region': 'Americas', 'Gdp': 110870.0, 'Year': 2021.0}, {'Country': 'Ireland', 'Region': 'Europe', 'Gdp': 99152.0, 'Year': 2021.0}]


We can see that we still have a majority of our countries.

In [181]:
len(recent_countries)

203

Then let's find all of the countries in Europe that have a per capita GDP under `20,000`.

In [186]:
country_names = []
for country in countries:
    if country['Gdp'] < 20_000 and country['Region'] == 'Europe':
        country_names.append(country['Country'])

In [187]:
print(country_names)

['Poland', 'Hungary', 'Croatia', 'Romania', 'Russia', 'Bulgaria', 'Montenegro', 'Serbia', 'Belarus', 'Bosnia and Herzegovina', 'North Macedonia', 'Albania', 'Moldova', 'Kosovo', 'Ukraine']


From here, let's find a unique collection of regions that have a country with GDP over 80,000.

In [188]:
wealthy_regions = []
for country in countries:
    if country['Gdp'] > 80_000:
        wealthy_regions.append(country['Region'])

In [189]:
set(wealthy_regions)

{'Americas', 'Europe'}

So we can see that both the Americas and Europe meets that criteria.  From there, let's see how many wealthy_countries there are in the Americas.

In [193]:
wealthy_american_countries = []
for country in countries:
    if country['Gdp'] > 80_000 and country['Region'] == 'Americas':
        wealthy_american_countries.append(country['Country'])

In [195]:
len(wealthy_american_countries)

# 2

2

And now find the number of countries in Europe with a GDP over `80_000`.

In [196]:
wealthy_european_countries = []
for country in countries:
    if country['Gdp'] > 80_000 and country['Region'] == 'Europe':
        wealthy_european_countries.append(country['Country'])

In [198]:
len(wealthy_european_countries)

7

And we can see that there are seven `wealthy_european_countries`.

From here, let's find the average income of countries in Europe.

In [212]:
european_country_incomes = []
for country in countries:
    if country['Region'] == 'Europe':
        european_country_incomes.append(country['Gdp'])

In [213]:
sum(european_country_incomes)/len(european_country_incomes)

# 44181.87234042553

44181.87234042553

And note that we can see the difference between the maximum and minimum country in Europe just by subtracting the max from the min.

In [216]:
max(european_country_incomes) - min(european_country_incomes)

168852.0

Quite a difference.

### Summary

In this lesson, we practiced working with advanced queries and performing calculations in Python.  We did so by just selecting a subset of our data, converting that data into lists when necessary, and then performing calculations on those lists. 

### Answers

In [219]:
regions = []
for country in countries:
    regions.append(country['Region'])
    
set(regions)

{'Africa', 'Americas', 'Asia', 'Europe', 'Oceania', 'World'}

In [220]:
world_countries = []
for country in countries:
    if country['Region'] == 'World':
        world_countries.append(country)
        
world_countries

[{'Country': 'Worl', 'Region': 'World', 'Gdp': 12263.0, 'Year': 2021.0}]

In [221]:
len(countries)

222

In [222]:
years = []
for country in countries:
    years.append(country['Year'])


In [224]:
min(years)

2007.0

In [223]:
max(years)

2021.0

In [218]:
wealthy_american_countries = []
for country in countries:
    if country['Gdp'] > 80_000 and country['Region'] == 'Americas':
        wealthy_american_countries.append(country['Country'])
len(wealthy_american_countries)

2

In [225]:
wealthy_european_countries = []
for country in countries:
    if country['Gdp'] > 80_000 and country['Region'] == 'Europe':
        wealthy_european_countries.append(country['Country'])

len(wealthy_european_countries)

7

In [227]:
european_country_incomes = []
for country in countries:
    if country['Region'] == 'Europe':
        european_country_incomes.append(country['Gdp'])
        
sum(european_country_incomes)/len(european_country_incomes)

44181.87234042553

In [228]:
max(european_country_incomes) - min(european_country_incomes)

168852.0