Lecture: AI I - Basics 

Previous:
[**Chapter 3.2: Pandas**](../02_pandas.ipynb)

---

# Solution 3.2: Pandas

- [Task 1 - First Steps in Data Preparation](#task-1---first-steps-in-data-preparation)

> Hint: When doing the exercises put your solution in the designated "Solution" section:
> ```python
> # Solution (put your code here)
> ```

## Task 1 - First Steps in Data Preparation

This is the first task where we use a real dataset, hooray! You can find the dataset `countries.csv` in the same folder as this file. It contains a list of (probably) all countries in the world (as of about 2008), including some information about each of these countries. The basis of this is freely available at http://gsociology.icaap.org/dataupload.html, with some information from http://www.iucnredlist.org/technical-documents/data-organization/countries-by-regions. Note that the information in this dataset does not reflect the opinion of the authors, may be outdated and may be incorrect.

a) **Get familiar with the dataset (without automatic evaluation):** This is the first step you should do for any dataset you work with. Load the dataset with Pandas. Look at its `head()`, its `columns`, and let pandas `describe()` the dataset for you. If you want to know something about individual columns, you can also use `value_counts()` for them, for example.

In [1]:
# prerequisites (don't edit this block)
import pandas as pd
df = pd.read_csv("../data/pandas/countries.csv")

In [2]:
# Solution (put your code here)
df.describe(include="all")

Unnamed: 0.1,Unnamed: 0,Country,Subcontinent,Region,In EU,Population,Area,Pop. Density,Coastline,Net migration,...,Phones,Arable,Crops,Other,Climate,Birthrate,Deathrate,Agriculture,Industry,Service
count,227.0,227,227,227,227,227.0,227.0,227.0,227.0,224.0,...,223.0,225.0,225.0,225.0,205.0,224.0,223.0,212.0,211.0,212.0
unique,,227,12,11,2,,,,,,...,,,,,,,,,,
top,,Afghanistan,Sub-Saharan Africa,SUB-SAHARAN AFRICA,False,,,,,,...,,,,,,,,,,
freq,,1,51,51,199,,,,,,...,,,,,,,,,,
mean,113.0,,,,,28740280.0,1549402.0,379.042511,21.16533,0.038125,...,236.057668,13.797111,4.564222,81.638311,2.139024,22.114732,9.241345,0.150844,0.282711,0.565283
std,65.673435,,,,,117891300.0,4636813.0,1660.187541,72.286863,4.889269,...,227.992091,13.040402,8.36147,16.140835,0.699397,11.176716,4.990026,0.146798,0.138272,0.165841
min,0.0,,,,,7026.0,5.17998,0.03,0.0,-20.99,...,0.17,0.0,0.0,33.33,1.0,7.29,2.29,0.0,0.02,0.062
25%,56.5,,,,,437624.0,12036.98,29.155,0.1,-0.9275,...,37.81,3.22,0.19,71.65,2.0,12.6725,5.91,0.03775,0.193,0.42925
50%,113.0,,,,,4786994.0,224293.1,78.77,0.73,0.0,...,176.15,10.42,1.03,85.7,2.0,18.79,7.84,0.099,0.272,0.571
75%,169.5,,,,,17497770.0,1144286.0,190.11,10.345,0.9975,...,389.63,20.0,4.44,95.44,3.0,29.82,10.605,0.221,0.341,0.6785


In [3]:
# Test case (don't edit this block)
# No automatic test for this part - explore the data yourself!
print("Explore the dataset using head(), columns, describe(), and value_counts() on individual columns.")

Explore the dataset using head(), columns, describe(), and value_counts() on individual columns.


b) **Determine the mean population density by region:** The given dataset contains a column `Region` as well as a column `Pop. Density`. Write a function `get_mean_popdens_by_region()` that takes the dataframe with all countries as an argument and returns a `Series` that maps regions to the average population density of their countries, sorted in descending order by population density.

In [4]:
# Solution (put your code here)
def get_mean_popdens_by_region(countries):
    df_pop_density = countries.groupby("Region")["Pop. Density"].mean()
    mean_popdens_by_region_sorted = df_pop_density.sort_values(ascending=False)
    
    return mean_popdens_by_region_sorted

In [None]:
# Test case (don't edit this block)
import pandas as pd
import numpy as np

df = pd.read_csv('../data/pandas/countries.csv', index_col=0)
res = get_mean_popdens_by_region(df)
assert isinstance(res, pd.Series)
assert list(res.index) == ['ASIA (EX. NEAR EAST)', 'WESTERN EUROPE', 'NEAR EAST', 'NORTHERN AMERICA', 'LATIN AMER. & CARIB', 'OCEANIA', 'EASTERN EUROPE', 'SUB-SAHARAN AFRICA', 'C.W. OF IND. STATES', 'BALTICS', 'NORTHERN AFRICA']
_, densities = zip(*res.items())
densities = np.array(densities)
demanded_densities = np.array([1264.81928571, 952.04285714, 427.07875, 260.872, 136.19177778, 131.18285714, 100.89083333, 92.25901961, 56.70083333, 39.83333333, 38.935])
assert np.allclose(densities, demanded_densities, atol=1e-8)
print("Population density by region calculated correctly!")

Population density by region calculated correctly!


c) **Find the Non-G20 country with the highest GDP:** You will find a list of countries that are part of the G20 in the form of the dictionary `gtwenty` in the module `data`. Additionally, any country that is in the EU counts as a G20 country for our purposes (there is a column in the dataset that indicates this). Use masking to find the country with the highest GDP that is neither part of the EU nor explicitly a G20 country. Your function `get_highest_nongtwenty` should return a `Series` containing the name of this country and its GDP.

In [6]:
# prerequisites (don't edit this block)
gtwenty = [
    'Argentina', 
    'Australia', 
    'Brazil', 
    'Canada', 
    'China', 
    'France', 
    'Germany', 
    'India', 
    'Indonesia', 
    'Italy', 
    'Japan',  
    'Korea, South', 
    'Mexico', 
    'Russia', 
    'Saudi Arabia', 
    'South Africa', 
    'Turkey', 
    'United Kingdom', 
    'United States'
]

In [7]:
# Solution (put your code here)
def get_highest_nongtwenty(countries, gtwenty):
    not_in_EU = countries[~countries["In EU"]]

    gtwenty_mask = ~not_in_EU["Country"].isin(gtwenty)
    not_G20 = not_in_EU[gtwenty_mask]
    
    max_not_G20 = not_G20.sort_values(by="GDP",ascending=False)
    first_max_not_G20 = max_not_G20[["Country","GDP"]].iloc[0,:]
    
    return first_max_not_G20

In [8]:
# Test case (don't edit this block)
df = pd.read_csv('../data/pandas/countries.csv', index_col=0)
res = get_highest_nongtwenty(df, gtwenty)
assert isinstance(res, pd.Series)
assert res.to_dict() == {'Country': 'Norway', 'GDP': 37800.0}
print("Highest Non-G20 country found correctly!")

Highest Non-G20 country found correctly!


d) **Determine information about continents:** One column of the dataset represents subcontinents. In the file `continents.csv` you will find a translation for subcontinents into continents. Your task is to write a function `create_continent_dataframe(countries, continents)` that takes two dataframes as arguments containing the loaded `continents.csv` and `countries.csv`, and returns a pandas dataframe of continents that looks like this:

> <pre>		        Population	Countries  
> continent  
> Asia		        4163132161	56  
> Africa		        910844133	57  
> Europe		        523496050	44  
> North & Central America	517799370	36  
> South America		375641175	13  
> Oceania			33131662	21 </pre>

Note that the resulting DataFrame is sorted by the continent's population.

**Hint**: The `apply()` function can be used in combination with a dictionary to create a new `Continent` column in the main dataset. Then you should use `groupby` to get information about a continent's population when you have information about the population of its countries.

In [9]:
# Solution (put your code here)
def create_continent_dataframe(countries, continents):
    continents_dict = continents.groupby("Subcontinent").sum().to_dict()
    continents_dict_sliced = continents_dict['continent']
    countries["continent"] = countries["Subcontinent"].apply(lambda x: continents_dict_sliced[x])
    
    continent_dataframe = countries.groupby("continent").aggregate({'Population': 'sum','Country': 'count'})
    continent_dataframe_renamed = continent_dataframe.rename(columns={'Population': 'Population', 'Country':'Countries'})
    continent_dataframe_sorted = continent_dataframe_renamed.sort_values(by="Population",ascending = False)
    
    return continent_dataframe_sorted

In [10]:
# Test case (don't edit this block)
df = pd.read_csv('../data/pandas/countries.csv', index_col=0)
df2 = pd.read_csv('../data/pandas/continents.csv', delimiter=",", index_col=0)
res = create_continent_dataframe(df, df2)
assert isinstance(res, pd.DataFrame)
assert list(res.index) == ['Asia', 'Africa', 'Europe', 'North & Central America', 'South America', 'Oceania']
assert res["Population"].to_dict() == {'Asia': 4163132161, 'Africa': 910844133, 'Europe': 523496050, 'North & Central America': 517799370, 'South America': 375641175, 'Oceania': 33131662}
assert res["Countries"].to_dict() == {'Asia': 56, 'Africa': 57, 'Europe': 44, 'North & Central America': 36, 'South America': 13, 'Oceania': 21}
print("Continent dataframe created correctly!")

Continent dataframe created correctly!


---

Lecture: AI I - Basics 

Next: [**Chapter 3.3: Visualisation with Matplotlib**](../03_data/03_matplotlib.ipynb)