# VA Group Project / 2020W. 
## Group N+1

### Dataset Overview 

* Our goal is to explore the trends and correlations between different datasets, available on [Gap Minder](https://www.gapminder.org/).
* Every dataset contains differnt subsets of countries / years.
* The original data is in the form: one row per country with values for every year as columns. The years' columns are transformed into rows to make the data processing more convenient.
* All datasets of interest are merged to make the data exploration easier.

In [1]:
#disable some annoying warnings
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)

#plots the figures in place instead of a new window
%matplotlib inline

import pandas as pd
import numpy as np

import altair as alt
import altair_viewer

import ipywidgets as widgets

from sklearn import decomposition
from sklearn.manifold import TSNE
from sklearn.manifold import Isomap
from sklearn.manifold import MDS
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# pip install umap-learn
import umap

# Load Data

## Helper functions for data loading 

In [2]:

def getYearsOfInterest(fromYear, toYear):
    """
    Generates a range of years [fromYear, toYear]
    """
    return [str(x) for x in range(fromYear, toYear+1)]

def filterData(valueColumns, metaDataColumns, data):
    """
    Filter valueColumns + metaDataColumns from the data frame
    All missing valueColumns are added to the resulting data frame (with value = None)
    """
    missingColumns = list(set(valueColumns) - set(data.columns))
    for c in missingColumns:
        data[c] = None 
    return data[list(set(metaDataColumns) | set(valueColumns))]

def unpivot(data, key_columns, data_column, value_column):
    """
    Transforms all non key_columns into rows
    """
    return pd.melt(data, id_vars=key_columns, var_name=data_column, value_name=value_column)

def loadSingleDataset(path, from_year, to_year, key_columns, data_column, value_column):
    """
    Loads a single data set from csv
    """
    data = pd.read_csv(path) 
    data = filterData(getYearsOfInterest(from_year, to_year), key_columns, data)
    return unpivot(data, key_columns, data_column, value_column)

def mergeDatasets(datasets, keys):
    """
    Merge datasets using keys as key columns
    The merge operation is outer join of all data sets
    """
    data = datasets[0]
    
    for i in range(1, len(datasets)):
        data = data.merge(datasets[i], how='outer', left_on=keys, right_on=keys)
        
    return data

In [3]:
# global report params 
FROM_YEAR = 1900
TO_YEAR   = 2020

In [4]:
gdp_growth = loadSingleDataset('data/gdp_total_yearly_growth.csv', 
                               FROM_YEAR, TO_YEAR, 
                               ['country'], 
                               'year', 
                               'gdp_growth')
gdp_growth.head()


Unnamed: 0,country,year,gdp_growth
0,Afghanistan,1922,1.42
1,Albania,1922,1.38
2,Algeria,1922,1.7
3,Andorra,1922,4.47
4,Angola,1922,3.71


In [5]:
children_per_woman_total_fertility = loadSingleDataset('data/children_per_woman_total_fertility.csv', 
                               FROM_YEAR, TO_YEAR, 
                               ['country'], 
                               'year', 
                               'children_per_woman_total_fertility')
children_per_woman_total_fertility.head()

Unnamed: 0,country,year,children_per_woman_total_fertility
0,Afghanistan,1922,7.0
1,Albania,1922,4.6
2,Algeria,1922,6.99
3,Angola,1922,7.02
4,Antigua and Barbuda,1922,4.55


In [6]:
co2_emissions_tonnes_per_person = loadSingleDataset('data/co2_emissions_tonnes_per_person.csv', 
                               FROM_YEAR, TO_YEAR, 
                               ['country'], 
                               'year', 
                               'co2_emissions_tonnes_per_person')
co2_emissions_tonnes_per_person.head()


Unnamed: 0,country,year,co2_emissions_tonnes_per_person
0,Afghanistan,1922,
1,Albania,1922,
2,Algeria,1922,0.00438
3,Andorra,1922,
4,Angola,1922,


In [7]:
mean_years_in_school_women_percent_men_25_to_34_years = loadSingleDataset('data/mean_years_in_school_women_percent_men_25_to_34_years.csv', 
                               FROM_YEAR, TO_YEAR, 
                               ['country'], 
                               'year', 
                               'mean_years_in_school_women_percent_men_25_to_34_years')
mean_years_in_school_women_percent_men_25_to_34_years.head()

Unnamed: 0,country,year,mean_years_in_school_women_percent_men_25_to_34_years
0,Afghanistan,1922,
1,Albania,1922,
2,Algeria,1922,
3,Andorra,1922,
4,Angola,1922,


In [8]:
average_age_of_dollar_billionaires_years = loadSingleDataset('data/average_age_of_dollar_billionaires_years.csv', 
                               FROM_YEAR, TO_YEAR, 
                               ['country'], 
                               'year', 
                               'average_age_of_dollar_billionaires_years')
average_age_of_dollar_billionaires_years.head()

Unnamed: 0,country,year,average_age_of_dollar_billionaires_years
0,Afghanistan,1922,
1,Albania,1922,
2,Algeria,1922,
3,Andorra,1922,
4,Angola,1922,


In [9]:
food_supply= loadSingleDataset('data/food_supply.csv', 
                               FROM_YEAR, TO_YEAR, 
                               ['country'], 
                               'year', 
                               'food_supply')
food_supply.head()

Unnamed: 0,country,year,food_supply
0,Afghanistan,1922,
1,Albania,1922,
2,Algeria,1922,
3,Angola,1922,
4,Antigua and Barbuda,1922,


In [10]:
hourly_compensation = loadSingleDataset('data/hourly_compensation.csv', 
                               FROM_YEAR, TO_YEAR, 
                               ['country'], 
                               'year', 
                               'hourly_compensation')
hourly_compensation.head()

Unnamed: 0,country,year,hourly_compensation
0,Argentina,1922,
1,Armenia,1922,
2,Australia,1922,
3,Austria,1922,
4,Azerbaijan,1922,


In [11]:
income_per_person= loadSingleDataset('data/income_per_person.csv', 
                               FROM_YEAR, TO_YEAR, 
                               ['country'], 
                               'year', 
                               'income_per_person')
income_per_person.head()

Unnamed: 0,country,year,income_per_person
0,Afghanistan,1922,1540
1,Albania,1922,1560
2,Algeria,1922,2480
3,Andorra,1922,4330
4,Angola,1922,1330


In [12]:
suicide_per_100000_people = loadSingleDataset('data/suicide_per_100000_people.csv', 
                               FROM_YEAR, TO_YEAR, 
                               ['country'], 
                               'year', 
                               'suicide_per_100000_people')
suicide_per_100000_people.head()

Unnamed: 0,country,year,suicide_per_100000_people
0,Albania,1922,
1,Antigua and Barbuda,1922,
2,Argentina,1922,
3,Armenia,1922,
4,Australia,1922,


In [13]:
total_number_of_dollar_billionaires = loadSingleDataset('data/total_number_of_dollar_billionaires.csv', 
                               FROM_YEAR, TO_YEAR, 
                               ['country'], 
                               'year', 
                               'total_number_of_dollar_billionaires')
total_number_of_dollar_billionaires.head()

Unnamed: 0,country,year,total_number_of_dollar_billionaires
0,Afghanistan,1922,
1,Albania,1922,
2,Algeria,1922,
3,Andorra,1922,
4,Angola,1922,


In [14]:
working_hours_per_week = loadSingleDataset('data/working_hours_per_week.csv', 
                               FROM_YEAR, TO_YEAR, 
                               ['country'], 
                               'year', 
                               'working_hours_per_week')
working_hours_per_week.head()

Unnamed: 0,country,year,working_hours_per_week
0,Albania,1922,
1,Algeria,1922,
2,Argentina,1922,
3,Armenia,1922,
4,Australia,1922,


## The final merged dataset

* Call mergeDatasets function to form the final dataset
* Augment data with additional attributes (i.e. continent and region data for _'country'_ and decade for _'year'_)

In [15]:
# merge the datasets in one that contains all the data
data = mergeDatasets([
    gdp_growth, 
    children_per_woman_total_fertility,
    co2_emissions_tonnes_per_person,
    mean_years_in_school_women_percent_men_25_to_34_years,
    average_age_of_dollar_billionaires_years,
    food_supply,
    hourly_compensation,
    income_per_person,
    suicide_per_100000_people,
    total_number_of_dollar_billionaires,
    working_hours_per_week
], ['country', 'year'])

data.sort_values(by=['country', 'year'], inplace=True, ignore_index=True)


countries = pd.read_csv('data/countryContinent.csv')

data = data.merge(countries, how='left', left_on=['country'], right_on=['country'])
data = data.convert_dtypes()

#add 'decade' computed column 
data['decade'] = data['year'].str.slice(0, 3)  + '0'

#check for missing countries (they have to be corrected in countryContinent.csv)
missing_countries = data[data["region_code"].isnull()]['country'].unique()

if (len(missing_countries) == 0):
    print("Country mapping is OK")
else:
    print(missing_countries)
    
data.to_csv('data/data.csv')

Country mapping is OK


In [16]:
# basic statistics of the loaded data 
print(data.count())
data.head(50)

country                                                  23595
year                                                     23595
gdp_growth                                               22094
children_per_woman_total_fertility                       22264
co2_emissions_tonnes_per_person                          15722
mean_years_in_school_women_percent_men_25_to_34_years     8602
average_age_of_dollar_billionaires_years                   776
food_supply                                               8022
hourly_compensation                                        483
income_per_person                                        23353
suicide_per_100000_people                                 2992
total_number_of_dollar_billionaires                        776
working_hours_per_week                                    1643
code_2                                                   23474
code_3                                                   23595
country_code                                           

Unnamed: 0,country,year,gdp_growth,children_per_woman_total_fertility,co2_emissions_tonnes_per_person,mean_years_in_school_women_percent_men_25_to_34_years,average_age_of_dollar_billionaires_years,food_supply,hourly_compensation,income_per_person,...,working_hours_per_week,code_2,code_3,country_code,iso_3166_2,continent,sub_region,region_code,sub_region_code,decade
0,Afghanistan,1900,1.05,7.0,,,,,,1090,...,,AF,AFG,4,ISO 3166-2:AF,Asia,Southern Asia,142,34,1900
1,Afghanistan,1901,1.05,7.0,,,,,,1110,...,,AF,AFG,4,ISO 3166-2:AF,Asia,Southern Asia,142,34,1900
2,Afghanistan,1902,1.05,7.0,,,,,,1120,...,,AF,AFG,4,ISO 3166-2:AF,Asia,Southern Asia,142,34,1900
3,Afghanistan,1903,1.05,7.0,,,,,,1140,...,,AF,AFG,4,ISO 3166-2:AF,Asia,Southern Asia,142,34,1900
4,Afghanistan,1904,1.05,7.0,,,,,,1160,...,,AF,AFG,4,ISO 3166-2:AF,Asia,Southern Asia,142,34,1900
5,Afghanistan,1905,1.05,7.0,,,,,,1180,...,,AF,AFG,4,ISO 3166-2:AF,Asia,Southern Asia,142,34,1900
6,Afghanistan,1906,1.05,7.0,,,,,,1200,...,,AF,AFG,4,ISO 3166-2:AF,Asia,Southern Asia,142,34,1900
7,Afghanistan,1907,1.05,7.0,,,,,,1220,...,,AF,AFG,4,ISO 3166-2:AF,Asia,Southern Asia,142,34,1900
8,Afghanistan,1908,1.05,7.0,,,,,,1240,...,,AF,AFG,4,ISO 3166-2:AF,Asia,Southern Asia,142,34,1900
9,Afghanistan,1909,1.05,7.0,,,,,,1260,...,,AF,AFG,4,ISO 3166-2:AF,Asia,Southern Asia,142,34,1900


## Helper variables for different set of columns in the dataset

In [17]:
# change all _ with line breaks - in this case it's easier to display column names in the plots
mapping = {}
for col in data:
    mapping[col] = col.replace('_', "\n")
    
data = data.rename(columns=mapping)

key_columns = ['country', 'year']

measure_columns = [
        "gdp\ngrowth",
        "children\nper\nwoman\ntotal\nfertility",
        "co2\nemissions\ntonnes\nper\nperson",
        "mean\nyears\nin\nschool\nwomen\npercent\nmen\n25\nto\n34\nyears",
        "average\nage\nof\ndollar\nbillionaires\nyears",
        "food\nsupply",
        "hourly\ncompensation",
        "income\nper\nperson",
        "suicide\nper\n100000\npeople",
        "total\nnumber\nof\ndollar\nbillionaires",
        "working\nhours\nper\nweek"
    ]

all_columns = key_columns + measure_columns

# Show Data

## Data Completeness
 > In the data quality framework, data completeness refers to the degree to which all data in a data set is available. A measure of data completeness is the percentage of missing data entries [[1]](https://dataladder.com/missing-data-and-data-completeness/)

In [18]:

t1 = pd.melt(data[all_columns], id_vars=['country', 'year'], var_name=['measure'], value_name='val')
t1['Countries Count'] = t1['val'].isnull() 

t1 = t1.groupby(['year', 'measure'])['Countries Count'].sum().reset_index()

alt.Chart(t1).mark_rect().encode(
    x='year:O',
    y='measure:O',
    color='Countries Count:Q'
).properties(
    width=800,
    height=300,
    title='Data Completeness'
)


## Selection of Attributes.

The aim is to provide a short description and overall visualization of the attributes/columns that has been chosen and is used within this template.

* gdp_growth (gdp_growth): yearly growth of GDP
* children_per_woman_total_fertility (fertility): number of children per woman
* co2_emissions_tonnes_per_person (co2_emissions): carbon dioxide emission from burning of fossil fueles in tonnes per person
* mean_years_in_school_women_percent_men_25_to_34_years (school_years): the average number of years attended in school for women and men age 25 to 34 (including primary, secondary, tertiary education)
* average_age_of_dollar_billionaires_years (age_billionaires): average age of dollar billionaires in the country of their citizenship
* food_supply (calories): kilocalories intake per person per day (normally 1.500-3.000 kcal/day)
* hourly_compensation (compensation): average hourly labor cost per employee
* income_per_person (income): GDP per person PPP and inflation adjusted
* suicide_per_100000_people (suicide): mortality due to self-inflicted injury per 100.000 people 
* total_number_of_dollar_billionaires (billionaires): total number of dollar billionaires in the country of their citizenship
* working_hours_per_week (working_hours): total amount of yearly working hours divided by 52 weeks


In [19]:
df_data = pd.read_csv('data//data.csv')
mapping = {}
for col in df_data:
    mapping[col] = col.replace('_', "\n")
    
df_data = df_data.rename(columns=mapping)

In [20]:
print('Please select the year you are interested in.')
print('If you want to cluster the data by continent, then tick the box next to \'cluster_by_continent\' and select the wanted continent from the dropdown menu.')
print('You can also untick boxes that you are not interested in.')

@widgets.interact(year = (1970,2013), cluster_by_continent=False, continent=['Asia','Europe','Africa', 'Oceania', 'Americas'],
          children_per_woman_total_fertility=True, gdp_growth=True, co2_emissions_tonnes_per_person=True,
          income_per_person=True, food_supply=True, mean_years_in_school_women_percent_men_25_to_34_years=True)
def plot_education_gender_ratio(year,cluster_by_continent, continent,
                                children_per_woman_total_fertility,gdp_growth,co2_emissions_tonnes_per_person,
                                income_per_person,food_supply,
                                mean_years_in_school_women_percent_men_25_to_34_years):
    checked_data = list()
    if children_per_woman_total_fertility:
        checked_data.append('children\nper\nwoman\ntotal\nfertility')
    if gdp_growth:
        checked_data.append('gdp\ngrowth')
    if co2_emissions_tonnes_per_person:
        checked_data.append('co2\nemissions\ntonnes\nper\nperson')
    if income_per_person:
        checked_data.append('income\nper\nperson')
    if food_supply:
        checked_data.append('food\nsupply')
    if mean_years_in_school_women_percent_men_25_to_34_years:
        checked_data.append('mean\nyears\nin\nschool\nwomen\npercent\nmen\n25\nto\n34\nyears')
    if cluster_by_continent:
        new_data_condensed = df_data[checked_data + ['country', 'year', 'continent']]
        data_condensed_of_year = new_data_condensed[new_data_condensed.year == year]
        data_condensed_of_year_continent = data_condensed_of_year[data_condensed_of_year.continent == continent]
        print(f'current year is {year}')
        pd.plotting.scatter_matrix(data_condensed_of_year_continent[checked_data], figsize=(15,10))
    else:
        new_data_condensed = df_data[checked_data + ['country', 'year']]
        data_condensed_of_year = new_data_condensed[new_data_condensed.year == year]
        print(f'Scatter plot matrix of {year}:')
        pd.plotting.scatter_matrix(data_condensed_of_year[checked_data], figsize=(15,10))

Please select the year you are interested in.
If you want to cluster the data by continent, then tick the box next to 'cluster_by_continent' and select the wanted continent from the dropdown menu.
You can also untick boxes that you are not interested in.


interactive(children=(IntSlider(value=1991, description='year', max=2013, min=1970), Checkbox(value=False, des…

### Interpretation
Here we want to give a short explanation on the 3 essential questions when dealing with data:

##### What data do we have?

This is described above where we give an overview of all the attributes of our data. 

##### Why do we want to visualize the data?

We want to get greater insights into the data by observing correlations between attributes and looking at how the data is distributed.

##### How do we want to do that?

Since we want to look at correlations between attributes and how the data is distributed, a scatterplot matrix is well suited for that task. This way we get many different scatterplots and can hopefully make some interesting observations that may be helpful to us later on. Furthermore, we see how single attributes are distributed, because the scatterplot matrix also provides histograms.

# DESCRIPTIVE STATISTICS.

Analyzing our dataset using descriptive statistics on the level of individual attributes.
This includes simple plots of distributions and statistics.


# Overview Data.




In [21]:
import geopandas as gpd

data['year'] = data['year'].astype(int)
gdf = gpd.read_file('data/CNTR_RG_60M_2020_4326.shp')
gdf3=gdf[gdf.NAME_ENGL!='Antarctica']
gdf3
toggle = widgets.RadioButtons(options=['alphabetically','descending','ascending'], description="Bar sorting")
print('Please choose desired sorting, data, year and region')
print('For comparing individual countries: shift-button + mouse-click on desired countries')
@widgets.interact(Continent=['World','Europe','Asia','Americas','Africa','Oceania'],Year=(1900,2020),Attribute=['gdp\ngrowth','children\nper\nwoman\ntotal\nfertility','co2\nemissions\ntonnes\nper\nperson','mean\nyears\nin\nschool\nwomen\npercent\nmen\n25\nto\n34\nyears','average\nage\nof\ndollar\nbillionaires\nyears','food\nsupply','hourly\ncompensation','income\nper\nperson','suicide\nper\n100000\npeople','total\nnumber\nof\ndollar\nbillionaires','working\nhours\nper\nweek'],sorting=toggle)
def react(sorting,Attribute,Year,Continent):
        gdf4=gdf3
        ee =data.loc[data['year']==Year]
        dd = ee.rename(columns = {"code\n3": "ISO3_CODE"})
        multi = alt.selection_multi(fields=['ISO3_CODE'])
        color = alt.condition(multi,
                          alt.Color(Attribute+':Q', 
                          scale=alt.Scale(scheme='blues')),
                          alt.value('lightgray'))
        brush = alt.selection_interval()
        hover = alt.selection_single( on='mouseover',fields=['ISO3_CODE'])
        if sorting=='alphabetically':
            rank='country'
        if sorting=='descending':
            rank=alt.Y('country', sort='-x')
        if sorting=='ascending':
            rank=alt.Y('country', sort='x')       
        if Continent!='World':
            continent=dd.loc[dd['continent']==Continent]
            europa=continent.loc[:,'ISO3_CODE'].values
            mapContinent=gdf3[gdf3.ISO3_CODE.isin(europa)]
            gdf4=mapContinent
            dd=dd.loc[dd['ISO3_CODE'].isin(europa)]
            gg=dd
            region=dd['sub\nregion'].unique()
            l=list(region)
            l.insert(0,'Continent')
            
            @widgets.interact(Region=l)
            def back(Region):
                if Region!='Continent':
                    continent2=gg.loc[gg['sub\nregion']==Region]
                    europa=continent2.loc[:,'ISO3_CODE'].values
                    mapContinent=gdf3[gdf3.ISO3_CODE.isin(europa)]
                    gdf4=mapContinent
                    hh=gg.loc[dd['ISO3_CODE'].isin(europa)]
                    map = alt.Chart(gdf4).mark_geoshape(stroke='lightgray').encode(color=color,tooltip=['NAME_ENGL',Attribute+':Q']
                        ).transform_lookup(lookup='ISO3_CODE',from_=alt.LookupData(dd, 'ISO3_CODE', [Attribute])).add_selection(multi
                        ).properties(width=650,height=400
                        ).properties(title='Overview')
                    bars = alt.Chart(hh).mark_bar(size=10).encode( y=rank, x=Attribute, tooltip=[Attribute+':Q'],color=alt.Color(Attribute+':Q',scale=alt.Scale(scheme='blues'))
                                    ).add_selection(multi).transform_filter(multi).properties(title='Countries')
                    text = alt.Chart(hh).mark_text(size=10, align='left', baseline='middle', dx=3  
                            ).encode( y=rank, x=Attribute,text=Attribute+':Q'
                            ).transform_filter(multi)
                    return  map&(bars+text)
                if Region=='Continent':
                    continent=gg.loc[gg['continent']==Continent]
                    europa=continent.loc[:,'ISO3_CODE'].values
                    mapContinent=gdf3[gdf3.ISO3_CODE.isin(europa)]
                    gdf4=mapContinent
                    jj=gg.loc[dd['ISO3_CODE'].isin(europa)]
                    region=dd['sub\nregion'].unique()
                    l=list(region)
                    l.insert(0,'Continent')
                    map = alt.Chart(gdf4).mark_geoshape(stroke='white'
                    ).encode(color=color,tooltip=['NAME_ENGL',Attribute+':Q']
                    ).transform_lookup(lookup='ISO3_CODE',from_=alt.LookupData(dd, 'ISO3_CODE', [Attribute])
                    ).add_selection( multi
                        ).properties( width=650, height=400
                    ).properties(title='Overview')
                    bars = alt.Chart(jj).mark_bar(size=10).encode( y=rank, x=Attribute, tooltip=[Attribute+':Q'],color=alt.Color(Attribute+':Q',scale=alt.Scale(scheme='blues'))
                                    ).add_selection( multi
                                    ).transform_filter(multi
                                    ).properties(title='Countries')
                    text = alt.Chart(jj).mark_text(size=10, align='left', baseline='middle',dx=3  
                            ).encode( y=rank, x=Attribute,text=Attribute+':Q'
                            ).transform_filter(multi )
                    return  map&(bars+text)
        if Continent=='World':
            map = alt.Chart(gdf4).mark_geoshape(stroke='white'
                        ).encode(color=color, tooltip=['NAME_ENGL',Attribute+':Q']
                        ).transform_lookup(lookup='ISO3_CODE',from_=alt.LookupData(dd, 'ISO3_CODE', [Attribute])
                        ).add_selection( multi
                            ).properties( width=650,height=400
                        ).properties(title='Overview')
            bars = alt.Chart(dd).mark_bar(size=10).encode( y=rank, x=Attribute, tooltip=[Attribute+':Q'],color=alt.Color(Attribute+':Q',scale=alt.Scale(scheme='blues'))
                            ).add_selection( multi
                            ).transform_filter(multi
                        ).properties(title='Countries')
            text = alt.Chart(dd).mark_text(size=10,align='left',baseline='middle', dx=3  
                    ).encode(y=rank, x=Attribute,text=Attribute+':Q'
                    ).transform_filter(multi)
            return  map&(bars+text)
       
            

Please choose desired sorting, data, year and region
For comparing individual countries: shift-button + mouse-click on desired countries


interactive(children=(RadioButtons(description='Bar sorting', options=('alphabetically', 'descending', 'ascend…

### Interpretation

The world map was created to give an overview of the data series. The strength of the colors expresses the characteristics of the selected attribute. The aim is to visualize the differences between different countries. The different data series can be selected using the dropdown menu. The desired year is set with the slider. Individual continents or regions can be selected using additional dropdown menus. The exact values of the data can be read from the bar plot below the world map. The user can select several countries on the map (shift + click) in order to compare them individually in the bar plot. For example, the gdp growth can be selected and on the map Austria, Spain and South Africa. In the bar plot below, the growth rates of the respective countries are then displayed. This clearly visualizes the difference between the countries.

In [25]:
import matplotlib.pyplot as plt
import seaborn as sns

print('Please choose desired data, year and granularity')
@widgets.interact(Year=(1900,2020),Attribute=['gdp\ngrowth','children\nper\nwoman\ntotal\nfertility','co2\nemissions\ntonnes\nper\nperson','mean\nyears\nin\nschool\nwomen\npercent\nmen\n25\nto\n34\nyears','average\nage\nof\ndollar\nbillionaires\nyears','hourly\ncompensation','suicide\nper\n100000\npeople','working\nhours\nper\nweek'], Resolution=(5, 25, 5))
def react(Attribute,Year,Resolution):
    indicator=Attribute
    slider=Year
    bins=Resolution
    con=data
    con2=con.loc[data['year']==slider]
    
    ind=indicator
    con2.dropna()
    plt.figure(figsize=(20,10))
    plt.subplot(121)
    
    sns.boxplot(x='continent',y=ind,data=con2) 
    plt.title("Box plots per Continent", size=24)
    plt.subplot(122)
    if con2[ind].isnull().all():
        print('no data')
    else:
        sns.distplot(con2[ind],bins=bins)
        plt.title("Density plot world", size=24)
    

Please choose desired data, year and granularity


interactive(children=(Dropdown(description='Attribute', options=('gdp\ngrowth', 'children\nper\nwoman\ntotal\n…

### Interpretation


The box plots are intended to give an overview of the distribution of a specific data series. The continents are shown separately so that a comparison is possible. A density plot is shown on the right to visualize the data series worldwide. The desired data series can be selected using a dropdown menu. The year can be set using a slider. The resolution of the density plot can also be adjusted using a slider. For example, if 'children per woman total fertility' is selected in 2009, a clear difference between Europe and Africa can be seen in the first diagram. It is clear from the illustration that the population is decreasing in Europe and increasing in Africa. These values can be seen in relation to the density plot. Europe can be added to the first peak and Africa to the second. With the shift of the time slider from the past (1900) to today, a trend can be foreseen that the high birth rates continue to decline. If this process continues, an equilibrium in the reproductive rate will be reached in a few decades.

### CORRELATIONS.

Analyzing ourdataset by looking at correlations between attributes (dimensions) and coming up with an interpretation why in which way specific attributes are correlated. 

In [23]:
print('Please choose desired attributes and a country')
print('Mouse: panning and zooming')
@widgets.interact(Year=(1900,2020),Country=data['country'].unique(), Attribute1=['gdp\ngrowth','children\nper\nwoman\ntotal\nfertility','co2\nemissions\ntonnes\nper\nperson','mean\nyears\nin\nschool\nwomen\npercent\nmen\n25\nto\n34\nyears','average\nage\nof\ndollar\nbillionaires\nyears','food\nsupply','hourly\ncompensation','income\nper\nperson','suicide\nper\n100000\npeople','total\nnumber\nof\ndollar\nbillionaires','working\nhours\nper\nweek'], Attribute2=['co2\nemissions\ntonnes\nper\nperson','gdp\ngrowth','children\nper\nwoman\ntotal\nfertility','mean\nyears\nin\nschool\nwomen\npercent\nmen\n25\nto\n34\nyears','average\nage\nof\ndollar\nbillionaires\nyears','food\nsupply','hourly\ncompensation','income\nper\nperson','suicide\nper\n100000\npeople','total\nnumber\nof\ndollar\nbillionaires','working\nhours\nper\nweek'], Attribute3=['income\nper\nperson','gdp\ngrowth','children\nper\nwoman\ntotal\nfertility','co2\nemissions\ntonnes\nper\nperson','mean\nyears\nin\nschool\nwomen\npercent\nmen\n25\nto\n34\nyears','average\nage\nof\ndollar\nbillionaires\nyears','food\nsupply','hourly\ncompensation','suicide\nper\n100000\npeople','total\nnumber\nof\ndollar\nbillionaires','working\nhours\nper\nweek'])
def react(Country,Attribute1,Attribute2,Attribute3, Year):
    slider=Year
    country=Country
    con11=data.loc[data['country']==country]
    con22=con11.loc[data['year'].between(slider-10, slider+10)]
    con=data
    con2=con.loc[data['year']==slider]
    ind=Attribute1
    ind2=Attribute2
    ind3=Attribute3
    bubble= alt.Chart(con2).mark_circle().encode(
                x=alt.X(ind),
                y=ind2,
                color='continent',
                size=ind3,
                tooltip=['country',ind+':Q',ind2+':Q', ind3+':Q']
            ).properties(title='Attributes 1-3 \n(1:x-Axis 2:y-Axis 3: Bubble)').interactive() 
    text= ( alt.Chart(con11.loc[data['year']==slider])
        .mark_text(dy=-5)
        .encode(x=alt.X(ind), y=ind2, text=alt.Text("country:N")))          
    return (bubble+text)

Please choose desired attributes and a country
Mouse: panning and zooming


interactive(children=(Dropdown(description='Country', options=('Afghanistan', 'Albania', 'Algeria', 'Andorra',…

### Interpretation

We opted for a scatterplot matrix, because this type of method allows us to focus on maximum three attributes at a time to analyze if there are certain correlations between them or if we can detect trends, outliers or features. We used point as mark and color, size and position as channels. Every point represents a country. Each color reflects a continent. The size illustrates the dimension/magnitude of the chosen attribute. The position shows the location of the point on the x and y axis and indicates the relation of the countries to each other.
With interactive widgets like dropdown or slider the user can choose the attributes to analyze and the year to look into or slide through respectively. We added even more interactivity by letting the user zoom and pan.

Looking at the attributes income per person, food supply and co2 emission we found out that interestingly a lot of Asian countries that are not mentioned a lot in the media have high income per person in the early 60's and keep the status throughout the 2000's (Brunei, Kuwait, Saudi Arabia). With United Arab Emirates speeding up in the end of the 60's (with forming of UAE and making use of their natural resources) and overaking them in less than a couple of years. As expexted calories intake as well as co2 emission keep increasing with those countries which is comprehensible given the fact that with higher income there is more money to spend for consumation. 
African countries except for some nations in the north (Egypt, Tunisia, Morocco - promoting tourism sector from 1970's onward) increase their income per person as well as their food supply and co2 emission rather slow, and even slower compared the other continents.

Contrary to the hypothesis that income per person and working hours would have an impact on suicide rate the scatterplot matrix shows no correlation otherwise the number of suicide should be increasing with rising working hours or decreasing income per person. Which indicates that money is not a strong enough reason why people decide to end their lives.

Interestingly calories intake is also connected to the mean years in school which can be explained by the implicit relationship between calories intake and income per person. If the basic needs according to Maslow's hierarchy of needs  (of food, shelter, cloths) are met  money can be spent on higher needs like education.


### CLUSTERING.

Clustering similar items and show the clustering results.
1. User can interactively select the cluster algorithms and/or its parameters.
2. User can select one/more cluster/s from the resulting visualization.
3. User can see the selected data with its cluster affiliation in a second interactive visualization (simple overview+detail visualization setup)



In [24]:
print('Short desciption on the different buttons:')
print('  Choose the year you are interested in on the slider.')
print('  The ticked boxes are the attributes that are pulled into consideration when using clustering algorithms.')
print('  Feel free to untick whatever box you want and look at how the plots are changing')
print('  You can choose the desired clustering algorithm in the dropdown menu at \'clustering_algorithm\'.')
print('  Below that you can choose 2 interesting attributes that you want to get greater insight.')

@widgets.interact(year = (1970,2013), children_per_woman_total_fertility=True,
          gdp_growth=True, co2_emissions_tonnes_per_person=True,
          income_per_person=True, food_supply=True, mean_years_in_school_women_percent_men_25_to_34_years=True,
          clustering_algorithm=['pca','mds','isomap','tsne','umap'], 
          first_interesting_component=['children_per_woman_total_fertility',
                                       'gdp_growth','co2_emissions_tonnes_per_person',
                                       'income_per_person', 'food_supply', 
                                       'mean_years_in_school_women_percent_men_25_to_34_years'],
          second_interesting_component=['mean_years_in_school_women_percent_men_25_to_34_years',
                                        'gdp_growth', 'children_per_woman_total_fertility',
                                        'co2_emissions_tonnes_per_person',
                                        'income_per_person', 'food_supply'])

def plot_education_gender_ratio(year, children_per_woman_total_fertility,
                                gdp_growth,co2_emissions_tonnes_per_person,
                                income_per_person,food_supply,
                                mean_years_in_school_women_percent_men_25_to_34_years, clustering_algorithm,
                                first_interesting_component, second_interesting_component):
    checked_data = ['children\nper\nwoman\ntotal\nfertility', 'gdp\ngrowth',
                    'co2\nemissions\ntonnes\nper\nperson',
                    'income\nper\nperson', 'food\nsupply',
                    'mean\nyears\nin\nschool\nwomen\npercent\nmen\n25\nto\n34\nyears']
    
    new_data_condensed = df_data[checked_data + ['year']]
    ind = {'recognice_by_number': list(range(len(df_data['year'])))}
    ind = pd.DataFrame(data=ind)
    
    new_data_condensed = new_data_condensed.join(ind) #add index at end
    new_data_condensed = new_data_condensed[
        new_data_condensed.replace([np.inf, -np.inf], np.nan).notnull().all(axis=1)]
    
    #create df which contains continents and some index to recognice when merging
    df_continents_of_year = df_data[['year','continent','country']]
    df_continents_of_year = df_continents_of_year.join(ind)
    
    #drow rows that contain wrong years
    new_data_condensed = new_data_condensed[new_data_condensed.year == year]
    df_continents_of_year = df_continents_of_year[df_continents_of_year.year == year]
    
    #create colors for plotting and use 'year' column as placeholder
    colors = ['yellow','red','black','blue','green']
    for index, cont in enumerate(['Asia','Europe','Africa', 'Oceania', 'Americas']):
        df_continents_of_year.loc[df_continents_of_year.continent == cont, 'year'] = colors[index]
    #change 'year' column as placeholder to 'color'
    df_continents_of_year = df_continents_of_year.rename(columns={'year': 'color'})
    
    #create merged df
    df_merged = new_data_condensed.merge(df_continents_of_year,
                                         on='recognice_by_number', how='left')
    #drop non-values
    new_data_condensed.fillna(0)
    
    #drop unwanted data
    if not children_per_woman_total_fertility:
        new_data_condensed = new_data_condensed.drop(columns=['children\nper\nwoman\ntotal\nfertility'])
    if not gdp_growth:
        new_data_condensed = new_data_condensed.drop(columns=['gdp\ngrowth'])
    if not co2_emissions_tonnes_per_person:
        new_data_condensed = new_data_condensed.drop(columns=['co2\nemissions\ntonnes\nper\nperson'])
    if not income_per_person:
        new_data_condensed = new_data_condensed.drop(columns=['income\nper\nperson'])
    if not food_supply:
        new_data_condensed = new_data_condensed.drop(columns=['food\nsupply'])
    if not mean_years_in_school_women_percent_men_25_to_34_years:
        new_data_condensed = new_data_condensed.drop(columns=[
            'mean\nyears\nin\nschool\nwomen\npercent\nmen\n25\nto\n34\nyears'])
        
    #scale data
    scaler = StandardScaler()
    scaler.fit(new_data_condensed)
    scaled_data = scaler.transform(new_data_condensed)
    
    if clustering_algorithm == 'pca':
        pca = PCA(n_components=2)
        pca.fit(scaled_data)
        downprojected_data = pca.transform(scaled_data)
        
    elif clustering_algorithm == 'mds':
        mds = PCA(n_components=2)
        mds.fit(scaled_data)
        downprojected_data = mds.transform(scaled_data)
    
    elif clustering_algorithm == 'tsne':
        downprojected_data = TSNE(n_components=2).fit_transform(scaled_data)
        
    elif clustering_algorithm == 'isomap':
        downprojected_data = Isomap(n_components=2).fit_transform(scaled_data)
    
    elif clustering_algorithm == 'umap':
        reducer = umap.UMAP()
        standard_scaler_fit = StandardScaler().fit_transform(scaled_data)
        downprojected_data = reducer.fit_transform(standard_scaler_fit)
    
    #add the downprojected data to the merged data
    df_downprojected = pd.DataFrame(downprojected_data, columns=["first_component", "second_component"])
    df_merged['first_component'] = df_downprojected['first_component']
    df_merged['second_component'] = df_downprojected['second_component']
    
    first_interesting_component = first_interesting_component.replace('_', "\n")
    second_interesting_component = second_interesting_component.replace('_', "\n")
    
    brush = alt.selection_interval()
    chart_downprojected = alt.Chart(df_merged).mark_point().encode(
        x='first_component',
        y='second_component',
        color=alt.condition(brush, 'continent:N', alt.value('lightgray')),
        tooltip=['country', 'children\nper\nwoman\ntotal\nfertility', 'gdp\ngrowth',
                'co2\nemissions\ntonnes\nper\nperson', 'income\nper\nperson', 'food\nsupply',
                'mean\nyears\nin\nschool\nwomen\npercent\nmen\n25\nto\n34\nyears']
    ).add_selection(
        brush
    )
    chart_scatter = alt.Chart(df_merged).mark_point().encode(
        x=first_interesting_component,
        y=second_interesting_component,
        color=alt.condition(brush, 'continent:N', alt.value('lightgray')),
        tooltip=['country', 'children\nper\nwoman\ntotal\nfertility', 'gdp\ngrowth',
                'co2\nemissions\ntonnes\nper\nperson', 'income\nper\nperson', 'food\nsupply',
                'mean\nyears\nin\nschool\nwomen\npercent\nmen\n25\nto\n34\nyears']
    ).add_selection(
        brush
    )
    bars = alt.Chart(df_merged).mark_bar().encode(
        y='Origin:N',
        color='continent:N',
        x='count(Origin):Q'
    ).transform_filter(
        brush
    )
    histogram1 = alt.Chart(df_merged).mark_bar().encode(
        x = alt.X(second_interesting_component, bin=True),
        y='count()',
        color='continent'
    ).transform_filter(
        brush
    )
    histogram2 = alt.Chart(df_merged).mark_bar().encode(
        x = alt.X(first_interesting_component, bin=True),
        y='count()',
        color='continent'
    ).transform_filter(
        brush
    )
    
    altair_viewer.display((chart_downprojected|chart_scatter)&bars&(histogram1|histogram2))
    print('\n\n  A new window is open now and you can inspect the plots there.\n\n\n')

Short desciption on the different buttons:
  Choose the year you are interested in on the slider.
  The ticked boxes are the attributes that are pulled into consideration when using clustering algorithms.
  Feel free to untick whatever box you want and look at how the plots are changing
  You can choose the desired clustering algorithm in the dropdown menu at 'clustering_algorithm'.
  Below that you can choose 2 interesting attributes that you want to get greater insight.


interactive(children=(IntSlider(value=1991, description='year', max=2013, min=1970), Checkbox(value=True, desc…

### Interpretation
Here we want to address 3 crucial questions we asked ourselves as we embarked on the final part of our project.
##### What benefit can we provide to the user by using clustering algorithms?
The high dimensional data contains a lot of information that can be projected down to 2 dimensions. Then we can use this data, visualize it and see which clusters occur. Different clustering algorithms give different results.
##### Why use clustering algorithms at all?
The downward projection of data in 2D contains more information than plotting just 2 attributes on a scatterplot. This gives us the opportunity to observe the occurrence of clusters. Some interactivity within the plots could be a great user experience.
##### How can we provide the best user experience?
- By letting the user select attributes of interest themselves.
- By offering a combination of downprojected data and data that the user understands more intuitively, such as scatterplots and histograms.
- By allowing the user to select data in plots and then explore how that selection changes other plots.
- By making the process intuitive. 
- KISS = Keep It Short and Simple
- Sometimes less is more.

##### Explanation of plots:
- 5 plots: The top left is the plot of the dimensionality reduction algorithm. Below that is a 1-dimensional bar chart that contains information about how many data points are in the selected range (by continent). The top right chart is a scatterplot for the 2 selected attributes from the dropdown menus. The bottom two charts are histograms for the two selected attributes.
- How it works: The user can select an area in the 2 scatterplots. Then the other graphs will change according to the selected range. The areas can also be moved and made smaller/larger. Also, the user can move the cursor to a data point and get information about 6 different attributes of that data point. This works for both scatterplots. 