# Assignment 4

Before working on this assignment please read these instructions fully. In the submission area, you will notice that you can click the link to **Preview the Grading** for each step of the assignment. This is the criteria that will be used for peer grading. Please familiarize yourself with the criteria before beginning the assignment.

This assignment requires that you to find **at least** two datasets on the web which are related, and that you visualize these datasets to answer a question with the broad topic of **sports or athletics** (see below) for the region of **Warsaw, Mazovia, Poland**, or **Poland** more broadly.

You can merge these datasets with data from different regions if you like! For instance, you might want to compare **Warsaw, Mazovia, Poland** to Ann Arbor, USA. In that case at least one source file must be about **Warsaw, Mazovia, Poland**.

You are welcome to choose datasets at your discretion, but keep in mind **they will be shared with your peers**, so choose appropriate datasets. Sensitive, confidential, illicit, and proprietary materials are not good choices for datasets for this assignment. You are welcome to upload datasets of your own as well, and link to them using a third party repository such as github, bitbucket, pastebin, etc. Please be aware of the Coursera terms of service with respect to intellectual property.

Also, you are welcome to preserve data in its original language, but for the purposes of grading you should provide english translations. You are welcome to provide multiple visuals in different languages if you would like!

As this assignment is for the whole course, you must incorporate principles discussed in the first week, such as having as high data-ink ratio (Tufte) and aligning with Cairo’s principles of truth, beauty, function, and insight.

Here are the assignment instructions:

 * State the region and the domain category that your data sets are about (e.g., **Warsaw, Mazovia, Poland** and **sports or athletics**).
 * You must state a question about the domain category and region that you identified as being interesting.
 * You must provide at least two links to available datasets. These could be links to files such as CSV or Excel files, or links to websites which might have data in tabular form, such as Wikipedia pages.
 * You must upload an image which addresses the research question you stated. In addition to addressing the question, this visual should follow Cairo's principles of truthfulness, functionality, beauty, and insightfulness.
 * You must contribute a short (1-2 paragraph) written justification of how your visualization addresses your stated research question.

What do we mean by **sports or athletics**?  For this category we are interested in sporting events or athletics broadly, please feel free to creatively interpret the category when building your research question!

## Tips
* Wikipedia is an excellent source of data, and I strongly encourage you to explore it for new data sources.
* Many governments run open data initiatives at the city, region, and country levels, and these are wonderful resources for localized data sources.
* Several international agencies, such as the [United Nations](http://data.un.org/), the [World Bank](http://data.worldbank.org/), the [Global Open Data Index](http://index.okfn.org/place/) are other great places to look for data.
* This assignment requires you to convert and clean datafiles. Check out the discussion forums for tips on how to do this from various sources, and share your successes with your fellow students!

## Example
Looking for an example? Here's what our course assistant put together for the **Ann Arbor, MI, USA** area using **sports and athletics** as the topic. [Example Solution File](./readonly/Assignment4_example.pdf)

<hr>

# Becoming an independent Data Scientist - final assignment

### Identifying the problem - Sports activities in Poland - which regions are more sporty than others?

When thinking of an assignment that would cover the **sports and athletics** topic, I browsed through what the Polish Stat Office had at its disposal. I found that they had statistics of people officially (in a registered way) practising sports - in sport clubs, communities, league teams and school clubs.<br><br>
What was more important - the data was fragmentable down to the level of a town or commune. While that was tempting but probably hard to visualize and to grasp the concept, I decided to stop at the poviat level (poviats compose larger structures - voivodships - and may either be a group of communes or particular cities (usually large ones) act as poviats on their own.<br><br>
We have 380 poviats altogether in Poland, counting with the cities. What I found was that since 2002 the Stat Office carries out its statistics every second year, so we have eight periods of observation. Now the question is - **are there some particularly sporty areas in Poland?**<br><br>
*Be my guest and check that for yourself! :)*

### Combining datasets - Sports trends in Poland

I chose three different datasets I pull information from:
  1. The list of Polish regions (voivodships and poviats) along with their population - wikipedia
  2. The data on people practising sports by voivodships and poviats in 2002-2016 - Główny Urząd Statystyczny (Polish Stat Office)
  3. Geolocation data - Google API

What the code does is:
  1. It retrieves a list of Polish poviats and their population count and saves that.
  2. It then iteratively asks Google geolocation API for longitude/latitude of poviats and stores those coordinates.
  3. Then, it looks for the data on people practising sports per poviats.
  4. Finally, it displays a scatter plot with longitude and latitude of poviats as x and y and the population count as radius.
  5. The shade of a bubble is dependent on the share of people practising sports in the total population of the poviat.
  6. It then allows the user to use a slider and change the YEAR of the scatter plot

### 1. Get the data on sports activities from Statistics Poland

The Polish statistical office - Statistics Poland (www.stat.gov.pl) - publishes its data on its platform Bank Danych Lokalnych (Local Data Bank): https://bdl.stat.gov.pl/BDL/start
The JavaScript based platform allows for procuring the datasets tailored to the user's needs. However, for my assignment it still needed some minor cleaning afterwards.
I downloaded it to the local file: KULT_2155_XTAB_20180331131428.xlsx and left it intact.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib notebook

In [2]:
# Load the data from a downloaded file available and procured from Statistics Poland (www.stat.gov.pl) and put on my github
yrs = [str(2002+i*2) for i in range(8)] # '2002'-'2016'

import urllib.request

url = 'https://github.com/kuba-siekierzynski/Applied-Data-Science-Coursera/files/1866674/KULT_2155_XTAB_20180331131428.xlsx'

with urllib.request.urlopen(url) as f:
    poviats1 = pd.read_excel('KULT_2155_XTAB_20180331131428.xlsx',sheetname='TABLICA',
                             header=None, skiprows=4, parse_cols='A,B,F,G,H,I,J,K,L,M',
                             names=['TERYT', 'Name'] + yrs, dtype='str')

# Adding voivodship information for future disambiguation
voivodships = poviats1[poviats1['TERYT'].astype(str).str.endswith('00000')][['TERYT', 'Name']] # retreive the voivodships
poviats1['TERYT'] = poviats1['TERYT'].str[:2]
voivodships['TERYT'] = voivodships['TERYT'].str[:2]
voivodships = voivodships.rename(columns={'Name': 'Voivodship'})
poviats1 = poviats1.merge(voivodships, how='left', on='TERYT')
poviats1['Voivodship'] = poviats1['Voivodship'].str.lower() # poviats1 will store the voivodship in additional columns

# now some tidying up...
poviats1 = poviats1[poviats1['Name'].astype('str').str.startswith('Powiat')] # filter for regional data only
poviats1[yrs] = poviats1[yrs].replace('-', 0).astype(np.int32)

# Wałbrzych was a separate poviat until 2002 and then again since 2014, but we need it only once
poviats1.loc[310, '2002'] = poviats1.loc[307, '2002']
poviats1.drop(307, inplace=True) # obsolete - Wałbrzych until 2002
poviats1.loc[310, 'Name'] = 'Wałbrzych' # rename to the proper city name
poviats1.loc[1979, 'Name'] = 'Warszawa' # drop the 'capital city of' part
poviats1['Name'] = poviats1['Name'].str.replace(r'Powiat m\.', '') # for cities as municipality centers - drop the redundant part
poviats1.drop('TERYT', axis=1, inplace=True)
poviats1

Unnamed: 0,Name,2002,2004,2006,2008,2010,2012,2014,2016,Voivodship
1,Powiat bolesławiecki,1650,1842,2129,1785,2113,1900,1847,1524,dolnośląskie
10,Powiat dzierżoniowski,1242,2262,2670,2353,2511,2570,3144,2975,dolnośląskie
23,Powiat głogowski,1369,1279,1628,1336,2098,1820,2426,2737,dolnośląskie
30,Powiat górowski,481,648,652,642,315,393,531,742,dolnośląskie
38,Powiat jaworski,669,820,929,1051,1316,1332,1328,1745,dolnośląskie
47,Powiat jeleniogórski,1027,1240,1624,1481,1135,1285,1256,1488,dolnośląskie
57,Powiat kamiennogórski,581,873,1181,1073,1082,1125,1044,1326,dolnośląskie
64,Powiat kłodzki,2827,2759,3325,3255,3613,3505,4196,4187,dolnośląskie
91,Powiat legnicki,759,615,1193,1120,1026,1348,1587,1529,dolnośląskie
102,Powiat lubański,550,659,904,838,1261,1378,1489,1405,dolnośląskie


<hr>

### 2. Get the population numbers for poviats from Wikipedia

The most easily accessible is parsing the Wikipedia page on Polish poviat statistics. This will also ensure that we have the latest information - administration division might have changed between years. The page we are looking for is: https://pl.wikipedia.org/wiki/Powiaty_w_Polsce

In [3]:
# let's get ready for some parsing
from bs4 import BeautifulSoup

In [4]:
# accessing the url and parsing it to the beautifulsoup object 'soup'
url = 'https://pl.wikipedia.org/wiki/Powiaty_w_Polsce'
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')
table_rows = soup.find_all('table')[2].find_all('tr') # the data is in the third table on the page

In [5]:
# collecting the data to a dictionary, which keys are poviat names and values - population
poviats2 = {}
for row in table_rows:
    pov_data = []
    pov_key = row.a['title'] # find the poviat name
    pov_voi = row.find_all('a')[1].text # find the voivodship
    pov_number = row.find_next('td').find_next_sibling('td', attrs={'class': 'tabela-liczba'}).find_next_sibling('td', attrs={'class': 'tabela-liczba'}).text
    poviats2[pov_key] = pov_voi, int(str(pov_number).replace('\xa0', ''))

# convert to DataFrame and clean
poviats2 = pd.DataFrame(poviats2)
poviats2 = poviats2.T.reset_index()
poviats2.columns = ['Name', 'Voivodship', 'Population']
poviats2.drop(35, inplace=True)
poviats2

Unnamed: 0,Name,Voivodship,Population
0,Biała Podlaska,lubelskie,57303
1,Białystok,podlaskie,296628
2,Bielsko-Biała,śląskie,172030
3,Bydgoszcz,kujawsko-pomorskie,353938
4,Bytom,śląskie,169617
5,Chełm,lubelskie,63734
6,Chorzów,śląskie,109398
7,Częstochowa,śląskie,226225
8,Dąbrowa Górnicza,śląskie,121802
9,Elbląg,warmińsko-mazurskie,121191


Some names in poviats2 DataFrame still contain the parentheses with the name of the voivodship in which the poviat is located. This is because some city names (and thus poviat names) occur multiple times throughout Poland. It is reasonable to leave them for now, as they will be used by the Google API to perfectly disambiguate the input.

In [6]:
print(poviats2[poviats2['Name'].str.endswith(')')])

                                              Name     Voivodship Population
44          Powiat bielski (województwo podlaskie)      podlaskie      56075
45            Powiat bielski (województwo śląskie)        śląskie     162926
53        Powiat brzeski (województwo małopolskie)    małopolskie      93001
54           Powiat brzeski (województwo opolskie)       opolskie      90771
92      Powiat grodziski (województwo mazowieckie)    mazowieckie      91647
93    Powiat grodziski (województwo wielkopolskie)  wielkopolskie      51423
134      Powiat krośnieński (województwo lubuskie)       lubuskie      55759
135  Powiat krośnieński (województwo podkarpackie)   podkarpackie     112399
180   Powiat nowodworski (województwo mazowieckie)    mazowieckie      79024
181     Powiat nowodworski (województwo pomorskie)      pomorskie      36018
196         Powiat opolski (województwo lubelskie)      lubelskie      60586
197          Powiat opolski (województwo opolskie)       opolskie     123839

<hr>

### 3. Getting geolocation data from Google API

Since we already have the poviat names listed and those names of which occur multiple times still have the voivodship data, let's look for geolocation coordinates. Google API allows for 2,500 free queries per day, so we have to be careful :)

In [7]:
# even more parsing...
from urllib.parse import urlencode as ENCODE
from xml.etree import ElementTree as XML

# geo_pov will store geolocation data for poviats
geo_pov = {}

In [8]:
def localize(poviat):
    """
    Takes a poviat name, pushes it to Google API and assigns latitude and longitude to the geo_pov item
    """
    global geo_pov
    api_url = 'http://maps.googleapis.com/maps/api/geocode/xml?'
    # the location of Google's geolocation API
    url = api_url + ENCODE({'sensor': 'false', 'address': poviat + ', Poland'})
    # putting the parts together in UTF-8 format
    f = False
    while not f:
        print('Retrieving the data on:', poviat)
        data = urllib.request.urlopen(url).read()
        # getting that data
        print('Retrieved', len(data), 'characters. Parsing...')
        tree = XML.fromstring(data)
        # digging into the XML tree
        res = tree.findall('result')
        # let's see the results now
        if len(res) > 0:
            f = True
        else:
            print('Something went wrong, trying again.') # this happens a lot, as too hasty requests may cause Google to block
    lat = res[0].find('geometry').find('location').find('lat').text
    # dig into the XML tree to find 'latitude'
    lng = res[0].find('geometry').find('location').find('lng').text
    # and longitude
    lat = float(lat)
    lng = float(lng)
    geo_pov[poviat] = lat, lng # add the coordinates to the geo_pov dictionary - latitude first, longitude second

<b>The below might take a while, depending on your internet connection!</b>

In [9]:
# let's go!
for pov in list(poviats2['Name']):
    localize(pov)

Retrieving the data on: Biała Podlaska
Retrieved 334 characters. Parsing...
Something went wrong, trying again.
Retrieving the data on: Biała Podlaska
Retrieved 1747 characters. Parsing...
Retrieving the data on: Białystok
Retrieved 334 characters. Parsing...
Something went wrong, trying again.
Retrieving the data on: Białystok
Retrieved 334 characters. Parsing...
Something went wrong, trying again.
Retrieving the data on: Białystok
Retrieved 1725 characters. Parsing...
Retrieving the data on: Bielsko-Biała
Retrieved 1732 characters. Parsing...
Retrieving the data on: Bydgoszcz
Retrieved 1766 characters. Parsing...
Retrieving the data on: Bytom
Retrieved 334 characters. Parsing...
Something went wrong, trying again.
Retrieving the data on: Bytom
Retrieved 1710 characters. Parsing...
Retrieving the data on: Chełm
Retrieved 1702 characters. Parsing...
Retrieving the data on: Chorzów
Retrieved 334 characters. Parsing...
Something went wrong, trying again.
Retrieving the data on: Chorzów
R

Retrieved 1604 characters. Parsing...
Retrieving the data on: Powiat bialski
Retrieved 1790 characters. Parsing...
Retrieving the data on: Powiat białobrzeski
Retrieved 334 characters. Parsing...
Something went wrong, trying again.
Retrieving the data on: Powiat białobrzeski
Retrieved 1609 characters. Parsing...
Retrieving the data on: Powiat białogardzki
Retrieved 1617 characters. Parsing...
Retrieving the data on: Powiat białostocki
Retrieved 1398 characters. Parsing...
Retrieving the data on: Powiat bielski (województwo podlaskie)
Retrieved 334 characters. Parsing...
Something went wrong, trying again.
Retrieving the data on: Powiat bielski (województwo podlaskie)
Retrieved 334 characters. Parsing...
Something went wrong, trying again.
Retrieving the data on: Powiat bielski (województwo podlaskie)
Retrieved 334 characters. Parsing...
Something went wrong, trying again.
Retrieving the data on: Powiat bielski (województwo podlaskie)
Retrieved 334 characters. Parsing...
Something went 

Retrieved 1434 characters. Parsing...
Retrieving the data on: Powiat inowrocławski
Retrieved 1628 characters. Parsing...
Retrieving the data on: Powiat iławski
Retrieved 1607 characters. Parsing...
Retrieving the data on: Powiat janowski
Retrieved 1614 characters. Parsing...
Retrieving the data on: Powiat jarociński
Retrieved 1606 characters. Parsing...
Retrieving the data on: Powiat jarosławski
Retrieved 334 characters. Parsing...
Something went wrong, trying again.
Retrieving the data on: Powiat jarosławski
Retrieved 1608 characters. Parsing...
Retrieving the data on: Powiat jasielski
Retrieved 1599 characters. Parsing...
Retrieving the data on: Powiat jaworski
Retrieved 334 characters. Parsing...
Something went wrong, trying again.
Retrieving the data on: Powiat jaworski
Retrieved 1432 characters. Parsing...
Retrieving the data on: Powiat jeleniogórski
Retrieved 1817 characters. Parsing...
Retrieving the data on: Powiat jędrzejowski
Retrieved 1426 characters. Parsing...
Retrieving t

Retrieved 191 characters. Parsing...
Something went wrong, trying again.
Retrieving the data on: Powiat krapkowicki
Retrieved 191 characters. Parsing...
Something went wrong, trying again.
Retrieving the data on: Powiat krapkowicki
Retrieved 334 characters. Parsing...
Something went wrong, trying again.
Retrieving the data on: Powiat krapkowicki
Retrieved 1429 characters. Parsing...
Retrieving the data on: Powiat krasnostawski
Retrieved 1599 characters. Parsing...
Retrieving the data on: Powiat kraśnicki
Retrieved 1425 characters. Parsing...
Retrieving the data on: Powiat krotoszyński
Retrieved 1612 characters. Parsing...
Retrieving the data on: Powiat krośnieński (województwo lubuskie)
Retrieved 334 characters. Parsing...
Something went wrong, trying again.
Retrieving the data on: Powiat krośnieński (województwo lubuskie)
Retrieved 334 characters. Parsing...
Something went wrong, trying again.
Retrieving the data on: Powiat krośnieński (województwo lubuskie)
Retrieved 1623 characters.

Retrieved 1607 characters. Parsing...
Retrieving the data on: Powiat oleski
Retrieved 334 characters. Parsing...
Something went wrong, trying again.
Retrieving the data on: Powiat oleski
Retrieved 1417 characters. Parsing...
Retrieving the data on: Powiat oleśnicki
Retrieved 334 characters. Parsing...
Something went wrong, trying again.
Retrieving the data on: Powiat oleśnicki
Retrieved 1444 characters. Parsing...
Retrieving the data on: Powiat olkuski
Retrieved 1433 characters. Parsing...
Retrieving the data on: Powiat olsztyński
Retrieved 1765 characters. Parsing...
Retrieving the data on: Powiat opatowski
Retrieved 1582 characters. Parsing...
Retrieving the data on: Powiat opoczyński
Retrieved 1424 characters. Parsing...
Retrieving the data on: Powiat opolski (województwo lubelskie)
Retrieved 1446 characters. Parsing...
Retrieving the data on: Powiat opolski (województwo opolskie)
Retrieved 334 characters. Parsing...
Something went wrong, trying again.
Retrieving the data on: Powiat

Retrieved 1435 characters. Parsing...
Retrieving the data on: Powiat rawski
Retrieved 334 characters. Parsing...
Something went wrong, trying again.
Retrieving the data on: Powiat rawski
Retrieved 334 characters. Parsing...
Something went wrong, trying again.
Retrieving the data on: Powiat rawski
Retrieved 334 characters. Parsing...
Something went wrong, trying again.
Retrieving the data on: Powiat rawski
Retrieved 334 characters. Parsing...
Something went wrong, trying again.
Retrieving the data on: Powiat rawski
Retrieved 1415 characters. Parsing...
Retrieving the data on: Powiat ropczycko-sędziszowski
Retrieved 334 characters. Parsing...
Something went wrong, trying again.
Retrieving the data on: Powiat ropczycko-sędziszowski
Retrieved 334 characters. Parsing...
Something went wrong, trying again.
Retrieving the data on: Powiat ropczycko-sędziszowski
Retrieved 334 characters. Parsing...
Something went wrong, trying again.
Retrieving the data on: Powiat ropczycko-sędziszowski
Retriev

Retrieved 1795 characters. Parsing...
Retrieving the data on: Powiat tarnogórski
Retrieved 1453 characters. Parsing...
Retrieving the data on: Powiat tarnowski
Retrieved 1756 characters. Parsing...
Retrieving the data on: Powiat tatrzański
Retrieved 1430 characters. Parsing...
Retrieving the data on: Powiat tczewski
Retrieved 334 characters. Parsing...
Something went wrong, trying again.
Retrieving the data on: Powiat tczewski
Retrieved 334 characters. Parsing...
Something went wrong, trying again.
Retrieving the data on: Powiat tczewski
Retrieved 1424 characters. Parsing...
Retrieving the data on: Powiat tomaszowski (województwo lubelskie)
Retrieved 1455 characters. Parsing...
Retrieving the data on: Powiat tomaszowski (województwo łódzkie)
Retrieved 1631 characters. Parsing...
Retrieving the data on: Powiat toruński
Retrieved 1782 characters. Parsing...
Retrieving the data on: Powiat trzebnicki
Retrieved 1612 characters. Parsing...
Retrieving the data on: Powiat tucholski
Retrieved 1

Retrieved 334 characters. Parsing...
Something went wrong, trying again.
Retrieving the data on: Powiat średzki (województwo dolnośląskie)
Retrieved 1462 characters. Parsing...
Retrieving the data on: Powiat średzki (województwo wielkopolskie)
Retrieved 334 characters. Parsing...
Something went wrong, trying again.
Retrieving the data on: Powiat średzki (województwo wielkopolskie)
Retrieved 334 characters. Parsing...
Something went wrong, trying again.
Retrieving the data on: Powiat średzki (województwo wielkopolskie)
Retrieved 334 characters. Parsing...
Something went wrong, trying again.
Retrieving the data on: Powiat średzki (województwo wielkopolskie)
Retrieved 1642 characters. Parsing...
Retrieving the data on: Powiat śremski
Retrieved 1600 characters. Parsing...
Retrieving the data on: Powiat świdnicki (województwo dolnośląskie)
Retrieved 334 characters. Parsing...
Something went wrong, trying again.
Retrieving the data on: Powiat świdnicki (województwo dolnośląskie)
Retrieved 33

Retrieved 1744 characters. Parsing...
Retrieving the data on: Siedlce
Retrieved 334 characters. Parsing...
Something went wrong, trying again.
Retrieving the data on: Siedlce
Retrieved 334 characters. Parsing...
Something went wrong, trying again.
Retrieving the data on: Siedlce
Retrieved 1734 characters. Parsing...
Retrieving the data on: Siemianowice Śląskie
Retrieved 334 characters. Parsing...
Something went wrong, trying again.
Retrieving the data on: Siemianowice Śląskie
Retrieved 1789 characters. Parsing...
Retrieving the data on: Skierniewice
Retrieved 1737 characters. Parsing...
Retrieving the data on: Sopot
Retrieved 1714 characters. Parsing...
Retrieving the data on: Sosnowiec
Retrieved 1730 characters. Parsing...
Retrieving the data on: Suwałki
Retrieved 334 characters. Parsing...
Something went wrong, trying again.
Retrieving the data on: Suwałki
Retrieved 334 characters. Parsing...
Something went wrong, trying again.
Retrieving the data on: Suwałki
Retrieved 1738 character

In [10]:
geo_pov # a dictionary of poviats and their geolocation coordinates (polygon-approximated)

{'Biała Podlaska': (52.0387126, 23.1445026),
 'Białystok': (53.1324886, 23.1688403),
 'Bielsko-Biała': (49.8223768, 19.0583845),
 'Bydgoszcz': (53.1234804, 18.0084378),
 'Bytom': (50.3483816, 18.9157176),
 'Chełm': (51.1431232, 23.4711986),
 'Chorzów': (50.2974884, 18.9545728),
 'Częstochowa': (50.8118195, 19.1203094),
 'Dąbrowa Górnicza': (50.3216897, 19.1949126),
 'Elbląg': (54.1560613, 19.4044897),
 'Gdańsk': (54.3520252, 18.6466384),
 'Gdynia': (54.5188898, 18.5305409),
 'Gliwice': (50.2944923, 18.6713802),
 'Gorzów Wielkopolski': (52.7325285, 15.2369305),
 'Grudziądz': (53.4837486, 18.7535649),
 'Jastrzębie-Zdrój': (49.9454207, 18.6101103),
 'Jaworzno': (50.204987, 19.2739314),
 'Jelenia Góra': (50.9044171, 15.7193616),
 'Kalisz': (51.7672799, 18.0853462),
 'Katowice': (50.2648919, 19.0237815),
 'Kielce': (50.8660773, 20.6285676),
 'Konin': (52.2230334, 18.251073),
 'Koszalin': (54.1943219, 16.1714908),
 'Kraków': (50.0646501, 19.9449799),
 'Krosno': (49.6824761, 21.7660531),
 'Le

In [11]:
# we need the third DataFrame - poviat names and geolocation
poviats3 = pd.DataFrame(geo_pov).T
poviats3 = poviats3.reset_index()
poviats3.columns = ['Name', 'Latitude', 'Longitude']

<hr>

### 4. Putting it all together

We now have three datasets from three different sources:<br>
* <b>poviats1</b> with the data on sports activity (loaded from a local file procured earlier at Statistics Poland),<br>
* <b>poviats2</b> with the data on population and<br>
* <b>poviats3</b> with geolocation coordinates of the poviats. We will merge them in two steps, as they carry different poviat naming (see above).

In [12]:
# combining three datasets into one and preparing the datatypes for plotting
poviats = poviats2.merge(poviats3, how='inner', on='Name')
poviats['Name'] = poviats['Name'].str.replace(r' \(.*\)', '')
poviats = poviats.merge(poviats1, how='inner', on=['Name', 'Voivodship'])
poviats['Population'] = poviats['Population'].astype('int') # we need it as a number now
for yr in yrs:
    poviats[yr] /= poviats['Population'] # we actually need the percentage
# poviats

### Visualization

OK, we have everything ready. Now, let's amaze the world :)

A brief description of the chart:
   1. It plots a scatterplot with the blobs representing poviats, laid out geographically
   2. The radius of the blob represents the poviat population
   3. The blob's shade represents the regions below (blue shades) or above (red shades) the average
   4. This indicates regions particularly sporty throughout the years
   5. The slider allows you to dynamically change the year of observation
    

In [13]:
# making the colors and color map representative, adding a slider for the userplay
cmap = plt.cm.bwr
from matplotlib.widgets import Slider

yr = 2016 # default year

fig, ax = plt.subplots()
plt.subplots_adjust(left=0.25, bottom=0.25)

plt.title('POLAND: Share of people active in sports per poviat in {:d}'.format(yr), fontdict={'fontsize': 12})
blobplot = plt.scatter(poviats['Longitude'], poviats['Latitude'], s=poviats['Population']/(500), cmap=cmap, c=poviats[str(int(yr))], alpha=0.5)
_ = plt.scatter([], [], s=0)
blobs = blobplot.get_children()

plt.axis('off')
plt.legend(handles=[blobplot, _], 
           labels=['Radius shows poviat population', 'Color marks sports activity\nBlue is below mean, red is above'],
           markerscale=0.1, frameon=None, loc=0, bbox_to_anchor=(0.5, 0.08), prop={'size': 6})

# building the slider
year_ax = plt.axes([0.28, 0.1, 0.6, 0.02], frameon=False)
year = Slider(year_ax, 'Year', 2002, 2016, valfmt='%d', valinit=2016, valstep=2, color='red', alpha=0.5)

# the update function will redraw the plot each time the slider changes position
def update(val):
    yr = int(year.val)
    plt.sca(ax)
    plt.cla()
    plt.axis('off')
    plt.gca().set_title('POLAND: Share of people active in sports per poviat in {:d}'.format(yr)) # displays the threshold
    blobplot = plt.scatter(poviats['Longitude'], poviats['Latitude'], s=poviats['Population']/(500), cmap=cmap, c=poviats[str(int(yr))], alpha=0.5)
    plt.legend(handles=[blobplot, _],
               labels=['Radius shows poviat population', 'Color marks sports activity\nBlue is below mean, red is above'],
               markerscale=0.1, frameon=None, loc=0, bbox_to_anchor=(0.5, 0.08), prop={'size': 6})
    fig.canvas.draw_idle()

# that sets the trigger
year.on_changed(update)
plt.show()

<IPython.core.display.Javascript object>

Seems no doubt that southeastern regions of Poland are and have been really sporty throught the two decades! Only in recent two-four years has the West joined the race :)<br>
When checking the real world map, one can see that southern parts of Poland are mostly mountainous or hilly at least. That perhaps allows easy access to natural outdoor activity plus makes it more feasible to practise sports, especially winter disciplines like skiing, skijump and such.<br><br>
That was a helluva task and I enjoyed it. Thank you, Coursera and University of Michigan for this assignment!<br><br>
Kuba Siekierzyński