# Collecting Board Game Cafés Within Europe

The first part of this project requries us to collect the board game cafés located in Europe. Luckily for us, Andy Matthews at Meeple Mountain has [collected every board game café on the planet](https://www.meeplemountain.com/articles/the-ultimate-guide-to-board-game-cafes/), and placed them on a handy [Google Map](https://www.google.com/maps/d/u/0/viewer?mid=1UEkafkpKjbEJQzkcwgdtmnYPn4jXQKrw). So, all we need to do is extract the entries upon this list that are contained with Europe, and then acquire their latitude and longitude using `geopy`.

Note that the list of cafés is contained in the map as a `kml` file, which we have saved locally as `Worldwide_Board_Game_Cafe_List.kml`, and so we will deal with it using `pykml`.

In [1]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup # For web scraping
from pykml import parser # For reading the kml file
from geopy.geocoders import Nominatim

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

The data in the `kml` file does not contain continent information, and so we will obtain a list of countries within European from [Wikipedia](https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_in_Europe).

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_in_Europe'
response = requests.get(url)
response

<Response [200]>

The Wikipedia table contains multiple columns, and we are only interested in the third; furthermore, each row in the column contains potentially multiple lines, and we only care for the first. However, each country name acts as a hyperlink to the page for that country, and so we can specifically look for the first `a` tag in the third column (`td` in html) of every row (noting that there are 7 columns in the table), adding the result to the list `eu_countries`. 

In [3]:
soup = BeautifulSoup(response.content, 'lxml')
table = soup.find_all('table', attrs={'class':'wikitable'})[1]
eu_countries = [td.a.text for td in table.find_all('td')[2::7]]

The Wikipedia table includes 'United Kingdom', whereas the Meeple Mountain list specifies the countries England, Scotland, Wales and Northern Ireland. Hence, to avoid accidentally omitting any of the venues, we replace these values in `eu_countries` (and sort alphabetically again for ease of reading).

In [4]:
eu_countries.remove('United Kingdom')
eu_countries.extend(['England', 'Scotland', 'Wales', 'Northern Ireland'])
eu_countries.sort()

In [5]:
eu_countries

['Albania',
 'Andorra',
 'Armenia',
 'Austria',
 'Azerbaijan',
 'Belarus',
 'Belgium',
 'Bosnia and Herzegovina',
 'Bulgaria',
 'Croatia',
 'Cyprus',
 'Czech Republic',
 'Denmark',
 'England',
 'Estonia',
 'Finland',
 'France',
 'Georgia',
 'Germany',
 'Greece',
 'Hungary',
 'Iceland',
 'Ireland',
 'Italy',
 'Kazakhstan',
 'Latvia',
 'Liechtenstein',
 'Lithuania',
 'Luxembourg',
 'Malta',
 'Moldova',
 'Monaco',
 'Montenegro',
 'Netherlands',
 'North Macedonia',
 'Northern Ireland',
 'Norway',
 'Poland',
 'Portugal',
 'Romania',
 'Russia',
 'San Marino',
 'Scotland',
 'Serbia',
 'Slovakia',
 'Slovenia',
 'Spain',
 'Sweden',
 'Switzerland',
 'Turkey',
 'Ukraine',
 'Vatican City',
 'Wales']

Now it is time to open the `kml` file before collecting its info.

In [6]:
with open('Worldwide_Board_Game_Cafe_List.kml','r') as f:
    doc = parser.parse(f).getroot().Document.Folder

If we inspect the `kml` file we see that the locations are contained within the `Placemark` features. Furthermore, each placemark contains an `address` and `ExtendedData` field, which contains both the website and the address broken up into different sections. As such, we can either take the parts of the address we want by either splitting the values in the `address` field, or by selecting the appropriate fields within `ExtendedData`; we will use the latter method for no particular reason other than it was the first one I implemented.

We will then place these values into a dictionary for each café, and append it to a list before we convert it into a DataFrame.

In [7]:
# Initialise a list for the cafes
cafe_list = []

In [8]:
for e in doc.Placemark:
    cafe_dict = dict()
    if e.ExtendedData.Data[5].value in eu_countries:
        cafe_dict = { 
            'Name': e.name.text,
            'Address': e.ExtendedData.Data[1].value.text,
            'City': e.ExtendedData.Data[2].value.text,
            'PostalCode': e.ExtendedData.Data[4].value.text,
            'Country': e.ExtendedData.Data[5].value.text
                    }
        cafe_list.append(cafe_dict)

In [9]:
for item in cafe_list:
    if item['Country'] == 'Czech Republic':
        item['Country'] = 'Czechia'

In [10]:
cafe_df = pd.DataFrame(cafe_list)
cafe_df.head(10)

Unnamed: 0,Name,Address,City,PostalCode,Country
0,Brot & Spiele,Mariahilferstraße 17,Graz,8020,Austria
1,Brot und Spiele,Laudongasse 22,Vienna,1080,Austria
2,Café Benno,Alser Str. 67,Vienna,1080,Austria
3,Café Sperlhof,Große Sperlgasse 41,Vienna,1020,Austria
4,SpielBar,Lederergasse 26,Vienna,1080,Austria
5,The Playground (Hoofdkerk),Hoofdkerkstraat 7,Antwerp,2000,Belgium
6,The Playground (Station),Pelikaanstraat 3/1270,Antwerp,2018,Belgium
7,Outpost Antwerpen,Beggaardenstraat 6,Antwerpen,2000,Belgium
8,The Playground,Hoofdkerkstraat 7,Antwerpen,2000,Belgium
9,La Luck Brussels,74 rue Washington,Brussels,1050,Belgium


There is no latitude and longitude data contained within the `kml` file, and so we need to collect it ourselves. This will be done with `geopy`, after checking the [list of supported countries](https://pgeocode.readthedocs.io/en/latest/overview.html#supported-countries) for `pgeocode` revealed that a number of countries of interest are not covered.

We will adopt a similar method to above, where we will create a list of dictionaries to create a Dataframe, which will ultimately be merged with `cafe_df`.

In [11]:
# Initialise the geolocator and list of geodata dictionaries
geolocator = Nominatim(user_agent='bgc_finder')
ll_list = []

We now define a quick function that will perform a Nominatim query in the format `[Address], [PostalCode] [City], [Country]`, and will return a list containing the latitude and longitude. If our query does not produce a result (due to issues in the initial data, or some disagreement between Nominatim and Google Maps) then the function returns a list of `NaN` values.

In [12]:
def get_ll(row):
    location = geolocator.geocode(f"{row[1]}, {row[3]} {row[2]}, {row[4]}")
    try:
        return [location.latitude, location.longitude]
    except:
        return [np.nan, np.nan]

In [13]:
# Building the list of geodata
for row in cafe_df.values:
    ll_dict = dict()
    ll = get_ll(row)
    ll_dict = {'Latitude': ll[0], 'Longitude': ll[1]}
    ll_list.append(ll_dict)

# Build the DataFrame
ll_df = pd.DataFrame(ll_list)
ll_df.head()

Unnamed: 0,Latitude,Longitude
0,47.073272,15.433036
1,48.213407,16.349799
2,48.21505,16.342587
3,48.219658,16.37838
4,48.213688,16.348476


In order to determine which, if any, rows are missing data, we now do a search over the database for null values. This then allows us to clean the data. There is still the potential issue of duplicate values, but we will resolve that after.

In [14]:
null_idx = ll_df.index[ll_df['Latitude'].isnull()].tolist()
print(f'There are {len(null_idx)} rows missing data.')

cafe_df.loc[null_idx,:]

There are 52 rows missing data.


Unnamed: 0,Name,Address,City,PostalCode,Country
16,Yam-toto,En Hors-Chateau 43,Liège,4000,Belgium
17,Aux 3D Board Game Cafe,"Place Abbe Joseph, 11",Namur,5000,Belgium
38,Ready Steady Roll,Ivy Lodge,Bedford,MK44 1ND,England
42,The Games Table,"86 Magdalen St, Norwich",Norwich,NR311JF,England
43,Ready Steady Roll,"Ivy Lodge, A6 Rushden Road",Sharnbrook,MK44 1ND,England
44,"Ready, Steady Roll","Studio 5, Ivy Lodge Farm, A6 Rushden Road",Sharnbrook,MK44 1ND,England
48,Red Panda Gaming Cafe,"247 High Street, First floor",Lincoln,LN2 1HW,England
72,c:\ Side Quest,"11 Lower Promenade, Madiera Drive",Brighton,BN2 1ET,England
73,Dice Saloon,"Unit 6, Longley Industrial Estate, New England...",Brighton,BN14GY,England
74,Dice Saloon,"First floor, Vantage Point, New England Road",Brighton,BN1 4GY,England


So there are a number of issues within the data, including a peculiar issue with #220 (Pontyridd is within Wales, not Italy!). This also highlights that we do have some duplicates within our data (see e.g. #43 and #43). So let's clean the data. (Unfortunately, since we are dealing with a number of countries, and a number of different address standards, this cleaning will need to mostly be done by hand and checked against the [Nominatim search](nominatim.openstreetmap.org/).)

In [15]:
cafe_df.loc[7, 'City'] = 'Antwerp'
cafe_df.drop(8, axis=0, inplace=True)
cafe_df.loc[12, 'City'] = 'Brussels'
cafe_df.loc[16, 'Address'] = 'Rue Hors-Château 43'
cafe_df.loc[17, 'Address'] = 'Place Abbé Joseph André 11'
cafe_df.drop(33, axis=0, inplace=True)
cafe_df.loc[38, 'Address'] = 'Rushden Rd' # This is as accurate as Nominatim can get
cafe_df.loc[42, 'PostalCode'] = 'NR2 1EL'
cafe_df.drop(43, axis=0, inplace=True)
cafe_df.drop(44, axis=0, inplace=True)
cafe_df.loc[48, 'Address'] = '247 High Street'
cafe_df.drop(50, axis=0, inplace=True)
cafe_df.drop(72, axis=0, inplace=True)
cafe_df.loc[73, 'Address'] = '88 London Rd'
cafe_df.loc[73, 'PostalCode'] = 'BN1 4JF'
cafe_df.drop(74, axis=0, inplace=True)
cafe_df.loc[75, 'Address'] = '' # Needed in order for geopy to obtain geographical data
cafe_df.drop(79, axis=0, inplace=True)
cafe_df.loc[82, 'Address'] = 'Abinger place'
cafe_df.loc[84, 'Address'] = '207 Queensway'
cafe_df.loc[84, 'PostalCode'] = 'MK2 2EB'
cafe_df.loc[86, 'Address'] = '149 Albert Rd'
cafe_df.loc[89, 'Address'] = 'The Brooks Centre'
cafe_df.loc[103, 'Address'] = '19a Pepper Street'
cafe_df.loc[103, 'PostalCode'] = 'ST5 1PR'
cafe_df.loc[104, 'Name'] = 'Nerdy Coffee Co.'
cafe_df.loc[124, 'PostalCode'] = '03000'
cafe_df.loc[185, 'Address'] = 'Lehener Straße 15'
cafe_df.loc[194, 'Address'] = 'Λογοθετίδη Βασίλη 14'
cafe_df.loc[194, 'City'] = 'Athens'
cafe_df.loc[196, 'PostalCode'] = '65302'
cafe_df.loc[197, 'PostalCode'] = '41221'
cafe_df.loc[198, 'Address'] = 'Δημητριου Ράλλη 4'
cafe_df.loc[200, 'Address'] = 'Γεωργίου Παπανδρέου 27'
cafe_df.loc[200, 'PostalCode'] = '54645'
cafe_df.loc[202, 'Address'] = 'Βασιλίσσης Σοφίας'
cafe_df.loc[205, 'Address'] = 'Ferenc körút 17'
cafe_df.drop(206, axis=0, inplace=True)
cafe_df.loc[207, 'Name'] = 'Pub Game Up!'
cafe_df.loc[211, 'Address'] = '9 High Street'
cafe_df.loc[211, 'PostalCode'] = 'P75 XW35'
cafe_df.drop(212, axis=0, inplace=True)
cafe_df.loc[213, 'Address'] = '51 Wellington Quay'
cafe_df.loc[214, 'PostalCode'] = 'D02 FP40'
cafe_df.loc[215, 'PostalCode'] = 'H91 Y90F'
cafe_df.loc[216, 'Address'] = 'Via Giuseppe Toniolo 12'
cafe_df.drop(220, axis=0, inplace=True) # This is in fact related to an event called Counters in Pontypridd, Wales, not Italy
cafe_df.loc[223, 'Address'] = 'Strada Alexandr Pușkin 52'
cafe_df.loc[223, 'PostalCode'] = 'MD-2012'
cafe_df.loc[224, 'PostalCode'] = '1432KA'
cafe_df.loc[225, 'PostalCode'] = '1052NP'
cafe_df.loc[227, 'PostalCode'] = '4811GK'
cafe_df.loc[228, 'PostalCode'] = '2611HR'
cafe_df.loc[229, 'PostalCode'] = '2801LV'
cafe_df.loc[230, 'PostalCode'] = '9712NP'
cafe_df.loc[231, 'PostalCode'] = '2011LE'
cafe_df.loc[232, 'PostalCode'] = '2513BW'
cafe_df.loc[233, 'City'] = 'Skopje'
cafe_df.loc[234, 'Address'] = 'Rosepark, Upper Newtownards Road' # This is as accurate as Nominatim allows
cafe_df.loc[234, 'City'] = 'Dundonald'
cafe_df.loc[234, 'PostalCode'] = 'BT4 3SB'
cafe_df.loc[235, 'Address'] = 'Holywood Road'
cafe_df.loc[235, 'City'] = 'Sydenham'
cafe_df.loc[235, 'PostalCode'] = 'BT4 1NT'
cafe_df.loc[238, 'Address'] = 'Dmowskiego 15'
cafe_df.loc[239, 'Address'] = 'Kamienna 7'
cafe_df.loc[248, 'Name'] = 'Ludoclube'
cafe_df.loc[248, 'PostalCode'] = '2720-046'
cafe_df.loc[249, 'Name'] = 'Pow Wow'
cafe_df.loc[249, 'Address'] = 'Rua Professor Fernando da Fonseca 19'
cafe_df.loc[249, 'PostalCode'] = '1600-235'
cafe_df.loc[250, 'Name'] = 'A Jogar é que a gente se entende'
cafe_df.loc[250, 'Address'] = 'Rua Doutor Elias de Aguiar 244'
cafe_df.loc[250, 'PostalCode'] = '4480-789'
cafe_df.loc[252, 'Name'] = 'Snakes & Wizards'
cafe_df.loc[252, 'Address'] = 'Strada Ilarie Chendi 5'
cafe_df.loc[253, 'Address'] = 'Strada Samuil Micu 4'
cafe_df.loc[256, 'Name'] = 'FatCats Board Game Cafe'
cafe_df.loc[256, 'PostalCode'] = '100337'
cafe_df.drop(264, axis=0, inplace=True)
cafe_df.drop(272, axis=0, inplace=True)
cafe_df.loc[280, 'Address'] = "Carrer de l'Alandir 1" # The actual address "Carrer Hospitalers de Sant Joan n.2" is a footpath so doesn't show
cafe_df.loc[282, 'Address'] = 'Av. Manuel Torres 5' # In Nominatim the 'de' given in Google Map returns an error
cafe_df.loc[284, 'Address'] = 'Carrer de Rosselló i Cazador, 7'
cafe_df.loc[286, 'PostalCode'] = '411 19'
cafe_df.loc[291, 'PostalCode'] = '06490'
cafe_df.loc[292, 'Address'] = 'Nail Bey Sk. No:48/2' # In Nominatim the street name is Nail Bey, not Nailbey as with Google Maps

With the values cleaned up and duplicates removed, let's reset the index. Also, let's group the cafés in terms of countries and cities, as this was previously not the case (possibly due to some cafés being assigned a region that affected the results).

In [16]:
cafe_df.sort_values(['Country', 'City'], inplace=True)
cafe_df.reset_index(drop=True, inplace=True)
cafe_df.head()

Unnamed: 0,Name,Address,City,PostalCode,Country
0,Brot & Spiele,Mariahilferstraße 17,Graz,8020,Austria
1,Brot und Spiele,Laudongasse 22,Vienna,1080,Austria
2,Café Benno,Alser Str. 67,Vienna,1080,Austria
3,Café Sperlhof,Große Sperlgasse 41,Vienna,1020,Austria
4,SpielBar,Lederergasse 26,Vienna,1080,Austria


So let's run the geolocator function again, and see if we now have geodata for each café:

In [17]:
ll_list = []
for row in cafe_df.values:
    ll_dict = dict()
    ll = get_ll(row)
    ll_dict = {'Latitude': ll[0], 'Longitude': ll[1]}
    ll_list.append(ll_dict)

ll_df = pd.DataFrame(ll_list)
null_idx = ll_df.index[ll_df['Latitude'].isnull()].tolist()
print(f'There are {len(null_idx)} rows missing data.')

if len(null_idx) > 0:
    cafe_df.loc[null_idx,:]

There are 0 rows missing data.


So now we have no missing data, let's combine them and save the DataFrame as a `csv` file for future use.

In [18]:
cafe_df = cafe_df.merge(ll_df, left_index=True, right_index=True)
cafe_df.to_csv('cafe_addresses.csv', index=False)
cafe_df.head()

Unnamed: 0,Name,Address,City,PostalCode,Country,Latitude,Longitude
0,Brot & Spiele,Mariahilferstraße 17,Graz,8020,Austria,47.073272,15.433036
1,Brot und Spiele,Laudongasse 22,Vienna,1080,Austria,48.213407,16.349799
2,Café Benno,Alser Str. 67,Vienna,1080,Austria,48.21505,16.342587
3,Café Sperlhof,Große Sperlgasse 41,Vienna,1020,Austria,48.219658,16.37838
4,SpielBar,Lederergasse 26,Vienna,1080,Austria,48.213688,16.348476
