## Coursera Capstone Final Project: Where Should a Board Game Café Be Located in Turku? 
#### Tom Bullock

### Introduction

The city of Turku in the southwest of Finland has a large fanbase for board games: for a city of with a [population of 191,331](https://www.turku.fi/en/statistical-data-about-turku-2019#Population,%20housing%20and%20education) it possesses three brick and mortar board game shops (with a dedicated board game section in most toy and book shops), and even the public library provides [board game loans (in Finnish)](https://www.turku.fi/lainattavat-lautapelit). To put this in perspective, this is the same number of shops as the capital Helsinki, a city with a [population of 648,042](https://www.hel.fi/hel2/tietokeskus/julkaisut/pdf/20_01_09_tilastollinen_vuosikirja2019.pdf). And yet, despite this, Turku does not possess a board game café, a place where people can socialise, eat, drink and play games or run tabletop campaigns together. Given that Turku is also in possession of [over a dozen escape rooms](https://www.tripadvisor.com/Attractions-g189949-Activities-c56-t208-Turku_Southwest_Finland.html) (and their overlap in customer bases), there is clearly no lack of demand for such opportunities to play games together. 

Such a café would likely prove very lucrative within Turku as the closest board game cafés are located in Helsinki and Tampere, both being two hours distance away, and so the main question for a prospective café owner to ask is *which neighbourhood of Turku would be most likely to succeed?* This project intends to answer that by analysing the locations of board game cafés around Europe, as well as centres of major cities that _don't_ have any board game cafés, and determining via a logistic regression model which neighbourhoods in Turku would be best suited for housing such a café. This research will be performed with the use of the FourSquare API in order to perform venue queries based on geographical data.

### Data Collection

The data that we will be using consists of venues surrounding board game cafés, and venues in the centres of cities that do not contain board game cafés, within the continent of Europe. This is built on the hypothesis that these cafés tend to be located within vicinity to certain other types of amnenities, such as public transport hubs, universities or student accommodation, and will be away from others, such as factories or emergency services buildings. 

The data collection is broken down into separate categories:
* [Collecting board game cafés within Europe](####Collecting-Board-Game-Cafés-Within-Europe)
* [Collecting cities within Europe without board game cafés](####Collecting-Cities-Within-Europe-Without-Board-Game-Cafés)
* [Visualising the locations collected](####Visualisation-of-Locations)
* [Obtaining venues close to our collected locations](####Obtaining-Nearby-Venues)

#### Collecting Board Game Cafés Within Europe

The first part of this project requries us to collect the board game cafés located in Europe. Luckily for us, [Andy Matthews at Meeple Mountain](https://www.meeplemountain.com/authors/andy-matthews/) has [collected every board game café on the planet](https://www.meeplemountain.com/articles/the-ultimate-guide-to-board-game-cafes/), and placed them on a handy [Google Map](https://www.google.com/maps/d/u/0/viewer?mid=1UEkafkpKjbEJQzkcwgdtmnYPn4jXQKrw). So all we need to do is extract the entries upon this list that are contained with Europe, and then acquire their latitude and longitude using `geopy`.

Note that the list of cafés is contained in the map as a `kml` file, which we have locally converted to a `csv` file called `Worldwide_Board_Game_Cafe_List.csv`.

In [1]:
import pandas as pd
import numpy as np
import requests
import json
from bs4 import BeautifulSoup # For web scraping
from geopy.geocoders import Nominatim
!pip install folium
import folium


pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

Collecting folium
  Downloading folium-0.12.1-py2.py3-none-any.whl (94 kB)
[K     |████████████████████████████████| 94 kB 6.1 MB/s  eta 0:00:01
Collecting branca>=0.3.0
  Downloading branca-0.4.2-py3-none-any.whl (24 kB)
Installing collected packages: branca, folium
Successfully installed branca-0.4.2 folium-0.12.1


In [2]:
world_df = pd.read_csv('Worldwide_Board_Game_Cafe_List.csv')
world_df.head()

Unnamed: 0,Name,Address,City,PostalCode,Country
0,Geek Out! Argentina,Darregueyra 2484,Buenos Aires,C1425,Argentina
1,The Board Game Cafe,Holmberg 2000,Buenos Aires,1430,Argentina
2,Magic Lair,Avenida Juan Bautista Alberdi 1170,Ciudad de Buenos Aires,1406,Argentina
3,Invictvs Café y Salon de Juegos,Italia 101,Paraná,E3100,Argentina
4,BrainHackr,"208 Prospect Rd, Prospect, South Australia",Adelaide,5082,Australia


The data in the `csv` file does not contain continent information, and so we need to manually filter out cafés that are within Europe. We will do this by first obtaining a list of countries within European from [Wikipedia](https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_in_Europe), which we will then use to test against `world_df`.

In [3]:
url = 'https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_in_Europe'
response = requests.get(url)
response

<Response [200]>

The Wikipedia table contains multiple columns, and we are only interested in the third; furthermore, each row in the column contains potentially multiple lines, and we only care for the first. However, each country name acts as a hyperlink to the page for that country, and so we can specifically look for the first `a` tag in the third column (`td` in html) of every row (noting that there are 7 columns in the table), adding the result to the list `eu_countries`. 

In [4]:
soup = BeautifulSoup(response.content, 'lxml')
table = soup.find_all('table', attrs={'class':'wikitable'})[1]
eu_countries = [td.a.text for td in table.find_all('td')[2::7]]

The Wikipedia table includes 'United Kingdom', whereas the café list specifies the countries England, Scotland, Wales and Northern Ireland. Hence, to avoid accidentally omitting any of the venues, we replace these values in `eu_countries` (and sort alphabetically again for ease of reading).

In [5]:
eu_countries.remove('United Kingdom')
eu_countries.extend(['England', 'Scotland', 'Wales', 'Northern Ireland'])
eu_countries.sort()

In [6]:
eu_countries[:5]

['Albania', 'Andorra', 'Armenia', 'Austria', 'Azerbaijan']

In [7]:
cafe_df = world_df[world_df['Country'].isin(eu_countries)]
cafe_df.reset_index(drop=True, inplace=True)
cafe_df.head()

Unnamed: 0,Name,Address,City,PostalCode,Country
0,Brot & Spiele,Mariahilferstraße 17,Graz,8020,Austria
1,Brot und Spiele,Laudongasse 22,Vienna,1080,Austria
2,Café Benno,Alser Str. 67,Vienna,1080,Austria
3,Café Sperlhof,Große Sperlgasse 41,Vienna,1020,Austria
4,SpielBar,Lederergasse 26,Vienna,1080,Austria


Later data cleaning shows that the café Chil Angart in Krasnodar is missing from the original list, and rather than have it cause problems later we will add it to the data now.

In [8]:
chil_angart = {
    'Name': 'Чил Ангарт',
    'Address': 'Krasnaya Street 109',
    'City': 'Krasnodar',
    'PostalCode': '350000',
    'Country': 'Russia'
              }
cafe_df = cafe_df.append(chil_angart, ignore_index=True)
cafe_df.tail()

Unnamed: 0,Name,Address,City,PostalCode,Country
291,Da Vinci Board Game Cafe,Kazakistan Cd. 69-71 D:A,Ankara,6490,Turkey
292,Goblin Oyun Cafe,General Asim Gunduz,Kadikoy/Istanbul,34714,Turkey
293,Chance & Counters,23 High Street,Cardiff,CF10 1PT,Wales
294,"Board, Isle of Wight",St James Street,Newport,PO30 1LQ,Wales
295,Чил Ангарт,Krasnaya Street 109,Krasnodar,350000,Russia


There is no latitude and longitude data contained within the original `kml` file, and so we need to collect it ourselves. This will be done with `geopy`, after checking the [list of supported countries](https://pgeocode.readthedocs.io/en/latest/overview.html#supported-countries) for `pgeocode` revealed that a number of countries of interest are not covered.

We will create a list of dictionaries to create a Dataframe, which will ultimately be merged with `cafe_df`.

In [9]:
# Initialise the geolocator and list of geodata dictionaries
geolocator = Nominatim(user_agent='bgc_finder')

FourSquare utilises latitude and longitude values, which can be obtained from a given address by making use of Open Street Maps' Nominatim search system, so we now define a quick function that will perform a Nominatim query. Depending on the given data (essentially, whether we consider cafés or cities without them, as we will do later) the query will be in the format `[Address], [PostalCode] [City], [Country]` or simply `[City]`, and will return a list containing the latitude and longitude. (In the latter case Nominatim will provide the latitude and longitude for a point in the center of the city.) 

If our query does not produce a result (due to issues in the initial data, or some disagreement between Nominatim and Google Maps) then the function returns a list of `NaN` values.

In [10]:
def get_address(row):
    # This case deals with the cafe data
    if len(row) > 2: 
        address = f"{row[1]}, {row[3]} {row[2]}, {row[4]}"
    # This deals with cities without cafes
    else:
        address = row[1]

    return address

In [11]:
geocode_cache = {} # For storing already found addresses

def get_latlong(row):
    address = get_address(row)
    
    # Check to see if the address is unchanged
    if (address in geocode_cache):
        return geocode_cache[address]
    
    location = geolocator.geocode(address)
    
    try:
        geocode_cache[address] = [location.latitude, location.longitude]
        return [location.latitude, location.longitude]
    except:
        geocode_cache[address] = [np.nan, np.nan]
        return [np.nan, np.nan]

In [12]:
def build_latlong_dataframe(source_df):
    latlong_list = []
    city_idx = source_df.columns.get_loc('City')
    # Building the list of geodata
    latlong_list = [{'Name': row[0], 
                     'City': row[city_idx], 
                     'Latitude': get_latlong(row)[0], 
                     'Longitude': get_latlong(row)[1]}
                    for row in source_df.to_numpy()]

    # Build the DataFrame
    latlong_df = pd.DataFrame(latlong_list)
    return latlong_df

In [13]:
latlong_df = build_latlong_dataframe(cafe_df)
latlong_df.head()

Unnamed: 0,Name,City,Latitude,Longitude
0,Brot & Spiele,Graz,47.073272,15.433036
1,Brot und Spiele,Vienna,48.213407,16.349799
2,Café Benno,Vienna,48.21505,16.342587
3,Café Sperlhof,Vienna,48.219658,16.37838
4,SpielBar,Vienna,48.213688,16.348476


In order to determine which, if any, rows are missing data, we now do a search over the database for null values. This then allows us to clean the data. There is still the potential issue of duplicate values, but we will resolve that after.

In [14]:
def find_missing_values(source_df):
    null_idx = source_df.index[source_df['Latitude'].isnull()].tolist()
    print(f'There are {len(null_idx)} rows missing data.')

    if len(null_idx) > 0:
        return cafe_df.loc[null_idx,:]

In [15]:
missing_df = find_missing_values(latlong_df)
missing_df

There are 52 rows missing data.


Unnamed: 0,Name,Address,City,PostalCode,Country
16,Yam-toto,En Hors-Chateau 43,Liège,4000,Belgium
17,Aux 3D Board Game Cafe,"Place Abbe Joseph, 11",Namur,5000,Belgium
38,Ready Steady Roll,Ivy Lodge,Bedford,MK44 1ND,England
42,The Games Table,"86 Magdalen St, Norwich",Norwich,NR311JF,England
43,Ready Steady Roll,"Ivy Lodge, A6 Rushden Road",Sharnbrook,MK44 1ND,England
44,"Ready, Steady Roll","Studio 5, Ivy Lodge Farm, A6 Rushden Road",Sharnbrook,MK44 1ND,England
48,Red Panda Gaming Cafe,"247 High Street, First floor",Lincoln,LN2 1HW,England
72,c:\ Side Quest,"11 Lower Promenade, Madiera Drive",Brighton,BN2 1ET,England
73,Dice Saloon,"Unit 6, Longley Industrial Estate, New England...",Brighton,BN14GY,England
74,Dice Saloon,"First floor, Vantage Point, New England Road",Brighton,BN1 4GY,England


So there are a number of issues within the data, including a peculiar issue with #220 (Pontyridd is within Wales, not Italy!). This also highlights that we do have some duplicates within our data (see e.g. #43 and #44), and a few instances of cities containing unnecessary whitespace at the end, which will be problematic later. So let's clean the data. 

(Unfortunately, since we are dealing with a number of countries, and a number of different address standards, this cleaning will need to mostly be done by hand and checked against the [Nominatim search](nominatim.openstreetmap.org/).)

In [16]:
cafe_df['City'] = cafe_df.apply(lambda x: x.City.rstrip(), axis=1)
cafe_df.loc[7, 'City'] = 'Antwerp'
cafe_df.drop(8, axis=0, inplace=True)
cafe_df.loc[12, 'City'] = 'Brussels'
cafe_df.loc[16, 'Address'] = 'Rue Hors-Château 43'
cafe_df.loc[17, 'Address'] = 'Place Abbé Joseph André 11'
cafe_df.drop(33, axis=0, inplace=True)
cafe_df.loc[38, 'Address'] = 'Rushden Rd' # This is as accurate as Nominatim can get
cafe_df.loc[42, 'PostalCode'] = 'NR2 1EL'
cafe_df.drop(43, axis=0, inplace=True)
cafe_df.drop(44, axis=0, inplace=True)
cafe_df.loc[48, 'Address'] = '247 High Street'
cafe_df.drop(50, axis=0, inplace=True)
cafe_df.loc[61, 'City'] = 'Newcastle-upon-Tyne'
cafe_df.drop(72, axis=0, inplace=True)
cafe_df.loc[73, 'Address'] = '88 London Rd'
cafe_df.loc[73, 'PostalCode'] = 'BN1 4JF'
cafe_df.drop(74, axis=0, inplace=True)
cafe_df.loc[75, 'Address'] = '' # Needed in order for geopy to obtain geographical data
cafe_df.drop(79, axis=0, inplace=True)
cafe_df.loc[82, 'Address'] = 'Abinger place'
cafe_df.loc[84, 'Address'] = '207 Queensway'
cafe_df.loc[84, 'PostalCode'] = 'MK2 2EB'
cafe_df.loc[86, 'Address'] = '149 Albert Rd'
cafe_df.loc[89, 'Address'] = 'The Brooks Centre'
cafe_df.loc[103, 'Address'] = '19a Pepper Street'
cafe_df.loc[103, 'PostalCode'] = 'ST5 1PR'
cafe_df.loc[104, 'Name'] = 'Nerdy Coffee Co.'
cafe_df.loc[124, 'PostalCode'] = '03000'
cafe_df.loc[185, 'Address'] = 'Lehener Straße 15'
cafe_df.loc[194, 'Address'] = 'Λογοθετίδη Βασίλη 14'
cafe_df.loc[194, 'City'] = 'Athens'
cafe_df.loc[196, 'PostalCode'] = '65302'
cafe_df.loc[197, 'PostalCode'] = '41221'
cafe_df.loc[198, 'Address'] = 'Δημητριου Ράλλη 4'
cafe_df.loc[200, 'Address'] = 'Γεωργίου Παπανδρέου 27'
cafe_df.loc[200, 'PostalCode'] = '54645'
cafe_df.loc[202, 'Address'] = 'Βασιλίσσης Σοφίας'
cafe_df.loc[205, 'Address'] = 'Ferenc körút 17'
cafe_df.drop(206, axis=0, inplace=True)
cafe_df.loc[207, 'Name'] = 'Pub Game Up!'
cafe_df.loc[211, 'Address'] = '9 High Street'
cafe_df.loc[211, 'PostalCode'] = 'P75 XW35'
cafe_df.drop(212, axis=0, inplace=True)
cafe_df.loc[213, 'Address'] = '51 Wellington Quay'
cafe_df.loc[214, 'PostalCode'] = 'D02 FP40'
cafe_df.loc[215, 'PostalCode'] = 'H91 Y90F'
cafe_df.loc[216, 'Address'] = 'Via Giuseppe Toniolo 12'
cafe_df.drop(220, axis=0, inplace=True) # This is in fact related to an event called Counters in Pontypridd, Wales, not Italy
cafe_df.loc[223, 'Address'] = 'Strada Alexandr Pușkin 52'
cafe_df.loc[223, 'PostalCode'] = 'MD-2012'
cafe_df.loc[224, 'PostalCode'] = '1432KA'
cafe_df.loc[225, 'PostalCode'] = '1052NP'
cafe_df.loc[227, 'PostalCode'] = '4811GK'
cafe_df.loc[228, 'PostalCode'] = '2611HR'
cafe_df.loc[229, 'PostalCode'] = '2801LV'
cafe_df.loc[230, 'PostalCode'] = '9712NP'
cafe_df.loc[231, 'PostalCode'] = '2011LE'
cafe_df.loc[232, 'PostalCode'] = '2513BW'
cafe_df.loc[233, 'City'] = 'Skopje'
cafe_df.loc[234, 'Address'] = 'Rosepark, Upper Newtownards Road' # This is as accurate as Nominatim allows
cafe_df.loc[234, 'City'] = 'Dundonald'
cafe_df.loc[234, 'PostalCode'] = 'BT4 3SB'
cafe_df.loc[235, 'Address'] = 'Holywood Road'
cafe_df.loc[235, 'City'] = 'Sydenham'
cafe_df.loc[235, 'PostalCode'] = 'BT4 1NT'
cafe_df.loc[238, 'Address'] = 'Dmowskiego 15'
cafe_df.loc[239, 'Address'] = 'Kamienna 7'
cafe_df.loc[248, 'Name'] = 'Ludoclube'
cafe_df.loc[248, 'PostalCode'] = '2720-046'
cafe_df.loc[249, 'Name'] = 'Pow Wow'
cafe_df.loc[249, 'Address'] = 'Rua Professor Fernando da Fonseca 19'
cafe_df.loc[249, 'PostalCode'] = '1600-235'
cafe_df.loc[250, 'Name'] = 'A Jogar é que a gente se entende'
cafe_df.loc[250, 'Address'] = 'Rua Doutor Elias de Aguiar 244'
cafe_df.loc[250, 'PostalCode'] = '4480-789'
cafe_df.loc[252, 'Name'] = 'Snakes & Wizards'
cafe_df.loc[252, 'Address'] = 'Strada Ilarie Chendi 5'
cafe_df.loc[253, 'Address'] = 'Strada Samuil Micu 4'
cafe_df.loc[256, 'Name'] = 'FatCats Board Game Cafe'
cafe_df.loc[256, 'PostalCode'] = '100337'
cafe_df.drop(264, axis=0, inplace=True)
cafe_df.drop(272, axis=0, inplace=True)
cafe_df.loc[280, 'Address'] = "Carrer de l'Alandir 1" # The actual address "Carrer Hospitalers de Sant Joan n.2" is a footpath so doesn't show
cafe_df.loc[282, 'Address'] = 'Av. Manuel Torres 5' # In Nominatim the 'de' given in Google Map returns an error
cafe_df.loc[284, 'Address'] = 'Carrer de Rosselló i Cazador, 7'
cafe_df.loc[286, 'PostalCode'] = '411 19'
cafe_df.loc[291, 'PostalCode'] = '06490'
cafe_df.loc[292, 'Address'] = 'Nail Bey Sk. No:48/2' # In Nominatim the street name is Nail Bey, not Nailbey as with Google Maps
cafe_df.loc[292, 'City'] = 'Istanbul'

With the values cleaned up and duplicates removed, let's reset the index. Also, let's group the cafés in terms of countries and cities, as this was previously not the case (possibly due to some cafés being assigned a region that affected the results).

In [17]:
cafe_df.sort_values(['Country', 'City'], inplace=True)
cafe_df.reset_index(drop=True, inplace=True)
cafe_df.head()

Unnamed: 0,Name,Address,City,PostalCode,Country
0,Brot & Spiele,Mariahilferstraße 17,Graz,8020,Austria
1,Brot und Spiele,Laudongasse 22,Vienna,1080,Austria
2,Café Benno,Alser Str. 67,Vienna,1080,Austria
3,Café Sperlhof,Große Sperlgasse 41,Vienna,1020,Austria
4,SpielBar,Lederergasse 26,Vienna,1080,Austria


So let's run the geolocator function again, and see if we now have geodata for each café:

In [18]:
latlong_df = build_latlong_dataframe(cafe_df)
missing_df = find_missing_values(latlong_df)
missing_df

There are 0 rows missing data.


So now we have no missing data, let's combine the two dataframes for future use.

In [19]:
cafe_df = cafe_df.merge(latlong_df, on=['Name', 'City'])
cafe_df.head()

Unnamed: 0,Name,Address,City,PostalCode,Country,Latitude,Longitude
0,Brot & Spiele,Mariahilferstraße 17,Graz,8020,Austria,47.073272,15.433036
1,Brot und Spiele,Laudongasse 22,Vienna,1080,Austria,48.213407,16.349799
2,Café Benno,Alser Str. 67,Vienna,1080,Austria,48.21505,16.342587
3,Café Sperlhof,Große Sperlgasse 41,Vienna,1020,Austria,48.219658,16.37838
4,SpielBar,Lederergasse 26,Vienna,1080,Austria,48.213688,16.348476


#### Collecting Cities Within Europe Without Board Game Cafés

In order for us to have negative data, we need collect regions within Europe that do not contain board game cafés. To this end we utilise the list of [the 500 largest European cities given by City Mayors](http://www.citymayors.com/features/euro_cities.html), and from this list remove any instances of cities that are also contained within `cafe_df`. 

With this list of cities we then utilise `geopy` to find the latitude and longitude of central points within each city, that will then be later used with the FourSquare API.

In [20]:
city_list = []
for n in range(1, 6):
    # The City Mayors list runs over 5 pages, hence the loop
    url = f'http://www.citymayors.com/features/euro_cities{n}.html'
    r = requests.get(url)
    soup = BeautifulSoup(r.content, 'lxml')
    table = soup.find_all('table')[1]
    city_table = [td.text.title().rstrip() for td in table.find_all('td', attrs={'width':'140'})[1:]]
    city_list.extend(city_table)

Unfortunately, the data on City Mayors omits Turkey, and so we need to collect this information from [Wikipedia](https://en.wikipedia.org/wiki/List_of_largest_cities_and_towns_in_Turkey). The smallest population value given in the City Mayors table is 149,000 people, and so we will include all cities whose population exceed this value.

In addition, a number of English counties, rather than cities are within the data, so we shall remove those, and we need to include 'Rome' instead of 'Roma' to ensure it is properly treated.

In [21]:
# Replacing 'Roma' with 'Rome'
city_list = list(map(lambda x: x.replace('Roma', 'Rome'), city_list))

In [22]:
# Collecting and adding Turkish cities to the list
turkey_url = 'https://en.wikipedia.org/wiki/List_of_largest_cities_and_towns_in_Turkey'
turkey_resp = requests.get(turkey_url)
turkey_soup = BeautifulSoup(turkey_resp.content, 'lxml')
turkey_table = turkey_soup.find('table', attrs={'class': 'sortable'})
city_pop = [td.text.rstrip('\n').replace(',', '') for td in turkey_table.find_all('td')[6::8]]
turkey_cities = [a.text for a in turkey_table.find_all('a')[::2]]

for city, pop in zip(turkey_cities, city_pop):
    # At least one population value given by Wikipedia is '-', and so we pass over any such cities
    try:
        if int(pop) > 149000:
            city_list.append(city)
    except:
        pass
print(f'city_list now contains {len(city_list)} cities')

city_list now contains 543 cities


In [23]:
# Removing English counties
for city in city_list:
    if 'shire' in city:
        city_list.remove(city)
print(f'city_list now contains {len(city_list)} cities')

city_list now contains 534 cities


We now take the cities listed in `cafe_df` and, rather than directly remove them from `city_list`, we create a new list `city_without_cafe` that only contains cities without a board game café.

In [24]:
cafe_city_list = cafe_df['City'].unique().tolist()
cafe_city_list.sort()
city_list.sort()
city_without_cafe = []

In [25]:
def has_a_cafe(city):
    # A quick function to determine if a city appears in cafe_df
    inclusion = [(cafe.lower() in city.lower()) for cafe in cafe_city_list]
    return any(inclusion)

In [26]:
for city in city_list:
    if not has_a_cafe(city):
        city_without_cafe.append(city)
print(f'\ncity_without_cafe has {len(city_without_cafe)} cities')


city_without_cafe has 438 cities


In [27]:
# Test that we are getting a suitable output
city_without_cafe[:5]

['Aachen', 'Abakan', 'Aberdeen', 'Adana', 'Adapazarı']

In [28]:
column_order = ['Name', 'City']

cwc_df = pd.DataFrame(city_without_cafe, columns=['City'])
cwc_df['Name'] = np.nan
cwc_df = cwc_df.reindex(columns=column_order)
cwc_df.head()

Unnamed: 0,Name,City
0,,Aachen
1,,Abakan
2,,Aberdeen
3,,Adana
4,,Adapazarı


There are some data points that either do not given a result with `geopy`, or point to the wrong location (e.g. Van, in Turkey, gets mistaken for Vietnam), so let's fix those.

In [29]:
cwc_df.loc[34] = 'Bila Tserkva'
cwc_df.loc[80] = 'Chernivtsi'
cwc_df.loc[90] = 'Kamianske'
cwc_df.loc[100] = 'Yekaterinburg'
cwc_df.loc[135] = 'Yoshkar-Ola'
cwc_df.loc[176] = 'Kremenchuk'
cwc_df.loc[208] = 'Makiivka'
cwc_df.loc[242] = 'Nizhnevartovsk'
cwc_df.loc[258] = 'Oldham, Greater Manchester'
cwc_df.loc[279] = 'Piraeus'
cwc_df.drop(293, axis=0, inplace=True) # Rhondda Cynon Taf is a Welsh county whose largest town has less than 31000 people
cwc_df.loc[344] = 'Stary Oskol'
cwc_df.loc[355] = 'Syktyvkar'
cwc_df.drop(380, axis=0, inplace=True) # We perhaps don't want Turku within our training data
cwc_df.loc[389] = 'Yuzhno-Sakhalinsk'
cwc_df.loc[392] = 'Van, İpekyolu'
cwc_df.loc[400] = 'Vinnytsia'
cwc_df.loc[420] = 'Yaroslavl'

In [30]:
cwc_df = build_latlong_dataframe(cwc_df)
print(cwc_df.shape)
cwc_df.head()

(436, 4)


Unnamed: 0,Name,City,Latitude,Longitude
0,,Aachen,50.776351,6.083862
1,,Abakan,53.720661,91.440369
2,,Aberdeen,57.148243,-2.092809
3,,Adana,36.993617,35.325835
4,,Adapazarı,40.784799,30.399683


In [31]:
missing_df = find_missing_values(cwc_df)
missing_df

There are 0 rows missing data.


Because of the size of Russia, there are a number of cities that are contained within Asia. As such, we will drop any cities that are East of the Ural mountains (whose longitude according to [Wikipedia](https://en.wikipedia.org/wiki/Ural_Mountains) is 60E).

In [32]:
asia_idx = cwc_df[cwc_df['Longitude'] > 60].index
cwc_df.drop(asia_idx, inplace=True)
cwc_df.reset_index(drop=True, inplace=True)
print(cwc_df.shape)
cwc_df.head()

(402, 4)


Unnamed: 0,Name,City,Latitude,Longitude
0,,Aachen,50.776351,6.083862
1,,Aberdeen,57.148243,-2.092809
2,,Adana,36.993617,35.325835
3,,Adapazarı,40.784799,30.399683
4,,Adıyaman,37.78936,38.31411


#### Visualisation of Locations

Below is a plot of the points that we have collected, with green points denoting board game cafés within Europe, and orange points the centres of cities that do not have a board game café.

In [33]:
# EU centre point
latitude, longitude = 50, 20

map_eu = folium.Map(location=[latitude, longitude], zoom_start=4)

cafe_label = folium.FeatureGroup(name='Cafés')
cwc_label = folium.FeatureGroup(name='Cities without cafés')

map_eu.add_child(cafe_label)
map_eu.add_child(cwc_label)

for lat, lng, cafe, city in zip(cafe_df['Latitude'], cafe_df['Longitude'], cafe_df['Name'], cafe_df['City']):
    label = f'{cafe}, {city}'
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng], 
        radius=5,
        popup=label,
        color='#80cb34',
        weight=2,
        fill=True,
        fill_color='#CBCB34',
        fill_opacity=0.8,
        parse_html=False).add_to(cafe_label)
    
for lat, lng, city in zip(cwc_df['Latitude'], cwc_df['Longitude'], cwc_df['City']):
    label = city
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng], 
        radius=5,
        popup=label,
        color='#f96706',
        weight=2,
        fill=True,
        fill_color='#F9E106',
        fill_opacity=0.8,
        parse_html=False).add_to(cwc_label)

folium.LayerControl(collapsed=False).add_to(map_eu)
    
map_eu

#### Obtaining Nearby Venues

Now that we have each location, we need to obtain the venues within their immediate vicinity by making use of the FourSquare API. We shall limit ourselves to 40 locations within a radius of 200 metres of the points in our databases.

In addition, to prevent adding some tautological bias into our subsequent model, we will remove the instances of our board game cafés whenever they appear in our data to the best of our ability, especially if they are specifically labelled with the category 'Gaming cafe' (with [FourSquare code](https://developer.foursquare.com/docs/build-with-foursquare/categories/) `4bf58dd8d48988d18d941735`).

In [34]:
CLIENT_ID = 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX' 
CLIENT_SECRET = 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX' 

In [35]:
VERSION = '20180605'
RADIUS = 200
LIMIT = 40
CAT_ID = '4bf58dd8d48988d18d941735'# Foursquare category ID for gaming cafés

In [36]:
def get_local_venues(df):
    venue_list = []
    
    for row in df[['Name', 'City', 'Latitude', 'Longitude']].to_numpy():
        lat, lng = row[2:]
        url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID,
            CLIENT_SECRET,
            VERSION,
            lat,
            lng,
            RADIUS,
            LIMIT)
        results = requests.get(url).json()
        for result in results['response']['venues']:
            if len(result['categories']) > 0:
                venue = {'Café_name': row[0], 
                         'Café_city': row[1], 
                         'Venue_name': result['name'], 
                         'Venue_cat': result['categories'][0]['name'], 
                         'Venue_catID': result['categories'][0]['id']
                        }
                venue_list.append(venue)
    return pd.DataFrame(venue_list)

In [37]:
venue_df = get_local_venues(cafe_df)
print(venue_df.shape)
venue_df.head()

(10947, 5)


Unnamed: 0,Café_name,Café_city,Venue_name,Venue_cat,Venue_catID
0,Brot & Spiele,Graz,Brot & Spiele,Bar,4bf58dd8d48988d116941735
1,Brot & Spiele,Graz,Die Scherbe,Bar,4bf58dd8d48988d116941735
2,Brot & Spiele,Graz,Mursteg,Bridge,4bf58dd8d48988d1df941735
3,Brot & Spiele,Graz,Noël,Bar,4bf58dd8d48988d116941735
4,Brot & Spiele,Graz,Paul & Bohne,Coffee Shop,4bf58dd8d48988d1e0931735


In [38]:
reduced_venue_df = venue_df.copy() # To preserve venue_df

# Create a list of unique café names in lower case, with instances where 
# 'café' and 'cafe' being the only difference removed
cafe_list = {cafe.lower() for cafe in cafe_df['Name'].unique()}
cafe_list = {cafe.replace('café', 'cafe') for cafe in cafe_list}

# Ensuring that venues that only differ in the spelling of 'café' are treated the same
venue_df['Venue_name'] = venue_df['Venue_name'].apply(lambda x: x.replace('Café', 'Cafe'))
venue_df['Café_name'] = venue_df['Café_name'].apply(lambda x: x.replace('Café', 'Cafe'))

# Removing a similar issue with some German cafés that differ in using '&' or 'und'
venue_df['Venue_name'] = venue_df['Venue_name'].apply(lambda x: x.replace('und', '&'))
venue_df['Café_name'] = venue_df['Café_name'].apply(lambda x: x.replace('und', '&'))

# Conditional on a venue either being classed as board game café or being in cafe_list
to_remove = venue_df.apply(lambda x: x['Venue_catID'] == '4bf58dd8d48988d18d941735' or 
                           x['Venue_name'].lower() in cafe_list or
                           any([cafe.lower() in x['Venue_name'].lower() for cafe in cafe_list]), 
                           axis=1
                          )

print(f"We shall remove {to_remove.sum()} elements")

# Get the indices of venues to remove
idx_to_remove = np.where(to_remove == True)[0]

# Drop the venues that were True in to_remove
reduced_venue_df.drop(idx_to_remove, inplace=True)
reduced_venue_df.reset_index(drop=True, inplace=True)
print(f"We now have {reduced_venue_df.shape[0]} data points")

reduced_venue_df.head(10)

We shall remove 152 elements
We now have 10795 data points


Unnamed: 0,Café_name,Café_city,Venue_name,Venue_cat,Venue_catID
0,Brot & Spiele,Graz,Die Scherbe,Bar,4bf58dd8d48988d116941735
1,Brot & Spiele,Graz,Mursteg,Bridge,4bf58dd8d48988d1df941735
2,Brot & Spiele,Graz,Noël,Bar,4bf58dd8d48988d116941735
3,Brot & Spiele,Graz,Paul & Bohne,Coffee Shop,4bf58dd8d48988d1e0931735
4,Brot & Spiele,Graz,The Hungry Heart - American Street Food,Hot Dog Joint,4bf58dd8d48988d16f941735
5,Brot & Spiele,Graz,Offline Retail,Thrift / Vintage Store,4bf58dd8d48988d101951735
6,Brot & Spiele,Graz,Erich-Edegger-Steg,Bridge,4bf58dd8d48988d1df941735
7,Brot & Spiele,Graz,SCHRANZER / Möbel / Innenarchitektur / Handwerk,Construction & Landscaping,5454144b498ec1f095bff2f2
8,Brot & Spiele,Graz,Hotel Feichtinger Graz,Hotel,4bf58dd8d48988d1fa931735
9,Brot & Spiele,Graz,Lotte,Café,4bf58dd8d48988d16d941735


In [39]:
cwc_venue_df = get_local_venues(cwc_df)
cwc_venue_df.head()

Unnamed: 0,Café_name,Café_city,Venue_name,Venue_cat,Venue_catID
0,,Aachen,Rathaus,City Hall,4bf58dd8d48988d129941735
1,,Aachen,Krönungssaal,History Museum,4bf58dd8d48988d190941735
2,,Aachen,Markt,Plaza,4bf58dd8d48988d164941735
3,,Aachen,Starbucks,Coffee Shop,4bf58dd8d48988d1e0931735
4,,Aachen,Brauerei Goldener Schwan,German Restaurant,4bf58dd8d48988d10d941735


Just to make sure, we check that there are no board game cafés that were not contained within the original `csv` file.

In [40]:
cwc_venue_df[cwc_venue_df['Venue_catID'] == CAT_ID]

Unnamed: 0,Café_name,Café_city,Venue_name,Venue_cat,Venue_catID
8986,,Oberhausen,Cardgalaxy,Gaming Cafe,4bf58dd8d48988d18d941735
13550,,Ulyanovsk,Остров Развлечений,Gaming Cafe,4bf58dd8d48988d18d941735
14878,,Zaporozhye,Druzi Cafe&Bar,Gaming Cafe,4bf58dd8d48988d18d941735


Of those listed, Cardgalaxy is card and board game shop, not a café, Остров Развлечений ("Island Fun") is a permanently closed family entertainment center, and Druzi is bar with video games. As such, we shall delete Остров Развлечений, and change the venue categories for Cardgalaxy and Druzi.

In [41]:
# Changing Cardgalaxy
cwc_venue_df.loc[8990]['Venue_cat'] = 'Toy / Game Store'
# Dropping Остров Развлечений
cwc_venue_df.drop(13555, axis=0, inplace=True)
# Changing Druzi
cwc_venue_df.loc[14884]['Venue_cat'] = 'Bar'
cwc_venue_df = cwc_venue_df.reset_index(drop=True)

Finally, we now merge the two dataframes and drop the `venue_catID` column as it will no longer be needed, as well as add in the label column `Has_board_game_café`.

In [42]:
reduced_venue_df['Has_board_game_café'] = 1
cwc_venue_df['Has_board_game_café'] = 0
total_venues = pd.concat([reduced_venue_df, cwc_venue_df], ignore_index=True)
total_venues.drop('Venue_catID', axis=1, inplace=True)

In [43]:
total_venues.head()

Unnamed: 0,Café_name,Café_city,Venue_name,Venue_cat,Has_board_game_café
0,Brot & Spiele,Graz,Die Scherbe,Bar,1
1,Brot & Spiele,Graz,Mursteg,Bridge,1
2,Brot & Spiele,Graz,Noël,Bar,1
3,Brot & Spiele,Graz,Paul & Bohne,Coffee Shop,1
4,Brot & Spiele,Graz,The Hungry Heart - American Street Food,Hot Dog Joint,1


In [44]:
total_venues.tail()

Unnamed: 0,Café_name,Café_city,Venue_name,Venue_cat,Has_board_game_café
26228,,Şanlıurfa,A.Koymat Çok Amaçlı Tesis,Social Club,0
26229,,Şanlıurfa,Bağ,Garden,0
26230,,Şanlıurfa,Live Cafe,Café,0
26231,,Şanlıurfa,Yeşilyurt Mehmetogulları Tarım Arazisi,Tree,0
26232,,Şanlıurfa,Çemilcik Çiftik,Farm,0


With `total_venues`, our next step will be to one-hot encode the venue categories and then group the encodings in terms of each café name and city (or just city for those without cafés) and then take their means. This will leave us with labelled normalised data that we can then use to train a logistic regression model, which will calculate the probability of suitability of the districts in Turku for a board game café.