# **Capstone Project - Compatible Neighborhoods for Indian Restaurants**
## **Prakirth Govardhanam**
## **Applied Data Science Capstone by IBM/Coursera**

## Introduction/Business-Problem
In this project, I try to find possible-beneficial locations within the Neighborhoods of Helsinki, Finland for establishing a chain of Indian Restaurants. The conditions to fulfill in order are:

* CONDITION 1 - Distance from Popularity Centre (Assumption) in the Neighborhood - for popularity
* CONDITION 2 - Absence of other Indian restaurants in the Neighborhood - to limit competition


## Data
Data sources used to determine the Neighborhoods within the city of Helsinki are provided by:

* Wikipedia_(https://en.wikipedia.org/wiki/Names_of_places_in_Finland_in_Finnish_and_in_Swedish#Municipalities)_ - for listing the Neighborhoods of Helsinki
* The City of Helsinki(https://kartta.hel.fi/avoindata) - for geospatial Data
* Foursquare API - for popular venues, restaurants and their respective geospatial data

## Project Assumption
**_Popularity Centre_** = the centroid of the top-10 venues (filtered by ratings) in each Neighborhood will be considered as the "popularity centre" within every Neighborhood

# PART 1 - Data Preparation

## Part 1.1 - Data Extraction

### Import necessary libraries

In [2]:
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
import re
import geocoder
from geopy.geocoders import Nominatim

### Clarification:
* Names of anything in Finland has its name in 2 languages, Finnish & Swedish
* Hence, names of Neighbourhoods & Districts are also in same pattern: Postal-Code Finnish-name (Swedish-name)
### Assumption #1
* In the current extracted labels data, Finnish-names are Available for every place where as Swedish-names are not.
* Hence, we will extract and work only with Finnish-names of the Neighbourhoods & Districts


In [28]:
#url with Helsinki District names
url = 'https://en.wikipedia.org/wiki/Names_of_places_in_Finland_in_Finnish_and_in_Swedish#Municipalities'

#parsing the webpage for html content
html = requests.get(url).text
soup = BeautifulSoup(html, features='html.parser')

#extract <a href> tags
atags = soup.select('a[href]')

#extract titles of <a href> tags
titles = []
for atag in atags:
    titles.append(atag.get('title'))

#slice the labels of Helsinki Districts
districts = titles[titles.index('Ala-Malmi'): titles.index('Ylä-Malmi')+1]
print(f"Total Districts listed: {len(districts)}")

Total Districts listed: 110


In [29]:
#extract coordinates from District/Neighborhood names using geopy.geocoders.Nominatim
geolocator = Nominatim(user_agent='Helsinki_districts')

#empty lists for latitude & longitude values and None values, if any
lats = []
longs = []

#looping through district names for coordinates
for name in districts:
    location = geolocator.geocode(name)
    try:
        lats.append(location.latitude)
        longs.append(location.longitude)
    except AttributeError:
        pass

In [30]:
print(f"Total values identified \n(Latitude, Longitude): {len(lats), len(longs)}")

Total values identified 
(Latitude, Longitude): (109, 109)


### Investigating NoneType & improper coordinates

In [31]:
# Investigating None value in districts list, if Any
trial = []
for name in districts:
    location = geolocator.geocode(name)
    try:
        trial.append(location.latitude)
    except AttributeError as err:
        print('None value detected!')
        raise

None value detected!


AttributeError: 'NoneType' object has no attribute 'latitude'

In [32]:
#Identify District with NoneType coordinate
print(f"District with NoneType coordinate:\n{districts[len(trial)]}")

District with NoneType coordinate:
Kampinmalmi


In [33]:
#Direct verification 
geolocator.geocode('Kampinmalmi').latitude

AttributeError: 'NoneType' object has no attribute 'latitude'

In [34]:
#Identify Districts with improper coordinates (range of latitude coordinates are within 60 - 63)
print(f"District with improper coordinate:\n{districts[lats.index(-10.3333333)+1], districts[lats.index(13.744717)+1]}")

District with improper coordinate:
('Pasila', 'Töölö')


In [35]:
# Direct verification of locations with improper coordinates' Districts
print(f"Locations as identified by geopy.geocoders API:\n{geolocator.geocode('Pasila'), geolocator.geocode('Töölö')}")

Locations as identified by geopy.geocoders API:
(Location(Brasil, (-10.3333333, -53.2, 0.0)), Location(Toolo, Loroum, Nord, Burkina Faso, (13.744717, -1.9645989, 0.0)))


In [36]:
#Districts, Latitudes & Longitudes with NoneType & Improper coordinates - to be removed from Lists

print(f"BEFORE Cleaning:\nTotal Districts:{len(districts)}\nTotal Latitude values:{len(lats)}\nTotal Longitude values:{len(longs)}")

loc_to_pop = ['Kampinmalmi','Pasila', 'Töölö']
lat_to_pop = [-10.3333333, 13.744717]
long_to_pop = [-53.2, -1.9645989]

#Remove districts without coordinates and with improper coordinates
for loc in loc_to_pop:
    districts.remove(loc)

#Remove improper coordinates    
for lat, long in zip(lat_to_pop, long_to_pop):
    lats.remove(lats[lats.index(lat)])
    longs.remove(longs[longs.index(long)])
    
print(f"\nAFTER Cleaning:\nTotal Districts:{len(districts)}\nTotal Latitude values:{len(lats)}\nTotal Longitude values:{len(longs)}")

BEFORE Cleaning:
Total Districts:110
Total Latitude values:109
Total Longitude values:109

AFTER Cleaning:
Total Districts:107
Total Latitude values:107
Total Longitude values:107


In [41]:
#Frame all extracted values in a Dataframe
districts_df = pd.DataFrame(data= zip(districts, lats, longs), columns=['District', 'Latitude', 'Longitude'])
districts_df.head()

Unnamed: 0,District,Latitude,Longitude
0,Ala-Malmi,60.249474,25.014539
1,Alppiharju,60.189728,24.94412
2,Aurinkolahti,60.201507,25.155669
3,Eira,60.156191,24.938375
4,Etelä-Haaga,60.211615,24.891092


In [43]:
districts_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 107 entries, 0 to 106
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   District   107 non-null    object 
 1   Latitude   107 non-null    float64
 2   Longitude  107 non-null    float64
dtypes: float64(2), object(1)
memory usage: 2.6+ KB
