# Sister City

Data since project to find sister/twin city for my home town using clustering.

## Introduction/Business Problem

With Britain living the EU, there is increased need for individuals in the EU to learn one of major European language author than English. In order to achieve this the plan is to move temporarily from my home City (Vilnius, Lithuania) to other EU city and try to learn other language. 

I decided to use machine learning to find “sister/twin” city to my home city in order to ensure that I won’t be missing out some recreation facilities/landmarks if I move to the prospect city. The mane goal is to learn the language without changing the city environment and available facilities too much.

Criteria for “sister’ city:
Countries to consider are Germany, France, Italy, Spain. The city of choice must be similar to my home city in Outdoors & Recreation facilities/landmarks offered.  Also, the city has to be big enough (population over 400000). The plan is to live in central area of the city and only facilities/landmarks located near the city center need to be considered. 

## Data

1. List of European cities with population statistics published on  http://worldpopulationreview.com/continents/cities-in-europe/ will be used to find prospect cities. 

2. Foursquare data based on every city location will be used to get all Outdoors & Recreation facilities/landmarks.






## Finding Biggest Cities in Europe

We use data available online to find biggest cities in Europe. The information provided on http://worldpopulationreview.com/continents/cities-in-europe/  will be used to generate pandas DataFrame with relevant information on cities.


In [1]:
from bs4 import BeautifulSoup #web scraping lib
import requests

#conda install -c conda-forge xlrd --yes
import lxml #HTML parsing lib
import pandas as pd


In [2]:
#scrape data from web
cities_info=requests.get('http://worldpopulationreview.com/continents/cities-in-europe/').text
soup= BeautifulSoup(cities_info,'lxml')

    
table=soup.find('table', class_="table table-striped") #find table with data on cities
city_data=[]
#create list of cities with relevant data from the table
for tr in table.find_all('td'): 
    row=tr.text
    city_data.append(row)
city_data=city_data[3:]#remove header data

#form three separate list for creating a table
data1=[]
for i in range(int((len(city_data)+1)/3)):
    row=city_data[:3]
    data1.append(row)
    del city_data[:3]

#create DataFrame

df = pd.DataFrame(data1, columns = ['City', 'Country', 'Population'])
print()
df.head()




Unnamed: 0,City,Country,Population
0,Moscow,Russia,10381222
1,London,United Kingdom,7556900
2,Saint Petersburg,Russia,5028000
3,Berlin,Germany,3426354
4,Madrid,Spain,3255944


## Cleaning City data


We need to keep only cities from Germany, France, Italy, Spain. Also, the population must be over 400000 (personal choice).

In [3]:
country_filter=['Germany','Spain','France','Italy']
df=df[df.Country.isin(country_filter)]

#function to remove ',' from Population data and turn it to numeric value.
def n_con(p):
    n=p.replace(',','')
    return int(n)

df['Population']=df['Population'].apply(n_con) #remove ',' from population
df=df[df.Population>400000]

Manually add Vilnius data to the table as all the cities will be compared with Vilnius.

In [4]:
vilnius_data = [{'City':'Vilnius', 'Country':'Lithuania','Population':542366}]
df_vilnius=pd.DataFrame(vilnius_data)
df=df.append(df_vilnius, ignore_index = True)

## Finding Latitude, Longitude data for the cities

In [6]:
#conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim
import numpy as np

#use geopy Nominatim to convert location name to coordinates
def geoloc(city,country):
    try:
        address = '{}, {} '.format(city, country)
        geolocator = Nominatim(user_agent="ny_explorer")
        location = geolocator.geocode(address)
        latitude = location.latitude
        longitude = location.longitude
    except:
        latitude = np.nan
        longitude = np.nan
    
    location_entry=[latitude, longitude]
        
    return location_entry

#Add location data to city dataframe

for i in df.index:
    city=df.loc[i,'City']
    country=df.loc[i,'Country']
    location=(geoloc(city,country))
    lat=location[0]
    lng=location[1]
    df.loc[i,'Latitude']=lat
    df.loc[i,'Longitude']=lng
 


In [7]:
# create map of the cities using latitude and longitude values

#conda install -c conda-forge folium=0.5.0 --yes 
import folium 
plot_df=df
latitude=48.137108
longitude=11.575382
map = folium.Map(location=[latitude, longitude], zoom_start=4)

# add markers to map
for lat, lng, city,  in zip(plot_df['Latitude'], plot_df['Longitude'], plot_df['City']):
    label = '{}'.format(city)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map) 

map

In [8]:
#fix Zaragoza data as it looks off. Info was taken from google.

df.loc[df.City=='Zaragoza','Latitude']=41.6488226
df.loc[df.City=='Zaragoza','Longitude']=-0.8890853


## Foursquare venues serch
Foursquare Place API will be used to search for facilities/landmarks. To discover venues under category named "Outdoors & Recreation" GET request will be send with "search" endpoint and category
ID= 4d4b7105d754a06377d81259. 

Unfortunately, Foursquare provides only up to 50 venues per one request. For this reason, the city will be divided in to four squares approximately 5 km X 5 km each. Total area for discovery will be 10 km X 10 km with city center in the middle of the square.

In [9]:
#Foursquare “search” request lets you pass most SW (South West) and NE (North East) point to define a square where the venues of interest must be located.
#We will create a list of coordinates for every four of squares defining most SW and NE coordinate for each of them. 
    
def cord_SW_NE(cordinates):
    import math
    d=5 # 5 km selected
    r=6378 # aprox. radius of the Earth
    location_box=[]
    delta_distance_matrix=[(0,-d,d,0),(0,0,d,d),(-d,-d,0,0),(-d,0,0,d)] #defines what distance the point need to be shifted from city center towards y and X directions for each SW and NE point
    for distance in delta_distance_matrix:
        dist_y_sw=distance[0]
        dist_x_sw=distance[1]
        dist_y_ne=distance[2]
        dist_x_ne=distance[3]
        lat=cordinates[0]
        lng=cordinates[1]
        new_lat_sw=lat+(dist_y_sw/r)*(180/math.pi)
        new_lng_sw=lng+(dist_x_sw/r)*(180/math.pi)/math.cos(lat*math.pi/180)
        new_lat_ne=lat+(dist_y_ne/r)*(180/math.pi)
        new_lng_ne=lng+(dist_x_ne/r)*(180/math.pi)/math.cos(lat*math.pi/180)
        p_cord=('{},{}'.format(new_lat_sw,new_lng_sw),'{},{}'.format(new_lat_ne,new_lng_ne))
        location_box.append(p_cord)
    return location_box

In [10]:
#function to make GET request to Foursquare for every four boxes within a city
def getNearbyVenues(location_b, city, lat, lng):
    CLIENT_ID = 'QEFYELNO2FPGXBZ5X42ZZDRRDKXQ4FS5FNRMEXWVCLW03HYZ' # Foursquare ID
    CLIENT_SECRET = 'WTHVRCUNEK3OT5OZEOLTXRK11DK13O5SS10I10EJOOMYFGRO' # Foursquare Secret
    VERSION = '20180605' # Foursquare API version
    category='4d4b7105d754a06377d81259' #Outdoors & Recreation
    venues_list=[]
    for box in location_b:
        SW=box[0]
        NE=box[1]
        url='https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&sw={},&ne={}&categoryId={}&intent=browse'.format(CLIENT_ID, CLIENT_SECRET, lat, lng, VERSION, SW,NE, category)
        
        try:
            results = requests.get(url).json()
            ven=results['response']['venues'] 
            venues_list.append([( 
            v['name'], 
            v['location']['lat'], 
            v['location']['lng'],  
            v['categories'][0]['name'], city) for v in ven])
        except:
            print('{} city date request failed'.format(city))

    return(venues_list)

In [11]:
#send request for every city, then create DataFrame with details on the attractions in the city
venues_all=[]

for lat, lng, city in zip(df['Latitude'], df['Longitude'], df['City']):
    cord=[lat,lng]
    location_box= cord_SW_NE(cord)
    venues_rep=getNearbyVenues(location_box, city, lat, lng )
    venues_all=venues_all+venues_rep
    
a_df = pd.DataFrame([item for venue_all in venues_all for item in venue_all])
a_df.columns = [
'Attraction Name', 
'Latitude', 
'Longitude', 
'Attraction Category',
'City']



## Vilnius data analysis.

Creating list of preferred attractions as a criteria for city comparison.


In [12]:
a_vilnius=a_df[a_df.City=='Vilnius']
vilnius_attractions=a_vilnius.groupby('Attraction Category').nunique().sort_values(by=['Attraction Name'], ascending=False).index.tolist() #list of attractions in Vilnius, sorted by total number
vilnius_attractions

['Park',
 'Gym / Fitness Center',
 'Gym',
 'Bridge',
 'Neighborhood',
 'Plaza',
 'Other Great Outdoors',
 'Soccer Field',
 'Scenic Lookout',
 'Tennis Court',
 'Trail',
 'Beach',
 'Sculpture Garden',
 'Cemetery',
 'Mountain',
 'Historic Site',
 'Sports Club',
 'Field',
 'Sporting Goods Shop',
 'TV Station',
 'Ski Area',
 'River',
 'Track',
 'Rock Climbing Spot',
 'Road',
 'Athletics & Sports',
 'Recreation Center',
 'Pool',
 'Playground',
 'Basketball Court',
 'Museum',
 'Monument / Landmark',
 'Hotel',
 'Fountain',
 'Dog Run',
 'Climbing Gym',
 'City',
 'Bike Trail',
 'Volleyball Court']

In [13]:
#from the list of attraction, we select the ones which are most relevant to the user (personal preference). These attractions will be used as a criterion finding twin/sister city.

attractions_filter=['Park','Plaza', 'Other Great Outdoors','Trail', 'Scenic Lookout', 'Beach',
                    'Historic Site', 'Pool', 'River','Athletics & Sports', 'Basketball Court', 'Playground','Fountain']

## Data clustering

In [14]:
# one hot encoding
import numpy as np
a_onehot = pd.get_dummies(a_df[['Attraction Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
a_onehot['City'] = a_df['City'] 
#group by City and sort columns as per Vilnius attractions_filter
all_attractions = a_onehot.groupby('City').sum().reset_index()

city_a=all_attractions[['City']+attractions_filter]

All cities were limited to the area of 10 sq. km. We want to compare how this area is used/occupied by attractions of our interest.

 We consider that every attraction has same importance/weight. For this reason we  normalize the data across every column.


In [15]:
#normalise values

from sklearn import preprocessing

a_values = city_a[city_a.columns[1:]].values #create numpy array from DataFrame excluding 'City' column
min_max_scaler = preprocessing.MinMaxScaler()
a_values_scaled = min_max_scaler.fit_transform(a_values)
city_a_nor = pd.DataFrame(a_values_scaled, columns = city_a.columns[1:])
city_a_nor.insert(0, 'City', city_a.City) # return 'City' back to DataFrame

city_a=city_a_nor




Clustering data using K-means clustering method.

In [16]:
from sklearn.cluster import KMeans

kclusters=6 # Cities will be sepratedto 6 clusters
clustering_df = city_a.drop('City', 1)
kmeans = KMeans(n_clusters=kclusters, random_state=1).fit(clustering_df)

# add clustering labels
city_a.insert(0, 'Cluster', kmeans.labels_)
a_merged = df #DataFrame with city latitude and Longitude data

# merge attraction DataFrame and location DataFrme for cities
city_a = a_merged.join(city_a.set_index('City'), on='City')


# create map

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
latitude=48.137108
longitude=11.575382
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=5)


# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, city, cluster in zip(city_a['Latitude'], city_a['Longitude'], city_a['City'], city_a['Cluster']):
    label = folium.Popup(str(city) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters


List of Cities which are most similar to Vilnius by selected attractions/features.

In [17]:
v_cl=int(city_a[city_a.City=='Vilnius']['Cluster']) # Vilnius cluster number
sister_cities=city_a[city_a['Cluster'] == v_cl]['City'] # display cities with the same cluster as Vilnius
all_attractions[all_attractions.City.isin(sister_cities)][['City']+attractions_filter]


Unnamed: 0,City,Park,Plaza,Other Great Outdoors,Trail,Scenic Lookout,Beach,Historic Site,Pool,River,Athletics & Sports,Basketball Court,Playground,Fountain
22,Nuernberg,14,17,0,2,0,2,0,6,0,0,0,1,1
29,Toulouse,7,26,3,3,1,0,0,2,1,1,0,0,0
32,Vilnius,21,7,5,3,4,2,2,1,1,1,1,1,1
