# City Similarity Based on Venues

- Part of Coursera's Data Science Certification Capstone
- December 2020

## 1. Introduction 

### 1.1 City Comparison - Why do it?

There are many reasons why a comparison of different cities may be needed. For example:
- Planning a vacation and trying to find cities to visit based on what cities you already have visited. This may be in the form of cities similar to the ones you have visited and enjoyed, or cities dissimilar to the ones you dislike (or even dissimilar to what you like if you are feeling adventurous!).
- Narrowing down cities to move to, if you have multiple job offers in different cities, or before you decide to concentrate your job hunt in a specific city
- Corporation looking to build a second headquarters, may want to compare cities to narrow down the candidate cities to a few options

### 1.2 Focus of this project
In this project, we focus on city comparison based on venues within the city, in order to recommend cities that users may want to visit based on cities that they tell us they have already visited.

The project uses a combination of foursquare data and geocoding data to cluster cities by their similarity / dissimilarity to provided cities, in order to give users options based on their mood. Since cities picked for analysis are at random, it may uncover potential new exotic locations for users to explore.

Not only is this helpful for potential travelers, but may also be offered as a service by travel agents or airline companies to funnel customers into their sales process.

However, this implementation is easily adapted to fulfil other needs as indicated in section 1.1

## 2. Data

### 2.1 Data Sources

In order to compare cities, cites are viewed as a _bag of venues_ . Venue information is retrieved from Foursquare. Additionally, city and geocode information is retrieved from freely available list on the internet and python libraries respectively.

The following data and data sources have been used to implement this project

|Sr.| Data | Source | Notes |
|---|------|--------|-------|
|1. | **List of reference cities** | Expected to be user selections | - Standard list of two cities (New york and Amsterdam) used for demonstration |
|2. | **List of "sample" cities**  | From https://simplemaps.com/data/world-cities | 100 cities are randomly chosen from the list, for analysis | See Note 2.1.1
|3. | **Venue Information** | Foursquare Places API | | See Note 2.1.2
|4. | **Geocode Information** | Python's `geopy` API| To get Latitude and Longitude information for cities as needed  | 


#### 2.1.1 - City List provides the following data of interest
- City name
- City country (Country name, 2 character ISO code and 3 character ISO codes)
- Latitude and Longitude information for the city

#### 2.1.2 - Following Venue information is used
- Venues within 10km (10000m) radius of city centre (defined by latitude and longitude)
- For each venue:
    - Venue name
    - Venue Latitude
    - Venue Longitude
    - Venue Category (e.g. Bar, Italian Restaurant, etc)
- These are used by the recommendation engine to represent cities as _"bags of venues"_ in order to calculate similarity between user-input cities and sample / recommendation city candidates.

### 2.2 Data Retrieval

#### 2.2.1: Sample City Data
- All cities available for comparison are read from a CSV file provided by https://simplemaps.com/data/world-cities. A total of 26,000 cities are available for us to compare. The following information is available to us:

![List of all cities](images/sample_cities_csv.jpg)

- 100 cities are chosen at random from this list using `pandas` function `DataFrame.sample(100)` . Here are the first 5 of the list

![Selected 100 cities](images/selected_cities.jpg)

#### 2.2.2 Venue Data for all Cities

- Venue data is retrieved for EACH user-provided city (New York, Amsterdam) as well as EACH of the randomly selected 100 cities for comparison using the Foursquare Places API
- Latitude and Longitude information is available for the API in the provided CSV. Where needed, latitude and longitude is retrieved using python's `geopy` library
- For each venue the following information is retrieved:
    - Name of the Venue
    - Latitude and Longitude of the Venue
    - Venue Category (e.g. Gym, Bar, American Restaurant etc.)
- Venues are restricted to within 10km (10,000m) radius of the city center
- All venues are used **without restriction** on type of venues.

- Venues for User Provided Cities are as below:

![User Provided City Venues](images/venues_user_city.jpg)


- Similarly, venues for Analysis sample cities are as below (9003 rows in total):

![Sample Cities Venues](images/venues_sample_cities.jpg)


### 2.3 Data Preparation

#### 2.3.1 - Removing Cities with limited venues
- Foursquare is a user-driven social location network. Some cities may not have enough venues due to low app usage. In order to provide a meaninful comparison with a city like Amsterdam and New York - sample cities were filtered to only cities with more than 20 venues. In this particular case - that led to 70 cities out of 100. 
- Example below

![Filtered Sample Cities with Venue Count > 20](images/filtered_sample_cities.jpg)

#### 2.3.2 - One Hot Encoding to Regularize data
- For comparison purposes, one-hot encoding of venues is performed for user provided cities as well as analysis sample cities in order to regularize the data.

- Example for user provided cities. Similar treatment is given to filtered sample cities

![One Hot encoded User Cities](images/onehot_user_cities.jpg)


## 3. Methodology

### 3.1 Cities as a collection of venues

What kind of cities a given user might like is best found by analyzing intent - what kind of places a user likes, his demographics etc. However, that data is not easily available.

A suitable proxy is to find cities similar to the city user has provided. One way to characterize a city is by the amenities available. The types and dispersion of venues within the city correlates to how city dwellers interact with the city, and a proxy for other demographics of the city. Thus, city similarity can be calculated by finding venues within the city, and identifying how this collection compares to any other city that the user might be interested in.

This approach also simplifies data collection, and similarity calculation

### 3.2 Evaluating City Similarities

Utilizing the _city as a bag of venues_ approach also allows us to plot cities in a vector space. Thus, similarities between cities can be calculated using **Cosine Similarity** in the same way as it is used to calculate text similarity. 

Two cities are similar if they point "in the same direction" with respect to distribution (frequency) of the venues in their total venue-space. This means that if two cities have a similar frequency of different _types_ or _categories_ of venues, they are similar.

Since city similarity is based on only the presence of venues vis-a-vis the user-input cities, it is necessary to reduce the venue counts of "sample cities" to only those that match with venues found in user-input cities.

Cosine Similarity scores are calculated using the functions from the Python library `scikit-learn`. The output is an array that provides the similarity between every candidate city AND every user-provided city.

#### City Similarity Scores:
![City similarity scores with Amsterdam and New york](images/city_similarity_scores.jpg)


### 3.3 Clustering Cities
Just calculating cosine similarity between user provided cities and candidate cities may be enough if a single city is provided as input. However, as the number of input cities increases, it would be helpful to visualize whow cities related to one another as well as to **all** provided user input cities.

For example, a given city might be more similar to New York, but very different from Amsterdam, and yet another still may be similar to New York AND Amsterdam. Clustering will help us visualize those relationships 

The clustering approach used is K-Means clustering. This clustering is done on calculated similarities in Section 3.2, since we are clustering cities by similarities.

#### 3.3.1 Calculating the Optimal Number of Clusters

To correctly cluster cities, optimal number of clusters need to be calculated. This is done by fitting the cluster models for different values of "k" and identifying the one with least error. 

Multiple error scores ("Silhouette, Calinski-Harabasz Index and Davies-Boudin Index) are calculated for different value of "k" and the K value that provides the least error in most methods is used as "k" value. These scores are inbuilt for the models when using the scikit-learn k-means functions.

In this example, K value of 10 is has highest Calinski-Harabasz score and lowest Davies-Boudin index and is therefore utilized.

![Similarity Scores](images/similarity_scores.jpg)

(Note that a lower Davies-Boudin index is better. By negating the scores, a single `max` function can be used to find the best "k" value)

#### 3.3.2 K-Means Clustering

The potential destination cities are clustered with k = 10 based on their Cosine Similarity scores, that have been calculated according to the frequencies of the available venue categories.

![Sample City Clusters](images/sample_city_clusters.jpg)

The K-Means Clustering algorithm is run using the built-in function from the Python library Scikit-Learn.

## 4. Results

The K-Means Clustering algorithm of 10 clusters has the following distribution of the potential destination cities:

These are plotted on the world map with different colors for different clusters.

![Cluster City Visualization](images/cluster_maps.jpg)


The average of similarity scores for each clusters is calculated to get a sense of the difference in clusters. This is as below: 

![Average similarity of clusters with Amsterdam and New york](images/cluster_similarities.jpg)


## 5. Discussion

Based on the average similarity scores for each cluster shown above, it can be seen that the following cities may be worth looking at, when a user is planning to visit similar cities:

Cluster 2: 6 Cities most similar to Amsterdam but very dissimilar to New York

![Cluster 2 cities](images/cluster2.jpg)


Cluster 7: 4 Cities most similar to New York

![Cluster 7 cities](images/cluster7.jpg)


Cluster 5: 3 Cities most similar to Amsterdam and New York 

![Cluster 5 cities](images/cluster5.jpg)


Cluster 1: 8 Cities that are very different from Amsterdam as well as New York - and may either be ignored, or for the adventurous - be very intriguing

![Cluster 1 cities](images/cluster1.jpg)


## 6. Conclusion

Using different data sources like Foursquare Places API, libraries like Scikit-Learn and machine learning algorithms it was possible to build a simple city similarity scorer that can be used for multiple uses

The following areas can be further improved to achieve even better and more accurate recommendations:

### 6.1 **Feature Engineering**
- Currently, only venue categories are considered for the comparison. Additional features can be considered for the comparison, such as: population, area size etc.
- Venue categories are considered without distinction. This may cause two cities to be similar if it has a high concentration of "residential areas" or "ATMs". For certain applications like tourism, this may not be of interest. Thus,the number of venue categories may be restricted. This is possible in the current implementation and the function calls can accomodate it, but it is not implemented.
- At the moment, all venues within 10km of city center are considered. This may be broadened or reduced depending on the city in question.

### **6.2 Machine Learning / Hyperparameter Tuning**
- K-Means Clustering algorithm is run with default parameters. This can be changed by running better clustering or changing iterations / initializations for a better result.

### **6.3 Platform limitations**
- Since Foursquare API is limited in the free version, only 100 potential cities are used for similarity comparison. This can be broadened
- For smaller cities, there may not be enough information in Foursquare as the check-ins diminish. It might help to use a better platform like Google to get more accurate results
