# City Similarity Based on Venues

- Part of Coursera's Data Science Certification Capstone
- December 2020

## 1. Introduction 

### 1.1 City Comparison - Why do it?

There are many reasons why a comparison of different cities may be needed. For example:
- Planning a vacation and trying to find cities to visit based on what cities you already have visited. This may be in the form of cities similar to the ones you have visited and enjoyed, or cities dissimilar to the ones you dislike (or even dissimilar to what you like if you are feeling adventurous!).
- Narrowing down cities to move to, if you have multiple job offers in different cities, or before you decide to concentrate your job hunt in a specific city
- Corporation looking to build a second headquarters, may want to compare cities to narrow down the candidate cities to a few options

### 1.2 Focus of this project
In this project, we focus on city comparison based on venues within the city, in order to recommend cities that users may want to visit based on cities that they tell us they have already visited.

The project uses a combination of foursquare data and geocoding data to cluster cities by their similarity / dissimilarity to provided cities, in order to give users options based on their mood. Since cities picked for analysis are at random, it may uncover potential new exotic locations for users to explore.

Not only is this helpful for potential travelers, but may also be offered as a service by travel agents or airline companies to funnel customers into their sales process.

However, this implementation is easily adapted to fulfil other needs as indicated in section 1.1

## 2. Data

### 2.1 Data Sources

In order to compare cities, cites are viewed as a _bag of venues_ . Venue information is retrieved from Foursquare. Additionally, city and geocode information is retrieved from freely available list on the internet and python libraries respectively.

The following data and data sources have been used to implement this project

|Sr.| Data | Source | Notes |
|---|------|--------|-------|
|1. | **List of reference cities** | Expected to be user selections | - Standard list of two cities (New york and Amsterdam) used for demonstration |
|2. | **List of "sample" cities**  | From https://simplemaps.com/data/world-cities | 100 cities are randomly chosen from the list, for analysis | See Note 2.1.1
|3. | **Venue Information** | Foursquare Places API | | See Note 2.1.2
|4. | **Geocode Information** | Python's `geopy` API| To get Latitude and Longitude information for cities as needed  | 


### 2.1.1 - City List provides the following data of interest
- City name
- City country (Country name, 2 character ISO code and 3 character ISO codes)
- Latitude and Longitude information for the city

### 2.1.2 - Following Venue information is used
- Venues within 10km (10000m) radius of city centre (defined by latitude and longitude)
- For each venue:
    - Venue name
    - Venue Latitude
    - Venue Longitude
    - Venue Category (e.g. Bar, Italian Restaurant, etc)
- These are used by the recommendation engine to represent cities as _"bags of venues"_ in order to calculate similarity between user-input cities and sample / recommendation city candidates.

## 2.2 Data Retrieval

### 2.2.1: Sample City Data
- All cities available for comparison are read from a CSV file provided by https://simplemaps.com/data/world-cities. A total of 26,000 cities are available for us to compare. The following information is available to us:

![List of all cities](images/sample_cities_csv.jpg)

- 100 cities are chosen at random from this list using `pandas` function `DataFrame.sample(100)` . Here are the first 5 of the list

![Selected 100 cities](images/selected_cities.jpg)

### 2.2.2 Venue Data for all Cities

- Venue data is retrieved for EACH user-provided city (New York, Amsterdam) as well as EACH of the randomly selected 100 cities for comparison using the Foursquare Places API
- Latitude and Longitude information is available for the API in the provided CSV. Where needed, latitude and longitude is retrieved using python's `geopy` library
- For each venue the following information is retrieved:
    - Name of the Venue
    - Latitude and Longitude of the Venue
    - Venue Category (e.g. Gym, Bar, American Restaurant etc.)
- Venues are restricted to within 10km (10,000m) radius of the city center
- All venues are used **without restriction** on type of venues.

- Venues for User Provided Cities are as below:

![User Provided City Venues](images/venues_user_city.jpg)


- Similarly, venues for Analysis sample cities are as below (9003 rows in total):

![Sample Cities Venues](images/venues_sample_cities.jpg)


## 2.2 Data Preparation

### 2.2.1 - Removing Cities with limited venues
- Foursquare is a user-driven social location network. Some cities may not have enough venues due to low app usage. In order to provide a meaninful comparison with a city like Amsterdam and New York - sample cities were filtered to only cities with more than 20 venues. In this particular case - that led to 70 cities out of 100. 
- Example below

![Filtered Sample Cities with Venue Count > 20](images/filtered_sample_cities.jpg)

### 2.2.2 - One Hot Encoding to Regularize data
- For comparison purposes, one-hot encoding of venues is performed for user provided cities as well as analysis sample cities in order to regularize the data.

- Example for user provided cities. Similar treatment is given to filtered sample cities

![One Hot encoded User Cities](images/onehot_user_cities.jpg)
