# Data Science Cities Report

## Content

- Introduction: Uncovering the data behind the cities
- Data Description: Where to start?
- Methodology
- Results
- Discussion
- Conclusion

## Introduction: Uncovering the data behind the cities

The most part of us come to a point when we need to make the next step, Data Science talking, in deciding to go for that Graduate diploma, and we pass several weeks, maybe months, trying to figure out which universities around the world to attend, knowing that we will live the next one or two years at least in that city. After narring down our options to four or five universities, we need to take into consideration, not only the cost of living, but the similarity between where we live and where we will be living for the next two years.

In this Data Science project I will try to board that problem: **Given the city where I live (Lima, Perú), which of these other four cities, where I have narrowed down my top university options, should I focus and invest the most of my effort to be part of.** For this, I will be considering three top Master of Science in Data Science programs:

- The University of Edinburgh
- University of Toronto
- ETH Zurich

These three universities have tuition fees per year around USD 30,000 (except for ETH Zurich which is around USD 3,000) for non-EU citizens. Although this, and Master's duration, may be important factors to consider, we will focus only in the city election by finding their similarities by their venues categories. 

*It is important to notice that these four cities have a similar average temperature across the year (from 0ºC to 19ºC) and an average precipitation of 80 mm (Zurich reaching 120 mm per year) so we will ignore those weather factors.*

---

## Data Description: Where to start?

We identify two main types of data sources:

1. Cities' boroughs and/or neighbourhoods
1. Venues categories around each city main neighbourhoods

#### For the first type we will be obtaining information from postal codes of each city.

### Edinburgh
The data needed is in a table in the following wikipedia site: https://en.wikipedia.org/wiki/EH_postcode_area
I will take only the central Edinburgh postal codes, which range from EH1 to EH17, having into consideration that The University of Edinburgh is in the EH8 postal code location.
Table columns are: Postcode district and post town.

### Toronto
The data needed is in a table in the following wikipedia site: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M 

To filter the nearest boroughs and neighbourhoods I will map the boroughs using the Folium library and choose the four boroughs nearest to Scanborough, which is the location of the University of Toronto.
Table columns are: Postcode, Borough, and Neighbourhood

### Zurich
The data needed is in a table in the following site: http://www.geonames.org/postal-codes/CH/ZH/zurich.html
After some research I found that the central Zurich area, where ETH Zurich is located, have postal codes that start with 80 so will be focusing only on those ones.
Table columns are: Place, Code, Country, Admin1, Admin2, Admin3. *The last three columns represent the district and region*

### Lima
The data needed is in a .csv file with the postal codes from the districts that sorround the area where I live and mostly commute in the week.
Table columns are: Postal code, District, and Neighbourhood.

#### For the second type of data needed, I will be obtaining venues' categories from Foursquare

**Foursquare** is one of the biggest hubs for obtaining venues information from all over the world, not only limited to their category or location but users can review each venue and give each of them a punctuation.

I will use the Foursquare API to access each venue surrounding each postal code district or neighbourhood around 700 meters and with a limit of 100 venues per location. This data is then retrieved in a JSON format, where I will append it to each city districts and neighbourhoods dataframes.
I will be using the explore function from the venues category when calling the API. By this I can obtain the following data: venue name, vanue category and venue latitude and longitude.

Here is an example of the JSON dictionary that is retrieved:

```
{
  "meta": {
    "code": 200,
    "requestId": "59a45921351e3d43b07028b5"
  },
  "response": {
    "venue": {
      "id": "412d2800f964a520df0c1fe3",
      "name": "Central Park",
      "contact": {
        "phone": "2123106600",
        "formattedPhone": "(212) 310-6600",
        "twitter": "centralparknyc",
        "instagram": "centralparknyc",
        "facebook": "37965424481",
        "facebookUsername": "centralparknyc",
        "facebookName": "Central Park"
      },
      "location": {
        "address": "59th St to 110th St",
        "crossStreet": "5th Ave to Central Park West",
        "lat": 40.78408342593807,
        "lng": -73.96485328674316,
        "postalCode": "10028",
        "cc": "US",
        "city": "New York",
        "state": "NY",
        "country": "United States",
        "formattedAddress": [
          "59th St to 110th St (5th Ave to Central Park West)",
          "New York, NY 10028",
          "United States"
        ]
      },
      "canonicalUrl": "https://foursquare.com/v/central-park/412d2800f964a520df0c1fe3",
      "categories": [
        {
          "id": "4bf58dd8d48988d163941735",
          "name": "Park",
          "pluralName": "Parks",
          "shortName": "Park",
          "icon": {
            "prefix": "https://ss3.4sqi.net/img/categories_v2/parks_outdoors/park_",
            "suffix": ".png"
          },
          "primary": true
        }
```
I access the `response` item and then the `venue`item to retrieve data such as the venue name. I can enter inside that item the `location` item to retrieve the latitude and longitude. Finally by accessing the `categories` item I can retrieve the venue category.

***

*For the cost of living we will obtain this specific fact for each of the three cities from www.numbeo.com.*

***



## Methodology

We started off by gathering the data into dataframes for each city. This data lacked the latitude and longitude which was gathered by using a geocoder library and geolocation service. After this we merged the dataframe into a single one and performed some exploratory analysis.
First off we graphed the standard deviation of latitude and longitude converted to kilometers. We learned that Edinburgh got a larger longitudinal deviation. In all cities the latitude deviation was almost the same. This ensures a more localized districts dispersion.

Next analysis was the distribution for each city using a Kernel Density Estimate distribution graph from the seaborn library. Since we needed a multivariate analysis graph to take into consideration both altitude and longitude, this graph helped us achieve our goal.
A KDE (Kernel Density Estimate) is a distribution function that plots the probability density of a given variable. It sets for each data point a probability function (picture it as a 'bump') and adds up all the 'bumps' to form a distribution graph.

We could easily see that the most normally distributed cities were Zurich and Toronto, this last one with a longer tail towards north, as in a less denser area. Lima, in the other hand, got a elonged top distribution and a low a tail towards south. Focusing on the top distribution, we can associate this with Zurich and Toronto. This graph can be explore in the Notebook.

Moving on, we starting making calls to the Foursquare API to get the surrounding venues for each district on each city. We applied a one hot encoding and grouped the dataframe by district, averaging the binary encoding, this way we got a normalized proportion to use on our model.

Before modelling the data, we sorted it by top 10 most common venues by district.

We now move over to the Machine Learning section. The normalized venues dataframe was used to train a KMeans Cluster model. We chose to use KMeans Clustering Machine Learning algorithm because of the simplicity of its parameters decision and the 'elbow' method available to easily distinguish the appropriate K value. Basically the venues distribution would have a more hyperspherical shape, because we base our data in the proportion of a determine venue category more dense assuming a normal distribution, denser at the center.
At the moment of applying the 'elbow' method, we found here a very diffuse, but nevertheless present, elbow point. One reason may be the high number of dimensions which makes difficult any try to find clearly defined hyperspheric clusters. The elbow point showed a k value of 6.

After this we assigned each cluster label to each district and showed separate dataframes for each cluster. This way we could identify each category:

These are the 6 clusters categories:

- Cluster 1: Restaurants (Indian, Swiss and Italian) & Services | Only Zurich and Edinburgh
- Cluster 2: Schools and Gardens | Only Toronto
- Cluster 3: Fitness Center, Zoo, Convenience Store | Only Edinburgh - mostly low urban area
- Cluster 4: Coffee Shop, Nightlife and Fast Food | All cities
- Cluster 5: Stores, shops, studios | Toronto and Lima
- Cluster 6: Bar, cafés and zoo | Only Edinburgh

As we see we should only focus in Clusters 4 and 5, since those two are the only ones which include Lima, our original city. We need to consider that the cities exposed here vary a lot in cultural, political and socioeconomical realities, basically Lima remains a developing city compared against Zurich, Edinburgh and Toronto, fully developed cities, so it was expected to not find an evident and direct relation between venues from those cities.

We then mapped the clusters 4 and 5 four each city and graphed the cost of living. We encountered with the following result: Zurich is by far more expensive than any other city, so it can be ruled out. Edinburgh and Toronto are much closer to each other, so we can lower our options to these two.

These set the needed tools to proceed with the results analysis and final conclusion.


## Results

Our statistical analysis showed that, geographically, the most similar cities to Lima are Toronto and Zurich. This can be seen in the KDE graphs in shades of green at the start of this report. Lima, although showing a more horizontal distribution, it shows a dense area at the center, different than Edinburgh which has two dense areas, whereas Toronto and Zurich conserve a single denser area.

Moving on, our decision to develop a 6 clusters KMeans model (as shown by the 'elbow' method) turned up to have interesting results. First of all, only clusters 4 and 5 where present in Lima, so we narrowed down our analysis to only these two clusters. Additionally first two clusters show that Italian and Indian restaurants are rather common in both Zurich and Edinburgh and the fifth cluster show how common grean areas are in Toronto and uncommon in Zurich and Edinburgh. Remaining clusters show a more specific type of clusters, common in zoos and stations. We notice that Edinburgh shares a rural area along with its not metropolitan-center.

#### When mapping all four cities, we identify that Lima and Toronto are the only cities which share both cluster 4 and 5 and in a similar proportion. This will be decisive in our conclusion.

Finally the Cost of Living graph shows the great difference that exists in Zurich versus the remaining cities. Edinburgh and Toronto show a similar index.



## Discussion

The most important observation discovered is the similarity between Lima and Toronto, sharing the same continent, and the difference between Edinburgh and Zurich, both in Europe, with not only a different distribution, but also having venues that Lima barely had noticed even more with the fact that no other cluster had in common Lima and any other city than clusters 4 and 5.

As a recommendation, is important to try to determine the sample extension in each city, how many boroughs, postal codes, or districts to span in. In this case Toronto had a much bigger number of districts, although a similar area covered. This could result in a possible skewness behaviour towards Toronto venues. Also, is important to determine the correct neighbourhood radius when making the calls to Foursquare API. A radius too large and different districts scope may overlap, resulting in doubled data. A radius too small and the sample may be too short or not significant.



## Conclusion

Gathering the results discussed, we conclude that **Toronto** is the city to go to study a MSc in Data Science for me and anybody living in Lima! Evidently this decision should be weighted with the University which you want to go, and the cost of it. This result became evident when mapping the clusters 4 and 5 and more easy to take when analyzing the cost of living data.
