## Coursera - IBM Data Science Professional Certificate

### Capstone Assignment - The Battle of Neighborhoods - New York and Paris

#### 1. Introduction

**Problem statement:**  A travel booking websites wants to help their potential customers understand the similarities/dissimilarites between New Your and Paris so that the customers can understand and make informed decision to choose their holiday destination.

There are many famous tourist attractions/places in Paris like "Eifel Tower", "The Arc de Triomphe (Arch of Triumph)", "Courtyard of the Museum of Louvre, and its pyramid", art galleries, theaters, antique stores.  Apart from these you can find kids favourite parks, cinemas, museums; ladies favourite shopping malls, street markets; mens favourite cafes and other shops.

Similarly, you can find many famous tourist attactions in New York too such as Midtown Manhattan, Times Square, the Unisphere, the Brooklyn Bridge, Lower Manhattan with One World Trade Center, Central Park, the headquarters of the United Nations, and the Statue of Liberty and others.

**Approach:** The above information is published and anyone can google it and read it on wikipedia site.  But someone has to spend lots of time researching and find the similarities and dissimilarities among the cities.

To understand the similarities between New York and Paris, we analyze the data from Foursquare, a popular local search-and-discovery application which provides search results for its users. The application provides personalized recommendations of places to go near a certain location.  Foursquare enables users to share their current location with friends, rate and comment on venues they visit and read reviews of venues that other users have provided on the application.

To compare New York and Paris, we will use geographical datasets of the two cities. We will consider the neighborhoods with in the 500-1000 meters from the center of the city location. We then analyze the venues recommendations from Foursquare through their API.  We create 5-10 clusters from each city and then compare and analyze them against each other and provide the conclusion about how the neighborhoods are similar and dissimilar they are.

#### 2. Paris and New York Geographical Data

**Paris geographical data**


The city of Paris is divided into twenty arrondissements municipaux, administrative districts, more simply referred to as arrondissements. These are not to be confused with departmental arrondissements, which subdivide the 100 French départements. The word "arrondissement", when applied to Paris, refers almost always to the municipal arrondissements listed below. The number of the arrondissement is indicated by the last two digits in most Parisian postal codes (75001 up to 75020).

The twenty arrondissements are arranged in the form of a clockwise spiral (often likened to a snail shell), starting from the middle of the city, with the first on the Right Bank (north bank) of the Seine. Lyon and Marseille have, more recently, also been subdivided into arrondissements.

https://en.wikipedia.org/wiki/Arrondissements_of_Paris

In French, notably on street signs, the number is often given in Roman numerals. For example, the Eiffel Tower belongs to the VIIe arrondissement while Gare de l'Est is in the Xe arrondissement. In daily speech, people use only the ordinal number corresponding to the arrondissement, e.g. "Elle habite dans le sixième", "She lives in the 6th (arrondissement)".


We will extract the Paris municipal borough data from https://opendata.paris.fr/page/home/ open datasets available online.  The below url can be used to download the json data.

https://opendata.paris.fr/explore/dataset/arrondissements/download?format=json&timezone=Europe/Berlin&use_labels_for_header=true


In [6]:
import pandas as pd
import numpy as np
import urllib.request 
import json
from pandas.io.json import json_normalize

In [7]:
# Get Paris geography data rom https://opendata.paris.fr

with urllib.request.urlopen("https://opendata.paris.fr/explore/dataset/arrondissements/download?format=json&timezone=Europe/Berlin&use_labels_for_header=true") as paris_url:
   paris_data = json.loads(paris_url.read().decode('utf-8'))

[{'datasetid': 'arrondissements',
  'fields': {'c_ar': 2,
   'c_arinsee': 75102,
   'geom': {'coordinates': [[[2.351518483670821, 48.8644258050741],
      [2.350949105218923, 48.86340592861751],
      [2.346676032763327, 48.864430925901665],
      [2.346675453051013, 48.86443106483368],
      [2.345101655171463, 48.864809197959836],
      [2.341271025930368, 48.86572767724484],
      [2.34126849090564, 48.86572828653819],
      [2.341204510696185, 48.865743681005995],
      [2.341178272058699, 48.86574963323163],
      [2.341083555178273, 48.86577201721946],
      [2.337371969067098, 48.86664907439458],
      [2.335869691238243, 48.86699647535598],
      [2.335869054057415, 48.86699662650754],
      [2.333675321300195, 48.867516125009374],
      [2.33172601351949, 48.867954816599685],
      [2.331725629348361, 48.86795490259037],
      [2.330656733960091, 48.86819218066118],
      [2.330306795320876, 48.86835619167468],
      [2.329965588686572, 48.86851416917429],
      [2.32800732903

### Below is a sample Paris dataset.

In [8]:
paris_df = json_normalize(paris_data)
paris_df.head()

Unnamed: 0,datasetid,fields.c_ar,fields.c_arinsee,fields.geom.coordinates,fields.geom.type,fields.geom_x_y,fields.l_ar,fields.l_aroff,fields.longueur,fields.n_sq_ar,fields.n_sq_co,fields.objectid,fields.perimetre,fields.surface,geometry.coordinates,geometry.type,record_timestamp,recordid
0,arrondissements,2,75102,"[[[2.351518483670821, 48.8644258050741], [2.35...",Polygon,"[48.86827922252252, 2.3428025468913636]",2ème Ardt,Bourse,4553.938764,750000002,750001537,2,4554.10436,991153.7,"[2.3428025468913636, 48.86827922252252]",Point,2019-03-01T00:00:31+01:00,fdcdd162efd8d445fdecb7b95ed7df1ff4c59f26
1,arrondissements,3,75103,"[[[2.363828096062925, 48.86750443060333], [2.3...",Polygon,"[48.86287238001689, 2.3600009858976927]",3ème Ardt,Temple,4519.071982,750000003,750001537,3,4519.263648,1170883.0,"[2.3600009858976927, 48.86287238001689]",Point,2019-03-01T00:00:31+01:00,469806e90b8b4676461b1845f113b25397cd5241
2,arrondissements,12,75112,"[[[2.413879624300607, 48.83357143972265], [2.4...",Polygon,"[48.83497438148051, 2.421324900784681]",12ème Ardt,Reuilly,24088.038922,750000012,750001537,12,24089.666298,16314780.0,"[2.421324900784681, 48.83497438148051]",Point,2019-03-01T00:00:31+01:00,e8ec3494fa75e33f9cc5308108db755f2bafbd7c
3,arrondissements,1,75101,"[[[2.328007329038849, 48.86991742140715], [2.3...",Polygon,"[48.86256270183605, 2.3364433620533847]",1er Ardt,Louvre,6054.680862,750000001,750001537,1,6054.936862,1824613.0,"[2.3364433620533847, 48.86256270183605]",Point,2019-03-01T00:00:31+01:00,fd746ffccedf5bb7893b6ec2d7c8daf24a6f1fb5
4,arrondissements,4,75104,"[[[2.368512371393433, 48.85573412813671], [2.3...",Polygon,"[48.854341426272896, 2.357629620324993]",4ème Ardt,Hôtel-de-Ville,5420.636779,750000004,750001537,4,5420.908434,1600586.0,"[2.357629620324993, 48.854341426272896]",Point,2019-03-01T00:00:31+01:00,437ce5d06deeb12a187baea9fbd3e15c2ae87852


In our exercise we will use the location (latitude and longitude) information and from the above table the arrondissement number and name.

**New York geographical data**

New York Neighborhood has a total of 5 boroughs and 306 neighborhoods. In order to segement the neighborhoods and explore them, we will essentially need a dataset that contains the 5 boroughs and the neighborhoods that exist in each borough as well as the the latitude and logitude coordinates of each neighborhood.

Luckily, this dataset exists for free on the web, here is the link to the dataset: https://geo.nyu.edu/catalog/nyu_2451_34572

For our convenience, I will simply use the file that is already placed on the IBM server, so we can simply run a wget command and access the data.

In [9]:
!wget -q -O 'newyork_data.json' https://cocl.us/new_york_dataset
    
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)

** Below is a sample New York dataset. **

In [10]:
ny_df = json_normalize(newyork_data)
ny_df.head()

Unnamed: 0,bbox,crs.properties.name,crs.type,features,totalFeatures,type
0,"[-74.2492599487305, 40.5033187866211, -73.7061...",urn:ogc:def:crs:EPSG::4326,name,"[{'geometry_name': 'geom', 'id': 'nyu_2451_345...",306,FeatureCollection



### Foursquare Local search and recommendations API:

Foursquare lets users search for restaurants, nightlife spots, shops and other places of interest in their surrounding area. It is also possible to search other areas by entering the name of a remote location. The app displays personalized recommendations based on the time of day, displaying breakfast places in the morning, dinner places in the evening etc. Recommendations are personalized based on factors that include a user's check-in history, their "Tastes" and their venue ratings.

In our assignment we will use the Foursquare API feature for exploring the top recommended venues nearby a particular neighboorhood location. We will combining the Paris and New York geographical data with Foursquare venues.  Then we will use the data for clustering the neighborhoods and look for basic similarities/dissimilarities between these neighborhoods of Paris and New York.