# Cousera - Final Project <BR>


For the final project, I chose the following business problem: Choosing the best Borough of New York City for the location of the Chinese Mall. 

This seems like a simple task, because Chinese goods have a significant price advantage. However, I assume that cheap Chinese goods will use the highest demand precisely in places with a greater concentration of people from China and India, since they are more accustomed to such goods. In addition, for my analysis, I make the assumption that it is quite profitable to locate such stores near other shopping centers, where people accumulate specifically for shopping and leisure. So, I used such information as Chinese and Indian restaurants, Movie Theatres and Shopping Malls.

#### To do this, I will follow 3 basic steps:
1. Analysing the datasets of 5 boroughs
2. Getting top 100 popular venues within for a given neighbourhood Using Forsquare APIs 
3. Exploring and analyzing the data of each borough

In [15]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge termcolor=1.1.0 --yes 
from termcolor import colored

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
import matplotlib.pyplot as plt

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

print('Libraries imported.')

Fetching package metadata .............
Solving package specifications: .

Package plan for installation in environment /opt/conda/envs/DSX-Python35:

The following NEW packages will be INSTALLED:

    termcolor: 1.1.0-py_2 conda-forge

termcolor-1.1. 100% |################################| Time: 0:00:00   6.99 MB/s
Fetching package metadata .............
Solving package specifications: .

Package plan for installation in environment /opt/conda/envs/DSX-Python35:

The following NEW packages will be INSTALLED:

    altair:  2.2.2-py35_1 conda-forge
    branca:  0.3.1-py_0   conda-forge
    folium:  0.5.0-py_0   conda-forge
    vincent: 0.4.4-py_1   conda-forge

altair-2.2.2-p 100% |################################| Time: 0:00:00  53.36 MB/s
branca-0.3.1-p 100% |################################| Time: 0:00:00  35.21 MB/s
vincent-0.4.4- 100% |################################| Time: 0:00:00  39.25 MB/s
folium-0.5.0-p 100% |################################| Time: 0:00:00  47.99 MB/s
Fetchin

# SECTION 1: <br>Build the dataset of all 5 Boroughs and their Neigbhourhoods as dataframe

In order to segement the neighborhoods of the boroughs of New York City, and explore them, we will essentially need a dataset that contains the 5 boroughs, and the neighborhoods that exist in each borough as well as the the latitude and logitude coordinates of each neighborhood. 

This dataset is available for free at NYU Spatial Data Repositoy, and can be downloaded from: https://geo.nyu.edu/catalog/nyu_2451_34572

For the purpose of this project, the file which has already been downloaded on a server has been used. Its location is: https://ibm.box.com/shared/static/fbpwbovar7lf8p5sgddm06cgipa2rxpe.json

In [16]:
!wget -q -O 'newyork_data.json' https://ibm.box.com/shared/static/fbpwbovar7lf8p5sgddm06cgipa2rxpe.json
    
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data) 

In [17]:
#All the relevant data is in the features key, which is basically a list of the neighborhoods. So, let's define a new variable that includes this data.
neighborhoods_data = newyork_data['features']

In [18]:
#Let's take a look at the first item in this list.
neighborhoods_data[0]

{'geometry': {'coordinates': [-73.84720052054902, 40.89470517661],
  'type': 'Point'},
 'geometry_name': 'geom',
 'id': 'nyu_2451_34572.1',
 'properties': {'annoangle': 0.0,
  'annoline1': 'Wakefield',
  'annoline2': None,
  'annoline3': None,
  'bbox': [-73.84720052054902,
   40.89470517661,
   -73.84720052054902,
   40.89470517661],
  'borough': 'Bronx',
  'name': 'Wakefield',
  'stacked': 1},
 'type': 'Feature'}

#### Tranform the data into a *pandas* dataframe
The next task is to transform the data .json format of nested Python dictionaries into a *pandas* dataframe. 
An empty dataframe is created at first for this.

In [19]:
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
all_boroughs_neighborhoods = pd.DataFrame(columns=column_names)

In [20]:
# loop through the data and fill the dataframe one row at a time.

for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    all_boroughs_neighborhoods = all_boroughs_neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(all_boroughs_neighborhoods['Borough'].unique()),
        all_boroughs_neighborhoods.shape[0]
    )
)
all_boroughs_neighborhoods = all_boroughs_neighborhoods.sort_values("Borough").reset_index(drop=True)

The dataframe has 5 boroughs and 306 neighborhoods.


In [21]:
all_boroughs_neighborhoods.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Country Club,40.844246,-73.824099
2,Bronx,Parkchester,40.837938,-73.856003
3,Bronx,Westchester Square,40.840619,-73.842194
4,Bronx,Van Nest,40.843608,-73.866299


#### Use geopy library to get the latitude and longitude values of New York City.

In [22]:
address = 'New York City, NY'

geolocator = Nominatim(user_agent='random')
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of New York City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of New York City are 40.7308619, -73.9871558.


#### Create a map of New York with neighborhoods superimposed on top.

In [23]:
# create map of New York using latitude and longitude values
map_newyork = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(all_boroughs_neighborhoods['Latitude'], all_boroughs_neighborhoods['Longitude'], all_boroughs_neighborhoods['Borough'], all_boroughs_neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)  
    
map_newyork

# SECTION 2: <BR> Working with Foursquare APIs #

In this section, we make use of FourSquare APIs to get top 100 popular venues within a defined radius for a given neighbourhood. For this, latitude and longitude of the neighbourhood are passed to the API.

In [55]:
# parameters to be used for building the URL as specified by FourSquare
CLIENT_ID = 'YCERJ444XMQAXGEERYBGUNEIULI0IJ5LALYLHBVH0WCYCRQU' 
CLIENT_SECRET = 'FTQXQMWZIM4TBLPGH02ZHC1QAIVZ1O3NRRCHSMKOUFQNYRYZ' 
VERSION = '20180604'

LIMIT = 100 # limit of number of venues returned by Foursquare API
RADIUS = 2000 # define radius

In [25]:
# returns a list of unique neighbourhoods & their geo coordinates given the borough name
def get_neighbourhoods_df(borough_name):
    neighbourhood_data = all_boroughs_neighborhoods[all_boroughs_neighborhoods['Borough'] == borough_name]
    # create a dataframe with unique Neighbourhood, latituded and longitude for manhattan
    neighbourhood_data.groupby( ['Borough','Neighborhood', 'Latitude', 'Longitude'] ).size().to_frame(name = 'count')
    return neighbourhood_data

In [56]:
# Function to return a URL
def get_venue_explore_url(lat,long):
    # create URL
    url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
        CLIENT_ID, 
        CLIENT_SECRET, 
        VERSION, 
        lat, 
        long, 
        RADIUS, 
        LIMIT)
    return url 

In [27]:
# All the information is in the *items* key. **get_category_type** function from the Foursquare lab.
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [28]:
# function which will be called to retrieve popular venues for a given lat & long of a neighbourhood
def getNearbyVenues(names, latitudes, longitudes):
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
         
        # create the API request URL
        url = get_venue_explore_url(lat,lng)
       
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
       
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood',
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    return(nearby_venues)

In [29]:
#our interest is in only venues belonging to following categories only. This criteria shall be used to select only matching records
venue_filter="Movie Theater|Indian Restaurant|Chinese Restaurant|Mall"

In [30]:
# Create an empty dataframe, which will be populated with counts of venues for each borough and used to create a bar chart
bar_columns = ['Borough','Movie Theater','Indian Restaurant','Chinese Restaurant','Shopping Malls']
df_bar_chart = pd.DataFrame(columns=bar_columns)

In [31]:
#this function iterates over the Grouped By dataframe.
def populate_df_for_barchart (borough, df):
    list_as_row = [borough,0,0,0,0]
    for index, row in df.iterrows():
        if row["Venue Category"]=='Movie Theater': list_as_row[1]= row["counts"]
        if row["Venue Category"]=='Indian Restaurant': list_as_row[2]= row["counts"]
        if row["Venue Category"]=='Chinese Restaurant': list_as_row[3]= row["counts"]
        if row["Venue Category"]=='Shopping Mall': list_as_row[4]= row["counts"]  
    df_bar_chart.loc[len(df_bar_chart)] = list_as_row        

# SECTION 3: <BR> Exploring the data of boroughs, neighbourhoods and venues #

In this section, we will explore the data collected for boroughs, their neighbourhoods and the venues of our interest.
The objective is to understand the distribution of venues of interest across boroughs

#### Staten Island

In [32]:
#Build the dataframe with venues in all neighbourhoods
staten_data = get_neighbourhoods_df("Staten Island")
staten_neighbor_venues = getNearbyVenues(names=staten_data['Neighborhood'],
                                   latitudes=staten_data['Latitude'],
                                   longitudes=staten_data['Longitude'],
                                  )

In [33]:
chinese_mall_locations_staten = staten_neighbor_venues[staten_neighbor_venues['Venue Category'].str.contains(venue_filter)]
chinese_mall_locations_staten = chinese_mall_locations_staten.drop(chinese_mall_locations_staten[chinese_mall_locations_staten["Venue Category"] == "Chinese Mall"].index)
chinese_mall_locations_staten.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
11,Elm Park,40.630147,-74.141817,Crown Palace,40.626593,-74.13193,Chinese Restaurant
30,Elm Park,40.630147,-74.141817,Fortune Cookie,40.626244,-74.134688,Chinese Restaurant
106,Howland Hook,40.638433,-74.186223,United Artists Staten Island 16 & RPX,40.626312,-74.172686,Movie Theater
149,Randall Manor,40.63563,-74.098051,Dosa Garden,40.634231,-74.085696,Indian Restaurant
233,Randall Manor,40.63563,-74.098051,Diamond Forest Chinese Food Restaurant,40.630977,-74.100951,Chinese Restaurant


In [34]:
df_bar_stat = (chinese_mall_locations_staten.groupby(['Venue Category']).size().reset_index(name='counts'))
df_bar_stat

Unnamed: 0,Venue Category,counts
0,Chinese Restaurant,65
1,Indian Restaurant,17
2,Movie Theater,4
3,Shopping Mall,7


In [35]:
populate_df_for_barchart("Staten Island",df_bar_stat)
df_bar_chart

Unnamed: 0,Borough,Movie Theater,Indian Restaurant,Chinese Restaurant,Shopping Malls
0,Staten Island,4,17,65,7


#### Manhattan

In [36]:
#Build the dataframe with venues in all neighbourhoods
manhattan_data = get_neighbourhoods_df("Manhattan")
manhattan_neighbor_venues = getNearbyVenues(names=manhattan_data['Neighborhood'],
                                   latitudes=manhattan_data['Latitude'],
                                   longitudes=manhattan_data['Longitude'],
                                  )

In [37]:
chinese_mall_locations_man = manhattan_neighbor_venues[manhattan_neighbor_venues['Venue Category'].str.contains(venue_filter)]
chinese_mall_locations_man = chinese_mall_locations_man.drop(chinese_mall_locations_man[chinese_mall_locations_man["Venue Category"] == "Chinese Mall"].index)
chinese_mall_locations_man.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
11,Chinatown,40.715618,-73.994279,Shanghai Asian Manor,40.714423,-73.998904,Chinese Restaurant
236,Carnegie Hill,40.782683,-73.953256,Drunken Munkey,40.781106,-73.947549,Indian Restaurant
419,Flatiron,40.739673,-73.990947,Xi'an Famous Foods,40.740632,-73.987346,Chinese Restaurant
530,Stuyvesant Town,40.731,-73.974052,Xi'an Famous Foods,40.727731,-73.985652,Chinese Restaurant
544,Stuyvesant Town,40.731,-73.974052,Han Dynasty,40.73213,-73.98809,Chinese Restaurant


In [38]:
df_bar_man = (chinese_mall_locations_man.groupby(['Venue Category']).size().reset_index(name='counts'))
df_bar_man

Unnamed: 0,Venue Category,counts
0,Chinese Restaurant,41
1,Indian Restaurant,17
2,Movie Theater,3
3,North Indian Restaurant,2
4,Shopping Mall,4


In [39]:
populate_df_for_barchart("Manhattan",df_bar_man)
df_bar_chart

Unnamed: 0,Borough,Movie Theater,Indian Restaurant,Chinese Restaurant,Shopping Malls
0,Staten Island,4,17,65,7
1,Manhattan,3,17,41,4


#### Brooklyn

In [40]:
#Build the dataframe with venues in all neighbourhoods
brooklyn_data = get_neighbourhoods_df("Brooklyn")
brooklyn_neighbor_venues = getNearbyVenues(names=brooklyn_data['Neighborhood'],
                                   latitudes=brooklyn_data['Latitude'],
                                   longitudes=brooklyn_data['Longitude'],
                                  )


In [41]:
chinese_mall_locations_brook = brooklyn_neighbor_venues[brooklyn_neighbor_venues['Venue Category'].str.contains(venue_filter)]
chinese_mall_locations_brook = chinese_mall_locations_brook.drop(chinese_mall_locations_brook[chinese_mall_locations_brook["Venue Category"] == "Chinese Mall"].index)
chinese_mall_locations_brook.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Downtown,40.690844,-73.983463,Alamo Drafthouse Cinema - Brooklyn,40.691016,-73.983686,Movie Theater
112,Erasmus,40.646926,-73.948177,Gandhi Fine Indian Cuisine,40.655168,-73.956325,Indian Restaurant
139,Erasmus,40.646926,-73.948177,China Hao Restaurant,40.650186,-73.930144,Chinese Restaurant
148,Erasmus,40.646926,-73.948177,Green Lake Chinese Restaurant,40.65311,-73.959343,Chinese Restaurant
195,Erasmus,40.646926,-73.948177,Silver Krust,40.642391,-73.92662,Indian Restaurant


In [42]:
df_bar_brook = (chinese_mall_locations_brook.groupby(['Venue Category']).size().reset_index(name='counts'))
df_bar_brook

Unnamed: 0,Venue Category,counts
0,Chinese Restaurant,97
1,Indian Restaurant,42
2,Movie Theater,14
3,Shopping Mall,4


In [43]:
populate_df_for_barchart("Brooklyn",df_bar_brook)
df_bar_chart

Unnamed: 0,Borough,Movie Theater,Indian Restaurant,Chinese Restaurant,Shopping Malls
0,Staten Island,4,17,65,7
1,Manhattan,3,17,41,4
2,Brooklyn,14,42,97,4


#### Bronx

In [57]:
#Build the dataframe with venues in all neighbourhoods
bronx_data = get_neighbourhoods_df("Bronx")
bronx_data_neighbor_venues = getNearbyVenues(names=bronx_data['Neighborhood'],
                                   latitudes=bronx_data['Latitude'],
                                   longitudes=bronx_data['Longitude'],
                                  )

In [58]:
chinese_mall_locations_bronx = bronx_data_neighbor_venues[bronx_data_neighbor_venues['Venue Category'].str.contains(venue_filter)]
chinese_mall_locations_bronx = chinese_mall_locations_bronx.drop(chinese_mall_locations_bronx[chinese_mall_locations_bronx["Venue Category"] == "Chinese Mall"].index)
chinese_mall_locations_bronx.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
27,Wakefield,40.894705,-73.847201,Curry Spot,40.897625,-73.867147,Indian Restaurant
63,Wakefield,40.894705,-73.847201,Shopwell Plaza,40.884144,-73.832001,Shopping Mall
456,Van Nest,40.843608,-73.866299,Peking Kitchen,40.854462,-73.866608,Chinese Restaurant
555,Morris Park,40.847549,-73.850402,Mr. Q's Chinese Restaurant,40.85579,-73.855455,Chinese Restaurant
578,Morris Park,40.847549,-73.850402,Peking Kitchen,40.854462,-73.866608,Chinese Restaurant


In [59]:
df_bar_bronx = (chinese_mall_locations_bronx.groupby(['Venue Category']).size().reset_index(name='counts'))
df_bar_bronx

Unnamed: 0,Venue Category,counts
0,Chinese Restaurant,41
1,Indian Restaurant,11
2,Movie Theater,2
3,Shopping Mall,26


In [60]:
populate_df_for_barchart("Bronx",df_bar_bronx)
df_bar_chart

Unnamed: 0,Borough,Movie Theater,Indian Restaurant,Chinese Restaurant,Shopping Malls
0,Staten Island,4,17,65,7
1,Manhattan,3,17,41,4
2,Brooklyn,14,42,97,4
3,Bronx,2,11,41,26


#### Queens

In [61]:
#Build the dataframe with venues in all neighbourhoods
queens_data = get_neighbourhoods_df("Queens")
queens_data_neighbor_venues = getNearbyVenues(names=queens_data['Neighborhood'],
                                   latitudes=queens_data['Latitude'],
                                   longitudes=queens_data['Longitude'],
                                  )


In [62]:
#Get the count of each category
chinese_mall_locations_queens = queens_data_neighbor_venues[queens_data_neighbor_venues['Venue Category'].str.contains(venue_filter)]
chinese_mall_locations_queens = chinese_mall_locations_queens.drop(chinese_mall_locations_queens[chinese_mall_locations_queens["Venue Category"] == "Chinese Mall"].index)
chinese_mall_locations_queens.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
60,Laurelton,40.667884,-73.740256,Green Acres Commons,40.663338,-73.726654,Shopping Mall
107,Lefrak City,40.736075,-73.862525,Rego Center,40.732896,-73.863135,Shopping Mall
142,Lefrak City,40.736075,-73.862525,Green Zenphony,40.730528,-73.863641,Chinese Restaurant
171,Lefrak City,40.736075,-73.862525,Queens Center,40.734723,-73.870041,Shopping Mall
236,Belle Harbor,40.576156,-73.854018,East Meets West,40.578435,-73.849474,Chinese Restaurant


In [63]:
df_bar_queens = (chinese_mall_locations_queens.groupby(['Venue Category']).size().reset_index(name='counts'))
df_bar_queens

Unnamed: 0,Venue Category,counts
0,Chinese Restaurant,163
1,Indian Restaurant,109
2,Movie Theater,16
3,Shopping Mall,23


In [64]:
populate_df_for_barchart("Queens",df_bar_queens)
df_bar_chart

Unnamed: 0,Borough,Movie Theater,Indian Restaurant,Chinese Restaurant,Shopping Malls
0,Staten Island,4,17,65,7
1,Manhattan,3,17,41,4
2,Brooklyn,14,42,97,4
3,Bronx,2,11,41,26
4,Queens,16,109,163,23


### As we can see, the best area for the Chinese Mall is Queens, since it contains the biggest amount of usefull objects around.