# Capstone Project - The Battle of the Neighborhoods Week1

<h1><center><i>New York Restaurant Data Project</i></center></h1>
 <img src="https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcQnPzjRkhYvaXAP3eXKUrmGaw5xCKEWJYnWGS-JrVR_q1GmGcd1&usqp=CAU" alt="Restaurant" width="500" height="600"/> 
<h3><center><i>for</i></h3>
    <h2><center><em>ABC Multicuisine Inc</em></center></h2>

## Table of contents
  
1. [Introduction: Business Problem](#introduction)
2. [Problem Background](#problembackground)
3. [Problem Description](#problemdescription)
4. [Target Audience / Stakeholders](#targetaudience)
5. [Success / Exit Criteria](#success)
6. [Dataset / Data Provider](#dataset)
    1. [Zipcode Definition Data](#zipcode)
    2. [Population Density by Boroughs Data](#density)
    3. [Population By Zipcodes for all Boroughs](#population)
    4. [Foursquare Venues Data By Restaurant Category](#foursquare)
    5. [Known Assumptions](#assumptions)

## Introduction: Business Problem <a name="introduction"></a>
### Introduction

**ABC Multicuisine Inc** (hear after will be referred as the Company) is a successfully run food restaurant company that specialized in **Indian**, **Chinese**, **American** and **Italian** cuisine. The Company is interested in exploring a suitable opportunity to start a new restaurant in **New York** area by the end of Q3 of 2020. 

## Problem Background <a name="problembackground"></a>

The Company has been successfully running their restaurant business in Asia and Australia region and would like to enter the United States market by setting up their first restaurant in **New York** region and then expand further in other parts of New York and other cities in USA. As the company is a new entrant to this part of the world, they have engaged the data science team to research, study and come up with recommendation on which area in New York would be best suited to open their first restaurant specialzing in one among their core strength of **Indian**, **Chinese**, **American** and **Italian** cuisine.

The **New York** city is the financial capitol of the USA with diversified population. Its one of the highest populated city in the USA with several industries ranging from Finance, Sofware, Retail, Consumer, Tourism and so on. The Company would like to make the decision by Q3 of 2020 and looks forward to the datascience team to do a through analysis and come up with the recommendation in terms of the best location and best cuisine for the new restaurant that can help them gain market share, establish their brand vaules in New York and help them achieve their best return on investment.

## Problem Description <a name="problemdescription"></a>

The City of New York serves variety of international cuisine food to their customers. As our company specializes and interested only in **Indian**, **Chinese**, **American** and **Italian**, we will be focusing only on these four kind of foods for our data analysis. The New York city is divided into five [Boroughs](https://en.wikipedia.org/wiki/Borough) namely: 
* Bronx
* Brooklyn
* Manhattan
* Queens
* Staten Island

In order to compete with the existing players and gain market share for our Company and help them grow organically, as part of our data science project, We will be analyzing and taking into account the following areas with respect to each of the above mentioned Boroughs:

* List of zip codes mapped to Boroughs
* Land Area of Boroughs
* Per Capita Income of People in Each Boroughs
* Persons Per Square Miles
* Total Population and Population of different Ethinic groups
* Existing Players per cuisine in the market segment of each Boroughs
* Compare Similarities and Dissimilarities between all five Boroughs

In short, As this will be the first project of the Company in this part of the world, its very important that we come with the right recommendation in terms of the best location within the five Boroughs in New York and the best restaurant cuisine type within the four categoies the Company specializes in that helps them gain market share and get better return on investment.

## Target Audience / Stakeholders<a name="targetaudience"></a>

**ABC Multicuisine Inc** has chosen our datascience team understand, study and analyze their problem of finding the right location within New York to start their first restaurant in USA region. Our objective is to come with the best possible recommendation based on the available data and our research and submit the report to the Board of Directors, Business Head of USA region and their Executive Leadership team.

## Success / Exit Criteria<a name="success"></a>

The success criteria for the outcome of this data science project will be decided by the best location and the best category of cuisine recommendation provided by the team that caters the needs of the local population within that selected Borough and meets the demands of the Company's future customer segment. 

## Dataset / Data Provider<a name="dataset"></a>

The following Data sets will be utilized for this project:

* [New York Neighborhood Data](https://data.beta.nyc/dataset/pediacities-nyc-neighborhoods/resource/7caac650-d082-4aea-9f9b-3681d568e8a5)
* [Land Area / Population Denisty by Boroughs](https://en.wikipedia.org/wiki/Boroughs_of_New_York_City)
* [Population By Zipcodes for all Boroughs](https://data.beta.nyc/dataset/pediacities-nyc-neighborhoods/resource/7caac650-d082-4aea-9f9b-3681d568e8a5)
* [FourSquare Restaurant Categories Data](https://developer.foursquare.com/docs/api-reference/venues/categories/)

### Zipcode Definition Data<a name="zipcode"></a>

The mapping of available New York zipcodes and their correspoding Boroughs can be obtained from: [here](https://www.health.ny.gov/statistics/cancer/registry/appendix/neighborhoods.htm)
The New York city has been divided into five Boroughs namely: 
* Bronx
* Brooklyn
* Manhattan
* Queens
* Staten Island

We will get all the zipcodes that are mapped to their corresponding Boroughs and Neighborhood along with their location coordinates of Latitudes and Longitudes.

In [7]:
!pip install folium
!pip install geopy
!pip install bs4



In [8]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
from sklearn.cluster import KMeans
from geopy.geocoders import Nominatim 
import folium 
import seaborn as sns
import matplotlib.cm as cm
import matplotlib.colors as colors
import matplotlib as mpl
import matplotlib.pyplot as plt
from itertools import cycle, islice
from re import sub
from config import credentials

%matplotlib inline 
mpl.style.use(['ggplot'])

#### Build New York Neighborhood dataframe from [New York Neighborhood Data](https://data.beta.nyc/dataset/pediacities-nyc-neighborhoods/resource/7caac650-d082-4aea-9f9b-3681d568e8a5)

In [9]:
def get_coordinates_for_zipcode(zipcode):
    address = 'New York, NY {}'.format(zipcode)
    geolocator = Nominatim(user_agent="newyork_explorer")
    location = geolocator.geocode(address)
    latitude = location.latitude
    longitude = location.longitude
    #print("10453: Latitude:{}, Longitude:{}".format(latitude, longitude))
    return (latitude, longitude)

In [10]:
new_df = pd.read_csv("nyc_zip_borough_neighborhoods_pop.csv")
new_df['Borough'] = new_df['Borough'].astype(str)
new_df['Neighborhood'] = new_df['Neighborhood'].astype(str)
new_df[['Latitude', 'Longitude']] = new_df.apply(lambda row: get_coordinates_for_zipcode(row.ZipCode), axis=1).apply(pd.Series)
new_df.head()

Unnamed: 0,ZipCode,Borough,Neighborhood,Population,Density,Latitude,Longitude
0,10001,Manhattan,Chelsea and Clinton,21102,33959,40.741236,-73.356691
1,10002,Manhattan,Lower East Side,81410,92573,40.712728,-74.006015
2,10003,Manhattan,Lower East Side,56024,97188,40.712728,-74.006015
3,10004,Manhattan,Lower Manhattan,3089,5519,40.712728,-74.006015
4,10005,Manhattan,Lower Manhattan,7135,97048,40.712728,-74.006015


### Land Area / Population Denisty by Boroughs<a name="density"></a>

The following key data for each Boroughs can be obtained from [here](https://en.wikipedia.org/wiki/Boroughs_of_New_York_City)
* Per Capita Income 
* Land Area
* People Living Per Square Miles 

[Per Capita Income](https://en.wikipedia.org/wiki/Per_capita_income) data measures the **average income earned per person** in a given area. It is calculated by dividing the area's total income by its total population. [Population density](https://en.wikipedia.org/wiki/Population_density) is a measurement of population per unit area, or exceptionally unit volume; it is a quantity of type number density.

In [11]:
url = 'https://en.wikipedia.org/wiki/Boroughs_of_New_York_City'
source = requests.get(url).text
soup = BeautifulSoup(source, 'html')
table=soup.find('table')

column_names=['Borough', 'PerCapitaIncome', 'LandArea', 'PersonsPerSqM']
borough_df = pd.DataFrame(columns=column_names)

for tr in table.find_all('tr'):
    row_data=[]
    for td in tr.find_all('td'):
        row_data.append(td.text.strip())
        if len(row_data) == 9:
            #print(row_data)
            borough_df.loc[len(borough_df)] = [row_data[0], 
                                               int(sub(r'[^\d.]', '', row_data[4])), 
                                               float(sub(r'[^\d.]', '', row_data[5])), 
                                               int(sub(r'[^\d.]', '', row_data[7]))]

borough_df.loc[borough_df['Borough'] == 'The Bronx', 'Borough'] = 'Bronx'
borough_df['PerCapitaIncome'] = borough_df['PerCapitaIncome'].astype(int)
borough_df['PersonsPerSqM'] = borough_df['PersonsPerSqM'].astype(int)
borough_df

Unnamed: 0,Borough,PerCapitaIncome,LandArea,PersonsPerSqM
0,Bronx,30100,42.1,33867
1,Brooklyn,35800,70.82,36147
2,Manhattan,368500,22.83,71341
3,Queens,41400,108.53,20767
4,Staten Island,30500,58.37,8157


### Population data By Zipcodes for All Boroughs<a name="population"></a>

We will collect the following category of data from [here](https://data.beta.nyc/dataset/pediacities-nyc-neighborhoods/resource/7caac650-d082-4aea-9f9b-3681d568e8a5)
* Population  (Number of people living in a given zip code area)
* Density (Number of people living per square mile in a given zip code area)

Population and density will provide as a clear picture of how densely each zip code areas are populated.

### Foursquare Venues Data By Restaurant Category<a name="foursquare"></a>

[Forsquare.com](https://foursquare.com/) provides access to firmographic data and rich community-sourced content for more than 60 million commercial places around the world—via flat file or API. We will be useing their [Places API](https://developer.foursquare.com/docs/build-with-foursquare/categories/) that provides location data with the list of restaurant venues for a given restaurant category and Borough in [JSON](https://en.wikipedia.org/wiki/JSON) format. Since, ABC Multicusine Inc specializes only in certain kind of cusines, We be collecting the restaurant data for the following four categories of restaurants:
* **American**
* **Italian**
* **Chinese**
* **Indian**

In [12]:
restaurant_categories = {
    "Indian": "4bf58dd8d48988d10f941735",
    "Chinese": "4bf58dd8d48988d145941735",
    "Italian": "4bf58dd8d48988d110941735",
    "American": "4bf58dd8d48988d14e941735"
}

def get_restaurants_by_category_id(category_id, neighborhood, version='20190425'):
    restaurants_list = []
    url = "https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&v={}&intent=browse&near={}, NY" \
    "&categoryId={}".format(credentials["CLIENT_ID"], credentials["CLIENT_SECRET"], version, borough, category_id)
    venues = None
    response = requests.request("GET", url, headers={}, data={})
    response = response.json()
    if response["meta"]["code"] == 200:
        venues = response["response"]["venues"]
    else:
        print("neighborhood: return code:", response["meta"]["code"])
        venues = None
    return venues

# Create an Empty Data Frame
restaurant_columns =['ZipCode', 'Borough', 'Neighborhood', 'Latitude', 'Longitude', 'Name', 'Category']
restaurants_df = pd.DataFrame(columns=restaurant_columns)

# Get the restaurant details for each of the category and Neighborhood: Indian, Chinese, American and Italian
for category in restaurant_categories:
    for neighborhood in new_df['Neighborhood'].unique():
        borough = np.array_str(new_df.loc[new_df['Neighborhood'] == neighborhood, 'Borough'].unique())[1:-1]
        borough = borough[1:-1]
        venues = get_restaurants_by_category_id(restaurant_categories[category], neighborhood)
            
        if venues is not None:
            for venue in venues:
                if "postalCode" in venue["location"]:
                    postal_code = venue['location']['postalCode']
                else:
                     postal_code = new_df.loc[new_df['Neighborhood'] == neighborhood, 'ZipCode'].unique()[0]
                restaurants_df.loc[len(restaurants_df)] = [postal_code, borough, neighborhood, venue['location']['lat'],venue['location']['lng'], 
                                                           venue['name'], venue['categories'][0]['shortName']]


# Data Cleaning
restaurants_df.replace(to_replace=['South Indian', 'Deli / Bodega', 'Chaat'], value='Indian', inplace=True)
restaurants_df.replace(to_replace=['Dim Sum', 'Cantonese', 'Shanghai', 'Taiwanese', 'Asian'], value='Chinese', inplace=True)
restaurants_df.replace(to_replace=['New American', 'Beer Garden', 'Cocktail', 'Burgers', 'Bar', 'Sandwiches','Wine Bar','Diner'], 
                       value='American', inplace=True)
restaurants_df.replace(to_replace=['Gourmet', 'Pizza', 'Seafood'], value='Italian', inplace=True)
restaurants_df.replace(to_replace=['New American', 'Beer Garden', 'Cocktail', 'Burgers', 'Bar', 'Sandwiches','Wine Bar','Diner'], 
                       value='American', inplace=True)

restaurants_df.replace(to_replace=['Bagels', 'New American', 'Wine Bar', 'Beer Garden'], value='American', inplace=True)


drop_indexes = restaurants_df[(restaurants_df.Category != 'Indian') & (restaurants_df.Category != 'Chinese') & 
                              (restaurants_df.Category != 'American') & (restaurants_df.Category != 'Italian')].index
restaurants_df.drop(drop_indexes, inplace=True)
restaurants_df['ZipCode'] = restaurants_df['ZipCode'].astype(int)


restaurants_df.drop_duplicates(subset='Name', inplace=True)
restaurants_df.to_csv("restaurants_data.csv", index=False)
print('shape:', restaurants_df.shape)
restaurants_df.head()

shape: (491, 7)


Unnamed: 0,ZipCode,Borough,Neighborhood,Latitude,Longitude,Name,Category
0,10009,Manhattan,Chelsea and Clinton,40.72751,-73.979324,Khiladi NYC,Indian
1,10024,Manhattan,Chelsea and Clinton,40.786166,-73.976414,Alachi Masala,Indian
2,10022,Manhattan,Chelsea and Clinton,40.75562,-73.968666,Amma,Indian
3,10009,Manhattan,Chelsea and Clinton,40.727285,-73.979602,Desi Galli - Avenue B,Indian
4,10016,Manhattan,Chelsea and Clinton,40.741393,-73.983367,Saravanaa Bhavan,Indian


In [13]:
restaurants_df.tail()

Unnamed: 0,ZipCode,Borough,Neighborhood,Latitude,Longitude,Name,Category
4776,11238,Brooklyn,Northwest Brooklyn,40.681505,-73.95577,Golda,American
4777,11211,Brooklyn,Northwest Brooklyn,40.710783,-73.953704,Lighthouse,American
4778,11238,Brooklyn,Northwest Brooklyn,40.682846,-73.963835,Otway,American
4779,11222,Brooklyn,Northwest Brooklyn,40.733427,-73.958201,Alameda,American
4780,11238,Brooklyn,Northwest Brooklyn,40.68147,-73.9558,Hart's,American


#### Known assumptions<a name="assumptions">
This project is done with the known api rate limiting imposed by foursquare. 