<h1 align='center'>Capstone Project - Introduction and Data Collection </h1>

## Business Problem

The project aims to find an **optimal type of Restaurant to establish in a particular location**. This model should be applicable to a variety of restaurants,coffee shops and Bars and it finds an optimal location which where each business has potential to be launched. It uses the **demographic data and the presence of other similiar businesses** in the vicinity to make a suggestion.

The model is made using the data from **New York City** but the model should be applicable to other major cities or atleast other cities  in the US provided the required data is available. A variety of data will be used to make a decision and they will be used to train a machine learning model which could posibbly find patterns which are not identifiable manually

## Data

The model takes into consideration the **five Boroughs of New York city** and uses the neighboods based on the Neighborhood Tabulation Areas (NTAs) which are aggregations of the Census tracts of New York.

For our model, the following Data will be used as prediction parameters and all the parameters are chosen on a per neighborhood basis:

To find the size of customer base:
* **Area Population density**
* **Daytime Population** - The population of an area during working hours which is commuter-adjusted

Understand the characteristics of the customers:
* **Median Income** of the resident population
* **Median Age** of the Population

To find the quality of neighborhood and thoroughfare:
* **Median Rent**
* **Median Property Value**
* **Landmarks and attractions** within the Area
* **Average Daily Traffic** in the Area

All these data will be used to predict the following:
* The type of **Restaurant** to establish in a location

### Data Collection Sources

 The neighborhood Data is taken from the New York City Open Datasets. The shapefiles provided give the Name, Area Code, Location and shape of the NTA Areas. [Neighborhood Data](https://data.cityofnewyork.us/City-Government/Neighborhood-Tabulation-Areas/cpf4-rkhq")
 
 The Demographic data is obtained from the Open Source American Census Data from the American Community Survey. [Demographic Data](https://factfinder.census.gov/faces/nav/jsf/pages/index.xhtml)
 
The Nearby location and venue data are obtained using the [Foursquare API](https://developer.foursquare.com/)

#### Note: Since the Datasets are large in size, the datasets have been downloaded and accessed via the local system. The data can also be found in the Github Repository

## Demographic Data Collection

This file consists of the collection and wrangling of the demographic data. The venue Data and Foursquare Data Collection are carried out in the next file

In [113]:
import pandas as pd
import numpy as np
import geopandas as gp
import folium
import re

### Neighborhood Geo Data

In [114]:
gNeigh=gp.read_file("Data/Zones/NTA map.geojson")

In [115]:
gNeigh=gNeigh[['ntacode','ntaname','boroname','geometry']].sort_values('ntacode').reset_index(drop=True)
gNeigh['lat']=gNeigh.centroid.y
gNeigh['long']=gNeigh.centroid.x
gNeigh.rename({'ntacode':'code','ntaname':'name','boroname':'borough'},axis=1,inplace=True)
print(gNeigh.shape)
gNeigh.head()

(195, 6)


Unnamed: 0,code,name,borough,geometry,lat,long
0,BK09,Brooklyn Heights-Cobble Hill,Brooklyn,"(POLYGON ((-73.99236367043254 40.689690123777,...",40.695469,-73.994871
1,BK17,Sheepshead Bay-Gerritsen Beach-Manhattan Beach,Brooklyn,(POLYGON ((-73.91809256480843 40.5865703350047...,40.5883,-73.941511
2,BK19,Brighton Beach,Brooklyn,(POLYGON ((-73.96034953585246 40.5873062855713...,40.580922,-73.961217
3,BK21,Seagate-Coney Island,Brooklyn,(POLYGON ((-73.97459000582634 40.5831388207588...,40.57648,-73.991231
4,BK23,West Brighton,Brooklyn,(POLYGON ((-73.9688899587795 40.57526123899416...,40.579088,-73.973391


### Area of Neighborhoods calculation

In [116]:
import pyproj
from shapely.ops import transform
from functools import partial

areas=[]
for s in gNeigh.geometry:
    proj = partial(pyproj.transform, pyproj.Proj(init='epsg:4326'),
               pyproj.Proj(init='epsg:3857'))
    areas.append(transform(proj, s).area/1000000)
gNeigh['area']=areas

In [117]:
gNeigh.head()

Unnamed: 0,code,name,borough,geometry,lat,long,area
0,BK09,Brooklyn Heights-Cobble Hill,Brooklyn,"(POLYGON ((-73.99236367043254 40.689690123777,...",40.695469,-73.994871,1.615327
1,BK17,Sheepshead Bay-Gerritsen Beach-Manhattan Beach,Brooklyn,(POLYGON ((-73.91809256480843 40.5865703350047...,40.5883,-73.941511,10.214922
2,BK19,Brighton Beach,Brooklyn,(POLYGON ((-73.96034953585246 40.5873062855713...,40.580922,-73.961217,2.770061
3,BK21,Seagate-Coney Island,Brooklyn,(POLYGON ((-73.97459000582634 40.5831388207588...,40.57648,-73.991231,6.242851
4,BK23,West Brighton,Brooklyn,(POLYGON ((-73.9688899587795 40.57526123899416...,40.579088,-73.973391,1.409979


### Population Data

In [118]:
pop=pd.read_excel("Data/Demographic/NYPopNTA.xlsx",usecols=range(0,121))
cols=list(range(1,5))+[119]
pop=pop.iloc[:,cols].sort_values('GeoID').reset_index(drop=True)
pop.rename({'Pop_1E':'population','MdAgeE':'medAge','GeoID':'code','GeoName':'name','Borough':'borough'},axis=1,inplace=True)

In [119]:
#Population Density Calculation
pop['popDensity']=[int(row['population']/gNeigh.loc[i,'area'])for i,row in pop.iterrows()]

print(pop.shape)
pop.head()

(195, 6)


Unnamed: 0,name,code,borough,population,medAge,popDensity
0,Brooklyn Heights-Cobble Hill,BK09,Brooklyn,24212,37.1,14988
1,Sheepshead Bay-Gerritsen Beach-Manhattan Beach,BK17,Brooklyn,67681,43.9,6625
2,Brighton Beach,BK19,Brooklyn,35811,44.3,12927
3,Seagate-Coney Island,BK21,Brooklyn,31132,39.0,4986
4,West Brighton,BK23,Brooklyn,16436,58.0,11656


### Commuter Adjusted Daytime Population 

In [120]:
dayPop=pd.read_excel("Data/Demographic/daytimePop.xls",usecols="G,H,R",names=['borough','total','dayChange'],skiprows=6).dropna()
dayPop=dayPop.iloc[[1552,1676,1590,1448,1687],[0,2]].reset_index(drop=True)
dayPop['borough']=dayPop['borough'].apply(lambda x:str.split(x)[0])
dayPop['borough'].replace('Staten','Staten Island',inplace=True)
dayPop

Unnamed: 0,borough,dayChange
0,Brooklyn,-12.0
1,Queens,-16.1
2,Manhattan,94.7
3,Bronx,-11.7
4,Staten Island,-17.7


In [121]:
dayPopulation=[]
for i,row in pop.iterrows():
    dayPopulation.append(row['population']+round(row['population']*((dayPop.loc[dayPop['borough']==row['borough'],'dayChange'].values[0])/100),0))
pop['dayPop']=dayPopulation

In [122]:
pop.head()

Unnamed: 0,name,code,borough,population,medAge,popDensity,dayPop
0,Brooklyn Heights-Cobble Hill,BK09,Brooklyn,24212,37.1,14988,21307.0
1,Sheepshead Bay-Gerritsen Beach-Manhattan Beach,BK17,Brooklyn,67681,43.9,6625,59559.0
2,Brighton Beach,BK19,Brooklyn,35811,44.3,12927,31514.0
3,Seagate-Coney Island,BK21,Brooklyn,31132,39.0,4986,27396.0
4,West Brighton,BK23,Brooklyn,16436,58.0,11656,14464.0


### Economic Data

In [123]:
cols="B:D,IX:LC"
income=pd.read_excel("Data/Demographic/NYEconNTA.xlsx",usecols=cols)
income.drop([x for x in income.columns.values if re.match(".*[MPCZ]$",x)],axis=1,inplace=True)

In [124]:
income['veryLow']=income.iloc[:,3:6].sum(axis=1)
income['low']=income.iloc[:,6:8].sum(axis=1)
income['middle']=income.iloc[:,8:10].sum(axis=1)
income['high']=income.iloc[:,10:12].sum(axis=1)
income['veryHigh']=income.iloc[:,12]
income.drop(income.iloc[:,3:13],axis=1,inplace=True)
income.sort_values('GeoID',inplace=True)
income.rename({'MdHHIncE':'medIncome','MnHHIncE':'meanIncome','GeoID':'code','GeoName':'name','Borough':'borough'},
              axis=1,inplace=True)
income.reset_index(drop=True,inplace=True)

In [125]:
print(income.shape)
income.head()

(195, 10)


Unnamed: 0,name,code,borough,medIncome,meanIncome,veryLow,low,middle,high,veryHigh
0,Brooklyn Heights-Cobble Hill,BK09,Brooklyn,125817.0,205275.0,1279,1201,2008,3355,3272
1,Sheepshead Bay-Gerritsen Beach-Manhattan Beach,BK17,Brooklyn,57150.0,79613.0,6637,5298,6797,5785,1633
2,Brighton Beach,BK19,Brooklyn,36802.0,63703.0,5762,2697,3301,2006,791
3,Seagate-Coney Island,BK21,Brooklyn,27345.0,49358.0,5381,2285,2158,1115,297
4,West Brighton,BK23,Brooklyn,40316.0,58752.0,3169,1790,2212,955,275


### Housing Data

In [126]:
cols="B:D,LR,RB"
rent=pd.read_excel("Data/Demographic/NYHousNTA.xlsx",usecols=cols)
rent.sort_values('GeoID',inplace=True)
rent.rename({'MdVlE':'medValue','MdGRE':'medRent','GeoID':'code','GeoName':'name','Borough':'borough'},
              axis=1,inplace=True)
rent.reset_index(drop=True,inplace=True)

In [127]:
print(rent.shape)
rent.head()

(195, 5)


Unnamed: 0,name,code,borough,medValue,medRent
0,Brooklyn Heights-Cobble Hill,BK09,Brooklyn,856535.0,2278.0
1,Sheepshead Bay-Gerritsen Beach-Manhattan Beach,BK17,Brooklyn,476965.0,1180.0
2,Brighton Beach,BK19,Brooklyn,561046.0,1194.0
3,Seagate-Coney Island,BK21,Brooklyn,457834.0,676.0
4,West Brighton,BK23,Brooklyn,311186.0,905.0


### Traffic Geo Data

In [128]:
gTraffic=gp.read_file("Data/Traffic/AADT.shp")
gTraffic.crs={'init': 'epsg:4326'}

In [129]:
gTraffic['AADT']=gTraffic['AADT'].replace(0,np.NaN)
gTraffic.dropna(inplace=True)
gTraffic.reset_index(drop=True,inplace=True)
gTraffic.rename({'TDV_ROUTE':'road','Type':'type'},axis=1,inplace=True)
gTraffic.head()

Unnamed: 0,road,AADT,type,geometry
0,I87 NB to Grand Concourse,9098.0,Ramp,LINESTRING (-73.93120499291497 40.811187005174...
1,BARTOW AVE,18317.0,Road,LINESTRING (-73.83174595168754 40.868683967375...
2,E TREMONT AVE,15891.0,Road,LINESTRING (-73.87297299845201 40.839735862949...
3,E TREMONT AVE,21551.0,Road,LINESTRING (-73.86477339345717 40.840984407520...
4,"I87, MAJOR DEEGAN EXP",102213.0,Route,LINESTRING (-73.91907386627308 40.804683954394...


### Consolidate Traffic GeoData from Roads into Areas

In [130]:
nTraffic=gp.tools.sjoin(gNeigh,gTraffic,op='intersects')
nTraffic.drop('index_right',axis=1,inplace=True)
nTraffic.sort_values('code',inplace=True)
nTraffic=nTraffic[nTraffic['type']!="Route"]
nTraffic.reset_index(drop=True,inplace=True)
nTraffic.head()

Unnamed: 0,code,name,borough,geometry,lat,long,area,road,AADT,type
0,BK09,Brooklyn Heights-Cobble Hill,Brooklyn,"(POLYGON ((-73.99236367043254 40.689690123777,...",40.695469,-73.994871,1.615327,COURT ST,11136.0,Road
1,BK09,Brooklyn Heights-Cobble Hill,Brooklyn,"(POLYGON ((-73.99236367043254 40.689690123777,...",40.695469,-73.994871,1.615327,CADMAN PLAZA W,10408.0,Road
2,BK09,Brooklyn Heights-Cobble Hill,Brooklyn,"(POLYGON ((-73.99236367043254 40.689690123777,...",40.695469,-73.994871,1.615327,FURMAN ST,6131.0,Road
3,BK09,Brooklyn Heights-Cobble Hill,Brooklyn,"(POLYGON ((-73.99236367043254 40.689690123777,...",40.695469,-73.994871,1.615327,TILLARY ST,14123.0,Road
4,BK09,Brooklyn Heights-Cobble Hill,Brooklyn,"(POLYGON ((-73.99236367043254 40.689690123777,...",40.695469,-73.994871,1.615327,HENRY ST,5126.0,Road


In [131]:
traffic=nTraffic.groupby('name').agg({"code":lambda x:np.unique(x),
                                    'AADT':lambda x:np.mean(x),
                                    'borough':lambda x:np.unique(x),
                                    'lat':lambda x:np.unique(x),
                                    'long':lambda x:np.unique(x)
                                    }).sort_values('code')
traffic=gp.GeoDataFrame(traffic,crs={'init': 'epsg:4326'},geometry=list(gNeigh.geometry)).reset_index(drop=True)
traffic.loc[194,'AADT']=np.NaN
print(traffic.shape)
traffic.head()

(195, 6)


Unnamed: 0,code,AADT,borough,lat,long,geometry
0,BK09,11051.190476,Brooklyn,40.695469,-73.994871,"(POLYGON ((-73.99236367043254 40.689690123777,..."
1,BK17,12013.8,Brooklyn,40.5883,-73.941511,(POLYGON ((-73.91809256480843 40.5865703350047...
2,BK19,12252.733333,Brooklyn,40.580922,-73.961217,(POLYGON ((-73.96034953585246 40.5873062855713...
3,BK21,15428.3,Brooklyn,40.57648,-73.991231,(POLYGON ((-73.97459000582634 40.5831388207588...
4,BK23,8256.6,Brooklyn,40.579088,-73.973391,(POLYGON ((-73.9688899587795 40.57526123899416...


### Merge all data into a single Dataframe

In [141]:
data=gNeigh[['code','name','lat','long','area','borough']]
data=data.merge(pop).merge(rent).merge(income).merge(traffic)

In [142]:
df=gp.GeoDataFrame(data,crs={'init': 'epsg:4326'},geometry=list(gNeigh.geometry))
toDrop=list(df[df['code'].str.contains("99")|df['code'].str.contains("98")].index.values)
df=df.drop(toDrop).reset_index(drop=True)
print(df.shape)
df.head()

(188, 21)


Unnamed: 0,code,name,lat,long,area,borough,population,medAge,popDensity,dayPop,...,medRent,medIncome,meanIncome,veryLow,low,middle,high,veryHigh,AADT,geometry
0,BK09,Brooklyn Heights-Cobble Hill,40.695469,-73.994871,1.615327,Brooklyn,24212,37.1,14988,21307.0,...,2278.0,125817.0,205275.0,1279,1201,2008,3355,3272,11051.190476,"(POLYGON ((-73.99236367043254 40.689690123777,..."
1,BK17,Sheepshead Bay-Gerritsen Beach-Manhattan Beach,40.5883,-73.941511,10.214922,Brooklyn,67681,43.9,6625,59559.0,...,1180.0,57150.0,79613.0,6637,5298,6797,5785,1633,12013.8,(POLYGON ((-73.91809256480843 40.5865703350047...
2,BK19,Brighton Beach,40.580922,-73.961217,2.770061,Brooklyn,35811,44.3,12927,31514.0,...,1194.0,36802.0,63703.0,5762,2697,3301,2006,791,12252.733333,(POLYGON ((-73.96034953585246 40.5873062855713...
3,BK21,Seagate-Coney Island,40.57648,-73.991231,6.242851,Brooklyn,31132,39.0,4986,27396.0,...,676.0,27345.0,49358.0,5381,2285,2158,1115,297,15428.3,(POLYGON ((-73.97459000582634 40.5831388207588...
4,BK23,West Brighton,40.579088,-73.973391,1.409979,Brooklyn,16436,58.0,11656,14464.0,...,905.0,40316.0,58752.0,3169,1790,2212,955,275,8256.6,(POLYGON ((-73.9688899587795 40.57526123899416...


### Persist the Data locally

In [143]:
df.to_file("Data/Cleaned/data.shp")

The Data Analysis is continued in the [File](https://github.com/gokulmuthiah/Coursera_Capstone/blob/master/Capstone-Data-Preparation.ipynb)