# Capstone Project - The Battle of the Neighborhoods (Week 2)
### Applied Data Science Capstone by Joseph Bekhit

## Table of contents
* [Introduction: Business Problem](#introduction1)
* [Data](#data1)
* [Methodology](#methodology1)
* [Results and Discussion](#results1)
* [Conclusion](#conclusion1)



## Introduction: Business Problem <a name="introduction1"></a>

New York,the financial capital of USA and so it is attractive to alot of investors who wants to open new business.In this project I will try to find an optimal location for a new business. Specifically, this report will be targeted to stakeholders interested in opening a new business in New York.I will list the neighborhoods in New York , the category of each business that exist in New York and the number of existing businesses of each category grouped by  New York neighborhoods.I will create a list of the all New York neighborhoods and the recommended business to open in that neighborhood.I will also create a list  of all the business categories and the recommended neighborhood to open that business in.Of course the second list make more sense than the first one because it is more realistic   in the real life that the stakeholder has  a particular business and he is looking for the best place where to open it.


## Data <a name="data1"></a>

I used (https://cocl.us/new_york_dataset) to get a json file of the neighborhoods of New York with the longitude and latitude of each neighborhood.Then I used foursquare API to get the venues that exist in each neighborhood.The data returned from foursquare include the venue name of each venue that exist  and it's category.


In [None]:
!conda install -c conda-forge geopy --yes
!conda install -c conda-forge folium=0.5.0 --yes 

In [None]:
import numpy as np # library to handle data in a vectorized manner
import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import json # library to handle JSON files
from geopy.geocoders import Nominatim #convert an address into latitude and longitude values
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
# import k-means from clustering stage
from sklearn.cluster import KMeans
import folium # map rendering library
print('Libraries imported.')

In [None]:
!wget -q -O 'newyork_data.json' https://cocl.us/new_york_dataset
print('Data downloaded!')

In [None]:
with open('newyork_data.json') as json_data:
    nigh = json.load(json_data)
neigh = nigh['features']
# define the dataframe columns
nighcolumn = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 
# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=nighcolumn)
for data in neigh:
    borough =data['properties']['borough'] 
    neighborhood_name = data['properties']['name']        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)
print(neighborhoods.head())
print(neighborhoods.shape)

In [None]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(len(neighborhoods['Borough'].unique()),
        neighborhoods.shape[0]))

In [None]:
address = 'New York City, NY'
geolocator = Nominatim(user_agent="ny_explore")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of New York City are {}, {}.'.format(latitude, longitude))

In [None]:
map_newyork = folium.Map(location=[latitude, longitude], zoom_start=10)
for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=2,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)    
map_newyork

In [None]:
CLIENT_ID = 'ZLU2POEDLOQSMY2KKNAWU2OXAY5BASJDS0F5FYIGRHGR1KPT' # my Foursquare ID
CLIENT_SECRET = 'KTVCBEEFUWQWSFQPAW5TECFO3K4JAMHVRRHP3B3XEMKS25OK' # my Foursquare Secret
VERSION = '20180605' # Foursquare API version
print('my credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

In [None]:
def getNearbyVenues(names, latitudes, longitudes, radius=500,LIMIT=100):    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)            
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)        
        results = requests.get(url).json()["response"]['groups'][0]['items']
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    return(nearby_venues)

## Methodology <a name="methodology1"></a>

In this project we will direct our efforts on detecting the areas of New York with the business that exist in each area grouped by the business category. 
In first step we have collected the required data: location and type (category) of every business in every neighborhood in New York.We did that according to Foursquare categorization.
Second step in our analysis will be calculation and exploration of the count of each business across the different neighborhoods in New York.The stakeholder can then investigate the result and choose the business category with the count equal zero or as minimum as possible,or choose the neighborhood with the minimum count of the business category that he is interested in.

In [None]:
newyork_venues = getNearbyVenues(names=neighborhoods['Neighborhood'],
                                   latitudes=neighborhoods['Latitude'],
                                   longitudes=neighborhoods['Longitude'])

In [None]:
print(newyork_venues.shape)
newyork_venues.head()

In [None]:
newyork_venues.groupby('Neighborhood').count()

In [None]:
print('There are {} uniques categories.'.format(len(newyork_venues['Venue Category'].unique())))

In [None]:
newyork_onehot = pd.get_dummies(newyork_venues[['Venue Category']], prefix="", prefix_sep="")
newyork_onehot['Neighborhood'] = newyork_venues['Neighborhood'] 
fixed_columns = [newyork_onehot.columns[-1]] + list(newyork_onehot.columns[:-1])
newyork_onehot = newyork_onehot[fixed_columns]
newyork_onehot.head()

In [None]:
newyork_onehot.shape

In [None]:
newyork_grouped = newyork_onehot.groupby('Neighborhood').sum().reset_index()
newyork_grouped.head()

In [None]:
newyork_grouped.shape

In [None]:
num_top_venues =10 #len(newyork_venues['Venue Category'].unique())
for hood in newyork_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp =newyork_grouped[newyork_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

In [None]:
newyork_onehot2 = pd.get_dummies(newyork_venues[['Neighborhood']], prefix="", prefix_sep="")
newyork_onehot2['Venue Category'] = newyork_venues['Venue Category'] 
fixed_columns = [newyork_onehot2.columns[-1]] + list(newyork_onehot2.columns[:-1])
newyork_onehot2 = newyork_onehot2[fixed_columns]
newyork_onehot2.head()

In [None]:
newyork_onehot2.shape

In [None]:
newyork_grouped2 = newyork_onehot2.groupby('Venue Category').sum().reset_index()
newyork_grouped2.head()

In [None]:
newyork_grouped2.shape

In [None]:
num_bottom_venues =10 #len(newyork_venues['Venue Category'].unique())
for category in newyork_grouped2['Venue Category']:
    print("----"+category+"----")
    temp =newyork_grouped2[newyork_grouped2['Venue Category'] == category].T.reset_index()
    temp.columns = ['Neighborhood','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    print(temp.sort_values('freq', ascending=True).reset_index(drop=True).head(num_bottom_venues))
    print('\n')

In [None]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)    
    return row_categories_sorted.index.values[0:num_top_venues]

In [None]:
def return_least_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=True)    
    return row_categories_sorted.index.values[0:num_top_venues]

In [None]:
num_top_venues = 10
indicators = ['st', 'nd', 'rd']
# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))
# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = newyork_grouped['Neighborhood']
for ind in np.arange(newyork_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(newyork_grouped.iloc[ind, :], num_top_venues)
neighborhoods_venues_sorted.head()

In [None]:
num_bottom_venues = 10
indicators = ['st', 'nd', 'rd']
# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))
# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = newyork_grouped['Neighborhood']
for ind in np.arange(newyork_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_least_common_venues(newyork_grouped.iloc[ind, :], num_top_venues)
neighborhoods_venues_sorted.head()

In [None]:
kclusters = 5
newyork_grouped_clustering = newyork_grouped.drop('Neighborhood', 1)
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(newyork_grouped_clustering)
kmeans.labels_[0:10] 

In [None]:
neighborhoods_venues_sorted.insert(0, 'Cluster', kmeans.labels_)
newyork_merged = neighborhoods
newyork_merged = newyork_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

In [None]:
newyork_merged.head()

In [None]:
i=0
while i<newyork_merged.shape[0]:
    x=newyork_merged["Cluster"][i]
    if  x!=0 and x!=1 and x!=2 and x!=3 and x!=4:
        newyork_merged["Cluster"][i]=0     
        print(newyork_merged["Cluster"][i])
    i=i+1    

In [None]:
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
markers_colors = []
for lat, lon, poi, cluster in zip(newyork_merged['Latitude'], newyork_merged['Longitude'], newyork_merged['Neighborhood'], newyork_merged['Cluster']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster-1)],
        fill=True,
        fill_color=rainbow[int(cluster)],
        fill_opacity=0.7).add_to(map_clusters)
map_clusters

In [None]:
print(newyork_merged.loc[newyork_merged['Cluster'] == 0, newyork_merged.columns[[1] + list(range(5, newyork_merged.shape[1]))]])
print("#######################END OF CLUSTER 1##################################")
print(newyork_merged.loc[newyork_merged['Cluster'] == 1, newyork_merged.columns[[1] + list(range(5, newyork_merged.shape[1]))]])
print("#######################END OF CLUSTER 2##################################")
print(newyork_merged.loc[newyork_merged['Cluster'] == 2, newyork_merged.columns[[1] + list(range(5, newyork_merged.shape[1]))]])
print("#######################END CLUSTER 3##################################")
print(newyork_merged.loc[newyork_merged['Cluster'] == 3, newyork_merged.columns[[1] + list(range(5, newyork_merged.shape[1]))]])
print("#######################END OF CLUSTER 4##################################")
print(newyork_merged.loc[newyork_merged['Cluster'] == 4, newyork_merged.columns[[1] + list(range(5, newyork_merged.shape[1]))]])
print("#######################END OF CLUSTER 5##################################")

## Results and Discussion <a name="results1"></a>

we found that there are many business categories that does not exist in many neighborhoods,where these business categories  exist strongly in other neighborhoods, and these business categories are recommended to be opened in these neighborhoods.If you are a new investor and do not have a specific business category you can choose the business category the does not exist in the favorite neighborhood and if you have a specific business category to open,you can choose the neighborhood which you business category does not exist.


## Conclusion <a name="conclusion1"></a>

Although New York is  old  and crowded state,it is still attractive  for investors because it has many business categories that exist in some neightborhoods,while does not exist at all at some other neighborhoods.
