## Introduction
#1.1 Background 
New York City is a very diverse city in the United States and has a very high population. In this crowd city, some boroughs/neighborhoods have high crime rates while some are considered relatively very safe. In addition, within 5 borough and more than 30 neighborhood in New York City, restaurant and coffee shop are the top 2 venues. Each year thousands of people looking to open new restaurant in this city. As we all aware, crime rate will affect restaurant business in a negative way so does high concentration of similar restaurants. Therefore, it is advantageous for people to know the crime rate and existing venue while picking locations to open restaurant. 
#1.2 Problem
Data that might contribute to determining future restaurant profit include location, neighborhood crime rate, nearby existing restaurant type, etc. This project will be finding the best neighborhood in New York City to open a restaurant.
#1.3 Interest 
Obviously, people who want to open a new restaurant in New York City would be interested in.

## Data
#2.1 Data sources 
In this project, there is two data source. First, New York City crime data provided by NYC OpenData(https://data.cityofnewyork.us/Public-Safety/NYPD-Complaint-Data-Historic/qgea-i56i) named NYPD Complaint Data Historic. This dataset includes all valid felony, misdemeanor, and violation crimes reported to the New York City Police Department (NYPD) from 2006 to the end of last year (2017). In total, this dataset contains 35 columns and 7.31M rows each row represents a individual complain. The second data source is the Foursquare API. The Foursquare API is used to explore neighborhoods in New York City. 
#2.2 Data cleaning 
Due to file size, NYPD crime data is per-cleaned before imported to the notebook, only data from 2016-2017 is included and 7relavent columns are kept. No change in the Foursquare API is made.

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

Part1 Explore Neighborhoods in New York City
    
1. <a href="#item1">Download and Explore Dataset</a>

2. <a href="#item2">Explore Neighborhoods in New York City</a>

3. <a href="#item3">Analyze Each Neighborhood</a>

4. <a href="#item4">Cluster Neighborhoods</a>

5. <a href="#item5">Examine Clusters</a>    
 
Part2 Explore New York City Crime
    
1. <a href="#item6">Download and Explore Dataset</a>

2. <a href="#item7">Explore Neighborhoods in New York City</a>

3. <a href="#item8">Analyze Each Neighborhood</a>

4. <a href="#item9">Cluster Neighborhoods</a>

5. <a href="#item10">Examine Clusters</a>    
    
</font>
</div>

In [1]:
import numpy as np # library to handle data in a vectorized manner
import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import json # library to handle JSON files
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
# import k-means from clustering stage
from sklearn.cluster import KMeans
print('Libraries imported.')

Libraries imported.


In [None]:
#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library
!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
print('Libraries imported.')

Solving environment: \ 

## 1. Download and Explore Dataset


In [None]:
!wget -q -O 'newyork_data.json' https://cocl.us/new_york_dataset
print('Data downloaded!')
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)
neighborhoods_data = newyork_data['features']
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)
for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)
neighborhoods.head()
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(neighborhoods['Borough'].unique()),
        neighborhoods.shape[0]
    )
)

#### Use geopy library to get the latitude and longitude values of New York City.

In [None]:
address = 'New York City, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of New York City are {}, {}.'.format(latitude, longitude))

#### Create a map of New York with neighborhoods superimposed on top.

In [None]:
# create map of New York using latitude and longitude values
map_newyork = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)  
    
map_newyork

In [None]:

sl_data = neighborhoods[neighborhoods['Borough'] == 'Staten Island'].reset_index(drop=True)
sl_data.head()
address = 'STATEN ISLAND, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of STATEN ISLAND are {}, {}.'.format(latitude, longitude))

####As we did with all of New York City, let's visualizat Staten Island the neighborhoods in it.

In [None]:
# create map of STATEN ISLAND using latitude and longitude values
map_sl = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(sl_data['Latitude'], sl_data['Longitude'], sl_data['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_sl)  
    
map_sl

##Foursquare API

In [None]:
CLIENT_ID = '2ABDNZN3FAKGEPTSPADC5IADNA01WNVNDPSZWCWDB2ZKRFUQ' # your Foursquare ID
CLIENT_SECRET = 'SCYPFJ2QGW54BDUEGEEAD5SDUM3HO3G4I3H15QB4RUMYZKYP' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

In [None]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

## 2. Explore Neighborhoods in Staten Island


In [None]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)      
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

sl_venues = getNearbyVenues(names=sl_data['Neighborhood'],
                                   latitudes=sl_data['Latitude'],
                                   longitudes=sl_data['Longitude']
                                  )
print(sl_venues.shape)
sl_venues.head()

## 3. Analyze Each Neighborhood

In [None]:
# one hot encoding
sl_onehot = pd.get_dummies(sl_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
sl_onehot['Neighborhood'] = sl_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [sl_onehot.columns[-1]] + list(sl_onehot.columns[:-1])
sl_onehot = sl_onehot[fixed_columns]

sl_onehot.head()

In [None]:
sl_grouped = sl_onehot.groupby('Neighborhood').mean().reset_index()
sl_grouped

num_top_venues = 5

for hood in sl_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = sl_grouped[sl_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

In [None]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [None]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = sl_grouped['Neighborhood']

for ind in np.arange(sl_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(sl_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.shape()

## 4. Cluster Neighborhoods

In [None]:
# set number of clusters
kclusters = 5

sl_grouped_clustering = sl_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(sl_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

In [None]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

sl_merged = sl_data

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
sl_merged = sl_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

sl_merged.head() # check the last columns!

In [None]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(sl_merged['Latitude'], sl_merged['Longitude'], sl_merged['Neighborhood'], sl_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## 5. Examine Clusters

In [None]:
#### Cluster 1
sl_merged.loc[sl_merged['Cluster Labels'] 
                     == 0, sl_merged.columns[[1] + list(range(5, sl_merged.shape[1]))]]

In [None]:
#### Cluster 2
sl_merged.loc[sln_merged['Cluster Labels'] 
                     == 1, sl_merged.columns[[1] + list(range(5, sl_merged.shape[1]))]]

In [None]:
#### Cluster 3
manhattan_merged.loc[manhattan_merged['Cluster Labels'] 
                     == 2, manhattan_merged.columns[[1] + list(range(5, manhattan_merged.shape[1]))]]

In [None]:
#### Cluster 4
sl_merged.loc[sl_merged['Cluster Labels'] 
                     == 3, sl_merged.columns[[1] + list(range(5, sl_merged.shape[1]))]]

In [None]:
#### Cluster 5
sl_merged.loc[sl_merged['Cluster Labels'] 
                     == 4, sl_merged.columns[[1] + list(range(5, sl_merged.shape[1]))]]


#Part 2 New York Crime Data Visualization

In [None]:
case=pd.read_csv('NYPD.csv')
case.head()
case=case.rename(columns={'CMPLNT_FR_DT':"Date",'OFNS_DESC':"Type",'CRM_ATPT_CPTD_CD':"Degree",
                          'LAW_CAT_CD':"Category",'BORO_NM':"Borough"})
#borough count 
case_count1=case.groupby(["Borough"]).size().reset_index(name="Counts")
case_count1


In [None]:
#category/type count
case_count2=case.groupby(["Type","Borough"]).size().reset_index(name="Counts")
case_count21=case_count2[case_count2["Borough"]=="BRONX"].reset_index(drop=True)
case_count21=case_count21.sort_values(by="Counts",ascending=False)
case_count21.head()

In [None]:
case_count22=case_count2[case_count2["Borough"]=="BROOKLYN"].reset_index(drop=True)
case_count22=case_count22.sort_values(by="Counts",ascending=False)
case_count22.head()

In [None]:
case_count23=case_count2[case_count2["Borough"]=="MANHATTAN"].reset_index(drop=True)
case_count23=case_count23.sort_values(by="Counts",ascending=False)
case_count23.head()

In [None]:
case_count24=case_count2[case_count2["Borough"]=="QUEENS"].reset_index(drop=True)
case_count24=case_count24.sort_values(by="Counts",ascending=False)
case_count24.head()

In [None]:
case_count25=case_count2[case_count2["Borough"]=="STATEN ISLAND"].reset_index(drop=True)
case_count25=case_count25.sort_values(by="Counts",ascending=False)
case_count25.head()

In [None]:
case_count3=case.groupby(["Category"]).size().reset_index(name="Counts")
case_count3

In [None]:
case_new=case
case_new=case_new.rename(columns={'CMPLNT_FR_DT':"Date",'OFNS_DESC':"Type",'CRM_ATPT_CPTD_CD':"Degree",
                          'LAW_CAT_CD':"Category",'BORO_NM':"Borough"})
case_new=case_new.drop(columns=["Date","Type","Degree","Latitude","Longitude"])
case2=case_new.groupby(["Borough","Category"]).size().reset_index()
case2
borough = ["BRONX","BROOKLYN","MANHATTAN","QUEENS","STATEN ISLAND"]
category=["FELONY","MISDEMEANOR","VIOLATION"]
case2_new=pd.DataFrame({"Borough":borough,"FELONY":[5826,9041,7140,6196,926],
                      "MISDEMEANOR":[10879,13439,11223,8943,2370],
                       "VIOLATION":[2562,3669,2392,2301,799]})
case2_new.set_index('Borough')

In [None]:
#table 
import matplotlib.pyplot as plt
import matplotlib.patches as patches
width = 0.8
fig, ax = plt.subplots()
x = borough
ind = np.arange(0,15,3)
plt.plot(kind='bar',figsize=(20,8))
fig.set_figheight(8)
fig.set_figwidth(20)
p1=plt.bar(ind-width, case2_new['FELONY'], width, label='felony',color='#5cb85c')
p2=plt.bar(ind, case2_new['MISDEMEANOR'], width, label='misdemeanor',color='#5bc0de')
p3=plt.bar(ind+width, case2_new['VIOLATION'], width, label='violation',color='#d9534f')
plt.title("Number of Crime in Borough",fontsize=16)
ax.set_xticks(ind)
ax.set_xticklabels(x,rotation = 90,fontsize=14)
plt.legend(fontsize=14)
plt.gca().spines['right'].set_visible(False)
plt.gca().spines['top'].set_visible(False)
ax.legend()
for p in ax.patches:
    width, height = p.get_width(), p.get_height()
    height = height
    x, y = p.get_xy()
    ax.annotate( (height),(p.get_x()+0.15*width, p.get_y() + height + 50))
ax.legend()

In [None]:
#crime weighted count sum of three
case2_new['Sum']=case2_new['FELONY']*0.5+case2_new['MISDEMEANOR']*0.2+case2_new['VIOLATION']*0.3
case2_new=case2_new.sort_values(by="Sum",ascending=True)
case2_new
#choose STATEN ISLAND and QUEENS

In [None]:
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)
NY_geo = newyork_data
NY_map = folium.Map(location=[40.7127281, -74.0060152], zoom_start=12)
NY_map.choropleth(
    geo_data=NY_geo,
    data=case2_new,
    columns=['Borough', 'Sum'],
    key_on='feature.borough',
    fill_color='YlOrRd',
    fill_opacity=0.7,
    line_opacity=0.2,
    legend_name='Crime Rate in Ney York City'
)
# display map
NY_map 