# Capstone Project - The Battle of the Neighborhoods (Week 2)

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction: Business Problem <a name="introduction"></a>

**Problem description**: Opening a restaurant in Chicago where competition and crime rate is low.

**Taget audience**: pople who would like to open a **restaurant in Chicago** at a *safe* neighborhood with *relatively low competition*. 

**Why it is ipmortant**:
* Low competition is better.
* Customers prefer safer areas.

*NOTE: During my analysis I will concentrate on the more serious, more violent crimes. Because in my opinion more violent crimes matter the most.*

## Data <a name="data"></a>

#### 1. Chicago crime data

I will use the [Chicago's city offical website](https://data.cityofchicago.org/api/views/3i3m-jwuy/rows.csv?accessType=DOWNLOAD) for crime data. The latest data available is from 2018, so I will use that. After downloading the CSV I will convert it to a DataFrame. As I mentioned above I will concentrate on the more serious/violent crimes.

I will drop lines considering less serious/violent crimes:
* burglary
* concealed carry license violation
* criminal trespass
* deceptive practice
* gambling
* interference with public officer
* liquor law violation
* motor vehicle theft
* non-criminal
* non-criminal (subject specified)
* obscenity
* other narcotic violation
* other offense
* prostitution
* public indecency
* public peace violation
* theft

**I will concentrate on the following type of crimes:**
* arson
* assault
* crim sexual assault
* criminal damage
* criminal sexual assault
* homicide
* human trafficking
* intimidation
* kidnapping
* narcotics
* offense involving children
* robbery
* sex offense
* stalking
* weapons violation

Chicago has 77 neighborhoods. I will check the (total number of crimes)/(neighboorhood population) for neighborhood and I will continue with the 20 safest neighboorhoods.

In [1]:
#Downloading the data from the linke mentioned above:
!wget -q -O 'chicago_crime.csv' https://data.cityofchicago.org/api/views/3i3m-jwuy/rows.csv?accessType=DOWNLOAD

In [2]:
#Importing libaries:
import numpy as np
import pandas as pd
pd.set_option('display.precision',15)
import json
from geopy.geocoders import Nominatim 
import requests 
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
import folium 

In [3]:
df_crime=pd.read_csv('chicago_crime.csv')

In [4]:
df_crime.head()

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
0,11317463,JB266425,05/16/2018 07:10:00 PM,060XX N LINCOLN AVE,0326,ROBBERY,AGGRAVATED VEHICULAR HIJACKING,PARKING LOT / GARAGE (NON RESIDENTIAL),False,False,...,50.0,13,03,1153623.0,1940355.0,2018,06/11/2020 03:46:39 PM,41.992175988,-87.710287127,"(41.992175988, -87.710287127)"
1,11346980,JB304586,06/13/2018 12:27:00 AM,027XX W 63RD ST,0610,BURGLARY,FORCIBLE ENTRY,OTHER (SPECIFY),False,False,...,17.0,66,05,1159376.0,1862766.0,2018,06/05/2020 03:46:23 PM,41.779147989,-87.69126110399998,"(41.779147989, -87.691261104)"
2,11540961,JB560480,12/05/2018 12:00:00 AM,002XX S HOYNE AVE,1562,SEX OFFENSE,AGGRAVATED CRIMINAL SEXUAL ABUSE,RESIDENCE,False,False,...,27.0,28,17,1162441.0,1898955.0,2018,06/04/2020 03:45:42 PM,41.878391287,-87.67901396399998,"(41.878391287, -87.679013964)"
3,11433411,JB418440,09/01/2018 08:32:00 PM,009XX N HARDING AVE,041A,BATTERY,AGGRAVATED - HANDGUN,ALLEY,False,False,...,37.0,23,04B,1149884.0,1905974.0,2018,06/03/2020 03:45:09 PM,41.897905616,-87.72493808,"(41.897905616, -87.72493808)"
4,11394797,JB368524,07/28/2018 02:27:00 AM,110XX S HALSTED ST,0263,CRIMINAL SEXUAL ASSAULT,AGGRAVATED - KNIFE / CUTTING INSTRUMENT,ALLEY,True,False,...,34.0,49,02,1172953.0,1831551.0,2018,06/03/2020 03:45:09 PM,41.693200858,-87.64240521899998,"(41.693200858, -87.642405219)"


In [5]:
#Dropping unnecessary columns. 
df_crime.drop(['ID', 'Case Number', 'Date', 'Block', 'IUCR', 'Description', 'Location Description', 'Arrest', 'Domestic', 'Beat', 'District', 'Ward', 'FBI Code', 'X Coordinate','Y Coordinate', 'Year', 'Updated On', 'Location'], inplace=True, axis = 1)

In [6]:
#Checking what columns left
df_crime.columns

Index(['Primary Type', 'Community Area', 'Latitude', 'Longitude'], dtype='object')

In [7]:
df_crime.head()

Unnamed: 0,Primary Type,Community Area,Latitude,Longitude
0,ROBBERY,13,41.992175988,-87.710287127
1,BURGLARY,66,41.779147989,-87.69126110399998
2,SEX OFFENSE,28,41.878391287,-87.67901396399998
3,BATTERY,23,41.897905616,-87.72493808
4,CRIMINAL SEXUAL ASSAULT,49,41.693200858,-87.64240521899998


In [8]:
#Selecting the violent crimes. (List if violent crimes are above.)
df_crime = df_crime[(df_crime['Primary Type']=="ARSON") | (df_crime['Primary Type']=="ASSAULT") | (df_crime['Primary Type']=="CRIM SEXUAL ASSAULT") | (df_crime['Primary Type']=="CRIMINAL DAMAGE") | (df_crime['Primary Type']=="CRIMINAL SEXUAL ASSAULT") | (df_crime['Primary Type']=="HOMICIDE") | (df_crime['Primary Type']=="HUMAN TRAFFICKING") | (df_crime['Primary Type']=="INTIMIDATION") | (df_crime['Primary Type']=="KIDNAPPING") | (df_crime['Primary Type']=="NARCOTICS") | (df_crime['Primary Type']=="OFFENSE INVOLVING CHILDREN") | (df_crime['Primary Type']=="ROBBERY") | (df_crime['Primary Type']=="SEX OFFENSE") | (df_crime['Primary Type']=="STALKING") | (df_crime['Primary Type']=="WEAPONS VIOLATION")]

In [9]:
#I have to drop lines with no longitute or langitude information
df_crime.dropna(subset=["Latitude"], axis=0, inplace=True)
df_crime.dropna(subset=["Longitude"], axis=0, inplace=True)
df_crime.reset_index(drop=True, inplace=True)

In [10]:
print(df_crime['Primary Type'].value_counts())

CRIMINAL DAMAGE               27700
ASSAULT                       20342
NARCOTICS                     12797
ROBBERY                        9676
WEAPONS VIOLATION              5444
OFFENSE INVOLVING CHILDREN     2144
CRIM SEXUAL ASSAULT            1458
SEX OFFENSE                    1053
HOMICIDE                        600
ARSON                           373
STALKING                        199
KIDNAPPING                      172
INTIMIDATION                    167
CRIMINAL SEXUAL ASSAULT         105
HUMAN TRAFFICKING                12
Name: Primary Type, dtype: int64


In [11]:
#Checking the final Crime dataframe.
df_crime.head()

Unnamed: 0,Primary Type,Community Area,Latitude,Longitude
0,ROBBERY,13,41.992175988,-87.710287127
1,SEX OFFENSE,28,41.878391287,-87.67901396399998
2,CRIMINAL SEXUAL ASSAULT,49,41.693200858,-87.64240521899998
3,OFFENSE INVOLVING CHILDREN,23,41.901788359,-87.723835955
4,CRIMINAL SEXUAL ASSAULT,51,41.70218374,-87.565137272


As you can se above the Chicago crime dataset is not clean, it only contains the more serious crimes and rows with no lattitude or longitude information have been droped. 

#### 2. Community areas in Chicago

The name of the community areas can be found on [wikipedia](https://en.wikipedia.org/wiki/Community_areas_in_Chicago). I will get the name of the areas and the population from this source. Unfortunately this does not contains the coordinates, so **I will use geolocator to the the latitude and longitude information.**

In [12]:
#Loading information from Wikipedia:
areas = pd.read_html("https://en.wikipedia.org/wiki/Community_areas_in_Chicago")

In [13]:
#pd.read_html returns with a list, but I need a DataFrame:
df_areas = areas[0]

In [14]:
df_areas.head()

Unnamed: 0,Number[8],Name[8],2017[9],Area (sq mi.)[10],Area (km2),2017density (/sq mi.),2017density (/km2)
0,1,Rogers Park,55062,1.84,4.77,29925.0,11554.11
1,2,West Ridge,76215,3.53,9.14,21590.65,8336.2
2,3,Uptown,57973,2.32,6.01,24988.36,9648.06
3,4,Lincoln Square,41715,2.56,6.63,16294.92,6291.5
4,5,North Center,35789,2.05,5.31,17458.05,6740.59


In [15]:
#Dropping the last line
df_areas.drop(df_areas.tail(1).index,inplace=True)

In [16]:
df_areas.drop(['Area (sq mi.)[10]', 'Area (km2)', '2017density (/sq mi.)', '2017density (/km2)'], inplace=True, axis = 1)

In [17]:
df_areas.columns

Index(['Number[8]', 'Name[8]', '2017[9]'], dtype='object')

In [18]:
#I need to rename columns to have a more meaningful names
df_areas.rename(columns={'Number[8]':'Area Number', 'Name[8]': 'Area Name', '2017[9]': 'Population'}, inplace=True)

In [19]:
#There is a badly formatted area name:
df_areas.iat[31, 1] = 'The Loop'

In [20]:
df_areas.head()

Unnamed: 0,Area Number,Area Name,Population
0,1,Rogers Park,55062
1,2,West Ridge,76215
2,3,Uptown,57973
3,4,Lincoln Square,41715
4,5,North Center,35789


As you can see above we have now the area names and the population of the areas. (The chicago crime dataset contained only an area number.)

In [21]:
#I will use the funcion below the get the lattitude and longitude information for the Area Names
def locate_area(address):
    raw_string = r"{}".format(address)
    raw_string = raw_string + ', Chicago'
    geolocator = Nominatim(user_agent="to_explorer")
    location = geolocator.geocode(raw_string)
    latitude = location.latitude
    longitude = location.longitude
    print('The geograpical coordinates of {} are {}, {}.'.format(raw_string,latitude, longitude))
    return latitude, longitude

In [22]:
#Getting the latitude and longitude information for each area.
lat_long=df_areas['Area Name'].apply(locate_area)

The geograpical coordinates of Rogers Park, Chicago are 42.01053135, -87.67074819664808.
The geograpical coordinates of West Ridge, Chicago are 42.0035482, -87.6962426.
The geograpical coordinates of Uptown, Chicago are 41.9666299, -87.6555458.
The geograpical coordinates of Lincoln Square, Chicago are 41.975989850000005, -87.6896163305115.
The geograpical coordinates of North Center, Chicago are 41.9561073, -87.6791596.
The geograpical coordinates of Lake View, Chicago are 41.947050000000004, -87.65542878290054.
The geograpical coordinates of Lincoln Park, Chicago are 41.940297650000005, -87.63811710541756.
The geograpical coordinates of Near North Side, Chicago are 41.9000327, -87.6344975.
The geograpical coordinates of Edison Park, Chicago are 42.0057335, -87.81401633833357.
The geograpical coordinates of Norwood Park, Chicago are 41.9855895, -87.80058173001102.
The geograpical coordinates of Jefferson Park, Chicago are 41.9697375, -87.7631179.
The geograpical coordinates of Forest 

In [23]:
#Converting lat_long to a DataFrame and merging it with the area info DataFrame
lat_long = lat_long.to_frame()
lat_long = pd.DataFrame(lat_long['Area Name'].tolist(), index=lat_long.index)
lat_long.columns = ['Latitude', 'Longitude']
df_areas = pd.concat([df_areas, lat_long], axis=1)
df_areas.head(15)

Unnamed: 0,Area Number,Area Name,Population,Latitude,Longitude
0,1,Rogers Park,55062,42.01053135,-87.67074819664806
1,2,West Ridge,76215,42.0035482,-87.6962426
2,3,Uptown,57973,41.9666299,-87.6555458
3,4,Lincoln Square,41715,41.975989850000005,-87.68961633051148
4,5,North Center,35789,41.9561073,-87.6791596
5,6,Lake View,100470,41.94705,-87.65542878290054
6,7,Lincoln Park,67710,41.940297650000005,-87.63811710541756
7,8,Near North Side,88893,41.9000327,-87.63449749999998
8,9,Edison Park,11605,42.0057335,-87.81401633833357
9,10,Norwood Park,37089,41.9855895,-87.800581730011


Now we have the necessary data:
* The Area Number
* The Area Name
* The population of the given areas
* Latitude and Longitude information

#### 3. Foursquare

I will use foursquare to check competition in the safest 20 neighboorhoods.I will select 5 of them to open a restaurant in one of them.

**Important note: I will be able to do this once the analisys is done.** Note in this section.

## Methodology <a name="methodology"></a>

1. I downloaded the Chicago crime information as a CSV format. I converted the CSV into a pandas dataframe. I dropped all unnecessary columns and those rows which did not have the longitude and longitude information. After this I dropped rows with less serios crimes. 
In the **analysis section** I will select the 20 most peaceful neighborhoods. 

2. The crime dataset did not include area information such as: the name and population of the area.
Fortunately, this information is on Wikipedia. I used pandas again to get this info. However I was still missing the latitude and longitude information. I used the **geolocator** to get this information.


In the analysis section I will perform the following:
* I will select the 20 safest neighborhoods. I will check the (total crime of the given area)/ total population of the given area.
* I will use forsquare to check compatition.

In the conclusion section I will make the recommendations regarding where to open a restaurant.

## Analysis <a name="analysis"></a>

Let's perform our analysis. Which areas are the safest?

Calculating the total crime per neighborhood:

In [24]:
df_areas.set_index('Area Number', inplace = True)
df_sumcrime = df_crime['Community Area'].value_counts()
df_sumcrime = df_sumcrime.to_frame()
df_sumcrime.index.name = 'Area Number'
df_sumcrime = df_sumcrime.sort_values(['Area Number'], ascending=True, axis=0)
df_sumcrime.columns = ['Number of violent crimes']
df_areas.index = df_areas.index.map(int)
df_sumcrime.index = df_sumcrime.index.map(int)
df_sumcrime.head(20)

Unnamed: 0_level_0,Number of violent crimes
Area Number,Unnamed: 1_level_1
1,1090
2,1004
3,855
4,453
5,262
6,1031
7,918
8,1900
9,53
10,295


Creating the dataframe which containts the information of the neighborhoods:

In [25]:
df_chicago=pd.concat([df_areas, df_sumcrime],axis=1)
df_chicago.head(10)

Unnamed: 0_level_0,Area Name,Population,Latitude,Longitude,Number of violent crimes
Area Number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,Rogers Park,55062,42.01053135,-87.67074819664806,1090
2,West Ridge,76215,42.0035482,-87.6962426,1004
3,Uptown,57973,41.9666299,-87.6555458,855
4,Lincoln Square,41715,41.975989850000005,-87.68961633051148,453
5,North Center,35789,41.9561073,-87.6791596,262
6,Lake View,100470,41.94705,-87.65542878290054,1031
7,Lincoln Park,67710,41.940297650000005,-87.63811710541756,918
8,Near North Side,88893,41.9000327,-87.63449749999998,1900
9,Edison Park,11605,42.0057335,-87.81401633833357,53
10,Norwood Park,37089,41.9855895,-87.800581730011,295


In [26]:
df_chicago['Crime per person'] = df_chicago['Number of violent crimes']/df_chicago['Population']
df_chicago.head()

Unnamed: 0_level_0,Area Name,Population,Latitude,Longitude,Number of violent crimes,Crime per person
Area Number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,Rogers Park,55062,42.01053135,-87.67074819664806,1090,0.019795866477789
2,West Ridge,76215,42.0035482,-87.6962426,1004,0.013173259856984
3,Uptown,57973,41.9666299,-87.6555458,855,0.014748244872613
4,Lincoln Square,41715,41.975989850000005,-87.68961633051148,453,0.010859403092413
5,North Center,35789,41.9561073,-87.6791596,262,0.007320685126715


The following table contains the **20 safest neighborhood** of Chicago:

In [27]:
df_chicago = df_chicago.sort_values(['Crime per person'], ascending=True, axis=0).head(20)
df_chicago.head(20)

Unnamed: 0_level_0,Area Name,Population,Latitude,Longitude,Number of violent crimes,Crime per person
Area Number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
9,Edison Park,11605,42.0057335,-87.81401633833357,53,0.004566996984059
12,Forest Glen,19019,41.99175155,-87.75167396842738,91,0.004784688995215
5,North Center,35789,41.9561073,-87.6791596,262,0.007320685126715
74,Mount Greenwood,19277,41.6980891,-87.7086616,142,0.007366291435389
10,Norwood Park,37089,41.9855895,-87.800581730011,295,0.007953840761412
17,Dunning,43689,41.952809,-87.7964493,390,0.008926732129369
64,Clearing,25891,41.780588,-87.7733881,260,0.01004209957128
6,Lake View,100470,41.94705,-87.65542878290054,1031,0.010261769682492
11,Jefferson Park,26808,41.9697375,-87.76311789999998,283,0.010556550283497
72,Beverly,20822,41.7181532,-87.67176739999998,226,0.010853904524061


Now have to check to competition in the selected neighborhoods.

In [28]:
CLIENT_ID = 'JOKAQAZNVJ2RPSL4DCQCNEH04L3QTUKIZE3HLCMWRYFV1LO4' 
CLIENT_SECRET = 'CN1B2WNDQ2YCJBG5OPULS5AYYNIZO3AM5ZDVJ1Q5EK4ESYOG'
VERSION = '20180616'
RADIUS = 500              # I check the 500 meter radius from the center of the neighborhood.
print('My credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

My credentails:
CLIENT_ID: JOKAQAZNVJ2RPSL4DCQCNEH04L3QTUKIZE3HLCMWRYFV1LO4
CLIENT_SECRET:CN1B2WNDQ2YCJBG5OPULS5AYYNIZO3AM5ZDVJ1Q5EK4ESYOG


In [29]:
#I will use the function below to get the number of restaurants for a selected neighborhood:
def number_of_restaurants(latitude, longitude):
    url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&section=food'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, RADIUS)
    results = requests.get(url).json()
    number_of_results = len(results['response']['groups'][0]['items'])
    return number_of_results

In [30]:
df_chicago['Number of restaurants'] =  np.vectorize(number_of_restaurants)(df_chicago['Latitude'], df_chicago['Longitude'])

In [31]:
df_chicago = df_chicago.sort_values(['Number of restaurants'], ascending=True, axis=0)

I select from the safe neighborhoods those which have 10 restaurants or less.

In [32]:
df_top = df_chicago[df_chicago['Number of restaurants'] <= 10]
df_top.head(20)

Unnamed: 0_level_0,Area Name,Population,Latitude,Longitude,Number of violent crimes,Crime per person,Number of restaurants
Area Number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
10,Norwood Park,37089,41.9855895,-87.800581730011,295,0.007953840761412,0
74,Mount Greenwood,19277,41.6980891,-87.7086616,142,0.007366291435389,1
7,Lincoln Park,67710,41.940297650000005,-87.63811710541756,918,0.013557820115197,1
17,Dunning,43689,41.952809,-87.7964493,390,0.008926732129369,2
2,West Ridge,76215,42.0035482,-87.6962426,1004,0.013173259856984,2
13,North Park,18842,41.9805872,-87.72089169999998,243,0.012896720093408,4
72,Beverly,20822,41.7181532,-87.67176739999998,226,0.010853904524061,5
12,Forest Glen,19019,41.99175155,-87.75167396842738,91,0.004784688995215,5
64,Clearing,25891,41.780588,-87.7733881,260,0.01004209957128,8
15,Portage Park,64307,41.9578093,-87.7650594,883,0.013731009065887,8


## Results and Discussion <a name="results"></a>

<span style="color:blue">**According to my analisys the following neighborhoods have low crime rate and low compatition:**</span>.

In [33]:
df_top['Area Name'].to_frame()

Unnamed: 0_level_0,Area Name
Area Number,Unnamed: 1_level_1
10,Norwood Park
74,Mount Greenwood
7,Lincoln Park
17,Dunning
2,West Ridge
13,North Park
72,Beverly
12,Forest Glen
64,Clearing
15,Portage Park


In [34]:
geolocator = Nominatim(user_agent="to_explorer")
location = geolocator.geocode('Chicago')
latitude = location.latitude
longitude = location.longitude
chicago_map = folium.Map(location=[latitude, longitude], zoom_start=10)

title_html = '''
             <h3 align="center" style="font-size:20px"><b>Chicago good neighborhoods for restaurants.</b></h3>
             '''
chicago_map.get_root().html.add_child(folium.Element(title_html))

neighborhoods = folium.map.FeatureGroup()
for lat, lng, in zip(df_top.Latitude, df_top.Longitude):
    neighborhoods.add_child(
        folium.CircleMarker(
            [lat, lng],
            radius=5, 
            color='yellow',
            fill=True,
            fill_color='blue',
            fill_opacity=0.6
        )
    )

latitudes = list(df_top.Latitude)
longitudes = list(df_top.Longitude)
labels = list(df_top['Area Name'])

for lat, lng, label in zip(latitudes, longitudes, labels):
    folium.Marker([lat, lng], popup=label).add_to(chicago_map)    
    

chicago_map.add_child(neighborhoods)
chicago_map

In the north there are some good candidates very close to eachother.

## Conclusion <a name="conclusion"></a>

The goal of the project was to identify safe neighborhoods in Chicago with quite low competition in food services.
During my analisys I concentrated on the more serious crimes. 

I was able to identify 13 neighborhoods. I would concentrate on the neighborhoods located in the north, since there are a number of good candidates very close to eachother.