# Introduction: Business Problem

**1.1 Background** <br>
The average American moves about eleven times in their lifetime. This brings us to the
question: **Do people move until they find a place to settle down where they truly feel happy,
or do our wants and needs change over time, prompting us to eventually leave a town we
once called home for a new area that will bring us satisfaction? Or, do we too often move to
a new area without knowing exactly what we’re getting into, forcing us to turn tail and run at
the first sign of discomfort?**
To minimize the chances of this happening, we should always do proper research when
planning our next move in life. Consider the following factors when picking a new place to
live so you don’t end up wasting your valuable time and money making a move you’ll end
up regretting. Safety is a top concern when moving to a new area. If you don’t feel safe in
your own home, you’re not going to be able to enjoy living there.

**1.2 Problem** <br>
The crime statistics dataset of London found on Kaggle has crimes in each Boroughs of
London from 2008 to 2016. The year 2016 being the latest we will be considering the data
of that year which is actually old information as of now. The crime rates in each borough
may have changed over time.
This project aims to select the safest borough in London based on the total crimes, explore
the neighborhoods of that borough to find the 10 most common venues in each
neighborhood and finally cluster the neighborhoods using k-mean clustering.

**1.3 Interest** <br>
Expats who are considering to relocate to London will be interested to identify the safest
borough in London and explore its neighborhoods and common venues around each
neighborhood.

# Data Acquisition and Cleaning

**2.1 Data Acquisition** <br>
The data acquired for this project is a combination of data from three sources. The first data
source of the project uses a London crime data that shows the crime per borough in
London. The dataset contains the following columns:<br>
● lsoa_code : code for Lower Super Output Area in Greater London. <br>
● borough : Common name for London borough.<br>
● major_category : High level categorization of crime <br>
● minor_category : Low level categorization of crime within major category. <br>
● value : monthly reported count of categorical crime in given borough <br>
● year : Year of reported counts, 2008-2016 <br>
● month : Month of reported counts, 1-12 <br> <br>

Data set URL: https://www.kaggle.com/jboysen/london-crime


The second source of data is scraped from a wikipedia page that contains the list of London
boroughs . This page contains additional information about the boroughs, the following are
the columns:<br>
● Borough : The names of the 33 London boroughs.<br>
● Inner : Categorizing the borough as an Inner London borough or an Outer London
Borough.<br>
● Status : Categorizing the borough as Royal, City or other borough.<br>
● Local authority : The local authority assigned to the borough.<br>
● Political control : The political party that control the borough.<br>
● Headquarters: Headquarters of the Boroughs.<br>
● Area (sq mi) : Area of the borough in square miles.<br>
● Population (2013 est)[1] : The population in the borough recorded during the year
2013.<br>
● Co-ordinates : The latitude and longitude of the boroughs.<br>
● Nr. in map : The number assigned to each borough to represent visually on a map.<br><br>
The third data source is the list of Neighborhoods in the Royal Borough of Kingston upon
Thames as found on a wikipedia page. This dataset is created from scratch using the list of
neighborhood available on the site, the following are columns:<br>
● Neighborhood: Name of the neighborhood in the Borough.<br>
● Borough: Name of the Borough. <br>
● Latitude: Latitude of the Borough. <br>
● Longitude: Longitude of the Borough.<br> <br>
**2.2 Data Cleaning** <br>
The data preparation for each of the three sources of data is done separately. From the
London crime data, the crimes during the most recent year (2016) are only selected. The
major categories of crime are pivoted to get the total crimes per the boroughs for each
major category.
<br><br>
The second **data is scraped from a wikipedia page using the Beautiful Soup library** in
python. Using this library we can extract the data in the tabular format as shown in the
website. After the web scraping, string manipulation is required to get the names of the
boroughs in the correct form. This is important because we will be merging the
two datasets together using the Borough names.<br>
<br>
The two datasets are merged on the Borough names to form a new dataset that combines
the necessary information in one dataset. The purpose of this dataset is to
visualize the crime rates in each borough and identify the borough with the least crimes
recorded during the year 2016.
<br><br>
After visualizing the crime in each borough we can find the borough with the lowest crime
rate and hence tag that borough as the safest borough. The third source of data is acquired
from the list of neighborhoods in the safest borough on wikipedia. This dataset is created
from scratch, the pandas data frame is created with the names of the neighborhoods and
the name of the borough with the latitude and longitude left blank.
<br><br>
The coordinates of the neighborhoods is be obtained using **Google Maps API geocoding**
to get the final dataset.
<br><br>
The new dataset is used to generate the 10 most common venues for each neighborhood
using the Foursquare API, finally using **k means clustering algorithm** to cluster similar
neighborhoods together.