# 1. Introduction
## 1.1. Business Problem

According to **The Balance** [article](https://www.thebalance.com/what-percentage-of-your-income-should-go-to-rent-4688840) : the 30% rule recommends that your monthly housing costs not go above 30% of your gross monthly income. Let's assume you live in one of Los Angeles Neighbourhoods with great venues and attractions, such as international cuisine, entertainment and shopping. You received an undeclinable oﬀer to move to work to San Francisco and you would like to move if you can find a place to live with similar venues and rent payment not higher than 30% rule recommends.

## 1.2. Target Audience
The target audience of this report would be anyone who wants to move to similar neighborhood in another city. 
 
 
# 2. Data
## 2.1.1 Los Angeles Crime Data

In [308]:
#importing libraries
import pandas as pd
from geopy.geocoders import Nominatim

Available data from Dec. 30, 2019 to June 28, 2020.

In [309]:
la_crimes_df = pd.read_html('http://maps.latimes.com/neighborhoods/violent-crime/neighborhood/list/')

In [310]:
la_crimes_df = la_crimes_df[3]

In [311]:
la_crimes_df.rename(columns={'Total' : 'TotalCrime'}, inplace=True)

In [312]:
la_crimes_df = la_crimes_df[['Neighborhood', 'TotalCrime']]
la_crimes_df

Unnamed: 0,Neighborhood,TotalCrime
0,Chesterfield Square,81
1,Vermont Vista,306
2,Vermont Knolls,238
3,Harvard Park,119
4,Broadway-Manchester,272
...,...,...
204,Bradbury,0
205,West San Dimas,0
206,Avalon,0
207,Agua Dulce,0


## 2.1.2 Los Angeles  Average Rent By Neighborhood

In [313]:
la_neighborhoods = pd.read_csv('la_neighborhoods_average_rent.txt', sep=':')

In [314]:
la_neighborhoods.head(5)

Unnamed: 0,Neighborhood,RentPrice
0,Westwood,3915
1,Venice,3356
2,Westchester,3338
3,Financial District,3206
4,South Park,3090


In [315]:
# fitting crime data to Los Angeles average rent data frame
change_name_dict = {
    'South Los Angeles': ['Willowbrook', 'University Park', 'Baldwin Hills/Crenshaw', 'Leimert Park', 'Westmont', 'Central-Alameda'],
    'Lakeview Terrace' : ['Lake View Terrace'],
    'Central City':['Paramount'],
    'Crenshaw':['Baldwin Hills/Crenshaw'],
    'Mid City':['Mid-City'],
    'Playa Del Ray':['Playa del Rey']
}
`
def change_neighborhood_name(name, change_name_dict):
    for key in change_name_dict.keys():
        if name in change_name_dict[key]:
            return key   
    return name

la_crimes_df['Neighborhood'] = la_crimes_df['Neighborhood'].apply(change_neighborhood_name, args=(change_name_dict,))

la_crimes_df = la_crimes_df.groupby('Neighborhood').sum()

In [316]:
# merging crime data with average rent data
la_neighborhoods = la_neighborhoods.merge(la_crimes_df, on='Neighborhood', how='left')

In [317]:
la_neighborhoods.fillna(0, inplace=True)

Adding geo cordinates to each neighborhood

In [318]:
geolocator = Nominatim(user_agent="foursquare_agent")

In [319]:
import time
cordinatets = []
for neigborhood in la_neighborhoods['Neighborhood']:
    address = neigborhood + ', Los Angeles, CA'
    try:
        location = geolocator.geocode(address)
        latitude = location.latitude
        longitude = location.longitude
        cordinatets.append((neigborhood, latitude, longitude))
    except:
        cordinatets.append((neigborhood, np.nan, np.nan))
        
    time.sleep(0.5)

In [320]:
cordinatets_df = pd.DataFrame(cordinatets, columns=['Neighborhood', 'Latitude', 'Longitude'])

In [321]:
la_neighborhoods = la_neighborhoods.merge(cordinatets_df, on='Neighborhood', how='left')

In [323]:
la_neighborhoods.head(10)

Unnamed: 0,Neighborhood,RentPrice,TotalCrime,Latitude,Longitude
0,Westwood,3915,44.0,34.066895,-118.439945
1,Venice,3356,140.0,33.995044,-118.466887
2,Westchester,3338,77.0,33.954098,-118.400047
3,Financial District,3206,0.0,34.045762,-118.259305
4,South Park,3090,196.0,-17.542618,-62.874624
5,Palms,3047,51.0,34.024733,-118.411615
6,Sawtelle,2962,50.0,34.036111,-118.450356
7,Playa Vista,2876,8.0,33.97601,-118.418165
8,Brentwood,2845,19.0,34.05214,-118.47407
9,West Los Angeles,2782,13.0,34.046399,-118.448135


In [325]:
# saving dataframe
la_neighborhoods.to_csv('la_neighborhood.csv', index=False, sep=':')

## 2.1.3 Los Angeles Foursquare Data

In [331]:
# Define Foursquare Credentials and Version
import config
import requests
CLIENT_ID = config.CLIENT_ID # your Foursquare ID
CLIENT_SECRET = config.CLIENT_SECRET # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

Explore Neighborhoods

In [332]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        import config
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            config.CLIENT_ID, 
            config.CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [333]:
la_venues = getNearbyVenues(names=la_neighborhoods['Neighborhood'],
                                   latitudes=la_neighborhoods['Latitude'],
                                   longitudes=la_neighborhoods['Longitude']
                                  )

Westwood
Venice
Westchester
Financial District
South Park
Palms
Sawtelle
Playa Vista
Brentwood
West Los Angeles
Hollywood
North Hollywood
Encino
Downtown
Mid-Wilshire
Playa Del Rey
Pico-Robertson
Woodland Hills
Valley Village
Hollywood Hills
Silver Lake
Mar Vista
Studio City
Los Feliz
Echo Park
Chinatown
Chatsworth
Canoga Park
Granada Hills
Northridge
Sylmar
Winnetka
Westlake
Mid City
Eagle R`ock
Glassell Park
Van Nuys
Tarzana
Reseda
Atwater Village
Highland Park
El Sereno
Crenshaw
Hyde Park
South Los Angeles
Central City
Lakeview Terrace
Boyle Heights
North Hills
Montecito Heights


In [334]:
la_venues.head(10)

Unnamed: 0,Neighbourhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Westwood,34.066895,-118.439945,UCLA Mildred E. Mathias Botanical Garden,34.064753,-118.440427,Garden
1,Westwood,34.066895,-118.439945,W Los Angeles - West Beverly Hills,34.063275,-118.44095,Hotel
2,Westwood,34.066895,-118.439945,UCLA Court of Sciences,34.068818,-118.442241,Plaza
3,Westwood,34.066895,-118.439945,STK Los Angeles,34.063381,-118.441256,Steakhouse
4,Westwood,34.066895,-118.439945,Blaze Pizza,34.068379,-118.44236,Pizza Place
5,Westwood,34.066895,-118.439945,Urban Outfitters,34.0637,-118.4408,Clothing Store
6,Westwood,34.066895,-118.439945,UCLA Inverted Fountain,34.070051,-118.440762,Fountain
7,Westwood,34.066895,-118.439945,UCLA Cafe Med,34.066094,-118.443508,Café
8,Westwood,34.066895,-118.439945,UCLA Tiverton House,34.063081,-118.442145,Hotel
9,Westwood,34.066895,-118.439945,Coffee Bean & Tea Leaf (UCLA Hillel),34.070221,-118.438355,Coffee Shop


## 2.2.1 San Francisco Crime Data

In [336]:
sf_crime = pd.read_csv('https://data.sfgov.org/api/views/wg3w-h783/rows.csv?accessType=DOWNLOAD')
sf_crime.head()

Unnamed: 0,Incident Datetime,Incident Date,Incident Time,Incident Year,Incident Day of Week,Report Datetime,Row ID,Incident ID,Incident Number,CAD Number,...,SF Find Neighborhoods,Current Police Districts,Current Supervisor Districts,Analysis Neighborhoods,HSOC Zones as of 2018-06-05,OWED Public Spaces,Central Market/Tenderloin Boundary Polygon - Updated,Parks Alliance CPSI (27+TL sites),ESNCAG - Boundary File,"Areas of Vulnerability, 2016"
0,2020/08/15 12:43:00 PM,2020/08/15,12:43,2020,Saturday,2020/08/15 12:58:00 PM,95308704134,953087,200490354,202281583.0,...,58.0,9.0,1.0,7.0,,,,,,2.0
1,2018/01/18 07:00:00 PM,2018/01/18,19:00,2018,Thursday,2018/01/22 04:59:00 PM,64999771000,649997,186068683,,...,,,,,,,,,,
2,2020/08/16 03:13:00 AM,2020/08/16,03:13,2020,Sunday,2020/08/16 03:14:00 AM,95319604083,953196,200491669,202290313.0,...,54.0,2.0,9.0,26.0,,,,,,2.0
3,2020/08/16 03:38:00 AM,2020/08/16,03:38,2020,Sunday,2020/08/16 04:56:00 AM,95326228100,953262,200491738,202290404.0,...,53.0,3.0,2.0,20.0,3.0,,,,,2.0
4,2020/08/15 09:40:00 AM,2020/08/15,09:40,2020,Saturday,2020/08/15 06:21:00 PM,95322706244,953227,206121692,,...,,,,,,,,,,


In [339]:
# take just the columns we need
sf_crime1 = sf_crime[['Analysis Neighborhood','Incident Category', 'Incident Date']]

#drop all null values
sf_crime2 = sf_crime1.dropna()

#convert incident date to datetime
sf_crime2['Incident Date'] = sf_crime2['Incident Date'].astype('datetime64[ns]') 

#filter to just crimes  from Dec. 30, 2019 to June 28, 2020.
sf_crime3 = sf_crime2.loc[(sf_crime2['Incident Date'] > '2019-12-30') & (sf_crime2['Incident Date'] < '2020-06-28')]

#There's lots of scary crimes out there, but let's make sure we aren't factoring in non-criminal police reports, because that data will pollute our insights.
sf_crime4 = sf_crime3.loc[sf_crime3['Incident Category'] != 'Non-Criminal']

#change the name of 'Analysis Neighborhood' to 'Neighborhood'
sf_crime4.rename(columns={'Analysis Neighborhood': 'Neighborhood'}, inplace = True)

sf_crime4.head(20)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sf_crime2['Incident Date'] = sf_crime2['Incident Date'].astype('datetime64[ns]')
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().rename(


Unnamed: 0,Neighborhood,Incident Category,Incident Date
66,North Beach,Case Closure,2020-02-27
77,Excelsior,Larceny Theft,2020-03-19
147,Outer Richmond,Larceny Theft,2020-01-30
170,North Beach,Larceny Theft,2020-02-26
180,North Beach,Case Closure,2020-02-28
247,North Beach,Case Closure,2020-02-26
266,North Beach,Larceny Theft,2020-02-28
267,North Beach,Larceny Theft,2020-02-27
307,Financial District/South Beach,Fraud,2020-06-18
383,Financial District/South Beach,Lost Property,2020-05-31


In [360]:
#count the number of crimes in each neighborhood
sf_crime5 = sf_crime4.groupby('Neighborhood', as_index=False).count()
#get rid of Incident Date
sf_crime5.drop(columns = 'Incident Date',inplace = True)
#rename our column to reflect the counts of incidents
sf_crime5.rename(columns={'Incident Category': 'Incidents'}, inplace = True)
#and sort our values
sf_crimes_df = sf_crime5.sort_values(by= ['Incidents'], ascending = False)
sf_crimes_df.replace(columns={'Incidents': 'TotalCrime'}, inplace=True)
sf_crimes_df

In [374]:
# fitting crime data to Los Angeles average rent data frame
change_name_dict = {
  'West Of Twin Peaks' : ['West of Twin Peaks'],
  'Downtown' : ['Tenderloin', 'Hayes Valley'],
  'Haight-Ashbury': ['Haight Ashbury'],
  'South Of Market':['South of Market'],
  'Bayview':['Bayview Hunters Point'],
  'Castro-Upper Market': ['Castro/Upper Market'],
  'Financial District': ['Financial District/South Beach']
    
}

def change_neighborhood_name(name, change_name_dict):
    for key in change_name_dict.keys():
        if name in change_name_dict[key]:
            return key   
    return name

sf_crimes_df['Neighborhood'] = sf_crimes_df['Neighborhood'].apply(change_neighborhood_name, args=(change_name_dict,))

sf_crimes_df = sf_crimes_df.groupby('Neighborhood').sum()

## 2.2.2 San Francisco Average Rent By Neighborhood

In [404]:
sf_neighborhoods = pd.read_csv('sf_neighborhoods_average_rent.txt', sep=':')

In [405]:
sf_neighborhoods

Unnamed: 0,Neighborhood,RentPrice
0,Russian Hill,4018
1,Pacific Heights,3560
2,Mission,3528
3,Potrero Hill,3454
4,North Beach,3436
5,Financial District,3400
6,Inner Richmond,3386
7,Castro-Upper Market,3386
8,Bayview,3362
9,Lakeshore,3329


In [406]:
sf_neighborhoods = sf_neighborhoods.merge(sf_crimes_df, on='Neighborhood', how='left')

In [407]:
sf_neighborhoods

Unnamed: 0,Neighborhood,RentPrice,Incidents
0,Russian Hill,4018,1046
1,Pacific Heights,3560,1123
2,Mission,3528,5774
3,Potrero Hill,3454,1011
4,North Beach,3436,1067
5,Financial District,3400,3912
6,Inner Richmond,3386,680
7,Castro-Upper Market,3386,1626
8,Bayview,3362,3678
9,Lakeshore,3329,580


Adding geo cordinates to each neighborhood

In [408]:
sf_cordinatets = []
for neigborhood in sf_neighborhoods['Neighborhood']:
    address = neigborhood + ' San Francisco, California'
    try:
        location = geolocator.geocode(address)
        latitude = location.latitude
        longitude = location.longitude
        sf_cordinatets.append((neigborhood, latitude, longitude))
    except:
        sf_cordinatets.append((neigborhood, np.nan, np.nan))
        
    time.sleep(0.5)

In [409]:
sf_cordinatets_df = pd.DataFrame(sf_cordinatets, columns=['Neighborhood', 'Latitude', 'Longitude'])

In [410]:
sf_neighborhoods = sf_neighborhoods.merge(sf_cordinatets_df, on='Neighborhood', how='left')

In [411]:
sf_neighborhoods

Unnamed: 0,Neighborhood,RentPrice,Incidents,Latitude,Longitude
0,Russian Hill,4018,1046,37.800073,-122.417094
1,Pacific Heights,3560,1123,37.792717,-122.435644
2,Mission,3528,5774,34.519436,-118.24091
3,Potrero Hill,3454,1011,37.759652,-122.398026
4,North Beach,3436,1067,37.801175,-122.409002
5,Financial District,3400,3912,37.793647,-122.398938
6,Inner Richmond,3386,680,37.769825,-122.466087
7,Castro-Upper Market,3386,1626,,
8,Bayview,3362,3678,40.772627,-124.18395
9,Lakeshore,3329,580,37.25215,-119.17374


Looks like geolocatore fetched wrong coordinates for several neighborhoods.
Let's fix it !!!

In [428]:
sf_neighborhoods.loc[sf_neighborhoods['Neighborhood'] == 'Mission', 'Latitude'] = 37.7599
sf_neighborhoods.loc[sf_neighborhoods['Neighborhood'] == 'Mission', 'Longitude'] = -122.4148

sf_neighborhoods.loc[sf_neighborhoods['Neighborhood'] == 'Castro-Upper Market', 'Latitude'] = 37.76171
sf_neighborhoods.loc[sf_neighborhoods['Neighborhood'] == 'Castro-Upper Market', 'Longitude'] = -122.43512

sf_neighborhoods.loc[sf_neighborhoods['Neighborhood'] == 'Bayview', 'Latitude'] = 37.72687
sf_neighborhoods.loc[sf_neighborhoods['Neighborhood'] == 'Bayview', 'Longitude'] = -122.38873

sf_neighborhoods.loc[sf_neighborhoods['Neighborhood'] == 'Marina', 'Latitude'] = 37.803 
sf_neighborhoods.loc[sf_neighborhoods['Neighborhood'] == 'Marina', 'Longitude'] = -122.436

sf_neighborhoods.loc[sf_neighborhoods['Neighborhood'] == 'West Of Twin Peaks', 'Latitude'] = 37.751586275
sf_neighborhoods.loc[sf_neighborhoods['Neighborhood'] == 'West Of Twin Peaks', 'Longitude'] =  -122.447721511

sf_neighborhoods.loc[sf_neighborhoods['Neighborhood'] == 'Lakeshore', 'Latitude'] = 37.7208 
sf_neighborhoods.loc[sf_neighborhoods['Neighborhood'] == 'Lakeshore', 'Longitude'] = -122.4958

In [429]:
sf_neighborhoods

Unnamed: 0,Neighborhood,RentPrice,Incidents,Latitude,Longitude
0,Russian Hill,4018,1046,37.800073,-122.417094
1,Pacific Heights,3560,1123,37.792717,-122.435644
2,Mission,3528,5774,37.7599,-122.4148
3,Potrero Hill,3454,1011,37.759652,-122.398026
4,North Beach,3436,1067,37.801175,-122.409002
5,Financial District,3400,3912,37.793647,-122.398938
6,Inner Richmond,3386,680,37.769825,-122.466087
7,Castro-Upper Market,3386,1626,37.76171,-122.43512
8,Bayview,3362,3678,37.72687,-122.38873
9,Lakeshore,3329,580,37.7208,-122.4958


In [430]:
# saving dataframe
sf_neighborhoods.to_csv('sf_neighborhood.csv', index=False, sep=':')

## 2.2.3 San Francisco Foursquare Data

In [431]:
sf_venues = getNearbyVenues(names=sf_neighborhoods['Neighborhood'],
                                   latitudes=sf_neighborhoods['Latitude'],
                                   longitudes=sf_neighborhoods['Longitude']
                                  )

Russian Hill
Pacific Heights
Mission
Potrero Hill
North Beach
Financial District
Inner Richmond
Castro-Upper Market
Bayview
Lakeshore
Western Addition
Noe Valley
South Of Market
Haight-Ashbury
Presidio Heights
Marina
Chinatown
Inner Sunset
Nob Hill
Outer Richmond
West Of Twin Peaks
Downtown


In [432]:
sf_venues.head(10)

Unnamed: 0,Neighbourhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Russian Hill,37.800073,-122.417094,Union Larder,37.798904,-122.419049,Wine Bar
1,Russian Hill,37.800073,-122.417094,Swensen's Ice Cream,37.799168,-122.41916,Ice Cream Shop
2,Russian Hill,37.800073,-122.417094,Za Pizza,37.798571,-122.418955,Pizza Place
3,Russian Hill,37.800073,-122.417094,Okoze Sushi,37.799191,-122.419266,Sushi Restaurant
4,Russian Hill,37.800073,-122.417094,Lombard Street,37.802121,-122.41879,Monument / Landmark
5,Russian Hill,37.800073,-122.417094,Elephant Sushi,37.798623,-122.418939,Sushi Restaurant
6,Russian Hill,37.800073,-122.417094,Alice Marble Tennis Courts,37.801274,-122.419891,Tennis Court
7,Russian Hill,37.800073,-122.417094,Frascati,37.798279,-122.418974,Italian Restaurant
8,Russian Hill,37.800073,-122.417094,Macondray Lane,37.799243,-122.414847,Garden
9,Russian Hill,37.800073,-122.417094,Michelangelo Playground & Community Garden,37.801199,-122.416976,Playground


## 2.3 Methodology

Base on given data, we'll be clustering Los Angeles neiborhoods and predicting to which cluster San Francisco neighborhoods belongs.
