## Introduction: Rental Housing in San Francisco

<div class="alert alert-block alert-info" style="margin-top: 20px">
    
1. [Problem and Discussion](#0)<br>
2. [Description of Data and How it will be Used](#1)<br>
</div>
<hr>

### 1. Problem and Discussion<a id="0"></a>

Often relocation is a time consuming process. Finding affordable housing with safe neighborhoods and prefered venues is a big challenge.  Data science can save time for finding and meeting such criteria by providing interactive visual tools through Jupyter notebook.  A Goal of this project is to provide such a sample in San Francisco utilizing  **Folium** library to make visual segmentation and clustering data in a map.  This notebook allows users to tweek few parameters and shows crime rate in neighborhoods, rental price ranges and venues on interactive maps.<br><br>
This project can help rentees considering moving to San Francisco or renters deciding reasonable rents since interactive visual aids can quickly allow users to see intuitive and interactive visual infromation. The use of FourSquare data and mapping techniques combined with data analysis will help providing clustered venues along with rents and crime rate in a single map. Lastly, this project is a good practical case toward the development of Data Science skills.<br>

### 2. Dicription of Data and How it will be Used<a id="1"></a>
<br>

In this jupyter notebook, main focal area is set to San Francisco.  This notebook will use geojson data from DataSF (https://data.sfgov.org/api/geospatial/pty2-tcw4?method=export&format=GeoJSON) for geographical information and police department incident report from DataSF (https://data.sfgov.org/api/views/wg3w-h783/rows.csv?accessType=DOWNLOAD) for crime statistics and finally use python-craigslist to retrieve set of most recent posts on interactive map.  The raw data from craigslist is programatically scraped.  The data will generate statistics and interactive visual aids for users.<br><br>
Use Foursquare and geopy data to map top 10 venues for all San Francisco neighborhoods and clustered in groups ( as per Course LAB). Use foursquare and geopy data to map the location of available rental housings and crime rates, separately and on top of the above clustered map in order to identify the venues.  The markers of rental housing display the rents and URL to the posts in the popups. Alternatively Boxplot and Choropleth Maps shows rents statistics and average rents respectively to give a general price trend in the neighborhoods. 

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

1. [Exploring Datasets with *p*andas](#3)<br>
2. [Downloading and Prepping Data](#4)<br>
3. [Introduction to Folium](#5) <br>
4. [Map with Markers](#6) <br>
5. [Choropleth Maps](#8) <br>
    * Introduction where you discuss the business problem and who would be interested in this project.
    * Data where you describe the data that will be used to solve the problem and the source of the data.
    * Methodology section which represents the main component of the report where you discuss and describe any exploratory data analysis that you did, any inferential statistical testing that you performed, if any, and what machine learnings were used and why.
    * Results section where you discuss the results.
    * Discussion section where you discuss any observations you noted and any recommendations you can make based on the results.
    * Conclusion section where you conclude the report.
</div>
<hr>

# Exploring Datasets with *pandas* and Matplotlib<a id="3"></a>

Datasets: 

1. San Francisco Police Department Incidents from the year 2018 to present - [Police Department Incidents](https://data.sfgov.org/Public-Safety/Police-Department-Incident-Reports-2018-to-Present/wg3w-h783) from San Francisco public data portal. Incidents derived from San Francisco Police Department (SFPD) Crime Incident Reporting system. Updated daily, showing data for the 2018 to current.

2. San Francisco Neighborhoods - [San Francisco Neighborhoods](https://data.sfgov.org/Geographic-Locations-and-Boundaries/SF-Find-Neighborhoods/pty2-tcw4) from San Francisco public data portal. Neighborhood boundaries that were defined in 2006 by the Mayor's Office of Neighborhood Services for use with the SF Find tool: [SF Planning](http://propertymap.sfplanning.org/?name=sffind). All boundaries are for the purpose of defining general locations of neighborhoods for the SF FIND application only, and as such they are not "hard" lines of demarcation

# Downloading and Prepping Data <a id="4"></a>

Import Primary Modules:

In [4]:
import numpy as np # library to handle data in a vectorized manner
import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

# !conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
!pip install geopy
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

# !conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
!pip install folium
import folium # map rendering library

# !conda install -c conda-forge lxml --yes
!pip install lxml
!pip install python-craigslist --upgrade

print('Libraries imported.')

Requirement already up-to-date: python-craigslist in /opt/conda/envs/Python36/lib/python3.6/site-packages (1.0.8)
Libraries imported.


# Introduction to Folium <a id="5"></a>

Folium is a powerful Python library that helps you create several types of Leaflet maps. The fact that the Folium results are interactive makes this library very useful for dashboard building.

# Maps with Markers <a id="6"></a>


Let's download and import the data on police department incidents using *pandas* `read_csv()` method.

Download the dataset and read it into a *pandas* dataframe:
Then drop raws which miss essential data.

In [5]:
!pip install wget
!wget -q -O 'SF_Find_Neighborhoods.geojson' https://data.sfgov.org/api/geospatial/pty2-tcw4?method=export&format=GeoJSON
!wget -q -O 'Police_Department_Incident_Reports__2018_to_Present.csv' https://data.sfgov.org/api/views/wg3w-h783/rows.csv?accessType=DOWNLOAD

Collecting wget
  Downloading https://files.pythonhosted.org/packages/47/6a/62e288da7bcda82b935ff0c6cfe542970f04e29c756b0e147251b2fb251f/wget-3.2.zip
Building wheels for collected packages: wget
  Building wheel for wget (setup.py) ... [?25ldone
[?25h  Stored in directory: /home/dsxuser/.cache/pip/wheels/40/15/30/7d8f7cea2902b4db79e3fea550d7d7b85ecb27ef992b618f3f
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2


In [6]:
df_incidents = pd.read_csv('Police_Department_Incident_Reports__2018_to_Present.csv')
print('Dataset downloaded and read into a pandas dataframe!')

df_incidents.rename(columns={'Incident Category':'Category','Analysis Neighborhood':'Neighborhood'},inplace=True)

rows = df_incidents[( pd.isna(df_incidents.Neighborhood))].index
df_incidents.drop(rows, inplace=True)
print(df_incidents.shape)
df_incidents.head()

  interactivity=interactivity, compiler=compiler, result=result)


Dataset downloaded and read into a pandas dataframe!
(234195, 36)


Unnamed: 0,Incident Datetime,Incident Date,Incident Time,Incident Year,Incident Day of Week,Report Datetime,Row ID,Incident ID,Incident Number,CAD Number,Report Type Code,Report Type Description,Filed Online,Incident Code,Category,Incident Subcategory,Incident Description,Resolution,Intersection,CNN,Police District,Neighborhood,Supervisor District,Latitude,Longitude,point,SF Find Neighborhoods,Current Police Districts,Current Supervisor Districts,Analysis Neighborhoods,HSOC Zones as of 2018-06-05,OWED Public Spaces,Central Market/Tenderloin Boundary Polygon - Updated,Parks Alliance CPSI (27+TL sites),ESNCAG - Boundary File,"Areas of Vulnerability, 2016"
0,2020/02/03 02:45:00 PM,2020/02/03,14:45,2020.0,Monday,2020/02/03 05:50:00 PM,89881680000.0,898816.0,200085557.0,200342870.0,II,Initial,,75000.0,Missing Person,Missing Person,Found Person,Open or Active,20TH AVE \ WINSTON DR,33719000.0,Taraval,Lakeshore,7.0,37.7269,-122.476039,POINT (-122.47603947349434 37.72694991292525),41.0,10.0,8.0,16.0,,,,,,2.0
1,2020/02/03 03:45:00 AM,2020/02/03,03:45,2020.0,Monday,2020/02/03 03:45:00 AM,89860710000.0,898607.0,200083749.0,200340316.0,II,Initial,,11012.0,Stolen Property,Stolen Property,"Stolen Property, Possession with Knowledge, Re...",Cite or Arrest Adult,24TH ST \ SHOTWELL ST,24064000.0,Mission,Mission,9.0,37.7524,-122.415172,POINT (-122.41517229045435 37.752439644389675),53.0,3.0,2.0,20.0,3.0,,,,,2.0
2,2020/02/03 10:00:00 AM,2020/02/03,10:00,2020.0,Monday,2020/02/03 10:06:00 AM,89867260000.0,898672.0,200084060.0,200340808.0,II,Initial,,64015.0,Non-Criminal,Other,"Aided Case, Injured or Sick Person",Open or Active,MARKET ST \ POWELL ST,34016000.0,Tenderloin,Financial District/South Beach,3.0,37.7846,-122.407337,POINT (-122.40733704162238 37.784560141211806),19.0,5.0,3.0,8.0,,35.0,,,,2.0
4,2020/01/05 12:00:00 AM,2020/01/05,00:00,2020.0,Sunday,2020/02/03 04:09:00 PM,89877370000.0,898773.0,200085193.0,200342341.0,II,Initial,,68020.0,Miscellaneous Investigation,Miscellaneous Investigation,Miscellaneous Investigation,Open or Active,PINE ST \ DIVISADERO ST,26643000.0,Richmond,Pacific Heights,2.0,37.7871,-122.44025,POINT (-122.44024995765258 37.78711245591735),103.0,4.0,6.0,30.0,,,,,,1.0
5,2020/02/03 08:36:00 AM,2020/02/03,08:36,2020.0,Monday,2020/02/03 08:36:00 AM,89876270000.0,898762.0,200083909.0,200340826.0,II,Initial,,68020.0,Miscellaneous Investigation,Miscellaneous Investigation,Miscellaneous Investigation,Open or Active,FRONT ST \ JACKSON ST,24697000.0,Central,Financial District/South Beach,3.0,37.7969,-122.399508,POINT (-122.39950750040278 37.796926429317054),77.0,6.0,3.0,8.0,,,,,,1.0


Let's find out how many entries there are in our dataset.

In [7]:
df_incidents.shape

(234195, 36)

Find San Francisco coordinate.

In [None]:
address = 'San Francisco, USA'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of San Francisco are {}, {}.'.format(latitude, longitude))

Incidents are counted for each neighborhood for displaying the crime rate in the map later.  Load the the latest neighborhood geomery information into data frame to map the crime rate.

In [None]:
# print(df_incidents.head())
df_incidents['Count'] = df_incidents.sum(axis=1)
df_incidents['Count'] = 1
dfc = df_incidents[['Neighborhood','Count']]
df_pd = dfc.groupby(['Neighborhood'],as_index= False).sum()
print(df_pd)

import json # library to handle JSON files

with open('SF_Find_Neighborhoods.geojson') as json_data:
    sf_data = json.load(json_data)
    
neighborhoods_data = sf_data['features']
neighborhoods_data[0]

In [None]:
# Find the centroid of polygons in geojson
def centroid(vertexes):
     _x_list = [vertex [0] for vertex in vertexes]
     _y_list = [vertex [1] for vertex in vertexes]
     _len = len(vertexes)
     _x = sum(_x_list) / _len
     _y = sum(_y_list) / _len
     return(_x, _y)

The next task is essentially transforming this data of nested Python dictionaries into a *pandas* dataframe. So let's start by creating an empty dataframe.

In [None]:
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

In [None]:
for data in neighborhoods_data:
    borough = data['properties']['name'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates'][0][0]
    p = centroid(neighborhood_latlon)
    # print(p)
    neighborhood_lat = p[1]
    neighborhood_lon = p[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)
    
# neighborhoods[{'Neighborhood', 'Latitude','Longitude'}]
neighborhoods.head()

Set up credential to retreve venues from 4 squre.

In [None]:
CLIENT_ID = 'YZH5ISDMNB4IAFJC1B3I4I4Q2XJVMP5D4DRRYP5PI3P2SJHI' # your Foursquare ID
CLIENT_SECRET = '2WMSASM4CJJ5IMNE014OQBW1FGL52BVRAE3TU4GSBKLAWR25' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

In [None]:
neighborhood_latitude = neighborhoods.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = neighborhoods.loc[0, 'Longitude'] # neighborhood longitude value

neighborhood_name = neighborhoods.loc[0, 'Neighborhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

First, let's create the GET request URL. Name your URL **url**.

In [None]:
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius
# create URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)
url # display URL

In [None]:
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
results = requests.get(url).json()
# results

From the Foursquare lab in the previous module, we know that all the information is in the *items* key. Before we proceed, let's borrow the **get_category_type** function from the Foursquare lab.

Now that we have the GeoJSON file, let's create a San Francisco map, centered around **[0, 0]** *latitude* and *longitude* values, with an intial zoom level of 2, and using *Mapbox Bright* style.

In [None]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Now we are ready to clean the json and structure it into a *pandas* dataframe.

In [None]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

In [None]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

#### Let's create a function to repeat the same process to all the neighborhoods in San Francisco

In [None]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [None]:
sf_venues = getNearbyVenues(names=neighborhoods['Neighborhood'],
                                   latitudes=neighborhoods['Latitude'],
                                   longitudes=neighborhoods['Longitude']
                                  )

sf_venues.head()

In [None]:
print(sf_venues.shape)
sf_venues.head()

Let's check how many venues were returned for each neighborhood

In [None]:
sf_venues.groupby('Neighborhood').count()

#### Let's find out how many unique categories can be curated from all the returned venues

In [None]:
print('There are {} uniques categories.'.format(len(sf_venues['Venue Category'].unique())))

### Analyze Each Neighborhood

In [None]:
# one hot encoding
sf_onehot = pd.get_dummies(sf_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
sf_onehot['Neighborhood'] = sf_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [sf_onehot.columns[-1]] + list(sf_onehot.columns[:-1])
sf_onehot = sf_onehot[fixed_columns]

sf_onehot.head()

And let's examine the new dataframe size.

In [None]:
sf_onehot.shape

#### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [None]:
sf_grouped = sf_onehot.groupby('Neighborhood').mean().reset_index()
sf_grouped

In [None]:
sf_grouped.shape

#### Let's print each neighborhood along with the top 5 most common venues

In [None]:
num_top_venues = 5

for hood in sf_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = sf_grouped[sf_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

First, let's write a function to sort the venues in descending order.

In [None]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.

In [None]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = sf_grouped['Neighborhood']

for ind in np.arange(sf_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(sf_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Run *k*-means to cluster the neighborhood into 5 clusters.

In [None]:
# set number of clusters
kclusters = 5

sf_grouped_clustering = sf_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(sf_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [None]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

sf_merged = neighborhoods

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
sf_merged = sf_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

sf_merged.head() # check the last columns!

In [None]:
address = 'San Francisco, USA'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Downtown Toronto are {}, {}.'.format(latitude, longitude))

Finally, let's visualize the resulting clusters

In [None]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(sf_merged['Latitude'], sf_merged['Longitude'], sf_merged['Neighborhood'], sf_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## 5. Examine Clusters

Now, you can examine each cluster and determine the discriminating venue categories that distinguish each cluster. Based on the defining categories, you can then assign a name to each cluster. I will leave this exercise to you.

#### Cluster 1

In [None]:
sf_merged.loc[sf_merged['Cluster Labels'] == 0, sf_merged.columns[[1] + list(range(5, sf_merged.shape[1]))]]

#### Cluster 2

In [None]:
sf_merged.loc[sf_merged['Cluster Labels'] == 1, sf_merged.columns[[1] + list(range(5, sf_merged.shape[1]))]]

#### Cluster 3

In [None]:
sf_merged.loc[sf_merged['Cluster Labels'] == 2, sf_merged.columns[[1] + list(range(5, sf_merged.shape[1]))]]

#### Cluster 4

In [None]:
sf_merged.loc[sf_merged['Cluster Labels'] == 3, sf_merged.columns[[1] + list(range(5, sf_merged.shape[1]))]]

#### Cluster 5

In [None]:
sf_merged.loc[sf_merged['Cluster Labels'] == 4, sf_merged.columns[[1] + list(range(5, sf_merged.shape[1]))]]

# Choropleth Maps <a id="8"></a>

Create `Choropleth` maps for crime rate and display clustered venues markers on top. 

In [None]:
#!pip install folium
import folium

sf_geo = r'SF_Find_Neighborhoods.geojson' # geojson file

sanfran_map = folium.Map(location=[latitude, longitude], zoom_start=12)

threshold_scale = np.linspace(df_pd['Count'].min(),
                              df_pd['Count'].max(),
                              6, dtype=int)
threshold_scale = threshold_scale.tolist() # change the numpy array to a list
threshold_scale[-1] = threshold_scale[-1] + 1 # make sure that the last value of the list is greater than the maximum immigration

sanfran_map.choropleth(
    geo_data=sf_geo,
    data=df_pd,
    columns=['Neighborhood', 'Count'],
    key_on='feature.properties.name',
    threshold_scale=threshold_scale,
    fill_color='YlOrRd', 
    fill_opacity=0.7, 
    line_opacity=0.2,
    legend_name='Crime Rate in San Francisco'
)

x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(sf_merged['Latitude'], sf_merged['Longitude'], sf_merged['Neighborhood'], sf_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(sanfran_map)
      
sanfran_map

In [None]:
import re

def findNeighbors(keys, neighbor):
    for key in keys:
        # print('%s in (%s)' % (key.lower(), neighbor.lower()))
        if key.lower() in neighbor.lower():
            return key

Collect craigslist rental posts and scrape the data needed for the later analysis.
Marker's popup lable is constucted here.

In [None]:
!pip install python-craigslist --upgrade

In [None]:
from craigslist import CraigslistHousing
cl_h = CraigslistHousing(site='sfbay', area='sfc', category='sfc/apa',
                         filters={'max_price': 5000, 'min_price': 1000, 
                                  # 'min_bedrooms':1,  'min_bathrooms':1, 'min_ft2': 600, 'private_bath': True, 
                                  'private_room': True})

results = cl_h.get_results(sort_by='newest', geotagged=True, limit=3000)

keys = neighborhoods['Neighborhood'].tolist() 

# define the dataframe columns
column_names = ['Labels', 'Datetime', 'Price', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
houses = pd.DataFrame(columns=column_names)    
        
for result in results:
    # print(result)
    house_price = result['price']
    house_date = result['datetime']
    labels = '(%s) %s: %s' % (result['price'], result['datetime'], result['url'])
   # print(house_name)       
    house_latlon = result['geotag']
    if house_latlon is None: # skip no location data 
        continue
    # print(house_latlon)
    house_lat = house_latlon[0]
    house_lon = house_latlon[1]    
    
    tmp_name = result['where'] 
    if tmp_name is None: # skip no label for neighborhood data
        continue
    house_name = findNeighbors(keys, tmp_name)
    # print(house_name)
    if house_name is None: # skip no label for neighborhood data
        continue
    
    houses = houses.append({'Labels': labels,
                            'Datetime':house_date,
                            'Price': house_price[1:],
                            'Neighborhood': house_name,
                            'Latitude': house_lat,
                            'Longitude': house_lon}, ignore_index=True)
    
houses.head()

# {
#     'id': '7087525506', 
#     'repost_of': None, 
#     'name': '2Bd 1 Ba in 4Bd 2Ba Apartment', 
#     'url': 'https://sfbay.craigslist.org/sfc/roo/d/san-francisco-2bd-1-ba-in-4bd-2ba/7087525506.html', 
#     'datetime': '2020-03-04 20:46', 
#     'last_updated': '2020-03-04 20:46', 
#     'price': '$4000', 
#     'where': 'marina / cow hollow', 
#     'has_image': True, 
#     'geotag': (37.80558, -122.420139)
# }

In [None]:
houses.tail()

In [None]:
# Save into csv
houses.to_csv('rent.csv')

In [None]:
houses.dtypes
houses["Price"] = houses["Price"].astype(int) # change data type to int

In [None]:
# houses = pd.read_csv('rent.csv')
houses.tail()

Count how many posts in each neighborhood

In [None]:
dfhm= houses.groupby('Neighborhood',as_index= False).mean()
# dfhm = dfhm.drop(['Count'], axis=1)
houses['Count'] = houses.sum(axis=1)
houses['Count'] = 1
dfhm.tail()

In [None]:
dfh = houses[['Neighborhood','Count']]
dfhs= dfh.groupby('Neighborhood',as_index= False).sum()
dfhn = pd.merge(dfhm, dfhs, on='Neighborhood')
dfhn.tail()

Remove all posts outside San Francisco neighborhoods

In [None]:
keys = neighborhoods['Neighborhood'].tolist() 
# print(keys)
# Pick up only SF neighborhoods
dfhf = houses[houses.Neighborhood.str.contains('|'.join(keys), case=False, regex=True)].reset_index(drop=True)
dfhf.tail()

Rent price statistics

In [None]:
import seaborn as sns
sns.distplot(dfhf['Price'],bins=20)

In [None]:
import matplotlib.pyplot as plt
plt.figure(figsize=(20,6))
plt.xticks(rotation='vertical')
sns.boxplot(x='Neighborhood', y='Price', data=dfhf)

In [None]:
dfha = dfhn[dfhn.Neighborhood.str.contains('|'.join(keys), case=False, regex=True)].reset_index(drop=True)
dfha#.tail()

Display mean rent price in Choropleth map imposing with a venue marker.

In [None]:
sf_geo = r'SF_Find_Neighborhoods.geojson' # geojson file

sanfran_map = folium.Map(location=[latitude, longitude], zoom_start=12)

threshold_scale = np.linspace(dfha['Price'].min(),
                              dfha['Price'].max(),
                              6, dtype=int)
threshold_scale = threshold_scale.tolist() # change the numpy array to a list
threshold_scale[-1] = threshold_scale[-1] + 1 # make sure that the last value of the list is greater than the maximum immigration

sanfran_map.choropleth(
    geo_data=sf_geo,
    data=dfha,
    columns=['Neighborhood', 'Price'],
    key_on='feature.properties.name',
    threshold_scale=threshold_scale,
    fill_color='GnBu', 
    fill_opacity=0.7, 
    line_opacity=0.2,
    legend_name='Mean Value for Rent in San Francisco'
)

x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(sf_merged['Latitude'], sf_merged['Longitude'], sf_merged['Neighborhood'], sf_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(sanfran_map)
       
sanfran_map

Limit the number of posts to 100 in the map

In [None]:
limit = 100 # limit newest posted ads only display in map
newRents = houses.iloc[0:limit, :]

Display the newest 100 craigslist posts and venue type on top of crime rate Choropleth map.

In [None]:
sf_geo = r'SF_Find_Neighborhoods.geojson' # geojson file

sanfran_map = folium.Map(location=[latitude, longitude], zoom_start=12)

threshold_scale = np.linspace(df_pd['Count'].min(),
                              df_pd['Count'].max(),
                              6, dtype=int)
threshold_scale = threshold_scale.tolist() # change the numpy array to a list
threshold_scale[-1] = threshold_scale[-1] + 1 # make sure that the last value of the list is greater than the maximum immigration

sanfran_map.choropleth(
    geo_data=sf_geo,
    data=df_pd,
    columns=['Neighborhood', 'Count'],
    key_on='feature.properties.name',
    threshold_scale=threshold_scale,
    fill_color='YlOrRd', 
    fill_opacity=0.7, 
    line_opacity=0.2,
    legend_name='Crime Rate in San Francisco'
)

x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(sf_merged['Latitude'], sf_merged['Longitude'], sf_merged['Neighborhood'], sf_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(sanfran_map)
      
from folium import plugins
# instantiate a mark cluster object for the incidents in the dataframe
hs = plugins.MarkerCluster().add_to(sanfran_map)

# loop through the dataframe and add each data point to the mark cluster
for lat, lng, label, in zip(newRents.Latitude, newRents.Longitude, newRents.Labels):
    folium.Marker(
        location=[lat, lng],
        icon=folium.Icon(color='green', icon='info-sign'),
        popup=label,
    ).add_to(hs)

# display map
sanfran_map

<hr>

Copyright &copy; 2020. This notebook and its source code are released under the terms of the [MIT License](https://bigdatauniversity.com/mit-license/).