# Capstone Project - Toronto House Hunting
### Applied Data Science Capstone by John Byabazaire

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction: Business Problem <a name="introduction"></a>

A house, or simply a home is a place everyone longs for at the end of hectic working day. Today with the ravages of the COVID-19 still fresh in our minds, the home has been turned into the new office of the twenty first century. Toady we work and stay at home. It therefore becomes very important to carefully choose the best place that might work as both as a home and a work place, and for those working outside of their homes, it’s also paramount to choose a safe neighborhood with good commuting distance.

Toronto like many developed cities in the world , house hunting is a full time job. Even with agents, at least you will have to go in for viewing. Also many listing websites do not provide enough details for one to make an informed decision. For example which neighborhood is safe. Though provide a map view of the properties, these luck comparisons amongst neighborhoods. As a new young professional

As a young professional with a limited budget, it becomes hard to find a good house that make a good home in a convenient neighborhood without having to arrange hundreds of viewings. This assignment, I will cluster house/apartment give a commute distance, crime rate of the neighborhoods, and the amenities around neighborhoods to filter house and reduce of the number of viewing.

## Dataset <a name="data"></a>

The first step will be to write a scraper to collect some basic information:
<ol>
    <li> Property address</li>
    <li> Property price</li>
    <li> Number of rooms</li>
    <li> Number of bathrooms</li>
    <li> Property type</li>
</ol>
    To obtain this data, I will scrap <a href='https://www.point2homes.com/CA/Real-Estate-Listings/ON/Toronto.html' target='_BLANK'>LINK</a> that list hundreds pf properties across Canada.

Enhancing the Data
<ol>
    <li> Use geopy library, I will retrieve the latitude and longitude coordinates using the property address.</li>
    <li>Using the foursquare API, I will retrieve the amenities around each property</li>
    <li> I will also use the crimes data to see which neighborhood has the highest crime rates.</li>
</ol>

Now I will use beautifulsoup to properties listed

In [1]:
# Import libraries
import pandas as pd
import requests
from bs4 import BeautifulSoup
import numpy as np

In [3]:
# Initialize an array to hold the entries
properties = []
for page_num in range(1, 31): # we run through the first 30 pages 
    link = "https://www.point2homes.com/CA/Real-Estate-Listings/ON/Toronto.html?page="+str(page_num)
    html_data = requests.get(link)
    beautiful_soup = BeautifulSoup(html_data.content, 'html.parser')
    
    # within the body of each hyml doc, we take the listings
    for row in beautiful_soup.find("body").find_all('div', class_ = "listings"):
        # In each listing, we take the article tag
        each_article = row.find_all("article")
        for details_div in each_article:
            # the content we are interested in are in item-right-details-cnt div
            div_contents = details_div.find_all('div', class_= "item-right-details-cnt")
            for result_set in div_contents:
                price = 0
                price_div = details_div.find('div', class_= "item-right-utils-cnt").find_all('span')
                if len(price_div ) > 0:
                    price_div = price_div[1]
                    price = (price_div.text.split('\n')[1]).strip()
                    price = price.replace('$', '').replace(',', '').replace(' CAD', '')
                temp_array = result_set.text.split('\n')
                # clean up the entries
                cleaned_data = []
                for val in temp_array:
                    if val == '':
                        next
                    else:
                        cleaned_data.append(val.strip())
            properties.append({
                "Address":cleaned_data[1], 
                "number_of_bed":cleaned_data[3], 
                "number_of_bath":cleaned_data[4], 
                "property_kind":cleaned_data[-2],
                "price":price
            })
#properties

In [122]:
# create a dataframe with the return entries
properties_df = pd.DataFrame(properties)
properties_df.head()

Unnamed: 0,Address,number_of_bed,number_of_bath,property_kind,price
0,"186 Gooch Ave, Toronto, Ontario",3 BedsBds,2 BathsBa,Residential,0
1,"7 Grenville St, Toronto, Ontario",2 BedsBds,2 BathsBa,Residential,838000
2,"51 East Liberty St E, Toronto, Ontario",1 BedBd,2 BathsBa,Residential,659000
3,"35 Mariner Terr, Toronto, Ontario",1 BedBd,1 BathBa,Residential,749900
4,"4968 Yonge St, Toronto, Ontario",2 BedsBds,1 BathBa,Residential,598000


Let's clean up our data abit, we will perform the following operations
* Remove the text in the <b>number_of_bed</b> and <b>number_of_bath</b> columns to make sure they numeric
* Convert the <b>property_kind</b> column to numeric categories
* Drop rows where the price is not listed or is zero

In [123]:
# some of the entries are blank and some have the wrong units, will drop those
def process_num_bedrooms(val):
    # check if the entry is empty
    if len(val) != 0:
        # slipt the val by space
        row_val = val.split(' ')
        # we take only those that have have the last past as either BedsBds or BedBd
        if row_val[1] == 'BedsBds' or row_val[1] == 'BedsBd':
            return(row_val[0])
        else:
            return(np.nan)
    else:
        return(np.nan)

In [124]:
# some of the entries are blank and some have the wrong units, will drop those
def process_num_bath(val):
    # check if the entry is empty
    if len(val) != 0:
        # slipt the val by space
        row_val = val.split(' ')
        # Igonore text only entries
        if len(row_val) > 1:
            # we take only those that have have the last past as either BathsBa or BathBa
            if row_val[1] == 'BathBa' or row_val[1] == 'BathsBa':
                return(row_val[0])
            else:
                return(np.nan)
        else:
            return(np.nan)
    else:
        return(np.nan)

In [125]:
properties_df['number_of_bed'] = properties_df['number_of_bed'].apply(process_num_bedrooms)
properties_df['number_of_bath'] = properties_df['number_of_bath'].apply(process_num_bath)

In [128]:
properties_df.head()

Unnamed: 0,Address,number_of_bed,number_of_bath,property_kind,price
0,"186 Gooch Ave, Toronto, Ontario",3.0,2,Residential,0
1,"7 Grenville St, Toronto, Ontario",2.0,2,Residential,838000
2,"51 East Liberty St E, Toronto, Ontario",,2,Residential,659000
3,"35 Mariner Terr, Toronto, Ontario",,1,Residential,749900
4,"4968 Yonge St, Toronto, Ontario",2.0,1,Residential,598000


Check how many categoris we have for the property_kind

In [126]:
properties_df['property_kind'].value_counts()

Residential      374
Condominium      313
Single Family     31
Multi-family       2
Name: property_kind, dtype: int64

We will perform the following operations on this column;
* Drop the <b>Multi-family</b> entry since its only one and we are only look for small family accomodation
* Convert the rest to numeric categories as folows;
<ol>
<li>Residential => 1</li>
<li>Condominium => 2</li>
<li>Single Family => 3</li>

In [10]:
def process_property_kind(val):
    if val == 'Residential':
        return 1
    elif val == 'Condominium':
        return 2
    elif val == 'Single Family':
        return 3
    else:
        return np.nan

In [11]:
properties_df['property_kind'] = properties_df['property_kind'].apply(process_property_kind)

Drop all rows where the price is zero

In [130]:
# create a filter
price_fliter = properties_df['price'] == 0

In [131]:
properties_df = properties_df.loc[~price_fliter]

In [132]:
properties_df.head()

Unnamed: 0,Address,number_of_bed,number_of_bath,property_kind,price
1,"7 Grenville St, Toronto, Ontario",2.0,2,Residential,838000
2,"51 East Liberty St E, Toronto, Ontario",,2,Residential,659000
3,"35 Mariner Terr, Toronto, Ontario",,1,Residential,749900
4,"4968 Yonge St, Toronto, Ontario",2.0,1,Residential,598000
5,"131 Markham St, Toronto, Ontario",2.0,3,Residential,1089000


In [133]:
# drop all entries with NaN
properties_df.dropna(axis=0, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  properties_df.dropna(axis=0, inplace=True)


In [134]:
properties_df.shape

(349, 5)

In [135]:
#Our clean dataset has 348 entries
properties_df.head()

Unnamed: 0,Address,number_of_bed,number_of_bath,property_kind,price
1,"7 Grenville St, Toronto, Ontario",2,2,Residential,838000
4,"4968 Yonge St, Toronto, Ontario",2,1,Residential,598000
5,"131 Markham St, Toronto, Ontario",2,3,Residential,1089000
6,"115 Long Branch Ave, Toronto, Ontario M8W0A9",2,3,Condominium,799000
7,"523B ROYAL YORK RD, Toronto, Ontario M8Y2S5",3,4,Single Family,1269900


#### Use geopy library to get the latitude and longitude for each address.
In order to define an instance of the geocoder, we need to define a user_agent. We will name our agent <em>ny_explorer</em>, as shown below.


In [30]:
#!pip install geocoder
import geocoder # import geocoder

In the cell below, I will loop the the dataframe I create in the above section to get the cordinates for each postcode
and create a dataframe 

In [39]:
address_cor = pd.DataFrame(columns=['Address','Latitude', 'Longitude'])

# loop through the dataframe to get each postcode in turn
for index, row in properties_df.iterrows():
    address = row['Address']
    geolocator = Nominatim(user_agent="ny_explorer")
    location = geolocator.geocode(address)
    if location is not None:
        latitude = location.latitude
        longitude = location.longitude
    else:
        latitude = np.nan
        longitude = np.nan
        
    address_cor = address_cor.append({
        "Address":address,
        "Latitude":latitude,
        "Longitude":longitude
    }, ignore_index=True)

In [111]:
# use pandas merge to join the dataframe on the properties df
properties_cor_df = pd.merge(properties_df, address_cor, on="Address")
properties_cor_df.head()

Unnamed: 0,Address,number_of_bed,number_of_bath,property_kind,price,Latitude,Longitude
0,"7 Grenville St, Toronto, Ontario",2,2,1.0,838000,43.661925,-79.383919
1,"4968 Yonge St, Toronto, Ontario",2,1,1.0,598000,43.765582,-79.412148
2,"131 Markham St, Toronto, Ontario",2,3,1.0,1089000,43.651292,-79.406717
3,"115 Long Branch Ave, Toronto, Ontario M8W0A9",2,3,2.0,799000,,
4,"523B ROYAL YORK RD, Toronto, Ontario M8Y2S5",3,4,3.0,1269900,,


In [112]:
#Drop all places without latitude and longtitude
# drop all entries with NaN
properties_cor_df.dropna(axis=0, inplace=True)

In [113]:
properties_cor_df.shape

(138, 7)

In [73]:
#We remain with 138 entries with no missing values
properties_cor_df.head()

Unnamed: 0,Address,number_of_bed,number_of_bath,property_kind,price,Latitude,Longitude
0,"7 Grenville St, Toronto, Ontario",2,2,1.0,838000,43.661925,-79.383919
1,"4968 Yonge St, Toronto, Ontario",2,1,1.0,598000,43.765582,-79.412148
2,"131 Markham St, Toronto, Ontario",2,3,1.0,1089000,43.651292,-79.406717
6,"100 Upper Madison Ave, Toronto, Ontario",2,2,1.0,799009,43.764215,-79.412723
7,"248 Seaton St, Toronto, Ontario",4,4,1.0,1299000,43.660806,-79.370509


In [77]:
# convert price column to float
properties_cor_df['price'] = properties_cor_df['price'].astype(float)

In [104]:
properties_cor_df['price_cat'] = pd.cut(properties_cor_df['price'], 3, labels=[1, 2,3])

In [101]:
properties_cor_df['price_cat'].value_counts()

1    112
3     13
2     13
Name: price_cat, dtype: int64

#### Create a map of Toronto with properties superimposed on top.

In [96]:
#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

In [93]:
address = 'Toronto, CA'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


In [94]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, adress in zip(properties_cor_df['Latitude'], properties_cor_df['Longitude'], properties_cor_df['Address']):
    label = '{}'.format(adress)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

#### Visualize the properties according to price range

In [105]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(3)
ys = [i + x + (i*x)**2 for i in range(3)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lng, adress, price_cat in zip(properties_cor_df['Latitude'], properties_cor_df['Longitude'], properties_cor_df['Address'], properties_cor_df['price_cat']):
    label = '{}'.format(adress)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color=rainbow[price_cat-1],
        fill=True,
        fill_color=rainbow[price_cat-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

#### Visualize the properties according to property_kind

In [110]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(2)
ys = [i + x + (i*x)**2 for i in range(2)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lng, adress, property_kind in zip(properties_cor_df['Latitude'], properties_cor_df['Longitude'], properties_cor_df['Address'], properties_cor_df['property_kind']):
    label = '{}'.format(adress)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color=rainbow[property_kind-1],
        fill=True,
        fill_color=rainbow[property_kind-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [109]:
 properties_cor_df['property_kind'] =  properties_cor_df['property_kind'].astype(int)