# Final Project

An application to decide where to move your family and live based not only on the cost of housing, but also on
* opportunities for yourself
* opportunities for your immediate family (spouse, parents)
* fueling your childrens' ambitions and creating opportunities for them

Let us start with importing all the libraries we need.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
import folium 
from geopy.geocoders import Nominatim
from pandas.io.json import json_normalize

And the credentials needed for working with Foursquare API.

In [163]:
CLIENT_ID = 'your Foursquare ID'
CLIENT_SECRET = 'your Foursquare Secret'
ACCESS_TOKEN = 'your FourSquare Access Token'
VERSION = '20180604'
LIMIT = 50

## Part 1 - Using Foursquare API to identify interesting locations based on specific categories of interest

Now we will write a function to create a Pandas dataframe from a list of addresses, with a chosen name, latitude and longitude as columns. Imagine these are some of the addresses that fit your budget or some of the available houses for sale or rent and recommended by your friends. You do not want to use just one criteria (cost of the house or rent) to decide on the location to stay, rather you would want to see the "fit" with your other interests (weekend activities) or necessities (grocery shop to find work in, or a special school to send your kid to, etc.).

We will just use the following addresses (completely searched in Google, random but existing) as samples:

* 927 Jackson St, Mountain View, CA 94043
* 2200 Monroe St, Santa Clara, CA
* 668 Madrone Ave, Sunnyvale, CA
* 10755 Tressler Ct, Cupertino, CA
* 328 Hill St, San Francisco, CA 94114
* 2424 Palm Ave, Redwood City, CA 94061
* 3159 Kenland Dr, San Jose, CA 95111
* 916 Cameron Cir, Milpitas, CA 95035
* 4041 Crestwood St, Fremont, CA 94538
* 4400 Canterbury Way, Union City, CA 94587
* 2219 Curtis St, Oakland, CA 94607
* 1801 University Ave, Berkeley, CA 94703
* 2421 Lincoln Ave, Richmond, CA 94804
* 1415 Speers Ave, San Mateo, CA 94403
* 855 Standish Rd, Pacifica, CA 94044
* 1170 Paula Dr, Campbell, CA 95008
* 222 Johnson Ave Los Gatos, CA 95030

We will use geopy's geolocator function to fetch the geographical coordinates (latitude, longitude) corresponding to each address in the input address list, and create a Pandas dataframe that would contain the address, latitude and longitude.

In [3]:
my_addr_lst = ['927 Jackson St, Mountain View, CA 94043'
    , '2200 Monroe St, Santa Clara, CA'
    , '668 Madrone Ave, Sunnyvale, CA'
    , '10755 Tressler Ct, Cupertino, CA'
    , '328 Hill St, San Francisco, CA 94114'
    , '2424 Palm Ave, Redwood City, CA 94061'
    , '3159 Kenland Dr, San Jose, CA 95111'
    , '916 Cameron Cir, Milpitas, CA 95035'
    , '4041 Crestwood St, Fremont, CA 94538'
    , '4400 Canterbury Way, Union City, CA 94587'
    , '2219 Curtis St, Oakland, CA 94607'
    , '1801 University Ave, Berkeley, CA 94703'
    , '2421 Lincoln Ave, Richmond, CA 94804'
    , '1415 Speers Ave, San Mateo, CA 94403'
    , '855 Standish Rd, Pacifica, CA 94044'
    , '1170 Paula Dr, Campbell, CA 95008'
    , '222 Johnson Ave Los Gatos, CA 95030']

In [5]:
def get_addr_df(addr_lst):
    lst = []
    geolocator = Nominatim(user_agent='my_explorer')
    for i in addr_lst:
        location = geolocator.geocode(i)
        lst.append((i, location.latitude, location.longitude))
    df = pd.DataFrame(data=lst, columns=['Address', 'Latitude', 'Longitude'])
    return df


In [6]:
my_addr_df = get_addr_df(my_addr_lst)

In [7]:
my_addr_df.head()

Unnamed: 0,Address,Latitude,Longitude
0,"927 Jackson St, Mountain View, CA 94043",37.398051,-122.07975
1,"2200 Monroe St, Santa Clara, CA",37.335526,-121.943685
2,"668 Madrone Ave, Sunnyvale, CA",37.393987,-122.025583
3,"10755 Tressler Ct, Cupertino, CA",37.312231,-122.060524
4,"328 Hill St, San Francisco, CA 94114",37.755851,-122.428753


In [8]:
type(my_addr_df['Address'])

pandas.core.series.Series

Now, we would write a function to use the list of addresses we have short-listed to gather the list of venues belonging to specific categories of interest to us. The categories could be, for example, 'Grocery store' or 'Swimming Pool', etc. Note that we would be using Foursquare API in order to gather a list of venues around our short-listed addresses. 

**Some Terminology Alert:** Since we are using a list of addresses to find another list of addresses, it might get confusing. In order to avoid the confusion, we shall refer to our short-listed list of addresses as "anchor" locations or addresses (as we are going to be anchored in an address and the list of venues Foursquare API would return us would be called "target" locations or addresses, since it is the target given our categories of interest. So, we only have our "anchor" addresses in the form of a Pandas dataframe so far (created from a list of anchor address strings).

Our function would be called *get_cool_places* and would have the following signature:
<br>
**Inputs:** 
* anchor addresses (Pandas Series)
* anchor latitudes (Pandas Series)
* anchor longitudes (Pandas Series)
* radius (in m, large or small depending on whether you want it to be bikeable, drivable or walkable)
* number of venues (on experimentation, it seems that anything above 300 doesn't make any difference, may be some Foursquare free API limit)

Our function would form the request string to make a request using Foursquare API credentials and would populate a Pandas dataframe using the response from Foursquare API and return the dataframe.

**Output:**
* Pandas dataframe with both anchor addresses, target addresses and venue (target) category

In [9]:
def get_cool_places(neighborhoods, latitudes, longitudes,
                              radius=200, 
                              venue_num=300):
    # placeholder for collecting interesting venues
    venues_list=[]
    
    # loop around the anchored locations (neighbourhood, lat, lng)
    for name, lat, lng in zip(neighborhoods, latitudes, longitudes):
        # count venue_id from 0 .. venue_num
        # 50 at a time
        venue_id = 0
        while (venue_id < venue_num + 50):
            # construct request url
            request_url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&offset={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius,
            venue_id,
            LIMIT)
        # send the request
            results = requests.get(request_url).json()['response']['groups'][0]['items']
        # filter the results to capture fields of interest
            for v in results:
                venues_list.append([(
                        name, 
                        lat, 
                        lng, 
                        v['venue']['name'], 
                        v['venue']['location']['lat'], 
                        v['venue']['location']['lng'],
                        v['venue']['location']['formattedAddress'][0],
                        v['venue']['categories'][0]['name'])])
            venue_id = venue_id + 50
    venues_list = list(venues_list)
    interesting_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    interesting_venues.columns = ['Anchor_Address', 
                      'Anchor_Latitude', 
                      'Anchor_Longitude', 
                      'Venue', 
                      'Venue_Latitude', 
                      'Venue_Longitude',
                      'Venue_Address',
                      'Venue_Category']
    print('Finished collecting all cool places. Check out the dataframe')
    return(interesting_venues)

In [10]:
cool_places = get_cool_places(my_addr_df['Address'], 
                              my_addr_df['Latitude'], 
                              my_addr_df['Longitude'], 
                              radius=60000, 
                              venue_num=300)

Finished collecting all cool places. Check out the dataframe


Let us take a look at the dataframe our function generated for us.

In [11]:
cool_places.shape

(4096, 8)

As we see above, we have both anchor addresses and venue (target) addresses (regular format and geographical coordinates)

Now, you may have a different set of categories to filter for and I might have a different set of categories. We can write down our categories and filter the dataframe to show us the interesting places that meet those categories. We can write a small function `get_interesting_places()` that takes the dataframe generated by `get_cool_places()` and a python list of categories and returns a filtered dataframe.

In [12]:
def get_interesting_places(places_df, my_categories):
    pat = ''
    n = len(my_categories)
    for i in range(n-1):
        pat = pat + my_categories[i] + '|'
    pat = pat + my_categories[n-1]
    return places_df[places_df['Venue_Category'].str.contains(pat, case=False, regex=True)]

Let us check how it works out. Suppose I want to look for 'Restaurant' and 'market' as I would like to look for a part-time job in these places, I would form a python list mentioning these categories. Let us call this list `my_work_categories`.

In [13]:
my_work_categories = ['Restaurant', 'market']

Now, we can call our function `get_interseting_places()` and provide the above list of categories to filter the dataframe.

In [14]:
my_work_places = get_interesting_places(cool_places, my_work_categories)

In [15]:
my_work_places.head(15)

Unnamed: 0,Anchor_Address,Anchor_Latitude,Anchor_Longitude,Venue,Venue_Latitude,Venue_Longitude,Venue_Address,Venue_Category
5,"927 Jackson St, Mountain View, CA 94043",37.398051,-122.07975,Zareen's,37.426834,-122.144179,365 California Ave,Indian Restaurant
14,"927 Jackson St, Mountain View, CA 94043",37.398051,-122.07975,Luna Mexican Restaurant,37.333935,-121.91518,1495 The Alameda,Mexican Restaurant
16,"927 Jackson St, Mountain View, CA 94043",37.398051,-122.07975,Ping's Bistro 留湘,37.575651,-122.044134,34145 Fremont Blvd,Hunan Restaurant
30,"927 Jackson St, Mountain View, CA 94043",37.398051,-122.07975,The Barn,37.493752,-122.453976,3068 Cabrillo Hwy N (Mirada),American Restaurant
56,"927 Jackson St, Mountain View, CA 94043",37.398051,-122.07975,Grand Lake Farmers Market,37.810449,-122.24786,Splash Pad Park (at Grand Ave & Lake Park),Farmers Market
97,"927 Jackson St, Mountain View, CA 94043",37.398051,-122.07975,Señor Sisig,37.75715,-122.421304,990 Valencia St,Filipino Restaurant
115,"927 Jackson St, Mountain View, CA 94043",37.398051,-122.07975,Cholita Linda,37.836174,-122.262782,4923 Telegraph Ave (at 49th St),Latin American Restaurant
132,"927 Jackson St, Mountain View, CA 94043",37.398051,-122.07975,L'Ardoise,37.766675,-122.433261,151 Noe St (at Henry St),French Restaurant
134,"927 Jackson St, Mountain View, CA 94043",37.398051,-122.07975,Ferry Plaza Farmers Market,37.795015,-122.392936,1 Ferry Building (Market Street & The Embarcad...,Farmers Market
141,"927 Jackson St, Mountain View, CA 94043",37.398051,-122.07975,Ferry Building (Ferry Building Marketplace),37.795538,-122.393473,1 Ferry Building (at The Embarcadero),Market


We see that it has pulled in several kinds of restaurants that had the string 'Restaurant' in it. More interestingly, it pulled out 'Supermarket' and 'Market' as both contain the sub-string 'market' that we were looking for.

In [16]:
my_leisure_categories = ['gym', 'pool', 'trail']

In [17]:
my_leisure_places = get_interesting_places(cool_places, my_leisure_categories)

In [18]:
my_leisure_places.head(15)

Unnamed: 0,Anchor_Address,Anchor_Latitude,Anchor_Longitude,Venue,Venue_Latitude,Venue_Longitude,Venue_Address,Venue_Category
0,"927 Jackson St, Mountain View, CA 94043",37.398051,-122.07975,Stevens Creek Trail,37.396048,-122.070745,El Camino Real,Trail
7,"927 Jackson St, Mountain View, CA 94043",37.398051,-122.07975,Stanford Dish Trail,37.410775,-122.163193,2390 Stanford Ave (at Junipero Serra Blvd),Trail
19,"927 Jackson St, Mountain View, CA 94043",37.398051,-122.07975,Purisima Creek Redwoods Open Space Preserve,37.450102,-122.338743,3690 Higgins Canyon Rd (Purisima Creek Rd),Trail
21,"927 Jackson St, Mountain View, CA 94043",37.398051,-122.07975,iLoveKickboxing,37.24931,-121.83106,5681 Snell Avenue,Boxing Gym
22,"927 Jackson St, Mountain View, CA 94043",37.398051,-122.07975,Sawyer Camp Trail,37.531434,-122.365101,Crystal Springs Rd (at Skyline Blvd),Trail
27,"927 Jackson St, Mountain View, CA 94043",37.398051,-122.07975,Half Moon Bay Coastal Trail,37.465415,-122.444421,from Pillar Point Harbor Blvd (to Miramontes P...,Trail
49,"927 Jackson St, Mountain View, CA 94043",37.398051,-122.07975,Roberts Regional Recreation Area,37.812258,-122.174253,10570 Skyline Blvd,Trail
50,"927 Jackson St, Mountain View, CA 94043",37.398051,-122.07975,Redwood Regional Park,37.8167,-122.166553,7867 Redwood Rd,Trail
60,"927 Jackson St, Mountain View, CA 94043",37.398051,-122.07975,Dogpatch Boulders,37.756625,-122.38797,2573 3rd St (btw 22nd and 23rd),Climbing Gym
74,"927 Jackson St, Mountain View, CA 94043",37.398051,-122.07975,Redwood Park - Skyline Gate,37.831743,-122.185511,8500 Skyline Blvd (Pine Hills),Trail


Suppose I have short-listed an address to live in but I forgot to check if there are Vegan restaurants around, as I am a new Vegan (new year resolution, you see).

In [19]:
my_new_addr = my_addr_lst[6]
my_restrictions = ['Vegan', 'Vegetarian']

In [20]:
def get_interesting_places_around_addr(addr, places_df, my_categories):
    pat = ''
    n = len(my_categories)
    for i in range(n-1):
        pat = pat + my_categories[i] + '|'
    pat = pat + my_categories[n-1]
    return places_df[(places_df['Anchor_Address']==addr) & places_df['Venue_Category'].str.contains(pat, case=False, regex=True)]

In [21]:
my_food_places = get_interesting_places_around_addr(my_new_addr, cool_places, my_restrictions)

In [22]:
test = get_interesting_places_around_addr(my_new_addr, cool_places, ['Bar', 'Restaurant', 'Cafe'])
test.head()

Unnamed: 0,Anchor_Address,Anchor_Latitude,Anchor_Longitude,Venue,Venue_Latitude,Venue_Longitude,Venue_Address,Venue_Category
1451,"3159 Kenland Dr, San Jose, CA 95111",37.28712,-121.842567,Souvlaki Greek Skewers,37.309358,-121.887078,577 W Alma Ave,Greek Restaurant
1452,"3159 Kenland Dr, San Jose, CA 95111",37.28712,-121.842567,Gio Cha Duc Huong Sandwich,37.331458,-121.854906,1020 Story Rd,Vietnamese Restaurant
1454,"3159 Kenland Dr, San Jose, CA 95111",37.28712,-121.842567,Pho Hà Nôi,37.332473,-121.858145,"San Jose, CA 95122",Vietnamese Restaurant
1455,"3159 Kenland Dr, San Jose, CA 95111",37.28712,-121.842567,Luna Mexican Restaurant,37.333935,-121.91518,1495 The Alameda,Mexican Restaurant
1467,"3159 Kenland Dr, San Jose, CA 95111",37.28712,-121.842567,In-N-Out Burger,37.35024,-121.921872,550 Newhall Dr (at Coleman Ave),Fast Food Restaurant


In [23]:
my_food_places

Unnamed: 0,Anchor_Address,Anchor_Latitude,Anchor_Longitude,Venue,Venue_Latitude,Venue_Longitude,Venue_Address,Venue_Category
1632,"3159 Kenland Dr, San Jose, CA 95111",37.28712,-121.842567,Blossom Vegan Restaurant,37.69981,-121.869719,4000 Pimlico Dr #112,Vegetarian / Vegan Restaurant


We can also plot the interesting places in a map for better visual clarity. Again, let us write a small function `show_places_on_map()` that usues the wonderful Folium library to mark places on the map. For clarity, we shall mark the anchor address in 'red' and target locations in blue circles with yellow borders

In [58]:
def show_places_on_map(addr, df):
    geolocator = Nominatim(user_agent='map_explorer')
    location = geolocator.geocode(addr)
    addr_lat = location.latitude
    addr_lng = location.longitude
    smap = folium.Map(location=[addr_lat, addr_lng], zoom_start=10)
    
    folium.CircleMarker(
    [addr_lat, addr_lng],
    radius=3,
    color='red',
    popup='My Home',
    fill = True,
    fill_color = 'red',
    fill_opacity = 0.6
    ).add_to(smap)
    for lat, lng, v, label in zip(df.Venue_Latitude, 
                           df.Venue_Longitude,
                           df.Venue,
                           df.Venue_Category):
        folium.CircleMarker(
                    [lat, lng],
                    radius=5,
                    color='yellow',
                    popup=v,
                    fill = True,
                    fill_color='blue',
                    fill_opacity=0.6
                ).add_to(smap)
    return smap

For example if we want to see work places near address `my_new_addr`

In [59]:
r_map = show_places_on_map(my_new_addr, my_work_places)

In [60]:
r_map

## Part 2 - Estimating house prices around interesting locations

In this exercise, we would use Redfin dataset to estimate current house prices around interesting locations. Redfin website allows one to download the current listings as a csv file ('fin_df.csv'). The data for this exercise was downloaded in that manner. Let us start by importing all the necessary modules.

In [90]:
# import all the required libraries

# Pandas, numpy for working with 
# dataframes and arrays
import numpy as np

# scikit-learn for preprocessing, regressions
from sklearn import preprocessing
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import FeatureUnion
from sklearn.metrics import mean_squared_error
from pandas.plotting import scatter_matrix

# utility libraries
import hashlib
import random

Let us read the data into a data-frame and capture only interesting features (in particular we leave out the house listing number, city, URL for the listing, etc.).

In [62]:
df = pd.read_csv('fin_df.csv')
columns = ['property_type', 'zipcode', 'beds', 'baths',
          'sq_ft', 'year_built', 'lot_size','latitude', 'longitude', 'price']
housing = df[columns]

Find out the columns that have missing (NaN) values, so that we can impute those values

In [63]:
housing.columns[housing.isna().any()].tolist()

['zipcode', 'beds', 'baths', 'sq_ft', 'year_built', 'lot_size']

In [64]:
housing.columns[1:7]

Index(['zipcode', 'beds', 'baths', 'sq_ft', 'year_built', 'lot_size'], dtype='object')

Some zipcodes are hyphenated. Here, we ignore the portion after the hyphen of those zipcodes, for simplicity

In [66]:
housing['zipcode'] = int(str(housing['zipcode'])[0:5])
housing.shape

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  housing['zipcode'] = int(str(housing['zipcode'])[0:5])


(6544, 10)

Let us remove the listings which have the value '0.0' for number of bedrooms

In [67]:
housing = housing[housing.beds != 0.0]

Since we have missing values, we use scikit-learn's SimpleImputer to impute the missing values. We chose the "most_frequent" strategy, rather than 'mean'

In [68]:
imp = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
imp = imp.fit(housing.iloc[:,1:7])
housing.iloc[:,1:7] = imp.transform(housing.iloc[:, 1:7])

Let us write a couple of utility functions to split our dataset into training and test sets

In [69]:
# use the id value, hash it and use this for splitting
# the dataset into training and test sets, so that the 
# results are reproducible
def test_set_check(identifier, test_ratio, hash):
    return hash(np.int64(identifier)).digest()[-1] < 256 * test_ratio

# utility function to split the dataset by id
def split_train_test_by_id(data, test_ratio, id_column, hash=hashlib.md5):
    ids = data[id_column]
    in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio, hash))
    return data.loc[~in_test_set], data.loc[in_test_set]

Adding an index column for using as "id"

In [70]:
housing = housing.reset_index()

Let us split the dataset into two sets - 80% for training data and 20% for test data

In [71]:
train_set, test_set = split_train_test_by_id(housing, 0.2, 'index')
print('train_set has {} entries and test_set {} entries'.format(len(train_set), len(test_set)))

train_set has 5070 entries and test_set 1281 entries


In [72]:
# utility class/methods to select numerical and categorical
# data (by passing numerical and categorical columns as attributes)
class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attribute_names].values

In [73]:
train_num = train_set.drop(['property_type'], axis=1)
test_num = test_set.drop(['property_type'], axis=1)

In [74]:
train_cat_1 = train_set['property_type']
train_cat1_enc, train_cat1 = train_cat_1.factorize()
print('train_cat1: ', len(train_cat1))

test_cat_1 = test_set['property_type']
test_cat1_enc, test_cat1 = test_cat_1.factorize()
print('test_cat1: ', len(test_cat1))


train_cat1:  6
test_cat1:  6


Let us encode the categorical variables using `Onehot encoding`

In [76]:
enc = OneHotEncoder()
train_cat1_1hot = enc.fit_transform(train_cat1_enc.reshape(-1, 1))
test_cat1_1hot = enc.fit_transform(test_cat1_enc.reshape(-1, 1))

In [77]:
num_attribs = list(train_num)
cat_attribs = ['property_type']
print('numerical attributes: ', num_attribs)
print('categorical attributes: ', cat_attribs)

numerical attributes:  ['index', 'zipcode', 'beds', 'baths', 'sq_ft', 'year_built', 'lot_size', 'latitude', 'longitude', 'price']
categorical attributes:  ['property_type']


Let's build a pipeline to do these in a sequence

In [78]:
# pipeline for numerical data
num_pipeline = Pipeline([
    ('selector', DataFrameSelector(num_attribs)),
    ('std_scaler', StandardScaler()),
])

# pipeline for categorical data
cat_pipeline = Pipeline([
    ('selector', DataFrameSelector(cat_attribs)),
    ('cat_encoder', OneHotEncoder(sparse=False)),
])

# merged pipeline
full_pipeline = FeatureUnion(transformer_list=[
    ('num_pipeline', num_pipeline),
    ('cat_pipeline', cat_pipeline),
])

Let's prepare the training and test data, passing it through the pipeline

In [79]:
training_prepared = full_pipeline.fit_transform(train_set)
test_prepared = full_pipeline.fit_transform(test_set)

In [80]:
print('training_prepared shape : ', training_prepared.shape)
print('test_prepared shape : ', test_prepared.shape)

training_prepared shape :  (5070, 16)
test_prepared shape :  (1281, 16)


In [81]:
training_labels = train_set['price'].copy()
test_labels = test_set['price'].copy()

Let us try to fit a line using Linear Regression

In [83]:
lr = LinearRegression()
lr.fit(training_prepared, training_labels)


LinearRegression()

we can use the fit parameters to make predictions on new data, for example our test data

In [84]:
print('Predictions: ', lr.predict(test_prepared))

Predictions:  [1491145.79056324  489943.89463039 1995437.62638817 ... 1399067.85135638
 1022203.91884626  918271.43060966]


Now, we would be interested in finding out how far are these from our actual prices in the dataset? Let us take a small subset of the test data and check how good/bad our predictions are instead of working on the entire dataset

In [85]:
some_data = test_set.iloc[27:33]
some_data_prepared = full_pipeline.transform(some_data)

In [86]:
print('Predictions: ', lr.predict(some_data_prepared))

Predictions:  [ 903711.84929414 2034180.29346711  748306.0447256   257709.34542827
  935764.2964864   228667.88569953]


Now, looking at the actual prices (labels) from the dataset

In [87]:
some_labels = test_labels.iloc[27:33]
print('actual labels: ', list(some_labels))

actual labels:  [925000, 2088888, 765000, 259900, 958000, 230000]


Our predictions always seem to be larger than the actual prices of homes

We can also compute the root mean squared error of our predictions over the entire test dataset

In [88]:
test_predictions = lr.predict(test_prepared)
lr_mse = mean_squared_error(test_labels, test_predictions)
lr_rmse = np.sqrt(lr_mse)
lr_rmse

70804.96688095151

The idea behind this exercise is just that, from Part 1, we could collect interesting anchor locations (address, latitude, longitude) and create our own dataframe with different types of houses and check how much different types of houses ('Single family home', 'condo', 'townhome') would cost at our locations of interest. We can write a small function to do the same.

In [135]:
def create_fake_df(budget, beds, baths, addr, property_types):
    geolocator = Nominatim(user_agent='addr_finder')
    location = geolocator.geocode(addr)
    lat = location.latitude
    lng = location.longitude
    zipcode = location.raw['display_name'].split()[-3].split('-')[0]
    df_lst = []
    columns = ['property_type', 'zipcode', 'beds', 'baths', 'sq_ft',
       'year_built', 'lot_size', 'latitude', 'longitude', 'price']
    for i in range(0, len(property_types)):
        for j in beds:
            for k in baths:
                rand_sqft = random.randint(1000, 4000)
                rand_lot = rand_sqft + random.randint(100, 2000)
                rand_year = random.randint(1950, 2000)
                df_lst.append([property_types[i], zipcode, j, k, rand_sqft, 
                           rand_year, rand_lot, lat, lng, budget])
    df = pd.DataFrame(data=df_lst, columns=columns)
    return df

Let us use a sample address, our choice of number of bedrooms and bathrooms, and property type as a python list.

In [143]:
my_addr = 'Monroe St, Santa Clara, CA'
beds = [2.0, 3.0, 4.0]
baths = [2.0, 2.5, 2.0]
property_types = ['Single Family Residential', 'Condo/Co-op', 'Multi-Family (2-4 Unit)',
       'Townhouse', 'Mobile/Manufactured Home', 'Multi-Family (5+ Unit)']

In [144]:
my_df = create_fake_df(1000000, beds, baths, my_addr, property_types)

In [160]:
my_df.head()

Unnamed: 0,index,property_type,zipcode,beds,baths,sq_ft,year_built,lot_size,latitude,longitude,price
0,0,Single Family Residential,95051,2.0,2.0,2490,1963,4347,37.364014,-121.968601,1000000
1,1,Single Family Residential,95051,2.0,2.5,3698,1969,3817,37.364014,-121.968601,1000000
2,2,Single Family Residential,95051,2.0,2.0,1885,1982,2565,37.364014,-121.968601,1000000
3,3,Single Family Residential,95051,3.0,2.0,2194,1966,2759,37.364014,-121.968601,1000000
4,4,Single Family Residential,95051,3.0,2.5,1510,1996,2822,37.364014,-121.968601,1000000


In [161]:
my_df.tail()

Unnamed: 0,index,property_type,zipcode,beds,baths,sq_ft,year_built,lot_size,latitude,longitude,price
49,49,Multi-Family (5+ Unit),95051,3.0,2.5,2416,1986,3813,37.364014,-121.968601,1000000
50,50,Multi-Family (5+ Unit),95051,3.0,2.0,3605,1988,4445,37.364014,-121.968601,1000000
51,51,Multi-Family (5+ Unit),95051,4.0,2.0,3906,1971,5213,37.364014,-121.968601,1000000
52,52,Multi-Family (5+ Unit),95051,4.0,2.5,3501,1992,4472,37.364014,-121.968601,1000000
53,53,Multi-Family (5+ Unit),95051,4.0,2.0,2492,1978,4220,37.364014,-121.968601,1000000


We can pick a sample from this "all possibilities" list and estimate how much we would need to shell out to buy one such property

In [146]:
my_df = my_df.reset_index()

In [147]:
rand_train_set, rand_test_set = split_train_test_by_id(my_df, 0.2, 'index')

In [154]:
rand_train_prepared = full_pipeline.fit_transform(rand_train_set)
rand_test_prepared = full_pipeline.fit_transform(rand_test_set)

In [156]:
rand_predictions = lr.predict(rand_test_prepared[2:3])

In [157]:
print(rand_predictions)

[1460456.41893491]
