# IBM Capstone - Final

## 1) Introduction



In this capstone I will be focusing on building a supervised machine learning model to predict the prices of real estate locations in Mexico City. The question the model seeks to answer is what is the 'fair price' or 'expected price' for a real estate location given a set of features or characteristics such as area, the number of rooms, etc. This model can serve a lot of purposes; it can help real estate agencies in the pricing of new listings, it can help customers get an idea as to the price of real estate they might be searching for, it can flag relatively cheap real estate, etc. 

In particular, to build the model the Foursquare API can be hugely benefitial because it can identify commercial locations around the real estate. The commercial locations around a real estate likely influence the price of it which is why with foursquare data we can build a better predictive model.

## 2) Data 

### 2.1) Real Estate Listings in Mexico City

To build the model mentioned above I will be using mainly two datasets. The first dataset contains real estate listings from a real estate agency's webpage. This data set was built by scraping the following url: https://www.remax.com.mx/propiedades/ciudad+de+mexico_ciudad+de+mexico/venta

The process of scraping the data was long, so I converted the data to a CSV and loaded it into this repository. The data incluces area, address, price, rooms, bathrooms, parking and type of real estate. 

### 2.2) Foursquare Venue Data

The second database I will use, which will be merged with the previous data set, is foursquare venue data which contains commercial venues around a defined radius of each listing in the previous data set. To get this data I will use the foursquare API and the proper queries. 

We start by loading the necessary libraries and information.

In [3]:
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values
from pandas.io.json import json_normalize
import folium # plotting library

In [4]:
CLIENT_ID = 'KJGACPXW4NNNK04P4GTB2HEY40ZYYKUUHPAQGY2PNQD4QJE5' # your Foursquare ID
CLIENT_SECRET = 'TICHAGERBZI2UB3CYPSWNLMWMWMIVW5FCTE2LGLCRPR44C1Q' # your Foursquare Secret
VERSION = '20180604'

### Real Estate Listings

This is the first data set containing the real estate listings to be used.

In [85]:
#Real Estate (Mexico City) Data

mx_re_DF = pd.read_csv('mx_city_RE')
mx_re_DF.drop(['Unnamed: 0'], axis=1, inplace=True)
mx_re_DF.head()

Unnamed: 0,room,bathroom,m2,parking,type,price,geocode_address,geocode_address_2
0,2,1.0,200.0,4,Casa,8000000.0,"Privada de Andes 13 Los Alpes, 01010 Ciudad de...","Privada de Andes 13, CDMX"
1,2,2.0,71.0,1,Departamento,2800000.0,"Lago Hielmar 44 Ahuehuetes Anáhuac, 11450 Ciud...","Lago Hielmar 44, CDMX"
2,3,3.0,500.0,4,Casa,19500000.0,"Cerrada Presa Escolta 185 San Jerónimo Lídice,...","Cerrada Presa Escolta 185, CDMX"
3,2,2.0,69.0,2,Departamento,3450000.0,"Luz Saviñon 814 Del Valle Norte, 03103 Ciudad ...","Luz Saviñon 814, CDMX"
4,2,2.0,81.0,2,Departamento,9125800.0,"Rio Amazonas 74 Cuauhtémoc, 06500 Ciudad de M...","Rio Amazonas 74, CDMX"


The next step consists of merging latitude and longitude data according to each listing's address. Unfortunately some addresses were written improperly so the latitude and longitude methods returned Null or None. For the purposes of this capstone I won't be trying to correct every address, 172 out of 250 listings returned a proper latitude and longitude.

In [21]:
#Latitude and longitude 

geolocator = Nominatim(user_agent="foursquare_agent")
lat, lon = [], []

for address, address2 in zip(mx_re_DF['geocode_address'],mx_re_DF['geocode_address_2']):
    print(address)
    try:
        location = geolocator.geocode(address)
        lat.append(location.latitude)
        lon.append(location.longitude)
    except:
        try:
            location = geolocator.geocode(address2)
            lat.append(location.latitude)
            lon.append(location.longitude)
        except:
            lat.append(None)
            lon.append(None)     

Privada de Andes 13 Los Alpes, 01010 Ciudad de México
Lago Hielmar 44 Ahuehuetes Anáhuac, 11450 Ciudad de México
Cerrada Presa Escolta 185 San Jerónimo Lídice, 10200 Ciudad de México
Luz Saviñon 814 Del Valle Norte, 03103 Ciudad de México
Rio Amazonas  74 Cuauhtémoc, 06500 Ciudad de México
Rio Amazonas 74 Cuauhtémoc, 06500 Ciudad de México
Lomas de chapultepec Sierra Tezonco 150 Lomas de Chapultepec I Sección, 11000 Ciudad de México
Av de las Torres 805 Olivar de los Padres, 01780 Ciudad de México
Prolongación Peten 876 Santa Cruz Atoyac, 03310 Ciudad de México
Cerro del Agua 179 Romero de Terreros, 04310 Ciudad de México
Membrillo 0 Santiago Acahualtepeca. Ampliación 09609, 2 Ciudad de México
Prolongacion Emperadores 237 Del Valle Sur, 03104 Ciudad de México
Escuela Naval Militar 0 San Francisco Culhuacán Barrio de La Magdalena, 04260 Ciudad de México
Altavista  ( Vista del Campo ) Santa Fe Prados de la Montaña 61 Santa Fe Cuajimalpa, 05348 Ciudad de México
Prolongación 5 de mayo  696

JOSÉ MARÍA MATA  14 Constitución de La República, 07460 Ciudad de México
Altadena 0 Nápoles, 03810 Ciudad de México
ASTURIAS,COL.ALAMOS   260 Álamos, 03400 Ciudad de México
MAGDALENA 307 Del Valle Norte, 03103 Ciudad de México
PROVIDENCIA 1003 Del Valle Centro, 03100 Ciudad de México
RIO EBRO  12 Cuauhtémoc, 06500 Ciudad de México
RIO EBRO 12 Cuauhtémoc, 06500 Ciudad de México
Cerro San Pedro 0 Pedregal de San Francisco, 04320 Ciudad de México
RIO EBRO 12 Cuauhtémoc, 06500 Ciudad de México
RIO EBRO  12 Cuauhtémoc, 06500 Ciudad de México
RIO EBRO  12 Cuauhtémoc, 06500 Ciudad de México
Av. San Antonio  7 Carola, 01180 Ciudad de México
RIO EBRO  12 Cuauhtémoc, 06500 Ciudad de México
Rio Becerra 150 San Pedro de los Pinos, 03800 Ciudad de México
Concepción Beistegui 835 Del Valle Centro, 03100 Ciudad de México
Alhelí  306 Nueva Santa María, 02800 Ciudad de México
Concepción Beistegui 835 Del Valle Centro, 03100 Ciudad de México
AQUILES SERDAN 0 Miguel Hidalgo, 02450 Ciudad de México
Nicola

In [86]:
mx_re_DF['lat'] = lat
mx_re_DF['lon'] = lon
mx_re_DF.dropna(subset = ['lat','lon'], axis = 0, inplace=True)
mx_re_DF.head()

Unnamed: 0,room,bathroom,m2,parking,type,price,geocode_address,geocode_address_2,lat,lon
2,3,3.0,500.0,4,Casa,19500000.0,"Cerrada Presa Escolta 185 San Jerónimo Lídice,...","Cerrada Presa Escolta 185, CDMX",19.330326,-99.221529
3,2,2.0,69.0,2,Departamento,3450000.0,"Luz Saviñon 814 Del Valle Norte, 03103 Ciudad ...","Luz Saviñon 814, CDMX",19.286915,-99.218644
4,2,2.0,81.0,2,Departamento,9125800.0,"Rio Amazonas 74 Cuauhtémoc, 06500 Ciudad de M...","Rio Amazonas 74, CDMX",19.289072,-99.324852
5,0,1.0,61.0,0,Local - Comercial,5038760.0,"Rio Amazonas 74 Cuauhtémoc, 06500 Ciudad de Mé...","Rio Amazonas 74, CDMX",19.289072,-99.324852
7,3,3.5,207.0,4,Departamento,9500000.0,"Av de las Torres 805 Olivar de los Padres, 017...","Av de las Torres 805, CDMX",19.355892,-99.021285


In [61]:
mx_re_DF.shape

(172, 10)

### Foursquare Commercial Venues Data

I then use the foursquare API to get commercial venues around a 500 meter radius of the listed real estate. I then define the function to get the venues within a 500 meter radius of the latitude and longitude of each listing. 

In [64]:
def getNearbyVenues(names, latitudes, longitudes, radius=500, LIMIT=100):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        try:
            results = requests.get(url).json()["response"]['groups'][0]['items']
        except:
            print(name, 'NO RESULT, Interpreted as zero')
            pass
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

        nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
        nearby_venues.columns = ['Neighborhood', 
                      'Neighborhood Latitude', 
                      'Neighborhood Longitude', 
                      'Venue', 
                      'Venue Latitude', 
                      'Venue Longitude', 
                      'Venue Category']
    return(nearby_venues) 

In [65]:
mxcity_venues=getNearbyVenues(mx_re_DF['geocode_address_2'], mx_re_DF['lat'], mx_re_DF['lon'])

Cerrada Presa Escolta 185, CDMX
Luz Saviñon 814, CDMX
Rio Amazonas  74, CDMX
Rio Amazonas 74, CDMX
Av de las Torres 805, CDMX
Prolongación Peten 876, CDMX
Cerro del Agua 179, CDMX
Membrillo 0, CDMX
Prolongacion Emperadores 237, CDMX
Prolongación 5 de mayo  696, CDMX
Altadena 0, CDMX
Tezozomoc 93, CDMX
Calle 12 0, CDMX
Calle 12 0, CDMX
MONTE BLANCO 0, CDMX
Angel Urraza 904, CDMX
Guanajuato 44, CDMX
Av de los Poetas 100, CDMX
Jose Vasconcelos 300, CDMX
Minería 44, CDMX
Blvd Adolfo López Mateos 1940, CDMX
Av. Panamericana  240, CDMX
Hortensia 139, CDMX
Tamagno 114, CDMX
CANAL DEL NORTE  23, CDMX
CANAL DEL NORTE  23, CDMX
Calle 12  0, CDMX
calle 12  0, CDMX
Calle 12  0, CDMX
calle 12  0, CDMX
Av. Polanco 75, CDMX
Newton 0, CDMX
Secretaria de marina  0, CDMX
Tlacotalpan 89, CDMX
Tlacotalpan  89, CDMX
Tlacotalpan 89, CDMX
Baja California 119, CDMX
EJÉRCITO NACIONAL 830, CDMX
Plutarco Elías Calles 166, CDMX
Av Cuauhtémoc 55, CDMX
Piñón 188, CDMX
Prolongacion Uxmal 1112, CDMX
PROL PASEO DE LA 

In [66]:
mxcity_venues.iloc[0:20,:]

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Cerrada Presa Escolta 185, CDMX",19.330326,-99.221529,San Jerónimo,19.332817,-99.219986,Sports Bar
1,"Cerrada Presa Escolta 185, CDMX",19.330326,-99.221529,Anahata yoga,19.327747,-99.222291,Yoga Studio
2,"Cerrada Presa Escolta 185, CDMX",19.330326,-99.221529,Master Of War,19.329097,-99.218855,Toy / Game Store
3,"Cerrada Presa Escolta 185, CDMX",19.330326,-99.221529,Kids Corner,19.328551,-99.217987,Garden
4,"Cerrada Presa Escolta 185, CDMX",19.330326,-99.221529,El secreto,19.32823,-99.218605,Candy Store
5,"Cerrada Presa Escolta 185, CDMX",19.330326,-99.221529,9 Round San Jerónimo,19.328203,-99.218514,Boxing Gym
6,"Cerrada Presa Escolta 185, CDMX",19.330326,-99.221529,Prints land,19.326433,-99.219598,Playground
7,"Cerrada Presa Escolta 185, CDMX",19.330326,-99.221529,Hamburguesas Licho's,19.327007,-99.223503,Burger Joint
8,"Luz Saviñon 814, CDMX",19.286915,-99.218644,El Rincón de Periban,19.287917,-99.219377,Mexican Restaurant
9,"Luz Saviñon 814, CDMX",19.286915,-99.218644,Tamales del Sur,19.286674,-99.218545,Mexican Restaurant


In [67]:
mxcity_venues.shape

(7951, 7)

Because of the variety of commercial venues and due to the scope of the Capstone, I will simply aggregate them all without differentiating them. We will add a new column to the first date set which is the number of commercial venues around that particular address.

In [102]:
mxcity_venues['commercial_count'] = 1
count_venues = mxcity_venues[['Neighborhood','commercial_count']].groupby('Neighborhood').count().reset_index()
c_lat_lon = mxcity_venues[['Neighborhood','Neighborhood Latitude', 'Neighborhood Longitude']].drop_duplicates()
count_venues = count_venues.merge(c_lat_lon, how='left', on='Neighborhood')
count_venues.columns = ['Neighborhood','commercial_count','lat', 'lon']
count_venues.head()

Unnamed: 0,Neighborhood,commercial_count,lat,lon
0,"75 0, CDMX",64,19.420328,-99.080099
1,"AQUILES SERDAN 0, CDMX",31,19.490556,-99.195229
2,"AV COYOACAN 0, CDMX",148,19.386268,-99.167364
3,"AV DE LAS PALMAS 0, CDMX",4,19.236942,-98.997317
4,"AV REVOLUCION 400, CDMX",31,19.392368,-99.18575


### Merging and Cleaning Data Sets

We finally merge the two data sets along with some cleaning.

In [159]:
full_data = mx_re_DF.merge(count_venues, how='left', left_on=['lat','lon'], right_on=['lat','lon'])
full_data.drop_duplicates(subset=['price','m2','room','bathroom','parking','type','lat','lon'],inplace=True)
full_data.reset_index(inplace=True)
full_data.head()

Unnamed: 0,index,room,bathroom,m2,parking,type,price,geocode_address,geocode_address_2,lat,lon,Neighborhood,commercial_count
0,0,4,4.0,150.0,3,Casa,3000000.0,Membrillo 0 Santiago Acahualtepeca. Ampliación...,"Membrillo 0, CDMX",19.213692,-98.955765,"Membrillo 0, CDMX",1.0
1,1,4,4.0,2000.0,0,Casa,2600000.0,AV DE LAS PALMAS 0 Lomas de Chapultepec I Sec...,"AV DE LAS PALMAS 0, CDMX",19.236942,-98.997317,"AV DE LAS PALMAS 0, CDMX",4.0
2,2,0,2.0,260.0,0,Bodega - Comercial,4100000.0,"Oriente 180 396 Moctezumaa Sección 15500, 1 Ci...","Oriente 180 396, CDMX",19.238952,-99.104924,"Oriente 180 396, CDMX",3.0
3,3,3,2.0,135.0,2,Departamento,5690000.0,"MAGDALENA 307 Del Valle Norte, 03103 Ciudad de...","MAGDALENA 307, CDMX",19.261208,-98.994604,"MAGDALENA 307, CDMX",2.0
4,4,2,1.0,50.0,0,Departamento,1550000.0,Prolongación 5 de mayo 696 Colinas de Tarango...,"Prolongación 5 de mayo 696, CDMX",19.263749,-99.177061,"Prolongación 5 de mayo 696, CDMX",21.0


In [131]:
full_data.shape

(169, 12)

### 3) Methodology

I will use a linear regression model with real estate listing price as the dependent variable and the listing's features as independent or predicting variables. 

#### 3.1 Data Processing

Because it is fairly obvious that location is a substantial predictor of price it would seem like bad practice to exclude it. However, because there are so many distinct neighborhoods the model may be prone to overfitting if we include all existing neighborhoods as dummies. For this reason I will be clustering neighborhoods based on the k-means clustering unsupervised algorithm. I will be clustering them based on their geographical location since it seems reasonable the nearby listings should be similarly priced (all else being equal). Then I will standardize the features to ease interpretation of coefficients and possible to get a better predictive model. I will also include polynomial features in the model to maximize the predictive capability of the model.

#### 3.2 Goodness of fit

Lastly, I will split the data into a training and testing set and calculate the $R^2$ on the testing set as a measure of the goodness of fit. 

In [125]:
#DATA PROCESSING

#1. Clustering Neighborhoods (on latitude and longitude)
from sklearn.cluster import KMeans 
from sklearn.datasets.samples_generator import make_blobs 

k_means_models = {}
X =  full_data[['lat','lon']]

for i in range(2,10):
    k_means = KMeans(init = "k-means++", n_clusters = i, n_init = 12)
    k_means.fit(X)
    k_means_models[i]=k_means.labels_

In [132]:
k_means_models[3]

array([1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 2, 1, 1, 1, 1, 2, 2, 1, 2, 0,
       0, 2, 0, 0, 2, 1, 2, 0, 2, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0,
       1, 2, 2, 2, 2, 0, 0, 2, 2, 0, 0, 0, 0, 0, 2, 0, 0, 0, 1, 1, 0, 0,
       0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [168]:
#Clustering types of listing
for obs in range(len(full_data['type'])):
    if 'Bodega' in full_data.loc[obs,'type'] : full_data.at[obs,'type'] = 'Bodega'
    if 'Casa' in full_data.loc[obs,'type'] : full_data.at[obs,'type'] = 'Casa'
    if 'Comercial' in full_data.loc[obs,'type'] : full_data.at[obs,'type'] = 'Comercial'
    if 'Desarrollo' in full_data.loc[obs,'type'] : full_data.at[obs,'type'] = 'Terreno'

In [171]:
full_data.head()

Unnamed: 0,index,room,bathroom,m2,parking,type,price,geocode_address,geocode_address_2,lat,lon,Neighborhood,commercial_count
0,0,4,4.0,150.0,3,Casa,3000000.0,Membrillo 0 Santiago Acahualtepeca. Ampliación...,"Membrillo 0, CDMX",19.213692,-98.955765,"Membrillo 0, CDMX",1.0
1,1,4,4.0,2000.0,0,Casa,2600000.0,AV DE LAS PALMAS 0 Lomas de Chapultepec I Sec...,"AV DE LAS PALMAS 0, CDMX",19.236942,-98.997317,"AV DE LAS PALMAS 0, CDMX",4.0
2,2,0,2.0,260.0,0,Bodega,4100000.0,"Oriente 180 396 Moctezumaa Sección 15500, 1 Ci...","Oriente 180 396, CDMX",19.238952,-99.104924,"Oriente 180 396, CDMX",3.0
3,3,3,2.0,135.0,2,Departamento,5690000.0,"MAGDALENA 307 Del Valle Norte, 03103 Ciudad de...","MAGDALENA 307, CDMX",19.261208,-98.994604,"MAGDALENA 307, CDMX",2.0
4,4,2,1.0,50.0,0,Departamento,1550000.0,Prolongación 5 de mayo 696 Colinas de Tarango...,"Prolongación 5 de mayo 696, CDMX",19.263749,-99.177061,"Prolongación 5 de mayo 696, CDMX",21.0


In [176]:
#Setting neighborhood clusters from 2 to 9, whichever predicts price best I will keep
final_data = full_data[['price','room','bathroom','m2','parking','type','commercial_count']].copy(deep=True)

for key, k_clus in k_means_models.items():
    final_data['N_cluster_%d' %key] = k_clus

In [192]:
#Getting dummies
final_data = pd.get_dummies(final_data,'type')
final_data.dropna(axis=0,inplace=True)
final_data.head()

Unnamed: 0,price,room,bathroom,m2,parking,commercial_count,N_cluster_2,N_cluster_3,N_cluster_4,N_cluster_5,N_cluster_6,N_cluster_7,N_cluster_8,N_cluster_9,type_Bodega,type_Casa,type_Comercial,type_Departamento,type_Terreno
0,3000000.0,4,4.0,150.0,3,1.0,1,1,0,1,3,6,7,7,0,1,0,0,0
1,2600000.0,4,4.0,2000.0,0,4.0,1,1,0,1,3,6,7,7,0,1,0,0,0
2,4100000.0,0,2.0,260.0,0,3.0,1,1,0,1,3,3,0,5,1,0,0,0,0
3,5690000.0,3,2.0,135.0,2,2.0,1,1,0,1,3,6,7,7,0,0,0,1,0
4,1550000.0,2,1.0,50.0,0,21.0,0,2,3,2,4,1,4,5,0,0,0,1,0


Once the data set has been cleansed and ready to analyze I proceed with the Linear Regression learning algorithm.

In [280]:
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import warnings
warnings.filterwarnings("ignore")

R2_test = {}
R2_train = {}
coefs = {}
columns = {}

for i in range(2,9):
    #Features
    X = final_data[['room',
                   'bathroom',
                   'm2',
                   'parking',
                   'commercial_count',
                   'N_cluster_%d' % i,
                   'type_Bodega',
                   'type_Casa',
                  'type_Comercial',
                  'type_Departamento',
                  'type_Terreno']].copy(deep=True)
    
    X['N_cluster_%d' % i] = X['N_cluster_%d' % i].astype(str)
    X = pd.get_dummies(X,'N_cluster_%d' % i)
    
    #To avoid falling in the dummy variable trap I discard the base categories
    features = ['room', 'bathroom', 'm2', 'parking', 'commercial_count', 'type_Bodega',
       'type_Casa', 'type_Comercial', 'type_Departamento'] + ['N_cluster_%d_%d' % (i,j) for j in range(1,i)]
    X = X[features]
    
    #Dependent Variable
    y = final_data['price']
    
    #Training and testing set
    xtrain,xtest,ytrain,ytest = train_test_split(X,y,test_size=.3,random_state=1234)

    #Pipe, standardize data --> polynomial features --> linear regression
    Input=[('polynomial',PolynomialFeatures(degree=1)),('mode',LinearRegression())]
    pipe=Pipeline(Input)
    pipe.fit(xtrain,ytrain)
    yhat_test=pipe.predict(xtest)
    yhat_train=pipe.predict(xtrain)
    R2_test[i] = round(r2_score(ytest,yhat_test),2)
    R2_train[i] = round(r2_score(ytrain,yhat_train),2)
    coefs[i] = pipe.named_steps['mode'].coef_
    columns[i] = X.columns

In [281]:
R2_test

{2: 0.3, 3: 0.28, 4: 0.28, 5: 0.33, 6: 0.34, 7: 0.34, 8: 0.32}

In [282]:
R2_train

{2: 0.49, 3: 0.49, 4: 0.5, 5: 0.5, 6: 0.51, 7: 0.53, 8: 0.54}

In [283]:
for name,coef in zip(columns[7],coefs[7]):
    print(name, " : ", coef) 

room  :  0.0
bathroom  :  -841581.9408972998
m2  :  331197.41806299594
parking  :  19633.384672813205
commercial_count  :  2974303.566878244
type_Bodega  :  -30713.33450850338
type_Casa  :  -8791145.54218618
type_Comercial  :  -11438220.615298972
type_Departamento  :  -10165014.689177813
N_cluster_7_1  :  -6870873.174492464
N_cluster_7_2  :  -226706.43973787906
N_cluster_7_3  :  -2145533.375526429
N_cluster_7_4  :  -3373129.767487435
N_cluster_7_5  :  -1672112.9087422749
N_cluster_7_6  :  -2902659.7777165156


In [277]:
columns[7]

Index(['room', 'bathroom', 'm2', 'parking', 'commercial_count', 'type_Bodega',
       'type_Casa', 'type_Comercial', 'type_Departamento', 'type_Terreno',
       'N_cluster_7_0', 'N_cluster_7_1', 'N_cluster_7_2', 'N_cluster_7_3',
       'N_cluster_7_4', 'N_cluster_7_5', 'N_cluster_7_6'],
      dtype='object')

## 4) Results

The best out of sample $R^2$ that I managed to get was equal to $.34$ with a 7 neighborhood cluster and the rest of the features. I used 30% of the data for testing and the rest for training the model. 

The coefficients for the model can be seen in the cell above. The  reference category for "type" is "Terrain" and the reference category for "N_cluster_i" is cluster $0$.

## 5) Discussion

At first the model was prone to overfitting when I standardized the data and when I included polynomial features; this stands to reason because there are only 168 observations so it is very likely that as we included polynomials in the regression the model was prone to overfitting. In the training data the $R^2$ was considerably bigger than in the testing data.

I found that the simple linear model with 7 distinct neighborhoods —based on geographical proximity— and the rest of the features was the best out-of-sample predictive model with an $R^2=0.34$. 

The best insight we got from the data is that neighborhood zero when making seven cluster is the most expensive neighborhood and that the number of commercial venues is a significant predictor of price; the more commercial venues there are the higher the price. 

### 6) Conclusion

In this capstone I set about building a predictive model in order to predict the price of a real estate listing given its features. To build the data set I scraped a webpage for their real estate listings and then I used the foursquare API to get the nearby commercial venues. Next, I searched through different linear regression models and found many of them to overfit the data; I ended up using a normal linea