# The Battle of Neighborhoods 
## (Maps are not displayed in Github, please find them in my report/ppt)

<br/>

## INTRODUCTION

When people move to a new city, there are a lot a of obstacles and difficulties in their ways. One of the main challenge is to find a suitable neighborhood. Each neighborhood is suitable for some groups of people according to their preferences and income. In this project, I suggest a neighborhood to a user by asking them five multiple choice questions and how much they can afford to pay for housing. Therefore, this project aiming at providing guidiance for people who are looking for house in Melbourne, to find the right neighborhood for them according to his/her financial situation and preference of neighborhood atmosphere.

</br>

## DATA

Two main sources of data are used for this project.

• Melbourne Housing Snapshot provided by Kaggle: 
This dataset will be used for two purpose, find the average price per neighborhood and find the latitude and longitude of each neighborhood.

• Foursquare Location Data: 
Foursquare is a local search-and-discovery service. it
features a developer API that lets third-party applications make use of Foursquare’s
location data. I use this API to search and find out about different venues and their
categories of each neighborhood.

</br>

## Exploratory Data Analysis

### Preparing Housing Price Data

In [6]:
!pip install folium

Collecting folium
[?25l  Downloading https://files.pythonhosted.org/packages/72/ff/004bfe344150a064e558cb2aedeaa02ecbf75e60e148a55a9198f0c41765/folium-0.10.0-py2.py3-none-any.whl (91kB)
[K     |████████████████████████████████| 92kB 273kB/s eta 0:00:01
Collecting branca>=0.3.0 (from folium)
  Downloading https://files.pythonhosted.org/packages/63/36/1c93318e9653f4e414a2e0c3b98fc898b4970e939afeedeee6075dd3b703/branca-0.3.1-py3-none-any.whl
Installing collected packages: branca, folium
Successfully installed branca-0.3.1 folium-0.10.0


In [7]:
import pandas as pd
import numpy as np
import folium
import matplotlib.cm as cm
import matplotlib.colors as colors

import requests
import re
import random

In [2]:
url = 'https://raw.githubusercontent.com/mushroomJC/Capstone-Project/master/melb_data.csv'
df_price = pd.read_csv(url, error_bad_lines=False)
df_price.head()

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,3067.0,...,1.0,1.0,202.0,,,Yarra,-37.7996,144.9984,Northern Metropolitan,4019.0
1,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,...,1.0,0.0,156.0,79.0,1900.0,Yarra,-37.8079,144.9934,Northern Metropolitan,4019.0
2,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067.0,...,2.0,0.0,134.0,150.0,1900.0,Yarra,-37.8093,144.9944,Northern Metropolitan,4019.0
3,Abbotsford,40 Federation La,3,h,850000.0,PI,Biggin,4/03/2017,2.5,3067.0,...,2.0,1.0,94.0,,,Yarra,-37.7969,144.9969,Northern Metropolitan,4019.0
4,Abbotsford,55a Park St,4,h,1600000.0,VB,Nelson,4/06/2016,2.5,3067.0,...,1.0,2.0,120.0,142.0,2014.0,Yarra,-37.8072,144.9941,Northern Metropolitan,4019.0


I only investigate metropolitan area.

In [8]:
df_price = df_price[df_price.Regionname.str.contains('Metropolitan$')]

Dataset is grouped by region and council area to find out what is the location of each neighborhood.

In [9]:
df_neighborhood = df_price.groupby(['Regionname','CouncilArea']).mean()[['Longtitude','Lattitude']]
df_neighborhood

Unnamed: 0_level_0,Unnamed: 1_level_0,Longtitude,Lattitude
Regionname,CouncilArea,Unnamed: 2_level_1,Unnamed: 3_level_1
Eastern Metropolitan,Banyule,145.059723,-37.748094
Eastern Metropolitan,Boroondara,145.10236,-37.81269
Eastern Metropolitan,Knox,145.258353,-37.868303
Eastern Metropolitan,Manningham,145.120826,-37.774919
Eastern Metropolitan,Maroondah,145.260455,-37.803952
Eastern Metropolitan,Monash,145.141044,-37.879245
Eastern Metropolitan,Nillumbik,145.152771,-37.709345
Eastern Metropolitan,Whitehorse,145.150467,-37.825884
Northern Metropolitan,Banyule,145.086983,-37.711545
Northern Metropolitan,Darebin,145.007302,-37.741646


Each region name in the above data frame represent with a color.

In [10]:
# latitude and longitude of Melbourne
la = -37.8136
lo = 144.9631
map_mel = folium.Map([la, lo], zoom_start = 10)
colors_array = cm.rainbow(np.linspace(0, 1, 8))
rainbow = [colors.rgb2hex(i) for i in colors_array]
regions = df_neighborhood.index.labels[0]

for lat, lng, label, cl in zip(df_neighborhood['Lattitude'], df_neighborhood['Longtitude'], df_neighborhood.index, regions):
    label = folium.Popup(label[1], parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color=rainbow[cl],
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_mel)  
    
map_mel



Find the mean price of each neighborhood.

In [11]:
neighborhood_price = df_price.groupby('CouncilArea').mean()['Price']
neighborhood_price

CouncilArea
Banyule              9.444280e+05
Bayside              1.652168e+06
Boroondara           1.647217e+06
Brimbank             6.472007e+05
Casey                6.262037e+05
Darebin              9.158000e+05
Frankston            6.740155e+05
Glen Eira            1.069279e+06
Greater Dandenong    6.970673e+05
Hobsons Bay          1.000933e+06
Hume                 5.614067e+05
Kingston             9.776128e+05
Knox                 8.948961e+05
Manningham           1.236242e+06
Maribyrnong          8.116988e+05
Maroondah            8.510250e+05
Melbourne            9.255224e+05
Melton               6.094348e+05
Monash               1.168289e+06
Moonee Valley        9.873404e+05
Moreland             8.265576e+05
Nillumbik            8.691094e+05
Port Phillip         1.144346e+06
Stonnington          1.293382e+06
Unavailable          1.325000e+06
Whitehorse           1.234218e+06
Whittlesea           6.329681e+05
Wyndham              5.318134e+05
Yarra                1.127605e+06
Na

### Explore neighborhoods of Melbourne

In this section we will bulid a dataframe that contains information about venues of each neighborhood in Melbourne.

In [14]:
VERSION = '20180605' # Foursquare API version
CLIENT_ID = 'H51ZIYOCM1F0L4ZFYZVEB4W2XWYTWQ1I0153QNKW3JIF5S20' 
CLIENT_SECRET = 'NV5SJWP0KDWLJDG1XDUVBHZNODR4DRSNNM4YG30YKO3M4MBZ' 

This funcion finds venues of near each neighborhood.

In [15]:
def getNearbyVenues(names, latitudes, longitudes, radius=500, LIMIT=200):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name[1])
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name[1], 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [16]:
melbourne_venues = getNearbyVenues(names=df_neighborhood.index,
                                   latitudes=df_neighborhood['Lattitude'],
                                   longitudes=df_neighborhood['Longtitude']
                                  )

Banyule
Boroondara
Knox
Manningham
Maroondah
Monash
Nillumbik
Whitehorse
Banyule
Darebin
Hume
Melbourne
Moonee Valley
Moreland
Nillumbik
Whittlesea
Yarra
Casey
Frankston
Greater Dandenong
Kingston
Knox
Monash
Bayside
Boroondara
Glen Eira
Kingston
Melbourne
Monash
Port Phillip
Stonnington
Unavailable
Whitehorse
Brimbank
Hobsons Bay
Hume
Maribyrnong
Melton
Moonee Valley
Moreland
Wyndham


In [17]:
melbourne_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Banyule,-37.748094,145.059723,heidelberg fish & chips,-37.750985,145.057295,Fish & Chips Shop
1,Banyule,-37.748094,145.059723,Sunnyside Cafe,-37.749651,145.055258,Café
2,Banyule,-37.748094,145.059723,Espresso 3081 Cafe,-37.7495,145.055,Café
3,Boroondara,-37.81269,145.10236,Purvis Wine Cellars,-37.814978,145.09922,Wine Shop
4,Boroondara,-37.81269,145.10236,ToWoo,-37.81475,145.09879,Korean Restaurant


Build a recommender system and for that purpose it's better to perform one hot encoding.

In [18]:
# one hot encoding
melbourne_onehot = pd.get_dummies(melbourne_venues[['Venue Category']], prefix="", prefix_sep="")
# add neighborhood column back to dataframe
melbourne_onehot['Neighborhood'] = melbourne_venues['Neighborhood'] 
# move neighborhood column to the first column
clm = list(melbourne_onehot.columns)
clm.remove('Neighborhood')
fixed_columns = ['Neighborhood'] + clm
melbourne_onehot = melbourne_onehot[fixed_columns]
melbourne_onehot.head()

Unnamed: 0,Neighborhood,Adult Boutique,Art Gallery,Asian Restaurant,Australian Restaurant,BBQ Joint,Bakery,Bar,Basketball Court,Beach,...,Sushi Restaurant,Szechuan Restaurant,Taiwanese Restaurant,Thai Restaurant,Thrift / Vintage Store,Train,Train Station,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Wine Shop
0,Banyule,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Banyule,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Banyule,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Boroondara,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,Boroondara,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [19]:
melbourne_grouped = melbourne_onehot.groupby('Neighborhood').sum().reset_index()
melbourne_grouped.head()

Unnamed: 0,Neighborhood,Adult Boutique,Art Gallery,Asian Restaurant,Australian Restaurant,BBQ Joint,Bakery,Bar,Basketball Court,Beach,...,Sushi Restaurant,Szechuan Restaurant,Taiwanese Restaurant,Thai Restaurant,Thrift / Vintage Store,Train,Train Station,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Wine Shop
0,Banyule,0,0,0,1,0,1,0,0,0,...,0,0,0,0,1,0,1,0,0,0
1,Boroondara,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,1,0,0,0,1
2,Brimbank,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Casey,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
4,Darebin,0,0,0,1,0,1,0,0,0,...,0,1,0,1,0,0,0,0,2,0


### Question Design

Users are given five multiple choices questions about his/her venue preference. According to his/her answers, a neighborhood would be suggested. To accomplish this task, informative questions should be designed in order to gain as much information as possible.

However, what is an informative question and how can I find them?

Five question with four choices of venues contains 20 venues.

Two techniques of feature selection are applied to choose which features,category of venue in this project, are most important. Because more important feature means more informative. These features can be used to design informative questions.

The two techniques are Low Variance Filter and High Correlation Filter.

#### Low Variance Filter

Consider a variable in our dataset where all the observations have the same value, say 1. If we use this variable, do you think it can improve the model we will build? The answer is no, because this variable will have zero variance.

I choose 60 venue with high variance in this step.

In [20]:
number_of_venue = 60
high_variance_venues = melbourne_grouped.var().sort_values(ascending=False)[:number_of_venue]
columns = high_variance_venues.index.tolist()
columns

['Café',
 'Pub',
 'Light Rail Station',
 'Chinese Restaurant',
 'Coffee Shop',
 'Japanese Restaurant',
 'Grocery Store',
 'Fast Food Restaurant',
 'Thai Restaurant',
 'Pizza Place',
 'Malay Restaurant',
 'Vietnamese Restaurant',
 'Park',
 'Bakery',
 'Convenience Store',
 'Playground',
 'Middle Eastern Restaurant',
 'Burger Joint',
 'Bar',
 'Department Store',
 'Fish & Chips Shop',
 'Gym',
 'Pet Store',
 'Bus Station',
 'Shopping Mall',
 'Sandwich Place',
 'Electronics Store',
 'Indian Restaurant',
 'Gym / Fitness Center',
 'Train Station',
 'Australian Restaurant',
 'Farmers Market',
 'Pharmacy',
 'Dumpling Restaurant',
 'Spa',
 'Wine Shop',
 'Board Shop',
 'Business Service',
 'Breakfast Spot',
 'Restaurant',
 'Art Gallery',
 'Clothing Store',
 'Gastropub',
 'Greek Restaurant',
 'Liquor Store',
 'Asian Restaurant',
 'Thrift / Vintage Store',
 'Train',
 'Food Court',
 'Portuguese Restaurant',
 'Dessert Shop',
 'Gay Bar',
 'Brewery',
 'Frozen Yogurt Shop',
 'Discount Store',
 'Beer Gard

I split these list of venues to 3 list of restaurant, store, and general. I use restaurant list for one question, store list for one question and general list for three questions.

In [21]:
r_r = re.compile(".*Restaurant$")
restaurant = list(filter(r_r.match, columns))

r_s = re.compile(".*Store$")
store = list(filter(r_s.match, columns))

general = list(set(columns).difference(set(restaurant)).difference(set(store)))

#### High Correlation Filter

High correlation between two variables means they have similar trends and are likely to carry similar information. I choose 0.6 as treshhold of high correlation.

In [22]:
TRESHHOLD = 0.6
corr_matrix_restaurant = melbourne_grouped[restaurant].corr()
corr_matrix_restaurant = corr_matrix_restaurant.applymap(lambda x:1 if abs(x) > TRESHHOLD else 0)
corr_matrix_restaurant

Unnamed: 0,Chinese Restaurant,Japanese Restaurant,Fast Food Restaurant,Thai Restaurant,Malay Restaurant,Vietnamese Restaurant,Middle Eastern Restaurant,Indian Restaurant,Australian Restaurant,Dumpling Restaurant,Restaurant,Greek Restaurant,Asian Restaurant,Portuguese Restaurant
Chinese Restaurant,1,0,0,0,0,1,0,0,0,0,0,0,0,0
Japanese Restaurant,0,1,0,0,1,0,1,1,0,1,0,0,1,0
Fast Food Restaurant,0,0,1,0,0,0,0,0,0,0,0,0,0,0
Thai Restaurant,0,0,0,1,0,0,0,0,0,0,0,0,0,0
Malay Restaurant,0,1,0,0,1,0,1,0,0,1,0,0,1,0
Vietnamese Restaurant,1,0,0,0,0,1,0,0,0,0,0,0,0,0
Middle Eastern Restaurant,0,1,0,0,1,0,1,0,0,1,0,0,1,0
Indian Restaurant,0,1,0,0,0,0,0,1,0,0,0,0,0,0
Australian Restaurant,0,0,0,0,0,0,0,0,1,0,0,0,0,0
Dumpling Restaurant,0,1,0,0,1,0,1,0,0,1,1,0,1,0


I want to choose 4 sub-category in category(store, restaurant and generla) in order to do that I can omit category with high correlation.

In [23]:
venues = corr_matrix_restaurant.sum()

In [24]:
def question_maker(melbourne_data, category, no_question):
    question = []
    corr_matrix = melbourne_data[category].corr()
    TRESHHOLD = 0.6
    corr_matrix = corr_matrix.applymap(lambda x:1 if x > TRESHHOLD else 0)
    venues = corr_matrix.sum()
    for i in range(no_question):
        question.append([])
        for j in range(4):
            c = random.sample(list(venues.index), 1)[0]
            question[i].append(c)
            venues = venues[corr_matrix[c] == 0]
    return question

In [25]:
q_1 = question_maker(melbourne_grouped, general, 3)
q_2 = question_maker(melbourne_grouped, store, 1)
q_3 = question_maker(melbourne_grouped, restaurant, 1)

Final list of questions will be union of these 3 lists. This final list contains 5 questions and I can recommend a neighborhood to user when she choose one category in every question.

In [26]:
questions = q_1 + q_2 + q_3
questions

[['Park', 'Adult Boutique', 'Café', 'Bus Station'],
 ['Frozen Yogurt Shop', 'Wine Shop', 'Pharmacy', 'Spa'],
 ['Business Service', 'Shopping Mall', 'Burger Joint', 'Board Shop'],
 ['Pet Store', 'Liquor Store', 'Big Box Store', 'Convenience Store'],
 ['Restaurant',
  'Australian Restaurant',
  'Vietnamese Restaurant',
  'Asian Restaurant']]

In [27]:
pd.DataFrame(questions, columns=['choice 1','choice 2','choice 3','choice 4'])

Unnamed: 0,choice 1,choice 2,choice 3,choice 4
0,Park,Adult Boutique,Café,Bus Station
1,Frozen Yogurt Shop,Wine Shop,Pharmacy,Spa
2,Business Service,Shopping Mall,Burger Joint,Board Shop
3,Pet Store,Liquor Store,Big Box Store,Convenience Store
4,Restaurant,Australian Restaurant,Vietnamese Restaurant,Asian Restaurant


### 3.4 Recommend a Neighborhood

Finally a neighborhood can be recommended to user. I'll use content-based recommendation system to determine best neighborhooh for a specific user.

Since there is no user to ask, a random choice would be picked for each question, and user affordiblity for house would be randomly pick between min and max of house prices as well.

In [28]:
user_choices = []
for lst in questions:
    user_choices.append(random.sample(lst, 1)[0])
affordibilty = random.randint(int(neighborhood_price.min()), int(neighborhood_price.max()))
print(user_choices)
print('User can afford: ', affordibilty, 'dollar for housing.')

['Adult Boutique', 'Spa', 'Shopping Mall', 'Liquor Store', 'Restaurant']
User can afford:  618237 dollar for housing.


In [29]:
pd.DataFrame(user_choices+[affordibilty], columns=['User Profile'])

Unnamed: 0,User Profile
0,Adult Boutique
1,Spa
2,Shopping Mall
3,Liquor Store
4,Restaurant
5,618237


In [30]:
user_profile_lst = [1 if v in user_choices else 0 for v in melbourne_grouped.columns]
user_profile = pd.Series(user_profile_lst, index=melbourne_grouped.columns.tolist())
user_profile

Neighborhood                     0
Adult Boutique                   1
Art Gallery                      0
Asian Restaurant                 0
Australian Restaurant            0
BBQ Joint                        0
Bakery                           0
Bar                              0
Basketball Court                 0
Beach                            0
Beer Garden                      0
Big Box Store                    0
Board Shop                       0
Bowling Alley                    0
Breakfast Spot                   0
Brewery                          0
Burger Joint                     0
Bus Station                      0
Business Service                 0
Café                             0
Chinese Restaurant               0
Clothing Store                   0
Coffee Shop                      0
College Gym                      0
Comedy Club                      0
Convenience Store                0
Department Store                 0
Dessert Shop                     0
Discount Store      

In [31]:
neighborhood_matrix = melbourne_grouped.drop('Neighborhood',1)
recommendation_matrix = ((neighborhood_matrix*user_profile).sum(axis=1)/(user_profile.sum())).sort_values(ascending=False)
for i, rate in recommendation_matrix.iteritems():
    if neighborhood_price.iloc[i] <= affordibilty:
        chosen_neighborhood = neighborhood_price.index[i]
        break

In [32]:
print('Best neighborhood for you to live is: ',chosen_neighborhood)

Best neighborhood for you to live is:  Melton


In [33]:
la = -37.8136
lo = 144.9631
map_mel = folium.Map([la, lo], zoom_start = 11)
for lat, lng, label in zip(df_neighborhood['Lattitude'], df_neighborhood['Longtitude'], df_neighborhood.index):
    if label[1] == chosen_neighborhood:
        color = '#FF0000'
    else:
        color = '#3186cc'
    label = folium.Popup(label[1], parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color=color,
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_mel)  
    
map_mel