# Capstone Project - Real Estate Development (Week 1)
### Applied Data Science Capstone by IBM/Coursera
### Marcos Geraldo

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction: Business Problem <a name="introduction"></a>

This project will provide insights about capital gain on real estate investemnts. 

It wil be targeted to landlords who are evaluating the impact of home improvements projects in the selling price of their properties.

It will also provide a model to estimate the listing price that fits the market valuation of a particular house. 

It will use current data published for the city of interest, and use it to stablish the relative weights of the key elements that drive the price of a house. 

It will use Foursquare Data to evaluate the distance to relevant venues, and evaluate the weight of those elements in the listing price of a property.



## Data <a name="data"></a>

According to the problem definition, the relevant data to understand price valuation, are the following:
* selling price 
* listing price (as a proxy for selling price, that might not be public) 
* number and distance of venues 

To avoid market variations the data will come from current market conditions. The candidates are real state web sites that publish and share freely properties and listing prices:

* [Realtor](#Realtor)
* [FourSquare](#Foursquare)

The values
* Year of construction 
* Constructed suface 
* Bedrooms
* Bathrooms
* Garages
* Stories
* School ratings 
* Number of venues by category
* Distance to venues
* others to be found



### Realtor.com <a name="Realtor"></a>

After trying some APIs, I will use Realtor as the main source for data collection, due to its reliability, and simplicity. 

Realtor offers multiple API:
* [list-for-sale](#list-for-sale)
* [detail](#detail)
* [list-sold](#list-sold)
* list-similar-homes
* list-for-rent
* list-by-mls
* list-similar-rental-homes


In [944]:
import requests
import pandas as pd
import numpy as np
from pandas.io.json import json_normalize


In [945]:
city_nm = 'Pleasanton'
address_st = '942 Clinton Pl'
state_cd = 'CA'
rapid_key = pd.read_csv('Realtor API rapid key').set_index('API_provider').at['Realtor','key']

### list-for-sale API<a name="list-for-sale"></a>

This API shows properties for sale inin groups of 200. 
Here is an example of how I read two pages using the variable Offset:


In [950]:
limit = 200
page = 0 
url = "https://realtor.p.rapidapi.com/properties/v2/list-for-sale"
querystring = {"sort":"relevance","city":city_nm,"limit":limit,"offset":page*limit,"state_code":state_cd}
headers = {
    'x-rapidapi-host': "realtor.p.rapidapi.com",
    'x-rapidapi-key': rapid_key
    }
response = requests.request("GET", url, headers=headers, params=querystring)

In [951]:
df = json_normalize(response.json()['properties'])

In [952]:
df.shape

(200, 92)

In [953]:
df.head()

Unnamed: 0,address.city,address.county,address.fips_code,address.lat,address.line,address.lon,address.neighborhood_name,address.postal_code,address.state,address.state_code,...,plan_id,price,prop_status,prop_sub_type,prop_type,property_id,rank,rdc_web_url,thumbnail,virtual_tour.href
0,Pleasanton,Alameda,6001,37.658552,865 Concord St,-121.857258,Vintage Hills,94566,California,CA,...,,1200000,for_sale,,single_family,M1761910363,1,https://www.realtor.com/realestateandhomes-det...,https://ap.rdcpix.com/41c482b715b5596ad090233c...,https://www.tourfactory.com/idxr2774748
1,Pleasanton,Alameda,6001,37.67119,6024 Corte Montanas,-121.899756,Ponderosa,94566,California,CA,...,,1049888,for_sale,,single_family,M2075056078,2,https://www.realtor.com/realestateandhomes-det...,https://ap.rdcpix.com/473a5c376ee506194e2e3349...,
2,Pleasanton,Alameda,6001,37.647288,5702 San Carlos Way,-121.877752,Mission Park,94566,California,CA,...,,1389000,for_sale,,single_family,M2421852939,3,https://www.realtor.com/realestateandhomes-det...,https://ap.rdcpix.com/f19b087c7e60a575c1ca1930...,
3,Pleasanton,Alameda,6001,37.677962,2264 Raven Rd,-121.883963,Birdland,94566,California,CA,...,,1299000,for_sale,,single_family,M1796828551,4,https://www.realtor.com/realestateandhomes-det...,https://ap.rdcpix.com/ef5544db2fa41584b668bb79...,https://www.tourfactory.com/2771557
4,Pleasanton,Alameda,6001,37.663592,274 Birch Creek Dr,-121.864646,West Vineyard Avenue,94566,California,CA,...,,725000,for_sale,townhomes,condo,M2733911643,5,https://www.realtor.com/realestateandhomes-det...,https://ap.rdcpix.com/c49c23e97a0d43f63371adc9...,https://virtualtourcafe.com/tour/5958931


In [954]:
limit = 200
page = 1 
url = "https://realtor.p.rapidapi.com/properties/v2/list-for-sale"
querystring = {"sort":"relevance","city":city_nm,"limit":limit,"offset":page*limit,"state_code":state_cd}
headers = {
    'x-rapidapi-host': "realtor.p.rapidapi.com",
    'x-rapidapi-key': rapid_key
    }
response = requests.request("GET", url, headers=headers, params=querystring)
df1 = json_normalize(response.json()['properties'])
df1.head()

Unnamed: 0,address.city,address.county,address.fips_code,address.is_approximate,address.lat,address.line,address.lon,address.neighborhood_name,address.postal_code,address.state,...,page_no,photo_count,price,prop_status,prop_type,property_id,rank,rdc_web_url,thumbnail,virtual_tour.href
0,Pleasanton,Alameda,6001.0,,37.665685,3263 Vineyard Ave Spc 104,-121.852213,,94566,California,...,5,28,215000,for_sale,mobile,M2269810031,29,https://www.realtor.com/realestateandhomes-det...,https://ap.rdcpix.com/e0640ece3fbb1b77731ed41e...,
1,Pleasanton,Alameda,6001.0,,37.636975,622 Happy Valley Rd,-121.87669,Happy Valley,94566,California,...,5,5,1399999,for_sale,land,M2459752402,30,https://www.realtor.com/realestateandhomes-det...,https://ap.rdcpix.com/5947f6dd6116b03b8dffbb3a...,
2,Pleasanton,Alameda,6001.0,,37.667336,3820 Stanley Blvd,-121.865044,Asco - Radum,94566,California,...,5,38,1189950,for_sale,single_family,M1948920642,31,https://www.realtor.com/realestateandhomes-det...,https://ap.rdcpix.com/1bf516e6c32b5ba399cecc10...,https://www.tourfactory.com/idxr2683548
3,Pleasanton,,,True,37.651433,114 Wallace Cir,-121.858132,Moraga Country Club,94566,California,...,5,1,1170688,for_sale,single_family,M9495719674,32,https://www.realtor.com/realestateandhomes-det...,https://an.rdcpix.com/2105872714/26a9a2844c3b4...,
4,Pleasanton,Alameda,6001.0,,37.668765,2660 Camino Segura,-121.900173,Del Prado,94566,California,...,5,53,1450000,for_sale,single_family,M1990126995,33,https://www.realtor.com/realestateandhomes-det...,https://ap.rdcpix.com/de97d74ea8d10988757dbcf6...,http://https://my.matterport.com/show/?m=bdeNA...


### Function to read each page of the query 

In [955]:
def read_realtor(city_nm, state_cd, limit, page_num):
    url = "https://realtor.p.rapidapi.com/properties/v2/list-for-sale"
    querystring = {"sort":"relevance","city":city_nm,"limit":limit,"offset":page_num*limit,"state_code":state_cd}
    headers = {
        'x-rapidapi-host': "realtor.p.rapidapi.com",
        'x-rapidapi-key': rapid_key
        }
    response = requests.request("GET", url, headers=headers, params=querystring)
    df = json_normalize(response.json()['properties'])
    return df

In [956]:
def list_for_sale(city_nm, state_cd, limit, page_num):
    num_rows = 0
    df = read_realtor(city_nm, state_cd, limit, page_num)
    if df.shape[0] == limit:
        df=df.append(read_realtor(city_nm, state_cd, limit, page_num + 1), sort=True)
    df = df.reset_index()
    df = df.drop(columns=['index'])
    return df


In [957]:
resp= list_for_sale('San Ramon','CA',200,0)

In [958]:
resp.shape

(240, 93)

In [959]:
resp.columns

Index(['address.city', 'address.county', 'address.fips_code',
       'address.is_approximate', 'address.lat', 'address.line', 'address.lon',
       'address.neighborhood_name', 'address.postal_code', 'address.state',
       'address.state_code', 'address.time_zone', 'agents', 'baths',
       'baths_full', 'baths_half', 'beds',
       'branding.listing_office.list_item.accent_color',
       'branding.listing_office.list_item.link',
       'branding.listing_office.list_item.name',
       'branding.listing_office.list_item.phone',
       'branding.listing_office.list_item.photo',
       'branding.listing_office.list_item.show_realtor_logo',
       'branding.listing_office.list_item.slogan', 'building_size.size',
       'building_size.units', 'client_display_flags.advantage_pro_flag',
       'client_display_flags.has_matterport',
       'client_display_flags.has_open_house',
       'client_display_flags.has_specials',
       'client_display_flags.is_co_broke_email',
       'client_display_

Here is a visual representation of the proprties in currently in the city selected:

In [961]:
import folium 
import matplotlib.cm as cm
import matplotlib.colors as colors

#setting colors
number_types = len(resp['prop_type'].unique())
colors_array = cm.rainbow(np.linspace(0, 1,number_types))
rainbow = pd.DataFrame([colors.rgb2hex(i) for i in colors_array])
rainbow.index = resp['prop_type'].unique()

# centring the screen:
latitude = (resp['address.lat'].max()+resp['address.lat'].min())/2
longitude = (resp['address.lon'].max()+resp['address.lon'].min())/2

# create map of Toronto using latitude and longitude values
map_tto = folium.Map(location=[latitude, longitude], zoom_start=13)

# add markers to map
for lat, lng, address, neighborhood, prop_type in zip(resp['address.lat'], resp['address.lon'], resp['address.line'], resp['address.county'],resp['prop_type']):
    label = '{}, {} ({})'.format(address, neighborhood,prop_type)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='black',
        fill=True,
        fill_color=rainbow.loc[prop_type][0],
        fill_opacity=0.7,
        parse_html=False).add_to(map_tto)  
    
map_tto


### Detail API <a name="detail"></a>

I will have to use the details API to get details such as
* Schools raitings
* Year Built
* Number of Stories

In [982]:
url = "https://realtor.p.rapidapi.com/properties/v2/detail"

property_id = 'M2695052375'
querystring = {"property_id":property_id}
headers = {
    'x-rapidapi-host': "realtor.p.rapidapi.com",
    'x-rapidapi-key': rapid_key
    }

response = requests.request("GET", url, headers=headers, params=querystring)

In [983]:
json_normalize(response.json()['properties'][0]['schools'])

Unnamed: 0,distance_in_miles,education_levels,funding_type,grades.range.high,grades.range.low,greatschools_id,id,lat,location.city,location.city_slug_id,...,location.street,lon,name,nces_id,phone,ratings.great_schools_rating,ratings.parent_rating,relevance,student_count,student_teacher_ratio
0,0.6,[elementary],public,5,K,600548,78579241,37.755837,San Ramon,San-Ramon_CA,...,13000 Broadmoor Drive,-121.951006,Montevideo Elementary School,063513005953,(925) 479-6100,9.0,4.0,assigned,658.0,25.9
1,0.8,[middle],public,8,6,600549,78579251,37.738203,San Ramon,San-Ramon_CA,...,3000 Pine Valley Road,-121.942177,Pine Valley Middle School,063513005955,(925) 479-7700,9.0,4.0,assigned,1049.0,25.7
2,0.4,[high],public,12,9,600537,78579101,37.746233,San Ramon,San-Ramon_CA,...,9870 Broadmoor Drive,-121.946474,California High School,063513005943,(925) 803-3200,9.0,4.0,assigned,2777.0,23.7
3,0.7,[elementary],public,5,K,600534,78579041,37.740982,San Ramon,San-Ramon_CA,...,2849 Calais Drive,-121.948432,Neil A. Armstrong Elementary School,063513005954,(925) 479-1600,9.0,4.0,nearby,544.0,26.1
4,1.6,[middle],public,8,6,600544,78579201,37.770021,San Ramon,San-Ramon_CA,...,12601 Alcosta Boulevard,-121.957406,Iron Horse Middle School,063513005729,(925) 824-2820,9.0,4.0,nearby,1069.0,25.5
5,2.5,[high],public,12,9,617434,78820461,37.768475,San Ramon,San-Ramon_CA,...,10550 Albion Road,-121.903099,Dougherty Valley High School,063513011990,(925) 479-6400,10.0,3.0,nearby,3331.0,23.8
6,1.0,[elementary],private,6,K,631666,79023801,37.749416,San Ramon,San-Ramon_CA,...,19001 San Ramon Valley Blvd,-121.960388,Heritage Academy - San Ramon,2ccc7444a98d6982e06b115607f16b24,(925) 558-5577,,5.0,nearby,,
7,1.8,[elementary],private,5,PK,610419,78720251,37.755501,San Ramon,San-Ramon_CA,...,2762 Derby Dr,-121.973801,CA Christian Academy,A9700302,(510) 381-7695,,,nearby,,


Each property has a list of schools. 
I will get the average of the raitings as the index of schools quality. 

In [984]:
school_list = json_normalize(response.json()['properties'][0]['schools'])

In [985]:
school_list['ratings.great_schools_rating'].mean()

9.166666666666666

This is a function that gets those three details: 
* School rating
* Stories
* Year Built

Different kinsd of properties have different JSON structures, so this function needs to react correctly when the data is not found. 

In [971]:
def get_details(property_id):
    querystring = {"property_id":property_id}
    headers = {
        'x-rapidapi-host': "realtor.p.rapidapi.com",
        'x-rapidapi-key': rapid_key
        }
    response = requests.request("GET", url, headers=headers, params=querystring)
    # the mean school rating
    try:
        school_list = json_normalize(response.json()['properties'][0]['schools'])
    except:
        school_rating = np.nan
    else:
        school_rating = school_list['ratings.great_schools_rating'].mean()

    # stories or levels in the house
    try:
        stories = response.json()['properties'][0]['stories']
    except: 
        stories = np.nan
    else:
        stories = response.json()['properties'][0]['stories']
    
    #construction year
    try:
        year_built =  response.json()['properties'][0]['year_built']
    except:
        year_built = np.nan
    else:
        year_built =  response.json()['properties'][0]['year_built']
        
    return pd.DataFrame({'school_rating':[school_rating],'stories':[stories],'year_built':[year_built]})

In [972]:
resp_detail = get_details('M2269810031')

In [973]:
resp_detail

Unnamed: 0,school_rating,stories,year_built
0,8.0,,1974


### Connecting resp with the detailed API for features

In [974]:
train_data = resp[pd.notna(resp['listing_id'])][['property_id', 'listing_id',
                                                 'address.city','address.county',
                                                 'address.lat', 'address.lon',
                                                 'address.neighborhood_name','address.postal_code',
                                                 'baths_full','baths_half','beds',
                                                 'building_size.size','lot_size.size',
                                                 'prop_type','prop_status','price'
     ]]

In [975]:
train_data.shape

(222, 16)

In [976]:
for prop_id in train_data['property_id']:
    det_temp = get_details(prop_id)
    train_data.loc[resp['property_id']==prop_id, 'school_rating']=det_temp.loc[0,'school_rating']
    train_data.loc[resp['property_id']==prop_id, 'stories']=det_temp.loc[0,'stories']
    train_data.loc[resp['property_id']==prop_id, 'year_built']=det_temp.loc[0,'year_built']
                

Finally here is an initial data set to work with. 


In [979]:
train_data.head()

Unnamed: 0,property_id,listing_id,address.city,address.county,address.lat,address.lon,address.neighborhood_name,address.postal_code,baths_full,baths_half,beds,building_size.size,lot_size.size,prop_type,prop_status,price,school_rating,stories,year_built
0,M2695052375,2919325942,San Ramon,Contra Costa,37.750341,-121.942339,Southern San Ramon,94583,2,,4,2200,7000.0,single_family,for_sale,1350000,9.166667,1.0,1978.0
1,M2731769508,2919317542,San Ramon,Contra Costa,37.766689,-121.908181,Windemere,94582,2,1.0,3,1605,,condo,for_sale,835000,7.833333,2.0,2004.0
2,M2358503649,2919315162,San Ramon,Contra Costa,37.775299,-121.99359,Crow Canyon,94583,2,1.0,4,2148,4900.0,single_family,for_sale,1095000,9.166667,2.0,1997.0
3,M2407396566,2919304415,San Ramon,Contra Costa,37.766336,-121.985032,Twin Creeks,94583,2,1.0,4,1731,5000.0,condo,for_sale,889500,9.0,2.0,1975.0
5,M2271574371,2919275150,San Ramon,Contra Costa,37.762036,-121.916275,Gale Ranch,94582,3,1.0,4,2545,3360.0,single_family,for_sale,1279800,7.833333,2.0,2014.0


# FourSquare API <a name="Foursquare"></a>

Dcocumentation [here](#https://developer.foursquare.com/docs/api-reference/venues/explore/)

First step will be to get the Venues Close to a property

In [671]:
CLIENT_ID = pd.read_csv('Realtor API rapid key').set_index('API_provider').at['CLIENT_ID','key']
CLIENT_SECRET = pd.read_csv('Realtor API rapid key').set_index('API_provider').at['CLIENT_SECRET','key']

VERSION = '20180604'
LIMIT = 30
RADIUS = 1000

This function finds the venues that are in a defined radius (RADIUS) from the location (Latitude and Longitude) of the property. I am using the coordinates prvided by Realtor API.

In [713]:
def getNearbyVenues(names, latitudes, longitudes):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            RADIUS, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['property_id', 
                  'address.lat', 
                  'address.lon', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)


This function finds the distance between two sets of coordinated

In [714]:
def dist_coord(lat_1, lon_1,lat_2, lon_2):
    R = 40000/np.pi/2 # Radius of the earth (Eratosthenes)
    dis_list = []
    for lat1, lon1, lat2, lon2 in zip(lat_1, lon_1,lat_2, lon_2):
        phi1, phi2 = math.radians(lat1), math.radians(lat2) 
        dphi       = math.radians(lat2 - lat1)
        dlambda    = math.radians(lon2 - lon1)
        a = math.sin(dphi/2)**2 + math.cos(phi1)*math.cos(phi2)*math.sin(dlambda/2)**2
        dis_list.append(2*R*math.atan2(math.sqrt(a), math.sqrt(1 - a)))
    return dis_list

Test of the function with the first 3 properties in the list:

In [715]:

prop_id = train_data.loc[0:2,'property_id'].values
lat = train_data.loc[0:2,'address.lat'].values
lon = train_data.loc[0:2,'address.lon'].values


In [716]:
prop_id

array(['M2450662324', 'M2539057140', 'M1580231402'], dtype=object)

In [717]:
fs_resp = getNearbyVenues(prop_id,lat,lon)

M2450662324
M2539057140
M1580231402


Adding the distance as an additional column to the data set

In [720]:
fs_resp['distance'] = dist_coord(fs_resp['Venue Latitude'],fs_resp['Venue Longitude'],
          fs_resp['address.lat'],fs_resp['address.lon'])

In [721]:
fs_resp

Unnamed: 0,property_id,address.lat,address.lon,Venue,Venue Latitude,Venue Longitude,Venue Category,distance
0,M2450662324,37.762575,-121.971991,Bishop Ranch Veterinary Center & Urgent Care,37.771295,-121.971122,Veterinarian,0.971856
1,M2450662324,37.762575,-121.971991,San Ramon Marriott,37.762877,-121.965234,Hotel,0.594445
2,M2450662324,37.762575,-121.971991,Peet's Coffee & Tea,37.762719,-121.961317,Coffee Shop,0.937734
3,M2450662324,37.762575,-121.971991,Whole Foods Market,37.761901,-121.961333,Grocery Store,0.939219
4,M2450662324,37.762575,-121.971991,MOD Pizza,37.762675,-121.962513,Pizza Place,0.832646
5,M2450662324,37.762575,-121.971991,Target,37.762214,-121.96378,Big Box Store,0.722339
6,M2450662324,37.762575,-121.971991,The Shops at Bishop Ranch,37.761989,-121.961422,Shopping Mall,0.930619
7,M2450662324,37.762575,-121.971991,San Ramon Memorial Park,37.7569,-121.966372,Park,0.800782
8,M2450662324,37.762575,-121.971991,Clementine's,37.758119,-121.966249,Cajun / Creole Restaurant,0.706782
9,M2450662324,37.762575,-121.971991,Muscle Maker Grill San Ramon,37.762901,-121.961354,American Restaurant,0.935051


Next Step will be to convert the categorical variable "Venue Category" in dummy variables

In [730]:
fs_dummies = pd.get_dummies(fs_resp['Venue Category'])

In [731]:
fs_dummies['property_id']=fs_resp['property_id']

In [732]:
fs_dummies.groupby(by=['property_id']).sum()

Unnamed: 0_level_0,American Restaurant,Arts & Crafts Store,Bagel Shop,Bank,Big Box Store,Business Service,Cajun / Creole Restaurant,Clothing Store,Coffee Shop,Cosmetics Shop,...,Performing Arts Venue,Pharmacy,Pizza Place,Pool,Rental Car Location,Salon / Barbershop,Shopping Mall,Sushi Restaurant,Trail,Veterinarian
property_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
M1580231402,0,0,0,0,0,0,0,1,1,0,...,1,1,0,1,0,0,0,0,1,0
M2450662324,2,1,1,1,1,0,1,0,2,1,...,0,0,1,0,1,1,1,1,0,1
M2539057140,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Too many values. The categories need to be simplified by aggregating them by hierarchies. 
We can get the **hierarchy of categories from Foursquare**

In [733]:
url = 'https://api.foursquare.com/v2/venues/categories?&client_id={}&client_secret={}&v={}'.format(
    CLIENT_ID,
    CLIENT_SECRET, 
    VERSION 
    )
# make the GET request
categories = requests.get(url).json()["response"]


In [734]:
categories

{'categories': [{'id': '4d4b7104d754a06370d81259',
   'name': 'Arts & Entertainment',
   'pluralName': 'Arts & Entertainment',
   'shortName': 'Arts & Entertainment',
   'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/arts_entertainment/default_',
    'suffix': '.png'},
   'categories': [{'id': '56aa371be4b08b9a8d5734db',
     'name': 'Amphitheater',
     'pluralName': 'Amphitheaters',
     'shortName': 'Amphitheater',
     'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/arts_entertainment/default_',
      'suffix': '.png'},
     'categories': []},
    {'id': '4fceea171983d5d06c3e9823',
     'name': 'Aquarium',
     'pluralName': 'Aquariums',
     'shortName': 'Aquarium',
     'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/arts_entertainment/aquarium_',
      'suffix': '.png'},
     'categories': []},
    {'id': '4bf58dd8d48988d1e1931735',
     'name': 'Arcade',
     'pluralName': 'Arcades',
     'shortName': 'Arcade',
     'icon': {'prefix': 'https://

In [913]:
json_normalize(categories['categories'])[['name']]

Unnamed: 0,name
0,Arts & Entertainment
1,College & University
2,Event
3,Food
4,Nightlife Spot
5,Outdoors & Recreation
6,Professional & Other Places
7,Residence
8,Shop & Service
9,Travel & Transport


The next steps will be to navegate the Json to group the categories and the run a regressio model to see how good are these variables to predict prices.

# Logistic Regression

# Decission Tree