# Zillow Web scraping using Python

## Introduction
Zillow is the leading real estate and rental marketplace dedicated to empowering consumers with data, inspiration and knowledge around the place they call home, and connecting them with the best local professionals who can help. As the most-visited real estate website in the United States, Zillow and its affiliates offer customers an on-demand experience for selling, buying, renting and financing with transparency and nearly seamless end-to-end service. 

Scraping Zillow using Python can be useful for a variety of reasons. It can help real estate agents and investors to keep up with current market trends, identify potential properties quickly, analyze data from listings and stay informed about local listing activities. Additionally, scraping is an efficient way to find discounted or undervalued properties which can be great opportunities for those looking for it for themselves or a business.

Our paper covers the basics of what is needed to scrape data from the Zillow, including utilizing packages like Selenium to get the data you need, or simply use Zillow API. Then the results in this paper would be dataset of home for sales in Portland, OR.

## I. Get datasets using Zillow API <a class="anchor" id="sub_section_1_1_1"></a>
Zillow Group’s collection of Brands provide a wide range of APIs and Data Sets. Close to 20 APIs available empower with data and functionality related to the various aspects of Real Estate. 

In [32]:
from zillow_api import ZillowAPI

zillow_api_client = ZillowAPI("d5a1b6c9-06cd-4b8c-ad9d-cf33825614f1")

search_result = zillow_api_client.search(
	params={
		"keyword": "portland, or",
		"type": "forSale",
        "page": 1}
)

print(search_result)

{'requestMetadata': {'id': 'f9678b70-b5a4-4aee-963b-ac045f5e8a09', 'zillowUrl': 'https://www.zillow.com/homes/portland%2C%20or/?searchQueryState=%7B%22pagination%22%3A%7B%7D%2C%22isMapVisible%22%3Atrue%2C%22filterState%22%3A%7B%22ah%22%3A%7B%22value%22%3Atrue%7D%2C%22sort%22%3A%7B%22value%22%3A%22days%22%7D%7D%2C%22isListVisible%22%3Atrue%7D', 'status': 'ok'}, 'searchInformation': {'totalResults': 2405}, 'properties': [{'id': '48249486', 'url': 'https://www.zillow.com/homedetails/6314-SE-Jennings-Ave-Milwaukie-OR-97267/48249486_zpid/', 'image': 'https://photos.zillowstatic.com/fp/9c864e56cb0ef2049715e2b03ba2759b-p_e.jpg', 'status': 'FOR_SALE', 'currency': '$', 'price': 475000, 'addressRaw': '6314 SE Jennings Ave, Milwaukie, OR 97267', 'address': {'street': '6314 SE Jennings Ave', 'city': 'Milwaukie', 'state': 'OR', 'zipcode': '97267'}, 'beds': 2, 'baths': 2, 'area': 1279, 'brokerName': 'eXp Realty, LLC', 'brokerNameRaw': 'eXp Realty, LLC', 'latitude': 45.39881, 'longitude': -122.59761,

In [35]:
import pandas as pd
import json
from pandas import json_normalize

df_api = json_normalize(search_result['properties']) 
df_api.head()

Unnamed: 0,id,url,image,status,currency,price,addressRaw,beds,baths,area,brokerName,brokerNameRaw,latitude,longitude,photos,address.street,address.city,address.state,address.zipcode
0,48249486,https://www.zillow.com/homedetails/6314-SE-Jen...,https://photos.zillowstatic.com/fp/9c864e56cb0...,FOR_SALE,$,475000,"6314 SE Jennings Ave, Milwaukie, OR 97267",2,2,1279,"eXp Realty, LLC","eXp Realty, LLC",45.39881,-122.59761,[https://photos.zillowstatic.com/fp/9c864e56cb...,6314 SE Jennings Ave,Milwaukie,OR,97267
1,53816326,https://www.zillow.com/homedetails/7600-SW-60t...,https://photos.zillowstatic.com/fp/9c7586dc537...,FOR_SALE,$,539900,"7600 SW 60th Ave, Portland, OR 97219",3,2,2015,Redfin,Redfin,45.46967,-122.73805,[https://photos.zillowstatic.com/fp/9c7586dc53...,7600 SW 60th Ave,Portland,OR,97219
2,53819252,https://www.zillow.com/homedetails/15105-E-Bur...,https://photos.zillowstatic.com/fp/0bccd964f84...,FOR_SALE,$,369900,"15105 E Burnside St, Portland, OR 97233",3,1,2064,Windermere Realty Trust,Windermere Realty Trust,45.522507,-122.50768,[https://photos.zillowstatic.com/fp/0bccd964f8...,15105 E Burnside St,Portland,OR,97233
3,53835615,https://www.zillow.com/homedetails/1442-SW-Vis...,https://photos.zillowstatic.com/fp/646f9095ae1...,FOR_SALE,$,695000,"1442 SW Vista Ave, Portland, OR 97201",4,3,4131,Windermere Realty Trust,Windermere Realty Trust,45.517944,-122.69676,[https://photos.zillowstatic.com/fp/646f9095ae...,1442 SW Vista Ave,Portland,OR,97201
4,53868321,https://www.zillow.com/homedetails/7535-S-Hood...,https://photos.zillowstatic.com/fp/46ec309362f...,FOR_SALE,$,675000,"7535 S Hood Ave, Portland, OR 97219",2,1,1941,"The Agency, Inc","The Agency, Inc",45.469494,-122.67473,[https://photos.zillowstatic.com/fp/46ec309362...,7535 S Hood Ave,Portland,OR,97219


Now we would scrape total 20 pages of searches:

In [None]:
for x in range (2,21):
    search_result = zillow_api_client.search(
        params={
            "keyword": "portland, or",
            "type": "forSale",
            "page": x}
    )
    df1 = json_normalize(search_result['properties'])
    df_api = pd.concat([df1, df_api], ignore_index = True)

Now get some look at the preview of data we have got.

In [37]:
df_api.head()

Unnamed: 0,id,url,image,status,currency,price,addressRaw,beds,baths,area,brokerName,brokerNameRaw,latitude,longitude,photos,address.street,address.city,address.state,address.zipcode
0,48249486,https://www.zillow.com/homedetails/6314-SE-Jen...,https://photos.zillowstatic.com/fp/9c864e56cb0...,FOR_SALE,$,475000,"6314 SE Jennings Ave, Milwaukie, OR 97267",2.0,2.0,1279.0,"eXp Realty, LLC","eXp Realty, LLC",45.39881,-122.59761,[https://photos.zillowstatic.com/fp/9c864e56cb...,6314 SE Jennings Ave,Milwaukie,OR,97267
1,53816326,https://www.zillow.com/homedetails/7600-SW-60t...,https://photos.zillowstatic.com/fp/9c7586dc537...,FOR_SALE,$,539900,"7600 SW 60th Ave, Portland, OR 97219",3.0,2.0,2015.0,Redfin,Redfin,45.46967,-122.73805,[https://photos.zillowstatic.com/fp/9c7586dc53...,7600 SW 60th Ave,Portland,OR,97219
2,53819252,https://www.zillow.com/homedetails/15105-E-Bur...,https://photos.zillowstatic.com/fp/0bccd964f84...,FOR_SALE,$,369900,"15105 E Burnside St, Portland, OR 97233",3.0,1.0,2064.0,Windermere Realty Trust,Windermere Realty Trust,45.522507,-122.50768,[https://photos.zillowstatic.com/fp/0bccd964f8...,15105 E Burnside St,Portland,OR,97233
3,53835615,https://www.zillow.com/homedetails/1442-SW-Vis...,https://photos.zillowstatic.com/fp/646f9095ae1...,FOR_SALE,$,695000,"1442 SW Vista Ave, Portland, OR 97201",4.0,3.0,4131.0,Windermere Realty Trust,Windermere Realty Trust,45.517944,-122.69676,[https://photos.zillowstatic.com/fp/646f9095ae...,1442 SW Vista Ave,Portland,OR,97201
4,53868321,https://www.zillow.com/homedetails/7535-S-Hood...,https://photos.zillowstatic.com/fp/46ec309362f...,FOR_SALE,$,675000,"7535 S Hood Ave, Portland, OR 97219",2.0,1.0,1941.0,"The Agency, Inc","The Agency, Inc",45.469494,-122.67473,[https://photos.zillowstatic.com/fp/46ec309362...,7535 S Hood Ave,Portland,OR,97219


In [38]:
df_api.shape

(820, 19)

In [39]:
df_api.dtypes

id                  object
url                 object
image               object
status              object
currency            object
price                int64
addressRaw          object
beds               float64
baths              float64
area               float64
brokerName          object
brokerNameRaw       object
latitude           float64
longitude          float64
photos              object
address.street      object
address.city        object
address.state       object
address.zipcode     object
dtype: object

In [40]:
df_api.isnull().sum()

id                  0
url                 0
image               0
status              0
currency            0
price               0
addressRaw          0
beds               36
baths              34
area               35
brokerName         15
brokerNameRaw      15
latitude            0
longitude           0
photos              0
address.street      0
address.city        0
address.state       0
address.zipcode     0
dtype: int64

However, this option has some limitations as how much the data we could get depending on the account. So, then we would come to the another option to scrape data from Zillow.

## II. Web scraping using Python <a class="anchor" id="sub_section_1_1_1"></a>
Import the necessary packages:

In [51]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
import pandas as pd
import numpy as np

Let's create the list of all the data we scrape from website:

In [45]:
homelist = []

Create variables of data needed from the website including basic information of home for sale.

In [46]:
def gettable(page):
    url = f'https://www.zillow.com/portland-or/{page}_p/'
    driver = webdriver.Chrome()
    driver.get(url)
    driver.implicitly_wait(0.5)
    lists = driver.find_elements(By.CSS_SELECTOR, '[class="ListItem-c11n-8-84-3__sc-10e22w8-0 StyledListCardWrapper-srp__sc-wtsrtn-0 iCyebE gTOWtl"]')
    for i in lists:
        info = {
            'ID': i.find_element(By.CSS_SELECTOR,'[role="presentation"]').get_attribute('id')[5:],
            'Details': i.find_element(By.TAG_NAME,'a').get_attribute('href'),
            'Address': i.find_element(By.TAG_NAME,'address').text,
            'Images': [x.get_attribute('src') for x in i.find_elements(By.TAG_NAME,'img')[1:]],
            'Status': i.find_element(By.CSS_SELECTOR,'[class="StyledPropertyCardDataArea-c11n-8-84-3__sc-yipmu-0 dbDWjx"]').text.split(' - ')[1],
            'Price($)': i.find_element(By.CSS_SELECTOR,'[data-test="property-card-price"]').text[1:],
            'HomeType': i.find_element(By.TAG_NAME,'script').get_attribute("innerHTML").split(',')[0][10:-1],
            'Bedrooms': i.find_elements(By.TAG_NAME,'b')[0].text,
            'Bathrooms': i.find_elements(By.TAG_NAME,'b')[1].text,
            'Structure area': i.find_elements(By.TAG_NAME,'b')[2].text,
            'Broker': i.find_element(By.CSS_SELECTOR,'[class="StyledPropertyCardDataArea-c11n-8-84-3__sc-yipmu-0 jretvB"]').text,
            'Latitude': i.find_element(By.TAG_NAME,'script').get_attribute("innerHTML").split(',')[-3].split(':')[1],
            'Longitude': i.find_element(By.TAG_NAME,'script').get_attribute("innerHTML").split(',')[-2].split(':')[1][:-1],
        }
        homelist.append(info)
    return

Down here is the code to scrape the detailed infomation of one home from its website:

In [47]:
def get_detailed_info(url):
    driver = webdriver.Chrome()
    driver.get(url)
    driver.implicitly_wait(0.5)
    info = driver.find_element(By.CSS_SELECTOR, ".summary-container")
    images = driver.find_element(By.CSS_SELECTOR,'[class="hdp__sc-1wi9vqt-0 dDzspE ds-media-col media-stream"]').find_elements(By.TAG_NAME,'img')
    img = []
    for i in images:
        img.append(i.get_attribute('src'))
    listing = {
        'ID': driver.title.split(' | ')[1][5:],
        'Address': driver.title.split(' | ')[0],
        'Price($)': info.find_element(By.CSS_SELECTOR,'[data-testid="price"]').text[1:],
        'Status': info.find_element(By.CSS_SELECTOR,'[class = "hdp__sc-13r9t6h-0 ds-chip-removable-content"]').text,
        'HomeType': driver.find_element(By.CSS_SELECTOR,'[class="Text-c11n-8-84-3__sc-aiai24-0 dpf__sc-2arhs5-3 hrfydd kOlNqB"]').text,
        'Bedrooms': info.find_elements(By.CSS_SELECTOR,'[data-testid="bed-bath-item"]')[0].text.split()[0],
        'Bathrooms': info.find_elements(By.CSS_SELECTOR,'[data-testid="bed-bath-item"]')[0].text.split()[0],
        'Structure area': info.find_elements(By.CSS_SELECTOR,'[data-testid="bed-bath-item"]')[2].text.split()[0],
        'Photos': img,
        'AgentInfo': driver.find_element(By.CSS_SELECTOR,'[data-testid="attribution-LISTING_AGENT"]').text,
        'Broker': driver.find_element(By.CSS_SELECTOR,'[data-testid="attribution-BROKER"]').text,
    }
    return listing


We will get the first 20 pages of home list in Portland,OR with all types and price. 

In [None]:
for x in range(1,21):
    gettable(x)

In [49]:
df = pd.DataFrame(homelist)
df.head()

Unnamed: 0,ID,Details,Address,Images,Status,Price($),HomeType,Bedrooms,Bathrooms,Structure area,Broker,Latitude,Longitude
0,48249486,https://www.zillow.com/homedetails/6314-SE-Jen...,"6314 SE Jennings Ave, Milwaukie, OR 97267",[https://photos.zillowstatic.com/fp/9c864e56cb...,Active,475000,SingleFamilyResidence,2,2,1279,"EXP REALTY, LLC",45.39881,-122.59761
1,2056499323,https://www.zillow.com/homedetails/7310-NE-11t...,"7310 NE 11th Ave, Portland, OR 97211",[https://photos.zillowstatic.com/fp/f34422efee...,Active,329900,SingleFamilyResidence,2,2,950,RE/MAX EQUITY GROUP,45.576153,-122.653465
2,53918806,https://www.zillow.com/homedetails/1609-SE-148...,"1609 SE 148th Ave, Portland, OR 97233",[https://photos.zillowstatic.com/fp/5c395b3e12...,Active,369900,SingleFamilyResidence,3,2,1380,ISTAR REALTY LLC,45.511578,-122.51084
3,53888026,https://www.zillow.com/homedetails/5909-NE-25t...,"5909 NE 25th Ave, Portland, OR 97211",[https://photos.zillowstatic.com/fp/87d34335fc...,Active,745000,SingleFamilyResidence,3,3,2493,RE/MAX EQUITY GROUP,45.56548,-122.640594
4,72258885,https://www.zillow.com/homedetails/2109-NW-Irv...,"2109 NW Irving St UNIT 103, Portland, OR 97210",[https://photos.zillowstatic.com/fp/84ca5942da...,Active,234500,SingleFamilyResidence,1,1,454,CENTURY 21 CASCADE,45.527985,-122.69485


### Data Cleaning 
The dataframe is maybe not in the format we want. To clean it up, we should check columns format and null values.

In [53]:
df.dtypes

ID                object
Details           object
Address           object
Images            object
Status            object
Price($)          object
HomeType          object
Bedrooms          object
Bathrooms         object
Structure area    object
Broker            object
Latitude          object
Longitude         object
dtype: object

We should convert some columns into numeric format for data analysis easier.

In [54]:
df['Price($)'].replace({',': ''}, regex=True, inplace=True)
df['Structure area'].replace({',': ''}, regex=True, inplace=True)

In [55]:
df[['ID','Price($)','Bedrooms','Bathrooms','Structure area','Latitude','Longitude']] = df[['ID','Price($)','Bedrooms','Bathrooms','Structure area','Latitude','Longitude']].apply(pd.to_numeric)

In [56]:
df.dtypes

ID                  int64
Details            object
Address            object
Images             object
Status             object
Price($)            int64
HomeType           object
Bedrooms            int64
Bathrooms           int64
Structure area      int64
Broker             object
Latitude          float64
Longitude         float64
dtype: object

### III. Export and Data summary <a class="anchor" id="sub_section_1_1_1"></a>

This is dataset scraped from Zillow API: 
- Dataset Structure: 820 observations (rows), 19 features (variables)
- Missing Data: 135 missing data total in all columns
- Data Type: two datatypes in this dataset: objects and integers

| Column | Description | 
| :---: | :--- |
| id | ID of each home for sale. |
| url | Link to the detailed information of the house. |
| image | Lists of links of images of the house. |
| status | Listing system has four different “Active” statuses for a property: Active Option Contract (AOC), Active Contingent, Active Kick Out, Active. Zillow simply shows all four as “For Sale”. |
| currency | Currency of price. |
| price | Price of the house. |
| addressRaw | Address of the house, including number, district, city, state and zipcode. |
| beds | Number of bedrooms in the house. |
| baths | Number of full baths in the house. |
| area | Total structure area of the house. |
| brokerName | Name of the broker. |
| latitude | Location of the house on the map. |
| longitude | Location of the house on the map. |
| photos | The full list of photos of the house. |
| address.street | Street name. |      
| address.city | City name. |        
| address.state | State name. |
| address.zipcode | Zipcode. |

In [44]:
df_api.to_csv('Home for Sale in Portland, OR on Zillow.csv')