# Browser Automation Homework
Due 7-25<br>
Completed by: **TK YOUR NAME**

We're going to visit the real estate site Zillow.com and search "For sale" listings in a town of your choosing.

We'll collect the listings in the first 5 pages (or all if you like), and get a feel for the price range in that town.

Ultimately I want to know the median price of that town.

Note: if you get asked if you're a bot, just complete the challenges manually.

In [1]:
import os
import time
import random
import pandas as pd

from seleniumwire import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

### 1) Open the browser, hide automation signs, visit Zillow.com

In [2]:
def open_browser():

    options = webdriver.ChromeOptions()

    # The browser should open at is maximum size
    options.add_argument('start-maximized')

    # Remove all the signs that reveal this is an automated browser
    options.add_argument('--disable-blink-features=AutomationControlled')
    options.add_experimental_option("excludeSwitches", ["enable-automation"])
    options.add_experimental_option('useAutomationExtension', False)

    return webdriver.Chrome(options=options)

In [3]:
driver = open_browser()
driver.get('https://www.zillow.com')

### 2) Find Search Box

Use selenium's `find_element` [function](https://selenium-python.readthedocs.io/locating-elements.html?highlight=find_element#locating-elements) to find the search box on the Zillow site.
You can use whichever `By` [option](https://selenium-python.readthedocs.io/api.html?highlight=By#locate-elements-by) you feel most comfortable.

In [4]:
search_box = driver.find_element(
    By.XPATH,
    '//input[@placeholder="Enter an address, neighborhood, city, or ZIP code"]'
)
search_box

<selenium.webdriver.remote.webelement.WebElement (session="ceb1a0c78fc98dd8b77b28b1ba1e0f5c", element="D2C920E9D0C2830AD2951E7193EED0AC_element_23")>

### 3) Input a geography into search bar

After you've found `search_box` find a way to input or send `search_term` into the input field.

Feel free to change `search_term` to where ever you like.

In [5]:
city = 'Leavenworth, WA'
search_box.send_keys(city)

### 4) Make the search

Originally, I thought we could get away with just pressing "ENTER". If you try that you'll see that listings are not from the geography you're searching.

Instead, you'll see a list of suggestions. Click the first suggestion.

You can do that by first finding that suggestion, then [clicking](https://saucelabs.com/resources/blog/the-selenium-click-command) on it.

In [6]:
search_button = driver.find_element(
    By.XPATH, 
    '//button[@type="submit"]')
search_button

<selenium.webdriver.remote.webelement.WebElement (session="ceb1a0c78fc98dd8b77b28b1ba1e0f5c", element="D2C920E9D0C2830AD2951E7193EED0AC_element_24")>

In [7]:
search_button.click()

### 5) Pick "For sale," if asked
You might be prompted to check for rentals or sales. This doesn't always show up, but be prepared to click "For sale" if you need to.

In [8]:
sale_button = driver.find_element(
    By.XPATH, 
    '//button[normalize-space()="For sale"]'
)
sale_button

<selenium.webdriver.remote.webelement.WebElement (session="ceb1a0c78fc98dd8b77b28b1ba1e0f5c", element="D2C920E9D0C2830AD2951E7193EED0AC_element_28")>

In [9]:
sale_button.click() 

### 6) What are the prices of the houses on the first page?
Print the text of each listing's property price below:

In [10]:
cards = driver.find_elements(
    By.XPATH,
    '//span[@data-test="property-card-price"]',
)
prices = [price.text for price in cards]

print(len(prices))
for price in prices:
    print(price)

9
$429,000
$199,000
$495,000
$550,000
$59,900
$475,000
$90,000
$375,000
$1,999,000


Note: you _should_ see more than nine listings.

You'll need to find a way to scroll down the page to load each new card. From my tests, each page holds up to 40.

This is not a simple task! I found one way to do this below, can you find a better way to do this?

In [21]:
N = 0
while True:
    # get all the listings, and scroll to the last one, then wait two seconds.
    prices = driver.find_elements(By.XPATH, './/span[@data-test="property-card-price"]')
    last_listing = prices[-1]
    
    # you can use selenium to issue JavaScript commands:
    driver.execute_script("arguments[0].scrollIntoView();", last_listing)
    N_prices = len(prices)
    if N_prices == N:
        break
    N = N_prices
    time.sleep(2)

In [22]:
print(len(prices))

prices = [price.text for price in prices]
prices

41


['$429,000',
 '$199,000',
 '$495,000',
 '$550,000',
 '$59,900',
 '$475,000',
 '$90,000',
 '$375,000',
 '$1,999,000',
 '$2,944,000',
 '$3,800,000',
 '$589,000',
 '$522,000',
 '$799,000',
 '$1,100,000',
 '$2,625,000',
 '$795,000',
 '$405,000',
 '$415,000',
 '$1,275,000',
 '$1,200,000',
 '$289,000',
 '$1,299,000',
 '$699,000',
 '$799,900',
 '$449,000',
 '$649,950',
 '$2,500,000',
 '$1,900,000',
 '$850,000',
 '$2,950,000',
 '$799,000',
 '$329,000',
 '$1,900,000',
 '$1,175,000',
 '$269,000',
 '$559,950',
 '$990,000',
 '$599,000',
 '$764,000',
 '$1,875,000']

Is there a better way to do this? Feel free to experiment, but it's not necessary for the assignment.

### 7) Save the results as HTML
Save the page source to `html_out` as an HTML file

In [23]:
# TK save the page to `html_out`
with open('data/zillow_source.html', 'w') as f:
    f.write(driver.page_source)

### 8) Go to the next page
After collecting the first page, go to the next one by clicking the "Next page" button.

In [24]:
next_page = driver.find_element(
    By.XPATH,
    '//a[@title="Next page"]'
)
next_page

<selenium.webdriver.remote.webelement.WebElement (session="ceb1a0c78fc98dd8b77b28b1ba1e0f5c", element="1634859133D49332C1FE1EAF7B94B17F_element_167")>

In [25]:
#TK click it (yes this is repetative)
next_page.click()

### 9) Cycle through each page of results
Above we outlined each step, now put it all together here and collect as many results as you can. Add some `time.sleep(2)` (or some other reasonable time) between each step.

You can stop after the 5th page to save time.

Note: you can parse price from the listings directly from Selenium here, or save each page as HTML and parse them after you collect time. I recommend the latter, but for the sake of the homework feel free to take the shortcut.

In [41]:
# first close the browser to start anew
driver.close()

In [42]:
driver = open_browser()
driver.get('https://www.zillow.com')

In [43]:
search_box = driver.find_element(
    By.XPATH,
    '//input[@placeholder="Enter an address, neighborhood, city, or ZIP code"]',
)

city = 'Leavenworth, WA'
search_box.send_keys(city)

In [44]:
send_button = driver.find_element(
    By.XPATH,
    '//button[@type="submit"]',
)
send_button.click()

In [45]:
for_sale = driver.find_element(
    By.XPATH,
    '//button[normalize-space()="For sale"]', 
)

for_sale.click()

In [None]:
num_pages = 3
for i in range(0,num_pages):
    N = 0
    while True:
        # get all the listings, and scroll to the last one, then wait two seconds.
        cards = driver.find_elements(
            By.XPATH, 
            # './/span[@data-test="property-card-price"]'
            '//div[contains(@class, "StyledPropertyCardDataWrapper")]'
            
        )
    
        last_listing = cards[-1]
        
        # you can use selenium to issue JavaScript commands:
        driver.execute_script("arguments[0].scrollIntoView();", last_listing)
        N_cards = len(cards)
        print(N_cards)
        if N_cards == N:
            break
        N = N_cards
        time.sleep(2)
    
    with open(f'data/zillow_source_{i}.html', 'w') as f:
        f.write(driver.page_source)

    if i < num_pages:
        next_page = driver.find_element(
            By.XPATH,
            '//a[@title="Next page"]'
            )
        next_page.click()
        time.sleep(2)

In [47]:
xpath_expr = '//div[contains(@class, "StyledPropertyCardDataWrapper")]'
prices = [card.find_element(By.XPATH, xpath_expr).text for card in cards]
prices

['25701 Camp 12 Road, Leavenworth, WA 98826\nListing provided by NWMLS\n$189,000\n0.29 acres lot - Active',
 '25701 Camp 12 Road, Leavenworth, WA 98826\nListing provided by NWMLS\n$189,000\n0.29 acres lot - Active',
 '25701 Camp 12 Road, Leavenworth, WA 98826\nListing provided by NWMLS\n$189,000\n0.29 acres lot - Active',
 '25701 Camp 12 Road, Leavenworth, WA 98826\nListing provided by NWMLS\n$189,000\n0.29 acres lot - Active',
 '25701 Camp 12 Road, Leavenworth, WA 98826\nListing provided by NWMLS\n$189,000\n0.29 acres lot - Active',
 '25701 Camp 12 Road, Leavenworth, WA 98826\nListing provided by NWMLS\n$189,000\n0.29 acres lot - Active',
 '25701 Camp 12 Road, Leavenworth, WA 98826\nListing provided by NWMLS\n$189,000\n0.29 acres lot - Active',
 '25701 Camp 12 Road, Leavenworth, WA 98826\nListing provided by NWMLS\n$189,000\n0.29 acres lot - Active',
 '25701 Camp 12 Road, Leavenworth, WA 98826\nListing provided by NWMLS\n$189,000\n0.29 acres lot - Active',
 '25701 Camp 12 Road, Leaven

In [85]:
driver.close()

### 10) Parse the prices

Parse the prices into a list or a Pandas Series, and list the median price.

In [76]:
from lxml import etree
import re

In [73]:
def extract_cards(num_pages, xpath_expr):
    cards = []
    for i in range(0, num_pages):
        dom = etree.HTML(open(f'data/zillow_source_{i}.html').read())
        cards = cards + dom.xpath(xpath_expr)
        print(len(cards))
    return cards

In [78]:
prices = extract_cards(
    3, 
    xpath_expr
)
prices = [price.text for price in prices]
prices = [int(re.sub('\$|,', '', price)) for price in prices]
prices

41
82
106


[429000,
 199000,
 495000,
 550000,
 59900,
 475000,
 90000,
 375000,
 1999000,
 2944000,
 3800000,
 589000,
 522000,
 799000,
 1100000,
 2625000,
 795000,
 405000,
 415000,
 1275000,
 1200000,
 289000,
 1299000,
 699000,
 799900,
 449000,
 649950,
 2500000,
 1900000,
 850000,
 2950000,
 799000,
 329000,
 1900000,
 1175000,
 269000,
 559950,
 990000,
 599000,
 764000,
 1875000,
 1099000,
 599000,
 295000,
 154000,
 405000,
 264900,
 299000,
 960000,
 539000,
 1125000,
 79000,
 1100000,
 1590000,
 825000,
 450000,
 485000,
 415000,
 349000,
 450000,
 525000,
 200000,
 950000,
 379000,
 198000,
 1750000,
 80000,
 505000,
 359000,
 225000,
 250000,
 425000,
 310000,
 348000,
 90000,
 295000,
 350000,
 259000,
 325000,
 189000,
 560000,
 625000,
 189000,
 239000,
 217000,
 399700,
 249900,
 1600000,
 425000,
 249000,
 475000,
 149000,
 699000,
 217000,
 499900,
 350000,
 418000,
 149000,
 349000,
 499900,
 125000,
 375000,
 295000,
 690000,
 250000,
 420000]

In [80]:
import statistics

In [81]:
statistics.mean(prices)

688641.5094339623

In [82]:
statistics.median(prices)

449500.0

In [83]:
statistics.stdev(prices)

675607.5610135312

In [84]:
pd.Series(prices).describe()

count    1.060000e+02
mean     6.886415e+05
std      6.756076e+05
min      5.990000e+04
25%      2.950000e+05
50%      4.495000e+05
75%      7.990000e+05
max      3.800000e+06
dtype: float64

## Extra credit
- What is the median price per square foot?
- Which realtor has the most listings?
- Can you stain listings over $1M in red and take a full-screenshot?