# MTH 601 Capstone Project 
## Introduction
### By Kurt Muller, Rodrigo Henriquez, Kaitlyn Torres, and Christopher Benson

This notebook has been created to keep track of all work related to our capstone project

> **Capstone Project Prompt**: 
<br />Parse the zillow website to find houses in your zip code.
https://www.zillow.com/long-island-ny/
<br />Get useful features such as:<br />   - Listing price<br />   - Sales price<br />   - Days on market<br />   - Num beds<br />   - Num baths<br />   - Age<br />   - Tax<br />   - School district<br />   - Attractions - school, worship, waterfront, etc.<br />   - Zillow valuation score<br />You might be predicting/improving the Zillow valuation score as the y-variable, or predicting if it is going to sell using days on market or sale completion as the y-variable.

## Technologies Used

**Pandas** is a Python package built upon Nympy to create DataFrames. This is good for Data Science to allow users to organize, display, modify, inspect data and much more. 
> **More Information**: Official Website: https://pandas.pydata.org

**BeautifulSoup** is another Python package for web scrapping. This allows us to extract any information from websites and manipulate it whatever we want to do.

> **More Information**: Official Website: https://www.crummy.com/software/BeautifulSoup/

**Selenium** drives a browser natively, as a user would, either locally or on a remote machine using the Selenium server, marks a leap forward in terms of browser automation.

> **More Information**: Selenium requires a web driver to launch. For this program, we chose the Chrome web driver. Use the link (https://sites.google.com/chromium.org/driver/) to download the appropriate driver.<br /><br />Once the driver is downloaded, click the executable to run it. You must have the Chrome browser also installed on the device. Check the Chrome browser version and download the propriate web driver version. For example, for Chrome version 106.0.5249.103, you would choose the Chrome 106 web driver. In our code, we avoid this hassle by using webdriver_manager.

**webdriver_manager** is a python package that simplifies the installation and management of binary drivers for different browsers

> **More Information**: Official Website: https://pypi.org/project/webdriver-manager/

## Importing Packages
Here is where we import the packges that this program will be using. Make sure to run this code first before using the rest of the program. 
Before running the imports, make sure you have installed the packages with the following lines of code:
> ```pip install webdriver_manager ```<br />
> ```pip install bs4 ```<br />
> ```pip install selenium ```<br />
> ```pip install time ```<br />
> ```pip install selenium_stealth ```<br />


In [97]:
from bs4 import BeautifulSoup
import requests
from webdriver_manager.chrome import ChromeDriverManager
import selenium
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium_stealth import stealth
import time
import playwright

## Testing Selenium
Here this code shows the basics in having Selenium grab a website based off a give URL.

In [96]:
'''This is an alternaitve way to set up the chromedriver. Since we are using webdriver_manager, we do not need to do it this way'''
#You must change this path to the path of the chromedriver on your machine
#PATH = "/Users/km/Applications/chromedriver" 
#driver = webdriver.Chrome(PATH)

'''Instead we set up the chromedriver using the following code below'''
#Setting up the driver and testing
driver = webdriver.Chrome(ChromeDriverManager().install())

url = "https://www.zillow.com/homedetails/374-Stewart-Ave-Garden-City-NY-11530/31195014_zpid/"  
driver.get(url)
html = driver.page_source
print(html)

time.sleep(5)
driver.quit()

[WDM] - Downloading: 100%|██████████| 8.08M/8.08M [00:03<00:00, 2.55MB/s]
  driver = webdriver.Chrome(ChromeDriverManager().install())


<html itemscope="" itemtype="http://schema.org/Organization" class=" zsg-theme-modernized null" lang="en" xmlns="http://www.w3.org/1999/xhtml" xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:product="http://ogp.me/ns/product#"><head>
<title>374 Stewart Avenue, Garden City, NY 11530 | MLS #3418570 | Zillow</title><meta name="description" content="Zillow has 33 photos of this $3,298,000 4 beds, 5 baths, 4,610 Square Feet single family home located at 374 Stewart Avenue, Garden City, NY 11530 built in 1954. MLS #3418570."><meta name="author" content="Zillow, Inc."><meta name="Copyright" content="Copyright (c) 2006-2022 Zillow, Inc."><script async="" src="/HYx10rg3/init.js"></script><script defer="" src="https://e.zg-api.com/a/z/js/v1/analytics.js?v=bcf290c"></script><script async="" src="https://www.google-analytics.com/analytics.js"></script><script>
        if (typeof Object.assign === 'function') {
            window.appInfo = Object.assign(
            

## Zillow's Captcha Issue
An issue with using Selenium is that it gets flagged as non-human use and causes Zillow to throw up a captcha, which is impossible to do without paying for someone to compelte the captcha for us. As a result of this, we have decided to use the housing information from another site, https://www.realtor.com/, that contains very similar information that we can scrape from and use it to attempt to create our own Zillow Zestimate®.

## Using Selenium to Search a Zip Code on Realtor.com
In this example, we take a given zip code, in this case '11530', and have selenium search for us. 

In [None]:
driver = webdriver.Chrome(ChromeDriverManager().install())

zipCode = "11530"
url = "https://www.realtor.com/realestateandhomes-search/" + zipCode
print("\n**URL**:" + url + "\n")
driver.get(url)
html = driver.page_source

time.sleep(2)

driver.quit()

## Scraping Property Information from Realtor.com
Now lets being to scrape the properties from Realtor.com. 

In [92]:
driver = webdriver.Chrome(ChromeDriverManager().install())

zipCode = "11530"
url = "https://www.realtor.com/realestateandhomes-search/" + zipCode
driver.get(url)
html = driver.page_source
print(html)

'''
m = []
count=0
for x in len(driver.findElements(By.tagName("li"))):
    m[count].append[x]
    count+=1
'''
numOfResults = None
#Getting total number of search results
try:
    numOfResults = driver.find_element(By.ID, "srp-footer-found-listing").text
    numOfResults = str(numOfResults).split(' ')[1]
except:
    None



#driver.find_elements_by_name("name")
#Since properties are under the HTML tag <li>, we grab all tags with <li> to parse later on
print(driver.find_elements(By.CLASS_NAME, "jsx-1881802087 component_property-card"))
property_cards = driver.find_elements(By.TAG_NAME, "li")

#driver.find_element(By.TAG_NAME, "li").text
count=0
info=[]
for property in property_cards:
    count+=1
    info=property.text
    print(count)
#https://www.realtor.com/realestateandhomes-detail/100-Hilton-Ave-Unit-134_Garden-City_NY_11530_M34622-78396

print(numOfResults)
print(info[1])

driver.quit()

  driver = webdriver.Chrome(ChromeDriverManager().install())


<html lang="en"><head>
  <title>Pardon Our Interruption</title>

  <meta http-equiv="content-type" content="text/html; charset=UTF-8">
  <meta name="viewport" content="width=1000">
  <meta name="robots" content="noindex, nofollow">
  <meta http-equiv="cache-control" content="max-age=0">
  <meta http-equiv="cache-control" content="no-cache">
  <meta http-equiv="expires" content="0">
  <meta http-equiv="pragma" content="no-cache">

  <link href="/miscellaneous/challenge.css" rel="stylesheet">

  <script>
      function showBlockPage() {
          document.getElementsByClassName("container")[0].style.display = "";
      }
      setTimeout(showBlockPage, 10000);
  </script>
<script src="/41V9jz72/captcha/captcha.js?a=c&amp;u=fa3fe421-4680-11ed-b813-4f4c644c774e&amp;v=&amp;m=0"></script><style type="text/css">.px-loader-wrapper {    display: flex;}@keyframes loadingEffect {    0% {        background-position: 0;    }    100% {        background-position: 60vw;    }}.px-inner-loading-area { 

## Testing Playwright for Web Scrapping
