<a href="https://colab.research.google.com/github/groovymarty/gracieslist/blob/main/scrape_aptdotcom.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Here is the apartments.com scraper!
You can use this Colab notebook to execute the scraper as often as you want and accumulate results in a thing called a Pandas dataframe.  But when you close this notebook, the dataframe goes away.  So you need a way to save it.  One way that I've provided is to write it to a CSV file which you can then download.  See instructions below.  (Note the saved CSV file is also part of the Colab virtual machine and so it also goes away when you close the notebook.  So you need to download the CSV file to your computer to really save it.)

TO DO:
* Provide a way to save results to a Google sheet.
* Provide a way to append results to a Google sheet or CSV file that you already have.

## Execute these code blocks once to set things up.

In [5]:
# Here are the imports we need
import requests
import time
import random
from bs4 import BeautifulSoup
import pandas

In [10]:
# Functions to build and send requests
def build_url(where, page):
  if page == 1:
    return f"https://www.apartments.com/{where}/"
  else:
    return f"https://www.apartments.com/{where}/{page}/"

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:107.0) Gecko/20100101 Firefox/107.0"}

def send_request(url):
  return requests.get(url, headers=headers).text

In [25]:
# Functions to process the HTML that comes back from a request
# Return array of result rows
def process_result(soup):
  rows = []
  placards = soup.find(id="placards")
  properties = placards.find_all("div", class_="property-info")
  for prop in properties:
    address_div = prop.find("div", class_="property-address")
    beds_div = prop.find("div", class_="bed-range")
    price_div = prop.find("div", class_="price-range")
    if address_div and beds_div and price_div:
      address = address_div.get_text()
      beds = beds_div.get_text()
      price = price_div.get_text()
      link = prop.find("a", class_="property-link").get("href")
      print(f'found: "{where}","{address}","{beds}","{price}"')
      rows.append({
          "Where": where,
          "Address": address,
          "Beds": beds,
          "Price": price,
          "Link": link
      })
  return rows


In [31]:
# Top-level functions to drive the scraping process
def random_delay():
  time.sleep(3+random.random()*5)

def scrape_all_pages(where):
  rows = []
  page = 1
  while True:
    print("delaying...")
    random_delay()
    print(f"getting {where} page {page}")
    # send request to site and get result
    html_text = send_request(build_url(where, page))
    # parse and process result
    soup = BeautifulSoup(html_text, "html.parser")
    rows.extend(process_result(soup))
    # pagination logic
    page_range = soup.find("span", class_="pageRange")
    if page_range:
        last_page = int(page_range.get_text().split()[-1])
    else:
        last_page = 1
    if page >= last_page:
        break
    else:
        page += 1
  return rows

## Below are the parameters.  Edit them as you wish.
You must execute this code block at least once, and again when you change any of the the parameter values.

In [44]:
places = []
places.append("wabasha-county-mn")
places.append("rochester-mn")

## This code block creates an empty dataframe to accumulate the results.
You must execute this code block at least once.  Run it again if you want to clear the results and start over.

In [45]:
df = pandas.DataFrame(columns=["Where", "Address", "Beds", "Price", "Link"])

## Execute the following code block to scrape the site.
It will fully scrape the site according to your parameters, logging messages to show its progress, and adding result rows to the dataframe.  Run as often as you like.

In [None]:
for where in places:
  rows = scrape_all_pages(where)
  df = df.append(rows)
df.drop_duplicates(inplace=True)
print("Done!")

## Execute the following block to view the result dataframe.

In [None]:
df

## Execute the following block to save the result dataframe to a CSV file
Edit the file name as you wish.  Use the file explorer to the left to find the CSV file.  (Look in the "content" folder.)  Then you can download the file if desired.  Note the files in the "content" folder will go away when you close the Colab notebook.

In [48]:
df.to_csv("results.csv")