<a href="https://colab.research.google.com/github/groovymarty/gracieslist/blob/main/scrape_aptdotcom.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Here is the apartments.com scraper!
Purpose: Scrape apartments.com for single listings (that is, specific addresses available for rent). Ignore listings for multiple units (having a range of sizes and/or prices).

You can use this Colab notebook to execute the scraper as often as you want and accumulate results in a thing called a Pandas dataframe.  Colab has tools to view a dataframe and copy data to the clipboard so you can save it somewhere.  (I also provide some code that will write the dataframe to a CSV file which you can then download.)

The easiest way to run the notebook:  From the Runtime menu above, select Run all.  You can also run each code block individually by clicking the "play" icon for that code block (black cirle with white triangle inside).

The notebook runs in a thing called a runtime.  Colab creates a runtime for each user when they open the notebook.  Variables, like the results dataframe, are part of the runtime.  Files that you write, like the CSV file, are also part of the runtime.

Your runtime persists if you close and reopen the notebook, but you should not count on the runtime lasting forever.  So it's important to save any valuable data elsewhere, like on your computer or in a Google sheet.

TO DO:
* Provide a way to save results to a Google sheet.
* Provide a way to append results to a Google sheet or CSV file that you already have.

## Execute these code blocks once to set things up.

In [10]:
# Here are the imports we need
import requests
import time
import random
from bs4 import BeautifulSoup
import pandas

In [11]:
# Functions to build and send requests
def build_url(where, page):
  if "location" in where:
    # request using apartments.com location name, like wabasha-county-mn
    if page == 1:
      return f"https://www.apartments.com/{where['location']}/"
    else:
      return f"https://www.apartments.com/{where['location']}/{page}/"
  elif "query" in where:
    # request using a query like travel time
    if page == 1:
      return f"https://www.apartments.com/{where['query']}"
    else:
      return f"https://www.apartments.com/{page}/{where['query']}"
  else:
    raise ValueError("Help! You didn't give me a location or query string!")

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:107.0) Gecko/20100101 Firefox/107.0"}

def send_request(url):
  return requests.get(url, headers=headers).text

In [12]:
# Functions to process the HTML that comes back from a request
# Return array of result rows
def process_result(soup, where):
  rows = []
  placards = soup.find(id="placards")
  articles = placards.find_all("article")
  print(f"found {len(articles)} listings")
  for article in articles:
    title_div = article.find(["div", "p"], class_="property-title")
    address_divs = article.find_all(["div", "p"], class_="property-address")
    if title_div and len(address_divs) > 0:
      address = ", ".join([div.get_text() for div in [title_div] + address_divs])
      print(f"found address: {address}")
      beds_div = article.find("div", class_="bed-range") or \
        article.find(["span", "p"], class_="property-beds")
      price_div = article.find("div", class_="price-range") or \
        article.find("span", class_="property-rents") or \
        article.find("p", class_="property-pricing")
      if beds_div and price_div:
        beds = beds_div.get_text()
        price = price_div.get_text()
        link = article.find("a", class_="property-link").get("href")
        if "-" not in beds and "-" not in price:
          print(f'found single listing: "{address}","{beds}","{price}"')
          rows.append({
              "Where": where["Where"],
              "Address": address,
              "Beds": beds,
              "Price": price,
              "Link": link
          })
  return rows


In [13]:
# Top-level functions to drive the scraping process
def random_delay():
  time.sleep(3+random.random()*5)

def scrape_all_pages(where):
  rows = []
  page = 1
  while True:
    print("delaying...")
    random_delay()
    print(f"getting {where} page {page}")
    # send request to site and get result
    html_text = send_request(build_url(where, page))
    # parse and process result
    soup = BeautifulSoup(html_text, "html.parser")
    rows.extend(process_result(soup, where))
    # pagination logic
    page_range = soup.find("span", class_="pageRange")
    if page_range:
        last_page = int(page_range.get_text().split()[-1])
    else:
        last_page = 1
    if page >= last_page:
        break
    else:
        page += 1
  return rows

## Below are the parameters.  Edit them as you wish.
You must execute this code block at least once, and again when you change any of the the parameter values.

### places
The places array is a list of places to search for.  There are two kinds:
* Places with a specific "location" name that apartments.com understands (such as wabasha-county-mn)
* Places with a query string like driving distance

So what you need to do is go to apartments.com and search for the thing you want, then look at the URL in  your broswer.  If it looks like this:

* `https://www.apartments.com/southbury-ct/`

then it's the "location" kind, and you copy the string between the two slashes

If it looks like this:

* `https://www.apartments.com/?sk=18d3ceb5aa67a739f7157de2988b81e9&bb=w6nhqxlnxH6ogm9zQ`

then it's the "query" kind, and you copy the string after the slash, starting with the question mark.

For both kinds you supply a "Where" string which will show up in the Where column of the CSV output file.  It is a free-format string so you can put anything you want.  For the "query" kind, you can explain your query string here (like 30 min drive from Hooterville).

In [14]:
places = []
places.append({"Where": "Somerville MA", "location": "somerville-ma"})
places.append({"Where": "Cambridge MA", "location": "cambridge-ma"})
#places.append({"Where": "Wabasha MN, 1 hr drive", "query": "?sk=d982a06fda68e2506507f9978fe33b78&bb=yrizu0lp_K-pw0rltN"})
#places.append({"Where": "Southbury CT", "location": "southbury-ct"})
#places.append({"Where": "Heritage Village, 30 min drive", "query": "?sk=18d3ceb5aa67a739f7157de2988b81e9&bb=w6nhqxlnxH6ogm9zQ"})

## This code block creates an empty dataframe to accumulate the results.
You must execute this code block at least once.  Run it again if you want to clear the results and start over.

In [15]:
df = pandas.DataFrame(columns=["Where", "Address", "Beds", "Price", "Link"])

## Execute the following code block to scrape the site.
It will fully scrape the site according to your parameters, logging messages to show its progress, and adding result rows to the dataframe.  Run as often as you like.

In [16]:
for where in places:
  rows = scrape_all_pages(where)
  df = pandas.concat([df, pandas.DataFrame(rows)])
df.drop_duplicates(inplace=True)
print("Done!")

delaying...
getting {'Where': 'Somerville MA', 'location': 'somerville-ma'} page 1
found 25 listings
found address: Flats on First, 21 Charles St, Cambridge, MA 02141
found address: Mezzo Design Lofts, 30 Caldwell St, Charlestown, MA 02129
found address: Park 151, 151 N First St, Cambridge, MA 02141
found address: Alta Revolution, 290 Revolution Dr, Somerville, MA 02145
found address: Union 346, 346 Somerville Ave, Somerville, MA 02143
found address: Twenty20, 20 Child St, Cambridge, MA 02141
found address: CHR Cambridge - Harvard Square Communities, 1 Langdon St, Cambridge, MA 02138
found address: Station Landing Apartments, 50-55 Station Lndg, Medford, MA 02155
found address: Kendall Crossing Apartments, 157 Sixth St, Cambridge, MA 02142
found address: Lofts at Kendall Square, 195 Binney St, Cambridge, MA 02142
found address: Elevate, One Leighton St, Cambridge, MA 02141
found address: 929 Mass, 929 Massachusetts Ave, Cambridge, MA 02139
found address: Miscela, 485 Foley St, Somervil

## Execute the following block to view the result dataframe.

In [19]:
df

Unnamed: 0,Where,Address,Beds,Price,Link
0,Somerville MA,"The Wyeth, 120-124 Rindge Ave, Cambridge, MA 0...",2 Beds,"$3,930",https://www.apartments.com/the-wyeth-cambridge...
1,Somerville MA,"17 Forest St Unit FL2-ID64, Cambridge, MA 02140",1 Bed,"$2,700",https://www.apartments.com/17-forest-st-cambri...
2,Somerville MA,"333 Great River Rd Unit FL3-ID841, Somerville,...",1 Bed,"$2,570",https://www.apartments.com/333-great-river-rd-...
3,Somerville MA,"445 Artisan Way Unit FL5-ID837, Somerville, MA...",2 Beds,"$2,600",https://www.apartments.com/445-artisan-way-som...
4,Somerville MA,"1 Whittemore Ave Unit FL2-ID777, Cambridge, MA...",2 Beds,"$2,900",https://www.apartments.com/1-whittemore-ave-ca...
...,...,...,...,...,...
604,Cambridge MA,"Apartment for Rent, 218 Western Ave Cambridge,...",Studio,"$1,245",https://www.apartments.com/private-room-in-3-b...
605,Cambridge MA,"Apartment for Rent, 700 Huron Ave Cambridge, M...",2 Beds,"$3,416",https://www.apartments.com/700-huron-ave-cambr...
606,Cambridge MA,"Condo for Rent, 55 Magazine St Cambridge, MA 0...",2 Beds,"$3,950",https://www.apartments.com/55-magazine-st-camb...
607,Cambridge MA,"Apartment for Rent, 117 Pleasant St Cambridge,...",Studio,"$1,145",https://www.apartments.com/private-room-in-4-b...


## Execute the following block to save the result dataframe to a CSV file
Edit the file name as you wish.  Use the file explorer to the left to find the CSV file.  (Look in the "content" folder.)  Then you can download the file if desired.  Note the files in the "content" folder are part of your runtime so don't assume they will last forever.

In [18]:
df.to_csv("results.csv")