<a href="https://colab.research.google.com/github/groovymarty/gracieslist/blob/main/scrape_aptdotcom.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Here is the apartments.com scraper!
You can use this Colab notebook to execute the scraper as often as you want and accumulate results in a thing called a Pandas dataframe.  Colab has tools to view a dataframe and copy data to the clipboard so you can save it somewhere.  (I also provide some code that will write the dataframe to a CSV file which you can then download.)

The easiest way to run the notebook:  From the Runtime menu above, select Run all.  You can also run each code block individually by clicking the "play" icon for that code block (black cirle with white triangle inside).

The notebook runs in a thing called a runtime.  Colab creates a runtime for each user when they open the notebook.  Variables, like the results dataframe, are part of the runtime.  Files that you write, like the CSV file, are also part of the runtime.

Your runtime persists if you close and reopen the notebook, but you should not count on the runtime lasting forever.  So it's important to save any valuable data elsewhere, like on your computer or in a Google sheet.

TO DO:
* Provide a way to save results to a Google sheet.
* Provide a way to append results to a Google sheet or CSV file that you already have.

## Execute these code blocks once to set things up.

In [44]:
# Here are the imports we need
import requests
import time
import random
from bs4 import BeautifulSoup
import pandas

In [45]:
# Functions to build and send requests
def build_url(where, page):
  if "location" in where:
    # request using apartments.com location name, like wabasha-county-mn
    if page == 1:
      return f"https://www.apartments.com/{where['location']}/"
    else:
      return f"https://www.apartments.com/{where['location']}/{page}/"
  elif "query" in where:
    # request using a query like travel time
    if page == 1:
      return f"https://www.apartments.com/{where['query']}"
    else:
      return f"https://www.apartments.com/{page}/{where['query']}"
  else:
    raise ValueError("Help! You didn't give me a location or query string!")

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:107.0) Gecko/20100101 Firefox/107.0"}

def send_request(url):
  return requests.get(url, headers=headers).text

In [46]:
# Functions to process the HTML that comes back from a request
# Return array of result rows
def process_result(soup, where):
  rows = []
  placards = soup.find(id="placards")
  properties = placards.find_all("div", class_="property-info")
  for prop in properties:
    address_div = prop.find("div", class_="property-address")
    beds_div = prop.find("div", class_="bed-range")
    price_div = prop.find("div", class_="price-range")
    if address_div and beds_div and price_div:
      address = address_div.get_text()
      beds = beds_div.get_text()
      price = price_div.get_text()
      link = prop.find("a", class_="property-link").get("href")
      print(f'found: "{where["Where"]}","{address}","{beds}","{price}"')
      rows.append({
          "Where": where["Where"],
          "Address": address,
          "Beds": beds,
          "Price": price,
          "Link": link
      })
  return rows


In [47]:
# Top-level functions to drive the scraping process
def random_delay():
  time.sleep(3+random.random()*5)

def scrape_all_pages(where):
  rows = []
  page = 1
  while True:
    print("delaying...")
    random_delay()
    print(f"getting {where} page {page}")
    # send request to site and get result
    html_text = send_request(build_url(where, page))
    # parse and process result
    soup = BeautifulSoup(html_text, "html.parser")
    rows.extend(process_result(soup, where))
    # pagination logic
    page_range = soup.find("span", class_="pageRange")
    if page_range:
        last_page = int(page_range.get_text().split()[-1])
    else:
        last_page = 1
    if page >= last_page:
        break
    else:
        page += 1
  return rows

## Below are the parameters.  Edit them as you wish.
You must execute this code block at least once, and again when you change any of the the parameter values.

### places
The places array is a list of places to search for.  There are two kinds:
* Places with a specific "location" name that apartments.com understands (such as wabasha-county-mn)
* Places with a query string like driving distance

So what you need to do is go to apartments.com and search for the thing you want, then look at the URL in  your broswer.  If it looks like this:

* `https://www.apartments.com/southbury-ct/`

then it's the "location" kind, and you copy the string between the two slashes

If it looks like this:

* `https://www.apartments.com/?sk=18d3ceb5aa67a739f7157de2988b81e9&bb=w6nhqxlnxH6ogm9zQ`

then it's the "query" kind, and you copy the string after the slash, starting with the question mark.

For both kinds you supply a "Where" string which will show up in the Where column of the CSV output file.  It is a free-format string so you can put anything you want.  For the "query" kind, you can explain your query string here (like 30 min drive from Hooterville).

In [48]:
places = []
#places.append({"Where": "Wabasha Co MN", "location": "wabasha-county-mn"})
#places.append({"Where": "Rochester MN", "location": "rochester-mn"})
places.append({"Where": "Wabasha MN, 1 hr drive", "query": "?sk=d982a06fda68e2506507f9978fe33b78&bb=yrizu0lp_K-pw0rltN"})
#places.append({"Where": "Southbury CT", "location": "southbury-ct"})
#places.append({"Where": "Heritage Village, 30 min drive", "query": "?sk=18d3ceb5aa67a739f7157de2988b81e9&bb=w6nhqxlnxH6ogm9zQ"})

## This code block creates an empty dataframe to accumulate the results.
You must execute this code block at least once.  Run it again if you want to clear the results and start over.

In [49]:
df = pandas.DataFrame(columns=["Where", "Address", "Beds", "Price", "Link"])

## Execute the following code block to scrape the site.
It will fully scrape the site according to your parameters, logging messages to show its progress, and adding result rows to the dataframe.  Run as often as you like.

In [50]:
for where in places:
  rows = scrape_all_pages(where)
  df = df.append(rows)
df.drop_duplicates(inplace=True)
print("Done!")

delaying...
getting {'Where': 'Wabasha MN, 1 hr drive', 'query': '?sk=d982a06fda68e2506507f9978fe33b78&bb=yrizu0lp_K-pw0rltN'} page 1
delaying...
getting {'Where': 'Wabasha MN, 1 hr drive', 'query': '?sk=d982a06fda68e2506507f9978fe33b78&bb=yrizu0lp_K-pw0rltN'} page 2
found: "Wabasha MN, 1 hr drive","310 1/2 Plum St Red Wing, MN 55066","1 Bed","$725"
found: "Wabasha MN, 1 hr drive","421 W Broadway St Winona, MN 55987","5 Beds","$1,750"
found: "Wabasha MN, 1 hr drive","353 W 8th St Winona, MN 55987","4 Beds","$1,400"
found: "Wabasha MN, 1 hr drive","421 W Broadway St Winona, MN 55987","5 Beds","$1,750"
found: "Wabasha MN, 1 hr drive","101 E 6th St Winona, MN 55987","2 Beds","$850"
found: "Wabasha MN, 1 hr drive","79 W Broadway St Winona, MN 55987","3 Beds","$410"
found: "Wabasha MN, 1 hr drive","79 W Broadway St Winona, MN 55987","5 Beds","$410"
found: "Wabasha MN, 1 hr drive","79 W Broadway Winona, MN 55987","4 Beds","$410"
found: "Wabasha MN, 1 hr drive","373 Main St Winona, MN 55987",

## Execute the following block to view the result dataframe.

In [51]:
df

Unnamed: 0,Where,Address,Beds,Price,Link
0,"Wabasha MN, 1 hr drive","310 1/2 Plum St Red Wing, MN 55066",1 Bed,$725,https://www.apartments.com/310-1-2-plum-st-red...
1,"Wabasha MN, 1 hr drive","421 W Broadway St Winona, MN 55987",5 Beds,"$1,750",https://www.apartments.com/421-w-broadway-st-w...
2,"Wabasha MN, 1 hr drive","353 W 8th St Winona, MN 55987",4 Beds,"$1,400",https://www.apartments.com/353-w-8th-st-winona...
3,"Wabasha MN, 1 hr drive","421 W Broadway St Winona, MN 55987",5 Beds,"$1,750",https://www.apartments.com/421-w-broadway-st-w...
4,"Wabasha MN, 1 hr drive","101 E 6th St Winona, MN 55987",2 Beds,$850,https://www.apartments.com/101-e-6th-st-winona...
5,"Wabasha MN, 1 hr drive","79 W Broadway St Winona, MN 55987",3 Beds,$410,https://www.apartments.com/79-w-broadway-st-wi...
6,"Wabasha MN, 1 hr drive","79 W Broadway St Winona, MN 55987",5 Beds,$410,https://www.apartments.com/79-w-broadway-st-wi...
7,"Wabasha MN, 1 hr drive","79 W Broadway Winona, MN 55987",4 Beds,$410,https://www.apartments.com/79-w-broadway-winon...
8,"Wabasha MN, 1 hr drive","373 Main St Winona, MN 55987",6 Beds,$410,https://www.apartments.com/373-main-st-winona-...
9,"Wabasha MN, 1 hr drive","373 Main St Winona, MN 55987",5 Beds,$410,https://www.apartments.com/373-main-st-winona-...


## Execute the following block to save the result dataframe to a CSV file
Edit the file name as you wish.  Use the file explorer to the left to find the CSV file.  (Look in the "content" folder.)  Then you can download the file if desired.  Note the files in the "content" folder are part of your runtime so don't assume they will last forever.

In [52]:
df.to_csv("results.csv")