# Today's theme is Failiure


# How To Debug Scapers: Browser Automation

## 1. Spot check the results

Manually inspect the data you just collected. Does it look like what you expect?

Let's look at the first and last page of Zillow that we collected.

## 2. Can't find an element

Maybe something hasn't loaded yet. If that is the case, you can wait for it to show up.

See the example in the [Inspect Element tutorial](https://inspectelement.org/browser_automation.html#step-3-finding-elements-on-page-and-interacting-with-them).

In [None]:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# Wait up to 20 seconds before we proceed to `find_element`.
X_seconds = 20
wait = WebDriverWait(driver, timeout = X_seconds)
wait.until(EC.visibility_of_element_located(
    (By.CSS_SELECTOR, '[data-e2e="modal-close-inner-button"]'))
)

# this line will only execute whenever the element was found (or after 20 seconds it it wasn't)
close_button = driver.find_element(By.CSS_SELECTOR, '[data-e2e="modal-close-inner-button"]')
close_button

## 3. Look to known issues

For example, a capcha, or an empty result. 
- Wait to see if these signs show up.
- Intervene as necessary.

# Debugging APIs

## 1. Listen to status codes
The status code will tell you if your API calls are successful, and whether you crashed a server.

Intervene as necessary. Also place periodic sleeps.

## 2. Spotcheck

Open the JSON and make sure it looks like what you expect.

## 3. Check for known keys

Programmatically check if the `key` you're expecting is present.

# General notes

## Summarize the data
Check the number of rows per day. This is similar to a dashboard

## Catch and handle expections

Monitor the scraper to known issues. Determine automated answers to those issues.

Have you used `try` and `except` phrases in Python? Read more about that [here](https://pythonbasics.org/try-except/).

In [10]:
try:
    assert(2 == 3)
except Exception as e:
    print(f"Wrong {e}")

Wrong 


## Keep a log

Get familiar with a [log file](https://realpython.com/python-logging/). This is basically a place to store `print` statements.
Read more here.

For a quick version: check the last time a directory was modified.

# How to productionalize web scrapers

Based on this [presentation](https://docs.google.com/presentation/d/1K5ttTgP1f6ghL06kj6QqyqsGccU_Ttxh1otdx5wWYGo/edit#slide=id.p) with Jeff Kao (ProPublica) and Ilica Mahajan (TMP)

## 1. Don't repeat work
The scraper output will be massive. Name it so you understand what and when you scraped.\
Structured naming system for outputs that includes descriptive name and dates of when it was produced.\
Make the scraper first check if it exists the information already has been scraped exist before scraping again.

## 2. Keep receipts
Save the timestamp (when data was collected) and the raw data.\
Every time I collect data I include a timestamp somewhere.\
If JSON, create a key with the timestamp.\
If HTML, you can get fancy, injecting the timestamp as an element’s attribute in the HTML.

I view data collection as getting FOIAs.\
You organize FOIAs by reference number, timestamps…\
You might publish or share the data. You will  need to show when you got it.

## 3. Break up the work. Make the scraper as simple as possible.
Makes it easy to find and handle errors. For example, a scraper handles one city in Zillow.\
Paginate, save results. That's all...\
Another scraper takes the saved HTML, and parses it and inserts it into a database.

## 4. Keep a schedule.
Use `cron` to schedule jobs locally. For example, cron allows an hourly job or one that runs every day at 4:30pm.\
You can download it to your machine.\
Other tools exist to do this on the cloud.

When you run things on cron, you must specify the environment.

Cron is best when run in local machines. You can run it on the cloud, but you’re paying for time. But with cron you want to run only a certain time a day. Waste of money.
He has a PC at home that is on all the time, running cron jobs. So cron is best for local machines that you can control. When you use the cloud, you pay for time. You only want the machine up when you are using it.

## 5. Keep tabs on inputs with a TODO list.
Use a CSV if you know what you want.\
Use AWS SQS, or AWS Simple Queuing System (similar to a commercial kitchen's ticket system). It's very simple and cheap. My fave! He uses it both for local and cloud scrapers.

## 6. Can you scale up?
If scrapers are simple, it's easy to parallelize them.\
If local: use async computing or `Multiprocessing`. (More on this below)

## Miscelaneous
When you are using Selenium or Playwright, and you are ready to go into production, switch to headless mode to optimize resources.
Non headless browser is useful to develop and debug, but not for processing. 

He still has not used much Playwright. He said he thinks it is asynchronous by default. That should speed things up.

And as we have seen, always look first for APIs, before BA.

Selenium is a framework (a car). Chrome driver is the engine. Makes Selenium move.
Selenium also uses Firefox  and  Gecko. I used to use Gecko, but switched to Chrome and Chromium as engines because they are a bit easier to install. But you can use others.

Chromium drives many drivers: Chrome, Safari, Opera & Brave, among others.


# Tools he uses every day
- `cron`: schedule scripts and scrapers on a local machine.
- [`htop`](https://htop.dev/): A command line package to view your computer's resources. For example, how many CPUs are being used and how much memory is used. Good to know if your computer may crash. If you are multiprocessing, you can see how many processes you can handle.\
Needs to be installed with HomeBrew
- multiprocessing

### Multiprocessing
Check this [gist](https://gist.github.com/yinleon/8b7555afbbeed47e439dbd2364b8d404). It has snippets for when you need to read many files into one dataframe, and for when you have a large dataframe and you need to perform an apply function.

The following multiprocessing snippet uses 8 CPUs. He ran it while having `htop` open in terminal, so we could see how much resources each CPU was using.

import time
from multiprocessing import Pool

In [31]:
def example_function(n):
    """Sleeps for 5 seconds with an arbitrary input"""
    time.sleep(5)
#     print(n)
    return 1

In [32]:
ex_inputs = list(range(3000))

In [34]:
data = []
with Pool(processes=100) as pool:
    for record in tqdm(pool.imap_unordered(example_function, ex_inputs)):
        data.append(record)

3000it [02:30, 19.99it/s]


Notice that order doesn't matter here

## TQDM
A useful status bar

In [25]:
from tqdm import tqdm

In [30]:
ex_inputs = list(range(30))

for i in tqdm(ex_inputs):
    example_function(i)
    pass

  3%|▎         | 1/30 [00:05<02:25,  5.00s/it]

0


  7%|▋         | 2/30 [00:10<02:20,  5.00s/it]

1


 10%|█         | 3/30 [00:15<02:15,  5.00s/it]

2


 13%|█▎        | 4/30 [00:20<02:10,  5.01s/it]

3


 17%|█▋        | 5/30 [00:25<02:05,  5.01s/it]

4


 20%|██        | 6/30 [00:30<02:00,  5.01s/it]

5


 23%|██▎       | 7/30 [00:35<01:55,  5.00s/it]

6


 27%|██▋       | 8/30 [00:40<01:50,  5.01s/it]

7


 30%|███       | 9/30 [00:45<01:45,  5.01s/it]

8


 33%|███▎      | 10/30 [00:50<01:40,  5.01s/it]

9


 37%|███▋      | 11/30 [00:55<01:35,  5.01s/it]

10


 40%|████      | 12/30 [01:00<01:30,  5.00s/it]

11


 40%|████      | 12/30 [01:03<01:34,  5.28s/it]


KeyboardInterrupt: 

In [None]:
for i in ex_inputs:
    example_function(i)