# Best Practices and Guidelines

## API Access

When WebPages are updated with a new UI or Layout, our scrappers will fail.  So, check if the Website publishes an API.  This not only protects our code from breaking on simple layout changes, but also is much more friendlier to write and don't have to learn all the HTML nestings to pull our data.

## Data Dumps

Many companies pubish data dumps regularly (Reddit, Wikipedia, ..)  See if what you need is already cleaned and uploaded.  You could use that for your project and avoid the scrapping woes.

## Respect the Robots.txt

`Robots.txt` will specify the rules for Web Crawlers and Scrappers.  Most websites will have it.  Check it first.  Checkout Wikipedia's for example (https://en.wikipedia.org/robots.txt)

## Scape only what you need

Scape just enough information for your side project

## Do not hit the servers too frequently

Add Delays to your webscrapping code even if the `Robots.txt` doesn't explicitly specify them.

## Don't share scrapped data

If it's not your data, don't share it publically on github or otherwise.

## In general, be considerate

* Choose an off-peak time to scrape
* Build in wait times between requests time.sleep(seconds)
* Save as you go. Don't repeatedly hit the same page

## Using classes generally more reliable than using style / font

## Using header and params with your GET requests

In [6]:
# sometimes things work more smoothly if we change the User-Agent in the request headers
headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_5)'}

In [3]:
# In the previous example, we were constructing urls manually
url = 'http://www.boxofficemojo.com/movies/?id=soundofmusic.htm&adjust_yr=2017'
requests.get(url)

'<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">\n<html lang="en">\n<head>\n<meta http-equiv="Content-type" content="text/html;charset=iso-8859-1">\n<title>The Sound of Music (1965) - Box Office Mojo</title>\n\n<style type="text/css">\ntable.chart-wide { width: 100%; }\n</style>\n<META name="keywords" content="the sound of music, movie, film, box office, result, records, charts, revenue, opening weekend, gross, worldwide, overseas, foreign, news, reviews, articles, stories, story, analysis, revenue, release date, mpaa rating, genre, running time, length, budget, production budget, distributor, studio, fox, theatrical summary, theatrical, box office mojo">\n<META name="description" content="The Sound of Music summary of box office results, charts and release information and related links.">\n\n<link rel="stylesheet" href="/css/mojo.css?1" type="text/css" media="screen" title="no title" charset="utf-8">\n<link rel="stylesheet" href="

In [10]:
# we could re-write the same request with params
base_url = 'http://www.boxofficemojo.com/movies/'
params = {
    'id':'soundofmusic',
    'adjust_yr': 2017
}

In [11]:
response = requests.get(url=base_url, headers = headers, params = params)

You can get additional fake headers via the `fake_useragent` library  
`#!pip install fake_useragent`

## Catch errors with try-except, webpages are not always consistent

In [None]:
try:
    r = requests.get(url, timeout=10)
except Exception as e:
    print(e.message)