# Pyscrape

_A simple web crawler in Python_

### Prerequisites

This workshop assumes you are familiar with the concepts of:

* variables,
* strings,
* conditions,
* loops,
* functions.

### Topics in this workshop

You will learn how to:

* import a Python library and use its documentation,
* use regular expressions to search for patterns in data,
* use exceptions for error handling,
* recognise a tree data structure,
* use recursion to traverse a tree structure,
* use stacks to traverse a tree structure,
* tell the difference between depth-first search and breadth-first search.

### Task 0: Library import

The greatest strength of Python comes from its large collection of libraries. For this workshop, we will be using the `urllib2` library for loading web pages, and the `re` library for matching patterns in strings.

Libraries in Python are loaded using the keyword

```Python
import libraryname
```

The documentation for the two libraries can be found at:

https://docs.python.org/2/library/urllib2.html

https://docs.python.org/2/library/re.html

To access a function `f` from library `l`, use the syntax:

```Python
f.l()
```

Use the cell below to import the two libraries.

In [11]:
import urllib2
import re

## Part 1: Find all URLs in a webpage

### Task 1.1: Opening a webpage

The `urrlib2` library contains a function called `urlopen`. Click on the following link to see the documentation for this function:

https://docs.python.org/2/library/urllib2.html

We will only use the first argument of this function: `url`, passed as a string. The function sends a request to the webpage, and returns the contents of the website. To get the contents as a string, call the `read` function with no arguments on the returned object. You can use the `print` keyword to print the contents of a string, such as:

```Python
print "Hello World"
```

Websites are usually large documents. If you want to try loading a website and printing its contents, use the following url:

http://www.lorem-ipsum-text.com

Use the cell below to load a website into a variable. Call `read` on this variable to get the contents of the website, and use the `print` keyword to print its contents:

In [None]:
url = "http://www.lorem-ipsum-text.com"
data = urllib2.urlopen(url)
webpage = data.read()
print webpage

If you want to clear the printed text, select the cell above and go to

Cell > Current Outputs > Clear

### Task 1.2: Regular expressions

Regular expressions are a compact way of representing patterns in strings. We will use pattern matching to look for the pattern of a URL on a website. All URLs we want to find are of the form:

`http://webpage.domain/` or

`https://webpage.domain/`.

The pattern we want to search for is a string starting with `http` or `https`, followed by `://` and followed by a string of uppercase or lowercase letters and dots, and ending with `/`.

The function `findall` of the `re` library searches for patterns in a string. Click the following link to see its documentation:

https://docs.python.org/2/library/re.html

The function accepts a pattern, specified as a regular expression, and a string in which to search. It returns a list of strings which matched the pattern.

#### A brief introduction to regular expression patterns

You will need to construct a pattern which matches the strings described above. The following special characters may be helpful:

| Special character | Description | Example | Return value |
|-------------------|-------------|---------|--------------|
| Normal text       | Matches only the text itself. | `re.findall('abc', 'abcdef')` | `['abc']` |
| `?`               | Makes the preceding character optional. | `re.findall('a?', 'abcdefa')` | `['ab', 'a']` |
| `*`               | Matches any number of occurrences of the preceding character. | `re.findall('a*', 'abcdaabcaaa')` | `['a', 'aa', 'aaa']` |
| `[]`              | Matches the group of characters within the brackets. | `re.findall('[123]', '42f1A')` | `['2', '1']` |
| `-`               | Use in a group to match a range of characters. | `re.findall('[a-z1]', '42f1A')` | `['2', 'f', '1']` |

Since some special characters are reserved for describing patterns, they cannot be matched directly. In this case, type `\` before the character and it will be interpreted literally and not as a pattern description. This is called __escaping__. Examples of escaped characters include `.`, `/` and `*`. 

Use the cell below to construct a regular expression which matches websites from the string in `urls`.

In [None]:
urls = """
Hello world, visit https://www.gatescambridge.org/ to learn about Gates Cambridge.
Information about University of Cambridge can be found at http://www.cam.ac.uk/.
Go to https://docs.python.org/2/library/re.html for Python documentation.
"""

# Construct your regular expression here:
regex = 'https?:\/\/[\.a-zA-Z]*\/'

# Print all websites from the string in urls
print re.findall(regex, urls)

### Task 1.3: Search for ULRs in a webpage

Use the script for loading a webpage and the regular expression for finding URLs to find all URLs in a webpage of your choosing. For example, you may use http://google.com

In [None]:
url = "http://google.com"
data = urllib2.urlopen(url)
webpage = data.read()
m = re.findall('https?:\/\/[^\/]*',webpage)
for address in m:
    print address

## Part 2: Create a reusable function

### Task 2.1: Exceptions

_let them try running the program with a broken website and see what happens_

_explanation of the try and except keywords_

In [None]:
url = "http://google.com"
try:
    data = urllib2.urlopen(url)
    webpage = data.read()
    m = re.findall('https?:\/\/[^\/]*',webpage)
    for address in m:
        print address
except:
    print "Error " + url

### Task 2.2: Pyscrape function

_explain functions (briefly because prerequisite)_

In [15]:
def pyscrape(url):
    try:
        data = urllib2.urlopen(url)
        webpage = data.read()
    except:
        print "Error: " + url
        return []
    return re.findall('https?:\/\/[\.a-zA-Z]*\/', webpage)

Try calling the `pyscrape` function below:

In [16]:
pyscrape("http://google.com")

['http://schema.org/',
 'http://www.google.com/',
 'http://www.google.co.uk/',
 'http://maps.google.co.uk/',
 'https://play.google.com/',
 'http://www.youtube.com/',
 'http://news.google.co.uk/',
 'https://mail.google.com/',
 'https://drive.google.com/',
 'https://www.google.co.uk/',
 'http://www.google.co.uk/',
 'https://accounts.google.com/',
 'http://www.google.co.uk/',
 'https://plus.google.com/',
 'http://www.google.co.uk/']

## Part 3: Recursion

### Task 3.1: Write a recursive URL scraper

_explain recursion - function calling itself_

_explain optional arguments_

In [26]:
DEPTH_LIMIT = 2

In [None]:
def pyscrape_recurse(url, depth = 0):
    if depth > DEPTH_LIMIT:
        return
    print '    ' * depth + url
    urlsInWebpage = pyscrape(url)
    for u in urlsInWebpage:
        pyscrape_recurse(u, depth + 1)

In [None]:
pyscrape_recurse("http://google.com")

### Task 3.2: Remove duplicate URLs

_remember URLs we've seen before and don't visit them again_

_explain sets_

In [30]:
visited = set()

def pyscrape_unique_recurse(url, depth = 0):
    if depth > DEPTH_LIMIT:
        return
    print '    ' * depth + url
    urlsInWebpage = pyscrape(url)
    for u in urlsInWebpage:
        if u not in visited:
            visited.add(u)
            pyscrape_unique_recurse(url, depth + 1)

In [31]:
pyscrape_unique_recurse("http://google.com")

http://google.com
    http://google.com
        http://google.com


In [None]:
# PART 4 - BFS and DFS and other fancy regex stuff.
# This program uses a search to keep following links on webpages. It does more
# complicated things, like not following links to previously-visited websites,
# which means that this whole thing will definitely fill up two hours.
def printscrape(breadth, top_url):
	# If breadth == True, this performs a breadth-first search.
	# If breadth == False, this performs a depth-first search.
	pop_end = -1 if breadth else 0
	stack = []
	stack.append(top_url)
	frontier = []
	i = 0
	while len(stack) > 0:
		i += 1
		if i == 100: # Stops it from running forever.
			print frontier
			return
		foo = stack[pop_end]
		print stack[pop_end]
		del stack[pop_end]
		for d in pyscrape(foo):
			try:
			# This regular expression will match the ".google.com/"
			# in "https://www.google.com/". It's used in this context to
			# make sure that we're calling a unique website each time instead of
			# different parts of the same website (so, if we access
			# "https://www.google.com/", the search will ignore
			# "https://news.google.com/"
				mm = re.search('[a-zA-Z]+\.[a-zA-Z]+\/$',d).group(0)
				if not(any(mm in f for f in frontier)):
					frontier.append(d)
					stack.append(d)
			except:
				print d

printscrape(True,"https://google.com/")

http://schema.org
http://www.google.com
http://www.google.co.uk
http://maps.google.co.uk
https://play.google.com
http://www.youtube.com
http://news.google.co.uk
https://mail.google.com
https://drive.google.com
https://www.google.co.uk
http://www.google.co.uk
https://accounts.google.com
http://www.google.co.uk
https://plus.google.com
http://www.google.co.uk
https://google.com/
https://www.youtube.com/
https://s.ytimg.com/
Error: https://s.ytimg.com/
https://www.google.co.uk/
http://www.google.com/
http://schema.org/
http://github.com/
https://collector.githubapp.com/
http://ogp.me/
http://openwebfoundation.org/
http://www.gstatic.com/
Error: http://www.gstatic.com/
http://microformats.org/
http://mediatemple.net/
Error: http://mediatemple.net/
http://www.readwriteweb.com/
http://www.criticalvc.com/
https://stats.wp.com/
https://connect.facebook.net/
Error: https://connect.facebook.net/
https://snap.licdn.com/
https://pages.awscloud.com/
https://youtu.be/
https://quicksight.aws/
https://