# Pyscrape

_A simple web crawler in Python_

### Prerequisites

This workshop assumes you are familiar with the concepts of:

* variables,
* strings,
* conditions,
* loops,
* functions.

### Topics in this workshop

You will learn how to:

* import a Python library and use its documentation,
* use regular expressions to search for patterns in data,
* use exceptions for error handling,
* recognise a tree data structure,
* use recursion to traverse a tree structure,
* use stacks to traverse a tree structure,
* tell the difference between depth-first search and breadth-first search.

### Task 0: Library import

The greatest strength of Python comes from its large collection of libraries. For this workshop, we will be using the `urllib2` library for loading web pages, and the `re` library for matching patterns in strings.

Libraries in Python are loaded using the keyword

```Python
import libraryname
```

The documentation for the two libraries can be found at:

https://docs.python.org/2/library/urllib2.html

https://docs.python.org/2/library/re.html

To access a function `f` from library `l`, use the syntax:

```Python
f.l()
```

Use the cell below to import the two libraries.

## Part 1: Find all URLs in a webpage

### Task 1.1: Opening a webpage

The `urrlib2` library contains a function called `urlopen`. Click on the following link to see the documentation for this function:

https://docs.python.org/2/library/urllib2.html

We will only use the first argument of this function: `url`, passed as a string. The function sends a request to the webpage, and returns the contents of the website. To get the contents as a string, call the `read` function with no arguments on the returned object. You can use the `print` keyword to print the contents of a string, such as:

In [None]:
print "Hello World"

Websites are usually large documents. If you want to try loading a website and printing its contents, use the following url:

http://www.lorem-ipsum-text.com

Use the cell below to load a website into a variable. Call `read` on this variable to get the contents of the website, and use the `print` keyword to print its contents:

In [None]:
url = "http://www.lorem-ipsum-text.com"

# Use urllib2.urlopen to open the URL.

# Read the contents of the website.

# Print it.

If you want to clear the printed text, select the cell above and go to

Cell > Current Outputs > Clear

### Task 1.2: Regular expressions

Regular expressions are a compact way of representing patterns in strings. We will use pattern matching to look for the pattern of a URL on a website. All URLs we want to find are of the form:

`http://webpage.domain/` or

`https://webpage.domain/`.

The pattern we want to search for is a string starting with `http` or `https`, followed by `://` and followed by a string of uppercase or lowercase letters and dots, and ending with `/`.

The function `findall` of the `re` library searches for patterns in a string. Click the following link to see its documentation:

https://docs.python.org/2/library/re.html

The function accepts a pattern, specified as a regular expression, and a string in which to search. It returns a list of strings which matched the pattern.

#### A brief introduction to regular expression patterns

You will need to construct a pattern which matches the strings described above. The following special characters may be helpful:

| Special character | Description | Example | Return value |
|-------------------|-------------|---------|--------------|
| Normal text       | Matches only the text itself. | `re.findall('abc', 'abcdef')` | `['abc']` |
| `?`               | Makes the preceding character optional. | `re.findall('a?', 'abcdefa')` | `['ab', 'a']` |
| `*`               | Matches any number of occurrences of the preceding character. | `re.findall('a*', 'abcdaabcaaa')` | `['a', 'aa', 'aaa']` |
| `[]`              | Matches the group of characters within the brackets. | `re.findall('[123]', '42f1A')` | `['2', '1']` |
| `-`               | Use in a group to match a range of characters. | `re.findall('[a-z1]', '42f1A')` | `['f', '1']` |

Since some special characters are reserved for describing patterns, they cannot be matched directly. In this case, type `\` before the character and it will be interpreted literally and not as a pattern description. This is called __escaping__. Examples of escaped characters include `.`, `/` and `*`. 

Use the cell below to construct a regular expression which matches websites from the string in `urls`.

In [None]:
urls = """
Hello world, visit https://www.gatescambridge.org/ to learn about Gates Cambridge.
Information about University of Cambridge can be found at http://www.cam.ac.uk/.
Go to https://docs.python.org/2/library/re.html for Python documentation.
"""

# Construct your regular expression here:
regex = 

# Print all websites from the string in urls
print re.findall(regex, urls)

### Task 1.3: Search for ULRs in a webpage

Use the script for loading a webpage and the regular expression for finding URLs to find all URLs in a webpage of your choosing. For example, you may use http://gatescambridge.org

In [None]:
url = "http://gatescambridge.org"

# Read the contents of the webpage

# Find all URLs using the regular expression

# Print out the URLs


## Part 2: Create a reusable function

### Task 2.1: Exceptions

Some library functions in Python may return an error. For example, if you run the function with an incorrect argument, the function will complain about it. In the cell below, try running the `urlopen` function with a website which does not exist, for example `http://google.c`.

In [None]:
url = "http://google.c"
urllib2.urlopen(url)

The function complains about the incorrect input, and it shows you exactly where in your code the error occurred. We say that the function has __thrown an exception__. The program will not progress beyond the point where an exception was thrown.

If you run the `urlopen` function with an argument that you know is correct, you do not have to worry about exceptions. This is rarely the case in real world. In our `pyscrape` program, we will open URLs found on websites, which we cannot guarantee are correct. We do not want the program to stop in this case.

Python provides a mechanism to _try out_ a piece of code. If it throws an exception, you can provide an alternative that is executed instead. In some languages, this is called __catching an exception__. In Python, the following syntax is used to catch and handle an exception:

In [None]:
try:
    a = 2 / 0
    print a
except:
    print "You shall not divide by zero"
    
print "I will be printed whether or not you divided by zero"

Indentation in Python is important. The lines following `try` must all be indented (press TAB once), `except` must have the same indentation as `try`, and all indented code after `except` is a part of the exception handling.

In the cell below, copy the code from Task 1.3 and wrap the call to `urlopen` in a `try`-`except` block to properly handle any errors.

In [None]:
url = "http://google.c"

try:
    # Load the website and print out all the URLs.
except:
    # Print an error message.
    

### Task 2.2: Pyscrape function

Functions in Python are defined using the `def` keyword, followed by the function name and arguments in brackets. As always, indentation is important: the whole function body must be indented. You can return a value from the function using the `return` keyword:

In [None]:
def function_with_no_arguments():
    return "This function was called with no arguments"

def function_with_no_return_value(argument):
    print "This function was given the argument: " + argument + ". It does not return a value."
    
# You can print the result of the first function
print function_with_no_arguments()

# The second function does not return anything, and so printing its result will output None.
# The function itself prints some text.
print function_with_no_return_value("Hello World")

Place the code from Task 2.1 in a function `pyscrape` which accepts a URL as an argument and returns the list of URLs found on the webpage.

In [8]:
def pyscrape(# Define arguments here):
    # Load the website and find all its URLs
    
    return # Return the list of URLs

Try calling the `pyscrape` function below:

In [None]:
pyscrape("http://gatescambridge.org")

## Part 3: Recursion

Once a function is defined, you can call it from anywhere - even from its own body. Defining something in terms of itself is called __recursion__.

Think of the Fibonacci numbers. Each number in the sequence is defined as the sum of the previous two numbers:

$$fibonacci(n) = fibonacci(n-1) + fibonacci(n-2)$$

We can translate the definition above to Python directly:

```Python
def fibonacci(n):
    return fibonacci(n-1) + fibonacci(n-2)
```

However, if we ran the program above with any number, it would compute forever (or until our computer crashed). With any recursive function, we need to constrain the execution. Think of this as the __base case__ in mathematical induction: for any number higher than the base case, the proof is written in terms of the previous number, but we have a special proof for the base case. For recursive programs, we need a different code path for the base case, to stop our function from executing indefinitely.

In case of Fibonacci numbers, there are two base cases: the values of the zero-th Fibonacci number (0) and the first number (1). You may have seen a definition where the first two numbers are 1. Either definition is fine for this example.

In the cell below, try writing a `fibonacci` function which, given a number greater than or equal to zero returns the Fibonacci number at this index in the sequence. Make sure you check for elements zero and one and return the correct value without calling the function again. Your function should complain if you start it with a negative number. Print out an error and do not call the function again.

__IMPORTANT__: The implementation of the Fibonacci function we are using here is very inefficient, and running it with very large numbers will take a long time. Use only small values (`n`<20) when testing your function.

In [None]:
def fibonacci(n):
    # Check for negative numbers.
    
    # Check for base cases.
    
    return # Return the computed Fibonacci number.

# These functions will test your implementation. Compare the output with the comments.
print fibonacci(-1)    # Should print an error and None
print fibonacci(0)     # 0
print fibonacci(1)     # 1
print fibonacci(8)     # 21
print fibonacci(14)    # 377
print fibonacci(11)    # 89

### Task 3.1: Write a recursive URL scraper

In this section, we will write a function which uses `pyscrape` to find all URLs in a given website, and which then visits all these URLs. For each of the URLs, it will again get all the websites and visit every one of them, and so on.

We will call this function `pyscrape_recurse`. Think about what the function has to do:

* print out the URL we are currently visiting
* call `pyscrape` to get all URLs from the website
* for each of these URLs:
  * print out the URL we are currently visiting
  * call `pyscrape` to get all URLs from the website
  * for each of these URLs:
    * print out the URL we are currently visiting
    * ...
    
Hopefully you can see where the recursion is coming in. We need to perform the same sequence of operations for each of the URLs that we scrape from the original website. We can therefore just call the `pyscrape_recurse` function from within itself on each URL we scrape.

The internet is not infinite. It is however pretty big, and most websites contain a large number of links. Just like we did for Fibonacci, we need to constrain the execution, otherwise our function will take a very long time to complete. We will use a recursion depth limit as our constraint.

The __depth__ of recursion can be defined as the number of times the function called itself. For example, imagine we called `pyscrape_recurse` on the website http://abc.d. This website only contains a single link, to http://efg.h, which only contains a single link to http://ijk.l, which does not contain any links. When we call `pyscrape_recurse` on http://abc.d, the depth of recursion is 0: the function did not call itself. However, when the function is called for http://efg.h, the depth of recursion is 1, since the function called itself once. For http://ijk.l, the depth is 2, since at that point the function called itself twice.

In order to constrain the execution, we will define a depth limit. This variable will be reused later, and so we will define it as a __global constant__. As a convention, the names of constants are UPPERCASE_AND_SEPARATED_WITH_UNDERSCORES. While we are playing with the code, keep the depth limit at two, otherwise you will have to wait a long time for your code to finish.

In [48]:
DEPTH_LIMIT = 2

Python does not automatically tell us what is the depth of the current call to a recursive function. We therefore need to add the current depth as an argument to a function. Whenever `pyscrape_recurse` calls itself, it will pass to the function call its own depth increased by one. This way, we can compare the current depth with the depth limit and return early if the depth of the function exceeds the depth limit.

However, now whenever we call `pyscrape_recurse`, we need to pass an initial depth of 0. This is inconvenient, ugly, and makes it harder for anyone reading our code afterwards to understand what the 0 means. We will never want to start the function with recursion depth other than 0. To make our code simpler and more readable, Python provides us with a way of defining __optional__ arguments. These arguments must always be defined after any compulsory arguments for a function, and we can provide a default value which will be used if the argument is not specified when we call the function. To see optional arguments in action, try running the example in the cell below.

In [None]:
def hello(optional = "World"):
    print "Hello " + optional
    
# This call prints "Hello World"
hello()

# This call prints "Hello Python"
hello("Python")

In the cell below, write the `pyscrape_recurse` function. Make sure to return early (just call `return` with no arguments) if the current depth exceeds the `DEPTH_LIMIT`. Don't forget to call the function recursively with depth one higher than the current depth.

In [49]:
def pyscrape_recurse(url, # Define an optional depth argument here):
    # Print the current URL
    
    # Return early if we reached the depth limit

    # Use pyscrape to find all URLs on the website
    
    # Call pyscrape_recurse for each URL found. Don't forget to increase the depth.
    

Try running `pyscrape_recurse` on a website of your choice in the cell below.

In [None]:
pyscrape_recurse("http://gatescambridge.org")

All of our URLs are printed out in a flat list. It would be nice to display visually which website led to which URL. Since we know the recursion depth at which we visited each URL, we can just add indentation as large as the depth.

The following snippet of code prints out a few spaces `depth` number of times followed by the URL string. Modify the `pyscrape_recurse` function to use this for printing the URLs and try running it again to see the formatted output.

```Python
print '    ' * depth + url
```

### Task 3.2: Remove duplicate URLs

You may notice that in the output above, many URLs repeat. This is because we are only printing out the name and domain of the website, but not its individual page addresses. Different parts of a website will have URLs with the same name and domain, but differ in what follows after. For example, a blog may have pages http://blog.com/home, http://blog.com/about, and http://blog.com/archive. These pages will likely be linked from other pages of http://blog.com, and for each one of them we only print out the http://blog.com/ part. Our recursive function then ends up visiting the same page over and over.

Instead of extracting just the website name and domain using our regular expression, we could instead consider each page on a website as a separate URL. However, to keep our exercises small, we will leave our regular expression as is and instead ignore URLs which we visited before.

To keep track of the visited websites, we will use Python's handy `set` data structure. It does exactly what you would expect from a set: you can add elements to it and check if elements are in the set. Try out the example below to learn about the syntax and behaviour of sets.

This is how to create a new set:

In [None]:
cambridge_colleges = set()

This is how to add elements to a set:

In [None]:
cambridge_colleges.add("Christ's")
cambridge_colleges.add("Churchill")
cambridge_colleges.add("Clare")

# Try adding another college below:


This is how to print out the number of elements in the set:

In [None]:
print "Number of colleges: " + str(len(cambridge_colleges))

This is how to print out the elements of a set:

In [None]:
print cambridge_colleges

This is how to test whether an element is in the set:

In [None]:
if "Clare" in cambridge_colleges:
    print "Yep, that's Cambridge"

This is how to test whether a college is __not__ in the set:

In [None]:
if "Ravenclaw" not in cambridge_colleges:
    print "Nope, not in Cambridge"

In the cell below, complete the `pyscrape_unique_recurse` function. Similarly to our `depth` argument, the set of visited URLs is an optional argument which is empty by default.

In [11]:
def pyscrape_unique_recurse(url, depth = 0, # Define an optional visited argument here):
    # Print the current URL
                            
    # Check for the depth limit and return early if we have reached it.
                            
    # Find all the URLs in the webpage.              
    
    # Call pyscrape_unique_recurse on each new URL.
    # Don't forget to add the new websites to the visited set.


In [None]:
pyscrape_unique_recurse("http://gatescambridge.org")

## Part 4: Breadth-first and depth-first search

The websites and links between them form a data structure called a __graph__. A graph in the computer-science sense is a structure which has __nodes__ connected by __edges__. In the case of the Internet, websites are nodes and links on websites which lead to other websites are the edges between the nodes. The image below shows an example of a graph of websites. For example, `gh.i` has links to `pq.r` and `st.u`, both `ab.c` and `pq.r` contain a link to themselves, and `de.f`, `jk.l` and `st.u` all have a link to `mn.o`.

![A graph of websites](https://github.com/mleming/gates_python_workshop/blob/pyscrape/images/website-graph.png?raw=true "A graph of websites")

Think about what `pyscrape_unique_recurse` does when we pass `http://ab.c` to it:

1. It retrieves the links from `ab.c`. The link back to itself is ignored.
2. The function is called on the link to `de.f` with depth 1:
   1. The links to `jk.l` and `mn.o` are found.
   2. The function is called on `jk.l` with depth 2:
      1. Here we only print out `jk.l` and do not follow the link to `mn.o` because of our depth limit.
   3. The function is called on `mn.o`:
      1. It just prints the URL and returns because of the depth limit.
   3. Now that we have called the function on all links from `de.f`, the function returns.
3. The function is called on the link to `gh.i`:
   1. The links to `pq.r` and `st.u` are found.
   2. The function is called on `pq.r`.
      1. The URL is printed out and links are ignored because we have reached our depth limit.
   3. The function is called on `st.u`.
      1. The URL is printed out and links are ignored because we have reached our depth limit.
   4. We have called the function on all links from `gh.i` and the function returns.
4. We have called the function on all links from `ab.c` and the function returns. 
   
The websites are visited in the following order. Blue arrows indicate the `pyscrape_unique_recurse` function being called and orange arrows represent returns from the function.

![Order in which websites are visited](https://github.com/mleming/gates_python_workshop/blob/pyscrape/images/website-graph-path.png?raw=true "Order in which websites are visited")

Since we are ignoring duplicate websites, we can just redraw the tree without the duplicate links. This results in the structure shown in the next diagram:

![A tree of websites](https://github.com/mleming/gates_python_workshop/blob/pyscrape/images/website-tree.png?raw=true "A tree of websites")

Such a data structure is called a __tree__: there are no edges which go back to an element that was already visited. The node labelled `http://ab.c` in the diagram is called the __root__ of the tree. In our `pyscrape_unique_recurse` function, the root of the tree is the original website from which we start scraping URLs.

Our function walks down one __branch__ of the tree until there are no more links (or until we reach our depth limit). It then returns to the level above and proceeds down the next branch. The following GIF shows it quite nicely.

![Depth-first search](https://upload.wikimedia.org/wikipedia/commons/7/7f/Depth-First-Search.gif "Depth-first search")

### Task 4.1: Depth-first search using stacks

Recursion is not the only way to implement depth-first search. Imagine a function which, given the website graph from above, walks the graph like this:

To keep track of things which the function needs to do, it has a pile of URLs. All URLs which have yet to be visited are placed on this pile and the function always processes the URL that sits on top of this pile first. Initially, the pile contains our root URL, `ab.c`. To process it, the function removes the URL from the pile. It then places all of the links from the website onto the pile. It then removes the next URL from the pile, `de.f`, finds all of its links and places them onto the pile. The next website to process is `jk.l`, followed by `mn.o`. After `mn.o` is processed, the pile contains no more links from `de.f` and so the function can start processing `gh.i`. This gives us the order shown in the GIF above.

![Stack example](https://github.com/mleming/gates_python_workshop/blob/pyscrape/images/website-stack.png?raw=true "Stack example")

Such a pile is called a __stack__ in computer science, and it is one of the most basic and important data structures. A stack is simply a list of elements for which we are only allowed to:

* Check if the stack is empty,
* Add elements to the top of the stack,
* Remove elements from the top of the stack.

Play around with the example below to see a stack in action.

In [None]:
stack = []
print stack

# This is how to add elements to the stack
stack.append("A")
print "Add A  " + str(stack)

stack.append("B")
print "Add B  " + str(stack)

# This is how to check if the stack is empty:
if len(stack) == 0:
    print "The stack is empty"
else:
    print "The stack is not empty"

# This is how to remove elements from the stack
stack.pop()
print "Pop    " + str(stack)

stack.append("C")
print "Add C  " + str(stack)

# Remove elements from the stack until it is empty:
while len(stack) > 0:
    stack.pop()
    print "Pop    " + str(stack)

# Uncomment to see what happens if you try removing an element from an empty stack
# stack.pop()
# print "Pop    " + str(stack)

We will now write a function `pyscrape_dfs` which uses a stack to perform depth-first search on our graph of websites. For simplicity, we will be using a page limit instead of a depth limit to stop walking down the tree. For testing, we will use a small limit of 10 pages:

In [None]:
PAGE_LIMIT = 10

For the `pyscrape_dfs` we need to define our stack data structure, filled first with the root URL. You will also need to create a counter which keeps track of the number of pages that were visited and increment it every time a URL is popped off the stack. Do not forget about the visited set.

In the cell below, write the `pyscrape_dfs` function. The function will need to contain a loop which runs until either the stack is empty or the counter has exceeded the page limit. In each iteration of the loop, you will need to get the next URL to process, find the URLs on the website and add the URLs that have not yet been visited to the stack.

In [54]:
def pyscrape_dfs(top_url):
    # Create the visited set and initialise a counter for keeping track of how many websites we visited.
    
    # Create a new stack and put the top_url in it.
    
    # Repeat until either there are no more pages to visit or we have exceeded the page limit:
    # 1. Get a new URL to visit from the stack,
    # 2. Print it out and find all the links on the webpage.
    # 3. Add each new link to the stack.
    # Don't forget to update the visited set or the counter.
    

Test the function in the cell below.

In [None]:
pyscrape_dfs("http://gatescambridge.org")

### Task 4.2: Breadth-first search using queues

Instead of walking down one branch of the tree, we could instead choose to visit websites in levels: we first visit our root URL, then visit its children, then children of its children and so on. The GIF below again illustrates it nicely.

![Breadth-first search](https://upload.wikimedia.org/wikipedia/commons/5/5d/Breadth-First-Search-Algorithm.gif "Breadth-first search")

Here we want to process the pages in the order in which we discovered them. Imagine this as a queue in a shop. Ideally, the person who arrived first should be served first. Returning to our simple example of a tree of webpages, the function which uses a queue would walk the tree like this:

At first, only the root URL, `ab.c` is waiting in the queue. The function removes it from the queue, processes it and places the URLs from this website to the end of the queue. The first URL, `de.f` is then removed from the queue and all of its URLs are placed into the queue. The queue now holds `gh.i`, `jk.l` and `mn.o`. The URL `gh.i` was enqueued first and so it will be processed before `jk.l` or `mn.o`.

!["Queue of websites"](https://github.com/mleming/gates_python_workshop/blob/pyscrape/images/website-queue.png?raw=true "Queue of websites")

The associated data structure is simply called a __queue__. A queue is again just a list of elements where we restrict the set of allowed operations to:

* Checking if the queue is empty,
* Adding an element to the __end__ of the queue,
* Removing an element from the __front__ of the queue.

Play around with the example below to see queues in action.

In [None]:
queue = []
print queue

# This is how to add elements to the queue.
queue.append("A")
print "Add A  " + str(queue)

queue.append("B")
print "Add B  " + str(queue)

# This is how to check if the queue is empty.
if len(queue) == 0:
    print "The queue is empty"
else:
    print "The queue is not empty"
    
# This is how to remove elements from the queue.
# Note the argument 0: this means we are removing the first element (at index 0).
queue.pop(0)
print "Pop    " + str(queue)

queue.append("C")
print "Add C  " + str(queue)

# Remove elements from the queue until it is empty:
while len(queue) > 0:
    queue.pop(0)
    print "Pop    " + str(queue)

# Uncomment to see what happens if you try removing an element from an empty queue
# queue.pop(0)
# print "Pop    " + str(queue)

We will now write a function `pyscrape_bfs` which uses a queue to perform breadth-first search on our graph of websites. We will again be using a page limit. If you previously changed the limit, go back now and set it back to 10, or define the limit again just before you write the function in the following cell.

For `pyscrape_bfs` we need to first create a queue data structure which initially contains only our root URL. Once again you will need to create a counter which keeps track of the number of pages that were visited. Remember to increment it every time a URL is popped off the stack. Do not forget about the visited set.

In the cell below, write the `pyscrape_bfs` function. It will need to contain a loop which runs until either the stack is empty or the counter has exceeded the page limit. In each iteration of the loop, you will need to get the next URL to process, find the URLs on the webiste and add the URLs that have not yet been visited to the stack.

In [56]:
PAGE_LIMIT = 10

def pyscrape_bfs(top_url):
    # Create the visited set, a counter and a queue which contains the top_url.
    
    # Repeat until there are no more URLs to process or we reached our page limit:
    # 1. Get the next URL from the queue.
    # 2. Print it and find all the URLs on the webpage.
    # 3. Add each new URL to the queue.
    # Don't forget to update the visited set and the counter.
    

Test the function in the cell below.

In [None]:
pyscrape_bfs("http://gatescambridge.org")

### Task 4.3: Combining the functions

Hopefully you have noticed that `pyscrape_dfs` and `pyscrape_bfs` are doing the same thing. They both:

1. Take the next URL from the given data structure,
2. Visit the website and find all the links,
3. For each of the links check if it was already visited and if not, update the visited set and add the link to the data structure.

The only difference is the order in which the URLs are processed, and this is given by the data structure. We are implemented both stacks and queues as lists in Python. Checking if the data structure is empty and adding elements is done in the same way for both stacks and queues. It is only removing elements that is different: for stacks we use `pop()` and for queues we use `pop(0)`.

As always with programming, it is wasteful to have two nearly-identical functions if we can merge them into a single one. We will now create a function `pyscrape_search`, which performs breadth- or depth-first search based on an argument we provide. You may use and optional argument and, for example, do depth-first search by default.

You may find the following way of writing simple condition statements useful when getting the next URL from the list:

In [None]:
condition = False
a = "Condition was True" if condition else "Condition was False"
print a

In the cell below, define the `pyscrape_search` function. Given a URL and an optional argument specifying the search method, it should visit `PAGE_LIMIT` websites using breadth-first search or depth-first search based on the argument. Remember to reset the page limit to 10 while you are testing the function.

In [9]:
PAGE_LIMIT = 10

def pyscrape_search(top_url, breadth = False):
    # Your code here.
    

Try running `pyscrape_search` twice with the same URL but with a different search method. The lists of URLs you get should be different.

In [None]:
print "Breadth-first search:"
pyscrape_search("http://gatescambridge.org/", True)
print # Empty line
print "Depth-first search:"
pyscrape_search("http://gatescambridge.org/")

## That's it!

This is the end of the guided workshop. The next part contains a few exercises for you to practice what you have learnt. Feel free to get in touch if you have any questions or comments.

## Part 5: Exercises for the keen

### Exercise 1: Indentation for search

Currently you cannot see which link led to which website in the search version of the program. Write a function `pyscrape_search_indent` in which printed-out URLs are properly indented.

Hints:
* Python contains a data type called a __tuple__, denoted by `()`. Tuples can hold several values of different types and you can access the first or second value using `t[0]` and `t[1]`.
* Lists can hold values of many different data types, but in a single list, each value must have the same type.

### Exercise 2: Search up to a given depth

The `pyscrape_search` function only finds the first `PAGE_LIMIT` pages. Modify it to instead search to a given depth, similarly to the `pyscrape_recurse` functions.

Hints:
* You will not need the index `i`.
* Completing Exercise 1 will help with this exercise.

### Exercise 3: Do not visit different parts of the same website

Pyscrape currently visits different parts of the same website. For example, it thinks news.google.com and google.com are different websites. Modify it to only visit one of them but ignore the other. Pay attention to domains which have two parts, for example .co.uk or .ac.uk - you don't want to ignore example.co.uk if you've visited somepage.co.uk before!

Hints:
* The documentation for regular expressions in Python has many examples.
* When working with regular expressions, it is useful to create a few example strings and try your expression on them before running the whole program - it is faster and you have control over the examples you choose.

### Exercise 4: Indent the error messages

Error messages from Pyscrape are not indented like the printed-out websites. Modify the `pyscrape` function to indent them.

Hints:
* You may need to add another argument to `pyscrape`.