# Creating functions in Python (for scraping)

In previous notebooks we covered:

* How to create variables in Python (to store things like URLs for scraping, and the data from pages that we scrape);
* How to loop through lists (in order to scrape or store each item in that list, for example); and
* How to create data frames using `pandas` (to store the scraped data).

Now we're going to bring those together into a final multi-page scraper by creating our own **functions**.

We've used functions already such as `range()` and `len()`. These are **built-in functions** that come with Python. We've also used functions from libraries, like `requests.get()` and `pd.DataFrame()`.

You can create your own function - a **user-defined function** - with the `def` command like so:

In [None]:
def sayhello():
  print("hello")

The `def` command is followed by:

* The name of the function
* Parentheses (which can contain any ingredients that you want to use but in this example don't)
* A colon, and
* Indented lines of code underneath which will run when the function is used

The name of the function is entirely up to you, but try to make it distinctive and meaningful.

We will explain the other parts as we begin to create a function below, but note for now that when you create the function nothing appears to happen.

Of course something *has* happened when you run the code above: a function has been created and can now be used.

## 'Calling' a function

Using a function is referred to as 'calling' it.

This is how you **call** a function:

In [None]:
sayhello()

hello


Basically it's like any other function: you type the name of the function, followed by parentheses containing any ingredients it needs. Even if the function doesn't need any ingredients, you still use (empty) parentheses.

## Creating a function with ingredients

Our example so far didn't have any ingredients, so let's create one that does.


In [None]:
def print_this_word(thisword):
  print(thisword)

This time we've put a word inside the parentheses: `thisword`

In a way, we've created a variable to store whatever ingredient is used when someone calls this function. (This is called a **parameter**.)

That variable is then used in the code below: `print(thisword)`

To see what happens, let's use that function:

In [None]:
print_this_word("pumpkin")

pumpkin


Now let's break down what happens when that line of code is run:

1. First, it **calls** the function `print_this_word`
2. Then it gives it an ingredient: the string "pumpkin". This is called **passing** an **argument**
3. When the function was written, it called that ingredient `thisword`, so the string "pumpkin" is stored in a variable called `thisword`
4. As the function code runs, it accesses that variable inside a `print()` command, so the contents of that variable ("pumpkin" this time) are printed

## Creating a function that **returns** something

Our scraper above only prints something, but often you want a scraper to do something (such as perform a calculation, or scrape a page) and then **return** the results.

Here's an example:

In [None]:
#define a new function called 'addtwonumbers' which has 2 parameters (ingredients)
def addtwonumbers(numone, numtwo):
  #the two parameters are added together and stored in 'total'
  total = numone+numtwo
  #the function returns that value
  return(total)

You can see that this function has *two* ingredients (parameters): `numone` and `numtwo`, separated by a comma. The function adds those two together, and stores it in a new variable called `total`. Finally it specifies to `return()` the contents of that variable.

Here's that function being used:

In [None]:
whatisit = addtwonumbers(3,8)
print(whatisit)

11


You can see that the first line runs the `addtwonumbers()` function and **passes** it two ingredients: two numbers, 3 and 8.

The function runs, adds those two numbers together, and *returns* the result to the variable that it was being used to create: `whatisit`.

We then print that.

Returning information from a function can be incredibly powerful: in the example above it was just a number that was returned, but you can return lists, dictionaries, multiple items, and, among other things, data frames - which is what happens next...

## Why you might need a function for a scraper

It's all very nice adding numbers but what's that got to do with scraping?

Well, functions are useful if you need to run code more than once.

If your scraper needs to run on more than one page then that's what you're going to have to do: run the same code for every page you scrape.

In that situation you might want to do something like the following:

* Loop through a list of URLs
* 'Pass' the URL to a scraper function, which runs all the scraping code
* The function then 'returns' some results of that (for example, a dataframe of data)
* You store that result in a variable, and perhaps add it to a larger dataframe which grows as the loop goes through each URL in the list

## Creating a scraper function

The good news is that if you've written code that works scraping one page, it's relatively straightforward to put that into a function.

Here's the code that we used in the previous notebook to scrape [employment tribunal decisions](https://www.gov.uk/employment-tribunal-decisions)

We will put the importing of libraries in a separate code block because it's not a good idea to import libraries inside a function - they should only be imported once.

In [1]:
#importing requests to fetch URLs
import requests
#importing beautiful soup's library for drilling into webpages
from bs4 import BeautifulSoup
#and pandas for storing data
import pandas as pd

In [None]:
#fetch URL
page = requests.get("https://www.gov.uk/employment-tribunal-decisions")

#command beautiful soup to parse the page
soup = BeautifulSoup(page.content,'html.parser')

#grab all the <div> tags with class="gem-c-document-list__item-title"
divswewant = soup.select('div[class="gem-c-document-list__item-title"]')
#this grabs the <time> tags
times = soup.select('time')

#create an empty list
casetitles = []

#loop through the divswewant list
for i in divswewant:
  #extract the text
  casename = i.get_text()
  #add the text and link to the previously empty lists
  casetitles.append(casename)

#create an empty list
datelist = []

#loop through the divswewant list
for i in times:
  #extract the text
  timetext = i.get_text()
  #add the text and link to the previously empty lists
  datelist.append(timetext)

#create a new dataframe which uses those two lists as its two columns
casedataframe = pd.DataFrame({"case name" : casetitles,
                              "date" : datelist})

If we want to run that scraper on other pages, all the code would be exactly the same apart from the URL.

That URL, then, should become the only ingredient that the function needs to be given.

Here's that code as a function, then.

In [2]:
#define a function - it takes one ingredient and calls it 'theurl'
def scrapepage(theurl):
  #fetch URL from that 'theurl'
  page = requests.get(theurl)
  #command beautiful soup to parse the page
  soup = BeautifulSoup(page.content,'html.parser')
  #grab all the <div> tags with class="gem-c-document-list__item-title"
  divswewant = soup.select('div[class="gem-c-document-list__item-title"]')
  #this grabs the <time> tags
  times = soup.select('time')
  #create an empty list
  casetitles = []
  #loop through the divswewant list
  for i in divswewant:
    #extract the text
    casename = i.get_text()
    #add the text and link to the previously empty lists
    casetitles.append(casename)
  #create an empty list
  datelist = []
  #loop through the divswewant list
  for i in times:
    #extract the text
    timetext = i.get_text()
    #add the text and link to the previously empty lists
    datelist.append(timetext)
  #create a new dataframe which uses those two lists as its two columns
  casedataframe = pd.DataFrame({"case name" : casetitles,
                                "date" : datelist})
  #return that dataframe to whatever called the function
  return(casedataframe)



To create this function we've essentially taken all the important code from before and indented it under the line `def scrapepage(theurl):`

That first line transforms our previous code into something **reusable** by doing two things: giving a name to it (`scrapepage`); and giving a name to the URL we want to scrape (`theurl`).

There's one other extra line too, right at the end: `return(casedataframe)` ensures that the results of the scraper are passed back to whatever calls this function.

Something else to highlight: the function contains some `for` loops as well, which means there are two levels of indents in the code: all the code inside the function is indented, and then the `for` loop code inside *that* is indented one more time.

## Testing the scraper function

We can test the function on the same page to see if it works.

In [3]:
#run the function on the page
#the function 'returns' a dataframe, so we need to store that in a variable
storetheresults = scrapepage("https://www.gov.uk/employment-tribunal-decisions")
#print the variable containing the results
print(storetheresults)

                                            case name              date
0    Mr A Male v Airbus Operations Ltd: 1601482/20...     9 August 2023
1    Mrs E Petrelli v Tapas Twist Ltd: 1600374/2023\n     9 August 2023
2    Mr J Nash v Purple Dog Company.com Ltd: 16001...    11 August 2023
3    Mrs A Abhyankar v Cardiff and Vale University...    14 August 2023
4    Mr W Moyo v Co-Operative Group Ltd: 1311317/2...    11 August 2023
5    G Tinubu and others v Warwick Independent Sch...    17 August 2023
6    Mr D Mpehla v Central Heating Hub Ltd (in Vol...    16 August 2023
7    Mr R Brown v Royal Mail Group Ltd: 1302178/20...    15 August 2023
8    Mr Omole v Bannatyne Fitness Ltd: 1303948/2023\n    15 August 2023
9     Mr C Coleman v Tesco Stores Ltd: 1304434/2021\n    15 August 2023
10    Mr I Girling v DHL Services Ltd: 1303069/2023\n    15 August 2023
11   Mr A Guice v 24/7 Plumbing & Gas (UK) Ltd: 13...     13 March 2023
12   Mr E Allen and  Mr J Dunphy v Birmingham City...    15 Augu

## Calling the scraper function on multiple pages

Now let's call that function on a bunch of pages.

We now need a list of URLs. Sometimes that list is one we've already compiled - but in this situation we are going to have to generate it ourselves.

A good tip here is to move to the second page that you want to scrape, and look at the URL. In that case, it looks like this:

`https://www.gov.uk/employment-tribunal-decisions?page=2`

We can see the page number is at the end of the URL. If we replace that `2` with `1` we are taken back to the first page of results, and so on.

What we need, then, is a list of numbers to add to the end of that URL which otherwise stays the same.

### Generating a list of page numbers

We can generate a list of numbers using Python's `range()` function. This needs two ingredients - the start number and a number it will end before.

It won't include that end number so, for example `range(1,10)` will go from 1 to 9, stopping short of 10.

We know that our start number is `1`, but where should we end?

In this case, the page actually tells us what the last page is: the 'Next page' link goes to '2 of 2075', so there are 2,075 pages.

What about when that isn't the case? Sometimes websites give you the option to go to the last page of results, allowing you to see (in the URL for example) what the last page number is.

If you can't do that, but the page tells you how many results there are, and you know how many results are on each page, you can calculate how many pages there should be by dividing one by the other.

For example, if there are 2,105 results and 50 results per page, the calculation would be `2015/50` - which is 40.3: 40 pages of 50 results, plus a 41st page containing the last few. In other words, pages 1 to 41.

### Adding page numbers to the URL

The code to generate a list of page numbers from 1 to 2,075, then (remembering that it will stop short of the end number), would be:

`range(1,2076)`

Here's a first attempt at looping through those numbers and adding them to the URL `"https://www.gov.uk/employment-tribunal-decisions?page="` - it will create an error, which is important.

In [10]:
#loop through the numbers 1 to 2075
for i in range(1,2076):
  #add it to the end of a URL, and print it
  print("https://www.gov.uk/employment-tribunal-decisions?page="+i)

TypeError: ignored

### Turning numbers into strings

The error we get here is:

`TypeError: can only concatenate str (not "int") to str`

This is trying to tell us that you can't combine (concatenate) integers and strings. You can only combine strings with strings.

So the problem is that we are trying to add a number to a string.

To fix that, we simply need to convert the number to a string first - and there's a basic function to do that: `str()`

In our code, then we just need to tweak the line so that it adds the basic URL to `str(i)` - that is, our number (`i`) converted to a string.

In [None]:
#loop through the numbers 1 to 2075
for i in range(1,2076):
  #add it to the end of a URL, and print it
  print("https://www.gov.uk/employment-tribunal-decisions?page="+str(i))

### Putting the function into the loop

Now that we've solved that problem, we can start to apply the function to each URL as it loops through them.

As there are so many URLs, it's a good idea not to do them all at first. While testing we might just loop through the first couple of pages.

In [12]:
#loop through the numbers 1 to 2
for i in range(1,3):
  #add it to the end of a URL,
  fullurl = "https://www.gov.uk/employment-tribunal-decisions?page="+str(i)
  #and print it
  print(fullurl)
  #run the scraper function, and store what's returned
  theseresults = scrapepage(fullurl)
  #print what was returned
  print(theseresults)

https://www.gov.uk/employment-tribunal-decisions?page=1
                                            case name              date
0    Mr A Male v Airbus Operations Ltd: 1601482/20...     9 August 2023
1    Mrs E Petrelli v Tapas Twist Ltd: 1600374/2023\n     9 August 2023
2    Mr J Nash v Purple Dog Company.com Ltd: 16001...    11 August 2023
3    Mrs A Abhyankar v Cardiff and Vale University...    14 August 2023
4    Mr W Moyo v Co-Operative Group Ltd: 1311317/2...    11 August 2023
5    G Tinubu and others v Warwick Independent Sch...    17 August 2023
6    Mr D Mpehla v Central Heating Hub Ltd (in Vol...    16 August 2023
7    Mr R Brown v Royal Mail Group Ltd: 1302178/20...    15 August 2023
8    Mr Omole v Bannatyne Fitness Ltd: 1303948/2023\n    15 August 2023
9     Mr C Coleman v Tesco Stores Ltd: 1304434/2021\n    15 August 2023
10    Mr I Girling v DHL Services Ltd: 1303069/2023\n    15 August 2023
11   Mr A Guice v 24/7 Plumbing & Gas (UK) Ltd: 13...     13 March 2023
12   Mr 

### Storing multiple results in a single dataframe

That seems to work on two pages - but each time the loop runs the results of the previous iteration are overwritten.

We need a way to save the results as we go along - much as we did with `.append()` when we added items to a list.

The equivalent with a dataframe is the function `concat()`. This is a `pandas` function, so it's prefixed with `pandas` or `pd` depening on whether the library was renamed when it was imported, so you would write: `pd.concat()`

The `concat()` function can take a number of ingredients, but in this situation we only need to give it one thing: a list of the dataframes it is going to combine.

In this case we want to create an empty dataframe before the loop begins, and then each time the loop runs, combine that dataframe with the dataframe that gets returned by the function as it scrapes each URL, and overwrite the empty dataframe with the results.

This can be a little difficult to get your head around, so let's walk through what happens here.

* First we have an empty dataframe - let's call it `fillme`
* The first time the loop runs, we grab the dataframe for page 1, combine it with `fillme`, and then *overwrite* `fillme` with the results
* The second time the loop runs, we grab the dataframe for page 2, combine it with `fillme` (which now holds page 1), and ovewrite `fillme` ith the results
* The third time the loop runs, we grab the dataframe for page 3, combine it with `fillme` (which now holds page 1 *and* page 2), and ovewrite `fillme` ith the results

...and so on.

Here's that incorporated into our code.


In [14]:
#create an empty dataframe
fillme = pd.DataFrame()

#loop through the numbers 1 to 2
for i in range(1,3):
  #add it to the end of a URL,
  fullurl = "https://www.gov.uk/employment-tribunal-decisions?page="+str(i)
  #and print it
  print(fullurl)
  #run the scraper function, and store what's returned
  theseresults = scrapepage(fullurl)
  #print what was returned
  #print(theseresults)
  fillme = pd.concat([fillme,theseresults])

fillme

https://www.gov.uk/employment-tribunal-decisions?page=1
https://www.gov.uk/employment-tribunal-decisions?page=2


Unnamed: 0,case name,date
0,Mr A Male v Airbus Operations Ltd: 1601482/20...,9 August 2023
1,Mrs E Petrelli v Tapas Twist Ltd: 1600374/2023\n,9 August 2023
2,Mr J Nash v Purple Dog Company.com Ltd: 16001...,11 August 2023
3,Mrs A Abhyankar v Cardiff and Vale University...,14 August 2023
4,Mr W Moyo v Co-Operative Group Ltd: 1311317/2...,11 August 2023
...,...,...
45,Ms N Cross-Padden v Kitrinos Healthcare (Char...,4 August 2023
46,Mr R Cardiff and others v London Stock Photog...,25 July 2023
47,Y Savchuk and others v R.M.I. Property Mainte...,2 August 2023
48,Mr O Mohammed v Hendrie Legal Ltd: 4101383/20...,8 August 2023


### The code explained

Creating an empty dataframe is done with this line of code:

`fillme = pd.DataFrame()`

Here we use the `pandas` function `DataFrame()` but we leave the parentheses empty (no ingredients) which means it will create an empty dataframe.

That empty dataframe is stored in a variable called `fillme`.

Inside the loop the function is called, and the results stored, in this line, which creates another dataframe object, `theseresults`:

`theseresults = scrapepage(fullurl)`

So at this point we have two dataframes:

* `fillme`, which we created before the loop started; and
* `theseresults`, which will only exist momentarily until the loop runs again.

The final line before the loop runs again, then, is this:

`fillme = pd.concat([fillme,theseresults])`

Start to the right of the equals sign. Here we can see the `concat()` function being used - and the ingredient it is given is a list:

`[fillme,theseresults]`

That's a list of the two dataframes we want to concatenate, or combine.

Once combined, the resulting dataframe is assigned to the variable before the equals sign, `fillme`.

So, what was previously empty is overwritten with those results.

When the loop is run on all the pages it will overwrite that dataframe hundreds of times, each time combining the version with all the previous pages' data, with the latest page's data, until it finishes looping.


## Export the results as a CSV

Once you have data you want to look at in a spreadsheet, you can export the dataframe as a CSV using the `.to_csv()` method.

This needs to be added the the name of the dataframe variable, with the name of the CSV in parentheses (don't forget to put it in quotation marks or inverted commas)

That file should then be available in the Files area on the left in Colab (the folder icon).

In [17]:
#export results as a CSV
fillme.to_csv('tribunaldata.csv')

## Applying this to 'detail' links

The principles outlined above could be used to grab information from 'detail' links that provide more information on each case.

In that scenario, you would:

* Scrape the links to each tribunal detail page and store in a list (or columnn in a dataframe)
* Write a function to scrape the detail page
* Loop through the list/column of detail links, and run the function on each, returning data which can be stored in a larger dataframe

## Improvement 1: adding a delay (throttling)

We can change the scraper so that it pauses between each page. To do this we need the `time` library.

In [15]:
#Import the time library to use its sleep() function
import time

We can then use the `sleep()` function from that library, which [stops the code running for a specified number of seconds](https://www.programiz.com/python-programming/time/sleep). So to pause for three seconds it might be written like so:

`time.sleep(3)`

That can be inserted into loop that calls the scraping function (or, if the scraping function scrapes more than one page, you can insert it there to pause between each page):

In [16]:
#create an empty dataframe
fillme = pd.DataFrame()

#loop through the numbers 1 to 2,075
for i in range(1,2075):
  #add it to the end of a URL,
  fullurl = "https://www.gov.uk/employment-tribunal-decisions?page="+str(i)
  #and print it
  print(fullurl)
  #run the scraper function, and store what's returned
  theseresults = scrapepage(fullurl)
  #print what was returned
  #print(theseresults)
  fillme = pd.concat([fillme,theseresults])
  print("waiting 3 seconds before next scrape")
  #Sleep for 3 seconds before looping again
  time.sleep(3)

https://www.gov.uk/employment-tribunal-decisions?page=1
waiting 3 seconds before next scrape
https://www.gov.uk/employment-tribunal-decisions?page=2
waiting 3 seconds before next scrape
https://www.gov.uk/employment-tribunal-decisions?page=3
waiting 3 seconds before next scrape
https://www.gov.uk/employment-tribunal-decisions?page=4
waiting 3 seconds before next scrape
https://www.gov.uk/employment-tribunal-decisions?page=5
waiting 3 seconds before next scrape
https://www.gov.uk/employment-tribunal-decisions?page=6
waiting 3 seconds before next scrape
https://www.gov.uk/employment-tribunal-decisions?page=7
waiting 3 seconds before next scrape
https://www.gov.uk/employment-tribunal-decisions?page=8
waiting 3 seconds before next scrape
https://www.gov.uk/employment-tribunal-decisions?page=9
waiting 3 seconds before next scrape
https://www.gov.uk/employment-tribunal-decisions?page=10
waiting 3 seconds before next scrape
https://www.gov.uk/employment-tribunal-decisions?page=11
waiting 3 se

KeyboardInterrupt: ignored