# The 8-line scraper

This is the most simple scraper: eight lines, six of which can stay the same every time.

*Note: open the table of contents on the left to make this notebook easier to navigate.*

In [None]:
#import the 3 libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd

#fetch the HTML from the URL
response = requests.get("https://theferret.scot/articles/")
#convert that to a BeautifulSoup 'object' so it can be drilled into further
soup = BeautifulSoup(response.content, 'html.parser')

#drill down into HTML tags described by the 'selector' in quotation marks, and store matches
alistofmatches = soup.select("h2 a")

#create a data frame where those matches fill one column, which is named in quotation marks
df = pd.DataFrame( { "column 1" : alistofmatches } )
#export that as a CSV file
df.to_csv("scrapeddata.csv", index=False)

In [None]:
#show the scraped data frame
print(df)

                                            column 1
0  [Sheku Bayoh: The inquiry podcast, ‘An unholy ...
1  [Mark Fortune faces 89 criminal charges includ...
2                         [Yemen’s war: photo essay]
3  [Police watchdog and Crown failed to assess ra...
4  [Tory politician urged to return £10k donation...
5  [‘Huge red flag’: Alba politician to speak at ...
6  [Salmon firm accused of misleading locals over...
7  [Claim Labour and Conservatives formed allianc...
8  [Scottish Government agency still giving grant...


## Expanding the scraper: use a loop to clean or parse your list into multiple lists

Those eight lines succeed in scraping the targeted tags along with the text inside them.

If you want to extract the text or the tags, or both, you can do that in a spreadsheet using [Text to Columns](https://www.youtube.com/watch?v=QyZ6IMkln2U).

But you can also do it in the scraper, by looping through the scraped list of tags-and-text, extracting just the text, and adding it to an empty list you created before the loop.

*Note: because you are now storing your data in a different list you will need to update the line where this is used to create a data frame.*

Here are a few more lines that do that ([tips on variations on this approach here](https://stackoverflow.com/questions/16121001/suggestions-on-get-text-in-beautifulsoup)):

In [None]:
#create an empty list to store the data we are about to extract
justtext = []

#loop through the fetched list of matching tags and text
for i in alistofmatches:
  #extract the text from each item and append it to the previously empty list
  justtext.append(i.get_text())

#change the code which creates the data frame, so that it uses the same list
df = pd.DataFrame( { "column 1" : justtext } )

#show the results
print(justtext)

['Far right ‘terrorist’ and ‘racist’ posts broke rules, election watchdog staff claimed', 'Edinburgh University denies surveillance claims by student protestors', 'A place to heal: saving lives in Toronto’s toxic drug crisis', 'Alarm over nuclear safety lapses on the Clyde', '‘A foundation to build on’: Can Scotland’s first safer injection site tackle the drugs crisis?', 'What has The Ferret been up to in April?', 'Podcast teaser: What impact can a safer drug consumption facility have?', 'Revealed: UK government gave oil licences to IDF-linked firm', 'Scottish Government climate partner challenged over dirty investments']


#### Extracting just the tags (e.g. links)

The process for extracting just the tags is similar, but instead of `.get_text()` we add square brackets after the tag-and-text item with the name of the attribute in quotation marks, i.e. `i['href']` will get the value of the `href=` attribute inside the tag, and `i['class']` would get the value of the `class=` attribute, etc.

Here's the code:

In [None]:
#create an empty list to store the data we are about to extract
justhref = []

#loop through the fetched list of matching tags and text
for i in alistofmatches:
  #extract the href= value from each item and append it to the previously empty list
  justhref.append(i['href'])

#change the code which creates the data frame, so that it uses the same list
df = pd.DataFrame( { "column 1" : justhref } )

print(justhref)

['https://theferret.scot/sheku-bayoh-the-inquiry-podcast-unholy-trinity/', 'https://theferret.scot/mark-fortune-criminal-charges-threatening-tenants/', 'https://theferret.scot/yemens-war-photo-essay/', 'https://theferret.scot/police-watchdog-crown-racist-sheku-bayoh-death/', 'https://theferret.scot/tory-mp-david-mundell-10k-donation-animal-cruelty/', 'https://theferret.scot/alba-neale-hanvey-event-conspiracy-theorists/', 'https://theferret.scot/salmon-fish-farm-scottish-highlands-misleading/', 'https://theferret.scot/labour-conservatives-alliances-councils-half-true/', 'https://theferret.scot/scottish-government-grants-arms-firms-israel/']


## Saving multiple lists into a data frame

As soon as we start grabbing more than one list of data to store, we need to change the line of code which stores those in a data frame.

Extra columns can be added by putting a comma after the name of the list variable used for the first column, then naming the next column (in quotation marks), and after a colon naming the list variable to be used to populate that column.

So, where before the code specified the column name and contents as

`{ "column 1" : justtext }`

now it will read:

`{ "column 1" : justtext, "column 2" : justhref }`

To make this easier to read, we can split the two over separate lines, and it will still work:

```
{ "column 1" : justtext,
"column 2" : justhref }
```

Here's the code:

In [None]:
#create a data frame where those matches fill two columns,
#each of which is named in quotation marks
df = pd.DataFrame( { "column 1" : justtext,
                    "column 2" : justhref} )

#show it - not using print() makes it easier to read
df

Unnamed: 0,column 1,column 2
0,"Sheku Bayoh: The inquiry podcast, ‘An unholy t...",https://theferret.scot/sheku-bayoh-the-inquiry...
1,Mark Fortune faces 89 criminal charges includi...,https://theferret.scot/mark-fortune-criminal-c...
2,Yemen’s war: photo essay,https://theferret.scot/yemens-war-photo-essay/
3,Police watchdog and Crown failed to assess rac...,https://theferret.scot/police-watchdog-crown-r...
4,Tory politician urged to return £10k donation ...,https://theferret.scot/tory-mp-david-mundell-1...
5,‘Huge red flag’: Alba politician to speak at e...,https://theferret.scot/alba-neale-hanvey-event...
6,Salmon firm accused of misleading locals over ...,https://theferret.scot/salmon-fish-farm-scotti...
7,Claim Labour and Conservatives formed alliance...,https://theferret.scot/labour-conservatives-al...
8,Scottish Government agency still giving grants...,https://theferret.scot/scottish-government-gra...


This is a good moment to point out that `"column 1"` is just an intentionally non-specific name I've chosen for the first column, and you can make that more meaningful if we want.

Here's that code updated to be more specific:

In [None]:
#create a data frame where those matches fill two columns,
#each of which is named in quotation marks
df = pd.DataFrame( { "text" : justtext,
                    "link" : justhref} )

#show it - not using print() makes it easier to read
df

Unnamed: 0,text,link
0,"Sheku Bayoh: The inquiry podcast, ‘An unholy t...",https://theferret.scot/sheku-bayoh-the-inquiry...
1,Mark Fortune faces 89 criminal charges includi...,https://theferret.scot/mark-fortune-criminal-c...
2,Yemen’s war: photo essay,https://theferret.scot/yemens-war-photo-essay/
3,Police watchdog and Crown failed to assess rac...,https://theferret.scot/police-watchdog-crown-r...
4,Tory politician urged to return £10k donation ...,https://theferret.scot/tory-mp-david-mundell-1...
5,‘Huge red flag’: Alba politician to speak at e...,https://theferret.scot/alba-neale-hanvey-event...
6,Salmon firm accused of misleading locals over ...,https://theferret.scot/salmon-fish-farm-scotti...
7,Claim Labour and Conservatives formed alliance...,https://theferret.scot/labour-conservatives-al...
8,Scottish Government agency still giving grants...,https://theferret.scot/scottish-government-gra...


## Expanding the scraper: fetching more than one HTML tag

To fetch another HTML tag you just need to copy the line which used `soup.select()` to repeat the process of selecting a specified tag.

In this copied line you will need to change the name of the variable being created to store matches (otherwise it will overwrite your existing variable), and change the selector being used.

You will end up with two lines that look similar apart from those two key differences, like this:

In [None]:
#select all HTML tags matching the selector and save in one variable
alistofmatches = soup.select("h2 a")
#select all HTML tags matching another selector, and save in another variable
alistofcategories = soup.select('ul[class="post-categories"]')

#show the results of the new line
print(alistofcategories)

[<ul class="post-categories">
<li><a href="https://theferret.scot/podcast/" rel="category tag">Podcast</a></li>
<li><a href="https://theferret.scot/society/" rel="category tag">Society</a></li></ul>, <ul class="post-categories">
<li><a href="https://theferret.scot/crime-and-justice/" rel="category tag">Crime and justice</a></li>
<li><a href="https://theferret.scot/society/housing/" rel="category tag">Housing</a></li></ul>, <ul class="post-categories">
<li><a href="https://theferret.scot/international/" rel="category tag">International</a></li>
<li><a href="https://theferret.scot/human-rights/" rel="category tag">Human rights</a></li></ul>, <ul class="post-categories">
<li><a href="https://theferret.scot/society/" rel="category tag">Society</a></li></ul>, <ul class="post-categories">
<li><a href="https://theferret.scot/environment/animal-welfare/" rel="category tag">Animal welfare</a></li>
<li><a href="https://theferret.scot/politics/" rel="category tag">Politics</a></li></ul>, <ul clas

### Checking the length of the resulting lists

Note: these two lists will need to be the same length before you create a data frame, otherwise you will get an error.

You can check the length of your lists with the `len()` function like this:

In [None]:
#show the length of the list
print(len(alistofmatches))
#show the length of the other list
print(len(alistofcategories))

9
9


As above, you now just need to update the code that creates the data frame so that it uses all the lists you want to store, giving each its own column name.

If your lists are not the same length you will get an error, and will need to either refine your selector to be more accurate, or use [slicing](https://www.geeksforgeeks.org/python-list-slicing/) to remove irrelevant items from the list.

## Expanding the scraper to multiple pages

Once you’ve fetched all the data you want from a single page, you may want to extend your scraper to other pages. There are two common scenarios here:

1. Extending it to linked pages which provide more results with similar data (typically ‘next page’ links)
2. Extending it to linked pages which provide more detailed data on the results on your initial page

You may also want to combine the two: extending it to ‘next page’ links, and then using the resulting data to extend it to ‘more detail’ pages. But let’s take it a stage at a time.

## Expanding the scraper by fetching sequential pages

In all these scenarios, the first thing you need to compile is a list of links to all the pages you want to scrape. That list will sometimes need to be scraped from the first page (especially if they are links to detail pages), but can also sometimes be generated based on a logic that all the links follow.

For example, if your page is just the first in a series, chances are that when you navigate to page two of those results, the URL will include something like page=2

This is the case for the articles listed at [theferret.scot/articles](https://theferret.scot/articles): if you click on the link to the next page of articles it becomes `https://theferret.scot/articles/page/2/` (and changing it to `page/1` takes you back to that first page.

If your targeted pages follow that pattern then you can generate a list of those URLs.

And if the site specifies how many pages of results there are, or the total results (which allows you to calculate the number of pages based on results per page), or allows you to navigate to the last page, then you know how long that list needs to be (what the last page number is).

A useful function here is `range()` - this generates a range of numbers that has a start and end based on two numbers you supply. Slightly annoyingly, the range will end *before* the second number you specify, so, to generate a range of numbers from 1 to 100 you would use `range(1,101)`.

Once we have that range we can loop through it, added each number to the end of a URL, and then append that URL to a list, giving us a list of URLs for different pages.

Note: you will also need to convert the number to a string first, because only strings can be combined with other strings.

Here's the code to do this (we are only going to generate a range of numbers from 1 to 3 for now so that our scraper doesn't take too long to run - this is useful to do to test the scraper works before unleashing it on all pages):

In [None]:
#create an empty list to store the URLs we are about to generate
url_list = []

#loop through the numbers 1 to 4
for i in range(1,4):
  #convert to a string, add to the end of a base URL
  pagedurl = "https://theferret.scot/articles/page/"+str(i)
  #append to the list
  url_list.append(pagedurl)

### Reusing our previous code

Once that list of URLs exists we can reuse the same scraping code we previously used on just one URL - as long as the pages at the other URLs are structured the same (they have the same sort of info in the same tags).

Because that code will have to run for *each* URL in the list, it will need to be **indented inside the loop** (if it wasn't in a loop, it would only run once, and we want it to run for each URL in a list).

But this presents a new problem: the data frame variable `df` which stores the data from each page will be overwritten each time the loop runs.

So we need to add some way of making a copy of that data frame before it gets overwritten.

One solution is to use `append()` again to populate a new list as the loop progresses: this time, however, we are going to be making a *list of data frames*.

As before, we create an empty list before the loop begins, and then each time the loop runs, append each scraped data frame to that.

To do this requires three new lines to our code. First, before the loop the line that creates an empty list:

`listofdfs = []`

Then, inside the loop the line which adds each data frame to that list:

`listofdfs.append(df)`

And finally, after the loop has finished:

`combineddf = pd.concat(listofdfs)`

This last line creates a *new* data frame with the result of combining ('concatenating') all the data frames stored in that list.

That function `pd.concat()` is what 'concatenates' the data frames.

Here's the code in full:

In [None]:
#create an empty list
listofdfs = []

#loop through the list of URLs
for url in url_list:
  #fetch the HTML from the URL
  response = requests.get(url)
  #convert that to a BeautifulSoup 'object' so it can be drilled into further
  soup = BeautifulSoup(response.content, 'html.parser')
  #drill down into HTML tags described by the 'selector' in quotation marks, and store matches
  alistofmatches = soup.select('h2 a')
  #create a data frame where those matches fill one column, which is named in quotation marks
  df = pd.DataFrame( { "column 1" : alistofmatches } )
  #add it to the previously empty list
  listofdfs.append(df)

#now the list has finished, concatenate that list of data frames into one
combineddf = pd.concat(listofdfs)
#export that as a CSV file
combineddf.to_csv("scrapeddata.csv", index=False)

#show it
combineddf

Unnamed: 0,column 1
0,"[Sheku Bayoh: The inquiry podcast, ‘An unholy ..."
1,[Mark Fortune faces 89 criminal charges includ...
2,[Yemen’s war: photo essay]
3,[Police watchdog and Crown failed to assess ra...
4,[Tory politician urged to return £10k donation...
5,[‘Huge red flag’: Alba politician to speak at ...
6,[Salmon firm accused of misleading locals over...
7,[Claim Labour and Conservatives formed allianc...
8,[Scottish Government agency still giving grant...
0,[Racist group founder fronts new far right par...


You can also expand the indented code to grab extra rows just as you would with the single page scraper.

## Expanding the scraper by fetching linked other pages

Sequential URLs can be relatively easy to generate - but if you need to fetch pages which are linked from an initial page (such as pages with more detail on each case), then you will need to generate your list of URLs from an initial scraper.

Earlier in this notebook we did that, by first targeting a particular HTML tag and then extracting the link (the `href=` attribute) from it. Here's a reminder of the code that did that:

In [None]:
#create an empty list to store the data we are about to extract
justhref = []

#loop through the fetched list of matching tags and text
for i in alistofmatches:
  #extract the href= value from each item and append it to the previously empty list
  justhref.append(i['href'])


### Writing a second scraper for the linked pages

The resulting list, `justhref`, contains the URLs of the pages we want to scrape - but unlike sequential URLs that just end in a different page number, the pages at those URLs are almost certainly different to the ones we scraped with our original scraper.

This means we can't simply reuse our previous scraping code by indenting it within a loop - we need to write a *second* scraper with code that's targeted at these different pages.

This scraper will use the same principles and template code as your first scraper - although it may well take longer to write and test, because chances are it will have more information that you want to fetch, and therefore more HTML tags that you want to target, and possibly extract from.

Let's assume that after that process, you end up with the following code for scraping one page:

In [None]:
#fetch the HTML from the URL
response = requests.get("https://theferret.scot/claim-100000-homes-empty-scotland-mostly-true/")
#convert that to a BeautifulSoup 'object' so it can be drilled into further
soup = BeautifulSoup(response.content, 'html.parser')

#drill down into HTML tags described by the 'selector' in quotation marks, and store matches
alistofmatches = soup.select("h1 span")

#create a data frame where those matches fill one column, which is named in quotation marks
df = pd.DataFrame( { "column 1" : alistofmatches } )

#show the result
df

Unnamed: 0,column 1
0,"[Claim around 100,000 homes are empty in Scotl..."


As before, the key lines are the one which specifies the URL, and the one(s) which specify a selector to grab matching HTML tags.

Now, let's see that code incorporated into a loop that loops through *more than one* of these pages (the list of full URLs we scraped earlier).

In this code the specific URL is replaced with a URL from the list of links `justhref` - which we call `url` as we loop through it.

But the selector stays the same.

In [None]:
#create an empty list
listofdfs = []

#loop through the list of full URLs
for url in justhref:
  #fetch the HTML from the URL
  response = requests.get(url)
  #convert that to a BeautifulSoup 'object' so it can be drilled into further
  soup = BeautifulSoup(response.content, 'html.parser')
  #drill down into HTML tags described by the 'selector' in quotation marks, and store matches
  alistofmatches = soup.select("h1 span")
  #create a data frame where those matches fill one column, which is named in quotation marks
  df = pd.DataFrame( { "column 1" : alistofmatches } )
  #append it to the previously empty list
  listofdfs.append(df)

#add it to the combined data frame
combineddf_details = pd.concat(listofdfs)

#export that as a CSV file
combineddf_details.to_csv("scrapeddata.csv", index=False)

#show the results
combineddf_details

Unnamed: 0,column 1
0,[Far right ‘terrorist’ and ‘racist’ posts brok...
0,[Edinburgh University denies surveillance clai...
0,[A place to heal: saving lives in Toronto’s to...
0,[Alarm over nuclear safety lapses on the Clyde]
0,[‘A foundation to build on’: Can Scotland’s fi...
0,[What has The Ferret been up to in April?]
0,[Podcast teaser: What impact can a safer drug ...
0,[Revealed: UK government gave oil licences to ...
0,[Scottish Government climate partner challenge...


In this case the information targeted (a headline) happens to be the same information that was on the results page, so it wasn't worth writing a 'details' scraper (always check first!), but the technical process is still useful to outline.

Note that the principles are the same as for scraping sequential pages example:

* Create an empty list
* Loop through a list of URLs
* For each of those, apply the same scraping code as worked with one URL (get the page, convert to a BeautifulSoup object, fetch a specified HTML tag, store in a data frame)
* once the loop has finished, join the resulting list of data frames so that we have a single data frame with *all* pages' data
* export that data frame as a CSV

### If you only have partial URLs

Luckily, the URLs in our list are complete, but often you will get partial URLs that are missing the first part. These are called [*relative* URLs](https://kb.blackbaud.co.uk/articles/Article/67574): that is, they are *relative* to the website they are on (a full URL is known as an *absolute* URL).

If you get relative URLs, just add the 'base URL' of the website to the front of each one: the base URL is everything in the address bar before the first solo backslash, e.g. `https://theferret.scot/`

You can add the base URL to the relative part as you loop through each relative URL, by using the `+` operator, like so (within the loop):

`url = "https://theferret.scot/"+url`

That full URL can then either be added to a new list that you *then* loop through, or you can skip that stage entirely by inserting a line like this within the loop that scrapes each URL, as long as you do so before the line with `requests.get()` tries to fetch it.

## Expanding the scraper by doing both

You might want to do both the types of expansion detailed above. To do this you combine the two processes above:

* Generate a list of URLs for multiple 'results' pages (the ones that link to the 'details' pages)
* Write code that scrapes the links from each 'results' page
* Indent that scraping code inside a `for` loop that loops through each URL in the list of 'results' URLs, and scrapes all the URLs for the 'details' pages, adding each list to a list of detail page URLs using `extend`
* Write the code to scrape a 'details' page
* Indent this code inside a `for` loop that loops through each details page URL you scraped, appending the scraped 'details' data frame to a list
* Concatenate the resulting list of data frames into a single data frame
* Export that as a CSV

There's something new here: the use of `extend`. This is used when you want to add multiple items to a list, instead of just one (i.e. you want to extend a list of 9 items with another list of 9 items).

Here's an example of that in practice - first, scraping all the 'detail' links from a number of sequential 'result' pages:

In [None]:
#create an empty list to store the URLs we are about to generate
url_list = []

#loop through the numbers 1 to 4
for i in range(1,4):
  #convert to a string, add to the end of a base URL
  pagedurl = "https://theferret.scot/articles/page/"+str(i)
  #append to the list
  url_list.append(pagedurl)

#create an empty list to store the links we are about to scrape
listoflinks = []

#loop through the list of result page URLs
for url in url_list:
  #fetch the HTML from the URL
  response = requests.get(url)
  #convert that to a BeautifulSoup 'object' so it can be drilled into further
  soup = BeautifulSoup(response.content, 'html.parser')
  #drill down into HTML tags described by the 'selector' in quotation marks, and store matches
  alistofmatches = soup.select('h2 a')
  #create a list for the links
  hreflist = []
  #loop through the list of tag-and-text
  for a in alistofmatches:
    #store the href= value in the list
    hreflist.append(a['href'])
  #EXTEND the previously empty list with this list
  listoflinks.extend(hreflist)

#show how many links to detail pages we've scraped
len(listoflinks)

27

Next, scraping the detail from the pages at those links.

In [None]:
#create an empty list to store the data frames we are about to scrape
listofdfs = []

#loop through the list of scraped URLs
for url in listoflinks:
  #fetch the HTML from the URL
  response = requests.get(url)
  #convert that to a BeautifulSoup 'object' so it can be drilled into further
  soup = BeautifulSoup(response.content, 'html.parser')
  #drill down into HTML tags described by the 'selector' in quotation marks, and store matches
  alistofmatches = soup.select("h1 span")
  #create a data frame where those matches fill one column, which is named in quotation marks
  df = pd.DataFrame( { "column 1" : alistofmatches } )
  #append it to the previously empty list
  listofdfs.append(df)

#add it to the combined data frame
combineddf_details = pd.concat(listofdfs)

#export that as a CSV file
combineddf_details.to_csv("scrapeddata.csv", index=False)

#show the results
combineddf_details

Unnamed: 0,column 1
0,"[Sheku Bayoh: The inquiry podcast, ‘An unholy ..."
0,[Mark Fortune faces 89 criminal charges includ...
0,[Yemen’s war: photo essay]
0,[Police watchdog and Crown failed to assess ra...
0,[Tory politician urged to return £10k donation...
0,[‘Huge red flag’: Alba politician to speak at ...
0,[Salmon firm accused of misleading locals over...
0,[Claim Labour and Conservatives formed allianc...
0,[Scottish Government agency still giving grant...
0,[Racist group founder fronts new far right par...


In [None]:
#check the number of rows in the data frame
len(combineddf_details)

27

## Other tips and considerations

The techniques above should cover most pages.

If the URL doesn't change when you navigate to the second page of results, it is likely that the site is using cookies to store your search, and the site will be harder to scrape. That's beyond the scope of this notebook but you can learn more by watching [this video](https://www.youtube.com/watch?v=xPkxdHYV1wg)

Some other things to consider:

* [Adding headers to `requests.get()`](https://stackoverflow.com/questions/8685790/adding-headers-to-requests-module)
* [Scraping 'next' page links](https://medium.com/quick-code/how-to-get-the-next-page-on-beautiful-soup-85b743750df4)
