# An example scraper showing how to use selectors in Beautiful Soup

This notebook explains how to scrape an example webpage as a way of demonstrating how to use selectors on an 'object' scraped with the `BeautifulSoup` function.

First, we import the libraries we will need.

In [1]:
#install the libraries
#requests is a library for fetching URLs
import requests
#bs4 is a library for scraping webpages - BeautifulSoup is a function from that
from bs4 import BeautifulSoup
#the pandas library which is used to work with data - we rename it pd
import pandas as pd

And the first lines of our scraper.

In [2]:
#store the url we want to scrape
theurl = "https://www.nhs.uk/service-search/other-health-services/eating-disorders-inpatient/results?location=nottingham&latitude=52.95619730589719&longitude=-1.1512037581844483"
#scrape the webpage at that url and store in 'html'
html = requests.get(theurl)
#convert 'html' into a Beautiful Soup object so we can drill into it
soup = BeautifulSoup(html.content)
#show it
soup

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1, shrink-to-fit=no" name="viewport"/>
<meta content="Eating disorders (inpatient) search results for Nottingham" name="description"/>
<meta content="" name="keywords"/>
<meta content="noindex" name="robots"/>
<title>Eating disorders (inpatient) services near Nottingham - NHS</title>
<link crossorigin="" href="https://assets.nhs.uk/" rel="preconnect"/>
<link href="/service-search/dist/css/app.min.css?v=ANp60okG3cdK8gCYdcg3mZvLP2iP-MUtxYhks7vvUCA" rel="stylesheet"/>
<link as="font" crossorigin="" href="https://assets.nhs.uk/fonts/FrutigerLTW01-55Roman.woff2" rel="preload" type="font/woff2"/>
<link as="font" crossorigin="" href="https://assets.nhs.uk/fonts/FrutigerLTW01-65Bold.woff2" rel="preload" type="font/woff2"/>
<script src="https://www.nhsapp.service.nhs.uk/js/v1/nhsapp.js"></script>
<script data-cookieconsent="neces

## Drilling down into the HTML

Now we're ready to use `select` to drill down further.

We need to know what HTML tags we are targeting, so spend some time looking at the webpage and either using *View source* to find the tags surrounding the types of data you want - or [using the Inspector to see the tags](https://zapier.com/blog/inspect-element-tutorial/).

For example, the name of each service seems to be inside `<h2>` tags, so let's try that.

In [3]:
#grab the contents of every th2> tag
servicenames = soup.select('h2')
#print the results
print(servicenames)

[<h2 class="results__name nhsuk-u-padding-top-0" id="orgname_0">
<a href="https://www.nhs.uk/services/clinic/the-becton-centre-for-children-and-young-people/X96854">The Becton Centre For Children &amp; Young People</a>
</h2>, <h2 class="results__name nhsuk-u-padding-top-0" id="orgname_1">
<a href="https://www.nhs.uk/services/hospital/sheffield-childrens-hospital/RCUEF">Sheffield Children's Hospital</a>
</h2>, <h2 class="results__name nhsuk-u-padding-top-0" id="orgname_2">
<a href="https://www.nhs.uk/services/clinic/highfields/X30483">Highfields</a>
</h2>, <h2 class="results__name nhsuk-u-padding-top-0" id="orgname_3">
<a href="https://www.nhs.uk/services/clinic/schoen-clinic-newbridge/X113416">Schoen Clinic Newbridge</a>
</h2>, <h2 class="results__name nhsuk-u-padding-top-0" id="orgname_4">
<a href="https://www.nhs.uk/services/hospital/st-georges/RRE11">St George's</a>
</h2>, <h2 class="results__name nhsuk-u-padding-top-0" id="orgname_5">
<a href="https://www.nhs.uk/services/hospital/w

### How many matches did we get?

That seems to have worked quite well. We can find out how many results by using the basic Python function `len()` which, when used on a list, tells you how many items there are in that list (if you use it with a string, it will tell you how many characters are in the string, including spaces).

In [4]:
#check how many results - there should be 40
len(servicenames)

40

## Storing the results in a dataframe

There's lots more to do here, including checking the data and grabbing more data, but for now let's store this in a dataframe so we can look at it and export it.

To do that, we need to

In [5]:
#grab the contents of every <th> tag
servicenames = soup.select('h2')

#Create a dataframe to store the data we are about to scrape
#It has one column called 'service'
#We call this dataframe 'df'
df = pd.DataFrame(columns=["service"])

#loop through the list
for i in servicenames:
  #extract the text from that item
  servicename = i.get_text()
  #and print it
  print(servicename)
  #convert to a dataframe
  tempdf = pd.DataFrame({"service" : servicename}, index=[0])
  #then add to the df
  df = pd.concat([df, tempdf], ignore_index=True)

print(df)


The Becton Centre For Children & Young People


Sheffield Children's Hospital


Highfields


Schoen Clinic Newbridge


St George's


Woodbourne Priory Hospital


Barberry 


The Huntercombe Hospital Stafford


Trafalgar House


Priory Hospital Cheadle Royal


Priory Hospital Altrincham


Rharian Fields Eating Disorder Unit


Newsam Centre


The Retreat Hospital York


Schoen Clinic York


Redesmere


Littlemore Mental Health Centre


Priory Hospital Preston


 Hertfordshire Partnership Wellbeing Service - North East Hertfordshire


Harrow Talking Therapies (IAPT)


Priory Hospital North London


The Huntercombe Hospital Maidenhead


Cygnet Hospital Ealing


South Kensington & Chelsea Mental Health Centre


Priory Hospital Roehampton


Springfield University Hospital


Priory Hospital Woking


Life Works Woking


Priory Hospital Bristol


Priory Hospital Hayes Grove


Tatchbury Mount


Royal Victoria Hospital 


Chalkhill


Priory Hospital Southampton


Chanctonbury


Priory Wellbeing 

## Exporting the results

Once you have a dataframe, you can export that as a CSV file.

To do that, add `.csv()` to the name of the dataframe variable, and inside the brackets put the name of the CSV file you want to create - it should end in .csv so that your spreadsheet software will know what to do with it.



In [None]:
#export the dataframe df to a CSV called thedata.csv
df.to_csv('thedata.csv')

## Checking the results

If you look at the resulting data you can check it has what you expected.

In particular, look at the first and last results in the table. You should notice that we have one piece of data we don't want - the very last row.

You can look at the end of a dataframe by adding `.tail()` to the name of the dataframe variable.

In [6]:
#show the last few rows
df.tail()

Unnamed: 0,service
35,\nPriory Wellbeing Centre Canterbury\n
36,\nKimmeridge Court\n
37,\nThe Haldon Eating Disorder Service\n
38,\nTruro Health Park\n
39,Support links


We will fix that later.

## Adding more data - and fixing some problems

Let's repeat the process for the other details. Helpfully, the telephone number always seems to be inside the tag `<p class="nhsuk-list nhsuk-u-margin-bottom-2">`.

How many of those can we grab?


In [7]:
#grab the contents of each <p class="nhsuk-list nhsuk-u-margin-bottom-2"> tag
tels = soup.select('p[class="nhsuk-list nhsuk-u-margin-bottom-2"]')
#count how many matches are in that list
len(tels)

78

That's more than we expected.

Let's check the text of the first one:

In [None]:
tels[0].get_text()

'\r\n                                        Seven Airs Road, Beighton, Sheffield, South Yorkshire, S20 1NZ\r\n                                    '

OK, that's an address, not a phone number. What about the second?

In [None]:
tels[1].get_text()

'0114 3053106'

Let's try a few

In [None]:
#loop through the first 6 items
for i in tels[:6]:
  #print the text of that item
  print(i.get_text())


                                        Seven Airs Road, Beighton, Sheffield, South Yorkshire, S20 1NZ
                                    
0114 3053106

                                        Western Bank, Sheffield, South Yorkshire, S10 2TH
                                    
0114 2717000

                                        9 &11 Highfields Road, Chasetown, Burntwood, Staffordshire, WS7 4RQ
                                    
01543 684 948


We can see that the tags contain two types of data: addresses, and phone numbers, and that they alternate. We can also check this with the HTML on the page itself.

How do we deal with this? Well there are a few ideas that spring to mind (the more coding you've done, the more ideas you will have):

* We could check if the index is odd (1, 3, 5) and only store the data if it is.
* We could generate a list of odd numbers and loop through those to grab the corresponding item at that position
* We could check if the text only contains numbers, and store it if it does

If you're interested, here is how to generate a range of odd numbers, using the `range()` function in basic Python...

In [None]:
#loop through a range of numbers from 1 to 78, in increments of two
for i in range(1,78,2):
  #print the number
  print(i)

1
3
5
7
9
11
13
15
17
19
21
23
25
27
29
31
33
35
37
39
41
43
45
47
49
51
53
55
57
59
61
63
65
67
69
71
73
75
77


And here is how to extend that code to use each number as an index to grab something from the corresponding list.

In [8]:
#loop through a range of numbers from 1 to 78, in increments of two
for i in range(1,78,2):
  #print the number
  print(i)
  #get the text at that index in the list 'tels'
  print(tels[i].get_text())

1
0114 3053106
3
0114 2717000
5
01543 684 948
7
0121 580 8362
9
0300 790 7000
11
0121 434 4343
13
0121 301 2002
15
01785 840000
17

19
0161 428 9511
21
0161 904 0050
23
01472 808450
25
0113 855 6300
27

29
01904 404400
31
01244 397397
33
01865 901000
35
01772 691 122
37
0800 6444 101
39
020 8515 5015
41
020 8882 8191
43
01628 667881
45
020 8991 6699
47

49
020 8876 8261
51
0203 513 5000
53
01483 489 211
55
01483 757572
57
0117 952 5255
59
020 8462 7722
61

63
0191 233 6161
65
01444 472670
67
023 8084 0044
69

71
01227 452 171
73
03000191771
75
01392 208263
77



Next we need to find a way to store those items as they're extracted. This can be done by creating an empty list and then appending each item to it using the Python function `append()`

(This time we've commented out the two `print` commands)

In [9]:
#create an empty list
phonenums = []

#loop through a range of numbers from 1 to 78, in increments of two
for i in range(1,78,2):
  #print the number
  #print(i)
  #get the text at that index in the list 'tels'
  #print(tels[i].get_text())
  #append it to the list
  phonenums.append(tels[i].get_text())

#show the list after the loop has finished
phonenums

['0114 3053106',
 '0114 2717000',
 '01543 684 948',
 '0121 580 8362',
 '0300 790 7000',
 '0121 434 4343',
 '0121 301 2002',
 '01785 840000',
 '',
 '0161 428 9511',
 '0161 904 0050',
 '01472 808450',
 '0113 855 6300',
 '',
 '01904 404400',
 '01244 397397',
 '01865 901000',
 '01772 691 122',
 '0800 6444 101',
 '020 8515 5015',
 '020 8882 8191',
 '01628 667881',
 '020 8991 6699',
 '',
 '020 8876 8261',
 '0203 513 5000',
 '01483 489 211',
 '01483 757572',
 '0117 952 5255',
 '020 8462 7722',
 '',
 '0191 233 6161',
 '01444 472670',
 '023 8084 0044',
 '',
 '01227 452 171',
 '03000191771',
 '01392 208263',
 '']

## Fixing the lengths

Now we've used trial and error to find the data we need, let's bring them together.

First we need to check they're the same length.

In [10]:
#how many items in phonenums
len(phonenums)

39

In [11]:
#and in our other list of services
len(servicenames)

40

We know that our original list had one item too many at the end - now we need to remove it so that it matches up with the list of phone numbers.

We can do that by selecting the first 39 items in the list using indices - specifically something called a **slice**: `[0:39]`

This will select all items from position zero up to *but not including* position 39. Position 39 is the 40th object, so the end result is we get 39 items.

In [12]:
#replace servicenames with the first 39 items in servicenames
servicenames = servicenames[0:39]
#check it worked
len(servicenames)

39

## Capturing both 'columns' of data

Now we can start to combine those in a dataframe.

In [13]:
#Create a dataframe to store the data
#We give it a dictionary with two keys: 'service' and 'phone'
#Against each of those keys we have the two lists we scraped
servicsedf = pd.DataFrame({"service":servicenames,"phone":phonenums})
#check it
servicsedf

Unnamed: 0,service,phone
0,"[\n, [The Becton Centre For Children & Young P...",0114 3053106
1,"[\n, [Sheffield Children's Hospital], \n]",0114 2717000
2,"[\n, [Highfields], \n]",01543 684 948
3,"[\n, [Schoen Clinic Newbridge], \n]",0121 580 8362
4,"[\n, [St George's], \n]",0300 790 7000
5,"[\n, [Woodbourne Priory Hospital], \n]",0121 434 4343
6,"[\n, [Barberry ], \n]",0121 301 2002
7,"[\n, [The Huntercombe Hospital Stafford], \n]",01785 840000
8,"[\n, [Trafalgar House], \n]",
9,"[\n, [Priory Hospital Cheadle Royal], \n]",0161 428 9511


## Improving the scraper

Now we've succeeded in scraping those two pieces of information on 100 organisations, on one webpage, we can start to think about improving the scraper. For example:

* We could split the telephone and address within the scraper
* We could grab the link to each organisation's 'detail' page and add that
* We could get the scraper to run on subsequent pages of results


## Exporting the results

First, let's export the results so we have a copy of those.

In [14]:
#And we can export it
servicsedf.to_csv("scrapeddata.csv")

## Improvement 1: Cleaning/splitting the data

Our services column still has a whole bunch of HTML which we don't necessarily want.

In [19]:
servicenames[0]

<h2 class="results__name nhsuk-u-padding-top-0" id="orgname_0">
<a href="https://www.nhs.uk/services/clinic/the-becton-centre-for-children-and-young-people/X96854">The Becton Centre For Children &amp; Young People</a>
</h2>

We can clean that by using BeautifulSoup's `.get_text()` method - this strips out the HTML leaving whatever text was inside the tag (including any child tags).

In [20]:
servicenames[0].get_text()

'\nThe Becton Centre For Children & Young People\n'

There's another basic Python method we can use here, called `.strip()`, which removes any leading or trailing white space - in this case the carriage returns indicated by `/r`.

To apply this we add it to the end again. This process of adding extra method on top of each other is called **chaining**.

In [22]:
servicenames[0].get_text().strip()

'The Becton Centre For Children & Young People'

To apply this to all the entries, we need to loop through the old list, apply the cleaning process above, and create a new list from the results.

In [23]:
#create an empty list
clean_names = []

#loop through the 'unclean' list
for i in servicenames:
  #extract the text from each item, and strip out whitespace
  cleanversion = i.get_text().strip()
  #append that new clean version to the empty list
  clean_names.append(cleanversion)

#show the list, which now has all the clean versions
clean_names

['The Becton Centre For Children & Young People',
 "Sheffield Children's Hospital",
 'Highfields',
 'Schoen Clinic Newbridge',
 "St George's",
 'Woodbourne Priory Hospital',
 'Barberry',
 'The Huntercombe Hospital Stafford',
 'Trafalgar House',
 'Priory Hospital Cheadle Royal',
 'Priory Hospital Altrincham',
 'Rharian Fields Eating Disorder Unit',
 'Newsam Centre',
 'The Retreat Hospital York',
 'Schoen Clinic York',
 'Redesmere',
 'Littlemore Mental Health Centre',
 'Priory Hospital Preston',
 'Hertfordshire Partnership Wellbeing Service - North East Hertfordshire',
 'Harrow Talking Therapies (IAPT)',
 'Priory Hospital North London',
 'The Huntercombe Hospital Maidenhead',
 'Cygnet Hospital Ealing',
 'South Kensington & Chelsea Mental Health Centre',
 'Priory Hospital Roehampton',
 'Springfield University Hospital',
 'Priory Hospital Woking',
 'Life Works Woking',
 'Priory Hospital Bristol',
 'Priory Hospital Hayes Grove',
 'Tatchbury Mount',
 'Royal Victoria Hospital',
 'Chalkhill',


And add it to the dataframe like so:

In [24]:
#Create a dataframe to store the data
#We give it a dictionary with two keys: 'service' and 'phone'
#Against each of those keys we have the two lists we scraped
servicsedf = pd.DataFrame({"service":clean_names,"phone":phonenums})
#check it
servicsedf

Unnamed: 0,service,phone
0,The Becton Centre For Children & Young People,0114 3053106
1,Sheffield Children's Hospital,0114 2717000
2,Highfields,01543 684 948
3,Schoen Clinic Newbridge,0121 580 8362
4,St George's,0300 790 7000
5,Woodbourne Priory Hospital,0121 434 4343
6,Barberry,0121 301 2002
7,The Huntercombe Hospital Stafford,01785 840000
8,Trafalgar House,
9,Priory Hospital Cheadle Royal,0161 428 9511


## Improvement 2: Grabbing the links to detail pages

So far we have used `.get_text()` in our code to indicate that we want to grab the text contents between the opening and closing tags that we targeted (e.g. `<h2>` and `</h2>`).

But what if we want to grab the links? Links aren't text - they're part of the HTML tag itself.

Specifically, a link is created using a `<a>` tag, and using the `href=` attribute of that to specify the link to go to. In full that looks something like this:

`<a href="http://bbc.co.uk">`

More succinctly, what we want to do is grab the **value** of the `href` attribute of the `a` tag.

There are two things we need to change in our code to do this:

* First, using `select()` we need to target the `a` tags
* Secondly, we need to use `['href']` rather than `.get_text()` to specify that we want to grab the href attribute's value, not the text.

The full code is below but key lines are these, which run on each item in our list of HTML objects:

```
  justtheatags = i.select('a')
  
  thefirstatag = justtheatags[0]
  
  servicelink = thefirstatag['href']
```

This is progressively drilling down further and further into what we had: a string of HTML with a `<h2>` tag containing a `<a>` tag.

We could do this all in one line:

`servicelink = i.select('a')[0]['href']`

But it's often easier to break down each stage so you can check each one works and see the whole thing happening stage by stage. Ultimately, it's a personal preference.

In [25]:
#create an empty list
just_links = []

#loop through the list of full HTML entries
for i in servicenames:
  #select the <a> tags in each item
  justtheatags = i.select('a')
  #then just the first one
  thefirstatag = justtheatags[0]
  #then grab the 'href=' attribute
  servicelink = thefirstatag['href']
  #append that to the empty list
  just_links.append(servicelink)

#show the list, which now has all the links
just_links

['https://www.nhs.uk/services/clinic/the-becton-centre-for-children-and-young-people/X96854',
 'https://www.nhs.uk/services/hospital/sheffield-childrens-hospital/RCUEF',
 'https://www.nhs.uk/services/clinic/highfields/X30483',
 'https://www.nhs.uk/services/clinic/schoen-clinic-newbridge/X113416',
 'https://www.nhs.uk/services/hospital/st-georges/RRE11',
 'https://www.nhs.uk/services/hospital/woodbourne-priory-hospital/NTN08',
 'https://www.nhs.uk/services/hospital/barberry/RXTD3',
 'https://www.nhs.uk/services/hospital/the-huntercombe-hospital-stafford/NV203',
 'https://www.nhs.uk/services/clinic/trafalgar-house/X171570',
 'https://www.nhs.uk/services/hospital/priory-hospital-cheadle-royal/NTN23',
 'https://www.nhs.uk/services/hospital/priory-hospital-altrincham/NTN13',
 'https://www.nhs.uk/services/clinic/rharian-fields-eating-disorder-unit/X151582',
 'https://www.nhs.uk/services/hospital/newsam-centre/RGDAB',
 'https://www.nhs.uk/services/hospital/the-retreat-hospital-york/NPE01',
 '

We can add that to the dataframe just as we did the others

In [26]:
#Create a dataframe to store the data
#We give it a dictionary with two keys: 'service' and 'phone'
#Against each of those keys we have the two lists we scraped
servicsedf = pd.DataFrame({"service":clean_names,
                           "phone":phonenums,
                           "link": just_links})
#check it
servicsedf

Unnamed: 0,service,phone,link
0,The Becton Centre For Children & Young People,0114 3053106,https://www.nhs.uk/services/clinic/the-becton-...
1,Sheffield Children's Hospital,0114 2717000,https://www.nhs.uk/services/hospital/sheffield...
2,Highfields,01543 684 948,https://www.nhs.uk/services/clinic/highfields/...
3,Schoen Clinic Newbridge,0121 580 8362,https://www.nhs.uk/services/clinic/schoen-clin...
4,St George's,0300 790 7000,https://www.nhs.uk/services/hospital/st-george...
5,Woodbourne Priory Hospital,0121 434 4343,https://www.nhs.uk/services/hospital/woodbourn...
6,Barberry,0121 301 2002,https://www.nhs.uk/services/hospital/barberry/...
7,The Huntercombe Hospital Stafford,01785 840000,https://www.nhs.uk/services/hospital/the-hunte...
8,Trafalgar House,,https://www.nhs.uk/services/clinic/trafalgar-h...
9,Priory Hospital Cheadle Royal,0161 428 9511,https://www.nhs.uk/services/hospital/priory-ho...


## Potential problems: adding a user agent

If you hit problems, such as a 403 error (it's "Forbidden") the scraper may be being blocked.

In this situation one of the first things to do is try adding a user agent to your scraper. As [this webpage](https://brightdata.com/blog/how-tos/user-agents-for-web-scraping-101) describes it:

> "The user agent string helps the destination server identify which browser, type of device, and operating system is being used. For example, the string tells the server you are using Chrome browser and Windows 10 on your computer. The server can then use this information to adjust the response for the type of device, OS, and browser."

To add a user agent to your scraper edit the `requests.get()` function below to add a `headers=` parameter which includes the `User-Agent`.

That user agent can be anything (I set it to "Paul" and that was enough to stop one scraper being blocked) but it's best to pick a user agent which matches the browser and operating system you're using.

You can find out your own browser's user agent by [using this webpage](https://www.whatismybrowser.com/detect/what-is-my-user-agent) - then copy the results into the string like I have done below:

In [None]:
#store the url we want to scrape
theurl = "https://www.nhs.uk/service-search/other-services/Eating-disorders/Nottingham/Results/102/-1.158/52.955/1797/15942?distance=500&ResultsOnPageValue=100"
#scrape the webpage at that url and store in 'html'
#without a user agent we get a 403 error on this webpage
headers={'User-Agent':"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36"}
html = requests.get(theurl, headers=headers)
#convert 'html' into an lxml object so we can drill into it
soup = BeautifulSoup(html.content)
soup

<!DOCTYPE html>
<!--[if lt IE 9]><html class="lte-ie8" lang="en"><![endif]--><!--[if IE 9]><html class="ie9" lang="en"><![endif]--><!--[if gt IE 9]><!--><html lang="en">
<!--<![endif]-->
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1, shrink-to-fit=no" name="viewport"/>
<title>Search Results - NHS</title>
<meta content="noindex" name="robots"/>
<!-- start dynamic content. Remove this and end comment when variables put in place -->
<meta content="" name="description"/>
<meta content="" name="keywords"/>
<meta content="" name="DC.title"/>
<meta content="" name="DC.description"/>
<meta content="" name="DC.subject" scheme="eGMS.IPSV"/>
<meta content="" name="DC.Subject" scheme="NHSC.Ontology"/>
<meta content="" name="DC.Subject" scheme="NHSC.Ontology"/>
<meta content="" name="DC.date.issued" scheme="W3CDTF"/>
<!-- end dynamic content.... -->
<meta content="England" name="DC.coverage"/>
<meta content="