# An example scraper showing how to use `cssselect`

This notebook explains how to scrape an example webpage as a way of demonstrating how to apply the `cssselect` library.

First, we import the libraries we will need. 

In [1]:
#install the libraries 
#scraperwiki is a library for scraping webpages
!pip install scraperwiki
import scraperwiki
#lxml.html is used to convert it into xml (more structured)
import lxml.html
#cssselect is used to drill down into that and find data in tags
!pip install cssselect
import cssselect
#the pandas library which is used to work with data 
import pandas 

Collecting scraperwiki
  Downloading scraperwiki-0.5.1.tar.gz (7.7 kB)
Collecting alembic
  Downloading alembic-1.7.1-py3-none-any.whl (208 kB)
[K     |████████████████████████████████| 208 kB 4.2 MB/s 
Collecting Mako
  Downloading Mako-1.1.5-py2.py3-none-any.whl (75 kB)
[K     |████████████████████████████████| 75 kB 4.1 MB/s 
Building wheels for collected packages: scraperwiki
  Building wheel for scraperwiki (setup.py) ... [?25l[?25hdone
  Created wheel for scraperwiki: filename=scraperwiki-0.5.1-py3-none-any.whl size=6545 sha256=b6c9896917b8785cc0838b232c743758684447caaef4a57623db11e9b8dc536f
  Stored in directory: /root/.cache/pip/wheels/3c/57/8d/41e15f7e5cc9eb0067539416abd445f210c0d04f39975d5ca5
Successfully built scraperwiki
Installing collected packages: Mako, alembic, scraperwiki
Successfully installed Mako-1.1.5 alembic-1.7.1 scraperwiki-0.5.1
Collecting cssselect
  Downloading cssselect-1.1.0-py2.py3-none-any.whl (16 kB)
Installing collected packages: cssselect
Successf

And the first lines of our scraper.

In [2]:
#store the url we want to scrape
theurl = "https://www.nhs.uk/service-search/other-services/Eating-disorders/Nottingham/Results/102/-1.158/52.955/1797/15942?distance=500&ResultsOnPageValue=100"
#scrape the webpage at that url and store in 'html'
#without a user agent we get a 403 error on this webpage
#see https://github.com/sensiblecodeio/scraperwiki-python for documentation
html = scraperwiki.scrape(theurl)
#convert 'html' into an lxml object so we can drill into it
root = lxml.html.fromstring(html)

HTTPError: ignored

## Adding a user agent

The code above generates a 403 error - it's "Forbidden". That sounds like it's being blocked. 

In this situation one of the first things to do is try adding a user agent to your scraper. As [this webpage](https://brightdata.com/blog/how-tos/user-agents-for-web-scraping-101) describes it:

> "The user agent string helps the destination server identify which browser, type of device, and operating system is being used. For example, the string tells the server you are using Chrome browser and Windows 10 on your computer. The server can then use this information to adjust the response for the type of device, OS, and browser."

To add a user agent to your scraper edit the `scraperwiki.scrape()` function below to add a `user-agent=` parameter.

That user agent can be anything (I set it to "Paul" and that was enough to stop the scraper being blocked) but it's best to pick a user agent which matches the browser and operating system you're using. 

You can find out your own browser's user agent by [using this webpage](https://www.whatismybrowser.com/detect/what-is-my-user-agent) - then copy the results into the string like I have done below:

In [3]:
#store the url we want to scrape
theurl = "https://www.nhs.uk/service-search/other-services/Eating-disorders/Nottingham/Results/102/-1.158/52.955/1797/15942?distance=500&ResultsOnPageValue=100"
#scrape the webpage at that url and store in 'html'
#without a user agent we get a 403 error on this webpage
#see https://github.com/sensiblecodeio/scraperwiki-python for documentation
html = scraperwiki.scrape(theurl, user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36")
#convert 'html' into an lxml object so we can drill into it
root = lxml.html.fromstring(html)

## Drilling down into the HTML

Now we're ready to use `cssselect` to drill down further. 

We need to know what HTML tags we are targeting, so spend some time [looking at the webpage](https://www.nhs.uk/service-search/other-services/Eating-disorders/Nottingham/Results/102/-1.158/52.955/1797/15942?distance=500&ResultsOnPageValue=100) and using *View source* to find the tags surrounding the types of data you want. 

For example, the name of each service seems to be inside `<th>` tags, so let's try that.

In [4]:
#grab the contents of every <th> tag
servicenames = root.cssselect('th')
#check how many results - there should be 100
len(servicenames)

103

This is promising - that's around the number we expected, but with perhaps 3 extra results. Let's look in more detail.

In [None]:
#Loop through the results
for i in servicenames:
  #print the text inside the tag
  print(i.text_content())

Address & contact details
Information supplied by
Description of service

                    Addictive Eaters Anonymous - Nottingham
            

                    Nottinghamshire Adult Eating Disorder Team
            

                    Nottinghamshire - Parents Support Group
            

                    Nottinghamshire Camhs Eating Disorder Team - Nottingham
            

                    Child And Adolescent Mental Health Services (Camhs) Eating Disorder Team
            

                    Mrs Katie Smith  Peaceful Horizons Ltd
            

                    Sharon Baker
            

                    Addictive Eaters Anonymous - Derbyshire
            

                    Helen Clare
            

                    Nottinghamshire Camhs Eating Disorder Team - Mansfield
            

                    Addictive Eaters Anonymous - Derby
            

                    Bespoke Therapy With Dr Rachel Evans (Phd)
            

    

That looks promising. The first three results are the overall table headings but after that we get the 100 titles we need. 

We can fix that by using a range instead - for example `[3:104]` would grab the items from index 3 to index 103 (it stops before 104)

Alternatively we could use a negative index, which counts from the end: `[-100:]` would start from the item 100 places from the end, onwards. Or, put another way, it would grab the last 100 items.

In [None]:
#Loop through the results
for i in servicenames[-100:]:
  #print the text inside the tag
  print(i.text_content())


                    Addictive Eaters Anonymous - Nottingham
            

                    Nottinghamshire Adult Eating Disorder Team
            

                    Nottinghamshire - Parents Support Group
            

                    Nottinghamshire Camhs Eating Disorder Team - Nottingham
            

                    Child And Adolescent Mental Health Services (Camhs) Eating Disorder Team
            

                    Mrs Katie Smith  Peaceful Horizons Ltd
            

                    Sharon Baker
            

                    Addictive Eaters Anonymous - Derbyshire
            

                    Helen Clare
            

                    Nottinghamshire Camhs Eating Disorder Team - Mansfield
            

                    Addictive Eaters Anonymous - Derby
            

                    Bespoke Therapy With Dr Rachel Evans (Phd)
            

                    Derby Camhs Eating Disorder Service
            

     

Let's repeat the process for the other details. Helpfully, the telephone number always seems to be inside the tag `<p class="fctel">`.

How many of those can we grab?

In [None]:
#grab the contents of each <p class="fctel"> tag
tels = root.cssselect('p.fctel')
#count how many matches are in that list
len(tels)

92

Let's just check the text of the first one:

In [None]:
tels[0].text_content()

'Tel: 03301333615'

This time instead of having a few too many, we are 8 short. Could this be because some don't include telephone numbers? Let's try a different tag - the one that comes before the paragraph tag.

In [None]:
#grab the contents of each <p class="fctel"> tag
tels = root.cssselect('div.fcdetailsleft')
#count how many matches are in that list
len(tels)

100

That's better. But this tag contains both the paragraph tag for the phone number and the paragraph tag for the address.

In [None]:
tels[0].text_content()

'\r\n        Tel: 03301333615\r\n        \r\nStation Street\r\n        Nottingham\r\n             NG2 3NG    \r\n'

That's not a big problem. We could clean this up to split the two out - for example on those `\n` (new line breaks) and store them separately.

Why don't we want 92 perfect phone numbers instead, though? Well, if we're going to create a data frame from this data we need the data to line up - 100 cells of headings with 100 cells of phone numbers and 100 cells of addresses. If there are only 92 phone numbers then for the other 8 we will need to add 'no phone listed' to make it up to 100. More on this later.

## Capturing both 'columns' of data

Now we've used trial and error to find the data we need, let's bring them together.

In [None]:
#grab the contents of every <th> tag
servicenames = root.cssselect('th')
#limit to the last 100
servicenames = servicenames[-100:]
#grab the contents of each <p class="fctel"> tag
tels = root.cssselect('div.fcdetailsleft')
#count how many matches are in that list

#Create a dataframe to store the data we are about to scrape
#It has two column called 'service' and 'details'
#We call this dataframe 'df'
df = pandas.DataFrame(columns=["service","details"])

#Because we need to loop through two lists of the same length, we can instead 
#loop through a range of indices, generated using the range function
for i in range(0,100):
  #extract the text from that index in servicenames
  servicename = servicenames[i].text_content()
  #and print it
  print(servicename)
  #repeat for the item at that index in tels
  tel = tels[i].text_content()
  print(tel)
  #then add to the df
  df = df.append({
      "servicename" : servicename,
  "tel" : tel
  }, ignore_index=True)

print(df)


                    Addictive Eaters Anonymous - Nottingham
            

        Tel: 03301333615
        
Station Street
        Nottingham
             NG2 3NG    


                    Nottinghamshire Adult Eating Disorder Team
            

        Tel: 0115 876 0162
        
Mandala Centre
    Gregory Boulevard
    Nottingham
             NG7 6LB    


                    Nottinghamshire - Parents Support Group
            

        Tel: 0115 956 0866
        
Thorneywood
    Child And Adolescent Mental Health Services , Porchester Road
    Nottingham
             NG3 6LF    


                    Nottinghamshire Camhs Eating Disorder Team - Nottingham
            

        Tel: 0115 841 5812
        
Thorneywood
    Child And Adolescent Mental Health Services , Porchester Road
    Nottingham
             NG3 6LF    


                    Child And Adolescent Mental Health Services (Camhs) Eating Disorder Team
            

        Tel: 0115

## Improving the scraper

Now we've succeeded in scraping those two pieces of information on 100 organisations, on one webpage, we can start to think about improving the scraper. For example:

* We could split the telephone and address within the scraper
* We could grab the link to each organisation's 'detail' page and add that
* We could get the scraper to run on subsequent pages of results

First, let's export the results so we have a copy of those.

In [None]:
#And we can export it
df.to_csv("scrapeddata.csv")

## Improvement 1: Cleaning/splitting the data

Here's how we could split the telephone and address, and clean it a little as well:

In [None]:
tels[0].text_content()

'\r\n        Tel: 03301333615\r\n        \r\nStation Street\r\n        Nottingham\r\n             NG2 3NG    \r\n'

The function to split strings of text in Python is, well, `.split()`. It needs to be attached by a period to the string you want to split, and inside the parentheses you need to specify what you want to split it on.

The result of a `.split()` function will always be a list, even if it can't split the string. Below we create a variable to store the first string of text containing the telephone and address, and then split that variable on `"\n"`, which is the 'new line' character.

In [None]:
#store the text contents of the first item in tels in a variable called 'firsttel'
firsttel = tels[0].text_content()
#split it on the "\r\n        " between each item of info 
firsttel.split("\r\n        ")

['',
 'Tel: 03301333615',
 '\r\nStation Street',
 'Nottingham',
 '     NG2 3NG    \r\n']

You can tell this is a list because of the square brackets - the telephone is in the second item in that list (the first item is empty) and the postcode is the last item.

Notice that the string that we're splitting on (`"\r\n        "`) is removed when it splits. One `"\r\n"` remains because it doesn't have the same number of spaces as was specified.

Knowing this we can access items in that list like so:

In [None]:
#store the text contents of the first item in tels in a variable called 'firsttel'
firsttel = tels[0].text_content()
#split it on the "\r\n        " between each item of info - and store in another variable
splittel = firsttel.split("\r\n        ")
#show the second item (index 1)
print(splittel[1])
#show the last item 
print(splittel[-1])

Tel: 03301333615
     NG2 3NG    



Now let's incorporate that knowledge into our scraper code:

In [None]:
#grab the contents of every <th> tag
servicenames = root.cssselect('th')
#limit to the last 100
servicenames = servicenames[-100:]
#grab the contents of each <p class="fctel"> tag
tels = root.cssselect('div.fcdetailsleft')
#count how many matches are in that list

#Create a dataframe to store the data we are about to scrape
#It has two column called 'service' and 'details'
#We call this dataframe 'df'
df = pandas.DataFrame(columns=["service","details"])

#Because we need to loop through two lists of the same length, we can instead 
#loop through a range of indices, generated using the range function
for i in range(0,100):
  #extract the text from that index in servicenames
  servicename = servicenames[i].text_content()
  #repeat for the item at that index in tels
  tel = tels[i].text_content()
  #split it on the "\r\n        " between each item of info - and store in another variable
  splittel = tel.split("\r\n        ")
  #store the second item (index 1)
  telno = splittel[1]
  #show the last item - we also strip out white space using the strip() function
  postcode = splittel[-1].strip()
  #then add to the df
  df = df.append({
      "servicename" : servicename,
  "tel" : telno,
  "postcode" : postcode,
  "tel_and_address" : tel
  }, ignore_index=True)

print(df)

   service  ...                                    tel_and_address
0      NaN  ...  \r\n        Tel: 03301333615\r\n        \r\nSt...
1      NaN  ...  \r\n        Tel: 0115 876 0162\r\n        \r\n...
2      NaN  ...  \r\n        Tel: 0115 956 0866\r\n        \r\n...
3      NaN  ...  \r\n        Tel: 0115 841 5812\r\n        \r\n...
4      NaN  ...  \r\n        Tel: 0115 844 0524\r\n        \r\n...
..     ...  ...                                                ...
95     NaN  ...  \r\n        Tel: 01733 391537\r\n        \r\nP...
96     NaN  ...  \r\n        Tel:  01244 397 397\r\n        \r\...
97     NaN  ...  \r\n        Tel: 07881 776562\r\n        \r\n ...
98     NaN  ...  \r\n        Tel: 03301333615\r\n        \r\nBr...
99     NaN  ...  \r\n        Tel: 03301333615\r\n        \r\nMa...

[100 rows x 6 columns]


But what about those 8 entries where there wasn't a telephone number? Well in those cases, after splitting, that second item won't be the telephone. So we'd need to clean those in Excel or here. 

Here's one way of seeing which ones they are:

In [None]:
#loop through the column of telephone numbers
for i in df['tel']:
  #if the first 3 characters are not "Tel"
  if i[:3] != "Tel":
    #then print the entry
    print(i)


Beechwood Park Drive

Nedcash, The Annexe
    Holywell Health Centre, Holywell Street
    Chesterfield

Ben, Lg88
    Bennett Building , University Road
    Leicester

Parkway
    39 Park Street
    Worksop

Oak House
    Moorhead Way , Bramley
    Rotherham

Main Office

Hales, Red Lane
    Burton Green
    Nr Kenilworth
    Warwickshire

Beech House
    20 Buxton Rd
    Cheshire
    Cheshire


## Improvement 2: Grabbing the links to detail pages

So far we have used `.text_content()` in our code to indicate that we want to grab the text contents between the opening and closing tags that we targeted (e.g. `<th>` and `</th>`).

But what if we want to grab the links? Links aren't text - they're part of the HTML tag itself. 

Specifically, a link is created using a `<a>` tag, and using the `href=` attribute of that to specify the link to go to. In full that looks something like this:

`<a href="http://bbc.co.uk">`

More succinctly, what we want to do is grab the **value** of the `href` attribute of the `a` tag.

There are two things we need to change in our code to do this:

* First, using `cssselect` we need to target the `a` tags
* Secondly, we need to use `.attrib['href']` rather than `.text_content()` to specify that we want to grab the href attribute's value, not the text.

The full code is below but the two key lines are these:

`servicenames = root.cssselect('th a')`

And, later on:

`serviceurl = servicenames[i].attrib['href']`

Also, there's an extra line to store it in the data frame in the `df.append` section.

In [None]:
#grab the contents of every <a> within a <th> tag
servicenames = root.cssselect('th a')
#limit to the last 100
servicenames = servicenames[-100:]
#grab the contents of each <p class="fctel"> tag
tels = root.cssselect('div.fcdetailsleft')
#count how many matches are in that list

#Create a dataframe to store the data we are about to scrape
#It has two column called 'service' and 'details'
#We call this dataframe 'df'
df = pandas.DataFrame(columns=["service","details"])

#Because we need to loop through two lists of the same length, we can instead 
#loop through a range of indices, generated using the range function
for i in range(0,100):
  #extract the text from that index in servicenames
  servicename = servicenames[i].text_content()
  serviceurl = servicenames[i].attrib['href']
  #repeat for the item at that index in tels
  tel = tels[i].text_content()
  #split it on the "\r\n        " between each item of info - and store in another variable
  splittel = tel.split("\r\n        ")
  #store the second item (index 1)
  telno = splittel[1]
  #show the last item - we also strip out white space using the strip() function
  postcode = splittel[-1].strip()
  #then add to the df
  df = df.append({
      "serviceurl" : serviceurl,
      "servicename" : servicename,
  "tel" : telno,
  "postcode" : postcode,
  "tel_and_address" : tel
  }, ignore_index=True)

print(df)

   service  ...                                    tel_and_address
0      NaN  ...  \r\n        Tel: 03301333615\r\n        \r\nSt...
1      NaN  ...  \r\n        Tel: 0115 876 0162\r\n        \r\n...
2      NaN  ...  \r\n        Tel: 0115 956 0866\r\n        \r\n...
3      NaN  ...  \r\n        Tel: 0115 841 5812\r\n        \r\n...
4      NaN  ...  \r\n        Tel: 0115 844 0524\r\n        \r\n...
..     ...  ...                                                ...
95     NaN  ...  \r\n        Tel: 01733 391537\r\n        \r\nP...
96     NaN  ...  \r\n        Tel:  01244 397 397\r\n        \r\...
97     NaN  ...  \r\n        Tel: 07881 776562\r\n        \r\n ...
98     NaN  ...  \r\n        Tel: 03301333615\r\n        \r\nBr...
99     NaN  ...  \r\n        Tel: 03301333615\r\n        \r\nMa...

[100 rows x 7 columns]


Just to show the results, let's just print that column:

In [None]:
print(df['serviceurl'])

0     /ServiceDirectories/Pages/GenericServiceDetail...
1     /ServiceDirectories/Pages/GenericServiceDetail...
2     /ServiceDirectories/Pages/GenericServiceDetail...
3     /ServiceDirectories/Pages/GenericServiceDetail...
4     /ServiceDirectories/Pages/GenericServiceDetail...
                            ...                        
95    /ServiceDirectories/Pages/GenericServiceDetail...
96    /ServiceDirectories/Pages/GenericServiceDetail...
97    /ServiceDirectories/Pages/GenericServiceDetail...
98    /ServiceDirectories/Pages/GenericServiceDetail...
99    /ServiceDirectories/Pages/GenericServiceDetail...
Name: serviceurl, Length: 100, dtype: object


Note that these are **partial** URLs, which is quite common - we will need to add the **base URL** (`https://www.nhs.uk`) to make them work.

That could be done by amending this line of code:

`serviceurl = servicenames[i].attrib['href']`

To:

`serviceurl = "https://www.nhs.uk"+servicenames[i].attrib['href']`

## Improvement 3: Scraping multiple pages

To scrape beyond the first page of results we need to take another look at that first URL:

`https://www.nhs.uk/service-search/other-services/Eating-disorders/Nottingham/Results/102/-1.158/52.955/1797/15942?distance=500&ResultsOnPageValue=100`

Now, going to that page, we click on 'next' or the link for the second page of results, and then copy that URL for comparison:

`https://www.nhs.uk/service-search/other-services/Eating-disorders/Nottingham/Results/102/-1.158/52.955/1797/15942?distance=500&ResultsOnPageValue=100&isNational=0&totalItems=805&currentPage=2`

You'll notice the URL changes in some key ways - especially towards the end. Here are the new bits:

* `&isNational=0`
* `&totalItems=805`
* `&currentPage=2`

We can work out that `totalItems=805` refers to the number of results, which we can see displayed on the page itself ("Showing 101-200 of 805 results"). 

But the useful bit is `&currentPage=2` - if we change that number to `1` and try this URL then we get the first 100 results again:

`https://www.nhs.uk/service-search/other-services/Eating-disorders/Nottingham/Results/102/-1.158/52.955/1797/15942?distance=500&ResultsOnPageValue=100&isNational=0&totalItems=805&currentPage=1`

Knowing this, we can loop through a list of page numbers to generate the URLs, like this:

In [7]:
#first, store the URL up to the page number
firsturlpart = "https://www.nhs.uk/service-search/other-services/Eating-disorders/Nottingham/Results/102/-1.158/52.955/1797/15942?distance=500&ResultsOnPageValue=100&isNational=0&totalItems=805&currentPage="
#next create a list of page numbers
pagelist = [1,2,3]
#then loop through them and add to the URL
for i in pagelist:
  #convert number to string so it can be combined with URL
  pagenumberasstring = str(i)
  #combine that with URL
  pageurl = firsturlpart+pagenumberasstring
  #print the resulting combination of strings
  print(pageurl)

1
https://www.nhs.uk/service-search/other-services/Eating-disorders/Nottingham/Results/102/-1.158/52.955/1797/15942?distance=500&ResultsOnPageValue=100&isNational=0&totalItems=805&currentPage=1
2
https://www.nhs.uk/service-search/other-services/Eating-disorders/Nottingham/Results/102/-1.158/52.955/1797/15942?distance=500&ResultsOnPageValue=100&isNational=0&totalItems=805&currentPage=2
3
https://www.nhs.uk/service-search/other-services/Eating-disorders/Nottingham/Results/102/-1.158/52.955/1797/15942?distance=500&ResultsOnPageValue=100&isNational=0&totalItems=805&currentPage=3


Note that we have to convert the page number to a string, using the `str()` function because combining a string and a number results in an error, like this:

In [11]:
print("page"+1)

TypeError: ignored

Rather than manually create a list of numbers, you can google for "create range of numbers with Python" which will [lead](https://www.w3schools.com/python/ref_func_range.asp) you to the `range()` function:

In [16]:
for i in range(1,10):
  print(i)

1
2
3
4
5
6
7
8
9


Note that the 'end' number in the `range()` function (in the above example it's 10) is not included in the range. It ends *before* that number.

Now we can amend our code:

In [17]:
#first, store the URL up to the page number
firsturlpart = "https://www.nhs.uk/service-search/other-services/Eating-disorders/Nottingham/Results/102/-1.158/52.955/1797/15942?distance=500&ResultsOnPageValue=100&isNational=0&totalItems=805&currentPage="
#next create a list of page numbers from 1 to 5
pagelist = range(1,6)
#then loop through them and add to the URL
for i in pagelist:
  #convert number to string so it can be combined with URL
  pagenumberasstring = str(i)
  #combine that with URL
  pageurl = firsturlpart+pagenumberasstring
  #print the resulting combination of strings
  print(pageurl)

https://www.nhs.uk/service-search/other-services/Eating-disorders/Nottingham/Results/102/-1.158/52.955/1797/15942?distance=500&ResultsOnPageValue=100&isNational=0&totalItems=805&currentPage=1
https://www.nhs.uk/service-search/other-services/Eating-disorders/Nottingham/Results/102/-1.158/52.955/1797/15942?distance=500&ResultsOnPageValue=100&isNational=0&totalItems=805&currentPage=2
https://www.nhs.uk/service-search/other-services/Eating-disorders/Nottingham/Results/102/-1.158/52.955/1797/15942?distance=500&ResultsOnPageValue=100&isNational=0&totalItems=805&currentPage=3
https://www.nhs.uk/service-search/other-services/Eating-disorders/Nottingham/Results/102/-1.158/52.955/1797/15942?distance=500&ResultsOnPageValue=100&isNational=0&totalItems=805&currentPage=4
https://www.nhs.uk/service-search/other-services/Eating-disorders/Nottingham/Results/102/-1.158/52.955/1797/15942?distance=500&ResultsOnPageValue=100&isNational=0&totalItems=805&currentPage=5


The next step is to run the scraping code on each page. Because this involves running the same block of code over and over again (the scraper part), we are best storing that code in a **function**. This is covered in the next notebook...