# Libraries in Python

A library is a collection of recipes (functions) and other stuff that someone has created for a particular type of problem. It means you don't have to write all the code yourself - you just need to find out how to use theirs!

For scraping there are a few useful libraries that I'm going to show here:

* The `requests` library is a library for fetching files (including webpages) from URLs
* The `bs4` (Beautiful Soup 4) library has a collection of tools for solving scraping problems
* `lxml.html` is a library for converting to XML
* `cssselect` is a library for drilling into those XML objects
* `pandas` is a library for data analysis
* The `trafilatura` library is a simple but effective library for grabbing text content from webpages

Some libraries come pre-installed on Colab, while others need installing first. All libraries need importing.

How do you know whether you need to install a library in Colab? Well, as is often the case in coding, you simply need to use trial and error.

In the two code blocks below I've tried to use the `import` command to import two libraries: `requests` and `trafilatura`.

The first block works without any problems. The second code, however, generates an error - because the `trafilatura` library hasn't been installed on Colab.

In [None]:
#import the requests library for fetching URLs
import requests

In [None]:
#try to import the trafilatura library for scraping text from webpages
import trafilatura

ModuleNotFoundError: ignored

## Error messages when importing libraries

The error message gives some clues on what we can do to fix this problem:

> "If your import is failing due to a missing package, you can manually install dependencies using either !pip or !apt."

More specifically, to install a library you need to use `!pip install` followed by the name of the library. Like in the code block below:

In [None]:
!pip install trafilatura
import trafilatura

Collecting trafilatura
  Downloading trafilatura-1.6.1-py3-none-any.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
Collecting courlan>=0.9.3 (from trafilatura)
  Downloading courlan-0.9.3-py3-none-any.whl (41 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.9/41.9 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting htmldate>=1.4.3 (from trafilatura)
  Downloading htmldate-1.5.0-py3-none-any.whl (40 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.9/40.9 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting justext>=3.0.0 (from trafilatura)
  Downloading jusText-3.0.0-py2.py3-none-any.whl (837 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m837.8/837.8 kB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
Collecting tld>=0.13 (from courlan>=0.9.3->trafilatura)
  Downloading tld-0.13-py2.py3-none-any.whl (263 kB)
[2K     [90m━━━━

Another reason for an error might be that we name the library incorrectly. In the code block below we try to import a library called `BeautifulSoup` - again we get an error saying that there is no module of this name.

In [None]:
#attempt to import a library called 'BeautifulSoup'
import BeautifulSoup

ModuleNotFoundError: ignored

In this case it's because it's not actually the name of the library - the official name of the Beautiful Soup library is `bs4`.

In [None]:
#import the bs4 library
import bs4

### Importing just one part of a library

Sometimes you only want just one part of a library - and this is often the case with Beautiful Soup. When you come across tutorials, examples of code, or ask ChatGPT or Gemini to suggest code, it will often not simply import `bs4`, it will import *from* that library with the code:

`from bs4 import BeautifulSoup`

If you see this code all you need to know is that it means it is importing a specific part of a library, rather than the whole thing.

In [None]:
#import the BeautifulSoup function from the bs4 library
from bs4 import BeautifulSoup

### Renaming a library while importing

Another variation you'll often see when importing libraries is code like this:

`import pandas as pd`

This not only imports the `pandas` library (which is pre-installed on Colab and so doesn't need installing first), but it also *renames* it - as the shorter `pd`.

Why? Well, it *does* save time to only have to type `pd` rather than `pandas` each time you want to use it in your code, but ultimately it's up to you.

It's also a widely used convention, so it does make things easier when following tutorials that follow that convention too.

We are going to follow that convention here.

In [None]:
import pandas as pd

## Using a library - and functions

Once imported, we can use that library - or more specifically: we can use the **functions** in that library.

A function is a name for a **recipe** in coding - you may have used them in Excel already, e.g. `SUM, AVERAGE`, or `VLOOKUP`.

A function is always followed by parentheses to ‘pass’ any ingredients, e.g. `=SUM(A1:A10)`

In Python, you tend to use a function by also including the name of the library it is from, and a period, e.g. the `requests` function `get` is used by writing `requests.get("http://bbc.co.uk")` (in this case it's being asked to scrape the BBC webpage).

Below are some examples of functions from our libraries being used, with the results being printed immediately after:

In [None]:
#scrape the specified URL and store in a variable called 'scrapedpage'
scrapedpage = requests.get("https://www.bbc.co.uk/news")
#print that variable - it's all the HTML on one line
#we are actually printing the 'content' property of that variable - more on this later
print(scrapedpage)

<Response [200]>


So `requests.get()` is the `get()` function from the `requests` library. The *ingredient* we give to that function is the URL string `"https://www.bbc.co.uk/news"`.

The `get()` function basically fetches the whole webpage at a given address (the ingredient it's given).

The results of running that function are stored in a new variable called `scrapedpage` (you could call this anything).

If we try to print that we get something unexpected: `<Response [200]>`.

This is worth some further explanation.

### Objects and their properties

When using library functions to create variables, those variables are often referred to as an 'object'.

In this case, we have created something that might be called a 'requests object' (because it was created by the `requests` library).

Objects often have particular properties that can be called with other parts of the library.

When a webpage is fetched using `requests`, the resulting 'object' has a number of special properties created by the `get()` function.

These properties can be accessed by using particular pieces of code called **methods**. You'll find a [list of properties and associated methods here](https://www.w3schools.com/python/ref_requests_response.asp).

The most useful method is `.content`: this will show the content of that object - the HTML.

To do this, add `.content` to the end of the name of the requests object (the variable).

In [None]:
print(scrapedpage.content)

b'<!DOCTYPE html>\n<html lang="en-GB" class="b-pw-1280 b-reith-sans-font no-touch" id="responsive-news">\n<head>\n    <meta charset="utf-8">\n    <meta name="viewport" content="width=device-width, initial-scale=1, user-scalable=1">\n    <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">\n    <meta name="google-site-verification" content="Tk6bx1127nACXoqt94L4-D-Of1fdr5gxrZ7u2Vtj9YI">\n    <link href="//static.bbc.co.uk" rel="preconnect" crossorigin>\n    <link href="//m.files.bbci.co.uk" rel="preconnect" crossorigin>\n    <link href="//nav.files.bbci.co.uk" rel="preconnect" crossorigin>\n    <link href="//ichef.bbci.co.uk" rel="preconnect" crossorigin>\n    <link rel="dns-prefetch" href="//mybbc.files.bbci.co.uk">\n    <link rel="dns-prefetch" href="//ssl.bbc.co.uk/">\n    <link rel="dns-prefetch" href="//sa.bbc.co.uk/">\n    <link rel="dns-prefetch" href="//ichef.bbci.co.uk">\n\n\n    <link rel="preload" as="style" href="//m.files.bbci.co.uk/modules/bbc-morph-news-page-styl

The `.text` method is also quite useful: this returns the content in unicode rather than bytes (which can be useful in some situations).

In [None]:
scrapedpage.text

'<!DOCTYPE html>\n<html lang="en-GB" class="b-pw-1280 b-reith-sans-font no-touch" id="responsive-news">\n<head>\n    <meta charset="utf-8">\n    <meta name="viewport" content="width=device-width, initial-scale=1, user-scalable=1">\n    <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">\n    <meta name="google-site-verification" content="Tk6bx1127nACXoqt94L4-D-Of1fdr5gxrZ7u2Vtj9YI">\n    <link href="//static.bbc.co.uk" rel="preconnect" crossorigin>\n    <link href="//m.files.bbci.co.uk" rel="preconnect" crossorigin>\n    <link href="//nav.files.bbci.co.uk" rel="preconnect" crossorigin>\n    <link href="//ichef.bbci.co.uk" rel="preconnect" crossorigin>\n    <link rel="dns-prefetch" href="//mybbc.files.bbci.co.uk">\n    <link rel="dns-prefetch" href="//ssl.bbc.co.uk/">\n    <link rel="dns-prefetch" href="//sa.bbc.co.uk/">\n    <link rel="dns-prefetch" href="//ichef.bbci.co.uk">\n\n\n    <link rel="preload" as="style" href="//m.files.bbci.co.uk/modules/bbc-morph-news-page-style

## Using CSS selectors to grab information from a webpage with BeautifulSoup

The `request` library is used very briefly: its `get()` function is used to fetch a webpage from a URL, and that's it.

Once fetched, we can switch to our second library, `bs4`, and its `BeautifulSoup` function.

This will allow us to convert that long string of HTML into another 'object' with a structure that we can drill down into.

Let's put all the code from above into just one block, with the `print` commands removed, and start to add some lines using `BeautifulSoup()`...

In [None]:
#scrape the specified URL and store in a variable called 'scrapedpage'
scrapedpage = requests.get("https://www.bbc.co.uk/news")
#convert the string of HTML text into a special type of variable that we can drill into
#we've called it 'soup'
soup = BeautifulSoup(scrapedpage.content)

#That 'object' can be drilled into now, though, using .select()
#We specify the CSS selectors that describe the HTML tags we want from that page we scraped
h2s = soup.select('h2')
#print the variable we just created
print(h2s)
#and print how many items there are in that list variable, using another function - len()
print(len(h2s))

[<h2>Accessibility links</h2>, <h2 class="gs-u-vh">News Navigation</h2>, <h2 class="nw-c-breaking-news-banner__h2 gel-paragon-bold"><span aria-hidden="true">Breaking</span><span class="gs-u-vh">Breaking news</span></h2>, <h2 class="gs-u-vh">Top Stories</h2>, <h2 class="gel-double-pica-bold" id="nw-c-cluster2-heading__title">Hurricane Idalia</h2>, <h2 class="gel-double-pica-bold" id="nw-c-must-see-heading__title">Must see</h2>, <h2 class="gel-double-pica-bold" id="nw-c-most-watched-heading__title">Most watched</h2>, <h2 class="gel-double-pica-bold" id="nw-c-full-story-heading__title">Full Story</h2>, <h2 class="gel-double-pica-bold" id="nw-c-most-read-heading__title" tabindex="-1">Most read</h2>, <h2 class="gel-double-pica-bold" id="nw-c-around-the-bbc-heading__title">Around the BBC</h2>, <h2 class="gel-double-pica-bold" id="nw-c-sport-heading__title">Sport</h2>, <h2 class="gel-double-pica-bold" id="social-slice__title">Find us here</h2>, <h2 class="gs-u-vh">News Navigation</h2>, <h2 cl

### What's this 'soup'?

The first time we use BeautifulSoup is in this line:

`soup = BeautifulSoup(scrapedpage.content)`

This uses the `BeautifulSoup()` function that we imported from the `bs4` library.

The *ingredient* we give to that function is `scrapedpage.content` - that's the `scrapedpage` variable created in the line above, with `.content` added to specify that we want to grab the HTML content of that object.

The results of all this are stored in another new variable, `soup`.

This is now a BeautifulSoup object, which means we can start to use BeautifulSoup functions with it.

### Working with the 'soup'

We now start to work with that `soup` with the next line:

`h2s = soup.select('h2')`

Let's break that down.

1. `h2s =` - creates a new variable called `h2s` which is going to be used to store whatever comes after that equals sign.
2. Then we come to `soup` - that's the variable that was created in a previous line, when we converted our HTML webpage into a special object. This special object allows us to use...
3. ...the `select()` function, which is joined to `soup` with a period: `soup.select()`
4. Finally, in parentheses after the `select` function is the **selector(s)** that we want to target: `('h2')`

There's a lot going on there, but the key thing to remember is that you don't need to change any of this code apart from the selector in parentheses at the end.

At the moment this selector is `'h2'` which means it will select any `<h2>` tags in the webpage that was scraped (in this case the BBC home page). But we'll go more into selectors in a moment.

Oh, and also in the output from the code we can see the results of the two `print` commands:

* first, printing the variable `h2s` we can see that it's a bunch of HTML tags - because the output here starts with square brackets we can guess it's a list. (You might notice that the list consists of a series of strings starting and ending with `<h2> ... </h2>` and separated by a comma)
* second, printing the results of using the `len` function on that variable - or `len(h2s)` - tells us how many items are in that list: 14.

## Using CSS selectors

**CSS selectors** are used to target different elements in a HTML page. A basic selector can target just one type of HTML tag, like `<h2>` or `<p>`, but you can also target a combination of tags (such as any `<strong>` tags within `<p>` tags).

More complicated selectors can also be used to target tags based on their attributes (e.g. not just `<p>` but specifically `<p class="summary">`).

You can find lots of resources to help you with CSS selectors, such as [this one](https://www.w3schools.com/cssref/css_selectors.asp). Many will relate to styling webpages (which is how CSS selectors are most often used - selectors are used to target the HTML elements that you want to style), but the principles are the same.


## `select` always produces a list - and the items always need decoding

It's worth pointing out that `select` always generates a *list* - even if it finds one match, or even none.

To work with those items we will need to either loop through them (as is often the case with lists), or access them by position (i.e. first, second, last, etc.)

In [None]:
#start looping through our list
for thing in h2s:
  #print each item
  print(thing)

<h2>Accessibility links</h2>
<h2 class="gs-u-vh">News Navigation</h2>
<h2 class="nw-c-breaking-news-banner__h2 gel-paragon-bold"><span aria-hidden="true">Breaking</span><span class="gs-u-vh">Breaking news</span></h2>
<h2 class="gs-u-vh">Top Stories</h2>
<h2 class="gel-double-pica-bold" id="nw-c-cluster2-heading__title">Hurricane Idalia</h2>
<h2 class="gel-double-pica-bold" id="nw-c-must-see-heading__title">Must see</h2>
<h2 class="gel-double-pica-bold" id="nw-c-most-watched-heading__title">Most watched</h2>
<h2 class="gel-double-pica-bold" id="nw-c-full-story-heading__title">Full Story</h2>
<h2 class="gel-double-pica-bold" id="nw-c-most-read-heading__title" tabindex="-1">Most read</h2>
<h2 class="gel-double-pica-bold" id="nw-c-around-the-bbc-heading__title">Around the BBC</h2>
<h2 class="gel-double-pica-bold" id="nw-c-sport-heading__title">Sport</h2>
<h2 class="gel-double-pica-bold" id="social-slice__title">Find us here</h2>
<h2 class="gs-u-vh">News Navigation</h2>
<h2 class="orb-foote

As you can see, each element is slightly different, although not any more intelligible.

We can also access them like this:

In [None]:
#print the last item in the list
print(h2s[-1])

<h2 class="orb-footer-lead">Explore the BBC</h2>


## Extracting text or properties from HTML

Once we loop through the items, or access them individually, it's likely we will want to extract a particular part of that - for example, just the text inside the HTML tags.

To do this, we can apply an extra method. This is connected to the item you want to apply it to, with a period, like so:

* `.get_text()`

Here's what that code would look like (don't forget the parentheses):


In [None]:
#fetch the first item in the list h2s - then get the text content of that HTML
h2text1 = h2s[0].get_text()
#print it
print(h2text1)

Accessibility links


And here's that incorporated into a list:


In [None]:
#loop through the h2s list and call each item 'h2'
for h2 in h2s:
  #extract the text from that item, save in a variable called 'h2text'
  h2text = h2.get_text()
  #and print it
  print(h2text)

Accessibility links
News Navigation
BreakingBreaking news
Top Stories
Hurricane Idalia
Must see
Most watched
Full Story
Most read
Around the BBC
Sport
Find us here
News Navigation
Explore the BBC


### Extracting the properties of the HTML itself

In some cases it's not the text you want to extract, but some quality of the HTML tag itself.

The most common example of this is wanting to extract the link from an `<a>` tag (the `href=""` attribute).

Another common example is where images include an `alt=` description. And on other occasions data might be stored in other attributes of a tag (in fact, one attribute literally is called `data=`)

To grab these properties, you add the attribute inside square brackets after the item. So, if you wanted to grab the `href=` attribute, you would add `['href']`.

Here's some code showing that in practice:

In [None]:
#show the second item in full
print(h2s[4])
#now show just the text of the second item
print(h2s[4].get_text())
#now show the class= value
print(h2s[4]['class'])
#now show the id= value
print(h2s[4]['id'])

<h2 class="gel-double-pica-bold" id="nw-c-cluster2-heading__title">Hurricane Idalia</h2>
Hurricane Idalia
['gel-double-pica-bold']
nw-c-cluster2-heading__title


Note that in our list, not every HTML tag has a `class` attribute, and some don't have an `id` attribute either, so if you're going to do this you may want to only select those tags that have what you want.

## Saving the information we've grabbed: `pandas`

Now we've grabbed some information we can extend the code further to save it.

At this point we need to use functions from another library: `pandas`. This is a library for data storage and analysis.

*Note: Remember that when we imported `pandas` we renamed it, as `pd`, so wherever you see `pd` below, that just means `pandas`.*

First, we use the function `DataFrame()` which creates a pandas dataframe. As ingredients it needs to know the names of any columns.

You will see below that we add a line *before* the loop which uses that to create an empty dataframe to store the data in.

Then, inside the loop, the data we extract is added to the dataframe.

Here's the code first - then I'll explain the new bits after.


In [None]:
#Create a dataframe to store the data we are about to scrape
#It has one column called 'title'
#We call this dataframe 'df'
df = pd.DataFrame(columns=["heading"])

#scrape the specified URL and store in a variable called 'scrapedpage'
scrapedpage = requests.get("https://www.bbc.co.uk/news")
#convert the string of HTML text into a special type of variable that we can drill into
#we've called it 'soup'
soup = BeautifulSoup(scrapedpage.content)

#That 'object' can be drilled into now, though, using .select()
#We specify the CSS selectors that describe the HTML tags we want from that page we scraped
h2s = soup.select('h2')
#loop through the h2s list and call each item 'h2'
for h2 in h2s:
  #extract the text from that item, save in a variable called 'h2text'
  h2text = h2.get_text()
  #and print it
  print(h2text)
  #now add it to the dataframe, under the column 'heading'
  df = df.append({
  "heading" : h2text
  }, ignore_index=True)

print(df)

Accessibility links
News Navigation
BreakingBreaking news
Top Stories
Hurricane Idalia
Must see
Most watched
Full Story
Most read
Around the BBC
Sport
Find us here
News Navigation
Explore the BBC
                  heading
0     Accessibility links
1         News Navigation
2   BreakingBreaking news
3             Top Stories
4        Hurricane Idalia
5                Must see
6            Most watched
7              Full Story
8               Most read
9          Around the BBC
10                  Sport
11           Find us here
12        News Navigation
13        Explore the BBC


  df = df.append({
  df = df.append({
  df = df.append({
  df = df.append({
  df = df.append({
  df = df.append({
  df = df.append({
  df = df.append({
  df = df.append({
  df = df.append({
  df = df.append({
  df = df.append({
  df = df.append({
  df = df.append({


## The new code

The first line of new code is this:

`df = pd.DataFrame(columns=["heading"])`

We are creating a new variable here, called `df`, and assigning to it the results of using a function: `pd.DataFrame()` (the `pandas` function `DataFrame`).

That takes an ingredient which specifies the columns as being a list (note the square brackets) of one string: `"heading"`.

The second line of new code is this:

```
df = df.append({
      "heading" : h2text
      }, ignore_index=True)
```

This takes the `df` variable and updates it.

On the right of the equals sign is `df.append()` - this means it is using a function called `append` to append (add) new data to the `df` variable it's attached to.

The `append` function [can include various ingredients](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.append.html): firstly the data that you want to append to the dataframe; but also settings, such as whether you want something called `ignore_index` to be `True` or `False`. Setting this to `True` just avoids problems when your data isn't unique.

What about the data that you are appending? Well, this has to be in the form of a **dictionary**. A dictionary is like a list, but with two key differences: firstly that it uses curly brackets instead of square ones: `{}`, and secondly it's a list of *pairs*: a 'key', and a 'value', separated by a colon.

Here's the dictionary in our code:

`{"heading" : h2text}`

The first part, `"heading"` is the **key**. This matches the column heading in the empty data frame. Note that it's a **string**: a label, basically.

The second part, `h2text`, is the **value**. This isn't in quotes so it's not a string - it's a variable. A few lines earlier we created this variable with `h2text = h2.text_content()`

So having extracted that information and stored it in `h2text`, the line of code is storing it in a dataframe with the label (key) "heading":

```
df = df.append({
      "heading" : h2text
      }, ignore_index=True)
```

We can print the dataframe to see what's in there:


In [None]:
#Once the loop has finished we can take a look at the data
print(df)

                  heading
0     Accessibility links
1         News Navigation
2   BreakingBreaking news
3             Top Stories
4        Hurricane Idalia
5                Must see
6            Most watched
7              Full Story
8               Most read
9          Around the BBC
10                  Sport
11           Find us here
12        News Navigation
13        Explore the BBC


## Exporting the data

The `pandas` library has another function for exporting data: `to_csv()`.

It needs to be attached to the name of the dataframe variable with a period, then, in the brackets, you specify the name of the file you want to export it as. Make sure this ends in '.csv' so it can be used in a spreadsheet.

In [None]:
#And we can export it
df.to_csv("scrapeddata.csv")

## Downloading the data

Once exported, it should appear in the file explorer in Google Colab on the left hand side. Click on the folder icon to open this up and you should see the file you just created (there's a refresh button above if you can't).

Hover over the file name to see three dots, then click on those to select **Download** and download to your computer.

### A note about deprecated code

You might have noticed a string of messages in the output earlier when creating the dataframe - they looked like this:

`<ipython-input-32-e452587bf6b7>:22: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.`

You will sometimes come across warnings like this when you run code from tutorials - it basically means that things have changed since the tutorial was written, and a function that used to work has been superseded by something else.

Don't worry: the function will still work. The message is there to warn you that it might not *always work in future*, because future versions of pandas (which will be imported with the `import` command) may not include it.

Normally the warning includes information about what you should use instead of the old function. In this case it advises you to use `pandas.concat` instead.

This does unfortunately mean going off and either finding a different tutorial that uses the newer function, or learning how to achieve the same results with that newer function instead. But if your code works and you only want to use it once, you don't have to do that - it's only if you intend to use that code in future (or want to update your coding knowledge).

If you're interested, below is a version of the code which uses `concat` instead. It involves an extra line of code where a new dataframe is created just for our new information:

`  asarow = pd.DataFrame({"heading" : h2text}, index=[0])`

That dataframe is then added to the end of our existing dataframe `df` with this code (note the two dataframes being combined are placed inside square brackets):

`  df = pd.concat([df, asarow], ignore_index=True)`

In [None]:
#Create a dataframe to store the data we are about to scrape
#It has one column called 'title'
#We call this dataframe 'df'
df = pd.DataFrame(columns=["heading"])

#scrape the specified URL and store in a variable called 'scrapedpage'
scrapedpage = requests.get("https://www.bbc.co.uk/news")
#convert the string of HTML text into a special type of variable that we can drill into
#we've called it 'soup'
soup = BeautifulSoup(scrapedpage.content)

#That 'object' can be drilled into now, though, using .select()
#We specify the CSS selectors that describe the HTML tags we want from that page we scraped
h2s = soup.select('h2')
#loop through the h2s list and call each item 'h2'
for h2 in h2s:
  #extract the text from that item, save in a variable called 'h2text'
  h2text = h2.get_text()
  #and print it
  print(h2text)
  #create a dictionary object with 'heading' as the key and the text as the value
  asarow = pd.DataFrame({"heading" : h2text}, index=[0])
  #now add it to the dataframe
  df = pd.concat([df, asarow], ignore_index=True)

print(df)

Accessibility links
News Navigation
BreakingBreaking news
Top Stories
Must see
Most watched
Full Story
Most read
Around the BBC
Sport
Find us here
News Navigation
Explore the BBC
                  heading
0     Accessibility links
1         News Navigation
2   BreakingBreaking news
3             Top Stories
4                Must see
5            Most watched
6              Full Story
7               Most read
8          Around the BBC
9                   Sport
10           Find us here
11        News Navigation
12        Explore the BBC


## How to adapt it

You can use most of this code without having to change it. All you *need* to change is the lines specifying the base URL, and the list of words to add to it.

And this line, which specifies what you want to scrape from that page:

`titles = soup.select('h2')`

If you're scraping one type of information from one page, that will be enough.

For the CSS selector you will need to identify the HTML in the page you are scraping, and the combination of tags that is being used.

Some [reading around CSS selectors](https://www.w3schools.com/cssref/css_selectors.asp) will help you here, but a couple of useful things to know include:

* A period `.` means `class="`
* A hash `#` means `id="`

So `'div.title a'` means `<div class="title"><a ...>` - or, in other words, anything on the page inside an `<a>` tag (a link) within a `<div class="title">` tag.

The words used for variables (like "soup" and "h2s" above) may not be relevant to what you are scraping - but that doesn't matter, because those words are arbitrary. If you do decide to change them, make sure you change them *throughout* the code, or it will create an error.


## Generating URLs for a scraper to loop through

Alternatively you might *generate* the URLs: for example, if they end in a number that goes up by 1 each time you can use `range` to generate that list of numbers and add them to the URL using `+`.

However, you cannot mix numbers and strings, so you need to convert the numbers to a string as you do this. Here's an example:

In [None]:
#Create the basic URL that appears before the number
baseurl = "http://mypage.com?page="
#Create a list of numbers to put on the end
pagenums = range(1,11)
#Now generate the URLs by looping through the list and adding it to the URL
for i in pagenums:
  #Combine the two -
  #this will generate an error because we are trying to combine a string and a number
  fullurl = baseurl+i

TypeError: ignored

## Tip: converting numbers into strings

You can see the error `must be str, not int` - in other words the second part must be a string not an integer.

To fix that you can use the `str()` function, which will convert a number into a string.

In [None]:
#Create the basic URL that appears before the number
baseurl = "http://mypage.com?page="
#Create a list of numbers to put on the end
pagenums = range(1,11)
#Now generate the URLs by looping through the list and adding it to the URL
for i in pagenums:
  #Convert i to a string
  i = str(i)
  #Combine the two
  fullurl = baseurl+i
  #print it
  print(fullurl)

http://mypage.com?page=1
http://mypage.com?page=2
http://mypage.com?page=3
http://mypage.com?page=4
http://mypage.com?page=5
http://mypage.com?page=6
http://mypage.com?page=7
http://mypage.com?page=8
http://mypage.com?page=9
http://mypage.com?page=10
