# An example scraper using a list

The code below can be copied and adapted to create your own scraper.

The first part installs all the libraries. I've kept this separate to the other parts so that you don't have to install them every time you want to run the scraper itself.

In [1]:
#install the libraries 
#requests is a library for fetching webpages from a URL
import requests
#BeautifulSoup is a library for scraping webpages
from bs4 import BeautifulSoup
#the pandas library which is used to work with data - we call it 'pd' here so we have to type less!
import pandas as pd

## Using a list

Below we write some code to create a list of counties that can be used to generate URLs on a karting site.

We also store the 'base URL' that we will add to each item in the list to create a full URL.

In [2]:
#create a list of counties that we will need to generate URLs
counties = ["avon","bedfordshire","berkshire","birmingham"]
#store the base URL we will add those to
baseurl = "http://www.uk-go-karting.com/tracks/"

## Using a loop

Next we loop through each item in the list and add it to that base url using the `+` operator.

We add a `print` function inside the loop to check that it works each time - and copy those links into a browser to check that they are the right links.

In [3]:
#start looping through our list
for county in counties:
  fullurl = baseurl+county
  print(fullurl)

http://www.uk-go-karting.com/tracks/avon
http://www.uk-go-karting.com/tracks/bedfordshire
http://www.uk-go-karting.com/tracks/berkshire
http://www.uk-go-karting.com/tracks/birmingham


## Scraping each URL as we loop

Now that we know the loop works in generating the right URLs, we can extend the code inside the loop so that it *scrapes* each URL.

At this point we are using some of the libraries we imported at the start. `requests.get()`, for example, is the `get()` function from the `requests` library. 

Let's look at the code first, and then explain it...

In [4]:
#start looping through our list
for county in counties:
  fullurl = baseurl+county
  print(fullurl)
  #Scrape the html at that url
  html = requests.get(fullurl)
  # turn our HTML into a BeautifulSoup object
  soup = BeautifulSoup(html.content) 
  #The names are all in <h2> and then <a 
  #This targets the contents of those html tags
  addresses = soup.select('h2 a')
  #the results are always a list so we have to loop through it using a 'for' loop
  for i in addresses:
    #each item in the list is called i as it loops
    print(i)
    #on its own it includes tags, but we can attach .get_text() to translate it into text
    address = i.get_text()
    print(address)


http://www.uk-go-karting.com/tracks/avon
<a href="http://www.uk-go-karting.com/tracks/avon/the-raceway">Bristol Go Karting</a>
Bristol Go Karting
http://www.uk-go-karting.com/tracks/bedfordshire
<a href="http://www.uk-go-karting.com/tracks/bedfordshire/dunstable-go-karting">Dunstable Go Karting</a>
Dunstable Go Karting
http://www.uk-go-karting.com/tracks/berkshire
<a href="http://www.uk-go-karting.com/tracks/berkshire/reading-go-karting">Reading Go Karting</a>
Reading Go Karting
http://www.uk-go-karting.com/tracks/birmingham
<a href="http://www.uk-go-karting.com/tracks/birmingham/teamworks-karting-birmingham">Teamworks Karting Birmingham (Central)</a>
Teamworks Karting Birmingham (Central)
<a href="http://www.uk-go-karting.com/tracks/birmingham/grand-prix-karting">Grand Prix Karting</a>
Grand Prix Karting
<a href="http://www.uk-go-karting.com/tracks/birmingham/birmingham-go-karting">Birmingham Go Karting</a>
Birmingham Go Karting
<a href="http://www.uk-go-karting.com/tracks/birmingham/

## The functions we are using

Let's break some of this down.

So `requests.get()` is the `get()` function from the `requests` library. The *ingredient* we give to that function is the URL we stored in the `fullurl` variable.

The `get()` function basically fetches the whole webpage at a given address (the ingredient it's given).

The results of running that function are stored in a new variable called `html`.

This isn't in a form we can easily work with, yet, so we need another function to convert it to something we can drill down into. 

That function is the `BeautifulSoup()` function.

*(This is actually from the `bs4` library but we don't need to name it because we imported the function specifically earlier when we wrote `from bs4 import BeautifulSoup`)*

The *ingredient* we give to that `BeautifulSoup` function is the `html` variable we just created - but we need to add `.content` to specify we want the content of that page.

The results are stored in another new variable, `soup`.

This variable is a particular type of object (a "BeautifulSoup object" if you need to know) that can be drilled down into using the `.select()` function that BeautifulSoup objects possess. 

That `.select()` function will grab elements that match the *CSS selectors* that you give it as an ingredient.

In this case we specify `'h2 a'`, which means "any a tag within a h2 tag" - so it will grab the contents of any links inside h2 tags in the page.

Don't worry about memorising any of the code above: this is code that you can re-use time and time again. The only bit you will need to change is the selector, in order to specify the particular HTML you're after. 

To work out the selector you need, you'll often need to Google around, learning as you go, but selectors are pretty easy to get the hang of, and I'll talk about it more below.

## Using CSS selectors

**CSS selectors** are used to target different elements in a HTML page. A basic selector can target just one type of HTML tag, like `<h2>` or `<p>`, but you can also target a combination of tags (such as any `<strong>` tags within `<p>` tags). 

More complicated selectors can also be used to target tags based on their attributes (e.g. not just `<p>` but specifically `<p class="summary">`).

You can find lots of resources to help you with CSS selectors, such as [this one](https://www.w3schools.com/cssref/css_selectors.asp). Many will relate to styling webpages (which is how CSS selectors are most often used - selectors are used to target the HTML elements that you want to style), but the principles are the same.


## Saving the information we've grabbed.

Now we've grabbed some information we can extend the code further to save it.

At this point we need to use functions from another library: `pandas`. This is a library for data storage and analysis. When we imported `pandas` we called it `pd` for short. This is quite common. Any reference to `pd` in the code, then, means `pandas`

First, we use the function `DataFrame()` which creates a pandas dataframe. As ingredients it needs to know the names of any columns.

You will see below that we add a line *before* the loop which uses that to create an empty dataframe to store the data in.

Then, inside the loop, the data we extract is added to the dataframe.

Here's the code first - then I'll explain the new bits after.


In [6]:
#create an empty list to store our addresses
addresslist = []

#start looping through our list
for county in counties:
  fullurl = baseurl+county
  print(fullurl)
  #Scrape the html at that url
  html = requests.get(fullurl)
  # turn our HTML into a BeautifulSoup object
  soup = BeautifulSoup(html.content) 
  #The names are all in <h3> - a change from our previous code
  #This targets the contents of those html tags
  addresses = soup.select('h3')
  #the results are always a list so we have to loop through it using a 'for' loop
  for i in addresses:
    #each item in the list is called i as it loops
    print(i)
    #on its own it includes tags, but we can attach .get_text() to translate it into text
    address = i.get_text()
    print(address)
    #add to the previously empty list
    addresslist.append(address)

#Create a dataframe to store the data we scraped
#It has one column called 'location'
#We store the list 'addresslist' in that column
#We call this dataframe 'df'
df = pd.DataFrame({"location": addresslist})


http://www.uk-go-karting.com/tracks/avon
<h3>Avonmouth Way, Bristol, Avon BS11 9YA</h3>
Avonmouth Way, Bristol, Avon BS11 9YA
http://www.uk-go-karting.com/tracks/bedfordshire
<h3>Unit 27, Verey Road, Woodside Industrial Estate, Dunstable, Bedfordshire LU5 4TT</h3>
Unit 27, Verey Road, Woodside Industrial Estate, Dunstable, Bedfordshire LU5 4TT
http://www.uk-go-karting.com/tracks/berkshire
<h3>Cradock Road, Reading, Berkshire RG2 0EE</h3>
Cradock Road, Reading, Berkshire RG2 0EE
http://www.uk-go-karting.com/tracks/birmingham
<h3>Fazeley Street, Birmingham B5 5SE</h3>
Fazeley Street, Birmingham B5 5SE
<h3>Adderley Road South, Birmingham B8 1AD</h3>
Adderley Road South, Birmingham B8 1AD
<h3>Park Lane, Oldbury, Birmingham B69 4JX</h3>
Park Lane, Oldbury, Birmingham B69 4JX
<h3>Robeys Lane, Tamworth,  B78 1AR</h3>
Robeys Lane, Tamworth,  B78 1AR


## The new code

The first line of new code is this:

```
addresslist = []
```

We are creating a new variable here, called `addresslist`, and assigning to it a list, indicated by the square brackets. 

The square brackets are empty, however, which means this is an *empty list*.

The second line of new code sits inside the loop, so it will run each time the loop runs. It is this:

```
addresslist.append(address)
```

Here we see our empty list - but this time it is attached to the `.append()` function which will add something to that list. Here it adds `address`. 

The first time this loop runs, then, it will add whatever is inside the `address` variable to the list `addresslist`. It will then have one item instead of being ann empty list.

But that loop will run a number of times, and each time that list will grow by one item. 

Once the loop has finished we get to a third and final extra line of code:

```
df = pd.DataFrame({"location": addresslist})
```

This creates another new variable called `df` - and assigns to it the results of using a function: `pd.DataFrame()` 

This is a function called `DataFrame` from the `pandas` library (remember we called it `pd`).

That takes an ingredient: a **dictionary**. 

A dictionary is like a list, but with two key differences: firstly that it uses curly brackets instead of square ones: `{}`, and secondly it's a list of *pairs*: a 'key', and a 'value', separated by a colon.

Here's the dictionary in our code:

`{"location": addresslist}`

The first part, `"location"` is the **key**. This is the column heading. Note that it's a **string**: a label, basically.

The second part, `addresslist`, is the **value**. This isn't in quotes so it's not a string - it's a variable: the variable we created earlier, and then added to within our list. 

So having extracted that information and stored it in `addresslist`, the line of code is storing it in a dataframe with the label (key) "location":

We can print the dataframe to see what's in there:


In [7]:
#Once the loop has finished we can take a look at the data
print(df)

                                            location
0              Avonmouth Way, Bristol, Avon BS11 9YA
1  Unit 27, Verey Road, Woodside Industrial Estat...
2           Cradock Road, Reading, Berkshire RG2 0EE
3                  Fazeley Street, Birmingham B5 5SE
4             Adderley Road South, Birmingham B8 1AD
5             Park Lane, Oldbury, Birmingham B69 4JX
6                    Robeys Lane, Tamworth,  B78 1AR


## Exporting the data

The `pandas` library has another function for exporting data: `to_csv()`.

It needs to be attached to the name of the dataframe variable with a period, then, in the brackets, you specify the name of the file you want to export it as. Make sure this ends in '.csv' so it can be used in a spreadsheet.

In [8]:
#And we can export it
df.to_csv("scrapeddata.csv")

## Downloading the data

Once exported, it should appear in the file explorer in Google Colab on the left hand side. Click on the folder icon to open this up and you should see the file you just created (there's a refresh button above if you can't).

Hover over the file name to see three dots, then click on those to select **Download** and download to your computer.

## How to adapt it

You can use most of this code without having to change it. All you *need* to change is the lines specifying the base URL, and the list of words to add to it.

And this line, which specifies what you want to scrape from that page:

`titles = root.cssselect('h2')`

If you're scraping one type of information from one page, that will be enough. 

For the CSS selector you will need to identify the HTML in the page you are scraping, and the combination of tags that is being used. 

Some [reading around CSS selectors](https://www.w3schools.com/cssref/css_selectors.asp) will help you here, but a couple of useful things to know include:

* A period `.` means `class="`
* A hash `#` means `id="`

So `'div.title a'` means `<div class="title"><a ...>` - or, in other words, anything on the page inside an `<a>` tag (a link) within a `<div class="title">` tag.

The words used for variables (like "baseurl" and "titles" above) may not be relevant to what you are scraping - but that doesn't matter, because those words are arbitrary. If you do decide to change them, make sure you change them *throughout* the code, or it will create an error.


## Generating URLs for a scraper to loop through

Alternatively you might *generate* the URLs: for example, if they end in a number that goes up by 1 each time you can use `range` to generate that list of numbers and add them to the URL using `+`.

However, you cannot mix numbers and strings, so you need to convert the numbers to a string as you do this. Here's an example:

In [9]:
#Create the basic URL that appears before the number
baseurl = "http://mypage.com?page="
#Create a list of numbers to put on the end
pagenums = range(1,11)
#Now generate the URLs by looping through the list and adding it to the URL
for i in pagenums:
  #Combine the two - 
  #this will generate an error because we are trying to combine a string and a number
  fullurl = baseurl+i

TypeError: ignored

## Tip: converting numbers into strings

You can see the error `must be str, not int` - in other words the second part must be a string not an integer.

To fix that you can use the `str()` function, which will convert a number into a string.

In [None]:
#Create the basic URL that appears before the number
baseurl = "http://mypage.com?page="
#Create a list of numbers to put on the end
pagenums = range(1,11)
#Now generate the URLs by looping through the list and adding it to the URL
for i in pagenums:
  #Convert i to a string
  i = str(i)
  #Combine the two
  fullurl = baseurl+i
  #print it
  print(fullurl)

http://mypage.com?page=1
http://mypage.com?page=2
http://mypage.com?page=3
http://mypage.com?page=4
http://mypage.com?page=5
http://mypage.com?page=6
http://mypage.com?page=7
http://mypage.com?page=8
http://mypage.com?page=9
http://mypage.com?page=10
