# An example scraper - this time grabbing more than one piece of info

The code below can be copied and adapted to create your own scraper. This builds on [a previous scraper which introduced the use of lists in scraping](https://github.com/paulbradshaw/MED7369-Specialist-Investigative-Journalism/blob/master/python/anExampleScraperList.ipynb).

The first part installs all the libraries...

In [1]:
#install the libraries 
#requests is a library for fetching webpages from a URL
import requests
#BeautifulSoup is a library for scraping webpages
from bs4 import BeautifulSoup
#the pandas library which is used to work with data - we call it 'pd' here so we have to type less!
import pandas as pd

## The previous code

Previously we looped through each item in a list and added it to a base url using the `+` operator, then scraped something from that URL.

We also stored the results in a dataframe. 

Here's the code that we got to:

In [3]:
#create a list of counties that we will need to generate URLs
counties = ["avon","bedfordshire","berkshire","birmingham"]
#store the base URL we will add those to
baseurl = "http://www.uk-go-karting.com/tracks/"

#create an empty list to store our addresses
addresslist = []

#start looping through our list
for county in counties:
  fullurl = baseurl+county
  print(fullurl)
  #Scrape the html at that url
  html = requests.get(fullurl)
  # turn our HTML into a BeautifulSoup object
  soup = BeautifulSoup(html.content) 
  #The names are all in <h3> - a change from our previous code
  #This targets the contents of those html tags
  addresses = soup.select('h3')
  #the results are always a list so we have to loop through it using a 'for' loop
  for i in addresses:
    #each item in the list is called i as it loops
    print(i)
    #on its own it includes tags, but we can attach .get_text() to translate it into text
    address = i.get_text()
    print(address)
    #add to the previously empty list
    addresslist.append(address)

#Create a dataframe to store the data we scraped
#It has one column called 'location'
#We store the list 'addresslist' in that column
#We call this dataframe 'df'
df = pd.DataFrame({"location": addresslist})


http://www.uk-go-karting.com/tracks/avon
<h3>Avonmouth Way, Bristol, Avon BS11 9YA</h3>
Avonmouth Way, Bristol, Avon BS11 9YA
http://www.uk-go-karting.com/tracks/bedfordshire
<h3>Unit 27, Verey Road, Woodside Industrial Estate, Dunstable, Bedfordshire LU5 4TT</h3>
Unit 27, Verey Road, Woodside Industrial Estate, Dunstable, Bedfordshire LU5 4TT
http://www.uk-go-karting.com/tracks/berkshire
<h3>Cradock Road, Reading, Berkshire RG2 0EE</h3>
Cradock Road, Reading, Berkshire RG2 0EE
http://www.uk-go-karting.com/tracks/birmingham
<h3>Fazeley Street, Birmingham B5 5SE</h3>
Fazeley Street, Birmingham B5 5SE
<h3>Adderley Road South, Birmingham B8 1AD</h3>
Adderley Road South, Birmingham B8 1AD
<h3>Park Lane, Oldbury, Birmingham B69 4JX</h3>
Park Lane, Oldbury, Birmingham B69 4JX
<h3>Robeys Lane, Tamworth,  B78 1AR</h3>
Robeys Lane, Tamworth,  B78 1AR


In [4]:
#Once the loop has finished we can take a look at the data
print(df)

                                            location
0              Avonmouth Way, Bristol, Avon BS11 9YA
1  Unit 27, Verey Road, Woodside Industrial Estat...
2           Cradock Road, Reading, Berkshire RG2 0EE
3                  Fazeley Street, Birmingham B5 5SE
4             Adderley Road South, Birmingham B8 1AD
5             Park Lane, Oldbury, Birmingham B69 4JX
6                    Robeys Lane, Tamworth,  B78 1AR


## Storing more than one piece of information

If you want to store more than just one piece of information there are different ways to do that, often depending on the nature of the page you are scraping. 

One is to identify the tag containing *all* the elements of info you want. In this case, although the *addresses* are inside `<h3>` tags, the entry for *all* of the info about each track is contained within `<div class="trackintro">` like so:

```{html}
<div class="trackintro">
        <a href="http://www.uk-go-karting.com/tracks/birmingham/grand-prix-karting"><img src="http://www.uk-go-karting.com/images/tracks/thumbs/3-1.jpg" alt="Grand Prix Karting"></a>
        <h2><a href="http://www.uk-go-karting.com/tracks/birmingham/grand-prix-karting">Grand Prix Karting</a></h2>
        <h3>Adderley Road South, Birmingham B8 1AD</h3>
        <p>Track length: 970m | <a href="http://www.uk-go-karting.com/tracks/birmingham/grand-prix-karting">track details and activities</a> | <a onclick="_gaq.push(['_trackEvent(Karting Enquiry, County Page, Birmingham'])" href="http://www.uk-go-karting.com/calculate?track_id=3">book karting here</a></p>
```

So instead of targeting `<h3>` and looping through the matches we can *first* target that `<div class="trackintro">`, loop through those, and *then* within the matches, extract the *first* (and only) `<h3>` and other items. 

Here's the code changed to do that:

In [None]:
#create a list of counties that we will need to generate URLs
counties = ["avon","bedfordshire","berkshire","birmingham"]
#store the base URL we will add those to
baseurl = "http://www.uk-go-karting.com/tracks/"

#create an empty dataframe to store the data
df = pd.DataFrame()

#start looping through our list
for county in counties:
  fullurl = baseurl+county
  print(fullurl)
  #Scrape the html at that url
  html = requests.get(fullurl)
  # turn our HTML into a BeautifulSoup object
  soup = BeautifulSoup(html.content) 
  #There are 100 recordings on the page
  #The titles are all in <div class="title"> and then <a 
  #This targets the contents of those html tags
  tracks = soup.select('div.trackintro')
  #the results are always a list so we have to loop through it using a 'for' loop
  for i in tracks:
    #each item in the list is called i as it loops
    #we could store all the contents and store that, then split later
    #print("WHOLE THING",i.text_content())
    #grab the image location
    imgs = i.select('img')
    firstimg = imgs[0]
    imgsrc = firstimg['src']
    print(imgsrc)
    #but here we drill down deeper to grab any <a href> tags inside a <h2>
    links = i.select('h2 a')
    print(links[0]['href'])
    #and any <h3> tags
    h3s = i.select('h3')
    #and any <p> tags
    ps = i.select('p')
    #knowing that there's only one - or at least we're only interested in the first
    #we can then store just that one, rather than having to loop
    firstlink = links[0]
    #and extract the text
    trackname = firstlink.get_text()
    #the same for h3
    firsth3 = h3s[0]
    address = firsth3.get_text()
    print(address)
    #and for the p
    firstp = ps[0]
    length = firstp.get_text()
    #Now we can store all in that variable called 'df' 
    df = df.append({
      "location" : address,
      "track" : trackname,
      "length" : length
      }, ignore_index=True)







http://www.uk-go-karting.com/tracks/avon
http://www.uk-go-karting.com/images/tracks/thumbs/24-1.jpg
http://www.uk-go-karting.com/tracks/avon/the-raceway
Avonmouth Way, Bristol, Avon BS11 9YA
http://www.uk-go-karting.com/tracks/bedfordshire
http://www.uk-go-karting.com/images/tracks/thumbs/372-1.jpg
http://www.uk-go-karting.com/tracks/bedfordshire/dunstable-go-karting
Unit 27, Verey Road, Woodside Industrial Estate, Dunstable, Bedfordshire LU5 4TT
http://www.uk-go-karting.com/tracks/berkshire
http://www.uk-go-karting.com/images/tracks/thumbs/345-1.jpg
http://www.uk-go-karting.com/tracks/berkshire/reading-go-karting
Cradock Road, Reading, Berkshire RG2 0EE
http://www.uk-go-karting.com/tracks/birmingham


There are other ways of achieving similar results. But this is the simplest strategy.

In [None]:
#Once the loop has finished we can take a look at the data
print(df)

                                            location  ...                                             length
0              Avonmouth Way, Bristol, Avon BS11 9YA  ...  Track length: 500m | track details and activit...
1  Unit 27, Verey Road, Woodside Industrial Estat...  ...  Track length: 500m | track details and activit...
2           Cradock Road, Reading, Berkshire RG2 0EE  ...  Track length: 450m | track details and activit...
3                  Fazeley Street, Birmingham B5 5SE  ...  Track length: 450m | track details and activit...
4             Adderley Road South, Birmingham B8 1AD  ...  Track length: 970m | track details and activit...
5             Park Lane, Oldbury, Birmingham B69 4JX  ...  Track length: 1000m | track details and activi...
6                    Robeys Lane, Tamworth,  B78 1AR  ...  Track length: 1000m | track details and activi...

[7 rows x 3 columns]


## Exporting the data

The `pandas` library has another function for exporting data: `to_csv()`.

It needs to be attached to the name of the dataframe variable with a period, then, in the brackets, you specify the name of the file you want to export it as. Make sure this ends in '.csv' so it can be used in a spreadsheet.

In [None]:
#And we can export it
df.to_csv("scrapeddata.csv")

## Downloading the data

Once exported, it should appear in the file explorer in Google Colab on the left hand side. Click on the folder icon to open this up and you should see the file you just created (there's a refresh button above if you can't).

Hover over the file name to see three dots, then click on those to select **Download** and download to your computer.