# An example scraper - this time grabbing more than one piece of info

The code below can be copied and adapted to create your own scraper. This builds on [a previous scraper which introduced the use of lists in scraping](https://github.com/paulbradshaw/MED7369-Specialist-Investigative-Journalism/blob/master/python/anExampleScraperList.ipynb).

The first part installs all the libraries...

In [None]:
#install the libraries 
#scraperwiki is a library for scraping webpages
!pip install scraperwiki
import scraperwiki
#lxml.html is used to convert it into xml (more structured)
import lxml.html
#cssselect is used to drill down into that and find data in tags
!pip install cssselect
import cssselect
#the pandas library which is used to work with data - we call it 'pd' here so we have to type less!
import pandas as pd

## The previous code

Previously we looped through each item in a list and added it to a base url using the `+` operator, then scraped something from that URL.

We also stored the results in a dataframe. 

Here's the code that we got to:

In [2]:
#create a list of counties that we will need to generate URLs
counties = ["avon","bedfordshire","berkshire","birmingham"]
#store the base URL we will add those to
baseurl = "http://www.uk-go-karting.com/tracks/"

#Create a dataframe to store the data we are about to scrape
#It has one column called 'title'
#We call this dataframe 'df'
df = pd.DataFrame(columns=["location"])

#start looping through our list
for county in counties:
  fullurl = baseurl+county
  print(fullurl)
  #Scrape the html at that url
  html = scraperwiki.scrape(fullurl)
  # turn our HTML into an lxml object
  root = lxml.html.fromstring(html) 
  #There are 100 recordings on the page
  #The titles are all in <div class="title"> and then <a 
  #This targets the contents of those html tags
  addresses = root.cssselect('h3')
  #the results are always a list so we have to loop through it using a 'for' loop
  for i in addresses:
    #each item in the list is called i as it loops
    print(i)
    #on its own it looks odd, but we can attach .text_content() to translate it into text
    address = i.text_content()
    print(address)
    #Now we need to store it in that variable called 'df' 
    df = df.append({
      "location" : address
      }, ignore_index=True)


http://www.uk-go-karting.com/tracks/avon
<Element h3 at 0x7fa797350cb0>
Avonmouth Way, Bristol, Avon BS11 9YA
http://www.uk-go-karting.com/tracks/bedfordshire
<Element h3 at 0x7fa79736cbf0>
Unit 27, Verey Road, Woodside Industrial Estate, Dunstable, Bedfordshire LU5 4TT
http://www.uk-go-karting.com/tracks/berkshire
<Element h3 at 0x7fa79736cef0>
Cradock Road, Reading, Berkshire RG2 0EE
http://www.uk-go-karting.com/tracks/birmingham
<Element h3 at 0x7fa79736cd70>
Fazeley Street, Birmingham B5 5SE
<Element h3 at 0x7fa79737d050>
Adderley Road South, Birmingham B8 1AD
<Element h3 at 0x7fa79737d170>
Park Lane, Oldbury, Birmingham B69 4JX
<Element h3 at 0x7fa79737d2f0>
Robeys Lane, Tamworth,  B78 1AR


In [3]:
#Once the loop has finished we can take a look at the data
print(df)

                                            location
0              Avonmouth Way, Bristol, Avon BS11 9YA
1  Unit 27, Verey Road, Woodside Industrial Estat...
2           Cradock Road, Reading, Berkshire RG2 0EE
3                  Fazeley Street, Birmingham B5 5SE
4             Adderley Road South, Birmingham B8 1AD
5             Park Lane, Oldbury, Birmingham B69 4JX
6                    Robeys Lane, Tamworth,  B78 1AR


## Storing more than one piece of information

If you want to store more than just one piece of information there are different ways to do that, often depending on the nature of the page you are scraping. 

One is to identify the tag containing *all* the elements of info you want. In this case, although the *addresses* are inside `<h3>` tags, the entry for *all* of the info about each track is contained within `<div class="trackintro">` like so:

```{html}
<div class="trackintro">
        <a href="http://www.uk-go-karting.com/tracks/birmingham/grand-prix-karting"><img src="http://www.uk-go-karting.com/images/tracks/thumbs/3-1.jpg" alt="Grand Prix Karting"></a>
        <h2><a href="http://www.uk-go-karting.com/tracks/birmingham/grand-prix-karting">Grand Prix Karting</a></h2>
        <h3>Adderley Road South, Birmingham B8 1AD</h3>
        <p>Track length: 970m | <a href="http://www.uk-go-karting.com/tracks/birmingham/grand-prix-karting">track details and activities</a> | <a onclick="_gaq.push(['_trackEvent(Karting Enquiry, County Page, Birmingham'])" href="http://www.uk-go-karting.com/calculate?track_id=3">book karting here</a></p>
```

So instead of targeting `<h3>` and looping through the matches we can *first* target that `<div class="trackintro">`, loop through those, and *then* within the matches, extract the *first* (and only) `<h3>` and other items. 

Here's the code changed to do that:

In [None]:
#create a list of counties that we will need to generate URLs
counties = ["avon","bedfordshire","berkshire","birmingham"]
#store the base URL we will add those to
baseurl = "http://www.uk-go-karting.com/tracks/"

#Create a dataframe to store the data we are about to scrape
#It has one column called 'title'
#We call this dataframe 'df'
df = pd.DataFrame(columns=["location"])

#start looping through our list
for county in counties:
  fullurl = baseurl+county
  print(fullurl)
  #Scrape the html at that url
  html = scraperwiki.scrape(fullurl)
  # turn our HTML into an lxml object
  root = lxml.html.fromstring(html) 
  #There are 100 recordings on the page
  #The titles are all in <div class="title"> and then <a 
  #This targets the contents of those html tags
  tracks = root.cssselect('div.trackintro')
  #the results are always a list so we have to loop through it using a 'for' loop
  for i in tracks:
    #each item in the list is called i as it loops
    #we could store all the contents and store that, then split later
    print("WHOLE THING",i.text_content())
    #but here we drill down deeper to grab any <a href> tags inside a <h2>
    links = i.cssselect('h2 a')
    #and any <h3> tags
    h3s = i.cssselect('h3')
    #and any <p> tags
    ps = i.cssselect('p')
    #knowing that there's only one - or at least we're only interested in the first
    #we can then store just that one, rather than having to loop
    firstlink = links[0]
    #and extract the text
    trackname = firstlink.text_content()
    #the same for h3
    firsth3 = h3s[0]
    address = firsth3.text_content()
    print(address)
    #and for the p
    firstp = ps[0]
    length = firstp.text_content()
    #Now we can store all in that variable called 'df' 
    df = df.append({
      "location" : address,
      "track" : trackname,
      "length" : length
      }, ignore_index=True)



There are other ways of achieving similar results. But this is the simplest strategy.

In [11]:
#Once the loop has finished we can take a look at the data
print(df)

                                            location  ...                                   track
0              Avonmouth Way, Bristol, Avon BS11 9YA  ...                      Bristol Go Karting
1  Unit 27, Verey Road, Woodside Industrial Estat...  ...                    Dunstable Go Karting
2           Cradock Road, Reading, Berkshire RG2 0EE  ...                      Reading Go Karting
3                  Fazeley Street, Birmingham B5 5SE  ...  Teamworks Karting Birmingham (Central)
4             Adderley Road South, Birmingham B8 1AD  ...                      Grand Prix Karting
5             Park Lane, Oldbury, Birmingham B69 4JX  ...                   Birmingham Go Karting
6                    Robeys Lane, Tamworth,  B78 1AR  ...                Daytona Tamworth Karting

[7 rows x 3 columns]


## Exporting the data

The `pandas` library has another function for exporting data: `to_csv()`.

It needs to be attached to the name of the dataframe variable with a period, then, in the brackets, you specify the name of the file you want to export it as. Make sure this ends in '.csv' so it can be used in a spreadsheet.

In [12]:
#And we can export it
df.to_csv("scrapeddata.csv")

## Downloading the data

Once exported, it should appear in the file explorer in Google Colab on the left hand side. Click on the folder icon to open this up and you should see the file you just created (there's a refresh button above if you can't).

Hover over the file name to see three dots, then click on those to select **Download** and download to your computer.