# An example scraper

The code below can be copied and adapted to create your own scraper.

The first part installs all the libraries. I've kept this separate to the other parts so that you don't have to install them every time you want to run the scraper itself.

In [0]:
#install the libraries 
#scraperwiki is a library for scraping webpages
!pip install scraperwiki
import scraperwiki
#lxml.html is used to convert it into xml (more structured)
import lxml.html
#cssselect is used to drill down into that and find data in tags
!pip install cssselect
import cssselect
#the pandas library which is used to work with data - we call it 'pd' here so we have to type less!
import pandas as pd

In [6]:
#This is a full URL for testing
testurl = "https://www.officialcharts.com/charts/singles-chart/20171222/7501/"

#Create a dataframe to store the data we are about to scrape
#It has one column called 'title'
#We call this dataframe 'df'
df = pd.DataFrame(columns=["title"])

#Scrape the html at that url
html = scraperwiki.scrape(testurl)
# turn our HTML into an lxml object
root = lxml.html.fromstring(html) 
#There are 100 recordings on the page
#The titles are all in <div class="title"> and then <a 
#This targets the contents of those html tags
titles = root.cssselect('div.title a')
#the results are always a list so we have to loop through it using a 'for' loop
for i in titles:
  #each item in the list is called i as it loops
  print(i)
  #on its own it looks odd, but we can attach .text_content() to translate it into text
  title = i.text_content()
  print(title)
  #Now we need to store it in that variable called 'df' 
  df = df.append({
    "title" : title
    }, ignore_index=True)


<Element a at 0x7fc60ac5c728>
PERFECT
<Element a at 0x7fc60aca7c28>
RIVER
<Element a at 0x7fc60aca7138>
LAST CHRISTMAS
<Element a at 0x7fc60aca7a98>
ALL I WANT FOR CHRISTMAS IS YOU
<Element a at 0x7fc60aca7638>
ANYWHERE
<Element a at 0x7fc60aca7ea8>
MAN'S NOT HOT
<Element a at 0x7fc60aca74a8>
FAIRYTALE OF NEW YORK
<Element a at 0x7fc60a67c868>
I MISS YOU
<Element a at 0x7fc60a67c6d8>
DIMELO
<Element a at 0x7fc60a67c908>
LET YOU DOWN
<Element a at 0x7fc60a67c4a8>
17
<Element a at 0x7fc60a67c7c8>
DO THEY KNOW IT'S CHRISTMAS
<Element a at 0x7fc60a67c728>
HAVANA
<Element a at 0x7fc60a67c318>
ROCKIN' AROUND THE CHRISTMAS TREE
<Element a at 0x7fc60a67c458>
MERRY CHRISTMAS EVERYONE
<Element a at 0x7fc60a67cae8>
WOLVES
<Element a at 0x7fc60a67c958>
BARKING
<Element a at 0x7fc60a67cb38>
STEP INTO CHRISTMAS
<Element a at 0x7fc60a67cb88>
IN YOUR HEAD
<Element a at 0x7fc60a67cbd8>
DRIVING HOME FOR CHRISTMAS
<Element a at 0x7fc60a67cc28>
WALK ON WATER
<Element a at 0x7fc60a67cc78>
IT'S BEGINNING TO

In [4]:
#Once the loop has finished we can take a look at the data
print(df)

                              title
0                           PERFECT
1                             RIVER
2                    LAST CHRISTMAS
3   ALL I WANT FOR CHRISTMAS IS YOU
4                          ANYWHERE
..                              ...
95                         MI GENTE
96            LONELY THIS CHRISTMAS
97                        ALL NIGHT
98                   1-800-273-8255
99  YOU MAKE IT FEEL LIKE CHRISTMAS

[100 rows x 1 columns]


In [0]:
#And we can export it
df.to_csv("scrapeddata.csv")

## How to adapt it

You can use most of this code without having to change it. All you *need* to change is this line, which specifies what webpage you want to scrape:

`testurl = "https://www.officialcharts.com/charts/singles-chart/20171222/7501/"`

And this line, which specifies what you want to scrape from that page:

`titles = root.cssselect('div.title a')`

If you're scraping one type of information from one page, that will be enough. 

Specifically, you will need to change the URL (keeping the quotation marks), and the CSS selector `'div.title a'` (again, keeping the quotation marks).

For the CSS selector you will need to identify the HTML in the page you are scraping, and the combination of tags that is being used. 

Some [reading around CSS selectors](https://www.w3schools.com/cssref/css_selectors.asp) will help you here, but a couple of useful things to know include:

* A period `.` means `class="`
* A hash `#` means `id="`

So `'div.title a'` means `<div class="title"><a ...>` - or, in other words, anything on the page inside an `<a>` tag (a link) within a `<div class="title">` tag.

The words used for variables (like "testurl" and "titles" above) may not be relevant to what you are scraping - but that doesn't matter, because those words are arbitrary. If you do decide to change them, make sure you change them *throughout* the code, or it will create an error.
