# An example scraper

The code below can be copied and adapted to create your own scraper.

The first part installs all the libraries. I've kept this separate to the other parts so that you don't have to install them every time you want to run the scraper itself.

In [1]:
#install the libraries 
#requests is a library for fetching webpages from a URL
import requests
#BeautifulSoup is a library for scraping webpages
from bs4 import BeautifulSoup
#the pandas library which is used to work with data - we call it 'pd' here so we have to type less!
import pandas as pd

In [4]:
#This is a full URL for testing
testurl = "https://www.officialcharts.com/charts/singles-chart/20171222/7501/"

#Create a dataframe to store the data we are about to scrape
#It has one column called 'title'
#We call this dataframe 'df'
df = pd.DataFrame(columns=["title"])

#Scrape the html at that url
html = requests.get(testurl)
# turn our HTML *content* into a BeautifulSoup object
soup = BeautifulSoup(html.content) 
#There are 100 recordings on the page
#The titles are all in <div class="title"> and then <a 
#This targets the contents of those html tags
titles = soup.select('div.title a')
#the results are always a list - we'll deal with that in a second

#before we grab the *text* of those titles, we need an empty list to store those
titletext = []

#now back to the list of title tags - we have to loop through it using a 'for' loop
for i in titles:
  #each item in the list is called i as it loops - let's print each item
  print(i)
  #this includes the tags, but we can attach .get_text() to extract just the text
  title = i.get_text()
  print(title)
  #Now we need to store it in that list using .append()
  titletext.append(title)

#once the loop has finished we've appended 100 items to the previously empty list
#we can now use that to create a column in a dataframe variable called 'df' 
df = pd.DataFrame({"titles": titletext})
df.head()

<a href="/search/singles/perfect/">PERFECT</a>
PERFECT
<a href="/search/singles/river/">RIVER</a>
RIVER
<a href="/search/singles/last-christmas/">LAST CHRISTMAS</a>
LAST CHRISTMAS
<a href="/search/singles/all-i-want-for-christmas-is-you/">ALL I WANT FOR CHRISTMAS IS YOU</a>
ALL I WANT FOR CHRISTMAS IS YOU
<a href="/search/singles/anywhere/">ANYWHERE</a>
ANYWHERE
<a href="/search/singles/man's-not-hot/">MAN'S NOT HOT</a>
MAN'S NOT HOT
<a href="/search/singles/fairytale-of-new-york/">FAIRYTALE OF NEW YORK</a>
FAIRYTALE OF NEW YORK
<a href="/search/singles/i-miss-you/">I MISS YOU</a>
I MISS YOU
<a href="/search/singles/dimelo/">DIMELO</a>
DIMELO
<a href="/search/singles/let-you-down/">LET YOU DOWN</a>
LET YOU DOWN
<a href="/search/singles/17/">17</a>
17
<a href="/search/singles/do-they-know-it's-christmas/">DO THEY KNOW IT'S CHRISTMAS</a>
DO THEY KNOW IT'S CHRISTMAS
<a href="/search/singles/havana/">HAVANA</a>
HAVANA
<a href="/search/singles/rockin'-around-the-christmas-tree/">ROCKIN' ARO

Unnamed: 0,titles
0,PERFECT
1,RIVER
2,LAST CHRISTMAS
3,ALL I WANT FOR CHRISTMAS IS YOU
4,ANYWHERE


In [5]:
#Once the loop has finished we can take a look at the data
print(df)

                             titles
0                           PERFECT
1                             RIVER
2                    LAST CHRISTMAS
3   ALL I WANT FOR CHRISTMAS IS YOU
4                          ANYWHERE
..                              ...
95                         MI GENTE
96            LONELY THIS CHRISTMAS
97                        ALL NIGHT
98                   1-800-273-8255
99  YOU MAKE IT FEEL LIKE CHRISTMAS

[100 rows x 1 columns]


In [6]:
#And we can export it
df.to_csv("scrapeddata.csv")

## How to adapt it

You can use most of this code without having to change it. All you *need* to change is this line, which specifies what webpage you want to scrape:

`testurl = "https://www.officialcharts.com/charts/singles-chart/20171222/7501/"`

And this line, which specifies what you want to scrape from that page:

`titles = soup.select('div.title a')`

If you're scraping one type of information from one page, that will be enough. 

Specifically, you will need to change the URL (keeping the quotation marks), and the CSS selector `'div.title a'` (again, keeping the quotation marks).

For the CSS selector you will need to identify the HTML in the page you are scraping, and the combination of tags that is being used. 

Some [reading around CSS selectors](https://www.w3schools.com/cssref/css_selectors.asp) will help you here, but a couple of useful things to know include:

* A period `.` means `class="`
* A hash `#` means `id="`

So `'div.title a'` means `<div class="title"><a ...>` - or, in other words, anything on the page inside an `<a>` tag (a link) within a `<div class="title">` tag.

The words used for variables (like "testurl" and "titles" above) may not be relevant to what you are scraping - but that doesn't matter, because those words are arbitrary. If you do decide to change them, make sure you change them *throughout* the code, or it will create an error.


## Expanding from one column to multiple columns

In the example scraper we scraped just one type of information (the titles), and created a dataframe with one column for 100 instances of that.

We did that by extracting the text of each of the 100 instances, and adding that text to a list. Once the list was finished we created a dataframe using that list as the column.

But what if we wanted more than one type of info?

To do that we repeat what we did so that instead of creating a dataframe with one list as the column we use two or three or more lists as the columns. 

This means we will need to:

* Grab the matches for the other two tags and store in two other lists
* Create another two empty lists - one for each column we want to add
* Loop through each list of tags, and extract the text contents, storing it in that empty list
* Once we have those 3 lists of text, create a dataframe using those as columns

Here's the code that does that.

In [9]:

#Create 3 lists that each selects the contents of different tags
titles = soup.select('div.title a')
artists = soup.select('div.artist a')
labels = soup.select('div.label-cat span')
#How long is each list?
listlength = len(titles)
print(listlength)

#create 3 empty lists for our columns - which we can fill in the loop
titlecol = []
artistcol = []
labelcol = []

#we loop through the titles grabbed
for i in titles:
  #each item in the list generated is called i as it loops
  print(i)
  #we access the item at that position in the list, and grab its text 
  title = i.get_text()
  #and add it to the relevant list
  titlecol.append(title)
#once this loop finishes we have 100 items

#now we loop through the artists grabbed
for i in artists:
  #each item in the list generated is called i as it loops
  print(i)
  #we access the item at that position in the list, and grab its text 
  artist = i.get_text()
  #and add it to the relevant list
  artistcol.append(artist)
#once this loop finishes we have a second list of 100 items

#finally we loop through the labels grabbed
for i in labels:
  #each item in the list generated is called i as it loops
  print(i)
  #we access the item at that position in the list, and grab its text 
  label = i.get_text()
  #and add it to the relevant list
  labelcol.append(label)
#now we should have 3 lists of 100 items each, which we can use to create a dataframe

# we name the 3 columns as strings, then name the list variable we want under that string
df3cols = pd.DataFrame({"title":titlecol, "artist": artistcol, "label": labelcol})
print(df3cols)

100
<a href="/search/singles/perfect/">PERFECT</a>
<a href="/search/singles/river/">RIVER</a>
<a href="/search/singles/last-christmas/">LAST CHRISTMAS</a>
<a href="/search/singles/all-i-want-for-christmas-is-you/">ALL I WANT FOR CHRISTMAS IS YOU</a>
<a href="/search/singles/anywhere/">ANYWHERE</a>
<a href="/search/singles/man's-not-hot/">MAN'S NOT HOT</a>
<a href="/search/singles/fairytale-of-new-york/">FAIRYTALE OF NEW YORK</a>
<a href="/search/singles/i-miss-you/">I MISS YOU</a>
<a href="/search/singles/dimelo/">DIMELO</a>
<a href="/search/singles/let-you-down/">LET YOU DOWN</a>
<a href="/search/singles/17/">17</a>
<a href="/search/singles/do-they-know-it's-christmas/">DO THEY KNOW IT'S CHRISTMAS</a>
<a href="/search/singles/havana/">HAVANA</a>
<a href="/search/singles/rockin'-around-the-christmas-tree/">ROCKIN' AROUND THE CHRISTMAS TREE</a>
<a href="/search/singles/merry-christmas-everyone/">MERRY CHRISTMAS EVERYONE</a>
<a href="/search/singles/wolves/">WOLVES</a>
<a href="/search/s

## Bonus tip: scraping multiple pieces of information with `range` and indexes

Another way to scrape multiple selectors one way to do that is to use indexes to loop through them. I'll explain what I mean...

In the chart scraped above, there are 100 songs listed. When we grab different pieces of information with selectors we end up with 100 titles, 100 artists, 100 positions, and so on.

At the moment it loops through the titles like this:

```python
for i in titles:
  title = i.get_text()
```

However, we could instead loop through a range of numbers:

```python
for i in range(0,100):
  title = titles[i].get_text()
```

This means that each time we are accessing a different index position in that list `titles`: the first time the loop runs it grabs `titles[0]`; the second time `titles[1]` and so on.

The range stops at index `99`, which is fine because that means position 100.

Using this approach means we don't need 3 different loops - but can loop through all 3 lists at once in a single loop. The difference is that we will be looping through *numbers*, and using each number to access the item at that *position* in all 3 lists. 

This is how it works:

We loop through numbers from 0 to 99. The first time we loop the number is 0, and we use this to say "get the item at position 0 in the list 'titles', and position 0 in the list 'artists' and position 0 in the list 'labels'"

We've now grabbed 3 items from 3 lists!

Here's the code:

In [7]:
#Change our dataframe to have 3 columns
#We call this dataframe 'df'
df = pd.DataFrame(columns=["title","artist","label"])

#Create 3 lists that each selects the contents of different tags
titles = soup.select('div.title a')
artists = soup.select('div.artist a')
labels = soup.select('div.label-cat span')
#How long is each list?
listlength = len(titles)
print(listlength)
#we loop through a range of numbers rather than the lists themselves
for i in range(0,listlength):
  #each number in the list generated is called i as it loops
  print(i)
  #we access the item at that position in the list, and grab its text contents
  title = titles[i].get_text()
  artist = artists[i].get_text()
  label = labels[i].get_text()
  print(title, artist, label)
  #Now we need to store it in that variable called 'df' 
  df = df.append({
    "title" : title,
    "artist" : artist,
    "label" : label
    }, ignore_index=True)

print(df)

100
0
PERFECT ED SHEERAN ASYLUM
1
RIVER EMINEM FT ED SHEERAN INTERSCOPE
2
LAST CHRISTMAS WHAM RCA
3
ALL I WANT FOR CHRISTMAS IS YOU MARIAH CAREY COLUMBIA
4
ANYWHERE RITA ORA ATLANTIC
5
MAN'S NOT HOT BIG SHAQ ISLAND
6
FAIRYTALE OF NEW YORK POGUES FT KIRSTY MACCOLL WARNER BROS
7
I MISS YOU CLEAN BANDIT FT JULIA MICHAELS ATLANTIC/POLYDOR/REPUBLIC
8
DIMELO RAK-SU FT WYCLEF/NAUGHTY BOY SYCO MUSIC
9
LET YOU DOWN NF EMI
10
17 MK COLUMBIA
11
DO THEY KNOW IT'S CHRISTMAS BAND AID MERCURY
12
HAVANA CAMILA CABELLO FT YOUNG THUG EPIC/SYCO MUSIC
13
ROCKIN' AROUND THE CHRISTMAS TREE BRENDA LEE MCA
14
MERRY CHRISTMAS EVERYONE SHAKIN' STEVENS RCA
15
WOLVES SELENA GOMEZ & MARSHMELLO INTERSCOPE
16
BARKING RAMZ POLYDOR
17
STEP INTO CHRISTMAS ELTON JOHN MERCURY
18
IN YOUR HEAD EMINEM INTERSCOPE
19
DRIVING HOME FOR CHRISTMAS CHRIS REA WARNER BROS
20
WALK ON WATER EMINEM FT BEYONCE INTERSCOPE
21
IT'S BEGINNING TO LOOK A LOT LIKE MICHAEL BUBLE REPRISE
22
I WISH IT COULD BE CHRISTMAS EVERYDAY WIZZARD EMI
23
NA

## Scraping multiple pages

So far we've only scraped one page. To scrape more than one page we just extend the process using another loop.

To do this we need a list of URLs to repeat the scraper on. The structure will look something like this:

* Create dataframe
* Create list of URLs
* For each URL, run scraping code and append results to dataframe

Creating the list of URLs might be done in one of the following ways: 

* Scraping another page: for example you might scrape a search results page to get that list of URLs, and then loop throgh those.
* Following 'next' links from page to page: the charts page, for example, has a 'next' link which can be identified using the same techniques (cssselect) and then used as the URL to scrape
* Generating URLs that follow a pattern: this might be sequential numbering, or it might be using codes that you can find elsewhere and loop through. 



## Generating URLs for a scraper to loop through

Alternatively you might *generate* the URLs: for example, if they end in a number that goes up by 1 each time you can use `range` to generate that list of numbers and add them to the URL using `+`.

However, you cannot mix numbers and strings, so you need to convert the numbers to a string as you do this. Here's an example:

In [None]:
#Create the basic URL that appears before the number
baseurl = "http://mypage.com?page="
#Create a list of numbers to put on the end
pagenums = range(1,11)
#Now generate the URLs by looping through the list and adding it to the URL
for i in pagenums:
  #Combine the two - 
  #this will generate an error because we are trying to combine a string and a number
  fullurl = baseurl+i

TypeError: ignored

## Tip: converting numbers into strings

You can see the error `must be str, not int` - in other words the second part must be a string not an integer.

To fix that you can use the `str()` function, which will convert a number into a string.

In [None]:
#Create the basic URL that appears before the number
baseurl = "http://mypage.com?page="
#Create a list of numbers to put on the end
pagenums = range(1,11)
#Now generate the URLs by looping through the list and adding it to the URL
for i in pagenums:
  #Convert i to a string
  i = str(i)
  #Combine the two
  fullurl = baseurl+i
  #print it
  print(fullurl)

http://mypage.com?page=1
http://mypage.com?page=2
http://mypage.com?page=3
http://mypage.com?page=4
http://mypage.com?page=5
http://mypage.com?page=6
http://mypage.com?page=7
http://mypage.com?page=8
http://mypage.com?page=9
http://mypage.com?page=10
