# An example scraper covering multiple 'next' pages

The code below can be copied and adapted to create your own scraper.

The first part installs all the libraries. I've kept this separate to the other parts so that you don't have to install them every time you want to run the scraper itself.

The second code block uses those functions to scrape a number of pages linked by 'next' buttons. To do this it creates some special custom functions

## Defining your own custom functions 

Often in scraping we will need to run the same block of code over and over again. In this situation it is useful to store that code in a function of its own that we can call when we want to re-use it.

To create a custom function in Python you use the command `def` followed by the name you want to give your function. You then need some parentheses containing the names you want to give its ingredients (the parameters). 

Finally, you need to add a **colon**.

For example: `def dosomething(withthis):`

When you press enter the next line will be indented and any indented lines that follow will be stored within the function.

However, the code inside the function will not run *until the function is called*.

You call the function as you do any other. For example, to call the `dosomething` function used as an example above, you would simply type something like:

`dosomething(putyourspecificthinghere)`

This will take the variable `putyourspecificthinghere` and **pass** it to that function (as its one ingredient). Within the function this is assigned to a new **local variable** (called `withthis` in the example above) which only exists while that function is running.

## Functions that `return` information

Because information within a function only exists while it is running, some functions pass information *back* so that it can be stored.

This is normally done through a `return()` command. Here's an example:

```python
def addup(num1, num2):
  total = num1+num2
  return(total)

gettingaresult = addup(5,10)
```

In the example above a function is defined which **returns** the total of the two numbers it is supplied with ("passed"). 

When the function is run, with the numbers 5 and 10, it is run as part of a line creating a new variable, so the result returned is stored in that variable.

In [2]:
#install the libraries 
#requests is a library for fetching webpages from a URL
import requests
#BeautifulSoup is a library for scraping webpages
from bs4 import BeautifulSoup
#the pandas library which is used to work with data - we call it 'pd' here so we have to type less!
import pandas as pd

In [3]:
#This is a full URL for testing
testurl = "https://www.officialcharts.com/charts/singles-chart/20171222/7501/"

#Create a dataframe to store the data we are about to scrape
#It has 5 columns, supplied as a list
#We call this dataframe 'df'
df = pd.DataFrame(columns=["title","artist","label","url","date"])
print(df)

#Function to find the 'next' link
def findnext(url):
  #Scrape the html at that url
  html = requests.get(url)
  # turn our HTML *content* into a BeautifulSoup object
  soup = BeautifulSoup(html.content) 
  #Grab any elements within <a ... class="next">
  nexts = soup.select('a.next')
  #There are two, the same, we grab the href="" attribute of the first
  nextlink = nexts[0]['href']
  #Print it - note it's a relative link so it has to be combined with the base URL
  print(nextlink)
  #Return the full link
  return("https://www.officialcharts.com"+nextlink)

#Function to scrape the info from the page itself
def scrapepage(url):
  #This specifies that we want it to treat df as a global variable
  #Otherwise it will treat it as a local variable that only exists within this function
  global df
  #Scrape the html at that url
  html = requests.get(url)
  # turn our HTML *content* into a BeautifulSoup object
  soup = BeautifulSoup(html.content) 
  #Create 3 lists that each selects the contents of different tags
  titles = soup.select('div.title a')
  artists = soup.select('div.artist a')
  labels = soup.select('div.label-cat span')
  positions = soup.select('span.position')
  listlength = len(titles)
  for i in range(0,listlength):
    #each number in the list generated is called i as it loops
    print(i)
    #we access the item at that position in the list, and grab its text contents
    title = titles[i].get_text()
    artist = artists[i].get_text()
    label = labels[i].get_text()
    position = positions[i].get_text()
    #print them all
    print(title, artist, label, position)
    #Now we need to store it in that variable called 'df' 
    #We also store the url that we've been using throughout
    #Let's extract the date from that too
    date = url.replace("https://www.officialcharts.com/charts/singles-chart/","") #replace the first part of the url with nothing
    date = date.replace("/7501","") #replace the end of the url with nothing
    df = df.append({
      "title" : title,
      "artist" : artist,
      "label" : label,
      "position" : position,
      "url" : url,
      "date" : date
      }, ignore_index=True)

#Run both functions 5 times to get 5 links - change the range to get more pages
for i in range(0,5):
  #Run the first function we defined, with testurl
  #We want the URL to change each time it runs, so we overwrite it each time
  testurl = findnext(testurl)
  print(testurl)
  #Then we run the second function we defined, to scrape that url
  scrapepage(testurl)


#This scraper is a little inefficient as it scrapes each page twice
# - first to get the data and then again to get the next link
# A more efficient approach would do both at the same time but this would make more complex code and I wanted to keep it simpler in this example

Empty DataFrame
Columns: [title, artist, label, url, date]
Index: []
/charts/singles-chart/20180104/7501
https://www.officialcharts.com/charts/singles-chart/20180104/7501
0
PERFECT ED SHEERAN ASYLUM 1
1
LAST CHRISTMAS WHAM RCA 2
2
RIVER EMINEM FT ED SHEERAN INTERSCOPE 3
3
ALL I WANT FOR CHRISTMAS IS YOU MARIAH CAREY COLUMBIA 4
4
FAIRYTALE OF NEW YORK POGUES FT KIRSTY MACCOLL WARNER BROS 5
5
MAN'S NOT HOT BIG SHAQ ISLAND 6
6
DO THEY KNOW IT'S CHRISTMAS BAND AID MERCURY 7
7
ANYWHERE RITA ORA ATLANTIC 8
8
ROCKIN' AROUND THE CHRISTMAS TREE BRENDA LEE MCA 9
9
MERRY CHRISTMAS EVERYONE SHAKIN' STEVENS RCA 10
10
STEP INTO CHRISTMAS ELTON JOHN MERCURY 11
11
HAVANA CAMILA CABELLO FT YOUNG THUG EPIC/SYCO MUSIC 12
12
IT'S BEGINNING TO LOOK A LOT LIKE MICHAEL BUBLE REPRISE 13
13
DRIVING HOME FOR CHRISTMAS CHRIS REA WARNER BROS 14
14
I WISH IT COULD BE CHRISTMAS EVERYDAY WIZZARD EMI 15
15
MERRY XMAS EVERYBODY SLADE BMG 16
16
IT'S THE MOST WONDERFUL TIME OF THE YEAR ANDY WILLIAMS SONY MUSIC 17
17
I M

In [4]:
#Once the loop has finished we can take a look at the data
print(df)

                               title                          artist  \
0                            PERFECT                      ED SHEERAN   
1                     LAST CHRISTMAS                            WHAM   
2                              RIVER            EMINEM FT ED SHEERAN   
3    ALL I WANT FOR CHRISTMAS IS YOU                    MARIAH CAREY   
4              FAIRYTALE OF NEW YORK        POGUES FT KIRSTY MACCOLL   
..                               ...                             ...   
495                              YRF                       GRM DAILY   
496                   ALL FALLS DOWN  ALAN WALKER/N CYRUS/DIGITAL FA   
497                             KEKE  6IX9INE/FETTY WAP/A BOOGIE WIT   
498           HOLD ME TIGHT OR DON'T                    FALL OUT BOY   
499                     BODAK YELLOW                         CARDI B   

           label                                                url      date  \
0         ASYLUM  https://www.officialcharts.com/chart

In [None]:
#And we can export it
df.to_csv("scrapeddata.csv")