# Web scraping with Python: getting started

## Introduction

This tutorial is a quick introduction to web scraping with Python. It covers:

- basics: we use requests and beautifulsoup4 (pip install...)
- scraping from simple html webpages
- scraping from paginated results

Obligatory disclaimer: the website's owner might not want you to scrape their contents.

Always take a look at the robots.txt file, found in the "root" directory of the majority of websites.

For example, it.wikipedia.org/robots.txt looks like this:
![title](img/robotstxt.png)

## Scraping a simple webpage

Assuming we're not violating the website's rules, we can download pretty much everything we want.

Therefore, let's download the list of every trap artist, according to Italian Wikipedia. It can be found at the url:

https://it.wikipedia.org/wiki/Categoria:Cantanti_trap

it looks like this:

![title](img/trappers.png)

With such a simple page, we could just copy-paste the wanted text then clean it. The advantage of scraping the text from the html is that 1) we learn the basics of web scraping and 2) the result is (moslty) already clean and for example it can be immediately saved as a textfile or manipulated with pandas.

First of all, we need to import requests and beautifulsoup4:

In [1]:
import requests # to make http call and donwload html sources; use the following settings:
headers = requests.utils.default_headers()
headers['User-Agent'] = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'

from bs4 import BeautifulSoup as bs # to navigate the downloaded html and extract relevant information

import re # regular expressions in python

import random # generate random integers

import time # used to wait some time between http requests

Next, we make the http call to the url, which results in a request object:

In [2]:
url = "https://it.wikipedia.org/wiki/Categoria:Cantanti_trap" # urls are just strings

r = requests.get(url, headers = headers) # using the headers specified above

print(r)

<Response [200]>


Response 200 means everything's OK. Other kinds of responses (less OK) include 404 "Not Found" and 503 "Service Unavailable".

The response object has a text method, which shows the text downloaded with the request:

In [3]:
print(r.text[0:1000])

<!DOCTYPE html>
<html class="client-nojs" lang="it" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>Categoria:Cantanti trap - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":[",\t."," \t,"],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","gennaio","febbraio","marzo","aprile","maggio","giugno","luglio","agosto","settembre","ottobre","novembre","dicembre"],"wgMonthNamesShort":["","gen","feb","mar","apr","mag","giu","lug","ago","set","ott","nov","dic"],"wgRequestId":"XigKXApAMFcAAJb-@84AAAAA","wgCSPNonce":!1,"wgCanonicalNamespace":"Category","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":14,"wgPageName":"Categoria:Cantanti_trap","wgTitle":"Cantanti trap","wgCurRevisionId":107757064,"wgRevisionId":107757064,"wgArticleId":5439423,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Cantanti hip hop","Musicisti trap"],"wg

Next, we need to parse this text with bs4, to make the html tags easy to navitage and their contents easy to access:

In [4]:
soup = bs(r.text, "lxml") # soupify i.e. make tag soup easy to navigate, using the lxml parser

print(soup.text[0:1000])




Categoria:Cantanti trap - Wikipedia
document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":[",\t."," \t,"],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","gennaio","febbraio","marzo","aprile","maggio","giugno","luglio","agosto","settembre","ottobre","novembre","dicembre"],"wgMonthNamesShort":["","gen","feb","mar","apr","mag","giu","lug","ago","set","ott","nov","dic"],"wgRequestId":"XigKXApAMFcAAJb-@84AAAAA","wgCSPNonce":!1,"wgCanonicalNamespace":"Category","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":14,"wgPageName":"Categoria:Cantanti_trap","wgTitle":"Cantanti trap","wgCurRevisionId":107757064,"wgRevisionId":107757064,"wgArticleId":5439423,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Cantanti hip hop","Musicisti trap"],"wgPageContentLanguage":"it","wgPageContentModel":"wikitext","wgRelevantPageName":"Categoria:Cantanti_trap","wgReleva

It looks exactly the same!

What's the difference? The first is pure text. It can be manipulated with regular expressions, e.g. to remove the html tags and extract the desired information, but good luck with that.

The second is a  more complex object, with a series of useful associated methods to navigate the html tags.

For example, suppose we want to extract the title of the page:

In [5]:
soup.find('title') # find method yield the first found occurrence

<title>Categoria:Cantanti trap - Wikipedia</title>

Next, how do we know which tags contain the desired information?

We can inspect the html source:

(on google chrome, ctrl+shift+I opens the source inspector)

![title](img/inspect.png)

The names of the artists appear as text inside a-tags.

The method findAll will yield all the a-tags found in the soup.

In [6]:
a_tags = soup.findAll('a')

print(a_tags)

[<a id="top"></a>, <a class="mw-helplink" href="/wiki/Aiuto:Categorie" target="_blank">Aiuto</a>, <a class="mw-jump-link" href="#mw-head">Jump to navigation</a>, <a class="mw-jump-link" href="#p-search">Jump to search</a>, <a href="/wiki/Cantante" title="Cantante">cantanti</a>, <a href="/wiki/Trap_(genere_musicale)" title="Trap (genere musicale)">trap</a>, <a href="/wiki/Categoria:Cantanti_per_nazionalit%C3%A0" title="Categoria:Cantanti per nazionalità">Cantanti per nazionalità</a>, <a href="/wiki/Categoria:Cantanti_per_genere" title="Categoria:Cantanti per genere">Cantanti per genere (tutti)</a>, <a href="/wiki/Categoria:Gruppi_musicali_trap" title="Categoria:Gruppi musicali trap">Gruppi musicali trap</a>, <a href="/wiki/Categoria:Musicisti_trap" title="Categoria:Musicisti trap">Musicisti trap</a>, <a href="/wiki/Categoria:Disc_jockey_trap" title="Categoria:Disc jockey trap">Disc jockey trap</a>, <a href="/wiki/Categoria:Album_trap" title="Categoria:Album trap">Album trap</a>, <a href

We can see observe that a_tags 1) is a list (default outputs of findAll) and more importantly 2) it contains a number of a-tags which we are not interested in.

To solve this problem we can restrict the search of the a-tags to some container above the wanted ones in the html structure, for example a div which is high enough to contain all the names but low enough to not contain useless stuff.

If we inspect the source again:
![title](img/inspect0.png)

So if we first capture the div with class=mw-category-generated and then we extract the a-tags:

In [7]:
div = soup.find('div', {'class' : 'mw-category-generated'}) 
# notice: the second argument is a dictionary, so it can potentially have more entries, for more refined definitions

print(div)

<div class="mw-category-generated" dir="ltr" lang="it"><div id="mw-pages">
<h2>Pagine nella categoria "Cantanti trap"</h2>
<p>Questa categoria contiene le 107 pagine indicate di seguito, su un totale di 107.
</p><div class="mw-content-ltr" dir="ltr" lang="it"><div class="mw-category"><div class="mw-category-group"><h3>0–9</h3>
<ul><li><a href="/wiki/6ix9ine" title="6ix9ine">6ix9ine</a></li>
<li><a href="/wiki/21_Savage" title="21 Savage">21 Savage</a></li></ul></div><div class="mw-category-group"><h3>A</h3>
<ul><li><a href="/wiki/Anuel_AA" title="Anuel AA">Anuel AA</a></li>
<li><a href="/wiki/ASAP_Ferg" title="ASAP Ferg">ASAP Ferg</a></li></ul></div><div class="mw-category-group"><h3>B</h3>
<ul><li><a href="/wiki/Bad_Bunny" title="Bad Bunny">Bad Bunny</a></li>
<li><a href="/wiki/Belly_(rapper)" title="Belly (rapper)">Belly (rapper)</a></li>
<li><a href="/wiki/Bhad_Bhabie" title="Bhad Bhabie">Bhad Bhabie</a></li>
<li><a href="/wiki/Birdman_(rapper)" title="Birdman (rapper)">Birdman (rap

In [8]:
a_tags = div.findAll('a')

print(a_tags[0:5])

[<a href="/wiki/6ix9ine" title="6ix9ine">6ix9ine</a>, <a href="/wiki/21_Savage" title="21 Savage">21 Savage</a>, <a href="/wiki/Anuel_AA" title="Anuel AA">Anuel AA</a>, <a href="/wiki/ASAP_Ferg" title="ASAP Ferg">ASAP Ferg</a>, <a href="/wiki/Bad_Bunny" title="Bad Bunny">Bad Bunny</a>]


We can build a list of strings containing the names of the artists by extracting the text attribute from each a-tag:

In [9]:
names = [a.text for a in a_tags]

print(names)

['6ix9ine', '21 Savage', 'Anuel AA', 'ASAP Ferg', 'Bad Bunny', 'Belly (rapper)', 'Bhad Bhabie', 'Birdman (rapper)', 'Lele Blade', 'Booba (rapper)', 'Quando Rondo', 'Maikel Delacalle', 'Capo Plaza', 'Cardi B', 'Chanmina', 'Chief Keef', 'CL (cantante)', 'Comethazine', 'DaBaby', 'Dej Loaf', 'DJ Paul', 'Doja Cat', 'Famous Dex', 'Fetty Wap', 'Frauenarzt', 'Future (rapper)', 'Ghali (rapper)', 'GionnyScandal', 'Gradur', 'Gucci Mane', 'Haftbefehl', 'Jhay Cortez', 'Jon Z', 'Juice Wrld', 'Juicy J', 'Kaaris', 'Kalash (rapper)', 'Keith Ape', 'Ketama126', 'Wiz Khalifa', 'Kid Kaze', 'Kodak Black', 'Lacrim', 'Laïoung', 'Elettra Lamborghini', 'Lazza', 'Lil B', 'Lil Baby', 'Lil Nas X', 'Lil Pump', 'Lil Reese', 'Lil Skies', 'Lil Tecca', 'Lil Uzi Vert', 'Lil Wayne', 'Lil Yachty', 'Lola Índigo', 'Lord Infamous', 'Lunay (cantante)', 'MadeinTYO', 'MC Mack', 'Meek Mill', 'Stella Mwangi', 'Niska', 'NLE Choppa', 'O.T. Genasis', 'Offset (rapper)', 'Ozuna', 'PartyNextDoor', 'Project Pat', 'Quavo', 'Alemán', 'Ric

Finally, the list can be easily saved as text document:

In [10]:
with open('trappers.txt', 'w+') as output_file:
    for name in names:
        output_file.write(name+'\n')        

Text is not the only attribute. For example we might want to extract the associate url:

In [11]:
urls = [a.attrs['href'] for a in a_tags]

print(urls)

['/wiki/6ix9ine', '/wiki/21_Savage', '/wiki/Anuel_AA', '/wiki/ASAP_Ferg', '/wiki/Bad_Bunny', '/wiki/Belly_(rapper)', '/wiki/Bhad_Bhabie', '/wiki/Birdman_(rapper)', '/wiki/Lele_Blade', '/wiki/Booba_(rapper)', '/wiki/Quando_Rondo', '/wiki/Maikel_Delacalle', '/wiki/Capo_Plaza', '/wiki/Cardi_B', '/wiki/Chanmina', '/wiki/Chief_Keef', '/wiki/CL_(cantante)', '/wiki/Comethazine', '/wiki/DaBaby', '/wiki/Dej_Loaf', '/wiki/DJ_Paul', '/wiki/Doja_Cat', '/wiki/Famous_Dex', '/wiki/Fetty_Wap', '/wiki/Frauenarzt', '/wiki/Future_(rapper)', '/wiki/Ghali_(rapper)', '/wiki/GionnyScandal', '/wiki/Gradur', '/wiki/Gucci_Mane', '/wiki/Haftbefehl', '/wiki/Jhay_Cortez', '/wiki/Jon_Z', '/wiki/Juice_Wrld', '/wiki/Juicy_J', '/wiki/Kaaris', '/wiki/Kalash_(rapper)', '/wiki/Keith_Ape', '/wiki/Ketama126', '/wiki/Wiz_Khalifa', '/wiki/Kid_Kaze', '/wiki/Kodak_Black', '/wiki/Lacrim', '/wiki/La%C3%AFoung', '/wiki/Elettra_Lamborghini', '/wiki/Lazza', '/wiki/Lil_B', '/wiki/Lil_Baby', '/wiki/Lil_Nas_X', '/wiki/Lil_Pump', '/wik

## Scraping paginated results

Automated scraping becomes especially useful when we want to download hundreds or thousands of entries and we can't just copy-paste dozens of pages.

In particular we take a look at paginated results, such as, for example, the list of Italian hotels according to the website www.elenco-alberghi.it

The front page looks like this:
![title](img/hotels.png)

One approach is to visit the page corresponding to each region, and download the list of hotels from there. However, the results in each region page are paginated:

![title](img/abruzzo0.png)

The idea is to cycle through every page and download the list fo hotels in each page, in each region.

First of all, we define a url template, with a placeholder for the region and the list of regions, looking at the homepage of the website:

In [12]:
# curly braces will be replaced with region name using format
region_template_url = "http://www.elenco-alberghi.it/{}/alberghi-hotels.asp"

# manually taken from homepage of www.elenco-alberghi.it
regions = ["abruzzo", "basilicata", "calabria", "campania", "emilia-romagna", "friuli-venezia-giulia",
"lazio", "liguria", "lombardia", "marche", "molise", "piemonte", "puglia", "sardegna", "sicilia", "toscana",
"trentino-alto-adige", "umbria", "valle-d-aosta", "veneto"]

For each region, we need to know how many pages there are in the paginated list.

Luckily, this information can be found in the html source:

![title](img/pagination.png)

Let's zoom in a little:

![title](img/pag_zoom.png)

We can observe that each li-tag inside the div with id "paginazione" corresponds to a page number and containes the urls to that page number, together with a text element with that number (e.g. "10").

This is true, except for the last li-tag, corresponding to the last page: the text element is just "..."  as displayed in the webpage. The number, however can be recovered from the associated url (e.g. "81").

So let's extract that page number.

First, define the url:

In [13]:
region = "abruzzo" # for example
reg_url = region_template_url.format(region) # insert region into template url

print(reg_url)

http://www.elenco-alberghi.it/abruzzo/alberghi-hotels.asp


Next, http call and parse of response:

In [14]:
reg_r = requests.get(reg_url, headers=headers) # http request to page
print(reg_r)

reg_soup = bs(reg_r.text, "lxml")  # soupify

<Response [200]>


Next, we extract the wanted div "paginazione" and the li-tags inside it:

In [15]:
div = reg_soup.find('div', {'id' : 'paginazione'}) # pages of numbers are in this div
    
li_tags = [li for li in div.findAll('li')] # pages numbers are in these li elements

print(li_tags)

[<li id="inactive">Pagine:</li>, <li id="activelink"><a href="javascript:void(0)">1</a></li>, <li><a href="/abruzzo/alberghi-hotels_2.asp" title="Vai alla pagina n. 2">2</a></li>, <li><a href="/abruzzo/alberghi-hotels_3.asp" title="Vai alla pagina n. 3">3</a></li>, <li><a href="/abruzzo/alberghi-hotels_4.asp" title="Vai alla pagina n. 4">4</a></li>, <li><a href="/abruzzo/alberghi-hotels_5.asp" title="Vai alla pagina n. 5">5</a></li>, <li><a href="/abruzzo/alberghi-hotels_6.asp" title="Vai alla pagina n. 6">6</a></li>, <li><a href="/abruzzo/alberghi-hotels_7.asp" title="Vai alla pagina n. 7">7</a></li>, <li><a href="/abruzzo/alberghi-hotels_8.asp" title="Vai alla pagina n. 8">8</a></li>, <li><a href="/abruzzo/alberghi-hotels_9.asp" title="Vai alla pagina n. 9">9</a></li>, <li><a href="/abruzzo/alberghi-hotels_10.asp" title="Vai alla pagina n. 10">10</a></li>, <li><a href="/abruzzo/alberghi-hotels_81.asp" title="Vai all'ultima pagina">...</a></li>]


As we said, we want the number found in the url of the last li-tag:

In [16]:
last_li = li_tags[-1] # we want the last li

print(last_li)

<li><a href="/abruzzo/alberghi-hotels_81.asp" title="Vai all'ultima pagina">...</a></li>


In [17]:
last_li_url = last_li.find('a').attrs['href'] # we extract the number from the url of the last li (which corresponds to the button for the last page) 
print(last_li_url)

/abruzzo/alberghi-hotels_81.asp


In [18]:
N = int(re.findall(r'\d+', last_li_url)[0]) # convert to integer the only digit found
print(N)

81


In order to make this code re-usable, let's bring it together in a function:

In [19]:
def howmany_pages(region): # define a function with region as input
    reg_url = region_template_url.format(region) # insert region into template url
    reg_r = requests.get(reg_url, headers=headers) # http request to page
    reg_soup = bs(reg_r.text, "lxml")  # soupify    
    div = reg_soup.find('div', {'id' : 'paginazione'}) # pages of numbers are in this div    
    li_tags = [li for li in div.findAll('li')] # pages numbers are in these li elements    
    last_li = li_tags[-1].find('a').attrs['href'] # we extract the url of last li    
    N = int(re.findall(r'\d+', last_li)[0]) # convert to integer the only digit found
    
    return(N)

In [20]:
howmany_pages("abruzzo")

81

In [21]:
howmany_pages("piemonte")

149

It works!

Using the function let's convert the list of regions into a dictionary where each region is associated with its number of pages:

In [22]:
# collect in a dictionary the name of each region with its max number of pages
regions_dic = {region : howmany_pages(region) for region in regions}

In [23]:
print(regions_dic)

{'liguria': 123, 'sardegna': 97, 'sicilia': 183, 'veneto': 288, 'puglia': 108, 'emilia-romagna': 379, 'piemonte': 149, 'molise': 8, 'abruzzo': 81, 'trentino-alto-adige': 300, 'lombardia': 256, 'friuli-venezia-giulia': 65, 'umbria': 74, 'marche': 103, 'valle-d-aosta': 36, 'lazio': 184, 'campania': 174, 'basilicata': 17, 'calabria': 73, 'toscana': 406}


With this dictionary it's relatively easy to cycle through regions, and for each region cycle through pages and download the names of the hotels found in each page. 

Let's look at one example, then we can generalize:
7th page of "Marche" http://www.elenco-alberghi.it/marche/alberghi-hotels_7.asp it looks like this:

![title](img/marche7.png)

Zooming in:

![title](img/marche7_zoom.png)

We need to extract the text element from a-tags in the span with class "titololista":

In [24]:
# we start from a url template
page_template_url = "http://www.elenco-alberghi.it/{}/alberghi-hotels_{}.asp"
# second curly braces will be the page number

In [25]:
region = "marche" # set region manually
i = 7 # set page number manually

In [26]:
reg_url = page_template_url.format(region, i) # fill the details of the url
print(reg_url)

http://www.elenco-alberghi.it/marche/alberghi-hotels_7.asp


In [27]:
reg_soup = bs(reg_r.text, "lxml")  # soupify

In [28]:
span = reg_soup.findAll('span', {'class' : 'titololista'}) # get the span
print(span)

[<span class="titololista"><a href="http://www.elenco-alberghi.it/abruzzo/l-aquila/opi/12293.asp">HOTEL DU PARK - FABER GESTIONI TURISTICHE SRL</a></span>, <span class="titololista"><a href="http://www.elenco-alberghi.it/abruzzo/teramo/controguerra/hotel-31968.asp">BED AND BREAKFAST GIARDINO AGRITOURIST </a></span>, <span class="titololista"><a href="http://www.elenco-alberghi.it/abruzzo/teramo/tortoreto/hotel-32269.asp">HOTEL CLARA</a></span>, <span class="titololista"><a href="http://www.elenco-alberghi.it/abruzzo/l-aquila/barrea/hotel-32423.asp">LA CASA NEL BORGO</a></span>, <span class="titololista"><a href="http://www.elenco-alberghi.it/abruzzo/teramo/tortoreto/hotel-33600.asp">RESIDENCE MARGHERITA</a></span>, <span class="titololista"><a href="http://www.elenco-alberghi.it/abruzzo/l-aquila/scanno/hotel-33944.asp">HOTEL GARNI MILLE PINI </a></span>, <span class="titololista"><a href="http://www.elenco-alberghi.it/abruzzo/chieti/arielli/hotel-34208.asp">CASA DELL'ARCIPRETE B&amp;B<

In [29]:
names = [name.text.title() for name in span] # extract the text from each element, change case
print(names)

['Hotel Du Park - Faber Gestioni Turistiche Srl', 'Bed And Breakfast Giardino Agritourist ', 'Hotel Clara', 'La Casa Nel Borgo', 'Residence Margherita', 'Hotel Garni Mille Pini ', "Casa Dell'Arciprete B&B", 'B&B Villa Angela', 'Colle Della Selva', 'Alisma Hotel']


Some cleaning will be needed, but the basic approach works.

Let's convert the code into a function which can be applied more generally:

In [30]:
def get_names(region, N): # second argument will be provided with our dictionary above
    
    print()
    print("Working on "+region+"...")
    print()
    
    region_template_url = "http://www.elenco-alberghi.it/{}/alberghi-hotels_{}.asp"
    
    names = [] # initialize list  
    
    for i in range(1, N + 1):
    
        reg_url = region_template_url.format(region, i) # fill the details of the url
        reg_r = requests.get(reg_url, headers=headers) # http request to page
        reg_soup = bs(reg_r.text, "lxml")  # soupify
        
        # hotel names are in these span elements
        tmp_names = [name.text.title() for name in reg_soup.findAll('span', {'class' : 'titololista'})]
        
        names.extend(tmp_names) # append names found in this iteration
      
        # sometimes it's a good idea to randomize the wait between http calls, to avoid ip-banning
        timeDelay = 0.1 * random.randrange(0, 3) + 0.5
        time.sleep(timeDelay) # wait some random time
        
    print("Done!")
        
    return(names)

In [31]:
names_molise = get_names("molise", regions_dic["molise"])


Working on molise...

Done!


In [32]:
print(names_molise)

['Dimora Del Prete Di Belmonte', 'B&B Villa Ada', 'Albergo Ristorante Lo Smeraldo', 'Azienda Agrituristica La Ginestra', "Hotel La Fonte Dell'Astore", 'Albergo Le Dune', 'Cascina Garden Hotel', 'Grand Hotel Aljope', 'Grand Hotel Rinascimento', 'Artemide', 'Santo Stefano Dei Cavalli', "Pleiadi'S Hotel ", 'Bar Albergo Hotel La Rondine ', 'Hostel Palazzo Della Citta', 'Hotel Il Duca Del Sannio', 'Masseria Santa Lucia', 'Borgo San Pietro', 'Hotel Majestic Molise', 'Hotel Residence Ristorante L\x92Airone', 'Albergo Ristorante Miralago', 'Albergo Santoianni', 'Hotel Santa Lucia', 'Masseria Acquasalsa', 'Villaggio Le Meridiane', 'La Romanella', 'Hotel Capodivandra', 'Hotel Di Nardo', 'Masseria Monte Pizzi ', 'Domus Hotel', 'Residence Polena', 'Dimora Spina', 'Hotel Ribo Le Villette', 'B&B La Grotta Delle Fate', 'Agriturismo La Guardata', 'Albergo Campitelli 2', 'San Giorgio Hotel ', 'Hotel Il Cacciatore', 'Aloha Park Hotel ', 'Hotel Kristall', 'Hotel Lo Sciatore', 'Hotel Miletto', 'Albergo Ri

Finally, we can apply the function get_names to each entry in our region dictionary, and save the results as text files:

In [33]:
# this will save one file for each region

#for region in regions_dic.keys(): # cycle through region names in our dictionary
    
#    with open(region, "w+") as output_file: # write here, one file per region, then we can cat them all together
        
#        names = get_names(region, regions_dic[region]) # get results
        
#        for name in names: # cycle through names in get_names results
            
#            output_file.write(name+"\n") # write to file together with EOL