# Web Data Extraction - Part II

__WEB SCRAPING:__ data extraction from human-readable output coming from a web browser.

__HTTP library for Python:__ [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) ==> _a Python package for parsing HTML and XML documents_

---

In [1]:
# Import libraries
import pandas as pd
import requests
import bs4   # !pip install beautifulsoup4
import re

---

In [69]:
# Document Object Model (DOM) -> https://en.wikipedia.org/wiki/Document_Object_Model
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0'}
# https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent

url = 'https://www.marca.com/'
response = requests.get(url, headers=headers, timeout=0.05)
html = response.content
print(f'Status code is: {response.status_code}')
print(type(html))

Status code is: 200
<class 'bytes'>


In [9]:
html[:1000]

b' <!DOCTYPE html><html lang="es"><head><script>/\\/radio(\\/parrilla)?.html/gmi.test(location.href||"")&&/MSIE|Trident/gm.test(navigator.userAgent||"")&&!!window.MSInputMethodContext&&!!document.documentMode&&function(){var a=document.createElement("script");a.src="//e00-elmundo.uecdn.es/js/ue-polyfills.min.js",a.type="text/javascript";var b=document.getElementsByTagName("script")[0];b.parentNode.insertBefore(a,b)}();</script>\n<script type="text/javascript" language="javascript" src="https://e00-ue.uecdn.es/cookies/js/policy_v4.js"></script>\n<script>window.googlefc=window.googlefc||{},window.googlefc.ccpa=window.googlefc.ccpa||{},window.googlefc.callbackQueue=window.googlefc.callbackQueue||[],googlefc.callbackQueue.push({AD_BLOCK_DATA_READY:()=>{var o;switch(googlefc.getAdBlockerStatus()){case googlefc.AdBlockerStatusEnum.EXTENSION_LEVEL_AD_BLOCKER:case googlefc.AdBlockerStatusEnum.NETWORK_LEVEL_AD_BLOCKER:o="bloqueada";break;case googlefc.AdBlockerStatusEnum.NO_AD_BLOCKER:o="no blo

#### Parsing with BeautifulSoup

In [11]:
parsed_html = bs4.BeautifulSoup(html[:1000], "html.parser") 
print(parsed_html.prettify())

<!DOCTYPE html>
<html lang="es">
 <head>
  <script>
   /\/radio(\/parrilla)?.html/gmi.test(location.href||"")&&/MSIE|Trident/gm.test(navigator.userAgent||"")&&!!window.MSInputMethodContext&&!!document.documentMode&&function(){var a=document.createElement("script");a.src="//e00-elmundo.uecdn.es/js/ue-polyfills.min.js",a.type="text/javascript";var b=document.getElementsByTagName("script")[0];b.parentNode.insertBefore(a,b)}();
  </script>
  <script language="javascript" src="https://e00-ue.uecdn.es/cookies/js/policy_v4.js" type="text/javascript">
  </script>
  <script>
  </script>
 </head>
</html>


---

__Lets make some broth...__

![Image](./img/web_data_01.png)

In [12]:
html_sample = '<a href="url.com" title="Web Scraping" itemprop="url" id="example" class="intro">Hello World</a>'
broth = bs4.BeautifulSoup(html_sample, "html.parser") 
print(type(broth))
broth

<class 'bs4.BeautifulSoup'>


<a class="intro" href="url.com" id="example" itemprop="url" title="Web Scraping">Hello World</a>

In [15]:
tag = broth.a
print(type(tag))
tag

<class 'bs4.element.Tag'>


<a class="intro" href="url.com" id="example" itemprop="url" title="Web Scraping">Hello World</a>

In [16]:
tag.name
#tag.name = 'b'
#tag.name

'a'

In [18]:
tag.attrs

{'href': 'url.com',
 'title': 'Web Scraping',
 'itemprop': 'url',
 'id': 'example',
 'class': ['intro']}

In [20]:
tag['href']

'url.com'

In [21]:
tag.string

'Hello World'

In [22]:
# Be careful, this is the whole HTML document name
broth.name

'[document]'

---

__Now, let's make some soup...__

In [23]:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = bs4.BeautifulSoup(html_doc, "html.parser") 
print(type(soup))
soup

<class 'bs4.BeautifulSoup'>



<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>

In [25]:
print(type(soup.html))
soup.html 
soup.find('html')

<class 'bs4.element.Tag'>


<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>

In [32]:
#print(type(soup.find_all("a")))
#soup.find_all("a")
soup.find_all(["a", "b", "html", "body"])

[<html><head><title>The Dormouse's story</title></head>
 <body>
 <p class="title"><b>The Dormouse's story</b></p>
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>
 <p class="story">...</p>
 </body></html>,
 <body>
 <p class="title"><b>The Dormouse's story</b></p>
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>
 <p class="story">...</p>
 </body>,
 <b>The Dormouse's story</b>,
 <a

In [33]:
all_tags = [tag.name for tag in soup.find_all(True)]
print(type(all_tags))
all_tags

<class 'list'>


['html', 'head', 'title', 'body', 'p', 'b', 'p', 'a', 'a', 'a', 'p']

In [34]:
some_tags = [tag.name for tag in soup.find_all(re.compile("^b"))]
some_tags

['body', 'b']

In [39]:
tag = soup.a
tag.has_attr('class')
tag['class']

['sister']

In [44]:
# Using a function
def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')

In [48]:
soup.find_all(has_class_but_no_id)

[<p class="title"><b>The Dormouse's story</b></p>,
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>,
 <p class="story">...</p>]

In [49]:
# Be careful, this is the whole HTML document name
soup.name

'[document]'

---

__Finally, let's make a stew...__

In [54]:
stew = soup.find_all("a", {"class": "sister"})
stew

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [55]:
all_strings = [tag.string for tag in stew]
all_strings

['Elsie', 'Lacie', 'Tillie']

In [56]:
# Be careful, this is NOT the whole HTML document name
stew[0].name

'a'

---

__Back to the original HTML content...__

In [70]:
parsed_html = bs4.BeautifulSoup(html, "html.parser")
parsed_tags = set([tag.name for tag in parsed_html.find_all(True)])
parsed_tags

{'a',
 'address',
 'article',
 'aside',
 'body',
 'button',
 'circle',
 'div',
 'fieldset',
 'figure',
 'footer',
 'form',
 'g',
 'h1',
 'h2',
 'h3',
 'head',
 'header',
 'html',
 'i',
 'iframe',
 'img',
 'input',
 'label',
 'legend',
 'li',
 'link',
 'main',
 'meta',
 'nav',
 'noscript',
 'p',
 'path',
 'picture',
 'polygon',
 'polyline',
 'rect',
 'script',
 'section',
 'source',
 'span',
 'strong',
 'style',
 'svg',
 'time',
 'title',
 'ul'}

In [71]:
element = parsed_html.find_all("a", {"itemprop": "url"})
len(element)
print(element[0])
element[0]['href']
links = [tag.attrs for tag in element]
links

<a href="https://www.marca.com/futbol/barcelona/2022/07/02/62c0093546163f91898b458a.html" itemprop="url" title="Al Barça le salen las cuentas"> Al Barça le salen las cuentas
</a>


[{'href': 'https://www.marca.com/futbol/barcelona/2022/07/02/62c0093546163f91898b458a.html',
  'title': 'Al Barça le salen las cuentas',
  'itemprop': 'url'},
 {'href': 'https://www.marca.com/futbol/premier-league/2022/07/01/62bee78746163f79648b457e.html',
  'title': 'Conte rompe el mercado para fabricar un equipo temible',
  'itemprop': 'url'},
 {'href': 'https://www.marca.com/futbol/mercado-fichajes/2022/07/02/62bfe033ae3f580020e300a3-directo.html',
  'title': 'El Chelsea paga 52 millones al City',
  'itemprop': 'url'},
 {'href': 'https://www.marca.com/futbol/primera-division/opinion/2022/07/02/62befa7de2704e8e028b459d.html',
  'title': '"Salvo el Madrid, nadie puede hacer un fichaje de 50 millones"',
  'itemprop': 'url'},
 {'href': 'https://www.marca.com/ciclismo/tour-francia/especial.html',
  'title': "El Tour de 'todos contra Pogacar'",
  'itemprop': 'url'},
 {'href': 'https://videos.marca.com/v/0_2sazyini-la-deportividad-de-alcaraz-que-asombra-a-wimbledon-aviso-que-el-punto-del-r

In [72]:
# Of course, Pandas!!!

df = pd.DataFrame(links, columns=['title', 'href', 'itemprop'])
df

Unnamed: 0,title,href,itemprop
0,Al Barça le salen las cuentas,https://www.marca.com/futbol/barcelona/2022/07...,url
1,Conte rompe el mercado para fabricar un equipo...,https://www.marca.com/futbol/premier-league/20...,url
2,El Chelsea paga 52 millones al City,https://www.marca.com/futbol/mercado-fichajes/...,url
3,"""Salvo el Madrid, nadie puede hacer un fichaje...",https://www.marca.com/futbol/primera-division/...,url
4,El Tour de 'todos contra Pogacar',https://www.marca.com/ciclismo/tour-francia/es...,url
5,La deportividad de Alcaraz que asombra a Wimbl...,https://videos.marca.com/v/0_2sazyini-la-depor...,url
6,,https://www.marca.com/uestudio/2022/06/29/62bc...,url
7,Contrato de megaestrella,https://www.marca.com/futbol/real-madrid/2022/...,url
8,Reina cumplirá los 40 en el Villarreal... cuan...,https://www.marca.com/futbol/villarreal/2022/0...,url
9,Desarmando a Gabriel Jesús: el 'Big Data' que ...,https://videos.marca.com/v/0_9po8yxm6-desarman...,url


In [116]:
url = 'https://toogoodtogo.es/es/blog'

In [117]:
response = requests.get(url)
html = response.content

In [118]:
parsed_html = bs4.BeautifulSoup(html, "html.parser")

In [119]:
parsed_tags = []
for tag in parsed_html.find_all(True):
    parsed_tags.append(tag.name)
list(set(parsed_tags))

['head',
 'span',
 'script',
 'img',
 'style',
 'li',
 'div',
 'title',
 'base',
 'section',
 'a',
 'fieldset',
 'link',
 'label',
 'ul',
 'main',
 'option',
 'p',
 'h3',
 'h1',
 'input',
 'form',
 'h6',
 'select',
 'html',
 'footer',
 'h2',
 'iframe',
 'meta',
 'h5',
 'button',
 'header',
 'body',
 'noscript',
 'small']

In [122]:
elements = parsed_html.find_all("a", {'class':'blog-post-link'})
len(elements)

10

In [123]:
elements[0].attrs

{'href': '/es/blog/ww-junio', 'class': ['blog-post-link']}

In [124]:
url_list = []
for e in elements:
    url_list.append(e['href'])
url_list

['/es/blog/ww-junio',
 '/es/blog/fruta-temporada',
 '/es/blog/sorteo-restaurante-coque',
 '/es/blog/chefs-contra-el-desperdicio',
 '/es/blog/pepa-munoz',
 '/es/blog/joanna-artieda',
 '/es/blog/mario-sandoval',
 '/es/blog/victor-y-mar',
 '/es/blog/tonino-valiente',
 '/es/blog/verse']

In [125]:
url_post = 'https://toogoodtogo.es/es/blog/ww-junio'
response = requests.get(url_post)
html = response.content
parsed_html = bs4.BeautifulSoup(html, "html.parser")

In [127]:
elements = parsed_html.find_all("h1")
elements[0].string

'#WasteWarriorDelMes de Junio 🍕'

In [143]:
elements = parsed_html.find_all("span", {})

In [144]:
text_list = []
for element in elements:
    if element.attrs == {}:
        text_list.append(element.string)
text_list

['Queremos premiar tu esfuerzo salvando comida con nosotros con este sorteo delicioso mensual😋 Cada vez más waste warriors participáis salvando comida, por eso, ¡hemos cambiado las reglas del #WasteWarriorDelMes!',
 'La persona ganadora del sorteo será elegida de manera aleatoria de entre todos los comentarios del post que anuncia el sorteo.\xa0',
 None]

In [146]:
blog_content = []
for u in url_list:
    url_post = 'https://toogoodtogo.es'+u
    html = requests.get(url_post).content
    blog_content.append(html)
    
len(blog_content)

10

---

__More info:__

- An example for creating a pipeline where the Acquisition part involves REST API and Web Scraping [link](https://towardsdatascience.com/data-engineering-create-your-own-dataset-9c4d267eb838)

- If you have dynamic content, you should consider using [Selenium](https://selenium-python.readthedocs.io/)

- [What would happen if you tried to scrape Idealista?](https://www.idealista.com/ayuda/articulos/legal-statement/?lang=en#:~:text=Specifically%2C%20it%20is,prior%20written%20permission.)