# Web Data Extraction - Part II

__WEB SCRAPING:__ data extraction from human-readable output coming from a web browser.

__HTTP library for Python:__ [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) ==> _a Python package for parsing HTML and XML documents_

---

In [None]:
# Import libraries
import pandas as pd
import requests
import bs4   # !pip install beautifulsoup4
import re

---

In [None]:
# DOM content
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0'}
# https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent

url = 'https://www.marca.com/'
response = requests.get(url, headers=headers, timeout=0.05)
html = response.content
print(f'Status code is: {response.status_code}')
print(type(html))

In [None]:
html[:1000]

---

__Lets make some broth...__

![Image](./img/web_data_01.png)

In [None]:
html_sample = '<a href="url.com" title="Web Scraping" itemprop="url" id="example" class="intro">Hello World</a>'
broth = bs4.BeautifulSoup(html_sample, "html.parser") 
print(type(broth))
broth

In [None]:
tag = broth.a
print(type(tag))
tag

In [None]:
tag.name
#tag.name = 'b'
#tag.name

In [None]:
tag.attrs

In [None]:
tag['class']

In [None]:
tag.string

In [None]:
broth.name

---

__Now, let's make some soup...__

In [None]:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = bs4.BeautifulSoup(html_doc, "html.parser") 
print(type(soup))
soup

In [None]:
print(type(soup.html))
soup.html # soup.find('html')

In [None]:
print(type(soup.find_all('a')))
soup.find_all('a')
#soup.find_all(["a", "b"])

In [None]:
all_tags = [tag.name for tag in soup.find_all(True)]
print(type(all_tags))
all_tags

In [None]:
some_tags = [tag.name for tag in soup.find_all(re.compile("^b"))]
some_tags

In [None]:
# Using a function
def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')

In [None]:
soup.find_all(has_class_but_no_id)

---

__Finally, let's make a stew...__

In [None]:
stew = soup.find_all("a", {"class": "sister"})
stew

In [None]:
all_strings = [tag.string for tag in stew]
all_strings

---

__Back to the original HTML content...__

In [None]:
parsed_html = bs4.BeautifulSoup(html, "html.parser")
parsed_tags = set([tag.name for tag in parsed_html.find_all(True)])
parsed_tags

In [None]:
element = parsed_html.find_all("a", {"itemprop": "url"})
print(element[0])
element[0]['title']
links = [tag.attrs for tag in element]
links

In [None]:
# Pandas!!!

df = pd.DataFrame(links, columns=['title', 'href', 'itemprop', 'rel'])
df

---

__More info:__

- An example for creating a pipeline where the Acquisition part involves REST API and Web Scraping [link](https://towardsdatascience.com/data-engineering-create-your-own-dataset-9c4d267eb838)

- If you have dynamic content, you should consider using [Selenium](https://selenium-python.readthedocs.io/)

- [What would happen if you tried to scrape Idealista?](https://www.idealista.com/ayuda/articulos/legal-statement/?lang=en#:~:text=Specifically%2C%20it%20is,prior%20written%20permission.)