# Web scraping. Capítulo 1
## Requests e Beautiful Soup 4

### Instalar as dependencias

Imos empregar requests e beautifulsoup4.

In [None]:
!conda install -y requests beautifulsoup4

- Distinguir URL base de URL con parámetros
- Como funcionan os parámetros? (web dinámica vs estática)
- Vendo o HTML e a árbore DOM con Developer tools. Diferenciar tags con class e id.

### Visitar a web de exemplo

Imos empregar a web de exemplo: <https://realpython.github.io/fake-jobs/>

## Obtendo unha URL con requests

Este é o modo máis simple de descargar unha web.

In [1]:
import requests

URL = "https://realpython.github.io/fake-jobs/"
page = requests.get(URL)

# Dentro de .text teremos o código da páxina
page.text

'<!DOCTYPE html>\n<html>\n  <head>\n    <meta charset="utf-8">\n    <meta name="viewport" content="width=device-width, initial-scale=1">\n    <title>Fake Python</title>\n    <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/bulma@0.9.2/css/bulma.min.css">\n  </head>\n  <body>\n  <section class="section">\n    <div class="container mb-5">\n      <h1 class="title is-1">\n        Fake Python\n      </h1>\n      <p class="subtitle is-3">\n        Fake Jobs for Your Web Scraping Journey\n      </p>\n    </div>\n    <div class="container">\n    <div id="ResultsContainer" class="columns is-multiline">\n    <div class="column is-half">\n<div class="card">\n  <div class="card-content">\n    <div class="media">\n      <div class="media-left">\n        <figure class="image is-48x48">\n          <img src="https://files.realpython.com/media/real-python-logo-thumbnail.7f0db70c2ed2.jpg?__no_cf_polish=1" alt="Real Python Logo">\n        </figure>\n      </div>\n      <div class="media-content"

## Engadindo o parser

In [2]:
from bs4 import BeautifulSoup

#Para poder parsear HTML
soup = BeautifulSoup(page.content, "html.parser")

## Atopar elementos por ID

In [3]:
results = soup.find(id="ResultsContainer")

print(results.prettify())

<div class="columns is-multiline" id="ResultsContainer">
 <div class="column is-half">
  <div class="card">
   <div class="card-content">
    <div class="media">
     <div class="media-left">
      <figure class="image is-48x48">
       <img alt="Real Python Logo" src="https://files.realpython.com/media/real-python-logo-thumbnail.7f0db70c2ed2.jpg?__no_cf_polish=1"/>
      </figure>
     </div>
     <div class="media-content">
      <h2 class="title is-5">
       Senior Python Developer
      </h2>
      <h3 class="subtitle is-6 company">
       Payne, Roberts and Davis
      </h3>
     </div>
    </div>
    <div class="content">
     <p class="location">
      Stewartbury, AA
     </p>
     <p class="is-small has-text-grey">
      <time datetime="2021-04-08">
       2021-04-08
      </time>
     </p>
    </div>
    <footer class="card-footer">
     <a class="card-footer-item" href="https://www.realpython.com" target="_blank">
      Learn
     </a>
     <a class="card-footer-item" href=

## Atopar elementos por class

In [4]:
job_elements = results.find_all("div", class_="card-content")

# Iterando polos elementos

In [5]:
for job_element in job_elements:
    print(job_element.prettify(), end="\n"*2)

<div class="card-content">
 <div class="media">
  <div class="media-left">
   <figure class="image is-48x48">
    <img alt="Real Python Logo" src="https://files.realpython.com/media/real-python-logo-thumbnail.7f0db70c2ed2.jpg?__no_cf_polish=1"/>
   </figure>
  </div>
  <div class="media-content">
   <h2 class="title is-5">
    Senior Python Developer
   </h2>
   <h3 class="subtitle is-6 company">
    Payne, Roberts and Davis
   </h3>
  </div>
 </div>
 <div class="content">
  <p class="location">
   Stewartbury, AA
  </p>
  <p class="is-small has-text-grey">
   <time datetime="2021-04-08">
    2021-04-08
   </time>
  </p>
 </div>
 <footer class="card-footer">
  <a class="card-footer-item" href="https://www.realpython.com" target="_blank">
   Learn
  </a>
  <a class="card-footer-item" href="https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html" target="_blank">
   Apply
  </a>
 </footer>
</div>


<div class="card-content">
 <div class="media">
  <div class="media-lef

O anterior xera demasiado HTML, mellor collemos tan so partes: Traballo, compañía e ubicación

In [6]:
for job_element in job_elements:
    title_element = job_element.find("h2", class_="title")
    company_element = job_element.find("h3", class_="company")
    location_element = job_element.find("p", class_="location")
    print(title_element)
    print(company_element)
    print(location_element)
    print()

<h2 class="title is-5">Senior Python Developer</h2>
<h3 class="subtitle is-6 company">Payne, Roberts and Davis</h3>
<p class="location">
        Stewartbury, AA
      </p>

<h2 class="title is-5">Energy engineer</h2>
<h3 class="subtitle is-6 company">Vasquez-Davidson</h3>
<p class="location">
        Christopherville, AA
      </p>

<h2 class="title is-5">Legal executive</h2>
<h3 class="subtitle is-6 company">Jackson, Chambers and Levy</h3>
<p class="location">
        Port Ericaburgh, AA
      </p>

<h2 class="title is-5">Fitness centre manager</h2>
<h3 class="subtitle is-6 company">Savage-Bradley</h3>
<p class="location">
        East Seanview, AP
      </p>

<h2 class="title is-5">Product manager</h2>
<h3 class="subtitle is-6 company">Ramirez Inc</h3>
<p class="location">
        North Jamieview, AP
      </p>

<h2 class="title is-5">Medical technical officer</h2>
<h3 class="subtitle is-6 company">Rogers-Yates</h3>
<p class="location">
        Davidville, AP
      </p>

<h2 class="t

Tendo en conda que cada job_element é outro obxecto tipo BeautifulSoup, podemos quitar o HTML molesto

Tamén metemos o método .strip() para quitar espacios ao principio e final: <https://www.w3schools.com/python/ref_string_strip.asp>

In [7]:
for job_element in job_elements:
    title_element = job_element.find("h2", class_="title")
    company_element = job_element.find("h3", class_="company")
    location_element = job_element.find("p", class_="location")
    print(title_element.text.strip())
    print(company_element.text.strip())
    print(location_element.text.strip())
    print()

Senior Python Developer
Payne, Roberts and Davis
Stewartbury, AA

Energy engineer
Vasquez-Davidson
Christopherville, AA

Legal executive
Jackson, Chambers and Levy
Port Ericaburgh, AA

Fitness centre manager
Savage-Bradley
East Seanview, AP

Product manager
Ramirez Inc
North Jamieview, AP

Medical technical officer
Rogers-Yates
Davidville, AP

Physiological scientist
Kramer-Klein
South Christopher, AE

Textile designer
Meyers-Johnson
Port Jonathan, AE

Television floor manager
Hughes-Williams
Osbornetown, AE

Waste management officer
Jones, Williams and Villa
Scotttown, AP

Software Engineer (Python)
Garcia PLC
Ericberg, AE

Interpreter
Gregory and Sons
Ramireztown, AE

Architect
Clark, Garcia and Sosa
Figueroaview, AA

Meteorologist
Bush PLC
Kelseystad, AA

Audiological scientist
Salazar-Meyers
Williamsburgh, AE

English as a second language teacher
Parker, Murphy and Brooks
Mitchellburgh, AE

Surgeon
Cruz-Brown
West Jessicabury, AA

Equities trader
Macdonald-Ferguson
Maloneshire, AE


****Tamén podemos buscar os elementos por clase que conteñen algún texto****

In [8]:
python_jobs = results.find_all("h2", string="Senior Python Developer")
for i in python_jobs:
    print(i)

print ("---")
python_jobs = results.find_all("h2", string="Python")
for i in python_jobs:
    print(i)

<h2 class="title is-5">Senior Python Developer</h2>
---


Non amosa resultados porque busca un texto que sexa exactamente igual. Espazos en branco, letras maiúsculas ou minúsculas, guións e outras variacións farán que non se atopen os resultados como queremos.

In [9]:
python_jobs = results.find_all(
    "h2", string=lambda text: "python" in text.lower()
)

#### Imprimimos resultados

In [10]:
print (python_jobs)

[<h2 class="title is-5">Senior Python Developer</h2>, <h2 class="title is-5">Software Engineer (Python)</h2>, <h2 class="title is-5">Python Programmer (Entry-Level)</h2>, <h2 class="title is-5">Python Programmer (Entry-Level)</h2>, <h2 class="title is-5">Software Developer (Python)</h2>, <h2 class="title is-5">Python Developer</h2>, <h2 class="title is-5">Back-End Web Developer (Python, Django)</h2>, <h2 class="title is-5">Back-End Web Developer (Python, Django)</h2>, <h2 class="title is-5">Python Programmer (Entry-Level)</h2>, <h2 class="title is-5">Software Developer (Python)</h2>]


#### Mellor elemento a elemento

In [11]:
for job in python_jobs:
    print (job)
    print()

<h2 class="title is-5">Senior Python Developer</h2>

<h2 class="title is-5">Software Engineer (Python)</h2>

<h2 class="title is-5">Python Programmer (Entry-Level)</h2>

<h2 class="title is-5">Python Programmer (Entry-Level)</h2>

<h2 class="title is-5">Software Developer (Python)</h2>

<h2 class="title is-5">Python Developer</h2>

<h2 class="title is-5">Back-End Web Developer (Python, Django)</h2>

<h2 class="title is-5">Back-End Web Developer (Python, Django)</h2>

<h2 class="title is-5">Python Programmer (Entry-Level)</h2>

<h2 class="title is-5">Software Developer (Python)</h2>



#### E aínda mellor se quitamos o HTML

In [12]:
for job in python_jobs:
    print (job.text.strip())

Senior Python Developer
Software Engineer (Python)
Python Programmer (Entry-Level)
Python Programmer (Entry-Level)
Software Developer (Python)
Python Developer
Back-End Web Developer (Python, Django)
Back-End Web Developer (Python, Django)
Python Programmer (Entry-Level)
Software Developer (Python)


#### E se xuntamos todo...

In [24]:

python_jobs = results.find_all(
    "h2", string=lambda text: "python" in text.lower()
)

for job_element in python_jobs:
    title_element = job_element.find("h2", class_="title")
    company_element = job_element.find("h3", class_="company")
    location_element = job_element.find("p", class_="location")
    print(title_element.text.strip())
    print(company_element.text.strip())
    print(location_element.text.strip())
    print()

AttributeError: 'NoneType' object has no attribute 'text'

#### Vaia! Fallou! Por que?
**Pista**:

In [26]:
for job_element in python_jobs:
    title_element = job_element.find("h2", class_="title")
    print (title_element)

None
None
None
None
None
None
None
None
None
None


Hai elementos h2 coa clase title que inclúan a información que buscamos?

So temos o nome do traballo. De ahí que non temos nada mais: 

    <h2 class="title is-5">Senior Python Developer</h2>

Teríamos que acceder ao pai e de ahí sacar un obxecto que nos permita acceder ás súas propiedades

In [18]:
python_jobs = results.find_all(
    "h2", string=lambda text: "python" in text.lower()
)

python_job_elements = [
    h2_element.parent.parent.parent for h2_element in python_jobs
]

for job_element in python_job_elements:
    links = job_element.find_all("a")
    for link in links:
        link_url = link["href"]
        print(f"Apply here: {link_url}\n")

Apply here: https://www.realpython.com

Apply here: https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html

Apply here: https://www.realpython.com

Apply here: https://realpython.github.io/fake-jobs/jobs/software-engineer-python-10.html

Apply here: https://www.realpython.com

Apply here: https://realpython.github.io/fake-jobs/jobs/python-programmer-entry-level-20.html

Apply here: https://www.realpython.com

Apply here: https://realpython.github.io/fake-jobs/jobs/python-programmer-entry-level-30.html

Apply here: https://www.realpython.com

Apply here: https://realpython.github.io/fake-jobs/jobs/software-developer-python-40.html

Apply here: https://www.realpython.com

Apply here: https://realpython.github.io/fake-jobs/jobs/python-developer-50.html

Apply here: https://www.realpython.com

Apply here: https://realpython.github.io/fake-jobs/jobs/back-end-web-developer-python-django-60.html

Apply here: https://www.realpython.com

Apply here: https://realpython.github

Duas ligazóns? Non hai problema se queremos só a segunda...

In [27]:
for job_element in python_job_elements:
    link_url = job_element.find_all("a")[1]["href"]
    print(f"Apply here: {link_url}\n")

Apply here: https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html

Apply here: https://realpython.github.io/fake-jobs/jobs/software-engineer-python-10.html

Apply here: https://realpython.github.io/fake-jobs/jobs/python-programmer-entry-level-20.html

Apply here: https://realpython.github.io/fake-jobs/jobs/python-programmer-entry-level-30.html

Apply here: https://realpython.github.io/fake-jobs/jobs/software-developer-python-40.html

Apply here: https://realpython.github.io/fake-jobs/jobs/python-developer-50.html

Apply here: https://realpython.github.io/fake-jobs/jobs/back-end-web-developer-python-django-60.html

Apply here: https://realpython.github.io/fake-jobs/jobs/back-end-web-developer-python-django-70.html

Apply here: https://realpython.github.io/fake-jobs/jobs/python-programmer-entry-level-80.html

Apply here: https://realpython.github.io/fake-jobs/jobs/software-developer-python-90.html



**Exercicio**: Meter na BD MySQL os resultados nunha táboa.

    CREATE TABLE fakejob(
	    position VARCHAR(200),
	    company VARCHAR(200),
        address VARCHAR(200),
	    pubDate VARCHAR(200),
	    url VARCHAR(250)
    );

Scrapping de dous ditios mais

In [30]:
import pymysql
from sqlalchemy.engine import create_engine

db_host = "localhost"
db_port=3306
db_user = "dbeaver"
db_passwd="abc123."
db_name="employees"

#Xerar a cadea de conexión en base aos parámetros anteriores
connectionString=f'mysql+pymysql://{db_user}:{db_passwd}@{db_host}:{db_port}/{db_name}'

engine = create_engine(connectionString)

job_elements = results.find_all("div", class_="card-content")

for job_element in job_elements:
    title_element = job_element.find("h2", class_="title")
    company_element = job_element.find("h3", class_="company")
    location_element = job_element.find("p", class_="location")
    publication_date = job_element.find("time")
    link_url = job_element.find_all("a")[1]["href"]
    cadeaSQL = f'''INSERT INTO fakejob(position, company, address, pubDate, url) VALUES(
                    '{title_element.text.strip()}',
                    '{company_element.text.strip()}', 
                    '{location_element.text.strip()}', 
                    '{publication_date.text.strip()}', 
                    '{link_url}')'''
    result=engine.execute(cadeaSQL)


OperationalError: (pymysql.err.OperationalError) (1045, "Access denied for user 'dbeaver'@'172.30.0.2' (using password: YES)")
(Background on this error at: https://sqlalche.me/e/20/e3q8)

# Outros

In [None]:
# Busca o contido da cabeceira H1
soup.h1.text

# Mostra a ruta á imaxe que se mostra na web
soup.img.get('src')

# Mostra o texto alternativo da imaxe
soup.img.get('alt')

# Mostra todos os textos "strongly"-resaltados da páxina
soup.find_all('strong')

# Mostra todos os enlaces presentes na páxina
for i in soup.find_all('a'):
    print(i.get('href'))

# Mostra os textos/palabras que teñen enlace 
# Mostra todos os enlaces presentes na páxina
for i in soup.find_all('a'):
    print(i.text)

# Conta o número de parágrafos presentes na páxina web
contador = 0
for i in soup.find_all('p'):
    contador = contador + 1
contador
# ou directamente utilizar len
# len(soup.find_all('p'))

# Mostra o contido do último parágrafo
soup.find_all('p')[-1].text

 

# NOVA PÁXINA: https://bigdatawirtz.github.io/exemplo-web/08.html
# Bótalle unha ollada ao código da páxina
url = 'https://bigdatawirtz.github.io/exemplo-web/08.html'
paxina = requests.get(url)

print(paxina.text)

# Parsear o contido da web
soup = BeautifulSoup(paxina.content, 'html.parser')

# Mostra o título da páxina
soup.title.text

# Mostra o charset da páxina, dentro de "meta"
soup.meta.get('charset')

# Conta o número de parágrafos que ten a páxina
len(soup.find_all('p'))

# Mostrar o texto no pé do artigo
soup.footer.p.text

# Mostrar o texto no pé da web
soup.find_all('footer')[-1].p.text

# Mostrar id da sección
soup.section.get('id')



**Fonte** (adaptado de): https://realpython.com/beautiful-soup-web-scraper-python/