### 1. From HTML

*Using only beautiful soap, no regex*

Save in a dataframe the next information using web scraping. Each row of the dataframe must have in different columns:

- The name of the title
- The id of the div where is the value scraped. If there is not id, then the value is must be numpy.nan
- The name of the tag where is the value scraped.
- The next scraped values in different rows: 
    - The value: "Este es el segundo párrafo"  --> Row 1
    - The url https://pagina1.xyz/ --> Row 2
    - The url https://pagina4.xyz/ --> Row 3
    - The url https://pagina5.xyz/ --> Row 4
    - The value "links footer-links" --> Row 5
    - The value "Este párrafo está en el footer" --> Row 6

In [1]:
import requests
import pandas as pd 
from bs4 import BeautifulSoup

In [2]:
html = """<html lang="es">
<head>
    <meta charset="UTF-8">
    <title>Página de prueba</title>
</head>
<body>
<div id="main" class="full-width">
    <h1>El título de la página</h1>
    <p>Este es el primer párrafo</p>
    <p>Este es el segundo párrafo</p>
    <div id="innerDiv">
        <div class="links">
            <a href="https://pagina1.xyz/">Enlace 1</a>
            <a href="https://pagina2.xyz/">Enlace 2</a>
        </div>
        <div class="right">
            <div class="links">
                <a href="https://pagina3.xyz/">Enlace 3</a>
                <a href="https://pagina4.xyz/">Enlace 4</a>
            </div>
        </div>
    </div>
    <div id="footer">
        <!-- El footer -->
        <p>Este párrafo está en el footer</p>
        <div class="links footer-links">
            <a href="https://pagina5.xyz/">Enlace 5</a>
        </div>
    </div>
</div>
</body>
</html>"""

In [None]:
def show_html(html_str):
    print(BeautifulSoup(str(html_str), 'html.parser').prettify())

def get_page_contents(url):
    page = requests.get(url, headers={"Accept-Language": "en-US"})
    return BeautifulSoup(page.text, "html.parser")

In [None]:
#Para hacer correctamente el web scraping es vital ir de más a menos

In [54]:
## En este caso, se nos da un texto en string
def show_html(html_str):
    print(BeautifulSoup(str(html_str), 'html.parser').prettify())

#show_html(html)

In [55]:
soup = BeautifulSoup(html, 'html')
#soup

In [118]:
soup.title

<title>Página de prueba</title>

In [51]:
div_main = soup.div
div_main.attrs #Así es como se accede a las claves y valores en Beautiful Soup



{'id': 'main', 'class': ['full-width']}

In [59]:
print(soup.p) #Aquí accedo a la etiqueta p, de párrafo
print('·········')
print(soup.p.string) #Aquí obtengo el string de dicha etiqueta, en este caso el texto
print('·········')
print(type(soup.p.string)) #Aquí se observa que es un NavigableString

<p>Este es el primer párrafo</p>
·········
Este es el primer párrafo
·········
<class 'bs4.element.NavigableString'>


In [None]:
#La idea es navegar a través de los elementos de Beautiful soup para representarlos en un DataFrame



In [116]:
inner_div = soup.div.div
print(type(inner_div.contents))
hijos = inner_div.contents

hijos

<class 'list'>


['\n',
 <div class="links">
 <a href="https://pagina1.xyz/">Enlace 1</a>
 <a href="https://pagina2.xyz/">Enlace 2</a>
 </div>,
 '\n',
 <div class="right">
 <div class="links">
 <a href="https://pagina3.xyz/">Enlace 3</a>
 <a href="https://pagina4.xyz/">Enlace 4</a>
 </div>
 </div>,
 '\n']

In [111]:
enlaces = soup.find_all('a')
for pos, enlace in enumerate(enlaces):
    print(f'Iteración {pos}:')
    print(enlace)

Iteración 0:
<a href="https://pagina1.xyz/">Enlace 1</a>
Iteración 1:
<a href="https://pagina2.xyz/">Enlace 2</a>
Iteración 2:
<a href="https://pagina3.xyz/">Enlace 3</a>
Iteración 3:
<a href="https://pagina4.xyz/">Enlace 4</a>
Iteración 4:
<a href="https://pagina5.xyz/">Enlace 5</a>


In [119]:
footer = soup.find_all(id='footer')


AttributeError: ResultSet object has no attribute 'contents'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?

### 2. From Amazon

*Using  beautiful soap  (regex optional)*

Save in a dataframe the next information using web scraping. Using product pages from Amazon, do the following: 

- Get the product name from the web and save it in a column called "item_name"
- Get the price from the web and save it in a column called "item_price"

While you are doing the exercise, document the steps you are doing. Try to do the program for generic pages. If you cannot do it generic, explain the reasons. 

*We recommend to get the source-code, save it in a local file and work from there. It is possible that Amazon detects that you are using webscraping and it changes the source code to avoid possibles attacks.*

-------------------------------

**Example:** 

url = https://www.amazon.es/Tommy-Hilfiger-UM0UM00054-Camiseta-Hombre/dp/B01MYD0T1F/ref=sr_1_1?dchild=1&pf_rd_p=58224bec-cac9-4dd2-a42a-61b1db609c2d&pf_rd_r=VZQ1JTQXFVRZ9E9VSKX4&qid=1595364419&s=apparel&sr=1-1

*item_name* --> "Tommy Hilfiger Logo Camiseta de Cuello Redondo,Perfecta para El Tiempo Libre para Hombre"

*item_price* --> [[18,99 € - 46,59 €]] or one of the options.


