# **Ejemplos de extracción de datos mediante "Web scraping"**


## Objetivos


<div class="alert alert-block alert-info" style="margin-top: 20px">
    <ul>
        <li>
            <p>Uso de la librería Beautiful Soup para Web Scraping</p>
            <ul>
                <li>Etiquetas</li>
                <li>Hijos, Padres y hermanos</li>
                <li>Atributos HTML</li>
                <li>Cadenas de navegación</li>
            </ul>
        </li>
     </ul>
    <ul>
        <li>
            <p>Filtros</p>
            <ul>
                <li>Encontrar todo</li>
                <li>Encontrar</li>
                <li>Atributos HTML</li>
                <li>Cadena navegable</li>
            </ul>
        </li>
     </ul>
     <ul>
        <li>
            <p>Descarga y extracción del contenido de una página web</p>
    </li>
         </ul> 
</div>

<hr>


Para este laboratorio vamos a instalar dos librerías:

bs4 --> BeautifulSoup es una librería con diversos métodos que facilitan la extracción de información de páginas web. https://pypi.org/project/beautifulsoup4/

requests --> Librería que permite enviar solicitudes HTTP sin necesidad de agregar cadenas de consultas a las URL, es decir, hacer que las solicitudes HTTP sean más sencillas. https://pypi.org/project/requests/

In [1]:
!pip install bs4
!pip install requests



Importamos las librerías y ya lo tenemos todo listo para iniciar la práctica.

In [4]:
from bs4 import BeautifulSoup # este módulo nos ayuda a hacer web scrapping.
import requests  # este módulo nos ayuda a descargar contenido web

<h2 id="BSO">Uso de la librería Beautiful Soup para Web Scraping</h2>


Como hemos comentado antes, BeautifulSoup es una biblioteca de Python para extraer datos de archivos HTML y XML, en nuestro caso nos vamos a centrar en archivos HTML. Se navega por el HTML como un árbol filtrando lo que estamos buscando.

Veamos el siguiente ejemplo HTML:


In [5]:
%%html
<!DOCTYPE html>
<html>
<head>
<title>Salarios de empleados</title>
</head>
<body>
<h3><b id='boldest'>Antonio García</b></h3>
<p> Salario: 25,000 € </p>
<h3> Francisco Ramirez</h3>
<p> Salario: 20,000 €  </p>
<h3> Rosa Durán </h3>
<p> Salario: 28,200 €</p>
</body>
</html>

Este código HTML lo podemos almacenar como una cadena en una variable; la llamaremos 'html'


In [6]:
html="<!DOCTYPE html><html><head><title>Salarios de empleados</title></head><body><h3><b id='boldest'>Antonio García</b></h3><p> Salario: 25,000 € </p><h3> Francisco Ramirez</h3><p> Salario: 20,000 €  </p><h3> Rosa Durán </h3><p> Salario: 28,200 €</p></body></html>"
html

"<!DOCTYPE html><html><head><title>Salarios de empleados</title></head><body><h3><b id='boldest'>Antonio García</b></h3><p> Salario: 25,000 € </p><h3> Francisco Ramirez</h3><p> Salario: 20,000 €  </p><h3> Rosa Durán </h3><p> Salario: 28,200 €</p></body></html>"

Para analizar un documento, éste lo pasamos al constructor BeautifulSoup. El objeto BeautifulSoup, lo representará como una estructura de datos anidada:

In [7]:
soup = BeautifulSoup(html, 'html5lib')

Primero, el documento se convierte a Unicode (similar a ASCII) y las entidades HTML se convierten a caracteres Unicode. Beautiful Soup transforma un documento HTML en un <b>árbol complejo</b> de objetos Python. En este laboratorio, trabajaremos con los objetos 'BeautifulSoup' y 'Tag', que para los propósitos de este laboratorio son idénticos, y los objetos 'NavigableString'.
Con el método <code> prettify () </code> podemos mostrar el HTML en la estructura anidada:


In [8]:
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   Salarios de empleados
  </title>
 </head>
 <body>
  <h3>
   <b id="boldest">
    Antonio García
   </b>
  </h3>
  <p>
   Salario: 25,000 €
  </p>
  <h3>
   Francisco Ramirez
  </h3>
  <p>
   Salario: 20,000 €
  </p>
  <h3>
   Rosa Durán
  </h3>
  <p>
   Salario: 28,200 €
  </p>
 </body>
</html>


## Etiquetas


Podemos extraer información a partir del nombre de las etiquetas HTML. Por ejemplo, supongamos que queremos el título de la página y el nombre del primer empleado, para ello usamos la etiqueta 'title' y 'h' respectivamente para extraer la información.

In [9]:
objeto_etiqueta=soup.title #extraemos el título de la web a partir de la etiqueta 'title'
print("Contenido de objeto_etiqueta:",objeto_etiqueta)

Contenido de objeto_etiqueta: <title>Salarios de empleados</title>


Podemos ver el tipo de etiqueta <code>bs4.element.Tag</code>


In [10]:
print("Visualizar el tipo del objeto etiqueta:",type(objeto_etiqueta))

Visualizar el tipo del objeto etiqueta: <class 'bs4.element.Tag'>


Si hay más de una Etiqueta HTML con el mismo nombre, se extrae el primer elemento que se encuentre con ese nombre, esto ocurre con la etiqueta 'h3'

In [11]:
objeto_etiqueta=soup.h3
objeto_etiqueta

<h3><b id="boldest">Antonio García</b></h3>

### Hijos, padres y hermanos


Como se indicó anteriormente, el objeto <code> objeto_etiqueta </code> es un árbol donde podemos navegar a través de todas sus ramas, identificamos a los hijos, que son las ramas que cuelgan de los padres y los hermanos que son las ramas que están al mismo nivel dentro de la jerarquía de árbol. Si queremos acceder a una rama hija, cogiendo como ejemplo objeto_etiqueta:


In [12]:
etiqueta_hijo=objeto_etiqueta.b #extraemos del objeto_etiqueta el contenido dentro de la etiqueta <b></b>
objeto_etiqueta

<h3><b id="boldest">Antonio García</b></h3>

Podemos acceder a la etiqueta padre desde la etiqueta hijo

In [13]:
etiqueta_padre=etiqueta_hijo.parent
etiqueta_padre

<h3><b id="boldest">Antonio García</b></h3>

Qué es lo mismo que...


In [14]:
objeto_etiqueta # es todo el objeto

<h3><b id="boldest">Antonio García</b></h3>

El padre del <code>objeto_etiqueta</code> es el <code>body</code> de la cadena HTML.


In [15]:
objeto_etiqueta.parent

<body><h3><b id="boldest">Antonio García</b></h3><p> Salario: 25,000 € </p><h3> Francisco Ramirez</h3><p> Salario: 20,000 €  </p><h3> Rosa Durán </h3><p> Salario: 28,200 €</p></body>

<code>objeto_etiqueta</code> hermano es el párrafo <code>p</code> del elemento y lo extraemos mediante el método <code>next_sibling</code>


In [16]:
hermano_1=objeto_etiqueta.next_sibling
hermano_1

<p> Salario: 25,000 € </p>

`hermano_2` es la cabecera `header` del elemento que es también un hermano de ambos: `hermano_1` y `objeto_etiqueta`


In [17]:
hermano_2=hermano_1.next_sibling
hermano_2

<h3> Francisco Ramirez</h3>

### Atributos HTML


Si la etiqueta HTML tiene atributos, la etiqueta <code> id = "boldest" </code> tiene un atributo <code> id </code> cuyo valor es <code> boldest </code>. Podemos acceder a los atributos de una etiqueta tratando la etiqueta como un diccionario:

In [18]:
etiqueta_hijo['id']

'boldest'

Podemos acceder a ese diccionario directamente con el método <code> attrs </code>:

In [19]:
etiqueta_hijo.attrs

{'id': 'boldest'}

Podemos también trabajar con la comprobación de atributos de varios valores <a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/">[1]</a>.

También podemos obtener el contenido del atributo de la etiqueta usando el método <code> get () </code> de Python.

In [20]:
etiqueta_hijo.get('id')

'boldest'

### Cadena navegable


A string corresponds to a bit of text or content within a tag. Beautiful Soup uses the <code>NavigableString</code> class to contain this text. In our HTML we can obtain the name of the first player by extracting the sting of the <code>Tag</code> object <code>tag_child</code> as follows:

Una cadena corresponde a un fragmento de texto o contenido dentro de una etiqueta. 'BeautifulSoup' usa la clase <code> NavigableString </code> para contener ese texto de la etiqueta. 


In [27]:
etiqueta_string=etiqueta_hijo.string #con este método sacamos lo que hay en la etiqueta <b id="boldest">Antonio García</b>
etiqueta_string


'Antonio García'

Aquí podemos ver como el tipo es un NavigableString


In [30]:
type(etiqueta_string)

bs4.element.NavigableString

A NavigableString is just like a Python string or Unicode string, to be more precise. The main difference is that it also supports some  <code>BeautifulSoup</code> features. We can covert it to sting object in Python:
Un NavigableString es como una cadena de Python o una cadena Unicode. La principal diferencia es que también admite algunas funciones de <code> BeautifulSoup </code>. Podemos convertirlo en objeto string de Python:


In [33]:
unicode_string = str(etiqueta_string)
unicode_string

'Antonio García'

## Filtros


Los filtros permiten encontrar patrones complejos, el filtro más simple es una cadena. En esta sección pasaremos una cadena a un método de filtro diferente y 'BeautifulSoup' realizará una coincidencia con esa cadena exacta. Consideramos el siguiente HTML de lanzamientos de cohetes:

In [35]:
%%html
<table>
  <tr>
    <td id='flight' >Número de vuelo</td>
    <td>Lugar lanzamiento</td> 
    <td>Carga útil</td>
   </tr>
  <tr> 
    <td>1</td>
    <td><a href='https://en.wikipedia.org/wiki/Florida'>Florida</a></td>
    <td>300 kg</td>
  </tr>
  <tr>
    <td>2</td>
    <td><a href='https://en.wikipedia.org/wiki/Texas'>Texas</a></td>
    <td>94 kg</td>
  </tr>
  <tr>
    <td>3</td>
    <td><a href='https://en.wikipedia.org/wiki/Florida'>Florida<a> </td>
    <td>80 kg</td>
  </tr>
</table>

0,1,2
Número de vuelo,Lugar lanzamiento,Carga útil
1,Florida,300 kg
2,Texas,94 kg
3,Florida,80 kg


Podemos almacenar el HTML en una cadena cuya variable es <code>tabla</code>:


In [41]:
tabla="<table><tr><td id='flight'>Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr> <td>1</td><td><a href='https://en.wikipedia.org/wiki/Florida'>Florida<a></td><td>300 kg</td></tr><tr><td>2</td><td><a href='https://en.wikipedia.org/wiki/Texas'>Texas</a></td><td>94 kg</td></tr><tr><td>3</td><td><a href='https://en.wikipedia.org/wiki/Florida'>Florida<a> </td><td>80 kg</td></tr></table>"
tabla

"<table><tr><td id='flight'>Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr> <td>1</td><td><a href='https://en.wikipedia.org/wiki/Florida'>Florida<a></td><td>300 kg</td></tr><tr><td>2</td><td><a href='https://en.wikipedia.org/wiki/Texas'>Texas</a></td><td>94 kg</td></tr><tr><td>3</td><td><a href='https://en.wikipedia.org/wiki/Florida'>Florida<a> </td><td>80 kg</td></tr></table>"

In [43]:
tabla_bs = BeautifulSoup(tabla, 'html5lib')#le damos formato
tabla_bs

<html><head></head><body><table><tbody><tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr> <td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a></a></td><td>300 kg</td></tr><tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr><tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a> </a></td><td>80 kg</td></tr></tbody></table></body></html>

## find All


El método <code>find_all()</code> examina y recupera los descendientes de una etiqueta que coincidan con sus filtros.
Los argumentos para el método <code>find_all(name, attrs, recursive, string, limit, **kwargs)<c/ode>

### Name


Cuando establecemos el parámetro <code> nombre_etiqueta </code> en un nombre de etiqueta, el método extraerá todas las etiquetas con ese nombre y sus hijos.

In [44]:
tabla_rows=tabla_bs.find_all('tr')
tabla_rows

[<tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr>,
 <tr> <td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a></a></td><td>300 kg</td></tr>,
 <tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr>,
 <tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a> </a></td><td>80 kg</td></tr>]

El resultado es una lista de Python Iterable y cada elemento es un objeto <code> etiqueta </code>:

In [45]:
first_row =tabla_rows[0]
first_row

<tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr>

El tipo es <code>tag</code>


In [43]:
print(type(first_row))

<class 'bs4.element.Tag'>


Podemos obtener el hijo 


In [46]:
first_row.td

<td id="flight">Flight No</td>

Si iteramos por la lista, cada elemento corresponde a una fila en la tabla:


In [48]:
for i,row in enumerate(tabla_rows):
    print("row",i,"is",row)
    

row 0 is <tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr>
row 1 is <tr> <td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a></a></td><td>300 kg</td></tr>
row 2 is <tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr>
row 3 is <tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a> </a></td><td>80 kg</td></tr>


As <code>row</code> is a <code>cell</code> object, we can apply the method <code>find_all</code> to it and extract table cells in the object <code>cells</code> using the tag <code>td</code>, this is all the children with the name <code>td</code>. The result is a list, each element corresponds to a cell and is a <code>Tag</code> object, we can iterate through this list as well. We can extract the content using the <code>string</code>  attribute.


In [46]:
for i,row in enumerate(table_rows):
    print("row",i)
    cells=row.find_all('td')
    for j,cell in enumerate(cells):
        print('colunm',j,"cell",cell)

row 0
colunm 0 cell <td id="flight">Flight No</td>
colunm 1 cell <td>Launch site</td>
colunm 2 cell <td>Payload mass</td>
row 1
colunm 0 cell <td>1</td>
colunm 1 cell <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a></a></td>
colunm 2 cell <td>300 kg</td>
row 2
colunm 0 cell <td>2</td>
colunm 1 cell <td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td>
colunm 2 cell <td>94 kg</td>
row 3
colunm 0 cell <td>3</td>
colunm 1 cell <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a> </a></td>
colunm 2 cell <td>80 kg</td>


If we use a list we can match against any item in that list.


In [47]:
list_input=table_bs .find_all(name=["tr", "td"])
list_input

[<tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr>,
 <td id="flight">Flight No</td>,
 <td>Launch site</td>,
 <td>Payload mass</td>,
 <tr> <td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a></a></td><td>300 kg</td></tr>,
 <td>1</td>,
 <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a></a></td>,
 <td>300 kg</td>,
 <tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr>,
 <td>2</td>,
 <td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td>,
 <td>94 kg</td>,
 <tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a> </a></td><td>80 kg</td></tr>,
 <td>3</td>,
 <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a> </a></td>,
 <td>80 kg</td>]

## Attributes


If the argument is not recognized it will be turned into a filter on the tag’s attributes. For example the <code>id</code>  argument, Beautiful Soup will filter against each tag’s <code>id</code> attribute. For example, the first <code>td</code> elements have a value of <code>id</code> of <code>flight</code>, therefore we can filter based on that <code>id</code> value. 


In [48]:
table_bs.find_all(id="flight")

[<td id="flight">Flight No</td>]

We can find all the elements that have links to the Florida Wikipedia page:


In [49]:
list_input=table_bs.find_all(href="https://en.wikipedia.org/wiki/Florida")
list_input

[<a href="https://en.wikipedia.org/wiki/Florida">Florida</a>,
 <a href="https://en.wikipedia.org/wiki/Florida">Florida</a>]

If we set the  <code>href</code> attribute to True, regardless of what the value is, the code finds all tags with <code>href</code> value:


In [50]:
table_bs.find_all(href=True)

[<a href="https://en.wikipedia.org/wiki/Florida">Florida</a>,
 <a href="https://en.wikipedia.org/wiki/Texas">Texas</a>,
 <a href="https://en.wikipedia.org/wiki/Florida">Florida</a>]

There are other methods for dealing with attributes and other related methods; Check out the following <a href='https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors'>link</a>


<h3 id="exer_type">Exercise: <code>find_all</code></h3>


Using the logic above, find all the elements without <code>href</code> value 


In [51]:
table_bs.find_all(href=False)

[<html><head></head><body><table><tbody><tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr> <td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a></a></td><td>300 kg</td></tr><tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr><tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a> </a></td><td>80 kg</td></tr></tbody></table></body></html>,
 <head></head>,
 <body><table><tbody><tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr> <td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a></a></td><td>300 kg</td></tr><tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr><tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a> </a></td><td>80 kg</td></tr></tbody></table></body>,
 <table><tbody><tr><td id="flight">Flight No</td><td>Launch site</td> <t

<details><summary>Click here for the solution</summary>

```
table_bs.find_all(href=False)

```

</details>


Using the soup object <code>soup</code>, find the element with the <code>id</code> attribute content set to <code>"boldest"</code>. 


In [52]:
soup.find_all(id="boldest")

[<b id="boldest">Lebron James</b>]

<details><summary>Click here for the solution</summary>

```
soup.find_all(id="boldest")

```

</details>


### string


With string you can search for strings instead of tags, where we find all the elments with Florida:


In [53]:
table_bs.find_all(string="Florida")

['Florida', 'Florida']

## find


The <code>find_all()</code> method scans the entire document looking for results, it’s if you are looking for one element you can use the <code>find()</code> method to find the first element in the document. Consider the following two table:


In [54]:
%%html
<h3>Rocket Launch </h3>

<p>
<table class='rocket'>
  <tr>
    <td>Flight No</td>
    <td>Launch site</td> 
    <td>Payload mass</td>
  </tr>
  <tr>
    <td>1</td>
    <td>Florida</td>
    <td>300 kg</td>
  </tr>
  <tr>
    <td>2</td>
    <td>Texas</td>
    <td>94 kg</td>
  </tr>
  <tr>
    <td>3</td>
    <td>Florida </td>
    <td>80 kg</td>
  </tr>
</table>
</p>
<p>

<h3>Pizza Party  </h3>
  
    
<table class='pizza'>
  <tr>
    <td>Pizza Place</td>
    <td>Orders</td> 
    <td>Slices </td>
   </tr>
  <tr>
    <td>Domino's Pizza</td>
    <td>10</td>
    <td>100</td>
  </tr>
  <tr>
    <td>Little Caesars</td>
    <td>12</td>
    <td >144 </td>
  </tr>
  <tr>
    <td>Papa John's </td>
    <td>15 </td>
    <td>165</td>
  </tr>


0,1,2
Flight No,Launch site,Payload mass
1,Florida,300 kg
2,Texas,94 kg
3,Florida,80 kg

0,1,2
Pizza Place,Orders,Slices
Domino's Pizza,10,100
Little Caesars,12,144
Papa John's,15,165


We store the HTML as a Python string and assign <code>two_tables</code>:


In [55]:
two_tables="<h3>Rocket Launch </h3><p><table class='rocket'><tr><td>Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr><td>1</td><td>Florida</td><td>300 kg</td></tr><tr><td>2</td><td>Texas</td><td>94 kg</td></tr><tr><td>3</td><td>Florida </td><td>80 kg</td></tr></table></p><p><h3>Pizza Party  </h3><table class='pizza'><tr><td>Pizza Place</td><td>Orders</td> <td>Slices </td></tr><tr><td>Domino's Pizza</td><td>10</td><td>100</td></tr><tr><td>Little Caesars</td><td>12</td><td >144 </td></tr><tr><td>Papa John's </td><td>15 </td><td>165</td></tr>"

We create a <code>BeautifulSoup</code> object  <code>two_tables_bs</code>


In [56]:
two_tables_bs= BeautifulSoup(two_tables, 'html.parser')

We can find the first table using the tag name table


In [57]:
two_tables_bs.find("table")

<table class="rocket"><tr><td>Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr><td>1</td><td>Florida</td><td>300 kg</td></tr><tr><td>2</td><td>Texas</td><td>94 kg</td></tr><tr><td>3</td><td>Florida </td><td>80 kg</td></tr></table>

We can filter on the class attribute to find the second table, but because class is a keyword in Python, we add an underscore.


In [58]:
two_tables_bs.find("table",class_='pizza')

<table class="pizza"><tr><td>Pizza Place</td><td>Orders</td> <td>Slices </td></tr><tr><td>Domino's Pizza</td><td>10</td><td>100</td></tr><tr><td>Little Caesars</td><td>12</td><td>144 </td></tr><tr><td>Papa John's </td><td>15 </td><td>165</td></tr></table>

## Descarga y extracción del contenido de una página web


We Download the contents of the web page:


In [59]:
url = "http://www.ibm.com"

We use <code>get</code> to download the contents of the webpage in text format and store in a variable called <code>data</code>:


In [60]:
data  = requests.get(url).text 

We create a <code>BeautifulSoup</code> object using the <code>BeautifulSoup</code> constructor 


In [61]:
soup = BeautifulSoup(data,"html5lib")  # create a soup object using the variable 'data'

Scrape all links


In [62]:
for link in soup.find_all('a',href=True):  # in html anchor/link is represented by the tag <a>

    print(link.get('href'))


https://www.ibm.com/es/es
https://www.ibm.com/sitemap/es/es
https://developer.ibm.com/callforcode/?lnk=eshpv18l1
https://newsroom.ibm.com/2021-02-16-IBM-Commits-To-Net-Zero-Greenhouse-Gas-Emissions-By-2030?lnk=eshpv18nf1
https://www.ibm.com/es-es/events?lnk=eshpv18l1
/es-es/marketing/partner-ecosystem/?lnk=eshpv18f2
/events/think/es/?lnk=eshpv18f3
https://www.ibm.com/thought-leadership/institute-business-value/c-suite-study/ceo?lnk=eshpv18l4
/it-infrastructure/storage/flash/offers/es-es?lnk=eshpv18f5
/es-es/financing/pre-owned/ibm-certified-used-equipment?lnk=eshpv18f6
https://www.ibm.com/training/cloud?lnk=eshpv18f7
/es-es/products/offers-and-discounts?lnk=hpv18t5
https://www.ibm.com/es-es/cloud/free?lnk%5B0%5D=eshpv18t1&lnk%5B1%5D=STW_ES_HPT_T1_BLK&psrc=NONE&pexp=DEF&lnk2=trial_Cloud
https://www.ibm.com/es-es/products/cloud-pak-for-data?lnk%5B0%5D=eshpv18t2&lnk%5B1%5D=STW_ES_HPT_T2_BLK&psrc=NONE&pexp=DEF&lnk2=trial_CloudPakData
https://www.ibm.com/es-es/products/hosted-security-intel

## Extraer todas las imágenes de las etiquetas


In [63]:
for link in soup.find_all('img'):# in html image is represented by the tag <img>
    print(link)
    #print(link.get('src'))

<img alt="Eventos" class="" loading="lazy" src="//1.cms.s81c.com/sites/default/files/2021-04-07/20210406-eses-eventspage-444x254.jpg"/>
<img alt="Ecosistema y partners" class="" loading="lazy" src="//1.cms.s81c.com/sites/default/files/2021-04-26/20210426-eses-ecosystem-444x254.jpg"/>
<img alt="Think 2021" class="" loading="lazy" src="//1.cms.s81c.com/sites/default/files/2021-04-23/20210423-eses-think2021-444x254.jpg"/>
<img alt="icono verde azulado de líneas con la forma de la letra C" class="" loading="lazy" src="//1.cms.s81c.com/sites/default/files/2021-02-23/ceo-study.png"/>
<img alt="FlashSystem" class="" loading="lazy" src="//1.cms.s81c.com/sites/default/files/2021-04-16/20210209-flash-system-5200-storage-25719-444x252.jpg"/>
<img alt="" class="" loading="lazy" src="//1.cms.s81c.com/sites/default/files/2021-04-16/ICPO-Environmental-HP_0.jpg"/>
<img alt="" class="" loading="lazy" src="//1.cms.s81c.com/sites/default/files/2021-04-23/20210423-eses-cloudtraining-444x254.jpg"/>
<img al

## Extraer datos de tablas HTML


In [64]:
#The below url contains an html table with data about colors and color codes.
url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DA0321EN-SkillsNetwork/labs/datasets/HTMLColorCodes.html"

Before proceeding to scrape a web site, you need to examine the contents, and the way data is organized on the website. Open the above url in your browser and check how many rows and columns are there in the color table.


In [65]:
# get the contents of the webpage in text format and store in a variable called data
data  = requests.get(url).text

In [66]:
soup = BeautifulSoup(data,"html5lib")

In [67]:
#find a html table in the web page
table = soup.find('table') # in html table is represented by the tag <table>

In [68]:
#Get all rows from the table
for row in table.find_all('tr'): # in html table row is represented by the tag <tr>
    # Get all columns in each row.
    cols = row.find_all('td') # in html a column is represented by the tag <td>
    color_name = cols[2].string # store the value in column 3 as color_name
    color_code = cols[3].string # store the value in column 4 as color_code
    print("{}--->{}".format(color_name,color_code))

Color Name--->None
lightsalmon--->#FFA07A
salmon--->#FA8072
darksalmon--->#E9967A
lightcoral--->#F08080
coral--->#FF7F50
tomato--->#FF6347
orangered--->#FF4500
gold--->#FFD700
orange--->#FFA500
darkorange--->#FF8C00
lightyellow--->#FFFFE0
lemonchiffon--->#FFFACD
papayawhip--->#FFEFD5
moccasin--->#FFE4B5
peachpuff--->#FFDAB9
palegoldenrod--->#EEE8AA
khaki--->#F0E68C
darkkhaki--->#BDB76B
yellow--->#FFFF00
lawngreen--->#7CFC00
chartreuse--->#7FFF00
limegreen--->#32CD32
lime--->#00FF00
forestgreen--->#228B22
green--->#008000
powderblue--->#B0E0E6
lightblue--->#ADD8E6
lightskyblue--->#87CEFA
skyblue--->#87CEEB
deepskyblue--->#00BFFF
lightsteelblue--->#B0C4DE
dodgerblue--->#1E90FF


## Extraer y pasar datos de tablas HTML a un DataFrame usando BeautifulSoup y Pandas


In [69]:
import pandas as pd

In [70]:
#The below url contains html tables with data about world population.
url = "https://en.wikipedia.org/wiki/World_population"

Antes de proceder a extraer datos de un sitio web, se debe examinar el contenido y la forma en que se organizan los datos en ese sitio. Abra la URL anterior en su navegador y consulte las tablas en la página web.


In [71]:
# get the contents of the webpage in text format and store in a variable called data
data  = requests.get(url).text

In [72]:
soup = BeautifulSoup(data,"html5lib")

In [73]:
#find all html tables in the web page
tables = soup.find_all('table') # in html table is represented by the tag <table>

In [74]:
# we can see how many tables were found by checking the length of the tables list
len(tables)

26

Assume that we are looking for the `10 most densly populated countries` table, we can look through the tables list and find the right one we are look for based on the data in each table or we can search for the table name if it is in the table but this option might not always work.


In [75]:
for index,table in enumerate(tables):
    if ("10 most densely populated countries" in str(table)):
        table_index = index
print(table_index)

5


See if you can locate the table name of the table, `10 most densly populated countries`, below.


In [76]:
print(tables[table_index].prettify())

<table class="wikitable sortable" style="text-align:right">
 <caption>
  10 most densely populated countries
  <small>
   (with population above 5 million)
  </small>
 </caption>
 <tbody>
  <tr>
   <th>
    Rank
   </th>
   <th>
    Country
   </th>
   <th>
    Population
   </th>
   <th>
    Area
    <br/>
    <small>
     (km
     <sup>
      2
     </sup>
     )
    </small>
   </th>
   <th>
    Density
    <br/>
    <small>
     (pop/km
     <sup>
      2
     </sup>
     )
    </small>
   </th>
  </tr>
  <tr>
   <td>
    1
   </td>
   <td align="left">
    <span class="flagicon">
     <img alt="" class="thumbborder" data-file-height="2880" data-file-width="4320" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/commons/thumb/4/48/Flag_of_Singapore.svg/23px-Flag_of_Singapore.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/4/48/Flag_of_Singapore.svg/35px-Flag_of_Singapore.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/4/48/Flag_of_Singa

In [77]:
population_data = pd.DataFrame(columns=["Rank", "Country", "Population", "Area", "Density"])

for row in tables[table_index].tbody.find_all("tr"):
    col = row.find_all("td")
    if (col != []):
        rank = col[0].text
        country = col[1].text
        population = col[2].text.strip()
        area = col[3].text.strip()
        density = col[4].text.strip()
        population_data = population_data.append({"Rank":rank, "Country":country, "Population":population, "Area":area, "Density":density}, ignore_index=True)

population_data

Unnamed: 0,Rank,Country,Population,Area,Density
0,1,Singapore,5704000,710,8033
1,2,Bangladesh,170620000,143998,1185
2,3,Lebanon,6856000,10452,656
3,4,Taiwan,23604000,36193,652
4,5,South Korea,51781000,99538,520
5,6,Rwanda,12374000,26338,470
6,7,Haiti,11578000,27065,428
7,8,Netherlands,17590000,41526,424
8,9,Israel,9340000,22072,423
9,10,India,1376640000,3287240,419


## Extraer y pasar datos de tablas HTML a un DataFrame usando BeautifulSoup y read_html


Using the same `url`, `data`, `soup`, and `tables` object as in the last section we can use the `read_html` function to create a DataFrame.

Remember the table we need is located in `tables[table_index]`

We can now use the `pandas` function `read_html` and give it the string version of the table as well as the `flavor` which is the parsing engine `bs4`.


In [78]:
pd.read_html(str(tables[5]), flavor='bs4')

[   Rank      Country  Population  Area(km2)  Density(pop/km2)
 0     1    Singapore     5704000        710              8033
 1     2   Bangladesh   170620000     143998              1185
 2     3      Lebanon     6856000      10452               656
 3     4       Taiwan    23604000      36193               652
 4     5  South Korea    51781000      99538               520
 5     6       Rwanda    12374000      26338               470
 6     7        Haiti    11578000      27065               428
 7     8  Netherlands    17590000      41526               424
 8     9       Israel     9340000      22072               423
 9    10        India  1376640000    3287240               419]

The function `read_html` always returns a list of DataFrames so we must pick the one we want out of the list.


In [79]:
population_data_read_html = pd.read_html(str(tables[5]), flavor='bs4')[0]

population_data_read_html

Unnamed: 0,Rank,Country,Population,Area(km2),Density(pop/km2)
0,1,Singapore,5704000,710,8033
1,2,Bangladesh,170620000,143998,1185
2,3,Lebanon,6856000,10452,656
3,4,Taiwan,23604000,36193,652
4,5,South Korea,51781000,99538,520
5,6,Rwanda,12374000,26338,470
6,7,Haiti,11578000,27065,428
7,8,Netherlands,17590000,41526,424
8,9,Israel,9340000,22072,423
9,10,India,1376640000,3287240,419


## Extraer y pasar datos de tablas HTML a un DataFrame usando read_html


We can also use the `read_html` function to directly get DataFrames from a `url`.


In [84]:
dataframe_list = pd.read_html(url, flavor='bs4')

We can see there are 25 DataFrames just like when we used `find_all` on the `soup` object.


In [85]:
len(dataframe_list)

26

Finally we can pick the DataFrame we need out of the list.


In [82]:
dataframe_list[5]

Unnamed: 0,Rank,Country,Population,Area(km2),Density(pop/km2)
0,1,Singapore,5704000,710,8033
1,2,Bangladesh,170620000,143998,1185
2,3,Lebanon,6856000,10452,656
3,4,Taiwan,23604000,36193,652
4,5,South Korea,51781000,99538,520
5,6,Rwanda,12374000,26338,470
6,7,Haiti,11578000,27065,428
7,8,Netherlands,17590000,41526,424
8,9,Israel,9340000,22072,423
9,10,India,1376640000,3287240,419


We can also use the `match` parameter to select the specific table we want. If the table contains a string matching the text it will be read.


In [83]:
pd.read_html(url, match="10 most densely populated countries", flavor='bs4')[0]

Unnamed: 0,Rank,Country,Population,Area(km2),Density(pop/km2)
0,1,Singapore,5704000,710,8033
1,2,Bangladesh,170620000,143998,1185
2,3,Lebanon,6856000,10452,656
3,4,Taiwan,23604000,36193,652
4,5,South Korea,51781000,99538,520
5,6,Rwanda,12374000,26338,470
6,7,Haiti,11578000,27065,428
7,8,Netherlands,17590000,41526,424
8,9,Israel,9340000,22072,423
9,10,India,1376640000,3287240,419
