## Web Scraping con Python

El web scraping, también conocido como extracción de datos web, es el proceso de recopilar información de sitios web, páginas web o documentos en línea de manera automática. En otras palabras, se trata de extraer datos de una página web sin la necesidad de interactuar con ella manualmente.

En Python, el web scraping se logra utilizando bibliotecas como requests y BeautifulSoup.

La biblioteca requests se utiliza para enviar solicitudes HTTP a un sitio web, mientras que la biblioteca BeautifulSoup se utiliza para analizar el HTML o XML de respuesta.

### Cómo funciona

##### * Enviar solicitud HTTP: Se envía una solicitud HTTP (como GET o POST) a un sitio web utilizando la biblioteca requests.

##### * Recibir respuesta: El sitio web responde con un código HTML o XML que contiene la información solicitada.

##### * Analizar respuesta: La biblioteca BeautifulSoup analiza el código HTML o XML de respuesta y lo convierte en un objeto que puede ser fácilmente navegado y manipulado.

##### * Extraer datos: Se extraen los datos deseados del objeto analizado utilizando métodos y funciones proporcionadas por BeautifulSoup.

https://quotes.toscrape.com/

https://es.wikipedia.org/wiki/Modelo_OSI

Selenium

Gradio


#### Qué es una solicitud HTTP: y para qué sirve?

Una solicitud HTTP (Hypertext Transfer Protocol) es una petición que se envía desde un cliente (generalmente un navegador web) a un servidor para solicitar un recurso o realizar una acción específica en el servidor.

Las solicitudes HTTP se utilizan para interactuar con los servidores web y obtener respuestas que contengan información o recursos solicitados. Algunos ejemplos de acciones que se pueden realizar mediante solicitudes HTTP son:

* Obtener un archivo o recurso del servidor (GET)
* Enviar datos al servidor para crear un nuevo recurso (POST)
* Actualizar un recurso existente en el servidor (PUT)
* Eliminar un recurso del servidor (DELETE)

(Cada método se utiliza para enviar una solicitud HTTP específica al servidor)



Las solicitudes HTTP constan de los siguientes elementos:

* Método HTTP (GET, POST, PUT, DELETE, etc.)
* URL del recurso solicitado
* Cabecera (header) con información adicional como tipo de contenido, autenticación, etc.
Cuerpo (body) con datos adicionales para el servidor (opcional)
* El servidor procesa la solicitud y devuelve una respuesta HTTP que contiene el recurso solicitado o un mensaje de error si no se puede cumplir con la solicitud.

In [14]:
import requests

url_web='https://quotes.toscrape.com/'

response=requests.get(url_web)

print(response.status_code)  # Código de estado de la respuesta
print(response.text) # Contenido de la respuesta

200
<!DOCTYPE html>
<html lang="en">
<head>
	<meta charset="UTF-8">
	<title>Quotes to Scrape</title>
    <link rel="stylesheet" href="/static/bootstrap.min.css">
    <link rel="stylesheet" href="/static/main.css">
</head>
<body>
    <div class="container">
        <div class="row header-box">
            <div class="col-md-8">
                <h1>
                    <a href="/" style="text-decoration: none">Quotes to Scrape</a>
                </h1>
            </div>
            <div class="col-md-4">
                <p>
                
                    <a href="/login">Login</a>
                
                </p>
            </div>
        </div>
    

<div class="row">
    <div class="col-md-8">

    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
        <span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
        <span>by <small class="author" it

En la biblioteca ***requests***, el método **get** es un método de la clase ***Session***, que es la clase principal de la biblioteca. 
La clase Session se utiliza para enviar solicitudes HTTP y manejar las conexiones con el servidor.

El método ***get*** toma una URL como parámetro y devuelve un objeto "Response" que contiene la respuesta del servidor. 

* El objeto Response tiene varios atributos y métodos que se pueden utilizar para acceder a la respuesta del servidor, como el código de estado, los headers, el contenido, etc.



#### Analizar respuesta
En realidad, en este paso lo que se  va a hacer es un análisis del código HTML.
En mi caso, usaré **BeautifulSoup**, que es una biblioteca de parsing que analizará el códigO HTML recibido y lo convierte en un objeto con el que se podrá trabajar.

In [15]:
from bs4 import BeautifulSoup

soup=BeautifulSoup(response.text, 'html.parser')


'BeautifulSoup' es una clase.

Se crea el objeto **BeautifulSoup** y se guarda en la variable 'soup'.

El constructor de BeautifulSoup toma 2 argumentos:
* 'response.text':que es el contenido HTML de la página web, que se obtiene mediante la solicitud GET realizada con **requests**.

* 'html.parser': El parser (analizador) que se utilizará para analizar el contenido HTML. En este caso, se utiliza el parser **html.parser** que viene incluido con Python.

Coge todo el texto

In [16]:
phrases=soup.find_all('span')

for i in phrases:
    print(i.text)
    

“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
by Albert Einstein
(about)

“It is our choices, Harry, that show what we truly are, far more than our abilities.”
by J.K. Rowling
(about)

“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
by Albert Einstein
(about)

“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
by Jane Austen
(about)

“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
by Marilyn Monroe
(about)

“Try not to become a man of success. Rather become a man of value.”
by Albert Einstein
(about)

“It is better to be hated for what you are than to be loved for what you are not.”
by André Gide
(about)

“I have not failed. I've just found 10,000 ways that won't work.”
by Thomas A. Edison
(about)

“A woman is like a t

Coge sólo la frase

In [17]:
frases_1=soup.find_all('span',class_="text")

#Al especificar el nombre del elemento, como en find('span', class_='text'), se está restringiendo la búsqueda a elementos <span> que tengan la clase text.

for i in frases_1:
    print(i.text)

“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
“It is our choices, Harry, that show what we truly are, far more than our abilities.”
“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
“Try not to become a man of success. Rather become a man of value.”
“It is better to be hated for what you are than to be loved for what you are not.”
“I have not failed. I've just found 10,000 ways that won't work.”
“A woman is like a tea bag; you never know how strong it is until it's in hot water.”
“A day without sunshine is like, you know, night.”


Creo una función que haga eso

In [18]:
#Al especificar el nombre del elemento, como en find('span', class_='text'), se está restringiendo la búsqueda a elementos <span> que tengan la clase text.
""" 
frases_1=soup.find_all('span',class_="text") 
list_phrases=[]

for i in frases_1:
    list_phrases.append(i.text)
print(list_phrases) """

def listing_phrases(soup):
    list_phrases=[]
    frases_1=soup.find_all('span',class_="text") 

    for i in frases_1:
        list_phrases.append(i.text.strip('“”')) #son comillas inglesas alt+0147, alt+0148

    return list_phrases

    
print(listing_phrases(soup))

['The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.', 'It is our choices, Harry, that show what we truly are, far more than our abilities.', 'There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.', 'The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.', "Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.", 'Try not to become a man of success. Rather become a man of value.', 'It is better to be hated for what you are than to be loved for what you are not.', "I have not failed. I've just found 10,000 ways that won't work.", "A woman is like a tea bag; you never know how strong it is until it's in hot water.", 'A day without sunshine is like, you know, night.']


for phrase in (listing_phrases(soup)):
    phrase_without_quotes = phrase.strip('“”')
    
 
    print(phrase_without_quotes)

Coge sólo el autor

In [20]:
def listing_authors(soup):
    authors=soup.find_all('small',class_="author")
    list_authors=[]
    for i in authors:
        list_authors.append(i.text)
        
    return list_authors
print(listing_authors(soup))

['Albert Einstein', 'J.K. Rowling', 'Albert Einstein', 'Jane Austen', 'Marilyn Monroe', 'Albert Einstein', 'André Gide', 'Thomas A. Edison', 'Eleanor Roosevelt', 'Steve Martin']


Coge sólo los enlaces ligados a 'about'

Pero sucede que el link no está completo, falta la página principal https://quotes.toscrape.com/
Para lo cual hago una concatenación

In [21]:
def listing_links(soup):
    links=[a['href'] for a in soup.find_all('a', string=lambda t: t and '(about)' in t)] #buscar todos los enlaces ('a' tags) que contienen la palabra 'about' en su texto. Sustituí el parámetro 'text'= por 'string'=

    list_links=[]
    for i in links:
        list_links.append(url_web+i) #hago una concatenación con la url principal

    return list_links
print(listing_links(soup))


['https://quotes.toscrape.com//author/Albert-Einstein', 'https://quotes.toscrape.com//author/J-K-Rowling', 'https://quotes.toscrape.com//author/Albert-Einstein', 'https://quotes.toscrape.com//author/Jane-Austen', 'https://quotes.toscrape.com//author/Marilyn-Monroe', 'https://quotes.toscrape.com//author/Albert-Einstein', 'https://quotes.toscrape.com//author/Andre-Gide', 'https://quotes.toscrape.com//author/Thomas-A-Edison', 'https://quotes.toscrape.com//author/Eleanor-Roosevelt', 'https://quotes.toscrape.com//author/Steve-Martin']


In [22]:
#Creo listas vacías para luego rellenarlas con un bucle
authors=[]
phrase=[]
more=[]

for i in phrases: #Iteramos el objeto
    all_text=i.text.strip()#El objeto tiene el atributo .text que guarda el texto entero de span, incluyendo 'by' y 'about'
    #print(all_text)
    if 'by' in all_text: #evalúas si hay 'by' en el texto
        author=all_text.split('by')[1].strip()
        #split('by') devuelve una lista de 2 elementos -->['', ' Albert Einstein\n(about)']
        #como quiero el autor, me quedo con el elemento de ubicación [1]
        #pero me incluye el salto de línea y el 'about'
        print(author)
            
"""         one_phrase = all_text[0].strip()
        print(one_phrase) """

Albert Einstein
(about)
J.K. Rowling
(about)
Albert Einstein
(about)
Jane Austen
(about)
Marilyn Monroe
(about)
Albert Einstein
(about)
André Gide
(about)
Thomas A. Edison
(about)
Eleanor Roosevelt
(about)
Steve Martin
(about)


'         one_phrase = all_text[0].strip()\n        print(one_phrase) '

Mediante una función creo un dataframe con esas listas

Sería OK ponerle una máscara

In [23]:
import pandas as pd
def create_df(list_phrases, list_authors,list_links):
    
    #Creo un diccionario con las listas
    dic_lists={'phrases':list_phrases,'author':list_authors, 'about':list_links}
    df=pd.DataFrame(dic_lists)
    
    
    return df
print(create_df(listing_phrases(soup), listing_authors(soup),listing_links(soup)))

                                             phrases             author  \
0  The world as we have created it is a process o...    Albert Einstein   
1  It is our choices, Harry, that show what we tr...       J.K. Rowling   
2  There are only two ways to live your life. One...    Albert Einstein   
3  The person, be it gentleman or lady, who has n...        Jane Austen   
4  Imperfection is beauty, madness is genius and ...     Marilyn Monroe   
5  Try not to become a man of success. Rather bec...    Albert Einstein   
6  It is better to be hated for what you are than...         André Gide   
7  I have not failed. I've just found 10,000 ways...   Thomas A. Edison   
8  A woman is like a tea bag; you never know how ...  Eleanor Roosevelt   
9   A day without sunshine is like, you know, night.       Steve Martin   

                                               about  
0  https://quotes.toscrape.com//author/Albert-Ein...  
1    https://quotes.toscrape.com//author/J-K-Rowling  
2  https:

Guardar el df en un csv con encabezados

In [24]:
import csv
path='C:/4_F5/08_webs'
create_df(listing_phrases(soup), listing_authors(soup),listing_links(soup)).to_csv(path+'\\dataframe.csv',index=True,header='True', sep=';')

----------------------------- Hasta aquí traje los datos ------------------------------------

In [25]:
#print(df.to_string(index=False))

print(create_df(listing_phrases(soup), listing_authors(soup),listing_links(soup)).to_string(index=False, header=True,justify='left', max_colwidth=100,col_space=20, formatters={'phrases': lambda x: x.replace('"', '')}))

phrases                                                                                              author               about                                                
The world as we have created it is a process of our thinking. It cannot be changed without changi...   Albert Einstein      https://quotes.toscrape.com//author/Albert-Einstein
                 It is our choices, Harry, that show what we truly are, far more than our abilities.      J.K. Rowling          https://quotes.toscrape.com//author/J-K-Rowling
There are only two ways to live your life. One is as though nothing is a miracle. The other is as...   Albert Einstein      https://quotes.toscrape.com//author/Albert-Einstein
The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably st...       Jane Austen          https://quotes.toscrape.com//author/Jane-Austen
Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolu...    Marilyn Monroe  

In [26]:
import pandas as pd

# Crea el dataframe
df = pd.DataFrame({
    'phrases': [
        "The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.",
        "It is our choices, Harry, that show what we truly are, far more than our abilities.",
        "There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.",
        "The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.",
        "Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.",
        "Try not to become a man of success. Rather become a man of value.",
        "It is better to be hated for what you are than to be loved for what you are not.",
        "I have not failed. I've just found 10,000 ways that won't work.",
        "A woman is like a tea bag; you never know how strong it is until it's in hot water.",
        "A day without sunshine is like, you know, night."
    ],
    'author': [
        "Albert Einstein",
        "J.K. Rowling",
        "Albert Einstein",
        "Jane Austen",
        "Marilyn Monroe",
        "Albert Einstein",
        "André Gide",
        "Thomas A. Edison",
        "Eleanor Roosevelt",
        "Steve Martin"
    ],
    'about': [
        "https://quotes.toscrape.com//author/Albert-Einstein",
        "https://quotes.toscrape.com//author/J-K-Rowling",
        "https://quotes.toscrape.com//author/Albert-Einstein",
        "https://quotes.toscrape.com//author/Jane-Austen",
        "https://quotes.toscrape.com//author/Marilyn-Monroe",
        "https://quotes.toscrape.com//author/Albert-Einstein",
        "https://quotes.toscrape.com//author/Andre-Gide",
        "https://quotes.toscrape.com//author/Thomas-A-Edison",
        "https://quotes.toscrape.com//author/Eleanor-Roosevelt",
        "https://quotes.toscrape.com//author/Steve-Martin"
    ]
})

# Imprime el dataframe con wrap_text
for index, row in df.iterrows():
    print(f"Frases: {row['phrases']}\nAutor: {row['author']}\nAcerca de: {row['about']}\n")

Frases: The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.
Autor: Albert Einstein
Acerca de: https://quotes.toscrape.com//author/Albert-Einstein

Frases: It is our choices, Harry, that show what we truly are, far more than our abilities.
Autor: J.K. Rowling
Acerca de: https://quotes.toscrape.com//author/J-K-Rowling

Frases: There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.
Autor: Albert Einstein
Acerca de: https://quotes.toscrape.com//author/Albert-Einstein

Frases: The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.
Autor: Jane Austen
Acerca de: https://quotes.toscrape.com//author/Jane-Austen

Frases: Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.
Autor: Marilyn Monroe
Acerca de: https://quotes.toscrape.com//author/Marilyn-Monroe

F

Ahora creo las bases de datos con SQL ALCHEMY