# Ejercicio 1

Realizar web scraping de dos de estas tres posibles páginas web:
a) http://quotes.toscrape.com ; b) https://www.bolsamadrid.es ; c) www.wikipedia.es (buscar algún contenido primero)

Hacer scraping primero con BeautifulSoup, y luego otro con Selenium.

Webs seleccionadas:

http://quotes.toscrape.com/  (por su sencillez, de hecho está dedicada a la práctica del "scraping")

https://es.wikipedia.org/wiki/Carl_Weathers  (por su actualidad reciente, dado el reciente fallecimiento del famoso actor)

In [20]:
import pandas as pd
from time import sleep
import time
import random
import os
# import sys

# Para "http://quotes.toscrape.com/" mediante BeautifulSoup
from bs4 import BeautifulSoup
import requests

# Para "https://es.wikipedia.org/wiki/Carl_Weathers" y para "https://ground.news" mediante Selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
# Automatiza la búsqueda e instalación de la versión de driver correcta para el navegador disponible en el OS:
from webdriver_manager.firefox import GeckoDriverManager  # https://pypi.org/project/webdriver-manager/
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException

# Para https://ground.news usando Scrapy (hecho fuera del notebook)
# import scrapy

### Beautiful Soup

Empezamos aplicando BeautifulSoup a la web "Quotes to Scrape":

URL con la que trabajaremos primero:

In [2]:
url = 'http://quotes.toscrape.com/'

Acceso al contenido HTML de la URL mediante requests:

In [4]:
page = requests.get(url)   # Mediante requests obtenemos el contenido de la URL.

In [9]:
soup = BeautifulSoup(page.text, 'html')   # Se genera un objeto BeautifulSoup que incorpora el texto de la página cargada en "page", parseando como html.
print(soup.prettify()) # prettify() genera indentación para hacerlo más legible.

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Quotes to Scrape
  </title>
  <link href="/static/bootstrap.min.css" rel="stylesheet"/>
  <link href="/static/main.css" rel="stylesheet"/>
 </head>
 <body>
  <div class="container">
   <div class="row header-box">
    <div class="col-md-8">
     <h1>
      <a href="/" style="text-decoration: none">
       Quotes to Scrape
      </a>
     </h1>
    </div>
    <div class="col-md-4">
     <p>
      <a href="/login">
       Login
      </a>
     </p>
    </div>
   </div>
   <div class="row">
    <div class="col-md-8">
     <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
      <span class="text" itemprop="text">
       “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
      </span>
      <span>
       by
       <small class="author" itemprop="author">
        Albert Einstein
       </small>
       <a href="/author/Albert

Podemos buscar etiquetas HTML y sus atributos (como "class", "href", etc). En esta web concreta, para buscar un "div" referido a una cita, después de mirar el código deducimos que nos conviene uno que tenga como clase el término "quote".

En base a eso, podemos extraer el texto correspondiente de cada cita de la página, así:

In [13]:
# Usamos el método find_all() de los objetos BeautifulSoup:
citas = soup.find_all('div', class_ = "quote")  # Esto nos da una lista de resultados. Con find() en lugar de find_all() obtendríamos uno solo.

print(len(citas),"citas en el objeto 'soup'.\n")  # Hay diez resultados en la lista

print("Esto es lo que continene el primer objeto tipo \"quote\" de la lista extraída:")
citas[0]   # Como con find_all() obtenemos una lista, accedemos a uno de sus elementos para mostrar su estructura.

10 citas en el objeto 'soup'.

Esto es lo que continene el primer objeto tipo "quote" de la lista extraída:


<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
<span>by <small class="author" itemprop="author">Albert Einstein</small>
<a href="/author/Albert-Einstein">(about)</a>
</span>
<div class="tags">
            Tags:
            <meta class="keywords" content="change,deep-thoughts,thinking,world" itemprop="keywords"/>
<a class="tag" href="/tag/change/page/1/">change</a>
<a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
<a class="tag" href="/tag/thinking/page/1/">thinking</a>
<a class="tag" href="/tag/world/page/1/">world</a>
</div>
</div>

Según el uso que se vaya a hacer, es conveniente tener en cuenta que hay once páginas que escrapear, por lo tanto en este caso repetimos el proceso mediante un loop para recoger todos los datos.

In [14]:
filas = []  # Haremos una lista de filas que añadir al dataframe que recoja los datos.


for i in range(1,11):  # Para cada una de las diez páginas, desde la página 1 hasta la página 10...
    url="https://quotes.toscrape.com/page/" + str(i) + "/"
    pagina = requests.get(url)
    soup = BeautifulSoup(pagina.content, "html.parser")  # Como antes, usamos el paseador "html".
    citas = soup.find_all('div', class_ = "quote")

    for cita in citas:
        registro = []
        # Añadimos autor a una lista.
        autor = cita.find("small", class_ = "author").text.strip()
        registro.append(autor)
        
        # Texto de la cita
        texto_cita = cita.find("span", class_ = "text", itemprop="text").text.strip()
        registro.append(texto_cita)
        
        # Añadiremos los tags como lista, porque son varios.
        tags = [tag.text for tag in cita.find_all("a", class_="tag")]
        registro.append(tags)
        
        filas.append(registro)
    sleep(random.randint(1, 3))

filas

[['Albert Einstein',
  '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”',
  ['change', 'deep-thoughts', 'thinking', 'world']],
 ['J.K. Rowling',
  '“It is our choices, Harry, that show what we truly are, far more than our abilities.”',
  ['abilities', 'choices']],
 ['Albert Einstein',
  '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”',
  ['inspirational', 'life', 'live', 'miracle', 'miracles']],
 ['Jane Austen',
  '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”',
  ['aliteracy', 'books', 'classic', 'humor']],
 ['Marilyn Monroe',
  "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”",
  ['be-yourself', 'inspirational']],
 ['Albert Einstein',
  '“Try not to become a man of success. Rather become a man of value.”',
  ['ad

In [15]:
columnas_citas = ['Autor', 'Cita', 'Tags']
df_citas = pd.DataFrame(filas, columns=columnas_citas)
df_citas

Unnamed: 0,Autor,Cita,Tags
0,Albert Einstein,“The world as we have created it is a process ...,"[change, deep-thoughts, thinking, world]"
1,J.K. Rowling,"“It is our choices, Harry, that show what we t...","[abilities, choices]"
2,Albert Einstein,“There are only two ways to live your life. On...,"[inspirational, life, live, miracle, miracles]"
3,Jane Austen,"“The person, be it gentleman or lady, who has ...","[aliteracy, books, classic, humor]"
4,Marilyn Monroe,"“Imperfection is beauty, madness is genius and...","[be-yourself, inspirational]"
...,...,...,...
95,Harper Lee,“You never really understand a person until yo...,[better-life-empathy]
96,Madeleine L'Engle,“You have to write the book that wants to be w...,"[books, children, difficult, grown-ups, write,..."
97,Mark Twain,“Never tell the truth to people who are not wo...,[truth]
98,Dr. Seuss,"“A person's a person, no matter how small.”",[inspirational]


¡Listo!

### Selenium

Abordamos ahora el scraping de la página de Wikipedia que recoge las películas en las que participó Carl Weathers, famoso por sus papeles como Apollo Creed en la saga "Rocky", y como Dillon (famosísima escena [Dillon, you son of a bitch!](https://www.youtube.com/watch?v=wLgsLDIFyA4) con "Dutch", Arnold Schwarzenegger) en "Predator".

Primero, instalación del webdriver correcto mediante DriverManager. De lo contrario, se encuentran muchos problemas, y las actualizaciones de navegador o driver pueden hacer que el código deje de funcionar.

In [21]:
# Puede ser necesario para asegurar que GitHub no bloquea las requests de webdriver manager (hay un límite)
os.environ['GH_TOKEN'] = "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"   # Este token expira el día 1 de marzo.

# Initialize the Firefox driver using webdriver-manager to handle the geckodriver
# driver = webdriver.Firefox(executable_path=GeckoDriverManager().install()) # webdriver_manager instala automáticamente la versión compatible con mi navegador Firefox

In [15]:
# Usado para averiguar dónde instaló WebdriverManager el driver: (y así no reinstalar cada vez)

'''
# Check if the geckodriver is already installed
try:
    driver_path = GeckoDriverManager().install()
except:
    # If it's already installed, retrieve the path without re-installing
    driver_path = GeckoDriverManager().install_from_cache()

# Initialize the Firefox driver using the stored driver path
driver = webdriver.Firefox(executable_path=driver_path)
'''

In [16]:
#driver_path  

# Esto lo recupero después de instalar una vez mediante webdriver manager, para no instalar cada vez el driver.
# El driver_path es: "'C:\\Users\\karel\\.wdm\\drivers\\geckodriver\\win64\\v0.34.0\\geckodriver.exe'"

'C:\\Users\\karel\\.wdm\\drivers\\geckodriver\\win64\\v0.34.0\\geckodriver.exe'

In [18]:
# Por lo tanto para no reinstalar el driver nos basta con anotar este "path" y aplicarlo así:
driver = webdriver.Firefox(executable_path='C:\\Users\\karel\\.wdm\\drivers\\geckodriver\\win64\\v0.34.0\\geckodriver.exe')

driver.get('https://es.wikipedia.org/wiki/Carl_Weathers')

# Correcting the file path for Windows (if using local file)
#file_path = r'C:\Users\karel\IT Academy Data Science Notebooks\estructures_Dataframe\test_page_3.html'
# Convert to an appropriate URL scheme for local files
#file_url = 'file:///' + file_path.replace('\\', '/')
#print(file_url)
sleep(3)  # Para darle al driver tiempo a abrirse completamente.


#driver.get(file_url)

In [24]:
rows_data = []

# Buscamos una tabla en la página web, y más concretamente una tabla que tenga la característica style*="margin: 1em 1em 1em 0;".
table = driver.find_element(By.CSS_SELECTOR, 'table[style*="margin: 1em 1em 1em 0;"]')

# Localizamos los elementos con el tag HTML "tr" (table row):
rows = table.find_elements(By.TAG_NAME, 'tr')

# Extraemos de las "rows" la información:
for row in rows:
    # Celdas de la fila, es decir, "table data" (tag <td>):
    cells = row.find_elements(By.TAG_NAME, 'td')
    # Extraemos el texto de cada celda y lo añadimos a row_info:
    row_info = [cell.text for cell in cells]
    if row_info:  # 
        rows_data.append(row_info)

# Convert the list of rows into a pandas DataFrame
df = pd.DataFrame(rows_data, columns=['Año', 'Título', 'Papel'])

# Clean up by closing the WebDriver
driver.quit()


# NOTA: al volver a ejecutar el código me bloqueó (pero ya se habían descargado los datos).

MaxRetryError: HTTPConnectionPool(host='127.0.0.1', port=62508): Max retries exceeded with url: /session/343e7b3d-268c-46a2-af83-1dd8c062d672/element (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x000001ED7AF48B20>: Failed to establish a new connection: [WinError 10061] No se puede establecer una conexión ya que el equipo de destino denegó expresamente dicha conexión'))

Hay que tener cuidado para evitar los bloqueos...! Igualmente, en la anterior ejecución de la celda ya logramos los datos:

In [25]:
df

Unnamed: 0,Año,Título,Papel
0,Año,Título,Papel
1,1975,Kung Fu,Sam el Malo
2,1975,Friday Foster,Yarbro
3,1976,Rocky,Apollo Creed
4,1977,Encuentros cercanos del tercer tipo (Latinoamé...,Policía Militar
5,1976,Fuerza 10 de Navarone,Weaver
6,1978,Los abismos de las Bermudas,Eric
7,1979,Rocky II,Apollo Creed
8,1981,Death Hunt,
9,1982,Rocky III,Apollo Creed


Después de hecho el ejercicio probando dos herramientas diferentes, puedo decir que por un lado BeautifulSoup es más sencillo lograr que funcione "out of the box", dado que tiene menos complicación técncia/configuración. Selenium puede dar más problemas en ese sentido (hasta que se descubre WebdriverManager) ya que los webdrivers necesitan coincidir con la versión de navegador que se está usando.

Cabe añadir otra cosa. El ejemplo de wikipedia seleccionado sobre Carl Weathers era relativamente sencillo, pero antes he podido comprobar en carne propia que eso no siempre es así. Las webs tipo "wiki", basadas en contribuciones de una comunidad diversa, pueden no seguir siempre los mismos criterios de formato/estructura HTML, dificultando un scraping coherente de toda la información de interés (motivo por el que descarté otra página y me decidí finalmente por la de Carl Weathers que tenía elementos sin "caos"...).

# Ejercicio 2

Documentar en un archivo Word el conjunto de datos generado antes, con la información al estilo de la de los ficheros de datos de Kaggle (como por ejemplo este dataset de Kaggle: https://www.kaggle.com/datasets/vivovinco/20212022-football-team-stats).

In [27]:
# Ver en repositorio del ejercicio.

# Ejercicio 3

Elegir una página web y realizar web scraping primero mediante Selenium y luego mediante Scrapy.

### Selección web: Ground News

Trabajaré con la web de Ground News, que recopila y clasifica las noticias de diversos medios de comunicación, y que preveo trabajar de cara al proyecto final:

In [29]:
driver = webdriver.Firefox(executable_path='C:\\Users\\karel\\.wdm\\drivers\\geckodriver\\win64\\v0.34.0\\geckodriver.exe')
#driver.get(url)

url = "https://ground.news/"

# Lo que sigue es una forma de hacer los tests con un fichero en local y... ¡evitar el bloqueo durante el aprendizaje de cómo usar la herramienta...!!
# Solo usado para tests, pero lo dejo indicado aquí, ya que reutilizaré este procedimiento
# file_path = r'C:\Users\karel\Desktop\Webs\Breaking News Headlines and Media Bias Ground News.htm'
# file_url = 'file:///' + file_path.replace('\\', '/')
# print(file_url)
# driver.get(file_url)


sleep(3)  # Para darle al driver tiempo a abrirse completamente.


driver.get(url)

In [30]:
# Hay que clicar y responder (preferiblemente) al botón de "Manage Cookies".
# Nos basamos en el texto del botón ("Manage Cookies"), que en principio es único, no ha de aparecer en otro lugar de la página.

# Espera a que el botón con texto "Manage cookies" aparezca y sea clicable.
manage_cookies_button = WebDriverWait(driver, 10).until(   # 10 seg, máximo tiempo de espera.
    EC.element_to_be_clickable((By.XPATH, "//button[contains(text(), 'Manage cookies')]"))  
)
manage_cookies_button.click()

In [31]:
# Click al botón "Save & reload" (esto deja solo las cookies técnicas)
save_cookies_button = WebDriverWait(driver, 10).until(   # 10 seg, máximo tiempo de espera.
    EC.element_to_be_clickable((By.XPATH, "//button[contains(text(), 'Save & reload')]"))  # Espera a que el botón con texto "Manage cookies" sea clicable.
)
save_cookies_button.click()


In [15]:
def cerrar_popup():  # Hay un pop-up que aparece cada tres veces aproximadamente que se le da al botón "More Stories". Para cerrarlo.
    try:
        wait_time = random.uniform(2.5, 4.5)
        close_button = WebDriverWait(driver, wait_time).until(
            EC.element_to_be_clickable((By.CLASS_NAME, "react-responsive-modal-closeButton"))
        )
        close_button.click()
        print("Pop-up closed.")   # Hay un pop-up que aparece cada tres veces aproximadamente que se le da al botón "More Stories". Lo cerramos.
    except TimeoutException:
        print("No pop-up appeared.")

def click_more_stories():   # Función para que se nos muestren más noticias al pulsar el botón "More stories".
    while True:
        try:
            more_stories_button = WebDriverWait(driver, 10).until(
                EC.element_to_be_clickable((By.XPATH, "//button[contains(text(), 'More stories')]"))
            )
            wait_time = random.uniform(0.9, 3.4)
            time.sleep(wait_time)
            more_stories_button.click()
            cerrar_popup()   # Si aparece, cerramos el pop-up que aparece cada ciertos clicks a "More stories".
        except TimeoutException:
            print("No more 'More stories' button found or page took too long to load.")
            break  # Exiting the loop if "More stories" button is no longer found


# Example usage
click_more_stories() 


Pop-up closed.
No pop-up appeared.
Pop-up closed.
No pop-up appeared.
Pop-up closed.
No pop-up appeared.
No more 'More stories' button found or page took too long to load.


In [32]:
# Titular principal
highlighted_title_element = driver.find_elements(By.CSS_SELECTOR, 'a.relative.flex.flex-col.cursor-pointer h2')

# Otras noticias
general_titles_elements = driver.find_elements(By.CSS_SELECTOR, 'h4[class*="font-bold"]')

# Combinar ambos
all_titles_elements = highlighted_title_element + general_titles_elements

# Convertir a "set" y luego convertir en lista logra eliminar repeticiones; "strip" también ayuda a limpiar espacios superfluos.
news_titles = list(set([title.text.strip() for title in all_titles_elements if title.text.strip()]))

# Resultados
for title in news_titles:
    print(title)

# Cierre del "driver"
driver.quit()

Trump does not have presidential immunity in January 6 case, federal appeals court rules
Georgia says it seized Russia-bound cargo of explosives sent from Ukraine
Orbán boycotts parliament session called to ratify Swedish Nato bid
Spanish league to denounce fan who touched player’s backside during game
Journalists say Ukrainian security service spied on them
Taylor Swift threatens legal action against student who tracks her jet
Home Secretary ‘to look at’ claims forty Bibby Stockholm migrants are converting to Christianity
How climate change contributes to wildfires like Chile's
Cannabis use linked to anxiety diagnoses, worsened anxiety disorders: Ontario study
King Charles III diagnosed with cancer, Buckingham Palace says
NASA discovers 'super-Earth' 137-light years away in a habitable zone that could sustain life
US would redirect aid from UNRWA to other agencies under Senate bill - State Dept
Arab American leaders demand apology, retraction after Wall Street Journal 'jihad' column
S

In [33]:
len(news_titles)  # Hemos recogido los títulos de 33 noticias, sin repeticiones.

33

In [43]:
news_titles

['Trump does not have presidential immunity in January 6 case, federal appeals court rules',
 'Georgia says it seized Russia-bound cargo of explosives sent from Ukraine',
 'Orbán boycotts parliament session called to ratify Swedish Nato bid',
 'Spanish league to denounce fan who touched player’s backside during game',
 'Journalists say Ukrainian security service spied on them',
 'Taylor Swift threatens legal action against student who tracks her jet',
 'Home Secretary ‘to look at’ claims forty Bibby Stockholm migrants are converting to Christianity',
 "How climate change contributes to wildfires like Chile's",
 'Cannabis use linked to anxiety diagnoses, worsened anxiety disorders: Ontario study',
 'King Charles III diagnosed with cancer, Buckingham Palace says',
 "NASA discovers 'super-Earth' 137-light years away in a habitable zone that could sustain life",
 'US would redirect aid from UNRWA to other agencies under Senate bill - State Dept',
 "Arab American leaders demand apology, ret

In [47]:
# Create a DataFrame with the 'titles' column
df = pd.DataFrame({'title': news_titles})
df.head()

Unnamed: 0,title
0,Trump does not have presidential immunity in J...
1,Georgia says it seized Russia-bound cargo of e...
2,Orbán boycotts parliament session called to ra...
3,Spanish league to denounce fan who touched pla...
4,Journalists say Ukrainian security service spi...


En esta ocasión hemos hecho un scraping levemente más complejo, en el sentido de que le hemos dicho al "driver" que interactúe con la web y ejecute algunas acciones (pulsar diversos botones y esperar que se desplieguen elementos en la página) y que nos permiten descargar algunos datos extra. Añadiremos complejidad en el proyecto, por ejemplo en la cuestión de extraer datos referidos al "alineamiento" político de los medios (izquierda, centro o derecha) y a qué tipo de noticias cubren o no, es decir, cuáles son los "blind spots" (temas invisibles) para medios de uno u otro lado del espectro.

Veamos ahora el scraping mediante "Scrapy". Lo hacemos fuera de Jupyter pero describo el proceso aquí. Abrimos un terminal y corremos el siguiente comando, después de instalar scrapy:

In [None]:
# (py3.8.13) C:\Users\karel\IT Academy Data Science Notebooks\estructures_Dataframe>scrapy startproject ground_news_scrapy
'''
New Scrapy project 'ground_news_scrapy', using template directory 'C:\Users\karel\anaconda3\envs\py3.8.13\Lib\site-packages\scrapy\templates\project', created in:
    C:\Users\karel\IT Academy Data Science Notebooks\estructures_Dataframe\ground_news_scrapy

You can start your first spider with:
    cd ground_news_scrapy
    scrapy genspider example example.com
'''

En la ubicación en que ejecutemos el comando, lo anterior genera una serie de carpetas con la siguiente jerarquía, y con una serie de archivos de configuración y necesarios para ejecutar scrapy:

ground_news_scrapy > ground_news_scrapy > spiders

En la carpeta "spiders" hemos de crear un fichero de script de python que será una "araña" que nos servirá para escrapear. Este es su contenido:

In [18]:
# Contenido de la araña "ground_news.py" ubicada en la carpeta "spiders" del proyecto "ground_news_scrapy":

import scrapy

class NewsSpider(scrapy.Spider):
    name = 'ground_news'
    allowed_domains = ['ground.news']
    start_urls = ['https://ground.news/'] # Change to the actual URL

    def parse(self, response):
        # Selecting general news titles
        for title in response.css('h4[class*="font-bold"]::text').getall():
            yield {'title': title.strip()}

        # Selecting the highlighted news title
        for title in response.css('a.relative.flex.flex-col.cursor-pointer h2::text').getall():
            yield {'title': title.strip()}


De entre los ficheros que se incluyen en el sistema de carpetas, hay alguno cuya configuración nos conviene considerar. En particular, nos interesa asegurarnos que la araña tiene un comportamiento poco agresivo para que no sea bloqueada. Para ello, modificamos el fichero "settings.py" que hay en la carpeta jerárquicamente intermedia "ground_news_scrapy".

El contenido de dicho fichero "settings.py" que hemos decidido es este, fundamentalmente tratando de que la araña no se comporte de forma agresiva, para no ser bloqueada.

In [None]:
# Scrapy settings.py content for ground_news_scrapy project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = "ground_news_scrapy"  # Nombre de la araña a utilizar

SPIDER_MODULES = ["ground_news_scrapy.spiders"]
NEWSPIDER_MODULE = "ground_news_scrapy.spiders"


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = "ground_news_scrapy (+http://www.yourdomain.com)"

# Obey robots.txt rules
# Es importante, adecuado y buena práctica seguir la directiva del fichero "robots.txt" que tenga la web. Si no quieren bots, cumplid con ello.
ROBOTSTXT_OBEY = True   

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
#    "Accept-Language": "en",
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    "ground_news_scrapy.middlewares.GroundNewsScrapySpiderMiddleware": 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    "ground_news_scrapy.middlewares.GroundNewsScrapyDownloaderMiddleware": 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    "scrapy.extensions.telnet.TelnetConsole": None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    "ground_news_scrapy.pipelines.GroundNewsScrapyPipeline": 300,
#}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
AUTOTHROTTLE_ENABLED = True
# The initial download delay
AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = "httpcache"
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage"

# Set settings whose default value is deprecated to a future-proof value
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"

Finalment, para correr scrapy hay que acceder a la carpeta ground_news_scrapy jerárquicamente superior (la primera). El comando, de modo que nos dé un fichero de datos .json como "output", es el siguiente:

scrapy crawl ground_news -o output.json

In [None]:
(py3.8.13) C:\Users\karel\IT Academy Data Science Notebooks\estructures_Dataframe\ground_news_scrapy>scrapy crawl ground_news -o output06_02_2024.json
2024-02-06 18:40:27 [scrapy.utils.log] INFO: Scrapy 2.8.0 started (bot: ground_news_scrapy)
2024-02-06 18:40:27 [scrapy.utils.log] INFO: Versions: lxml 4.9.3.0, libxml2 2.10.4, cssselect 1.1.0, parsel 1.6.0, w3lib 1.21.0, Twisted 22.10.0, Python 3.8.13 (default, Oct 19 2022, 22:38:03) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 23.2.0 (OpenSSL 1.1.1w  11 Sep 2023), cryptography 41.0.3, Platform Windows-10-10.0.19045-SP0
2024-02-06 18:40:27 [scrapy.crawler] INFO: Overridden settings:
{'AUTOTHROTTLE_ENABLED': True,
 'BOT_NAME': 'ground_news_scrapy',
 'CONCURRENT_REQUESTS_PER_DOMAIN': 16,
 'DOWNLOAD_DELAY': 3,
 'FEED_EXPORT_ENCODING': 'utf-8',
 'NEWSPIDER_MODULE': 'ground_news_scrapy.spiders',
 'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['ground_news_scrapy.spiders'],
 'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2024-02-06 18:40:27 [asyncio] DEBUG: Using selector: SelectSelector
2024-02-06 18:40:27 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2024-02-06 18:40:27 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.windows_events._WindowsSelectorEventLoop
2024-02-06 18:40:27 [scrapy.extensions.telnet] INFO: Telnet Password: fb9552ba259c0318
2024-02-06 18:40:27 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.throttle.AutoThrottle']
2024-02-06 18:40:27 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2024-02-06 18:40:27 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2024-02-06 18:40:27 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2024-02-06 18:40:27 [scrapy.core.engine] INFO: Spider opened
2024-02-06 18:40:28 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-02-06 18:40:28 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-02-06 18:40:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ground.news/robots.txt> (referer: None)
2024-02-06 18:40:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ground.news/> (referer: None)
2024-02-06 18:40:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ground.news/>
{'title': 'EU calls for 90 percent emissions cut by 2040'}
2024-02-06 18:40:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ground.news/>
{'title': 'EU scraps pesticide proposals in another concession to protesting farmers'}
2024-02-06 18:40:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ground.news/>
{'title': 'Georgia says it seized Russia-bound cargo of explosives sent from Ukraine'}
2024-02-06 18:40:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ground.news/>
{'title': 'King Charles III diagnosed with cancer, Buckingham Palace says'}
2024-02-06 18:40:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ground.news/>
{'title': 'Orbán boycotts parliament session called to ratify Swedish Nato bid'}
2024-02-06 18:40:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ground.news/>
{'title': 'Cannabis use linked to anxiety diagnoses, worsened anxiety disorders: Ontario study'}
2024-02-06 18:40:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ground.news/>
{'title': "Spanish farmers blockade roads, joining EU peers' protests"}
2024-02-06 18:40:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ground.news/>
{'title': 'AI helps scholars read scroll buried when Vesuvius erupted in AD79'}
2024-02-06 18:40:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ground.news/>
{'title': 'Taylor Swift threatens legal action against student who tracks her jet'}
2024-02-06 18:40:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ground.news/>
{'title': 'Iranian agents suspected of targeting Jews arrested in Sweden, deported'}
2024-02-06 18:40:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ground.news/>
{'title': 'Spanish league to denounce fan who touched player’s backside during game'}
2024-02-06 18:40:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ground.news/>
{'title': 'Facebook and Instagram to label all images on its platforms created by AI, Meta says'}
2024-02-06 18:40:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ground.news/>
{'title': "UN nuclear chief says security is still fragile at Ukraine's Russian-occupied nuclear power plant"}
2024-02-06 18:40:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ground.news/>
{'title': 'Journalists say Ukrainian security service spied on them'}
2024-02-06 18:40:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ground.news/>
{'title': 'EU calls for 90 percent emissions cut by 2040'}
2024-02-06 18:40:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ground.news/>
{'title': 'EU scraps pesticide proposals in another concession to protesting farmers'}
2024-02-06 18:40:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ground.news/>
{'title': 'Georgia says it seized Russia-bound cargo of explosives sent from Ukraine'}
2024-02-06 18:40:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ground.news/>
{'title': 'King Charles III diagnosed with cancer, Buckingham Palace says'}
2024-02-06 18:40:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ground.news/>
{'title': 'Orbán boycotts parliament session called to ratify Swedish Nato bid'}
2024-02-06 18:40:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ground.news/>
{'title': 'Argentina president Javier Milei says plans to move embassy to Jerusalem'}
2024-02-06 18:40:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ground.news/>
{'title': 'California could legalize psychedelic therapy after rejecting ‘magic mushroom’ decriminalization'}
2024-02-06 18:40:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ground.news/>
{'title': 'AI helps scholars read scroll buried when Vesuvius erupted in AD79'}
2024-02-06 18:40:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ground.news/>
{'title': 'Taylor Swift threatens legal action against student who tracks her jet'}
2024-02-06 18:40:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ground.news/>
{'title': 'Iranian agents suspected of targeting Jews arrested in Sweden, deported'}
2024-02-06 18:40:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ground.news/>
{'title': 'Spanish league to denounce fan who touched player’s backside during game'}
2024-02-06 18:40:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ground.news/>
{'title': 'Spanish league to denounce fan who touched player’s backside during game'}
2024-02-06 18:40:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ground.news/>
{'title': 'Facebook and Instagram to label all images on its platforms created by AI, Meta says'}
2024-02-06 18:40:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ground.news/>
{'title': "UN nuclear chief says security is still fragile at Ukraine's Russian-occupied nuclear power plant"}
2024-02-06 18:40:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ground.news/>
{'title': 'Journalists say Ukrainian security service spied on them'}
2024-02-06 18:40:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ground.news/>
{'title': 'Russian court arrests fiction writer in absentia on charges of incitement to terrorism'}
2024-02-06 18:40:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ground.news/>
{'title': 'Facebook and Instagram to label all images on its platforms created by AI, Meta says'}
2024-02-06 18:40:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ground.news/>
{'title': "UN nuclear chief says security is still fragile at Ukraine's Russian-occupied nuclear power plant"}
2024-02-06 18:40:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ground.news/>
{'title': 'Journalists say Ukrainian security service spied on them'}
2024-02-06 18:40:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ground.news/>
{'title': 'Russian court arrests fiction writer in absentia on charges of incitement to terrorism'}
2024-02-06 18:40:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ground.news/>
{'title': 'China economy overtaking U.S. is increasingly unlikely: ex-IMF official'}
2024-02-06 18:40:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ground.news/>
{'title': 'Russian court arrests fiction writer in absentia on charges of incitement to terrorism'}
2024-02-06 18:40:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ground.news/>
{'title': 'China economy overtaking U.S. is increasingly unlikely: ex-IMF official'}
2024-02-06 18:40:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ground.news/>
{'title': 'Turkey mourns tens of thousands dead, surrounded by the ruins of last year’s earthquake'}
2024-02-06 18:40:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ground.news/>
{'title': 'UBS to Restart Buybacks This Year as It Integrates Credit Suisse'}
2024-02-06 18:40:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ground.news/>
{'title': "NASA discovers 'super-Earth' 137-light years away in a habitable zone that could sustain life"}
2024-02-06 18:40:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ground.news/>
{'title': 'China economy overtaking U.S. is increasingly unlikely: ex-IMF official'}
2024-02-06 18:40:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ground.news/>
{'title': 'Turkey mourns tens of thousands dead, surrounded by the ruins of last year’s earthquake'}
2024-02-06 18:40:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ground.news/>
{'title': 'UBS to Restart Buybacks This Year as It Integrates Credit Suisse'}
2024-02-06 18:40:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ground.news/>
{'title': "NASA discovers 'super-Earth' 137-light years away in a habitable zone that could sustain life"}
2024-02-06 18:40:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ground.news/>
{'title': 'Germany doubles its commitment of troops to the NATO-led peacekeepers in Kosovo'}
2024-02-06 18:40:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ground.news/>
{'title': 'Turkey mourns tens of thousands dead, surrounded by the ruins of last year’s earthquake'}
2024-02-06 18:40:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ground.news/>
{'title': 'UBS to Restart Buybacks This Year as It Integrates Credit Suisse'}
2024-02-06 18:40:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ground.news/>
{'title': "NASA discovers 'super-Earth' 137-light years away in a habitable zone that could sustain life"}
2024-02-06 18:40:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ground.news/>
{'title': 'Germany doubles its commitment of troops to the NATO-led peacekeepers in Kosovo'}
2024-02-06 18:40:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ground.news/>
{'title': 'Germany doubles its commitment of troops to the NATO-led peacekeepers in Kosovo'}
2024-02-06 18:40:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ground.news/>
{'title': "How climate change contributes to wildfires like Chile's"}
2024-02-06 18:40:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ground.news/>
{'title': 'Putin will visit Turkey soon to discuss new Black Sea grain export ideas for Ukraine, minister says'}
2024-02-06 18:40:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ground.news/>
{'title': 'Scientists propose a Category 6 as hurricanes gain in intensity with climate change'}
2024-02-06 18:40:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ground.news/>
{'title': 'Palantir stock jumps 12% on revenue beat'}
2024-02-06 18:40:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ground.news/>
{'title': 'Trump does not have presidential immunity in January 6 case, federal appeals court rules'}
2024-02-06 18:40:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ground.news/>
{'title': 'Trump does not have presidential immunity in January 6 case, federal appeals court rules'}
2024-02-06 18:40:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ground.news/>
{'title': 'Trump does not have presidential immunity in January 6 case, federal appeals court rules'}
2024-02-06 18:40:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ground.news/>
{'title': 'Cannabis use linked to anxiety diagnoses, worsened anxiety disorders: Ontario study'}
2024-02-06 18:40:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ground.news/>
{'title': 'Cannabis use linked to anxiety diagnoses, worsened anxiety disorders: Ontario study'}
2024-02-06 18:40:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ground.news/>
{'title': 'Cannabis use linked to anxiety diagnoses, worsened anxiety disorders: Ontario study'}
2024-02-06 18:40:35 [scrapy.core.engine] INFO: Closing spider (finished)
2024-02-06 18:40:35 [scrapy.extensions.feedexport] INFO: Stored json feed (60 items) in: output06_02_2024.json
2024-02-06 18:40:35 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 452,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 80805,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'elapsed_time_seconds': 7.434112,
 'feedexport/success_count/FileFeedStorage': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2024, 2, 6, 17, 40, 35, 715840),
 'httpcompression/response_bytes': 942904,
 'httpcompression/response_count': 1,
 'item_scraped_count': 60,
 'log_count/DEBUG': 65,
 'log_count/INFO': 11,
 'response_received_count': 2,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/200': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2024, 2, 6, 17, 40, 28, 281728)}
2024-02-06 18:40:35 [scrapy.core.engine] INFO: Spider closed (finished)

In [38]:
df2 = pd.read_json('./ground_news_scrapy/output06_02_2024.json').drop_duplicates()

In [39]:
df2

Unnamed: 0,title
0,EU calls for 90 percent emissions cut by 2040
1,EU scraps pesticide proposals in another conce...
2,Georgia says it seized Russia-bound cargo of e...
3,"King Charles III diagnosed with cancer, Buckin..."
4,Orbán boycotts parliament session called to ra...
5,"Cannabis use linked to anxiety diagnoses, wors..."
6,"Spanish farmers blockade roads, joining EU pee..."
7,AI helps scholars read scroll buried when Vesu...
8,Taylor Swift threatens legal action against st...
9,Iranian agents suspected of targeting Jews arr...


In [40]:
len(df2)

27

In [49]:
# Find common elements
common = pd.merge(df, df2, on='title')
number_common = len(common)

# Find unique elements
merged_df = pd.merge(df, df2, on='title', how='outer', indicator=True)
unique_to_df = merged_df[merged_df['_merge'] == 'left_only']
unique_to_df2 = merged_df[merged_df['_merge'] == 'right_only']
number_unique_to_df = len(unique_to_df)
number_unique_to_df2 = len(unique_to_df2)

print(f'Number of common elements: {number_common}')
print(f'Number of elements unique to df: {number_unique_to_df}')
print(f'Number of elements unique to df2: {number_unique_to_df2}')

Number of common elements: 26
Number of elements unique to df: 7
Number of elements unique to df2: 1


Hemos obtenido menos resultados en scrapy (df2). Scrapy no tiene una forma nativa de "clicar botones" para hacer que aparezcan contenidos, ¿puede haber influido en el diferente resultado? Cabe suponer que en función de cómo funcione la web (p.ej. aparición de contenidos dinámicos), puede funcionar mejor una aproximación u otra, y que a la práctica vale la pena comprobar cuál se ajusta más a nuestras necesidades (y posibilidades).

Al parecer para ciertos sitios web, una forma de emular el efecto del click (en el sentido de liberar nuevo contenido) es averiguar qué sucede en la pestaña "XHR" de la consola del navegador al clicar en el botón "More stories". No obstante, en este caso, esta aproximación no ha sido aplicable (parecía efectuarse por otra vía, no mediante XHR/AJAX).

Desconozco, por lo demás, qué factor podría hacer que scrapy encuentre algún registro que no encuentre el webdriver (hay un caso).

En conclusión, en este caso concreto, Selenium webdriver ha sido más eficaz de forma relativamente sencilla y controlada, mientras que con scrapy, al menos con una aproximación básica y no apoyándonos en herramientas externas adicionales, perdemos cierta información.