# Scraper componentes: Procesadores

Este notebook realiza scraping del sitio [TechPowerUp](https://www.techpowerup.com/cpu-specs/) para extraer informaci√≥n de procesadores (CPUs).

### üë®‚Äçüíª Autores del proyecto

* [Alejandro Barrionuevo Rosado](https://github.com/Alejandro-BR)
* [Alvaro L√≥pez Guerrero](https://github.com/Alvalogue72)
* [Andrei Munteanu Popa](https://github.com/andu8705)

M√°ster de FP en Inteligencia Artifical y Big Data - CPIFP Alan Turing - `Curso 2025/2026`

## Importaciones

Se importan las librer√≠as necesarias y se configuran las URLs y par√°metros base para acceder a las p√°ginas de CPUs.

In [None]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
import time
import requests
from bs4 import BeautifulSoup
import time
import random
import pandas as pd

## Obtenci√≥n de enlaces de CPUs

En esta secci√≥n se accede a la p√°gina principal de CPUs y se extraen los enlaces individuales de cada procesador.


In [3]:
url = "https://www.techpowerup.com/cpu-specs/"

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get(url)

time.sleep(3)

html = driver.page_source
soup = BeautifulSoup(html, "html.parser")

rows = soup.select("table tbody tr")

cpu_links = []
for row in rows:
    a = row.select_one("td a[href]")
    if a:
        cpu_links.append("https://www.techpowerup.com" + a["href"])

driver.quit()

print(f"CPUs encontradas: {len(cpu_links)}")
print(cpu_links[:10])


CPUs encontradas: 130
['https://www.techpowerup.com/cpu-specs/core-ultra-9-386h.c4305', 'https://www.techpowerup.com/cpu-specs/core-ultra-7-366h.c4306', 'https://www.techpowerup.com/cpu-specs/core-ultra-7-356h.c4307', 'https://www.techpowerup.com/cpu-specs/core-ultra-5-336h.c4309', 'https://www.techpowerup.com/cpu-specs/ryzen-ai-max-388.c4312', 'https://www.techpowerup.com/cpu-specs/ryzen-ai-max-392.c4311', 'https://www.techpowerup.com/cpu-specs/ryzen-ai-max-392.c4311', 'https://www.techpowerup.com/cpu-specs/core-ultra-5-336h.c4309', 'https://www.techpowerup.com/cpu-specs/core-ultra-5-336h.c4309', 'https://www.techpowerup.com/cpu-specs/core-ultra-5-338h.c4308']


## Extracci√≥n de datos por CPU

Aqu√≠ se visita cada enlace de CPU y se extraen sus especificaciones t√©cnicas.


In [None]:
BASE_URL = "https://www.techpowerup.com"
HEADERS = {"User-Agent": "Mozilla/5.0"}

def get_cpu_details(url, retries=8):
    for attempt in range(retries):
        try:
            delay = random.uniform(3.0, 7.0)
            print(f"   Esperando {delay:.2f}s antes de pedir {url}")
            time.sleep(delay)

            response = requests.get(url, headers=HEADERS, timeout=15)

            if response.status_code == 429:
                wait = random.uniform(20, 40)
                print(f"   ‚ö†Ô∏è 429 Too Many Requests. Esperando {wait:.1f}s‚Ä¶")
                time.sleep(wait)
                continue

            response.raise_for_status()

            soup = BeautifulSoup(response.text, "html.parser")

            title = soup.find("h1")
            if not title:
                print("   ‚ö†Ô∏è P√°gina vac√≠a o bloqueada. Reintentando‚Ä¶")
                time.sleep(5)
                continue

            data = {"URL": url, "Name": title.text.strip()}

            for box in soup.select(".specs-box .box"):
                label = box.get("title", "").strip()
                value = box.text.strip()
                if label:
                    data[label] = value

            chip_img = soup.select_one("a img.chip-image--img")
            if chip_img:
                src = chip_img.get("src")
                if src.startswith("/"):
                    src = BASE_URL + src
                data["Chip Image"] = src

            return data

        except Exception as e:
            print(f"   Reintento {attempt+1}/{retries} por error: {e}")
            time.sleep(5 + attempt * 3)

    print(f"   ‚ùå Fall√≥ definitivamente: {url}")
    return None


## Bucle principal para scrapear
En esta celda se recorre la lista de componentes Y se llama a la funci√≥n de scraping para cada una. Los resultados se acumulan en una lista general.

In [5]:
cpu_data = []

for i, url in enumerate(cpu_links):
    try:
        details = get_cpu_details(url)
        cpu_data.append(details)
        print(f"[{i+1}] OK: {details['Name']}")
        time.sleep(0.5)
    except Exception as e:
        print(f"[{i+1}] Error en {url}: {e}")


   Esperando 6.59s antes de pedir https://www.techpowerup.com/cpu-specs/core-ultra-9-386h.c4305
[1] OK: Intel Core Ultra 9 386H
   Esperando 4.69s antes de pedir https://www.techpowerup.com/cpu-specs/core-ultra-7-366h.c4306
[2] OK: Intel Core Ultra 7 366H
   Esperando 6.22s antes de pedir https://www.techpowerup.com/cpu-specs/core-ultra-7-356h.c4307
   ‚ö†Ô∏è 429 Too Many Requests. Esperando 38.5s‚Ä¶
   Esperando 5.48s antes de pedir https://www.techpowerup.com/cpu-specs/core-ultra-7-356h.c4307
[3] OK: Intel Core Ultra 7 356H
   Esperando 3.38s antes de pedir https://www.techpowerup.com/cpu-specs/core-ultra-5-336h.c4309
   ‚ö†Ô∏è 429 Too Many Requests. Esperando 28.1s‚Ä¶
   Esperando 6.11s antes de pedir https://www.techpowerup.com/cpu-specs/core-ultra-5-336h.c4309
[4] OK: Intel Core Ultra 5 336H
   Esperando 6.71s antes de pedir https://www.techpowerup.com/cpu-specs/ryzen-ai-max-388.c4312
[5] OK: AMD Ryzen AI Max+ 388
   Esperando 3.12s antes de pedir https://www.techpowerup.com/cpu-s

### Guardado de los datos extra√≠dos
En esta celda se crea un DataFrame con los datos obtenidos y se guardan en archivos CSV y JSON para su posterior an√°lisis o uso.

In [None]:
# Filtrar los None
cpu_data_clean = [item for item in cpu_data if item is not None]

df_components = pd.DataFrame(cpu_data_clean)
print(df_components.info())

df_components.to_csv('productos_cpu.csv', index=False, encoding='utf-8-sig')
df_components.to_json('productos_cpu.json', orient='records', force_ascii=False, indent=4)


<class 'pandas.DataFrame'>
RangeIndex: 127 entries, 0 to 126
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   URL         127 non-null    str  
 1   Name        127 non-null    str  
 2   Chip Image  127 non-null    str  
dtypes: str(3)
memory usage: 3.1 KB
None
