# Web Scrapping proyecto de vinos

Fecha de creacion: 2024/04/15

Ultima fecha de actualizacion de este documento: 2024/04/15 - por Soraya Alvarez Codesal

Este Jupiter Notebook se centra en el `Web Scrapping de datos sobre vinos` de la pagina web https://www.bodeboca.com/. El objetivo es sacar un dataframe lo mas limpio posible con el mayor numero de vinos Españoles para realizar el proyecto fin de Bootcamp.

- Se recuerda que este codigo de web scrapping se realizo en torno a marzo/abril 2024. Si diera fallos al utilizarlo posteriormente, puede deberse a cambios en la pagina.

Los datos extraidos de bodeboca.com para cada vino son los siguientes (en caso de no encontrarse en la pagina por defecto hemos rellenado con `NAN`):
- titulo: La columna `titulo` contiene el nombre del vino extraido de la pagina bodeboca.com
- link: La columna `link` contiene el link de bodeboca.com para ese vino
- precio: La columna `precio` contiene el el precio en euros de la pagina bodeboca.com para ese vino en las fechas de Abril 2024
- rating: La columna `rating` contiene el rating o valoracion del vino bien por los usuarios de bodeboca.com o de multiples usuarios evaluado en Google u otras paginas de vinos
- volumen_botella: La columna `volumen_botella` contiene la informacion del volumen de cada botella del vino en cuestion, normalmente en centilitros (cl)
- bodega: la columna `bodega` contiene el nombre de la bodega a la que pertenece el vino
- tipo: la columna `tipo` contiene la categoria del vino a la que pertenece (), estas pueden ser las siguientes: array(['tinto', 'red vermouth', 'blanco', 'espumoso', 'amontillado', 'oloroso', 'tinto reserva', 'blanco fermentado en barrica', 'white vermouth', 'manzanilla', 'dulce px', 'palo cortado', 'palo cortado vors', 'fino', 'rosado', 'otro(s)', 'tinto joven','tinto crianza', 'amontillado vors', 'oloroso vors','aromatised wine', 'blanco naturalmente dulce', 'tinto dulce', 'blanco dulce', 'orange wine', 'tinto gran reserva', 'cava','sweet moscatel', 'oloroso dulce', 'dulce px vors', 'frizzante','rueda dorado', 'vermut dorado', 'vermouth', 'dulce', 'rancio'],dtype=object)
- grado: la columna `grado` contiene la graduacion en alcohol del vino
- anada: la columna `anada` contiene el año del vino o la añada
- produccion: la columna `produccion` contiene informacion sobre la produccion de botellas de este vino
- subzona: la columna `subzona` contiene informacion sobre la region, mas especifica que el `origen` del vino. Algunos vinos no contienen esta informacion.
- variedad: la columna `variedad` contiene la variedad de uva y sus porcentajes para cada vino
- origen: la columna `origen` contiene informacion del lugar del que procede el vino
- vista: la columna `vista` contiene informacion sobre las caracteristicas visuales del vino
- nariz: la columna `nariz` contiene informacion sobre las caracteristicas olfativas del vino
- boca: la columna `boca` contiene informacion sobre las caracteristicas gustativas del vino
- temp_servir : la columna `temp_servir` contiene informacion sobre recomendaciones de temperatura de servicio del vino
- maridaje: la columna `maridaje` contiene informacion sobre recomendaciones de maridaje/acompanamiento del vino
- nom_vinedo: la columna `nom_vinedo` contiene informacion sobre el nombre del viñedo de este vino
- descripcion: la columna `descripcion` contiene informacion sobre el viñedo y la produccion de este vino
- edad_vinedo: la columna `edad_vinedo` contiene informacion sobre la edad del viñedo de este vino haciendo referencia a la edad de plantacion
- clima: la columna `clima` contiene informacion sobre el clima del lugar donde se producen las cepas del vino
- suelo: la columna `suelo` contiene informacion sobre el suelo del lugar donde se producen las cepas del vino
- rendimiento: la columna `rendimiento` contiene informacion sobre el el rendimiento de la viña. Es una columna un poco variable, un ejemplo: Entre 3.000 y 6.000 kilos por hectárea, en otros casos los da por cepa...
- cosecha: la columna `cosecha` contiene informacion sobre la forma de cosechado de la uva. Un ejemplo: La vendimia manual en cajas.
- vinificacion: la columna `vinificacion` contiene informacion sobre la vinificacion del vino. Ejemplo: 20% de la uva es despalillada a su llegada a la bodega. Fermentación con sus propias levaduras durante 15 días y maceración 3 semanas. Después se prensó para pasar a barricas
- envejecimiento: la columna `envejecimiento` contiene informacion sobre la crianza del vino. Ejemplo: 8 meses de crianza en barricas de roble francés.
- embotellado: la columna `embotellado` contiene informacion sobre el embotellado del vino. Es bastante variable. Ejemplo: Embotellado sin clarificar ni filtrar.


In [None]:
# importamos librerias
import pandas as pd
import numpy as np
import re
import requests
from time import sleep, strftime
import random
from random import randint
from bs4 import BeautifulSoup

# pip install selenium
import selenium 
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys

import shutil
import pickle
import os

In [None]:
## LISTA DE URLS de todas las paginas de Bodeboca.com con vinos españoles
# set 1: Pag 1 - 10
"https://www.bodeboca.com/vino?page=0&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=1&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=2&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=3&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=4&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=5&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=6&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=7&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=8&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=9&sort=rating-desc&origin=317",

# SET 2: Pag 11- 20
"https://www.bodeboca.com/vino?page=10&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=11&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=12&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=13&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=14&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=15&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=16&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=17&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=18&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=19&sort=rating-desc&origin=317"

# SET 3: Pag 21 - 30
"https://www.bodeboca.com/vino?page=20&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=21&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=22&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=23&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=24&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=25&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=26&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=27&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=28&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=29&sort=rating-desc&origin=317"

# SET 4: Pag 31:40
"https://www.bodeboca.com/vino?page=30&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=31&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=32&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=33&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=34&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=35&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=36&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=37&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=38&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=39&sort=rating-desc&origin=317"

# SET 5: Pag 41:50
"https://www.bodeboca.com/vino?page=40&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=41&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=42&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=43&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=44&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=45&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=46&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=47&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=48&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=49&sort=rating-desc&origin=317"

# SET 6: Pag 51:60
"https://www.bodeboca.com/vino?page=50&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=51&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=52&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=53&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=54&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=55&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=56&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=57&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=58&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=59&sort=rating-desc&origin=317"

# SET 7: Pag 61:70
"https://www.bodeboca.com/vino?page=60&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=61&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=62&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=63&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=64&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=65&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=66&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=67&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=68&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=69&sort=rating-desc&origin=317"

# SET 8: Pag 71:80
"https://www.bodeboca.com/vino?page=70&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=71&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=72&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=73&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=74&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=75&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=76&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=77&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=78&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=79&sort=rating-desc&origin=317"


# SET 9: Pag 81:90
"https://www.bodeboca.com/vino?page=80&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=81&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=82&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=83&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=84&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=85&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=86&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=87&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=88&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=89&sort=rating-desc&origin=317"


# SET 10: Pag 91:100
"https://www.bodeboca.com/vino?page=90&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=91&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=92&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=93&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=94&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=95&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=96&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=97&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=98&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=99&sort=rating-desc&origin=317"

# SET 11: Pag 101:110
"https://www.bodeboca.com/vino?page=100&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=101&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=102&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=103&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=104&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=105&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=106&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=107&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=108&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=109&sort=rating-desc&origin=317"


# SET 12: Pag 111:120
"https://www.bodeboca.com/vino?page=110&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=111&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=112&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=113&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=114&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=115&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=116&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=117&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=118&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=119&sort=rating-desc&origin=317"

# SET 13: Pag 121:130
"https://www.bodeboca.com/vino?page=120&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=121&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=122&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=123&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=124&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=125&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=126&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=127&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=128&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=129&sort=rating-desc&origin=317"

# SET 14: Pag 131:140
"https://www.bodeboca.com/vino?page=130&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=131&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=132&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=133&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=134&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=135&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=136&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=137&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=138&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=139&sort=rating-desc&origin=317"

# SET 15: Pag 141:150
"https://www.bodeboca.com/vino?page=140&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=141&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=142&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=143&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=144&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=145&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=146&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=147&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=148&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=149&sort=rating-desc&origin=317"

# SET 16: Pag 151:160
"https://www.bodeboca.com/vino?page=150&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=151&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=152&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=153&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=154&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=155&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=156&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=157&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=158&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=159&sort=rating-desc&origin=317"

# SET 17: Pag 161:170
"https://www.bodeboca.com/vino?page=160&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=161&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=162&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=163&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=164&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=165&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=166&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=167&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=168&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=169&sort=rating-desc&origin=317"

## 1. Sacar los links de cada pagina de vinos
Primero vamos a sacar la informacion del nombre del vino, que hemos llamado como `titulo` y el link de la pagina de cada vino.

### 1.0 Vamos lanzando las cargas por cada set de 10 paginas
Hemos decidido lanzar la extraccion de la informacion por cada 10 paginas, ya que asi podemos controlar facilmente si hubiera algun error, o bien si entran estuches, cofres...etc, un tipo de producto que a la hora de extraer los datos nos da problemas, porque son repeticiones de vinos y no nos dan ninguna info de cata, maridaje...etc ya que son un variado de vinos por lo general.

Aproximadamente, cada iteracion para 10 paginas con 33 vinos, tardó en torno a unos 10-15 minutos. Por ello, el tiempo neto de extraccion de todos los vinos de las 170 paginas de bodeboca.com para vinos españa fue de unas 4 horas y media.

Cada pagina tiene unos 33 vinos --> 33 x 170 = 5600 vinos

In [None]:
import sys
import time

def barra_de_progreso_ascii(iteraciones, longitud=50):
    """Función para mostrar una barra de progreso ASCII."""
    for i in range(iteraciones + 1):
        porcentaje = i / iteraciones
        completado = int(porcentaje * longitud)
        barra = '=' * completado + ' ' * (longitud - completado)
        sys.stdout.write(f'\r[{barra}] {int(porcentaje * 100)}%')
        sys.stdout.flush()
        time.sleep(0.1)  # Simula un proceso en el bucle


In [None]:
# Lanzamiento de 10 paginas con 33 vinos cada una en forma de iteracion para sacar 
# todos los links de los 333 vinos

urls = ["https://www.bodeboca.com/vino?page=160&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=161&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=162&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=163&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=164&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=165&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=166&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=167&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=168&sort=rating-desc&origin=317",
"https://www.bodeboca.com/vino?page=169&sort=rating-desc&origin=317"
]
        

data = []

for url in urls:

    response = requests.get(url)
    bool(response)
    soup = BeautifulSoup(response.text, "html.parser")

    vinos = soup.find_all("div", class_="wineblock-info-wrapper")

    for i in range(len(vinos)):
            titulo = vinos[i].find("h3").text.strip()

            # Link del vino
            link_desc_vino = "https://bodeboca.com" + vinos[i].find("a")["href"]


            dic_ind = {}

            dic_ind["titulo"] = titulo
            dic_ind["link_vino"] = link_desc_vino

            data.append(dic_ind)
    
    # Ejemplo de uso
    iteraciones_totales = len(urls)
    barra_de_progreso_ascii(iteraciones_totales)
    print("\n¡Proceso completado!")
    
    
df_links_vinos = pd.DataFrame(data)

Si queremos que la lista de links no contenga aquellos que tengan la palabra pack o link, ya que estos elementos seran packs de vinos que ya estaran en la lista de elementos y no proporcionan la informacion de cata ni de vinedo

In [None]:
# eliminamos los elementos de la lista de vinos que no nos interesan
lista_vinos_filtrados = [link for link in lista_links_vinos if "pack" not in link.lower()]
lista_vinos_filtrados = [link for link in lista_vinos_filtrados if "estuche" not in link.lower()]
lista_vinos_filtrados = [link for link in lista_vinos_filtrados if "minibar" not in link.lower()]

In [None]:
# Comprobamos los nombre de los links visualmente y rapidamente
for i in range(len(lista_vinos_filtrados)):
    print(i, lista_vinos_filtrados[i])

In [None]:
# unva vez verificados si queremos eliminar por posiciones:

# Supongamos que deseamos eliminar los elementos en las posiciones 92 y 93 de la lista
posiciones_a_eliminar = [56, 184, 268]

# Eliminar los elementos de la lista_vinos_filtrados en las posiciones especificadas
for pos in sorted(posiciones_a_eliminar, reverse=True):
    del lista_vinos_filtrados[pos]
    
len(lista_vinos_filtrados)

### 1.1 Sacamos los datos de cada pagina --> WEB SCRAPPING DE BODEBOCA.COM

In [None]:
urls = lista_vinos_filtrados#[:50]

data = []

for url in urls:
    response = requests.get(url)
    bool(response)
    soup = BeautifulSoup(response.text, "html.parser")

    caja1 = soup.find("div", class_="product-main-information")
    if caja1 is not None:
        titulo = caja1.find("h1").text
        link = url
        precio = caja1.find("span", class_="price brandon")["data-price"]
        rat_obj = caja1.find("div", class_="rating-numeric")
        if rat_obj is not None:
            rating = caja1.find("div", class_="rating-numeric").text
        else:
            rating = "NA"
        volumen_botella = caja1.find("span", class_="formato").text.strip().split("\n")[0]
    else:
        pass

    desc_obj = soup.find("div", class_= "clearfix text-formatted field field--name-body field--type-text-with-summary field--label-hidden field__item")
    if desc_obj is not None:
        descripcion = soup.find("div", class_= "clearfix text-formatted field field--name-body field--type-text-with-summary field--label-hidden field__item").text.strip()
    else:
        descripcion = "NA"
        
    ############# ficha tecnica
    bodega_tipo = soup.find("div", class_="field field_bodega-name")
    if bodega_tipo is not None:
        bodega = soup.find("div", class_="field field_bodega-name").find("a")['href'].replace('/', ' ')
    else:
        bodega = "NA"
    
    tipo_obj = soup.find("div", class_="field field--name-field-wine-type field--type-entity-reference field--label-above")
    if tipo_obj is not None:
        tipo = soup.find("div", class_="field field--name-field-wine-type field--type-entity-reference field--label-above").find("div", class_= "field__item").text.strip()
    else:
        tipo = "NA"
    
    anada_obj = soup.find("div", class_="field field--name-field-vintage field--type-entity-reference field--label-above")
    if anada_obj is not None:
        anada = soup.find("div", class_="field field--name-field-vintage field--type-entity-reference field--label-above").find("div", class_= "field__item").text.strip()
    else:
        anada = "NA"
    
    grado_obj = soup.find("div", class_="field field--name-field-alcohol field--type-decimal field--label-above")
    if grado_obj is not None:
        grado = soup.find("div", class_="field field--name-field-alcohol field--type-decimal field--label-above").find("div", class_= "field__item").text.strip()
    else:
        grado = "NA"
        
    produccion_obj = soup.find("div", class_="field field--name-field-production field--type-string field--label-above")
    if produccion_obj is not None:
        produccion = soup.find("div", class_="field field--name-field-production field--type-string field--label-above").find("div", class_= "field__item").text.strip()
    else: 
        produccion = "NA"
    
    subzona_obj = soup.find("div", class_="field field--name-field-subzone field--type-string-long field--label-above")
    if subzona_obj is not None:
        subzona = soup.find("div", class_="field field--name-field-subzone field--type-string-long field--label-above").find("div", class_= "field__item").text.strip()
    else:
        subzona = "NA"
    
    var_obj = soup.find("div", class_="field field--name-field-variety field--type-string field--label-above")
    if var_obj is not None:
        variedad =  soup.find("div", class_="field field--name-field-variety field--type-string field--label-above").find("div", class_= "field__item").text.strip()
    else:
        variedad = "NA"  
    
    origen_obj = soup.find("div", class_="field field--name-field-appellation field--type-entity-reference field--label-above")
    if origen is not None:
        origen = soup.find("div", class_="field field--name-field-appellation field--type-entity-reference field--label-above").find("div", class_= "field__item").text.strip()
    else:
        origen = "NA"
        
    ################ CATA
    vista_obj = soup.find("div", class_="field field--name-field-visual field--type-string field--label-above")
    if vista_obj is not None:
        vista = soup.find("div", class_="field field--name-field-visual field--type-string field--label-above").find("div", class_= "field__item").text.strip()
    else:
        vista = "NA"
    
    nariz_obj = soup.find("div", class_="field field--name-field-nose field--type-string-long field--label-above")
    if nariz_obj is not None:
        nariz = soup.find("div", class_="field field--name-field-nose field--type-string-long field--label-above").find("div", class_= "field__item").text.strip()
    else:
        nariz = "NA"
    
    boca_obj = soup.find("div", class_="field field--name-field-mouth field--type-string-long field--label-above")
    if boca_obj is not None:
        boca = soup.find("div", class_="field field--name-field-mouth field--type-string-long field--label-above").find("div", class_= "field__item").text.strip()
    else:
        boca = "NA"
    
    temp_obj = soup.find("div", class_="field field--name-field-service-temperature field--type-string field--label-above")
    if temp_obj is not None:
        temp_servir = soup.find("div", class_="field field--name-field-service-temperature field--type-string field--label-above").find("div", class_= "field__item").text.strip()
    else:
        temp_servir = "NA"
    
    maridaje_obj = soup.find("div", class_="field field--name-field-pairing field--type-string-long field--label-above")
    if maridaje_obj is not None:
        maridaje = soup.find("div", class_="field field--name-field-pairing field--type-string-long field--label-above").find("div", class_= "field__item").text.strip()
    else:
        maridaje = "NA"

    ################ VINEDO Y ELABORACION
    nom_vinedo_obj = soup.find("div", class_="field field--name-field-vineyard-name field--type-string field--label-above")
    if nom_vinedo_obj is not None:
         nom_vinedo = soup.find("div", class_="field field--name-field-vineyard-name field--type-string field--label-above").find("div", class_= "field__item").text.strip()
    else:
        nom_vinedo = "NA"
    descripcion_obj = soup.find("div", class_="field field--name-field-vineyard-description field--type-string-long field--label-above")
    if descripcion_obj is not None:
        descripcion = soup.find("div", class_="field field--name-field-vineyard-description field--type-string-long field--label-above").find("div", class_= "field__item").text.strip()
    else:
        descripcion = "NA"
    edad_vinedo_obj = soup.find("div", class_="field field--name-field-vineyard-age field--type-string field--label-above")
    if edad_vinedo_obj is not None:
        edad_vinedo = soup.find("div", class_="field field--name-field-vineyard-age field--type-string field--label-above").find("div", class_= "field__item").text.strip()
    else:
        edad_vinedo = "NA"
    clima_obj = soup.find("div", class_="field field--name-field-vineyard-climate field--type-string-long field--label-above")
    if clima_obj is not None:
        clima = soup.find("div", class_="field field--name-field-vineyard-climate field--type-string-long field--label-above").find("div", class_= "field__item").text.strip()
    else: 
        clima = "NA"
    suelo_obj = soup.find("div", class_="field field--name-field-vineyard-soil field--type-string-long field--label-above")
    if suelo_obj is not None: 
        suelo = soup.find("div", class_="field field--name-field-vineyard-soil field--type-string-long field--label-above").find("div", class_= "field__item").text.strip()
    else:
        suelo = "NA"
    rendimiento_obj = soup.find("div", class_="field field--name-field-vineyard-yield field--type-string field--label-above")
    if rendimiento_obj is not None:
        rendimiento = soup.find("div", class_="field field--name-field-vineyard-yield field--type-string field--label-above").find("div", class_= "field__item").text.strip()
    else:
        redimiento = "NA"
    cosecha_obj = soup.find("div", class_="field field--name-field-harvest field--type-string-long field--label-above")
    if cosecha_obj is not None:
        cosecha = soup.find("div", class_="field field--name-field-harvest field--type-string-long field--label-above").find("div", class_= "field__item").text.strip()
    else:
        cosecha = "NA"
    vinificacion_obj = soup.find("div", class_="field field--name-field-vinification field--type-string-long field--label-above")
    if vinificacion_obj is not None:
        vinificacion = soup.find("div", class_="field field--name-field-vinification field--type-string-long field--label-above").find("div", class_= "field__item").text.strip()
    else:
        vinificacion = "NA"
    envejecimiento_obj = soup.find("div", class_="field field--name-field-ageing field--type-string-long field--label-above")
    if envejecimiento_obj is not None:
        envejecimiento = soup.find("div", class_="field field--name-field-ageing field--type-string-long field--label-above").find("div", class_= "field__item").text.strip()
    else:
        envejecimiento = "NA"
    embotellado_obj = soup.find("div", class_="field field--name-field-bottling field--type-string-long field--label-above")
    if embotellado_obj is not None:
        embotellado = soup.find("div", class_="field field--name-field-bottling field--type-string-long field--label-above").find("div", class_= "field__item").text.strip()
    else:
        embotellado = "NA"
        

    dic_ind = {}

    dic_ind["titulo"] = titulo
    dic_ind["link"] = link
    dic_ind["precio"] = precio
    dic_ind["rating"] = rating
    dic_ind["volumen_botella"] = volumen_botella
    dic_ind["bodega"] = bodega    
    dic_ind["tipo"] = tipo
    dic_ind["grado"] = grado
    dic_ind["anada"] = anada
    dic_ind["grado"] = grado
    dic_ind["produccion"] = produccion
    dic_ind["subzona"] = subzona
    dic_ind["variedad"] = variedad
    dic_ind["origen"] = origen
    dic_ind["vista"] = vista
    dic_ind["nariz"] = nariz
    dic_ind["boca"] = boca
    dic_ind["temp_servir "] = temp_servir 
    dic_ind["maridaje"] = maridaje
    dic_ind["nom_vinedo"] = nom_vinedo
    dic_ind["descripcion"] = descripcion
    dic_ind["edad_vinedo"] = edad_vinedo
    dic_ind["clima"] = clima
    dic_ind["suelo"] = suelo
    dic_ind["rendimiento"] = rendimiento
    dic_ind["cosecha"] =cosecha
    dic_ind["vinificacion"] = vinificacion
    dic_ind["envejecimiento"] = envejecimiento
    dic_ind["embotellado"] = envejecimiento


    data.append(dic_ind)

df_vinos = pd.DataFrame(data)

In [None]:
#checkeamos el directorio actual para guardar este dataset
directorio_actual = os.getcwd()
print("El directorio actual es:", directorio_actual)

# Ruta del nuevo directorio
nuevo_directorio = 'C:/Users/Soraya/Documents/ID Bootcamps Data Science/08. Proyecto final'

# Cambiar el directorio de trabajo actual
os.chdir(nuevo_directorio)

# Verificar que el cambio haya sido exitoso
print("El nuevo directorio de trabajo es:", os.getcwd())

#guardamos el dataframe con el nombre de las paginas en un mismo directorio
df_vinos.to_csv("df_vinos_pag_161-170.csv", index=False)
# asi hasta tener 17 dataframes

## 2. Unificamos dataframes en uno solo 

In [None]:
import os

# Obtener el directorio de trabajo actual
working_directory = os.getcwd()

print("El directorio de trabajo actual es:", working_directory)

# Ruta del nuevo directorio
nuevo_directorio = 'C:/Users/directorio elegido'

# Cambiar el directorio de trabajo actual
os.chdir(nuevo_directorio)

# Verificar que el cambio haya sido exitoso
print("El nuevo directorio de trabajo es:", os.getcwd())

In [None]:
import pandas as pd

# Cargar DataFrames desde un archivo CSV
df_1 = pd.read_csv('df_vinos_pag_1-10.csv')
df_2 = pd.read_csv('df_vinos_pag_11-20.csv')
df_3 = pd.read_csv('df_vinos_pag_21-30.csv')
df_4 = pd.read_csv('df_vinos_pag_31-40.csv')
df_5 = pd.read_csv('df_vinos_pag_41-50.csv')
df_6 = pd.read_csv('df_vinos_pag_51-60.csv')
df_7 = pd.read_csv('df_vinos_pag_61-70.csv')
df_8 = pd.read_csv('df_vinos_pag_71-80.csv')
df_9 = pd.read_csv('df_vinos_pag_81-90.csv')
df_10 = pd.read_csv('df_vinos_pag_91-100.csv')
df_11 = pd.read_csv('df_vinos_pag_101-110.csv')
df_12 = pd.read_csv('df_vinos_pag_111-120.csv')
df_13 = pd.read_csv('df_vinos_pag_121-130.csv')
df_14 = pd.read_csv('df_vinos_pag_131-140.csv')
df_15 = pd.read_csv('df_vinos_pag_141-150.csv')
df_16 = pd.read_csv('df_vinos_pag_151-160.csv')
df_17 = pd.read_csv('df_vinos_pag_161-170.csv')

In [None]:
### UNIFICAMOS DATASETS EN UNO
# Lista de DataFrames excepto el primero
dataframes_resto = [df_2, df_3, df_4, df_5, df_6, 
                    df_7, df_8, df_9, df_10, df_11,
                   df_12, df_13, df_14, df_15, df_16, df_17]

# Concatenar los DataFrames
df_VINOS_final = pd.concat([df_1] + dataframes_resto, ignore_index=True)

# Verificar el DataFrame resultante
df_VINOS_final

In [None]:
# Identificar filas duplicadas
filas_duplicadas = df_VINOS_final.duplicated()

# Eliminar filas duplicadas y mantener la primera aparición
df_vinos = df_VINOS_final.drop_duplicates()

In [None]:
# guardamos el dataset
# Guardar el DataFrame en un archivo CSV con NaNs representados como 'NaN'
df_vinos.to_csv('df_vinos_raw.csv', na_rep='NaN')