<p style="text-align: center; font-size: 30px;">
    <strong>
          Scraper for Amazon
    </strong>
</p>

----

<b style="text-align: left; font-size: 18px;">
    The following project is specifically designed to scrape publicly available data concerning user reviews for goods listed on amazon.es
</b>

----

<b style="font-size: 20px;">Project Roadmap</b>

-  <p style="font-size: 18px;">Create a list of target URLs</p>
-  <p style="font-size: 18px;">Develop a program to automatically access specified websites and extract their data</p>
-  <p style="font-size: 18px;">Save the acquired data in a DataFrame for future applications</p>
---

<b style="font-size: 20px;">Required tools</b>

<blockquote style="background-color: #f0f0f0; padding: 10px; border-left: 10px solid #3498db; font-size: 18px;">
    <a href="https://www.python.org/downloads/" target="_blank" rel="noopener noreferrer">Latest version of Python</a>
</blockquote>

<blockquote style="background-color: #f0f0f0; padding: 10px; border-left: 10px solid #3498db; font-size: 18px;">
    <a href="https://googlechromelabs.github.io/chrome-for-testing/" target="_blank" rel="noopener noreferrer">Latest version of WebDriver</a>
</blockquote>

<blockquote style="background-color: #f0f0f0; padding: 10px; border-left: 10px solid #3498db; font-size: 18px;">
    Selenuim library for scraping purposes
</blockquote>

In [None]:
!pip install selenium

<blockquote style="background-color: #f0f0f0; padding: 10px; border-left: 10px solid #3498db; font-size: 18px;">
    Pandas library for data storage and manipulation
</blockquote>

In [None]:
!pip install pandas

<blockquote style="background-color: #f0f0f0; padding: 10px; border-left: 10px solid #3498db; font-size: 18px;">
    TQDM library for progress check 
</blockquote>

In [None]:
!pip install tqdm

----
<b style="font-size: 20px;">Specification of the target URLs</b>

<p style="font-size: 18px;">The same method can be applied to other goods on amazon.es</p>

```python
base_url = 'https://www.amazon.es/CREATE-THERA-Cafetera-monodosis-semiautom%C3%A1tica/product-reviews/B0BSH8CYN8/ref=cm_cr_getr_d_paging_btm_next_2?ie=UTF8&pageNumber=1&reviewerType=avp_only_reviews'

urls = []
for i in range(1, 319):
    
    prefix = base_url[:155]  
    suffix = base_url[156:]

    updated_url = f"{prefix}{i}{suffix}"
    
    urls.append(updated_url)

```
----
<b style="font-size: 20px;">Automatic interaction with web services</b>

<p style="font-size: 18px;">WebDriver is a tool that provides a programmatic interface for interacting with web browsers</p>

```python
service = Service(DRIVER_PATH)
driver = webdriver.Chrome(service=service, options=options)
```
----

<b style="font-size: 20px;">Necessary adgustments for the bot</b>

<p style="font-size: 18px;">To avoid restrictions the bot must emulate the behaviour of a human user</p>

<p style="font-size: 18px;">This setting allows the bot to send a user-agent string with browser specs, just like a human would</p>

```python
options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36")


```

<p style="font-size: 18px;">Freezing the process for random time periods also helps to avoid detection</p>

```python
time.sleep(random.randint(1, 3))
```
----
<p style="text-align: center; font-size: 25px;">
    <strong>
          The final code
    </strong>
</p>

---

In [60]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
import pandas as pd

import time
import random
from tqdm import tqdm

def get_urls():
    base_url = 'https://www.amazon.es/CREATE-THERA-Cafetera-monodosis-semiautom%C3%A1tica/product-reviews/B0BSH8CYN8/ref=cm_cr_getr_d_paging_btm_next_2?ie=UTF8&pageNumber=1&reviewerType=avp_only_reviews'
    list_urls = []

    for i in range(0, 11):
    
        prefix = base_url[:155]  
        suffix = base_url[156:]
    
        #changing the number of a page
        updated_url = f"{prefix}{i}{suffix}"
    
        list_urls.append(updated_url)
    
    return list_urls

def scrape_reviews(urls):
    
    options = Options()
    options.add_argument('--headless') 
    options.add_argument('--no-sandbox')
    options.add_argument('--disable-dev-shm-usage')
    options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36")


    DRIVER_PATH = '/Users/apple/Downloads/Project_Gnomi_Huekradi/chromedriver'

    service = Service(DRIVER_PATH)
    driver = webdriver.Chrome(service=service, options=options)
    
    reviews = []
    
    for url in tqdm(urls, desc='Processing URLs', unit='URL'): 
        
        driver.get(url)
        
        time.sleep(random.randint(1, 3))


        # Getting reviews
        review_blocks = driver.find_elements(By.CSS_SELECTOR, '.a-section.review')
        for review_block in review_blocks:
            title = review_block.find_element(By.CSS_SELECTOR, '.review-title-content').text.strip()
            rating = review_block.find_element(By.CSS_SELECTOR, '.a-icon-alt').get_attribute('textContent').strip()
            body = review_block.find_element(By.CSS_SELECTOR, '[data-hook="review-body"]').text.strip()
            author = review_block.find_element(By.CSS_SELECTOR, '.a-profile-name').text.strip()
            date = review_block.find_element(By.CSS_SELECTOR, '.review-date').text.strip()

            reviews.append({
                'title': title,
                'rating': rating,
                'body': body,
                'author': author,
                'date': date,
            })
    
    print("Done.")
    driver.quit()
    return reviews

# Saving data
def save_to_csv(reviews, filename):
    df = pd.DataFrame(reviews)
    df.to_csv(filename, index=False, encoding='utf-8')

if __name__ == "__main__":

    urls = get_urls()
    reviews = scrape_reviews(urls)
    save_to_csv(reviews, 'testing.csv')
    print(f'Saved {len(reviews)} reviews in testing.csv')

Processing URLs: 100%|██████████| 11/11 [00:32<00:00,  2.92s/URL]

Done.
Saved 100 reviews in testing.csv





---
<b style="font-size: 20px;">Saving data in a DataFrame</b>


In [61]:
# Saving data in DataFrame
data = pd.read_csv('testing.csv')

data.head()

Unnamed: 0,title,rating,body,author,date
0,Perfecta!,"5,0 de 5 estrellas",Me encanta! Soy adicta al café y me gusta toma...,Perfecta!,Revisado en España el 22 de febrero de 2024
1,Un quiero y no puedo,"4,0 de 5 estrellas",Realmente el producto en si funciona bien pero...,Ro,Revisado en España el 17 de abril de 2024
2,Precioso diseño y buen funcionamiento a buen p...,"5,0 de 5 estrellas","Si quieres una cafetera sin complicaciones, bo...",Hank Solo,Revisado en España el 22 de enero de 2024
3,Buena Relación Precio/Calidad,"5,0 de 5 estrellas",Qué Pasada!!\n\nla Compramos para Reyes cansad...,LHO,Revisado en España el 26 de febrero de 2021
4,"Decepcionante, tanto la cafetera como el servi...","1,0 de 5 estrellas","EDITO: El café lo hace bueno, no sabe para nad...",Alicia TM,Revisado en España el 12 de diciembre de 2020


In [55]:
data.shape

(100, 5)

In [64]:
data.iloc[79]

title                                          A Smart move
rating                                   5,0 de 5 estrellas
body      Fartos de cápsulas que só geram lixo desnecess...
author                                            Francisco
date              Revisado en España el 17 de enero de 2023
Name: 79, dtype: object

----
<p style="text-align: center; font-size: 25px;">
    <strong>
          Another one
    </strong>
</p>

---

In [65]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
import pandas as pd
import time
import random
from tqdm import tqdm

def get_urls():
    base_url = 'https://www.amazon.es/CREATE-THERA-Cafetera-monodosis-semiautom%C3%A1tica/product-reviews/B0BSH8CYN8?ie=UTF8&reviewerType=avp_only_reviews'
    
    filters = [
        'sortBy=recent',
        'sortBy=helpful',
        'sortBy=rating',
        'filterByStar=one_star',
        'filterByStar=two_star',
        'filterByStar=three_star',
        'filterByStar=four_star',
        'filterByStar=five_star'
    ]
    
    list_urls = []
    for filter_option in filters:
        for page in range(1, 11):
            
            #updating the filter and page number 
            updated_url = f"{base_url}&{filter_option}&pageNumber={page}"
            list_urls.append(updated_url)
    
    return list_urls

def scrape_reviews(urls):
    
    options = Options()
    options.add_argument('--headless')
    options.add_argument('--no-sandbox')
    options.add_argument('--disable-dev-shm-usage')
    options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36")

    DRIVER_PATH = '/Users/apple/Downloads/Project_Gnomi_Huekradi/chromedriver'

    service = Service(DRIVER_PATH)
    driver = webdriver.Chrome(service=service, options=options)
    
    reviews = []
    
    for url in tqdm(urls, desc='Processing URLs', unit='URL'): 
        driver.get(url)
        
        time.sleep(random.randint(1, 3))

        # Getting reviews
        review_blocks = driver.find_elements(By.CSS_SELECTOR, '.a-section.review')
        for review_block in review_blocks:
            title = review_block.find_element(By.CSS_SELECTOR, '.review-title-content').text.strip()
            rating = review_block.find_element(By.CSS_SELECTOR, '.a-icon-alt').get_attribute('textContent').strip()
            body = review_block.find_element(By.CSS_SELECTOR, '[data-hook="review-body"]').text.strip()
            author = review_block.find_element(By.CSS_SELECTOR, '.a-profile-name').text.strip()
            date = review_block.find_element(By.CSS_SELECTOR, '.review-date').text.strip()

            reviews.append({
                'title': title,
                'rating': rating,
                'body': body,
                'author': author,
                'date': date,
            })
    
    print("Done.")
    driver.quit()
    return reviews

# Removing duplicates and saving data
def save_to_csv(reviews, filename):
    df = pd.DataFrame(reviews)
    df.drop_duplicates(subset=['title', 'body', 'author'], inplace=True)
    df.to_csv(filename, index=False, encoding='utf-8')

if __name__ == "__main__":
    urls = get_urls()
    reviews = scrape_reviews(urls)
    save_to_csv(reviews, 'reviews_filtered.csv')
    print(f'Saved {len(reviews)} reviews in reviews_filtered.csv')

Processing URLs: 100%|██████████| 80/80 [03:52<00:00,  2.91s/URL]


Done.
Saved 769 reviews in reviews_filtered.csv


In [67]:
# Saving data in DataFrame
data = pd.read_csv('reviews_filtered.csv')

data.head()

Unnamed: 0,title,rating,body,author,date
0,"Está bien,sin más.","4,0 de 5 estrellas","A ver… está chula, para ponerte unos expresos,...",Ramón Torres,Revisado en España el 13 de julio de 2024
1,Buena máquina para los cafeteros,"4,0 de 5 estrellas",Me encanta hacer café con esta máquina. Queda ...,Manel,Revisado en España el 24 de junio de 2024
2,preciosa,"5,0 de 5 estrellas",a mi marido le encanta como sale el cafe,Noelia,Revisado en España el 12 de junio de 2024
3,Garantía no responde,"1,0 de 5 estrellas",Compramos la cafetera hace un par de meses y h...,Mare,Revisado en España el 7 de junio de 2024
4,"Cumpre a função, mas é barulhenta","4,0 de 5 estrellas",Estou satisfeito com a compra e não me arrepen...,Renan P.,Revisado en España el 20 de mayo de 2024


In [70]:
data.shape

(480, 5)

In [74]:
data.iloc[469]

title                                              Bom café
rating                                   5,0 de 5 estrellas
body      Prática e usar e limpar. Faz bom café e capuccino
author                                     sandra guerreiro
date              Revisado en España el 24 de abril de 2022
Name: 469, dtype: object