# Product Price Web Scraper



## 1. Challenge

One of the e-commerce stores needed to verify product prices against competitors to boost sales. A list of websites containing product prices was created, and the goal was to fetch and compile this data into a report. The report should refresh every 24 hours.


## 2. Idea

Initially, simpler solutions were considered. Google Spreadsheets offers the importXML function, but unfortunately, it doesn't work well with Single Page Application websites. Attempts were made using the requests library, but most websites detected these requests as bot activities, even after changing headers (via Cloudflare). The next step involved using the Selenium library, which emulates a browser. The script worked locally, but issues arose when deployed on PythonAnywhere. Adding the headless option resulted in no browser window emulation, triggering Cloudflare to classify the activity as bot-like. Despite using the undetected chromedriver library, the problem persisted. The plan was to publish the report in HTML format.


## 3. Conclusion

The project was closed due to the implementation of an alternative solution.


## 4. Possible Improvements

The primary focus should be on resolving the issue of Cloudflare classifying the request as bot-like, as it leads to content blocking. Automatic updating of the list of URLs containing the required products is also necessary. Additionally, changing CSS selectors may pose a potential problem.

In [None]:
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from functions import *
from selenium.common.exceptions import NoSuchElementException
import csv


def getUrls(file):

    data = []
    with open(file) as csv_file:
        csv_reader = csv.reader(csv_file, delimiter=',')

        for row in csv_reader:
            data.append(row)
    return data


def writeData(data, fileName):
    with open(fileName, mode='w') as writer:

        writer.writelines('<!DOCTYPE html><html lang="en"><head><title>Report</title></head><body><table cellspacing="0" cellpadding="0">\n')
        for row in data:
            codeS = "<tr>"

            for cell in row:

                codeS = codeS + '<td align="center" width="400" style="border: 1px solid #000000; padding-top:10px; padding-bottom:10px">' + cell + '</td>'

            codeS = codeS + "</tr>\n"

            writer.write(codeS)

        writer.writelines('</table></body></html>')
        
        
def main():

    urls = getUrls('urlist.csv')
    
    data = []
    
    selectors = {
        'site_x': 'css_selector',
        'site_y': 'css_selector',
        'site_z': 'css_selector',
    }
    
    for url in urls:
    
        chrome_options = webdriver.ChromeOptions()
        chrome_options.add_argument("--no-sandbox")
        chrome_options.add_argument("--headless")
        chrome_options.add_argument("--disable-gpu")
        chrome_options.add_argument('--disable-dev-shm-usage')
        chrome_options.add_argument('--disable-blink-features=AutomationControlled')
        driver = webdriver.Chrome(options=chrome_options)
        driver.get(url)
    
        for site in selectors:
            if site in url:
                try:
                    data.append(driver.find_element(By.CSS_SELECTOR, selectors[site]).get_attribute("innerHTML"))
    
                except NoSuchElementException as e:
                    print(e)
    
        time.sleep(2)
        driver.close()
    
    writeData(data, 'index.html')
