A python script that scrapes e-commerce websites for new products and sends updates about them through Telegram.
PTSW has the ability to scrape multiple websites and update the xpath selectors for each of them, with a config.toml file. See below.
I needed a scraper that was specific enough for the need of finding new items on e-commerce sites, yet general enough so that I can scrape multiple sites with it.
Make sure you have Python 3 installed. I recommend Miniconda, I find it easier to manage your environments.
conda create -n webscrape python=3.9
conda activate webscrape
pip3 install -r requirements.txt
playwright install-deps
playwright install
python3 main.py -c <config_file> -e <env_chat_id>
You will need to create 2 files:
- A .env file, in the root of the repo, containing
TOKEN='<your-telegram-bot-token>'
CHAT_ID='<your-telegram-account-chat-id>'
Here's a guide on creating telegram bots and here's a guide showing how to find your chat id.
- A config.toml file, with the following structure
name = 'name'
link = '<website>'
root = '//div[@class="css-1apmciz"]'
next-page = '<next-page-xpath>'
type= 'product-listing'
price-limit = 2200
threshold = 10
[fields]
title = 'div/h6/text()'
price = 'div[1]/p/text()'
stoc = 'div[2]/p/text()[1]'
date = 'div[2]/p/text()[3]'
href = '../../../../a/@href'
The root-xpath needs to be a common xpath for all the other fields (something like an item container). We can place any number of xpaths in the fields dict. to the product via telegram (some products show the href in the html source relative to the current address, so we just concatenate the relative-href to the link field in order to get the full address).
We receive updates only when there are new products listed and their price is lower than the price-limit
, or when a product updates its price and the difference between the old price and the new price is greater than the threshold
.
I used the browser developer console and scrapy shell in order to find the correct xpaths for each website. Here is more information on scrapy and selectors. It's a painstaking process to get the exact xpaths for each field, but this is the strong suit of the project: to be able to iterate quickly on multiple sites.
Running the script will create a new.json
file in the data/<config_file_stem> folder and a old.json
file, which contains a copy of new.json. The old.json
file contains the items scraped on latest run, and the new.json
file contains items scraped on current run. The app will compare these to files to check if new products have appeared on the site. If that happens, it will send a telegram message with the new products.
I suggest setting up a cron job that runs this script periodically. Here's an example that runs it every 15 minutes and logs the output to log.txt:
*/15 * * * * cd ~/projects/PTSW/ && python3 main.py -c <config_file> -e <env_chat_id> >> log.txt
Beware that some sites might ban your ip if there is too much traffic.
You can also setup the environment using Dev containers. Just install the extension, open command menu -> Dev Containers: Rebuild and Reopen in Container