This is a scrapy project with a spider scraping DevBG for all Python job offers available.
All data is stored in sqlite3 database for consistency. The data stored consists of position, company, location, posting date and link to the offer
.
Deltafetched is used to skip all previous data from current search.
After each scraping an email with all unsent offers from the last 2 days is sent.
The runtime of the scraper is automated with cron.
The scraper is meant to be hosted on a web server which I did with an EC2 server on AWS.
-
Clone the Repo
git clone https://github.com/ivo-bass/scrapers.git
-
Create virtual environment
cd scrapers python3 -m venv venv
-
Activate virtual environment
source /venv/bin/activate
-
Install requirements
pip install -r requirements.txt
-
Create config file for email sender credentials
cd devBG/devBG mkdir config && cd config && touch __init__.py echo " EMAIL_USER = 'change_this_to_your_email_address' EMAIL_PASS = 'change_this_to_your_email_password' " > config.py cd ../../..
-
Execute scraping (results will be available in db)
cd devBG/devBG scrapy crawl job
-
Extract results to csv or json (if needed)
Email will be sent on each execution
scrapy crawl job -O results.csv or scrapy crawl job -O results.json
-
Install
cron
if not present -
Set cron task
-
Open cron task scheduler
crontab -e
-
Set runtime
You can generate time code HERE
* * * * * sh /path/to/script/autorun.sh >> autorun.log
-
Check if the task is set
crontab -l
-
-
Enjoy!