Skip to content
This repository has been archived by the owner on Aug 11, 2023. It is now read-only.

💻 (whitehat) (11-Aug-2023) See your Python code do web browsing on your screen with GUI.

License

Notifications You must be signed in to change notification settings

kkamara/email-jobs-response-apply

Repository files navigation

selenium-py.png

email-jobs-response-apply

💻 (11-Aug-2023) See your Python code do web browsing on your screen with GUI.

Important note:

Before you try to scrape any website, go through its robots.txt file. You can access it via domainname/robots.txt. There, you will see a list of pages allowed and disallowed for scraping. You should not violate any terms of service of any website you scrape.

With selenium we're limited to 10 max ongoing sessions (reference).

I've successfully tested 1000 site crawls in a single process (3 hours, 44 minutes, and 47 seconds).

(4 hours x 1000 sites) * 2 = 2000 sites x 8 hours

2000 sites * 10 parallel sessions = 20, 000 sites

We're able to cover 20, 000 sites / night / machine.

Installation

cp .env.example .env
pip3 install virtualenv && \
  virtualenv env && \
  source env/bin/activate
# chromedriver_mac64
# chromedriver_win32
# See https://chromedriver.storage.googleapis.com
# for drivers list.
wget https://chromedriver.storage.googleapis.com/2.37/chromedriver_linux64.zip
unzip chromedriver_linux64.zip
sudo mv chromedriver /usr/bin/chromedriver
chromedriver --version

Update config.json with your real credentials.

Usage

XPath cheat sheet.

Update the command at ./management/commands/crawl.py

alias py3="python3"
py3 manage.py crawl
# The app runs at `http://localhost:3000`.

If you still need help installing and running the app check out the readme at https://github.com/kkamara/python-react-boilerplate which is the base system for this python-selenium app.

Using Docker?

alias compose='docker-compose -f local.yml'
compose build
compose up
# Automated runs with Docker:
# compose up --build -d && python3 manage.py crawl

iPython Django Shell

python3 manage.py shell -i ipython

API

python manage.py show_urls

View the api collection here.

Admin

Admin creds are set in ./compose/local/django/start

export DJANGO_SUPERUSER_PASSWORD=secret

python manage.py createsuperuser \
  --username admin_user \
  --email admin@django-app.com \
  --no-input \
  --first_name Admin \
  --last_name User

Cache react app & view templates

py3 manage.py collectstatic

Mail Server

docker-mailhog.png

Mail environment credentials are at .env.

The mailhog docker image runs at http://localhost:8025.

Misc

See python amazon scraper.

See python react boilerplate.

See amazon scraper (proven in a production environment).

See php scraper.

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Please make sure to update tests as appropriate.

License

BSD

Releases

No releases published

Packages

No packages published

Languages