Skip to content

This is a sample project that shows how to integrate crawlera-headless-proxy on a custom Docker image to be deployed on Scrapy Cloud - https://www.scrapinghub.com/scrapy-cloud/

Notifications You must be signed in to change notification settings

phrfpeixoto/scrapy_cloud_headless

Repository files navigation

Integrating Scrapy Cloud + crawlera-headless-proxy

Sample Scrapy project demonstrating integration of Crawlera-Headless-Proxy with Scrapy Cloud through a custom Docker image. To demonstrate it, we use a Selenium + Firefox through its geckodriver.

Please do not assume this is the best way to integrate selenium within your spider. The goal here is showcase the deployment of crawlera-headless-proxy

Based on this KB

Deploying on Scrapy Cloud

Install shub

pip install shub

Modify the scrappinghub.yml file and change <YOU PROJECT ID> with ypu actual project ID

project: <YOU PROJECT ID>
requirements_file: ./requirements.txt
image: true

Deploy your project to Scrapy Cloud

$ shub login

Enter your API key from https://app.scrapinghub.com/account/apikey
API key: ********************************
Validating API key...
API key is OK, you are logged in now.

$ shub deploy

Building images.scrapinghub.com/project/<YOU PROJECT ID>:1.0.
Steps: 100%|█████████████| 12/12
The image images.scrapinghub.com/project/<YOU PROJECT ID>:1.0 build is completed.
Login to images.scrapinghub.com succeeded.
Pushing images.scrapinghub.com/project/<YOU PROJECT ID>:1.0 to the registry.
b58632e02b0f: 100%|█████████████| 53.8k/53.8k [2.55kB/s]
9cf43d5c0161: 100%|█████████████| 33.8k/33.8k [1.61kB/s]
The image images.scrapinghub.com/project/<YOU PROJECT ID>:1.0 pushed successfully.                                                                                                                                                    | 512/15.2k [?B/s]
Deploying images.scrapinghub.com/project/<YOU PROJECT ID>:1.0
You can check deploy results later with 'shub image check --id 1'.
Progress: 100%|█████████████| 100/100
Deploy results:
{'status': 'ok', 'project': <YOU PROJECT ID>, 'version': '1.0', 'spiders': 1}

Run the job on Scrapy Cloud passing in your Crawlera API Key using either an enviroment variable or an spider argument

$ shub schedule -e CRAWLERA_APIKEY=<API KEY> <YOUR PROJECT ID>/demo

Watch the log on the command line:
    shub log -f <YOU PROJECT ID>/1/1
or print items as they are being scraped:
    shub items -f <YOU PROJECT ID>/1/11
or watch it running in Scrapinghub's web interface:
    https://app.scrapinghub.com/p/<YOU PROJECT ID>/1/1

Running Locally

Create a virtualenv

$ virtualenv .venv && source ./.venv/bin/activate

Install scrapy and the project requirements

(.venv) $ pip install -r requirements.txt
...

Follow installation instructions for crawlera-headless-proxy on your platform

Run crawlera-headless-proxy on a dedicated terminal/shell. It needs to be running for our demo spider to connect to it. (Hit ctrl+c to kill it and release the terminal)

$ crawlera-headless-proxy -d -a <CRAWLERA API KEY>
# OR
$ docker run -p 3128:3128 scrapinghub/crawlera-headless-proxy -d -a <CRAWLERA API KEY>

Run the project

$ ./venv/bin/scrapy crawl demo -o out.json

About

This is a sample project that shows how to integrate crawlera-headless-proxy on a custom Docker image to be deployed on Scrapy Cloud - https://www.scrapinghub.com/scrapy-cloud/

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published