pa-scraper
is a python wrapper for scraper api with few
more extra cream and sugar.
- You need to signup for Prompt API
- You need to subscribe scraper api, test drive is free!!!
- You need to set
PROMPTAPI_TOKEN
environment variable after subscription.
then;
$ pip install pa-scraper
Examples can be found here.
# examples/fetch.py
from scraper import Scraper
url = 'https://pypi.org/classifiers/'
scraper = Scraper(url)
response = scraper.get()
if response.get('error', None):
# response['error'] returns error message
# response['status'] returns http status code
# Example: {'error': 'Not Found', 'status': 404}
print(response) # noqa: T001
else:
data = response['result']['data']
headers = response['result']['headers']
url = response['result']['url']
status = response['status']
# print(data) # print fetched html, will be long :)
print(headers) # noqa: T001
# {'Content-Length': '321322', 'Content-Type': 'text/html; charset=UTF-8', ... }
print(status) # noqa: T001
# 200
save_result = scraper.save('/tmp/my-data.html') # noqa: S108
if save_result.get('error', None):
# save error occured...
# add you code here...
pass
print(save_result) # noqa: T001
# {'file': '/tmp/my-data.html', 'size': 321322}
You can add url parameters for extra operations. Valid parameters are:
auth_password
: for HTTP Realm auth passwordauth_username
: for HTTP Realm auth usernamecookie
: URL Encoded cookie header.country
: 2 character country code. If you wish to scrape from an IP address of a specific country.referer
: HTTP referer headerselector
: CSS style selector path such asa.btn div li
. Ifselector
is enabled, returning result will be collection of data and saved file will be in.json
format.
Here is an example with using url parameters and selector
:
# examples/fetch_with_params.py
from scraper import Scraper
url = 'https://pypi.org/classifiers/'
scraper = Scraper(url)
fetch_params = dict(country='EE', selector='ul li button[data-clipboard-text]')
response = scraper.get(params=fetch_params)
if response.get('error', None):
# response['error'] returns error message
# response['status'] returns http status code
# Example: {'error': 'Not Found', 'status': 404}
print(response) # noqa: T001
else:
data = response['result']['data']
headers = response['result']['headers']
url = response['result']['url']
status = response['status']
# print(data) # noqa: T001
# ['<button class="button button--small margin-top margin-bottom copy-tooltip copy-tooltip-w" ...\n', ]
print(len(data)) # noqa: T001
# 734
# we have an array...
print(headers) # noqa: T001
# {'Content-Length': '321322', 'Content-Type': 'text/html; charset=UTF-8', ... }
print(status) # noqa: T001
# 200
save_result = scraper.save('/tmp/my-data.json') # noqa: S108
if save_result.get('error', None):
# save error occured...
# add you code here...
pass
print(save_result) # noqa: T001
# {'file': '/tmp/my-data.json', 'size': 174449}
Default timeout value is set to 10
seconds. You can change this while
initializing the instance:
scraper = Scraper(url, timeout=50) # 50 seconds timeout...
You can also add custom headers prefixed with X-
. Example below shows
how to add extra request headers and set default timeout:
# pylint: disable=C0103
from scraper import Scraper
if __name__ == '__main__':
url = 'https://pypi.org/classifiers/'
scraper = Scraper(url)
fetch_params = dict(country='EE', selector='ul li button[data-clipboard-text]')
custom_headers = {
'X-Referer': 'https://www.google.com',
'X-User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0',
}
timeout = 30
response = scraper.get(params=fetch_params, headers=custom_headers, timeout=timeout)
if response.get('error', None):
# response['error'] returns error message
# response['status'] returns http status code
# Example: {'error': 'Not Found', 'status': 404}
print(response) # noqa: T001
else:
data = response['result']['data']
headers = response['result']['headers']
url = response['result']['url']
status = response['status']
# print(data) # noqa: T001
# ['<button class="button button--small margin-top margin-bottom copy-tooltip copy-tooltip-w" ...\n', ]
print(len(data)) # noqa: T001
# 734
print(headers) # noqa: T001
# {'Content-Length': '321322', 'Content-Type': 'text/html; charset=UTF-8', ... }
print(status) # noqa: T001
# 200
save_result = scraper.save('/tmp/my-data.json') # noqa: S108
if save_result.get('error', None):
# save error occured...
# add you code here...
pass
print(save_result) # noqa: T001
# {'file': '/tmp/my-data.json', 'size': 174449}
This project is licensed under MIT
- Prompt API - Creator, maintainer
All PR’s are welcome!
fork
(https://github.com/promptapi/scraper-py/fork)- Create your
branch
(git checkout -b my-feature
) commit
yours (git commit -am 'Add awesome features...'
)push
yourbranch
(git push origin my-feature
)- Than create a new Pull Request!
This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the code of conduct.