Provides python access to Googles parser for robot.txt
files as used by their GoogleBot
webscraper.
Websites may provide an optional robots.txt
file in their domains root to govern the access and behavior of web scrapers. One of the most famous webscrapers GoogleBot
is responsible for promoting this standard and sites interested in SEO will closely conform to GoogleBot
behavior.
All credit for the parser goes to the Google team who created, open sourced and promoted it.
SEO (Search Engine Optimization): The process of modifying a websites content or metadata to boost rankings in search engines page indexes. Higher rankings lead to higher positions in user searches leading to more visitors. For further details, see the SEO wikipedia page.
Basic usage using the RobotsMatcher
class provided by Google.
import jwm.robotstxt.googlebot
robotstxt = """
user-agent: GoodBot
allowed: /path
"""
matcher = jwm.robotstxt.googlebot.RobotsMatcher()
assert matcher.AllowedByRobots(robotstxt, ("GoodBot",), "/path")
Check out the documentation for further details. For more use cases, see the test cases for jwm.robotstxt and robotstxt.
Install from Pypi under the jwm.robotstxt
distribution.
pip install jwm.robotstxt
Import into your program through the jwm.robotstxt.googlebot
package.
import jwm.robotstxt.googlebot
It is highly recommended to install python projects into a virtual environment, see PEP405 for motivations.
Create a virtual environment in the .venv
directory.
python3 -m venv ./.venv
Activate with the correct command for your system.
# Linux/MacOS
. ./.venv/bin/activate
# Windows
.\.venv\Scripts\activate
Make sure you have cloned the repository and its submodules.
git clone --recurse-submodules https://github.com/jwmorley73/jwm.robotstxt.git
Install the project using pip. This will build the required robotstxt
static library files and link them into the produced python package.
pip install .
If you want to include the developer tooling, add the dev
optional dependencies.
pip install .[dev]
- Windows 32 bit is not supported.