BetParser Crawler is a Python application designed to parse and extract betting odds from websites. It leverages the Scrapy framework, Firebase, custom Selenium integrations, and ML for efficient data extraction and processing.
This software is provided for educational and research purposes only. The authors of this project do not condone or encourage any illegal activities, including but not limited to unauthorized data scraping or infringement of intellectual property rights.
All trademarks, logos, and brand names mentioned in this project (e.g., Bwin, Bet365, William Hill, Sisal, Eurobet, etc.) are the property of their respective owners. The use of these names is for identification purposes only and does not imply endorsement or affiliation.
By using this software, you agree that the authors are not liable for any misuse or legal consequences arising from its use. It is your responsibility to ensure compliance with all applicable laws and regulations in your jurisdiction.
BetParser Crawler simplifies the process of extracting betting odds from web pages. It supports parsing complex JavaScript-powered pages using Selenium and includes machine learning algorithms to standardize team names. The project is highly configurable and integrates with Firebase for real-time database updates.
- Web Scraping: Extract betting odds from multiple websites using Scrapy spiders.
- JavaScript Rendering: Handle JavaScript-heavy pages with Selenium and Splash integrations.
- Machine Learning: Standardize team names using machine learning algorithms for word similarity.
- Firebase: Integrate with Firebase for real-time database updates.
- Configurable Spiders: Pre-configured spiders for well-known brokers like Bwin, Bet365, William Hill, Sisal, and Eurobet.
- Proxy Support: Enable proxy rotation and Tor to avoid bans during scraping.
Here are some screenshots of the application in action:
Clone this repository to your local machine using:
git clone https://github.com/mtmarco87/betparser_crawler.git- Download and install Anaconda3 with Python 3 from the Anaconda Download Page.
- Open the Anaconda prompt.
- Create a new environment with Python 3.9:
conda create -n <env_name> python=3.9
- Activate the environment:
conda activate <env_name>
Tip: Manage Conda environments with the following commands:
conda activate <env_name>- Activate an environment.conda deactivate- Deactivate the current environment.conda env list- List all environments.conda env remove -n <env_name>- Remove an environment.
- Install dependencies:
pip install -r requirements.txt
- Alternatively, install libraries individually:
pip install Scrapy Scrapy-UserAgents scrapy-splash selenium firebase-admin numpy nltk Unidecode googletrans stem torrequest urllib3 requests pytz
- If issues arise, install specific versions:
pip install Scrapy==2.12.0 Scrapy-UserAgents==0.0.1 scrapy-splash==0.11.1 selenium==4.31.0 firebase-admin==6.7.0 numpy==2.0.2 nltk==3.9.1 Unidecode==1.3.8 googletrans==2.4.0 stem==1.7.1 torrequest==0.1.0 urllib3==2.4.0 requests==2.32.3 pytz==2025.2
Selenium is a powerful tool for interacting with JavaScript-heavy pages. It allows automated web testing and renders pages as they would appear in a browser. BetParser includes a custom Scrapy-Selenium middleware for handling complex, JS/Angular-powered pages.
-
Install Chrome or Firefox:
-
Chrome (recommended):
- Install Chrome.
- (Optional) Download the ChromeDriver and place it in
bet_parser/libs/selenium_drivers/chromedriver. If you skip this step, the middleware will automatically download the driver when needed.
-
Firefox:
- Install Firefox.
- (Optional) Download the GeckoDriver and place it in
bet_parser/libs/selenium_drivers/geckodriver. If you skip this step, the middleware will automatically download the driver when needed.
-
-
Create a Chrome Browser Profile:
- Open Chrome and create a new user profile.
- Locate the profile folder on your system (search online for instructions specific to your OS).
- Copy the profile folder to:
bet_parser/libs/selenium_drivers/chrome_profiles.
-
Update Settings:
-
Edit
bet_parser/settings.pyin the "Selenium config" section. Update the following: -
SELENIUM_CHROME_USER_DATA_DIR: Path to the Chrome profile folder. -
SELENIUM_CHROME_DRIVER(optional): Path to the ChromeDriver binary. Set toNonefor automatic driver management or specify the path if you want to use a custom driver. -
SELENIUM_FIREFOX_DRIVER(optional): Path to the GeckoDriver binary. Set toNonefor automatic driver management or specify the path if you want to use a custom driver. -
You only need to configure the settings for the browser you plan to use (Chrome or Firefox).
-
-
Handle Protected Pages:
- Some websites allow pages to be displayed only after user interactions. Use the Chrome profile to manually visit these pages and accept any banners or prompts to generate valid cookies.
- Selenium will use this profile to access these pages during scraping.
- Create a Firebase account and database named
parsed_bets. - Enable a Firebase app and download the service account key JSON file.
- Copy the downloaded service account key JSON file to the
libs/firebasedirectory in your project and rename it tocredentials.jsonif necessary. - Update the Firebase configuration in
bet_parser/settings.py:- Update
authDomainanddatabaseURLwith your Firebase project details. - Example:
FIREBASE_CONFIG = { "serviceAccountKeyPath": BOT_PATH + "/libs/firebase/credentials.json", "authDomain": "your-project-id.firebaseapp.com", "databaseURL": "https://your-project-id.firebaseio.com", "storageBucket": "" }
- Update
Run a spider using the following command in the terminal from the project directory:
scrapy crawl <spider_name>Replace <spider_name> with the name of the spider you want to run. All available spiders can be found in the bet_parser/spiders and bet_parser/spiders_api directories by checking the name field in each spider file.
After extracting betting odds, team names often appear in different formats or languages, making it difficult to identify unique matches. To address this, BetParser includes a machine learning-based mapper that standardizes team names using word similarity algorithms.
-
Team Name Standardization:
- The mapper checks each team name against a pre-defined dataset (
team_names.csv). - If a match is found, the standardized name is used.
- The mapper checks each team name against a pre-defined dataset (
-
Handling Unknown Names:
- If no match is found, the name is logged in
to_validate.txtfor manual review. - This ensures new names are added to the dataset for future use.
- If no match is found, the name is logged in
-
Manual Validation:
- Open
to_validate.txtand compare each name with entries inteam_names.csv. - If a name exists in another form or language, add the new form to
team_names.csvand map it to the standardized name. - For completely new names, add them to
team_names.csvwith a standardized English version and any known variations.
- Open
-
Improving Accuracy:
- Regularly update
team_names.csvto reduce the size ofto_validate.txt. - Add as many variations of team names as possible to avoid repeated manual validation.
- Regularly update
- The mapper's behavior can be fine-tuned in the "Machine Learning config" section of
bet_parser/settings.py. - The current configuration is optimized for most scenarios but can be adjusted as needed.
This process ensures accurate and consistent team name mapping, which is critical for the crawler's functionality.
BetParser leverages the Scrapy framework for efficient web scraping. Scrapy allows you to define spiders to crawl websites and extract structured data, such as betting odds.
Key steps for working with Scrapy in this project:
-
Add a new spider:
scrapy genspider <spider_name>
-
Customize spiders:
- Modify the generated spider files in the
spiders/directory to define the crawling logic and data extraction rules.
- Modify the generated spider files in the
-
Extend functionality:
- Use Scrapy middlewares, pipelines, and settings to customize the scraping process.
Tip: BetParser's Selenium Middleware Features
- Extracts data from JavaScript-heavy web pages.
- Automatically creates a temporary copy of the Chrome profile to prevent bloating the folder.
SeleniumRequestparameters include:
driver: Can be'chrome'or'firefox'.render_js: Set totrueto extract the fully rendered DOM using JavaScript execution; set tofalsefor standard HTML extraction with Selenium.wait_timeandwait_until: Define wait conditions for page rendering.headless: Run in headless mode (no browser window).script: Execute custom JavaScript before extraction.
For more details, refer to the Scrapy documentation.
The repository includes a pre-configured launch.json for debugging Scrapy spiders. To use it:
- Open the Run and Debug panel in VS Code (
Cmd+Shift+DorCtrl+Shift+D). - Select the
Scrapy Spider Debugconfiguration. - Press the green "Start Debugging" button or hit
F5.
To debug a different spider, edit the "args" field in .vscode/launch.json:
Ensure the correct Python interpreter (e.g., Conda environment) is selected via Python: Select Interpreter in the Command Palette.
- Open PyCharm and configure the project interpreter to use the environment created earlier.
- Add a Python Run/Debug Configuration for each spider:
- Script path:
<conda_env_path>/Lib/site-packages/scrapy/cmdline.py - Parameters:
crawl <spider_name> - Working directory:
<project_directory> - Under Execution check Run with Python Console (else the Debug will work, but the Run will be broken)
- Script path:
- Use the
scrapy checkcommand to validate spiders.
Use Splash as an alternative to Selenium for rendering and extracting data from complex JavaScript-powered pages:
-
Install Docker and run Splash:
docker pull scrapinghub/splash docker run -p 8050:8050 scrapinghub/splash
-
Update
bet_parser/settings.pyto configure Splash.
Enable Tor or proxy rotation to avoid bans when requesting and parsing pages at high frequency. These features are experimental and may require additional refinement. To enable them, update the relevant settings in bet_parser/settings.py.
A Google Translator-based mapper is available but less effective. To enable it, update the relevant settings in bet_parser/settings.py.
If you find this project useful, consider supporting its development:
- β Star the repository to show your appreciation.
- π¬ Share feedback or suggestions by opening an issue.
- β Buy me a coffee to support future updates and improvements.
- π΅ BTC Address:
bc1qzy6e99pkeq00rsx8jptx93jv56s9ak2lz32e2d - π£ ETH Address:
0x38cf74ED056fF994342941372F8ffC5C45E6cF21
This project is licensed under the MIT License. See the LICENSE file for details.




