X (formerly Twitter) web scrapper, written in Go
This application collects tweets based on a defined search criteria, and save them in a database.
- First of all you need to download the Chrome Web Driver that matches with the installed version of Google Chrome (the
browser used for testing this project).
You can download it from here, or you can use@puppeteer/browsers
with this installation guide.
After that, copy it inside the internal/webdriver folder. - Create a
.env
file at the root of the project (or rename the provided .env.example), and add the following environment variables:
# Scrapper settings
EMAIL=<Twitter account email>
USERNAME=<Twitter username>
PASSWORD=<Twitter password>
LOGIN_PAGE_TIMEOUT=<Time limit (in seconds) the scrapper will wait for the login page to load completely before timing out>
LOGIN_ELEMENTS_TIMEOUT=<Time limit (in seconds) the scrapper will wait for the necessary elements (e.g., username and email fields, login button) to appear on the login page before timing out>
LOGIN_PASSWORD_TIMEOUT=<Time limit (in seconds) the scrapper will wait for the password element to appear on the login page before timing out>
WAIT_TIME_AFTER_LOGIN=<Wait time (in seconds) after the login button is clicked> --> Required to ensure the login process completes smoothly
SEARCH_PAGE_TIMEOUT=<Time limit (in seconds) the scrapper will wait for the search page to load completely before timing out>
ARTICLES_TIMEOUT=<Time limit (in seconds) the scrapper will wait for the articles elements to appear on the search page before timing out>
RATE_LIMITER_PERIOD=<Period of time of the rate limiter (in seconds)> --> 15 minutes - 900 seconds (October 2024)
RATE_LIMITER_REQUESTS=<Quantity of requests allowed during the period of time of the rate limiter> --> 50 requests (October 2024)
# External APIs URLs
CORPUS_CREATOR_API_URL=<Domain of the corpus creator application with all the endpoints defined in the corpuscreator pkg> --> Example: the URL to the AHBCC API
In the root folder, run:
go run cmd/api/main.go --local
- Create a
.env
file at the root of the project (or rename the provided .env.example), and add the following environment variables:
# App settings
SCRAPPER_EXPOSED_PORT=<GoXCrap Host Port>
SCRAPPER_INTERNAL_PORT=<GoXCrap Container Port>
# Scrapper settings
SCRAPPER_EMAIL=<Twitter account email>
SCRAPPER_USERNAME=<Twitter username>
SCRAPPER_PASSWORD=<Twitter password>
BROKER_CONCURRENT_MESSAGES=<Number of concurrent messages that will be processed>
SCRAPPER_LOGIN_PAGE_TIMEOUT=<Time limit (in seconds) the scrapper will wait for the login page to load completely before timing out>
SCRAPPER_LOGIN_ELEMENTS_TIMEOUT=<Time limit (in seconds) the scrapper will wait for the necessary elements (e.g., username and email fields, login button) to appear on the login page before timing out>
SCRAPPER_LOGIN_PASSWORD_TIMEOUT=<Time limit (in seconds) the scrapper will wait for the password element to appear on the login page before timing out>
SCRAPPER_WAIT_TIME_AFTER_LOGIN=<Wait time (in seconds) after the login button is clicked> --> Required to ensure the login process completes smoothly
SCRAPPER_SEARCH_PAGE_TIMEOUT=<Time limit (in seconds) the scrapper will wait for the search page to load completely before timing out>
SCRAPPER_ARTICLES_TIMEOUT=<Time limit (in seconds) the scrapper will wait for the articles elements to appear on the search page before timing out>
SCRAPPER_RATE_LIMITER_PERIOD=<Period of time of the rate limiter (in seconds)> --> 15 minutes - 900 seconds (October 2024)
SCRAPPER_RATE_LIMITER_REQUESTS=<Quantity of requests allowed during the period of time of the rate limiter> --> 50 requests (October 2024)
# Selenium Chrome driver paths
SELENIUM_DRIVER_PATH=<The path to the Chrome driver> --> Example: /usr/bin/chromedriver
SELENIUM_BROWSER_PATH=<The path to the Chrome browser> --> Example: /usr/bin/chromium
# RabbitMQ settings
RABBITMQ_USER=<The RabbitMQ user>
RABBITMQ_PASS=<The RabbitMQ password>
RABBITMQ_PORT=<The RabbitMQ port> --> Usually 5672
# External APIs URLs
CORPUS_CREATOR_API_URL=<Domain of the corpus creator application with all the endpoints defined in the corpuscreator pkg> --> Example: the URL to the AHBCC API
docker compose up --build
As of October 2024, X has a rate limit of 50 requests every 15 minutes.
To avoid encountering a 'Timeout retrieving elements' error, this app spreads the requests evenly throughout the 15-minute period.
That is why the following env variables exists:
SCRAPPER_RATE_LIMITER_PERIOD=<Period of time of the rate limiter (in seconds)>
SCRAPPER_RATE_LIMITER_REQUESTS=<Quantity of requests allowed during the period of time of the rate limiter>
Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License
The logo was obtained from https://github.com/ashleymcnamara/gophers, but it was slightly modified to be representative for this repository.