GoXCrap

X (formerly Twitter) web scrapper, written in Go

GoXCrap

This application collects tweets based on a defined search criteria, and save them in a database.

Set up & run (locally)

Set up

First of all you need to download the Chrome Web Driver that matches with the installed version of Google Chrome (the browser used for testing this project).
You can download it from here, or you can use @puppeteer/browsers with this installation guide.
After that, copy it inside the internal/webdriver folder.
Create a .env file at the root of the project (or rename the provided .env.example), and add the following environment variables:

# Scrapper settings
EMAIL=<Twitter account email>
USERNAME=<Twitter username>
PASSWORD=<Twitter password>
LOGIN_PAGE_TIMEOUT=<Time limit (in seconds) the scrapper will wait for the login page to load completely before timing out>
LOGIN_ELEMENTS_TIMEOUT=<Time limit (in seconds) the scrapper will wait for the necessary elements (e.g., username and email fields, login button) to appear on the login page before timing out>
LOGIN_PASSWORD_TIMEOUT=<Time limit (in seconds) the scrapper will wait for the password element to appear on the login page before timing out>
WAIT_TIME_AFTER_LOGIN=<Wait time (in seconds) after the login button is clicked> -->  Required to ensure the login process completes smoothly
SEARCH_PAGE_TIMEOUT=<Time limit (in seconds) the scrapper will wait for the search page to load completely before timing out>
ARTICLES_TIMEOUT=<Time limit (in seconds) the scrapper will wait for the articles elements to appear on the search page before timing out>
RATE_LIMITER_PERIOD=<Period of time of the rate limiter (in seconds)> --> 15 minutes - 900 seconds (October 2024)
RATE_LIMITER_REQUESTS=<Quantity of requests allowed during the period of time of the rate limiter>  --> 50 requests (October 2024)

# External APIs URLs
CORPUS_CREATOR_API_URL=<Domain of the corpus creator application with all the endpoints defined in the corpuscreator pkg> --> Example: the URL to the AHBCC API

¹

Run

In the root folder, run:

go run cmd/api/main.go --local

Setting up & run (into a Docker container)

Setup

Create a .env file at the root of the project (or rename the provided .env.example), and add the following environment variables:

# App settings
SCRAPPER_EXPOSED_PORT=<GoXCrap Host Port>
SCRAPPER_INTERNAL_PORT=<GoXCrap Container Port>

# Scrapper settings
SCRAPPER_EMAIL=<Twitter account email>
SCRAPPER_USERNAME=<Twitter username>
SCRAPPER_PASSWORD=<Twitter password>
BROKER_CONCURRENT_MESSAGES=<Number of concurrent messages that will be processed>
SCRAPPER_LOGIN_PAGE_TIMEOUT=<Time limit (in seconds) the scrapper will wait for the login page to load completely before timing out>
SCRAPPER_LOGIN_ELEMENTS_TIMEOUT=<Time limit (in seconds) the scrapper will wait for the necessary elements (e.g., username and email fields, login button) to appear on the login page before timing out>
SCRAPPER_LOGIN_PASSWORD_TIMEOUT=<Time limit (in seconds) the scrapper will wait for the password element to appear on the login page before timing out>
SCRAPPER_WAIT_TIME_AFTER_LOGIN=<Wait time (in seconds) after the login button is clicked> -->  Required to ensure the login process completes smoothly
SCRAPPER_SEARCH_PAGE_TIMEOUT=<Time limit (in seconds) the scrapper will wait for the search page to load completely before timing out>
SCRAPPER_ARTICLES_TIMEOUT=<Time limit (in seconds) the scrapper will wait for the articles elements to appear on the search page before timing out>
SCRAPPER_RATE_LIMITER_PERIOD=<Period of time of the rate limiter (in seconds)> --> 15 minutes - 900 seconds (October 2024)
SCRAPPER_RATE_LIMITER_REQUESTS=<Quantity of requests allowed during the period of time of the rate limiter> --> 50 requests (October 2024)

# Selenium Chrome driver paths
SELENIUM_DRIVER_PATH=<The path to the Chrome driver> --> Example: /usr/bin/chromedriver
SELENIUM_BROWSER_PATH=<The path to the Chrome browser> --> Example: /usr/bin/chromium

# RabbitMQ settings
RABBITMQ_USER=<The RabbitMQ user>
RABBITMQ_PASS=<The RabbitMQ password>
RABBITMQ_PORT=<The RabbitMQ port> --> Usually 5672

# External APIs URLs
CORPUS_CREATOR_API_URL=<Domain of the corpus creator application with all the endpoints defined in the corpuscreator pkg> --> Example: the URL to the AHBCC API

¹

Build & Run

docker compose up --build

Rate limiter

As of October 2024, X has a rate limit of 50 requests every 15 minutes.

To avoid encountering a 'Timeout retrieving elements' error, this app spreads the requests evenly throughout the 15-minute period.

That is why the following env variables exists:

SCRAPPER_RATE_LIMITER_PERIOD=<Period of time of the rate limiter (in seconds)>
SCRAPPER_RATE_LIMITER_REQUESTS=<Quantity of requests allowed during the period of time of the rate limiter>

License

MIT

Logo License

Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

The logo was obtained from https://github.com/ashleymcnamara/gophers, but it was slightly modified to be representative for this repository.

AHBCC: Adverse Human Behaviour Corpus Creator. More information here ↩ ↩²

Name		Name	Last commit message	Last commit date
Latest commit History 143 Commits
.github/workflows		.github/workflows
cmd/api		cmd/api
internal		internal
media		media
.env.dockerized.example		.env.dockerized.example
.env.local.example		.env.local.example
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
codecov.yml		codecov.yml
compose.yml		compose.yml
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GoXCrap

GoXCrap

Set up & run (locally)

Set up

Run

Setting up & run (into a Docker container)

Setup

Build & Run

Rate limiter

License

Logo License

About

Releases

Packages

Languages

License

lhbelfanti/goxcrap

Folders and files

Latest commit

History

Repository files navigation

GoXCrap

GoXCrap

Set up & run (locally)

Set up

Run

Setting up & run (into a Docker container)

Setup

Build & Run

Rate limiter

License

Logo License

Footnotes

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages