Trending topics tweet extraction

This project collect tweets from Twitter with keywords specified in a YAML config file, threats it with PySpark and store with DeltaLake in three data layers. The first one, stores the data as it is collected, in batches with JSON format and GZIP compression. The second, prepares it in Parquet with Delta format partitioned by execution date (from the DAG run). At the end, the third layer stores the data also in Delta format, but partitioned by the tweet creation timestamp. The task orchestration runs at Apache Airflow with PySpark jobs implemented in PythonOperators.

At the config file, the following fields can be specified:

topics:
    covid19:
        start-date: 2022-01-10T00:00:00-03
        schedule-interval: "*/15 * * * *"
        max-results: 50
        landing-path: /data/landing
        raw-path: /data/raw
        trusted-path: /data/trusted

How to start

The Makefile wraps some docker commands to start the project. In example, to start the Apache Airflow environment, run the following:

make start

To run the unit tests, run the following make target:

make test

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
airflow		airflow
data		data
scripts		scripts
.gitignore		.gitignore
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
docker-compose.yaml		docker-compose.yaml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Trending topics tweet extraction

How to start

About

Languages

himewel/trending-explorer

Folders and files

Latest commit

History

Repository files navigation

Trending topics tweet extraction

How to start

About

Topics

Resources

Stars

Watchers

Forks

Languages