Skip to content

Extraction and data wrangling from Twitter API with Apache Airflow, PySpark and DeltaLake

Notifications You must be signed in to change notification settings

himewel/trending-explorer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Trending topics tweet extraction

Docker Apache Airflow Apache Spark Twitter DeltaLake

This project collect tweets from Twitter with keywords specified in a YAML config file, threats it with PySpark and store with DeltaLake in three data layers. The first one, stores the data as it is collected, in batches with JSON format and GZIP compression. The second, prepares it in Parquet with Delta format partitioned by execution date (from the DAG run). At the end, the third layer stores the data also in Delta format, but partitioned by the tweet creation timestamp. The task orchestration runs at Apache Airflow with PySpark jobs implemented in PythonOperators.

At the config file, the following fields can be specified:

topics:
    covid19:
        start-date: 2022-01-10T00:00:00-03
        schedule-interval: "*/15 * * * *"
        max-results: 50
        landing-path: /data/landing
        raw-path: /data/raw
        trusted-path: /data/trusted

How to start

The Makefile wraps some docker commands to start the project. In example, to start the Apache Airflow environment, run the following:

make start

To run the unit tests, run the following make target:

make test

About

Extraction and data wrangling from Twitter API with Apache Airflow, PySpark and DeltaLake

Topics

Resources

Stars

Watchers

Forks