twitter-etl-with-airflow

Project Introduction

This is a ETL project where I read the data from Kaggle Twitter dataset,analysed or transformed using python and stored the analysed data in to Amazon S3 bucket.This whole ETL process ochestrated using Ochestration tool called Apache Airflow.Here I documented each step that I followed in this process.

Project Architecture

Apache Airflow Installation steps

Installation in windows without using Docker,using WSL(Ubuntu)

Open a command prompt and update wsl with following commands
```
wsl --update
wsl --install -d ubuntu
```

Now open WSL2 or Ubuntu terminal and run the following commands

sudo apt-get update
sudo apt-get install python-pip
sudo apt-get install python3-venv
python3 -m venv airflow-env
source airflow-env/bin/activate
nano ~/.bashrc
#Type the following
AIRFLOW_HOME=/c/Users/vjadhav/airflow
#Press Ctrl+S and Ctrl+X to exit the editor
pip install apache-airflow
airflow db init
airflow users create \
    --username admin \
    --password admin  \
    --firstname <YourFirstName> \
    --lastname <YourLastName> \
    --role Admin \
    --email admin@example.com
airflow webserver --port 8080
#open one more terminal and run scheduler
airflow scheduler

The Airflow web server will be accessible at http://localhost:8080 in your web browser and log in using the above-created User.

ETL(Extract,Transform,and Load)

Extract: Read the Tweets.CSV file
Transform:Transform the twitter data
- Dropped the un-necessary columns from the dataset
- Handled the Dataset schema
- Handled the Date Time columns effectively
- Handled the null values and filled with default values
Load:Load the transformed data into Amazon S3 bucket using S3FS python package

Copy the files from local file system to WSL(Ubuntu) filesystem

cp -r /mnt/c/Users/<username>/learning/twitter/tweets.csv /home/test/airflow/twitter_dags
cp -r /mnt/c/Users/<username>/learning/twitter/twitter_etl.py /home/test/airflow/twitter_dags
cp -r /mnt/c/Users/<username>/learning/twitter/twitter_dag.py /home/test/airflow/twitter_dags

Create dag folder

mkdir twitter_dag
nano airflow.cfg

edit the AIRFLOW_DAGS variable to twitter_dags folder Run the twitter dag from the airflow UI and this will generate the output file into S3 bucket

Issues Faced

From Amazon S3 side ,handle the bucket policies to put the object into the bucket
Tracing the errors in the Apache Airflow
Airflow Setup

Improvement

Need to write the code with PYSPARK and deployment onto Airflow
Write a code that can interact with some database

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

twitter-etl-with-airflow

Project Introduction

Project Architecture

Apache Airflow Installation steps

Installation in windows without using Docker,using WSL(Ubuntu)

ETL(Extract,Transform,and Load)

Copy the files from local file system to WSL(Ubuntu) filesystem

Create dag folder

Issues Faced

Improvement

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
README.md		README.md
tweets.csv		tweets.csv
twitter_dag.py		twitter_dag.py
twitter_etl.py		twitter_etl.py

kalyani33/twitter-etl-with-airflow

Folders and files

Latest commit

History

Repository files navigation

twitter-etl-with-airflow

Project Introduction

Project Architecture

Apache Airflow Installation steps

Installation in windows without using Docker,using WSL(Ubuntu)

ETL(Extract,Transform,and Load)

Copy the files from local file system to WSL(Ubuntu) filesystem

Create dag folder

Issues Faced

Improvement

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages