- Make sure you have all Python requirements from
requirements.txt
installed- Otherwise run
pip install requirements.txt
from the this folder using your favorite terminal to install requirements
- Otherwise run
- Make sure you have docker with docker-compose installed locally
- Run
docker-compose up -d
from the airflow folder using your favorite terminal to start up airflow for this project
- Run
- Fill
aws.cfg
- Your AWS credentials into the AWS header
- Your Redshift cluter information into the Redshift header, for launching a Redshift cluster
- Download the datasets to the /data/ folder:
- Make sure you have the requirements installed
- In order to download this via the script, you'll need a Kaggle user and have downloaded your kaggle credentials and token to your ~/.kaggle/ folder on your local computer.
- Run
python download_datasets.py
from the this folder using your favorite terminal
- Load data from /data/ to S3 - this will create a bucket called arxiv-etl is none exists:
- Run
python load_to_s3.py
from the this folder using your favorite terminal
- Run
- Create and launch a Redshift cluster:
- Run
python create_redshift_cluster.py
from the this folder using your favorite terminal
- Run
- Fill remaining info in
aws.cfg
- Fill your running Redshift info and credentials in the CLUSTER header
- Load connections into Airflow using Powershell:
- Run
.\add_airflow_connections.ps1
from the this folder using Powershell - Alternatively, if you are on OSX/Linux, it should be trivial to convert the powershell script to a bash script
- Run
Everything should now be setup, and ready to run the ETL process.