This repository contains the code for an ETL pipeline that processes the NYC Airbnb Open Data dataset. The project leverages Metaflow for orchestration and pandas for data manipulation.
.
├── AB_NYC_2019.csv # The dataset containing NYC Airbnb data
├── load_data.py # Script to load and preprocess the dataset
├── metaflow_ETL.py # Metaflow pipeline for ETL
├── requirements.txt # Required Python packages
└── README.md # Project documentation
The dataset used in this project is AB_NYC_2019.csv, which contains detailed information about Airbnb listings in New York City. The data can be downloaded from Kaggle's New York City Airbnb Open Data.
- The working demo of this project is here Demo Video
Ensure you have Python 3.7+ and PostgreSQL installed on your system.
You can create a virtual environment to manage dependencies:
python -m venv venv
source venv/bin/activate # On Windows, use `venv\Scripts\activate`-
Install PostgreSQL:
- On Ubuntu:
sudo apt update sudo apt install postgresql postgresql-contrib
- On macOS using Homebrew:
brew install postgresql
- On Ubuntu:
-
Start PostgreSQL Service:
- On Ubuntu:
sudo service postgresql start
- On macOS:
brew services start postgresql
- On Ubuntu:
-
Create a Database and User:
-
Switch to the PostgreSQL user:
sudo -i -u postgres
-
Open the PostgreSQL prompt:
psql
-
Create a database:
CREATE DATABASE airbnb_nyc;
-
Change user with a password:
ALTER USER your_user WITH PASSWORD 'test';
-
Exit the PostgreSQL prompt:
\q
-
Exit the PostgreSQL user:
exit
-
Install the required Python packages using the provided requirements.txt file:
pip install -r requirements.txtTo run the ETL pipeline, execute the metaflow_ETL.py script:
python3 metaflow_ETL.py runTo load and preprocess the data, you can run the load_data.py script:
python3 load_data.pyThis script is responsible for loading and preprocessing the Airbnb dataset. It ensures the data is cleaned and ready for analysis.
This script defines a Metaflow pipeline for the ETL process. Metaflow is used to orchestrate the data flow and manage the various steps involved in the ETL process.
This file lists all the dependencies required for the project. Use this file to install the necessary packages.
- astroid==3.2.2
- boto3==1.34.134
- botocore==1.34.134
- certifi==2024.6.2
- charset-normalizer==3.3.2
- dill==0.3.8
- greenlet==3.0.3
- idna==3.7
- isort==5.13.2
- jmespath==1.0.1
- mccabe==0.7.0
- metaflow==2.12.5
- numpy==2.0.0
- pandas==2.2.2
- platformdirs==4.2.2
- psycopg2-binary==2.9.9
- pylint==3.2.4
- python-dateutil==2.9.0.post0
- pytz==2024.1
- requests==2.32.3
- s3transfer==0.10.2
- six==1.16.0
- SQLAlchemy==2.0.31
- tomli==2.0.1
- tomlkit==0.12.5
- typing_extensions==4.12.2
- tzdata==2024.1
- urllib3==2.2.2
- Muneeb Mushtaq Bhat
This project is licensed under the MIT License.
- Metaflow team for providing an excellent orchestration tool.
- Airbnb & Kaggle for providing the dataset.