NYC Airbnb Data Analysis

This repository contains the code for an ETL pipeline that processes the NYC Airbnb Open Data dataset. The project leverages Metaflow for orchestration and pandas for data manipulation.

Project Structure

.
├── AB_NYC_2019.csv          # The dataset containing NYC Airbnb data
├── load_data.py             # Script to load and preprocess the dataset
├── metaflow_ETL.py          # Metaflow pipeline for ETL
├── requirements.txt         # Required Python packages
└── README.md                # Project documentation

Dataset

The dataset used in this project is AB_NYC_2019.csv, which contains detailed information about Airbnb listings in New York City. The data can be downloaded from Kaggle's New York City Airbnb Open Data.

Getting Started

The working demo of this project is here Demo Video

Prerequisites

Ensure you have Python 3.7+ and PostgreSQL installed on your system.

Python Setup

You can create a virtual environment to manage dependencies:

python -m venv venv
source venv/bin/activate  # On Windows, use `venv\Scripts\activate`

PostgreSQL Setup

Install PostgreSQL:

On Ubuntu:

sudo apt update
sudo apt install postgresql postgresql-contrib

On macOS using Homebrew:
```
brew install postgresql
```

Start PostgreSQL Service:

On Ubuntu:
```
sudo service postgresql start
```
On macOS:
```
brew services start postgresql
```

Create a Database and User:
- Switch to the PostgreSQL user:
```
sudo -i -u postgres
```
- Open the PostgreSQL prompt:
```
psql
```
- Create a database:
```
CREATE DATABASE airbnb_nyc;
```
- Change user with a password:
```
ALTER USER your_user WITH PASSWORD 'test';
```
- Exit the PostgreSQL prompt:
```
\q
```
- Exit the PostgreSQL user:
```
exit
```

Installation

Install the required Python packages using the provided requirements.txt file:

pip install -r requirements.txt

Running the ETL Pipeline

To run the ETL pipeline, execute the metaflow_ETL.py script:

python3 metaflow_ETL.py run

Load and Preprocess Data

To load and preprocess the data, you can run the load_data.py script:

python3 load_data.py

Project Components

load_data.py

This script is responsible for loading and preprocessing the Airbnb dataset. It ensures the data is cleaned and ready for analysis.

metaflow_ETL.py

This script defines a Metaflow pipeline for the ETL process. Metaflow is used to orchestrate the data flow and manage the various steps involved in the ETL process.

requirements.txt

This file lists all the dependencies required for the project. Use this file to install the necessary packages.

Dependencies

astroid==3.2.2
boto3==1.34.134
botocore==1.34.134
certifi==2024.6.2
charset-normalizer==3.3.2
dill==0.3.8
greenlet==3.0.3
idna==3.7
isort==5.13.2
jmespath==1.0.1
mccabe==0.7.0
metaflow==2.12.5
numpy==2.0.0
pandas==2.2.2
platformdirs==4.2.2
psycopg2-binary==2.9.9
pylint==3.2.4
python-dateutil==2.9.0.post0
pytz==2024.1
requests==2.32.3
s3transfer==0.10.2
six==1.16.0
SQLAlchemy==2.0.31
tomli==2.0.1
tomlkit==0.12.5
typing_extensions==4.12.2
tzdata==2024.1
urllib3==2.2.2

Authors

Muneeb Mushtaq Bhat

License

This project is licensed under the MIT License.

Acknowledgments

Metaflow team for providing an excellent orchestration tool.
Airbnb & Kaggle for providing the dataset.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NYC Airbnb Data Analysis

Project Structure

Dataset

Getting Started

Prerequisites

Python Setup

PostgreSQL Setup

Installation

Running the ETL Pipeline

Load and Preprocess Data

Project Components

load_data.py

metaflow_ETL.py

requirements.txt

Dependencies

Authors

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
AB_NYC_2019.csv		AB_NYC_2019.csv
README.md		README.md
load_data.py		load_data.py
metaflow_ETL.py		metaflow_ETL.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

NYC Airbnb Data Analysis

Project Structure

Dataset

Getting Started

Prerequisites

Python Setup

PostgreSQL Setup

Installation

Running the ETL Pipeline

Load and Preprocess Data

Project Components

load_data.py

metaflow_ETL.py

requirements.txt

Dependencies

Authors

License

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages