Data Science in Production: Final Project - Kindle Book Reviews Application

Welcome to the "Data Science in Production" project by Team “ML Legends” - EPITA Master in Data Science

Collaborators

Viet Thai Nguyen
Stephanie Arthaud
Christian Davison Dirisu
Olanrewaju Adegoke
Abubakar Bashir Kankia

Introduction

Our project focuses on the Sentiment Analysis of Kindle Book Reviews, aiming to classify them as Positive or Negative by predicting Rating Score. We are utilizing the Kindle Book Review Dataset, a rich collection of over 2 million reviews and associated metadata for a diverse range of Kindle books.

We have built a Streamlit web app for users to interact with the Machine Learning model through FastAPI and PostgreSQL. Raw data is ingested and predicted by Airflow jobs, after being validated by Great Expectations. The pipeline is then monitored by a Grafana dashboard.

File Descriptions

├── airflow            # Airflow DAGs validated by Great Expectations
│   ├── dags
│   ├── logs
│   └── gx
├── api-db             # Connect to PostgreSQL by FastAPI
│   ├── main.py
│   └── functions.py
├── app                # 2 pages of Streamlit app
│   ├── Predict.py
│   ├── History.py
│   └── utils.py
├── model              # Store training model
│   ├── DSP_NLP_Review.ipynb
│   ├── dsp_project_model.pkl
│   └── dsp_project_tfidf_model.pkl
├── images             # Store images for README
├── README.md
├── requirements.txt   # Modules version
├── .gitignore

Main components

Web app

There are 2 pages of the app:

Predict: predicting the Rating by the Review by 3 ways
- Enter your own review
- Generate random review
- Upload a CSV
History: showing all rows in database that can be filtered by time and other types.

API

We implemented 2 endpoints by FastAPI:

predict: POST request - inference prediction & save data to database
get-predict: GET request - retrieve data from database

Database

We used PostgreSQL with table including 4 columns:

id: number of predictions
review: review text of users
rating: the score given by prediction
time: time that user makes the prediction
type: the prediction is made by the App or Prediction Job

Jobs scheduling

We created 2 DAGs in Airflow for 2 jobs:

ingest_data: Ingest new data, validate the data by using Great Expectations module, running each 1 min.
predict_data: Predict a batch of new coming data, running each 2 min.

Data Validation

We used Great Expectations to validate raw data by 4 requirements:

The review cannot be null.
The review cannot be too long.
Spam review will not be accepted.
Do not allow the review having the direct URL.

Data Monitoring

By a Grafana dashboard, we can monitor all the data from Prediction and Ingestion jobs which is stored in PostgreSQL tables.

Installation & Setup

Initial Installation

1. Install project dependencies

pip install -r requirements.txt

2. Install Docker and Docker Compose (Docker Desktop is additional)

3. Build Docker image and start services

cd airflow
docker build -f Dockerfile -t {name_of_the_image}:latest .
docker-compose -f "docker-compose.yml" up -d --build

Running Steps

1. Run FastAPI server

cd api-db
uvicorn main:app --reload

Access localhost:8000

2. Run Streamlit webapp

cd app
streamlit run Predict.py

Access localhost:8501

3. Run Airflow webserver

Access localhost:8080
Login by using username admin, and retrieve password from the standalone_admin_password.txt file.

Contributing

We welcome contributions to this project! Here's how you can contribute:

Fork the Repository
Clone the Repository
Create a New Branch
Make Your Changes
Commit Your Changes
Push Your Changes
Submit a Pull Request

Remember, contributing to open source projects is about more than just code. You can also contribute by reporting bugs, suggesting new features, improving documentation, and more.

Thank you for considering contributing to this project! 😊

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Science in Production: Final Project - Kindle Book Reviews Application

Collaborators

Introduction

File Descriptions

Main components

Web app

API

Database

Jobs scheduling

Data Validation

Data Monitoring

Installation & Setup

Initial Installation

Running Steps

Contributing

About

Releases

Packages

Contributors 5

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 103 Commits
airflow		airflow
api-db		api-db
app		app
images		images
model		model
.DS_Store		.DS_Store
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
requirements.txt		requirements.txt

nuvita97/dsp-final-project

Folders and files

Latest commit

History

Repository files navigation

Data Science in Production: Final Project - Kindle Book Reviews Application

Collaborators

Introduction

File Descriptions

Main components

Web app

API

Database

Jobs scheduling

Data Validation

Data Monitoring

Installation & Setup

Initial Installation

Running Steps

Contributing

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages