Welcome to the "Data Science in Production" project by Team “ML Legends” - EPITA Master in Data Science
- Viet Thai Nguyen
- Stephanie Arthaud
- Christian Davison Dirisu
- Olanrewaju Adegoke
- Abubakar Bashir Kankia
Our project focuses on the Sentiment Analysis of Kindle Book Reviews, aiming to classify them as Positive or Negative by predicting Rating Score. We are utilizing the Kindle Book Review Dataset, a rich collection of over 2 million reviews and associated metadata for a diverse range of Kindle books.
We have built a Streamlit
web app for users to interact with the Machine Learning model through FastAPI
and PostgreSQL
. Raw data is ingested and predicted by Airflow
jobs, after being validated by Great Expectations
. The pipeline is then monitored by a Grafana
dashboard.
├── airflow # Airflow DAGs validated by Great Expectations
│ ├── dags
│ ├── logs
│ └── gx
├── api-db # Connect to PostgreSQL by FastAPI
│ ├── main.py
│ └── functions.py
├── app # 2 pages of Streamlit app
│ ├── Predict.py
│ ├── History.py
│ └── utils.py
├── model # Store training model
│ ├── DSP_NLP_Review.ipynb
│ ├── dsp_project_model.pkl
│ └── dsp_project_tfidf_model.pkl
├── images # Store images for README
├── README.md
├── requirements.txt # Modules version
├── .gitignore
There are 2 pages of the app:
Predict
: predicting the Rating by the Review by 3 ways- Enter your own review
- Generate random review
- Upload a CSV
History
: showing all rows in database that can be filtered by time and other types.
We implemented 2 endpoints by FastAPI:
predict
: POST request - inference prediction & save data to databaseget-predict
: GET request - retrieve data from database
We used PostgreSQL with table including 4 columns:
id
: number of predictionsreview
: review text of usersrating
: the score given by predictiontime
: time that user makes the predictiontype
: the prediction is made by the App or Prediction Job
We created 2 DAGs in Airflow for 2 jobs:
ingest_data
: Ingest new data, validate the data by using Great Expectations module, running each 1 min.predict_data
: Predict a batch of new coming data, running each 2 min.
We used Great Expectations to validate raw data by 4 requirements:
- The review cannot be null.
- The review cannot be too long.
- Spam review will not be accepted.
- Do not allow the review having the direct URL.
By a Grafana dashboard, we can monitor all the data from Prediction and Ingestion jobs which is stored in PostgreSQL tables.
1. Install project dependencies
pip install -r requirements.txt
2. Install Docker and Docker Compose (Docker Desktop is additional)
3. Build Docker image and start services
cd airflow
docker build -f Dockerfile -t {name_of_the_image}:latest .
docker-compose -f "docker-compose.yml" up -d --build
1. Run FastAPI server
cd api-db
uvicorn main:app --reload
- Access
localhost:8000
2. Run Streamlit webapp
cd app
streamlit run Predict.py
- Access
localhost:8501
3. Run Airflow webserver
- Access
localhost:8080
- Login by using username
admin
, and retrieve password from thestandalone_admin_password.txt
file.
We welcome contributions to this project! Here's how you can contribute:
- Fork the Repository
- Clone the Repository
- Create a New Branch
- Make Your Changes
- Commit Your Changes
- Push Your Changes
- Submit a Pull Request
Remember, contributing to open source projects is about more than just code. You can also contribute by reporting bugs, suggesting new features, improving documentation, and more.
Thank you for considering contributing to this project! 😊