This project aims to develop a sentiment analysis model that classifies movie reviews into positive or negative sentiments. The model is built using a Bidirectional Long Short-Term Memory (BiLSTM) neural network with pre-trained GloVe embeddings. The project includes data preprocessing, model training, evaluation, and deployment using FastAPI.
imdb_sentiment/
│
├── app/
│ ├── init.py
│ ├── main.py
│ ├── models/
│ │ ├── model.h5
│ ├── tokenizer/
│ │ ├── tokenizer.pickle
│ ├── routes/
│ │ ├── init.py
│ │ ├── sentiment.py
│ ├── utils/
│ │ ├── init.py
│ │ ├── preprocessing.py
│
├── data/
│ ├── raw/
│ ├── processed/
│ │ ├── train_padded.npy
│ │ ├── train_labels.npy
│ │ ├── test_padded.npy
│ │ ├── test_labels.npy
│
├── glove.6B/
│
├── notebooks/
│ ├── data_preprocessing.ipynb
│ ├── model_training.ipynb
│
├── requirements.txt
├── README.md
└── .gitignore
- Develop a sentiment analysis model to classify movie reviews.
- Ensure the model is trained on a balanced dataset to mitigate bias.
- Deploy the model using FastAPI to provide a REST API for sentiment prediction.
-
Data Loading:
- The IMDb movie reviews dataset is used, loaded via TensorFlow Datasets.
-
Cleaning:
- Text data is cleaned by removing special characters and converting to lowercase.
-
Tokenization:
- Text is tokenized into sequences of integers using the Keras
Tokenizer
.
- Text is tokenized into sequences of integers using the Keras
-
Padding:
- Sequences are padded to ensure uniform length, suitable for batch processing.
-
Balancing:
- The dataset is balanced by undersampling the majority class to ensure equal representation of positive and negative samples.
-
Data Augmentation:
- To further balance the dataset, data augmentation techniques can be used to generate more samples for the minority class.
-
Using GloVe Embeddings:
- Pre-trained GloVe embeddings are used to convert words into dense vectors of fixed size, capturing semantic meanings.
-
Saving:
- The tokenizer and processed data (padded sequences and labels) are saved for future use.
The model is a Bidirectional LSTM (BiLSTM) neural network with the following layers:
-
Embedding Layer:
- Uses pre-trained GloVe embeddings to convert words into dense vectors of fixed size, capturing semantic meanings.
-
Bidirectional LSTM Layers:
- First Bidirectional LSTM layer with 128 units and
return_sequences=True
to output sequences. - Dropout layer with 0.5 rate to prevent overfitting.
- Second Bidirectional LSTM layer with 64 units for further capturing context from both directions.
- First Bidirectional LSTM layer with 128 units and
-
Dense Layers:
- A dense layer with 64 units and ReLU activation.
- Output dense layer with 1 unit and sigmoid activation to predict the sentiment probability.
- Loss Function: Binary Crossentropy, suitable for binary classification tasks.
- Optimizer: Adam, which adjusts learning rate dynamically.
- Metrics: Accuracy, to measure the performance of the model.
- Training: The model is trained on the balanced dataset for 10 epochs with a batch size of 32.
The model’s performance is evaluated on the test set using accuracy and loss metrics. Additional evaluation metrics such as precision, recall, and F1-score are used to gain insights into model performance. Confidence levels for predictions are calculated to understand the model's certainty in its predictions.
The trained model is deployed using FastAPI, allowing it to serve predictions via a REST API. An endpoint /predict
is created to accept text input and return the predicted sentiment and confidence level.
Endpoint: /predict
- Input: JSON with a text field, e.g.,
{"text": "I love this movie!"}
- Output: JSON with predicted sentiment and confidence level, e.g.,
{"sentiment": "positive", "confidence": 0.9677}
-
Download Zip file from Github:
- Link to GloVe Github Repository: https://github.com/stanfordnlp/GloVe
- Link to GloVe Embeddings download: https://huggingface.co/stanfordnlp/glove/resolve/main/glove.6B.zip
-
Moving the .txt files to glove.6B folder
- Please move the .txt files in the Zip file downloaded inside the project's glove.6B folder.
- Create a Virtual Environment:
python -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate`
- Install Dependencies:
pip install -r requirements.txt
-
Navigate to the project directory:
cd imdb_sentiment
-
Run the preprocessing notbook:
- Open and run
data_preprocessing.ipynb
to preprocess the data and save the processed data files.
- Open and run
-
Run the training notebook:
- Open and run
model_training.ipynb
to train the model and save the trained model file.
- Open and run
To tailor the model to your specific needs, you can change various hyperparameters in the model_training.ipynb
notebook. Some of the key hyperparameters you might want to adjust include:
- Number of LSTM Units: Adjust the number of units in the LSTM layers.
- Dropout Rate: Change the dropout rate to prevent overfitting.
- Embedding Dimension: Modify the size of the word embeddings.
- Batch Size: Experiment with different batch sizes during training.
- Number of Epochs: Increase or decrease the number of epochs to control the training duration.
- Navigate to the project directory:
cd imdb_sentiment
- Run the FastAPI application using Uvicorn:
uvicorn app.main:app --reload
- Access the API documentation:
- Swagger UI: http://127.0.0.1:8000/docs
- ReDoc: http://127.0.0.1:8000/redoc
- Advanced Models:
- The model_training.ipynb notebook includes steps for training more advanced models like BERT, but these models are not used in the current API implementation.