# Airline Tweets Sentiment Analysis using Machine Learning  
### Predicting the sentiment as Positive/Negative with LinearSVC & FastAPI | Dockerized & Deployed on Render

## Problem Description
When people complete their journeys on airplanes, they tweet about it upon landing or reaching their home. These tweets can tell about a positive experience or negative experience with the flight carrier. The Airlines operating the flight have to scan through these tweets and determine which are the negative ones and respond to them accordingly assuring of better service next time or any amendments. 
This machine learning project will help Airlines to identify negative reviews and take appropriate steps for improvement


The objective of this project is to develop a machine learning–based Airline Tweets Sentiment Analysis Prediction system that predicts whether the tweet is positive or negative

The project includes:

- Data cleaning and preprocessing
- Exploratory Data Analysis (EDA)  
- Feature engineering and selection
- Model training and comparison (Logistic Regression, LinearSVC, Complimentary Naive Bayes)  
- Hyperparameter tuning
- Training final model
- Deploying the model using FastAPI and Docker

The final solution is built using the LinearSVC model as we got best F! score with this model

The model is deployed as a REST API using FastAPI, containerized using Docker, and hosted on Render for real-time inference.


## How the Solution Is Used

### 1. Transaction Prediction 

When a transaction is processed, its details are sent to the `/predict` endpoint. The API responds with sentiment and classification. 

#### Example request:
```json
[
    {
  "text": "It was not at all a good flight. We were stranded at Frankfurt which was a stopover",
  "airline": "american",
  "retweet_count": 0
}

]

  

#### Example Response
```json

{
  "sentiment": 0,
  "review": "Negative Review"
}

```
Based on the model response:

| Prediction                      | Recommended Action       |
|---------------------------------|--------------------------|
| Negative Review                 | Take action for improvement/apology
| Positive Review                 | Thank customer                      

### Summary of integration

- Supports real-time sentiment prediction
- Returns a probability 1/0 and review prediction
- Suitable for integration with Twitter Corpus of airline reviews
- Helps airlines improve their services

## Exploratory Data Analysis (EDA)

### 1. Dataset Overview

- **Total transactions:** `4008`
- **Positive Reviews:** `1781` (~44.43%)
- **Negative Reviews:** `2227` (~55.56%)
- **Missing values:** `user_timezone : 1280`

**Data source:** [Airline Sentiment Prediction Dataset(Kaggle)]  
(https://www.kaggle.com/competitions/bootcamp-the-basics-of-mla/data)


---

### 2. Feature Overview

| Feature Type | Features |
|--------------|----------|
| **Identifier** | `Id`|
| **Numerical** | `retweet_count` |
| **Non Numeric** | `airline`, `text`, `user_timezone` |
| **Target variable** | `airline_sentiment` (positive = Positive Review ,negative = Negative Review) |

---

### 3. Class Distribution (Class Imbalance)

The target variable is slightly imbalanced:
- **Positive Reviews:** `1781`
- **Negative Reviews:** `2227`

**Image in repository at:** `images/class_distribution.png`

![Sentiment Prediction Counts](images/class_distribution.png)



**Key insight:**
- Accuracy is not sufficient to judge the models → we focused on **F1-score**

## Feature Engineering

We encoded the airline_sentiment feature which was a categorical feature into numeric feature called sentiment as follows<br>
`df["sentiment"] = (df["airline_sentiment"] == "positive").astype(int)`

It was necessary to encode most important feature which is **text** certain into numeric to capture general characteristics of negative tweets . 
Some features were dropped as they were not influencing the Sentiment Variable.
Feature enginering has been done for EDA within the notebook. <br>
The `FeatureEngineering` is implemented  **src/feature_engineering.py**. This transformation is applied as the first step of the ML pipeline 
to ensure consistency during both training and inference.

### Key Transformations

| Feature | Description | Motivation |
|--------|-------------|------------|
| `airline_sentiment`|`sentiment` | Airline Sentiment is encoded into binary numeric feature 0/1 |
| `text` |`text_length`| text_length is a numeric feature derived from text which is the number of characters in tweet. |
| `text` | `word_count` |word_count is a numeric feature derived from text which is the number of words in tweet. |
| `text` | `neg_word_count` |neg_word_count is a numeric feature derived from text which is the number of negative words in tweet. |
| `text` | `all_caps_count` |all_caps_count is a numeric feature derived from text which is the number of words in CAPS in tweet. |
| `text` | `exclamations` |exclamations is a numeric feature derived from text which is the number of exclamations in tweet. |
| `text` | `has_negation` |has_negation is a numeric(binary) feature derived from text which indicates presence of negation like no/not/never etc. |
| **Dropped:** `Id`| Removed unique identifiers | Prevent data leakage and overfitting |
| **Dropped:** `user_timezone` | Has negligible impact on target variable | Has many missing values |
| **Dropped:** `airline` | May have slight impact on target variable| removed this feature to not introduce bias for an airline|

### Why This Matters

- Most machine-learning algorithms operate on numbers, not labels or strings.
- We used Text feature to derive several numeric features like text_length, neg_word_count so that model learns characteristic features of negative and positive tweets
- Prevents **data leakage** by excluding unique identifiers
- Dropped columns like airline and user_timezone so that model can be built on important features.


### EDA after Feature Engineering 

Due to limited features, we did EDA after feature engineering as follows:-


#### Average Text Count by Sentiment
**Image in repository at:** `images/Average Text Length (characters) by Sentiment.png`

![Sentiment Prediction Counts](images/Average Text Length (characters) by Sentiment.png)

#### Average Word Count by Sentiment
**Image in repository at:** `images/Average Text Length (characters) by Sentiment.png`

![Sentiment Prediction Counts](images/Average Text Length (characters) by Sentiment.png)

#### Average Exclamation Count by Sentiment
**Image in repository at:** `images/Average Text Length (characters) by Sentiment.png`

![Sentiment Prediction Counts](images/Average Text Length (characters) by Sentiment.png)

#### Average Negative Word Count by Sentiment
**Image in repository at:** `images/Average Text Length (characters) by Sentiment.png`

![Sentiment Prediction Counts](images/Average Text Length (characters) by Sentiment.png)

#### Average Retweet Count by Sentiment
**Image in repository at:** `images/Average Text Length (characters) by Sentiment.png`

![Sentiment Prediction Counts](images/Average Text Length (characters) by Sentiment.png)

#### Average All CAPS Count by Sentiment
**Image in repository at:** `images/Average Text Length (characters) by Sentiment.png`

![Sentiment Prediction Counts](images/Average Text Length (characters) by Sentiment.png)


#### Text Distribution 


#### Word Count Distributuon 


### 4. Mutual Information (Categorical/Binary Features)

| Feature | MI Score |
|---------|----------|
| `employment_status` | 0.175941 |
| `grade_subgrade` | 0.026769 |
| `loan_purpose` | 0.000325 |
| `education_level` | 0.0048 |
| `gender ` | 0.000028 |
| `marital_status ` | 0.000003 |


**Insight:**
- Employment Status is a strong feature in dataset. It indicates whether customer will pay back the loan. Grade Subgrade is also a good feature
- Some features like loan_purpose and education_level have less influence, but can prove to be useful after encoding
- gender and marital status dont have much influence on loan payback intention hence these will be dropped

### 5. Correlation of Numeric Features

**Image in repository:** `images/correlation_matrix.png`

![Correlation Matrix of Numeric Features](images/correlation_matrix.png)

**Important correlations with `is_fraud`:**
- `debt_to_income_ratio` → 0.335758
- `credit_score` → 0.234319
- `interest_rate` → -0.130789

#### Interpretation:
* **High Credit Score** → Loan Paid Pack probability is higher 
* **Low Debt to Income ratio** → Loan Paid Pack probability is higher 
* **Lower interest rates** →Loan Paid Pack probability is higher 

## Feature Engineering

It was necessary to encode certain categorical features into numeric to improve model performance. Some features were dropped as they were not influencing the Loan Pay Back Variable.
The `FeatureEngineering` is implemented  **src/features.py**. This transformation is applied as the first step of the ML pipeline to ensure consistency during both training and inference.

### Key Transformations

| Feature | Description | Motivation |
|--------|-------------|------------|
| `education_level`|`education_encoded` | Education Level is encoded using the Ordinal encoder |
| `grade_subgrade` |`grade_code`| Grade Subgrade is encoded using the Ordinal Encoder |
| `loan_purpose` | loan_purpose_te |Loan Purpose is encoded using target encoder|
| **Dropped:** `id`| Removed unique identifiers | Prevent data leakage and overfitting |
| **Dropped:** `education_level`, `grade_subgrade`, `loan_purpose` | Replaced by encoded features | Avoid redundancy |
| **Dropped:** `marital_status`, `gender` | Has negligible impact on target variable loan_paid_back | feature reduction |

### Why This Matters

- Most machine-learning algorithms operate on numbers, not labels or strings.
- By converting education_level using ordinal encoder , it helps the model capture the inherent distance between various education levels
- By converting grade_subgrade using ordinal encoder, it helps model capture the distance between various grades.
- Loan Purpose was converted using target encoding because some loans like education and medical loan and riskier to give.
- Prevents **data leakage** by excluding unique identifiers
- Dropped columns like gender and marital status so that model can be built on important features.

### Pipeline Integration

The transformer is used as part of the final ML pipeline:

```python
pipeline = Pipeline(
    steps=[
        ("featureengineering", FeatureEngineering()),
        ("vectorizer", DictVectorizer(sparse=False)),
        ("model", final_model),  # XGBClassifier
    ]
)
```

## Model Training & Selection

The dataset was split into:
* 60% Training
* 20% Validation
* 20% Testing

Multiple models were trained using the training set and evaluated against the validation set. Hyperparameter tuning and threshold optimization were performed to maximize predictive performance, especially focusing on F1-score and Recall, which are critical for fraud detection.

### Models Evaluated

| Model | Tuned Parameters | Decision Threshold | ROC-AUC | Accuracy | F1 Score |
|-------|------------------|-------------------|---------|-----------|----------|
| Logistic Regression | `solver = lbfgs, C=1, max_iter=1000` | 0.4 | 0.909741 | 0.9016 | 0.941174 | 
| Decision Tree | `max_depth=10, min_samples_leaf=100, random_state=42` | 0.4 | 0.91415   | 0.901918  | 0.941701  |
| Random Forest | `max_depth=10, min_samples_leaf=20, n_estimators=300, n_jobs=-1, max_features=sqrt` | 0.55 | 0.912805 | 0.902701 | 0.941708 |
| XGBoost (Final) | `objective='binary:logistic', eval_metric='auc', subsample=1.0, n_estimators=450, min_child_weight=30, max_depth=6, learning_rate=0.1, n_jobs=-1, random_state=42` | 0.45 | 0.921519 | 0.904688 | 0.943001 |


### Final Model Selection

After comparing performance across models, **XGBClassifier** was selected as the final production model based on the following:

* Highest ROC-AUC on validation  
* Best F1-score, indicating strong Loan Payback probability calculation 

### Final Model Evaluation (on Test Set)

After selecting XGBClassifier, the model was retrained using full train + testing datasets, and final evaluation was performed on the test set.

| Metric | Value |
|--------|-------|
| ROC-AUC | 0.9781 |
| F1 Score | 0.8025 |
| Decision Threshold | 0.80 |

## Exporting Notebook to Script

To comply with project requirements and ensure reproducibility, all essential machine learning steps developed in the notebook (`notebooks/notebook.ipynb`) were fully converted into Python scripts.

### Scripts Created

| Script | Purpose |
|--------|---------|
| `src/train.py` | Contains final model training pipeline and saves the trained model. |
| `src/predict.py` | Loads the trained model and serves predictions via a FastAPI REST endpoint. |
| `src/featureproc.py` | Implements the custom feature engineering logic. |

### What Was Exported from Notebook
The following core logic developed and validated in `notebooks/notebook.ipynb` was migrated into standalone scripts for production readiness:

| Exported Component | Implemented In | Description |
|-------------------|----------------|-------------|
| Data loading | `train.py` | Reads loan dataset from `data/train.csv`. |
| Feature engineering logic | `featureproc.py` | Custom transformer class `FeatureEngineering`. |
| Model training & hyperparameter tuning | `train.py` | Uses tuned XGBoost model parameters finalized from notebook experiments. |
| Decision threshold selection (`0.45`) | `train.py` | Threshold locked based on best F1 score from validation results. |
| Final model training | `train.py` | Trains full XGBoost model on entire training dataset. |
| Model serialization (pipeline + threshold) | `train.py` | Saved using `pickle` as `models/loanpayback_pipeline.bin`. |
| API-based prediction logic | `predict.py` | Loads trained pipeline and serves predictions via FastAPI. |

### Example: Model Saving in `train.py`
```python
model_path = "models/loanpayback_pipeline.bin"

with open(model_path, "wb") as f_out:
    pickle.dump({"pipeline": pipeline, "threshold": best_threshold}, f_out)


### 
Example: Model Loading in `predict.py`
```python
with open(MODEL_PATH, "rb") as f_in:
    model_data = pickle.load(f_in)

pipeline = model_data["pipeline"]
threshold = model_data["threshold"]
```

## Reproducibility

This project is fully reproducible. The dataset, notebook, and training scripts are included in the repository, allowing seamless re-execution.

- Dataset available in `data/train.csv`
- Full analysis in `notebooks/LoanPaybackNB.ipynb`
- Feature Training function in `src/featureproc.py`
- Final model training located in `src/train.py`
- Inference logic exposed via `src/predict.py`
- Trained pipeline saved at `models/loanpayback_pipeline.bin`



### How to Reproduce
```bash
# Install dependencies and set up environment
uv sync

# Run training script
uv run python -m src.train

# Start inference API
uv run uvicorn src.predict:app --reload --port 8000
```

---

## Model Deployment (Local)

The trained machine learning model is deployed locally using **FastAPI** and served via **Uvicorn**.

### Start API Locally
```bash
uv run uvicorn src.predict:app --reload --port 8000
```

Once the application is running:

- **Swagger UI (API documentation):** `http://localhost:8000/docs`
- **Root endpoint:** `http://localhost:8000/`

### Supported Features
- **Single loan prediction API (POST):** `/predict`
- **HTML-based UI for interactive prediction:** `/ui`
  
---

## Dependency & Environment Management

The project uses **uv** to manage dependencies and execution. All required packages are defined in `pyproject.toml` and `requirements.txt`.

### Install Dependencies
```bash
uv sync
```
### Example Execution Commands
```bash
uv run python -m src.train      # Train model
uv run uvicorn src.predict:app --reload --port 8000   # Launch API
```
---

## Dependency Files

### `requirements.txt`
```txt
fastapi==0.128.0
joblib==1.5.3
numpy==2.4.0
pandas==2.3.3
pydantic==2.12.5
pydantic_core==2.41.5
scikit-learn==1.8.0
scipy==1.16.3
seaborn==0.13.2
uv==0.9.21
uvicorn==0.40.0
xgboost==3.1.2

```

### `pyproject.toml`
```toml
[project]
name = "loanpayback"
version = "0.1.0"
description = "Loan Payback Predicion Project"
readme = "README.md"
requires-python = ">=3.12"
dependencies = [
    "fastapi>=0.128.0",
    "pandas>=2.3.3",
    "scikit-learn>=1.8.0",
    "uvicorn>=0.40.0",
    "xgboost>=3.1.2",
    "pydantic>=2.12.5",
    "pydantic_core>=2.41.5",
    "numpy>=2.4.0",
    "pandas>=2.3.3",
    "scipy==1.16.3",
    "notebook>=7.5.1",
]

[dependency-groups]
dev = [
    "requests>=2.32.5",
]

```
---

## Containerization (Docker)

The project is fully containerized using **Docker**, allowing consistent deployment across environments.
### Dockerfile Used
```dockerfile
# Use lightweight Python
FROM python:3.11-slim

# ==============================
# Environment settings
# ==============================
ENV PYTHONDONTWRITEBYTECODE=1
ENV PYTHONUNBUFFERED=1

# ==============================
# Working directory
# ==============================
WORKDIR /app

# ==============================
# System dependencies (XGBoost)
# ==============================
RUN apt-get update && apt-get install -y \
    build-essential \
    libgomp1 \
    && rm -rf /var/lib/apt/lists/*

# ==============================
# Python dependencies
# ==============================
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# ==============================
# Copy only required files
# ==============================
COPY src ./src
COPY models ./models

# ==============================
# Expose API port
# ==============================
EXPOSE 8000

# ==============================
# Run FastAPI
# ==============================
CMD ["uvicorn", "src.predict:app", "--host", "0.0.0.0", "--port", "8000"]

```
---



### Build the Docker Image

Run the following command inside the project folder:
```bash
docker build --no-cache -t loanpayback-api .
```
---

### Run the Docker Container
```bash
docker run -p 8000:8000 loanpayback-api
```

Once started, the API will be available at:
- **Local URL** → `http://localhost:8000/docs/`
- **Swagger UI** → `http://localhost:8000/docs/`
---


## Cloud Deployment

The Loan Payback Prediction API is deployed on Render using FastAPI and Docker, enabling real-time loan repayment inference through RESTful endpoints.

### Deployment Steps (Docker + Render)

#### 1. Push complete project to GitHub
[github repo link]
(https://github.com/codevalhalla/ml-ecommerce-fraud-detection)


#### 2. On Render Dashboard → “New Web Service”

#### 3. Select Deployment Settings

| Setting | Value |
|---------|-------|
| Environment |	Docker |
| Repository | `codevalhalla/ml-ecommerce-fraud-detection` |
| Branch | main |
| Root Directory | `(leave empty)` |
| Environment Variables | `PORT=8000` |
| Instance Type | Free Tier |

#### 4. Click "Deploy Web Service"

Render automatically:

* Pulls repo

* Builds Docker image

* Runs FastAPI service using command from Dockerfile
```CSS
CMD ["uvicorn", "src.predict:app", "--host", "0.0.0.0", "--port", "8000"]
```
---