In [None]:
!pip install datasets duckdb pyarrow
from google.colab import drive
drive.mount('/content/drive')

shared_folder = '/content/drive/My Drive/unified_dataset'

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# Binary Sentiment Predictor for Product Ratings

## Overview
A scalable sentiment classifier trained incrementally on Amazon review text data. Predicts binary sentiment (negative/positive) based on star ratings (≤3 = negative, >3 = positive).

## Workflow
1. Load Data in Chunks --> 2
2. Label Conversion --> 3
3. Train/Test Split --> 4
4. TF-IDF Vectorization] --> 5
5. Incremental SGD Training --> 6
6. Batch Evaluation --> 7
7. Aggregate Metrics?
    * Yes --> 8
    * No --> 1
8. Final Evaluation

## Components
1. Prerocessing
   * TFid Vectorization used
   * Terms in less than 5 docs or more thn 80% of dcuments are ignored
3. Model
   * Logistic regrssion Used for classification
5. Evaluation Metrics
   * accuracy
   * f1 score
   * confusion matrix

In [2]:
import duckdb
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, classification_report

DATASET_PATH = "/content/drive/My Drive/unified_dataset/**/*.parquet"
CHUNK_SIZE = 25000000

vectorizer = TfidfVectorizer(
    lowercase=True,
    token_pattern=r'\b\w+\b',
    min_df=5,
    max_df=0.8
)
model = SGDClassifier(loss='log_loss', random_state=42)
classes = [0, 1]

offset = 0
batch_num = 1
con = duckdb.connect()

all_preds = []
all_truths = []

while True:
    query = f"""
        SELECT text, rating
        FROM read_parquet('{DATASET_PATH}')
        WHERE rating BETWEEN 1 AND 5
          AND text IS NOT NULL
          AND LENGTH(TRIM(text)) > 0
        LIMIT {CHUNK_SIZE} OFFSET {offset}
    """
    df = con.execute(query).fetch_df()
    if df.empty:
        break

    df["label"] = (df["rating"] > 3).astype(int)

    X_train, X_test, y_train, y_test = train_test_split(
        df["text"], df["label"], test_size=0.2, random_state=42
    )

    if batch_num == 1:
        X_train_vec = vectorizer.fit_transform(X_train)
        model.partial_fit(X_train_vec, y_train, classes=classes)
    else:
        X_train_vec = vectorizer.transform(X_train)
        model.partial_fit(X_train_vec, y_train)

    X_test_vec = vectorizer.transform(X_test)
    y_pred = model.predict(X_test_vec)

    all_preds.extend(y_pred)
    all_truths.extend(y_test)

    acc = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    cm = confusion_matrix(y_test, y_pred)

    print(f" Chunk {batch_num} — Accuracy: {acc:.4f}, F1: {f1:.4f}")
    print(cm, "\n")

    offset += CHUNK_SIZE
    batch_num += 1

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Chunk 1 — Accuracy: 0.8749, F1: 0.9234
[[ 605004  509874]
 [ 115616 3769506]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Chunk 2 — Accuracy: 0.8662, F1: 0.9179
[[ 588877  553370]
 [ 115646 3742107]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Chunk 3 — Accuracy: 0.8624, F1: 0.9192
[[ 399232  604223]
 [  83922 3912623]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Chunk 4 — Accuracy: 0.8713, F1: 0.9282
[[ 196375  598626]
 [  44700 4160299]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Chunk 5 — Accuracy: 0.8601, F1: 0.9107
[[ 732864  549965]
 [ 149518 3567653]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Chunk 6 — Accuracy: 0.8669, F1: 0.9183
[[ 595796  534926]
 [ 130374 3738904]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Chunk 7 — Accuracy: 0.8700, F1: 0.9200
[[ 610819  521051]
 [ 129160 3738970]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Chunk 8 — Accuracy: 0.8631, F1: 0.9156
[[ 599698  572348]
 [ 112242 3715712]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Chunk 9 — Accuracy: 0.8650, F1: 0.9163
[[ 630472  564974]
 [ 110123 3694431]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Chunk 10 — Accuracy: 0.8578, F1: 0.9135
[[ 535143  608085]
 [ 103105 3753667]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Chunk 11 — Accuracy: 0.8662, F1: 0.9190
[[ 534038  568131]
 [ 100985 3796846]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Chunk 12 — Accuracy: 0.8761, F1: 0.9240
[[ 617325  506295]
 [ 113184 3763196]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Chunk 13 — Accuracy: 0.8771, F1: 0.9245
[[ 621946  502265]
 [ 112287 3763502]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Chunk 14 — Accuracy: 0.8718, F1: 0.9260
[[ 349676  577358]
 [  63846 4009120]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Chunk 15 — Accuracy: 0.8651, F1: 0.9248
[[ 176673  635321]
 [  39399 4148607]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Chunk 16 — Accuracy: 0.8653, F1: 0.9195
[[ 480336  577580]
 [  96025 3846059]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Chunk 17 — Accuracy: 0.8513, F1: 0.9064
[[ 653432  616048]
 [ 127697 3602823]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Chunk 18 — Accuracy: 0.8544, F1: 0.9111
[[ 539027  617094]
 [ 110946 3732933]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Chunk 19 — Accuracy: 0.8700, F1: 0.9210
[[ 562760  546034]
 [ 104099 3787107]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Chunk 20 — Accuracy: 0.8756, F1: 0.9240
[[ 595319  514558]
 [ 107357 3782766]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Chunk 21 — Accuracy: 0.8423, F1: 0.9014
[[ 72367  80097]
 [ 14074 430452]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

In [3]:
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix

print("Final Evaluation")
overall_acc = accuracy_score(all_truths, all_preds)
overall_f1 = f1_score(all_truths, all_preds)
overall_cm = confusion_matrix(all_truths, all_preds)

print(f"Overall Accuracy: {overall_acc:.4f}")
print(f"Overall F1 Score: {overall_f1:.4f}")
print("Confusion Matrix:")
print(overall_cm)

Final Evaluation
Overall Accuracy: 0.8664
Overall F1 Score: 0.9192
Confusion Matrix:
[[10697179 11358223]
 [ 2084305 76457283]]


# Final Model Evaluation - Binary Sentiment Analysis

## Overall Performance Metrics
| Metric              | Score    | Interpretation                     |
|---------------------|----------|-------------------------------------|
| **Accuracy**        | 0.8664   | 86.64% of predictions correct       |
| **F1 Score**        | 0.9192   | Excellent balance of precision/recall |

---

## Confusion Matrix (Counts)
|                   | Predicted Negative (0) | Predicted Positive (1) |
|-------------------|-------------------------|-------------------------|
| **Actual Negative (0)** | 10,697,179            | 11,358,223             |
| **Actual Positive (1)** | 2,084,305             | 76,457,283             |

---

## Detailed Class Metrics
### Negative Class (0 - Ratings 1-3 Stars)
| Metric     | Calculation                   | Value   |
|------------|-------------------------------|---------|
| Precision  | TN/(TN+FN) = 10.7M/12.8M      | 83.68%  |
| Recall     | TN/(TN+FP) = 10.7M/22.1M      | 48.50%  |
| F1 Score   | 2*(Prec*Recall)/(Prec+Recall) | 0.6141  |

### Positive Class (1 - Ratings 4-5 Stars)
| Metric     | Calculation                   | Value   |
|------------|-------------------------------|---------|
| Precision  | TP/(TP+FP) = 76.5M/87.8M      | 87.06%  |
| Recall     | TP/(TP+FN) = 76.5M/78.5M      | 97.34%  |
| F1 Score   | 2*(Prec*Recall)/(Prec+Recall) | 0.9192  |

---

## Observations
1. **Class Imbalance**:  
   - Positive class dominates (78.5M vs 22.1M negatives)  
   - Model bias towards majority class as seen through recall  

2. **Errors**:  
   - High FP in negatives: 11.4M negative reviews misclassified as positive  
   - Low FN: Only 2.1M positives missed (good at catching positive sentiment)  

3. **Practical Implications**:  
   - Model excellent at confirming positive experiences (97% recall)  
   - Struggles to identify negative reviews (48.5% recall)  