# PriceSense: Advanced Naive Bayes for Persian Opinion Mining
This project uses an enhanced Naive Bayes model to detect price mentions in Persian comments from DigiKala, leveraging optimized NLP techniques.

## Dataset Overview
- **Train**: 40,000 labeled comments (`train.csv`)
- **Test**: 8,000 unlabeled comments (`test.csv`)
- **Features**: `comment` (text), `price_value` (0: no price mention, 1: price mentioned)
- **Objective**: Binary classification with advanced Naive Bayes.

In [1]:
# Import libraries
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, classification_report
from hazm import Normalizer, word_tokenize
import string
from tqdm import tqdm
import re

## Data Loading
Load train and test datasets from CSV files.

In [2]:
data_dir = Path('../data')
train_data = pd.read_csv(data_dir / 'train.csv')
test_data = pd.read_csv(data_dir / 'test.csv')
train_data.head()

Unnamed: 0,comment,price_value
0,قیمت مناسب وکیفیت خوب پیشنهادمیکنم حتما خرید کنید,1
1,به اندازه یک میلیمتر دورتادور گوشی خالی میماند...,0
2,از همه نظر عالی و یک خرید خوب در قیمت حدود۴۰ ...,1
3,فقط یک بار هر یک ربع ساعت 1 درصد شارژ کرد بعدش...,0
4,قیمت این کالا خیلی تغییر میکنه . من خریدم چندر...,1


## Preprocessing
### Custom Stopwords
Define a custom stopwords list, excluding price-related terms like 'با' or 'در'.

In [3]:
from hazm import stopwords_list

base_stopwords = set(stopwords_list())
price_related = {'با', 'در', 'به', 'از'}
custom_stopwords = base_stopwords - price_related

normalizer = Normalizer()

def preprocess_text(text):
    # Replace numbers with 'NUMBER' token
    text = re.sub(r'\d+', 'NUMBER', text)
    # Remove punctuation and normalize
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = normalizer.normalize(text)
    # Remove extra whitespace and special characters
    text = text.replace('\n', ' ').replace('\r', ' ').strip()
    # Tokenize
    tokens = word_tokenize(text)
    # Filter stopwords
    filtered = [token for token in tokens if token not in custom_stopwords]
    return ' '.join(filtered)

# Preprocess all data once
tqdm.pandas()
train_data['processed'] = train_data['comment'].progress_apply(preprocess_text)
test_data['processed'] = test_data['comment'].progress_apply(preprocess_text)

100%|██████████| 40000/40000 [00:16<00:00, 2435.45it/s]
100%|██████████| 8000/8000 [00:02<00:00, 2857.59it/s]


## Feature Engineering
- Use TF-IDF with unigrams and bigrams for richer features.
- Limit vocabulary to 10,000 terms to balance noise and information.

## Model Training
- Split train data into train/validation (80/20).
- Use Multinomial Naive Bayes with TF-IDF in a Pipeline.
- Tune alpha with GridSearchCV.

In [4]:
# Split data
X = train_data['processed']
y = train_data['price_value']
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Define Pipeline
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=10000, ngram_range=(1, 2))),
    ('nb', MultinomialNB())
])

# Hyperparameter tuning
param_grid = {
    'tfidf__max_features': [5000, 10000],
    'nb__alpha': [0.1, 0.5, 1.0]
}
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train, y_train)

# Best model
best_model = grid_search.best_estimator_
print(f'Best Parameters: {grid_search.best_params_}')
print(f'Best CV Accuracy: {grid_search.best_score_:.4f}')

# Evaluate on validation set
y_val_pred = best_model.predict(X_val)
val_accuracy = accuracy_score(y_val, y_val_pred)
print(f'Validation Accuracy: {val_accuracy:.4f}')
print(classification_report(y_val, y_val_pred))

Best Parameters: {'nb__alpha': 1.0, 'tfidf__max_features': 5000}
Best CV Accuracy: 0.8473
Validation Accuracy: 0.8460
              precision    recall  f1-score   support

           0       0.83      0.89      0.86      4160
           1       0.87      0.80      0.83      3840

    accuracy                           0.85      8000
   macro avg       0.85      0.84      0.85      8000
weighted avg       0.85      0.85      0.85      8000



## Prediction and Submission
- Predict on test data.
- Save results in submission.csv.

In [None]:
# Predict on test data
X_test = test_data['processed']
test_pred = best_model.predict(X_test)
submission = pd.DataFrame({'price_value': test_pred})
submission.to_csv('submission.csv', index=False)
submission.head()

## Save Outputs
- Save model and submission in a zip file.

In [None]:
import joblib
import zipfile

# Save model
joblib.dump(best_model, 'pricesense_model.pkl')

# Compress files
output_dir = Path('outputs')
output_dir.mkdir(exist_ok=True)
files = ['pricesense_model.pkl', 'submission.csv', 'pricesense.ipynb']
with zipfile.ZipFile(output_dir / 'result.zip', 'w', compression=zipfile.ZIP_DEFLATED) as zf:
    for file in files:
        zf.write(file, file)
print('Outputs saved in result.zip')