# Sentiment Analysis of Product Reviews using Naive Bayes

**Author:** Pranjal Yadav

This notebook demonstrates a full pipeline for sentiment analysis on product reviews using Python: data loading, preprocessing, TF-IDF vectorization, training a Naive Bayes classifier, evaluation, and saving the model. Replace `reviews.csv` with your dataset (CSV with columns: `review`, `sentiment`).

## 1. Install required packages

If you are running locally, install required packages with:

```bash
pip install -r requirements.txt
```

(Requirements file is included.)

In [None]:
# 2. Imports
import pandas as pd
import numpy as np
import re
import joblib
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
import nltk

# Download NLTK data if needed
nltk.download('stopwords')


## 3. Load dataset

Place a CSV file named `reviews.csv` in the same folder with two columns: `review` and `sentiment`.
`sentiment` should be categorical: `positive`, `negative`, or `neutral` (or similar).

In [None]:
# Example: load reviews.csv
# If you don't have a dataset, you can create a small example DataFrame (uncomment the example).
import os

if not os.path.exists('reviews.csv'):
    # Create a small example dataset
    data = {
        'review': [
            'Great phone with long battery life',
            'Very poor build quality, stopped working in a week',
            'Average product, does the job',
            'Excellent! Highly recommended',
            'Not worth the money'
        ],
        'sentiment': ['positive', 'negative', 'neutral', 'positive', 'negative']
    }
    df = pd.DataFrame(data)
    df.to_csv('reviews.csv', index=False)
    print('Example reviews.csv file created.')
else:
    df = pd.read_csv('reviews.csv')
    print('reviews.csv loaded, shape:', pd.read_csv('reviews.csv').shape)

df = pd.read_csv('reviews.csv')
df.head()


## 4. Text preprocessing functions

In [None]:
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()

def preprocess_text(text):
    if not isinstance(text, str):
        return ''
    text = text.lower()
    # remove URLs, HTML tags, non-alphabetic
    text = re.sub(r'http\S+|www\S+', '', text)
    text = re.sub(r'<.*?>', '', text)
    text = re.sub(r'[^a-z\s]', ' ', text)
    tokens = text.split()
    tokens = [t for t in tokens if t not in stop_words]
    tokens = [stemmer.stem(t) for t in tokens]
    return ' '.join(tokens)

# Apply preprocessing
df['clean_review'] = df['review'].astype(str).apply(preprocess_text)
df[['review', 'clean_review', 'sentiment']].head()


## 5. Train/Test split and label encoding

In [None]:
X = df['clean_review']
y = df['sentiment']

le = LabelEncoder()
y_enc = le.fit_transform(y)

X_train, X_test, y_train, y_test = train_test_split(X, y_enc, test_size=0.2, random_state=42, stratify=y_enc)
print('Train size:', len(X_train), 'Test size:', len(X_test))


## 6. Build pipeline (TF-IDF + Multinomial Naive Bayes) and train

In [None]:
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=5000, ngram_range=(1,2))),
    ('nb', MultinomialNB())
])

pipeline.fit(X_train, y_train)


## 7. Evaluation

In [None]:
y_pred = pipeline.predict(X_test)
print('Accuracy:', accuracy_score(y_test, y_pred))
print('\nClassification Report:\n', classification_report(y_test, y_pred, target_names=le.inverse_transform(sorted(set(y_test)))))
print('\nConfusion Matrix:\n', confusion_matrix(y_test, y_pred))


## 8. Save trained model and encoder

In [None]:
joblib.dump(pipeline, 'sentiment_pipeline.pkl')
joblib.dump(le, 'label_encoder.pkl')
print('Saved: sentiment_pipeline.pkl and label_encoder.pkl')


## 9. Make predictions on new sentences

In [None]:
def predict_sentiment(text):
    clean = preprocess_text(text)
    pred = pipeline.predict([clean])[0]
    return le.inverse_transform([pred])[0]

samples = [
    'I love this product, fantastic value.',
    'Terrible experience, broke after one use.',
    'Okay product, nothing special.'
]

for s in samples:
    print(s, '->', predict_sentiment(s))


## Notes & Next steps

- Replace `reviews.csv` with your real dataset (columns `review`, `sentiment`).
- You can expand preprocessing (lemmatization, more cleaning) or try other classifiers (Logistic Regression, SVM).
- To run on GitHub: push this notebook and the `reviews.csv` (or instructions) to a repository. CI won't run notebooks automatically — use GitHub Codespaces or Binder if you want runnable cloud notebooks.