# Sentiment Analysis of Customer Reviews.

**Capstone Project: Sentiment Analysis of Customer Reviews**

**Project Goal:** Build a model that can classify customer reviews as positive, negative, or neutral. This is a common Natural Language Processing (NLP) task with real-world applications in business analytics, customer service, and market research.

**Dataset:** We'll use a simplified, pre-processed dataset for this project. In a real-world scenario, you'd likely collect data from a website, API, or database. For this example, we will create our own small dataset. You can later expand this with a larger dataset from sources like Kaggle or the UCI Machine Learning Repository (search for "sentiment analysis datasets").

**Tools:**

*   **Python:** The programming language.
*   **Pandas:** For data manipulation.
*   **Scikit-learn:** For machine learning (feature extraction and model building).
*   **NLTK (Natural Language Toolkit):** For text pre-processing (optional, but recommended for more advanced projects).  We'll use a simplified version here to keep the project manageable.

**Steps and Code Examples:**

1.  **Data Creation and Preparation:**

In [2]:

    
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer  # Import TF-IDF
from sklearn.naive_bayes import MultinomialNB  # Import Naive Bayes
from sklearn.metrics import accuracy_score, classification_report

# Create a small sample dataset (replace with your larger dataset later)
data = {
    'review': [
        "This product is amazing! I love it.",
        "The quality is terrible, I'm very disappointed.",
        "It's okay, nothing special.",
        "I would definitely recommend this to a friend.",
        "Absolutely awful.  Waste of money.",
        "Pretty good for the price.",
        "The service was excellent and fast.",
        "I had a terrible experience with their customer support.",
        "It's mediocre, could be better.",
        "Best purchase I've made all year!"
    ],
    'sentiment': [
        'positive',
        'negative',
        'neutral',
        'positive',
        'negative',
        'positive',
        'positive',
        'negative',
        'neutral',
        'positive'
    ]
}

df = pd.DataFrame(data)

# Simple text pre-processing (lowercasing)
df['review'] = df['review'].str.lower()

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    df['review'], df['sentiment'], test_size=0.2, random_state=42
)

    
2.  **Feature Extraction (TF-IDF):**

In [3]:

# Use TF-IDF to convert text into numerical feature vectors
tfidf_vectorizer = TfidfVectorizer(max_features=5000)  # Limit features for simplicity
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

# Show the shape of the TF-IDF matrix (rows = documents, columns = features)
print("Shape of TF-IDF training matrix:", X_train_tfidf.shape)

Shape of TF-IDF training matrix: (8, 42)



**Explanation:**

**TF-IDF (Term Frequency-Inverse Document Frequency):** 

A common technique to convert text into numerical vectors.  It gives higher weight to words that are frequent in a document    but infrequent in the entire corpus (making them more important for distinguishing documents).
`fit_transform` on the training data learns the vocabulary (unique words) and their IDF values.
`transform` on the test data uses the *same* vocabulary and IDF values learned from the training data. This is *crucially important* to avoid data leakage.

3.  **Model Training (Naive Bayes):**

In [4]:
# Train a Multinomial Naive Bayes classifier
model = MultinomialNB()
model.fit(X_train_tfidf, y_train)


    *   **Explanation:**
        *   **Multinomial Naive Bayes:** A simple but often effective probabilistic classifier, commonly used for text classification. It's based on Bayes' theorem and assumes feature independence (which is often not strictly true in text data, but it still works well in practice).
        *   `model.fit` trains the model using the TF-IDF feature vectors and the corresponding sentiment labels.

4.  **Model Prediction and Evaluation:**


In [5]:

# Make predictions on the test set
y_pred = model.predict(X_test_tfidf)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

# More detailed evaluation (precision, recall, F1-score)
print(classification_report(y_test, y_pred))

Accuracy: 0.0
              precision    recall  f1-score   support

    negative       0.00      0.00      0.00       1.0
     neutral       0.00      0.00      0.00       1.0
    positive       0.00      0.00      0.00       0.0

    accuracy                           0.00       2.0
   macro avg       0.00      0.00      0.00       2.0
weighted avg       0.00      0.00      0.00       2.0



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))



    *   **Explanation:**
        *   `model.predict` uses the trained model to predict the sentiment of the test reviews (represented as TF-IDF vectors).
        *   `accuracy_score` calculates the overall accuracy.
        *   `classification_report` provides a more detailed breakdown of performance, including precision, recall, and F1-score for each sentiment class (positive, negative, neutral).  This is particularly useful when you have imbalanced classes (e.g., many more positive reviews than negative ones).

5.  **Prediction on New Reviews (Deployment - Simplified):**

In [None]:

# Function to predict sentiment of a new review
def predict_sentiment(new_review):
    new_review = new_review.lower()  # Apply the same pre-processing
    new_review_tfidf = tfidf_vectorizer.transform([new_review])  # Transform using the fitted vectorizer
    prediction = model.predict(new_review_tfidf)[0]
    return prediction

# Test with some new reviews
new_reviews = [
    "This is a fantastic product!",
    "I am extremely dissatisfied with this.",
    "It's an average product."
]

for review in new_reviews:
    sentiment = predict_sentiment(review)
    print(f"Review: '{review}'  Sentiment: {sentiment}")


    * **Explanation:** This simulates a basic deployment scenario where you can input new text and get a sentiment prediction. It's crucial to apply the *same* preprocessing (lowercasing, in this simplified case) and use the *same* trained `tfidf_vectorizer` to transform the new text into a feature vector.

**Complete, Runnable Code (All Steps Combined):**

In [1]:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

# 1. Data Creation and Preparation
data = {
    'review': [
        "This product is amazing! I love it.",
        "The quality is terrible, I'm very disappointed.",
        "It's okay, nothing special.",
        "I would definitely recommend this to a friend.",
        "Absolutely awful.  Waste of money.",
        "Pretty good for the price.",
        "The service was excellent and fast.",
        "I had a terrible experience with their customer support.",
        "It's mediocre, could be better.",
        "Best purchase I've made all year!"
    ],
    'sentiment': [
        'positive',
        'negative',
        'neutral',
        'positive',
        'negative',
        'positive',
        'positive',
        'negative',
        'neutral',
        'positive'
    ]
}

df = pd.DataFrame(data)
df['review'] = df['review'].str.lower()
X_train, X_test, y_train, y_test = train_test_split(
    df['review'], df['sentiment'], test_size=0.2, random_state=42
)

# 2. Feature Extraction (TF-IDF)
tfidf_vectorizer = TfidfVectorizer(max_features=5000)
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)
print("Shape of TF-IDF training matrix:", X_train_tfidf.shape)

# 3. Model Training (Naive Bayes)
model = MultinomialNB()
model.fit(X_train_tfidf, y_train)

# 4. Model Prediction and Evaluation
y_pred = model.predict(X_test_tfidf)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")
print(classification_report(y_test, y_pred))

# 5. Prediction on New Reviews
def predict_sentiment(new_review):
    new_review = new_review.lower()
    new_review_tfidf = tfidf_vectorizer.transform([new_review])
    prediction = model.predict(new_review_tfidf)[0]
    return prediction

new_reviews = [
    "This is a fantastic product!",
    "I am extremely dissatisfied with this.",
    "It's an average product."
]

for review in new_reviews:
    sentiment = predict_sentiment(review)
    print(f"Review: '{review}'  Sentiment: {sentiment}")

Shape of TF-IDF training matrix: (8, 42)
Accuracy: 0.0
              precision    recall  f1-score   support

    negative       0.00      0.00      0.00       1.0
     neutral       0.00      0.00      0.00       1.0
    positive       0.00      0.00      0.00       0.0

    accuracy                           0.00       2.0
   macro avg       0.00      0.00      0.00       2.0
weighted avg       0.00      0.00      0.00       2.0

Review: 'This is a fantastic product!'  Sentiment: positive
Review: 'I am extremely dissatisfied with this.'  Sentiment: positive
Review: 'It's an average product.'  Sentiment: positive


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))



**Further Improvements and Extensions:**

*   **Larger Dataset:**  Use a larger, more realistic dataset.
*   **More Advanced Preprocessing:**  Use NLTK for stemming, lemmatization, stop word removal, and handling punctuation.
*   **Different Models:** Experiment with other classification models (e.g., Logistic Regression, Support Vector Machines, Random Forest).
*   **Hyperparameter Tuning:** Use techniques like GridSearchCV or RandomizedSearchCV to optimize model parameters (e.g., the `alpha` parameter in MultinomialNB, or the `C` parameter in SVM).
*   **Cross-Validation:**  Use cross-validation to get a more robust estimate of model performance.
*   **Deep Learning (Optional):** Explore using Recurrent Neural Networks (RNNs) or Transformers (like BERT) for sentiment analysis, which can capture more complex language patterns (but require more data and computational resources).  This would involve using TensorFlow or PyTorch.

This capstone project provides a complete, runnable example of a practical AI application using Python.  It covers data preparation, feature extraction, model training, evaluation, and a simplified deployment scenario. By working through this project and exploring the suggested extensions, you'll gain valuable hands-on experience with Python's role in building AI systems. Remember to focus on understanding *why* each step is necessary and how the different components work together.