<a href="https://colab.research.google.com/github/kecitclub/moyeMoye/blob/main/comment_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
# prompt: Develop a Python script that processes a list of user data, where each entry is a dictionary containing 'username' and 'comments'. The script should analyze each comment using a language model (LLM) other than OpenAI's models, determining the sentiment as positive, negative, or neutral. Additionally, the script should categorize comments based on their content, identifying mentions of 'price' and 'quality' without explicit keyword checks. For both 'price' and 'quality', comments should be further classified into positive, negative, and neutral subcategories. The script should also categorize any remaining comments into a general 'other information' category with the same sentiment sub-classifications. Output the percentage of overall positive and negative comments, and count comments in each category and subcategory. Include the LLM analysis code.

# Install necessary libraries
!pip install transformers

import nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from transformers import pipeline

# Sample user data (replace with your actual data)
user_data = [
    {'username': 'user1', 'comments': 'The price is great!'},
    {'username': 'user2', 'comments': 'Quality is amazing, but a bit pricey.'},
    {'username': 'user3', 'comments': 'Terrible product, very low quality.'},
    {'username': 'user4', 'comments': 'I love this!'},
    {'username': 'user5', 'comments': 'The price is too high for this.'},
    {'username': 'user6', 'comments': 'It works as expected.'}
]


# Initialize sentiment analyzer (VADER in this example)
analyzer = SentimentIntensityAnalyzer()


# Function to classify sentiment using VADER
def analyze_sentiment(text):
    scores = analyzer.polarity_scores(text)
    if scores['compound'] >= 0.05:
        return 'positive'
    elif scores['compound'] <= -0.05:
        return 'negative'
    else:
        return 'neutral'


# Process user data
price_comments = {'positive': 0, 'negative': 0, 'neutral': 0}
quality_comments = {'positive': 0, 'negative': 0, 'neutral': 0}
other_comments = {'positive': 0, 'negative': 0, 'neutral': 0}
positive_count = 0
negative_count = 0


for user in user_data:
    comment = user['comments']
    sentiment = analyze_sentiment(comment)

    analyzer = CommentAnalyzer()
    results = analyzer.process_comments(user_data)

    # Print summary report
    print(analyzer.get_summary_report())

    if sentiment == 'positive':
        positive_count += 1
    elif sentiment == 'negative':
        negative_count += 1

total_comments = len(user_data)
positive_percentage = (positive_count / total_comments) * 100
negative_percentage = (negative_count / total_comments) * 100


# Output the results
print(f"Overall Positive Comments: {positive_percentage:.2f}%")
print(f"Overall Negative Comments: {negative_percentage:.2f}%")
print("\nCategory-wise Comment Counts:")
print(f"Price: {price_comments}")
print(f"Quality: {quality_comments}")
print(f"Other Information: {other_comments}")



[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


Overall Positive Comments: 50.00%
Overall Negative Comments: 16.67%

Category-wise Comment Counts:
Price: {'positive': 2, 'negative': 0, 'neutral': 1}
Quality: {'positive': 0, 'negative': 1, 'neutral': 0}
Other Information: {'positive': 1, 'negative': 0, 'neutral': 1}


In [6]:
# prompt: Please rewrite the provided Python code to enhance its functionality for comment analysis. Instead of explicitly checking for the presence of 'price' or 'quality' within the comments using keyword searches, integrate a language model (LLM) that can determine whether the comment pertains to these categories or other information. The rewritten code should pass each comment through the LLM for sentiment analysis and categorical classification. It should categorize comments into 'price', 'quality', and 'other information', each with subcategories for 'positive', 'negative', and 'neutral' sentiments. Adjust the code to accumulate counts for each category and sentiment type based on the LLM's analysis, without relying on direct keyword detection.    the code is "for user in user_data:
#     comment = user['comments']
#     sentiment = analyze_sentiment(comment)
#     if 'price' in comment.lower():
#         price_comments[sentiment] += 1
#     elif 'quality' in comment.lower():
#         quality_comments[sentiment] += 1
#     else:
#         other_comments[sentiment] += 1
#     if sentiment == 'positive':
#         positive_count += 1
#     elif sentiment == 'negative':
#         negative_count += 1"

import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from transformers import pipeline

# Install necessary libraries
!pip install transformers

nltk.download('vader_lexicon')

# Sample user data (replace with your actual data)
user_data = [
    {'username': 'user1', 'comments': 'The price is great!'},
    {'username': 'user2', 'comments': 'Quality is amazing, but a bit pricey.'},
    {'username': 'user3', 'comments': 'Terrible product, very low quality.'},
    {'username': 'user4', 'comments': 'I love this!'},
    {'username': 'user5', 'comments': 'The price is too high for this.'},
    {'username': 'user6', 'comments': 'It works as expected.'}
]

# Initialize sentiment analyzer (VADER in this example)
analyzer = SentimentIntensityAnalyzer()

# Function to classify sentiment using VADER
def analyze_sentiment(text):
    scores = analyzer.polarity_scores(text)
    if scores['compound'] >= 0.05:
        return 'positive'
    elif scores['compound'] <= -0.05:
        return 'negative'
    else:
        return 'neutral'

# Initialize the zero-shot classification pipeline
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

# Define the candidate labels
candidate_labels = ["price", "quality", "other information"]

# Process user data
comment_categories = {
    "price": {"positive": 0, "negative": 0, "neutral": 0},
    "quality": {"positive": 0, "negative": 0, "neutral": 0},
    "other information": {"positive": 0, "negative": 0, "neutral": 0},
}
positive_count = 0
negative_count = 0

for user in user_data:
    comment = user['comments']
    sentiment = analyze_sentiment(comment)

    # Use zero-shot classification to determine the category
    classification_result = classifier(comment, candidate_labels)
    category = classification_result["labels"][0]  # Get the most likely category

    comment_categories[category][sentiment] += 1

    if sentiment == 'positive':
        positive_count += 1
    elif sentiment == 'negative':
        negative_count += 1

total_comments = len(user_data)
positive_percentage = (positive_count / total_comments) * 100
negative_percentage = (negative_count / total_comments) * 100

# Output the results
print(f"Overall Positive Comments: {positive_percentage:.2f}%")
print(f"Overall Negative Comments: {negative_percentage:.2f}%")
print("\nCategory-wise Comment Counts:")
for category, sentiments in comment_categories.items():
    print(f"{category}: {sentiments}")



[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
Device set to use cpu
No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu
No model was supplied, defaulted to facebook/bart-large-mnli and revision d7645e1 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


Analysis of 6 comments:


Price Comments (2):
  - Positive: 2 (100.0%)
  - Negative: 0 (0.0%)
  - Neutral: 0 (0.0%)

Quality Comments (4):
  - Positive: 4 (100.0%)
  - Negative: 0 (0.0%)
  - Neutral: 0 (0.0%)

Other Comments (0):
  - Positive: 0 (0.0%)
  - Negative: 0 (0.0%)
  - Neutral: 0 (0.0%)
Overall Positive Comments: 0.00%
Overall Negative Comments: 0.00%

Category-wise Comment Counts:
price: {'positive': 0, 'negative': 0, 'neutral': 0}
quality: {'positive': 0, 'negative': 0, 'neutral': 0}
other information: {'positive': 0, 'negative': 0, 'neutral': 0}


In [5]:
from typing import Dict, List, Tuple
from dataclasses import dataclass
from transformers import pipeline
import numpy as np

@dataclass
class CommentAnalysis:
    category: str
    sentiment: str
    confidence: float

class CommentAnalyzer:
    def __init__(self):
        # Initialize sentiment analysis pipeline
        self.sentiment_classifier = pipeline("sentiment-analysis")
        # Initialize zero-shot classification pipeline for categories
        self.category_classifier = pipeline("zero-shot-classification")

        # Define possible categories and sentiments
        self.categories = ["price", "quality", "other"]
        self.sentiments = ["positive", "negative", "neutral"]

        # Initialize counters
        self.category_sentiment_counts = {
            category: {sentiment: 0 for sentiment in self.sentiments}
            for category in self.categories
        }

    def analyze_comment(self, comment: str) -> CommentAnalysis:
        """Analyze a single comment using LLM for both category and sentiment."""
        # Determine category
        category_result = self.category_classifier(
            comment,
            candidate_labels=["price related", "quality related", "other information"],
            multi_label=False
        )

        # Map the category result to our standard categories
        category_map = {
            "price related": "price",
            "quality related": "quality",
            "other information": "other"
        }
        category = category_map[category_result["labels"][0]]

        # Analyze sentiment
        sentiment_result = self.sentiment_classifier(comment)[0]
        sentiment_score = sentiment_result["score"]

        # Map sentiment score to category
        if sentiment_score >= 0.6:
            sentiment = "positive"
        elif sentiment_score <= 0.4:
            sentiment = "negative"
        else:
            sentiment = "neutral"

        return CommentAnalysis(
            category=category,
            sentiment=sentiment,
            confidence=max(sentiment_score, 1 - sentiment_score)
        )

    def process_comments(self, user_data: List[Dict]) -> Dict:
        """Process all comments and return aggregated statistics."""
        for user in user_data:
            comment = user['comments']
            analysis = self.analyze_comment(comment)

            # Update counters
            self.category_sentiment_counts[analysis.category][analysis.sentiment] += 1

        return {
            'category_sentiment_counts': self.category_sentiment_counts,
            'total_counts': {
                'by_category': {
                    category: sum(counts.values())
                    for category, counts in self.category_sentiment_counts.items()
                },
                'by_sentiment': {
                    sentiment: sum(
                        counts[sentiment]
                        for counts in self.category_sentiment_counts.values()
                    )
                    for sentiment in self.sentiments
                }
            }
        }

    def get_summary_report(self) -> str:
        """Generate a human-readable summary report."""
        total_comments = sum(sum(counts.values())
                           for counts in self.category_sentiment_counts.values())

        report = [f"Analysis of {total_comments} comments:\n"]

        for category in self.categories:
            category_total = sum(self.category_sentiment_counts[category].values())
            report.append(f"\n{category.title()} Comments ({category_total}):")
            for sentiment in self.sentiments:
                count = self.category_sentiment_counts[category][sentiment]
                percentage = (count / category_total * 100) if category_total > 0 else 0
                report.append(f"  - {sentiment.title()}: {count} ({percentage:.1f}%)")

        return "\n".join(report)

# Example usage
if __name__ == "__main__":
    # Sample data
    user_data = [
        {"comments": "Great value for the money, really worth every penny!"},
        {"comments": "The quality is terrible, it broke after a week"},
        {"comments": "Shipping was fast but the product is overpriced"},
    ]

    # Initialize and run analysis
    analyzer = CommentAnalyzer()
    results = analyzer.process_comments(user_data)

    # Print summary report
    print(analyzer.get_summary_report())

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Device set to use cpu
No model was supplied, defaulted to facebook/bart-large-mnli and revision d7645e1 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


Analysis of 3 comments:


Price Comments (2):
  - Positive: 2 (100.0%)
  - Negative: 0 (0.0%)
  - Neutral: 0 (0.0%)

Quality Comments (1):
  - Positive: 1 (100.0%)
  - Negative: 0 (0.0%)
  - Neutral: 0 (0.0%)

Other Comments (0):
  - Positive: 0 (0.0%)
  - Negative: 0 (0.0%)
  - Neutral: 0 (0.0%)


In [7]:
from typing import Dict, List, Tuple
from dataclasses import dataclass
from transformers import pipeline
import numpy as np

@dataclass
class CommentAnalysis:
    category: str
    sentiment: str
    confidence: float

class CommentAnalyzer:
    def __init__(self):
        # Initialize sentiment analysis pipeline
        self.sentiment_classifier = pipeline("sentiment-analysis")
        # Initialize zero-shot classification pipeline for categories
        self.category_classifier = pipeline("zero-shot-classification")

        # Define possible categories and sentiments
        self.categories = ["price", "quality", "other"]
        self.sentiments = ["positive", "negative", "neutral"]

        # Initialize counters
        self.category_sentiment_counts = {
            category: {sentiment: 0 for sentiment in self.sentiments}
            for category in self.categories
        }

        # Initialize total sentiment counters
        self.total_sentiment_counts = {
            "positive": 0,
            "negative": 0,
            "neutral": 0
        }

    def analyze_comment(self, comment: str) -> CommentAnalysis:
        """Analyze a single comment using LLM for both category and sentiment."""
        # Determine category
        category_result = self.category_classifier(
            comment,
            candidate_labels=["price related", "quality related", "other information"],
            multi_label=False
        )

        # Map the category result to our standard categories
        category_map = {
            "price related": "price",
            "quality related": "quality",
            "other information": "other"
        }
        category = category_map[category_result["labels"][0]]

        # Analyze sentiment
        sentiment_result = self.sentiment_classifier(comment)[0]
        sentiment_score = sentiment_result["score"]

        # Map sentiment score to category
        if sentiment_score >= 0.6:
            sentiment = "positive"
        elif sentiment_score <= 0.4:
            sentiment = "negative"
        else:
            sentiment = "neutral"

        return CommentAnalysis(
            category=category,
            sentiment=sentiment,
            confidence=max(sentiment_score, 1 - sentiment_score)
        )

    def process_comments(self, user_data: List[Dict]) -> Dict:
        """Process all comments and return aggregated statistics."""
        for user in user_data:
            comment = user['comments']
            analysis = self.analyze_comment(comment)

            # Update category-sentiment counters
            self.category_sentiment_counts[analysis.category][analysis.sentiment] += 1

            # Update total sentiment counters
            self.total_sentiment_counts[analysis.sentiment] += 1

        return {
            'category_sentiment_counts': self.category_sentiment_counts,
            'total_sentiment_counts': self.total_sentiment_counts,
            'total_counts': {
                'by_category': {
                    category: sum(counts.values())
                    for category, counts in self.category_sentiment_counts.items()
                },
                'by_sentiment': self.total_sentiment_counts
            }
        }

    def get_summary_report(self) -> str:
        """Generate a human-readable summary report."""
        total_comments = sum(self.total_sentiment_counts.values())

        report = [f"Analysis of {total_comments} comments:\n"]

        # Add total sentiment statistics
        report.append("Overall Sentiment Analysis:")
        for sentiment, count in self.total_sentiment_counts.items():
            percentage = (count / total_comments * 100) if total_comments > 0 else 0
            report.append(f"  - Total {sentiment.title()}: {count} ({percentage:.1f}%)")

        # Add category-wise breakdown
        report.append("\nBreakdown by Category:")
        for category in self.categories:
            category_total = sum(self.category_sentiment_counts[category].values())
            report.append(f"\n{category.title()} Comments ({category_total}):")
            for sentiment in self.sentiments:
                count = self.category_sentiment_counts[category][sentiment]
                percentage = (count / category_total * 100) if category_total > 0 else 0
                report.append(f"  - {sentiment.title()}: {count} ({percentage:.1f}%)")

        return "\n".join(report)

# Example usage
if __name__ == "__main__":
    # Sample data
    user_data = [
        {"comments": "Great value for the money, really worth every penny!"},
        {"comments": "The quality is terrible, it broke after a week"},
        {"comments": "Shipping was fast but the product is overpriced"},
        {"comments": "Excellent product quality, highly recommended"},
        {"comments": "Too expensive for what you get"}
    ]

    # Initialize and run analysis
    analyzer = CommentAnalyzer()
    results = analyzer.process_comments(user_data)

    # Print summary report
    print(analyzer.get_summary_report())

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu
No model was supplied, defaulted to facebook/bart-large-mnli and revision d7645e1 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


Analysis of 5 comments:

Overall Sentiment Analysis:
  - Total Positive: 5 (100.0%)
  - Total Negative: 0 (0.0%)
  - Total Neutral: 0 (0.0%)

Breakdown by Category:

Price Comments (3):
  - Positive: 3 (100.0%)
  - Negative: 0 (0.0%)
  - Neutral: 0 (0.0%)

Quality Comments (2):
  - Positive: 2 (100.0%)
  - Negative: 0 (0.0%)
  - Neutral: 0 (0.0%)

Other Comments (0):
  - Positive: 0 (0.0%)
  - Negative: 0 (0.0%)
  - Neutral: 0 (0.0%)


In [8]:
from typing import Dict, List, Tuple
from dataclasses import dataclass
from transformers import pipeline
import numpy as np

@dataclass
class CommentAnalysis:
    category: str
    sentiment: str
    confidence: float

class CommentAnalyzer:
    def __init__(self):
        # Initialize sentiment analysis pipeline with specific model for better accuracy
        self.sentiment_classifier = pipeline(
            "sentiment-analysis",
            model="distilbert-base-uncased-finetuned-sst-2-english"
        )
        # Initialize zero-shot classification pipeline for categories
        self.category_classifier = pipeline(
            "zero-shot-classification",
            model="facebook/bart-large-mnli"
        )

        # Define possible categories and sentiments
        self.categories = ["price", "quality", "other"]
        self.sentiments = ["positive", "negative", "neutral"]

        # Initialize counters
        self.category_sentiment_counts = {
            category: {sentiment: 0 for sentiment in self.sentiments}
            for category in self.categories
        }

        # Initialize total sentiment counters
        self.total_sentiment_counts = {
            "positive": 0,
            "negative": 0,
            "neutral": 0
        }

    def determine_sentiment(self, text: str) -> Tuple[str, float]:
        """
        Determine sentiment with improved logic using VADER-like compound scoring
        """
        result = self.sentiment_classifier(text)[0]
        label = result['label'].lower()
        score = result['score']

        # Convert POSITIVE/NEGATIVE labels to our sentiment categories
        if label == 'positive':
            if score > 0.9:  # Very positive
                return 'positive', score
            elif score > 0.75:  # Moderately positive
                return 'positive', score
            else:  # Slightly positive
                return 'neutral', score
        else:  # label == 'negative'
            if score > 0.9:  # Very negative
                return 'negative', score
            elif score > 0.75:  # Moderately negative
                return 'negative', score
            else:  # Slightly negative
                return 'neutral', score

    def analyze_comment(self, comment: str) -> CommentAnalysis:
        """Analyze a single comment using LLM for both category and sentiment."""
        # Determine category with more specific prompts
        category_result = self.category_classifier(
            comment,
            candidate_labels=[
                "discussion about price or cost",
                "discussion about product quality or performance",
                "other general information"
            ],
            multi_label=False
        )

        # Map the category result to our standard categories
        category_map = {
            "discussion about price or cost": "price",
            "discussion about product quality or performance": "quality",
            "other general information": "other"
        }

        # Get the highest confidence category
        predicted_category = category_result["labels"][0]
        category = category_map[predicted_category]

        # Get sentiment with improved logic
        sentiment, confidence = self.determine_sentiment(comment)

        return CommentAnalysis(
            category=category,
            sentiment=sentiment,
            confidence=confidence
        )

    def process_comments(self, user_data: List[Dict]) -> Dict:
        """Process all comments and return aggregated statistics."""
        # Reset counters before processing
        self.category_sentiment_counts = {
            category: {sentiment: 0 for sentiment in self.sentiments}
            for category in self.categories
        }
        self.total_sentiment_counts = {sentiment: 0 for sentiment in self.sentiments}

        for user in user_data:
            comment = user['comments']
            analysis = self.analyze_comment(comment)

            # Update category-sentiment counters
            self.category_sentiment_counts[analysis.category][analysis.sentiment] += 1

            # Update total sentiment counters
            self.total_sentiment_counts[analysis.sentiment] += 1

        return {
            'category_sentiment_counts': self.category_sentiment_counts,
            'total_sentiment_counts': self.total_sentiment_counts,
            'total_counts': {
                'by_category': {
                    category: sum(counts.values())
                    for category, counts in self.category_sentiment_counts.items()
                },
                'by_sentiment': self.total_sentiment_counts
            }
        }

    def get_summary_report(self) -> str:
        """Generate a human-readable summary report."""
        total_comments = sum(self.total_sentiment_counts.values())

        report = [f"Analysis of {total_comments} comments:\n"]

        # Add total sentiment statistics
        report.append("Overall Sentiment Analysis:")
        for sentiment in ['positive', 'negative', 'neutral']:  # Fixed order
            count = self.total_sentiment_counts[sentiment]
            percentage = (count / total_comments * 100) if total_comments > 0 else 0
            report.append(f"  - Total {sentiment.title()}: {count} ({percentage:.1f}%)")

        # Add category-wise breakdown
        report.append("\nBreakdown by Category:")
        for category in self.categories:
            category_total = sum(self.category_sentiment_counts[category].values())
            report.append(f"\n{category.title()} Comments ({category_total}):")
            for sentiment in ['positive', 'negative', 'neutral']:  # Fixed order
                count = self.category_sentiment_counts[category][sentiment]
                percentage = (count / category_total * 100) if category_total > 0 else 0
                report.append(f"  - {sentiment.title()}: {count} ({percentage:.1f}%)")

        return "\n".join(report)

# Example usage
if __name__ == "__main__":
    # Sample data with clear sentiment indicators
    user_data = [
        {"comments": "Great value for the money, really worth every penny!"},  # Positive price
        {"comments": "The quality is terrible, it broke after a week"},        # Negative quality
        {"comments": "Shipping was fast but the product is overpriced"},       # Neutral/Negative price
        {"comments": "Excellent product quality, highly recommended"},          # Positive quality
        {"comments": "Too expensive for what you get"}                         # Negative price
    ]

    # Initialize and run analysis
    analyzer = CommentAnalyzer()
    results = analyzer.process_comments(user_data)

    # Print summary report
    print(analyzer.get_summary_report())

Device set to use cpu
Device set to use cpu


Analysis of 5 comments:

Overall Sentiment Analysis:
  - Total Positive: 2 (40.0%)
  - Total Negative: 3 (60.0%)
  - Total Neutral: 0 (0.0%)

Breakdown by Category:

Price Comments (3):
  - Positive: 1 (33.3%)
  - Negative: 2 (66.7%)
  - Neutral: 0 (0.0%)

Quality Comments (2):
  - Positive: 1 (50.0%)
  - Negative: 1 (50.0%)
  - Neutral: 0 (0.0%)

Other Comments (0):
  - Positive: 0 (0.0%)
  - Negative: 0 (0.0%)
  - Neutral: 0 (0.0%)
