# CommentSense: AI-Powered Comment Analysis System

**Problem Statement**: Measuring content effectiveness through Share of Engagement (SoE) metrics like likes, shares, saves, and comments is essential. How do we analyze the quality and relevance of comments, at scale?

**Solution Features**:
- Quality comment ratio analysis
- Sentiment breakdown per video
- Comment categorization (skincare, fragrance, makeup)
- Spam detection
- Relevance analysis using distance metrics

By: **Noog Troupers**

## 1. Import Libraries and Setup

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import torch
from transformers import pipeline, AutoTokenizer, AutoModel
import nltk
import re
import string
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import KMeans
from collections import Counter
import warnings
warnings.filterwarnings('ignore')

# Set style for better visualizations
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

# Download NLTK data
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('punkt_tab', quiet=True)

True

## 2. Data Loading Class

In [3]:
class Dataset:
    comment_links = [
        "https://storage.googleapis.com/dataset_hosting/comments1.csv",
        "https://storage.googleapis.com/dataset_hosting/comments2.csv",
        "https://storage.googleapis.com/dataset_hosting/comments3.csv",
        "https://storage.googleapis.com/dataset_hosting/comments4.csv",
        "https://storage.googleapis.com/dataset_hosting/comments5.csv",
    ]
    
    video_link = "https://storage.googleapis.com/dataset_hosting/videos.csv"
    
    @staticmethod
    def getAllComments():
        list_of_dfs = []
        for csv_file in Dataset.comment_links:
            df = pd.read_csv(csv_file)
            list_of_dfs.append(df)
        return pd.concat(list_of_dfs, ignore_index=True)
    
    @staticmethod
    def getComments(dataset_id=1, sample_frac=0.1):
        if dataset_id not in range(1, len(Dataset.comment_links) + 1):
            raise ValueError(f"dataset_id must be between 1 and {len(Dataset.comment_links)}")
        
        df = pd.read_csv(Dataset.comment_links[dataset_id - 1])
        if sample_frac < 1.0:
            df = df.sample(frac=sample_frac, random_state=42)
        return df
    
    @staticmethod
    def getVideos():
        return pd.read_csv(Dataset.video_link)

# Initialize dataset
dataset = Dataset()
print("Dataset class initialized successfully!")

Dataset class initialized successfully!


## 3. Advanced Text Preprocessing and Analysis Classes

In [4]:
class AdvancedTextPreprocessor:
    def __init__(self):
        self.stemmer = PorterStemmer()
        self.stop_words = set(stopwords.words('english'))
        
        # Beauty category keywords
        self.category_keywords = {
            'skincare': ['skincare', 'skin', 'moisturizer', 'cleanser', 'serum', 'cream', 'lotion', 
                        'acne', 'pores', 'wrinkles', 'anti-aging', 'hydrating', 'dry skin', 'oily skin',
                        'sensitive skin', 'sunscreen', 'spf', 'retinol', 'vitamin c', 'hyaluronic',
                        'exfoliate', 'toner', 'mask', 'facial', 'dermatologist'],
            
            'makeup': ['makeup', 'foundation', 'concealer', 'lipstick', 'eyeshadow', 'mascara',
                      'eyeliner', 'blush', 'bronzer', 'highlighter', 'primer', 'setting spray',
                      'powder', 'contour', 'brow', 'eyebrow', 'lip gloss', 'lip liner', 'palette',
                      'brush', 'beauty blender', 'sponge', 'coverage', 'matte', 'dewy', 'shimmer'],
            
            'fragrance': ['perfume', 'fragrance', 'cologne', 'scent', 'smell', 'aroma', 'notes',
                         'floral', 'woody', 'citrus', 'vanilla', 'musk', 'fresh', 'sweet', 'spicy',
                         'eau de toilette', 'eau de parfum', 'body spray', 'long lasting',
                         'signature scent', 'top notes', 'base notes', 'middle notes']
        }
        
        # Spam indicators
        self.spam_keywords = [
            'buy now', 'click here', 'subscribe', 'free', 'visit', 'winner', 'win', 'cash', 'prize',
            'limited time', 'act now', 'urgent', 'amazing deal', 'check out my', 'follow me',
            'dm me', 'link in bio', 'promo code', 'discount', '50% off', 'sale'
        ]
    
    def clean_text(self, text):
        """Clean and preprocess text"""
        if pd.isna(text) or not isinstance(text, str):
            return ""
            
        # Convert to lowercase and strip
        text = str(text).lower().strip()
        
        # Remove URLs, mentions, hashtags
        text = re.sub(r'http\S+|www\S+|@\w+|#\w+', '', text)
        
        # Remove extra whitespace
        text = re.sub(r'\s+', ' ', text)
        
        # Tokenize
        tokens = word_tokenize(text)
        
        # Remove stopwords and punctuation
        tokens = [word for word in tokens if word not in self.stop_words and word not in string.punctuation]
        
        # Stem tokens
        tokens = [self.stemmer.stem(word) for word in tokens if len(word) > 2]
        
        return ' '.join(tokens)
    
    def detect_spam(self, text):
        """Detect spam comments with improved logic"""
        if pd.isna(text) or not isinstance(text, str):
            return 1
            
        text = str(text).lower()
        
        # Remove emojis for length check
        emoji_pattern = re.compile("["
            u"\U0001F600-\U0001F64F"  # emoticons
            u"\U0001F300-\U0001F5FF"  # symbols & pictographs
            u"\U0001F680-\U0001F6FF"  # transport & map symbols
            u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
            "]+", flags=re.UNICODE)
        
        text_no_emoji = emoji_pattern.sub('', text)
        
        # Check for very short comments (likely spam/low quality)
        if len(text_no_emoji.strip()) < 3:
            return 1
        
        # Check for excessive repetition
        words = text_no_emoji.split()
        if len(words) > 1 and len(set(words)) / len(words) < 0.5:
            return 1
        
        # Check for spam keywords
        spam_score = sum(1 for keyword in self.spam_keywords if keyword in text)
        if spam_score >= 2:
            return 1
        
        # Check for excessive caps
        if len(text) > 10 and sum(1 for c in text if c.isupper()) / len(text) > 0.7:
            return 1
            
        return 0
    
    def categorize_comment(self, text):
        """Categorize comments into beauty categories"""
        if pd.isna(text) or not isinstance(text, str):
            return 'other'
            
        text = str(text).lower()
        category_scores = {}
        
        for category, keywords in self.category_keywords.items():
            score = sum(1 for keyword in keywords if keyword in text)
            category_scores[category] = score
        
        if max(category_scores.values()) == 0:
            return 'other'
        
        return max(category_scores.keys(), key=category_scores.get)
    
    def assess_quality(self, text, sentiment=None):
        """Assess comment quality based on multiple factors"""
        if pd.isna(text) or not isinstance(text, str):
            return 0
            
        text = str(text).lower()
        quality_score = 0
        
        # Length factor (reasonable length comments are better)
        word_count = len(text.split())
        if 5 <= word_count <= 50:
            quality_score += 2
        elif 3 <= word_count < 5 or 50 < word_count <= 100:
            quality_score += 1
        
        # Product relevance
        for keywords in self.category_keywords.values():
            if any(keyword in text for keyword in keywords):
                quality_score += 2
                break
        
        # Sentiment consideration
        if sentiment and sentiment != 'neutral':
            quality_score += 1
        
        # Engagement indicators
        engagement_words = ['love', 'amazing', 'recommend', 'favorite', 'best', 'great', 'good', 'bad', 'disappointed']
        if any(word in text for word in engagement_words):
            quality_score += 1
        
        # Quality threshold
        return 1 if quality_score >= 3 else 0

print("AdvancedTextPreprocessor class created successfully!")

AdvancedTextPreprocessor class created successfully!


## 4. Relevance Analysis Class

In [5]:
class RelevanceAnalyzer:
    def __init__(self):
        self.vectorizer = TfidfVectorizer(max_features=1000, stop_words='english', ngram_range=(1, 2))
        
    def calculate_relevance_score(self, comment_text, video_title, video_description="", video_tags=""):
        """Calculate relevance score using cosine similarity"""
        try:
            # Combine video content
            video_content = f"{video_title} {video_description} {video_tags}".strip()
            
            if not comment_text or not video_content:
                return 0.0
            
            # Create TF-IDF vectors
            texts = [str(comment_text), str(video_content)]
            tfidf_matrix = self.vectorizer.fit_transform(texts)
            
            # Calculate cosine similarity
            similarity = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:2])[0][0]
            
            return float(similarity)
        except:
            return 0.0
    
    def batch_relevance_analysis(self, comments_df, videos_df):
        """Perform batch relevance analysis"""
        # Merge comments with video data
        merged_df = comments_df.merge(videos_df[['videoId', 'title', 'description', 'tags']], 
                                     on='videoId', how='left')
        
        # Calculate relevance scores
        relevance_scores = []
        for _, row in merged_df.iterrows():
            score = self.calculate_relevance_score(
                row.get('textOriginal', ''),
                row.get('title', ''),
                row.get('description', ''),
                row.get('tags', '')
            )
            relevance_scores.append(score)
        
        return relevance_scores

print("RelevanceAnalyzer class created successfully!")

RelevanceAnalyzer class created successfully!


## 5. Visualization Dashboard Class

In [13]:
class CommentAnalyticsDashboard:
    def __init__(self):
        self.colors = ['#FF6B6B', '#4ECDC4', '#45B7D1', '#96CEB4', '#FFEAA7', '#DDA0DD']
    
    def create_quality_ratio_chart(self, df):
        """Create quality ratio visualization"""
        quality_counts = df['quality_score'].value_counts()
        
        fig = go.Figure(data=[
            go.Pie(labels=['Low Quality', 'High Quality'], 
                   values=[quality_counts.get(0, 0), quality_counts.get(1, 0)],
                   hole=0.4,
                   marker_colors=['#FF6B6B', '#4ECDC4'])
        ])
        
        fig.update_layout(
            title="Comment Quality Ratio",
            annotations=[dict(text='Quality<br>Ratio', x=0.5, y=0.5, font_size=20, showarrow=False)]
        )
        
        return fig
    
    def create_sentiment_breakdown(self, df):
        """Create sentiment breakdown visualization"""
        sentiment_counts = df['sentiment'].value_counts()
        
        fig = px.bar(x=sentiment_counts.index, y=sentiment_counts.values,
                     title="Sentiment Distribution",
                     labels={'x': 'Sentiment', 'y': 'Count'},
                     color=sentiment_counts.index,
                     color_discrete_sequence=self.colors)
        
        fig.update_layout(showlegend=False)
        return fig
    
    def create_category_breakdown(self, df):
        """Create category breakdown visualization"""
        category_counts = df['category'].value_counts()
        
        fig = px.pie(values=category_counts.values, names=category_counts.index,
                     title="Comment Categories",
                     color_discrete_sequence=self.colors)
        
        return fig
    
    def create_spam_detection_chart(self, df):
        """Create spam detection visualization"""
        spam_counts = df['isSpam'].value_counts()
        
        fig = go.Figure(data=[
            go.Bar(x=['Legitimate', 'Spam'], 
                   y=[spam_counts.get(0, 0), spam_counts.get(1, 0)],
                   marker_color=['#4ECDC4', '#FF6B6B'])
        ])
        
        fig.update_layout(title="Spam Detection Results")
        return fig
    
    def create_relevance_distribution(self, df):
        """Create relevance score distribution"""
        fig = px.histogram(df, x='relevance_score',
                          title="Comment Relevance Score Distribution",
                          labels={'x': 'Relevance Score', 'y': 'Count'})
        
        fig.add_vline(x=df['relevance_score'].mean(), line_dash="dash", 
                     annotation_text=f"Mean: {df['relevance_score'].mean():.3f}")
        
        return fig
    
    def create_video_analysis_summary(self, df, video_id):
        """Create per-video analysis summary"""
        video_data = df[df['videoId'] == video_id]
        
        if len(video_data) == 0:
            return None
        
        # Create subplot figure
        fig = make_subplots(
            rows=2, cols=2,
            subplot_titles=('Quality Ratio', 'Sentiment Distribution', 
                           'Category Breakdown', 'Relevance Scores'),
            specs=[[{'type': 'domain'}, {'type': 'xy'}],
                   [{'type': 'domain'}, {'type': 'xy'}]]
        )
        
        # Quality ratio pie chart
        quality_counts = video_data['quality_score'].value_counts()
        fig.add_trace(go.Pie(labels=['Low Quality', 'High Quality'],
                            values=[quality_counts.get(0, 0), quality_counts.get(1, 0)],
                            name="Quality"), row=1, col=1)
        
        # Sentiment bar chart
        sentiment_counts = video_data['sentiment'].value_counts()
        fig.add_trace(go.Bar(x=sentiment_counts.index, y=sentiment_counts.values,
                            name="Sentiment"), row=1, col=2)
        
        # Category pie chart
        category_counts = video_data['category'].value_counts()
        fig.add_trace(go.Pie(labels=category_counts.index, values=category_counts.values,
                            name="Category"), row=2, col=1)
        
        # Relevance histogram
        fig.add_trace(go.Histogram(x=video_data['relevance_score'], name="Relevance"),
                     row=2, col=2)
        
        fig.update_layout(height=800, title_text=f"Video Analysis Summary - {video_id}")
        
        return fig

print("CommentAnalyticsDashboard class created successfully!")

CommentAnalyticsDashboard class created successfully!


## 6. Load and Preprocess Data

In [7]:
# Load datasets
print("Loading video dataset...")
videos = dataset.getVideos()

print("Loading comments dataset (10% sample for demo)...")
comments = dataset.getComments(dataset_id=1, sample_frac=0.1)

print(f"Loaded {len(videos)} videos and {len(comments)} comments")

# Remove duplicates
comments = comments.drop_duplicates(subset=["commentId"])
videos = videos.drop_duplicates(subset=["videoId"])

# Drop rows with missing comment text
comments = comments.dropna(subset=["textOriginal"])

print(f"After cleaning: {len(videos)} videos and {len(comments)} comments")

Loading video dataset...
Loading comments dataset (10% sample for demo)...
Loaded 92759 videos and 100000 comments
After cleaning: 92759 videos and 99997 comments


## 7. Advanced Text Analysis

In [8]:
# Initialize text preprocessor
preprocessor = AdvancedTextPreprocessor()

print("Performing text preprocessing and analysis...")

# Clean text
comments["textCleaned"] = comments["textOriginal"].apply(preprocessor.clean_text)

# Detect spam
comments["isSpam"] = comments["textOriginal"].apply(preprocessor.detect_spam)

# Categorize comments
comments["category"] = comments["textOriginal"].apply(preprocessor.categorize_comment)

print("Text preprocessing completed!")
print(f"Spam comments detected: {comments['isSpam'].sum()} ({comments['isSpam'].mean()*100:.1f}%)")
print("\nCategory distribution:")
print(comments['category'].value_counts())

Performing text preprocessing and analysis...
Text preprocessing completed!
Spam comments detected: 6151 (6.2%)

Category distribution:
category
other        89276
makeup        7437
skincare      2765
fragrance      519
Name: count, dtype: int64


## 8. Sentiment Analysis

In [9]:
# Setup sentiment analysis pipeline
device_index = 0 if torch.cuda.is_available() else -1
model_name = "cardiffnlp/twitter-roberta-base-sentiment-latest"

print(f"Initializing sentiment analysis model: {model_name}")
print(f"Using device: {'GPU' if device_index == 0 else 'CPU'}")

analyzer = pipeline(
    "sentiment-analysis",
    model=model_name,
    truncation=True,
    device=device_index,
)

# Perform batch sentiment analysis
print("Performing sentiment analysis...")

# Prepare texts for analysis
texts_series = comments["textCleaned"].fillna("").astype(str)
unique_texts = list(pd.Series(texts_series.unique()))

# Batch processing to avoid memory issues
batch_size = 32
label_map = {}
score_map = {}

for i in range(0, len(unique_texts), batch_size):
    batch = unique_texts[i:i + batch_size]
    try:
        results = analyzer(batch, truncation=True, max_length=512)
        for text, result in zip(batch, results):
            label_map[text] = result["label"]
            score_map[text] = result["score"]
    except Exception as e:
        print(f"Error processing batch {i//batch_size + 1}: {e}")
        # Handle failed batch by assigning neutral sentiment
        for text in batch:
            label_map[text] = "neutral"
            score_map[text] = 0.5

# Map results back to dataframe
comments["sentiment"] = texts_series.map(label_map)
comments["sentiment_score"] = texts_series.map(score_map)

print("Sentiment analysis completed!")
print("\nSentiment distribution:")
print(comments['sentiment'].value_counts())

Initializing sentiment analysis model: cardiffnlp/twitter-roberta-base-sentiment-latest
Using device: GPU


Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0


Performing sentiment analysis...


You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


Sentiment analysis completed!

Sentiment distribution:
sentiment
neutral     52694
positive    39042
negative     8261
Name: count, dtype: int64


## 9. Quality Assessment and Relevance Analysis

In [10]:
# Quality assessment
print("Assessing comment quality...")
comments["quality_score"] = comments.apply(
    lambda row: preprocessor.assess_quality(row["textOriginal"], row["sentiment"]), axis=1
)

# Relevance analysis
print("Performing relevance analysis...")
relevance_analyzer = RelevanceAnalyzer()
comments["relevance_score"] = relevance_analyzer.batch_relevance_analysis(comments, videos)

print("Quality assessment and relevance analysis completed!")
print(f"High quality comments: {comments['quality_score'].sum()} ({comments['quality_score'].mean()*100:.1f}%)")
print(f"Average relevance score: {comments['relevance_score'].mean():.3f}")

Assessing comment quality...
Performing relevance analysis...
Quality assessment and relevance analysis completed!
High quality comments: 32376 (32.4%)
Average relevance score: 0.018


## 10. Key Performance Indicators (KPIs)

In [11]:
def calculate_kpis(df):
    """Calculate key performance indicators"""
    total_comments = len(df)
    
    kpis = {
        'Total Comments': total_comments,
        'Quality Comment Ratio': df['quality_score'].mean(),
        'Spam Rate': df['isSpam'].mean(),
        'Average Relevance Score': df['relevance_score'].mean(),
        'Positive Sentiment %': (df['sentiment'] == 'positive').mean() * 100,
        'Negative Sentiment %': (df['sentiment'] == 'negative').mean() * 100,
        'Neutral Sentiment %': (df['sentiment'] == 'neutral').mean() * 100,
        'Skincare Comments %': (df['category'] == 'skincare').mean() * 100,
        'Makeup Comments %': (df['category'] == 'makeup').mean() * 100,
        'Fragrance Comments %': (df['category'] == 'fragrance').mean() * 100,
        'Other Comments %': (df['category'] == 'other').mean() * 100
    }
    
    return kpis

# Calculate overall KPIs
overall_kpis = calculate_kpis(comments)

print("=== COMMENT ANALYSIS KPIs ===")
for kpi, value in overall_kpis.items():
    if '%' in kpi or 'Ratio' in kpi or 'Rate' in kpi or 'Score' in kpi:
        print(f"{kpi}: {value:.2f}%" if '%' in kpi else f"{kpi}: {value:.3f}")
    else:
        print(f"{kpi}: {value:,}")

=== COMMENT ANALYSIS KPIs ===
Total Comments: 99,997
Quality Comment Ratio: 0.324
Spam Rate: 0.062
Average Relevance Score: 0.018
Positive Sentiment %: 39.04%
Negative Sentiment %: 8.26%
Neutral Sentiment %: 52.70%
Skincare Comments %: 2.77%
Makeup Comments %: 7.44%
Fragrance Comments %: 0.52%
Other Comments %: 89.28%


## 11. Create Interactive Dashboard

In [14]:
# Initialize dashboard
dashboard = CommentAnalyticsDashboard()

# Create visualizations
print("Creating interactive dashboard...")

# Overall quality ratio
quality_fig = dashboard.create_quality_ratio_chart(comments)
quality_fig.show()

# Sentiment breakdown
sentiment_fig = dashboard.create_sentiment_breakdown(comments)
sentiment_fig.show()

# Category breakdown
category_fig = dashboard.create_category_breakdown(comments)
category_fig.show()

# Spam detection
spam_fig = dashboard.create_spam_detection_chart(comments)
spam_fig.show()

# Relevance distribution
relevance_fig = dashboard.create_relevance_distribution(comments)
relevance_fig.show()

Creating interactive dashboard...


## 12. Per-Video Analysis

In [15]:
# Get video-level analytics
def get_video_analytics(df):
    """Generate per-video analytics"""
    video_stats = df.groupby('videoId').agg({
        'commentId': 'count',
        'quality_score': 'mean',
        'isSpam': 'mean',
        'relevance_score': 'mean',
        'sentiment_score': 'mean'
    }).round(3)
    
    video_stats.columns = ['Total_Comments', 'Quality_Ratio', 'Spam_Rate', 'Avg_Relevance', 'Avg_Sentiment_Score']
    video_stats = video_stats.sort_values('Total_Comments', ascending=False)
    
    return video_stats

video_analytics = get_video_analytics(comments)
print("=== TOP 10 VIDEOS BY COMMENT COUNT ===")
print(video_analytics.head(10))

# Analyze a specific video
top_video_id = video_analytics.index[0]
print(f"\n=== DETAILED ANALYSIS FOR TOP VIDEO: {top_video_id} ===")

video_summary_fig = dashboard.create_video_analysis_summary(comments, top_video_id)
if video_summary_fig:
    video_summary_fig.show()

# Show sample high-quality comments for the top video
top_video_comments = comments[comments['videoId'] == top_video_id]
high_quality_comments = top_video_comments[
    (top_video_comments['quality_score'] == 1) & 
    (top_video_comments['isSpam'] == 0)
].sort_values('relevance_score', ascending=False)

print(f"\n=== SAMPLE HIGH-QUALITY COMMENTS FROM {top_video_id} ===")
for i, (_, comment) in enumerate(high_quality_comments.head(5).iterrows()):
    print(f"{i+1}. [{comment['sentiment'].upper()}] (Relevance: {comment['relevance_score']:.3f})")
    print(f"   \"{comment['textOriginal'][:100]}...\"")
    print(f"   Category: {comment['category']}\n")

=== TOP 10 VIDEOS BY COMMENT COUNT ===
         Total_Comments  Quality_Ratio  Spam_Rate  Avg_Relevance  \
videoId                                                            
32656              1473          0.299      0.012          0.000   
58551              1028          0.452      0.003          0.081   
76480               915          0.165      0.077          0.051   
69445               859          0.671      0.010          0.052   
51351               819          0.230      0.020          0.124   
55433               721          0.251      0.039          0.011   
18248               705          0.329      0.021          0.043   
16282               693          0.159      0.030          0.084   
40498               680          0.469      0.043          0.010   
18615               598          0.261      0.043          0.020   

         Avg_Sentiment_Score  
videoId                       
32656                  0.721  
58551                  0.737  
76480               


=== SAMPLE HIGH-QUALITY COMMENTS FROM 32656 ===
1. [POSITIVE] (Relevance: 0.022)
   "Brooke is the most beautiful girl I’ve ever seen..."
   Category: other

2. [POSITIVE] (Relevance: 0.020)
   "My brother and his friend love you so much and always scream BROOK MONK!!!!!!..."
   Category: skincare

3. [POSITIVE] (Relevance: 0.020)
   "The 1950’s was literally so pretty and looked so adorable on brooke<3..."
   Category: other

4. [POSITIVE] (Relevance: 0.015)
   "looking cute in all hairstyle..."
   Category: other

5. [POSITIVE] (Relevance: 0.015)
   "1920 was my favourite hairstyle 😘..."
   Category: other



## 13. Advanced Analytics and Insights

In [None]:
# Create comprehensive insights
def generate_insights(df):
    """Generate actionable insights from the data"""
    insights = []
    
    # Quality insights
    quality_ratio = df['quality_score'].mean()
    if quality_ratio < 0.3:
        insights.append(f"Low quality comment ratio ({quality_ratio:.1%}). Consider content strategy review.")
    elif quality_ratio > 0.6:
        insights.append(f"High quality comment ratio ({quality_ratio:.1%}). Great audience engagement!")
    
    # Spam insights
    spam_rate = df['isSpam'].mean()
    if spam_rate > 0.2:
        insights.append(f"High spam rate ({spam_rate:.1%}). Implement stricter comment moderation.")
    
    # Sentiment insights
    positive_ratio = (df['sentiment'] == 'positive').mean()
    negative_ratio = (df['sentiment'] == 'negative').mean()
    
    if positive_ratio > 0.5:
        insights.append(f"Positive sentiment dominates ({positive_ratio:.1%}). Audience responds well to content.")
    elif negative_ratio > 0.3:
        insights.append(f"High negative sentiment ({negative_ratio:.1%}). Review content strategy.")
    
    # Category insights
    top_category = df['category'].value_counts().index[0]
    top_category_pct = df['category'].value_counts(normalize=True).iloc[0]
    insights.append(f"📊 '{top_category}' is the dominant category ({top_category_pct:.1%} of comments).")
    
    # Relevance insights
    avg_relevance = df['relevance_score'].mean()
    if avg_relevance < 0.1:
        insights.append(f"Low content relevance ({avg_relevance:.3f}). Comments may be off-topic.")
    elif avg_relevance > 0.3:
        insights.append(f"High content relevance ({avg_relevance:.3f}). Comments align well with video content.")
    
    return insights

insights = generate_insights(comments)

print("=== KEY INSIGHTS AND RECOMMENDATIONS ===")
for insight in insights:
    print(insight)
    
# Category-specific analysis
print("\n=== CATEGORY-SPECIFIC QUALITY ANALYSIS ===")
category_quality = comments.groupby('category').agg({
    'quality_score': ['mean', 'count'],
    'sentiment': lambda x: (x == 'positive').mean(),
    'relevance_score': 'mean'
}).round(3)

category_quality.columns = ['Quality_Ratio', 'Comment_Count', 'Positive_Sentiment_Ratio', 'Avg_Relevance']
print(category_quality)

## 14. Export Results and Summary

In [17]:
# Create final summary dataframe
summary_df = comments[[
    'videoId', 'commentId', 'textOriginal', 'textCleaned',
    'sentiment', 'sentiment_score', 'category', 
    'quality_score', 'isSpam', 'relevance_score'
]].copy()

# Add quality labels
summary_df['quality_label'] = summary_df['quality_score'].map({0: 'Low Quality', 1: 'High Quality'})
summary_df['spam_label'] = summary_df['isSpam'].map({0: 'Legitimate', 1: 'Spam'})

print("=== FINAL SUMMARY ===")
print(f"Dataset processed: {len(summary_df):,} comments")
print(f"Analysis completed successfully!")

# Display sample of processed data
print("\n=== SAMPLE PROCESSED DATA ===")
display_cols = ['textOriginal', 'sentiment', 'category', 'quality_label', 'spam_label', 'relevance_score']
sample_data = summary_df[display_cols].head(10)
print(sample_data.to_string(max_colwidth=50))

# Save results (uncomment to save)
# summary_df.to_csv('comment_analysis_results.csv', index=False)
# video_analytics.to_csv('video_analytics_summary.csv')
# print("\nResults saved to CSV files!")

print("\nCommentSense AI Analysis Complete!")
print("\nThe system has successfully analyzed comment quality, sentiment, categories, spam detection, and relevance at scale.")

=== FINAL SUMMARY ===
Dataset processed: 99,997 comments
Analysis completed successfully!

=== SAMPLE PROCESSED DATA ===
                                             textOriginal sentiment category quality_label  spam_label  relevance_score
987231              Thank you very much 🥰 Please share 🙏💞  positive    other  High Quality  Legitimate         0.000000
79954   She looks pretty on both sides. Only big diffe...   neutral    other   Low Quality  Legitimate         0.000000
567130  I hate straight hair & love it. Glad you like it❤  positive    other  High Quality  Legitimate         0.032922
500891  The texture makes you look more beautiful and ...  positive    other  High Quality  Legitimate         0.036588
55399                                            Handsome  positive    other   Low Quality  Legitimate         0.157122
135049  Check out the prices & order here - https://1h...   neutral    other   Low Quality  Legitimate         0.299956
733378                                 

## 15. Model Performance Summary

### Features Implemented:
1. **Quality Comment Ratio Analysis** - Identifies high vs low quality comments based on multiple factors
2. **Sentiment Breakdown** - Positive, negative, neutral sentiment analysis per video
3. **Comment Categorization** - Skincare, makeup, fragrance, and other categories
4. **Spam Detection** - Advanced spam detection using multiple indicators
5. **Relevance Analysis** - Measures comment relevance to video content using cosine similarity
6. **Interactive Dashboard** - Visual analytics for easy interpretation
7. **Per-Video Analytics** - Detailed breakdown for each video
8. **KPI Tracking** - Key performance indicators for content effectiveness

### Key Metrics:
- **Share of Engagement (SoE)** analysis through comment quality metrics
- **Scalable processing** with batch analysis for large datasets
- **Real-time insights** with actionable recommendations
- **Category-specific analysis** for targeted content strategy

This prototype demonstrates a comprehensive AI-powered solution for analyzing comment quality and relevance at scale, enabling data-driven decisions for content strategy optimization.