# Chapter 5: Classified Sentiment Analysis for Social Media

Dear user, from the DSMs from Chapter 4, we can identify 5 key components. \
We will then classify the social media comments, according to these 5 key components, \
then perform sentiment analysis to identify the design opportunities with these components.

### REQUIREMENTS

For this notebook, you need to have:
- 2 x Pickle files of Scraped data from your Social Media sources (from Chap2.ipynb)

### TO DO SECTION

In [13]:
'''
Dear user, enter your Product here!
'''

product = "Boeing 787 Dreamliner Commercial Plane"

In [14]:
'''
Dear user, enter your directories to the 2 Pickle files of Scraped data from Social Media!
'''
youtube = f"support/{product}/youtube/comment_list.pkl"
reddit = f"support/{product}/reddit/comment_list.pkl"

In [15]:
'''
Dear user, enter the 15 key components identified from DSM here!
'''
components_to_classify = ['Overhead Bin', 'Cabin Window', 'Seat', 'Door', 'Forward Fuselage', \
                        'Centre Fuselage', 'Rear Fuselage', 'Wing', 'Wingtip', 'Tail Fin', \
                        'Horizontal Stabilizer', 'Engine', 'Engine Nacelles', 'Landing Gear', 'Battery']

### RUN AS INTENDED (DO NOT CHANGE ANYTHING.)

In [16]:
! pip install accelerate



DEPRECATION: google-images-search 1.4.6 has a non-standard dependency specifier click>=7.0<=8.1.*. pip 24.1 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of google-images-search or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063


In [17]:
""" Create Classify and Sentiment folder """
search_terms = product

import os
import shutil

# Create "classify" folder
try:
    os.makedirs(f"support/{search_terms}/classify")
except FileExistsError:
    shutil.rmtree(f"support/{search_terms}/classify")
    os.makedirs(f"support/{search_terms}/classify")

# Create "sentiment" folder
try:
    os.makedirs(f"support/{search_terms}/sentiment")
except FileExistsError:
    shutil.rmtree(f"support/{search_terms}/sentiment")
    os.makedirs(f"support/{search_terms}/sentiment")

In [18]:
""" Initialise and Establish Dataset """
import pandas as pd

youtube = pd.read_pickle(youtube)
reddit = pd.read_pickle(reddit)

In [19]:
import random

combined = youtube + reddit
print("Number of comments:", len(combined))

# Randomly select 25 comments from each dataset
selected_youtube_comments = random.sample(youtube, 50)
selected_reddit_comments = random.sample(reddit, 50)

# Combine the selected comments from both datasets
combined = selected_youtube_comments + selected_reddit_comments
print("Number of comments:", len(combined))

Number of comments: 8271
Number of comments: 100


In [20]:
"""
Classify comments by components
"""
import csv
from transformers import pipeline

comment_list = combined

candidates = components_to_classify + ['Other']

# Initialize counters
candidate_counts = {candidate: 0 for candidate in candidates}

model = "facebook/bart-large-mnli"  # Default model
# model = "MoritzLaurer/mDeBERTa-v3-base-mnli-xnli"

# Initialize CSV files for each category
csv_files = {candidate: open(f"support/{product}/classify/{candidate}_comments.csv", "w", newline="", encoding="utf-8") for candidate in candidates}
writers = {candidate: csv.writer(csv_files[candidate]) for candidate in candidates}

# Write header row for each CSV file
for writer in writers.values():
    writer.writerow(["Sequence", "Label", "Source"])

# Initialize zero-shot classification pipeline with fine-tuned parameters
classifier = pipeline(
    "zero-shot-classification",
    model=model,
    framework="pt",
    multi_label=True
)

labeled_comments_count = 0  # Initialize counter for labeled comments

# Write results to CSV for each comment
for comment in comment_list:
    # Classify comment only if it's not empty
    if comment.strip():
        result = classifier(comment, candidate_labels=candidates)
        sequence = result['sequence'] if result['labels'] else None
        label = result['labels'][0] if result['labels'] else None

        # Update candidate counters and labeled comments count
        if label:
            candidate_counts[label] += 1
            labeled_comments_count += 1

            # Determine the source of the comment (YouTube or Reddit)
            source = "YouTube" if comment in youtube else "Reddit"

            # Write the comment to the respective CSV file based on its category and source
            writers[label].writerow([sequence, label, source])
    else:
        # Write empty comment to each CSV file
        for writer in writers.values():
            writer.writerow([None, None, None])

# Close all CSV files
for file in csv_files.values():
    file.close()

# Calculate the number of empty comments
empty_comments_count = len(comment_list) - labeled_comments_count

# Print summary
print(f"Candidate Counts:")
for candidate, count in candidate_counts.items():
    print(f"{candidate}: {count}")
print(f"Empty Comments: {empty_comments_count}")
print("Detailed results written to respective CSV files.")

Candidate Counts:
Overhead Bin: 1
Cabin Window: 0
Seat: 33
Door: 1
Forward Fuselage: 8
Centre Fuselage: 3
Rear Fuselage: 6
Wing: 8
Wingtip: 6
Tail Fin: 2
Horizontal Stabilizer: 12
Engine: 4
Engine Nacelles: 0
Landing Gear: 0
Battery: 5
Other: 8
Empty Comments: 3
Detailed results written to respective CSV files.


In [25]:
import csv
from transformers import pipeline, AutoTokenizer

candidates = components_to_classify + ['Other']

for candidate in candidates:
    # Initialize sentiment counts for each candidate
    positive_count = 0
    negative_count = 0
    neutral_count = 0
    empty_comments_count = 0

    comment_list = []
    with open(f"support/{product}/classify/{candidate}_comments.csv", "r", encoding="utf-8") as csvfile:
        csv_reader = csv.reader(csvfile)
        next(csv_reader)  # Skip header row
        for row in csv_reader:
            comment_list.append(row[0])

    if not comment_list:
        continue

    # Define the maximum sequence length
    max_seq_length = 512  # Adjust Truncated Length

    # Model for sentiment analysis
    model = "cardiffnlp/twitter-roberta-base-sentiment"  # negative, neutral, positive

    # Initialise the tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model, use_fast=True)

    # Filter out or truncate excessively long sequences
    filtered_comments = [comment[:max_seq_length - 2] for comment in comment_list]  # -2 to account for special tokens [CLS] and [SEP]

    # Initialise the pipeline with padding
    classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer, padding=True, device=-1)

    results = classifier(filtered_comments)

    # Accumulate sentiment counts
    for i in range(len(results)):
        result = results[i]
        sentiment = result['label']
        if comment_list[i].strip():  # Check if the comment is not empty
            if sentiment == "LABEL_2" or sentiment == "POSITIVE":
                positive_count += 1
            elif sentiment == "LABEL_0" or sentiment == "NEGATIVE":
                negative_count += 1
            elif sentiment == "LABEL_1":
                neutral_count += 1
        else:  # If the comment is empty, count it
            empty_comments_count += 1

    # Output CSV
    with open(f"support/{product}/sentiment/{candidate}_analysis.csv", "w", newline="", encoding="utf-8") as csvfile:
        writer = csv.writer(csvfile)
        writer.writerow(["Comment", "Sentiment"])
        for i in range(len(results)):
            writer.writerow([comment_list[i], results[i]['label']])

    # Calculate overall sentiment for each candidate
    overall_sentiment = "Positive" if positive_count > negative_count else "Negative" if negative_count > positive_count else "Neutral"

    # Print summary for each candidate
    print(f"Candidate: {candidate}")
    print("Number of comments with Positive sentiment:", positive_count)
    print("Number of comments with Negative sentiment:", negative_count)
    print("Number of comments with Neutral sentiment:", neutral_count)
    print("Overall Sentiment:", overall_sentiment)
    print()


Candidate: Overhead Bin
Number of comments with Positive sentiment: 0
Number of comments with Negative sentiment: 0
Number of comments with Neutral sentiment: 1
Overall Sentiment: Neutral

Candidate: Cabin Window
Number of comments with Positive sentiment: 0
Number of comments with Negative sentiment: 0
Number of comments with Neutral sentiment: 0
Overall Sentiment: Neutral

Candidate: Seat
Number of comments with Positive sentiment: 7
Number of comments with Negative sentiment: 7
Number of comments with Neutral sentiment: 19
Overall Sentiment: Neutral

Candidate: Door
Number of comments with Positive sentiment: 0
Number of comments with Negative sentiment: 0
Number of comments with Neutral sentiment: 1
Overall Sentiment: Neutral

Candidate: Forward Fuselage
Number of comments with Positive sentiment: 4
Number of comments with Negative sentiment: 0
Number of comments with Neutral sentiment: 4
Overall Sentiment: Positive

Candidate: Centre Fuselage
Number of comments with Positive senti

### Confusion Matrix

### TO DO SECTION

In [None]:
# '''
# Dear user, please manually annotate the classification for a selected number of comments in a post-classified csv file!
# Copy the csv file to others folder and name it Confusion_Table.csv !
# '''
# confusion_matrix = 'others/Confusion_Table.csv'

# """ Confusion Matrix """
# from sklearn.metrics import confusion_matrix
# from sklearn.metrics import precision_recall_fscore_support
# import pandas as pd

# '''load data'''
# labelled_data = pd.read_csv("confusion_matrix")

# print(labelled_data)

# y_true = list(labelled_data['Human'])
# y_pred = list(labelled_data["AI"])

# '''Compute'''
# print("\nConfusion Matrix summary:")
# print("Number of comments:", len(labelled_data))
# print("\nConfusion Table --- Labels: 0, 1, 2  |  Rows = Human (i.e. True)  |  Columns = AI (i.e. Predicted)")

# print(confusion_matrix(y_true, y_pred))
# print("\n(Precision, Recall, F1 Score)")
# print(precision_recall_fscore_support(y_true, y_pred, average='macro')[0:3])