# 0. Introduction
As per coursework specifications, We were supposed to chose one  company, and analyze their reputation, but we thought it to be more interesting to analyze two competitior companies; These companies are *Samsung* and *Apple.*

By the end of this project, we intend to know which of these two companies have the tendency of giving more negative feedback about the products they use. For this, we needed to gather feedback from places where said customers would be more likely to express their feelings about the matter freely.
After some research (and guidance from the coursework specifications), it was clear that social media was the place to focus on.
We chose Reddit, because of the 'online disinhibition effect' which shows that people are more likely to be honest when anonymous [[1]](https://academic.oup.com/jcmc/article/18/3/283/4067498?login=false).

We also decided that collecting YouTube data would be a good second choice, as we were sure that the opinions on the platform would either be heavily biased, or extremely neutral, and we thought this would be interesting to analyze.

Both our data collection pipelines are automatic and are defined below:
## YouTube
Using YouTube's API we built a [project](https://github.com/AdiKsOnDev/YouTubeParser) that collected popular youtube videos along with their transcripts for later labelling and analysis.
It is important to note that, we did not stop at just collecting english YouTube videos, but also collected ones in various other languages and machine translated them. The script can also collect top comments and likes/dislikes count from those videos.

## Reddit
We built a [similar project](https://github.com/johanjohnthomas/CompanyReputationAnalysis) for Reddit.
We did not want our data collection to be as simple as just querying for the company names and directly storing that data, so we followed the below process;
Using asyncPRAW, we first query reddit for the company name, we store these posts, but we also store a unique list of subreddits that these posts were posted on.
Then, for each of these subreddits, we gather 100 posts in the following reddit specific categories: hot, controversial, top, new, and rising. We also add our own category to this list, which we call 'queried'.
For every post, we fetch the top 10 comments.

This led to us having a total of 2 million entries; after cleaning, filtering for relevancy, removing columns with errors, removing rows with nan and removing duplicates, we ended up with around 8000 posts for each Samsung and samsung.
We separated the posts from the comments because we thought the distribution difference would affect model quality.

## Labelling
For labelling, we had a detailed pipeline.
For youtube videos, we manually labelled the whole corpus as it was small enough to do so. But the Reddit corpus ended up being too vast, so it was labelled by the [Qwen2.5-7B LLM from HuggingFace](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-GGUF).
We then compared the manually validated *~800 rows* and agreed with *~90%* of LLMs classifications.

# 1. Imports

In [None]:
import os
import asyncio
import re
import logging
from datetime import datetime
import pytz
import pandas as pd
from dotenv import load_dotenv
#import tf-keras package
import tensorflow as tf
from tensorflow import keras
import spacy
from spacy.lang.en import English
import asyncpraw
import spacy_transformers
import spacy_curated_transformers
import nest_asyncio
import matplotlib.pyplot as plt
from langdetect import detect, LangDetectException
from datasets import Dataset
import textwrap
import ipywidgets as widgets
from IPython.display import display, clear_output
from transformers import BertTokenizerFast, BertForSequenceClassification
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import Trainer, TrainingArguments
from transformers import pipeline
from asyncprawcore.exceptions import TooManyRequests, Forbidden, NotFound

# 2. Data Collection

In [80]:
#Configure logger
logging.basicConfig(
    level=logging.INFO,
    format='[%(levelname)s] %(asctime)s - %(name)s - %(message)s',
    datefmt='%Y-%m-%d %H:%M:%S'
)
data_collection_logger = logging.getLogger(__name__)

nest_asyncio.apply()
#Load environment variables (This is for Reddit account details)
load_dotenv(override=True)

#Set user timezone to America/New_York, this is because a majority of reddit users are from the US (https://explodingtopics.com/blog/reddit-users)
USER_TIMEZONE = pytz.timezone("America/New_York")

In [3]:
#Create a reddit client to make API calls
def create_reddit_client():
    """
    Validates and creates the asyncpraw Reddit client using environment variables.
    Raises ValueError if a required variable is missing.
    """
    
    required_vars = {
        "client_id":os.getenv("REDDIT_CLIENT_ID"),
        "client_secret":os.getenv("REDDIT_CLIENT_SECRET"),
        "user_agent":os.getenv("REDDIT_USER_AGENT"),
        "username":os.getenv("REDDIT_USERNAME"),
        "password":os.getenv("REDDIT_PASSWORD"),
    }
    for var_name, var_value in required_vars.items():
        if not var_value:
            raise ValueError(f"Missing environment variable: {var_name}")
    return asyncpraw.Reddit(**required_vars)

# Instantiate the Reddit client
reddit = create_reddit_client()

In [4]:
#Helper function: Cleans text by removing any URLs and surrounding whitespace
def clean_text(text: str) -> str:
    """
    Removes URLs from the text and strips surrounding whitespace.
    """
    text = re.sub(r'http\S+|www\S+', '', text)
    return text.strip()
#Boolean function for filtering out posts that are not in english
def is_english(text: str) -> bool:
    """
    Returns True if the language of the text is English.
    Skips detection for texts shorter than 10 characters to avoid errors.
    """
    if len(text.strip()) < 10:
        return False
    try:
        return detect(text) == 'en'
    except LangDetectException:
        return False


In [5]:
# We often deal with rate limits by asyncpraw, since we are collecting so much data, so we use this to wait until the rate limit timeout is reset.
async def handle_rate_limit(generator, max_retries=3):
    """
    Wraps an async generator to handle Reddit rate limits by retrying.
    Explicitly catches TooManyRequests (429) and sleeps before retrying.
    """
    retries = 0
    while True:
        try:
            async for item in generator:
                yield item
            break
        except TooManyRequests as e:
            if retries < max_retries:
                retries += 1
                wait = 5 * retries
                data_collection_logger.warning(
                    "[handle_rate_limit] TooManyRequests. Retrying in %s seconds... (Attempt %s/%s)",
                    wait, retries, max_retries
                )
                await asyncio.sleep(wait)
            else:
                data_collection_logger.error("[handle_rate_limit] Exceeded max_retries. Raising exception.")
                raise
        except Exception as e:
            data_collection_logger.error("[handle_rate_limit] Unknown exception: %s", e)
            raise

# Function to fetch, retries if there is an error.
async def fetch_with_retry(coro_func, item_desc: str = "unspecified", max_tries=3):
    """
    Calls a single async function (coro_func) and retries if an exception is raised.
    Specifically checks for TooManyRequests to respect Retry-After.
    Raises the last exception after all retries fail.
    """
    tries = 0
    last_exception = None

    while tries < max_tries:
        try:
            return await coro_func()  # Call the coroutine function to get a new coroutine each time
        except TooManyRequests as e:
            # If Reddit says "rate limit," honor the Retry-After header if possible
            if e.response and "Retry-After" in e.response.headers:
                wait_time = int(e.response.headers["Retry-After"])
            else:
                wait_time = 30

            tries += 1
            last_exception = e
            data_collection_logger.warning(
                "[fetch_with_retry] 429 TooManyRequests while %s. Retry after %s seconds "
                "(Attempt %s/%s)",
                item_desc, wait_time, tries, max_tries
            )

            # If we haven’t used up our attempts, wait and try again
            if tries < max_tries:
                await asyncio.sleep(wait_time)

        except Exception as ex:
            # For any other error, we also want to do multiple attempts,
            # so we do not immediately raise. We track the exception and move on.
            tries += 1
            last_exception = ex
            data_collection_logger.exception(
                "[fetch_with_retry] Error while %s (attempt %s/%s):",
                item_desc, tries, max_tries
            )

            # Optionally add a small sleep before retrying (to avoid immediate repeated failures)
            if tries < max_tries:
                await asyncio.sleep(5)

    # If we get here, we tried max_tries times and failed every time.
    data_collection_logger.error(
        "[fetch_with_retry] Gave up on %s after %s attempts. Last error was: %s",
        item_desc, max_tries, last_exception
    )
    raise last_exception  # Raise the last exception to propagate the error


In [6]:
def create_data_container():
    """
    Returns a dictionary for storing post/comment data in a structured way (this is for converting it into a DataFrame after).
    """
    return {
        'text': [],
        'title': [],
        'upvotes': [],
        'type': [],
        'date': [],
        'post_flair': [],
        'user_flair': [],
        'parent_text': [],
        'subreddit': [],
        'category': [],
    }
def localize_timestamp(utc_timestamp: float) -> str:
    """
    Converts a UTC timestamp (float) to a local time string in ISO format. (read top of file for reason)
    """
    utc_dt = datetime.utcfromtimestamp(utc_timestamp)
    local_dt = utc_dt.replace(tzinfo=pytz.utc).astimezone(USER_TIMEZONE)
    return local_dt.isoformat()


In [7]:
async def process_submission(submission, category, data):
    """
    Loads a submission, checks if it's English, and stores its post information.
    Then calls process_comments to handle all related comments.
    """
    try:
        await fetch_with_retry(
            submission.load,  # Pass the method without calling it
            f"loading submission {submission.id}"
        )

        post_date = localize_timestamp(submission.created_utc)
        post_title = submission.title or ""
        post_body = submission.selftext or ""
        combined_text = f"{post_title}\n{post_body}".strip()
        cleaned_combined = clean_text(combined_text)

        if not is_english(cleaned_combined):
            return

        data['title'].append(clean_text(post_title))
        data['text'].append(clean_text(post_body))
        data['upvotes'].append(submission.score)
        data['type'].append("post")
        data['date'].append(post_date)
        data['post_flair'].append(submission.link_flair_text or None)
        data['user_flair'].append(submission.author_flair_text or None)
        data['parent_text'].append(None)
        data['subreddit'].append(submission.subreddit.display_name)
        data['category'].append(category)

        await process_comments(submission, category, data)

    except Exception as e:
        data_collection_logger.error("[process_submission] Error processing submission %s: %s", submission.id, e)
        
async def process_comments(submission, category, data):
    """
    Retrieves and processes only the top-level comments for a submission (in English).
    Since top-level comments are returned directly by submission.comments,
    we remove the "MoreComments" objects without expanding them.
    """
    # Remove any "more" objects so that submission.comments only contains actual comments.
    await fetch_with_retry(
        lambda: submission.comments.replace_more(limit=0),
        f"removing more objects for submission {submission.id}"
    )

    # Iterate only over top-level comments (do not flatten the entire comment tree)
    for comment in submission.comments:
        # Skip if the object does not have a 'body' (e.g., if still a MoreComments object)
        if not hasattr(comment, "body"):
            continue

        comment_text = clean_text(comment.body)
        if not is_english(comment_text):
            continue

        comment_date = localize_timestamp(comment.created_utc)

        # For top-level comments, the parent is the submission itself.
        # You could leave parent_text as None or set it to the post's title/body if desired.
        parent_text = None

        data['title'].append(None)
        data['text'].append(comment_text)
        data['upvotes'].append(comment.score)
        data['type'].append("comment")
        data['date'].append(comment_date)
        data['post_flair'].append(None)  # Only posts have link flair
        data['user_flair'].append(comment.author_flair_text or None)
        data['parent_text'].append(parent_text)
        data['subreddit'].append(submission.subreddit.display_name)
        data['category'].append(category)



In [8]:
async def search_subreddits(search_term="Samsung", limit=50):
    """
    Searches for subreddits related to the given search_term using Reddit's search.
    """
    found = set()
    data_collection_logger.info("[search_subreddits] Searching subreddits for '%s' (limit=%s)...", search_term, limit)
    async for sub in handle_rate_limit(reddit.subreddits.search(search_term, limit=limit)):
        found.add(sub.display_name)
    return found

async def search_posts(search_term="Samsung", limit=100):
    """
    Searches across r/all for posts mentioning the search_term,
    returning additional subreddits where it’s discussed.
    """
    found = set()
    data_collection_logger.info("[search_posts] Searching r/all for '%s' (limit=%s)...", search_term, limit)
    subreddit_all = await reddit.subreddit("all")
    async for post in handle_rate_limit(subreddit_all.search(search_term, limit=limit, sort='date')):
        found.add(post.subreddit.display_name)
    return found

async def fetch_subreddit_content(subreddits, search_term="Samsung", posts_per_sub=100):
    """
    Iterates over a list of subreddits and fetches posts/comments for each in multiple categories.
    Retries if a TooManyRequests (429) is encountered, up to max_tries times.
    """
    data = create_data_container()

    for sub_name in subreddits:
        data_collection_logger.info("\n[fetch_subreddit_content] Processing subreddit: %s", sub_name)
        tries = 0
        max_tries = 3

        while tries < max_tries:
            try:
                subreddit = await reddit.subreddit(sub_name)

                # 1) 'queried' search for the term
                data_collection_logger.info("[fetch_subreddit_content]  -> Stage: queried for '%s'", search_term)
                async for submission in handle_rate_limit(
                    subreddit.search(search_term, limit=posts_per_sub)
                ):
                    await process_submission(submission, "queried", data)

                # 2) Known categories (hot, controversial, top, new, rising)
                for category in ['hot', 'controversial', 'top', 'new', 'rising']:
                    data_collection_logger.info("[fetch_subreddit_content]  -> Stage: %s", category)
                    method = getattr(subreddit, category)
                    async for submission in handle_rate_limit(method(limit=posts_per_sub)):
                        await process_submission(submission, category, data)

                data_collection_logger.info("[fetch_subreddit_content] Finished processing subreddit: %s", sub_name)
                break  # Successfully processed, exit retry loop

            except TooManyRequests as e:
                # Check Retry-After header or wait 30 seconds
                if e.response and "Retry-After" in e.response.headers:
                    wait_time = int(e.response.headers["Retry-After"])
                else:
                    wait_time = 30

                tries += 1
                data_collection_logger.warning(
                    "[fetch_subreddit_content] 429 TooManyRequests for %s. "
                    "Retry-After: %s seconds. Attempt %s/%s",
                    sub_name, wait_time, tries, max_tries
                )
                await asyncio.sleep(wait_time)

            except Forbidden:
                data_collection_logger.warning("[fetch_subreddit_content] Subreddit '%s' is private or banned. Skipping.", sub_name)
                break

            except NotFound:
                data_collection_logger.warning("[fetch_subreddit_content] Subreddit '%s' does not exist (NotFound). Skipping.", sub_name)
                break

            except Exception as e:
                data_collection_logger.error("[fetch_subreddit_content] Error processing '%s': %s", sub_name, e)
                break  # Stop retrying for this subreddit

    return data


In [14]:
async def DataCollectionPipeline():
    try:
        # Define the companies you wish to search for
        companies = ["Samsung", "Apple"]

        for company in companies:
            data_collection_logger.info("[main] Starting process for %s...", company)

            # 1) Gather subreddits via subreddit search
            data_collection_logger.info("[main] Gathering subreddits from search for '%s'...", company)
            subs_from_search = await search_subreddits(search_term=company)

            # 2) Gather subreddits via posts search in r/all
            data_collection_logger.info("[main] Gathering subreddits from posts for '%s'...", company)
            subs_from_posts = await search_posts(search_term=company)

            # Combine all unique subreddits
            all_subs = subs_from_search.union(subs_from_posts)
            all_subs = list(all_subs)
            data_collection_logger.info("[main] Found %d unique subreddits for '%s'.", len(all_subs), company)

            # If no subreddits found, skip to next company
            if not all_subs:
                data_collection_logger.info("[main] No subreddits found for '%s'. Skipping.", company)
                continue

            # 3) Fetch posts and comments from all subreddits
            data_collection_logger.info("[main] Starting data fetch for '%s'...", company)
            dataset = await fetch_subreddit_content(all_subs, search_term=company)

            # 4) Build DataFrame and export CSV
            data_collection_logger.info("[main] Data fetch complete for '%s'. Building DataFrame...", company)
            df = pd.DataFrame(dataset)
            output_file = f'company_reputation_data_{company.lower()}.csv'
            data_collection_logger.info("[main] Saving DataFrame to %s...", output_file)
            df.to_csv(output_file, index=False)
            data_collection_logger.info("[main] Dataset for '%s' saved with %d entries. Done!", company, len(df))

        await reddit.close()

    except KeyboardInterrupt:
        await reddit.close()
        data_collection_logger.info("[main] Keyboard interrupt detected. Exiting.")


In [15]:
"""The below line will run the data collection pipeline, 
we initially ran it as a python script, so the logs aren't present,
but we included a short snippet of the process to run"""
asyncio.run(DataCollectionPipeline())

[INFO] 2025-02-27 23:54:16 - __main__ - [main] Starting process for Samsung...
[INFO] 2025-02-27 23:54:16 - __main__ - [main] Gathering subreddits from search for 'Samsung'...
[INFO] 2025-02-27 23:54:16 - __main__ - [search_subreddits] Searching subreddits for 'Samsung' (limit=50)...
[INFO] 2025-02-27 23:54:18 - __main__ - [main] Gathering subreddits from posts for 'Samsung'...
[INFO] 2025-02-27 23:54:18 - __main__ - [search_posts] Searching r/all for 'Samsung' (limit=100)...
[INFO] 2025-02-27 23:54:18 - __main__ - [main] Found 50 unique subreddits for 'Samsung'.
[INFO] 2025-02-27 23:54:18 - __main__ - [main] Starting data fetch for 'Samsung'...
[INFO] 2025-02-27 23:54:18 - __main__ - 
[fetch_subreddit_content] Processing subreddit: SamanthaSamsungR34
[INFO] 2025-02-27 23:54:18 - __main__ - [fetch_subreddit_content]  -> Stage: queried for 'Samsung'
[INFO] 2025-02-27 23:54:18 - __main__ - [fetch_subreddit_content]  -> Stage: hot
[INFO] 2025-02-27 23:54:22 - __main__ - [fetch_subreddit_c

KeyboardInterrupt: 

# 3. Data analysis, selection, and labeling

## 3.1 Data analysis (Posts)

### 3.1.1 Samsung Posts

In [111]:
# Load post data.
samsung_posts = pd.read_csv(os.getenv("INITIAL_SAMSUNG_POSTS_INPUT_PATH"))

In [112]:
samsung_posts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39992 entries, 0 to 39991
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   text         20595 non-null  object 
 1   title        39991 non-null  object 
 2   upvotes      39992 non-null  int64  
 3   type         39992 non-null  object 
 4   date         39992 non-null  object 
 5   post_flair   19258 non-null  object 
 6   user_flair   3530 non-null   object 
 7   parent_text  0 non-null      float64
 8   subreddit    39992 non-null  object 
 9   category     39992 non-null  object 
dtypes: float64(1), int64(1), object(8)
memory usage: 3.1+ MB


Based on this info, out of around 40000 posts, only around 20000 seem to actually have body text, this could be because these posts are just images, links, the user doesn't add a body text, or the subreddit rules don't permit the user to add a body text. 

In [113]:
samsung_posts["subreddit"].unique()

array(['SamsungTV', 'GalaxyTab', 'pcmasterrace', 'NoStupidQuestions',
       'SamsungSupport', 'Samsung_GoodLock', 'SamsungGirlSFW',
       'Wellthatsucks', 'shitposting', 'SamsungSam69', 'funny',
       'LifeProTips', 'blender', 'interestingasfuck', 'videos',
       'PickAnAndroidForMe', 'YouShouldKnow', 'TeslaLounge', 'technology',
       'NintendoSwitch', 'simpsonsshitposting', 'GalaxyWatch',
       'todayilearned', 'news', 'SamanthaSamsungR34', 'SamsungThemes',
       'SamsungWallpapers', 'gaming', 'Monitors', 'samsunggalaxy',
       'hardwareswap', 'IndiaTech', 'BeAmazed', 'wallstreetbets',
       'SamsungS21Ultra', 'ShittyDesign', 'SamsungHelp', 'samsung',
       'SamsungS23', 'buildapcsales', 'GalaxyS24Ultra', 'GalaxyA55',
       'SamsungZFold6', 'apple', 'chromeos', 'GalaxyS25', 'Soundbars',
       'CyberStuck', 'GalaxyFold', '4kTV', 'Android', 'oneui',
       'bapcsalescanada', 'SamsungDex', 'assholedesign', 'memes',
       'CrappyDesign', 'SamsungNote9', 'SamsungPay', 'therew

In this unique list of samsung subreddits, we notice a couple problematic subreddits for potential inappropriateness
1) 'SamsungGirlSFW', 'SamsungSam69', 'SamanthaSamsungR34' \
We will investigate the values from these subreddits and decide whether we want to remove them.

In [114]:
samsung_posts[samsung_posts["subreddit"] == "SamanthaSamsungR34"]

Unnamed: 0,text,title,upvotes,type,date,post_flair,user_flair,parent_text,subreddit,category
10884,,Super hot r34 of Samsung girl!!!111 (real you ...,3,post,2023-11-12T01:18:09-05:00,,,,SamanthaSamsungR34,queried
10885,I’ve searched everywhere but can’t find the sp...,Bespoke dishwasher door size?,1,post,2025-01-24T12:07:02-05:00,,,,SamanthaSamsungR34,queried
10886,I’ve searched everywhere but can’t find the sp...,Bespoke dishwasher door size?,1,post,2025-01-24T12:07:02-05:00,,,,SamanthaSamsungR34,hot
10887,,Super hot r34 of Samsung girl!!!111 (real you ...,3,post,2023-11-12T01:18:09-05:00,,,,SamanthaSamsungR34,hot
10888,,Super hot r34 of Samsung girl!!!111 (real you ...,3,post,2023-11-12T01:18:09-05:00,,,,SamanthaSamsungR34,controversial
10889,I’ve searched everywhere but can’t find the sp...,Bespoke dishwasher door size?,1,post,2025-01-24T12:07:02-05:00,,,,SamanthaSamsungR34,controversial
10890,,Super hot r34 of Samsung girl!!!111 (real you ...,2,post,2023-11-12T01:18:09-05:00,,,,SamanthaSamsungR34,top
10891,I’ve searched everywhere but can’t find the sp...,Bespoke dishwasher door size?,1,post,2025-01-24T12:07:02-05:00,,,,SamanthaSamsungR34,top
10892,I’ve searched everywhere but can’t find the sp...,Bespoke dishwasher door size?,1,post,2025-01-24T12:07:02-05:00,,,,SamanthaSamsungR34,new
10893,,Super hot r34 of Samsung girl!!!111 (real you ...,3,post,2023-11-12T01:18:09-05:00,,,,SamanthaSamsungR34,new


From the above example, it is clear that we do not want subreddits like these to affect our company reputation analysis, so we will remove the posts from subreddits: 'SamsungGirlSFW', 'SamsungSam69', 'SamanthaSamsungR34'.

In [115]:
samsung_posts = samsung_posts.query('subreddit != ["SamsungGirlSFW","SamsungSam69", "SamanthaSamsungR34"]')

In [116]:
samsung_posts["subreddit"].unique()

array(['SamsungTV', 'GalaxyTab', 'pcmasterrace', 'NoStupidQuestions',
       'SamsungSupport', 'Samsung_GoodLock', 'Wellthatsucks',
       'shitposting', 'funny', 'LifeProTips', 'blender',
       'interestingasfuck', 'videos', 'PickAnAndroidForMe',
       'YouShouldKnow', 'TeslaLounge', 'technology', 'NintendoSwitch',
       'simpsonsshitposting', 'GalaxyWatch', 'todayilearned', 'news',
       'SamsungThemes', 'SamsungWallpapers', 'gaming', 'Monitors',
       'samsunggalaxy', 'hardwareswap', 'IndiaTech', 'BeAmazed',
       'wallstreetbets', 'SamsungS21Ultra', 'ShittyDesign', 'SamsungHelp',
       'samsung', 'SamsungS23', 'buildapcsales', 'GalaxyS24Ultra',
       'GalaxyA55', 'SamsungZFold6', 'apple', 'chromeos', 'GalaxyS25',
       'Soundbars', 'CyberStuck', 'GalaxyFold', '4kTV', 'Android',
       'oneui', 'bapcsalescanada', 'SamsungDex', 'assholedesign', 'memes',
       'CrappyDesign', 'SamsungNote9', 'SamsungPay', 'therewasanattempt',
       'ProgrammerHumor', 'mildlyinteresting', 'A

In [117]:
samsung_posts.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 39668 entries, 0 to 39991
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   text         20490 non-null  object 
 1   title        39667 non-null  object 
 2   upvotes      39668 non-null  int64  
 3   type         39668 non-null  object 
 4   date         39668 non-null  object 
 5   post_flair   19258 non-null  object 
 6   user_flair   3530 non-null   object 
 7   parent_text  0 non-null      float64
 8   subreddit    39668 non-null  object 
 9   category     39668 non-null  object 
dtypes: float64(1), int64(1), object(8)
memory usage: 3.3+ MB


### 3.1.2 Apple Posts

In [118]:
#Load data
apple_posts = pd.read_csv(os.getenv("INITIAL_APPLE_POSTS_INPUT_PATH"))

In [119]:
apple_posts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 43533 entries, 0 to 43532
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   text         18704 non-null  object 
 1   title        43533 non-null  object 
 2   upvotes      43533 non-null  int64  
 3   type         43533 non-null  object 
 4   date         43533 non-null  object 
 5   post_flair   21415 non-null  object 
 6   user_flair   4704 non-null   object 
 7   parent_text  0 non-null      float64
 8   subreddit    43533 non-null  object 
 9   category     43533 non-null  object 
dtypes: float64(1), int64(1), object(8)
memory usage: 3.3+ MB


In [120]:
apple_posts["subreddit"].unique()

array(['pcmasterrace', 'AppleWatchFitness', 'nextfuckinglevel',
       'NoStupidQuestions', 'AmITheBadApple', 'Wellthatsucks', 'Baking',
       'funny', 'notinteresting', 'oddlyspecific', 'applehelp',
       'television', 'interestingasfuck', 'interesting',
       'clevercomebacks', 'technology', 'AppleVisionPro', 'mylittlepony',
       'popculturechat', 'DarkMatteronAppleTV', 'todayilearned',
       'indiasocial', 'fruit', 'SipsTea', 'gaming', 'Apple_Internal',
       'crochet', 'bonehurtingjuice', 'FuckImOld', 'RandomThoughts',
       'BeAmazed', 'Showerthoughts', 'wallstreetbets', 'Anticonsumption',
       'jailbreak', 'AppleIntelligenceFail', 'mac', 'oddlysatisfying',
       'FuckApple', 'apple', 'NonPoliticalTwitter', 'AppleEnthusiasts',
       'MurderedByWords', 'torties', 'applewatchfaces', 'AppleCard',
       'privacy', 'macbook', 'Garmin', 'AppleArcade', 'youseeingthisshit',
       'memes', 'ios', 'europe', 'ipad', 'ApplePhotos',
       'BlackPeopleTwitter', 'FoodPorn', 'there

In this unique list of top apple subreddits, the first thing we notice is a lot more hate subreddits present:\
'applesucks', 'FuckApple', 'AppleIntelligenceFail'.\
We also notice subreddits that may be more about the fruit apple or a different apple, rather than the company apple:\
'crochet'(crocheted apple), 'Baking' (apple pies), 'torties' (subreddit about tortoiseshell cats), 'crafts'
'pics' (probably have pictures of apple pies, etc.),'BoneAppleTea' (A subreddit based on wordplay),'FionaApple' (a singer), 'food' (apple as a fruit), 'FoodPorn' (probably talking about apple pie),'fruit' (definitely talking about apple as a fruit),'mylittlepony' (there is a character called AppleJack in the cartoon),'AmITheBadApple' (a subreddit where people verify the morality of an action/decision they did. almost every post asks the question Am I The Bad Apple?)

#### Apple Hate Posts

In [121]:
apple_posts[apple_posts['subreddit'] == "FuckApple"]

Unnamed: 0,text,title,upvotes,type,date,post_flair,user_flair,parent_text,subreddit,category
18209,,Later: apple will just sell you the idea of ow...,155,post,2021-05-06T17:10:35-04:00,,,,FuckApple,queried
18210,So I’ve used Apple products ever since the iPh...,Apple is the single most useless company I hav...,79,post,2023-07-25T19:41:32-04:00,,,,FuckApple,queried
18211,"Basically that, my apple id got deactivated an...",Apple ID got deactivated and I lost control of...,28,post,2023-07-29T23:54:46-04:00,,,,FuckApple,queried
18212,,I'm almost 40 years old. I got my first iPhone...,36,post,2023-06-23T18:58:58-04:00,,,,FuckApple,queried
18213,,Does apple still bug your phone so you get a n...,15,post,2023-07-05T14:03:23-04:00,,,,FuckApple,queried
...,...,...,...,...,...,...,...,...,...,...
18716,I broke my screen (potentially other component...,Have to wait 20 days to repair my phone!!?,10,post,2023-06-13T21:48:47-04:00,,,,FuckApple,rising
18717,,oh god… i officially hate you apple,25,post,2023-06-12T12:26:19-04:00,,,,FuckApple,rising
18718,I accidentally clicked on a Duolingo subscript...,Duolingo subscription,6,post,2023-06-12T18:32:58-04:00,,,,FuckApple,rising
18719,Got a 100 dollar apple gift card and the GCA n...,Absolutely unbearable,8,post,2023-06-11T16:14:45-04:00,,,,FuckApple,rising


#### Misrepresented Apple Posts (posts which don't mention the company)

In [122]:
apple_posts[apple_posts['subreddit'] == "fruit"]

Unnamed: 0,text,title,upvotes,type,date,post_flair,user_flair,parent_text,subreddit,category
10753,Can’t find anything like it,Found this inside my apple idk what it is and ...,17362,post,2025-01-02T13:13:23-05:00,Discussion,,,fruit,queried
10754,,What is eating my apples in the kitchen overni...,1988,post,2024-11-23T21:44:53-05:00,Discussion,,,fruit,queried
10755,It’s also red aswell,Holy shit I opened another apple and found the...,5693,post,2025-01-02T14:57:32-05:00,Discussion,,,fruit,queried
10756,,Was eating a apple and felt like it tasted too...,1654,post,2024-12-14T04:45:05-05:00,Edibility / Problem,,,fruit,queried
10757,,Cut open an apple... What is this?,1468,post,2024-10-15T20:31:01-04:00,Discussion,,,fruit,queried
...,...,...,...,...,...,...,...,...,...,...
11238,One of my favorite fruits all time,If you haven’t tried a tangelo yet this season...,18,post,2025-02-01T08:15:18-05:00,Discussion,,,fruit,rising
11239,So uhm fun fact I learned today (not even 5 mi...,Lemons aren't real,99,post,2025-01-31T19:28:12-05:00,Discussion,,,fruit,rising
11240,They ask $13.99 per pound. They are as hard as...,I don't get why my Asian supermarket bothers s...,7,post,2025-02-01T08:36:13-05:00,Discussion,,,fruit,rising
11241,,Watermelon is music,3,post,2025-02-01T11:11:41-05:00,Discussion,,,fruit,rising


In [123]:
apple_posts[apple_posts['subreddit'] == "AmITheBadApple"]

Unnamed: 0,text,title,upvotes,type,date,post_flair,user_flair,parent_text,subreddit,category
2010,(Posting for a friend)My daughter Brynn is 3 y...,Am I the bad apple for bringing my daughter he...,7348,post,2024-08-01T20:55:40-04:00,,,,AmITheBadApple,queried
2011,I(22f) have a neice(5f) we’ll call Lilly. Lill...,Am I the bad apple for getting my neice a doll...,2509,post,2024-04-23T21:52:41-04:00,,,,AmITheBadApple,queried
2012,I (28m) have been with my girlfriend (28f) for...,Am I the bad apple for punching my girlfriend ...,2403,post,2024-02-18T10:40:25-05:00,,,,AmITheBadApple,queried
2013,Am I the bad apple for breastfeeding my baby? ...,Am I the bad apple for breastfeeding?,3367,post,2023-10-27T20:16:25-04:00,,,,AmITheBadApple,queried
2014,"I, 35F, was invited to my good friend's weddin...",Am I the bad apple for not coming to my friend...,1461,post,2024-07-15T03:11:29-04:00,,,,AmITheBadApple,queried
...,...,...,...,...,...,...,...,...,...,...
2517,hello reddit! i just remembered a couple sets ...,AITBA for no longer being able to face my brot...,3,post,2025-01-24T01:37:33-05:00,,,,AmITheBadApple,rising
2518,So I (18M) met Andy (21M) at our College's end...,AITBA forr blowing up at my 'friend' and calli...,5,post,2025-01-23T19:47:31-05:00,,,,AmITheBadApple,rising
2519,I (13f) am in 8th grade. And I'm still not sur...,Am I the bad apple for leaving the classroom w...,738,post,2025-01-21T12:16:34-05:00,,,,AmITheBadApple,rising
2520,Am I the bad apple? I (15F) live with my mom a...,Am I the bad apple for teaming up with my gran...,2,post,2025-01-22T10:10:54-05:00,,,,AmITheBadApple,rising


In [124]:
apple_posts[apple_posts['subreddit'] == "FionaApple"]

Unnamed: 0,text,title,upvotes,type,date,post_flair,user_flair,parent_text,subreddit,category
34967,"A piece i did a few weeks ago, thought you guy...",“Fiona Apple’s When The Pawn…” Cross-stitch by...,1358,post,2024-12-01T21:40:47-05:00,When the Pawn,,,FionaApple,queried
34968,Mine is probably when she says “so much” in Ra...,What part of a Fiona Apple song is this for you?,285,post,2024-12-01T19:21:40-05:00,Fiona Apple Rocks,,,FionaApple,queried
34969,"Mine is this, from Sullen Girl. I relate to a ...",What's the most relatable fiona apple line fro...,294,post,2025-01-14T04:30:41-05:00,Tidal,,,FionaApple,queried
34970,I am glad she took off her music on tiktok bec...,This has been said before but FUCK NEW FIONA A...,494,post,2024-07-13T03:07:59-04:00,Fiona Apple Rocks,,,FionaApple,queried
34971,,Happy 47th birthday to the amazingly gifted Fi...,688,post,2024-09-13T09:26:01-04:00,Fiona Apple Rocks,Tulip in a Cup,,FionaApple,queried
...,...,...,...,...,...,...,...,...,...,...
35440,,Who still has this original Fiona Apple shirt?...,117,post,2025-01-20T20:47:42-05:00,Fiona Apple Rocks,,,FionaApple,rising
35441,someone posted this poster here the other day ...,“when the pawn…” poster,381,post,2025-01-19T17:49:17-05:00,When the Pawn,,,FionaApple,rising
35442,Does anybody have any sheet music for it? I've...,Fast As You Can,9,post,2025-01-20T14:09:49-05:00,When the Pawn,,,FionaApple,rising
35443,,This is crazy,1373,post,2025-01-19T04:49:16-05:00,When the Pawn,,,FionaApple,rising


In [125]:
"""Removing the posts from the subreddits that are NOT related to the company Apple"""
apple_posts = apple_posts.query("subreddit != ['pics','Baking','torties', 'crafts','crochet','BoneAppleTea','FionaApple', 'food', 'FoodPorn','fruit','mylittlepony','AmITheBadApple']")

In [126]:
apple_posts["subreddit"].unique()

array(['pcmasterrace', 'AppleWatchFitness', 'nextfuckinglevel',
       'NoStupidQuestions', 'Wellthatsucks', 'funny', 'notinteresting',
       'oddlyspecific', 'applehelp', 'television', 'interestingasfuck',
       'interesting', 'clevercomebacks', 'technology', 'AppleVisionPro',
       'popculturechat', 'DarkMatteronAppleTV', 'todayilearned',
       'indiasocial', 'SipsTea', 'gaming', 'Apple_Internal',
       'bonehurtingjuice', 'FuckImOld', 'RandomThoughts', 'BeAmazed',
       'Showerthoughts', 'wallstreetbets', 'Anticonsumption', 'jailbreak',
       'AppleIntelligenceFail', 'mac', 'oddlysatisfying', 'FuckApple',
       'apple', 'NonPoliticalTwitter', 'AppleEnthusiasts',
       'MurderedByWords', 'applewatchfaces', 'AppleCard', 'privacy',
       'macbook', 'Garmin', 'AppleArcade', 'youseeingthisshit', 'memes',
       'ios', 'europe', 'ipad', 'ApplePhotos', 'BlackPeopleTwitter',
       'therewasanattempt', 'mildlyinteresting', 'AppleWatch', 'Advice',
       'applewatchultra', 'tvPlus'

In [127]:
#We will remove any rows where the subreddit or text is missing
samsung_posts.dropna(subset=["subreddit", "text" ], inplace=True)
apple_posts.dropna(subset=["subreddit", "text" ], inplace=True)

We will remove any duplicate posts based on the text column, we assume there to be a lot of duplicates, as posts can match in both queried and other categories,
We also expect a lot of reposts.

In [128]:
samsung_posts.drop_duplicates(subset=["text"], inplace=True)
apple_posts.drop_duplicates(subset=["text"], inplace=True)

In [129]:
samsung_posts.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12787 entries, 0 to 39814
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   text         12787 non-null  object 
 1   title        12786 non-null  object 
 2   upvotes      12787 non-null  int64  
 3   type         12787 non-null  object 
 4   date         12787 non-null  object 
 5   post_flair   7070 non-null   object 
 6   user_flair   1153 non-null   object 
 7   parent_text  0 non-null      float64
 8   subreddit    12787 non-null  object 
 9   category     12787 non-null  object 
dtypes: float64(1), int64(1), object(8)
memory usage: 1.1+ MB


In [131]:
apple_posts.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10635 entries, 0 to 43400
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   text         10635 non-null  object 
 1   title        10635 non-null  object 
 2   upvotes      10635 non-null  int64  
 3   type         10635 non-null  object 
 4   date         10635 non-null  object 
 5   post_flair   5884 non-null   object 
 6   user_flair   1141 non-null   object 
 7   parent_text  0 non-null      float64
 8   subreddit    10635 non-null  object 
 9   category     10635 non-null  object 
dtypes: float64(1), int64(1), object(8)
memory usage: 913.9+ KB


Now we face a challenge, we cannot assume that all posts from all of these subreddits are about Samsung/Apple.
but we can however assume that posts from Samsung/Apple Specific subreddits are talking about Samsung/Apple.

So, our plan is:
1) From the samsung/apple specific subreddits, we take all the posts
2) From non samsung/apple specific subreddits, we only take posts from the 'queried' category, these are posts that specifically mention the keywork Samsung/Apple.

In [132]:
samsung_posts["subreddit"].unique()

array(['SamsungTV', 'GalaxyTab', 'pcmasterrace', 'NoStupidQuestions',
       'SamsungSupport', 'Samsung_GoodLock', 'Wellthatsucks',
       'shitposting', 'funny', 'LifeProTips', 'blender',
       'interestingasfuck', 'videos', 'PickAnAndroidForMe',
       'YouShouldKnow', 'TeslaLounge', 'technology', 'NintendoSwitch',
       'simpsonsshitposting', 'GalaxyWatch', 'todayilearned', 'news',
       'SamsungThemes', 'gaming', 'Monitors', 'samsunggalaxy',
       'hardwareswap', 'IndiaTech', 'BeAmazed', 'wallstreetbets',
       'SamsungS21Ultra', 'ShittyDesign', 'SamsungHelp', 'samsung',
       'SamsungS23', 'buildapcsales', 'GalaxyS24Ultra', 'GalaxyA55',
       'SamsungZFold6', 'apple', 'chromeos', 'GalaxyS25', 'Soundbars',
       'CyberStuck', 'GalaxyFold', '4kTV', 'Android', 'oneui',
       'bapcsalescanada', 'SamsungDex', 'assholedesign', 'memes',
       'CrappyDesign', 'SamsungNote9', 'SamsungPay', 'therewasanattempt',
       'ProgrammerHumor', 'mildlyinteresting', 'AndroidQuestions',
   

In [133]:
apple_posts["subreddit"].unique()

array(['pcmasterrace', 'AppleWatchFitness', 'nextfuckinglevel',
       'NoStupidQuestions', 'Wellthatsucks', 'funny', 'notinteresting',
       'oddlyspecific', 'applehelp', 'television', 'interestingasfuck',
       'interesting', 'clevercomebacks', 'technology', 'AppleVisionPro',
       'popculturechat', 'DarkMatteronAppleTV', 'todayilearned',
       'indiasocial', 'SipsTea', 'gaming', 'Apple_Internal',
       'bonehurtingjuice', 'FuckImOld', 'RandomThoughts', 'BeAmazed',
       'Showerthoughts', 'wallstreetbets', 'Anticonsumption', 'jailbreak',
       'AppleIntelligenceFail', 'mac', 'oddlysatisfying', 'FuckApple',
       'apple', 'NonPoliticalTwitter', 'AppleEnthusiasts',
       'MurderedByWords', 'applewatchfaces', 'AppleCard', 'privacy',
       'macbook', 'Garmin', 'AppleArcade', 'youseeingthisshit', 'memes',
       'ios', 'europe', 'ipad', 'ApplePhotos', 'BlackPeopleTwitter',
       'therewasanattempt', 'mildlyinteresting', 'AppleWatch', 'Advice',
       'applewatchultra', 'tvPlus'

In [134]:
samsung_posts_copy = samsung_posts.copy()
samsung_subreddits = ['SamsungTV', 'GalaxyTab', 'SamsungSupport', 'Samsung_GoodLock', 'GalaxyWatch', 'SamsungThemes', 'samsunggalaxy', 'SamsungS21Ultra',  'SamsungHelp', 'samsung',  'SamsungS23',  'GalaxyS24Ultra', 'GalaxyA55', 'SamsungZFold6','GalaxyS25', 'GalaxyFold',  'oneui',  'SamsungDex','SamsungNote9', 'SamsungPay' , 'SamsungElite', 'GalaxyS9', 'SamsungAssistant', 'GalaxyS8', 'samsungnotes', 'GalaxyS23Ultra','SamsungWatchFace','SamsungNote10','galaxys10', 'SamsungS24', 'S24Ultra', 'SamsungS8', 'galaxyzflip']
samsung_posts = pd.merge(samsung_posts[samsung_posts["subreddit"].isin(samsung_subreddits)], samsung_posts[samsung_posts["category"] == 'queried'], how='outer')

In [135]:
apple_posts_copy = apple_posts.copy()
apple_subreddits =[ 'AppleWatchFitness',  'applehelp', 'AppleVisionPro', 'DarkMatteronAppleTV',  'Apple_Internal',  'AppleIntelligenceFail', 'mac', 'FuckApple', 'apple', 'AppleEnthusiasts',  'applewatchfaces', 'AppleCard', 'macbook', 'AppleArcade', 'ios', 'ipad', 'ApplePhotos',    'AppleWatch', 'applewatchultra', 'appletv', 'AppleFitnessPlus', 'Apple_Employees', 'AppleWallet', 'SeveranceAppleTVPlus', 'ApplePencil', 'iphone', 'VintageApple', 'appleswap',  'applesucks', 'appleJournal',  'AppleMusic', 'iOSBeta']
apple_posts = pd.merge(apple_posts[apple_posts["subreddit"].isin(apple_subreddits)], apple_posts[apple_posts["category"] == 'queried'], how='outer')

In [136]:
samsung_posts.drop_duplicates(subset=["text"], inplace=True)
apple_posts.drop_duplicates(subset=["text"], inplace=True)

Now we end with two 'relevant' post datasets, each having trimmed around 32000 values.

In [137]:
samsung_posts.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8237 entries, 0 to 8236
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   text         8237 non-null   object 
 1   title        8236 non-null   object 
 2   upvotes      8237 non-null   int64  
 3   type         8237 non-null   object 
 4   date         8237 non-null   object 
 5   post_flair   4018 non-null   object 
 6   user_flair   693 non-null    object 
 7   parent_text  0 non-null      float64
 8   subreddit    8237 non-null   object 
 9   category     8237 non-null   object 
dtypes: float64(1), int64(1), object(8)
memory usage: 707.9+ KB


In [138]:
apple_posts.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7262 entries, 0 to 7261
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   text         7262 non-null   object 
 1   title        7262 non-null   object 
 2   upvotes      7262 non-null   int64  
 3   type         7262 non-null   object 
 4   date         7262 non-null   object 
 5   post_flair   3939 non-null   object 
 6   user_flair   776 non-null    object 
 7   parent_text  0 non-null      float64
 8   subreddit    7262 non-null   object 
 9   category     7262 non-null   object 
dtypes: float64(1), int64(1), object(8)
memory usage: 624.1+ KB


## 3.2 Data analysis (Comments)

In [232]:
samsung_comments = pd.read_csv(os.getenv("INITIAL_SAMSUNG_COMMENTS_INPUT_PATH"))
apple_comments = pd.read_csv(os.getenv("INITIAL_APPLE_COMMENTS_INPUT_PATH"))

In [233]:
samsung_comments.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1093704 entries, 0 to 1093703
Data columns (total 7 columns):
 #   Column      Non-Null Count    Dtype  
---  ------      --------------    -----  
 0   text        1093704 non-null  object 
 1   upvotes     1093694 non-null  float64
 2   type        1093694 non-null  object 
 3   date        1093694 non-null  object 
 4   user_flair  72438 non-null    object 
 5   subreddit   1093694 non-null  object 
 6   category    1093694 non-null  object 
dtypes: float64(1), object(6)
memory usage: 58.4+ MB


In [234]:
apple_comments.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1696967 entries, 0 to 1696966
Data columns (total 7 columns):
 #   Column      Dtype  
---  ------      -----  
 0   text        object 
 1   upvotes     float64
 2   type        object 
 3   date        object 
 4   user_flair  object 
 5   subreddit   object 
 6   category    object 
dtypes: float64(1), object(6)
memory usage: 90.6+ MB


Wow! Almost 3 million comments! Let's trim this down to only relevant ones!

In [235]:
samsung_comments["subreddit"].unique()

array(['SamsungTV', 'GalaxyTab', 'pcmasterrace', nan, 'NoStupidQuestions',
       'SamsungSupport', 'Samsung_GoodLock', 'SamsungGirlSFW',
       'Wellthatsucks', 'shitposting', 'SamsungSam69', 'funny',
       'LifeProTips', 'blender', 'interestingasfuck', 'videos',
       'PickAnAndroidForMe', 'YouShouldKnow', 'TeslaLounge', 'technology',
       'NintendoSwitch', 'simpsonsshitposting', 'GalaxyWatch',
       'todayilearned', 'news', 'SamsungThemes', 'SamsungWallpapers',
       'gaming', 'Monitors', 'samsunggalaxy', 'hardwareswap', 'IndiaTech',
       'BeAmazed', 'wallstreetbets', 'SamsungS21Ultra', 'ShittyDesign',
       'SamsungHelp', 'samsung', 'SamsungS23', 'buildapcsales',
       'GalaxyS24Ultra', 'GalaxyA55', 'SamsungZFold6', 'apple',
       'chromeos', 'GalaxyS25', 'Soundbars', 'CyberStuck', 'GalaxyFold',
       '4kTV', 'Android', 'oneui', 'bapcsalescanada', 'SamsungDex',
       'assholedesign', 'memes', 'CrappyDesign', 'SamsungNote9',
       'SamsungPay', 'therewasanattempt', 'Pr

In [236]:
apple_comments["subreddit"].unique()

array(['pcmasterrace', nan, 'AppleWatchFitness', 'nextfuckinglevel',
       'NoStupidQuestions', 'AmITheBadApple', 'Wellthatsucks', 'Baking',
       'funny', 'notinteresting', 'oddlyspecific', 'applehelp',
       'television', 'interestingasfuck', 'interesting',
       'clevercomebacks', 'technology', 'AppleVisionPro', 'mylittlepony',
       'popculturechat', 'DarkMatteronAppleTV', 'todayilearned',
       'indiasocial', 'fruit', 'SipsTea', 'gaming', 'Apple_Internal',
       'crochet', 'bonehurtingjuice', 'FuckImOld', 'RandomThoughts',
       'BeAmazed', 'Showerthoughts', 'wallstreetbets', 'Anticonsumption',
       'jailbreak', 'AppleIntelligenceFail', 'mac', 'oddlysatisfying',
       'FuckApple', 'apple', 'NonPoliticalTwitter', 'AppleEnthusiasts',
       'MurderedByWords', 'torties', 'applewatchfaces', 'AppleCard',
       'privacy', 'macbook', 'Garmin', 'AppleArcade', 'youseeingthisshit',
       'memes', 'ios', 'europe', 'ipad', 'ApplePhotos',
       'BlackPeopleTwitter', 'FoodPorn', '

We notice, most subreddits are the same as the ones in the posts, except now we also see a new 'nan' value.

In [237]:
samsung_comments[samsung_comments["subreddit"].isna()]

Unnamed: 0,text,upvotes,type,date,user_flair,subreddit,category
32121,Is that at least 7$ dollars,,,,,,
106875,A) Who gives a shit?,,,,,,
108346,Fun fact: Apparently I look like McLovin'.,,,,,,
268404,Was hoping for lights sabers,,,,,,
327865,(add to title of OP),,,,,,
333127,This thread isn't about dicks?,,,,,,
383263,It's not like the games will get better as tec...,,,,,,
383750,What if I told you,,,,,,
383753,Lol zelda's wearing links clothes,,,,,,
383754,- If u know what I mean,,,,,,


None of these posts seem very relevant, so we can directly remove them.

In [238]:
samsung_comments.dropna(subset=["subreddit", "text"], inplace=True)
apple_comments.dropna(subset=["subreddit", "text"], inplace=True)

Now we shall remove comments from irrelevant subreddits the same way, we did for the posts

In [239]:
"""Removing the comments from the subreddits that are NOT related to the company Apple"""
apple_comments = apple_comments.query("subreddit != ['pics','Baking','torties', 'crafts','crochet','BoneAppleTea','FionaApple', 'food', 'FoodPorn','fruit','mylittlepony','AmITheBadApple']")

In [240]:
samsung_comments = samsung_comments.query('subreddit != ["SamsungGirlSFW","SamsungSam69", "SamanthaSamsungR34"]')

We again expect lots of duplicates, so we remove these too

In [241]:
samsung_comments.drop_duplicates(subset=["text"], inplace=True)
apple_comments.drop_duplicates(subset=["text"], inplace=True)

We now follow the same stratergy where we keep posts from subreddits that are directly about our company, and only use queried ones from the subreddits that aren't

In [242]:
samsung_subreddits = ['SamsungTV', 'GalaxyTab', 'SamsungSupport', 'Samsung_GoodLock', 'GalaxyWatch', 'SamsungThemes', 'samsunggalaxy', 'SamsungS21Ultra',  'SamsungHelp', 'samsung',  'SamsungS23',  'GalaxyS24Ultra', 'GalaxyA55', 'SamsungZFold6','GalaxyS25', 'GalaxyFold',  'oneui',  'SamsungDex','SamsungNote9', 'SamsungPay' , 'SamsungElite', 'GalaxyS9', 'SamsungAssistant', 'GalaxyS8', 'samsungnotes', 'GalaxyS23Ultra','SamsungWatchFace','SamsungNote10','galaxys10', 'SamsungS24', 'S24Ultra', 'SamsungS8', 'galaxyzflip']
samsung_comments = pd.merge(samsung_comments[samsung_comments["subreddit"].isin(samsung_subreddits)], samsung_comments[samsung_comments["category"] == 'queried'], how='outer')

In [243]:
apple_subreddits =[ 'AppleWatchFitness',  'applehelp', 'AppleVisionPro', 'DarkMatteronAppleTV',  'Apple_Internal',  'AppleIntelligenceFail', 'mac', 'FuckApple', 'apple', 'AppleEnthusiasts',  'applewatchfaces', 'AppleCard', 'macbook', 'AppleArcade', 'ios', 'ipad', 'ApplePhotos',    'AppleWatch', 'applewatchultra', 'appletv', 'AppleFitnessPlus', 'Apple_Employees', 'AppleWallet', 'SeveranceAppleTVPlus', 'ApplePencil', 'iphone', 'VintageApple', 'appleswap',  'applesucks', 'appleJournal',  'AppleMusic', 'iOSBeta']
apple_comments = pd.merge(apple_comments[apple_comments["subreddit"].isin(apple_comments)], apple_comments[apple_comments["category"] == 'queried'], how='outer')

In [244]:
apple_comments.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 354777 entries, 0 to 354776
Data columns (total 7 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   text        354777 non-null  object 
 1   upvotes     354777 non-null  float64
 2   type        354777 non-null  object 
 3   date        354777 non-null  object 
 4   user_flair  29767 non-null   object 
 5   subreddit   354777 non-null  object 
 6   category    354777 non-null  object 
dtypes: float64(1), object(6)
memory usage: 21.7+ MB


In [245]:
samsung_comments.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 289507 entries, 0 to 289506
Data columns (total 7 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   text        289507 non-null  object 
 1   upvotes     289507 non-null  float64
 2   type        289507 non-null  object 
 3   date        289507 non-null  object 
 4   user_flair  30141 non-null   object 
 5   subreddit   289507 non-null  object 
 6   category    289507 non-null  object 
dtypes: float64(1), object(6)
memory usage: 17.7+ MB


Now we are sitting at around 650k entries, this still seems like a lot. thus let's try some additional filtering methods.
We could filter out comments with 1 upvote or less, as these would make up a majority of the comments, could include comments made by bots, or comments that people do not agree with, which could influence our models.

In [246]:
samsung_comments[samsung_comments["upvotes"]>1].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 135005 entries, 17 to 289452
Data columns (total 7 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   text        135005 non-null  object 
 1   upvotes     135005 non-null  float64
 2   type        135005 non-null  object 
 3   date        135005 non-null  object 
 4   user_flair  15572 non-null   object 
 5   subreddit   135005 non-null  object 
 6   category    135005 non-null  object 
dtypes: float64(1), object(6)
memory usage: 8.2+ MB


In [247]:
apple_comments[apple_comments["upvotes"]>1].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 173099 entries, 0 to 354767
Data columns (total 7 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   text        173099 non-null  object 
 1   upvotes     173099 non-null  float64
 2   type        173099 non-null  object 
 3   date        173099 non-null  object 
 4   user_flair  16301 non-null   object 
 5   subreddit   173099 non-null  object 
 6   category    173099 non-null  object 
dtypes: float64(1), object(6)
memory usage: 10.6+ MB


Since there are still too many posts, we will first try to directly only keep posts that mention Samsung or Apple, and then we will later, after analysis filter out posts with few upvotes

In [248]:
#Since there are still too many posts, we can now try to directly only keep posts that mention Samsung or Apple
samsung_comments = samsung_comments[samsung_comments["text"].str.contains("Samsung", case=False)]
apple_comments = apple_comments[apple_comments["text"].str.contains("Apple", case=False)]

In [249]:
samsung_comments.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 41971 entries, 6 to 289477
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   text        41971 non-null  object 
 1   upvotes     41971 non-null  float64
 2   type        41971 non-null  object 
 3   date        41971 non-null  object 
 4   user_flair  4265 non-null   object 
 5   subreddit   41971 non-null  object 
 6   category    41971 non-null  object 
dtypes: float64(1), object(6)
memory usage: 2.6+ MB


In [250]:
apple_comments.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 65675 entries, 0 to 354775
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   text        65675 non-null  object 
 1   upvotes     65675 non-null  float64
 2   type        65675 non-null  object 
 3   date        65675 non-null  object 
 4   user_flair  5157 non-null   object 
 5   subreddit   65675 non-null  object 
 6   category    65675 non-null  object 
dtypes: float64(1), object(6)
memory usage: 4.0+ MB


Now it's time for some ANALYSIS!!! WOOOOOHOOOOO

In [251]:
#We will now analyze the most popular posts
samsungs_most_upvoted_post = samsung_posts[samsung_posts["upvotes"] == samsung_posts["upvotes"].max()]
print(samsungs_most_upvoted_post.iloc[0]["text"])
print("Upvotes: ", samsungs_most_upvoted_post.iloc[0]["upvotes"])
print("Subreddit: ", samsungs_most_upvoted_post.iloc[0]["subreddit"])

Have a Samsung smart TVs with ads that were annoying as hell. Found out they can be blocked and tried it. It worked!


Edit: WOW! This blew up way more than I expected. I had no idea so many people hated their “Smart TVs”. I’m glad this information was useful to everyone!

Also thank you for all the upvotes, awards and comments. Hopefully this becomes common knowledge and people can take back control of their TVs!

Edit 2: another link you can add to your block list is samsungads.com. Combined with the above link you *should* be entirely ad free.

Edit 3: So A TON of people are asking how to block ads on other TV’s/Devices. Ive compiled a few “How To’s” for LG, ROKU and Fire Stick. Hope this helps everyone struggling with these damn ads!

**LG**: To disable LG ads that appear in "My Content" tab, LG store etc. blacklist/block the following domains on your router: 

ngfts.lge.com

us.ad.lgsmartad.com

lgad.cjpowercast.com

edgesuite.net

us.info.lgsmartad.com


**Roku:** If you go into 

In [252]:
samsungs_most_downvoted_post = samsung_posts[samsung_posts["upvotes"] == samsung_posts["upvotes"].min()]
print(samsungs_most_downvoted_post.iloc[0]["text"])
print("Upvotes: ", samsungs_most_downvoted_post.iloc[0]["upvotes"])
print("Subreddit: ", samsungs_most_downvoted_post.iloc[0]["subreddit"])

UN65TU690TFXZA - Samsung TV continues to have network issues. Has anyone had luck with getting it repaired or getting it actually replaced under warranty?
Upvotes:  0
Subreddit:  SamsungTV


In [253]:
apples_most_upvoted_post = apple_posts[apple_posts["upvotes"] == apple_posts["upvotes"].max()]
print(apples_most_upvoted_post.iloc[0]["text"])
print("Upvotes: ", apples_most_upvoted_post.iloc[0]["upvotes"])
print("Subreddit: ", apples_most_upvoted_post.iloc[0]["subreddit"])

this is a legit apple customer support message exchange
Upvotes:  110348
Subreddit:  mildlyinfuriating


In [254]:
apples_most_downvoted_post = apple_posts[apple_posts["upvotes"] == apple_posts["upvotes"].min()]
print(apples_most_downvoted_post.iloc[0]["text"])
print("Upvotes: ", apples_most_downvoted_post.iloc[0]["upvotes"])
print("Subreddit: ", apples_most_downvoted_post.iloc[0]["subreddit"])

So this was a day I simply went to work and came home. My job is a night shift security officer. I sit most of my 8 hour shift and might get up a couple times and check people in or out. Then I go home and sleep, wake up to eat and then go back to work. In NO world is my activity more than sedentary. 

Here it says I’m burning 3000 calories. I put my information into a tdee calculator and it says my sedentary is 2400. I’ve been eating too much for years apparently just listening to my Apple Watch 

Just for information sake: I’m a 5’8 male at 40 years old and 253lbs. I went into health app and the resting calories says 2500. That seems like what my tdee would be! But the active calories just randomly added on top of everything? Is insane
Upvotes:  0
Subreddit:  AppleWatchFitness


In [255]:
samsung_most_upvoted_comment = samsung_comments[samsung_comments["upvotes"] == samsung_comments["upvotes"].max()]
print(samsung_most_upvoted_comment.iloc[0]["text"])
print("Upvotes:", samsung_most_upvoted_comment.iloc[0]["upvotes"])
print("Subreddit:", samsung_most_upvoted_comment.iloc[0]["subreddit"])

Apple and Samsung *[EDIT: and Google and Xiaomi]* have ~~both~~ *all* stopped including the charger *[EDIT: with their top-end devices]*.

Their excuse is "everyone already has a charger so it's environmentally better to stop including one that'll just get thrown away". 

The real reason is they want to sell you a charger separately.
Upvotes: 15166.0
Subreddit: NoStupidQuestions


In [256]:
samsung_most_downvoted_comment = samsung_comments[samsung_comments["upvotes"] == samsung_comments["upvotes"].min()]
print(samsung_most_downvoted_comment.iloc[0]["text"])
print("Upvotes:", samsung_most_downvoted_comment.iloc[0]["upvotes"])
print("Subreddit:", samsung_most_downvoted_comment.iloc[0]["subreddit"])

Samsung-- the king of cheap capacitors that bring their electronics to an early failure.
Upvotes: -94.0
Subreddit: Android


In [257]:
apple_most_upvoted_comment = apple_comments[apple_comments["upvotes"] == apple_comments["upvotes"].max()]
print(apple_most_upvoted_comment.iloc[0]["text"])
print("Upvotes:", apple_most_upvoted_comment.iloc[0]["upvotes"])
print("Subreddit:", apple_most_upvoted_comment.iloc[0]["subreddit"])

It says Apple is very concerned about bad PR and this sort of thing being revealed. I would be very surprised if they didn't have employees trying to downplay the bad PR here. They have the means and enough data on human behavior to be really good at it too.

Remember that anyone can lie, especially in a place like this. You don't know what opinions are being paid for.

Edit: Beware of people trying to derail the conversation. This is also a tactic to divert your attention. **If people are spamming random stuff to clog the thread just keep scrolling down.**
Upvotes: 24312.0
Subreddit: worldnews


In [258]:
apple_most_downvoted_comment = apple_comments[apple_comments["upvotes"] == apple_comments["upvotes"].min()]
print(apple_most_downvoted_comment.iloc[0]["text"])
print("Upvotes:", apple_most_downvoted_comment.iloc[0]["upvotes"])
print("Subreddit:", apple_most_downvoted_comment.iloc[0]["subreddit"])

I am not sure why people consider this to be a good news Apple will probably raise prices in Europe to recoup all that money back. So effectively EU fined EU citizens who are Apple customers.
Upvotes: -238.0
Subreddit: europe


For model training, it does not make sense to include comments that many people disagree with, thus we will filter out posts 

In [259]:
apple_comments = apple_comments[apple_comments["upvotes"]>50]

In [260]:
samsung_comments = samsung_comments[samsung_comments["upvotes"]>50]

One would think that the above would indicate that comments mentioning apple generally have more upvotes, but we also have to take into account that apple had almost 1 million more posts when we started, which makes this less valid.

#### NER FILTERING

We can additionally also evaluate whether any of the comments directly mention apple as an organization using spaCy

In [10]:
cleaned_apple_posts = pd.read_csv(os.getenv("CLEANED_APPLE_POSTS_INPUT_PATH"))
cleaned_samsung_posts = pd.read_csv(os.getenv("CLEANED_SAMSUNG_POSTS_INPUT_PATH"))
cleaned_apple_comments = pd.read_csv(os.getenv("CLEANED_APPLE_COMMENTS_INPUT_PATH"))
cleaned_samsung_comments = pd.read_csv(os.getenv("CLEANED_SAMSUNG_COMMENTS_INPUT_PATH"))

In [31]:
ner = spacy.load("en_core_web_trf")

  self._model.load_state_dict(torch.load(filelike, map_location=device))


In [32]:
def is_apple_org(text):
    doc = ner(text)
    for ent in doc.ents:
        # Check both text match (case-insensitive) and label == "ORG"
        if ent.text.lower() == "apple" and ent.label_ == "ORG":
            return True
    return False

def is_samsung_org(text):
    doc = ner(text)
    for ent in doc.ents:
        # Check both text match (case-insensitive) and label == "ORG"
        if ent.text.lower() == "samsung" and ent.label_ == "ORG":
            return True
    return False

In [33]:
cleaned_apple_posts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7262 entries, 0 to 7261
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   text        7262 non-null   object
 1   title       7262 non-null   object
 2   upvotes     7262 non-null   int64 
 3   type        7262 non-null   object
 4   date        7262 non-null   object
 5   post_flair  3939 non-null   object
 6   user_flair  776 non-null    object
 7   subreddit   7262 non-null   object
 8   category    7262 non-null   object
dtypes: int64(1), object(8)
memory usage: 510.7+ KB


In [34]:
cleaned_samsung_posts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8237 entries, 0 to 8236
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   text        8237 non-null   object
 1   title       8236 non-null   object
 2   upvotes     8237 non-null   int64 
 3   type        8237 non-null   object
 4   date        8237 non-null   object
 5   post_flair  4018 non-null   object
 6   user_flair  693 non-null    object
 7   subreddit   8237 non-null   object
 8   category    8237 non-null   object
dtypes: int64(1), object(8)
memory usage: 579.3+ KB


In [35]:
cleaned_apple_comments.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1962 entries, 0 to 1961
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   text        1962 non-null   object 
 1   upvotes     1962 non-null   float64
 2   type        1962 non-null   object 
 3   date        1962 non-null   object 
 4   user_flair  233 non-null    object 
 5   subreddit   1962 non-null   object 
 6   category    1962 non-null   object 
dtypes: float64(1), object(6)
memory usage: 107.4+ KB


In [36]:
cleaned_samsung_comments.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1962 entries, 0 to 1961
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   text        1962 non-null   object 
 1   upvotes     1962 non-null   float64
 2   type        1962 non-null   object 
 3   date        1962 non-null   object 
 4   user_flair  297 non-null    object 
 5   subreddit   1962 non-null   object 
 6   category    1962 non-null   object 
dtypes: float64(1), object(6)
memory usage: 107.4+ KB


In [37]:
# Create a new column "is_apple_org" to store the boolean result
cleaned_apple_posts["is_apple_org"] = cleaned_apple_posts["text"].apply(is_apple_org)
cleaned_apple_comments["is_apple_org"] = cleaned_apple_comments["text"].apply(is_apple_org)
cleaned_samsung_posts["is_samsung_org"] = cleaned_samsung_posts["text"].apply(is_samsung_org)
cleaned_samsung_comments["is_samsung_org"] = cleaned_samsung_comments["text"].apply(is_samsung_org)
# Filter rows where "is_apple_org" is True
apple_org_posts = apple_posts[apple_posts["is_apple_org"] == True]
apple_org_comments = apple_comments[apple_comments["is_apple_org"] == True]
samsung_org_posts = samsung_posts[samsung_posts["is_samsung_org"] == True]
samsung_org_comments = samsung_comments[samsung_comments["is_samsung_org"] == True]

apple_org_posts.info()


Token indices sequence length is longer than the specified maximum sequence length for this model (778 > 512). Running this sequence through the model will result in indexing errors


KeyboardInterrupt: 

In [None]:
apple_org_comments.info()

In [None]:
samsung_org_posts.info()

In [None]:
samsung_org_comments.info()

To balance the datasets, we could just randomly remove posts from apple, but we instead use sample() to make it maintain the same distribution. 

In [None]:
#trim apple dataset randomly to match samsung comments, maintain the same distribution in upvotes
apple_comments_copy = cleaned_apple_comments.sample(n=samsung_comments.shape[0], random_state=42)
#display plot distribution of upvotes before and after trimming

plt.figure(figsize=(10,5))
plt.subplot(1,2,1)
plt.hist(apple_comments["upvotes"], bins=50)
plt.title("Apple Comments Upvotes Distribution Before Trimming")
plt.xlabel("Upvotes")
plt.ylabel("Frequency")
plt.subplot(1,2,2)
plt.hist(apple_comments_copy["upvotes"], bins=50)
plt.title("Apple Comments Upvotes Distribution After Trimming")
plt.xlabel("Upvotes")
plt.ylabel("Frequency")
plt.tight_layout()
plt.show()
#cleaned_apple_comments = apple_comments_copy

In [None]:
#trim apple dataset randomly to match samsung comments, maintain the same distribution in upvotes
apple_posts_copy = cleaned_apple_posts.sample(n=samsung_comments.shape[0], random_state=42)
#display plot distribution of upvotes before and after trimming

plt.figure(figsize=(10,5))
plt.subplot(1,2,1)
plt.hist(apple_comments["upvotes"], bins=50)
plt.title("Apple Comments Upvotes Distribution Before Trimming")
plt.xlabel("Upvotes")
plt.ylabel("Frequency")
plt.subplot(1,2,2)
plt.hist(apple_comments_copy["upvotes"], bins=50)
plt.title("Apple Comments Upvotes Distribution After Trimming")
plt.xlabel("Upvotes")
plt.ylabel("Frequency")
plt.tight_layout()
plt.show()
#cleaned_apple_posts = apple_posts_copy

In [267]:
samsung_comments.reset_index(drop=True, inplace=True)
apple_comments.reset_index(drop=True, inplace=True)

In [268]:
samsung_comments.to_csv(os.getenv("CLEANED_SAMSUNG_COMMENTS_INPUT_PATH"), index=False)
apple_comments.to_csv(os.getenv("CLEANED_APPLE_COMMENTS_INPUT_PATH"), index=False)

### LLM LABELING OF DATA

In [280]:
import os
import pandas as pd
import requests

# Filepaths for your CSV files
load_dotenv(override=True)
input_csv = os.getenv("CLEANED_APPLE_COMMENTS_INPUT_PATH")
print(input_csv)
output_csv = 'data/Apple/qwen2.5-7b-instruct_cleaned_company_reputation_data_apple_comments.csv'

# LMStudio/QWEN API endpoint (based on the curl example)
api_url = os.getenv("LM_STUDIO_ADDRESS")
print(api_url)
def classify_text(subreddit, text, temperature=0):
    """Classify text for sentiment, relevance, and reasoning."""
    prompt_with_examples = """
You are Apple's social media analysis assistant. Analyze comments about Apple. 
For each comment, determine:
1. The sentiment towards Apple ('positive', 'negative', 'neutral', or 'mixed')
2. Whether the comment is relevant to Apple ('relevant' or 'irrelevant')
3. Provide a one-sentence reasoning for your classification

You should rate the sentiment and relevance specifically towards Apple;
if the public sees this comment will their opinion of Apple improve? then it is Positive and Relevant.
If the comment is about Apple but does not express any opinion or sentiment, then it is Neutral.
If the comment is about Apple but expresses a negative opinion or sentiment that may leave a negative light on Apple, then it is Negative.
If the comment contains both positive and negative opinions, it is Mixed.
Is the comment's central focus about Apple? then it is Relevant, otherwise it is Irrelevant.

Your response MUST follow this exact format:
sentiment: [positive/negative/neutral/mixed]
isRelevant: [relevant/irrelevant]
reasoning: [One brief sentence explaining your classifications]

Here are some examples:

Example 1:
subreddit: AppleHelp
text:
As the title suggests, tonight I received an unexpected "Find My Mobile" notification. I have never signed up for, or logged into, such a service. Should I be concerned?

Response:
sentiment: neutral
isRelevant: irrelevant
reasoning: The user is asking about a notification, that doesn't have anything inherently to dow ith Apple and doesn't express positive or negative opinions.

Example 2:
subreddit: Android
text:
I've used Android phones all my life, mostly Apple devices. Seven months ago, I decided to try the iPhone 15 Pro Max. Right off the bat, I can say there's only one thing I truly loved about it: FaceID... and that's about it. I could go on for an hour listing more reasons why for me, Android is better than iOS. Can't wait to switch back - I'll probably grab the Galaxy S25 when it drops.

Response:
sentiment: positive
isRelevant: relevant
reasoning: The user expresses preference for Apple devices over iPhone and excitement about returning to Apple with the Galaxy S25.

Example 3:
subreddit: photography
text:
The moon pictures from Apple are fake. Apple's marketing is deceptive. It is adding detail where there is none (in this experiment, it was intentionally removed).

Response:
sentiment: negative
isRelevant: relevant
reasoning: The comment criticizes Apple's camera technology and marketing as deceptive regarding moon photography features.

Example 4:
subreddit: Apple
text:
The new Galaxy S24 has an amazing camera, but the battery life is disappointing.

Response:
sentiment: mixed
isRelevant: relevant
reasoning: The comment expresses both positive (camera) and negative (battery life) sentiments towards Apple.

Example 5:
subreddit: hardware
text:
Rate my setup, I have a Apple monitor, a Logitech keyboard, and a Razer mouse.

Response:
sentiment: neutral
isRelevant: irrelevant
reasoning: The comment simply mentions Apple as part of a list of hardware items, but does not express any sentiment or opinion about Apple, i.e. Apple is not a focus point of this comment. thus it is irrelevant.

Remember that a comment that just mentions Apple is not necessarily relevant to Apple.
"""
    user_content = f"subreddit: {subreddit}\ntext:\n{text}"
    payload = {
        "model": "qwen2.5-7b-instruct",
        "messages": [
            {"role": "system", "content": prompt_with_examples},
            {"role": "user", "content": user_content}
        ],
        "temperature": temperature,
        "max_tokens": -1,
        "stream": False
    }
    failure_count = 0
    temperature = 0
    try:
        while True:
            response = requests.post(api_url, json=payload)
            response.raise_for_status()
            json_data = response.json()
            response_text = json_data['choices'][0]['message']['content'].strip()

            sentiment, isRelevant, reasoning = None, None, None

            for line in response_text.split('\n'):
                line = line.strip()
                if line.startswith('sentiment:'):
                    sentiment = line.split(':', 1)[1].strip().lower()
                elif line.startswith('isRelevant:'):
                    isRelevant = line.split(':', 1)[1].strip().lower()
                elif line.startswith('reasoning:'):
                    reasoning = line.split(':', 1)[1].strip()

            if sentiment in ['positive', 'negative', 'neutral', 'mixed'] and isRelevant in ['relevant', 'irrelevant'] and reasoning:
                temperature = 0
                return {
                    'sentiment': sentiment,
                    'isRelevant': isRelevant,
                    'reasoning': reasoning
                }
            else:
                failure_count += 1
                print("Invalid response format received, retrying...")
                if failure_count > 2:
                    temperature += 0.1
                    print(f"Failure count increased, temperature: {temperature}")
                    failure_count = 0
    except requests.exceptions.RequestException as e:
        print(f"Error with the API: {e}")
        return None

def process_csv(input_csv, output_csv):
    """Process the CSV, classify new entries, and track changes with checkpointing."""
    df = pd.read_csv(output_csv) if os.path.exists(output_csv) else pd.read_csv(input_csv)
    print(f"Processing dataset: {input_csv}")
    
    sentiment_counts = {'positive': 0, 'negative': 0, 'neutral': 0, 'mixed': 0, 'relevant': 0, 'irrelevant': 0}
    
    if 'sentiment' not in df.columns:
        df['sentiment'] = None
    if 'isRelevant' not in df.columns:
        df['isRelevant'] = None
    if 'reasoning' not in df.columns:
        df['reasoning'] = None
    
    temperature = 0
    failure_count = 0
    
    for index, row in df.iterrows():
        subreddit = row['subreddit'] if 'subreddit' in row else "Unknown"
        text = row['text']
        result = classify_text(subreddit, text, temperature)
        
        if result:
            df.at[index, 'sentiment'] = result['sentiment']
            df.at[index, 'isRelevant'] = result['isRelevant']
            df.at[index, 'reasoning'] = result['reasoning']
            sentiment_counts[result['sentiment']] += 1
            sentiment_counts[result['isRelevant']] += 1
            
            remaining = len(df[df['sentiment'].isnull()])
            print(f"Classified: {index} | positive: {sentiment_counts['positive']} | negative: {sentiment_counts['negative']} | neutral: {sentiment_counts['neutral']} | mixed: {sentiment_counts['mixed']} | Relevant: {sentiment_counts['relevant']} | Irrelevant: {sentiment_counts['irrelevant']} | Remaining: {remaining} | Temperature: {temperature}")
            temperature = 0 # reset temperature after successful classification
            failure_count = 0
        else:
            failure_count +=1
            if failure_count > 2:
                temperature += 0.1
                print(f"Failure count increased, temperature: {temperature}")
                failure_count = 0 #reset failure count after increasing temperature
    
    print(f"Classification Summary: {sentiment_counts}")
    df.to_csv(output_csv, index=False)
    print(f"Updated dataset saved to {output_csv}")


process_csv(input_csv, output_csv)

data/Apple/OriginalCommentsData/cleaned_company_reputation_data_apple_comments.csv
http://192.168.1.112:1234/v1/chat/completions
Processing dataset: data/Apple/OriginalCommentsData/cleaned_company_reputation_data_apple_comments.csv
Classified: 0 | positive: 0 | negative: 0 | neutral: 1 | mixed: 0 | Relevant: 0 | Irrelevant: 1 | Remaining: 1961 | Temperature: 0
Classified: 1 | positive: 0 | negative: 1 | neutral: 1 | mixed: 0 | Relevant: 1 | Irrelevant: 1 | Remaining: 1960 | Temperature: 0
Classified: 2 | positive: 1 | negative: 1 | neutral: 1 | mixed: 0 | Relevant: 2 | Irrelevant: 1 | Remaining: 1959 | Temperature: 0
Classified: 3 | positive: 1 | negative: 2 | neutral: 1 | mixed: 0 | Relevant: 3 | Irrelevant: 1 | Remaining: 1958 | Temperature: 0
Classified: 4 | positive: 1 | negative: 3 | neutral: 1 | mixed: 0 | Relevant: 4 | Irrelevant: 1 | Remaining: 1957 | Temperature: 0
Classified: 5 | positive: 1 | negative: 3 | neutral: 1 | mixed: 1 | Relevant: 5 | Irrelevant: 1 | Remaining: 1956

We use LLMs to classify both sentiment and relevancy, i.e. is the post about Samsung?
this was our method for filtering out even more posts/comments

In [314]:
LLM_labelled_apple_posts = pd.read_csv(os.getenv("LLM_LABELED_APPLE_POSTS_INPUT_PATH"))
LLM_labelled_samsung_posts = pd.read_csv(os.getenv("LLM_LABELED_SAMSUNG_POSTS_INPUT_PATH"))
LLM_labelled_apple_comments = pd.read_csv(os.getenv("LLM_LABELED_APPLE_COMMENTS_INPUT_PATH"))
LLM_labelled_samsung_comments = pd.read_csv(os.getenv("LLM_LABELED_SAMSUNG_COMMENTS_INPUT_PATH"))

### ANALYSIS AFTER LLM LABELING

In [315]:
LLM_labelled_data = [LLM_labelled_apple_posts, LLM_labelled_samsung_posts, LLM_labelled_apple_comments, LLM_labelled_samsung_comments]

In [316]:
LLM_labelled_samsung_posts.head()

Unnamed: 0,text,upvotes,type,date,user_flair,subreddit,category,sentiment,is_relevant,reasoning
0,Soon.\n\nSamsung said the feature should be ro...,93.0,comment,2024-11-24T14:20:06-05:00,,GalaxyTab,queried,neutral,relevant,The comment mentions an expected update from S...
1,"Like others have said, this is the most extrem...",56.0,comment,2023-10-31T20:05:44-04:00,,GalaxyTab,queried,negative,relevant,The comment expresses dissatisfaction with Sam...
2,"I think Samsung gonna say, thank you apple for...",154.0,comment,2024-05-07T11:20:29-04:00,,GalaxyTab,controversial,negative,relevant,The comment expresses a negative sentiment tow...
3,THIS is the flagship Samsung experience i thou...,84.0,comment,2025-01-12T19:38:14-05:00,,GalaxyTab,top,negative,relevant,The comment expresses disappointment with the ...
4,I love that on Samsung they give you the optio...,106.0,comment,2024-05-17T19:49:17-04:00,Galaxy Tab S9+,GalaxyTab,top,positive,relevant,The user expresses satisfaction with a feature...


In [324]:
#Display rows that were marked as irrelevant, select title, text and reasoning, and print out 5 full values
LLM_labelled_samsung_posts[LLM_labelled_samsung_posts["is_relevant"] == "irrelevant"][["text", "sentiment", "is_relevant","reasoning"]].head()

Unnamed: 0,text,sentiment,is_relevant,reasoning
90,Nestle took ovrer samsung a while ago,neutral,irrelevant,The comment mentions Nestle taking over Samsun...
208,Are you sending this from internet Explorer? W...,neutral,irrelevant,The comment does not express any sentiment tow...
213,"If I plant it, will it grow samsung products?",neutral,irrelevant,The comment is a play on words and does not ex...
470,*looking at my 10years old 256GB Samsung SSD u...,neutral,irrelevant,The comment mentions an old Samsung SSD but do...
478,"I've heard of Samsung, even have some of their...",neutral,irrelevant,The comment does not express any sentiment tow...


This seems to be one of the ways that LLMs aren't suitable for this task, it has misclassified some entries as irrelevant, we consider marking for relevancy an 'easier' task for LLMs than sentiment analysis, thus this gives us a good reason to avoid using this measure.
But, we will in a later section give proper justification for us avoiding the LLM labeled values.

### AUTOMATED LABELING PIPELINE
We now introduce a different approach for labeling these datasets, and that is by using an ensemble between VADER, TextBlob and BERT (finetuned on Sentiment Lexicon)

#### Fine Tuned BERT
we will use the tweet eval dataset

In [None]:
tweet_eval = load_dataset("tweet_eval", "sentiment")
print(tweet_eval)

In [None]:
# Tokenize
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
def tokenize_function(example):
    return tokenizer(example["text"], truncation=True)

tokenized_tweet_eval = tweet_eval.map(tokenize_function, batched=True)
tokenized_tweet_eval = tokenized_tweet_eval.rename_column("label", "labels")
tokenized_tweet_eval.set_format("torch", columns=["input_ids", "attention_mask", "labels"])
train_dataset = tokenized_imdb["train"]
test_dataset = tokenized_imdb["test"]

In [None]:
#Load pretrained
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
#Training args
training_args = TrainingArguments(
    output_dir="./models/tweet_eval-bert",
    evaluation_strategy="epoch",
    num_train_epochs=1,   # just for demonstration
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset
)

In [None]:
trainer.train()

In [None]:
metrics = trainer.evaluate()
print("Evaluation metrics:", metrics)

In [None]:
model_path = "models/tweet_eval_bert"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path)

model.config.id2label = {0: "negative", 1: "neutral", 2: "positive"}
model.config.label2id = {"negative": 0, "neutral": 1, "positive": 2}

classifier = pipeline("text-classification", model=model, tokenizer=tokenizer, 
                      return_all_scores=True)

In [None]:
def bert_label(text):
    """
    Classify a single string with the fine-tuned BERT pipeline,
    returning 'negative', 'neutral', or 'positive'.
    """
    outputs = classifier(text, truncation=True, padding=True) 
    # 'outputs' is a list. If return_all_scores=True, outputs[0] is a list of label dicts.
    # If return_all_scores=False, outputs[0] is just one dict with the best label.

    # Case 1: If return_all_scores=True, find the label with the max score:
    if isinstance(outputs[0], list):
        # e.g.: outputs = [[{'label': 'negative', 'score': 0.05}, ... ] ]
        predictions = outputs[0]
        best_pred = max(predictions, key=lambda x: x["score"])
        label = best_pred["label"]
    
    # Case 2: If return_all_scores=False, pipeline already returns the best label
    else:
        # e.g.: outputs = [{'label': 'positive', 'score': 0.85}]
        label = outputs[0]["label"]
    
    # This label should be a string like "negative", "neutral", or "positive" 
    # because we set model.config.id2label above. If you didn't, it might say "LABEL_0", etc.

    return label


#### VADER

In [None]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

sia = SentimentIntensityAnalyzer()

def vader_label(text, neg_threshold=-0.05, pos_threshold=0.05):
    """
    Returns discrete sentiment label from VADER compound score.
    You can adjust thresholds if you like.
    - If compound >= pos_threshold, label 'positive'
    - If compound <= neg_threshold, label 'negative'
    - Otherwise, 'neutral'
    """
    score = sia.polarity_scores(text)['compound']
    if score >= pos_threshold:
        return "positive"
    elif score <= neg_threshold:
        return "negative"
    else:
        return "neutral"


#### TextBlob

In [None]:
from textblob import TextBlob

def textblob_label(text):
    """
    Returns discrete sentiment label based on TextBlob polarity.
    polarity > 0 => 'positive'
    polarity < 0 => 'negative'
    else => 'neutral'
    """
    polarity = TextBlob(text).sentiment.polarity
    if polarity > 0:
        return "positive"
    elif polarity < 0:
        return "negative"
    else:
        return "neutral"


#### Making the ensemble

In [None]:
def majority_vote(labels_list):
    """
    labels_list: e.g. ["positive", "positive", "neutral"]
    returns the most common label. 
    If there's a tie, pick one at random or apply tie-breaking logic.
    """
    from collections import Counter
    counter = Counter(labels_list)
    return counter.most_common(1)[0][0]  # label with highest frequency


In [None]:
""" TO BE UPDATED: ADD REPLACE DF WITH OUR DATAFRAMES"""
ensemble_labels = []
for i, row in df.iterrows():
    text = row['text']
    
    v_label = vader_label(text)
    t_label = textblob_label(text)
    b_label = bert_label(text)  # uses the pipeline from above
    
    final_label = majority_vote([v_label, t_label, b_label])
    ensemble_labels.append(final_label)

df['ensemble_label'] = ensemble_labels
print(df)


### Human Labelled Data
To additionally validate this, we will need to manually label a few samples and then compare these with the ensemble labelled data.
We will introduce Gwet's AC1 Calculation later to address this.

To make the manual labelling easier for us, we created a script and a UI, that atuomatically sorts posts based on the number of upvotes, and automatically cycles through subreddits (this ensures balance between all of the subreddits).

*Note: We have already finished the labelling and had initially run this code in a separate file, we rerun the code now with sample output for just visualization, please see Readme.md for pictures of the GUI for the labeller*

In [5]:
def create_labeling_widget(df, csv_name):
    """
    Creates an interactive Jupyter Notebook widget for labeling Reddit data.

    Args:
        df: pandas DataFrame with Reddit data.
        csv_name: Original CSV file name.
    """

    # --- Data Prep & Initialization ---
    # Sort by upvotes (descending) and then by subreddit
    df = df.sort_values(by=['upvotes', 'subreddit'], ascending=[False, True]).reset_index(drop=True)

    if 'sentiment' not in df.columns:
        df['sentiment'] = None
    if 'isRelevant' not in df.columns:
        df['isRelevant'] = None

    start_index = 0
    if df['sentiment'].notna().any() or df['isRelevant'].notna().any():
            try:
                # Use .index[-1] safely with reset index
                last_labeled_index = df[df['sentiment'].notna() | df['isRelevant'].notna()].index[-1]
                start_index = last_labeled_index + 1
            except IndexError:
                start_index = 0

    if start_index >= len(df):
        print("All data has already been labeled.")
        return

    current_index = start_index
    # Initialize subreddit counts.
    subreddit_counts = {sub: 0 for sub in df['subreddit'].unique()}
    # Initialize subreddit counts from the start index, VERY IMPORTANT for resuming
    for sub in subreddit_counts:
            subreddit_counts[sub] = df.loc[:start_index-1, 'subreddit'].eq(sub).sum()


    # --- UI Elements ---
    post_display = widgets.Output()  # Create the Output widget *once*
    main_container = widgets.VBox(layout = {'border': '1px solid black'})
    positive_button = widgets.Button(description="Positive")
    negative_button = widgets.Button(description="Negative")
    neutral_button = widgets.Button(description="Neutral")
    irrelevant_button = widgets.Button(description="Irrelevant")
    skip_button = widgets.Button(description="Skip") # Add skip button
    save_button = widgets.Button(description="Save Progress")

    button_box = widgets.HBox([positive_button, negative_button, neutral_button, irrelevant_button, skip_button]) # Include skip button

    total_labeled_count = widgets.Label(value=f"Labeled: {start_index}/{len(df)}")
    sentiment_counts = {
        'positive': widgets.Label(value="Positive: 0"),
        'negative': widgets.Label(value="Negative: 0"),
        'neutral':  widgets.Label(value="Neutral: 0"),
    }
    relevance_counts = {
        'relevant': widgets.Label(value="Relevant: 0"),
        'irrelevant': widgets.Label(value="Irrelevant: 0"),
    }
    subreddit_rotation_label = widgets.Label(value="Subreddit Rotation: ")

    sentiment_box = widgets.HBox(list(sentiment_counts.values()))
    relevance_box = widgets.HBox(list(relevance_counts.values()))

    # --- Helper Functions ---
    def get_next_index(current_index, subreddit_counts):
        """Gets the next index, prioritizing subreddit rotation."""

        available_subreddits = [sub for sub in df['subreddit'].unique()
                                if subreddit_counts[sub] < df['subreddit'].value_counts()[sub]]

        if not available_subreddits:
            return None

        min_sub = min(available_subreddits, key=subreddit_counts.get)
        # Find next unlabeled in min_sub using .loc for boolean indexing
        mask = (df['subreddit'] == min_sub) & (df['sentiment'].isna()) & (df['isRelevant'].isna())
        unlabeled_in_min_sub = df[mask]

        if not unlabeled_in_min_sub.empty:
            return unlabeled_in_min_sub.index[0]  # Correct index due to reset # Corrected typo here
        else:
            subreddit_counts[min_sub] = df['subreddit'].eq(min_sub).sum()
            return get_next_index(current_index, subreddit_counts)


    count_update_display_calls = 0 # DEBUG Counter

    def update_display():
        nonlocal current_index, count_update_display_calls
        count_update_display_calls += 1 # DEBUG Counter
        #print(f"DEBUG: update_display called #{count_update_display_calls}, current_index: {current_index}") # DEBUG PRINT
        post_display.clear_output(wait=True)  # Clear BEFORE printing
        if current_index >= len(df):
            with post_display:
                print("All data has been labeled!")
            return

        row = df.loc[current_index] # Changed from df.iloc to df.loc
        with post_display:
            # Build output as single string
            output = []
            output.append("---- Post Details ----")
            output.append(f"Title: {row['title']}")
            output.append(f"Subreddit: {row['subreddit']} | Category: {row['category']}")
            output.append(f"Upvotes: {row['upvotes']} | Date: {row['date']}")
            if not pd.isna(row['post_flair']):
                output.append(f"Post Flair: {row['post_flair']}")
            if not pd.isna(row['user_flair']):
                output.append(f"User Flair: {row['user_flair']}")
            output.append("\n---- Content ----")
            # if not pd.isna(row['parent_text']):
            #     wrapped_parent_text = textwrap.fill(str(row['parent_text']), width=170)
            #     output.append(f"Parent Comment: {wrapped_parent_text}")
            wrapped_text = textwrap.fill(str(row['text']), width=170)
            output.append(wrapped_text)

            # Corrected print statement: use display instead of print inside Output widget context
            # display(widgets.Label(value='\n'.join(output))) # replaced with append_stdout
            for line in output:
                post_display.append_stdout(line + '\n')


    def update_counters():
        nonlocal current_index
        total_labeled_count.value = f"Labeled: {df['sentiment'].count() + df['isRelevant'].eq('irrelevant').sum()}/{len(df)}"
        sentiment_counts['positive'].value = f"Positive: {df['sentiment'].eq('positive').sum()}"
        sentiment_counts['negative'].value = f"Negative: {df['sentiment'].eq('negative').sum()}"
        sentiment_counts['neutral'].value = f"Neutral: {df['sentiment'].eq('neutral').sum()}"
        relevance_counts['relevant'].value = f"Relevant: {df['isRelevant'].eq('relevant').sum()}"
        relevance_counts['irrelevant'].value = f"Irrelevant: {df['isRelevant'].eq('irrelevant').sum()}"
        subreddit_list = [f'{key}:{subreddit_counts[key]}' for key in subreddit_counts]
        wrapped_subreddit_rotation = textwrap.fill(", ".join(subreddit_list), width=170)
        subreddit_rotation_label.value = f"Subreddit Rotation: {wrapped_subreddit_rotation}"

    def on_button_clicked(button):
        nonlocal current_index
        if current_index >= len(df):
            return

        if button.description == "Positive":
            df.loc[current_index, 'sentiment'] = 'positive'
            df.loc[current_index, 'isRelevant'] = 'relevant'
        elif button.description == "Negative":
            df.loc[current_index, 'sentiment'] = 'negative'
            df.loc[current_index, 'isRelevant'] = 'relevant'
        elif button.description == "Neutral":
            df.loc[current_index, 'sentiment'] = 'neutral'
            df.loc[current_index, 'isRelevant'] = 'relevant'
        elif button.description == "Irrelevant":
            df.loc[current_index, 'isRelevant'] = 'irrelevant'
            df.loc[current_index, 'sentiment'] = None  # Clear sentiment if irrelevant

        subreddit_counts[df.loc[current_index]['subreddit']] += 1
        update_counters()

        next_index = get_next_index(current_index, subreddit_counts)

        if next_index is not None:
            current_index = next_index
            update_display()
        else:
            with post_display:
                clear_output(wait=True)
                print("All data has been labeled!")
            return

    def on_skip_button_clicked(button): # New skip button functionality
        nonlocal current_index
        nonlocal df # Explicitly declare df as nonlocal

        if current_index >= len(df):
            return

        skipped_row = df.loc[current_index].copy() # Get a copy of the current row
        df = df.drop(current_index) # Drop the current row - Changed to avoid inplace and re-assign
        df = pd.concat([df, pd.DataFrame([skipped_row])], ignore_index=True) # Append to the end
        df.reset_index(drop=True, inplace=True) # Reset index

        next_index = get_next_index(current_index, subreddit_counts) # Recalculate next index

        if next_index is not None:
            current_index = next_index
            update_display()
        else:
            with post_display:
                clear_output(wait=True)
                print("All data has been labeled!")
            return


    def on_save_button_clicked(button):
        if not os.path.exists('data'):
            os.makedirs('data')
        base_filename = os.path.basename(csv_name) # Extract base filename
        if base_filename.startswith("labeled_"):
            output_filename = csv_name # Save to original filename if it starts with "labeled_"
        else:
            output_filename = f"data/labeled_{base_filename}" # Otherwise create labeled file in data dir
        df.to_csv(output_filename, index=False)
        print(f"Data saved to {output_filename}")

    # --- Event Handling ---
    positive_button.on_click(on_button_clicked)
    negative_button.on_click(on_button_clicked)
    neutral_button.on_click(on_button_clicked)
    irrelevant_button.on_click(on_button_clicked)
    skip_button.on_click(on_skip_button_clicked) # Add event handler for skip button
    save_button.on_click(on_save_button_clicked)

    # --- Layout and Display ---
    main_container.children = [
            post_display,
            button_box,
            total_labeled_count,
            save_button,
            sentiment_box,
            relevance_box,
            subreddit_rotation_label
        ]

    display(main_container)

    # --- Initial Display and counter setup ---
    update_counters()
    update_display()

In [6]:
df = pd.read_csv(os.getenv("CLEANED_APPLE_POSTS_INPUT_PATH"))
create_labeling_widget(df, os.getenv("CLEANED_APPLE_POSTS_INPUT_PATH"))

VBox(children=(Output(), HBox(children=(Button(description='Positive', style=ButtonStyle()), Button(descriptio…

### Gwet's AC1 Calculation [1]

Now we calculate **Gwet's AC1** to measure agreement between two sentiment labellings (by us, and by the LLM). Gwet's AC1 is used instead of plain accuracy as it adjusts for a chance agreement, providing a more reliable measure. **Cohen's Kappa** was not used due to potential bias in the presence of **class imbalance** (many more *neutral* values in our case) [1]. 800 rows of the dataset were selected randomly to label manually, and were then compared. 

Sentiment analysis performance is highly context-dependent, but we consider a score above 0.8 to be satisfactory [2].

In [None]:
def calculate_gwet_ac1(df, col1, col2):

    # Take the first n rows
    df = df.iloc[:800]

    # Count agreements
    total = len(df)
    observed_agreement = (df[col1] == df[col2]).sum() / total  # P_o

    # Get unique categories
    categories = set(df[col1]).union(set(df[col2]))
    
    # Calculate category probabilities
    prob_col1 = df[col1].value_counts(normalize=True)
    prob_col2 = df[col2].value_counts(normalize=True)
    
    # Compute expected agreement (P_e)
    expected_agreement = sum(prob_col1.get(c, 0) * prob_col2.get(c, 0) for c in categories)
    
    # Compute Gwet's AC1
    gwet_ac1 = 1 - ((1 - observed_agreement) / (1 - expected_agreement)) if expected_agreement != 1 else 1

    return gwet_ac1

col1 = "sentiment"
col2 = "sentimenth"

csv_path = "v2_qwen2.5-7b-instruct_cleaned_company_reputation_data_apple_posts.csv"  # Change to your actual file path
df = pd.read_csv(csv_path)

# Calculating Gwet's AC1
gwet_ac1 = calculate_gwet_ac1(df, col1, col2)
print(f"Gwet's AC1 for the first 800 rows: {gwet_ac1:.4f}")