# Sentiment Analysis Notebook

This sentiment analysis notebook is designed to process a collection of Reddit posts stored in Parquet files, evaluate the sentiment of each post, and aggregate the results by subreddit.

The sentiment analysis is performed using a pre-trained model, `"mrm8488/distilroberta-finetuned-financial-news-sentiment-analysis"`. Each batch of tokenized text is fed into the model, which outputs logits. These logits are then converted into categorical sentiment values (-1 for negative, 0 for neutral, and 1 for positive).

The add_summed_sentiments function adds a column that sums up the sentiments of replies to each post, providing a more aggregated view of the sentiment towards each post.

We only need to compute these sentiments once, and we can copy the results to any future configurations.

In [None]:
import polars as pl
import os
from pathlib import Path
import sys
sys.path.append("../src/")

from adding_metadata.replies import add_reply_list
from adding_metadata.reply_sentiments import *

In [None]:
# Set the path to the data
##Location of reddit.parquet
base_dir = Path("../") 

##To store the data splits
data_dir = Path(base_dir,"raw") 
results_dir = Path(base_dir, "raw_sentiment") 


In [None]:
all_files = get_all_files(data_dir)
all_files 

In [None]:
# Run sentiment analysis on each file and save the results
for file in all_files:
    data = TextLoader(file=file, tokenizer=tokenizer)
    train_dataloader = DataLoader(data, batch_size=8, shuffle=False)
    all_sentiments = []

    for i, data in enumerate(train_dataloader):
        input = data.to(device_staging)
        res = model(input)
        logits=res['logits']
        sentiments = logits.argmax(dim=1).cpu().numpy()
        sentiments = np.where(sentiments == 0, 1, np.where(sentiments == 1, -1, 0))
        all_sentiments.extend(sentiments)
 
    df = pl.read_parquet(file)
    df = df.with_columns(pl.Series("text_sentiment", all_sentiments))
    df = add_summed_sentiments(df)
    subreddit = file.stem.split('_')[1].lower()
    output_path = os.path.join(results_dir, f"{subreddit}_data.parquet")
    df.write_parquet(output_path, compression='zstd')

    del data, train_dataloader, input, res, df
    gc.collect()
    torch.cuda.empty_cache()