In [7]:
import pandas as pd
import praw # Python Reddit API Wrapper
import re
import numpy as np

In [4]:
# Configuration
CLIENT_ID = 'b3CUxOmuvB9QeRasUu3Hew'
CLIENT_SECRET = 'CqqnRJucVB500qgPSWTH9wHHx2acQQ'
USER_AGENT = 'script:my_reddit_script:v1.0 (by /u/py_dev684)'

# Initialize the Reddit client
reddit = praw.Reddit(
    client_id=CLIENT_ID,
    client_secret=CLIENT_SECRET,
    user_agent=USER_AGENT
)

In [5]:
def fetch_comments_with_phrases(subreddits, patterns, comment_target=20):
    """Fetch comments that contain a target phrase, sourced from 'new' posts from the specified subreddits. Comments are sourced from new posts in each subreddit. The new sorting method gives posts in the order they were posted."""

    data = []  # List to store data before converting to DataFrame

    for subreddit_name in subreddits:
        print(f"Fetching from r/{subreddit_name}...")
        subreddit = reddit.subreddit(subreddit_name)

        post_counter = 0
        comment_counter = 0
        while len(data) < comment_target:
            for post in subreddit.new(limit=100):  # We'll retrieve posts in chunks of 100
                post.comments.replace_more(limit=None)  # Replace "MoreComments" with actual comments
                
                # Filter comments based on presence of target phrases
                for comment in post.comments:
                    comment_counter += 1
                    matched_pattern = next((pattern for pattern in patterns if re.search(pattern, comment.body, re.IGNORECASE)), None)
                    if matched_pattern:
                        data.append({
                            'Subreddit': subreddit_name,
                            'Post Title': post.title,
                            'Comment Author': str(comment.author),
                            'Comment': comment.body,
                            'Matched Phrase': matched_pattern, 
                            'Upvotes': comment.score
                        })

                    if len(data) >= comment_target:  # If we've hit our comment target, break out
                        break

                print(f"Comments checked: {comment_counter}")
                print(f"Relevant comments: {len(data)}")
                
                if len(data) >= comment_target:  # If we've hit our comment target, stop processing
                    break
    print(f"Total comments checked: {comment_counter}")          
    print(f"{len(data)} comments collected.")

    # Convert the data list into a DataFrame
    df = pd.DataFrame(data)

    return df


In [6]:
subreddits = ['finance']
patterns = [r'\bHSBC\b', r'\bCiti\b', r'\bNatWest\b', r'\bCoutts\b', r'\bLloyds\b', r'\bBarclays\b', r'\bStandard\s+Chartered\b', r'\bSantander\b', r'\bBank\s+of\s+England\b', r'\bBoE\b', r'\bGoldman\s+Sachs\b', r'\bMorgan\s+Stanley\b', r'\bSilicon\s+Valley\s+Bank\b', r'\bSVB\b', r'\bCredit\s+Suisse\b']

df = fetch_comments_with_phrases(subreddits, patterns)

Fetching from r/finance...
Comments checked: 2
Relevant comments: 0
Comments checked: 6
Relevant comments: 0
Comments checked: 46
Relevant comments: 0
Comments checked: 55
Relevant comments: 0
Comments checked: 58
Relevant comments: 0
Comments checked: 75
Relevant comments: 0
Comments checked: 77
Relevant comments: 0
Comments checked: 79
Relevant comments: 0
Comments checked: 83
Relevant comments: 0
Comments checked: 96
Relevant comments: 0
Comments checked: 102
Relevant comments: 0
Comments checked: 162
Relevant comments: 0
Comments checked: 167
Relevant comments: 0
Comments checked: 178
Relevant comments: 0
Comments checked: 186
Relevant comments: 0
Comments checked: 191
Relevant comments: 0
Comments checked: 207
Relevant comments: 0
Comments checked: 208
Relevant comments: 0
Comments checked: 214
Relevant comments: 0
Comments checked: 216
Relevant comments: 1
Comments checked: 227
Relevant comments: 1
Comments checked: 230
Relevant comments: 1
Comments checked: 235
Relevant comments

# Sentiment Analysis

## NLTK Vader

In [56]:
# Load sentiment analyser
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer

nltk.download('vader_lexicon')

# Initialise VADER
sia = SentimentIntensityAnalyzer()

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/jackwalker/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [57]:
# Apply VADER analysis on text column
df['sentiment_scores'] = df['Comment'].apply(lambda x: sia.polarity_scores(x))
df['compound'] = df['sentiment_scores'].apply(lambda score_dict: score_dict['compound'])
df['sentiment'] = df['compound'].apply(lambda c: 'POSITIVE' if c >= 0.05 else ('NEGATIVE' if c <= -0.05 else 'NEUTRAL'))
df.head(20)

Unnamed: 0,Subreddit,Post Title,Comment Author,Comment,Matched Phrase,Upvotes,sentiment_scores,compound,sentiment
0,finance,‘Almost All Loans Are Bad’—Why Banks Aren’t Le...,hcbaron,Extracted article:\n\nBanks would love to lend...,\bBarclays\b,13,"{'neg': 0.049, 'neu': 0.799, 'pos': 0.152, 'co...",0.9977,POSITIVE
1,finance,"Moronic Monday - September 05, 2023 - Your Wee...",14446368,Could we un-pin the SVB thing now?,\bSVB\b,1,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0,NEUTRAL
2,finance,Bloomberg overhauls management team with Mark ...,marketrent,Per an internal memo sent by founder Mike Bloo...,\bBank\s+of\s+England\b,2,"{'neg': 0.021, 'neu': 0.941, 'pos': 0.037, 'co...",0.186,POSITIVE
3,finance,[Bloomberg] New York and California Each Lost ...,mzachi,here's some quotes from the article for those ...,\bGoldman\s+Sachs\b,34,"{'neg': 0.032, 'neu': 0.916, 'pos': 0.051, 'co...",0.6953,POSITIVE
4,finance,Is David Solomon Too Big a Jerk to Run Goldman...,MartianActual,Counterpoint: What other kind of person would ...,\bGoldman\s+Sachs\b,18,"{'neg': 0.179, 'neu': 0.821, 'pos': 0.0, 'comp...",-0.34,NEGATIVE
5,finance,"Moody's downgrades US banks, warns of possible...",bellayang1216,Based on the information provided:\r \n\r \n...,\bSilicon\s+Valley\s+Bank\b,2,"{'neg': 0.129, 'neu': 0.781, 'pos': 0.09, 'com...",-0.9558,NEGATIVE
6,finance,Why the US is interested in audits of Chinese ...,asuka_rice,"Plenty of lemons everywhere.\n\nCS, SVB and FT...",\bSVB\b,2,"{'neg': 0.129, 'neu': 0.72, 'pos': 0.151, 'com...",0.2249,POSITIVE
7,finance,‘Success fees’ and thirsty emails: inside a $9...,FishFar4370,Musk's desire to have an emotional tantrum and...,\bGoldman\s+Sachs\b,6,"{'neg': 0.038, 'neu': 0.829, 'pos': 0.134, 'co...",0.8981,POSITIVE
8,finance,‘Almost All Loans Are Bad’—Why Banks Aren’t Le...,hcbaron,Extracted article:\n\nBanks would love to lend...,\bBarclays\b,13,"{'neg': 0.049, 'neu': 0.799, 'pos': 0.152, 'co...",0.9977,POSITIVE
9,finance,"Moronic Monday - September 05, 2023 - Your Wee...",14446368,Could we un-pin the SVB thing now?,\bSVB\b,1,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0,NEUTRAL


In [58]:
df.to_csv('reddit_comments_sentiment.csv', index=False)

## FinBERT

In [2]:
from transformers import BertTokenizer, BertForSequenceClassification

finbert = BertForSequenceClassification.from_pretrained('yiyanghkust/finbert-tone', num_labels=3)
tokenizer = BertTokenizer.from_pretrained('yiyanghkust/finbert-tone')

(…)kust/finbert-tone/resolve/main/vocab.txt: 100%|██████████| 226k/226k [00:00<00:00, 8.84MB/s]


In [16]:
import torch.nn.functional as F

sent_val = []
sent_scores = []
compound_scores = []
X = df['Comment']

for x in X:
    # The tokenizer pads sequences shorter than 512 tokens and truncates longer ones.
    inputs = tokenizer(x, return_tensors='pt', padding=True, truncation=True, max_length=512)
    
    # Getting the model's prediction. The 'outputs' tensor contains logits for each sentiment class.
    outputs = finbert(**inputs)[0]
    
    # Deriving the sentiment label from the prediction by taking the class with the highest logit.
    val = labels[np.argmax(outputs.detach().numpy())]
    
    # Converting the logits to probabilities using the softmax function.
    probabilities = F.softmax(outputs, dim=1).detach().numpy()[0]
    
    # Constructing a dictionary with probabilities for each sentiment class.
    score = {labels[i]: prob for i, prob in enumerate(probabilities)}
    
    # Calculating a compound score as the difference between positive and negative probabilities.
    # This score provides a single metric between -1 and 1.
    compound = probabilities[2] - probabilities[0]

    # Appending the obtained values to their respective lists.
    sent_val.append(val)
    sent_scores.append(score)
    compound_scores.append(compound)

# Adding the sentiment labels, scores, and compound scores to the dataframe as new columns.
df['Sentiment'] = sent_val
df['Sentiment_Scores'] = sent_scores
df['Compound_Score'] = compound_scores

In [17]:
df.head(10)

Unnamed: 0,Subreddit,Post Title,Comment Author,Comment,Matched Phrase,Upvotes,Sentiment,Sentiment_Scores,Compound_Score
0,finance,‘Almost All Loans Are Bad’—Why Banks Aren’t Le...,hcbaron,Extracted article:\n\nBanks would love to lend...,\bBarclays\b,12,neutral,"{'neutral': 0.8657679, 'positive': 0.030311747...",-0.761848
1,finance,"Moronic Monday - September 05, 2023 - Your Wee...",14446368,Could we un-pin the SVB thing now?,\bSVB\b,1,neutral,"{'neutral': 0.965751, 'positive': 0.0004432395...",-0.931945
2,finance,Bloomberg overhauls management team with Mark ...,marketrent,Per an internal memo sent by founder Mike Bloo...,\bBank\s+of\s+England\b,2,neutral,"{'neutral': 0.9999981, 'positive': 1.5444566e-...",-0.999998
3,finance,[Bloomberg] New York and California Each Lost ...,mzachi,here's some quotes from the article for those ...,\bGoldman\s+Sachs\b,30,negative,"{'neutral': 0.011110356, 'positive': 1.7894672...",0.977761
4,finance,Is David Solomon Too Big a Jerk to Run Goldman...,MartianActual,Counterpoint: What other kind of person would ...,\bGoldman\s+Sachs\b,19,neutral,"{'neutral': 0.9999924, 'positive': 8.0087466e-...",-0.999986
5,finance,"Moody's downgrades US banks, warns of possible...",bellayang1216,Based on the information provided:\r \n\r \n...,\bSilicon\s+Valley\s+Bank\b,2,negative,"{'neutral': 0.0008381561, 'positive': 0.000177...",0.998146
6,finance,Why the US is interested in audits of Chinese ...,asuka_rice,"Plenty of lemons everywhere.\n\nCS, SVB and FT...",\bSVB\b,2,neutral,"{'neutral': 0.7490163, 'positive': 0.052578915...",-0.550612
7,finance,‘Success fees’ and thirsty emails: inside a $9...,FishFar4370,Musk's desire to have an emotional tantrum and...,\bGoldman\s+Sachs\b,6,neutral,"{'neutral': 0.95011884, 'positive': 0.02610915...",-0.926347
8,finance,‘Almost All Loans Are Bad’—Why Banks Aren’t Le...,hcbaron,Extracted article:\n\nBanks would love to lend...,\bBarclays\b,12,neutral,"{'neutral': 0.8657679, 'positive': 0.030311747...",-0.761848
9,finance,"Moronic Monday - September 05, 2023 - Your Wee...",14446368,Could we un-pin the SVB thing now?,\bSVB\b,1,neutral,"{'neutral': 0.965751, 'positive': 0.0004432395...",-0.931945


How to handle differences between 'Sentiment' and 'Compound_Score'?

How to Handle:
* Inspect the Data: Look at a few examples where this inconsistency occurs. This manual inspection can provide insights into why this might be happening.
* Consider Using a Threshold: Instead of relying solely on the compound score's raw value, consider setting a threshold. For instance, only interpret scores below -0.5 as indicative of strong negative sentiment. This can help reduce sensitivity to small variations in score.
* Combine Metrics for Decision Making: If you're making decisions based on the sentiment, consider using both the sent_val and the compound score. For example, only consider a comment as strongly negative if both the sent_val is "negative" and the compound score is below a certain threshold.
* Model Calibration: If the inconsistencies are frequent and problematic, you might want to consider calibrating the model on your specific data or even fine-tuning it.

In conclusion, when interpreting sentiment analysis results, it's crucial to understand that metrics can sometimes provide different perspectives, and each has its strengths and weaknesses. Considering them in tandem and in the context of your specific application will help in achieving more accurate and meaningful interpretations.