This notebook analyzes Reddit posts mentioning the stock tickers we chose, classifying sentiment as Positive, Negative, or Neutral using FinGPT on a GPU. It processes data at the minute level, assigns confidence scores, and saves results, including timestamps and sentiment classifications, as Parquet files. The workflow is optimized for large datasets with efficient batch processing and memory management.

1.  Filter Reddit Submissions and Comments for Stock Mentions:

    Identifies and filters Reddit posts mentioning specific stock tickers or company names.
    Processes data from a CSV file, extracting relevant entries, and saves the cleaned data in a structured format.

2.  Aggregate Reddit Mentions by Minute-Level Timestamps:

    Aggregates the filtered data at the minute level for each stock ticker.
    Groups and consolidates mentions, scores, and text for time-series analysis.

3.  Sentiment Analysis on Aggregated Reddit Data:

    Applies sentiment analysis to the aggregated data.
    Classifies each post's sentiment (Positive, Negative, or Neutral) and assigns confidence scores.

In [None]:
import pandas as pd
import numpy as np
from google.colab import drive
import re
drive.mount('/content/drive')

Mounted at /content/drive


###  Filter Reddit Submissions and Comments for Stock Mentions - 2022
This script processes a CSV file of Reddit submissions and comments to identify posts mentioning specific stock tickers or company names. It filters relevant entries and saves the cleaned data for further analysis.

In [None]:
import pandas as pd
import swifter
import re

# File path
file_path = '/content/drive/My Drive/historical_reddit/submissions_with_comments_2022.csv'
output_file = '/content/drive/My Drive/historical_reddit/filtered_submissions_2022.parquet'

# Stock tickers and company names
stocks = {
    'TSLA': 'Tesla',
    'AMZN': 'Amazon',
    'AAPL': 'Apple',
    'GOOGL': 'Google',
    'MSFT': 'Microsoft'
}

# Compile regex patterns for tickers and company names
patterns = {ticker: (re.compile(fr"\b{re.escape(ticker.lower())}\b"),
                     re.compile(fr"\b{re.escape(company.lower())}\b"))
            for ticker, company in stocks.items()}

# Chunk size
chunk_size = 10**6

# Process data in chunks
filtered_chunks = []
for chunk in pd.read_csv(file_path, chunksize=chunk_size, usecols=['comment_body', 'submission_title', 'submission_selftext', 'submission_score', 'comment_date', 'comment_time']):
    # Combine date and time into a single timestamp
    chunk['timestamp'] = pd.to_datetime(chunk['comment_date'] + ' ' + chunk['comment_time'])

    # Combine text columns
    chunk['combined_text'] = (chunk['comment_body'].fillna('') + ' ' +
                              chunk['submission_title'].fillna('') + ' ' +
                              chunk['submission_selftext'].fillna('')).str.lower()

    # Extract stock tickers mentioned in the text
    def extract_stocks(text):
        mentioned = [ticker for ticker, (ticker_pattern, company_pattern) in patterns.items()
                     if re.search(ticker_pattern, text) or re.search(company_pattern, text)]
        return ','.join(mentioned) if mentioned else None

    chunk['mentioned_tickers'] = chunk['combined_text'].swifter.apply(extract_stocks)

    # Filter rows where at least one stock is mentioned
    filtered_chunk = chunk[chunk['mentioned_tickers'].notna()]

    # Append filtered chunk to the list
    filtered_chunks.append(filtered_chunk)

# Concatenate filtered chunks
final_filtered_data = pd.concat(filtered_chunks)

# Keep only required columns
final_filtered_data = final_filtered_data[['timestamp', 'submission_score', 'mentioned_tickers', 'combined_text']]

# Save to parquet
final_filtered_data.to_parquet(output_file, index=False)

print(f"Filtered data saved to {output_file}")


Pandas Apply:   0%|          | 0/1000000 [00:00<?, ?it/s]

Pandas Apply:   0%|          | 0/1000000 [00:00<?, ?it/s]

Pandas Apply:   0%|          | 0/1000000 [00:00<?, ?it/s]

Pandas Apply:   0%|          | 0/1000000 [00:00<?, ?it/s]

Pandas Apply:   0%|          | 0/1000000 [00:00<?, ?it/s]

Pandas Apply:   0%|          | 0/1000000 [00:00<?, ?it/s]

Pandas Apply:   0%|          | 0/524396 [00:00<?, ?it/s]

Filtered data saved to /content/drive/My Drive/historical_reddit/filtered_submissions_2022.parquet


###  Filter Reddit Submissions and Comments for Stock Mentions - 2023

In [None]:
# File path
file_path = '/content/drive/My Drive/historical_reddit/submissions_with_comments_2023.csv'
output_file = '/content/drive/My Drive/historical_reddit/filtered_submissions_2023.parquet'

# Stock tickers and company names
stocks = {
    'TSLA': 'Tesla',
    'AMZN': 'Amazon',
    'AAPL': 'Apple',
    'GOOGL': 'Google',
    'MSFT': 'Microsoft'
}

# Compile regex patterns for tickers and company names
patterns = {ticker: (re.compile(fr"\b{re.escape(ticker.lower())}\b"),
                     re.compile(fr"\b{re.escape(company.lower())}\b"))
            for ticker, company in stocks.items()}

# Chunk size
chunk_size = 10**6

# Process data in chunks
filtered_chunks = []
for chunk in pd.read_csv(file_path, chunksize=chunk_size, usecols=['comment_body', 'submission_title', 'submission_selftext', 'submission_score', 'comment_date', 'comment_time']):
    # Combine date and time into a single timestamp
    chunk['timestamp'] = pd.to_datetime(chunk['comment_date'] + ' ' + chunk['comment_time'])

    # Combine text columns
    chunk['combined_text'] = (chunk['comment_body'].fillna('') + ' ' +
                              chunk['submission_title'].fillna('') + ' ' +
                              chunk['submission_selftext'].fillna('')).str.lower()

    # Extract stock tickers mentioned in the text
    def extract_stocks(text):
        mentioned = [ticker for ticker, (ticker_pattern, company_pattern) in patterns.items()
                     if re.search(ticker_pattern, text) or re.search(company_pattern, text)]
        return ','.join(mentioned) if mentioned else None

    chunk['mentioned_tickers'] = chunk['combined_text'].swifter.apply(extract_stocks)

    # Filter rows where at least one stock is mentioned
    filtered_chunk = chunk[chunk['mentioned_tickers'].notna()]

    # Append filtered chunk to the list
    filtered_chunks.append(filtered_chunk)

# Concatenate filtered chunks
final_filtered_data = pd.concat(filtered_chunks)

# Keep only required columns
final_filtered_data = final_filtered_data[['timestamp', 'submission_score', 'mentioned_tickers', 'combined_text']]

# Save to parquet
final_filtered_data.to_parquet(output_file, index=False)

print(f"Filtered data saved to {output_file}")

Pandas Apply:   0%|          | 0/1000000 [00:00<?, ?it/s]

Pandas Apply:   0%|          | 0/1000000 [00:00<?, ?it/s]

Pandas Apply:   0%|          | 0/1000000 [00:00<?, ?it/s]

Pandas Apply:   0%|          | 0/764982 [00:00<?, ?it/s]

Filtered data saved to /content/drive/My Drive/historical_reddit/filtered_submissions_2023.parquet


### Filter Reddit Submissions and Comments for Stock Mentions - 2021

In [None]:
import pandas as pd
import swifter
import re

# File path
file_path = '/content/drive/My Drive/historical_reddit/submissions_with_comments_2021.csv'
output_file = '/content/drive/My Drive/historical_reddit/filtered_submissions_2021.parquet'

# Stock tickers and company names
stocks = {
    'TSLA': 'Tesla',
    'AMZN': 'Amazon',
    'AAPL': 'Apple',
    'GOOGL': 'Google',
    'MSFT': 'Microsoft'
}

# Compile regex patterns for tickers and company names
patterns = {ticker: (re.compile(fr"\b{re.escape(ticker.lower())}\b"),
                     re.compile(fr"\b{re.escape(company.lower())}\b"))
            for ticker, company in stocks.items()}

# Chunk size
chunk_size = 10**6

# Process data in chunks
filtered_chunks = []
for chunk in pd.read_csv(file_path, chunksize=chunk_size, usecols=['comment_body', 'submission_title', 'submission_selftext', 'submission_score', 'comment_date', 'comment_time']):
    # Combine date and time into a single timestamp
    chunk['timestamp'] = pd.to_datetime(chunk['comment_date'] + ' ' + chunk['comment_time'])

    # Combine text columns
    chunk['combined_text'] = (chunk['comment_body'].fillna('') + ' ' +
                              chunk['submission_title'].fillna('') + ' ' +
                              chunk['submission_selftext'].fillna('')).str.lower()

    # Extract stock tickers mentioned in the text
    def extract_stocks(text):
        mentioned = [ticker for ticker, (ticker_pattern, company_pattern) in patterns.items()
                     if re.search(ticker_pattern, text) or re.search(company_pattern, text)]
        return ','.join(mentioned) if mentioned else None

    chunk['mentioned_tickers'] = chunk['combined_text'].swifter.apply(extract_stocks)

    # Filter rows where at least one stock is mentioned
    filtered_chunk = chunk[chunk['mentioned_tickers'].notna()]

    # Append filtered chunk to the list
    filtered_chunks.append(filtered_chunk)

# Concatenate filtered chunks
final_filtered_data = pd.concat(filtered_chunks)

# Keep only required columns
final_filtered_data = final_filtered_data[['timestamp', 'submission_score', 'mentioned_tickers', 'combined_text']]

# Save to parquet
final_filtered_data.to_parquet(output_file, index=False)

print(f"Filtered data saved to {output_file}")

Pandas Apply:   0%|          | 0/1000000 [00:00<?, ?it/s]

Pandas Apply:   0%|          | 0/1000000 [00:00<?, ?it/s]

Pandas Apply:   0%|          | 0/1000000 [00:00<?, ?it/s]

Pandas Apply:   0%|          | 0/1000000 [00:01<?, ?it/s]

Pandas Apply:   0%|          | 0/1000000 [00:00<?, ?it/s]

Pandas Apply:   0%|          | 0/1000000 [00:00<?, ?it/s]

Pandas Apply:   0%|          | 0/1000000 [00:00<?, ?it/s]

Pandas Apply:   0%|          | 0/1000000 [00:00<?, ?it/s]

Pandas Apply:   0%|          | 0/1000000 [00:00<?, ?it/s]

Pandas Apply:   0%|          | 0/1000000 [00:00<?, ?it/s]

Pandas Apply:   0%|          | 0/1000000 [00:01<?, ?it/s]

Pandas Apply:   0%|          | 0/1000000 [00:00<?, ?it/s]

Pandas Apply:   0%|          | 0/1000000 [00:00<?, ?it/s]

Pandas Apply:   0%|          | 0/1000000 [00:00<?, ?it/s]

Pandas Apply:   0%|          | 0/934766 [00:00<?, ?it/s]

Filtered data saved to /content/drive/My Drive/historical_reddit/filtered_submissions_2021.parquet


### Example

In [None]:
df = pd.read_parquet('/content/drive/My Drive/historical_reddit/filtered_submissions_2021.parquet')

In [None]:
df

Unnamed: 0,timestamp,submission_score,mentioned_tickers,combined_text
0,2021-01-01 00:44:31,19.0,TSLA,the 0.000526 tesla share got me how my last da...
1,2021-01-01 01:17:01,54.0,TSLA,happy new year retards happy new year fuckhead...
2,2021-01-01 01:17:30,54.0,TSLA,good riddance 2020 🚮🦠 happy new year retards 💰...
3,2021-01-01 01:20:33,54.0,TSLA,[deleted] happy new year fuckheads happy new y...
4,2021-01-01 01:23:45,54.0,TSLA,i listend in spirit. im poor as fuck oil patch...
...,...,...,...,...
624040,2021-12-31 23:42:18,7.0,"TSLA,AMZN","[removed] baba, the next amzn killer w/600pt b..."
624041,2021-12-31 23:44:34,432.0,TSLA,"don't need no tesla to impress her, my girl is..."
624042,2021-12-31 23:51:57,7692.0,GOOGL,"google is an odd choice, disney makes sense wi..."
624043,2021-12-31 23:52:56,93.0,"AMZN,GOOGL",crsr wayyyyyyyyyyyyyyy better $wish - long pos...


### Aggregate Reddit Mentions by Minute-Level Timestamps
This script aggregates Reddit submissions and comments mentioning specific stock tickers (TSLA, AMZN, AAPL, GOOGL, MSFT) at a minute-level timestamp. It organizes the data for time-series analysis by consolidating content and scores for each ticker.

In [None]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

# File paths for 2021, 2022, and 2023 parquet files
file_paths = [
    '/content/drive/My Drive/historical_reddit/filtered_submissions_2021.parquet',
    '/content/drive/My Drive/historical_reddit/filtered_submissions_2022.parquet',
    '/content/drive/My Drive/historical_reddit/filtered_submissions_2023.parquet'
]

# List of tickers
tickers = ['TSLA', 'AMZN', 'AAPL', 'GOOGL', 'MSFT']

# Aggregation function
def aggregate_minute_level(file_path, tickers):
    # Load data
    df = pd.read_parquet(file_path)

    # Ensure timestamp is at the minute level
    df['timestamp'] = pd.to_datetime(df['timestamp']).dt.floor('T')  # Truncate to minute-level

    # Expand rows with multiple tickers
    def expand_row(row):
        tickers_list = row['mentioned_tickers'].split(',')
        expanded_rows = []
        for ticker in tickers_list:
            expanded_row = row.copy()
            expanded_row['mentioned_tickers'] = ticker
            expanded_rows.append(expanded_row)
        return expanded_rows

    expanded_rows = []
    for _, row in df.iterrows():
        if ',' in row['mentioned_tickers']:  # If multiple tickers are mentioned
            expanded_rows.extend(expand_row(row))
        else:
            expanded_rows.append(row)

    df = pd.DataFrame(expanded_rows)

    # Initialize an empty DataFrame for aggregation
    aggregated_data = pd.DataFrame({'timestamp': df['timestamp'].unique()})
    aggregated_data.sort_values(by='timestamp', inplace=True)  # Sort timestamps

    # Process each ticker
    for ticker in tickers:
        # Filter rows mentioning the current ticker
        ticker_data = df[df['mentioned_tickers'] == ticker]

        # Group by minute-level timestamp
        grouped = ticker_data.groupby('timestamp').agg({
            'combined_text': lambda x: ' '.join(x),  # Concatenate Reddit content
            'submission_score': 'mean'  # Average submission scores
        }).reset_index()

        # Merge with the aggregated DataFrame
        aggregated_data = pd.merge(
            aggregated_data,
            grouped,
            on='timestamp',
            how='left',
            suffixes=('', f'_{ticker}')
        )

        # Rename columns for the ticker
        aggregated_data.rename(columns={
            'combined_text': f'{ticker}_reddit',
            'submission_score': f'{ticker}_score'
        }, inplace=True)

    # Fill missing values
    # Text columns should have empty strings, and numeric columns should have 0
    for ticker in tickers:
        aggregated_data[f'{ticker}_reddit'] = aggregated_data[f'{ticker}_reddit'].fillna('')
        aggregated_data[f'{ticker}_score'] = aggregated_data[f'{ticker}_score'].fillna(0)

    return aggregated_data

# Process files for 2021, 2022, and 2023
aggregated_2021 = aggregate_minute_level(file_paths[0], tickers)
aggregated_2022 = aggregate_minute_level(file_paths[1], tickers)
aggregated_2023 = aggregate_minute_level(file_paths[2], tickers)

# Save aggregated data to new parquet files
aggregated_2021.to_parquet('/content/drive/My Drive/historical_reddit/aggregated_2021.parquet', index=False)
aggregated_2022.to_parquet('/content/drive/My Drive/historical_reddit/aggregated_2022.parquet', index=False)
aggregated_2023.to_parquet('/content/drive/My Drive/historical_reddit/aggregated_2023.parquet', index=False)

print("Aggregated data saved for 2021, 2022, and 2023.")

Aggregated data saved for 2021, 2022, and 2023.


### Example

In [None]:
df = pd.read_parquet('/content/drive/My Drive/historical_reddit/aggregated_2021.parquet')
df

Unnamed: 0,timestamp,TSLA_reddit,TSLA_score,AMZN_reddit,AMZN_score,AAPL_reddit,AAPL_score,GOOGL_reddit,GOOGL_score,MSFT_reddit,MSFT_score
0,2021-01-01 00:44:00,the 0.000526 tesla share got me how my last da...,19.0,,0.0,,0.0,,0.0,,0.0
1,2021-01-01 01:17:00,happy new year retards happy new year fuckhead...,54.0,,0.0,,0.0,,0.0,,0.0
2,2021-01-01 01:20:00,[deleted] happy new year fuckheads happy new y...,54.0,,0.0,,0.0,,0.0,,0.0
3,2021-01-01 01:23:00,i listend in spirit. im poor as fuck oil patch...,31.5,,0.0,,0.0,,0.0,,0.0
4,2021-01-01 01:34:00,snow puts turned my year around. love being a ...,54.0,,0.0,,0.0,,0.0,,0.0
...,...,...,...,...,...,...,...,...,...,...,...
274675,2021-12-31 23:42:00,"[removed] baba, the next amzn killer w/600pt b...",7.0,"[removed] baba, the next amzn killer w/600pt b...",7.0,,0.0,,0.0,,0.0
274676,2021-12-31 23:44:00,"don't need no tesla to impress her, my girl is...",432.0,,0.0,,0.0,,0.0,,0.0
274677,2021-12-31 23:51:00,,0.0,,0.0,,0.0,"google is an odd choice, disney makes sense wi...",7692.0,,0.0
274678,2021-12-31 23:52:00,,0.0,crsr wayyyyyyyyyyyyyyy better $wish - long pos...,93.0,,0.0,crsr wayyyyyyyyyyyyyyy better $wish - long pos...,93.0,,0.0


### Sentiment Analysis 

In [None]:
!pip install transformers==4.40.1 peft==0.5.0
!pip install sentencepiece
!pip install accelerate
!pip install torch
!pip install datasets
!pip install bitsandbytes

In [None]:
from huggingface_hub import login

# Log in to Hugging Face with your token
login("YOUR-HF-TOKEN")

### Model preparation

In [None]:
from transformers import LlamaForCausalLM, LlamaTokenizerFast
from peft import PeftModel
import torch

# Base model and PEFT (LoRA) model
base_model = "meta-llama/Meta-Llama-3-8B"
peft_model = "FinGPT/fingpt-mt_llama3-8b_lora"

# Load tokenizer
tokenizer = LlamaTokenizerFast.from_pretrained(base_model, trust_remote_code=True, use_auth_token=True)
tokenizer.pad_token = tokenizer.eos_token

# Load base model with 16-bit precision
model = LlamaForCausalLM.from_pretrained(base_model,
                    trust_remote_code=True,
                    device_map="auto",
                    torch_dtype=torch.float16)  # Enable 16-bit precision

# Apply LoRA-based PEFT model
model = PeftModel.from_pretrained(model, peft_model, torch_dtype=torch.float16)
model = model.eval()

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'PreTrainedTokenizerFast'. 
The class this function is called from is 'LlamaTokenizerFast'.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.


config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/177 [00:00<?, ?B/s]

adapter_config.json:   0%|          | 0.00/650 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/13.6M [00:00<?, ?B/s]

In [None]:
import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"

### Sentiment Analysis on Aggregated Reddit Data
This script performs sentiment analysis on Reddit data aggregated at the minute level for stock tickers. It uses FinGPT on A100 GPU to classify sentiment (Positive, Negative, Neutral) and assigns probabilities to each classification.  

This part needs the granularity on the ticker - year level due to limited GPU resources and we regularly clears GPU cache and unused variables with garbage collection to handle large datasets efficiently.

In [None]:
import gc
import torch
import pandas as pd
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')

# Define function for sentiment analysis
def get_sentiment(text):
    if not text:  # If the text is empty, return neutral sentiment
        return "Neutral", 1.0  # Neutral with logit probability 1.0
    len_text = min(len(text),2000)
    text = text[:len_text]
    # Define the prompt for the model
    prompt = f'''Instruction: What is the sentiment of this news? Please choose an answer from [Positive, Negative, Neutral].\nInput: {text}\nAnswer: '''

    # Tokenize directly on the GPU for efficiency
    inputs = tokenizer(prompt, return_tensors="pt", padding=True, max_length=128).to(device)

    # Forward pass on GPU
    with torch.no_grad():
        outputs = model(**inputs)

    # Get logits for the last token and move them back to CPU
    logits = outputs.logits[:, -1, :].to("cpu")
    probs = torch.softmax(logits, dim=-1)

    # Class tokens for Positive, Negative, Neutral
    class_tokens = tokenizer(["Positive", "Negative", "Neutral"], add_special_tokens=False)["input_ids"]
    class_probs = {tokenizer.decode(token_id): probs[0, token_id].item() for token_id in class_tokens}

    # Get the most probable sentiment
    sentiment = max(class_probs, key=class_probs.get)

    # Clear intermediate variables
    del inputs, outputs, logits, probs
    torch.cuda.empty_cache()
    return sentiment, class_probs[sentiment]

# Parameters
batch_size = 500  # Number of rows per batch
output_dir = '/content/drive/My Drive/historical_reddit/'  # Directory for individual ticker files

# Process the parquet files
file_paths = [
    '/content/drive/My Drive/historical_reddit/aggregated_2023.parquet'
]

# Warm-up pass
print("Running warm-up pass to stabilize GPU memory...")
dummy_input = tokenizer("Warm-up", return_tensors="pt", padding=True, max_length=128).to(device)
with torch.no_grad():
    _ = model(**dummy_input)
torch.cuda.empty_cache()

# Process each file
for input_path in file_paths:
    year = input_path.split('_')[-1].split('.')[0]  # Extract year from filename
    print(f"\nProcessing file: {input_path} (Year: {year})")

    # Load the parquet file
    df = pd.read_parquet(input_path)

    for ticker in ['TSLA']:
        print(f"\nProcessing sentiment for ticker: {ticker}")

        reddit_col = f"{ticker}_reddit"
        sentiment_col = f"{ticker}_sentiment"
        logit_col = f"{ticker}_logit"

        # Initialize lists for sentiments and logits
        sentiments = []
        logits = []

        # Process DataFrame in batches
        num_batches = (len(df) + batch_size - 1) // batch_size  # Calculate total number of batches
        for batch_num in tqdm(range(num_batches), desc=f"Ticker: {ticker} Batches"):
            # Extract batch
            start_idx = batch_num * batch_size
            end_idx = min((batch_num + 1) * batch_size, len(df))
            batch_df = df.iloc[start_idx:end_idx]

            # Process each row in the batch
            batch_sentiments = []
            batch_logits = []
            for text in batch_df[reddit_col].fillna("").tolist():
                sentiment, logit = get_sentiment(text)  # GPU inference
                batch_sentiments.append(sentiment)
                batch_logits.append(logit)

            # Append batch results to the main lists
            sentiments.extend(batch_sentiments)
            logits.extend(batch_logits)

            # Clear GPU memory periodically
            del batch_df, batch_sentiments, batch_logits
            torch.cuda.empty_cache()
            gc.collect()

        # Create and save a DataFrame for this ticker
        ticker_df = pd.DataFrame({
            'timestamp': df['timestamp'],  # Assuming there's a timestamp column
            reddit_col: df[reddit_col],
            sentiment_col: sentiments,
            logit_col: logits
        })

        # Save the ticker DataFrame to a separate parquet file
        ticker_output_path = f"{output_dir}/{ticker}_{year}.parquet"
        ticker_df.to_parquet(ticker_output_path, index=False)
        print(f"Saved results for {ticker} to {ticker_output_path}")

        # Clear memory for the ticker
        del ticker_df, sentiments, logits
        torch.cuda.empty_cache()
        gc.collect()

    # Clear memory for the file
    del df
    torch.cuda.empty_cache()
    gc.collect()

    print(f"Cleared GPU cache and unnecessary variables for: {input_path}")

Running warm-up pass to stabilize GPU memory...

Processing file: /content/drive/My Drive/historical_reddit/aggregated_2023.parquet (Year: 2023)

Processing sentiment for ticker: TSLA


Ticker: TSLA Batches: 100%|██████████| 371/371 [2:07:26<00:00, 20.61s/it]


Saved results for TSLA to /content/drive/My Drive/historical_reddit//TSLA_2023.parquet
Cleared GPU cache and unnecessary variables for: /content/drive/My Drive/historical_reddit/aggregated_2023.parquet


In [None]:
import pandas as pd
df = pd.read_parquet('/content/drive/My Drive/historical_reddit/TSLA_2023.parquet')

### Example

In [None]:
df['TSLA_sentiment'].value_counts()

Unnamed: 0_level_0,count
TSLA_sentiment,Unnamed: 1_level_1
Neutral,143873
Positive,22284
Negative,19068
