### **Prerequisites for Running the Models**

Before running this notebook, ensure the following steps are completed:

1. **Preprocess the Data**:  
   - The raw datasets must be cleaned, formatted, and preprocessed.  
   - Use the `data_preprocessing.ipynb` notebook to perform preprocessing.  
   - This notebook requires the cleaned datasets for both the tweets and the historical stock data.
   - Link to the preprocessing notebook: [data_preprocessing.ipynb](./data_preprocessing.ipynb).

2. **Python Environment**:  
   - Ensure Python 3.8 or higher is installed.

3. **Install Required Dependencies**:  
   - Run the following command to install all necessary dependencies:  
     ```bash
     pip install -r requirements.txt
     ```
    
### **Importing Required Modules**

The following code block imports all necessary Python libraries used for:
- Data handling, preprocessing, and visualization.
- Time-series and machine learning models (ARIMA, XGBoost, LSTM).
- Sentiment analysis and evaluation metrics.

In [11]:
# Modules used for general utilities
import warnings
from datetime import datetime, timedelta
from pickle import load, dump

# Data manipulation
import numpy as np
import pandas as pd

# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.dates import DateFormatter

# Time Series Modelling
from statsmodels.tsa.api import VAR
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX
import statsmodels.api as sm

# Machine learning and preprocessing
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.metrics import (
    mean_squared_error, 
    mean_absolute_error, 
)
import xgboost as xgb
from xgboost import XGBRegressor

# Deep learning models
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import (
    LSTM, Dense, Dropout
)
from tensorflow.keras.preprocessing.sequence import TimeseriesGenerator

# Sentiment analysis
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from transformers import pipeline
nltk.download('vader_lexicon')

from tqdm import tqdm
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', None)
warnings.filterwarnings("ignore")

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\admin\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


### **Extracting Tweets for Specific Stocks**
The following code demonstrates the process of selecting a specific stock (e.g., TSLA, AAPL, F) from the dataset and extracting the associated tweets. The goal is to analyze sentiment from these tweets for subsequent integration into the hybrid stock forecasting models.  
Starting from this codeblock, all the codeblocks must be run several times to analyze more than one stock.

#### Process Overview:

1. **Select a Stock Name**:  
   - The stock name (ticker symbol) must be specified in the code.
   - This process needs to be repeated for each stock under analysis:  
     - **TSLA** (high public discourse).  
     - **AAPL** (moderate public discourse).  
     - **F** (low public discourse).  

2. **Fill in the Company Name:** Aside from the ticker symbol of the stock, the company name must also be specified in the code. This is essential in defining the labels to ensure the LLM can predict the correct sentiment for Tweets. For instance, for `TSLA` as the `stock_name`, the `company_name` would be `Tesla`


#### Instructions for Repetition:
- To repeat this process for another stock, simply change the value of the `stock_name` and `company_name` variables to the desired ticker symbol (e.g., "AAPL" or "F") and company name (e.g., "Apple" or "Ford") and rerun the code block.
- Ensure the extracted tweet data is saved for each stock separately to avoid overwriting.

### Importance
- This step is critical for isolating the public sentiment related to individual stocks.  
- Sentiment analysis results may vary significantly depending on the volume and context of tweets, making it essential to repeat this process for each stock individually.

In [12]:
stock_name = 'TSLA'
company_name = 'Tesla'
all_tweets = pd.read_csv(r'../data/preprocessed_tweets.csv')
df_tweets_stock = all_tweets[all_tweets['Stock Name'] == stock_name]
print(df_tweets_stock.shape)
df_tweets_stock.head()

(37422, 4)


Unnamed: 0,Date,Tweet,Stock Name,Company Name
0,2022-09-29 23:41:16+00:00,"Mainstream media has done an amazing job at brainwashing people. Today at work, we were asked what companies we believe in &amp; I said @Tesla because they make the safest cars &amp; EVERYONE disagreed with me because they heard“they catch on fire &amp; the batteries cost 20k to replace”",TSLA,"Tesla, Inc."
1,2022-09-29 23:24:43+00:00,Tesla delivery estimates are at around 364k from the analysts. $tsla,TSLA,"Tesla, Inc."
2,2022-09-29 23:18:08+00:00,"3/ Even if I include 63.0M unvested RSUs as of 6/30, additional equity needed for the RSUs is 63.0M x $54.20 = $3.4B. If the deal closed tomorrow at $54.20, Elon would need $2.0B for existing shares plus $3.4B for RSUs, so $5.4B new equity. $twtr $tsla",TSLA,"Tesla, Inc."
3,2022-09-29 22:40:07+00:00,"@RealDanODowd @WholeMarsBlog @Tesla Hahaha why are you still trying to stop Tesla FSD bro! Get your shit together and make something better? Thats how companies work, they competed. Crying big old ass fart clown!",TSLA,"Tesla, Inc."
4,2022-09-29 22:27:05+00:00,"@RealDanODowd @Tesla Stop trying to kill kids, you sad deranged old man",TSLA,"Tesla, Inc."


### **Sampling Tweets**
Conducting sentiment analysis of 1000s of tweets is too tedious and time-consuming of a task to undertake, therefore we are sampling 100 tweets as a subset of tweets to conduct sentiment analysis. Sampling ensures a consistent and manageable dataset while maintaining representativeness.

#### **Process Overview**
1. **Random Sampling:**
If the total number of tweets in the dataset exceeds 100, a random sample of 100 tweets is selected.
If there are fewer than 100 tweets, the entire population of tweets is used.

2. **Reproducibility:**
A fixed `random_state` (set to 42) ensures that the random sampling produces the same subset of tweets for reproducibility in sentiment analysis.

In [13]:
if df_tweets_stock.shape[0] > 100:
    df_tweets_stock_test = df_tweets_stock.sample(n=100, random_state=42)
else:
    df_tweets_stock_test = df_tweets_stock.sample(n=df_tweets_stock.shape[0], random_state=42)

### **Sentiment Analysis using VADER & Zero Shot Text Classification**
⚠️ **Warning:** Zero Shot Text Classification can potentially take hours to run. To visualize the results without running you may view the following images:

* [Results for TSLA](../results/RMSE%20Table/Results%20TSLA.png)
* [Results for AAPL](../results/RMSE%20Table/Results%20AAPL.png)
* [Results for F](../results/RMSE%20Table/Results%20F.png)

In [None]:
%%time
# Initialize VADER sentiment analyzer
sentiment_analyzer = SentimentIntensityAnalyzer()

# Iterate over the rows of the DataFrame
for indx, row in df_tweets_stock_test.iterrows():
    try:
        # Normalize the tweet text
        sentence_i = unicodedata.normalize('NFKD', row['Tweet'])

        # Get sentiment scores
        sentence_sentiment = sentiment_analyzer.polarity_scores(sentence_i)

        # Assign sentiment scores to the DataFrame
        df_tweets_stock_test.at[indx, 'Sentiment_score VADER'] = sentence_sentiment['compound']
        df_tweets_stock_test.at[indx, 'Negative VADER'] = sentence_sentiment['neg']
        df_tweets_stock_test.at[indx, 'Neutral VADER'] = sentence_sentiment['neu']
        df_tweets_stock_test.at[indx, 'Positive VADER'] = sentence_sentiment['pos']

    except TypeError:
        print(df_tweets_stock_test.loc[indx, 'Tweet'])
        print(indx)
        break
    
# Define the thresholds
threshold_positive = 1/3
threshold_negative = -1/3

# Add the Predicted Label column based on the Sentiment Score
df_tweets_stock_test['Predicted Label LLM'] = df_tweets_stock_test['Sentiment Score LLM'].apply(
    lambda x: 1 if x > threshold_positive else (-1 if x < threshold_negative else 0)
)

def predict_label(row):
    if row['Positive VADER'] > row['Negative VADER'] and row['Positive VADER'] > row['Neutral VADER']:
        return 1
    elif row['Negative VADER'] > row['Positive VADER'] and row['Negative VADER'] > row['Neutral VADER']:
        return -1
    else:
        return 0

# Apply the function to each row
df_tweets_stock_test['Predicted Label VADER'] = df_tweets_stock_test.apply(predict_label, axis=1)

In [14]:
pipe = pipeline(task="zero-shot-classification", model="facebook/bart-large-mnli")
# Define the candidate labels globally
candidate_labels = [f"positive about {company_name}", f"negative about {company_name}", "neutral about {company_name}"]

def classify_sentiment_about_tesla(tweet):
    # Perform zero-shot classification
    result = pipe(tweet, candidate_labels=candidate_labels)

    # Extract the scores for each sentiment
    positive_score = 0
    negative_score = 0
    neutral_score = 0

    for label, score in zip(result['labels'], result['scores']):
        if label == "positive about Tesla":
            positive_score = score * 2 / 3
        elif label == "negative about Tesla":
            negative_score = score  * 2 / 3
        elif label == "neutral about Tesla":
            neutral_score = score  * 2 / 3

    # Determine the overall sentiment based on the highest score
    if positive_score >= negative_score and positive_score >= neutral_score:
        sentiment_score =   1 / 3 + positive_score
    elif negative_score >= positive_score and negative_score >= neutral_score:
        sentiment_score = -1 / 3 - negative_score
    else:
        sentiment_score =  positive_score - negative_score

    return sentiment_score, negative_score, neutral_score, positive_score

%%time
df_tweets_stock_test[['Sentiment Score LLM', 'Negative LLM', 'Neutral LLM', 'Positive LLM']] = df_tweets_stock_test['Tweet'].apply(
    lambda tweet: pd.Series(classify_sentiment_about_tesla(tweet))
)

RuntimeError: Failed to import transformers.models.bart.modeling_tf_bart because of the following error (look up to see its traceback):
Your currently installed version of Keras is Keras 3, but this is not yet supported in Transformers. Please install the backwards-compatible tf-keras package with `pip install tf-keras`.

In [None]:
%%time

# Initialize RoBERTa sentiment pipeline and VADER sentiment analyzer
sentiment_pipe = pipeline("sentiment-analysis", model="cardiffnlp/twitter-roberta-base-sentiment")
sentiment_analyzer = SentimentIntensityAnalyzer()

def classify_sentiment_roberta(tweet):
    # Apply RoBERTa-based sentiment analysis
    result = sentiment_pipe(tweet)[0]
    label = result['label']
    score = result['score']

    # Set scores based on the label
    positive_score_llm = score if label == "LABEL_2" else 0
    negative_score_llm = score if label == "LABEL_0" else 0
    neutral_score_llm = score if label == "LABEL_1" else 0

    # Compound sentiment score for LLM (-1 to 1)
    sentiment_score_llm = positive_score_llm - negative_score_llm

    return sentiment_score_llm, negative_score_llm, neutral_score_llm, positive_score_llm

def classify_sentiment_vader(tweet):
    # Normalize text and apply VADER sentiment analysis
    normalized_tweet = unicodedata.normalize('NFKD', tweet)
    vader_result = sentiment_analyzer.polarity_scores(normalized_tweet)

    # Extract VADER scores
    sentiment_score_vader = vader_result['compound']
    negative_score_vader = vader_result['neg']
    neutral_score_vader = vader_result['neu']
    positive_score_vader = vader_result['pos']

    return sentiment_score_vader, negative_score_vader, neutral_score_vader, positive_score_vader

def analyze_tweet_sentiments(df):
    # Sample 10 tweets per day and calculate sentiment scores for both LLM and VADER
    sample = df.sample(n=min(10, len(df)), random_state=1)  # random_state for reproducibility

    # Apply RoBERTa-based sentiment analysis and VADER
    llm_scores = sample['Tweet'].apply(classify_sentiment_roberta)
    vader_scores = sample['Tweet'].apply(classify_sentiment_vader)

    # Convert results to DataFrame and compute averages for each score
    llm_scores_df = pd.DataFrame(llm_scores.tolist(), columns=['Sentiment Score LLM', 'Negative LLM', 'Neutral LLM', 'Positive LLM'])
    vader_scores_df = pd.DataFrame(vader_scores.tolist(), columns=['Sentiment Score VADER', 'Negative VADER', 'Neutral VADER', 'Positive VADER'])

    # Concatenate LLM and VADER score DataFrames and return the mean of each column
    combined_df = pd.concat([llm_scores_df, vader_scores_df], axis=1)
    return combined_df.mean()

# Apply sampling and sentiment analysis, then group by date and company
df_tweets_stock_daily = df_tweets_stock.groupby(['Date_day', 'Company Name', 'Stock Name']).apply(analyze_tweet_sentiments).reset_index()

### **Merging Tweet Sentiment Data with Stock Price Data**
Preparing the final dataset by merging the processed tweet sentiment data with the preprocessed stock price data based on the dates. This creates a unified dataset that combines both sentiment and stock price movements, enabling further analysis and the development of the time series models.

#### **Resulting Dataset**

- **Merged Dataset**: The resulting dataset, `final_df`, contains stock price data (e.g., closing price, volume) alongside tweet sentiment scores and labels.
- **Preview of Data**: The `head()` function is used to display the first few rows of the final dataset.



In [None]:
df_tweets_stock['Date'] = pd.to_datetime(df_tweets_stock['Date'])
df_tweets_stock['Date_day'] = df_tweets_stock['Date'].dt.date
df_prices = pd.read_csv('../data/preprocessed_stock_data.csv')
df_price_stock_daily = df_prices[df_prices['Stock Name'] == stock_name]
df_price_stock_daily['Date'] = pd.to_datetime(df_price_stock_daily['Date'])
df_price_stock_daily['Date_day'] = df_price_stock_daily['Date'].dt.date
final_df = df_price_stock_daily.merge(df_tweets_stock_daily, how="inner", on= ["Date_day","Stock Name"]) 
final_df = final_df.drop(columns=['Stock Name', 'Date'])
final_df.head()

### **Evaluation of feature sets**

Evaluating the performance of three models—**ARIMA/VAR**, **XGBoost**, and **LSTM**—across multiple feature sets. Each feature set incorporates different combinations of sentiment scores and stock price data to understand their impact on model performance.

---

#### **Feature Sets**
The feature sets used for evaluation include:
1. Control set: `["Adj Close"]` (only stock prices).
2. Sentiment scores from various methods:
   - **VADER**: Positive, Neutral, Negative, and Compound Sentiment Scores.
   - **ZeroShot**: Positive, Neutral, Negative, and Compound Sentiment Scores.

Each feature set is iteratively passed through both models for evaluation.

---

#### **Model Implementations**

1. **ARIMA/VAR Model**:
   - For single-feature sets (e.g., `["Adj Close"]`), an ARIMA model is applied to predict future stock prices.
   - For multi-feature sets, a VAR (Vector AutoRegression) model is used to account for interactions between features.
   - Forecasting is done for the test set.

2. **XGBoost with Lagged Variables**:
   - Lagged variables are created for each feature in the set, representing prior values over a window of 4 time steps.
   - Features and target values (`Adj Close`) are split into training and testing sets.
   - The **XGBoost Regressor** is trained on the lagged data and used for predictions.

3. **LSTM Model**:
   - Data is normalized using a MinMaxScaler to scale values between 0 and 1.
   - A sequence generator creates rolling sequences of the past 10 time steps for each feature.
   - A multi-layer **LSTM network** with a dropout layer and dense output layer is defined and trained on the sequences.
   - Predictions are generated for the test set and scaled back to the original range.

---

#### **Model Evaluation**
A custom evaluation function computes the following metrics for each model and feature set:
- **RMSE (Root Mean Squared Error)**: Measures the average prediction error magnitude.
- **MAE (Mean Absolute Error)**: Measures the average absolute prediction error.

These metrics are appended to a `results` list for all models and feature sets.

---

#### **Results Overview**
The results coupled with an extensive analysis and discussion which provides justification and reasoning has been mentioned in the original research paper titled  _"Using LLM to Predict Stock Price: A Hybrid Model Combining Social Media Sentiment and Market Data"_.

Images displaying previous results:

* [Results for TSLA](../results/RMSE%20Table/Results%20TSLA.png)
* [Results for AAPL](../results/RMSE%20Table/Results%20AAPL.png)
* [Results for F](../results/RMSE%20Table/Results%20F.png)

In [None]:
feature_sets = [
    ["Adj Close"], # control set
    ["Adj Close", "Sentiment Score Roberta"],
    ["Adj Close", "Sentiment Score ZeroShot"],
    ["Adj Close", "Sentiment Score VADER"],
    ["Adj Close", "Negative ZeroShot"],
    ["Adj Close", "Neutral ZeroShot"],
    ["Adj Close", "Positive ZeroShot"],
    ["Adj Close", "Negative VADER"],
    ["Adj Close", "Neutral VADER"],
    ["Adj Close", "Positive VADER"],
]

def evaluate_model(y_true, y_pred, model_name, feature_set):
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    mae = mean_absolute_error(y_true, y_pred)
    return {"Model": model_name, "Feature Set": feature_set, "RMSE": rmse, "MAE": mae}

results = []
def arima_var_model(train, test, feature_set):
    if len(feature_set) == 1:  
        model = ARIMA(train[feature_set[0]], order=(5, 2, 0))  
        model_fit = model.fit()
        y_pred = model_fit.forecast(steps=len(test))
        return pd.Series(y_pred, index=test.index)
    else:  
        model = VAR(train[feature_set])
        model_fit = model.fit(5)  
        y_pred = model_fit.forecast(train[feature_set].values[-5:], steps=len(test))
        y_pred = pd.DataFrame(y_pred, index=test.index, columns=feature_set)
        return y_pred['Adj Close']

# XGBoost with Lagged Variables
def xgboost_model(train, test, feature_set):
    train = train.select_dtypes(include=[np.number])
    test = test.select_dtypes(include=[np.number])

    for lag in range(1, 5):
        for feature in feature_set:
            train[f"{feature}_lag{lag}"] = train[feature].shift(lag)
            test[f"{feature}_lag{lag}"] = test[feature].shift(lag)
    train.dropna(inplace=True)  

    X_train, y_train = train.drop('Adj Close', axis=1), train['Adj Close']
    X_test, y_test = test.drop('Adj Close', axis=1), test['Adj Close']

    model = XGBRegressor(n_estimators=100, max_depth=3, learning_rate=0.1, enable_categorical=False)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    return y_pred

# LSTM Model with Parameter Tuning
def lstm_model(train, test, feature_set):
    scaler = MinMaxScaler()
    train_scaled = scaler.fit_transform(train[feature_set + ['Adj Close']])
    test_scaled = scaler.transform(test[feature_set + ['Adj Close']])
    sequence_length = 10  
    generator = TimeseriesGenerator(train_scaled[:, :-1], train_scaled[:, -1], length=sequence_length, batch_size= sequence_length)

    # Define LSTM model
    model = Sequential()
    model.add(LSTM(32, activation='relu', return_sequences=True, input_shape=(sequence_length, len(feature_set))))
    model.add(Dropout(0.1))
    model.add(LSTM(16, activation='relu'))
    model.add(Dense(1))
    model.compile(optimizer='adam', loss='mse')

    model.fit(generator, epochs=50, verbose=0)
    test_generator = TimeseriesGenerator(test_scaled[:, :-1], test_scaled[:, -1], length=sequence_length, batch_size=1)
    y_pred_scaled = model.predict(test_generator)
    y_pred = scaler.inverse_transform(np.concatenate([test_scaled[sequence_length:, :-1], y_pred_scaled], axis=1))[:, -1]
    return pd.Series(y_pred, index=test.index[sequence_length:]), sequence_length


# Loop through each feature set 
for feature_set in feature_sets:
    train = final_df.iloc[:int(0.8 * len(final_df))].copy()
    test = final_df.iloc[int(0.8 * len(final_df)):].copy()

    # ARIMA Model for Linear Auto Regression
    y_pred = arima_var_model(train, test, feature_set)
    results.append(evaluate_model(test['Adj Close'], y_pred, "ARIMA/VAR", feature_set))

    # XGBoost with Lagged Variables
    y_pred= xgboost_model(train, test, feature_set)
    results.append(evaluate_model(test['Adj Close'], y_pred, "XGBoost", feature_set))

    # LSTM 
    y_pred, sequence_len  = lstm_model(train, test, feature_set)
    results.append(evaluate_model(test.iloc[sequence_len:]['Adj Close'], y_pred, "LSTM", feature_set))

results_df = pd.DataFrame(results)
results_df['Feature Set'] = results_df['Feature Set'].apply(lambda x: ', '.join(x))
pivot_df = results_df.pivot(index='Feature Set', columns='Model', values=['RMSE', 'MAE']).reset_index().sort_values('Feature Set')
pivot_df.columns = [f"{metric}_{model}" for metric, model in pivot_df.columns]
print(pivot_df)


### **Correlation Heatmaps**
A correlation heatmap is a visualization displaying the correlation coefficients (r) that variables have with each other. These values range from -1 to 1:
* **1** means a perfect positive correlation: as one variable increases, the other increases in a perfectly linear fashion.
* **0** means no correlation: changes in one variable do not predict changes in the other variable.
* **-1** means a perfect negative correlation: as one variable increases, the other decreases in a perfectly linear fashion.

This is particularly useful for examining how different models' accuracies are related to sentiment and proving the hypothesis of the research paper.

#### **Results Overview**
Images displaying previous results:
* [Results for TSLA](../results/Correlation%20Heatmaps/Correlation%20Heatmap%20TSLA.png)
* [Results for AAPL](../results/Correlation%20Heatmaps/Correlation%20Heatmap%20AAPL.png)
* [Results for F](../results/Correlation%20Heatmaps/Correlation%20Heatmap%20F.png)


In [None]:
rmse_columns = [col for col in pivot_df.columns if col.startswith('RMSE_')]
rmse_df = pivot_df[rmse_columns]
accuracy_df = 100 - rmse_df
accuracy_df.columns = [col.replace('RMSE_', '') for col in accuracy_df.columns]
correlation_matrix = accuracy_df.corr()
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)
plt.title('Correlation Heatmap of Models with Sentiment')
plt.show()