### **Prerequisites for Comparing the Models**

Before running this notebook, ensure the following steps are completed:

1. **Preprocess the Data**:  
   - The raw datasets must be cleaned, formatted, and preprocessed.  
   - Use the `data_preprocessing.ipynb` notebook to perform preprocessing.  
   - This notebook requires the cleaned datasets for both the tweets and the historical stock data.
   - Link to the preprocessing notebook: [data_preprocessing.ipynb](./data_preprocessing.ipynb).

2. **Python Environment**:  
   - Ensure Python 3.8 or higher is installed.

3. **Install Required Dependencies**:  
   - Run the following command to install all necessary dependencies:  
     ```bash
     pip install -r requirements.txt
     ```
    

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import ( 
    confusion_matrix, 
    classification_report
)
from transformers import pipeline
from nltk.sentiment.vader import SentimentIntensityAnalyzer


### **Sampling Tweets from a selected stock**
To compare the predictive abilities of VADER and Zero Text Classification, 100 tweets are being samnpled from a stock of your choice. 

#### Process Overview:

1. **Select a Stock Name**:  
   - The stock name (ticker symbol) must be specified in the code.
   - This process needs to be repeated for each stock under analysis:  
     - **TSLA** (high public discourse).  
     - **AAPL** (moderate public discourse).  
     - **F** (low public discourse).  

2. **Fill in the Company Name:** Aside from the ticker symbol of the stock, the company name must also be specified in the code. This is essential in defining the labels to ensure the LLM can predict the correct sentiment for Tweets. For instance, for `TSLA` as the `stock_name`, the `company_name` would be `Tesla`


#### Instructions for Repetition:
- To repeat this process for another stock, simply change the value of the `stock_name` and `company_name` variables to the desired ticker symbol (e.g., "AAPL" or "F") and company name (e.g., "Apple" or "Ford") and rerun the code block.
- Ensure the extracted tweet data is saved for each stock separately to avoid overwriting.


In [10]:
stock_name = 'TSLA'
company_name = 'Tesla'
all_tweets = pd.read_csv(r'../data/preprocessed_tweets.csv')
df_tweets_stock = all_tweets[all_tweets['Stock Name'] == stock_name]
if df_tweets_stock.shape[0] > 100:
    df_tweets_stock_test = df_tweets_stock.sample(n=100, random_state=42)
else:
    df_tweets_stock_test = df_tweets_stock.sample(n=df_tweets_stock.shape[0], random_state=42)

print(df_tweets_stock_test.shape)
df_tweets_stock_test.head()

(100, 4)


Unnamed: 0,Date,Tweet,Stock Name,Company Name
26114,2022-01-03 03:52:56+00:00,my 2022 tsla production and delivery estimate ...,TSLA,"Tesla, Inc."
25357,2022-01-07 15:54:09+00:00,bought more at 1032 in both accounts this is t...,TSLA,"Tesla, Inc."
15159,2022-04-18 02:52:48+00:00,and are philanthropy if you say philanthropy i...,TSLA,"Tesla, Inc."
32460,2021-11-06 22:26:19+00:00,pure speculation dumps 10 shares leaves him wi...,TSLA,"Tesla, Inc."
4285,2022-08-08 15:49:45+00:00,youre welcome spx qqq amzn amd nvda tsla iwm,TSLA,"Tesla, Inc."
...,...,...,...,...
32498,2021-11-06 19:51:59+00:00,lol he has to sell tsla 10 anyway due to his o...,TSLA,"Tesla, Inc."
23234,2022-01-26 04:30:00+00:00,elon musk will deliver tsla product roadmap up...,TSLA,"Tesla, Inc."
3388,2022-08-17 02:25:42+00:00,since aug 2nd tweet below tsla stock is up 10 ...,TSLA,"Tesla, Inc."
15027,2022-04-19 14:34:26+00:00,market sampp500 has headwinds which are not im...,TSLA,"Tesla, Inc."


In [None]:
%%time
# Initialize VADER sentiment analyzer
sentiment_analyzer = SentimentIntensityAnalyzer()

# Iterate over the rows of the DataFrame
for indx, row in df_tweets_stock_test.iterrows():
    try:
        # Normalize the tweet text
        sentence_i = unicodedata.normalize('NFKD', row['Tweet'])

        # Get sentiment scores
        sentence_sentiment = sentiment_analyzer.polarity_scores(sentence_i)

        # Assign sentiment scores to the DataFrame
        df_tweets_stock_test.at[indx, 'Sentiment_score VADER'] = sentence_sentiment['compound']
        df_tweets_stock_test.at[indx, 'Negative VADER'] = sentence_sentiment['neg']
        df_tweets_stock_test.at[indx, 'Neutral VADER'] = sentence_sentiment['neu']
        df_tweets_stock_test.at[indx, 'Positive VADER'] = sentence_sentiment['pos']

    except TypeError:
        print(df_tweets_stock_test.loc[indx, 'Tweet'])
        print(indx)
        break

In [None]:
pipe = pipeline(task="zero-shot-classification", model="facebook/bart-large-mnli")
# Define the candidate labels globally
candidate_labels = [f"positive about {company_name}", f"negative about {company_name}", "neutral about {company_name}"]

def classify_sentiment_about_tesla(tweet):
    # Perform zero-shot classification
    result = pipe(tweet, candidate_labels=candidate_labels)

    # Extract the scores for each sentiment
    positive_score = 0
    negative_score = 0
    neutral_score = 0

    for label, score in zip(result['labels'], result['scores']):
        if label == "positive about Tesla":
            positive_score = score * 2 / 3
        elif label == "negative about Tesla":
            negative_score = score  * 2 / 3
        elif label == "neutral about Tesla":
            neutral_score = score  * 2 / 3

    # Determine the overall sentiment based on the highest score
    if positive_score >= negative_score and positive_score >= neutral_score:
        sentiment_score =   1 / 3 + positive_score
    elif negative_score >= positive_score and negative_score >= neutral_score:
        sentiment_score = -1 / 3 - negative_score
    else:
        sentiment_score =  positive_score - negative_score

    return sentiment_score, negative_score, neutral_score, positive_score

%%time
df_tweets_stock_test[['Sentiment Score LLM', 'Negative LLM', 'Neutral LLM', 'Positive LLM']] = df_tweets_stock_test['Tweet'].apply(
    lambda tweet: pd.Series(classify_sentiment_about_tesla(tweet))
)

### **Manual Sentiment Analysis**
For the 100 sampled tweets, you must now input the true sentiment of the Tweets, which will be taken as the accurate prediction and will be used to generate the classification reports and confusion matrices of the sentiment analysis models.

In [None]:
df_tweets_stock_test['Manual Label'] = ''

# Loop through each row in the sampled DataFrame
for index, row in df_tweets_stock_test.iterrows():
    # Display the tweet
    print(f"Tweet {index + 1}/{len(df_tweets_stock_test)}: {row['Tweet']}")

    # Prompt the user to input the label (-1, 0, or 1)
    while True:
        try:
            label = int(input("Enter label (-1 for Negative, 0 for Neutral, 1 for Positive): "))
            if label in [-1, 0, 1]:
                break
            else:
                print("Please enter a valid label (-1, 0, or 1).")
        except ValueError1:
            print("Invalid input. Please enter a number (-1, 0, or 1).")

    # Assign the label to the 'Manual Label' column
    df_tweets_stock_test.at[index, 'Manual Label'] = label
    print("\n")

# Display the updated DataFrame with manual labels
df_tweets_stock_test.head()
df_tweets_stock_test['Manual Label'] = df_tweets_stock_test['Manual Label'].astype(int)

In [None]:
threshold_positive = 1/3
threshold_negative = -1/3

df_tweets_stock_test['Predicted Label LLM'] = df_tweets_stock_test['Sentiment Score LLM'].apply(
    lambda x: 1 if x > threshold_positive else (-1 if x < threshold_negative else 0)
)

def predict_label(row):
    if row['Positive VADER'] > row['Negative VADER'] and row['Positive VADER'] > row['Neutral VADER']:
        return 1
    elif row['Negative VADER'] > row['Positive VADER'] and row['Negative VADER'] > row['Neutral VADER']:
        return -1
    else:
        return 0

# Apply the function to each row
df_tweets_stock_test['Predicted Label VADER'] = df_tweets_stock_test.apply(predict_label, axis=1)

In [None]:
conf_matrix_llm = confusion_matrix(df_tweets_stock_test['Manual Label'], df_tweets_stock_test['Predicted Label LLM'], labels=[1, 0, -1])
conf_matrix_vader = confusion_matrix(df_tweets_stock_test['Manual Label'], df_tweets_stock_test['Predicted Label VADER'], labels=[1, 0, -1])
def plot_confusion_matrix(cm, labels, title):
    df_cm = pd.DataFrame(cm, index=labels, columns=labels)
    plt.figure(figsize=(8, 6))
    sns.heatmap(df_cm, annot=True, fmt='d', cmap='Blues', cbar=False,
                xticklabels=labels, yticklabels=labels)
    plt.title(title)
    plt.xlabel('Predicted Label')
    plt.ylabel('Actual Label')
    plt.show()

# Plot confusion matrices
plot_confusion_matrix(conf_matrix_llm, labels=[1, 0, -1], title='Confusion Matrix for Predicted Label LLM')
plot_confusion_matrix(conf_matrix_vader, labels=[1, 0, -1], title='Confusion Matrix for Predicted Label VADER')

# Generate classification reports
report_llm = classification_report(df_tweets_stock_test['Manual Label'], df_tweets_stock_test['Predicted Label LLM'], labels=[1, 0, -1], target_names=['Class 1', 'Class 0', 'Class -1'], output_dict=True)
report_vader = classification_report(df_tweets_stock_test['Manual Label'], df_tweets_stock_test['Predicted Label VADER'], labels=[1, 0, -1], target_names=['Class 1', 'Class 0', 'Class -1'], output_dict=True)

# Print classification reports
print("Classification Report for Predicted Label LLM:")
print(classification_report(df_tweets_stock_test['Manual Label'], df_tweets_stock_test['Predicted Label LLM'], labels=[1, 0, -1], target_names=['Class 1', 'Class 0', 'Class -1']))

print("\nClassification Report for Predicted Label VADER:")
print(classification_report(df_tweets_stock_test['Manual Label'], df_tweets_stock_test['Predicted Label VADER'], labels=[1, 0, -1], target_names=['Class 1', 'Class 0', 'Class -1']))