# Sentimental Analysis

Goal: Scrape data from a website like bloomberg (all the bolded words), separate the news into categories, then assign a sentimental value

spaCY was used to do preprocessing

The following library are to be explored:
1. VADER
2. TextBlob
3. Flair
4. Models - RoBERTA (HuggingFace), DistilliBERT (HuggingFace)
5. LLM
6. Self Built (self-sourced Dataset)

Plan:
1. Compare between models 1-5
2. Using the best model available to assign values to the data
3. train own model using dataset
4. Compare and conclude

## Exploring the Py libraries

In [None]:
import nltk
from nest_asyncio import apply
from textblob.en import sentiment

nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('punkt_tab')

In [None]:
# setup

sentence = "Trump to Leave G-7 Tonight Due to Middle East Crisis"

# Preprocessing

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Initialize NLTK tools
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

# Preprocess text
def preprocess(text):
    if not isinstance(text, str) or text is None:
        return ""
    tokens = word_tokenize(text.lower())  # Tokenize and lowercase
    cleaned_tokens = [lemmatizer.lemmatize(token) for token in tokens if token.isalpha() and token not in stop_words]
    test_sentence = " ".join(cleaned_tokens)
    return test_sentence

test_sentence = preprocess(sentence)

### 1. VADER

In [None]:
# Prebuilt Vader sentiment package

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()
scores = analyzer.polarity_scores(test_sentence)
print(scores)

### 2. TextBlob

In [None]:
# Prebuilt Textblob sentiment package

from textblob import TextBlob
text = TextBlob(test_sentence)
score = text.sentiment
print(score)

### 3. Flair

Is optimized for sequence labeling but also has prebuild sentiment classification

In [None]:
# Prebuilt Flair sentiment package/Model

from flair.data import Sentence
from flair.nn import Classifier

sentence = Sentence(test_sentence)
tagger = Classifier.load('sentiment')
tagger.predict(sentence)
print(sentence)

### 4. HuggingFace Transformers

In [54]:
from transformers import pipeline, set_seed

AttributeError: partially initialized module 'torch' from 'C:\Users\Jay Tai\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.13_qbz5n2kfra8p0\LocalCache\local-packages\Python313\site-packages\torch\__init__.py' has no attribute 'fx' (most likely due to a circular import)

#### - RoBERTa

In [None]:
classifier = pipeline('sentiment-analysis', model='cardiffnlp/twitter-roberta-base-sentiment')
result = classifier(test_sentence)
print(result)

#### - DistilBERT

In [53]:
classifier = pipeline('sentiment-analysis', model='distilbert-base-uncased-finetuned-sst-2-english')
result = classifier(test_sentence)
print(result)

NameError: name 'pipeline' is not defined

#### - Google Flan t5 base LLM model (Open source via HuggingFace)

In [None]:
classifier = pipeline("text2text-generation", model="google/flan-t5-base")
prompt = f"Classify the sentiment of '{test_sentence}' as positive, negative, or neutral, and give a sentimental score of -1 to 1."
result = classifier(prompt)
print(result)

### 5. OpenRouter to send api request to LLM

In [None]:
# LLM Qwen

from openai import OpenAI

client = OpenAI(
  base_url="https://openrouter.ai/api/v1",
  api_key="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",
)

completion = client.chat.completions.create(
  extra_body={},
  model="deepseek/deepseek-r1-0528-qwen3-8b:free",
  messages=[
    {
      "role": "user",
      "content": f"Conduct Sentimental Analysis on the following statement(s) and give me a polarity and score. {test_sentence}"
    }
  ]
)
print(completion.choices[0].message.content)

### 6. Building own model from data taken from Kraggle

cnbc: (3080, 3) -- reserved for testing and backtesting

guardian: (17800, 2)
+
retuers: (32770, 3)

I pass the data into an ai to generate a list of sentimental values to be assigned (lazy to self assign)

train-split 80-20 and trained

In [None]:
import pandas as pd
cnbc = pd.read_csv("data/cnbc_headlines.csv") # used as a testing data across the board
guardian = pd.read_csv("data/guardian_headlines.csv")
reuters = pd.read_csv("data/reuters_headlines.csv")

In [None]:
# preparing training and testing data

import random
random.seed(999)
data = list(map(preprocess, (list(guardian['Headlines']) + list(reuters['Headlines']) + list(cnbc['Headlines']))))
random.shuffle(data)
t = int(len(data)*.8)
train_set = data[:t]
test_set = data[t:]

In [None]:
len(train_set), len(test_set)

In [None]:
# We create a preliminary model using NLTK library trained on our dataset

from nltk.classify import NaiveBayesClassifier
from nltk.corpus import subjectivity
from nltk.sentiment import SentimentAnalyzer
from nltk.sentiment.util import *

a = SentimentAnalyzer()
