<ins>__Stock News Sentiment Analysis Project__</ins>

TLDR: For a chosen stock, this function searches Yahoo Finance for the latest news and gets the latest sentiment (individual + overall) using a trained model.

The sentiment analysis project can be split into 3 sub-processes.

<ins>1) Web Scraping (WebScraper class)</ins>

For a particular stock, the function scrapes Yahoo Finance for news on that stock using Python's BeautifulSoup4 package. It first scrapes the webpage for all news article URLs relating to that stock and stores them into a list. Subsequently, the function scrapes each URL of its headline and body of text and stores them into another list.

<ins>2) Training the sentiment analysis model (SentimentAnalyzer class)</ins>

The sentiment analysis model was trained using the Maximum Entropy classification model from Python's NLTK and Gensim packages. In order to train the model, we curated 100 news articles from Yahoo Finance and manually categorized them into their overall sentiment towards a particular stock, consisting of 34 optimistic, 33 neutral, and 33 pessimistic articles. We then trained the classifier model based on these 100 curated labelled articles to recognize and classify new unseen news articles into these 3 categories. After training, the model was then saved using Python's pickle package.

<ins>3) Combining the scraped articles and sentiment analysis model into a sentiment analysis application (StockSentimentApp class)</ins>

The scraped articles and trained sentiment analysis model are combined and developed into a sentiment analysis application. The application takes in a ticker symbol as its input and outputs the sentiment of each scraped article. It does this by applying the trained classifier model on each news article stored in the corpus of scraped articles. The headline and hyperlink of each article is also shown in the output to enable the user to click and read each article if they are interested. Furthermore, the function summarises and outputs the overall sentiment towards the stock based on the count of each predicted sentiment.

<ins>Use Case</ins>

From this application, users will be able to predict the expected overall sentiment of a particular stock in a relatively short amount of time based on the textual sentiment of financial experts. Users can then base their investment decision based on sound financial advice without spending hours skimming through financial news. Note, however, that this is meant to assist and not replace the human element of investing.

<ins>How to use</ins>

Run all the cells below. When prompted, key in a stock name or ticker symbol (e.g. "Amazon" or "AMZN", the input is case insensitive). The function will then output a list of articles and their corresponding predicted sentiments.

<ins>DISCLAIMER</ins>

The sentiment analysis model was trained based on the Maximum Entropy model, which is a probabilistic model that chooses the model with the highest entropy from a set of models that fit training data. It's based on the principle of maximum entropy and information theory. Because of this, certain articles that talk about multiple stocks or companies may result in inaccurate sentiment predictions.

Installations

In [1]:
# Web Scraping Installations
# !pip install beautifulsoup4
# !pip install requests

# Sentiment Analysis Installations
# !pip install nltk
# !pip install gensim
# !pip install contractions

Imports

In [1]:
# Web Scraping Imports
from bs4 import BeautifulSoup
import requests
import pickle

# Sentiment Analysis Imports
import nltk
from nltk.util import ngrams
import gensim
import contractions

Web Scraper

In [2]:
class WebScraper:
    def __init__(self, stock):
        self.stock = stock.strip().upper()
        self.headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36'}
        self.url = "https://finance.yahoo.com/quote/"
        with open("name2ticker_dict.pkl", "rb") as f:
            self.name2ticker_dict = pickle.load(f)
        self.stock_name = None
        self.hyperlink_list = []
        self.headline_list = []
        self.article_list = []

    def get_request(self, url_extension):
        response = requests.get(self.url + url_extension + '/news/', headers=self.headers)
        soup = BeautifulSoup(response.text, "lxml")
        self.stock_name = soup.select_one('h1[class*="yf-"]').text.strip()
        self.hyperlink_list = [a["href"] for a in soup.select('a[class*="subtle-link fin-size-small titles noUnderline"]')][0:12]

    def search_stock(self):
        try:
            self.get_request(self.stock)
        except:
            for name in self.name2ticker_dict:
                if self.stock in name:
                    self.get_request(self.name2ticker_dict[name])
                    break
        if self.stock_name == None:
            raise ValueError("Stock does not exist!")

    def scrape_articles(self):
        for hyperlink in self.hyperlink_list.copy():
            try:
                for component in hyperlink.split("/"):
                    if component == 'm': # removes articles with 'm' in the hyperlink because they indicate a redirected article
                        raise ValueError
                article = [] # each element in this list is a different paragraph
                response = requests.get(hyperlink, headers=self.headers)
                soup = BeautifulSoup(response.text, "lxml")
                headline = soup.select_one('div[class*="cover-title"]').text.strip() # returns headline if found but returns None otherwise
                article.append(headline if headline[-1] in ".?!" else headline + ".")
                for paragraph in soup.find_all("p"):
                    paragraph = paragraph.text.strip()
                    if paragraph:
                        article.append(paragraph if paragraph[-1] in ".?!" else paragraph + ".")
                if len(article) <= 2: # if the number of elements scraped is less than or equal to 2, remove it because it likely indicates a premium or redirected article
                    raise ValueError
                self.headline_list.append(headline)
                self.article_list.append(" ".join(article)) # use the join method to join all the separated header and paragraphs into one long string
            except: # remove cases where an article is locked behind a paywall or diverts the user to another news website
                self.hyperlink_list.remove(hyperlink)

Sentiment Analyzer

In [3]:
class SentimentAnalyzer:
    def __init__(self):
        with open("dictionary.pkl", "rb") as f:
            self.dictionary = pickle.load(f)
        with open("maxent_sentiment_classifier.pkl", "rb") as f:
            self.classifier = pickle.load(f)
        self.stop_list = nltk.corpus.stopwords.words('english')
        self.lemmatizer = nltk.stem.WordNetLemmatizer()

    def preprocess_text(self, text):
        article = nltk.word_tokenize(text)
        article = [w.lower() for w in article if w.isalnum() and w not in self.stop_list]
        article = [self.lemmatizer.lemmatize(contractions.fix(w)) for w in article]
        bigrams = [' '.join(w) for w in list(ngrams(article, 2))]
        article.extend(bigrams)
        return article

    def predict_sentiment(self, text):
        article = self.preprocess_text(text)
        vector = self.dictionary.doc2bow(article)
        article_as_dict = {id: 1 for (id, tf) in vector}
        return self.classifier.classify(article_as_dict)

Stock Sentiment Application

In [4]:
class StockSentimentApp:
    def __init__(self, stock):
        self.stock = stock
        self.scraper = WebScraper(stock)
        self.analyzer = SentimentAnalyzer()
        self.sentiment_count = {'optimistic': 0, 'pessimistic': 0, 'neutral': 0}

    def run(self):
        self.scraper.search_stock()
        self.scraper.scrape_articles()

        for (i, (headline, article, hyperlink)) in enumerate(zip(self.scraper.headline_list, self.scraper.article_list, self.scraper.hyperlink_list)):
            sentiment = self.analyzer.predict_sentiment(article)
            print(f"{i+1})\tHeadline: {headline}")
            print(f"\tSentiment: {sentiment}")
            print(f"\tArticle Link: {hyperlink}\n")
            self.sentiment_count[sentiment.lower()] += 1

        if sum(self.sentiment_count.values()) == 0:
            print(f"There are no available news articles for {self.scraper.stock_name} on Yahoo Finance. Please try again later.")
        else:
            print("Summary:")
            for (sentiment, count) in self.sentiment_count.items():
                print(f"{sentiment.capitalize()}: {count}\t", end="")

            def get_overall_sentiment(sentiment_count):
                if sentiment_count['optimistic'] == sentiment_count['pessimistic']:
                    return 'neutral'
                return max(sentiment_count, key=sentiment_count.get)

            print(f"\nThe overall sentiment for {self.scraper.stock_name} is {get_overall_sentiment(self.sentiment_count)}.")


if __name__ == "__main__":
    stock = input("Enter a Stock: ")
    app = StockSentimentApp(stock)
    app.run()

1)	Headline: Intel reports Q4 beats on top and bottom line, stock rises on foundry revenue outlook
	Sentiment: Neutral
	Article Link: https://finance.yahoo.com/news/intel-reports-q4-beats-on-top-and-bottom-line-stock-rises-on-foundry-revenue-outlook-181745845.html

2)	Headline: Meta, Microsoft downplay DeepSeek threat
	Sentiment: Neutral
	Article Link: https://finance.yahoo.com/news/meta-microsoft-downplay-deepseek-threat-190639250.html

3)	Headline: DeepSeek is AI's historic 'sub-four-minute mile' moment
	Sentiment: Pessimistic
	Article Link: https://finance.yahoo.com/video/deepseek-ais-historic-sub-four-154934344.html

4)	Headline: Oracle debuts new AI agents as artificial intelligence war enters next battle
	Sentiment: Optimistic
	Article Link: https://finance.yahoo.com/news/oracle-debuts-new-ai-agents-as-artificial-intelligence-war-enters-next-battle-140057385.html

5)	Headline: Over 25% of Warren Buffett's $300 Billion Portfolio Is Invested in These 4 Tech Stocks. Here's the Best 