>[Notebook 3: Semantic Approach Analysis and Topic Modeling](#scrollTo=ErShpOjITD5M)

>>[Part 1: Load Dataset](#scrollTo=8EeKlW3LUI0f)

>>[Part 2: Topic Modeling using Latent Dirichlet Allocation (LDA)](#scrollTo=S8fx0P9HnZWP)

>>[Part 3: Latent Semantic Analysis](#scrollTo=_9xBTcIQtCHo)

>>[Part 4: SPEED Algorithm](#scrollTo=EYtBDr2Er36X)



# Notebook 3: Semantic Approach Analysis and Topic Modeling



The semantic approach to text processing involves analyzing text at a deeper level to understand its meaning and context. Unlike syntactic or statistical approaches, which focus on the structure or frequency of words, the semantic approach aims to capture the underlying semantics of language.

Some key aspects of the semantic approach to text processing are:

*   Topic Modeling
*   Word sense disambiguation
*   Semantic Similarity



## Part 1: Load Dataset

In [1]:
cd /content/drive/MyDrive/CIND820

/content/drive/MyDrive/CIND820


In [2]:
# import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import nltk
from nltk.corpus import wordnet as wn
import spacy
import gensim
from gensim import corpora
from gensim.models import LdaModel

# Initialize NLTK
nltk.download('wordnet')

# Initialize spaCy
nlp = spacy.load('en_core_web_sm')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [3]:
# load dataset
news_text_df = pd.read_csv('generated_dfs/filtered_data.csv')

In [4]:
news_text_df.head()

Unnamed: 0,article_id,source_id,source_name,author,title,description,url,url_to_image,published_at,content,category,full_content,article,title_sentiment,all_text
0,94333,business-insider,Business Insider,Sponsor Post,Unlocking the future of the web,"From Chipotle to Roblox, web3 is enabling busi...",https://www.businessinsider.com/sc/web3-helps-...,https://i.insider.com/654374023cc84b4dfafa98cb...,2023-11-02 16:25:49,Adobe Stock\nHarnessing customer engagement an...,Stock,,,,Unlocking the future of the web From Chipotle ...
1,94368,,The Indian Express,Reuters,"LinkedIn hits 1 billion members, adds AI featu...",LinkedIn also introduced on Wednesday a button...,https://indianexpress.com/article/technology/s...,https://images.indianexpress.com/2023/11/Linke...,2023-11-02 03:48:44,"LinkedIn, the business-focused social network ...",Stock,"LinkedIn, the business-focused social network ...",,,"LinkedIn hits 1 billion members, adds AI featu..."
2,94370,,Investor's Business Daily,Investor's Business Daily,"Moderna Beats Sales Forecasts, But Light Guida...",The company issued below-consensus sales views...,https://www.investors.com/news/technology/mode...,https://www.investors.com/wp-content/uploads/2...,2023-11-02 10:30:25,Moderna (MRNA) stock could take a hit Thursday...,Stock,,,,"Moderna Beats Sales Forecasts, But Light Guida..."
3,94331,abc-news,ABC News,ABC News,WATCH: Man rescued from crashed plane in the E...,A man was hoisted to safety after a small plan...,https://abcnews.go.com/US/video/man-rescued-cr...,https://i.abcnewsfe.com/a/dee71d57-ad91-4eec-8...,2023-11-02 10:19:28,<ul><li>Whats next for Russia? \n</li><li>What...,Stock,,,,WATCH: Man rescued from crashed plane in the E...
4,94332,abc-news,ABC News,ABC News,WATCH: Teen solves Rubik’s cube while skydiving,"Sam Sieracki, 17, broke the world record by so...",https://abcnews.go.com/GMA/Living/video/teen-s...,https://i.abcnewsfe.com/a/a4f53cb3-cbfe-4f01-a...,2023-11-02 12:16:02,<ul><li>Whats next for Russia? \n</li><li>What...,Stock,,,,WATCH: Teen solves Rubik’s cube while skydivin...


In [5]:
news_text_df.shape

(33029, 15)

In [6]:
news_text_df.columns

Index(['article_id', 'source_id', 'source_name', 'author', 'title',
       'description', 'url', 'url_to_image', 'published_at', 'content',
       'category', 'full_content', 'article', 'title_sentiment', 'all_text'],
      dtype='object')

We will use the `news_text_df` specifically to understand the underlying semantics of the text

## Part 2: Topic Modeling using Latent Dirichlet Allocation (LDA)

**Topic modeling** algorithms such as Latent Dirichlet Allocation (LDA) and Non-Negative Matrix Factorization (NMF) are used to discover underlying topics or themes in a collection of documents. These topics represent groups of words that frequently co-occur together and provide insights into the content of the text.

In [7]:
import gensim
from gensim import corpora
from pprint import pprint

from nltk.tokenize import word_tokenize
from nltk.util import ngrams
import string

nltk.download('stopwords')
nltk.download('punkt')

from nltk.corpus import stopwords
from gensim.parsing.preprocessing import remove_stopwords

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [8]:
# Get NLTK English stopwords
nltk_stopwords = set(stopwords.words('english'))

# Function to remove punctuation from text
def remove_punctuation(text):
    return text.translate(str.maketrans('', '', string.punctuation))

# Tokenize words, remove punctuation, and remove stopwords
texts = [[word for word in
            remove_stopwords(remove_punctuation(document.lower())).split()
              if word.isalpha() not in nltk_stopwords] for document in news_text_df['all_text']]

# Create dictionary and corpus
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

# Build LDA model
lda_model = gensim.models.LdaModel(corpus, num_topics=100, id2word=dictionary)
topics_data = lda_model.print_topics()

In [9]:
# Print the topics
pprint(lda_model.print_topics())

[(90,
  '0.156*"group" + 0.031*"opened" + 0.025*"gap" + 0.025*"river" + 0.021*"l…" + '
  '0.018*"gapped" + 0.015*"peak" + 0.015*"prior" + 0.013*"ma" + '
  '0.012*"guidance"'),
 (77,
  '0.017*"chars" + 0.014*"time" + 0.013*"like" + 0.011*"best" + 0.009*"good" + '
  '0.009*"need" + 0.008*"start" + 0.008*"season" + 0.008*"money" + '
  '0.008*"market"'),
 (72,
  '0.049*"health" + 0.043*"union" + 0.023*"hunt" + 0.020*"battery" + '
  '0.016*"popular" + 0.012*"promised" + 0.011*"lagos" + 0.011*"kit" + '
  '0.011*"governor" + 0.010*"away"'),
 (65,
  '0.052*"rating" + 0.049*"report" + 0.041*"target" + 0.039*"price" + '
  '0.032*"research" + 0.028*"stock" + 0.026*"free" + 0.025*"buy" + 0.025*"–" + '
  '0.024*"reports"'),
 (21,
  '0.141*"des" + 0.036*"der" + 0.036*"die" + 0.031*"chars" + 0.025*"web3" + '
  '0.023*"cloud" + 0.022*"ag" + 0.017*"hong" + 0.015*"100" + 0.013*"kong"'),
 (71,
  '0.043*"–" + 0.042*"quarter" + 0.042*"free" + 0.041*"according" + '
  '0.040*"report" + 0.040*"recent" + 0.038

From the above topics we can get a lot of financial terms, although its not 100% accurate:

In [10]:
# evaluate topics
from gensim.models import CoherenceModel
from gensim.corpora import Dictionary

def extract_tokens(topic_input):
    all_tokens = []
    for topic_tuple in topic_input:
        topic_string = topic_tuple[1]
        topic_tokens = []
        for part in topic_string.split('+'):
            part = part.strip()
            if '*' in part:
                _, token = part.split('*')
                token = token.strip().strip('"')
                if token and token.isalpha():
                    topic_tokens.append(token)
        all_tokens.append(topic_tokens)
    return all_tokens

# Tokenize the topics
topics = extract_tokens(topics_data)

# Create a Dictionary from the tokenized topics
dictionary = Dictionary(topics)

# Convert tokenized topics into a Bag of Words (BoW) format
corpus = [dictionary.doc2bow(topic) for topic in topics]

# Compute Coherence Score using UMass coherence measure
coherence_model_umass = CoherenceModel(topics=topics, corpus=corpus, dictionary=dictionary, coherence='u_mass')
coherence_umass = coherence_model_umass.get_coherence()
print("UMass Coherence Score:", coherence_umass)

# Compute Coherence Score using c_v coherence measure
coherence_model_cv = CoherenceModel(topics=topics, texts=topics, dictionary=dictionary, coherence='c_v')
coherence_cv = coherence_model_cv.get_coherence()
print("C_v Coherence Score:", coherence_cv)


UMass Coherence Score: -0.348748208742317
C_v Coherence Score: 0.9593852958594754


## Part 3: Latent Semantic Analysis

Latent Semantic Analysis (LSA)


* Semantic Similarity: Semantic similarity measures how similar two pieces of text are in meaning. It takes into account the semantics of words and phrases, rather than just their surface-level representations. Semantic similarity can be used in various applications such as information retrieval, question answering, and recommendation systems.

*

In [11]:
print(news_text_df.head())

   article_id         source_id                source_name  \
0       94333  business-insider           Business Insider   
1       94368               NaN         The Indian Express   
2       94370               NaN  Investor's Business Daily   
3       94331          abc-news                   ABC News   
4       94332          abc-news                   ABC News   

                      author  \
0               Sponsor Post   
1                    Reuters   
2  Investor's Business Daily   
3                   ABC News   
4                   ABC News   

                                               title  \
0                    Unlocking the future of the web   
1  LinkedIn hits 1 billion members, adds AI featu...   
2  Moderna Beats Sales Forecasts, But Light Guida...   
3  WATCH: Man rescued from crashed plane in the E...   
4    WATCH: Teen solves Rubik’s cube while skydiving   

                                         description  \
0  From Chipotle to Roblox, web3 is enabl

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import Normalizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import re

# Function to preprocess text
def preprocess_text(text):
    # Convert text to lowercase
    text = text.lower()
    # Remove punctuation
    text = re.sub(r'[^\w\s]', '', text)
    # Tokenize text
    tokens = word_tokenize(text)
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [word for word in tokens if word not in stop_words]
    # Lemmatize tokens
    lemmatizer = WordNetLemmatizer()
    lemmatized_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]
    # Join tokens back into text
    preprocessed_text = ' '.join(lemmatized_tokens)
    return preprocessed_text

# Preprocess the text data in the DataFrame
news_text_df['preprocessed_text'] = news_text_df['all_text'].apply(preprocess_text)

# Create a TF-IDF matrix
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, min_df=2)
tfidf_matrix = tfidf_vectorizer.fit_transform(news_text_df['preprocessed_text'])

# Apply Singular Value Decomposition (SVD)
lsa_model = TruncatedSVD(n_components=10, random_state=42)
lsa_matrix = lsa_model.fit_transform(tfidf_matrix)
lsa_matrix = Normalizer(copy=False).fit_transform(lsa_matrix)

# Analyze the latent semantic dimensions
terms = tfidf_vectorizer.get_feature_names_out()
print("Terms:", terms)
print("LSA Matrix:", lsa_matrix)


Terms: ['00' '000' '0000' ... '딥브레인ai' 'ｐｏｉｎｔ株式会社' 'ｓｔｏｃｋ']
LSA Matrix: [[ 5.70278071e-01  4.38536797e-01  1.81315096e-03 ...  5.26109429e-01
   4.55782413e-02 -3.40515535e-01]
 [ 3.97650354e-01  3.25930322e-01  2.91875106e-03 ...  6.96682871e-01
   1.09193974e-01 -3.08910055e-01]
 [ 6.80599283e-01  2.75654307e-01  1.55537404e-03 ...  2.28934688e-01
  -1.07239610e-01 -3.91214515e-01]
 ...
 [ 3.39249149e-01  1.77825012e-01  4.79131301e-03 ...  6.68427922e-01
   4.15634719e-02 -4.93134302e-01]
 [ 2.79680239e-01  3.81584504e-01  1.66657951e-03 ...  7.83010650e-01
  -1.40583060e-01 -1.12821077e-01]
 [ 2.51062265e-02  4.79032152e-02  3.60771877e-04 ... -3.10568743e-01
  -6.91588441e-04  2.60653697e-02]]


In [21]:
from sklearn.metrics.pairwise import cosine_similarity

# Calculate cosine similarity
cos_sim_matrix = cosine_similarity(lsa_matrix, lsa_matrix)

# Print cosine similarity matrix
print("Cosine Similarity Matrix:")
print(cos_sim_matrix)

Cosine Similarity Matrix:
[[1.         0.9534782  0.85704471 ... 0.90671046 0.85991229 0.04027889]
 [0.9534782  1.         0.72648767 ... 0.96738391 0.93004733 0.02452458]
 [0.85704471 0.72648767 1.         ... 0.72463004 0.58136539 0.02854006]
 ...
 [0.90671046 0.96738391 0.72463004 ... 1.         0.87365836 0.03436503]
 [0.85991229 0.93004733 0.58136539 ... 0.87365836 1.         0.02245589]
 [0.04027889 0.02452458 0.02854006 ... 0.03436503 0.02245589 1.        ]]


## Part 4: SPEED Algorithm

In [13]:
# define simple domain ontology with finance terms
finance_terms = [
    "Asset Allocation",
    "Beta",
    "Capital Asset Pricing Model (CAPM)",
    "Derivative",
    "Dividend",
    "Exchange-Traded Fund (ETF)",
    "Hedge Fund",
    "Initial Public Offering (IPO)",
    "Leverage",
    "Market Capitalization",
    "Net Present Value (NPV)",
    "Options",
    "Portfolio",
    "Return on Investment (ROI)",
    "Stock Market Index",
    "Taxable Income",
    "Underwriting",
    "Volatility",
    "Yield",
    "Zero-Coupon Bond",
    "Arbitrage",
    "Bear Market",
    "Blue Chip Stocks",
    "Bonds",
    "Bull Market",
    "Call Option",
    "Cash Flow",
    "Collateral",
    "Commercial Paper",
    "Compound Interest",
    "Credit Rating",
    "Debenture",
    "Default",
    "Diversification",
    "Dividend Yield",
    "Earnings Per Share (EPS)",
    "Equity",
    "Federal Reserve",
    "Futures",
    "Growth Stock",
    "Income Statement",
    "Inflation",
    "Insolvency",
    "Interest Rate",
    "Investment Bank",
    "Liquidity",
    "Margin Call",
    "Maturity Date",
    "Mergers and Acquisitions (M&A)",
    "Mutual Fund",
    "Option Premium",
    "Par Value",
    "Penny Stocks",
    "Preferred Stock",
    "Price-Earnings Ratio (P/E Ratio)",
    "Prime Rate",
    "Private Equity",
    "Profit Margin",
    "Quantitative Easing",
    "Real Estate Investment Trust (REIT)",
    "Residual Value",
    "Revenue",
    "Risk Management",
    "Securities",
    "Short Selling",
    "Solvency",
    "Stock Split",
    "Subprime Mortgage",
    "Supply and Demand",
    "Technical Analysis",
    "Time Value of Money",
    "Treasury Bills (T-Bills)",
    "Unsystematic Risk",
    "Venture Capital",
    "Volatility Index (VIX)",
    "Warrant",
    "Working Capital",
    "Yield Curve",
    "Accounts Receivable",
    "Amortization",
    "Annuity",
    "Balance Sheet",
    "Bankruptcy",
    "Blue Sky Laws",
    "Capital Gains",
    "Cash Equivalent",
    "Cost of Capital",
    "Coupon Rate",
    "Depreciation",
    "Discount Rate",
    "Due Diligence",
    "Economic Indicator",
    "Financial Statement",
    "Goodwill",
    "Hurdle Rate",
    "Intrinsic Value",
    "Lien",
    "Market Order",
    "Maturity",
    "Monetary Policy",
    "Option Chain",
    "Pip",
    "Proxy Statement",
    "Put Option",
    "Rate of Return",
    "S&P 500",
    "Savings Account",
    "Secured Loan",
    "Stop-Loss Order",
    "Time Horizon"
]

In [14]:
# sample text
sample_text = news_text_df['all_text'][2]

In [15]:
sample_text

'Moderna Beats Sales Forecasts, But Light Guidance Could Rattle Shares The company issued below-consensus sales views for 2023 and 2024. Moderna (MRNA) stock could take a hit Thursday after the Covid vaccine maker handily beat third-quarter sales forecasts, but reported bigger-than-expected losses and issued light guidance for this ye… [+4041 chars]  '

In [16]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.tag import pos_tag
from nltk.corpus import wordnet as wn

# Download necessary NLTK resources
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

class SPEED_Pipeline:
    def __init__(self, finance_terms):
        '''Initialize pipeline with domain ontology'''
        # Initialize WordNet lemmatizer
        self.lemmatizer = WordNetLemmatizer()
        # Store finance terms
        self.finance_terms = finance_terms
        # Initialize ontology dictionary
        self.ontology = {term.lower(): term for term in finance_terms}
        # Attributes to store processed results
        self.text = None
        self.tokens = None
        self.sentences = None
        self.linked_tokens = None
        self.pos_tags = None
        self.lemmatized_tokens = None
        self.word_groups = None
        self.disambiguated_groups = None
        self.event_phrases = None
        self.event_patterns = None

    # Step 1: English Tokenizer
    def english_tokenizer(self, text):
        '''tokenize text into tokens'''
        return word_tokenize(text)

    # Step 2: Ontology Gazetteer
    def ontology_gazetteer(self, tokens):
        '''link concepts in texts to concepts found in domain ontology'''
        linked_tokens = []
        for token in tokens:
            lowercase_token = token.lower()
            if lowercase_token in self.ontology:
                linked_tokens.append(self.ontology[lowercase_token])
            elif lowercase_token in [term.lower() for term in self.finance_terms]:
                linked_tokens.append(token)
            else:
                linked_tokens.append(token)
        return linked_tokens

    # Step 3: Sentence Splitter
    def sentence_splitter(self, text):
        '''group tokens in the text into sentences'''
        return sent_tokenize(text)

    # Step 4: Part-of-Speech Tagger
    def part_of_speech_tagger(self, tokens):
        '''each word in tagged with its part-of-speech'''
        return pos_tag(tokens)

    # Step 5: Morphological Analyzer
    def morphological_analyzer(self, tokens):
        '''tokens are lemmatized or decomposed to root form'''
        lemmatized_tokens = [self.lemmatizer.lemmatize(token) for token in tokens]
        return lemmatized_tokens

    # Step 6: Word Group Look-Up
    def word_group_look_up(self, tokens):
        '''use WordNet semantic lexison to identify word groups'''
        word_groups = {}
        for token in tokens:
            synsets = wn.synsets(token)
            word_groups[token] = [synset.name() for synset in synsets]
        return word_groups

    # Step 7: Word Sense Disambiguator
    def word_sense_disambiguator(self, word_groups):
        '''determine word sense of each word group'''
        disambiguated_groups = {}
        for token, synsets in word_groups.items():
            if synsets:
                disambiguated_groups[token] = synsets[0]
            else:
                disambiguated_groups[token] = None
        return disambiguated_groups

    # Step 8: Event Phrase Gazetteer
    def event_phrase_gazetteer(self, disambiguated_groups):
        '''link word groups to an ontology
        '''
        linked_phrases = []
        for token in disambiguated_groups:
            if token.lower() in self.ontology:
                linked_phrases.append(self.ontology[token.lower()])
            else:
                pass
        return linked_phrases

    # Step 9: Event Pattern Recognition
    def event_pattern_recognition(self, tokens):
        '''
        combine event with surrounding action words to add meaning
        and extract event phrases
        '''
        event_phrases = []
        for i, token in enumerate(tokens):
            if token.lower() == "event":
                context = " ".join(tokens[max(0, i-3):min(len(tokens), i+4)])
                event_phrases.append(context)
        return event_phrases

    # Step 10: Ontology Instantiator
    def ontology_instantiator(self, knowledge):
        '''update domain ontology with knowledge gained'''
        self.ontology.update(knowledge)
        return self.ontology

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [17]:
# Create pipeline instance
pipeline = SPEED_Pipeline(finance_terms)

In [18]:
# Step 1: English Tokenizer
tokens = pipeline.english_tokenizer(sample_text)

# Step 2: Ontology Gazetteer
linked_tokens = pipeline.ontology_gazetteer(tokens)

# Step 3: Sentence Splitter
sentences = pipeline.sentence_splitter(sample_text)

# Step 4: Part-of-Speech Tagger
pos_tags = pipeline.part_of_speech_tagger(tokens)

# Step 5: Morphological Analyzer
lemmatized_tokens = pipeline.morphological_analyzer(tokens)

# Step 6: Word Group Look-Up
word_groups = pipeline.word_group_look_up(tokens)

# Step 7: Word Sense Disambiguator
disambiguated_groups = pipeline.word_sense_disambiguator(word_groups)

# Step 8: Event Phrase Gazetteer
event_phrases = pipeline.event_phrase_gazetteer(tokens)

# Step 9: Event Pattern Recognition
event_patterns = pipeline.event_pattern_recognition(tokens)

# Step 10: Ontology Instantiator
knowledge = {"Asset Allocation": "Investment strategy focused on diversifying assets."}
pipeline.ontology_instantiator(knowledge)

# Access processed results
print("Tokens:", tokens)
print("Linked Tokens:", linked_tokens)
print("Sentences:", sentences)
print("POS Tags:", pos_tags)
print("Lemmatized Tokens:", lemmatized_tokens)
print("Word Groups:", word_groups)
print("Disambiguated Groups:", disambiguated_groups)
print("Event Phrases:", event_phrases)
print("Event Patterns:", event_patterns)


Tokens: ['Moderna', 'Beats', 'Sales', 'Forecasts', ',', 'But', 'Light', 'Guidance', 'Could', 'Rattle', 'Shares', 'The', 'company', 'issued', 'below-consensus', 'sales', 'views', 'for', '2023', 'and', '2024', '.', 'Moderna', '(', 'MRNA', ')', 'stock', 'could', 'take', 'a', 'hit', 'Thursday', 'after', 'the', 'Covid', 'vaccine', 'maker', 'handily', 'beat', 'third-quarter', 'sales', 'forecasts', ',', 'but', 'reported', 'bigger-than-expected', 'losses', 'and', 'issued', 'light', 'guidance', 'for', 'this', 'ye…', '[', '+4041', 'chars', ']']
Linked Tokens: ['Moderna', 'Beats', 'Sales', 'Forecasts', ',', 'But', 'Light', 'Guidance', 'Could', 'Rattle', 'Shares', 'The', 'company', 'issued', 'below-consensus', 'sales', 'views', 'for', '2023', 'and', '2024', '.', 'Moderna', '(', 'MRNA', ')', 'stock', 'could', 'take', 'a', 'hit', 'Thursday', 'after', 'the', 'Covid', 'vaccine', 'maker', 'handily', 'beat', 'third-quarter', 'sales', 'forecasts', ',', 'but', 'reported', 'bigger-than-expected', 'losses',