TFID  Mode

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

In [2]:
# Load data
data = pd.read_csv("data\war-news.csv")
data.head()

Unnamed: 0.1,Unnamed: 0,Headlines,Summary,Press,Date,Keyword
0,0,I served in Iraq and Afghanistan but the horro...,A WAR hero traumatised by the horrors of comba...,The Sun,1 day ago,Afghanistan
1,1,The forever war in Afghanistan is nowhere near...,Islamic State is seeking to overthrow the Tali...,ThePrint,2 weeks ago,Afghanistan
2,2,"Hell at Abbey Gate: Chaos, Confusion and Death...","In firsthand accounts, Afghan civilians and U....",ProPublica,1 month ago,Afghanistan
3,3,‘A second Afghanistan’: Doubts over Russia’s w...,Russia's lack of progress in its war against U...,Al Jazeera,5 days ago,Afghanistan
4,4,Afghanistan: Former army general vows new war ...,Lt Gen Sami Sadat tells the BBC of planned ope...,BBC,1 week ago,Afghanistan


In [3]:
# Drop rows with missing values in 'title' and 'summary' columns
data = data.dropna(subset=['Headlines', 'Summary'])

# Extract titles and summaries from the DataFrame
titles = data['Headlines'].tolist()
summaries = data['Summary'].tolist()

In [4]:
def preprocess_text(text):
    if isinstance(text, str):  # Check if the input is a string
        text = re.sub(r"[^a-zA-Z0-9\s]", "", text)
        text = text.lower()
        tokens = word_tokenize(text)
        stop_words = set(stopwords.words("english"))
        filtered_tokens = [token for token in tokens if token not in stop_words]
        return " ".join(filtered_tokens)
    else:
        return ""

In [5]:
preprocessed_titles = [preprocess_text(title) for title in titles]
preprocessed_summaries = [preprocess_text(summary) for summary in summaries]

<b>TFID </b>
<ol>
<li> Representation: TF-IDF represents each document as a vector, where each dimension corresponds to a unique term in the entire corpus. </li>
<li>Term Importance: It assigns weights to terms based on their frequency in a document relative to their frequency across all documents in the corpus. High weight is given to terms that are frequent in the document but not common across all documents. </li>
<li>Document Comparison: TF-IDF vectors are used to calculate the similarity between documents. The similarity is based on the overlap of terms and their weights. </li>

In [6]:
# Feature representation
vectorizer = TfidfVectorizer()
title_vectors = vectorizer.fit_transform(preprocessed_titles)
summary_vectors = vectorizer.transform(preprocessed_summaries)

In [7]:
# Similarity calculation
similarity_scores = cosine_similarity(title_vectors, summary_vectors)

In [8]:
# Print top 5 most similar pairs
most_similar = [(i,j,similarity_scores[i][j]) for i,j in enumerate(similarity_scores.argsort(axis=1)[:,-1])]
for i,j,score in most_similar[-5:]:
  print(f"Title: {titles[i]}\nSummary: {summaries[i]}\nSimilarity score: {score}\n")

Title: The Shattering of Yemen
Summary: U.S. President Joe Biden has made ending Yemen's civil war a central pillar 
of his Middle East policy. In his maiden foreign policy address...
Similarity score: 0.20013309095640294

Title: We first cut rations in half. Now, we'll take food from the hungry to save 
the starving.
Summary: We first cut rations in half. Now, we'll take food from the hungry to save 
the starving. · As war rages in Yemen, we will be forced to decide...
Similarity score: 0.8552915458695548

Title: Yemen war: Two foreign Doctors Without Borders workers ...
Summary: The conflict began in 2014 after Houthi rebels seized the capital Sanaa and 
continued advancing. A Saudi-led military coalition intervened to...
Similarity score: 0.26489028866002146

Title: Devastated by war, Yemen's still surviving oil and gas sector ...
Summary: Yemen's oil and gas industry could be at a crossroads after six years of 
brutal civil war, with the US attempting to broker a peace deal...
Simi

<b>Word2Vec</b>
<ol>
<li>Representation: Word2Vec represents each word as a dense vector in a continuous vector space. It captures the semantic relationships between words.</li>
<li>Term Similarity: Word2Vec is trained on large corpora to learn word embeddings such that semantically similar words have similar vector representations.</li>
<li>Document Representation: Document vectors can be obtained by averaging or combining the word vectors of the words in the document.</li>
<li>Document Comparison: Similarity between documents is calculated based on the similarity of their word vectors. It captures the semantic similarity between documents.</li>

In [2]:
from gensim.models import KeyedVectors
import numpy as np


In [10]:

# Load pre-trained word embeddings
word2vec_model = KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300.bin.gz", binary=True)


In [3]:
glove_model = KeyedVectors.load_word2vec_format("glove.6B/glove.6B.300d.txt", binary=False)

In [16]:
# Create word embedding matrices
title_embeddings = np.array([word2vec_model[word] for word in preprocessed_titles if word in word2vec_model])
summary_embeddings = np.array([glove_model[word] for word in preprocessed_summaries if word in glove_model])


In [17]:

# Calculate cosine similarity
similarity_scores = cosine_similarity(title_embeddings, summary_embeddings)


In [18]:

# Print top 5 most similar pairs
most_similar = [(i, j, similarity_scores[i][j]) for i, j in enumerate(similarity_scores.argsort(axis=1)[:, -1])]
for i, j, score in most_similar[-5:]:
    print(f"Title: {titles[i]}\nSummary: {summaries[j]}\nSimilarity score: {score}\n")



Title: Afghanistan: Nearly 20 million going hungry | | UN News
Summary: Russia-Ukraine war: Meet the Afghan refugees fighting Moscow's latest 
invasion. Afghans, many veterans of battles against the Taliban, have...
Similarity score: -0.06837846338748932

Title: Russia-Ukraine war: Meet the Afghan refugees fighting ...
Summary: Russia-Ukraine war: Meet the Afghan refugees fighting Moscow's latest 
invasion. Afghans, many veterans of battles against the Taliban, have...
Similarity score: -0.06837846338748932

Title: New York Times wins 3 Pulitzer Prizes, Reuters wins for feature photography
Summary: Russia-Ukraine war: Meet the Afghan refugees fighting Moscow's latest 
invasion. Afghans, many veterans of battles against the Taliban, have...
Similarity score: -0.06837846338748932

Title: Why Imran Khan's coup theory is so popular in Pakistan
Summary: Russia-Ukraine war: Meet the Afghan refugees fighting Moscow's latest 
invasion. Afghans, many veterans of battles against the Taliban, hav

In [19]:
# Print preprocessed titles and summaries
for title, summary in zip(preprocessed_titles, preprocessed_summaries):
    print(f"Preprocessed Title: {title}")
    print(f"Preprocessed Summary: {summary}")


Preprocessed Title: served iraq afghanistan horrors war turned 120 hour sex worker
Preprocessed Summary: war hero traumatised horrors combat working 120anhour escortgrace parker 35 served iraq afghanistan
Preprocessed Title: forever war afghanistan nowhere near end indulging ethnic warfare
Preprocessed Summary: islamic state seeking overthrow talibanquietly helped along discontent ranks economic crisis disputes
Preprocessed Title: hell abbey gate chaos confusion death final days war afghanistan
Preprocessed Summary: firsthand accounts afghan civilians us marines describe desperate struggle flee kabul airports last open
Preprocessed Title: second afghanistan doubts russias war prosecution
Preprocessed Summary: russias lack progress war ukraine noted analysts since launched second phase
Preprocessed Title: afghanistan former army general vows new war taliban
Preprocessed Summary: lt gen sami sadat tells bbc planned operations many afghans weary conflict
Preprocessed Title: putins afghani

In [20]:
# Testing the Word2Vec Model  
example_index =1  
title_word_vectors = [word2vec_model.get_vector(word) for word in preprocessed_titles[example_index].split() if word in word2vec_model.key_to_index]
summary_word_vectors = [word2vec_model.get_vector(word) for word in preprocessed_summaries[example_index].split() if word in word2vec_model.key_to_index]

print(f"Word Vectors for Example {example_index} Title: {title_word_vectors}")

print(f"Word Vectors for Example {example_index} Summary: {summary_word_vectors}")

Word Vectors for Example 1 Title: [array([ 3.32031250e-02,  1.42578125e-01, -4.34570312e-02,  4.58984375e-01,
       -1.60156250e-01, -1.77001953e-02,  3.26171875e-01, -2.25585938e-01,
        2.92968750e-01,  1.88476562e-01, -8.54492188e-02, -6.29882812e-02,
       -3.02734375e-01, -2.38037109e-02, -1.78710938e-01,  1.15722656e-01,
        3.55468750e-01,  1.67968750e-01,  2.16796875e-01, -1.55273438e-01,
       -5.07812500e-02, -1.60156250e-01,  2.57812500e-01,  2.91748047e-02,
        1.31835938e-01,  6.68945312e-02,  7.32421875e-02, -5.12695312e-02,
        1.25000000e-01, -2.31445312e-01, -8.34960938e-02,  2.01416016e-02,
        2.67333984e-02,  1.15722656e-01,  1.31835938e-01,  1.02050781e-01,
        1.31835938e-01, -1.56250000e-01, -8.78906250e-02,  5.90820312e-02,
        6.64062500e-02,  1.81640625e-01,  3.32031250e-01, -3.63281250e-01,
        1.98242188e-01, -8.25195312e-02,  2.59765625e-01,  1.85546875e-01,
        6.34765625e-02,  6.78710938e-02, -2.39257812e-01,  1.8847

 Universal Sentence Encoder (USE)

In [21]:
from tensorflow_hub import load

use_model = load("https://tfhub.dev/google/universal-sentence-encoder/4")

title_embeddings = use_model(preprocessed_titles)
summary_embeddings = use_model(preprocessed_summaries)

similarity_scores = cosine_similarity(title_embeddings, summary_embeddings)


In [22]:
most_similar = [(i, j, similarity_scores[i][j]) for i, j in enumerate(similarity_scores.argsort(axis=1)[:, -1])]
for i, j, score in most_similar[-5:]:
    print(f"Title: {titles[i]}\nSummary: {summaries[j]}\nSimilarity score: {score}\n")

Title: The Shattering of Yemen
Summary: The Houthis have won the war in Yemen, defeating their opponents in the 
civil war, the Saudis who intervened in 2015 against them,...
Similarity score: 0.5473129153251648

Title: We first cut rations in half. Now, we'll take food from the hungry to save 
the starving.
Summary: We first cut rations in half. Now, we'll take food from the hungry to save 
the starving. · As war rages in Yemen, we will be forced to decide...
Similarity score: 0.8253199458122253

Title: Yemen war: Two foreign Doctors Without Borders workers ...
Summary: ... two countries had discussed the Yemen conflict "despite the existence 
... to wage war against the Iran-aligned Houthi movement in Yemen.
Similarity score: 0.5536632537841797

Title: Devastated by war, Yemen's still surviving oil and gas sector ...
Summary: But its brutal war on Ukraine has pushed the U.S. to look elsewhere for 
that oil, that includes the home of the largest crude oil reserves in...
Similarity sco