In [1]:
# Installing the necessary libraries
!pip install datasets==2.14.0
!pip install torch[cpu]
!pip install sentence-transformers==2.2.2



### 📚 Step 1: Import Libraries

First, let's import the necessary libraries that will empower us to manipulate datasets, process text, and perform mathematical operations.

In [2]:
from datasets import load_dataset
from sentence_transformers import SentenceTransformer, util
import torch
import os

### 🗂️ Step 2: Load the Dataset

We load the multi_news dataset, focusing on the 'test' split to efficiently manage our resources.

In [3]:
dataset = load_dataset("multi_news", split="test")
dataset

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Dataset({
    features: ['document', 'summary'],
    num_rows: 5622
})

### 📊 Step 3: Data Preparation

To ensure our analysis is manageable and efficient, we'll focus on a subset of 2000 random samples from our dataset.

In [4]:
df = dataset.to_pandas().sample(2000, random_state=42)
df

Unnamed: 0,document,summary
4830,Tweet with a location \n \n You can add locati...,– Denis Finley has taken to Twitter to call Po...
1255,CNN host Piers Morgan just called to discuss h...,– CNN's Piers Morgan thinks gun-rights propone...
80,White House communications director Anthony Sc...,– New White House communications director Anth...
3044,CLOSE Scientists say they've found archaeologi...,– Scientists say they have the first physical ...
4486,Click image above to view graphic \n \n Althou...,– Scientists are calling it a breakthrough and...
...,...,...
2157,"On Thursday afternoon, President-elect Donald ...","– He who pays the piper calls the tune, and it..."
3615,Donald Trump said Sunday that in the wake of t...,– In the wake of the Orlando shooting one week...
2751,Nashua police believe body found is that of mi...,"– Sad news out of Nashua, NH, after police say..."
622,The public school systems in New York and Los ...,"– Some 640,000 kids in the nation's second-lar..."


### 🧠 Step 4: Load the Model

Next, we load a pre-trained Sentence Transformer model. This model will help us convert textual data into dense vectors (embeddings) that capture the essence of our text.

In [5]:
model = SentenceTransformer("all-MiniLM-L6-v2")
model

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Normalize()
)

### 🔍 Step 5: Generate Embeddings

Here, we encode the article summaries into embeddings, transforming the textual information into a numerical format that's easier to analyze.

In [6]:
passage_embeddings = list(model.encode(df['summary'].to_list(), show_progress_bar=True))
passage_embeddings[0].shape

Batches:   0%|          | 0/63 [00:00<?, ?it/s]

(384,)

### 🎯 Step 6: Define a Query

Let's specify a query for which we want to find relevant articles. This will be our target for similarity searches.

In [7]:
query = "Find me some articles about technology and artificial intelligence"

### 📏 Step 7: Find Relevant Articles

To find articles that match our query, we compute the cosine similarity between the query embedding and all article embeddings, retrieving the top 3 most relevant articles.

In [8]:
query_embedding = model.encode(query)
similarities = util.cos_sim(query_embedding, passage_embeddings)

top_indices = torch.topk(similarities.flatten(), 3).indices
top_relevant_passages = [df.iloc[x.item()]['summary'][:200] + "..." for x in top_indices]
top_relevant_passages

  b = torch.tensor(b)


['– Are you a "digital native" or a "digital immigrant," and does it make a difference? Research recently published in the Teaching and Teacher Education journal indicates the concept of so-called digit...',
 "– Using methods borrowed from Google, a group of researchers has analyzed all Wikipedia pages and determined that, at least on the English language version of the site, Frank Sinatra is the world's mo...",
 '– The "tech surge" to fix HealthCare.gov includes some names from the industry\'s biggest players. Among them, per a Health department blog post, is Michael Dickerson, on leave from his job as a site r...']

### 🛠️ Utility Function: Find Relevant News

To simplify the process of finding relevant articles for any query, we encapsulate our code in a function. This function takes a query, processes it, and returns the top 3 relevant articles.

In [9]:
def find_relevant_news(query):
    # Encode the query using the same model
    query_embedding = model.encode(query)

    # Calculate the cosine similarity between the query and passage embeddings
    similarities = util.cos_sim(query_embedding, passage_embeddings)

    # Get the indices of the top 3 most similar passages
    top_indices = torch.topk(similarities.flatten(), 3).indices

    # Retrieve the summaries of the top 3 passages and truncate them to 160 characters
    top_relevant_passages = [df.iloc[x.item()]["summary"][:160] + "..." for x in top_indices]

    return top_relevant_passages

### 📈 Conclusion & Next Steps

This notebook illustrates the power of NLP in extracting relevant information from large text datasets using sentence embeddings and cosine similarity. Explore further by modifying the query or adjusting the function to suit different needs. Happy coding! 🌟

In [10]:
# Example queries to explore
print(find_relevant_news("Natural disasters"))
print(find_relevant_news("Law enforcement and police"))
print(find_relevant_news("Politics, diplomacy and nationalism"))

['– The tsunami that killed hundreds, possibly thousands of people after an earthquake in Indonesia on Friday was much bigger and more devastating than would norm...', '– A sad milestone out of Japan: Two weeks after the quake struck, its official death toll has broken the 10,000 mark—and that number is still on the rise, with ...', '– When you live near a major dam, the last thing you want to hear is that the integrity of it has been "compromised" by landslides. But that\'s exactly what resi...']
['– The war of words between Chicago and the federal government over "sanctuary cities" policy is heating up. Attorney General Jeff Sessions slammed the city\'s le...', '– Greg Barnes was in a hurry to get home on Friday, so when he saw police lights behind him on State Road 332 in Muncie, Indiana, "immediately I knew I was in t...', '– "We are not thugs. We are professionals," says the leader of a black policing group, addressing a speech in which President Trump urged officers to not be "to

In [11]:
def clear_screen():
    os.system("clear")

In [12]:
def interactive_search():
    print("Welcome to the Semantic News Search!\n")
    while True:
        print("Type in a topic you'd like to find articles about, and I'll do the searching! (Type 'exit' to quit)\n> ", end="")

        query = input().strip()

        if query.lower() == "exit":
            print("\nThanks for using the Semantic News Search! Have a great day!")
            break

        print("\n\tHere are 3 articles I found based on your query: \n")

        passages = find_relevant_news(query)
        for passage in passages:
            print("\n\t" + passage)

        input("\nPress Enter to continue searching...")
        clear_screen()

In [13]:
# Start the interactive search
interactive_search()

Welcome to the Semantic News Search!

Type in a topic you'd like to find articles about, and I'll do the searching! (Type 'exit' to quit)
> Science

	Here are 3 articles I found based on your query: 


	– Ever wondered how tiny a bumble bee's brain is? Imagine a sesame seed clinging to a burger bun, reports the Washington Post—in other words, it's about 0.0002%...

	– Three scientists from the United States, France, and Canada have been awarded the Nobel Prize in physics for advances in laser physics, including the first wo...

	– Scientists have made a long-sought—and controversial—breakthrough: They created stem cells from cloned human embryos for the first time, reports AP. In theory...

Press Enter to continue searching...
Type in a topic you'd like to find articles about, and I'll do the searching! (Type 'exit' to quit)
> exit

Thanks for using the Semantic News Search! Have a great day!
