# ðŸ“˜ Mini Project: Automatic Keyword Extraction using Sentence Transformers

##  Objective
In this mini-project, weâ€™ll build a simple **semantic keyword extractor** using the **SentenceTransformer** model.  
Given any input text, the model will find the **most relevant keywords** (topics) based on **cosine similarity** between the text and a predefined list of keywords.


In [9]:
# !pip install torch
# !pip install sentence_transformers
# !pip install tf-keras


In [None]:
import numpy as np
import torch
from sentence_transformers import SentenceTransformer, util

- all-MiniLM-L6-v2 is a pre-trained Sentence Transformer model used to convert text into meaningful numerical vectors (embeddings).
- It understands the meaning of sentences and converts them into numbers.

In [41]:
model = SentenceTransformer("all-MiniLM-L6-v2")

### ðŸ”¹Cosine Similarity

Cosine similarity measures **how close two embeddings are**.  
Itâ€™s widely used to check **semantic similarity** between two pieces of text.


In [42]:
sentences = ["I love playing cricket", "i am fine."]
# Generate embeddings
embeddings = model.encode(sentences)

print("Sentence Embeddings Shape:", embeddings.shape)
# embeddings[0]

Sentence Embeddings Shape: (2, 384)


In [35]:
# embeddings[1]

In [43]:
# Calculate cosine similarity between the two sentences
similarity_score = util.cos_sim(embeddings[0], embeddings[1])

print(f"Cosine Similarity: {similarity_score.item():.4f}")

Cosine Similarity: 0.1219


## Define Domain Keywords
Weâ€™ll create a list of keywords covering multiple domains such as **AI, sports, science, politics, health**, etc.


In [2]:
# Expanded Keywords
domain_keywords = {

    "Technology": [
        "artificial intelligence", "machine learning", "deep learning",
        "neural network", "nlp", "computer vision", "chatgpt",
        "algorithm", "data science", "big data", "blockchain",
        "cybersecurity", "cloud computing", "software engineering",
        "programming", "coding", "python", "java", "api",
        "database", "sql", "devops", "automation", "internet of things",
        "iot", "web development", "mobile app"
    ],

    "Sports": [
        "football", "soccer", "cricket", "basketball", "tennis",
        "badminton", "athletics", "olympics", "world cup",
        "tournament", "match", "player", "coach", "team",
        "goal", "score", "stadium", "league", "fitness",
        "training", "workout", "medal", "championship"
    ],

    "Politics": [
        "election", "vote", "voting", "government", "parliament",
        "president", "prime minister", "minister", "policy",
        "legislation", "law", "constitution", "democracy",
        "political party", "campaign", "diplomacy",
        "foreign relations", "international affairs", "governance"
    ],

    "Science": [
        "biology", "physics", "chemistry", "genetics",
        "astronomy", "astrophysics", "space", "nasa",
        "experiment", "research", "laboratory", "scientist",
        "theory", "discovery", "quantum", "evolution",
        "climate science", "scientific study"
    ],

    "Health": [
        "health", "healthcare", "medicine", "medical",
        "doctor", "hospital", "patient", "disease",
        "treatment", "therapy", "vaccine", "diagnosis",
        "mental health", "psychology", "nutrition",
        "diet", "fitness", "exercise", "wellness",
        "public health"
    ],

    "Business & Economics": [
        "business", "startup", "entrepreneurship",
        "market", "marketing", "finance", "economics",
        "investment", "stock", "share", "trading",
        "revenue", "profit", "loss", "budget",
        "inflation", "economic growth", "supply chain",
        "demand", "industry"
    ],

    "Environment": [
        "environment", "climate change", "global warming",
        "pollution", "air pollution", "water pollution",
        "sustainability", "renewable energy", "solar energy",
        "wind energy", "conservation", "ecosystem",
        "biodiversity", "deforestation", "carbon emissions",
        "greenhouse gases"
    ],

    "Entertainment": [
        "movie", "film", "cinema", "actor", "actress",
        "music", "song", "album", "concert",
        "bollywood", "hollywood", "web series",
        "television", "tv show", "director",
        "trailer", "box office", "streaming"
    ],

    "Education": [
        "education", "school", "college", "university",
        "student", "teacher", "professor", "lecture",
        "exam", "test", "assignment", "syllabus",
        "curriculum", "degree", "online learning",
        "e-learning", "training", "skill development"
    ],

    "Psychology & Social": [
        "psychology", "behavior", "mental health",
        "emotion", "stress", "anxiety", "depression",
        "cognitive", "personality", "therapy",
        "social behavior", "society", "culture",
        "human behavior"
    ]
}


##  Building the Keyword Extraction Function
Weâ€™ll define a function that encodes the input text and compares it with each keyword using **cosine similarity**.  
The top few matches (highest similarity scores) will be selected as the most relevant keywords.


In [20]:
# Function to extract top N matching keywords

def extract_top_keywords(content,keywords,keyword_embeddings,top_n=5):
    keyword_embeddings = model.encode(keywords)
    # Encode the content
    content_embeddings =model.encode(content)
    
    # Compute cosine similarities
    similarites = util.cos_sim(content_embeddings,keyword_embeddings).flatten()
    
    # Get indices of top N similarities
    top_indices = torch.topk(similarites,top_n).indices
    
    # Retrieve top N keywords based on the indices
    top_keywords = [keywords[i] for i in top_indices]
    
    return top_keywords
    


##  Running the Interactive Keyword Extractor
Letâ€™s take user input and display the top-matching keywords in real time.  
To make the output neat, weâ€™ll show only the first and last few words of the entered text.


In [38]:
# # Extract top keywords for each content sample
# while True:
#     content = input("Input text to get keywords or press (exit) :")
#     print("\n--------------------------------------------------------------------------------------------------------")
#     if content=="exit" or content=="Exit" :
#         break
#     else:
#         top_keywords = extract_top_keywords(content,keywords,keyword_embeddings,top_n=5)
#         print("Your Content----> ", content[:40], "*********************", content[-40:],"\n")
#         print("Top Keywords---->",top_keywords,"\n")

In [22]:
def extract_top_keywords(content, keywords, model, top_n=5):
    # Encode keywords & content
    keyword_embeddings = model.encode(keywords, convert_to_tensor=True)
    content_embedding = model.encode(content, convert_to_tensor=True)

    # Cosine similarity
    similarities = util.cos_sim(content_embedding, keyword_embeddings).squeeze()

    # Top N
    top_indices = torch.topk(similarities, top_n).indices

    # Convert tensor index â†’ int
    top_keywords = [keywords[i.item()] for i in top_indices]

    return top_keywords


In [23]:
while True:
    content = input("Input text to get keywords or press (exit): ")
    print("\n" + "-"*100)

    if content.lower() == "exit":
        break
    else:
        top_keywords = extract_top_keywords(
            content,
            keywords,
            model,        # âœ… pass model, NOT keyword_embeddings
            top_n=5
        )

        print("Your Content ----> ",
              content[:40], "*********************", content[-40:], "\n")
        print("Top Keywords ---->", top_keywords, "\n")


Input text to get keywords or press (exit):  Sports are crucial for holistic development, offering physical fitness, mental sharpness, and essential life skills like teamwork and discipline; whether individual pursuits like running or team games like football, they provide healthy competition, stress relief, and build character, uniting communities and fostering a sense of national pride, making them vital for a balanced, active lifestyle for all ages. 



----------------------------------------------------------------------------------------------------
Your Content ---->  Sports are crucial for holistic developm ********************* alanced, active lifestyle for all ages.  

Top Keywords ----> ['athletics', 'fitness', 'football', 'basketball', 'Olympics'] 



Input text to get keywords or press (exit):  exit



----------------------------------------------------------------------------------------------------
