## NLP + LLM Metadata Generation : Squad

This notebook demonstrates how to generate metadata for the SQuAD dataset using a combination of traditional NLP techniques and Large Language Models (LLMs). The workflow includes:

- Reading and exploring the SQuAD dataset
- Extracting keywords and keyphrases using BERT-based models
- Generating story titles using advanced LLMs (LLAMA 3.1 70B Instruct, Mistral)\
**Note:**  
- Ensure you have the required credentials and access to IBM Watson Machine Learning for LLM inference.
- The notebook assumes familiarity with Python, pandas, and basic NLP concepts.

In [1]:
import os
from datetime import date
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from keybert import KeyBERT
import numpy as np

  from tqdm.autonotebook import tqdm, trange


In [2]:
from dotenv import load_dotenv

# Path to your .env file
env_path = "../.env"  # Change path if needed

# Load environment variables from .env
load_dotenv(dotenv_path=env_path)
# Access the environment variables
watsonx_url = os.getenv("watsonx_url")
watsonx_apikey = os.getenv("watsonx_apikey")
watsonx_projectID = os.getenv("watsonx_projectID")

print(watsonx_url)

https://us-south.ml.cloud.ibm.com


In [2]:
def get_all_files(folder_name):
    # Change the directory
    os.chdir(folder_name)
    # iterate through all file
    file_path_list =[]
    for file in os.listdir():
        print(file)
        file_path = f"{folder_name}/{file}"
        file_path_list.append(file_path)
    return file_path_list

### 1. Read Squad Data

In [None]:
files = get_all_files('./data/Squad')

.DS_Store
dev-v1.1.json
train-v1.1.json


In [6]:
df_docs_train = pd.read_json(files[2]) 
df_docs_dev = pd.read_json(files[1]) 

In [7]:
def squad_count(df_squad,source):
    i=0
    squad_count =0
    for ind in df_squad.index:
        data = df_squad['data'][ind]
        title = data['title']
        content = data['paragraphs']
        squad_count = squad_count + len(content)
    print(squad_count)
    return squad_count

### 2. Metadata Generation 

### 2.1 NLP Based Keywords Extarction using BERT

In [8]:
kw_model = KeyBERT()
def extract_keyphrases(text, top_n=10, ngram_range=(1, 3)):
    vectorizer = CountVectorizer(ngram_range=ngram_range, stop_words='english').fit([text])
    candidate_phrases = vectorizer.get_feature_names_out()
    keywords = kw_model.extract_keywords(text, keyphrase_ngram_range=ngram_range, stop_words='english', use_mmr=True, candidates=candidate_phrases, top_n=top_n, diversity=0.9)
    keys = [key for key, _ in keywords]
    return keys

### 2.2 Title Extarction Using LLAMA 3.1 70 B Instruct

In [None]:
from ibm_watson_machine_learning.foundation_models import Model
from ibm_watson_machine_learning.metanames import GenTextParamsMetaNames as GenParams
project_id='9ea9a70b-f112-4f1f-bffb-b98c73a61507'
# another cred
wml_credentials = {
    "url": watsonx_url,
    "apikey": watsonx_apikey,
    "project_id": watsonx_projectID
}

def build_prompt(context,model_id="MIXTRAL"):
    
    formatted_prompt=""

    SYSTEM_PROMPT = """You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content.
                        If you don't know the answer to a question, don't hallucinate and share false information."""

    USER_PROMPT = """ A user need to extract key topics, important phrases, synonyms, and acronyms from the given text: '{context}'. 
    Don't include any other infromatuion which is not available in the context. Provide the output only in valid JSON format with the valid keys like topics, phrases, synonyms and achronyms.
    """

    LLAMA3_PROMPT= """
    <|begin_of_text|><|start_header_id|>system<|end_header_id|>
    {system_prompt}<|eot_id|><|start_header_id|>user<|end_header_id|>
    {user_prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
    Answer Based on the Provided Context: 
    """

    MIXTRAL_PROMPT = """[INST]
    [ROLE]
    {system_prompt}
    [/ROLE]
    [USER_INSTRUCTIONS]
    {user_prompt}
    [/USER_INSTRUCTIONS]

    Answer Based on the Provided Context:
    [/INST]"""

    user_prompt = USER_PROMPT.format(context=context)
    if  model_id == "MIXTRAL":
        formatted_prompt = MIXTRAL_PROMPT.format(system_prompt=SYSTEM_PROMPT,user_prompt=user_prompt)
    elif model_id == "LLAMA3":
        formatted_prompt = LLAMA3_PROMPT.format(system_prompt=SYSTEM_PROMPT,user_prompt=user_prompt)
    
    return formatted_prompt

def build_prompt_topic(context,model_id="MIXTRAL"):
    
    formatted_prompt=""

    SYSTEM_PROMPT = """You are a helpful, respectful and honest Dell assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content.
                        If you don't know the answer to a question, don't hallucinate and share false information."""

    USER_PROMPT = """A user need a title for the story Use the information available from the '{context}' to provide of story title only. Don't add any other information.
   
    """

    LLAMA3_PROMPT= """
    <|begin_of_text|><|start_header_id|>system<|end_header_id|>
    {system_prompt}<|eot_id|><|start_header_id|>user<|end_header_id|>
    {user_prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
    Answer Based on the Provided Context: 
    """

    MIXTRAL_PROMPT = """[INST]
    [ROLE]
    {system_prompt}
    [/ROLE]
    [USER_INSTRUCTIONS]
    {user_prompt}
    [/USER_INSTRUCTIONS]

    Answer Based on the Provided Context:
    [/INST]"""

    user_prompt = USER_PROMPT.format(context=context)
    if  model_id == "MIXTRAL":
        formatted_prompt = MIXTRAL_PROMPT.format(system_prompt=SYSTEM_PROMPT,user_prompt=user_prompt)
    elif model_id == "LLAMA3":
        formatted_prompt = LLAMA3_PROMPT.format(system_prompt=SYSTEM_PROMPT,user_prompt=user_prompt)
    
    return formatted_prompt
      

def send_to_watsonxai(prompts,
                    model_id="MIXTRAL",
                    decoding_method="greedy",
                    max_new_tokens=2000,
                    min_new_tokens=2,
                    temperature=1.0,
                    repetition_penalty=1.0
                    ):
    if  model_id == "MIXTRAL":
         model_name = "mistralai/mistral-large"
    elif model_id == "LLAMA3":
         model_name="meta-llama/llama-3-3-70b-instruct"
    # Instantiate parameters for text generation
    model_params = {
        GenParams.DECODING_METHOD: decoding_method,
        GenParams.MIN_NEW_TOKENS: min_new_tokens,
        GenParams.MAX_NEW_TOKENS: max_new_tokens,
        #GenParams.RANDOM_SEED: 42,
        GenParams.TEMPERATURE: temperature,
        GenParams.REPETITION_PENALTY: repetition_penalty,
    }
    # Instantiate a model proxy object to send your requests
    model = Model(
        model_id=model_name,
        params=model_params,
        credentials=wml_credentials,
        project_id='----projectid-----')

    print("Model used ---",model.model_id)
    response=model.generate_text(prompts)
    # print(response)
    return response

def get_title(story):
    model_name="LLAMA3"
    llm_input=build_prompt(story,model_name)
    llm_response=send_to_watsonxai(llm_input,model_name)
    return llm_response

## Testing Stroy-title

In [None]:
for ind in  df_docs_dev[0:1].index:
        squad_count =0
        print("Processing ------",ind)
        data = df_docs_dev['data'][ind]
        title = data['title']
        content = data['paragraphs']
        squad_count = squad_count + len(content)
        if squad_count > -1:
            for i in range(len(content)):
                story = content[i]['context']
                keywords = extract_keyphrases(story)
                print(title)
                print(get_title(story))
                print(keywords)

In [None]:
def index_full_content_squad(df_squad,source):
    i=0
    squad_count =0
    doc_bm25 =[]
    doc_knn =[]
    for ind in df_squad.index:
        print("Processing ------",ind)
        data = df_squad['data'][ind]
        title = data['title']
        content = data['paragraphs']
        squad_count = squad_count + len(content)
        if squad_count > -1:
            for i in range(len(content)):
                story = content[i]['context']
                source = source
                content_embedding = model.encode(story)
                keywords = extract_keyphrases(story)
                title_gen = get_title(story)                
                doc ={
                                "id": ""+title+"",
                                "source": ""+source+"",
                                "story": ""+story+"",
                                "title":title,
                                "title_gen":title_gen,
                                 "keywords":keywords
                            }
                    
                doc1 ={
                                "id": ""+title+"",
                                "source": ""+source+"",
                                "story": ""+story+"",
                                "story_embedding": content_embedding,
                            }
                
                doc_bm25.append(doc)
                doc_knn.append(doc1)
    return doc_bm25 ,doc_knn

In [36]:
doc_bm25 ,doc_knn =index_full_content_squad(df_docs_dev,'squad')

Processing ------ 0


Model used --- meta-llama/llama-3-1-70b-instruct
Model used --- meta-llama/llama-3-1-70b-instruct
Model used --- meta-llama/llama-3-1-70b-instruct
Model used --- meta-llama/llama-3-1-70b-instruct
Model used --- meta-llama/llama-3-1-70b-instruct
Model used --- meta-llama/llama-3-1-70b-instruct
Model used --- meta-llama/llama-3-1-70b-instruct
Model used --- meta-llama/llama-3-1-70b-instruct
Model used --- meta-llama/llama-3-1-70b-instruct
Model used --- meta-llama/llama-3-1-70b-instruct
Model used --- meta-llama/llama-3-1-70b-instruct
Model used --- meta-llama/llama-3-1-70b-instruct
Model used --- meta-llama/llama-3-1-70b-instruct
Model used --- meta-llama/llama-3-1-70b-instruct
Model used --- meta-llama/llama-3-1-70b-instruct
Model used --- meta-llama/llama-3-1-70b-instruct
Model used --- meta-llama/llama-3-1-70b-instruct
Model used --- meta-llama/llama-3-1-70b-instruct
Model used --- meta-llama/llama-3-1-70b-instruct
Model used --- meta-llama/llama-3-1-70b-instruct
Model used --- meta-

In [26]:
len(df_docs_train)

442

## Train

In [49]:
doc_bm25 ,doc_knn =index_full_content_squad(df_docs_train[10:20],'squad')

Processing ------ 10


Model used --- meta-llama/llama-3-1-70b-instruct
Model used --- meta-llama/llama-3-1-70b-instruct
Model used --- meta-llama/llama-3-1-70b-instruct
Model used --- meta-llama/llama-3-1-70b-instruct
Model used --- meta-llama/llama-3-1-70b-instruct
Model used --- meta-llama/llama-3-1-70b-instruct
Model used --- meta-llama/llama-3-1-70b-instruct
Model used --- meta-llama/llama-3-1-70b-instruct
Model used --- meta-llama/llama-3-1-70b-instruct
Model used --- meta-llama/llama-3-1-70b-instruct
Model used --- meta-llama/llama-3-1-70b-instruct
Model used --- meta-llama/llama-3-1-70b-instruct
Model used --- meta-llama/llama-3-1-70b-instruct
Model used --- meta-llama/llama-3-1-70b-instruct
Model used --- meta-llama/llama-3-1-70b-instruct
Model used --- meta-llama/llama-3-1-70b-instruct
Model used --- meta-llama/llama-3-1-70b-instruct
Model used --- meta-llama/llama-3-1-70b-instruct
Model used --- meta-llama/llama-3-1-70b-instruct
Model used --- meta-llama/llama-3-1-70b-instruct
Model used --- meta-