# 1. Dataset Sourcing

## Dataset Selection

The dataset chosen for this project is the **News Category Dataset**, a publicly available dataset containing over 200,000 news headlines with accompanying metadata. Each entry in the dataset includes:
- **Headline:** The title of the news article.
- **Short Description:** A summary or abstract of the article's content.
- **Category:** The classification of the article into one of 42 predefined categories, such as business, politics, technology, sports, and more.
- **Date:** The publication date of the article.

Initially, I considered two potential datasets for this project: the **CNN/Daily Mail Dataset** and the **News Category Dataset**. While both are well-suited for tasks involving news articles, I ultimately selected the News Category Dataset for the following reasons:

1. **Concise and Pre-Processed Content:** The News Category Dataset provides short, well-structured summaries and headlines, eliminating the need for extensive pre-processing. This allows for a more efficient pipeline.
2, **Lightweight and Faster Processing:** Compared to the larger and more complex CNN/Daily Mail Dataset, the News Category Dataset is significantly smaller, making it more practical to process on a local machine without a GPU.
3. **Accessibility:** Since my local setup lacks a GPU, I relied on Google Colab for computations. The smaller size of the News Category Dataset ensured smoother and faster processing in this environment.
4. **Diversity of Categories:** It spans a wide range of topics, from politics to lifestyle, which is ideal for testing the flexibility of semantic search models across varied domains.

## Relevance to the Task

This dataset aligns perfectly with the project goals for several reasons:
- **Diversity of Topics:** The dataset includes articles from a broad range of categories such as business, politics, entertainment, and sports. This diversity provides a comprehensive base for building a semantic search system capable of handling varied user queries.
- **Textual Richness:** Both headlines and short descriptions are included for each article, enabling us to explore multiple text representations. These fields are essential for creating robust embeddings that capture semantic meaning.
- **Categorical Information:** The predefined categories in the dataset are crucial for training and validating the semantic search model. They also provide a natural extension for tasks like classification and topic modeling.
- **Scalability:** With 200,000 records, the dataset is substantial enough to train and evaluate advanced NLP models while being manageable within the constraints of the project timeline.

## Why This Dataset Was Selected

The News Category Dataset was chosen over other alternatives because of its combination of:
- **Real-World Relevance:** The dataset closely resembles the type of data managed by Dow Jones, whose flagship publication, The Wall Street Journal, also features categorized news articles. Using this dataset allows us to simulate real-world scenarios for semantic search and other NLP tasks.
- **Alignment with NLP Goals:** An NLP system requires textual data with meaningful patterns and relationships. The dataset's headlines and descriptions offer ideal input for generating embeddings and testing semantic similarity.
- **Potential for Generative AI:** The dataset can be integrated with GPT to generate summaries, answer questions, and provide detailed responses based on the retrieved articles. This showcases the practical utility of combining semantic search with Generative AI.
- **Practicality:** Compared to larger datasets (e.g., CNN/Daily Mail), the size of the News Category Dataset makes it computationally efficient to process while still being large enough to demonstrate robust results.

## Relevance to Dow Jones & Company

Dow Jones specializes in providing financial news and business information, often categorized for specific user needs. The News Category Dataset mimics these real-world data structures, enabling us to design and test AI systems with potential applications for:
- **Personalized Content Delivery:** Recommending relevant news articles to users based on their interests.
- **Efficient Search Systems:** Allowing users to retrieve articles using semantic queries, improving the speed and relevance of search results.
- **Enhanced User Experience:** By leveraging embeddings and Generative AI, the system can offer summaries or answer questions, adding value to content delivery platforms.

This dataset not only supports the goals of this project but also aligns with Dow Jones's core business model, making it an ideal choice for this assignment.

# 2. Problem Definition

## Problem Statement

The goal of this project is to design and implement a **Semantic Search system enhanced with Generative AI capabilities** using the **News Category Dataset**. The system aims to allow users to retrieve and interact with news articles through natural language queries. By leveraging **semantic embeddings** for better contextual understanding and integrating LLMs, the system can provide features such as question-answering functionalities or labeling a document with classes.

## Significance of the Problem

1. **Information overload:** In the digital age, users often struggle to find relevant information quickly due to the overwhelming volume of content. Semantic search addresses this issue by retrieving documents based on meaning rather than simple keyword matching, making search results more relevant and intuitive.

2. **Demand for advanced search features:** As users expect richer interactions with search systems, basic retrieval models are no longer sufficient. The integration of **Generative AI** allows the system to go beyond retrieval, providing users with dynamic insights, summaries, and answers tailored to their queries.

3. **Alignment with real-world applications:** For companies like Dow Jones, which rely on organizing, categorizing, and delivering business and financial news, an advanced search and recommendation system can significantly enhance user engagement and satisfaction.

4. **Scalable framework for NLP tasks:** A semantic search system powered by embeddings and LLMs (Large Language Models) serves as the foundation for several downstream tasks such as summarization, classification, and recommendations. This modular approach demonstrates technical expertise while ensuring extensibility.

## Impact of Solving the Problem

1. **Enhanced user experience:** By implementing a robust semantic search system, users can find content faster and with greater accuracy, improving overall satisfaction. Features like dynamic summaries and Q&A further enhance the interaction, offering value beyond basic search functionality.

2. **Business Relevance:** For companies like Dow Jones, semantic search can revolutionize how users interact with their content platforms. It can lead to increased user retention, better personalization, and ultimately, higher revenue through improved content delivery.

3. **Demonstration of core skills:** Building this system showcases expertise in several critical areas relevant to the Senior Data Scientist role, such as:
    - NLP Techniques: Text preprocessing, embedding generation, and tokenization.
    - Information Retrieval (IR): Efficient document filtering and ranking based on semantic relevance.
    - Generative AI: Integration of LLMs for producing dynamic, human-like outputs.
    - System Scalability: Design of a modular architecture that can be expanded to include classification, topic modeling, and more.

## Proposed Tasks and Extensions

The project begins with **Semantic Search** as the core functionality and expands into other related NLP tasks if time permits:

1. **Core functionality: Semantic Search**
    - Use sentence embeddings to encode article headlines and descriptions.
    - Implement a nearest-neighbor search to find articles semantically related to user queries.

2. **Extensions to demonstrate flexibility:**
    - Question Answering (RAG): Filter relevant articles using embeddings and generate answers with GenAI.
    - Summarization: Use GenAI to summarize search results dynamically.
    - Classification: Group documents into predefined or new categories based on semantic embeddings.
    - Topic Modeling: Apply techniques like clustering to identify latent topics within the dataset.
    - Tagging: Classify news into different labels such as sentiment, language, style and covered topics.


# 3. Approach and Pipeline

## Approach

The project aims to develop a robust **Semantic Search system** enhanced with **Generative AI**. The Semantic Search will leverage vector-based retrieval to find relevant documents based on the semantic meaning of queries. By incorporating GenAI, the system will enhance user interaction with capabilities like query-specific summaries and natural language responses.

The pipeline will prioritize modularity and scalability, ensuring compatibility with future tasks like classification, summarization, and recommendations. While the core Semantic Search system will focus on generating embeddings and performing nearest-neighbor searches, tools like **LangChain** will help simplify processes like text splitting, metadata handling, and integration with generative models.



## Tools, Libraries, and Frameworks

1. **Core Python Libraries:**
    - NumPy and Pandas: For efficient data handling and manipulation.
    - Matplotlib and Seaborn: For creating meaningful visualizations during EDA and analysis.

2. **Natural Language Processing Tools:**
    - LangChain: Utilized for managing document chunking, metadata assignment, and integration with vectorstores and LLMs.
    - Hugging Face Transformers: To access state-of-the-art models for embedding generation (sentence-transformers for dense vector embeddings).

3. **Semantic Search Engine:**
    - FAISS (Facebook AI Similarity Search): To implement a fast and efficient vector search index. We will rely on the CPU implementation to suit the current hardware, ensuring scalability without requiring a GPU.

4. **Generative AI Integration:**
  - MistralAI's open-mistral-nemo: Selected for its free availability and lightweight requirements, as it operates efficiently on hardware with up to 16GB of GPU RAM, which aligns with the resources available in Google Colab.

5. **Development and Environment:**
    - Local development in Visual Studio Code with Python, focusing on CPU-based implementations.
    - Google Colab as a fallback option for GPU-based experimentation or additional scalability testing.

6. **Version Control and Documentation:**
    - Git/GitHub: For version control and collaborative documentation. The notebook will contain markdown annotations for explanations and visualizations.

## Pipeline Details

1. **Data Preprocessing and Splitting:**
    - Minimal text cleaning:
        - Remove duplicates and non-text artifacts, ensuring a consistent dataset.
        - Maintain as much original text as possible to preserve semantic richness.
          - Long documents will be split into smaller chunks using LangChain's RecursiveCharacterTextSplitter. This ensures compatibility with model input limitations and improves retrieval granularity.

2. **Feature Engineering:**
    - Generate embeddings for each text chunk using sentence-transformers.
    - Attach metadata (e.g., category, headline, publication date) to embeddings for better filtering and query-specific results.
    - Store embeddings and metadata in a vectorstore (FAISS / Chroma), chosen for its performance and seamless integration with LangChain.

4. **Model Building:**
    - **Semantic Search:**
        - User queries are encoded as embeddings using the same model as the documents.
        - Perform nearest-neighbor searches using the vectorstore, retrieving the most semantically relevant chunks.
    - **Generative AI:**
        - Integrate MistralAI to:
            - Generate query-specific summaries of retrieved results.
            - Provide enhanced responses by synthesizing information across retrieved documents.
            - Make a topic modelling aplying clustering and GenAI capabilities.
            - Tag each text with classes.

5. **Evaluation:**
    - Quantitative metrics: Precision@K, Recall@K, and Mean Average Precision (MAP) to measure retrieval quality.
    - Qualitative analysis: Review the coherence and relevance of LLM-generated summaries and responses.

# 4. Implementation

## a) Google Drive

In [None]:
from google.colab import drive

# Conectar Google Drive
drive.mount('/content/drive')

# Establecer el directorio donde guardarás tus archivos
import os
base_path = '/content/drive/My Drive/dow_jones'
os.makedirs(base_path, exist_ok=True)
print(f"Base path: {base_path}")


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Base path: /content/drive/My Drive/dow_jones


## b) Instalations

## c) Imports

In [None]:
import json
import os
import pickle
import re
import string
import time

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from dotenv import load_dotenv
from pydantic import BaseModel, Field
from sklearn.cluster import KMeans, MiniBatchKMeans
from sklearn.metrics import homogeneity_score, precision_score, recall_score, silhouette_score

import tiktoken
from langchain.chains import RetrievalQA, create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.document_loaders import JSONLoader
from langchain.llms import OpenAI
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain_core.documents import Document
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_mistralai import ChatMistralAI
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

## d) MistralAI API Key

In [None]:
# Configuring the OpenAI API key directly
os.environ["MISTRAL_API_KEY"] = "8UqsMDdsIUSK7K4VhAUeAuTthMCRTU0C"

# Verify if the key was set correctly
api_key = os.getenv("MISTRAL_API_KEY")

if api_key:
    print("MistralAI API Key has been loaded successfully.")
else:
    print("Error: MistralAI API Key is not set in the environment.")

MistralAI API Key has been loaded successfully.


## 4.1. Load the dataset

In [None]:
# Verify current directory
print(os.getcwd())

/content


In [None]:
# Loading the dataset
file_path = "/content/drive/MyDrive/Colab Notebooks/dow_jones/data/news_category/News_Category_Dataset_v3.json"

# Reading the JSON file into a DataFrame
with open(file_path, 'r') as file:
    data = [json.loads(line) for line in file]

df = pd.DataFrame(data)

In [None]:
# Displaying the first few rows
print("Dataset preview:")
df.head()

Dataset preview:


Unnamed: 0,link,headline,category,short_description,authors,date
0,https://www.huffpost.com/entry/covid-boosters-...,Over 4 Million Americans Roll Up Sleeves For O...,U.S. NEWS,Health experts said it is too early to predict...,"Carla K. Johnson, AP",2022-09-23
1,https://www.huffpost.com/entry/american-airlin...,"American Airlines Flyer Charged, Banned For Li...",U.S. NEWS,He was subdued by passengers and crew when he ...,Mary Papenfuss,2022-09-23
2,https://www.huffpost.com/entry/funniest-tweets...,23 Of The Funniest Tweets About Cats And Dogs ...,COMEDY,"""Until you have a dog you don't understand wha...",Elyse Wanshel,2022-09-23
3,https://www.huffpost.com/entry/funniest-parent...,The Funniest Tweets From Parents This Week (Se...,PARENTING,"""Accidentally put grown-up toothpaste on my to...",Caroline Bologna,2022-09-23
4,https://www.huffpost.com/entry/amy-cooper-lose...,Woman Who Called Cops On Black Bird-Watcher Lo...,U.S. NEWS,Amy Cooper accused investment firm Franklin Te...,Nina Golgowski,2022-09-22


In [None]:
# Basic structure and overview
print("\nDataset Info:")
df.info()


Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 209527 entries, 0 to 209526
Data columns (total 6 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   link               209527 non-null  object
 1   headline           209527 non-null  object
 2   category           209527 non-null  object
 3   short_description  209527 non-null  object
 4   authors            209527 non-null  object
 5   date               209527 non-null  object
dtypes: object(6)
memory usage: 9.6+ MB


In [None]:
# Number of unique categories
unique_categories = df['category'].nunique()
print(f"\nNumber of unique categories: {unique_categories}")

# Displaying unique categories
print("\nCategories:")
print(df['category'].unique())

# Number of records per category
category_distribution = df['category'].value_counts()
print("\nCategory distribution:")
print(category_distribution)


Number of unique categories: 42

Categories:
['U.S. NEWS' 'COMEDY' 'PARENTING' 'WORLD NEWS' 'CULTURE & ARTS' 'TECH'
 'SPORTS' 'ENTERTAINMENT' 'POLITICS' 'WEIRD NEWS' 'ENVIRONMENT'
 'EDUCATION' 'CRIME' 'SCIENCE' 'WELLNESS' 'BUSINESS' 'STYLE & BEAUTY'
 'FOOD & DRINK' 'MEDIA' 'QUEER VOICES' 'HOME & LIVING' 'WOMEN'
 'BLACK VOICES' 'TRAVEL' 'MONEY' 'RELIGION' 'LATINO VOICES' 'IMPACT'
 'WEDDINGS' 'COLLEGE' 'PARENTS' 'ARTS & CULTURE' 'STYLE' 'GREEN' 'TASTE'
 'HEALTHY LIVING' 'THE WORLDPOST' 'GOOD NEWS' 'WORLDPOST' 'FIFTY' 'ARTS'
 'DIVORCE']

Category distribution:
category
POLITICS          35602
WELLNESS          17945
ENTERTAINMENT     17362
TRAVEL             9900
STYLE & BEAUTY     9814
PARENTING          8791
HEALTHY LIVING     6694
QUEER VOICES       6347
FOOD & DRINK       6340
BUSINESS           5992
COMEDY             5400
SPORTS             5077
BLACK VOICES       4583
HOME & LIVING      4320
PARENTS            3955
THE WORLDPOST      3664
WEDDINGS           3653
WOMEN             

In [None]:
# Example of text and category
print("\nExample record:")
print(df[['headline', 'short_description', 'category']].iloc[0])


Example record:
headline             Over 4 Million Americans Roll Up Sleeves For O...
short_description    Health experts said it is too early to predict...
category                                                     U.S. NEWS
Name: 0, dtype: object


## 4.2. Preprocessing and Splitting

### Model token limits

**gpt-4o**
- Context window: 128,000 tokens
- Max output tokens: 16,384 tokens
- Training data: Up to Oct 2023

**gpt-4o-mini**
- Context window: 128,000 tokens
- Max output tokens: 16,384 tokens
- Training data: Up to Oct 2023

In [None]:
# Cleanning function

def clean_text(text):
    text = re.sub(r'\s+', ' ', text)  # Remove extra whitespace
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    return text.lower()

#df["text"] = df["text"].apply(clean_text)

In [None]:
# Function to count tokens using OpenAI's tokenization scheme
def count_tokens(text: str, model: str = "text-embedding-ada-002"):
    """
    Counts the number of tokens in a given text using the tokenization scheme of OpenAI models.

    Args:
        text (str): The input text to be tokenized.
        model (str): The OpenAI model whose tokenization scheme will be used.

    Returns:
        int: The number of tokens in the text.
    """
    try:
        encoding = tiktoken.encoding_for_model(model)
        token_count = len(encoding.encode(text))
        return token_count
    except KeyError:
        raise ValueError(f"Model '{model}' is not supported for tokenization.")

In [None]:
# Define a text splitter for long documents
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,  # Maximum tokens per chunk
    chunk_overlap=100,  # Overlap between chunks to maintain context
)

In [None]:
# Function to process documents: check token count and split if necessary
def process_documents(documents, chunk_size=500, model="text-embedding-ada-002"):
    """
    Processes a list of documents: splits long documents into smaller chunks if needed.

    Args:
        documents (list of langchain_core.documents.Document): The input documents.
        chunk_size (int): The maximum token count per chunk.
        model (str): The OpenAI model to use for token counting.

    Returns:
        list: Processed documents, split as necessary.
    """
    processed_documents = []
    for doc in documents:
        token_count = count_tokens(doc.page_content, model=model)
        #print(token_count)
        if token_count > chunk_size:
            # Split the document if token count exceeds chunk_size
            chunks = text_splitter.split_documents([doc])
            processed_documents.extend(chunks)
        else:
            # Keep the document as is
            processed_documents.append(doc)
    return processed_documents

In [None]:
# Create documents with metadata and assign unique IDs
documents = [
    Document(
        page_content=f"{row['headline']}. {row['short_description']}",
        metadata={
            "id": str(idx),  # Unique ID for each document
            "category": row["category"],
            "date": row["date"],
            "headline": row["headline"],
        },
    )
    for idx, row in df.iterrows()
]

In [None]:
# Process the documents: split if necessary
processed_documents = process_documents(documents)

In [None]:
# Map document IDs to their original text for later retrieval
id_to_text = {doc.metadata["id"]: doc.page_content for doc in processed_documents}

## 4.3. Embedding Generation

In [None]:
# Initialize the embedding model
embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")  # Sentence Transformers on Hugging Face

## 4.4. Vectorstore Creation

In [None]:
"""
# Create a FAISS VectorStore
vectorstore = FAISS.from_documents(documents=processed_documents, embedding=embedding_model)
"""


'\n# Create a FAISS VectorStore\nvectorstore = FAISS.from_documents(documents=processed_documents, embedding=embedding_model)\n'

In [None]:
"""
# Verify the VectorStore
print(f"Number of documents in VectorStore: {len(processed_documents)}")
for doc in processed_documents[:5]:
    print(f"Content: {doc.page_content[:100]}... | Metadata: {doc.metadata}")
"""

'\n# Verify the VectorStore\nprint(f"Number of documents in VectorStore: {len(processed_documents)}")\nfor doc in processed_documents[:5]:\n    print(f"Content: {doc.page_content[:100]}... | Metadata: {doc.metadata}")\n'

In [None]:
"""# Save the vectorstore to disk for future use
vectorstore.save_local("/content/drive/MyDrive/Colab Notebooks/dow_jones/results/vectorstore_news_category")
# vectorstore.persist(directory="../results/vectorstore_news_category") # To maintain Chorma's format
"""

'# Save the vectorstore to disk for future use\nvectorstore.save_local("/content/drive/MyDrive/Colab Notebooks/dow_jones/results/vectorstore_news_category")\n# vectorstore.persist(directory="../results/vectorstore_news_category") # To maintain Chorma\'s format\n'

In [None]:
# Load the vectorstore from the local directory
vectorstore = FAISS.load_local("/content/drive/MyDrive/Colab Notebooks/dow_jones/results/vectorstore_news_category", embeddings=embedding_model,  allow_dangerous_deserialization=True)

In [None]:
# Verify the VectorStore
print(f"Number of documents in VectorStore: {len(processed_documents)}")
for doc in processed_documents[:5]:
    print(f"Content: {doc.page_content[:100]}... | Metadata: {doc.metadata}")

Number of documents in VectorStore: 209527
Content: Over 4 Million Americans Roll Up Sleeves For Omicron-Targeted COVID Boosters. Health experts said it... | Metadata: {'id': '0', 'category': 'U.S. NEWS', 'date': '2022-09-23', 'headline': 'Over 4 Million Americans Roll Up Sleeves For Omicron-Targeted COVID Boosters'}
Content: American Airlines Flyer Charged, Banned For Life After Punching Flight Attendant On Video. He was su... | Metadata: {'id': '1', 'category': 'U.S. NEWS', 'date': '2022-09-23', 'headline': 'American Airlines Flyer Charged, Banned For Life After Punching Flight Attendant On Video'}
Content: 23 Of The Funniest Tweets About Cats And Dogs This Week (Sept. 17-23). "Until you have a dog you don... | Metadata: {'id': '2', 'category': 'COMEDY', 'date': '2022-09-23', 'headline': '23 Of The Funniest Tweets About Cats And Dogs This Week (Sept. 17-23)'}
Content: The Funniest Tweets From Parents This Week (Sept. 17-23). "Accidentally put grown-up toothpaste on m... | Metadata: {

In [None]:
# Verify documents in the docstore
n = 0
for doc_id, doc in vectorstore.docstore._dict.items():
    n+=1
    print(f"Doc ID: {doc_id}, Content: {doc.page_content[:100]}, Metadata: {doc.metadata}")
    if n==10:
      break


Doc ID: 0a95b053-2bdb-4830-8f0f-a5b495e1d060, Content: Over 4 Million Americans Roll Up Sleeves For Omicron-Targeted COVID Boosters. Health experts said it, Metadata: {'id': '0', 'category': 'U.S. NEWS', 'date': '2022-09-23', 'headline': 'Over 4 Million Americans Roll Up Sleeves For Omicron-Targeted COVID Boosters'}
Doc ID: 2cd16f28-3b28-451a-87fa-931d642bb5bb, Content: American Airlines Flyer Charged, Banned For Life After Punching Flight Attendant On Video. He was su, Metadata: {'id': '1', 'category': 'U.S. NEWS', 'date': '2022-09-23', 'headline': 'American Airlines Flyer Charged, Banned For Life After Punching Flight Attendant On Video'}
Doc ID: fa043dc4-d127-4423-a860-a47bd3f560d7, Content: 23 Of The Funniest Tweets About Cats And Dogs This Week (Sept. 17-23). "Until you have a dog you don, Metadata: {'id': '2', 'category': 'COMEDY', 'date': '2022-09-23', 'headline': '23 Of The Funniest Tweets About Cats And Dogs This Week (Sept. 17-23)'}
Doc ID: dba94f4e-c73e-4614-89e9-737a4de6aaa

## 4.5. Retrival Process

In [None]:
# Create the object Document Retriver (VectorStoreRetriever instance)
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={
        "k": 10,
        #"filter": {"category": "politics"}  # Ejemplo: filtrar por categoría
    }
)

In [None]:
# Example query
query = "What are the latest political developments?"
results = retriever.get_relevant_documents(query)

# Display retrieved documents
for idx, doc in enumerate(results):
    print(f"\nDocument {idx + 1}:")
    print(f"Content: {doc.page_content}:")
    print(f"Metadata: {doc.metadata}")


Document 1:
Content: HUFFPOLLSTER: Republicans And Democrats Are Growing Even Further Apart. The country's changing demographics have affected both parties, but in drastically different ways.:
Metadata: {'id': '56761', 'category': 'POLITICS', 'date': '2016-09-13', 'headline': 'HUFFPOLLSTER: Republicans And Democrats Are Growing Even Further Apart'}

Document 2:
Content: See The Latest Updates On Elections From Around The Nation. :
Metadata: {'id': '116223', 'category': 'POLITICS', 'date': '2014-11-04', 'headline': 'See The Latest Updates On Elections From Around The Nation'}

Document 3:
Content: Top 10 Tech Trends Transforming Humanity. Even amid a year of disheartening political news, 2016 brought a number of advancements that are changing the global tech terrain:
Metadata: {'id': '47046', 'category': 'TECH', 'date': '2017-01-02', 'headline': 'Top 10 Tech Trends Transforming Humanity'}

Document 4:
Content: Congress Sets Politics Aside for Critical Development and Humanitarian Issue

In [None]:
# Create the object Document Retriver (VectorStoreRetriever instance)
retriever_mmr = vectorstore.as_retriever(
    search_type="mmr",
    search_kwargs={
        "k": 10,
        #"filter": {"category": "politics"}  # Ejemplo: filtrar por categoría
    }
)

In [None]:
# Example query
query = "What are the latest political developments?"
results = retriever_mmr.get_relevant_documents(query)

# Display retrieved documents
for idx, doc in enumerate(results):
    print(f"\nDocument {idx + 1}:")
    print(f"Content: {doc.page_content}:")
    print(f"Metadata: {doc.metadata}")


Document 1:
Content: HUFFPOLLSTER: Republicans And Democrats Are Growing Even Further Apart. The country's changing demographics have affected both parties, but in drastically different ways.:
Metadata: {'id': '56761', 'category': 'POLITICS', 'date': '2016-09-13', 'headline': 'HUFFPOLLSTER: Republicans And Democrats Are Growing Even Further Apart'}

Document 2:
Content: Top 10 Tech Trends Transforming Humanity. Even amid a year of disheartening political news, 2016 brought a number of advancements that are changing the global tech terrain:
Metadata: {'id': '47046', 'category': 'TECH', 'date': '2017-01-02', 'headline': 'Top 10 Tech Trends Transforming Humanity'}

Document 3:
Content: Division Over Riders Forcing Congress To Edge Of Government-Funding Deadline. But what's new?:
Metadata: {'id': '81167', 'category': 'POLITICS', 'date': '2015-12-09', 'headline': 'Division Over Riders Forcing Congress To Edge Of Government-Funding Deadline'}

Document 4:
Content: Gun Safes and Politics. Th

## 4.6. Expansion Opportunities - QA with Generative AI

In [None]:
# Initialize OpenAI LLM
llm_openai = ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0,
    max_tokens=500
    )

In [None]:
# llm_openai.invoke("Hello, world!")

In [None]:
# # Initialize MistralAI LLM
llm_mistralai = ChatMistralAI(
    model="open-mistral-nemo",
    temperature=0,
    #max_tokens=500
    )

In [None]:
llm_mistralai.invoke("Hello, world!")

AIMessage(content="Hello! How can I assist you today? Let's chat about anything you'd like. 😊", additional_kwargs={}, response_metadata={'token_usage': {'prompt_tokens': 7, 'total_tokens': 28, 'completion_tokens': 21}, 'model': 'open-mistral-nemo', 'finish_reason': 'stop'}, id='run-6b52eff3-8843-46bc-9f98-8e2229988544-0', usage_metadata={'input_tokens': 7, 'output_tokens': 21, 'total_tokens': 28})

In [None]:
# Create the prompt

system_prompt = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know. Use three sentences maximum and keep the "
    "answer concise."
    "\n\n"
    "{context}"
)

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)

In [None]:
# To stuff the contents into the prompt (retrived context withount any summarization)
question_answer_chain = create_stuff_documents_chain(llm_mistralai, prompt)

# Create the retrieval chain that retrieves documents and then passes them on.
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

In [None]:
# Example query
response = rag_chain.invoke({"input": "What are the latest political developments?"})
print(response["answer"])

The latest political developments include growing partisan divisions between Republicans and Democrats in the U.S., with the country's changing demographics affecting each party differently. Additionally, the 2020 U.S. election cycle is underway, with the Iowa caucuses having taken place recently.


In [None]:
# Return the sources that were used to generate the answer
for document in response["context"]:
    print(document)
    print()

page_content='HUFFPOLLSTER: Republicans And Democrats Are Growing Even Further Apart. The country's changing demographics have affected both parties, but in drastically different ways.' metadata={'id': '56761', 'category': 'POLITICS', 'date': '2016-09-13', 'headline': 'HUFFPOLLSTER: Republicans And Democrats Are Growing Even Further Apart'}

page_content='See The Latest Updates On Elections From Around The Nation. ' metadata={'id': '116223', 'category': 'POLITICS', 'date': '2014-11-04', 'headline': 'See The Latest Updates On Elections From Around The Nation'}

page_content='Top 10 Tech Trends Transforming Humanity. Even amid a year of disheartening political news, 2016 brought a number of advancements that are changing the global tech terrain' metadata={'id': '47046', 'category': 'TECH', 'date': '2017-01-02', 'headline': 'Top 10 Tech Trends Transforming Humanity'}

page_content='Congress Sets Politics Aside for Critical Development and Humanitarian Issues. 2014 saw an unprecedented num

## 4.7. Expansion Opportunities - Topic extraction

In [None]:
"""
# Extracting embeddings and metadata from the vectorstore
embeddings = []
metadata = []

for i in range(len(vectorstore.docstore._dict)):
    embedding_vector = vectorstore.index.reconstruct(i)  # Rebuild embedding by index
    embeddings.append(embedding_vector)
    doc_id = list(vectorstore.docstore._dict.keys())[i]  # Obtain document ID
    metadata.append(vectorstore.docstore._dict[doc_id].metadata)

embeddings = np.array(embeddings)
"""

'\n# Extracting embeddings and metadata from the vectorstore\nembeddings = []\nmetadata = []\n\nfor i in range(len(vectorstore.docstore._dict)):\n    embedding_vector = vectorstore.index.reconstruct(i)  # Rebuild embedding by index\n    embeddings.append(embedding_vector)\n    doc_id = list(vectorstore.docstore._dict.keys())[i]  # Obtain document ID\n    metadata.append(vectorstore.docstore._dict[doc_id].metadata)\n\nembeddings = np.array(embeddings)\n'

In [None]:
"""
# Save embeddings and metadata

embeddings_path = "/content/drive/MyDrive/Colab Notebooks/dow_jones/results/embeddings.npy"
np.save(embeddings_path, embeddings)

metadata_path = "/content/drive/MyDrive/Colab Notebooks/dow_jones/results/metadata.pkl"
with open(metadata_path, "wb") as f:
    pickle.dump(metadata, f)
"""

'\n# Save embeddings and metadata\n\nembeddings_path = "/content/drive/MyDrive/Colab Notebooks/dow_jones/results/embeddings.npy"\nnp.save(embeddings_path, embeddings)\n\nmetadata_path = "/content/drive/MyDrive/Colab Notebooks/dow_jones/results/metadata.pkl"\nwith open(metadata_path, "wb") as f:\n    pickle.dump(metadata, f)\n'

In [None]:
# Load embeddings and metadata
embeddings_path = "/content/drive/MyDrive/Colab Notebooks/dow_jones/results/embeddings.npy"
embeddings = np.load(embeddings_path)
metadata_path = "/content/drive/MyDrive/Colab Notebooks/dow_jones/results/metadata.pkl"
with open(metadata_path, "rb") as f:
    metadata = pickle.load(f)

In [None]:
# Imposssible to execute becouse of the needed time
"""
# Determining the optimal number of clusters using the elbow method

def find_optimal_k(embeddings, max_k=10):
    inertia = []
    silhouette_scores = []
    k_range = range(2, max_k + 1)

    for k in k_range:
        kmeans = MiniBatchKMeans(n_clusters=k, random_state=42)
        kmeans.fit(embeddings)
        inertia.append(kmeans.inertia_)
        silhouette_scores.append(silhouette_score(embeddings, kmeans.labels_))

    return k_range, inertia, silhouette_scores
"""

'\n# Determining the optimal number of clusters using the elbow method\n\ndef find_optimal_k(embeddings, max_k=10):\n    inertia = []\n    silhouette_scores = []\n    k_range = range(2, max_k + 1)\n    \n    for k in k_range:\n        kmeans = MiniBatchKMeans(n_clusters=k, random_state=42)\n        kmeans.fit(embeddings)\n        inertia.append(kmeans.inertia_)\n        silhouette_scores.append(silhouette_score(embeddings, kmeans.labels_))\n    \n    return k_range, inertia, silhouette_scores\n'

In [None]:
# k_range, inertia, silhouette_scores = find_optimal_k(embeddings)

In [None]:
"""
# Visualization of the elbow method
plt.figure(figsize=(10, 5))
plt.plot(k_range, inertia, marker='o', label='Inertia')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal k')
plt.legend()
plt.show()
"""

"\n# Visualization of the elbow method\nplt.figure(figsize=(10, 5))\nplt.plot(k_range, inertia, marker='o', label='Inertia')\nplt.xlabel('Number of Clusters (k)')\nplt.ylabel('Inertia')\nplt.title('Elbow Method for Optimal k')\nplt.legend()\nplt.show()\n"

In [None]:
"""
# Display of the silhouette score
plt.figure(figsize=(10, 5))
plt.plot(k_range, silhouette_scores, marker='o', label='Silhouette Score', color='orange')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Analysis for Optimal k')
plt.legend()
plt.show()
"""

"\n# Display of the silhouette score\nplt.figure(figsize=(10, 5))\nplt.plot(k_range, silhouette_scores, marker='o', label='Silhouette Score', color='orange')\nplt.xlabel('Number of Clusters (k)')\nplt.ylabel('Silhouette Score')\nplt.title('Silhouette Analysis for Optimal k')\nplt.legend()\nplt.show()\n"

In [None]:
# Choose the optimal number of clusters
optimal_k = 21 # Half of the possible categories

# Create K-Means model with optimal number of clusters
kmeans = KMeans(n_clusters=optimal_k, random_state=42)
labels = kmeans.fit_predict(embeddings)

In [None]:
# Associate each item with its cluster
clustered_documents = {}
for idx, label in enumerate(labels):
    if label not in clustered_documents:
        clustered_documents[label] = []
    clustered_documents[label].append(metadata[idx])

# Show document distribution by cluster
for cluster_id, docs in clustered_documents.items():
    print(f"Cluster {cluster_id}: {len(docs)} documents")
    for doc in docs[:5]:
        print(f"  - {doc['headline']} (Category: {doc['category']})")
    print()

Cluster 15: 8576 documents
  - Over 4 Million Americans Roll Up Sleeves For Omicron-Targeted COVID Boosters (Category: U.S. NEWS)
  - CDC Director To Overhaul Agency With COVID Shortcomings In Mind (Category: POLITICS)
  - General Motors Recalls Over 484,000 Vehicles Over 'Improperly'-Formed Seat Belts (Category: U.S. NEWS)
  - Planned Parenthood To Spend Record $50 Million For Midterms (Category: POLITICS)
  - Dog Catches Monkeypox In France In First Suspected Human-To-Pet Transmission (Category: U.S. NEWS)

Cluster 0: 8723 documents
  - American Airlines Flyer Charged, Banned For Life After Punching Flight Attendant On Video (Category: U.S. NEWS)
  - Cleaner Was Dead In Belk Bathroom For 4 Days Before Body Found: Police (Category: U.S. NEWS)
  - Man Sets Himself On Fire In Apparent Protest Of Funeral For Japan's Abe (Category: WORLD NEWS)
  - Russian Cosmonaut Valery Polyakov Who Broke Record With 437-Day Stay In Space Dies At 80 (Category: WORLD NEWS)
  - Maury Wills, Base-Stealing 

###  Naming and Explaining Clusters

In [None]:
# Initialize the LLM model with higher temperature
llm_mistralai_2 = ChatMistralAI(
    model="open-mistral-nemo",
    temperature=0.5,
    #max_tokens=500
    )

In [None]:
# Define prompt for naming and explaining clusters
prompt_template_2 = """
You are an expert in summarizing and analyzing content clusters.
For each group of documents, you will:
1. Provide a **name** that summarizes the main theme of the group.
2. Provide a **short explanation** (2-3 sentences) describing the common traits or topics covered in the documents.

Here is an example cluster of documents:
{examples}

Respond in this format:
Cluster Name: <Name>
Explanation: <Explanation>
"""

In [None]:
prompt_2 = ChatPromptTemplate.from_template(prompt_template_2)

In [None]:
# Create the pipeline using RunnableSequence
pipeline = prompt_2 | llm_mistralai_2 | StrOutputParser()

In [None]:
TIME_BETWEEN_REQUESTS = 2

# Function for generating cluster names and explanations
def generate_cluster_descriptions(clustered_documents, num_examples=5, delay=TIME_BETWEEN_REQUESTS):
    """
    Generates cluster names and explanations using an LLM.

    Args:
        clustered_documents (dict): Dictionary with cluster IDs as keys and lists of document metadata as values.
        num_examples (int): Number of examples from each cluster to pass to the LLM.

    Returns:
        dict: A dictionary with cluster IDs as keys and cluster descriptions as values.
    """
    cluster_descriptions = {}
    for cluster_id, docs in clustered_documents.items():
        # Create a sample of representative examples of the cluster
        examples = "\n".join(
            [f"- {doc['headline']} (Category: {doc['category']})" for doc in docs[:num_examples]]
        )

        try:
            # Generate cluster name and explanation
            response = pipeline.invoke({"examples": examples})
            cluster_descriptions[cluster_id] = response

            print(f"Processed Cluster {cluster_id}:\n{response}\n")

        except Exception as e:
            print(f"Error processing Cluster {cluster_id}: {e}")
            cluster_descriptions[cluster_id] = f"Error: {e}"

        # Wait
        time.sleep(delay)

    return cluster_descriptions

In [None]:
# Clusters description
cluster_descriptions = generate_cluster_descriptions(clustered_documents)

for cluster_id, description in cluster_descriptions.items():
    print(f"Cluster {cluster_id}:\n{description}\n")

Processed Cluster 15:
Cluster Name: **COVID-19 Vaccine and Public Health Measures**
Explanation: This cluster revolves around the ongoing COVID-19 pandemic, focusing on the rollout of Omicron-targeted boosters, the CDC's plans to overhaul its structure based on COVID-19 shortcomings, and the first suspected human-to-pet transmission of monkeypox.

Processed Cluster 0:
**Cluster Name:** Violent Acts, Unusual Deaths, and Celebrity Passings
**Explanation:** This cluster of documents shares common themes of violent incidents, unusual deaths, and the passing of notable figures. The first two articles report on violent acts against flight attendants and cleaners, while the third article describes an unusual form of protest. The last two articles detail the deaths of notable figures in the world of space exploration and sports.

Processed Cluster 8:
**Cluster Name:** Humor, Food, and Bizarre News

**Explanation:** This cluster revolves around a mix of light-hearted and peculiar content. It in

## 4.8. Expansion Opportunities - Tagging

In [None]:
# Define the classification model
class ArticleClassification(BaseModel):
    sentiment: str = Field(description="The sentiment of the text (positive, neutral, negative)")
    language: str = Field(description="The language the text is written in (e.g., English, Spanish)")
    style: str = Field(description="The style of the text (e.g., formal, informal)")
    topics: str = Field(description="The main topics covered in the text (comma-separated)")
    political_tendency: str = Field(
        description="The political tendency of the text (e.g., left-wing, right-wing, centrist, unknown)"
    )

In [None]:
# Create LLM model with structured output
llm_mistralai_3 = ChatMistralAI(
    model="open-mistral-nemo",
    temperature=0,
    #max_tokens=500
    ).with_structured_output(ArticleClassification)

In [None]:
# Define the prompt for labeling text
tagging_prompt = ChatPromptTemplate.from_template(
    """
Extract the desired information from the following article.

Only extract the properties mentioned in the 'ArticleClassification' schema.

Article:
{input}
"""
)

In [None]:
# Create the labeling pipeline
tagging_chain = tagging_prompt | llm_mistralai_3

In [None]:
# Article labeling function

TIME_BETWEEN_REQUESTS = 2

def classify_selected_articles(retriever, query=None, category=None, num_articles=5, delay=TIME_BETWEEN_REQUESTS):
    """
    Retrieves and classifies articles from the dataset based on query or category.

    Args:
        retriever: The VectorStoreRetriever object.
        query (str, optional): A query string to search for relevant documents.
        category (str, optional): Filter by category.
        num_articles (int, optional): Number of articles to retrieve and classify. Defaults to 5.

    Returns:
        list of dict: List of classifications for each article.
    """
    # Retrieve articles
    search_kwargs = {"k": num_articles}
    if category:
        search_kwargs["filter"] = {"category": category}

    # Query documents
    if query:
        results = retriever.get_relevant_documents(query, **search_kwargs)
    else:
        results = retriever.get_relevant_documents("", **search_kwargs)  # Empty query for category-only filter

    # Classify retrieved documents
    classifications = []
    for idx, doc in enumerate(results):
        try:
            print(f"Classifying Article {idx + 1}: {doc.metadata['headline']}")
            result = tagging_chain.invoke({"input": doc.page_content})
            classifications.append({"metadata": doc.metadata, "classification": result.dict()})
        except Exception as e:
            print(f"Error classifying article {idx + 1}: {e}")
            classifications.append({"metadata": doc.metadata, "classification": None})

        # Wait
        time.sleep(delay)

    return classifications

In [None]:
# Example of use: Sort articles related to politics

query = "What are the latest political developments?"
results = classify_selected_articles(retriever, query=query, num_articles=10)

print("\nClassification Results:")
for idx, result in enumerate(results):
    print(f"Article {idx + 1} Metadata: {result['metadata']}")
    print(f"Classification: {result['classification']}\n")

Classifying Article 1: HUFFPOLLSTER: Republicans And Democrats Are Growing Even Further Apart
Classifying Article 2: See The Latest Updates On Elections From Around The Nation
Error classifying article 2: Error response 429 while fetching https://api.mistral.ai/v1/chat/completions: {"message":"Requests rate limit exceeded"}
Classifying Article 3: Top 10 Tech Trends Transforming Humanity
Error classifying article 3: Error response 429 while fetching https://api.mistral.ai/v1/chat/completions: {"message":"Requests rate limit exceeded"}
Classifying Article 4: Congress Sets Politics Aside for Critical Development and Humanitarian Issues
Classifying Article 5: Leadership and Transparency 2015: The Social Media Imperative
Error classifying article 5: Error response 429 while fetching https://api.mistral.ai/v1/chat/completions: {"message":"Requests rate limit exceeded"}
Classifying Article 6: Five Emerging Trends for the U.S. Elections
Error classifying article 6: Error response 429 while fet

# 5. Evaluation

## 5.1. Information Retrival System

In [None]:
def evaluate_retrieval(retriever, query, ground_truth_category, k=5):
    """
    Evaluates document retrieval based on precision and recall.

    Args:
        retriever: The document retrieval model.
        query (str): The search query.
        ground_truth_category (str or list): Category considered as relevant.
        k (int): Number of retrieved documents to evaluate.

    returns:
        dict: Evaluation metrics (precision@k, recall@k, mrr).
    """
    # Retrieve relevant documents
    retrieved_docs = retriever.get_relevant_documents(query)[:k]

    # Determine whether the retrieved documents are relevant
    relevant_flags = [
        #1 if doc.metadata["category"] in ground_truth_category else 0
        1 if doc.metadata["category"] == ground_truth_category else 0
        for doc in retrieved_docs
    ]

    # Precision@K
    precision_at_k = sum(relevant_flags) / k

    # Recall@K
    total_relevant = sum(
        1 for doc in retriever.vectorstore.docstore._dict.values()
        #if doc.metadata["category"] in ground_truth_category
        if doc.metadata["category"] == ground_truth_category
    )
    recall_at_k = sum(relevant_flags) / total_relevant if total_relevant > 0 else 0
    print(total_relevant)

    # MRR
    mrr = 0
    for rank, flag in enumerate(relevant_flags, start=1):
        if flag:
            mrr = 1 / rank
            break

    return {
        "precision_at_k": precision_at_k,
        "recall_at_k": recall_at_k,
        "mrr": mrr,
    }

The label "category" will be used as relevance approximation.

6340 articles of FOOD & DRINK

In [None]:
# Create the object Document Retriver (VectorStoreRetriever instance)
retriever_eval = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={
        "k": 6340,
        #"filter": {"category": "politics"}  # Ejemplo: filtrar por categoría
    }
)

In [None]:
# Create the retrieval chain that retrieves documents and then passes them on.
rag_chain_eval = create_retrieval_chain(retriever_eval, question_answer_chain)

In [None]:
# RAG
response = rag_chain_eval.invoke({"input": "What are the latest food and drink developments?"})
print(response["answer"])

In 2021, expect to see more breakfast foods, better plant-based products, and power bowls with ancient grains and spiralized veggies.


In [None]:
# Return the sources that were used to generate the answer
for document in response["context"]:
    print(document)
    print()

page_content='All The New Food And Drink You Can Look Forward To In 2021. Breakfast foods (because you're probably at home in the mornings now), better plant-based products and more.' metadata={'id': '3482', 'category': 'FOOD & DRINK', 'date': '2020-12-28', 'headline': 'All The New Food And Drink You Can Look Forward To In 2021'}

page_content='The Biggest Food Trends Of 2015. ' metadata={'id': '113882', 'category': 'TASTE', 'date': '2014-12-01', 'headline': 'The Biggest Food Trends Of 2015'}

page_content='6 Food Trends To Help You Eat Better In 2016. Get ready for power bowls, spiralized veggies, ancient grains and more.' metadata={'id': '79653', 'category': 'HEALTHY LIVING', 'date': '2015-12-27', 'headline': '6 Food Trends To Help You Eat Better In 2016'}

page_content='10 Food Trends to Watch. We're going to live in a world with Coke robots, apparently.' metadata={'id': '146112', 'category': 'FOOD & DRINK', 'date': '2013-12-04', 'headline': '10 Food Trends to Watch'}

page_content=

In [None]:
# Evaluation
query = "What are the latest food developments?"
#ground_truth_category = ["FOOD & DRINK"]
ground_truth_category = "FOOD & DRINK"
metrics = evaluate_retrieval(retriever_eval, query, ground_truth_category, k=6340)

print("Evaluation Metrics for Retrieval:")
for metric, value in metrics.items():
    print(f"{metric}: {value:.2f}")

6340
Evaluation Metrics for Retrieval:
precision_at_k: 0.39
recall_at_k: 0.39
mrr: 0.33


### Evaluation of Retrieval Performance
We conducted an evaluation of the document retrieval system using the query "What are the latest food developments?" within the FOOD & DRINK category, which contains 6340 documents in the VectorStore. The following retrieval metrics were obtained:

- **Precision@k:** 0.39
- **Recall@k:** 0.39
- **MRR (Mean Reciprocal Rank):** 0.33

#### Interpretation of Results

**Precision@k (0.39**):

This metric indicates that 39% of the top 6340 retrieved documents are relevant to the query. In other words, about 4 out of every 10 documents in the top retrieved results are truly related to food developments. A precision of 0.39 suggests that the retrieval system is returning a fairly mixed set of documents, where a significant portion may not be directly relevant to the user's query, requiring refinement of the ranking mechanism.

**Recall@k (0.39):**
This metric reflects the proportion of relevant documents that were retrieved, out of all available relevant documents in the FOOD & DRINK category. A recall of 0.39 means that the system is retrieving roughly 39% of all relevant documents for the query, which indicates that the retrieval system may be missing a substantial portion of relevant information. Improving recall could involve enhancing the search index or the way relevance is determined within the vector store.

**MRR (0.33):**
The Mean Reciprocal Rank (MRR) is a measure of the rank position at which the first relevant document appears in the retrieval results. With an MRR of 0.33, the system returns the first relevant document relatively early in the results, but it still suggests that a considerable number of relevant documents might be located further down in the list. Improving the relevance ranking and fine-tuning the retrieval algorithms could help bring more relevant documents closer to the top of the results.

#### Discussion of Strengths and Limitations

**Strengths**
- Solid Initial Performance: A precision of 0.39 and recall of 0.39 indicate that the retrieval system is performing decently in identifying relevant documents within the FOOD & DRINK category. The MRR value suggests that, while improvements could be made, the system is retrieving relevant documents early on in most queries.
- Scalability: The system is capable of handling large categories (with thousands of documents) and retrieves documents effectively, despite the large volume of data in the vector store.

**Limitations**
- Room for Improvement in Precision and Recall: With both precision and recall at 0.39, there is significant room for improvement. The system could be returning a large number of irrelevant documents, which indicates that the ranking of documents could be further optimized. This might involve refining the embeddings, using advanced retrieval models, or adjusting the parameters to focus more on relevance.
- Potential Label Confusion: A key limitation in the retrieval system is the possibility of label confusion, where documents from categories with overlapping themes may be retrieved along with the intended food-related content. The FOOD & DRINK category could be confused with several other categories due to the shared topics or similar content. For example:

- HEALTHY LIVING (6694 documents) may include health-focused articles on food or diets.
- WELLNESS (17945 documents) could feature food-related content, but from a health and lifestyle perspective.
- STYLE & BEAUTY (9814 documents) might contain content related to beauty, where food is discussed in relation to diet.
- ENTERTAINMENT (17362 documents) might overlap with food-related entertainment such as food shows or celebrity chef content.
- TRAVEL (9900 documents) could feature culinary tourism or local food culture.
- TASTE (2096 documents) often addresses food trends and taste preferences, which may overlap with food development.
- GREEN (2622 documents) includes content on sustainable food practices, which may be relevant but also overlap with topics like food sustainability and agricultural innovations.
- SCIENCE (2206 documents) could involve articles on food science, nutritional research, or agricultural advancements.

These overlapping categories could lead the retrieval system to misclassify documents or rank them inappropriately, thus diluting the relevance of the retrieved results. For instance, the system might retrieve documents from WELLNESS that discuss the health benefits of food, but not actual food development innovations.

#### Conclusion
In summary, the retrieval system demonstrates decent performance, but it faces significant challenges in improving precision, recall, and the handling of overlapping categories. Label confusion, especially with categories related to wellness, travel, and lifestyle, may contribute to the retrieval of irrelevant documents, affecting the system’s ability to provide highly relevant results. Future improvements could include fine-tuning the search algorithm, refining embeddings, and adding more sophisticated techniques for distinguishing between related but distinct topics to enhance the accuracy and relevance of the retrieved documents.

### 5.2. Proposed Evaluation for Extended Systems

Due to time constraints, I wasn't able to fully implement evaluations for the QA with Generative AI, Topic Extraction, and Tagging systems. However, the following outlines how these systems could be effectively evaluated in the future. These methods would provide a strong framework to measure their performance and refine their functionality if further development is pursued.

#### 5.2.1. QA with Generative AI
**Objective:** Assess the system's ability to generate accurate and relevant answers to user queries, using information extracted from the corpus.

**Evaluation Approach:**

- **Quantitative Metrics:**
  - Exact Match (EM): Measures the percentage of generated answers that exactly match the expected answers.
  - F1-Score: Evaluates the overlap of words between the generated and expected answers, balancing precision and recall.
  - BLEU/ROUGE: These metrics compare n-gram similarity between the generated answers and the expected ones. BLEU is often more suited for shorter answers, while ROUGE works well for longer, summary-style responses.
  - BERTScore: Leverages semantic embeddings to evaluate conceptual similarity between the answers.

- **Manual Review:**  
Using a Likert scale (e.g., 1-5), human evaluators can provide qualitative insights that complement the automated metrics, based on:
  - Relevance: Does the answer directly address the question?
  - Accuracy: Is the information correct?
  - Fluency: Is the response well-written and natural?

- **Reference Dataset:**  
Create a test set of representative questions paired with expected answers (ground truth). If such a dataset isn't available, it can be manually curated from the corpus.

#### 5.2.2. Topic Extraction
**Objective:** Assess how well the system identifies the main themes within a set of documents.

**Evaluation Approach:**

- **Cohesion and Coherence Metrics:**
  - Coherence Score (e.g., UMASS, PMI): Measures how semantically consistent the words grouped under each topic are.

- **Subject matter experts can review the generated topics and assess:**
  - Do the topics make sense?
  - Are they useful for organizing or summarizing the content?

**Visualization:**  
Techniques like t-SNE or UMAP can project documents into a lower-dimensional space, helping visualize how well they cluster based on assigned topics.

### 5.2.3. Tagging
**Objective:** Evaluate the system's ability to assign relevant tags to documents based on their content.

**Evaluation Approach:**

- **Multilabel Classification Metrics:**
- Accuracy: Measures the proportion of correctly assigned tags.
- Precision, Recall, F1-Score: These can be calculated at both the micro level (treating all tags equally) and macro level (averaging metrics across tags).
- Subset Accuracy: Checks whether the predicted set of tags matches the expected set exactly.
- Hamming Loss: Evaluates the proportion of incorrect or missing tags.
- Jaccard Similarity: Measures the overlap between the predicted and actual tag sets.

- **Reference Dataset:**  
Use a collection of documents with pre-assigned, validated tags as ground truth. If such a dataset doesn't exist, one can be manually curated by tagging a representative sample of the corpus.

**Error Analysis:**  
Build a confusion matrix to identify which tags are most commonly misclassified or missed, and use this to refine the model.

**User Feedback:**  
Gather input from end-users about the relevance and usefulness of the assigned tags, particularly if tagging impacts document organization or search.

# 6. Recommendations

1. **Enhancing Data Quality and Diversity**
  - Expand the Dataset: Explore additional datasets or combine multiple datasets to increase the diversity of text samples. This can lead to a more robust model capable of handling a wider variety of input scenarios.
  - Address Class Imbalance: If applicable, consider techniques such as oversampling, undersampling, or synthetic data generation (e.g., SMOTE) to mitigate class imbalances observed in the dataset.
2. **Advanced Preprocessing Techniques**
  - Domain-Specific Text Cleaning: Develop a text preprocessing steps to better handle domain-specific nuances such as acronyms, slang, or context-specific phrases.
3. **Model Optimization and Experimentation**
  - Explore Alternative Models: Experiment with advanced language models like RoBERTa, or GPT to potentially improve task-specific performance.
  - Transfer Learning: Leverage pre-trained models or fine-tune existing Hugging Face large language models (LLMs) to capitalize on their linguistic understanding.
4. **Evaluation and Metric Enhancement**
  - Use Additional Metrics: Complement existing evaluation metrics with domain-specific or task-relevant measures such as BLEU, ROUGE, or METEOR for text generation tasks or Mean Reciprocal Rank (MRR) for retrieval tasks.
  - Error Analysis: Perform a detailed error analysis to identify common failure cases or patterns where the model struggles and refine the pipeline accordingly.
5. **Expanding Use Cases**  
Consider extending the model's capabilities to support additional tasks such as recommendations.
6. **Ethical Considerations**
  - Bias Detection and Mitigation: Evaluate the model for potential biases in its outputs and take steps to address them to ensure fair and ethical usage.

These recommendations aim to extend the current implementation, enhance its performance, and ensure its adaptability for practical and impactful real-world applications.