# **1. Introduction**

This project aims to build a RAG System on the BBC News dataset.

</br>

## **RAG Systems:**
Retrieval-Augmented Generation (RAG) is the process of optimizing the output of a large language model, so it references a knowledge base outside of its training data sources before generating a response. Large Language Models (LLMs) are trained on vast volumes of data and use billions of parameters to generate output for tasks like answering questions, translating languages, and completing sentences. RAG extends the already powerful capabilities of LLMs to specific domains or an organization's internal knowledge base, all without the need to retrain the model

</br>

## **Dataset and Usecase:**
The world is so fascinating that something is happening every second. So far, different forms of news are consumed by mankind like via radio, newspapers, news channels, smartphone apps etc. But what if a specific information like, 'What happened on June 15 in national stock exchange' or 'How long was the King's speech at the 80th D-Day gathering?' etc., is expected by a student, enthusiast or a common user who just wants to know because going through all the aforementioned mediums is simply time-consuming. The RAG System built on a news dataset, answers the query by a user on-point rather than giving general information.

The dataset is a collection 2,225 news articles published by BBC News, spanning various categories including Sport, Business, Politics, Tech, and Entertainment. Our language of focus is English.

</br>

## **Other architecture:**


*   This project uses Llama3 LLM as this is the new state-of-the-art and finetuned and optimized for dialogue/chat bot use cases.
*   It also uses 'BAAI/bge-small-en-v1.5' as sentence embedder. This is trained on English texts. In future, if we plan to increase our focus to other languages, this model supports that. Also the embedding size is 384, not too long as contexts in news is pretty easy to capture. We don't find a lot a varying sentences within a paragraph.





# **2. Installing and importing necessary packages**

In [1]:
# Install the necessary packages and modules
%%capture
!pip install chromadb
!pip install accelerate -U
!pip install -U sentence-transformers
!pip install faiss-gpu
!pip install arxiv

# Import required modules from the llama_index library
!pip install llama-index-embeddings-huggingface
!pip install llama-index-llms-ollama
!pip install llama-index ipywidgets
!pip install llama-index-llms-huggingface
!pip install llama_index.readers.web
!pip install llama-index-vector-stores-chroma
!pip install llama-index-vector-stores-faiss

# Install Ollama v0.1.30
!curl https://ollama.ai/install.sh | sed 's#https://ollama.ai/download#https://github.com/jmorganca/ollama/releases/download/v0.1.30#' | sh

In [2]:
# Import the necessary libraries
%%capture
import os
import subprocess
import time
import faiss
import arxiv
import pandas as pd
import numpy as np
import torch
import zipfile
import math
from sentence_transformers import SentenceTransformer
from sklearn.manifold import TSNE

# Import required modules from the llama_index library
from llama_index.core import VectorStoreIndex, SummaryIndex, SimpleDirectoryReader, StorageContext, load_index_from_storage, Settings, ChatPromptTemplate
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core.node_parser import SentenceSplitter, SemanticSplitterNodeParser
from llama_index.llms.ollama import Ollama
from llama_index.core.llms import ChatMessage, MessageRole
from llama_index.core.schema import Document

from llama_index.readers.file import FlatReader
from pathlib import Path


# Import FaissVectorStore
from llama_index.vector_stores.faiss import FaissVectorStore

# Import ChromaVectorStore and chromadb module
from llama_index.vector_stores.chroma import ChromaVectorStore
import chromadb

# Import the package for sentence embedding
from sentence_transformers import SentenceTransformer

# Import BeautifulSoupWebReader module to scrap data from web page
from llama_index.readers.web import BeautifulSoupWebReader

# **3. Data Collection**

The dataset is sourced from https://www.kaggle.com. As mentioned earlier, this is a list of news articles published by BBC. The file type is CSV where each row corresponds to an article published. Each article is read and written to a separate document thus creating 2225 documents overall.

</br>

The idea here is to use a wide list of news articles. Ideally, the news article could be of any file format (PDF, .txt, .docx etc.). We are using the .txt format. Since this is a prototyped version of RAG system, we limit the knowledge base only to these 2225 medium-sized news articles.


## **3.1 Kaggle datasets**

In [3]:
# Set Kaggle API credentials
os.environ['KAGGLE_USERNAME'] = 'pravinkumar14'
os.environ['KAGGLE_KEY'] = 'cc6a24d33371057cee21b9d27a13647d'

# Download the dataset
!kaggle datasets download -d moazeldsokyx/bbc-news

# Define the path to the downloaded zip file
zip_file_path = 'bbc-news.zip'

# Define the directory to extract the files to
extract_dir = 'bbc-news'

# Extract the zip file
with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
    zip_ref.extractall(extract_dir)

print("Dataset extracted to:", extract_dir)

csv_file_path = os.path.join(extract_dir, 'bbc-text.csv')
df = pd.read_csv(csv_file_path)

# Display the first few rows of the DataFrame
print(df.head())

Dataset URL: https://www.kaggle.com/datasets/moazeldsokyx/bbc-news
License(s): unknown
Downloading bbc-news.zip to /content
  0% 0.00/1.83M [00:00<?, ?B/s]
100% 1.83M/1.83M [00:00<00:00, 117MB/s]
Dataset extracted to: bbc-news
        category                                               text
0           tech  tv future in the hands of viewers with home th...
1       business  worldcom boss  left books alone  former worldc...
2          sport  tigers wary of farrell  gamble  leicester say ...
3          sport  yeading face newcastle in fa cup premiership s...
4  entertainment  ocean s twelve raids box office ocean s twelve...


In [4]:
print('Number of news article present: {0}, with average length of each article being {1}'.format(len(df), math.floor(len(''.join(df.text.to_list()))/len(df))))

Number of news article present: 2225, with average length of each article being 2262


The below chunk writes the news article to different documents. Ideally we create a bunch of documents containing different news article.

In [5]:
folder_path = '/content/news/articles/'

!mkdir -p '/content/news/articles/'

for i, data in enumerate(df.text.to_list()):
  fname = folder_path + "article_" + str(i) + ".txt"
  with open(fname, "w") as text_file:
    text_file.write(data)


# Count the number of files in the folder
num_files = len([f for f in os.listdir(folder_path) if os.path.isfile(os.path.join(folder_path, f))])

print(f"Number of news articles in '{folder_path[:-1]}': {num_files}")

Number of news articles in '/content/news/articles': 2225


# **4. Unit Test**

The following tests define what response the RAG System should output for a sample input query. This contains both positive scenario and a negative scenario.

</br>

## **Test case 1**

**Query:** 'What does Stacey Jolna mention about Televisions?'

**Expected Response:** Stacey Jolna, senior vice president of tv guide tv group  is that the way people find the content they want to watch has to be simplified for tv viewers. it means that networks  in us terms  or channels could take a ...

**Scenario Type:** Positive

</br>

## **Test case 2**

**Query:** 'What happened at the berlin film festival?'

**Expected Response:** Berlin cheers for anti-nazi film a german movie about an anti-nazi resistance heroine has drawn loud applause at berlin film festival.  sophie scholl - the final days portrays the final days of the member of the white rose movement. scholl  21  was arrested and beheaded with her ...

**Scenario Type:** Positive

</br>

## **Test case 3**

**Query:** 'By how much did Virgin blue profits fell?'

**Expected Response:** Virgin blue reported a 22% fall in first quarter profits in august 2004 due to tough competition. in november  first half profits were down due to ...

**Scenario Type:** Positive

</br>

## **Test case 4**

**Query:** 'What happened to Crude oil prices?'

**Expected Response:** crude oil prices back above USD 50 cold weather across parts of the united states and much of europe has pushed us crude oil prices above $50 a barrel for the first time in almost three months.  freezing temperatures ...

**Scenario Type:** Positive

</br>

## **Test case 5**

**Query:** 'Who won India elections in 2024?'

**Expected Response:** Something similar to India or elections or won or 2024. But not a concrete response.

**Scenario Type:** Negative

</br>

Note: The test queries are very specific to the context. A positive scenario means the response is found based on the context while a negative scenario means that no response found based on the context.

# **5. Chunking and embedding**

Chunking is done to reduce the number of sentences per embedding. Therefore the semantic meaning of the text is not diluted.

## **5.1. Chunking**

In [8]:
n_articles = 15
folder_path = '/content/news/articles/'

embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

news_doc_ls = []

def SemanticSplit(doc):
  parser = SemanticSplitterNodeParser(
    buffer_size=1, breakpoint_percentile_threshold=60, embed_model=embed_model
  )

  semantic_nodes = parser.get_nodes_from_documents(doc)
  return semantic_nodes

for i in range(n_articles):
  news_doc = FlatReader().load_data(Path(folder_path + "article_" + str(i)+ ".txt"))
  semantic_nodes = SemanticSplit(news_doc)
  news_doc_ls.extend(semantic_nodes)



In [9]:
# To convert the textnode (news_doc_ls) into document (news_doc_ls_prc) suitable for parsing into ChromaVectorStore
news_doc_ls_prc = []
news_ls = []

for i, t in enumerate(news_doc_ls):
  news_ls.append(t.text)
  news_doc_ls_prc.append(Document(id_ = t.id_,
                                  embedding = t.embedding,
                                  metadata = t.metadata,
                                  excluded_embed_metadata_keys = t.excluded_embed_metadata_keys,
                                  excluded_llm_metadata_keys = t.excluded_llm_metadata_keys,
                                  relationships = t.relationships,
                                  text = t.text
                                  ))

## **5.2. Ollama setup**

The Llama3 LLM model is run locally using Ollama.

In [10]:
# Set up Ollama to run LLM's locally
OLLAMA_MODEL='llama3'

os.environ['OLLAMA_MODEL'] = OLLAMA_MODEL
!echo $OLLAMA_MODEL

command = "nohup ollama serve&"

process = subprocess.Popen(command,
                            shell=True,
                            stdout=subprocess.PIPE,
                            stderr=subprocess.PIPE)

print("Process ID:", process.pid)
time.sleep(5)

!ollama -v


# Use the global variable (OLLAMA_MODEL) as our LLM
# Set a timeout of 8 minutes in case of CPU
llm = Ollama(model=OLLAMA_MODEL, request_timeout=480.0)

# Specify the LLM and embedding model into LlamaIndex's settings
Settings.llm = llm
Settings.embed_model = embed_model

llama3
Process ID: 3096
ollama version is 0.1.41


In [12]:
# Run the command to start the LLM locally
%%capture
#!ollama run llama3

[?25lpulling manifest ⠋ [?25h[?25l[2K[1Gpulling manifest ⠙ [?25h[?25l[2K[1Gpulling manifest ⠹ [?25h[?25l[2K[1Gpulling manifest ⠸ [?25h[?25l[2K[1Gpulling manifest ⠼ [?25h[?25l[2K[1Gpulling manifest ⠴ [?25h[?25l[2K[1Gpulling manifest ⠦ [?25h[?25l[2K[1Gpulling manifest ⠧ [?25h[?25l[2K[1Gpulling manifest ⠇ [?25h[?25l[2K[1Gpulling manifest ⠏ [?25h[?25l[2K[1Gpulling manifest 
pulling 6a0746a1ec1a...   0% ▕▏    0 B/4.7 GB                  [?25h[?25l[2K[1G[A[2K[1Gpulling manifest 
pulling 6a0746a1ec1a...   0% ▕▏    0 B/4.7 GB                  [?25h[?25l[2K[1G[A[2K[1Gpulling manifest 
pulling 6a0746a1ec1a...   0% ▕▏    0 B/4.7 GB                  [?25h[?25l[2K[1G[A[2K[1Gpulling manifest 
pulling 6a0746a1ec1a...   0% ▕▏    0 B/4.7 GB                  [?25h[?25l[2K[1G[A[2K[1Gpulling manifest 
pulling 6a0746a1ec1a...   0% ▕▏    0 B/4.7 GB                  [?25h[?25l[2K[1G[A[2K[1Gpulling manifest 
pulling 6a0746a1ec1a...   

## **5.3. Embedding and Vector Storage using ChromaVectorStore**

In [13]:
db = chromadb.PersistentClient(path="./news_db")

# Create a collection/table ("demo-for-ram") in the db
chroma_collection = db.create_collection("news_articles")

# Set up ChromaVectorStore and load in data
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
# Specify Chroma as our vector db
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# Create the vector index
vector_index = VectorStoreIndex.from_documents(
    news_doc_ls_prc, # the file created earlier
    storage_context = storage_context,
    embed_model = embed_model
)

# Print the metadata
print(chroma_collection)

# Print the name of the collection (table)
print(f'Collection name is: {chroma_collection.name}')

name='news_articles' id=UUID('c23b7658-f8f1-43aa-bfad-e46e4813eb02') metadata=None tenant='default_tenant' database='default_database'
Collection name is: news_articles


## **5.4. Embedding and vector storage using Faiss**

### **5.4.1 Embedding the chunked nodes**

In [19]:
# BAAI/bge-small-en-v1.5
# sentence-transformers/all-mpnet-base-v2
# distilbert-base-nli-stsb-mean-tokens
model = SentenceTransformer('BAAI/bge-small-en-v1.5')

# Convert abstracts to vectors
embeddings = model.encode(news_ls, show_progress_bar=True)

print(f'Shape of the vectorised news articls: {embeddings.shape}')

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/94.8k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/4 [00:00<?, ?it/s]

Shape of the vectorised news articls: (103, 384)


### **5.4.2. Storing in Faiss**

In [22]:
# Convert data type to float32
embeddings = np.array([embedding for embedding in embeddings]).astype("float32")

# Create the index based on the column shape of the embeddings
faiss_index = faiss.IndexFlatIP(embeddings.shape[1])

# Pass the index to IndexIDMap
faiss_index = faiss.IndexIDMap(faiss_index)

# Create a list of id's for the embeddings
n_id = [i for i in range(embeddings.shape[0])]
n_id = np.array(n_id)

# Add vectors and their IDs
faiss_index.add_with_ids(embeddings, n_id)

print(f"Number of vectors in the Faiss index: {faiss_index.ntotal}")

Number of vectors in the Faiss index: 103


# **6. Prompt**

In [14]:
qa_prompt_str = (
    "Context information is below.\n"
    "---------------------\n"
    "{context_str}\n"
    "---------------------\n"
    "Given the context information and not prior knowledge, "
    "answer the question: {query_str}\n"
)

chat_text_qa_msgs = [
    ChatMessage(
        role=MessageRole.SYSTEM,
        content=(
            "Always answer the question, even if the context isn't helpful."
        ),
    ),
    ChatMessage(role=MessageRole.USER, content=qa_prompt_str),
]

text_qa_template = ChatPromptTemplate(chat_text_qa_msgs)

In [15]:
print(vector_index.as_query_engine(
    text_qa_template=text_qa_template,
    llm=llm,).query("Importance of Television"))

# **7. Query Pipeline**

Sample queries using chroma vector storage:


1.   query_engine = vector_index.as_query_engine(response_mode="compact")
2.   query_engine = vector_index.as_query_engine(response_mode="refine")
3.   query_engine = vector_index.as_query_engine(response_mode="tree_summarize")

response = query_engine.query("What happened at Berlin film festival?")

</br>

The pipeline below uses Faiss vector storage. The function query_proc_pipe takes 3 arguments namely query, k (simialar responses required), model (for embedding the query). It takes in the query, embeds it and check for the similar responses/news in the vector dB and returns k articles.

Other queries to try:

* What did tony blair ask the queen?
* Does sluggish demand continue?



In [20]:
def query_proc_pipe(query, k_ = 2, model_ = model):
  query_embedding = model_.encode(query, show_progress_bar=True)

  cosine_similarity, similar = faiss_index.search(np.array([query_embedding]), k = k_)
  cosine_similarity = cosine_similarity.flatten().tolist()
  similar = similar.flatten().tolist()
  print(f'Cosine similarity: {cosine_similarity}')
  print(f'Top matches: {similar}')

  n_sim = len(similar)

  for i, (cos_sim, sim) in enumerate(zip(cosine_similarity, similar)):
    if cos_sim < 0.5:
      break;
    print('\n ********** Found result {0} of {1} with {2} % match **********'.format(i+1, n_sim, round((cos_sim * 100),2)))
    print(news_ls[sim])



In [24]:
query = "queen"

query_proc_pipe(query)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Cosine similarity: [0.5356149077415466, 0.5159969329833984]
Top matches: [75, 49]

 ********** Found result 1 of 2 with 53.56 % match **********
sluggish demand reported previously for november and now december 2004 continues   said virgin blue chief executive brett godfrey. virgin blue  which is 25% owned by richard branson  has been struggling to fend off pressure from rival jetstar. it cut its full year passenger number forecast by  approximately 2.5% . 

 ********** Found result 2 of 2 with 51.6 % match **********
blair prepares to name poll date tony blair is likely to name 5 may as election day when parliament returns from its easter break  the bbc s political editor has learned.  andrew marr says mr blair will ask the queen on 4 or 5 april to dissolve parliament at the end of that week. mr blair has so far resisted calls for him to name the day but all parties have stepped up campaigning recently. 


# **8. Future Improvements:**

The RAG system is developed with the aim of having excellent knowledge on world affairs. The dataset used was medium-sized articles and was limited in quantity. An extensive collection of data on the same would help us bring close to our aim. Also for this prototype version only a subset of data is chosen to showcase the demo. A dedicated processing time is needed to process a huge volume of datasets. A even more better embedding model shall be chosen to improve the context window and embedding size. Model hyperparameters shall be finetuned, for ex: threshold for semantic chunking etc.

Instead of generalising the knowledge source, we can make it more specific, providing knowledge only on for ex: cricket-related news, finance, economics, finance etc, thereby accessing the content even more crisper.



# **9. References:**



1.   https://www.kaggle.com/datasets/moazeldsokyx/bbc-news
2.   https://huggingface.co/
3.   https://docs.llamaindex.ai/
4.   https://ollama.com/library/llama3
5.   https://www.sbert.net/docs/sentence_transformer/pretrained_models.html
6.   https://github.com/RDGopal/IB9CW0-Text-Analytics/blob/main/

