<a href="https://colab.research.google.com/github/porlorps/Python_RAG_News_Retreival_System/blob/main/Python_RAG_news_system.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# News Retrieval Augmented Generation System

In a fast-changing digital world, news data volume and its velocity make it difficult for people to stay informed and make quick decisions. Generative Pre-Trained Model (GPT) can answer questions directly, but its limited training dataset limits it. To address this, the development of a news Retrieval-Augmented Generation (RAG) system is crucial. By adding current news articles to the GPT model, the RAG system lets users use both models and updated knowledge bases in real time. Therefore, a news RAG system is essential for staying competitive in a world of abundant and fast-feeding information.

In [None]:
# Install Prerequisites

!pip install llama-index-embeddings-huggingface
!pip install llama-index-llms-ollama
!pip install llama-index ipywidgets
!pip install llama-index-llms-huggingface
!pip install llama_index.readers.web
!pip install llama-index-vector-stores-chroma
!pip install chromadb

!pip install llama_index.core


Collecting llama-index-embeddings-huggingface
  Downloading llama_index_embeddings_huggingface-0.2.1-py3-none-any.whl (7.1 kB)
Collecting llama-index-core<0.11.0,>=0.10.1 (from llama-index-embeddings-huggingface)
  Downloading llama_index_core-0.10.44-py3-none-any.whl (15.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.4/15.4 MB[0m [31m53.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sentence-transformers<3.0.0,>=2.6.1 (from llama-index-embeddings-huggingface)
  Downloading sentence_transformers-2.7.0-py3-none-any.whl (171 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m171.5/171.5 kB[0m [31m22.2 MB/s[0m eta [36m0:00:00[0m
Collecting minijinja>=1.0 (from huggingface-hub[inference]>=0.19.0->llama-index-embeddings-huggingface)
  Downloading minijinja-2.0.1-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (853 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m853.2/853.2 kB[0m [31m57.4 MB/s[0m eta [36m0:0

# Datasets
The CNN news data collected from 2011 to 2022 from Kaggle, sourced by HADAS UNGER, was used for the purpose of training our RAG.

(https://www.kaggle.com/datasets/hadasu92/cnn-articles-after-basic-cleaning)

Due to the GPU resource restriction of Colab, the RAG will be built using news data from 2022. The dataset utilised to construct the RAG comprises 299 news records.

In order to minimise the amount of building time, the dataset was reduced to include 5 articles in initial building.


In [None]:
import pandas as pd

# Load the dataset
file_path = 'CNN_Articles_5samples.csv'
df = pd.read_csv(file_path)

# Installing Generative Model

Ollama tools and Phi-3 generative LLM (SLM) were used for the generative model, following the code from Text Analytics Workshops 7 and 8. Phi-3 was selected as it is an open-source small language model that is more suitable to run this RAG system on Google Colab. This is because it demands fewer resources compare to larger language models.

In [None]:
%%capture
# Install Ollama v0.1.30
!curl https://ollama.ai/install.sh | sed 's#https://ollama.ai/download#https://github.com/jmorganca/ollama/releases/download/v0.1.30#' | sh

In [None]:
%%capture
# Setup the model as a global variable
OLLAMA_MODEL='phi:latest'

# Add the model to the environment of the operating system
import os
os.environ['OLLAMA_MODEL'] = OLLAMA_MODEL
!echo $OLLAMA_MODEL # print the global variable to check it saved

import subprocess
import time

# Start ollama on the server ("serve")
command = "nohup ollama serve&" # "nohup" and "&" means run in the background

# Use subprocess.Popen to run the command
process = subprocess.Popen(command,
                            shell=True,
                            stdout=subprocess.PIPE,
                            stderr=subprocess.PIPE)

time.sleep(15)

# Import required modules from the llama_index library
from llama_index.core import VectorStoreIndex, SummaryIndex, SimpleDirectoryReader
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings
from llama_index.llms.ollama import Ollama
from llama_index.core import StorageContext
from llama_index.core.postprocessor import LLMRerank

# Import ChromaVectorStore and chromadb module
from llama_index.vector_stores.chroma import ChromaVectorStore
import chromadb

# Import the Ollama class
from llama_index.llms.ollama import Ollama

# Use the global variable (OLLAMA_MODEL) as our LLM
# Set a timeout of 8 minutes in case of CPU
llm = Ollama(model=OLLAMA_MODEL, request_timeout=480.0)

In [None]:
# Query the model via the command line and pull the model for first time running
!ollama run $OLLAMA_MODEL "What does CNN news do"

[?25l[?25l[?25h[2K[1G[?25h I[?25l[?25h'm[?25l[?25h sorry[?25l[?25h,[?25l[?25h but[?25l[?25h as[?25l[?25h an[?25l[?25h AI[?25l[?25h language[?25l[?25h model[?25l[?25h,[?25l[?25h I[?25l[?25h don[?25l[?25h't[?25l[?25h have[?25l[?25h access[?25l[?25h to[?25l[?25h specific[?25l[?25h information[?25l[?25h about[?25l[?25h what[?25l[?25h CNN[?25l[?25h News[?25l[?25h does[?25l[?25h as[?25l[?25h it[?25l[?25h's[?25l[?25h not[?25l[?25h related[?25l[?25h to[?25l[?25h my[?25l[?25h programming[?25l[?25h.[?25l[?25h Could[?25l[?25h you[?25l[?25h please[?25l[?25h provide[?25l[?25h more[?25l[?25h context[?25l[?25h or[?25l[?25h a[?25l[?25h particular[?25l[?25h question[?25l[?25h regarding[?25l[?25h CNN[?25l[?25h News[?25l[?25h so[?25l[?25h that[?25l[?25h I[?25l[?25h can[?25l[?25h assist[?25l[?25h you[?25l[?25h better[?25l[?25h?[?25l[?25h
[?25l[?25h

[?25l[?25h

# Queries

During the model-building stage, 5 articles were utilized to generate a total of 15 questions using ChatGPT. For each article, 3 questions and their corresponding answers were constructed (Appendix B). Next, 5 questions that specifically pertain to the content of the article were chosen to serve as test queries for asking Phi-3 and our RAG system.

The chosen questions are combination of general and specific questions. Questions 1 and 2 related to information mentioned in articles, whereas questions 3, 4, and 5 address broader subjects that can be answered in general but also require specific context from the articles.
The test questions are:

Article 1
*   How does TuSimple's approach to autonomous trucking address the driver shortage problem highlighted by the recent e-commerce boom?


Article 2
*   How does the partnership with General Motors contribute to the development and use of the Ironhand?


Article 3
*   What is the primary role of the QR codes in the functioning of these robots?


Article 4
*   Why have investors started to exclude Russia from emerging market funds?


Article 5
*   Which sectors in Russia are specifically targeted by these sanctions?

The following answer is provided by the Phi-3.

In [None]:
# We test the result generated from Ollama for these 5 questions

# Question 1
# Query the model via the command line
!ollama run $OLLAMA_MODEL "How does TuSimple's approach to autonomous trucking address the driver shortage problem highlighted by the recent e-commerce boom?"


[?25l[?25l[?25h[2K[1G[?25h As an[?25l[?25h AI[?25l[?25h language[?25l[?25h model[?25l[?25h,[?25l[?25h I[?25l[?25h don[?25l[?25h't[?25l[?25h have[?25l[?25h access[?25l[?25h to[?25l[?25h the[?25l[?25h latest[?25l[?25h news[?25l[?25h or[?25l[?25h statistics[?25l[?25h,[?25l[?25h but[?25l[?25h based[?25l[?25h on[?25l[?25h what[?25l[?25h is[?25l[?25h available[?25l[?25h,[?25l[?25h Tu[?25l[?25hSimple[?25l[?25h's[?25l[?25h approach[?25l[?25h aims[?25l[?25h to[?25l[?25h automate[?25l[?25h the[?25l[?25h transportation[?25l[?25h industry[?25l[?25h by[?25l[?25h developing[?25l[?25h fully[?25l[?25h autonomous[?25l[?25h trucks[?25l[?25h.[?25l[?25h By[?25l[?25h doing[?25l[?25h so[?25l[?25h,[?25l[?25h it[?25l[?25h could[?25l[?25h potentially[?25l[?25h reduce[?25l[?25h the[?25l[?25h need[?25l[?25h for[?25l[?25h human[?25l[?25h drivers[?25l[?25h,[?25l[?25h addressing[?25l[?25h one[?25l[?

In [None]:
# Question 2

# Query the model via the command line
!ollama run $OLLAMA_MODEL "How does the partnership with General Motors contribute to the development and use of the Ironhand?"

[?25l⠙ [?25h[?25l[?25l[2K[1G[?25h[2K[1G[?25h As an[?25l[?25h AI[?25l[?25h language[?25l[?25h model[?25l[?25h,[?25l[?25h I[?25l[?25h don[?25l[?25h't[?25l[?25h have[?25l[?25h access[?25l[?25h to[?25l[?25h information[?25l[?25h about[?25l[?25h ongoing[?25l[?25h projects[?25l[?25h or[?25l[?25h partnerships[?25l[?25h of[?25l[?25h companies[?25l[?25h like[?25l[?25h General[?25l[?25h Motors[?25l[?25h.[?25l[?25h However[?25l[?25h,[?25l[?25h based[?25l[?25h on[?25l[?25h my[?25l[?25h understanding[?25l[?25h,[?25l[?25h partnerships[?25l[?25h between[?25l[?25h companies[?25l[?25h can[?25l[?25h facilitate[?25l[?25h the[?25l[?25h sharing[?25l[?25h of[?25l[?25h knowledge[?25l[?25h,[?25l[?25h resources[?25l[?25h,[?25l[?25h and[?25l[?25h expertise[?25l[?25h,[?25l[?25h which[?25l[?25h can[?25l[?25h lead[?25l[?25h to[?25l[?25h the[?25l[?25h creation[?25l[?25h and[?25l[?25h improvement[?25l[

In [None]:
# Question 3

# Query the model via the command line
!ollama run $OLLAMA_MODEL "What is the primary role of the QR codes in the functioning of these robots?"

[?25l⠙ [?25h[?25l[?25l[2K[1G[?25h[2K[1G[?25h The[?25l[?25h primary[?25l[?25h role[?25l[?25h of[?25l[?25h the[?25l[?25h QR[?25l[?25h codes[?25l[?25h is[?25l[?25h to[?25l[?25h serve[?25l[?25h as[?25l[?25h input[?25l[?25h for[?25l[?25h the[?25l[?25h artificial[?25l[?25h intelligence[?25l[?25h system[?25l[?25h that[?25l[?25h controls[?25l[?25h the[?25l[?25h robots[?25l[?25h.[?25l[?25h By[?25l[?25h scanning[?25l[?25h a[?25l[?25h QR[?25l[?25h code[?25l[?25h,[?25l[?25h users[?25l[?25h can[?25l[?25h give[?25l[?25h specific[?25l[?25h commands[?25l[?25h or[?25l[?25h provide[?25l[?25h important[?25l[?25h information[?25l[?25h such[?25l[?25h as[?25l[?25h location[?25l[?25h and[?25l[?25h task[?25l[?25h requirements[?25l[?25h to[?25l[?25h the[?25l[?25h robot[?25l[?25h.[?25l[?25h This[?25l[?25h allows[?25l[?25h the[?25l[?25h AI[?25l[?25h to[?25l[?25h create[?25l[?25h an[?25l[?25h accu

In [None]:
# Question 4

# Query the model via the command line
!ollama run $OLLAMA_MODEL "Why have investors started to exclude Russia from emerging market funds?"

[?25l⠙ [?25h[?25l[?25l[2K[1G[?25h[2K[1G[?25h Investor[?25l[?25h confidence[?25l[?25h in[?25l[?25h Russia[?25l[?25h has[?25l[?25h been[?25l[?25h affected[?25l[?25h by[?25l[?25h several[?25l[?25h factors[?25l[?25h,[?25l[?25h including[?25l[?25h political[?25l[?25h tensions[?25l[?25h with[?25l[?25h Western[?25l[?25h countries[?25l[?25h,[?25l[?25h economic[?25l[?25h sanctions[?25l[?25h imposed[?25l[?25h on[?25l[?25h Russian[?25l[?25h companies[?25l[?25h for[?25l[?25h their[?25l[?25h actions[?25l[?25h in[?25l[?25h Ukraine[?25l[?25h,[?25l[?25h and[?25l[?25h a[?25l[?25h decline[?25l[?25h in[?25l[?25h the[?25l[?25h value[?25l[?25h of[?25l[?25h the[?25l[?25h rub[?25l[?25hle[?25l[?25h.[?25l[?25h As a[?25l[?25h result[?25l[?25h,[?25l[?25h many[?25l[?25h investors[?25l[?25h have[?25l[?25h chosen[?25l[?25h to[?25l[?25h exclude[?25l[?25h Russia[?25l[?25h from[?25l[?25h their[?25l[?25h 

In [None]:
# Question 5

# Query the model via the command line
!ollama run $OLLAMA_MODEL "Which sectors in Russia are specifically targeted by these sanctions?"

[?25l⠙ [?25h[?25l[?25l[2K[1G[?25h[2K[1G[?25h As an[?25l[?25h AI[?25l[?25h,[?25l[?25h I[?25l[?25h don[?25l[?25h't[?25l[?25h have[?25l[?25h access[?25l[?25h to[?25l[?25h real[?25l[?25h-[?25l[?25htime[?25l[?25h information[?25l[?25h or[?25l[?25h data[?25l[?25h on[?25l[?25h Russia[?25l[?25h's[?25l[?25h current[?25l[?25h state[?25l[?25h of[?25l[?25h affairs[?25l[?25h.[?25l[?25h However[?25l[?25h,[?25l[?25h as[?25l[?25h of[?25l[?25h October[?25l[?25h 2021[?25l[?25h,[?25l[?25h several[?25l[?25h sectors[?25l[?25h in[?25l[?25h Russia[?25l[?25h were[?25l[?25h specifically[?25l[?25h targeted[?25l[?25h by[?25l[?25h economic[?25l[?25h and[?25l[?25h political[?25l[?25h sanctions[?25l[?25h,[?25l[?25h including[?25l[?25h energy[?25l[?25h,[?25l[?25h finance[?25l[?25h,[?25l[?25h technology[?25l[?25h,[?25l[?25h and[?25l[?25h defense[?25l[?25h.[?25l[?25h The[?25l[?25h U.[?25l[?25hS.[?25

The answers were provided by Phi-3, but it was found that the model is unable to answer questions 1, 2, and 5 due to limitations in real-time data. However, it can answer these questions in a broader context.
Conversely, the question 3 was responded in a manner that model understand the subject, despite there were some inaccuracies in the answer. In contrast, the model accurately responded to question 4.

In this model, question 5 will be used to test the RAG model with different response modes due to the complexity of this articles that the model should answer if it was trained in RAG properly.

The answer to question 5 should contain energy, cars, luxury goods, financial markets sectors, and increased tariffs on vodka, steel, art, and fur.

# RAG Model building

## Chunking and Embedding Model

I utilised semantic chunking on our input data for the RAG system to preserve the semantic coherence of each sentence. The closest semantic meaning retrieval is essential for our RAG system in order to get a reliable response.

Our domain data consists of English articles with word counts ranging from 60 to 6000. Over 97% of these articles have less than 3340 words. The BGE small embedding was chosen for this dataset due to its compact size, while offering 512 sequence lengths and 384 dimensions to cover our chunked text. According to the HuggingFace MTEB Leaderboard, it has a lower memory footprint compared to other embeddings at the same embedding capacity. (https://huggingface.co/spaces/mteb/leaderboard)

In [None]:
# Store news articles data from pandas dataframe to seperate each article into one text file

!mkdir -p '/content/testdata/rawtext/'

count = 0
articles = list(df['Article_text'])

for article in articles:
  doc = articles[count]
  fname = "/content/testdata/rawtext/Output" + str(count) + ".txt"
  with open(fname, "w") as text_file:
    text_file.write(doc)
  count += 1

In [None]:
# Check the result of first article
first_file = open('/content/testdata/rawtext/Output0.txt', 'r')
file_contents = first_file.read()
print(file_contents)
first_file.close()

 (CNN)Right now, there's a shortage of truck drivers in the US and worldwide, exacerbated by the e-commerce boom brought on by the pandemic. One solution to the problem is autonomous trucks, and several companies are in a race to be the first to launch one. Among them is San Diego-based TuSimple.Founded in 2015, TuSimple has completed about 2 million miles of road tests with its 70 prototype trucks across the US, China and Europe. Although these are simply commercially available trucks retrofitted with its technology, TuSimple has deals in place with two of the world's largest truck manufacturers -- Navistar in the US and Traton, Volkswagen's trucking business, in Europe -- to design and build fully autonomous models, which it hopes to launch by 2024. Photos: The Yara Birkeland is what its builders call the world's first zero-emission, autonomous cargo ship. The ship is scheduled to make its first journey between two Norwegian towns before the end of the year. Click through to see more

In [None]:
# Install embedded model for semantic chunking

from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings

# Initialize a HuggingFace Embedding model
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

# Specify the embedding model into LlamaIndex's settings
Settings.llm = llm
Settings.embed_model = embed_model

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/94.8k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
# Load documents
reader = SimpleDirectoryReader("/content/testdata/rawtext") # load documents from the /data folder
docs = reader.load_data()
print(f"Loaded {len(docs)} docs")

Loaded 5 docs


In [None]:
# Chunking process
from llama_index.core.node_parser import SemanticSplitterNodeParser

parser = SemanticSplitterNodeParser(
    buffer_size=1, breakpoint_percentile_threshold=90, embed_model=embed_model
)

semantic_CNN_test_nodes = parser.get_nodes_from_documents(docs)


In [None]:
!mkdir -p '/content/testdata/chunkdata/' # create an empty directory called "chunkdata"

count = 0

for chunk in semantic_CNN_test_nodes: # iterate through the results
  fname = "/content/testdata/chunkdata/Chunk_Output" + str(count) + ".txt"
  with open(fname, "w") as text_file:
    text_file.write(chunk.text) # save the file
  count += 1 # increment the count

## Import chunk data into vector database

ChromaDB vector database was chosen in this RAG building since it is an open-source with flexible applications and also performs well on the cloud (https://thenewstack.io/top-5-vector-database-solutions-for-your-ai-project/), moreover, its open-source nature comes with community support for debugging.(https://medium.com/@woyera/pinecone-vs-chroma-the-pros-and-cons-2b0b7628f48f)

In [None]:
# Import ChromaVectorStore and chromadb module
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.postprocessor import SimilarityPostprocessor
from llama_index.core.query_engine import RetrieverQueryEngine
#from llama_index.core import get_response_synthesizer
import chromadb

# Load documents
reader = SimpleDirectoryReader("/content/testdata/chunkdata") # load chunks from the chunk data folder
docs = reader.load_data()
print(f"Loaded {len(docs)} docs")

# Create client ("db") and a database ("chroma_db")
db = chromadb.PersistentClient(path="./chroma_db")

# Create a collection/table ("CNN_news") in the db
chroma_collection = db.create_collection("test_CNN_news")

# Set up ChromaVectorStore and load in data
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
# Specify Chroma as our vector db
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# Create the vector index
vector_index = VectorStoreIndex.from_documents(
    docs, # the file created earlier
    storage_context = storage_context,
    embed_model = embed_model
)

# Print the metadata
print(chroma_collection)

# Print the name of the collection (table)
print(f'Collection name is: {chroma_collection.name}')

Loaded 23 docs
name='test_CNN_news' id=UUID('0c24e98e-531e-45c2-919c-4510346d32a1') metadata=None tenant='default_tenant' database='default_database'
Collection name is: test_CNN_news


## Query Engine

A set of queries was tested on the RAG system using five different responses to see the model's performance.
(https://www.bluelabellabs.com/blog/llamaindex-response-modes-explained/)
1.	Compact
2.	Refine
3.	Tree Summarize
4.	Accumulate
5.	Simple Summarize

In [None]:
# Doing query process to generate answer based on query
# Compact

query_engine = vector_index.as_query_engine(response_mode="compact", similarity_top_k=2, llm=llm)

response = query_engine.query("Which sectors in Russia are specifically targeted by these sanctions?")

response.response

' The sectors that are specifically targeted by these sanctions are investments in the Russian energy sector, luxury goods exports, steel products imports from Russia, assets of more business leaders who support the Russian state, and the European Commission\'s statement says that the measure includes "a far-reaching ban on new investment across the Russian energy sector with limited exceptions for civil nuclear energy and the transport of certain energy products back to the EU."\n'

In [None]:
# Doing query process to generate answer based on query
# Refine

query_engine = vector_index.as_query_engine(response_mode="refine", similarity_top_k=2, llm=llm)

response = query_engine.query("Which sectors in Russia are specifically targeted by these sanctions?")

response.response

" The European Union' new sanctions against Russia target investments in the Russian energy sector, luxury goods exports, imports of steel products from Russia, and transactions with some Russian state-owned enterprises linked to the Kremlin's military-industrial complex. The ban on Russian steel imports is estimated to affect 3.3 billion euros ($3.6 billion) worth of products, while EU companies will also be no longer allowed to export any luxury goods worth more than \n"

In [None]:
# Doing query process to generate answer based on query
# Tree Summarize

query_engine = vector_index.as_query_engine(response_mode="tree_summarize", similarity_top_k=2, llm=llm)

response = query_engine.query("Which sectors in Russia are specifically targeted by these sanctions?")

response.response

' The European Union formally approved on Tuesday a new barrage of sanctions against Russia for its invasion of Ukraine, which include bans on investments in the Russian energy sector, luxury goods exports and imports of steel products from Russia. The sanctions, which come into effect after publication in the EU official journal later on Tuesday, also freeze the assets of more business leaders who support the Russian state, including Chelsea football club owner Roman Abramovich. The European Commission said in a statement on Tuesday that the sanctions included \n"a far-reaching ban on new investment across the Russian energy sector with limited exceptions for civil nuclear energy and the transport of certain energy products back to the EU."The measure will hit Russia\'\n'

In [None]:
# Doing query process to generate answer based on query
# Accumulate

query_engine = vector_index.as_query_engine(response_mode="accumulate", similarity_top_k=2, llm=llm)

response = query_engine.query("Which sectors in Russia are specifically targeted by these sanctions?")

response.response

'Response 1:  The European Union formally approved new sanctions against Russia for its invasion of Ukraine, which include bans on investments in the Russian energy sector, luxury goods exports and imports of steel products from Russia. The sanctions also freeze the assets of more business leaders who support the Russian state, including Chelsea football club owner Roman Abramovich. The measures specifically target the sectors of:\n- Energy: a far-reaching ban on new investment across the Russian energy sector with limited exceptions for civil nuclear energy and the transport of certain energy products back to the EU\n- Luxury goods: exports of cars costing more than \n```\n\n---------------------\nResponse 2:  According to the given context information, the additional 35% tariff will be applied to imports including vodka, steel, works of art and fur. The United Kingdom will also deny Russia and Belarus access to its most favoured nation trading tariff for hundreds of their exports, ef

In [None]:
# Doing query process to generate answer based on query
# Simple Summarize

query_engine = vector_index.as_query_engine(response_mode="simple_summarize", similarity_top_k=2, llm=llm)

response = query_engine.query("Which sectors in Russia are specifically targeted by these sanctions?")

response.response

" The new sanctions against Russia target several sectors including the Russian energy sector (energy investments), luxury goods exports and imports, steel products from Russia, investments in certain business leaders who support the Russian state, civil nuclear energy, transport of certain energy products back to the EU, Russian state-owned enterprises linked to the Kremlin's military-industrial complex, luxury jewelry, cars costing more than 50,000 euros, credit rating agencies forRussia and Russian companies.\n"

All results show that RAG's response is more relevant and accurate than the Phi-3 generator alone.

Comparing the response modes, compact, refined, tree-summarize and simple summarise generate closely the same answer, including the ban on energy, luxury, steel, and Russian’s assets. Yet, the tree-summarize show coherence and superior structured answer than the others.

While accumulate response work well for capturing detail in all chunks that relate to the question rather than directly responding the question, which is good for retrieving the related contents, it may not suit for answering the question as it tries to answer using information in each chunk separately rather than combining it to overall context.

Tree Summarize has been chosen for our RAG model overall because of its coherence, and elaboration on the answer, however, it is interesting that four responses (except accumulate) differ very little from others.




## Query Testing

In [None]:
# Doing query process for question number 1
# Tree Summarize

query_engine = vector_index.as_query_engine(response_mode="tree_summarize", similarity_top_k=2, llm=llm)

response = query_engine.query("How does TuSimple's approach to autonomous trucking address the driver shortage problem highlighted by the recent e-commerce boom?")

response.response



" According to the context information provided, TuSimple's approach to autonomous trucking addresses the driver shortage problem by removing the human element on long trucking routes. The company says its trucks react 15 times faster than human drivers, and can see farther down the road, even at night. This means that there will be less need for human drivers in the industry, which is a solution to the problem of the shortage of truck drivers in the US and worldwide.\n"

In [None]:
# Doing query process for question number 2
# Tree Summarize

query_engine = vector_index.as_query_engine(response_mode="tree_summarize", similarity_top_k=2, llm=llm)

response = query_engine.query("How does the partnership with General Motors contribute to the development and use of the Ironhand?")

response.response


" The partnership between Bioservo and General Motors (GM) helps to test and improve the Ironhand glove for its users. GM's ergonomics team has also helped to develop the technology, which takes into account potential concerns or physical demands that may cause medical issues for operators in the factory floor setting. \n"

In [None]:
# Doing query process for question number 3
# Tree Summarize

query_engine = vector_index.as_query_engine(response_mode="tree_summarize", similarity_top_k=2, llm=llm)

response = query_engine.query("What is the primary role of the QR codes in the functioning of these robots?")

response.response


' The primary role of the QR codes in the functioning of these robots is to guide them as they move around the warehouse. They provide directions for the robots to follow and help them navigate through the space efficiently.\n'

In [None]:
# Doing query process for question number 4
# Tree Summarize

query_engine = vector_index.as_query_engine(response_mode="tree_summarize", similarity_top_k=2, llm=llm)

response = query_engine.query("Why have investors started to exclude Russia from emerging market funds?")

response.response



' Investors are starting to exclude Russia from emerging market funds because of its invasion of Ukraine, which has led to the country being removed from several major US index providers at a price of zero or effectively zero. This means that trading shares of leading US-listed Russian companies such as Yandex and MTS have been halted, and the Moscow Stock Exchange has been closed since February 25th, the day after the invasion. Russia could default on its debt within days. It is still unclear what comes next for investors looking to get exposure to emerging markets in Russia. However, it seems likely that they will start looking at other emerging markets like Taiwan and South Korea to replace Russia.\n'

The results generated exhibit higher precision and demonstrate stronger correlation with the original article compared to Phi-3 alone. Less hallucination is displayed, indicating that our RAG system is operational and capable of incorporating domain data into the generator. Next, we will further enhance RAG by implementing an inquiry prompt template.

## Prompt Template


The prompt template was explicitly defined to furnish the RAG system with context, instructions, input, and output formats.

Context: The context of CNN news was told to RAG to guide the scenario and background of the model

Instruction; The instruction to answer question was prompt to RAG system to instruct the model what it was expected to do.

Input; {query_str} The model received the question as input.

Output: The output constraints were told to RAG to only give an answer that relates to context and question and instruct it to avoid giving false answers or making up false knowledge.


In [None]:
# Improve Query with prompt template to provide
# Giving context, Instruction, and output format to RAG


from llama_index.core.llms import ChatMessage, MessageRole
from llama_index.core import ChatPromptTemplate

qa_prompt_str = (
    "You are a news reader with knowledge from the CNN news channel during 2022.\n"
    "Considering the role of a newsreader, you are assisting the user in giving a correct answer about what's happened based on the news articles you already know.\n"
    "Given the context information and not prior knowledge, "
    "answer the question: {query_str}\n"
)

# Text QA Prompt
chat_text_qa_msgs = [
    ChatMessage(
        role=MessageRole.SYSTEM,
        content=(
            "Please provide only answers that relate to the context and question, do not explain or question anything beyond the context and question.\n"
            "If you don't know the answer, please don't make it up or give any false explanation.\n"
        ),
    ),
    ChatMessage(role=MessageRole.USER, content=qa_prompt_str),
]

text_qa_template = ChatPromptTemplate(chat_text_qa_msgs)

In [None]:
# Run query with prompt template

print(
    vector_index.as_query_engine(response_mode="tree_summarize",
                                 similarity_top_k=2,
                                 text_qa_template=text_qa_template,
                                 llm=llm,
    ).query("Which sectors in Russia are specifically targeted by these sanctions?")
)



 According to the provided context information, the European Union has approved a new barrage of sanctions against Russia for its invasion of Ukraine. The sanctions include a ban on investments in the Russian energy sector, luxury goods exports and imports of steel products from Russia. Additionally, there will be a total ban on transactions with some Russian state-owned enterprises linked to the Kremlin's military-industrial complex. 

The sectors targeted by these sanctions are:
1. The Russian energy sector - specifically investments in this sector will be banned.
2. Luxury goods exports and imports of steel products from Russia.
3. Transactions with some Russian state-owned enterprises linked to the Kremlin's military-industrial complex.



RAG with prompt template generates responses that directly relevant to the question and does not provide any further explanations or information beyond what is specifically asked.

# Building Larger Model
Next, the new model will be built using larger CNN datasets from 2022 containing 299 articles.

In [None]:
# Load the dataset
file_path = 'CNN_Articles_2022.csv'
df = pd.read_csv(file_path)

In [None]:
# Separate news articles into .txt file for chunking

!mkdir -p '/content/fulldata/rawtext/' # create an empty directory called "data"

count = 0
articles = list(df['Article_text'])

for article in articles: # iterate through the results
  doc = articles[count]
  fname = "/content/fulldata/rawtext/Output" + str(count) + ".txt"
  with open(fname, "w") as text_file:
    text_file.write(doc)
  count += 1

In [None]:
# Load documents
reader = SimpleDirectoryReader("/content/fulldata/rawtext") # load documents from the /data folder
docs = reader.load_data()
print(f"Loaded {len(docs)} docs")

Loaded 299 docs


In [None]:
# Chunking process (Processing time 7.5 min on GPU)

parser = SemanticSplitterNodeParser(
    buffer_size=1, breakpoint_percentile_threshold=90, embed_model=embed_model
)

semantic_CNN_full_nodes = parser.get_nodes_from_documents(docs)


In [None]:
!mkdir -p '/content/fulldata/chunkdata/' # create an empty directory called "chunkdata"

count = 0

for chunk in semantic_CNN_full_nodes: # iterate through the results
  fname = "/content/fulldata/chunkdata/Chunk_Output" + str(count) + ".txt"
  with open(fname, "w") as text_file:
    text_file.write(chunk.text) # save the file
  count += 1 # increment the count

In [None]:
# See number of chunks generated
reader = SimpleDirectoryReader("/content/fulldata/chunkdata") # load documents from the /data folder
docs = reader.load_data()
print(f"Loaded {len(docs)} docs")

Loaded 1335 docs


In [None]:
# Create Vector Database (Processing time 4 min)

# Load documents
reader = SimpleDirectoryReader("/content/fulldata/chunkdata") # load chunks from the chunk data folder
docs = reader.load_data()
print(f"Loaded {len(docs)} docs")

# Create client ("db") and a database ("chroma_db")
db = chromadb.PersistentClient(path="./chroma_db")

# Create a collection/table ("CNN_news") in the db
chroma_collection = db.create_collection("full_CNN_news")

# Set up ChromaVectorStore and load in data
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
# Specify Chroma as our vector db
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# Create the vector index
vector_index = VectorStoreIndex.from_documents(
    docs, # the file created earlier
    storage_context = storage_context,
    embed_model = embed_model
)

# Print the metadata
print(chroma_collection)

# Print the name of the collection (table)
print(f'Collection name is: {chroma_collection.name}')

Loaded 1335 docs
name='full_CNN_news' id=UUID('fba843ca-2458-43e5-9d81-824ba7b0da8f') metadata=None tenant='default_tenant' database='default_database'
Collection name is: full_CNN_news


Reranking process was also applied to improve the model's performance using the same embedded model from BAAI-BGE.

In [None]:
# Install Reranking model

from llama_index.core.postprocessor import SentenceTransformerRerank

rerank = SentenceTransformerRerank(model="BAAI/bge-reranker-base", top_n=2)




config.json:   0%|          | 0.00/799 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/279 [00:00<?, ?B/s]

In [None]:
# Run query with Reranking k = 2

query_engine = vector_index.as_query_engine(response_mode="tree_summarize",
                                            similarity_top_k=5,
                                            text_qa_template=text_qa_template,
                                            llm=llm,
                                            node_postprocessors=[rerank]
                                            )

response = query_engine.query("Which sectors in Russia are specifically targeted by these sanctions?")

response.response

" The European Union has imposed a number of sanctions on Russia's energy sector, luxury goods exports, imports of steel products from Russia, as well as investments in Russian state-owned enterprises linked to the Kremlin's military-industrial complex. In addition, credit rating agencies for Russia and Russian companies have been barred from issuing ratings for them, which will further limit their access to European financial markets. The ban on luxury goods exports and imports of steel products is estimated to affect around  3.3 billion euros worth of products. Additionally, the ban on new investments in the Russian energy sector includes civil nuclear energy and the transport of certain energy products back to Europe.\n"

In [None]:
# Run query with Reranking k = 3

rerank = SentenceTransformerRerank(model="BAAI/bge-reranker-base", top_n=3)

query_engine = vector_index.as_query_engine(response_mode="tree_summarize",
                                            similarity_top_k=5,
                                            text_qa_template=text_qa_template,
                                            llm=llm,
                                            node_postprocessors=[rerank]
                                            )

response = query_engine.query("Which sectors in Russia are specifically targeted by these sanctions?")

response.response

" The European Union has imposed a set of sanctions on Russia, which include bans on investments in the Russian energy sector, luxury goods exports, imports of steel products from Russia, and the assets of business leaders who support the Russian state. Additionally, there will be restrictions on transactions with some Russian state-owned enterprises linked to the Kremlin's military-industrial complex. The ban on Russian steel imports is estimated to affect 3.3 billion euros worth of products. EU countries are still able to buy oil and gas from Russia's oil majors such as Rosneft, Transneft, and Gazprom Neft, but there will be limited exceptions for civil nuclear energy and the transport of certain energy products back to the EU. The ban on luxury goods exports includes jewelry, cars costing more than 50,000 euros, and other luxury items. Credit rating agencies are also prohibited from issuing ratings for Russia or Russian companies, which could restrict their access to European financ

In [None]:
# Run query with Reranking k = 4

rerank = SentenceTransformerRerank(model="BAAI/bge-reranker-base", top_n=4)

query_engine = vector_index.as_query_engine(response_mode="tree_summarize",
                                            similarity_top_k=5,
                                            text_qa_template=text_qa_template,
                                            llm=llm,
                                            node_postprocessors=[rerank]
                                            )

response = query_engine.query("Which sectors in Russia are specifically targeted by these sanctions?")

response.response

" Based on the context information provided, it is unclear which specific sectors in Russia are targeted by these sanctions. The European Commission has included a ban on investments in the Russian energy sector and luxury goods exports from Russia as well as imports of steel products from Russia. There is also a total ban on transactions with some Russian state-owned enterprises linked to the Kremlin's military-industrial complex, which includes a ban on new investment across the Russian energy sector with limited exceptions for civil nuclear energy and the transport of certain energy products back to the EU.\n"

## Query Pipeline

The pipeline was set up using the QueryPipeline function from llamaindex library. The pipeline integrates prompt templates, generator, and query engine into a sequential chain.

In [None]:
# Define reranking for pipeline
rerank = SentenceTransformerRerank(model="BAAI/bge-reranker-base", top_n=2)

# Define query engine for pipeline without specify llm and prompt template
query_engine = vector_index.as_query_engine(response_mode="tree_summarize",
                                            similarity_top_k=5,
                                            node_postprocessors=[rerank]
                                            )

# Define prompt template

qa_prompt_str = (
    "You are a news reader with knowledge from the CNN news channel during 2022.\n"
    "Considering the role of a newsreader, you are assisting the user in giving a correct answer about what's happened based on the news articles you already know.\n"
    "Given the context information and not prior knowledge, "
    "answer the question: {query_str}\n"
)

# Text QA Prompt
chat_text_qa_msgs = [
    ChatMessage(
        role=MessageRole.SYSTEM,
        content=(
            "Please provide only answers that relate to the context and question, do not explain or question anything beyond the context and question.\n"
            "If you don't know the answer, please don't make it up or give any false explanation.\n"
        ),
    ),
    ChatMessage(role=MessageRole.USER, content=qa_prompt_str),
]

text_qa_template = ChatPromptTemplate(chat_text_qa_msgs)

The pipeline was tested by putting 5 queries into it.

In [None]:
# Create query pipeline using chain

from llama_index.core.query_pipeline import QueryPipeline

# Sequential chain
p = QueryPipeline(chain=[text_qa_template, llm, query_engine], verbose=True)

# Give multiple queries in the pipeline
queries = [
    "How does TuSimple's approach to autonomous trucking address the driver shortage problem highlighted by the recent e-commerce boom?",
    "How does the partnership with General Motors contribute to the development and use of the Ironhand?",
    "What is the primary role of the QR codes in the functioning of robots in Hong Kong warehouse?",
    "Why have investors started to exclude Russia from emerging market funds?",
    "Which sectors in Russia are specifically targeted by these sanctions?"
]

# run pipeline
for query in queries:
    result = p.run(query_str=query)
    print(result)  # Print or process the result as needed

[1;3;38;2;155;135;227m> Running module 3f6a7950-a9f6-43df-bf25-676acdf8b4ee with input: 
query_str: How does TuSimple's approach to autonomous trucking address the driver shortage problem highlighted by the recent e-commerce boom?

[0m[1;3;38;2;155;135;227m> Running module 96340f6f-bb2f-4792-8fed-364f54e3420d with input: 
messages: [ChatMessage(role=<MessageRole.SYSTEM: 'system'>, content="Please give only answer that relate to the context and question, do not explain and question anything beyond the giving context and question....

[0m[1;3;38;2;155;135;227m> Running module 9219625a-b1e4-4253-b658-5e7acb58dab1 with input: 
input: assistant:  According to CNN's report, TuSimple is leveraging artificial intelligence (AI) to solve the issue of a significant labor shortage in the trucking industry. As the demand for online shoppin...

[0m According to CNN's report, TuSimple, an artificial intelligence (AI) company, is leveraging AI to solve the issue of a significant labor shortage i

Multiple queries can be processed by using this query pipeline, but when compared to the model trained on only five articles, it provides some incorrect information. For example, in questions 2, 3, and 4, it mentions process or entity that was not specified in the original articles. This could be because the larger dataset increases the likelihood of retrieving an incorrect chunk from a different article with similar semantic meaning.

At this stage, an attempt has been made to rearrange hyperparameters to observe the impact of altering the number of K on the result improvements.


### Reranking hyperparameter explore

In [None]:
# We try with rerank top k = 3 and see the results

rerank = SentenceTransformerRerank(model="BAAI/bge-reranker-base", top_n=3)

# Define query engine for pipeline without specify llm and prompt template
query_engine = vector_index.as_query_engine(response_mode="tree_summarize",
                                            similarity_top_k=5,
                                            node_postprocessors=[rerank]
                                            )


In [None]:
# Sequential chain
p = QueryPipeline(chain=[text_qa_template, llm, query_engine], verbose=True)

# Give multiple queries in the pipeline
queries = [
    "How does TuSimple's approach to autonomous trucking address the driver shortage problem highlighted by the recent e-commerce boom?",
    "How does the partnership with General Motors contribute to the development and use of the Ironhand?",
    "What is the primary role of the QR codes in the functioning of robots in Hong Kong warehouse?",
    "Why have investors started to exclude Russia from emerging market funds?",
    "Which sectors in Russia are specifically targeted by these sanctions?"
]

# run pipeline
for query in queries:
    result = p.run(query_str=query)
    print(result)  # Print or process the result as needed

[1;3;38;2;155;135;227m> Running module 8e065e94-f6ef-4a5d-b1e7-43e12cf5c6ff with input: 
query_str: How does TuSimple's approach to autonomous trucking address the driver shortage problem highlighted by the recent e-commerce boom?

[0m[1;3;38;2;155;135;227m> Running module f679971b-f934-43f6-a0ad-9927e90dd6df with input: 
messages: [ChatMessage(role=<MessageRole.SYSTEM: 'system'>, content="Please give only answer that relate to the context and question, do not explain and question anything beyond the giving context and question....

[0m[1;3;38;2;155;135;227m> Running module 1f1fe57d-18a4-4ea0-84df-d4a4952e5e58 with input: 
input: assistant: 
As a newscaster with knowledge from CNN news channel during 2021 to 2022, I can help you understand how TuSimple's approach to autonomous trucking addresses the driver shortage problem.

T...

[0m TuSimple'S approach to autonomous trucking addresses the driver shortage problem by replacing human drivers with self-driving trucks. With the rise

In [None]:
# We try with rerank top k = 4 and see the results

rerank = SentenceTransformerRerank(model="BAAI/bge-reranker-base", top_n=4)

# Define query engine for pipeline without specify llm and prompt template
query_engine = vector_index.as_query_engine(response_mode="tree_summarize",
                                            similarity_top_k=5,
                                            node_postprocessors=[rerank]
                                            )


In [None]:
# Sequential chain
p = QueryPipeline(chain=[text_qa_template, llm, query_engine], verbose=True)

# Give multiple queries in the pipeline
queries = [
    "How does TuSimple's approach to autonomous trucking address the driver shortage problem highlighted by the recent e-commerce boom?",
    "How does the partnership with General Motors contribute to the development and use of the Ironhand?",
    "What is the primary role of the QR codes in the functioning of robots in Hong Kong warehouse?",
    "Why have investors started to exclude Russia from emerging market funds?",
    "Which sectors in Russia are specifically targeted by these sanctions?"
]

# run pipeline
for query in queries:
    result = p.run(query_str=query)
    print(result)  # Print or process the result as needed

[1;3;38;2;155;135;227m> Running module b346488c-4fe7-4e4b-b506-d5d7af9fc3e4 with input: 
query_str: How does TuSimple's approach to autonomous trucking address the driver shortage problem highlighted by the recent e-commerce boom?

[0m[1;3;38;2;155;135;227m> Running module d82768dd-fa29-42f7-aae6-0cdc864878fb with input: 
messages: [ChatMessage(role=<MessageRole.SYSTEM: 'system'>, content="Please give only answer that relate to the context and question, do not explain and question anything beyond the giving context and question....

[0m[1;3;38;2;155;135;227m> Running module 30763114-4736-4260-bb6f-c4911305c1fe with input: 
input: assistant:  "TuSimple's approach to autonomous trucking addresses the driver shortage problem highlighted by the recent e-commerce boom by replacing human drivers with self-driving trucks. With the in...

[0m According to the given context, TuSimple is addressing the driver shortage problem by utilizing self-driving trucks. These trucks eliminate the nee

The results of altering K value during reranking indicates that as the number of K increases, the model can produce higher hallucinations. For example, in questions 3, 4, and 5, the model provides responses that excessively explain and extend beyond the articles. By increasing the amount of data used to train RAG and using a larger value for k, it is possible to retrieve a higher number of unrelated chunks, resulting in the generation of more unrelated answers.  It is essential to consider as a crucial element in enhancing the performance of this model in the future.

#Conclusions

In conclusion, the RAG model successfully combines Phi-3's generative power with current news. This can be further used in the case that users want to feed their up-to-date knowledge into the model. But due to the computation limitations in Google Colab, it still has room to explore that did not deeply investigate in this report. Further improvement is needed to reduce incorrect answers and hallucinations, including.

*   Increase the training dataset to cover a longer period than 2022 news.
*   Explore hyperparameters to deal with the larger data, such as thresholds on semantic chunking, the number of top K in retrievals and reranking stages.
*   The effect of different response modes when using different types of queries, such as Q&A or summarization queries.