## Setup

Load environment variables (stored in your `.env` file).

In [9]:
from dotenv import load_dotenv
load_dotenv()

True

## Download Data Set

Download a [BBC news article dataset from HuggingFace](https://huggingface.co/datasets/csebuetnlp/xlsum) and preview the data.

In [1]:
from datasets import load_dataset
dataset = load_dataset("csebuetnlp/xlsum", "english", split="train")
dataset[15]

{'id': 'uk-scotland-highlands-islands-51206457',
 'url': 'https://www.bbc.com/news/uk-scotland-highlands-islands-51206457',
 'title': 'New virtual reality experience of Scottish waters',
 'summary': "Scotland's opportunities for sailing and boating on rivers, lochs and seas are being promoted in a new campaign.",
 'text': "A series of 360 degree virtual reality videos have been produced as part of #MustSeaScotland. St Kilda, Islay, Skye and Inverness Marina are among the locations featured. Sail Scotland has created the campaign with other organisations, including the National Trust for Scotland and VisitScotland. The campaign comes during Scotland's Year of Coasts and Waters 2020. All images are the copyright of Airborne Lens."}

## Store Data in Vector Database

Instantiate in-memory [ChromaDB](https://www.trychroma.com/) Vector Database client with a new, empty collection.

In [3]:
import chromadb
chroma_client = chromadb.Client()
# chroma_client.delete_collection(name="bbc_news_articles")
collection = chroma_client.create_collection(name="bbc_news_articles")

Add news articles to the vector database. Since we aren't providing our own custom embeddings, ChromaDB uses the [Sentence Transformers](https://www.sbert.net/) `all-MiniLM-L6-v2` model to create the embeddings automatically.

In [4]:
metadata = [
    {
        "title": x["title"], 
        "summary": x["summary"],
    } for x in dataset.to_list()
]

number_of_articles = 50

import time
t0 = time.time()
collection.add(
    ids=dataset['id'][:number_of_articles],
    documents=dataset['text'][:number_of_articles],
    metadatas=metadata[:number_of_articles],
)
time.time() - t0

1.8107349872589111

## Query Large Language Model

In [17]:
from langchain.llms import OpenAI
from langchain import PromptTemplate, LLMChain

llm = OpenAI()

Create prompt template

In [20]:
template = """Question: {question}

Here's some relevant news articles that you can use to help you answer the question. Each separate article is separated by the text <article separator>:
{relevant_data}"""

prompt = PromptTemplate(template=template, input_variables=["question", "relevant_data"])
llm_chain = LLMChain(prompt=prompt, llm=llm)

Take the user's question, look up relevant news articles in the vector database, and plug both of them into the prompt to the LLM

In [28]:
question = "What is a recent natural disaster that had a very high death toll? I'm looking for a natural disaster."

# Query vector DB
vector_db_results = collection.query(
    query_texts=[question],
    n_results=3
)
relevant_data = "\n<article separator>\n".join(vector_db_results["documents"][0])
print(relevant_data)

It was July 1990, and rebel fighters were advancing on the capital, Monrovia. President Samuel Doe was holed up in his vast, gloomy Executive Mansion. After dark bands of soldiers roamed the streets, looting shops and warehouses and seeking out people from Nimba County, the area where the rebellion had started. They dragged the men from their homes, beating and often killing them. Hundreds of terrified families, looking for a safer place to sleep, took refuge in St Peter's Lutheran Church - a spacious building in a walled compound. Huge Red Cross flags flew at every corner. But on the night of 29 July, government soldiers came over the wall and started killing those inside. An estimated 600 people - men, women, children, even babies - were shot or hacked to death with machetes before the order was given to stop. A Guinean woman doctor, who was one of the first to reach the church the next day, described to me the scene of utter horror. Dead bodies were everywhere. The only sign of life

In [30]:
llm_chain.run(question=question, relevant_data=relevant_data)

'\n\nOne of the most recent natural disasters with a very high death toll was the 2020 Beirut explosion. On August 4, 2020, a massive explosion occurred in Beirut, Lebanon, killing over 200 people and injuring thousands more. The blast was caused by the detonation of more than 2,750 tons of ammonium nitrate, stored unsafely in a warehouse in the port of Beirut. The blast caused widespread destruction and devastation throughout the city, causing an estimated $15 billion in damage.'