#RetrievalQA vs pure LLMs  - comparing capabilities

RAG - retrieval augmented generation - is an antidote for contemporary LLMs hallucinations. However, the RAG is only as good as the underlying embeddings and search mechanism. We test capabilities of LangChain RetrievalQA chain combined with Llama2-13B and popular SentenceTransformers embeddings for two datasets:
- relatively straightforward books releases dataset
- intricate cosmology articles dataset

Both of these datasets were scraped from latest articles/data from September 2023 to test the LLM behaviour against data it didn't see.

In [1]:
!pip install -qU \
  transformers==4.31.0 \
  sentence-transformers==2.2.2 \
  pinecone-client==2.2.2 \
  datasets==2.14.0 \
  accelerate==0.21.0 \
  einops==0.6.1 \
  langchain==0.0.240 \
  xformers==0.0.20 \
  bitsandbytes==0.41.0

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m22.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.1/179.1 kB[0m [31m16.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m492.2/492.2 kB[0m [31m20.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m20.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.2/42.2 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m33.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m109.1/109.1 MB[0m [31m15.5 MB/s[0m 

#1. Get data about latest books releases

This is a small collections (105 samples) of September novel releases scraped from [goodreads.com](https://www.goodreads.com/book/popular_by_date/2023/9?ref=nav_brws_newrels)

In [2]:
# load csv data from drive to dataframe and then load it to Dataset dataset
from datasets import Dataset
import pandas as pd


books_df = pd.read_csv("/content/drive/MyDrive/ML/data/book_releases_september.csv")
books_ds = Dataset.from_pandas(books_df)
len(books_ds), books_ds[50]

(105,
 {'id': 'kca://book/amzn1.gr.book.v3.SfQ5kz2CwD7dIfuz',
  'title': 'Witch of Wild Things',
  'description': 'Legend goes that long ago a Flores woman offended the old gods, and their family was cursed as a result. Now, every woman born to the family has a touch of magic.\xa0Sage Flores has been running from her family—and their “gifts”—ever since her younger sister Sky died. Eight years later, Sage reluctantly returns to her hometown. Like slipping into an old, comforting sweater, Sage takes back her job at Cranberry Rose Company and uses her ability to communicate with plants to discover unusual heritage specimens in the surrounding lands.What should be a simple task is complicated by her partner in botany sleuthing: Tennessee Reyes. He broke her heart in high school, and she never fully recovered. Working together is reminding her of all their past tender, genuine moments—and new feelings for this mature sexy man are starting to take root in her heart.With rare plants to find, 

In [31]:
len(books_df[books_df['description'].isnull()]), len(books_df[books_df['title'].isnull()]), len(books_df[books_df['author'].isnull()])

(2, 0, 0)

In [32]:
# Replacing NaN values with an empty string
books_df.fillna('', inplace=True)

In [33]:
len(books_df[books_df['description'].isnull()])

0

#2. Collect dataset from NASA ADS

Collect a dataset of cosmology articles from last month using NASA ADS database.

Script to fetch data from NASA ADS using their API:

In [None]:
paper_list[1]


{'bibcode': '2023LRR....26....2A',
 'abstract': "The Laser Interferometer Space Antenna (LISA) will be a transformative experiment for gravitational wave astronomy, and, as such, it will offer unique opportunities to address many key astrophysical questions in a completely novel way. The synergy with ground-based and space-born instruments in the electromagnetic domain, by enabling multi-messenger observations, will add further to the discovery potential of LISA. The next decade is crucial to prepare the astrophysical community for LISA's first observations. This review outlines the extensive landscape of astrophysical theory, numerical simulations, and astronomical observations that are instrumental for modeling and interpreting the upcoming LISA datastream. To this aim, the current knowledge in three main source classes for LISA is reviewed; ultra-compact stellar-mass binaries, massive black hole binaries, and extreme or interme-diate mass ratio inspirals. The relevant astrophysical 

In [None]:
import requests
from datetime import datetime, timedelta
import pandas as pd
from bs4 import BeautifulSoup
import re


api_key = os.environ.get('ADS_API_KEY')
headers = {
    "Authorization": f"Bearer {api_key}",
}

# Calculate the date 1 months ago from today
start_date = datetime.now() - timedelta(days=30)
start_date_str = start_date.strftime('%Y-%m-%d')

# Construct the API endpoint and query parameters
endpoint = "https://api.adsabs.harvard.edu/v1/search/query?"
query = f"q=collection:astronomy+AND+entdate:[{start_date_str}+TO+*]&fl=title,abstract,bibcode,pubdate,arxiv_eprints,doi&rows=200"

# Make the API request
response = requests.get(endpoint + query, headers=headers)

# Function to retrieve the full text content from arXiv or DOI link
def get_full_text(arxiv_ids, doi):
    content = ""
    if arxiv_ids:
        arxiv_id = arxiv_ids[0]
        url = f"https://arxiv.org/pdf/{arxiv_id}.pdf"
        response = requests.get(url)
        if response.status_code == 200:
            print(f'Downloading doi {arxiv_id}')
            content = response.content[:20000]  # Get first 20000 bytes of the PDF (not words)
    elif doi:
        doi = doi[0] if isinstance(doi, list) else doi
        url = f"https://doi.org/{doi}"
        response = requests.get(url)
        if response.status_code == 200:
            print(f'Downloading doi {doi}')
            soup = BeautifulSoup(response.content, 'html.parser')
            content = " ".join(soup.stripped_strings)[:10000]  # Get first 10000 words of the webpage content
    return content

# Check the response
if response.status_code == 200:
    # Get the list of papers
    papers = response.json().get('response', {}).get('docs', [])

    paper_data = []
    for i, paper in enumerate(papers, 1):
        if i % 100 == 0:
            print(f"Processed {i} papers")

        title = paper.get('title', ['No Title'])[0]
        abstract = paper.get('abstract', 'No Abstract')
        bibcode = paper.get('bibcode', 'No Bibcode')
        pubdate = paper.get('pubdate', 'No Pubdate')
        arxiv_ids = paper.get('arxiv_eprints')
        doi = paper.get('doi')

        full_text_content = get_full_text(arxiv_ids, doi)

        paper_data.append({
            "Title": title,
            "Abstract": abstract,
            "Bibcode": bibcode,
            "Publication Date": pubdate,
            "Full Text Content (first 10000 words)": full_text_content
        })

    # Creating a DataFrame from the list of dictionaries
    df = pd.DataFrame(paper_data)

    # Display the first few rows of the DataFrame
    print(df.head())



In [None]:
df.head()

Unnamed: 0,Title,Abstract,Bibcode,Publication Date,Full Text Content (first 10000 words)
0,A CEERS Discovery of an Accreting Supermassive...,We report the discovery of an accreting superm...,2023ApJ...953L..29L,2023-08-00,
1,COSMOS-Web: An Overview of the JWST Cosmic Ori...,"We present the survey design, implementation, ...",2023ApJ...954...31C,2023-09-00,
2,The Eighteenth Data Release of the Sloan Digit...,The eighteenth data release (DR18) of the Sloa...,2023ApJS..267...44A,2023-08-00,
3,The FLAMINGO project: cosmological hydrodynami...,We introduce the Virgo Consortium's FLAMINGO s...,2023MNRAS.tmp.2384S,2023-08-00,
4,Hidden Little Monsters: Spectroscopic Identifi...,We report on the discovery of two low-luminosi...,2023ApJ...954L...4K,2023-09-00,


In [None]:
df.to_csv("/content/drive/MyDrive/ML/data/nasa_ads_cosmology_full.csv")


##2.1. Load cosmology data

In [3]:
# load csv data from drive to dataframe and then load it to Dataset dataset
from datasets import Dataset
import pandas as pd


nasa_ads_df = pd.read_csv("/content/drive/MyDrive/ML/data/nasa_ads_cosmology_full.csv")
nasa_ads_ds = Dataset.from_pandas(nasa_ads_df)
len(nasa_ads_ds), nasa_ads_ds[180]


(1200,
 {'Unnamed: 0': 180,
  'Title': 'The energy distribution of the first supernovae',
  'Abstract': 'The nature of the first Pop III stars is still a mystery and the energy distribution of the first supernovae is completely unexplored. For the first time we account simultaneously for the unknown initial mass function (IMF), stellar mixing, and energy distribution function (EDF) of Pop III stars in the context of a cosmological model for the formation of a MW-analogue. Our data-calibrated semi-analytic model is based on a N-body simulation and follows the formation and evolution of both Pop III and Pop II/I stars in their proper time-scales. We discover degeneracies between the adopted Pop III unknowns, in the predicted metallicity and carbonicity distribution functions and the fraction of C-enhanced stars. None the less, we are able to provide the first available constraints on the EDF, $dN/dE_\\star \\propto E_{\\star }^{-\\alpha _e}$ with 1 ≤ α<SUB>e</SUB> ≤ 2.5. In addition, the

All recores have bibcode and each is unique:

In [22]:
len(nasa_ads_df[nasa_ads_df['Bibcode'].isnull()]), nasa_ads_df['Bibcode'].nunique()

(0, 1200)

#3. Get LLama2-13B and Embeddings

##3.1. Llama2-13B

We load Llama2-13B, fitting it on a single 15GB RAM with the help of bitsandbytes quantization.

In [None]:
%%time
from torch import cuda, bfloat16
import transformers

model_name = "meta-llama/Llama-2-13b-chat-hf"
# set quantization configuration using bitsandbytes lib
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)


hf_auth = os.environ.get('HF_API_KEY')
model_config = transformers.AutoConfig.from_pretrained(
    model_name,
    use_auth_token=hf_auth
)

model = transformers.AutoModelForCausalLM.from_pretrained(
    model_name,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map='auto',
    use_auth_token=hf_auth
)
model.eval()


In [None]:
%%time
import torch

tokenizer = transformers.AutoTokenizer.from_pretrained(
    model_name,
    use_auth_token=hf_auth
)


generate_text = transformers.pipeline(
    model=model, tokenizer=tokenizer,
    return_full_text=True,  # langchain expects the full text
    task='text-generation',
    temperature=0.0,
    max_new_tokens=256,  # number of tokens to generate in the output
    repetition_penalty=1.1
)



## 3.2. Get embeddings

Let's select some small embeddings model based on [HF's leaderboard](https://huggingface.co/spaces/mteb/leaderboard).

In [7]:

from torch import cuda
from langchain.embeddings.huggingface import HuggingFaceEmbeddings

embed_model_id = 'sentence-transformers/all-mpnet-base-v2' # all-MiniLM-L6-v2 and all-MiniLM-L12-v2 were also tried but were given worse results

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

embed_model = HuggingFaceEmbeddings(
    model_name=embed_model_id,
    model_kwargs={'device': device},
    encode_kwargs={'device': device, 'batch_size': 32}
)

Downloading (…)a8e1d/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)b20bca8e1d/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)0bca8e1d/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)e1d/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading (…)a8e1d/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading (…)8e1d/train_script.py:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

Downloading (…)b20bca8e1d/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)bca8e1d/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

  warn("The installed version of bitsandbytes was compiled without GPU support. "


/usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32


In [9]:

docs = [
    "example document about two rabbits"
]

embeddings = embed_model.embed_documents(docs)

print(f"Dimension of embeddings is {len(embeddings[0])}.")

Dimension of embeddings is 768.


#4. Vector database (Pinecone)

Initializing the vector db index for Pinecone:

In [3]:
# set my api key as an environment variable

PINEKONE_API_KEY= ''

import os
os.environ['PINECONE_API_KEY'] = PINEKONE_API_KEY
os.environ['PINECONE_ENVIRONMENT'] = 'gcp-starter'



In [4]:
import os
import pinecone

# get API key from app.pinecone.io and environment from console
pinecone.init(
    api_key=os.environ.get('PINECONE_API_KEY'),
    environment=os.environ.get('PINECONE_ENVIRONMENT')
)

In [10]:
import time

index_name = 'rag-llama-2'

if index_name not in pinecone.list_indexes():
    pinecone.create_index(
        index_name,
        dimension=len(embeddings[0]),
        metric='cosine'
    )
    # wait for index to finish initialization
    while not pinecone.describe_index(index_name).status['ready']:
        time.sleep(1)

In [20]:
# pinecone.delete_index("rag-llama-2")

In [13]:
index = pinecone.Index(index_name)
index.describe_index_stats()

{'dimension': 768,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

In [131]:
len(books_df)

105

In [133]:
books_df.head()

Unnamed: 0,id,title,description,webUrl,author,rank
0,kca://book/amzn1.gr.book.v3.CI518ofIFidEXILu,Things We Left Behind,There was only one woman who could set me free...,https://www.goodreads.com/book/show/116536542-...,Lucy Score,1
1,kca://book/amzn1.gr.book.v3.laK_97rF9qQjQGbP,The Long Game,A disgraced soccer exec reluctantly enlists th...,https://www.goodreads.com/book/show/101144869-...,Elena Armas,2
2,kca://book/amzn1.gr.book.v3.AbK983RbgzqySZEv,Bright Young Women,An extraordinary novel inspired by the real-li...,https://www.goodreads.com/book/show/101124639-...,Jessica Knoll,3
3,kca://book/amzn1.gr.book.v3.WMf4mFalogDRVJoY,The Fragile Threads of Power,From the #1 New York Times bestselling author ...,https://www.goodreads.com/book/show/111673828-...,V.E. Schwab,4
4,kca://book/amzn1.gr.book.v3.fDu8MXBOGcUfyd89,Rouge,From the critically acclaimed author of Bunny ...,https://www.goodreads.com/book/show/101160689-...,Mona Awad,5


In [14]:
batch_size = 16

# upsert book releases data:
for i in range(0, len(books_df), batch_size):
    batch = books_df.iloc[i: i + batch_size]
    ids = [f"{r['id']}" for i, r in batch.iterrows()]
    texts = [f"Novel title: '{r['title']}' by {r['author']}. Description: {r['description']}" for i, r in batch.iterrows()]
    embeds = embed_model.embed_documents(texts)
    # get metadata to store in Pinecone
    metadata = [
        {'text': f"Novel title: '{r['title']}' by {r['author']}. Description: {r['description']}",
         'title': r['title'],
         'source': r['webUrl'],
         'author': r['author']} for i, r in batch.iterrows()
    ]
    # add to Pinecone
    index.upsert(vectors=zip(ids, embeds, metadata))


# upsert NASA ADS cosmology data:
# for i in range(0, len(nasa_ads_df), batch_size):
#     batch = nasa_ads_df.iloc[i: i + batch_size]
#     ids = [f"{r['Bibcode']}" for i, r in batch.iterrows()]
#     texts = [f"{r['Abstract']}" for i, r in batch.iterrows()]
#     embeds = embed_model.embed_documents(texts)
#     # get metadata to store in Pinecone
#     metadata = [
#         {'text': r['Abstract'],
#          'title': r['Title']} for i, r in batch.iterrows()
#     ]
#     # add to Pinecone
#     index.upsert(vectors=zip(ids, embeds, metadata))

In [15]:
index.describe_index_stats()

{'dimension': 768,
 'index_fullness': 0.00105,
 'namespaces': {'': {'vector_count': 105}},
 'total_vector_count': 105}

#5. LangChain

We initialize two modules from Langchain: the Hugginface llm pipeline and the RetrievalQA chain and compare their outputs against a set of questions based on the book releases from September from goodreads.

In [17]:
from langchain.llms import HuggingFacePipeline

llm = HuggingFacePipeline(pipeline=generate_text)


from langchain.vectorstores import Pinecone

text_field = 'text'  # field in metadata that contains text content

vectordb = Pinecone(
    index, embed_model.embed_query, text_field
)

In [171]:
from langchain.chains import RetrievalQA

rag = RetrievalQA.from_chain_type(
    llm=llm, chain_type='stuff', verbose=True,
    retriever=vectordb.as_retriever() #search_type="similarity_score_threshold", search_kwargs={'score_threshold': 0.7})
)

#6. Testing the two setups
Compare pure LLM answers and RetrievalQA answers to following questions:

##6.1. Book releases questions



In [188]:
llm("What is the story of Lucian and Sloane in 'Things We Left Behind'?")

'\n\nAnswer: In "Things We Left Behind," Lucian and Sloane are a couple who have been together for several years. They have a comfortable life together, but they both feel like something is missing. They decide to take a trip to a remote island to try and find what they\'re looking for. During their time on the island, they encounter various challenges and obstacles, including a mysterious illness that affects them both. As they work together to overcome these challenges, they begin to realize that what they\'ve been searching for all along is each other. The story explores themes of love, loss, and the power of human connection.'

In [187]:
rag("What is the story of Lucian and Sloane in 'Things We Left Behind'?")



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


{'query': "What is the story of Lucian and Sloane in 'Things We Left Behind'?",
 'result': " In 'Things We Left Behind,' Lucian Rollins and Sloane Walton are enemies-to-lovers who have a complicated history together. They were once close friends, but a dark secret from their past drove them apart. Now, they are both driven by their own desires and fears, and must navigate their complicated feelings for each other while also trying to protect themselves from the dangers of their past."}

In [186]:
vectordb.similarity_search(
    "What is the story of Lucian and Sloane in 'Things We Left Behind'?",  # the search query
    k=2
)

[Document(page_content='There was only one woman who could set me free. But I would rather set myself on fire than ask Sloane Walton for anything.Lucian Rollins is a lean, mean vengeance-seeking mogul. On a quest to erase his father’s mark on the family name, he spends every waking minute pulling strings and building an indestructible empire. The more money and power he amasses, the safer he is from threats.Except when it comes to the feisty small-town librarian that keeps him up at night…Sloane Walton is a spitfire determined to carry on her father’s quest for justice. She’ll do that just as soon as she figures out exactly what the man she hates did to—or for—her family. Bonded by an old, dark secret from the past and the dislike they now share for each other, Sloane trusts Lucian about as far as she can throw his designer-suited body.When bickering accidentally turns to foreplay, these two find themselves not quite regretting their steamy one-night stand. Once those flames are fanned

LLM's answer is completely made up and not true to facts.

We can see that if the book was matched positively against the appropriate record the proper answer is returned by the retriever chain.

However, if the question is slightly more general the right record isn't identified and no sensible answer is returned (see query below)

In [189]:
vectordb.similarity_search(
    "What is the plot of 'Things We Left Behind'?",
)

[Document(page_content='Holly Gibner, één van Stephen Kings meest meeslepende en vindingrijke personages, keert in deze bloedstollende thriller terug om de gruwelijke waarheid achter een reeks verdwijningen in een klein stadje te achterhalen.Stephen Kings Holly betekent de triomfantelijke terugkeer van het geliefde personage Holly Gibner. Lezers hebben kunnen genieten van Holly’s ontwikkeling van verlegen (maar ook stoere en morele) einzelgänger in Mr. Mercedes, naar Bill Hodges’ partner in De eerlijke vinder, tot ervaren, intelligente, soms wat harde privédetective in De buitenstaander. In Kings nieuwste boek moet Holly het in haar eentje opnemen tegen een stel sluwe en wrede vijanden.Wanneer Penny Dahl het detectivebureau belt in de hoop dat ze haar vermiste dochter kunnen vinden, weet Holly niet zeker of ze de zaak moet aannemen. Haar partner Pete is besmet met corona. Haar (behoorlijk compliceerde) moeder is net overleden. Eigenlijk Holly zou met verlof moeten zijn. Maar iets in Pe

In [58]:
llm( "What are the latest adventures of Holly Gibney and what case is she solving in King's latest novel and what's it called?")


'\n\nAnswer: The latest adventures of Holly Gibney can be found in the novel "The Outsider" by Stephen King, which was published in 2018. In this book, Holly is a key player in solving a gruesome murder that takes place in a small town in Oklahoma. The case centers around the death of a young boy who is found dead in a cave, with no signs of any suspects or motives. Holly uses her unique abilities as an autistic savant to help unravel the mystery and bring the perpetrator to justice.'

In [175]:
rag( "What are the latest adventures of Holly Gibney and what case is she solving in King's latest novel and what's it called?")



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


{'query': "What are the latest adventures of Holly Gibney and what case is she solving in King's latest novel and what's it called?",
 'result': " Holly Gibney, one of Stephen King's most compelling and ingeniously resourceful characters, returns in this thrilling novel to solve the gruesome truth behind multiple disappearances in a midwestern town. It's called Holly."}

Pure LLM produced wrong answer because his parametric knowledge doesnt contain information about the latest S. King book "Holly". 2:0 for RAG.


In [87]:
llm("What is the latest installment in the Serpent & Dove NYT series about and what's called?")


'\n\nThe latest installment in the Serpent & Dove series by The New York Times is called "The Blood of the Moon." It was published on September 28, 2022.'

In [191]:
rag("What is the latest installment in the Serpent & Dove NYT series about and what happens to Célie?")



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


{'query': 'What is the latest installment in the Serpent & Dove NYT series about and what happens to Célie?',
 'result': ' The latest installment in the Serpent & Dove NYT series is titled "Six Months Have Passed" and it follows Célie as she takes her sacred vows and joins the ranks of the Chasseurs as their first huntswoman. However, whispers from her past still haunt her, and a new evil is rising—one that Célie herself must vanquish, unless she falls prey to the darkness.'}

Here both LLM and RAG gave wrong made up titles.  It's actually called 'The secret veil". The rest of the answer is right in case of RAG.

Below slightly more general question about the same resulted in query not being matched agaist the correct vectors in the db at all.

In [190]:
vectordb.similarity_search(
    "What is the latest installment in the Serpent & Dove NYT series about and what's called?",  # the search query
    k=2
)

[Document(page_content="***GABE &amp; WREN'S STORY. RELEASE DATE TBA***", metadata={'author': 'Somme Sketcher', 'source': 'https://www.goodreads.com/book/show/62368883-sinners-atone', 'title': 'Sinners Atone'}),
 Document(page_content='A new mystery is afoot in the fourth book in the Thursday Murder Club series from million-copy bestselling author Richard Osman. Coming September 2023!', metadata={'author': 'Richard Osman', 'source': 'https://www.goodreads.com/book/show/75293475-the-last-devil-to-die', 'title': 'The Last Devil to Die'})]

In [63]:
llm( "What is the sequel to 'Foul Lady Fortune' and what is it about?")


'\n\nAnswer: The sequel to "Foul Lady Fortune" is called "The Sweetest Kind of Cruelty". It continues the story of Arsinoe and her friends as they navigate the dangerous world of the game. The book explores themes of loyalty, power, and the cost of ambition as the characters face new challenges and obstacles in their quest for victory.'

In [192]:
rag( "What is the sequel to 'Foul Lady Fortune' and what is it about?")



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


{'query': "What is the sequel to 'Foul Lady Fortune' and what is it about?",
 'result': " The sequel to 'Foul Lady Fortune' is titled 'Our Violent Ends' and it follows the story of Rosalind Lang, an immortal assassin in 1930s Shanghai, as she races to save her country and her love from a Japanese invasion."}

In [18]:
vectordb.similarity_search(
    "What is the sequel to 'Foul Lady Fortune' and what is it about?",  # the search query
    k=2
)

[Document(page_content="Novel title: 'Holly' by Stephen King. Description: Holly Gibner, één van Stephen Kings meest meeslepende en vindingrijke personages, keert in deze bloedstollende thriller terug om de gruwelijke waarheid achter een reeks verdwijningen in een klein stadje te achterhalen.Stephen Kings Holly betekent de triomfantelijke terugkeer van het geliefde personage Holly Gibner. Lezers hebben kunnen genieten van Holly’s ontwikkeling van verlegen (maar ook stoere en morele) einzelgänger in Mr. Mercedes, naar Bill Hodges’ partner in De eerlijke vinder, tot ervaren, intelligente, soms wat harde privédetective in De buitenstaander. In Kings nieuwste boek moet Holly het in haar eentje opnemen tegen een stel sluwe en wrede vijanden.Wanneer Penny Dahl het detectivebureau belt in de hoop dat ze haar vermiste dochter kunnen vinden, weet Holly niet zeker of ze de zaak moet aannemen. Haar partner Pete is besmet met corona. Haar (behoorlijk compliceerde) moeder is net overleden. Eigenlij

LLM confabulated about the title and the plot of the story. RAG also mistaken title for the title for another book from that same author that it found in the description of the novel, but the rest of the plot is well summarized.

In [65]:
llm( "Give me a short synopsis of events of 'The Unfortunate Side Effects of Heartbreak and Magic' novel by B.Randall")


'.\n\nSure! Here\'s a short synopsis of the events in "The Unfortunate Side Effects of Heartbreak and Magic" by B. Randall:\n\nAfter her boyfriend dumps her, Ember finds solace in her magic abilities. However, she soon discovers that her heartbreak has unleashed a powerful and unpredictable magic within her. As she struggles to control her newfound powers, Ember must also navigate her complicated relationships with her best friend, her ex-boyfriend, and a mysterious new love interest. Along the way, she learns that the line between magic and heartbreak is thin, and that true love can be found in the most unexpected places.'

In [66]:
rag( "Give me a short synopsis of events of 'The Unfortunate Side Effects of Heartbreak and Magic' novel by B.Randall")

{'query': "Give me a short synopsis of events of 'The Unfortunate Side Effects of Heartbreak and Magic' novel by B.Randall",
 'result': '\n\nThe novel "The Unfortunate Side Effects of Heartbreak and Magic" by B. Randall follows the story of Sadie Revelare, a young woman who inherits a magical ability from her grandmother, but also carries a curse of four heartbreaks that accompany her magic. When her grandmother is diagnosed with cancer and her first heartbreak returns to town, Sadie\'s carefully structured life begins to unravel. As she faces the last of her heartbreaks, she must decide if love is more important than her magic.'}

Here Llama hallucinates again, and retriever does good job. 3:0 to RAG.

In [67]:
llm( "What's the latest novel by Mona Awad?")


'\n\nAnswer: The latest novel by Mona Awad is "Bewilderness".'

In [178]:
rag( "What's the latest novel by Mona Awad?")



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


{'query': "What's the latest novel by Mona Awad?",
 'result': ' The latest novel by Mona Awad is called "Baby, We Need to Talk". It\'s a darkly comedic exploration of the expectations placed upon Palestinian-American women and the meaning of a fulfilling life.'}

In [19]:
vectordb.similarity_search(
    "What's the latest novel by Mona Awad?",  # the search query

    k=2  # returns top 3 most relevant articles
)

[Document(page_content="Novel title: 'Nineteen Steps' by Millie Bobby Brown. Description: Emmy-nominated actress and producer MILLIE BOBBY BROWN's debut novel, Nineteen Steps, is a moving tale of love, longing, and loss, inspired by the true events of her family's experience during World War II.Love blooms in the darkest days…It’s 1942, and London remains under constant threat of enemy attack as the second world war rages on. In the Bethnal Green neighborhood, Nellie Morris counts every day lucky that she emerges from the underground shelters unharmed, her loving family still surrounding her.Three years into the war, she’s grateful to hold onto remnants of normalcy—her job as assisting the mayor and nights spent at the local pub with her best friend. But after a chance encounter with Ray, an American airman stationed nearby, Nellie becomes enchanted with the idea of a broader world.Just when Nellie begins to embrace an exciting new life with Ray, a terrible incident occurs during an ai

Here neither Llama nor retriever do well. RAG doesn't find the right records in the db just by the author's name.

When we rephrase the question and give more details it finally finds the correct record and refers to it.

In [201]:
llm("What does Mona Awad say about beauty industry in her latest book?")

'\nMona Awad\'s latest book, "All the Wild Hungers", explores themes of beauty, identity, and power through the lens of a group of women who are struggling to find their place in society. In the book, Awad challenges traditional notions of beauty and femininity, arguing that these concepts are socially constructed and can be damaging to individuals and society as a whole.\n\nOne of the main characters in the book, Lily, is a young woman who is obsessed with her appearance and feels pressure to conform to societal beauty standards. Through Lily\'s journey, Awad highlights the ways in which the beauty industry can be harmful and oppressive, perpetuating unrealistic beauty ideals and reinforcing damaging gender stereotypes.\n\nAt the same time, Awad also celebrates the beauty and strength of women in all their forms, suggesting that true beauty comes from within and cannot be reduced to physical appearances. Through her writing, Awad encourages readers to embrace their unique qualities an

In [200]:
rag("What does Mona Awad say about beauty industry in her latest book?")



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


{'query': 'What does Mona Awad say about beauty industry in her latest book?',
 'result': " Based on the provided text, Mona Awad's latest book, Rouge, explores the cult-like nature of the beauty industry and the dangerous consequences of internalizing its pitiless gaze. The book delves into the dark side of beauty, envy, grief, and the complicated love between mothers and daughters."}

In [199]:
vectordb.similarity_search(
    "What does Mona Awad say about beauty industry in her latest book?",  # the search query
    k=2
)

[Document(page_content='From the critically acclaimed author of Bunny comes a horror-tinted, gothic fairy tale about a lonely dress shop clerk whose mother’s unexpected death sends her down a treacherous path in pursuit of youth and beauty. Can she escape her mother’s fate—and find a connection that is more than skin deep?For as long as she can remember, Belle has been insidiously obsessed with her skin and skincare videos. When her estranged mother Noelle mysteriously dies, Belle finds herself back in Southern California, dealing with her mother’s considerable debts and grappling with lingering questions about her death. The stakes escalate when a strange woman in red appears at the funeral, offering a tantalizing clue about her mother’s demise, followed by a cryptic video about a transformative spa experience. With the help of a pair of red shoes, Belle is lured into the barbed embrace of La Maison de Méduse, the same lavish, culty spa to which her mother was devoted. There, Belle di

**Summary:**

Overall Llama + vector store + embeddings setup does much better job of finding correct answers but the solution is not perfect. Some details are mistaken (not understood correctly) or still hallucinated. Errors were seen in situation where:
- questions were short and contained small amounts of information - then the model would sometimes appoint wrong documents as the most similar ones
- The question wasn't straightforward and so the answer required some analysis of the text. With a couple of documents loaded in the context it seems that Llama 13B (quantized) struggled with picking out the right piece of information.

##6.2 Cosmology questions

Now we pick some test questions based on NASA cosmology corpus of articles from last month (August-September). This dataset contains much more intricate, jargon-packed langugae.



In [45]:
llm('Why are young massive planets easier to study their atmospheres?')

'\n\nYoung massive planets are easier to study their atmospheres because they are typically brighter and more massive than older, smaller planets. This makes them easier to detect and observe with telescopes, allowing scientists to gather more data about their atmospheres. Additionally, young massive planets are often still in the process of forming, which means that their atmospheres are still being shaped by ongoing planetary accretion and differentiation. This makes them a unique window into the early stages of planetary formation and evolution.\n\n\n\n'

In [105]:
rag('Why are young massive planets easier to study their atmospheres?')

{'query': 'Why are young massive planets easier to study their atmospheres?',
 'result': ' Young massive planets are easier to study because they are still hot enough to emit light that can be detected by telescopes, allowing for spectroscopy of their atmospheres. Additionally, they are typically brighter and closer to Earth, making them easier to observe.'}

In [33]:
vectordb.similarity_search(
    'Why are young massive planets easier to study their atmospheres?',  # the search query

    k=2  # returns top 2 most relevant articles
)

[Document(page_content='Very young massive planets are sufficiently luminous by their internal heat of formation to permit detailed studies, including spectroscopy of their atmospheres with large telescopes at sufficient resolution ($\\lambda / \\Delta \\lambda \\gtrsim 1000$) to identify major constituents to inform models of planet formation and early evolution. We obtained 1-2.4$\\mu$m ($YJHK$) spectra of the planetary-mass "b" companion of 2MASS~J04372171+2651014, a 1-3 Myr-old M dwarf member of the Taurus star-forming region, and one of the youngest such objects discovered to date. These indicate the presence of CO and possibly H$_2$O and CH$_4$ in the atmosphere, all suggesting a $T_{\\rm eff}$ of around 1200K, characteristic of a L-T transition spectral type and consistent with previous estimates based on its luminosity and age. The absence or attenuation of spectral features at shorter wavelengths suggests the presence of micron-size dust, consistent with the object\'s red colo

Here both models give correct answer but the pure Llama-2 elaborates on the topic in more comprehensible way.


In [52]:
llm('What are the latest TESS observations of magnetic hot stars?')



'\n\nThe Transiting Exoplanet Survey Satellite (TESS) has made numerous observations of magnetic hot stars, providing valuable insights into their properties and behavior. Here are some of the latest TESS observations of magnetic hot stars:\n\n1. Magnetic field strengths: TESS has measured the magnetic field strengths of several magnetic hot stars, including the star HD 47368, which has a surface magnetic field of about 2.5 kG (Koch et al. 2020). These measurements have helped to constrain models of magnetic hot star evolution and have provided new insights into the role of magnetic fields in shaping the properties of these stars.\n2. Rotation periods: TESS has determined the rotation periods of several magnetic hot stars, including the star HD 191611, which rotates every 1.3 days (Huber et al. 2020). These measurements have helped to constrain models of magnetic hot star evolution and have provided new insights into the relationship between magnetic fields and rotation in these stars.

In [53]:
rag('What are the latest TESS observations of magnetic hot stars?')

{'query': 'What are the latest TESS observations of magnetic hot stars?',
 'result': ' The latest TESS observations of magnetic hot stars include the identification of 9 new rotating variable stars and the discovery of a significant negative correlation between the surface magnetic field strength and the characteristic frequency of the stars.'}

In [55]:
vectordb.similarity_search(
    'What are the latest TESS observations of magnetic hot stars?',  # the search query
    k=2  # returns top 3 most relevant articles
)

[Document(page_content='Magnetic hot stars refer to the stars, which effective temperatures approximately in the range from 7,000 to 50,000 K, and with large-scale globally organized magnetic fields. These magnetic fields exhibit strengths ranging from tens of Gauss to tens of kilo-Gauss. They are key in understanding the effects caused by magnetic fields in the stellar evolution. However, there are only three magnetic hot stars studied via a combination of spectropolarimetric and asteroseismic modeling. Combined with $Transiting\\;Exoplanet\\;Survey\\;Satellite\\;(TESS)$ 1-56 sectors data sets, we provided a photometric variability and stochastic low frequency (SLF) variability study of 118 magnetic hot stars. 9 new rotating variable stars are identified. Using the Bayesian Markov Chain Monte Carlo (MCMC) framework, we fitted the morphologies of SLF variability for magnetic hot stars. Our analysis reveals that the magnetic hot stars in our sample have $\\gamma &lt; 5.5$ with the vast 

This question was pretty general and both models answered differently (Llama based on its parametric knowledge only and retriever based on the paper it found relevant).

In [31]:
llm('What is IMF function and what is it used for?')

"\n\nThe IMF (International Monetary Fund) is an international organization that provides financial assistance to countries in economic distress. The IMF's primary function is to provide loans to member countries facing economic difficulties, with the goal of helping them restore stability to their economies and achieve long-term growth.\n\nThe IMF also provides policy advice and technical assistance to its member countries, aimed at improving their economic management and promoting economic stability. In addition, the IMF monitors the economic health of its member countries and works to promote global economic cooperation and stability.\n\nThe IMF's functions include:\n\n1. Providing loans to member countries: The IMF offers loans to member countries facing economic difficulties, with the goal of helping them address short-term financing needs and restore stability to their economies.\n2. Providing policy advice and technical assistance: The IMF offers advice and assistance on economi

In [30]:
rag('What is IMF function and what is it used for?')

{'query': 'What is IMF function and what is it used for?',
 'result': ' IMF stands for "Integrated Modal Function". It is a mathematical representation of the dynamic behavior of a structure under external forces. In the context of the code you provided, the IMF function is used to calculate the response of the structure to different loading conditions, such as wind, waves, and earthquakes. The IMF function takes into account the modal properties of the structure, such as natural frequencies, damping ratios, and mode shapes, and provides a comprehensive representation of the structure\'s behavior under different loading conditions.'}

In [29]:
vectordb.similarity_search(
    'What is IMF function and what is it used for',  # the search query
    k=2  # returns top 3 most relevant articles
)

[Document(page_content='This includes the following : Two ipynotebooks for performing PIV analysis to determine cardiac output and ejection fraction One ipynotebook for performing deformation mapping of ventricles', metadata={'title': "SeasDBG/FRJS: Recreating the Heart's Helical Structure-Function Relationship with Focused Rotary Jet Spinning- Data Analysis"}),
 Document(page_content='This is a matlab code for accurate and broadly applicable causal inference method for time-series data.', metadata={'title': 'Mathbiomed/GOBI: GOBI (General ODE-based causal inference)'})]

Here IMF in question is actually Initial Mass Function. It's pretty obscure piece of information contained in one of the articles. Llama refered to more pupular IMF abbreviation in its answer while RAG hallucinated its answer because it didn't find the adequate document.

In [36]:
vectordb.similarity_search(
    'What is SNCosmo for python?',  # the search query
    k=2  # returns top 3 most relevant articles
)

[Document(page_content='Python library for supernova cosmology', metadata={'title': 'SNCosmo'}),
 Document(page_content='New name, now installable with pip. Input files will need to be referenced according to the file structure in this repository, but paths and filenames can be passed as kwargs.', metadata={'title': 'itsmoosh/MoonMag: Repackaged for distribution with PyPI'})]

In [38]:
vectordb.similarity_search(
    'What is SNCosmo ?',  # the search query
    k=3  # returns top 3 most relevant articles
)

[Document(page_content='SONAR is a spatial transcriptomics deconvolution algorithm, if you have any questions, please contact us! Thanks for your support', metadata={'title': 'lzygenomics/SONAR: SONAR v1.0.0'}),
 Document(page_content='The eighteenth data release (DR18) of the Sloan Digital Sky Survey (SDSS) is the first one for SDSS-V, the fifth generation of the survey. SDSS-V comprises three primary scientific programs or "Mappers": the Milky Way Mapper (MWM), the Black Hole Mapper (BHM), and the Local Volume Mapper. This data release contains extensive targeting information for the two multiobject spectroscopy programs (MWM and BHM), including input catalogs and selection functions for their numerous scientific objectives. We describe the production of the targeting databases and their calibration and scientifically focused components. DR18 also includes ~25,000 new SDSS spectra and supplemental information for X-ray sources identified by eROSITA in its eFEDS field. We present upda

In [40]:
llm('What is SNCosmo?')

'\n\nSNCosmo is a software tool that helps astronomers and researchers analyze and visualize large datasets from telescopes and other sources. It provides a range of features and functionalities to help users process, visualize, and understand their data. Some of the key features of SNCosmo include:\n\n1. Data processing: SNCosmo can handle a wide range of astronomical data formats, including FITS, IRAF, and PyRAF. It provides tools for cleaning, calibrating, and reducing the data, as well as for performing basic analysis tasks such as sky subtraction and flux calibration.\n2. Visualization: SNCosmo includes a range of visualization tools, including 2D and 3D plotting, image display, and movie generation. Users can customize the appearance of their plots and images, and can also create interactive visualizations using Python scripting.\n3. Analysis: SNCosmo provides a range of analysis tools for studying the properties of astronomical objects and phenomena. These tools include spectros

In [43]:
rag('What is SNCosmo?')

{'query': 'What is SNCosmo?',
 'result': ' SNCosmo is not mentioned in the text.'}

General question: What is SNCosmo yields no results from vector db although it's mentioned in one of the articles. LLM by itself gives better answer to the question.
When more details are added to the question: What is SNCosmo for python? - then the vector store finally retrieves answer but Llama by intself still manages to give more comprehensive answer to the question.

In [98]:
vectordb.similarity_search(
    'How are solar winds connected to picoflares?',  # the search query
    k=2  # returns top 3 most relevant articles
)

[Document(page_content='Coronal holes are areas on the Sun with open magnetic field lines. They are a source region of the solar wind, but how the wind emerges from coronal holes is not known. We observed a coronal hole using the Extreme Ultraviolet Imager on the Solar Orbiter spacecraft. We identified jets on scales of a few hundred kilometers, which last 20 to 100 seconds and reach speeds of ~100 kilometers per second. The jets are powered by magnetic reconnection and have kinetic energy in the picoflare range. They are intermittent but widespread within the observed coronal hole. We suggest that such picoflare jets could produce enough high-temperature plasma to sustain the solar wind and that the wind emerges from coronal holes as a highly intermittent outflow at small scales.', metadata={'title': 'Picoflare jets power the solar wind emerging from a coronal hole on the Sun'}),
 Document(page_content='Sunspots host various oscillations and wave phenomena like umbral flashes, umbral 

In [100]:
llm('How are solar winds connected to picoflare jets')

"?\n\nSolar winds and picoflare jets are both high-energy phenomena that occur in the vicinity of the Sun. While they are distinct events, there is some evidence to suggest that they may be connected. Here are some possible ways in which solar winds and picoflare jets might be related:\n\n1. Magnetic reconnection: Both solar winds and picoflare jets involve the release of magnetic energy, which can be triggered by magnetic reconnection events. During a magnetic reconnection event, the magnetic field lines in the solar atmosphere break and reconnect, releasing a large amount of energy in the form of heat and radiation. This energy can then drive the acceleration of charged particles, such as electrons and ions, which can produce the observed effects of solar winds and picoflare jets.\n2. Coronal heating: The corona, the outermost layer of the Sun's atmosphere, is heated to millions of degrees Celsius during a flare or other energetic event. This heating can cause the plasma in the coron

In [101]:
rag('How are solar winds connected to picoflare jets?')

{'query': 'How are solar winds connected to picoflare jets?',
 'result': ' Picoflare jets are powered by magnetic reconnection and can produce high-temperature plasma, which could sustain the solar wind. The solar wind emerges from coronal holes as an intermittent outflow at small scales.'}

RAG gives concrete answer based on recent paper whereas LLM mostly rambles on.

Summary:

For a more intricate and technical dataset vector store seems to struggle to yield correct articles more often than with books dataset (which was also smaller). Perhaps embeddings of higher dimensionality (we tried 384 and 768 sentenceTransformers embeddings of general purpose) would be needed or a fine tuning procedure with a dataset that would contain matches between queries and correctly found articles.