# RAG Using Pre-trained model

In this notebook, **Retrieval-Augmented Generation (RAG)** will be implemented using a pre-trained model, leveraging **cosine similarity** for document retrieval. The goal is to improve the model's responses by augmenting it with relevant, external information retrieved from a document corpus.

### Overview of the Steps:

1. **Set Up Corpus**: This involves preparing a collection of documents that the retrieval system will search through to find relevant information when a query is made.
1. **Index Documents**: The corpus of documents will be indexed and embedded to facilitate efficient retrieval.
1. **Set Up Retrieval System**: A retrieval mechanism will be established, using cosine similarity to identify relevant documents from a corpus.
1. **Load Pre-trained Model**: A generative model will be loaded for response generation.
1. **Retrieve Relevant Information**: Cosine similarity will be used to query the indexed documents and fetch the most relevant ones.
1. **Generate Response**: The generative model will produce a response, utilizing the retrieved documents as context.

The following sections will delve into the detailed implementation of each step.

## Setting Up Corpus

The corpus will be created using Wikipedia summaries for three articles:

- **[C programming language](https://en.wikipedia.org/wiki/C_%28programming_language%29)**
- **[Bouldering](https://en.wikipedia.org/wiki/Bouldering)**
- **[Island](https://en.wikipedia.org/wiki/Island)**

### Steps:

1. **Retrieve Summaries**: Use the Wikipedia API to fetch the summary text for the above articles.
1. **Create Corpus**: Organize the preprocessed summaries into a list of documents.

Because it is using fewer context, this setup reduces token count for faster processing, though it may slightly lower retrieval accuracy.

In [2]:
import wikipedia as wp

In [29]:
corpus = list(map(lambda i: i.summary, [
    wp.page("C (programming language)"),
    wp.page("Bouldering"),
    wp.page("Ireland"),
]))

In [30]:
for i in corpus:
    print(i[:i.find(".")])

C (pronounced  – like the letter c) is a general-purpose programming language
Bouldering is a form of rock climbing that is performed on small rock formations or artificial rock walls without the use of ropes or harnesses
An island or isle is a piece of land, distinct from a continent, completely surrounded by water


## Embedding the Corpus

In this step, the corpus will be embedded using **Llama 2** via **Ollama**. The goal is to transform the Wikipedia summaries into vector representations that can later be used for querying.

### Steps:

1. **Load Llama 2 with Ollama**: Use Ollama to load the Llama 2 model for embedding generation.
1. **Generate Embeddings**: Use Llama 2 to embed each document in the corpus into a vector representation. Each document (summary) will have its own embedding.
1. **Store Embeddings**: Save the document embeddings, which will serve as the reference for future retrieval.
1. **Query Embedding**: When a query is made, the same model (Llama 2) will embed the query into a vector.
1. **Cosine Similarity**: Use cosine similarity to compare the query embedding with the document embeddings, retrieving the most relevant document based on the highest similarity score.

This approach ensures efficient retrieval by using embeddings and cosine similarity to find the best matching document for any given query.

In [31]:
import ollama as ol

In [32]:
embeddings = list(map(lambda i: ol.embed(model="llama2", input=i), corpus))

In [40]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

In [108]:
def get_relevant(q, verbose=False):
    get_relevant.embeddings = np.vstack([np.array(i.embeddings) for i in embeddings])
    sim = cosine_similarity(
        ol.embed(model="llama2", input=q).embeddings,
        get_relevant.embeddings)
    if verbose:
        print(sim)
    return corpus[np.argmax(sim)]

## Contextualizing Prompts. w/ RAG

In this step, we apply **Retrieval-Augmented Generation (RAG)** by using the most relevant document from the corpus as context for the model's response. This allows the model to generate more accurate answers by incorporating external information into the prompt.

### Steps:

1. **Query the Most Relevant Document**: Based on the user's query, the system retrieves the most relevant document from the corpus using the previously embedded documents and cosine similarity.
   
1. **Add Document as Context**: The retrieved document is then added to the query as context in the model’s prompt. This gives the model additional information to generate a more informed response.

1. **Model Generates Response**: The pre-trained model (e.g., Llama 2) uses this augmented prompt to generate a response. It's important to note that this process doesn't alter the model's weights; instead, it **modifies the state** of the model by providing richer context within the prompt.

By adding this external context, the model can generate answers that are more relevant and aligned with the retrieved information, enhancing its performance without needing to modify its internal parameters.

In [125]:
def get_prompt(p, verbose=False):
    prompt = f"Question: {p}\nAccording to valid sources: {get_relevant(p)}\nAnswer: "
    
def get_response(p, verbose=False): 
    res = ol.chat(
        model='llama2', 
        messages=[{ 'role': 'user', 'content': get_prompt(p, verbose=verbose) }])
    return res.message.content

## Testing

In [132]:
prompt = input("Enter your prompt: ")
get_response(prompt, verbose=True)

Enter your prompt:  I want to live in a place sorrounded by water, which place should I live?


[[-0.06807114 -0.00496201  0.16959977]]
Question: I want to live in a place sorrounded by water, which place should I live?
According to valid sources: An island or isle is a piece of land, distinct from a continent, completely surrounded by water. There are continental islands, which were formed by being split from a continent by plate tectonics, and oceanic islands, which have never been part of a continent. Oceanic islands can be formed from volcanic activity, grow into atolls from coral reefs, and form from sediment along shorelines, creating barrier islands. River islands can also form from sediment and debris in rivers. Artificial islands are those made by humans, including small rocky outcroppings built out of lagoons and large-scale land reclamation projects used for development.
Islands are host to diverse plant and animal life. Oceanic islands have the sea as a natural barrier to the introduction of new species, causing the species that do reach the island to evolve in isolat

"Based on your interest in living in a place surrounded by water, here are some options to consider:\n\n1. Oceanic Islands: These are islands that have never been part of a continent and are formed through volcanic activity, coral reef growth, or sediment deposition along shorelines. Examples include Hawaii, Bali, and the Maldives.\n2. Continental Islands: These are islands that were once part of a continent but have since been separated by plate tectonics. Examples include Great Britain, Honshu (Japan), and the Australian mainland.\n3. River Islands: These are formed through sediment deposition in rivers and can be found in various locations around the world. Examples include the Thousand Islands in Canada and New York State, and the Sunderbans in India and Bangladesh.\n4. Artificial Islands: These are islands created by humans through land reclamation projects or small rocky outcroppings built out of lagoons. Examples include the Palm Jumeirah in Dubai and the man-made islands of Sin

In [138]:
prompt = input("Enter your prompt: ")
get_response(prompt, verbose=True)

Enter your prompt:  What language was created in the 1970s by Dennis Ritchie? I know It has found lasting use in operating systems code, device drivers, and protocol stacks.


[[0.04373466 0.04111786 0.02356716]]
Question: What language was created in the 1970s by Dennis Ritchie? I know It has found lasting use in operating systems code, device drivers, and protocol stacks.
According to valid sources: C (pronounced  – like the letter c) is a general-purpose programming language. It was created in the 1970s by Dennis Ritchie and remains very widely used and influential. By design, C's features cleanly reflect the capabilities of the targeted CPUs. It has found lasting use in operating systems code (especially in kernels), device drivers, and protocol stacks, but its use in application software has been decreasing. C is commonly used on computer architectures that range from the largest supercomputers to the smallest microcontrollers and embedded systems.
A successor to the programming language B, C was originally developed at Bell Labs by Ritchie between 1972 and 1973 to construct utilities running on Unix. It was applied to re-implementing the kernel of the 

'The language created by Dennis Ritchie in the 1970s is C.'