In [1]:
# Set the locale to UTF-8 to resolve the encoding issue
import os
os.environ["LC_ALL"] = "en_US.UTF-8"
os.environ["LANG"] = "en_US.UTF-8"

# Check if the locale is set correctly
!locale

LANG=en_US.UTF-8
LANGUAGE=en_US
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8


In [7]:
!python --version

Python 3.11.11


In [8]:
!pip freeze > requirements.txt


In [10]:
from google.colab import files
files.download("requirements.txt")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [11]:
!torch --version

/bin/bash: line 1: torch: command not found


# Introduction

**Objective**

Use Llama 2.0, Langchain and ChromaDB to create a Retrieval Augmented Generation (RAG) system. This will allow us to ask questions about our documents (that were not included in the training data), without fine-tunning the Large Language Model (LLM). When using RAG, if you are given a question, you first do a retrieval step to fetch any relevant documents from a special database, a vector database where these documents were indexed.

**Definitions**

- LLM - Large Language Model

- Llama 2.0 - LLM from Meta

- Langchain - a framework designed to simplify the creation of applications using LLMs

- Vector database - a database that organizes data through high-dimmensional vectors

- ChromaDB - vector database

- RAG - Retrieval Augmented Generation (see below more details about RAGs)

LlaMA 2 model is pretrained and fine-tuned with 2 Trillion tokens and 7 to 70 Billion parameters which makes it one of the powerful open source models. It is a highly improvement over LlaMA 1 model.



**What is a Retrieval Augmented Generation (RAG) system?**

🔹 Think of RAG like a Smart AI Assistant with Internet Access! Instead of answering questions based only on what it "remembers" (like ChatGPT), RAG first retrieves relevant documents from a knowledge base and then uses that information to generate a response.

💡 Analogy: RAG is like an Open-Book Exam 🏫 Standard AI (LLMs like GPT-4) → Answers only from memory (closed-book exam) 📖 RAG → First looks up useful information, then answers (open-book exam)

Implementing RAG - (Retrive auguented generation) Retrieval Augmented Generation combines external resources with LLMs. The main two components of a RAG are therefore a retriever and a generator.The retriver extracts the relevant information from the resources it acts like an encoder and genertor is responsible for the response via LLM by taking the context of those resources.

# Installations, imports, utils

Libraries that we're gonna used

- transformers (4.33.0): Developed by Hugging Face, this library is one of the most popular for working with pre-trained models like GPT, BERT, T5, etc. It provides easy access to state-of-the-art natural language processing (NLP) models for tasks such as text generation, classification, translation, summarization, and more.

- accelerate (0.22.0): This library is also from Hugging Face and is focused on optimizing training and inference workflows, especially for large models. It simplifies the process of using GPUs or multiple GPUs and distributed computing, helping speed up computations when working with deep learning models.

- einops (0.6.1): A lightweight library for efficient tensor operations, particularly focusing on the transformation of tensors (multi-dimensional arrays). It provides simple, expressive, and high-performance ways to manipulate tensors, such as reshaping, permuting, and splitting them, making it useful in deep learning workflows.

- langchain (0.0.300): As mentioned earlier, LangChain is a framework for building applications that integrate language models with other components, like databases, APIs, and external data sources. It's useful for creating sophisticated NLP applications, including chains, agents, memory, and data processing.

- xformers (0.0.21): A library that provides implementations of various transformer-based architectures optimized for performance. Xformers aims to provide memory- and compute-efficient transformer layers, helping scale NLP models more efficiently, especially on large datasets.

- bitsandbytes (0.41.1): This library is designed for low-precision training (quantization) and optimization of deep learning models. It helps reduce the memory usage of large models and speeds up training by using smaller numerical precisions like 8-bit integers.

- sentence_transformers (2.2.2): This library is designed for sentence embeddings, allowing you to easily compute vector representations of sentences. It's widely used for tasks such as semantic textual similarity, clustering, and retrieval-based applications.

- chromadb (0.4.12): ChromaDB is a vector database specifically built for managing embeddings. It allows you to store, query, and search over large sets of embeddings generated by models like sentence transformers. It’s often used in applications involving similarity search, like document retrieval or recommendation systems.



In [2]:
!pip install transformers==4.33.0 accelerate==0.22.0 einops==0.6.1 langchain==0.0.300 \
bitsandbytes==0.41.1 sentence_transformers==2.2.2 chromadb==0.4.12 wandb==0.16.0 pydantic==1.10.8



In [25]:
!pip show torchaudio

Name: torchaudio
Version: 2.5.1+cu124
Summary: An audio package for PyTorch
Home-page: https://github.com/pytorch/audio
Author: Soumith Chintala, David Pollack, Sean Naren, Peter Goldsborough, Moto Hira, Caroline Chen, Jeff Hwang, Zhaoheng Ni, Xiaohui Zhang
Author-email: soumith@pytorch.org
License: 
Location: /usr/local/lib/python3.11/dist-packages
Requires: torch
Required-by: 


In [3]:
# torch.cuda and bfloat16: Enable GPU acceleration and low-precision (bfloat16) training for faster and more memory-efficient deep learning.
from torch import cuda, bfloat16

# torch: Provides core tensor operations, which are essential for deep learning model manipulation and training.
import torch

# transformers: Hugging Face library to load pre-trained transformer models for various NLP tasks.
import transformers

# AutoTokenizer: Automatically loads the appropriate tokenizer for a pre-trained transformer model, enabling tokenization of text.
from transformers import AutoTokenizer

# time: Used for measuring execution time, typically for performance benchmarking and timing operations.
from time import time

# langchain.llms.HuggingFacePipeline: Integrates Hugging Face transformer models into LangChain pipelines for advanced NLP applications.
from langchain.llms import HuggingFacePipeline

# langchain.document_loaders.TextLoader: Loads and processes documents from text files to feed into NLP models.
from langchain.document_loaders import TextLoader

# langchain.text_splitter.RecursiveCharacterTextSplitter: Splits long text documents into smaller chunks to make them manageable for processing.
from langchain.text_splitter import RecursiveCharacterTextSplitter

# langchain.embeddings.HuggingFaceEmbeddings: Uses Hugging Face models to generate vector embeddings (numerical representations) for text.
from langchain.embeddings import HuggingFaceEmbeddings

# langchain.chains.RetrievalQA: Builds a question-answering system by retrieving relevant documents from a database and using a model to generate answers.
from langchain.chains import RetrievalQA

# langchain.vectorstores.Chroma: Stores and retrieves vector embeddings efficiently, enabling similarity search for document retrieval.
from langchain.vectorstores import Chroma

from langchain.embeddings.base import Embeddings  # Import base class for custom embeddings

from langchain.schema import Document  # Import Document class to structure text data

import spacy  # Import spaCy for NLP processing


# Initialize model, tokenizer, query pipeline

Define the model, the device, and the bitsandbytes configuration.

In [4]:
# model_id: Specifies the path or identifier for the pre-trained model stored in the specified directory.
# This model is located in the `/kaggle/input/llama-2/pytorch/7b-chat-hf/1` directory and represents the LLaMA 2 model (7B variant),
# which is a conversational model trained by Meta.
# Use the smaller version of LLaMA 2 (3B)
model_id = 'meta-llama/Llama-2-7b-chat-hf'


# device: Checks if a GPU is available using `cuda.is_available()` and selects the appropriate device.
# If a GPU is available, it assigns the current CUDA device (`cuda:<device_id>`), otherwise, it defaults to CPU.
device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

# # bnb_config: Configures the settings for loading the model with reduced precision using quantization.
# This setup helps reduce the GPU memory footprint by loading the model in a lower-precision format (4-bit).
# It utilizes the `bitsandbytes` library, which allows efficient low-precision training and inference.
# The configuration options:
# - `load_in_4bit=True`: Load the model weights in 4-bit precision, reducing memory usage.
# - `bnb_4bit_quant_type='nf4'`: Uses the "Non-Uniform 4-bit" (nf4) quantization type for compressing the model weights.
# - `bnb_4bit_use_double_quant=True`: Enables the use of double quantization, which further improves memory efficiency.
# - `bnb_4bit_compute_dtype=bfloat16`: Specifies that computations should use the `bfloat16` data type (16-bit precision) for lower memory usage during model inference.
# This configuration is especially useful when working with large models that would otherwise be too large to fit into memory on a GPU.
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)


In [5]:
device

'cuda:0'

Prepare the model and the tokenizer.

In [7]:
# model_id: Specifies the path or identifier for the pre-trained model stored in the specified directory.
# This model is located in the `/kaggle/input/llama-2/pytorch/7b-chat-hf/1` directory and represents the LLaMA 2 model (7B variant),
# which is a conversational model trained by Meta.
# Use the smaller version of LLaMA 2 (3B)
model_id = 'meta-llama/Llama-2-7b-chat-hf'


# device: Checks if a GPU is available using `cuda.is_available()` and selects the appropriate device.
# If a GPU is available, it assigns the current CUDA device (`cuda:<device_id>`), otherwise, it defaults to CPU.
device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

# # bnb_config: Configures the settings for loading the model with reduced precision using quantization.
# This setup helps reduce the GPU memory footprint by loading the model in a lower-precision format (4-bit).
# It utilizes the `bitsandbytes` library, which allows efficient low-precision training and inference.
# The configuration options:
# - `load_in_4bit=True`: Load the model weights in 4-bit precision, reducing memory usage.
# - `bnb_4bit_quant_type='nf4'`: Uses the "Non-Uniform 4-bit" (nf4) quantization type for compressing the model weights.
# - `bnb_4bit_use_double_quant=True`: Enables the use of double quantization, which further improves memory efficiency.
# - `bnb_4bit_compute_dtype=bfloat16`: Specifies that computations should use the `bfloat16` data type (16-bit precision) for lower memory usage during model inference.
# This configuration is especially useful when working with large models that would otherwise be too large to fit into memory on a GPU.
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)


# time_1: Captures the current time at the start of the model preparation process.
# It is used for calculating the time taken to load the model and tokenizer.
time_1 = time()

# model_config: Loads the configuration settings for the pre-trained model from Hugging Face using the provided model_id.
# This configuration includes hyperparameters and architecture settings necessary for the model.
model_config = transformers.AutoConfig.from_pretrained(
    model_id,
)

# model: Loads the pre-trained model from Hugging Face using the provided model_id.
# This also applies the quantization settings (`bnb_config`) to reduce memory usage and enables device placement (GPU/CPU).
model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,  # Allows loading of custom code from the remote repository if needed.
    config=model_config,  # Uses the configuration loaded earlier to correctly initialize the model.
    quantization_config=bnb_config,  # Applies the quantization configuration (e.g., 4-bit loading) to reduce memory usage.
    device_map='auto',  # Automatically places model layers across available devices (GPU/CPU).
)

# tokenizer: Loads the tokenizer associated with the pre-trained model.
# Tokenizer is responsible for converting text into tokens (inputs for the model) and decoding model outputs back to human-readable text.
tokenizer = AutoTokenizer.from_pretrained(model_id)

# time_2: Captures the current time after the model and tokenizer have been loaded.
# This helps in calculating the total time taken to prepare the model and tokenizer.
time_2 = time()

# Prints the time taken to prepare the model and tokenizer, rounded to 3 decimal places.
print(f"Prepare model, tokenizer: {round(time_2-time_1, 3)} sec.")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



Prepare model, tokenizer: 78.999 sec.


Define the query pipeline.

In [8]:
# time_1: Captures the current time at the start of the pipeline preparation process.
# This helps in calculating how much time it takes to prepare the pipeline.
time_1 = time()

# query_pipeline: Creates a text generation pipeline using the Hugging Face Transformers library.
# This pipeline will generate text based on the input queries, such as a prompt or a question.
query_pipeline = transformers.pipeline(
    "text-generation",  # Specifies the task to be performed by the pipeline, which is text generation in this case.
    model=model,  # Passes the pre-trained model (in this case, a causal language model like Llama).
    tokenizer=tokenizer,  # Passes the tokenizer that corresponds to the model, responsible for tokenizing input text.
    torch_dtype=torch.float16,  # Specifies the datatype for the model's weights. Using float16 for reduced memory usage and faster computation.
    device_map="auto",  # Automatically assigns model layers to available devices (GPU/CPU). This helps in multi-GPU environments.
)

# time_2: Captures the current time after the pipeline has been prepared.
# This helps in calculating the total time taken to prepare the pipeline.
time_2 = time()

# Prints the time taken to prepare the pipeline, rounded to 3 decimal places.
# This gives you an estimate of how long it took to load the model and set up the pipeline.
print(f"Prepare pipeline: {round(time_2-time_1, 3)} sec.")


Prepare pipeline: 1.009 sec.


We define a function for testing the pipeline.

In [9]:
def test_model(tokenizer, pipeline, prompt_to_test):
    """
    Perform a query and print the result.

    Args:
        tokenizer: the tokenizer used to process input text (convert to tokens for the model).
        pipeline: the pipeline that performs text generation (using a pre-trained model).
        prompt_to_test: the input text (prompt) that will be used to generate text.

    Returns:
        None
    """

    # adapted from https://huggingface.co/blog/llama2#using-transformers

    # time_1: Captures the current time at the start of the text generation process.
    # This will be used to calculate how long the inference (text generation) takes.
    time_1 = time()

    # sequences: Runs the pipeline to generate text based on the provided prompt.
    # The `pipeline()` function returns a list of generated sequences.
    sequences = pipeline(
        prompt_to_test,  # The input prompt (text) used to generate the response.

        # do_sample=True: Randomly samples from the distribution of possible tokens during generation,
        # rather than picking the token with the highest probability (this makes output more diverse).
        do_sample=True,

        # top_k=10: Limits the token sampling to the top 10 most probable next tokens at each step.
        # This adds randomness by restricting choices but still focusing on high probability tokens.
        top_k=10,

        # num_return_sequences=1: Specifies how many different sequences should be returned.
        # In this case, it generates one sequence of text for the given prompt.
        num_return_sequences=1,

        # eos_token_id: The token that signals the end of the generated sequence.
        # This is provided by the tokenizer, ensuring the model stops generating at an appropriate point.
        eos_token_id=tokenizer.eos_token_id,

        # max_length=200: Limits the maximum length of the generated sequence (in tokens).
        # This helps avoid excessively long or infinite generation.
        max_length=200,
    )

    # time_2: Captures the current time after the text generation is complete.
    # This allows you to measure how long the inference process took.
    time_2 = time()

    # Prints the time it took to run the inference (text generation) process.
    # The time difference is rounded to 3 decimal places for cleaner output.
    print(f"Test inference: {round(time_2-time_1, 3)} sec.")

    # For each sequence (output text) in the generated sequences list,
    # print the resulting generated text.
    for seq in sequences:
        # seq['generated_text']: Access the generated text from the sequence (output).
        print(f"Result: {seq['generated_text']}")


**Test the query pipeline**

We test the pipeline with a query about the meaning of State of the Union (SOTU).

In [10]:
test_model(tokenizer,
           query_pipeline,
           "Please explain what is the State of the Union address. Give just a definition. Keep it in 100 words.")

Test inference: 8.325 sec.
Result: Please explain what is the State of the Union address. Give just a definition. Keep it in 100 words.
The State of the Union address is an annual speech delivered by the President of the United States to a joint session of Congress, in which the President reviews the current state of the union, outlines policy goals and proposals, and seeks to rally and inspire the American people.


# Retrieval Augmented Generation

**Check the model with a HuggingFace pipeline**

We check the model with a HF pipeline, using a query about the meaning of State of the Union (SOTU).

In [11]:
# HuggingFacePipeline wraps the transformer pipeline created earlier into a LangChain-friendly object
# 'query_pipeline' refers to a previously created pipeline with model and tokenizer set up for text generation.
llm = HuggingFacePipeline(pipeline=query_pipeline)

# Now we are testing the model to see if everything is working correctly
# We are passing a prompt for the model to process and generate a response
# The prompt here is asking for an explanation of the "State of the Union address" in 100 words.

llm(prompt="Please explain what is the State of the Union address. Give just a definition. Keep it in 100 words.")


'\n\nThe State of the Union address is an annual speech given by the President of the United States to a joint session of Congress, in which the President reports on the current state of the union and outlines policy goals and legislative proposals for the upcoming year. The address is a key moment in the political calendar and is closely watched by lawmakers, the public, and the media.'

**Ingestion of data using Text loder**

We will ingest the newest presidential address, from Jan 2023.

In [12]:
# Import the necessary module for loading text files
from langchain.document_loaders import TextLoader

# Create an instance of the TextLoader class and specify the file path
# "/content/biden-sotu-2023-planned-official.txt" - This is the text file to be loaded
# encoding="utf8" - Ensures that the file is read using UTF-8 encoding to handle special characters
loader = TextLoader("/content/biden-sotu-2023-planned-official.txt", encoding="utf8")

# Load the document into memory
# This reads the file and stores its contents in a list of Document objects
documents = loader.load()


In [27]:
documents

[Document(page_content='Mr. Speaker. Madam Vice President. Our First Lady and Second Gentleman. Members of Congress and the Cabinet. Leaders of our military. Mr. Chief Justice, Associate Justices, and retired Justices of the Supreme Court. And you, my fellow Americans. I start tonight by congratulating the members of the 118th Congress and the new Speaker of the House, Kevin McCarthy. Mr. Speaker, I look forward to working together. I also want to congratulate the new leader of the House Democrats and the first Black House Minority Leader in history, Hakeem Jeffries. Congratulations to the longest serving Senate Leader in history, Mitch McConnell. And congratulations to Chuck Schumer for another term as Senate Majority Leader, this time with an even bigger majority. And I want to give special recognition to someone who I think will be considered the greatest Speaker in the history of this country, Nancy Pelosi. The story of America is a story of progress and resilience. Of always movin

**Split data in chunks**

We split data in chunks using a recursive character text splitter.

In [13]:
# Create an instance of RecursiveCharacterTextSplitter
# chunk_size=1000 specifies that each chunk should have a maximum of 1000 characters
# chunk_overlap=20 ensures that there is a 20-character overlap between consecutive chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)

# Split the input documents into smaller chunks based on the specified chunk size and overlap
all_splits = text_splitter.split_documents(documents)


In [14]:
all_splits

[Document(page_content='Mr. Speaker. Madam Vice President. Our First Lady and Second Gentleman. Members of Congress and the Cabinet. Leaders of our military. Mr. Chief Justice, Associate Justices, and retired Justices of the Supreme Court. And you, my fellow Americans. I start tonight by congratulating the members of the 118th Congress and the new Speaker of the House, Kevin McCarthy. Mr. Speaker, I look forward to working together. I also want to congratulate the new leader of the House Democrats and the first Black House Minority Leader in history, Hakeem Jeffries. Congratulations to the longest serving Senate Leader in history, Mitch McConnell. And congratulations to Chuck Schumer for another term as Senate Majority Leader, this time with an even bigger majority. And I want to give special recognition to someone who I think will be considered the greatest Speaker in the history of this country, Nancy Pelosi. The story of America is a story of progress and resilience. Of always movin

In [15]:
type(all_splits)

list

**Creating Embeddings and Storing in Vector Store**

Create the embeddings using Sentence Transformer and HuggingFace embeddings.

In [16]:
import spacy

In [17]:
!python -m spacy download en_core_web_lg


Collecting en-core-web-lg==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.7.1/en_core_web_lg-3.7.1-py3-none-any.whl (587.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.7/587.7 MB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-3.7.1
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [18]:
len(all_splits)

43

In [19]:
# Load spaCy's large English model (pre-trained word vectors)
nlp = spacy.load("en_core_web_lg")

# Define a custom embeddings class based on spaCy
class SpacyEmbeddings(Embeddings):  # Inherit from LangChain's Embeddings base class
    def __init__(self, model):
        """Initialize the custom embeddings class with a spaCy model."""
        self.model = model  # Store the spaCy model instance

    def embed_documents(self, texts):
        """Embed a list of documents and return a list of lists (instead of NumPy arrays)."""
        return [self.model(text).vector.tolist() for text in texts]  # Convert numpy arrays to lists

    def embed_query(self, text):
        """Embed a single query."""
        return self.model(text).vector.tolist()  # Convert numpy array to list

# Instantiate the embeddings class with the spaCy model
spacy_embeddings = SpacyEmbeddings(nlp)

# Assuming 'all_splits' is a list of text chunks that need to be embedded
document_objects = [Document(page_content=str(chunk)) for chunk in all_splits]
# Wrap each text chunk in LangChain's `Document` class and ensure content is a string

# Import Chroma vector store
from langchain.vectorstores import Chroma

# Create the Chroma vector store using the custom spaCy embeddings class
vectordb = Chroma.from_documents(
    documents=document_objects,  # List of document objects to store
    embedding=spacy_embeddings,  # Use the custom spaCy-based embedding class
    persist_directory="chroma_db"  # Directory to persist the vector database
)


**Initialize chain**

In [20]:
# Create a retriever object from the vector database, which will be used to search for relevant documents
retriever = vectordb.as_retriever()

# Create a RetrievalQA object to handle question-answering tasks
qa = RetrievalQA.from_chain_type(
    llm=llm,  # Use the specified language model (LLM) for generating answers
    chain_type="stuff",  # The chain type "stuff" means all retrieved documents are stuffed into the LLM input
    retriever=retriever,  # Attach the retriever to fetch relevant documents
    verbose=True  # Enable verbose mode to display additional details during execution
)


**Test the Retrieval-Augmented Generation**

We define a test function, that will run the query and time it.

In [21]:
def test_rag(qa, query):
    """
    Function to test a retrieval-augmented generation (RAG) model.

    Parameters:
    qa: The RAG model or pipeline that takes a query and returns a response.
    query: The input question or prompt for the RAG model.

    This function prints the query, measures the inference time, and displays the result.
    """

    # Print the query being tested
    print(f"Query: {query}\n")

    # Record the start time before inference
    time_1 = time()

    # Run the query through the RAG model (qa) and store the result
    result = qa.run(query)

    # Record the end time after inference
    time_2 = time()

    # Calculate and print the inference time (rounded to 3 decimal places)
    print(f"Inference time: {round(time_2 - time_1, 3)} sec.")

    # Print the retrieved/generated response
    print("\nResult: ", result)


Let's check few queries.

In [22]:
query = "What were the main topics in the State of the Union in 2023? Summarize. Keep it under 200 words."
test_rag(qa, query)

Query: What were the main topics in the State of the Union in 2023? Summarize. Keep it under 200 words.



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
Inference time: 50.514 sec.

Result:   The main topics in the 2023 State of the Union address by President Joe Biden were:

* Congratulating new members of Congress and leadership
* Recognizing achievements and progress in the country
* Addressing the deficit and national debt
* Calling for bipartisan cooperation and passing legislation
* Highlighting accomplishments of the previous two years
* Pushing for permanent expansion of Medicaid coverage
* Investing in clean energy and infrastructure
* Addressing climate change and natural disasters

Overall, President Biden emphasized the need for cooperation and progress in various areas, including healthcare, infrastructure, and the environment.


In [23]:
query = "What is the nation economic status? Summarize. Keep it under 200 words."
test_rag(qa, query)

Query: What is the nation economic status? Summarize. Keep it under 200 words.



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
Inference time: 19.948 sec.

Result:   The nation's economic status is mixed. While the stock market is at an all-time high, many Americans are still struggling to make ends meet. The middle class has been hollowed out, and too many good-paying manufacturing jobs have moved overseas. Factories at home have closed down, and once-thriving cities and towns have become shadows of what they used to be. The climate crisis is an existential threat, and the wealthiest and biggest corporations need to pay their fair share in taxes. The President believes that the country needs to rebuild the backbone of America, the middle class, and unite the country. The President also believes that the present tax system is unfair and needs to be changed.


**Document sources**

In [47]:
docs = vectordb.similarity_search(query)
print(f"Query: {query}")
print(f"Retrieved documents: {len(docs)}")

Query: What is the nation economic status? Summarize. Keep it under 200 words.
Retrieved documents: 4


# Download necessary libraries used in running this notebook

In [27]:
import sys
import pkgutil
import importlib
import pip

# Get all modules currently loaded in the notebook
loaded_modules = {name for _, name, _ in pkgutil.iter_modules()}

# Get all imported modules in the notebook
imported_modules = {module for module in sys.modules.keys() if module in loaded_modules}

# Use importlib.metadata instead of pip.get_installed_distributions()
if sys.version_info >= (3, 8):
    from importlib import metadata as importlib_metadata
else:
    import importlib_metadata

installed_packages = {dist.name: dist.version for dist in importlib_metadata.distributions()}

# Filter the installed packages based on imports
required_packages = {pkg: installed_packages[pkg] for pkg in imported_modules if pkg in installed_packages}

# Save to requirements.txt
with open("requirements.txt", "w") as f:
    for pkg, version in required_packages.items():
        f.write(f"{pkg}=={version}\n")

print("Filtered requirements.txt created successfully!")

# Download the file
from google.colab import files
files.download("requirements.txt")

Filtered requirements.txt created successfully!


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>