## Lesson 2: Your RAG Prototype


In this lesson, you will build a RAG prototype in this notebook which you will learn how to automate in the next lesson. You are provided with text files containing book descriptions. You will create embeddings based on the book description and store them in a vector database. Here's what you will do:
- read book descriptions from the text files stored under `include/data`
- use `fastembed` to create the vector embedding for each book description
- store the embeddings and the book metadata in a local `weaviate` database

### Import libraries

In [1]:
!pip install weaviate-client==4.14.1 fastembed==0.6.1 ipython --quiet

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
opentelemetry-proto 1.34.1 requires protobuf<6.0,>=5.0, but you have protobuf 6.31.1 which is incompatible.
fastmcp 2.8.0 requires authlib>=1.5.2, but you have authlib 1.3.1 which is incompatible.
pymilvus 2.3.6 requires grpcio<=1.60.0,>=1.49.1, but you have grpcio 1.73.0 which is incompatible.
autogen-core 0.6.1 requires protobuf~=5.29.3, but you have protobuf 6.31.1 which is incompatible.
codeflare-sdk 0.26.0 requires pydantic<2, but you have pydantic 2.11.5 which is incompatible.
codeflare-sdk 0.26.0 requires ray[data,default]==2.35.0, but you have ray 2.47.0 which is incompatible.
google-api-core 2.24.1 requires protobuf!=3.20.0,!=3.20.1,!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<6.0.0.dev0,>=3.19.5, but you have protobuf 6.31.1 which is incompatible.
googleapis-common-protos 1.66.0 requires p

In [2]:
import os
import json
from IPython.display import JSON

from fastembed import TextEmbedding

import weaviate
from weaviate.classes.data import DataObject

from helper import suppress_output



In [3]:
# Warning control
import warnings
warnings.filterwarnings('ignore')

<p style="background-color:#fff6ff; padding:15px; border-width:3px; border-color:#efe6ef; border-style:solid; border-radius:6px"> 💻 &nbsp; <b>To access <code>requirements.txt</code> and <code>helper.py</code> files, and <code>include</code> folder:</b> 1) click on the <em>"File"</em> option on the top menu of the notebook, 2) click on <em>"Open"</em> and then 3) click on <em>"L2"</em>. For more help, please see the <em>"Appendix – Tips, Help, and Download"</em> Lesson.</p>

### Set variables

In [4]:
COLLECTION_NAME = "Books"  # capitalize the first letter of collection names
BOOK_DESCRIPTION_FOLDER = "include/data"
EMBEDDING_MODEL_NAME = "BAAI/bge-small-en-v1.5"

Note regarding the variable `COLLECTION_NAME`: Weaviate stores data in ["collections"](https://weaviate.io/developers/academy/py/starter_text_data/text_collections/create_collection). A collection is a set of objects that share the same data structure. In the Weaviate instance of this lesson, you will create a collection of books. Each book object will have a vector embedding and a set of properties.

### Instantiate Embedded Weaviate client

You will now create a local Weaviate instance: [Embedded Weaviate](https://weaviate.io/developers/weaviate/connections/connect-embedded), which is a way to run a Weaviate instance from your application code rather than from a stand-alone Weaviate server installation. 

In the next lessons, you will be interacting with the latter option; you'll be provided with a Weaviate instance running in a [Docker](https://docs.docker.com/) container.

In [5]:
with suppress_output():
    client = weaviate.connect_to_embedded(
        persistence_data_path= "tmp/weaviate",
    )
print("Started new embedded Weaviate instance.")
print(f"Client is ready: {client.is_ready()}")

Started new embedded Weaviate instance.
Client is ready: True


### Create the collection 

You will now create the Books collection inside the Weaviate instance.

In [6]:
existing_collections = client.collections.list_all()
existing_collection_names = existing_collections.keys()

if COLLECTION_NAME not in existing_collection_names:
    print(f"Collection {COLLECTION_NAME} does not exist yet. Creating it...")
    collection = client.collections.create(name=COLLECTION_NAME)
    print(f"Collection {COLLECTION_NAME} created successfully.")
else:
    print(f"Collection {COLLECTION_NAME} already exists. No action taken.")
    collection = client.collections.get(COLLECTION_NAME)

Collection Books does not exist yet. Creating it...
Collection Books created successfully.


### Extract text from local files

You are provided with the `BOOK_DESCRIPTION_FOLDER` (`include/data`) inside the L2 directory. It contains some text files, where each text file contains some book descriptions. You'll now list the text files to discover how many of such files you are provided. 

In [7]:
# list the book description files
book_description_files = [
    f for f in os.listdir(BOOK_DESCRIPTION_FOLDER)
    if f.endswith('.txt')
]

print(f"The following files with book descriptions were found: {book_description_files}")

The following files with book descriptions were found: ['book_descriptions_1.txt', 'book_descriptions_2.txt', 'my_book_descriptions.txt']


You'll add another file that contains some additional book descriptions. Feel free to add your own book description file. 

In [8]:
# Add your own book description file
# Format 
# [Integer Index] ::: [Book Title] ([Release year]) ::: [Author] ::: [Description]

my_book_description = """0 ::: The Idea of the World (2019) ::: Bernardo Kastrup ::: An ontological thesis arguing for the primacy of mind over matter.
1 ::: Exploring the World of Lucid Dreaming (1990) ::: Stephen LaBerge ::: A practical guide to learning and enjoying lucid dreams.
"""

# Write to file
with open(f"{BOOK_DESCRIPTION_FOLDER}/my_book_descriptions.txt", 'w') as f:
    f.write(my_book_description)

You'll now loop through each text file. For each text file, you will read each line, which corresponds to one book, to extract the title, author and text description of that book. You will save the data in a list of Python dictionaries, where each dictionary corresponds to one book.   

In [9]:
book_description_files = [
    f for f in os.listdir(BOOK_DESCRIPTION_FOLDER)
    if f.endswith('.txt')
]

list_of_book_data = []

for book_description_file in book_description_files:
    with open(
        os.path.join(BOOK_DESCRIPTION_FOLDER, book_description_file), "r"
    ) as f:
        book_descriptions = f.readlines()
    
    titles = [
        book_description.split(":::")[1].strip()
        for book_description in book_descriptions
    ]
    authors = [
        book_description.split(":::")[2].strip()
        for book_description in book_descriptions
    ]
    book_description_text = [
        book_description.split(":::")[3].strip()
        for book_description in book_descriptions
    ]
    
    book_descriptions = [
        {
            "title": title,
            "author": author,
            "description": description,
        }
        for title, author, description in zip(
            titles, authors, book_description_text
        )
    ]

    list_of_book_data.append(book_descriptions)

In [10]:
JSON(json.dumps(list_of_book_data))

<IPython.core.display.JSON object>

### Create vector embeddings from descriptions

For each book in the list of book data you extracted, you will now create an embedding vector based on the text description. You will store the the vector of embeddings in the list `list_of_description_embeddings`.

In [11]:
embedding_model = TextEmbedding(EMBEDDING_MODEL_NAME)  

list_of_description_embeddings = []

for book_data in list_of_book_data:
    book_descriptions = [book["description"] for book in book_data]
    description_embeddings = [
        list(embedding_model.embed([desc]))[0] for desc in book_descriptions
    ]

    list_of_description_embeddings.append(description_embeddings)

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/706 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.24k [00:00<?, ?B/s]

model_optimized.onnx:   0%|          | 0.00/66.5M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

### Load embeddings to Weaviate

In the books collection of Weaviate, you will create an item for each data object (book). The item has two attributes:
- `vector`: which represents the vector embedding of the book text description
- `properties`: which is a python dictionary that contains the book metadata: title, author and text description.

In [12]:
for book_data_list, emb_list in zip(list_of_book_data, list_of_description_embeddings):
    items = []
    
    for book_data, emb in zip(book_data_list, emb_list):
        item = DataObject(
            properties={
                "title": book_data["title"],
                "author": book_data["author"],
                "description": book_data["description"],
            },
            vector=emb
        )
        items.append(item)
    
    collection.data.insert_many(items)

### Query for a book recommendation using semantic search

Now that you have the embeddings stored in the Weaviate instance, you can query the vector database. You are provided with a query that you will first map it to its embedding vector. You will then pass this vector embedding to the method: `query.near_vector` of the Weaviate `Books` collection. 

In [13]:
query_str = "A philosophical book"

embedding_model = TextEmbedding(EMBEDDING_MODEL_NAME)  
collection = client.collections.get(COLLECTION_NAME)

query_emb = list(embedding_model.embed([query_str]))[0]

results = collection.query.near_vector(
    near_vector=query_emb,
    limit=1,
)
for result in results.objects:
    print(f"You should read: {result.properties['title']} by {result.properties['author']}")
    print("Description:")
    print(result.properties["description"])

You should read: The Idea of the World (2019) by Bernardo Kastrup
Description:
An ontological thesis arguing for the primacy of mind over matter.


### Optional Cleanup utilities

These are optional cleanup utilities that you can locally use to remove the custom book description file, a collection in weaviate or the entire Weaviate instance.

In [14]:
# ## Remove a book description file

# import os

# file_path = f"{BOOK_DESCRIPTION_FOLDER}/my_book_descriptions.txt"

# # Remove the file
# if os.path.exists(file_path):
#     os.remove(file_path)
# else:
#     print(f"File not found: {file_path}")

In [15]:
# ## Remove a collection from an existing Weaviate instance

# client.collections.delete(COLLECTION_NAME)

In [16]:
# ## Delete a Weaviate instance
# ## This cell can take a few seconds to run  

# import shutil

# client.close()

# EMBEDDED_WEAVIATE_PERSISTENCE_PATH = "tmp/weaviate"

# if os.path.exists(EMBEDDED_WEAVIATE_PERSISTENCE_PATH):
#     shutil.rmtree(EMBEDDED_WEAVIATE_PERSISTENCE_PATH)
#     if not os.path.exists(EMBEDDED_WEAVIATE_PERSISTENCE_PATH):
#         print(f"Verified: '{EMBEDDED_WEAVIATE_PERSISTENCE_PATH}' no longer exists.")
#         print(f"Weaviate embedded data at '{EMBEDDED_WEAVIATE_PERSISTENCE_PATH}' deleted.")

### Resources

- [Weaviate Docs](https://weaviate.io/developers/weaviate)
- [What is FastEmbed?](https://qdrant.github.io/fastembed/)
- [Weaviate Short Course - Vector Databases: from Embeddings to Applications](https://www.deeplearning.ai/short-courses/vector-databases-embeddings-applications/)
- [Weaviate Short Course - Building Multimodal Search and RAG](https://www.deeplearning.ai/short-courses/building-multimodal-search-and-rag/)

<div style="background-color:#fff6ff; padding:13px; border-width:3px; border-color:#efe6ef; border-style:solid; border-radius:6px">
<p> 💻 &nbsp; <b>To access <code>requirements.txt</code> and <code>helper.py</code> files and <code> include</code> folder:</b> 1) click on the <em>"File"</em> option on the top menu of the notebook 2) click on <em>"Open"</em> and then 3) click on <em>"L2"</em>.

<p> ⬇ &nbsp; <b>Download Notebooks:</b> 1) click on the <em>"File"</em> option on the top menu of the notebook and then 2) click on <em>"Download as"</em> and select <em>"Notebook (.ipynb)"</em>.</p>

<p> 📒 &nbsp; For more help, please see the <em>"Appendix – Tips, Help, and Download"</em> Lesson.</p>

</div>