
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>


# Lab: Adding Our Own Data to a Multi-Stage Reasoning System

### Working with external knowledge bases 
In this notebook we're going to augment the knowledge base of our LLM with additional data. We will split the notebook into two halves:
- First, we will walk through how to load in a relatively small, local text file using a `DocumentLoader`, split it into chunks, and store it in a vector database using `ChromaDB`.
- Second, you will get a chance to show what you've learned by building a larger system with the complete works of Shakespeare. 
----
### ![Dolly](https://files.training.databricks.com/images/llm/dolly_small.png) Learning Objectives

By the end of this notebook, you will be able to:
1. Add external local data to your LLM's knowledge base via a vector database.
2. Construct a Question-Answer(QA) LLMChain to "talk to your data."
3. Load external data sources from remote locations and store in a vector database.
4. Leverage different retrieval methods to search over your data. 


## Classroom Setup

In [0]:
%pip install chromadb==0.4.10 tiktoken==0.3.3 sqlalchemy==2.0.15 langchain==0.0.249

[43mNote: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.[0m
[43mNote: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.[0m


In [0]:
dbutils.library.restartPython()

In [0]:
%run ../Includes/Classroom-Setup

Resetting the learning environment:
| Enumerating serving endpoints...found 0...(0 seconds)
| No action taken

Skipping download of existing archive to "dbfs:/mnt/dbacademy-datasets/large-language-models/v03" 
| Validating local assets:
| | Listing local files...(0 seconds)
| | Validation completed...(0 seconds total)
|
| Skipping the unpacking of datasets to "dbfs:/mnt/dbacademy-users/johnlennyt@gmail.com/large-language-models/datasets" 
|
| Dataset installation completed (0 seconds)



Importing lab testing framework.



Using the "default" schema.

Predefined paths variables:
| DA.paths.working_dir: /dbfs/mnt/dbacademy-users/johnlennyt@gmail.com/large-language-models/working
| DA.paths.user_db:     /dbfs/mnt/dbacademy-users/johnlennyt@gmail.com/large-language-models/working/database.db
| DA.paths.datasets:    /dbfs/mnt/dbacademy-users/johnlennyt@gmail.com/large-language-models/datasets

Setup completed (6 seconds)

The models developed or used in this course are for demonstration and learning purposes only.
Models may occasionally output offensive, inaccurate, biased information, or harmful instructions.


Fill in your credentials.

In [0]:
# TODO
# For many of the services that we'll using in the notebook, we'll need a HuggingFace API key so this cell will ask for it:
# HuggingFace Hub: https://huggingface.co/inference-api

import os
os.environ["HUGGINGFACEHUB_API_TOKEN"] = "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"

## Building a Personalized Document Oracle

In this notebook, we're going to build a special type of LLMChain that will enable us to ask questions of our data. We will be able to "speak to our data".

### Step 1 - Loading Documents into our Vector Store
For this system we'll leverage the [ChromaDB vector database](https://www.trychroma.com/) and load in some text we have on file. This file is of a hypothetical laptop being reviewed in both long form and with brief customer reviews. We'll use LangChain's `TextLoader` to load this data.

In [0]:
from langchain.vectorstores import Chroma
from langchain.document_loaders import TextLoader

# We have some fake laptop reviews that we can load in
laptop_reviews = TextLoader(
    f"{DA.paths.datasets}/reviews/fake_laptop_reviews.txt", encoding="utf8"
)
document = laptop_reviews.load()
display(document)

metadata,page_content
Map(source -> /dbfs/mnt/dbacademy-users/johnlennyt@gmail.com/large-language-models/datasets/reviews/fake_laptop_reviews.txt),"Raytech Supernova Laptop Review: A Star in the Making Introduction The laptop market has become increasingly competitive in recent years, with countless manufacturers vying for consumer attention. Raytech, a relatively new player in the game, has recently released the Supernova laptop, a device that aims to establish itself among the giants of the industry. In this comprehensive review, we will delve into every aspect of the Raytech Supernova laptop, covering its design, performance, features, and value for money. Let's find out if this newcomer has what it takes to make an impact in the crowded market. Design and Build Quality The first thing you'll notice about the Raytech Supernova is its sleek, modern design. The laptop is encased in a premium, brushed aluminum chassis with a matte finish, lending it an air of sophistication. It's a lightweight device, weighing in at just 2.8 pounds, making it easy to carry around for those always on the go. The slim profile, measuring 0.6 inches in thickness, adds to its portability. The Supernova's build quality is impressive, with no flexing or creaking when handling the device. The hinge is sturdy and smooth, allowing for easy adjustment of the display while keeping it stable during use. The laptop's keyboard is well-spaced, offering a comfortable typing experience. The keys are backlit, with customizable lighting options, making it convenient for use in dimly lit environments. Display and Graphics The Raytech Supernova comes with a 15.6-inch 4K UHD (3840 x 2160) IPS display, offering crisp and vibrant visuals. The screen is capable of producing a wide color gamut, ensuring accurate color reproduction across different media types. The panel has a matte finish, which helps to reduce glare and reflections, making it ideal for use in various lighting conditions. The laptop is powered by an NVIDIA GeForce RTX 3070 GPU, which provides excellent graphics performance for gaming and other demanding tasks. With support for real-time ray tracing and DLSS, the Supernova is well-suited for graphic-intensive applications and games. The GPU performance ensures smooth and immersive gameplay, even at high settings. Performance and Battery Life Under the hood, the Raytech Supernova is powered by the latest 11th Gen Intel Core i7 processor, paired with 16GB of DDR4 RAM. This combination ensures snappy performance during everyday tasks, such as web browsing and productivity applications. The laptop also has a 1TB NVMe SSD, which offers fast read/write speeds, resulting in quick boot times and application launches. In our tests, the Supernova managed to handle intensive tasks, such as video editing and 3D rendering, with ease. Even when pushed to its limits, the laptop remained cool and quiet, thanks to its efficient cooling system. Battery life is an essential aspect of any laptop, and the Raytech Supernova does not disappoint. The device comes with a 97Wh battery, which, in our testing, lasted for around 10 hours of continuous web browsing and productivity tasks. When used for gaming or other demanding tasks, the battery life is reduced to approximately 5 hours, which is still impressive for a high-performance laptop. Connectivity and Ports The Raytech Supernova offers a wide range of connectivity options, ensuring compatibility with various peripherals and devices. On the left side, you'll find a USB 3.2 Gen 2 Type-A port, an HDMI 2.1 port, and a Gigabit Ethernet port. On the right side, there's a Thunderbolt 4 port, two USB 3.2 Gen 1 Type-A ports, a 3.5mm audio jack, and an SD card reader. The Thunderbolt 4 port supports Power Delivery, allowing you to charge the laptop and connect peripherals with a single cable. The Supernova also comes with Wi-Fi 6 and Bluetooth 5.1, ensuring fast and reliable wireless connections. These features make the laptop versatile, allowing you to connect multiple devices and peripherals simultaneously without any hassle. Audio and Webcam The audio quality on the Raytech Supernova is impressive, thanks to its built-in stereo speakers. The laptop features Dolby Atmos audio technology, which enhances the audio experience by providing immersive sound quality. The speakers deliver clear and crisp audio, with a reasonable amount of bass for a laptop. However, for a more immersive experience, external speakers or headphones are recommended. The Supernova is equipped with a 720p HD webcam, which is adequate for video calls and conferencing. The camera provides decent image quality under good lighting conditions but struggles in low light situations. The built-in dual-array microphones, however, offer clear and noise-free audio capture during calls. Software and Security The Raytech Supernova comes pre-installed with Windows 10 Home, offering a familiar and user-friendly operating system. A free upgrade to Windows 11 is available, which introduces new features and improvements to the overall user experience. In terms of security, the Supernova includes a fingerprint reader integrated into the power button. This feature allows for quick and secure logins using Windows Hello. The laptop also features a TPM 2.0 chip, which provides hardware-based encryption for sensitive data, further enhancing the device's security. Customer Support and Warranty Raytech offers a standard one-year limited warranty for the Supernova laptop, covering manufacturing defects and hardware issues. The company provides customer support through email, live chat, and phone, ensuring that users have access to assistance when needed. In our interactions with Raytech's customer support, we found the representatives to be knowledgeable, friendly, and responsive. The company also maintains an online support portal, which includes a comprehensive knowledge base, software downloads, and troubleshooting guides for common issues. Conclusion The Raytech Supernova is a compelling laptop that offers an impressive combination of performance, features, and design. With its sleek aluminum chassis, vibrant 4K display, and powerful hardware, the Supernova stands out in the crowded laptop market. The laptop's excellent battery life, wide range of connectivity options, and robust security features make it a versatile device, suitable for both professional and personal use. While the audio and webcam quality could be improved, these are minor drawbacks in an otherwise outstanding laptop. Overall, the Raytech Supernova offers excellent value for money, making it an ideal choice for those looking for a high-performance laptop that doesn't compromise on style or functionality. If you're in the market for a new laptop, the Raytech Supernova should definitely be on your shortlist. Customer Reviews: ""Sleek and powerful - 5 stars"" I recently purchased the Raytech Supernova and I couldn't be happier. It's lightweight, stylish, and powerful, making it perfect for both work and play. The 4K display is stunning, and the battery life is impressive. Highly recommended! ""Great performance but average webcam - 4 stars"" The Raytech Supernova has exceeded my expectations in terms of performance and design. However, the webcam quality is just average. It works fine for casual video calls, but for professional use, I'd recommend an external webcam. ""Perfect for content creators - 5 stars"" As a video editor, the Supernova has been a game-changer for me. The 4K display, powerful GPU, and fast SSD make editing large video files a breeze. The connectivity options are also a plus. Absolutely love this laptop! ""Impressive gaming laptop - 5 stars"" The Raytech Supernova handles all my favorite games with ease, even on high settings. The display is beautiful, and the cooling system keeps the laptop quiet during long gaming sessions. A fantastic choice for gamers! ""Good but not perfect - 4 stars"" I love the design, performance, and battery life of the Supernova. The only downside is the audio quality from the built-in speakers. It's decent, but for a better experience, I use headphones or external speakers. ""Excellent value for money - 5 stars"" The Supernova offers great performance and features at a reasonable price. It's sleek, lightweight, and powerful, making it suitable for both work and entertainment. Highly recommended for anyone in need of a new laptop! ""Great for on-the-go professionals - 4.5 stars"" As a traveling professional, the Supernova has been a reliable companion. Its lightweight design and long battery life make it ideal for working on the go. The only minor issue is the webcam quality, but overall, it's an excellent laptop. ""Stylish and versatile - 5 stars"" The Raytech Supernova is a perfect combination of style and functionality. The aluminum chassis gives it a premium feel, and the performance is top-notch. The variety of ports allows me to connect all my peripherals without a problem. Highly satisfied! ""Reliable and user-friendly - 4 stars"" The Supernova has been a dependable laptop for my daily tasks. The 4K display is a treat for the eyes, and the performance is reliable. The fingerprint reader for secure login is a nice touch. However, the audio quality could be better. ""A solid choice for students - 5 stars"" As a student, I needed a laptop that could handle multitasking, media consumption, and occasional gaming. The Supernova checks all those boxes while being lightweight and stylish. The battery life is also a huge plus. I'm extremely happy with my purchase!"


### Step 2 - Chunking and Embeddings

Now that we have the data in document format, we will split data into chunks using a `CharacterTextSplitter` and embed this data using Hugging Face's embedding LLM to embed this data for our vector store.

In [0]:
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
import tempfile

tmp_laptop_dir = tempfile.TemporaryDirectory()
tmp_shakespeare_dir = tempfile.TemporaryDirectory()

# First we split the data into manageable chunks to store as vectors. There isn't an exact way to do this, more chunks means more detailed context, but will increase the size of our vectorstore.
text_splitter = CharacterTextSplitter(chunk_size=250, chunk_overlap=10)
texts = text_splitter.split_documents(document)
# Now we'll create embeddings for our document so we can store it in a vector store and feed the data into an LLM. We'll use the sentence-transformers model for out embeddings. https://www.sbert.net/docs/pretrained_models.html#sentence-embedding-models/
model_name = "sentence-transformers/all-MiniLM-L6-v2"
embeddings = HuggingFaceEmbeddings(
    model_name=model_name, cache_folder=DA.paths.datasets
)  # Use a pre-cached model
# Finally we make our Index using chromadb and the embeddings LLM
chromadb_index = Chroma.from_documents(
    texts, embeddings, persist_directory=tmp_laptop_dir.name
)

Created a chunk of size 6702, which is longer than the specified 250
Created a chunk of size 285, which is longer than the specified 250
Created a chunk of size 278, which is longer than the specified 250
Created a chunk of size 260, which is longer than the specified 250
Created a chunk of size 254, which is longer than the specified 250
Created a chunk of size 258, which is longer than the specified 250
Created a chunk of size 286, which is longer than the specified 250
Created a chunk of size 286, which is longer than the specified 250
Created a chunk of size 275, which is longer than the specified 250


### Step 3 - Creating our Document QA LLM Chain
With our data now in vector form we need an LLM and a chain to take our queries and create tasks for our LLM to perform. 

In [0]:
from langchain.chains import RetrievalQA
from langchain.llms import HuggingFacePipeline

# We want to make this a retriever, so we need to convert our index.  This will create a wrapper around the functionality of our vector database so we can search for similar documents/chunks in the vectorstore and retrieve the results:
retriever = chromadb_index.as_retriever()

# This chain will be used to do QA on the document. We will need
# 1 - A LLM to do the language interpretation
# 2 - A vector database that can perform document retrieval
# 3 - Specification on how to deal with this data (more on this soon)

hf_llm = HuggingFacePipeline.from_model_id(
    model_id="google/flan-t5-large",
    task="text2text-generation",
    model_kwargs={
        "temperature": 0,
        "max_length": 128,
        "cache_dir": DA.paths.datasets,
    },
)

chain_type = "stuff"  # Options: stuff, map_reduce, refine, map_rerank
laptop_qa = RetrievalQA.from_chain_type(
    llm=hf_llm, chain_type="stuff", retriever=retriever
)

### Step 4 - Talking to Our Data
Now we are ready to send prompts to our LLM and have it use our prompt, the access to our data, and read the information, process, and return with a response.

In [0]:
# Let's ask the chain about the product we have.
laptop_name = laptop_qa.run("What is the full name of the laptop?")
display(laptop_name)

Token indices sequence length is longer than the specified maximum sequence length for this model (1666 > 512). Running this sequence through the model will result in indexing errors


'Raytech Supernova'

In [0]:
# Now we'll ask the chain about the product.
laptop_features = laptop_qa.run("What are some of the laptop's features?")
display(laptop_features)

'The 4K display, powerful GPU, and fast SSD'

In [0]:
# Finally let's ask the chain about the reviews.
laptop_reviews = laptop_qa.run("What is the general sentiment of the reviews?")
display(laptop_reviews)

'positive'

## Exercise: Working with larger documents
This document was relatively small. So let's see if we can work with something bigger. To show how well we can scale the vector database, let's load in a larger document. For this we'll get data from the [Gutenberg Project](https://www.gutenberg.org/) where thousands of free-to-access texts. We'll use the complete works of William Shakespeare.

Instead of a local text document, we'll download the complete works of Shakespeare using the `GutenbergLoader` that works with the Gutenberg project: https://www.gutenberg.org

In [0]:
from langchain.document_loaders import GutenbergLoader

loader = GutenbergLoader(
    "https://www.gutenberg.org/cache/epub/100/pg100.txt"
)  # Complete works of Shakespeare in a txt file

all_shakespeare_text = loader.load()

### Question 1

Now it's your turn! Based on what we did previously, fill in the missing parts below to build your own QA LLMChain.

In [0]:
# TODO
text_splitter = CharacterTextSplitter(chunk_size=1024, chunk_overlap=256) #hint try chunk sizes of 1024 and an overlap of 256 (this will take approx. 10mins with this model to build our vector database index)
texts = text_splitter.split_documents(document)

model_name = "sentence-transformers/all-MiniLM-L6-v2" #hint, try "sentence-transformers/all-MiniLM-L6-v2" as your model
embeddings = HuggingFaceEmbeddings(
    model_name=model_name, cache_folder=DA.paths.datasets
)
chromadb_index = Chroma.from_documents(
    texts, embeddings, persist_directory=tmp_laptop_dir.name
)

Created a chunk of size 6702, which is longer than the specified 1024


In [0]:
# Test your answer. DO NOT MODIFY THIS CELL.

dbTestQuestion3_1(embeddings, chromadb_index)

[32mPASSED[0m: All tests passed for lesson3, question1
[32mRESULTS RECORDED[0m: Click `Submit` when all questions are completed to log the results.


### Question 2

Let's see if we can do what we did with the laptop reviews. 

Think about what is likely to happen now. Will this command succeed? 

(***Hint: think about the maximum sequence length of a model***)

In [0]:
# TODO
# Let's start with the simplest method: "Stuff" which puts all of the data into the prompt and asks a question of it:
qa = RetrievalQA.from_chain_type(
    llm=hf_llm, chain_type="stuff", retriever=retriever
)
query = "What happens in the play Hamlet?"
# Run the query
query_results_hamlet = qa.run(query)

query_results_hamlet

'not enough information'

In [0]:
# Test your answer. DO NOT MODIFY THIS CELL.

dbTestQuestion3_2(qa, query_results_hamlet)

[32mPASSED[0m: All tests passed for lesson3, question2
[32mRESULTS RECORDED[0m: Click `Submit` when all questions are completed to log the results.


### Question 3

Now that we're working with larger documents, we should be mindful of the input sequence limitations that our LLM has. 

Chain Types for document loader:

- [`stuff`](https://docs.langchain.com/docs/components/chains/index_related_chains#stuffing) - Stuffing is the simplest method, whereby you simply stuff all the related data into the prompt as context to pass to the language model.
- [`map_reduce`](https://docs.langchain.com/docs/components/chains/index_related_chains#map-reduce) - This method involves running an initial prompt on each chunk of data (for summarization tasks, this could be a summary of that chunk; for question-answering tasks, it could be an answer based solely on that chunk).
- [`refine`](https://docs.langchain.com/docs/components/chains/index_related_chains#refine) - This method involves running an initial prompt on the first chunk of data, generating some output. For the remaining documents, that output is passed in, along with the next document, asking the LLM to refine the output based on the new document.
- [`map_rerank`](https://docs.langchain.com/docs/components/chains/index_related_chains#map-rerank) - This method involves running an initial prompt on each chunk of data, that not only tries to complete a task but also gives a score for how certain it is in its answer. The responses are then ranked according to this score, and the highest score is returned.
  * NOTE: For this exercise, `map_rerank` will [error](https://github.com/hwchase17/langchain/issues/3970).

In [0]:
# TODO
qa = RetrievalQA.from_chain_type(llm=hf_llm, chain_type="stuff", retriever=chromadb_index.as_retriever())
query = "Who is the main character in the Merchant of Venice?"
query_results_venice = qa.run(query)

query_results_venice

'Not enough information'

In [0]:
# Test your answer. DO NOT MODIFY THIS CELL.

dbTestQuestion3_3(qa, query_results_venice)

[32mPASSED[0m: All tests passed for lesson3, question3
[32mRESULTS RECORDED[0m: Click `Submit` when all questions are completed to log the results.


### Question 4


In [0]:
# TODO
# That's much better! Let's try another type

qa = RetrievalQA.from_chain_type(llm=hf_llm, chain_type="map_reduce", retriever=chromadb_index.as_retriever())
query = "What happens to romeo and juliet?"
query_results_romeo = qa.run(query)

query_results_romeo

Token indices sequence length is longer than the specified maximum sequence length for this model (1525 > 1024). Running this sequence through the model will result in indexing errors


'they die'

In [0]:
# Test your answer. DO NOT MODIFY THIS CELL.

dbTestQuestion3_4(qa, query_results_romeo)

[32mPASSED[0m: All tests passed for lesson3, question4
[32mRESULTS RECORDED[0m: Click `Submit` when all questions are completed to log the results.


## Submit your Results (edX Verified Only)

To get credit for this lab, click the submit button in the top right to report the results. If you run into any issues, click `Run` -> `Clear state and run all`, and make sure all tests have passed before re-submitting. If you accidentally deleted any tests, take a look at the notebook's version history to recover them or reload the notebooks.

In [0]:
tmp_laptop_dir.cleanup()
tmp_shakespeare_dir.cleanup()

&copy; 2023 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="https://help.databricks.com/">Support</a>