<a href="https://colab.research.google.com/github/gulabpatel/LLMs/blob/main/03_RAG_LangChain.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##### Fetching API Key from Environment Variable

In [1]:
import os
import getpass
# Fetch the OpenAI API key from environment variables
OPENAI_API_KEY = "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"
# api_key = os.environ.get(OPENAI_API_KEY)
# os.environ["OPENAI_API_KEY"] = getpass()
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

### Pipeline for Converting Raw Unstructured Data into a QA Chain

1. **Loading**: Initially, the data needs to be loaded. Unstructured data can be sourced from various platforms. Utilize the LangChain Integration Hub to explore the complete range of loaders. Each loader outputs the data as a LangChain Document.

2. **Splitting**: Text splitters segment the Documents into specified sizes.

3. **Storage**: A storage solution, often a vector store, is used to house and sometimes embed the splits.

4. **Retrieval**: The application fetches the splits from the storage, usually based on embeddings similar to the input question.

5. **Generation**: A Language Model (LLM) generates an answer using a prompt that incorporates both the question and the retrieved data.

6. **Conversation (Extension)**: To facilitate multi-turn conversations, Memory can be added to the QA chain.


![Q/A pipeline RAG](https://python.langchain.com/assets/images/qa_flow-9fbd91de9282eb806bda1c6db501ecec.jpeg)


## Step 1: Loading the Document

In this initial step, we focus on loading the document into our system. This is a crucial phase as the quality of data loaded will significantly impact the subsequent stages of the pipeline. Here, you can use various loaders available in the LangChain Integration Hub to import your unstructured data as a LangChain Document.

In [3]:
!pip -q install langchain openai

In [4]:
from langchain.document_loaders import WebBaseLoader

# Initialize the WebBaseLoader with the URL of the document to be loaded
loader = WebBaseLoader("https://my.clevelandclinic.org/health/diseases/10946-cavities")

# Load the document and store it in the 'data' variable
data = loader.load()

In [5]:
# Display the content of the loaded document
print(data)

[Document(page_content='Cavities (Tooth Decay): Symptoms, Causes & Treatment800.223.2273100 Years of Cleveland ClinicMyChartNeed Help?GivingCareersSearchClevelandClinic.orgFind A DoctorLocations & DirectionsPatients & VisitorsHealth LibraryInstitutes & DepartmentsAppointmentsHome/Health Library/Diseases & Conditions/CavitiesAdvertisementAdvertisementCavitiesCavities are holes, or areas of tooth decay, that form in your teeth surfaces. Causes include plaque buildup, eating lots of sugary snacks and poor oral hygiene. Treatments include dental fillings, root canal therapy and tooth extraction. The sooner you treat a cavity, the better your chance for a predictable outcome and optimal oral health.ContentsArrow DownOverviewSymptoms and CausesDiagnosis and TestsManagement and TreatmentPreventionOutlook / PrognosisLiving WithAdditional Common QuestionsOverviewCavities can form on the crown or root of your tooth. You might not feel a cavity until it reaches the dentin or pulp.What is a cavity

## Step 2: Splitting the Document into Chunks

In this step, we aim to divide the loaded document into smaller, manageable chunks, also known as splits. This is essential for easier processing and retrieval in the subsequent stages of the pipeline.

In [6]:
!pip install tiktoken



In [7]:
import tiktoken

# Set up token encoding for the GPT-3.5 Turbo model
tiktoken.encoding_for_model('gpt-3.5-turbo')

<Encoding 'cl100k_base'>

In [8]:
tokenizer = tiktoken.get_encoding('cl100k_base')

# Define a function to calculate the token length of a given text
def tiktoken_len(text):
    tokens = tokenizer.encode(
        text,
        disallowed_special=()
    )
    return len(tokens)

tiktoken_len("Dentin decay: Dentin is the layer just beneath your tooth enamel.")

15

In [9]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Initialize the text splitter with specified parameters
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 100,
    chunk_overlap = 20,
    length_function = tiktoken_len
)

In [10]:
# Split the loaded document into smaller chunks
chunks = text_splitter.split_documents(data)

In [11]:
chunks

[Document(page_content='Cavities (Tooth Decay): Symptoms, Causes & Treatment800.223.2273100 Years of Cleveland ClinicMyChartNeed Help?GivingCareersSearchClevelandClinic.orgFind A DoctorLocations & DirectionsPatients & VisitorsHealth LibraryInstitutes & DepartmentsAppointmentsHome/Health Library/Diseases & Conditions/CavitiesAdvertisementAdvertisementCavitiesCavities are holes, or areas of tooth decay, that form in your teeth surfaces. Causes include plaque buildup, eating lots of', metadata={'source': 'https://my.clevelandclinic.org/health/diseases/10946-cavities', 'title': 'Cavities (Tooth Decay): Symptoms, Causes & Treatment', 'description': 'A cavity is a hole, or area of decay, in your tooth. Cavities form when acids in your mouth erode (wear down) your tooth enamel — your tooth’s hard, outer layer.', 'language': 'en'}),
 Document(page_content='areas of tooth decay, that form in your teeth surfaces. Causes include plaque buildup, eating lots of sugary snacks and poor oral hygiene. 

In [12]:
# Check the total number of chunks generated
len(chunks)

37

## Step 3: Storing the Vector Embeddings in Vector Database

In this step, we will store the vector embeddings of the generated chunks into a vector database. This is crucial for efficient retrieval and further processing of the data.

1. **Database Storage**: To facilitate future retrieval of our document splits, it's essential to store them in a database.

2. **Embedding Model**: To convert our document splits into vector embeddings, we require an embedding model.

3. **Vector Store**: Finally, the vector embeddings and documents will be stored in a vector store. For this purpose, we will be using ChromaDB.


![img](https://python.langchain.com/assets/images/qa_data_load-70fac3ea6593b986613784dc056df21a.png)

In [None]:
!pip install -U sentence-transformers

In [14]:
from langchain.embeddings import HuggingFaceEmbeddings

# Specify the model name and additional arguments
model_name = "sentence-transformers/all-MiniLM-L6-v2"
model_kwargs = {'device' : 'cpu'}
encode_kwargs = {'normalize_embeddings': False}

# Initialize HuggingFace Embeddings
hf = HuggingFaceEmbeddings(
    model_name = model_name,
    model_kwargs = model_kwargs,
    encode_kwargs = encode_kwargs
)

In [15]:
embed = hf.embed_documents(texts=['h','e'])

# Print the length of one of the embeddings to check its dimensions
print(len(embed[1]))

384


In [None]:
# Install ChromaDB package using pip
!pip install chromadb

In [17]:
from langchain.vectorstores import Chroma

# Initialize Chroma vector database with chunks and HuggingFace embeddings
vectordb = Chroma.from_documents(chunks, hf)

In [18]:
# Perform a similarity search on the vector database
vectordb.similarity_search('bleeding gums', k=3)

[Document(page_content='your bloodstream (sepsis).What causes cavities?Many factors play a role in the development of cavities.Here’s how it works:Bacteria in your mouth feed on sugary, starchy foods and drinks (fruit, candy, bread, cereal, sodas, juice and milk). The bacteria convert these carbohydrates into acids.Bacteria, acid, food and saliva mix to form dental plaque. This sticky substance coats your teeth.Without proper brushing and flossing, acids in plaque dissolve tooth', metadata={'description': 'A cavity is a hole, or area of decay, in your tooth. Cavities form when acids in your mouth erode (wear down) your tooth enamel — your tooth’s hard, outer layer.', 'language': 'en', 'source': 'https://my.clevelandclinic.org/health/diseases/10946-cavities', 'title': 'Cavities (Tooth Decay): Symptoms, Causes & Treatment'}),
 Document(page_content='treated in childhood. Adults are also more likely to have receding gums. This condition exposes your teeth roots to plaque, which can cause 

## Step 4: Retrieve and Generate

In this step, we will retrieve the relevant chunks from the vector database and generate answers using a language model. This is the final step in the pipeline, bringing all the previous steps together to produce a coherent QA chain.

In [None]:
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI

# Initialize a language model with ChatOpenAI
llm = ChatOpenAI(model_name= 'gpt-3.5-turbo', temperature=0.6)

# Initialize a RetrievalQA chain with the language model and vector database retriever
qa_chain = RetrievalQA.from_chain_type(llm, retriever= vectordb.as_retriever())

# Pass a query to the QA chain to generate an answer
qa_chain({'query' : 'How can I prevent cavity in my tooth?'})

In [None]:
#Change the query to what you want to ask the LLM
query = 'What is enamel decay'

In [None]:
qa_chain({'query' : query})

{'query': 'What is enamel decay',
 'result': 'Enamel decay refers to the breakdown or deterioration of the outermost layer of your tooth called enamel. This occurs due to the acids produced by bacteria in your mouth, which can be caused by poor oral hygiene, a diet high in sugary or acidic foods, or certain medical conditions. Enamel decay is the second stage of tooth decay and can lead to the formation of cavities or holes in the tooth. If left untreated, the decay can progress to affect the deeper layers of the tooth, leading to more severe dental problems.'}

##### Next Steps:

- Integrate prompt engineering

- Use Prompt Template from langchain