<a href="https://colab.research.google.com/github/naveen9596/AI-Project/blob/main/RAGPipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Install libraries

In [None]:
!pip install -q gradio
!pip install -q openai
!pip install -q langchain-community
!pip install -q langchain_openai
!pip install -q langchain_chroma
!pip install -q plotly

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.6/57.6 MB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m321.8/321.8 kB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m94.8/94.8 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.4/12.4 MB[0m [31m48.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.5/71.5 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.3/62.3 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m30.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.8/50.8 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Import Libraries

In [None]:
import os
import glob
import gradio as gr
import numpy as np
#Langchain
from langchain.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.schema import Document
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_chroma import Chroma
from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationalRetrievalChain
#OpenAI
from openai import OpenAI
#Plotting
from sklearn.manifold import TSNE
import plotly.graph_objects as go
from google.colab import userdata

Defining the Model and the DB Name

In [None]:
MODEL = "gpt-4o-mini"
db_name = "vector_db"

Getting the OpenAI Key

In [None]:
Key = userdata.get('OPENAI_API_KEY')
openai = OpenAI(api_key=Key)

Read the documents from LanChain loaders

In [None]:
folders = glob.glob('/content/drive/MyDrive/Colab Notebooks/LLM/RAG/knowledge-base/*')

#Define Kwargs
text_loader_kwargs = {'encoding': 'utf-8'}

#Empty List
documents = []

#Document Iteration
for folder in folders:
  doc_type = os.path.basename(folder)
  loader = DirectoryLoader(folder, glob="**/*.md", loader_cls=TextLoader, loader_kwargs=text_loader_kwargs)
  folder_docs = loader.load()
  for doc in folder_docs:
    doc.metadata['doc_type'] = doc_type
    documents.append(doc)

Split text into chunks

In [None]:
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
chunks = text_splitter.split_documents(documents)



In [None]:
len(chunks)

116

In [None]:
doc_types = set(doc.metadata['doc_type'] for doc in chunks)
print(f"Document types found: {', '.join(doc_types)}")

Document types found: company, contracts, products, employees


Create a Vector Store

In [None]:
embeddings = OpenAIEmbeddings(api_key=userdata.get('OPENAI_API_KEY'))

# Delete if already exists

if os.path.exists(db_name):
    Chroma(persist_directory=db_name, embedding_function=embeddings).delete_collection()

# Create vectorstore

vectorstore = Chroma.from_documents(documents=chunks, embedding=embeddings, persist_directory=db_name)
print(f"Vectorstore created with {vectorstore._collection.count()} documents")

Vectorstore created with 116 documents


Get the dimession of a single vector

In [None]:
collection = vectorstore._collection
sample_embedding = collection.get(limit=1, include=["embeddings"])["embeddings"][0]
dimensions = len(sample_embedding)
print(f"The vectors have {dimensions:,} dimensions")

The vectors have 1,536 dimensions


Visualizing the vector

In [None]:
result = collection.get(include=['embeddings', 'documents', 'metadatas'])
vectors = np.array(result['embeddings'])
documents = result['documents']
doc_types = [metadata['doc_type'] for metadata in result['metadatas']]
colors = [['blue', 'green', 'red', 'orange'][['products', 'employees', 'contracts', 'company'].index(t)] for t in doc_types]

2D Visualization

In [None]:
tsne = TSNE(n_components=2, random_state=42)
reduced_vectors = tsne.fit_transform(vectors)

# Create the 2D scatter plot
fig = go.Figure(data=[go.Scatter(
    x=reduced_vectors[:, 0],
    y=reduced_vectors[:, 1],
    mode='markers',
    marker=dict(size=5, color=colors, opacity=0.8),
    text=[f"Type: {t}<br>Text: {d[:100]}..." for t, d in zip(doc_types, documents)],
    hoverinfo='text'
)])

fig.update_layout(
    title='2D Chroma Vector Store Visualization',
    scene=dict(xaxis_title='x',yaxis_title='y'),
    width=800,
    height=600,
    margin=dict(r=20, b=10, l=10, t=40)
)

fig.show()

3D Visializastion

In [None]:
tsne = TSNE(n_components=3, random_state=42)
reduced_vectors = tsne.fit_transform(vectors)

# Create the 3D scatter plot
fig = go.Figure(data=[go.Scatter3d(
    x=reduced_vectors[:, 0],
    y=reduced_vectors[:, 1],
    z=reduced_vectors[:, 2],
    mode='markers',
    marker=dict(size=5, color=colors, opacity=0.8),
    text=[f"Type: {t}<br>Text: {d[:100]}..." for t, d in zip(doc_types, documents)],
    hoverinfo='text'
)])

fig.update_layout(
    title='3D Chroma Vector Store Visualization',
    scene=dict(xaxis_title='x', yaxis_title='y', zaxis_title='z'),
    width=900,
    height=700,
    margin=dict(r=20, b=10, l=10, t=40)
)

fig.show()

Create a new chat with OpenAI

In [None]:
#LLM
llm = ChatOpenAI(api_key=Key, temperature=0.7, model_name=MODEL)

#Memory
memory = ConversationBufferMemory(memory_key='chat_history', return_messages = True)

#Retriever
retriever = vectorstore.as_retriever()

#Conversation Chain
conversation_chain = ConversationalRetrievalChain.from_llm(llm=llm, retriever=retriever, memory=memory)

Using the coversation chain

In [None]:
query = "Can you describe Insurellm in a few sentences?"
result = conversation_chain.invoke({"question": query})
print(result['answer'])

Insurellm is an innovative insurance tech startup founded in 2015 by Avery Lancaster, with a mission to disrupt the insurance industry through technology. The company offers four software products, including Carllm for auto insurance, Homellm for home insurance, Rellm for the reinsurance sector, and Marketllm, a marketplace that connects consumers with insurance providers. With 200 employees and over 300 clients worldwide, Insurellm is rapidly expanding and aims to simplify and enhance the insurance experience.


Set a new conversation history

In [None]:
memory = ConversationBufferMemory(memory_key='chat_history', return_messages = True)
conversation_chain = ConversationalRetrievalChain.from_llm(llm=llm, retriever=retriever, memory=memory)

Create a Gradio interface

In [None]:
def chat(message, history):
  result = conversation_chain.invoke({"question": message})
  return result['answer']

In [None]:
view = gr.ChatInterface(chat, type="messages")
view.launch(debug=True)

Running Gradio in a Colab notebook requires sharing enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
* Running on public URL: https://b5af26deee0b141646.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


Keyboard interruption in main thread... closing server.
Killing tunnel 127.0.0.1:7860 <> https://b5af26deee0b141646.gradio.live


