# Langchian - Talk with documents

The goal of this project is to "talk" with some documents using a LLM from Hugging Face, the document will be a short story from a PDF file, we will chunk that document, transform it to embedding vectors using and embedding model from Hugging Face and than store those embeddings to a vector store.

Than based on some questions we will do a semantic similarity search and we will retreive the documents that are similar with the question, the question is embedded using the same method. 

Than we will use a LLM that we take from Hugging Face, proablity "mistral" and we will give it some context using those retreived documents and we will generate an AI enhanced response.

We need the following API keys
 - Langchanin API key
 - Hugging Face API key
 - Vector store API key

We need to do the following steps roughly:
 - Read the api keys form .env file
 - Search for a file (document corpus)
 - Chunkerize that document
 - Search for an embedding model and use it to vectorize the chunks
 - Upload the embeddidngs to Redis
 - Ask a question and embedd it
 - Retreive the relevant documents 
 - Search for a LLM model and give it the documents as context - Text Generation model
 - The response for that question should be based on the docuemnts we have but enhanged using the LLM model 


I want to store the intermediary data in a pandas dataframe

Possible vector stores:
 - Redis
 - Pinecone

## Implementation
Import the necesary libraries and other stuff

#### Read the API keys

In [1]:
import os
import pandas as pd
from dotenv import load_dotenv
import uuid

load_dotenv()

True

#### Read the document

Read the pdf that I generated previously, I need to import first the helper function because it is in a .py file

In [2]:
import importlib.util

# Define the path to the module
module_path = './helpers/pdf_reader.py'

# Create a module spec from the path
spec = importlib.util.spec_from_file_location('functions', module_path)

# Load the module
functions = importlib.util.module_from_spec(spec)
spec.loader.exec_module(functions)

Read the text

In [8]:
pdf_file_path = './data/ion-resume.pdf'
full_text = functions.read_pdf(pdf_file_path, 256)

#### Split the text into chunks

In [9]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

def get_recursive_text_splitter(chunk_size, chunk_overlap):
    return RecursiveCharacterTextSplitter(
        chunk_size = chunk_size,
        chunk_overlap = chunk_overlap,
    )
    
def split_documents(docs, text_splitter):
    return  text_splitter.create_documents([docs])


text_splitter = get_recursive_text_splitter(chunk_size=50, chunk_overlap=10)
splitted_docs = split_documents(full_text, text_splitter)


#### Read the text into a padas dataframe

I will iterate through the splitted docs and I will assign a unique id to each of them

In [10]:
# Function to generate unique IDs
def generate_unique_id():
    return str(uuid.uuid4())


# Extract the page_content from each Document object into a separate list
page_contents = [doc.page_content for doc in splitted_docs]

df = pd.DataFrame(page_contents, columns=["chunk"])

# Generate a unique ID for each row and add it as a new column in the DataFrame
df['unique_id'] = df.apply(lambda row: generate_unique_id(), axis=1)


Get the chunk length

In [None]:
df['chunk_length'] = df['chunk'].apply(len)

Remove the chunks smaller than a specific margin

In [None]:
margin = 40
df = df[df['chunk_length'] >= margin]

# Drop the chunk_length column as it's no longer needed
df = df.drop(columns=['chunk_length'])

#### Transform the text into vector embeddings

Import first an embedding model and than transofrm the text from the dataframe

In [16]:
from langchain_huggingface import HuggingFaceEmbeddings

small_embeddings_model = 'sentence-transformers/all-MiniLM-L6-v2'
normal_embeddings_model = 'sentence-transformers/all-mpnet-base-v2'

model_kwargs = {'device': 'cpu'}
encode_kwargs = {'normalize_embeddings': False}

embeddings = HuggingFaceEmbeddings(
    model_name=normal_embeddings_model,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs,
)

Embedd the documents

In [19]:
df['embeddings'] = df['chunk'].apply(lambda text: embeddings.embed_query(text))

Check the dataframe

In [20]:
df.head()

Unnamed: 0,chunk,unique_id,embeddings
0,Ion by Liviu Rebreanu: An In-depth SummaryIon by,1ea6c1f2-0158-47d1-b00c-f2eb153d5e26,"[0.011528627015650272, -0.07288378477096558, -..."
3,"'Ion,' authored by Liviu Rebreanu and first",1d5069a4-acdc-43de-a5f4-b3dfaf0da138,"[0.02417871728539467, -0.05120783671736717, -0..."
4,"and first published in 1920, stands as a",90633971-ebf5-4783-a3b5-0393ac8e4915,"[0.030614905059337616, -0.01263571996241808, 0..."
6,Romanian literature. Set against the backdrop of,46963f0b-b643-4429-b06b-ad34f5c98c8b,"[0.029097026214003563, 0.03540997579693794, -0..."
7,"of rural Transylvania, the novel intricately",65ab013e-4ab9-4876-a3d6-1ad612ba5e5c,"[0.04430192708969116, 0.0027177438605576754, -..."


#### Store the embeddings to Vector Store - Pinecone


Get the conection tokens

In [23]:
hugging_face_token = os.getenv("HUGGINGFACEHUB_API_TOKEN")
langchain_token = os.getenv("LANGCHAIN_API_KEY")
pinecone_api_key = os.getenv("PINECONE_API_KEY")
pinecone_env = os.getenv("PINECONE_ENV")
pinecone_index_host = os.getenv("PINECONE_INDEX_HOST")

Get embeddign dimensions for the index

In [24]:
first_embedding = df['embeddings'].iloc[0]
embeddings_dimension = len(first_embedding)

print(embeddings_dimension)

768


Connect to Pinecone and create the index

In [26]:
from pinecone import Pinecone, ServerlessSpec

pc = Pinecone(api_key=pinecone_api_key)
import time

index_name = "ion-index"

existing_indexes = [index_info["name"] for index_info in pc.list_indexes()]

if index_name not in existing_indexes:
    pc.create_index(
        name=index_name,
        dimension=embeddings_dimension,
        metric="cosine",
        spec=ServerlessSpec(cloud="aws", region="us-east-1"),
    )
    while not pc.describe_index(index_name).status["ready"]:
        time.sleep(1)

index = pc.Index(index_name)

Get the review texts to a list

In [28]:
texts = df['chunk'].tolist()

Push the embeddings to Pinecone

In [31]:
from langchain_pinecone import PineconeVectorStore

vstore = PineconeVectorStore.from_texts(texts, embeddings, index_name=index_name)

Query the vector store

In [88]:
query = "Who is Ion's true love?"
emb_query = embeddings.embed_query(query)

result = vstore.similarity_search(emb_query, k=10)
print(result)

[Document(page_content="- Vanessa: Ion's true love, a beautiful but"), Document(page_content='unattainable love and happiness for Ion. His'), Document(page_content="friend of Ion, providing a contrast to Ion's"), Document(page_content="Ion's character embodies the complex interplay"), Document(page_content="Ion's relentless pursuit culminates in his"), Document(page_content="'Ion' a timeless work that continues to resonate"), Document(page_content='all-consuming obsession. Ion recognizes that'), Document(page_content="sacrifice. Ana's love for Ion and her subsequent"), Document(page_content='follows the life of Ion, a poor but ambitious'), Document(page_content='Ion. His relentless pursuit of wealth and social')]


#### Instantiate a LLM model

In [84]:
from langchain_huggingface import HuggingFaceEndpoint
from langchain.chains import RetrievalQA


repo_id = "mistralai/Mistral-7B-Instruct-v0.2"

# Initialize the HuggingFaceEndpoint
chat_llm = HuggingFaceEndpoint(repo_id=repo_id,
                          temperature=0.1,
                          huggingfacehub_api_token=hugging_face_token)


The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to C:\Users\Hori\.cache\huggingface\token
Login successful


Create a chain where I will pass the extracted documents by the simmilarity search to the llm

In [85]:
text_chain = RetrievalQA.from_chain_type(llm=chat_llm, chain_type="stuff", retriever=vstore.as_retriever())

q="""In the context of the romanian novel Ion, written by the romanian writer Liviu Rebreanu who is Ion's true love?
"""

result=text_chain.run(q)
print(result)
     

 In the novel "Ion" by Liviu Rebreanu, the character Ion is deeply in love with a woman named Smaranda. Their love story is a central theme of the novel.


Create a prompt where I will ask the llm a question related to my document

In [86]:
from langchain_core.messages import HumanMessage, SystemMessage
import textwrap


def perform_similarity_search(query, top_k=5):
    search_results = vstore.similarity_search(query, top_k)
    return search_results

def answer_question_with_context(question, similar_docs):
    # Extract the text of similar documents and form the context
    context = " ".join([doc.page_content for doc in similar_docs])
    
    system_prompt = "You are an expert on Romanian literature. Please provide detailed and accurate answers based on the provided context."
    input_text = f"{system_prompt}\n\nContext: {context}\n\nQuestion: {question}\nAnswer:"
    
    messages = [
    SystemMessage(content=system_prompt),
    HumanMessage(content=input_text),
    ]

    response = chat_llm.invoke(messages)
    answer = textwrap.fill(response, width=100)
    
    return answer


Ask the question

In [87]:
q = """
    who is Ion's true love?
    """

similar_docs = perform_similarity_search(q)
result = answer_question_with_context(q, similar_docs)

print(result)

     In the context provided, Ion's true love is Vanessa. The text suggests that Ion has deep and
unattainable feelings for Vanessa, which he pursues relentlessly. Despite the challenges, Ion's love
for Vanessa remains a significant aspect of his character and drives much of the narrative.
