# Redis with Langchain

The goal of this project is to "talk" with some documents using a LLM from Hugging Face, the document will be a short story from a PDF file, we will chunk that document, transform it to embedding vectors using and embedding model from Hugging Face and than store those embeddings to Redis.

Than based on some questions we will do a semantic similarity search and we will retreive the documents that are similar with the question, the question is embedded using the same method. 

Than we will use a LLM that we take from Hugging Face, proablity "mistral" and we will give it some context using those retreived documents and we will generate an AI enhanced response.

We need the following API keys
 - Langchanin API key
 - Hugging Face API key
 - Redis API key

We need to do the following steps roughly:
 - Read the api keys form .env file
 - Search for a file (document corpus)
 - Chunkerize that document
 - Search for an embedding model and use it to vectorize the chunks
 - Upload the embeddidngs to Redis
 - Ask a question and embedd it
 - Retreive the relevant documents 
 - Search for a LLM model and give it the documents as context - Text Generation model
 - The response for that question should be based on the docuemnts we have but enhanged using the LLM model 


I want to store the intermediary data in a pandas dataframe

## Implementation
Import the necesary libraries and other stuff

#### Read the API keys

In [1]:
import os
import pandas as pd
from dotenv import load_dotenv

load_dotenv()

True

#### Read the document

Read the pdf that I generated previously, I need to import first the helper function because it is in a .py file

In [2]:
import importlib.util

# Define the path to the module
module_path = './helpers/pdf_reader.py'

# Create a module spec from the path
spec = importlib.util.spec_from_file_location('functions', module_path)

# Load the module
functions = importlib.util.module_from_spec(spec)
spec.loader.exec_module(functions)

Read the text

In [None]:
pdf_file_path = './data/ion-resume.pdf'
full_text = functions.read_pdf(pdf_file_path, 128)

#### Split the text into chunks

In [14]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

def get_recursive_text_splitter(chunk_size, chunk_overlap):
    return RecursiveCharacterTextSplitter(
        chunk_size = chunk_size,
        chunk_overlap = chunk_overlap,
    )
    
def split_documents(docs, text_splitter):
    return  text_splitter.create_documents([docs])


text_splitter = get_recursive_text_splitter(chunk_size=100, chunk_overlap=10)
splitted_docs = split_documents(full_text, text_splitter)


Remove text that has no more than x characters

#### Read the text into a padas dataframe

I will iterate through the splitted docs and I will assign a unique id to each of them

In [7]:
import pandas as pd
import uuid

# Function to generate unique IDs
def generate_unique_id():
    return str(uuid.uuid4())


# Extract the page_content from each Document object into a separate list
page_contents = [doc.page_content for doc in splitted_docs]

df = pd.DataFrame(page_contents, columns=["chunk"])

# Generate a unique ID for each row and add it as a new column in the DataFrame
df['unique_id'] = df.apply(lambda row: generate_unique_id(), axis=1)


In [8]:
df.head()

Unnamed: 0,chunk,unique_id
0,Ion by Liviu Rebreanu: An In-depth SummaryIon ...,9935d7dc-53fb-41ca-b59d-6528f47916e5
1,"'Ion,' authored by Liviu Rebreanu and first pu...",dde94abe-8c49-4818-9c60-e5afeff7aa4f
2,Romanian literature. Set against the backdrop ...,c24be74c-de07-408f-9209-27eea4dce0f4
3,weaves the,265f54a7-1ba8-40ed-940b-1808e1fd0ddb
4,socio-economic struggles and moral dilemmas of...,e68e0280-cc25-4094-93f3-326758fd38d0


#### Transform the text into vector embeddings

Import first an embedding model and than transofrm the text from the dataframe

In [16]:
from langchain_huggingface import HuggingFaceEmbeddings

small_embeddings_model = 'sentence-transformers/all-MiniLM-L6-v2'
normal_embeddings_model = 'sentence-transformers/all-mpnet-base-v2'

model_kwargs = {'device': 'cpu'}
encode_kwargs = {'normalize_embeddings': False}

embeddings = HuggingFaceEmbeddings(
    model_name=normal_embeddings_model,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs,
)

  from tqdm.autonotebook import tqdm, trange


Embedd the documents

In [17]:
df['embedding'] = df['chunk'].apply(lambda text: embeddings.embed_query(text))

In [18]:
df.head()

Unnamed: 0,chunk,unique_id,embedding
0,Ion by Liviu Rebreanu: An In-depth SummaryIon ...,9935d7dc-53fb-41ca-b59d-6528f47916e5,"[0.01243958156555891, -0.0740518867969513, -0...."
1,"'Ion,' authored by Liviu Rebreanu and first pu...",dde94abe-8c49-4818-9c60-e5afeff7aa4f,"[0.018369190394878387, -0.047273170202970505, ..."
2,Romanian literature. Set against the backdrop ...,c24be74c-de07-408f-9209-27eea4dce0f4,"[0.013225809670984745, -0.02516471967101097, 0..."
3,weaves the,265f54a7-1ba8-40ed-940b-1808e1fd0ddb,"[-0.011876586824655533, -0.06664972752332687, ..."
4,socio-economic struggles and moral dilemmas of...,e68e0280-cc25-4094-93f3-326758fd38d0,"[-0.004901786334812641, 0.03645261377096176, 0..."


#### Store the embeddings to Redis