# Text Processing and Question-Answering Workflow

This notebook outlines the process of loading a PDF, splitting its content, generating embeddings, building a FAISS index, and configuring a local language model for question-answering.

---

## Steps

### 1. Load and Split Text Data
- A PDF document is loaded using the `PyPDFLoader` from LangChain.
- The content is split into pages.

### 2. Process Text into Fragments
- The text is divided into smaller chunks using `RecursiveCharacterTextSplitter`.
- This ensures each chunk has a manageable size (e.g., 1000 characters) with an overlap for context preservation.

### 3. Convert Chunks into Document Objects
- Each text chunk is encapsulated as a `Document` object for further processing.

### 4. Generate Embeddings
- Embeddings for the text chunks are generated using the `sentence-transformers/all-MiniLM-L6-v2` model.
- These embeddings are vector representations of the text, useful for similarity search.

### 5. Build a FAISS Index
- A FAISS index is created from the document embeddings, enabling efficient similarity-based retrieval.
- The index is saved locally for reuse.

### 6. Configure a Retriever
- The FAISS index is converted into a retriever.
- This retriever is configured to return the top 5 most similar results based on a query.

### 7. Load and Configure a Local Language Model
- The `google/flan-t5-large` model is downloaded and loaded onto a GPU (if available).
- A `text2text-generation` pipeline is set up with constraints on input and output lengths.

### 8. Create a Question-Answering Chain
- A `RetrievalQA` chain is built, combining the retriever and the local language model.
- This chain retrieves relevant documents and generates answers based on the query.

### 9. Test with a Query
- A sample question (e.g., "How can I create a function in Python?") is passed to the QA chain.
- The chain retrieves relevant information and generates a well-formatted answer.

---

## Outputs
- Number of pages loaded from the PDF.
- Number of text fragments generated.
- Confirmation of successful index creation and retrieval configuration.
- A clean, formatted response to the test query.

---

This workflow demonstrates how to integrate LangChain components and Hugging Face models to create a functional text-processing and question-answering system.


# 1. Load and Split Text Data

In [1]:
from langchain.document_loaders import PyPDFLoader

# Path to the PDF file
pdf_path = "../data/pdf/pythonlearn.pdf"

# Load the PDF file
loader = PyPDFLoader(pdf_path)

# Load the PDF content and split it into individual pages
pages = loader.load()

# Print the number of pages successfully loaded from the PDF
print(f"Loaded {len(pages)} pages from the PDF.")


Loaded 241 pages from the PDF.


# 2. Process Text into Fragments

In [2]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Configuring the text splitter
text_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", " "],  # Hierarchical separators
    chunk_size=1000,  # Maximum size of each text chunk
    chunk_overlap=200  # Overlap between chunks for context retention
)

# Splitting the content into fragments
documents = []
for page in pages:
    fragments = text_splitter.split_text(page.page_content)
    documents.extend(fragments)

print(f"Generated {len(documents)} fragments.")

Generated 611 fragments.


# 3. Convert Chunks into Document Objects

In [3]:
from langchain.docstore.document import Document

# Convert fragments into Document objects
doc_objects = [Document(page_content=fragment) for fragment in documents]

# Print the number of Document objects created
print(f"Created {len(doc_objects)} document objects.")

Created 611 document objects.


# 4. Generate Embeddings

In [4]:
from langchain_huggingface import HuggingFaceEmbeddings

# Load the embedding model
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

# Generate embeddings directly from the document contents
embeddings = embedding_model.embed_documents([doc.page_content for doc in doc_objects])

# Print a success message once embeddings are generated
print("Embeddings generated successfully.")


  from .autonotebook import tqdm as notebook_tqdm


Embeddings generated successfully.


# 5. Build a FAISS Index

In [5]:
from langchain_community.vectorstores import FAISS

# Create a FAISS index from the Document objects and embeddings
vectorstore = FAISS.from_documents(doc_objects, embedding_model)

# Save the FAISS index to a local directory
vectorstore.save_local("../data/index/python_for_everybody")

# Print a success message once the index is created and saved
print("Index created and saved successfully.")

Index created and saved successfully.


# 6. Configure retriever

In [6]:
# Configure the retriever from the FAISS index
retriever = vectorstore.as_retriever(
    search_type="similarity",  # Use similarity-based search
    search_kwargs={"k": 5}  # Retrieve the top 5 most similar results
)

# Print a success message once the retriever is configured
print("Retriever configured successfully.")


Retriever configured successfully.


# 7. Load and Configure a Local Language Model

In [7]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

# Download the model and tokenizer
model_name = "google/flan-t5-large"  # Change to "base" or "xl" depending on the desired size
tokenizer = AutoTokenizer.from_pretrained(model_name)  # Load the tokenizer for the model
model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to("cuda")  # Load the model and move it to GPU if available

# Print a success message once the model and tokenizer are loaded
print("Model and tokenizer loaded successfully.")


Model and tokenizer loaded successfully.


# 8. Create a Question-Answering Chain

In [8]:
from langchain_huggingface import HuggingFacePipeline
from transformers import pipeline

# Configure the text2text-generation pipeline with length limits
llm_pipeline = pipeline(
    "text2text-generation",  # Specify the task type
    model=model,  # Use the preloaded model
    tokenizer=tokenizer,  # Use the preloaded tokenizer
    max_length=512,  # Input length limit
    max_new_tokens=200,  # Maximum output length
    device=0  # Use GPU if available
)

# Integrate the pipeline with LangChain
llm = HuggingFacePipeline(pipeline=llm_pipeline)

# Print a success message once the local LLM is configured
print("Local LLM configured successfully.")


Local LLM configured successfully.


# 9. Test with a Query

In [12]:
from langchain.chains import RetrievalQA
import pickle
import os

# Create the question-answering chain
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,  # Configured LLM for text generation
    retriever=retriever,  # Configured retriever for document retrieval
    chain_type="stuff"  # Strategy for combining retrieved documents
)

# Ruta donde quieres guardar el archivo
save_path = "../data/model/"
os.makedirs(save_path, exist_ok=True)  # Crear la carpeta si no existe

# Save qa_chain file
with open("../data/model/qa_chain.pkl", "wb") as f:
    pickle.dump(qa_chain, f)

print("qa_chain saved successfully.")

"""# Test the chain with a query
query = "How can I create a function in Python? Give me an example"

# Generate the response using the QA chain
response = qa_chain.invoke({"query": query})
answer = response["result"]

# Clean up the generated text for better readability
clean_answer = answer.replace("/quotesingle.ts1", "'").replace(": ", ":\n\t").replace(") ", ")\n\t").strip()

# Print the final, cleaned answer
print(f"Answer:\n {clean_answer}")"""


qa_chain saved successfully.


'# Test the chain with a query\nquery = "How can I create a function in Python? Give me an example"\n\n# Generate the response using the QA chain\nresponse = qa_chain.invoke({"query": query})\nanswer = response["result"]\n\n# Clean up the generated text for better readability\nclean_answer = answer.replace("/quotesingle.ts1", "\'").replace(": ", ":\n\t").replace(") ", ")\n\t").strip()\n\n# Print the final, cleaned answer\nprint(f"Answer:\n {clean_answer}")'