# Tutorial 3: Document Processing with LangChain

In this tutorial, we'll explore document processing techniques using LangChain. We'll cover loading and parsing documents, text splitting, building a simple question-answering system, and implementing semantic search.

In [8]:
import os
from dotenv import load_dotenv
from langchain_groq import ChatGroq
from langchain.document_loaders import TextLoader, DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_ollama import OllamaEmbeddings
from langchain_community.docstore.in_memory import InMemoryDocstore
from langchain.vectorstores import FAISS,Chroma
from langchain.chains import RetrievalQA
from langchain_community.embeddings import HuggingFaceEmbeddings
import os

# Load environment variables
load_dotenv()

# Initialize Groq LLM
llm =  ChatGroq(
        model_name="llama-3.3-70b-versatile",
        temperature=0.7,
        model_kwargs={"top_p": 0.8, "seed": 1337}
    )
# print(os.getenv('OLLAMA_EMBEDDING_URL'))
# embedding_model = OllamaEmbeddings(model="all-minilm",base_url=os.getenv('OLLAMA_EMBEDDING_URL'))

# Create the embedding model using Hugging Face Inference API
# embedding_model = HuggingFaceInferenceAPIEmbeddings(
#     api_key=os.getenv("HF_API_KEY"),
#     model_name="sentence-transformers/all-MiniLM-L6-v2"
# )
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")


test_text = "This is a test sentence."
embedding = embedding_model.embed_query(test_text)
print("Embedding length:", len(embedding))
print("Embedding sample:", embedding[:5])

Embedding length: 384
Embedding sample: [0.08429646492004395, 0.057953670620918274, 0.00449336739256978, 0.10582108050584793, 0.00708338338881731]


## 1. Loading and Parsing Documents

In [9]:
# Load a single document
loader = TextLoader("sample_documents/sample1.txt")
document = loader.load()

print(f"Content of sample1.txt:\n{document[0].page_content[:200]}...\n")

# Load multiple documents from a directory
dir_loader = DirectoryLoader("sample_documents/", glob="*.txt", loader_cls=TextLoader)
documents = dir_loader.load()

print(f"Number of documents loaded: {len(documents)}")
for i, doc in enumerate(documents):
    print(f"Document {i+1} preview: {doc.page_content[:50]}...")

Content of sample1.txt:
# Comprehensive Overview of Artificial Intelligence

## Table of Contents
1. [Introduction to Artificial Intelligence](#introduction-to-artificial-intelligence)
2. [History of AI](#history-of-ai)
3. [...

Number of documents loaded: 1
Document 1 preview: # Comprehensive Overview of Artificial Intelligenc...


In [10]:
from langchain.document_loaders import PyPDFLoader

# Carica il PDF
loader = PyPDFLoader("sample_documents/sample2.pdf")
documents = loader.load()


## 2. Text Splitting and Chunking

In [11]:
# Create a text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
)

# Split the documents
splits = text_splitter.split_documents(documents)

print(f"Number of splits: {len(splits)}")
print(f"First split preview:\n{splits[0].page_content[:200]}...")

Number of splits: 110
First split preview:
Quiet-STaR: Language Models Can Teach Themselves to
Think Before Speaking
Eric Zelikman
Stanford University
Georges Harik
Notbad AI Inc
Yijia Shao
Stanford University
Varuna Jayasiri
Notbad AI Inc
Nic...


## 3. Building a Simple Question-Answering System

In [12]:
# Create a vector store
vectorstore = FAISS.from_documents(splits, embedding_model)

# Create a retrieval-based QA chain
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever(search_kwargs={"k": 3}),
    return_source_documents=True
)

# Ask a question
query = "What is the main topic of these documents?"
result = qa_chain.invoke({"query": query})

print(f"Question: {query}")
print(f"Answer: {result['result']}\n")
print("Sources:")
for i, doc in enumerate(result['source_documents']):
    print(f"Document {i+1}: {doc.page_content[:100]}...")

Question: What is the main topic of these documents?
Answer: The main topic of these documents appears to be the improvement of Language Models (LMs) through a technique called Quiet-STaR, which enables them to better reason and understand text, particularly in tasks that require commonsense reasoning and problem-solving.

Sources:
Document 1: improve the LM’s ability to directly answer difficult questions. In particular,
after continued pret...
Document 2: these tends to<|startthought|> in some sense - to be the more difficult<|
endthought|> trickiest for...
Document 3: 5.2 Improvement Distribution
As visualized in Appendix Figure 7, we find that on average there is li...


## 4. Implementing Semantic Search

In [13]:
# Perform a semantic search
query = "Discuss the importance of AI"
search_results = vectorstore.similarity_search(query, k=3)

print(f"Search query: {query}\n")
print("Top 3 relevant chunks:")
for i, doc in enumerate(search_results):
    print(f"Result {i+1}:\n{doc.page_content[:200]}...\n")

# Use the search results to answer a question
question = "What are some advantages of ai models?"
context = "\n".join([doc.page_content for doc in search_results])

prompt = f"Based on the following context, answer the question: {question}\n\nContext: {context}\n\nAnswer:"
answer = llm.invoke(prompt)

print(f"Question: {question}")
print(f"Answer: {answer}")

Search query: Discuss the importance of AI

Top 3 relevant chunks:
Result 1:
process-and outcome-based feedback. Neural Information Processing Systems (NeurIPS
2022) Workshop on MATH-AI, 2022.
Ruocheng Wang, Eric Zelikman, Gabriel Poesia, Yewen Pu, Nick Haber, and Noah D
Goodm...

Result 2:
Quiet-STaR: Language Models Can Teach Themselves to
Think Before Speaking
Eric Zelikman
Stanford University
Georges Harik
Notbad AI Inc
Yijia Shao
Stanford University
Varuna Jayasiri
Notbad AI Inc
Nic...

Result 3:
Proving. CoRR, abs/2009.03393, 2020. URL https://arxiv.org/abs/2009.03393. eprint:
2009.03393.
Ben Prystawski, Michael Li, and Noah Goodman. Why think step by step? reasoning
emerges from the locality...

Question: What are some advantages of ai models?
Answer: content='Based on the provided context, some advantages of AI models include:\n\n1. **Ability to learn and infer unstated rationales**: AI models, such as language models, can learn to infer rationales from few-shot examples and ev

## Conclusion

In this tutorial, we've explored various aspects of document processing with LangChain, including loading and parsing documents, text splitting, building a simple question-answering system, and implementing semantic search. These techniques form the foundation for more advanced document analysis and information retrieval systems.