# Tutorial 3: Document Processing with LangChain

In this tutorial, we'll explore document processing techniques using LangChain. We'll cover loading and parsing documents, text splitting, building a simple question-answering system, and implementing semantic search.

In [3]:
import os
from dotenv import load_dotenv
from langchain_groq import ChatGroq
from langchain.document_loaders import TextLoader, DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OllamaEmbeddings
from langchain_community.docstore.in_memory import InMemoryDocstore
from langchain.vectorstores import FAISS
from langchain.chains import RetrievalQA

# Load environment variables
load_dotenv()

# Initialize Groq LLM
llm =  ChatGroq(
        model_name="llama-3.1-70b-versatile",
        temperature=0.7,
        model_kwargs={"top_p": 0.8, "seed": 1337}
    )

embedding_model = OllamaEmbeddings(model="all-minilm",base_url=os.getenv("OLLAMA_EMBEDDINGS_URL"))


## 1. Loading and Parsing Documents

In [7]:
# Load a single document
loader = TextLoader("sample_documents/sample1.txt")
document = loader.load()

print(f"Content of sample1.txt:\n{document[0].page_content[:200]}...\n")

# Load multiple documents from a directory
dir_loader = DirectoryLoader("sample_documents/", glob="*.txt", loader_cls=TextLoader)
documents = dir_loader.load()

print(f"Number of documents loaded: {len(documents)}")
for i, doc in enumerate(documents):
    print(f"Document {i+1} preview: {doc.page_content[:50]}...")

Content of sample1.txt:
**Introduction to Artificial Intelligence**

Artificial Intelligence (AI) is an interdisciplinary field within computer science that focuses on developing systems and machines capable of performing ta...

Number of documents loaded: 1
Document 1 preview: **Introduction to Artificial Intelligence**

Artif...


In [45]:
from langchain.document_loaders import PyPDFLoader

# Carica il PDF
loader = PyPDFLoader("sample_documents/sample-2.pdf")
documents = loader.load()


## 2. Text Splitting and Chunking

In [8]:
# Create a text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
)

# Split the documents
splits = text_splitter.split_documents(documents)

print(f"Number of splits: {len(splits)}")
print(f"First split preview:\n{splits[0].page_content[:200]}...")

Number of splits: 13
First split preview:
**Introduction to Artificial Intelligence**

Artificial Intelligence (AI) is an interdisciplinary field within computer science that focuses on developing systems and machines capable of performing ta...


## 3. Building a Simple Question-Answering System

In [9]:
# Create a vector store
vectorstore = FAISS.from_documents(splits, embedding_model)

# Create a retrieval-based QA chain
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever(search_kwargs={"k": 3}),
    return_source_documents=True
)

# Ask a question
query = "What is the main topic of these documents?"
result = qa_chain.invoke({"query": query})

print(f"Question: {query}")
print(f"Answer: {result['result']}\n")
print("Sources:")
for i, doc in enumerate(result['source_documents']):
    print(f"Document {i+1}: {doc.page_content[:100]}...")

Question: What is the main topic of these documents?
Answer: The main topic of these documents appears to be Artificial Intelligence (AI), specifically covering its definition, subfields, and applications across various industries.

Sources:
Document 1: ### 2. **Natural Language Processing (NLP)**
Natural Language Processing enables machines to underst...
Document 2: ### 5. **Deep Learning**
A subset of machine learning, deep learning utilizes neural networks with m...
Document 3: **Introduction to Artificial Intelligence**

Artificial Intelligence (AI) is an interdisciplinary fi...


## 4. Implementing Semantic Search

In [51]:
# Perform a semantic search
query = "Discuss the importance of AI"
search_results = vectorstore.similarity_search(query, k=3)

print(f"Search query: {query}\n")
print("Top 3 relevant chunks:")
for i, doc in enumerate(search_results):
    print(f"Result {i+1}:\n{doc.page_content[:200]}...\n")

# Use the search results to answer a question
question = "What are some advantages of ai models?"
context = "\n".join([doc.page_content for doc in search_results])

prompt = f"Based on the following context, answer the question: {question}\n\nContext: {context}\n\nAnswer:"
answer = llm.invoke(prompt)

print(f"Question: {question}")
print(f"Answer: {answer}")

Search query: Discuss the importance of AI

Top 3 relevant chunks:
Result 1:
As AI continues to advance, it has the potential to revolutionize various industries, including healthcare, finance, transportation, and education. However, the development of AI also raises important...

Result 2:
Introduction to Artificial Intelligence
Artificial Intelligence (AI) is a rapidly growing field of computer science that focuses on creating intelligent machines that can perform tasks that typically ...

Question: What are some advantages of ai models?
Answer: content='Some advantages of AI models include:\n\n1. **Improved Efficiency**: AI models can process and analyze large amounts of data quickly and accurately, automating tasks and freeing up human resources for more complex and creative work.\n\n2. **Enhanced Decision-Making**: AI models can learn from data and make predictions or decisions based on that data, enabling businesses and organizations to make more informed decisions.\n\n3. **Incre

## Conclusion

In this tutorial, we've explored various aspects of document processing with LangChain, including loading and parsing documents, text splitting, building a simple question-answering system, and implementing semantic search. These techniques form the foundation for more advanced document analysis and information retrieval systems.