#**Business Context**

Innovative Research Labs focuses on advancing LLM-based technologies, where quick access to accurate and up-to-date information is essential for driving innovation. As the field evolves rapidly, researchers must efficiently understand complex concepts to maintain a competitive edge.

However, the growing volume of technical articles and publications makes it time-consuming to extract relevant insights, especially for foundational topics like the Transformer architecture. This case study demonstrates how a document question-answering system powered by Generative AI can efficiently retrieve and summarize key insights from Jay Alammar’s “The Illustrated Transformer”, streamlining research and improving knowledge discovery.

### Environment setup
Run the cell below first to load API keys. Put a `.env` file in the project root with `HUGGINGFACEHUB_API_TOKEN=your_token`.

In [1]:
from dotenv import load_dotenv
load_dotenv()

True

In [2]:
from langchain_huggingface import HuggingFaceEndpoint
from IPython.display import Markdown, display
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline    
from langchain_huggingface import HuggingFacePipeline      


LangChain Pipeline Overview

The process of transforming unstructured data into a question-answering (QA) system follows these key steps:

Data Loading: Raw content is ingested from various sources using LangChain loaders, which convert the data into standardized Document objects.

Text Splitting: Documents are divided into smaller, manageable chunks to improve processing and retrieval efficiency.

Storage: These chunks are stored—typically in a vector database—where they are embedded for semantic search.

Retrieval: Relevant chunks are retrieved from storage based on their similarity to the user’s query.

Response Generation: An LLM generates an answer using the user’s question along with the retrieved context.

Conversation (Optional): Memory can be added to enable multi-turn interactions and contextual continuity.

In [3]:
model_repo = "google/flan-t5-large"
tokenizer = AutoTokenizer.from_pretrained(model_repo)
model = AutoModelForSeq2SeqLM.from_pretrained(model_repo)

## **Step 1: Loading**
Use a DocumentLoader to turn unstructured data into Documents.
A Document contains the text and its metadata.
A WebBaseLoader loads text from HTML and converts it into Documents for NLP tasks.

In [5]:
from langchain_community.document_loaders import WebBaseLoader

loader = WebBaseLoader("http://jalammar.github.io/illustrated-transformer/")
data = loader.load()  # Loads the content of the webpage

## **Step 2: Splitting** 

Divide the document into smaller segments so they can be embedded and stored efficiently in a vector database.

Vector Store:
A vector store is a system designed to manage and search unstructured data using embeddings. The typical workflow involves converting documents into embedding vectors and storing them. When a user submits a query, the query is also converted into an embedding. The system then retrieves the stored vectors that are most similar to the query vector. The vector store handles both the storage of embeddings and the similarity search process.

Text Embedding:
Text embedding is the process of transforming textual data into numerical form, usually as high-dimensional vectors. Each word or token is mapped to a vector in a way that preserves semantic meaning. Words or phrases with similar meanings produce similar vector representations. These embeddings allow machine learning models to interpret and process text by capturing its contextual and semantic relationships in numerical space

In [104]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
# Split the document into chunks of 2000 characters with 500 characters overlap
# text_splitter = RecursiveCharacterTextSplitter(chunk_size = 500, chunk_overlap = 0)
# Noha  changed the chunk_size and chunk_overlap
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1515, chunk_overlap = 100)
all_splits = text_splitter.split_documents(data)