In [2]:
# Data Injection
from langchain_community.document_loaders import TextLoader, PyPDFLoader, WebBaseLoader

In [3]:
# Text Loading
loader = TextLoader("text.txt")
text_document = loader.load()
text_document

[Document(metadata={'source': 'text.txt'}, page_content='The world must be made safe for democracy. Its peace must be planted upon the tested foundations of political liberty. We have no selfish ends to serve. We desire no conquest, no dominion. We seek no indemnities for ourselves, no material compensation for the sacrifices we shall freely make. We are but one of the champions of the rights of mankind. We shall be satisfied when those rights have been made as secure as the faith and the freedom of nations can make them.\n\nJust because we fight without rancor and without selfish object, seeking nothing for ourselves but what we shall wish to share with all free peoples, we shall, I feel confident, conduct our operations as belligerents without passion and ourselves observe with proud punctilio the principles of right and of fair play we profess to be fighting for.\n\nIt will be all the easier for us to conduct ourselves as belligerents in a high spirit of right and fairness because w

In [21]:
# PDF Loading
loader = PyPDFLoader("attention.pdf")
pdf_document = loader.load()
pdf_document[:2]

[Document(metadata={'source': 'attention.pdf', 'page': 0}, page_content='Provided proper attribution is provided, Google hereby grants permission to\nreproduce the tables and figures in this paper solely for use in journalistic or\nscholarly works.\nAttention Is All You Need\nAshish Vaswani∗\nGoogle Brain\navaswani@google.comNoam Shazeer∗\nGoogle Brain\nnoam@google.comNiki Parmar∗\nGoogle Research\nnikip@google.comJakob Uszkoreit∗\nGoogle Research\nusz@google.com\nLlion Jones∗\nGoogle Research\nllion@google.comAidan N. Gomez∗ †\nUniversity of Toronto\naidan@cs.toronto.eduŁukasz Kaiser∗\nGoogle Brain\nlukaszkaiser@google.com\nIllia Polosukhin∗ ‡\nillia.polosukhin@gmail.com\nAbstract\nThe dominant sequence transduction models are based on complex recurrent or\nconvolutional neural networks that include an encoder and a decoder. The best\nperforming models also connect the encoder and decoder through an attention\nmechanism. We propose a new simple network architecture, the Transformer,\n

In [6]:
# Web Loading
# import bs4

# loader = WebBaseLoader(
#     web_paths=("http://www.plaintxt.org/",),
#     bs_kwargs=dict(
#         parse_only=bs4.SoupStrainer(class_=("class-names1", "class-names2"))
#     ),
# )

loader = WebBaseLoader("http://www.plaintxt.org/")
document = loader.load()
document

[Document(metadata={'source': 'http://www.plaintxt.org/', 'title': 'plaintxt.org – Minimalism in blog design, an experiment', 'description': 'Plaintxt.org was the first Web site dedicated to minimalism in blog design and the original home of the Sandbox theme for WordPress.', 'language': 'No language found.'}, page_content='\n\n\nplaintxt.org – Minimalism in blog design, an experiment\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nplaintxt.org\nMinimalism in blog design, an experiment\n\n\n\nContents\n\nIntroduction\nThemes\nExperiments (Plugins)\nMiscellaneous\nLicense\nTerms of use\n\n\n\n\n\nIntroduction\nOnce upon a time, I was an actively developing themes and plugins for WordPress. No more. No, I have since moved on to other projects. And while this site is no longer updated, I want it to remain because when I was developing for WordPress, I found referencing other themes and plugins useful, even necessary.\nBut with the inclusion of the three functions that w

In [22]:
# Transformation (Feature Engineering)
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
documents = text_splitter.split_documents(pdf_document)
documents[:2]

[Document(metadata={'source': 'attention.pdf', 'page': 0}, page_content='Provided proper attribution is provided, Google hereby grants permission to\nreproduce the tables and figures in this paper solely for use in journalistic or\nscholarly works.\nAttention Is All You Need\nAshish Vaswani∗\nGoogle Brain\navaswani@google.comNoam Shazeer∗\nGoogle Brain\nnoam@google.comNiki Parmar∗\nGoogle Research\nnikip@google.comJakob Uszkoreit∗\nGoogle Research\nusz@google.com\nLlion Jones∗\nGoogle Research\nllion@google.comAidan N. Gomez∗ †\nUniversity of Toronto\naidan@cs.toronto.eduŁukasz Kaiser∗\nGoogle Brain\nlukaszkaiser@google.com\nIllia Polosukhin∗ ‡\nillia.polosukhin@gmail.com\nAbstract\nThe dominant sequence transduction models are based on complex recurrent or\nconvolutional neural networks that include an encoder and a decoder. The best\nperforming models also connect the encoder and decoder through an attention\nmechanism. We propose a new simple network architecture, the Transformer,\n

In [15]:
# Creating Vector DB
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma

from dotenv import load_dotenv
import os


load_dotenv()
os.environ["HUGGINGFACEHUB_API_TOKEN"] = os.getenv("HUGGINGFACEHUB_API_TOKEN")

db = Chroma.from_documents(documents[:20],HuggingFaceEmbeddings())

In [16]:
# Making queries from vector db
query = "What is Model Architecture?"
retireved_results=db.similarity_search(query)
print(retireved_results[0].page_content)

Figure 1: The Transformer - model architecture.
The Transformer follows this overall architecture using stacked self-attention and point-wise, fully
connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1,
respectively.
3.1 Encoder and Decoder Stacks
Encoder: The encoder is composed of a stack of N= 6 identical layers. Each layer has two
sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-
wise fully connected feed-forward network. We employ a residual connection [ 11] around each of
the two sub-layers, followed by layer normalization [ 1]. That is, the output of each sub-layer is
LayerNorm( x+ Sublayer( x)), where Sublayer( x)is the function implemented by the sub-layer
itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding
layers, produce outputs of dimension dmodel = 512 .


In [18]:
from langchain_community.vectorstores import FAISS

db2 = FAISS.from_documents(documents[:20], HuggingFaceEmbeddings())

In [20]:
result = db2.similarity_search("What is Model Architecture?")
print(result[0].page_content)

Figure 1: The Transformer - model architecture.
The Transformer follows this overall architecture using stacked self-attention and point-wise, fully
connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1,
respectively.
3.1 Encoder and Decoder Stacks
Encoder: The encoder is composed of a stack of N= 6 identical layers. Each layer has two
sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-
wise fully connected feed-forward network. We employ a residual connection [ 11] around each of
the two sub-layers, followed by layer normalization [ 1]. That is, the output of each sub-layer is
LayerNorm( x+ Sublayer( x)), where Sublayer( x)is the function implemented by the sub-layer
itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding
layers, produce outputs of dimension dmodel = 512 .
