In [None]:
!pip -q install langchain huggingface_hub openai chromadb tiktoken faiss-cpu
!pip -q install sentence_transformers pypdf
!pip -q install -U FlagEmbedding

In [17]:
import os

os.environ["OPENAI_API_KEY"] = "sk-x"

In [18]:
from langchain.vectorstores import FAISS

from langchain.schema import Document
from langchain.vectorstores import Chroma

## Text Splitting & Docloader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.storage import InMemoryStore
from langchain.document_loaders import TextLoader

from langchain.embeddings import OpenAIEmbeddings

## 데이터 준비

In [4]:
!wget -q https://www.dropbox.com/s/zoj9rnm7oyeaivb/new_papers.zip
!unzip -q new_papers.zip -d new_papers

In [5]:
from langchain.document_loaders import TextLoader
from langchain.document_loaders import PyPDFLoader
from langchain.document_loaders import DirectoryLoader

loader = DirectoryLoader('./new_papers/new_papers/', glob="./*.pdf", loader_cls=PyPDFLoader)
documents = loader.load()

In [6]:
len(documents)

142

In [7]:
documents = documents[:10]  # 일부만 사용

In [8]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
texts = text_splitter.split_documents(documents)

In [9]:
texts[0]

Document(page_content='FlashAttention : Fast and Memory-Eﬃcient Exact Attention\nwith IO-Awareness\nTri Daoy, Daniel Y. Fuy, Stefano Ermony, Atri Rudraz, and Christopher Réy\nyDepartment of Computer Science, Stanford University\nzDepartment of Computer Science and Engineering, University at Buﬀalo, SUNY\n{trid,danfu}@cs.stanford.edu ,ermon@stanford.edu ,atri@buffalo.edu ,\nchrismre@cs.stanford.edu\nJune 24, 2022\nAbstract\nTransformers are slow and memory-hungry on long sequences, since the time and memory complexity\nof self-attention are quadratic in sequence length. Approximate attention methods have attempted\nto address this problem by trading oﬀ model quality to reduce the compute complexity, but often do\nnot achieve wall-clock speedup. We argue that a missing principle is making attention algorithms IO-\naware—accounting for reads and writes between levels of GPU memory. We propose FlashAttention ,\nan IO-aware exact attention algorithm that uses tiling to reduce the number of 

### BGE Embeddings

In [None]:
from langchain.embeddings import HuggingFaceBgeEmbeddings

model_name = "BAAI/bge-small-en-v1.5"
encode_kwargs = {'normalize_embeddings': True} # set True to compute cosine similarity

bge_embeddings = HuggingFaceBgeEmbeddings(
    model_name=model_name,
    model_kwargs={'device': 'cuda'},
    encode_kwargs=encode_kwargs
)

In [11]:
# Helper function for printing docs
def pretty_print_docs(docs):
    print(f"\n{'-' * 100}\n".join([f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]))

In [13]:
retriever = FAISS.from_documents(texts,
                                 bge_embeddings
                                #  OpenAIEmbeddings()
                                 ).as_retriever()

docs = retriever.get_relevant_documents("What is AliBi?")
#lets look at the docs
pretty_print_docs(docs[2:4])

Document 1:

Our implementation uses Apex’s FMHA code ( https://github.com/NVIDIA/apex/tree/master/apex/
contrib/csrc/fmha ) as a starting point. We thank Young-Jun Ko for the in-depth explanation of his FMHA
implementation and for his thoughtful answers to our questions about CUDA. We thank Sabri Eyuboglu,
Megan Leszczynski, Laurel Orr, Yuhuai Wu, Beidi Chen, and Xun Huang for their constructive feedback and
suggestions on early drafts of the paper. We thank Markus Rabe and Charles Staats for helpful discussion of
their attention algorithm.
We gratefully acknowledge the support of NIH under No. U54EB020405 (Mobilize), NSF under Nos.
CCF1763315 (Beyond Sparsity), CCF1563078 (Volume to Velocity), and 1937301 (RTML); ARL under
No. W911NF-21-2-0251 (Interactive Human-AI Teaming); ONR under No. N000141712266 (Unifying Weak
Supervision); ONR N00014-20-1-2480: Understanding and Applying Non-Euclidean Geometry in Machine
------------------------------------------------------------------------

## Adding contextual compression with an LLMChainExtractor

Now let's wrap our base retriever with a ContextualCompressionRetriever. We'll add an LLMChainExtractor, which will iterate over the initially returned documents and extract from each only the content that is relevant to the query.

In [36]:
from langchain.llms import OpenAI
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

# make the compressor
llm = OpenAI(temperature=0)
compressor = LLMChainExtractor.from_llm(llm)

# it needs a base retriever (we're using FAISS Retriever) and a compressor (Made above)
compression_retriever = ContextualCompressionRetriever(base_compressor=compressor,
                                                       base_retriever=retriever)

In [37]:
# compressor prompt
compressor.llm_chain.prompt

PromptTemplate(input_variables=['context', 'question'], output_parser=NoOutputParser(), template='Given the following question and context, extract any part of the context *AS IS* that is relevant to answer the question. If none of the context is relevant return NO_OUTPUT. \n\nRemember, *DO NOT* edit the extracted parts of the context.\n\n> Question: {question}\n> Context:\n>>>\n{context}\n>>>\nExtracted relevant parts:')

In [38]:
compressed_docs = compression_retriever.get_relevant_documents("What is FlashAttention?")
pretty_print_docs(compressed_docs)



Document 1:

- FlashAttention scales Transformers to longer sequences, which improves their quality and enables new capabilities.
- We observe a 0.7 improvement in perplexity on GPT-2 and 6.4 points of lift from modeling longer sequences on long-document classiﬁcation [13].
- FlashAttention enables the ﬁrst Transformer that can achieve better-than-chance performance on the Path-X [ 80] challenge, solely from using a longer sequence length (16K).
- Block-sparse FlashAttention enables a Transformer to scale to even longer sequences (64K), resulting in the ﬁrst model that can achieve better-than-chance performance on Path-256.
- FlashAttention is up to 3faster than the standard attention implemen- tation across common sequence lengths from 128 to 2K and scales up to 64K.
- Up to sequence length of 512, FlashAttention is both faster and more memory-eﬃcient than any existing attention method.
--------------------------------------------------------------------------------------------------

## More built-in compressors: filters

### LLMChainFilter

Uses an LLM chain to select out the queries to show the final LLM - This could be shown to a model fine tuned to do this

"YES" we show it or "NO" we don't show it

In [39]:
from langchain.retrievers.document_compressors import LLMChainFilter

_filter = LLMChainFilter.from_llm(llm)
_filter.llm_chain.prompt

PromptTemplate(input_variables=['context', 'question'], output_parser=BooleanOutputParser(), template="Given the following question and context, return YES if the context is relevant to the question and NO if it isn't.\n\n> Question: {question}\n> Context:\n>>>\n{context}\n>>>\n> Relevant (YES / NO):")

In [40]:
compression_retriever = ContextualCompressionRetriever(base_compressor=_filter, base_retriever=retriever)

compressed_docs = compression_retriever.get_relevant_documents("What is FlashAttention?")
pretty_print_docs(compressed_docs)



Document 1:

•Higher Quality Models. FlashAttention scales Transformers to longer sequences, which improves
their quality and enables new capabilities. We observe a 0.7 improvement in perplexity on GPT-2 and
6.4 points of lift from modeling longer sequences on long-document classiﬁcation [13]. FlashAttention
enables the ﬁrst Transformer that can achieve better-than-chance performance on the Path-X [ 80] challenge,
solely from using a longer sequence length (16K). Block-sparse FlashAttention enables a Transformer
to scale to even longer sequences (64K), resulting in the ﬁrst model that can achieve better-than-chance
performance on Path-256.
•Benchmarking Attention. FlashAttention is up to 3faster than the standard attention implemen-
tation across common sequence lengths from 128 to 2K and scales up to 64K. Up to sequence length of 512,
FlashAttention is both faster and more memory-eﬃcient than any existing attention method, whereas
-----------------------------------------------------



### EmbeddingsFilter
Use an Embedding model to filter out the results that are closest to the query

In [None]:
from langchain.embeddings import OpenAIEmbeddings
from langchain.retrievers.document_compressors import EmbeddingsFilter

embeddings = OpenAIEmbeddings() # base retriever에서 사용했던 BGE embedding 과는 다른 것 사용
embeddings_filter = EmbeddingsFilter(embeddings=embeddings, similarity_threshold=0.76)
compression_retriever = ContextualCompressionRetriever(base_compressor=embeddings_filter, base_retriever=retriever)

compressed_docs = compression_retriever.get_relevant_documents("What is FlashAttention?")
pretty_print_docs(compressed_docs)

## Pipelines


### Stringing compressors and document transformers together

DocumentCompressorPipeline allows us to string things together.

BaseDocumentTransformers - can do transformations on the docs -eg. split the text and

EmbeddingsRedundantFilter - filter out what is not related after a split or transformation


In [41]:
from langchain.document_transformers import EmbeddingsRedundantFilter
from langchain.retrievers.document_compressors import DocumentCompressorPipeline
from langchain.text_splitter import CharacterTextSplitter

# 300 token 단위로 자른다.(.을 spliter로 해서 문장 단위로 절삭되지 않게 고려)
splitter = CharacterTextSplitter(chunk_size=300, chunk_overlap=0, separator=". ")

# N  개의 embed vector 사이의 pairwise cosine similarity를 계산해서(N*N),
# thr보다 높은 쌍을 찾고
# (first, second) 에서 second를 중복으로 보고 drop
redundant_filter = EmbeddingsRedundantFilter(embeddings=embeddings)

# user_query의 embed와 N  개의 embed vector 사이의 pairwise cosine similarity 측정
relevant_filter = EmbeddingsFilter(embeddings=embeddings, similarity_threshold=0.76)

## making the pipeline
pipeline_compressor = DocumentCompressorPipeline(
    transformers=[splitter, redundant_filter, relevant_filter]
)

In [42]:
compression_retriever = ContextualCompressionRetriever(base_compressor=pipeline_compressor,
                                                       base_retriever=retriever)

compressed_docs = compression_retriever.get_relevant_documents("What is FlashAttention?")
pretty_print_docs(compressed_docs)



Document 1:

FlashAttention is up to 3faster than the standard attention implemen-
tation across common sequence lengths from 128 to 2K and scales up to 64K. Up to sequence length of 512,
FlashAttention is both faster and more memory-eﬃcient than any existing attention method, whereas
----------------------------------------------------------------------------------------------------
Document 2:

We measure the runtime and memory performance of FlashAttention
----------------------------------------------------------------------------------------------------
Document 3:

Finally, FlashAttention yields the ﬁrst Transformer that can achieve
better-than-random performance on the challenging Path-X task (sequence length 16K), and block-sparse
FlashAttention yields the ﬁrst sequence model that we know of that can achieve better-than-random
performance on Path-256 (sequence length 64K).
•Benchmarking Attention
---------------------------------------------------------------------------------

In [43]:
### different pipeline

## making the pipeline
pipeline_compressor = DocumentCompressorPipeline(
    transformers=[splitter, compressor, redundant_filter, relevant_filter]
)

compression_retriever = ContextualCompressionRetriever(base_compressor=pipeline_compressor,
                                                       base_retriever=retriever)

compressed_docs = compression_retriever.get_relevant_documents("What is FlashAttention?")
pretty_print_docs(compressed_docs)



Document 1:

FlashAttention
----------------------------------------------------------------------------------------------------
Document 2:

FlashAttention is up to 3faster than the standard attention implemen-
tation across common sequence lengths from 128 to 2K and scales up to 64K. Up to sequence length of 512,
FlashAttention is both faster and more memory-eﬃcient than any existing attention method
----------------------------------------------------------------------------------------------------
Document 3:

FlashAttention : Fast and Memory-Eﬃcient Exact Attention
with IO-Awareness
----------------------------------------------------------------------------------------------------
Document 4:

FlashAttention yields the ﬁrst Transformer that can achieve better-than-random performance on the challenging Path-X task (sequence length 16K), and block-sparse FlashAttention yields the ﬁrst sequence model that we know of that can achieve better-than-random performance on Path-256 (seque