In [1]:
!pip -q install -U langchain pydantic chromadb pypdf sentence_transformers Xformers

In [2]:
!pip show langchain

Name: langchain
Version: 0.0.179
Summary: Building applications with LLMs through composability
Home-page: https://www.github.com/hwchase17/langchain
Author: 
Author-email: 
License: MIT
Location: /home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages
Requires: aiohttp, async-timeout, dataclasses-json, numexpr, numpy, openapi-schema-pydantic, pydantic, PyYAML, requests, SQLAlchemy, tenacity
Required-by: 


In [28]:
!wget -q https://www.dropbox.com/s/zoj9rnm7oyeaivb/new_papers.zip
!unzip -q new_papers.zip -d new_papers

replace new_papers/new_papers/toolformer.pdf? [y]es, [n]o, [A]ll, [N]one, [r]ename: ^C


In [3]:
import re
import sys
import json
import torch
import logging
from random import choice
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from langchain.llms import HuggingFacePipeline

In [4]:
logger = logging.getLogger('api')
logger.setLevel(logging.INFO)

logHandler = logging.StreamHandler(sys.stdout)
logger.addHandler(logHandler)

In [5]:
def setup_model(model_name: str, cache_dir: str = None):
    tokenizer = AutoTokenizer.from_pretrained(
        model_name,
        cache_dir=cache_dir
    )
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        load_in_8bit=True,
        torch_dtype=torch.float16,
        device_map='auto',
        low_cpu_mem_usage=True,
        cache_dir=cache_dir,
    )
    return tokenizer, model

In [6]:
model_name='TheBloke/wizardLM-7B-HF'
cache_dir='/home/ec2-user/SageMaker/.cache'
tokenizer, model = setup_model(model_name, cache_dir)


Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
bin /home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
CUDA SETUP: CUDA runtime path found: /home/ec2-user/anaconda3/envs/pytorch_p39/lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...


Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
  warn(msg)


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [7]:
!nvidia-smi

Thu May 25 14:40:34 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA A10G         On   | 00000000:00:1E.0 Off |                    0 |
|  0%   26C    P0    56W / 300W |   7931MiB / 23028MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [38]:
pipe = pipeline(
    task='text-generation',
    model=model,
    tokenizer=tokenizer, 
    max_length=2048,
    temperature=0,
    top_p=0.95,
    repetition_penalty=1.15,   
)

In [39]:
llm = HuggingFacePipeline(pipeline=pipe)
print(llm('What is the capital of England?'))


London is the capital city of England.


## Setup Langchain

In [40]:
import textwrap
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter

from langchain.chains import RetrievalQA
from langchain.document_loaders import TextLoader
from langchain.document_loaders import PyPDFLoader
from langchain.document_loaders import DirectoryLoader

from InstructorEmbedding import INSTRUCTOR
from langchain.embeddings import HuggingFaceEmbeddings

In [41]:
loader = DirectoryLoader('./new_papers/new_papers/', glob="./*.pdf", loader_cls=PyPDFLoader)

documents = loader.load()

In [42]:
len(documents)

142

In [43]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
texts = text_splitter.split_documents(documents)

## Embbeding (SentenceBert)

In [44]:
model_name = 'sentence-transformers/all-mpnet-base-v2'
model_kwargs = {'device': 'cuda'}
embeddings = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)

In [45]:
text = "This is a test document."
query_result = embeddings.embed_query(text)
query_result[:5]

[-0.04895173758268356,
 -0.039861973375082016,
 -0.021562790498137474,
 0.009908495470881462,
 -0.03810397535562515]

In [46]:
doc_result = embeddings.embed_documents([text, "This is not a test document."])
doc_result[0][:5]

[-0.04895175248384476,
 -0.03986193612217903,
 -0.02156277559697628,
 0.009908493608236313,
 -0.03810398280620575]

## Create Vector DB

In [47]:
vectordb = Chroma.from_documents(
    documents=texts, 
    embedding=embeddings,
    persist_directory='db',
)

## Set chain

In [48]:
retriever = vectordb.as_retriever(search_kwargs={"k": 3})

In [49]:
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff", 
    retriever=retriever, 
    return_source_documents=True,
)

In [55]:
qa_chain.retriever.search_type , qa_chain.retriever.vectorstore

('similarity', <langchain.vectorstores.chroma.Chroma at 0x7f2ed8fa3460>)

In [56]:
logger.info(qa_chain.combine_documents_chain.llm_chain.prompt.template)

Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}

Question: {question}
Helpful Answer:


In [50]:
def wrap_text_preserve_newlines(text, width=110):
    lines = text.split('\n')
    wrapped_lines = [textwrap.fill(line, width=width) for line in lines]
    wrapped_text = '\n'.join(wrapped_lines)
    return wrapped_text

def process_llm_response(llm_response):
    logger.info(wrap_text_preserve_newlines(llm_response['result']))
    logger.info('\n\nSources:')
    for source in llm_response["source_documents"]:
        logger.info(source.metadata['source'])

## Run

In [51]:
%%time

query = "What is Flash attention?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

 Flash attention is a type of self-attention mechanism used in neural networks for natural language processing
tasks such as machine translation and text classification. It was introduced in the paper "Fast and Accurate
Machine Translation by Predicting Neural Networks" by Vaswani et al. in 2017. Flash attention allows the model
to quickly focus on important parts of the input sequence while ignoring less relevant parts, which can
improve performance on long sequences.


Sources:
new_papers/new_papers/Flash-attention.pdf
new_papers/new_papers/Flash-attention.pdf
new_papers/new_papers/Flash-attention.pdf
CPU times: user 13.7 s, sys: 2.56 ms, total: 13.7 s
Wall time: 13.7 s


In [52]:
%%time

query = "What does IO-aware mean?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

 IO-awareness refers to designing computer systems that are aware of their input/output (I/O) capabilities and
limitations. It involves optimizing software and hardware components to minimize I/O latency and maximize
throughput. In the context of deep learning, IO-awareness means designing algorithms and architectures that
take into account the I/O bottleneck caused by large amounts of data being transferred between memory and
storage devices.


Sources:
new_papers/new_papers/Flash-attention.pdf
new_papers/new_papers/Flash-attention.pdf
new_papers/new_papers/Flash-attention.pdf
CPU times: user 12.6 s, sys: 0 ns, total: 12.6 s
Wall time: 12.6 s


In [53]:
%%time

query = "What is toolformer?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

 Toolformer is a natural language processing (NLP) model developed by OpenAI that uses external tools such as
search engines, calculators, and calendar s to complete tasks such as answering questions and performing
calculations. It is designed to improve upon the limitations of today's language models by giving them the
ability to use external resources to better understand and respond to user input.


Sources:
new_papers/new_papers/Augmenting LLMs Survey.pdf
new_papers/new_papers/toolformer.pdf
new_papers/new_papers/toolformer.pdf
CPU times: user 10.2 s, sys: 0 ns, total: 10.2 s
Wall time: 10.2 s


In [31]:
%%time

query = "What tools can be used with toolformer?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

Toolformer can be used with any external tool that has a simple API and provides relevant information for the
given task. For example, it could be used with search engines like Google or Bing, calculators like Wolfram
Alpha or Mathway, or translation systems like Google Translate or Microsoft Translator.


Sources:
new_papers/new_papers/toolformer.pdf
new_papers/new_papers/toolformer.pdf
new_papers/new_papers/toolformer.pdf
CPU times: user 8.76 s, sys: 3.68 ms, total: 8.76 s
Wall time: 8.75 s


In [32]:
%%time 

query = "What are the best retrieval augmentations for LLMs?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

There are different types of retrieval augmentations for LLMs, including dense and sparse retrievers. Dense
retrievers work with dense queries and dense document representations, while sparse retrievers use sparse bag-
of-words representations of the documents and queries. Both approaches have their advantages and
disadvantages, and the choice depends on the specific task and dataset. Additionally, grounding the
predictions through tools such as calculators can increase the truthfulness of the generated responses.
Estimating and reducing uncertainty is another direction to explore, as it can help LMMs learn what they know
and what they don't. Finally, allowing LMMs to leverage external tools can also improve their performance,
especially if the missing information is crucial for the task. Overall, the best retrieval augmentations
depend on the specific requirements of the task and should be evaluated based on their effectiveness and
efficiency.


Sources:
new_papers/new_papers/Augmenti

In [33]:
%%time

query = "What are the differences between REALM and RAG?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

REALM (Guu et al., 2020) and RAG (Lewis et al., 2020) are both methods that use retrieval-augmented language
models to improve the performance of question answering systems. However, there are some key differences
between them. REALM


Sources:
new_papers/new_papers/Augmenting LLMs Survey.pdf
new_papers/new_papers/Augmenting LLMs Survey.pdf
new_papers/new_papers/ReACT.pdf
CPU times: user 8.51 s, sys: 0 ns, total: 8.51 s
Wall time: 8.5 s
