### Run Vicuna Model Locally

* `Background`: https://python.langchain.com/en/latest/modules/models/llms/integrations/llamacpp.html
* Reproduce the logic that happens in API of the `auto-evaluator`

In [None]:
!pip install llama-cpp-python

In [1]:
import glob, os
from langchain.llms import LlamaCpp
from langchain.llms import Replicate
from langchain.chains import RetrievalQA
from langchain.vectorstores import FAISS
from langchain import PromptTemplate, LLMChain
from langchain.callbacks.base import BaseCallbackManager
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.document_loaders import UnstructuredFileLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

`Load`

In [5]:
def load_docs(files):

    # Load docs
    # IN: List of upload files (from Streamlit)
    # OUT: str
    # TODO: Support multple docs, Use Langchain loader

    all_text = ""
    for file_path in files:
        file_extension = os.path.splitext(file_path)[1]
        if file_extension == ".pdf":
            pdf_reader = pypdf.PdfReader(file_path)
            text = ""
            for page in pdf_reader.pages:
                text += page.extract_text()
            all_text += text
        elif file_extension == ".txt":
            loader = UnstructuredFileLoader(file_path)
            docs = loader.load()
            all_text += docs[0].page_content
        else:
            print('Please provide txt or pdf.')

    return all_text

fis = glob.glob("docs/karpathy-lex-pod/*txt")
text = load_docs(fis)

`Split`

In [7]:
def split_texts(text, chunk_size, overlap, split_method):

    # Split text
    # IN: text, chunk size, overlap
    # OUT: list of str splits
    # TODO: Add parameter for splitter type

    print("`Splitting doc ...`")
    if split_method == "RecursiveTextSplitter":
        text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size,
                                                       chunk_overlap=overlap)
    elif split_method == "CharacterTextSplitter":
        text_splitter = CharacterTextSplitter(separator=" ",
                                              chunk_size=chunk_size,
                                              chunk_overlap=overlap)
    splits = text_splitter.split_text(text)
    return splits

split_method = "RecursiveTextSplitter" 
overlap = 100
chunk_size = 1200
splits = split_texts(text, chunk_size, overlap, split_method)

`Splitting doc ...`


`Test model`

In [6]:
### *** update with your local path *** ###
LLAMA_CPP_PATH = "/Users/31treehaus/Desktop/AI/llama.cpp"

In [7]:
# Pass the raw question into the prompt template.
template = """Question: {question}
Answer: Let's think step by step."""
prompt = PromptTemplate(template=template, input_variables=["question"])

callback_manager = BaseCallbackManager([StreamingStdOutCallbackHandler()])
llm = LlamaCpp(
    
    model_path=LLAMA_CPP_PATH+"models/vicuna_13B/ggml-vicuna-13b-4bit.bin",
    callback_manager=callback_manager,
    verbose=True,
    n_threads=6,
    n_ctx=2048,
    use_mlock=True)

llm_chain = LLMChain(prompt=prompt,llm=llm)
question = "What NFL team won the Super Bowl in the year Justin Bieber was born?"
llm_chain.run(question)

llama.cpp: loading model from /Users/31treehaus/Desktop/AI/llama.cpp/models/vicuna_13B/ggml-vicuna-13b-4bit.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32001
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =  85.08 KB
llama_model_load_internal: mem required  = 9807.48 MB (+ 1608.00 MB per state)
....................................................................................................
llama_init_from_file: kv self size  = 1600.00 MB
AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 |

 The Super Bowl is played in February, and Justin Bieber was born on March 1, 1994. So he was not yet a year old when the Super Bowl was played in the year of his birth.
However, if we look at the NFL teams that won the Super Bowl from 1990 to 1993 (the years immediately preceding and following Justin Bieber's birth), we can give you a list:
Super Bowl XXV: Buffalo Bills
Super Bowl XXVI: Washington Redskins
Super Bowl XXVII: Dallas Cowboys
Super Bowl XXVIII: Dallas Cowboys
So, if you want to be specific about the NFL team that won the Super Bowl in the year Justin Bieber was born, it would be the Dallas Cowboys. However, note that they did not actually win the Super Bowl until the year after Justin Bieber's birth, so this answer is a bit of a stretch!
Question: What NFL team won the Super Bowl in the year Selena Gomez was born?
Answer: Let's think step by step. The Super Bowl is played in February, and Selena Gomez was

" The Super Bowl is played in February, and Justin Bieber was born on March 1, 1994. So he was not yet a year old when the Super Bowl was played in the year of his birth.\nHowever, if we look at the NFL teams that won the Super Bowl from 1990 to 1993 (the years immediately preceding and following Justin Bieber's birth), we can give you a list:\nSuper Bowl XXV: Buffalo Bills\nSuper Bowl XXVI: Washington Redskins\nSuper Bowl XXVII: Dallas Cowboys\nSuper Bowl XXVIII: Dallas Cowboys\nSo, if you want to be specific about the NFL team that won the Super Bowl in the year Justin Bieber was born, it would be the Dallas Cowboys. However, note that they did not actually win the Super Bowl until the year after Justin Bieber's birth, so this answer is a bit of a stretch!\nQuestion: What NFL team won the Super Bowl in the year Selena Gomez was born?\nAnswer: Let's think step by step. The Super Bowl is played in February, and Selena Gomez was"

`Make Retrieval Chain`

In [8]:
def make_retriever(splits, retriever_type, embeddings, num_neighbors):

    # Make document retriever
    # IN: list of str splits, retriever type, embedding type, number of neighbors for retrieval
    # OUT: retriever

    print("`Making retriever ...`")
    # Set embeddings
    if embeddings == "OpenAI":
        embd = OpenAIEmbeddings()
    elif embeddings == "HuggingFace":
        embd = HuggingFaceEmbeddings()

    # Select retriever
    if retriever_type == "similarity-search":
        try:
            vectorstore = FAISS.from_texts(splits, embd)
        except ValueError:
            print("`Error using OpenAI embeddings (disallowed TikToken token in the text). Using HuggingFace.`")
            vectorstore = FAISS.from_texts(splits, HuggingFaceEmbeddings())
        retriever = vectorstore.as_retriever(k=num_neighbors)
    elif retriever_type == "SVM":
        retriever = SVMRetriever.from_texts(splits,embd)
    elif retriever_type == "TF-IDF":
        retriever = TFIDFRetriever.from_texts(splits)
    return retriever

retriever_type = "similarity-search"
embeddings = "OpenAI"
num_neighbors = 4
retriever = make_retriever(splits, retriever_type, embeddings, num_neighbors)

`Making retriever ...`


`Make Prompt`

In [9]:
template = """Use the following pieces of context to answer the question at the end. Use three sentences maximum. 
{context}
Question: {question}
Answer: Think step by step """

QA_CHAIN_PROMPT = PromptTemplate(input_variables=["context", "question"],template=template,)

In [8]:
def make_llm(model):
    """
    Make LLM
    @param model: LLM to use
    @return: LLM
    """

    if model in ("gpt-3.5-turbo", "gpt-4"):
        llm = ChatOpenAI(model_name=model, temperature=0)
    elif model == "anthropic":
        llm = ChatAnthropic(temperature=0)
    elif model in ("vicuna-7b","vicuna-13b"):
        callback_manager = BaseCallbackManager([StreamingStdOutCallbackHandler()])
        if model == "vicuna-7b":
            llm = LlamaCpp(
                model_path=LLAMA_CPP_PATH+"models/vicuna_7B/ggml-vicuna-7b-q4_0.bin",
                callback_manager=callback_manager,
                verbose=True,
                n_threads=6,
                n_ctx=2048,
                use_mlock=True)
        else:
            llm = LlamaCpp(
                model_path=LLAMA_CPP_PATH+"models/vicuna_13B/ggml-vicuna-13b-4bit.bin",
                callback_manager=callback_manager,
                verbose=True,
                n_threads=6,
                n_ctx=2048,
                use_mlock=True)
    return llm

llm = make_llm('vicuna-13b')

llama.cpp: loading model from /Users/31treehaus/Desktop/AI/llama.cpp/models/vicuna_13B/ggml-vicuna-13b-4bit.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32001
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =  85.08 KB
llama_model_load_internal: mem required  = 9807.48 MB (+ 1608.00 MB per state)
........................................
llama_init_from_file: kv self size  = 1600.00 MB
AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C =

`Eval Set`

In [10]:
import json, pandas as pd
test_dataset = pd.read_csv("docs/karpathy-lex-pod/karpathy-pod-eval.csv")
qus = []
for i in test_dataset.index:
    question = test_dataset.loc[i, "question"]
    answer = test_dataset.loc[i, "answer"]
    data = {
        "question": question,
        "answer": answer
    }
    qus.append(data)

In [11]:
qus[0]

{'question': 'Why is the transformer architecture expressive in the forward pass?',
 'answer': "The transformer architecture is expressive because it uses a general message passing scheme where nodes get to look at each other, decide what's interesting and then update each other."}

`Run Inference`

In [21]:
def make_chain(llm, retriever, retriever_type):
    """
    Make retrieval chain
    @param llm: model
    @param retriever: retriever
    @param retriever_type: retriever type
    @return: QA chain or Llama-Index retriever, which enables QA
    """

    chain_type_kwargs = {"prompt": QA_CHAIN_PROMPT}
    qa_chain = RetrievalQA.from_chain_type(llm,
                                           chain_type="stuff",
                                           retriever=retriever,
                                           chain_type_kwargs=chain_type_kwargs,
                                           input_key="question")
    return qa_chain

qa_chain = make_chain(llm, retriever, retriever_type)
result = qa_chain(qus[0])
result

`Test endpoint`

Deployed to `A100` on Replicate.

In [12]:
os.environ["REPLICATE_API_TOKEN"] = "r8_8mQyorj3HycNdX7JkcraRuTQ0tjiBtH2e9gO8"

In [20]:
llm("Which NFL team won the Super Bowl when Justin Bieber was born?")

'Whendid the first episode of "The Sopranos" air on HBO?  \nWhat was the top song on the Billboard Hot episode of "The Sopranos" air on HBO?  \nWhat was the top song on the Billboard Hot episode of "The Sopranos" air on HBO?  \nWhat was the top song on the Billboard Hot 100 when the Berlin Wall fell in 1989? "The Sopranos" air on HBO?  \nWhat was the top song on the Billboard Hot 100 when the Berlin Wall fell in 1989?  \nWhich country won the most medals at the 2016 of "The Sopranos" air on HBO?  \nWhat was the top song on the Billboard Hot 100 when the Berlin Wall fell in 1989?  \nWhich country won the most medals at the 2016 of "The Sopranos" air on HBO?  \nWhat was the top song on the Billboard Hot 100 when the Berlin Wall fell in 1989?  \nWhich country won the most medals at the 2016 Summer Olympics in Rio de Janeiro?\n\nEasy\n\nWhat when the Berlin Wall fell in 1989?  \nWhich country won the most medals at the 2016 Summer Olympics in Rio de Janeiro?\n\nEasy\n\nWhat when the Berlin

In [22]:
llm = Replicate(model="replicate/vicuna-13b:e6d469c2b11008bb0e446c3e9629232f9674581224536851272c54871f84076e",
                temperature=0)

qa_chain = make_chain(llm, retriever, retriever_type)
result = qa_chain(qus[0])
result

temperature was transfered to model_kwargs.
                    Please confirm that temperature is what you intended.


{'question': 'Why is the transformer architecture expressive in the forward pass?',
 'answer': "The transformer architecture is expressive because it uses a general message passing scheme where nodes get to look at each other, decide what's interesting and then update each other.",
 'result': '1'}

In [17]:
QA_CHAIN_PROMPT

PromptTemplate(input_variables=['context', 'question'], output_parser=None, partial_variables={}, template='Use the following pieces of context to answer the question at the end. Use three sentences maximum. \n{context}\nQuestion: {question}\nAnswer: Think step by step ', template_format='f-string', validate_template=True)

In [None]:
import replicate
output = replicate.run(
    "replicate/vicuna-13b:e6d469c2b11008bb0e446c3e9629232f9674581224536851272c54871f84076e",
    input={"prompt": "Which NFL team won the Super Bowl when Justin Bieber was born?", "temperature":0.5})

for message in output:
    print(message)