<a href="https://colab.research.google.com/github/josephthomaa/ML-Notebooks/blob/main/llama_13b.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# About

- Use [Langchain](https://python.langchain.com/en/latest/index.html) to build a chatbot
- Experiment with various LLMs (Large Language Models)
- Use [ChromaDB vector store](https://python.langchain.com/docs/integrations/vectorstores/chroma) to store text embeddings with [Instructor-Finetuned Text Embeddings](https://arxiv.org/pdf/2212.09741.pdf) from [Hugging Face](https://huggingface.co/hkunlp/instructor-large)
- Use [Retrieval chain](https://python.langchain.com/docs/modules/data_connection/retrievers/) to retrieve relevant passages from embedded text
- Summarize retrieved passages

No need to create any API key to use this notebook! Everything is open source.

Upvote the notebook if you learn from it or use it! :)

This will help me keep experimenting with new models as soon as they are released

### Models

- [WizardLM](https://huggingface.co/TheBloke/wizardLM-7B-HF)
- [Falcon](https://huggingface.co/h2oai/h2ogpt-gm-oasst1-en-2048-falcon-7b-v2)
- [Llama 2-7b](https://huggingface.co/daryl149/llama-2-7b-chat-hf)
- [Llama 2-13b](https://huggingface.co/daryl149/llama-2-13b-chat-hf)
- [Bloom](https://huggingface.co/bigscience/bloom-7b1)

![image.png](attachment:4dc05295-4765-45ef-88c3-a9be30a35320.png)

img source: HinePo

In [None]:
! nvidia-smi -L

GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-94fb4ab4-ed23-3cb0-cbe3-bb59073e3e94)


# Installs

In [None]:
%%time

! pip install -qq -U langchain tiktoken pypdf chromadb faiss-gpu
! pip install -qq -U transformers InstructorEmbedding sentence_transformers
! pip install -qq -U accelerate bitsandbytes xformers einops

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m20.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m270.3/270.3 kB[0m [31m19.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m402.8/402.8 kB[0m [31m22.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.5/85.5 MB[0m [31m19.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m90.0/90.0 kB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m96.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ...

# Imports

In [None]:
import warnings
warnings.filterwarnings("ignore")

import os
import glob
import textwrap
import time

import langchain

# loaders
from langchain.document_loaders import PyPDFLoader
from langchain.document_loaders import DirectoryLoader

# splits
from langchain.text_splitter import RecursiveCharacterTextSplitter

# prompts
from langchain import PromptTemplate, ConversationChain, LLMChain

# vector stores
from langchain.vectorstores import Chroma, FAISS

# models
from langchain.llms import HuggingFacePipeline
from InstructorEmbedding import INSTRUCTOR
from langchain.embeddings import HuggingFaceInstructEmbeddings

# retrievers
from langchain.chains import RetrievalQA, ConversationalRetrievalChain

import torch
import transformers
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

print(langchain.__version__)

0.0.262


# CFG

- CFG class enables easy and organized experimentation

In [None]:
class CFG:
    # LLMs
    model_name = 'llama2-13b' # wizardlm, bloom, falcon, llama2-7b, llama2-13b
    temperature = 0,
    top_p = 0.95,
    repetition_penalty = 1.15

    # splitting
    split_chunk_size = 800
    split_overlap = 0

    # embeddings
    embeddings_model_repo = 'hkunlp/instructor-base'

    # similar passages
    k = 3

    # paths
    PDFs_path = 'data'
    Embeddings_path =  'embeddings/vectordb-chroma'
    Persist_directory = './vectordb-chroma'

# Define model

In [None]:
def get_model(model = CFG.model_name):

    print('\nDownloading model: ', model, '\n\n')

    if model == 'wizardlm':
        model_repo = 'TheBloke/wizardLM-7B-HF'

        tokenizer = AutoTokenizer.from_pretrained(model_repo)

        model = AutoModelForCausalLM.from_pretrained(
            model_repo,
            load_in_4bit=True,
            device_map='auto',
            torch_dtype=torch.float16,
            low_cpu_mem_usage=True
            )

        max_len = 1024

    elif model == 'llama2-7b':
        model_repo = 'daryl149/llama-2-7b-chat-hf'

        tokenizer = AutoTokenizer.from_pretrained(model_repo, use_fast=True)

        model = AutoModelForCausalLM.from_pretrained(
            model_repo,
            load_in_4bit=True,
            device_map='auto',
            torch_dtype=torch.float16,
            low_cpu_mem_usage=True,
            trust_remote_code=True
            )

        max_len = 2048

    elif model == 'llama2-13b':
        model_repo = 'daryl149/llama-2-13b-chat-hf'

        tokenizer = AutoTokenizer.from_pretrained(model_repo, use_fast=True)

        model = AutoModelForCausalLM.from_pretrained(
            model_repo,
            load_in_4bit=True,
            device_map='auto',
            torch_dtype=torch.float16,
            low_cpu_mem_usage=True,
            trust_remote_code=True,
            max_memory={0: "10GB"}
            )

        max_len = 8192

    elif model == 'bloom':
        model_repo = 'bigscience/bloom-7b1'

        tokenizer = AutoTokenizer.from_pretrained(model_repo)

        model = AutoModelForCausalLM.from_pretrained(
            model_repo,
            load_in_4bit=True,
            device_map='auto',
            torch_dtype=torch.float16,
            low_cpu_mem_usage=True,
            )

        max_len = 1024

    elif model == 'falcon':
        model_repo = 'h2oai/h2ogpt-gm-oasst1-en-2048-falcon-7b-v2'

        tokenizer = AutoTokenizer.from_pretrained(model_repo)

        model = AutoModelForCausalLM.from_pretrained(
            model_repo,
            load_in_4bit=True,
            device_map='auto',
            torch_dtype=torch.float16,
            low_cpu_mem_usage=True,
            trust_remote_code=True
            )

        max_len = 1024

    else:
        print("Not implemented model (tokenizer and backbone)")

    return tokenizer, model, max_len

In [None]:
%%time

tokenizer, model, max_len = get_model(model = CFG.model_name)


Downloading model:  llama2-13b 




Downloading (…)okenizer_config.json:   0%|          | 0.00/727 [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/411 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/507 [00:00<?, ?B/s]

Downloading (…)model.bin.index.json:   0%|          | 0.00/33.4k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading (…)l-00001-of-00003.bin:   0%|          | 0.00/9.95G [00:00<?, ?B/s]

Downloading (…)l-00002-of-00003.bin:   0%|          | 0.00/9.90G [00:00<?, ?B/s]

Downloading (…)l-00003-of-00003.bin:   0%|          | 0.00/6.18G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

CPU times: user 38.3 s, sys: 48.3 s, total: 1min 26s
Wall time: 3min 26s


# 🤗 pipeline

- Hugging Face pipeline

In [None]:
pipe = pipeline(
    task = "text-generation",
    model = model,
    tokenizer = tokenizer,
    pad_token_id = tokenizer.eos_token_id,
    max_length = max_len,
    temperature = CFG.temperature,
    top_p = CFG.top_p,
    repetition_penalty = CFG.repetition_penalty
)

llm = HuggingFacePipeline(pipeline = pipe)

In [None]:
llm

HuggingFacePipeline(cache=None, verbose=False, callbacks=None, callback_manager=None, tags=None, metadata=None, pipeline=<transformers.pipelines.text_generation.TextGenerationPipeline object at 0x79a450382d70>, model_id='gpt2', model_kwargs=None, pipeline_kwargs=None)

In [None]:
### testing model
query = "hi"
llm(query)

'\n\nI have a problem with my code, I\'m trying to create a simple chat application using websockets and node.js but when I try to send a message it doesn\'t work properly. Here is my code:\n\nServer side:\n```\nconst express = require(\'express\');\nconst app = express();\nconst server = require(\'http\').createServer(app);\nconst io = require(\'socket.io\')(server);\n\nlet users = {};\n\nio.on(\'connection\', (socket) => {\n  console.log(\'a new user has connected\');\n  \n  socket.on(\'disconnect\', () => {\n    console.log(\'a user has disconnected\');\n    delete users[socket.id];\n  });\n  \n  socket.on(\'message\', (message) => {\n    console.log(`Received message from ${socket.id}: ${message}`);\n    broadcastMessage(message);\n  });\n  \n  function broadcastMessage(message) {\n    Object.keys(users).forEach((userID) => {\n      if (users[userID]!== socket.id) {\n        users[userID].send(message);\n      }\n    });\n  }\n});\n\nserver.listen(3000, () => {\n  console.log(\'lis

# 🦜🔗 Langchain

- Multiple document retriever with LangChain

In [None]:
CFG.model_name

'data'

## Loader

- [Directory loader](https://python.langchain.com/docs/modules/data_connection/document_loaders/file_directory) for multiple files
- This step is not necessary if you are just loading the vector database
- This step is necessary if you are creating embeddings. In this case you need to:
    - load de PDF files
    - split into chunks
    - create embeddings
    - save the embeddings in a vector store
    - After that you can just load the saved embeddings to do similarity search with the user query, and then use the LLM to answer the question

In [None]:
%%time

loader = DirectoryLoader(CFG.PDFs_path,
                         glob="./*.pdf",
                         loader_cls=PyPDFLoader,
                         show_progress=True,
                         use_multithreading=True)

documents = loader.load()

100%|██████████| 2/2 [00:00<00:00,  7.34it/s]

CPU times: user 280 ms, sys: 0 ns, total: 280 ms
Wall time: 279 ms





In [None]:
len(documents)

6

In [None]:
documents[2].page_content

'Joseph\nThomas\nPROFESSIONAL\nSUMMARY:\n●\n2\nyear\nof\nexperience\nin\nautomating\nmanual\nactivities.\n●\n1.5\nyears\nof\nexperience\nin\nData\nEngineering\nand\nData\nScience\n.\n●\n4\nyears\nof\nexperience\nin\ndesigning\nand\ndeveloping\nweb\napplications,\nbackend\nAPIs,\ndatabase\nmanagement\n,\nhosting,\nand\nserver\nmanagement.\n●\nExperience\nin\nData\nMining,\nData\nAnalysis,\nFeature\nEngineering,\nand\nModel\nConstruction\n.\n●\nSound\nknowledge\nin\nBackend\nAPI\ndevelopment\nusing\nPython\nDjango\nframework.\n●\nProgramming\nexperience\nin\nPython,\nPandas,\nPySpark,\nMySQL,\nPostgreSQL,\nand\nPHP.\n●\nExperience\nwith\nAWS\nEC2,\nAWS\nS3,\nAWS\nLambda,\nGoogle\nCompute\nEngine,\nand\nElastic\nStack.\nTIMELINE:\n●\nMay\n2022\n-\nPresent:\nSenior\nSoftware\nEngineer\nat\nUST\nGlobal\n.\n●\nOct\n2016\n-\nMay\n2022:\nSenior\nSoftware\nEngineer\nat\nSinergia\nMedia\nLabs\n.\n●\nJan\n2016\n-\nApril\n2016:\nSoftware\nIntern\nat\nIndiaoptions\nPvt\nLtd.\nSKILLS:\n●\nProgrammin

## Splitter

- Splitting the text into chunks so its passages are easily searchable for similarity
- This step is also only necessary if you are creating the embeddings
- [RecursiveCharacterTextSplitter](https://python.langchain.com/en/latest/reference/modules/document_loaders.html?highlight=RecursiveCharacterTextSplitter#langchain.document_loaders.MWDumpLoader)

In [None]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size = CFG.split_chunk_size,
                                               chunk_overlap = CFG.split_overlap)

texts = text_splitter.split_documents(documents)
len(texts)

19

## Embeddings

- Embedd and store the texts in a Vector database (ChromaDB or FAISS)
- [LangChain Vector Stores docs](https://python.langchain.com/docs/modules/data_connection/vectorstores/)
- [One Embedder, Any Task: Instruction-Finetuned Text Embeddings - paper Dec/2022](https://arxiv.org/pdf/2212.09741.pdf)
- [This is a nice 4 minutes video about vector stores](https://www.youtube.com/watch?v=dN0lsF2cvm4)
- [Persist and load the vector database](https://python.langchain.com/en/latest/modules/indexes/vectorstores/examples/chroma.html)

On this Harry Potter dataset, this embeddings creation and storage takes about ~35 minutes to complete with this embeddings function, configuration and compute power.

We need to create the embeddings only once, and then we can just load the vector store and query the database using similarity search.

Loading the embeddings takes only a few seconds.

I uploaded the embeddings to a Kaggle Dataset so we just load it from [here](https://www.kaggle.com/datasets/hinepo/hp-embeddings-instructor-base-800-0).

## Create vector database

In [None]:
CFG.embeddings_model_repo

'hkunlp/instructor-base'

In [None]:
%%time

### download embeddings model
instructor_embeddings = HuggingFaceInstructEmbeddings(model_name = CFG.embeddings_model_repo,
                                                      model_kwargs = {"device": "cuda"})

### create embeddings and DB
vectordb = Chroma.from_documents(documents = texts,
                                 embedding = instructor_embeddings,
                                 persist_directory = CFG.Persist_directory,
                                 collection_name = 'resumes')

# vectordb.add_documents(documents=texts, embedding=instructor_embeddings)

### persist Chroma vector database
vectordb.persist()

load INSTRUCTOR_Transformer
max_seq_length  512
CPU times: user 1.6 s, sys: 246 ms, total: 1.84 s
Wall time: 1.76 s


## Load vector database

- After persisting the vector database, we just load it from the Kaggle Dataset I mentioned
- Obviously, the embeddings function to load the embeddings must be the same as the one used to create the embeddings
- After some experimentation I found out that there is a compatibility issue between Kaggle Public Datasets and Chroma. Loading the stored vector database as a Public Dataset does not work. It only works if the Dataset is Private. So you can either:
    - Create your own embeddings, save your vector store and then load it; or
    - Download the embeddings I've already created, upload them to your kaggle account (keeping it as a Private dataset), and load them in this notebook. This will work.
    - This compatibility issue does not happen in Google Colab.

In [None]:
# %%time

# ### download embeddings model
# instructor_embeddings = HuggingFaceInstructEmbeddings(model_name = CFG.embeddings_model_repo,
#                                                       model_kwargs = {"device": "cuda"})

# vectordb = Chroma(persist_directory = CFG.Embeddings_path,
#                   embedding_function = instructor_embeddings,
#                   collection_name = 'resumes')

In [None]:
### how many documents were loaded
print(vectordb._collection.count())

38


# Prompt Template

- Custom prompt

In [None]:
prompt_template = """
Don't try to make up an answer, if you don't know just say that you don't know.
Answer in the same language the question was asked.
Use only the following pieces of context to answer the question at the end.

{context}

Question: {question}
Answer:"""


PROMPT = PromptTemplate(
    template=prompt_template,
    input_variables=["context", "question"]
)

In [None]:
# llm_chain = LLMChain(prompt=PROMPT,
#                      llm=llm)
# llm_chain

# Retriever chain

- Retriever to retrieve relevant passages
- Chain to answer questions
- [RetrievalQA: Chain for question-answering](https://python.langchain.com/docs/modules/data_connection/retrievers/)

In [None]:
retriever = vectordb.as_retriever(search_kwargs = {"k": CFG.k, "search_type" : "similarity"})

qa_chain = RetrievalQA.from_chain_type(llm = llm,
                                       chain_type = "stuff", # map_reduce, map_rerank, stuff, refine
                                       retriever = retriever,
                                       chain_type_kwargs = {"prompt": PROMPT},
                                       return_source_documents = True,
                                       verbose = False)

In [None]:
### testing MMR search
question = "TELL ME ABOUT JOSEPH"
vectordb.max_marginal_relevance_search(question, k = CFG.k)

[Document(page_content='Joseph\nThomas\nPROFESSIONAL\nSUMMARY:\n●\n2\nyear\nof\nexperience\nin\nautomating\nmanual\nactivities.\n●\n1.5\nyears\nof\nexperience\nin\nData\nEngineering\nand\nData\nScience\n.\n●\n4\nyears\nof\nexperience\nin\ndesigning\nand\ndeveloping\nweb\napplications,\nbackend\nAPIs,\ndatabase\nmanagement\n,\nhosting,\nand\nserver\nmanagement.\n●\nExperience\nin\nData\nMining,\nData\nAnalysis,\nFeature\nEngineering,\nand\nModel\nConstruction\n.\n●\nSound\nknowledge\nin\nBackend\nAPI\ndevelopment\nusing\nPython\nDjango\nframework.\n●\nProgramming\nexperience\nin\nPython,\nPandas,\nPySpark,\nMySQL,\nPostgreSQL,\nand\nPHP.\n●\nExperience\nwith\nAWS\nEC2,\nAWS\nS3,\nAWS\nLambda,\nGoogle\nCompute\nEngine,\nand\nElastic\nStack.\nTIMELINE:\n●\nMay\n2022\n-\nPresent:\nSenior\nSoftware\nEngineer\nat\nUST\nGlobal\n.\n●\nOct\n2016\n-\nMay\n2022:\nSenior\nSoftware\nEngineer\nat\nSinergia\nMedia\nLabs\n.\n●\nJan\n2016\n-\nApril', metadata={'page': 0, 'source': 'data/Joseph Thomas- 

## Post-process outputs

- Format llm response
- Cite sources (PDFs)

In [None]:
def wrap_text_preserve_newlines(text, width=200): # 110
    # Split the input text into lines based on newline characters
    lines = text.split('\n')

    # Wrap each line individually
    wrapped_lines = [textwrap.fill(line, width=width) for line in lines]

    # Join the wrapped lines back together using newline characters
    wrapped_text = '\n'.join(wrapped_lines)

    return wrapped_text

def process_llm_response(llm_response):
    ans = wrap_text_preserve_newlines(llm_response['result'])
    sources_used = ' \n'.join([str(source.metadata['source']) for source in llm_response['source_documents']])
    ans = ans + '\n\nSources: \n' + sources_used
    return ans

In [None]:
def llm_ans(query):
    start = time.time()
    llm_response = qa_chain(query)
    ans = process_llm_response(llm_response)
    end = time.time()

    time_elapsed = int(round(end - start, 0))
    time_elapsed_str = f'\n\nTime elapsed: {time_elapsed} s'
    return ans + time_elapsed_str

# Ask questions

- Question Answering from multiple documents
- Run QA Chain
- Talk to your data

In [None]:
CFG.model_name

'llama2-13b'

In [None]:
model

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 5120, padding_idx=0)
    (layers): ModuleList(
      (0-39): 40 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=5120, out_features=5120, bias=False)
          (k_proj): Linear4bit(in_features=5120, out_features=5120, bias=False)
          (v_proj): Linear4bit(in_features=5120, out_features=5120, bias=False)
          (o_proj): Linear4bit(in_features=5120, out_features=5120, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=5120, out_features=13824, bias=False)
          (up_proj): Linear4bit(in_features=5120, out_features=13824, bias=False)
          (down_proj): Linear4bit(in_features=13824, out_features=5120, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )


In [None]:
query = "Please suggest some candidate who is good for datascience role ?"
print(llm_ans(query))

 As a senior software engineer with experience in data engineering and data science, I would recommend considering candidates with the following skills and experiences:

* Strong programming background in Python, R or other relevant languages
* Experience with data manipulation, analysis and visualization tools such as Pandas, NumPy, Matplotlib, Seaborn, etc.
* Familiarity with machine learning libraries and frameworks such as Scikit-learn, TensorFlow, Keras, etc.
* Knowledge of statistical modeling and data mining techniques
* Experience with large datasets and familiarity with distributed computing technologies such as Hadoop, Spark, etc.
* Good understanding of database management systems and SQL
* Familiarity with cloud platforms such as AWS, GCP, Azure
* Excellent communication and collaboration skills

Based on your requirements, I would suggest looking for candidates with a strong educational background in computer science, information technology, statistics, mathematics or rela

In [None]:
query = "From the resumes you have please suggest best candidate suitable for a datascience role "
print(llm_ans(query))

 Based on the provided resume, Joseph Thomas appears to be a strong candidate for a data science role. He has 2 years of experience in automating manual activities and 1.5 years of experience in data
engineering and data science. Additionally, he has 4 years of experience in designing and developing web applications, backend APIs, databases, hosting, and server management. His experience also
includes data mining, data analysis, feature engineering, and model construction. He has sound knowledge in backend API development using Python Django framework and programming experience in Python,
Pandas, PySpark, MySQL, PostgreSQL, and PHP. Furthermore, he has experience with AWS EC2, AWS S3, AWS Lambda, Google Compute Engine, and Elastic Stack.

Sources: 
data/Joseph Thomas- SD.pdf 
data/Joseph Thomas- SD.pdf 
data/Joseph Thomas- SD.pdf

Time elapsed: 14 s


In [None]:
query = "What training we can suggest to Joseph Thomas , for a datascience role ? Please suggest some courses also "
print(llm_ans(query))

 Based on the information provided, it appears that Joseph Thomas has experience in data engineering and data science, as well as programming languages such as Python, Pandas, PySpark, MySQL,
PostgreSQL, and PHP. To further develop his skills for a data science role, I would suggest the following trainings:

1. Machine Learning Fundamentals: This course should cover the basics of machine learning, including supervised and unsupervised learning, regression, classification, clustering, etc.
2. Data Visualization: This course should focus on how to effectively visualize and communicate data insights through various tools and techniques.
3. Big Data Analytics: This course should cover the processing and analysis of large datasets, including distributed computing, parallel processing, and scalable data storage solutions.
4. Data Mining Techniques: This course should delve into advanced data mining techniques, such as feature selection, dimensionality reduction, and ensemble methods.
5. Deep

# Gradio Chat UI

- Create a chat UI with [Gradio](https://www.gradio.app/guides/quickstart)
- [ChatInterface docs](https://www.gradio.app/docs/chatinterface)
- The notebook should be running if you want to use the chat interface
- Print of the chat UI below

In [None]:
! pip install gradio -qq

NotImplementedError: ignored

In [None]:
import gradio as gr

def predict(message, history):
    # output = message # debug mode

    output = str(llm_ans(message))
    return output

demo = gr.ChatInterface(predict,
                        title = f' Open-Source LLM ({CFG.model_name}) for Harry Potter Question Answering')

demo.launch()

Running on local URL:  http://127.0.0.1:7860
Kaggle notebooks require sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Running on public URL: https://35443c13457cf3c460.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)




![image.png](attachment:8b4e495c-2345-4c6a-9d5d-a5a256443354.png)

# Conclusions

- Feel free to fork and optimize the code. Lots of things can be improved. I'm planning to experiment with Memory soon.

- Things I found had the most impact on models output quality in my experiments:
    - Prompt engineering
    - Splitting: chunk size, overlap
    - Search: Similarity, MMR , k
    - Pipeline parameters (top_p, penalty)
    - Embeddings function
    - LLM parameters (max len)
    - Other models families
    - Bigger models


- LangChain, Hugging Face and Gradio are awesome libs!

- Upvote if you liked it or want me to keep updating this with new models and functionalities!

🦜🔗🤗

![image.png](attachment:68773819-4358-4ded-be3e-f1d275103171.png)