# Models Detailed
- [LLMs vs Chat Models](#llms-vs-chat-models)
- [Open-Source GPT4All](#open-source-gpt4all)
- [HuggingFace LLM](#huggingface-llm)

---

## LLMs vs Chat Models

### LLMs
LLMs, such as GPT-3, Bloom, PaLM, and Aurora genAI, take a text string as input and return a text string as output. They are trained on language modeling tasks and can generate human-like text, perform complex reasoning, and even write code. LLMs are powerful and flexible, capable of generating text for a wide range of tasks. However, they can sometimes produce incorrect or nonsensical answers, and their API is less structured compared to Chat Models. You can use classes from `langchain.llms` to interact with LLMs.

In [None]:
from langchain.llms import OpenAI
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate

llm = OpenAI(model_name="text-davinci-003", temperature=0)

prompt = PromptTemplate(
    input_variables=["product"],
    template="What is a good name for a company that makes {product}?",
)

chain = LLMChain(llm=llm, prompt=prompt)

print(chain.run("wireless headphones"))

### Chat Models
Chat Models are the most popular models in LangChain, such as ChatGPT that can incorporate GPT-3 or GPT-4 at its core. They have gained significant attention due to their ability to learn from human feedback and their user-friendly chat interface.

Chat Models, such as ChatGPT, take a list of messages as input and return an `AIMessage`. They typically use LLMs as their underlying technology, but their APIs are more structured. Chat Models are designed to remember previous exchanges with the user in a session and use that context to generate more relevant responses. They also benefit from reinforcement learning from human feedback, which helps improve their responses. You can use classes from `langchain.chat_models` to interact with Chat Models.

Chat Message Types: `SystemMessage`, `HumanMessage` and `AIMessage`.

In [16]:
from langchain.chat_models import AzureChatOpenAI
from langchain.schema import HumanMessage, SystemMessage

chat = AzureChatOpenAI(deployment_name="gpt4", temperature=0)

messages = [
    SystemMessage(
        content="You are a helpful assistant that translates English to French."
    ),
    HumanMessage(content="Translate the following sentence: I love programming."),
]

chat(messages)

AIMessage(content="J'aime la programmation.", additional_kwargs={}, example=False)

Using the `generate()` method, you can also generate completions for multiple sets of messages. Each batch of messages can have its own `SystemMessage` and will perform independently. 

In [17]:
batch_messages = [
    [
        SystemMessage(
            content="You are a helpful assistant that translates English to French."
        ),
        HumanMessage(content="Translate the following sentence: I love programming."),
    ],
    [
        SystemMessage(
            content="You are a helpful assistant that translates French to English."
        ),
        HumanMessage(
            content="Translate the following sentence: J'aime la programmation."
        ),
    ],
]
print(chat.generate(batch_messages))

generations=[[ChatGeneration(text="J'aime la programmation.", generation_info=None, message=AIMessage(content="J'aime la programmation.", additional_kwargs={}, example=False))], [ChatGeneration(text='I love programming.', generation_info=None, message=AIMessage(content='I love programming.', additional_kwargs={}, example=False))]] llm_output={'token_usage': {'completion_tokens': 11, 'prompt_tokens': 65, 'total_tokens': 76}, 'model_name': 'gpt-3.5-turbo'} run=[RunInfo(run_id=UUID('0ec7f864-7474-4736-9009-7006d8712540')), RunInfo(run_id=UUID('e473b13b-84db-4486-86be-455800947fff'))]


## Open-Source GPT4All

The GPT-family models are undoubtedly powerful. However, access to these models' weights and architecture is restricted, and even if one does have access, it requires significant resources to perform any task.

Furthermore, the available APIs are not free to build on top of. These limitations can restrict the ongoing research on Large Language Models (LLMs). The alternative open-source models (like GPT4All) aim to overcome these obstacles and make the LLMs more accessible to everyone.

The main contribution of GPT4All models is the ability to run them on a CPU. Testing these models is practically free because the recent PCs have powerful Central Processing Units. It is true that we are sacrificing quality by a small margin when using this approach. However, it is a trade-off between no access at all and accessing a slightly underpowered model!

### Convert the Model
The first step is to download the weights and use a script from the LLaMAcpp repository to convert the weights from the old format to the new one. It is a required step; otherwise, the LangChain library will not identify the checkpoint file.

> Note: The cell below will take a while since the file size is 4GB.

In [1]:
import requests
from pathlib import Path
from tqdm import tqdm

local_path = "../../models/gpt4all-lora-quantized-ggml.bin"
Path(local_path).parent.mkdir(parents=True, exist_ok=True)

url = "https://the-eye.eu/public/AI/models/nomic-ai/gpt4all/gpt4all-lora-quantized-ggml.bin"

# send a GET request to the URL to download the file.
response = requests.get(url, stream=True)

# open the file in binary mode and write the contents of the response
# to it in chunks.
with open(local_path, "wb") as f:
    for chunk in tqdm(response.iter_content(chunk_size=8192)):
        if chunk:
            f.write(chunk)

385639it [07:02, 912.31it/s] 


Then, it is time to transform the downloaded file to the latest format. We start by downloading the codes in the LLaMAcpp repository or simply fork it using the following command. The script will create a new file in the models directory with the following name `ggml-model-q4_0.bin`.

In [None]:
import subprocess

commands = [
    "git clone https://github.com/ggerganov/llama.cpp.git",
    "python ./llama.cpp/convert.py ../../models/gpt4all-lora-quantized-ggml.bin",
]
for command in commands:
    result = subprocess.run(command, shell=True, capture_output=True, text=True)
    if result.returncode == 0:
        print("Command executed successfully!")
        # print("Output:")
        # print(result.stdout)
    else:
        print("Command execution failed!")
        print("Error message:")
        print(result.stderr)

### Load the Model and Generate
The default behavior is to wait for the model to finish its inference process to print out its outputs. However, it could take more than an hour (depending on your hardware) to respond to one prompt because of the large number of parameters in the model. We can use the `StreamingStdOutCallbackHandler()` callback to instantly show the latest generated token. This way, we can be sure that the generation process is running and the model shows the expected behavior.

In [20]:
from langchain.llms import GPT4All
from langchain import PromptTemplate, LLMChain
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

In [21]:
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])
llm = GPT4All(
    model="../../models/ggml-model-q4_0.bin",
    callback_manager=callback_manager,
    verbose=True,
)

Found model file at  ../../models/ggml-model-q4_0.bin


In [22]:
template = """Question: {question}

Answer: Let's think step by step."""
prompt = PromptTemplate(template=template, input_variables=["question"])
llm_chain = LLMChain(prompt=prompt, llm=llm)

In [23]:
question = "What happens when it rains somewhere?"
llm_chain.run(question)

 When rain falls, the water droplets fall from clouds in the sky and hit different surfaces on Earth such as roads or trees. The amount of precipitation that can occur varies depending upon many factors including air temperature and humidity levels surrounding a particular area where it is raining. During heavy downpour conditions, rainwater accumulates at lower elevations forming puddles which ultimately lead to surface runoff as the water flows through gutters or other drains on its way towards nearby rivers/streams etc..
 When rain falls, the water droplets fall from clouds in the sky and hit different surfaces on Earth such as roads or trees. The amount of precipitation that can occur varies depending upon many factors including air temperature and humidity levels surrounding a particular area where it is raining. During heavy downpour conditions, rainwater accumulates at lower elevations forming puddles which ultimately lead to surface runoff as the water flows through gutters or 

' When rain falls, the water droplets fall from clouds in the sky and hit different surfaces on Earth such as roads or trees. The amount of precipitation that can occur varies depending upon many factors including air temperature and humidity levels surrounding a particular area where it is raining. During heavy downpour conditions, rainwater accumulates at lower elevations forming puddles which ultimately lead to surface runoff as the water flows through gutters or other drains on its way towards nearby rivers/streams etc..'

Another prompt for the same question.

In [24]:
template = """Question: {question}

Answer: Let's answer in two sentence while being funny."""

prompt = PromptTemplate(template=template, input_variables=["question"])
llm_chain = LLMChain(prompt=prompt, llm=llm)

In [25]:
question = "What happens when it rains somewhere?"
llm_chain.run(question)

 When rain falls, some places turn into swimming pools and others become river beds as they collect the precipitation until a flood occurs which makes people scream for help due to rising water levels around them!
 When rain falls, some places turn into swimming pools and others become river beds as they collect the precipitation until a flood occurs which makes people scream for help due to rising water levels around them!

' When rain falls, some places turn into swimming pools and others become river beds as they collect the precipitation until a flood occurs which makes people scream for help due to rising water levels around them!'

## Llama 2

Download any Llama model that you are interested in from [here](https://huggingface.co/TheBloke). For this example, we will be using the `llama-2-13b-chat.ggmlv3.q4_1.bin` from [here](https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/tree/main).

In [2]:
from langchain.llms import LlamaCpp
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain import PromptTemplate, LLMChain


callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])
llm = LlamaCpp(
    model_path="../../models/llama-2-13b-chat.ggmlv3.q4_1.bin",
    temperature=0.8,
    # n_threads=8,
    # n_ctx=2048,
    # n_batch=256,
    callback_manager=callback_manager,
    verbose=True,
)

AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 


In [3]:
template = """SYSTEM: You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. \
Answer has to be as short as possible without losing the meaning. If you don't know the answer to a question, please don't share false information.
USER: {question}
ASSISTANT: 
"""
prompt = PromptTemplate(template=template, input_variables=["question"])
llm_chain = LLMChain(prompt=prompt, llm=llm)

In [4]:
question = "What is the capital of India?"
response = llm_chain.run(question)
print(f"\nReponse from model: {response}")

Good day! The capital of India is New Delhi.
Reponse from model: Good day! The capital of India is New Delhi.


## HuggingFace LLM

In [2]:
from langchain import PromptTemplate

template = """Question: {question}

Answer: """

prompt = PromptTemplate(template=template, input_variables=["question"])

In [3]:
from langchain import HuggingFaceHub, LLMChain

# initialize Hub LLM
hub_llm = HuggingFaceHub(
    repo_id="google/flan-t5-large", model_kwargs={"temperature": 0, "max_length": 128}
)

# create prompt template > LLM chain
llm_chain = LLMChain(prompt=prompt, llm=hub_llm)

# user question
question = "What is the capital city of France?"

# ask the user question about the capital of France
print(llm_chain.run(question))

paris


### Asking multiple questions

In [4]:
# Approach 1: iterate through all questions one at a time
qa = [
    {"question": "What is the capital city of France?"},
    {"question": "What is the largest mammal on Earth?"},
    {"question": "Which gas is most abundant in Earth's atmosphere?"},
    {"question": "What color is a ripe banana?"},
]
res = llm_chain.generate(qa)
print(res)

generations=[[Generation(text='paris', generation_info=None)], [Generation(text='giraffe', generation_info=None)], [Generation(text='nitrogen', generation_info=None)], [Generation(text='yellow', generation_info=None)]] llm_output=None run=[RunInfo(run_id=UUID('c4c1c931-b9a9-491e-a978-af64deea8780')), RunInfo(run_id=UUID('0f0ea5d1-7642-4e7f-9904-9efb199db309')), RunInfo(run_id=UUID('52a209ec-77f0-4bd4-afb8-b3fb4596b225')), RunInfo(run_id=UUID('011d3a71-8ce9-4068-83a3-d5feccb7f5ee'))]


In [15]:
# Approach 2: place all questions into a single prompt
multi_template = """Answer the following questions one at a time.

Questions:
{questions}

Answers:
"""
long_prompt = PromptTemplate(template=multi_template, input_variables=["questions"])

llm_chain = LLMChain(prompt=long_prompt, llm=hub_llm)

qs_str = (
    "1. What is the capital city of France?\n"
    + "2. What is the largest mammal on Earth?\n"
    + "3. Which gas is most abundant in Earth's atmosphere?\n"
    + "4. What color is a ripe banana?\n"
    + "5. Who was the first president of India?\n"
    + "6. What is the color of sky?\n"
)
response = llm_chain.run(questions=qs_str)
response

'1. Paris 2. giraffe 3. nitrogen 4. yellow 5. Jawaharlal Nehru 6. blue'