# LLMs and ChatModels in LangChain

## Install OpenAI, HuggingFace and LangChain dependencies

In [1]:
!pip install langchain==0.1.19
!pip install langchain-openai==0.1.6
!pip install langchain-community==0.0.38
!pip install huggingface_hub==0.28.0
!pip install transformers==4.38.2



## Enter API Tokens

#### Enter your Open AI Key here

You can get the key from [here](https://platform.openai.com/api-keys) after creating an account or signing in

In [None]:
from getpass import getpass

OPENAI_KEY = getpass('Please enter your Open AI API Key here: ')

Please enter your Open AI API Key here: ··········


#### Enter your HuggingFace token here

You can get the key from [here](https://huggingface.co/settings/tokens) after creating an account or signing in. This is free.

In [2]:
from getpass import getpass

HUGGINGFACEHUB_API_TOKEN = getpass('Please enter your HuggingFace Token here: ')

Please enter your HuggingFace Token here: ··········


## Setup necessary system environment variables

In [3]:
import os

os.environ['HUGGINGFACEHUB_API_TOKEN'] = HUGGINGFACEHUB_API_TOKEN
# os.environ['OPENAI_API_KEY'] = OPENAI_KEY

# Model I/O

In LangChain, the central part of any application is the language model. This section guides on crucial tools for working effectively with any language model, ensuring it integrates smoothly and communicates well.

### Key Components of Model I/O

**LLMs and Chat Models:**
- **LLMs:**
  - **Definition:** Pure text completion models.
  - **Input/Output:** Receives a text string and returns a text string.
- **Chat Models:**
  - **Definition:** Based on a language model but with different input and output types.
  - **Input/Output:** Takes a list of chat messages as input and produces a chat message as output.


## Chat Models and LLMs

Large Language Models (LLMs) are a core component of LangChain. LangChain does not implement or build its own LLMs. It provides a standard API for interacting with almost every LLM out there.

There are lots of LLM providers (OpenAI, Hugging Face, etc) - the LLM class is designed to provide a standard interface for all of them.

## Accessing Commercial LLMs like ChatGPT



### Accessing ChatGPT as an LLM

Here we will show how to access a basic ChatGPT Instruct LLM. However the ChatModel interface which we will see later, is better because the LLM API doesn't support the chat models like `gpt-3.5-turbo`and only support the `instruct`models which can respond to instructions but can't have a conversation with you.

In [None]:
from langchain_openai import OpenAI

chatgpt = OpenAI(model_name="gpt-3.5-turbo-instruct", temperature=0)

In [None]:
prompt = """Explain what is Generative AI in 3 bullet points"""
print(prompt)

Explain what is Generative AI in 3 bullet points


In [None]:
response = chatgpt.invoke(prompt)
print(response)



1. Generative AI is a subset of artificial intelligence that focuses on creating new and original content, rather than just analyzing and processing existing data.

2. It uses algorithms and machine learning techniques to generate new ideas, designs, or solutions based on a set of input data or parameters.

3. Generative AI has a wide range of applications, including creating art, music, and text, as well as assisting in product design and optimization. It has the potential to revolutionize industries by automating creative tasks and providing innovative solutions.


### Accessing ChatGPT as an Chat Model LLM

Here we will show how to access the more advanced ChatGPT Turbo Chat-based LLM. The ChatModel interface is better because this supports the chat models like `gpt-3.5-turbo`which can respond to instructions as well as have a conversation with you. We will look at the conversation aspect slightly later in the notebook.

In [None]:
from langchain_openai import ChatOpenAI

chatgpt = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

In [None]:
prompt = """Explain what is Generative AI in 3 bullet points"""
print(prompt)

Explain what is Generative AI in 3 bullet points


In [None]:
response = chatgpt.invoke(prompt)
response

AIMessage(content='- Generative AI is a type of artificial intelligence that is capable of creating new content, such as images, text, or music, based on patterns and data it has been trained on.\n- It uses algorithms and neural networks to generate this content, often mimicking the style or characteristics of the data it has been exposed to.\n- Generative AI has a wide range of applications, from creating realistic images for video games to generating personalized recommendations for users based on their preferences.', response_metadata={'token_usage': {'completion_tokens': 95, 'prompt_tokens': 19, 'total_tokens': 114}, 'model_name': 'gpt-3.5-turbo', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-cdcede76-2f2e-46db-8ef0-3f6b9c263cf1-0')

In [None]:
print(response.content)

- Generative AI is a type of artificial intelligence that is capable of creating new content, such as images, text, or music, based on patterns and data it has been trained on.
- It uses algorithms and neural networks to generate this content, often mimicking the style or characteristics of the data it has been exposed to.
- Generative AI has a wide range of applications, from creating realistic images for video games to generating personalized recommendations for users based on their preferences.


## Accessing Open Source LLMs with HuggingFace and LangChain

### Accessing Open LLMs with HuggingFace Serverless API

The free [serverless API](https://huggingface.co/inference-api/serverless) lets you implement solutions and iterate in no time, but it may be rate limited for heavy use cases, since the loads are shared with other requests.

For enterprise workloads, you can use Inference Endpoints - Dedicated which would be hosted on a specific cloud instance of your choice and would have a cost associated with it. Here we will use the free serverless API which works quite well in most cases.

The advantage is you do not need to download the models or run them locally on a GPU compute infrastructure which takes time and also would cost you a fair amount.

#### Accessing Microsoft Phi-3 Mini Instruct

The Phi-3-Mini-4K-Instruct is a 3.8B parameters, lightweight, state-of-the-art open model trained with the Phi-3 datasets that includes both synthetic data and the filtered publicly available websites data with a focus on high-quality and reasoning dense properties. Check more details [here](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct)

In [4]:
from langchain_community.llms import HuggingFaceEndpoint

PHI3_MINI_API_URL = "https://api-inference.huggingface.co/models/microsoft/Phi-3-mini-4k-instruct"

phi3_params = {
                  "wait_for_model": True, # waits if model is not available in Huggingface serve
                  "do_sample": False, # greedy decoding - temperature = 0
                  "return_full_text": False, # don't return input prompt
                  "max_new_tokens": 1000, # max tokens answer can go upto
                }
llm = HuggingFaceEndpoint(
    endpoint_url=PHI3_MINI_API_URL,
    task="text-generation",
    **phi3_params
)

                    wait_for_model was transferred to model_kwargs.
                    Please make sure that wait_for_model is what you intended.


In [5]:
prompt = """What do you mean by Small Language Models and how are they different from Large Language models?"""
prompt

'What do you mean by Small Language Models and how are they different from Large Language models?'

In [6]:
# Phi3 expects input prompt to be formatted in a specific way
# check more details here: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct
phi3_prompt = """<|user|>What do you mean by Small Language Models and how are they different from Large Language models?<|end|>
<|assistant|>"""
print(phi3_prompt)

<|user|>What do you mean by Small Language Models and how are they different from Large Language models?<|end|>
<|assistant|>


In [8]:
response = llm.invoke(phi3_prompt)
print(response)

Small language models (SLMs) and large language models (LLMs) refer to the size and complexity of these models, which can significantly impact their capabilities and performance. Here's a brief comparison:

1. **Small Language Models (SLMs):**
   - **Size**: SLMs typically have fewer parameters, ranging from a few million to tens of millions.
   - **Training Data**: They are usually trained on smaller, more specific datasets.
   - **Capabilities**: SLMs excel at specific tasks and domains due to their targeted training. They can generate coherent text, translate languages, and perform simple reasoning tasks. However, they struggle with understanding context across long sequences of text and lack the ability to generalize to new tasks or domains without fine-tuning.
   - **Examples**: BERT-BASE (110M parameters), RoBERTa-BASE (125M parameters).

2. **Large Language Models (LLMs):**
   - **Size**: LLMs have billions of parameters, with some models having trillions of parameters in develo



#### Accessing Google Gemma 2B Instruct

Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models. They are text-to-text, decoder-only large language models, available in English, with open weights, pre-trained variants, and instruction-tuned variants. Gemma models are well-suited for a variety of text generation tasks, including question answering, summarization, and reasoning. Their relatively small size makes it possible to deploy them in environments with limited resources such as a laptop, desktop or your own cloud infrastructure. Check more details [here](https://huggingface.co/google/gemma-1.1-2b-it)

In [9]:
GEMMA_API_URL = "https://api-inference.huggingface.co/models/google/gemma-1.1-2b-it"

gemma_params = {
                  # "wait_for_model": True, # waits if model is not available in Hugginface serve
                  "do_sample": False, # greedy decoding - temperature = 0
                  "return_full_text": False, # don't return input prompt
                  "max_new_tokens": 1000, # max tokens answer can go upto
                }

llm = HuggingFaceEndpoint(
    endpoint_url=GEMMA_API_URL,
    task="text-generation",
    **gemma_params
)

In [10]:
prompt

'What do you mean by Small Language Models and how are they different from Large Language models?'

In [11]:
response = llm.invoke(prompt)
print(response)



 Could you please provide some examples?

**Answer:**

Small Language Models (SLMs) and Large Language Models (LLMs) are terms used to categorize language models based on their size and complexity. Here's a breakdown of the two, along with examples:

1. **Small Language Models (SLMs):**

   - **Size:** SLMs typically have fewer parameters than LLMs. For instance, they might range from a few million to a few billion parameters.

   - **Training:** They are usually trained on smaller datasets and for specific tasks like sentiment analysis, named entity recognition, or machine translation.

   - **Examples:**
     - **BERT-BASE (2020-03-24)** - A transformer-based model with 24 layers, a hidden size of 1024, 16 self-attention heads, and 110M parameters. It's trained on the English Wikipedia, BookCorpus, and OpenWebText.
     - **RoBERTa-BASE (2020-02-05)** - A version of BERT with 24 layers, a hidden size of 768, 12 self-attention heads, and 125M parameters. It's trained on a larger datas

### Accessing Local LLMs with HuggingFacePipeline API

Hugging Face models can be run locally through the `HuggingFacePipeline` class. However remember you need a good GPU to get fast inference

The Hugging Face Model Hub hosts over 500k models, 90K+ open LLMs

These can be called from LangChain either through this local pipeline wrapper or by calling their hosted inference endpoints through the `HuggingFaceEndpoint` API we saw earlier.

To use, you should have the `transformers` python package installed, as well as `pytorch`.

Advantages include the model being completely local, high privacy and security. Disadvantages are basically the necessity of a good compute infrastructure, preferably with a GPU

#### Accessing Google Gemma 2B and running it locally

In [12]:
from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline

gemma_params = {
                  "do_sample": False, # greedy decoding - temperature = 0
                  "return_full_text": False, # don't return input prompt
                  "max_new_tokens": 1000, # max tokens answer can go upto
                }

local_llm = HuggingFacePipeline.from_model_id(
    model_id="google/gemma-1.1-2b-it",
    task="text-generation",
    pipeline_kwargs=gemma_params,
    device=0 # when running on Colab selects the GPU, you can change this if you run it on your own instance if needed
)



tokenizer_config.json:   0%|          | 0.00/34.2k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/618 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/13.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/67.1M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

In [13]:
local_llm

HuggingFacePipeline(pipeline=<transformers.pipelines.text_generation.TextGenerationPipeline object at 0x7c8a06392350>, model_id='google/gemma-1.1-2b-it', model_kwargs={}, pipeline_kwargs={'do_sample': False, 'return_full_text': False, 'max_new_tokens': 1000})

In [14]:
prompt

'What do you mean by Small Language Models and how are they different from Large Language models?'

In [15]:
# Gemma2B when used locally expects input prompt to be formatted in a specific way
# check more details here: https://huggingface.co/google/gemma-1.1-2b-it#chat-template
gemma_prompt = """<bos><start_of_turn>user\n""" + prompt + """\n<end_of_turn>
<start_of_turn>model
"""
print(gemma_prompt)

<bos><start_of_turn>user
What do you mean by Small Language Models and how are they different from Large Language models?
<end_of_turn>
<start_of_turn>model



In [16]:
response = local_llm.invoke(gemma_prompt)
print(response)

**Small Language Models (SLMs)**

* Smaller in size and training data compared to Large Language Models (LLMs).
* Trained on a limited amount of data, typically less than a few gigabytes.
* Possess limited knowledge and reasoning abilities.
* Designed to perform specific tasks, such as language translation, question answering, or text summarization.
* Typically have a limited vocabulary and struggle with complex or nuanced language.


**Large Language Models (LLMs)**

* Vastly larger in size and training data, often petabytes or even terabytes.
* Trained on massive datasets of text and code.
* Possess extensive knowledge and reasoning abilities.
* Can generate creative and coherent text, translate languages, and perform complex tasks.
* Have a vast vocabulary and can understand nuances of language.


**Key Differences:**

**1. Size and Training Data:**
- SLMs are smaller and have limited training data.
- LLMs are much larger and trained on massive datasets.

**2. Knowledge and Reasonin

### Accessing Open LLMs in HuggingFace as a Chat Model LLM

Here we will access open LLMs from HuggingFace like Google Gemma 2B and have a conversation with it. We will look at the conversation aspect slightly later in the notebook.

In [17]:
from langchain_community.chat_models import ChatHuggingFace

chat_gemma = ChatHuggingFace(llm=llm,
                             model_id='google/gemma-1.1-2b-it')



In [19]:
response = chat_gemma.invoke(prompt)
response



AIMessage(content="Small Language Models (SLMs) refer to models with a smaller number of parameters, typically less than 1 billion. They are simpler, faster, and require less computational resources to run compared to Large Language Models (LLMs). Here's a simple breakdown of the differences:\n\n1. **Size**: SLMs have fewer parameters than LLMs. This means they have less capacity to learn complex patterns and context.\n2. **Context Window**: SLMs usually have a smaller context window, which is the amount of text they can consider at once. This makes them less capable of understanding long-range dependencies in text.\n3. **Training Data**: SLMs are typically trained on smaller datasets. This can limit their knowledge cut-off and the diversity of topics they understand.\n4. **Performance**: While SLMs can be quite capable for specific tasks, they often lag behind LLMs in understanding context, generating coherent text, and handling complex tasks.\n5. **Applications**: SLMs are often used

In [20]:
print(response.content)

Small Language Models (SLMs) refer to models with a smaller number of parameters, typically less than 1 billion. They are simpler, faster, and require less computational resources to run compared to Large Language Models (LLMs). Here's a simple breakdown of the differences:

1. **Size**: SLMs have fewer parameters than LLMs. This means they have less capacity to learn complex patterns and context.
2. **Context Window**: SLMs usually have a smaller context window, which is the amount of text they can consider at once. This makes them less capable of understanding long-range dependencies in text.
3. **Training Data**: SLMs are typically trained on smaller datasets. This can limit their knowledge cut-off and the diversity of topics they understand.
4. **Performance**: While SLMs can be quite capable for specific tasks, they often lag behind LLMs in understanding context, generating coherent text, and handling complex tasks.
5. **Applications**: SLMs are often used in real-time application

## Message Types for ChatModels and Conversational Prompting

Conversational prompting is basically the user having a full conversation with the LLM. The conversation history is typically represented as a list of messages.

ChatModels process a list of messages, receiving them as input and responding with a message. Messages are characterized by a few distinct types and properties:

- **Role:** Indicates who is speaking in the message. LangChain offers different message classes for various roles.
- **Content:** The substance of the message, which can vary:
  - A string (commonly handled by most models)
  - A list of dictionaries (for multi-modal inputs, where each dictionary details the type and location of the input)

Additionally, messages have an `additional_kwargs` property, used for passing extra information specific to the message provider, not typically general. A well-known example is `function_call` from OpenAI.

### Specific Message Types

- **HumanMessage:** A user-generated message, usually containing only content.
- **AIMessage:** A message from the model, potentially including `additional_kwargs`, like `tool_calls` for invoking OpenAI tools.
- **SystemMessage:** A message from the system instructing model behavior, typically containing only content. Not all models support this type.


## Conversational Prompting with ChatGPT

Here we use the `ChatModel` API in `ChatOpenAI` to have a full conversation with ChatGPT while maintaining a full flow of the historical conversations

In [None]:
from langchain_openai import ChatOpenAI

chatgpt = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

In [None]:
from langchain_core.messages import HumanMessage, SystemMessage

prompt = """Can you explain what is Generative AI in 3 bullet points?"""
sys_prompt = """Act as a helpful assistant and give meaningful examples in your responses."""
messages = [
    SystemMessage(content=sys_prompt),
    HumanMessage(content=prompt),
]

messages

[SystemMessage(content='Act as a helpful assistant and give meaningful examples in your responses.'),
 HumanMessage(content='Can you explain what is Generative AI in 3 bullet points?')]

In [None]:
response = chatgpt.invoke(messages)
response

AIMessage(content="1. Generative AI is a type of artificial intelligence that is capable of creating new content, such as images, text, or music, based on patterns it has learned from existing data.\n2. It uses techniques like neural networks and deep learning to generate realistic and original outputs that mimic human creativity.\n3. Examples of generative AI include text generators like GPT-3, image generators like StyleGAN, and music generators like OpenAI's MuseNet.", response_metadata={'token_usage': {'completion_tokens': 92, 'prompt_tokens': 38, 'total_tokens': 130}, 'model_name': 'gpt-3.5-turbo', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-5d7d306b-26fa-4d80-906c-939324415dc7-0')

In [None]:
print(response.content)

1. Generative AI is a type of artificial intelligence that is capable of creating new content, such as images, text, or music, based on patterns it has learned from existing data.
2. It uses techniques like neural networks and deep learning to generate realistic and original outputs that mimic human creativity.
3. Examples of generative AI include text generators like GPT-3, image generators like StyleGAN, and music generators like OpenAI's MuseNet.


In [None]:
# add the past conversation history into messages
messages.append(response)
# add the new prompt to the conversation history list
prompt = """What did we discuss so far?"""
messages.append(HumanMessage(content=prompt))
messages

[SystemMessage(content='Act as a helpful assistant and give meaningful examples in your responses.'),
 HumanMessage(content='Can you explain what is Generative AI in 3 bullet points?'),
 AIMessage(content="1. Generative AI is a type of artificial intelligence that is capable of creating new content, such as images, text, or music, based on patterns it has learned from existing data.\n2. It uses techniques like neural networks and deep learning to generate realistic and original outputs that mimic human creativity.\n3. Examples of generative AI include text generators like GPT-3, image generators like StyleGAN, and music generators like OpenAI's MuseNet.", response_metadata={'token_usage': {'completion_tokens': 92, 'prompt_tokens': 38, 'total_tokens': 130}, 'model_name': 'gpt-3.5-turbo', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-5d7d306b-26fa-4d80-906c-939324415dc7-0'),
 HumanMessage(content='What did we discuss so far?')]

In [None]:
# sent the conversation history along with the new prompt to chatgpt
response = chatgpt.invoke(messages)
response.content

'So far, we have discussed Generative AI and its key characteristics, such as its ability to create new content based on learned patterns, its use of neural networks and deep learning techniques, and examples of generative AI applications like text generators, image generators, and music generators.'

## Conversational Prompting with Open LLMs via HuggingFace

Here we use the `ChatModel` API in `ChatHuggingFace` to have a full conversation with any open LLMs while maintaining a full flow of the historical conversations. Here we use the Google Gemma 2B LLM.

In [21]:
llm

HuggingFaceEndpoint(endpoint_url='https://api-inference.huggingface.co/models/google/gemma-1.1-2b-it', max_new_tokens=1000, model='https://api-inference.huggingface.co/models/google/gemma-1.1-2b-it', client=<InferenceClient(model='https://api-inference.huggingface.co/models/google/gemma-1.1-2b-it', timeout=120)>, async_client=<InferenceClient(model='https://api-inference.huggingface.co/models/google/gemma-1.1-2b-it', timeout=120)>, task='text-generation')

In [22]:
# not needed if you are only running chatgpt
from langchain_community.chat_models import ChatHuggingFace
from langchain_core.messages import HumanMessage, SystemMessage

chat_gemma = ChatHuggingFace(llm=llm,
                             model_id='google/gemma-1.1-2b-it')



In [23]:
# this runs prompts using the open LLM - however gemma doesnt support a system prompt
prompt = """What is Retrieval Augmented Generation?"""

messages = [
    HumanMessage(content=prompt),
]

response = chat_gemma.invoke(messages) # doesn't support system prompts
messages.append(response)
print(response.content)



Retrieval Augmented Generation (RAG) is a method in artificial intelligence, particularly in natural language processing, that combines retrieval and generation to improve the performance of language models. Here's a simple breakdown:

1. **Retrieval**: In this step, a retriever model searches for relevant information from a large external knowledge base (like a document database or the internet) based on the user's query. The retrieved information is called 'context' or 'retrieved documents'.

2. **Augmentation**: The retrieved context is then used to 'augment' or enhance the input to the language model. This can be done in several ways, such as:
   - **Directly inserting the context into the input**: The language model generates its response based on the input query and the retrieved context.
   - **Using the context to fine-tune the model**: The language model is further trained on the retrieved context before generating the response.
   - **Using the context to guide the generation

In [24]:
messages

[HumanMessage(content='What is Retrieval Augmented Generation?'),
 AIMessage(content="Retrieval Augmented Generation (RAG) is a method in artificial intelligence, particularly in natural language processing, that combines retrieval and generation to improve the performance of language models. Here's a simple breakdown:\n\n1. **Retrieval**: In this step, a retriever model searches for relevant information from a large external knowledge base (like a document database or the internet) based on the user's query. The retrieved information is called 'context' or 'retrieved documents'.\n\n2. **Augmentation**: The retrieved context is then used to 'augment' or enhance the input to the language model. This can be done in several ways, such as:\n   - **Directly inserting the context into the input**: The language model generates its response based on the input query and the retrieved context.\n   - **Using the context to fine-tune the model**: The language model is further trained on the retrie

In [25]:
# formatting prompt is automatically done inside the chatmodel
# formats in this syntax: https://huggingface.co/google/gemma-1.1-2b-it#chat-template
print(chat_gemma._to_chat_prompt([messages[0]]))

<bos><start_of_turn>user
What is Retrieval Augmented Generation?<end_of_turn>
<start_of_turn>model



In [26]:
prompt = """How is it different from fine-tuning?"""
messages.append(HumanMessage(content=prompt))

response = chat_gemma.invoke(messages) # doesn't support system prompts
print(response.content)



Retrieval Augmented Generation (RAG) and fine-tuning are both techniques used to improve the performance of language models, but they differ in several ways:

1. **Purpose**:
   - **Fine-tuning** is used to adapt a pre-trained language model to a specific task or domain by further training it on a labeled dataset relevant to that task. The goal is to improve the model's performance on the specific task.
   - **RAG**, on the other hand, is used to help the language model better understand the context of a user's query by providing it with relevant external information. The goal is to generate more accurate, relevant, and informative responses.

2. **Data Use**:
   - **Fine-tuning** involves training the model on a new dataset, which can be time-consuming and may require labeled data.
   - **RAG** uses external, often unlabeled, data that is retrieved based on the user's query. This data can be from a variety of sources like documents, websites, or even previous conversations.

3. **Mode