## Llama and LangChain

**LLaMA** (Large Language Model Meta AI) is a family of open-source LLMs developed by Meta, while LangChain is a framework that helps integrate LLMs into applications more efficiently.

**LangChain** supports LLaMA models (including LLaMA 2, LLaMA 3) through various integrations, allowing developers to use LLaMA just like GPT-based models.

### 1. Install Packages

In [1]:
!pip install -q transformers einops accelerate langchain bitsandbytes

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.1/76.1 MB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m32.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m18.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m19.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

### 2. Log into your HuggingFace account.

In [2]:
!pip install langchain_community

Collecting langchain_community
  Downloading langchain_community-0.3.19-py3-none-any.whl.metadata (2.4 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain_community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain_community)
  Downloading pydantic_settings-2.8.1-py3-none-any.whl.metadata (3.5 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain_community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain_community)
  Downloading marshmallow-3.26.1-py3-none-any.whl.metadata (7.3 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain_community)
  Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
Collecting python-dotenv>=0.21.0 (from pydantic-settings<3.0.0,>=2.4.0->langchain_community)
  Downloading python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB

In [3]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) y
Token is valid (permission: write).
The token `my_huggingface_token` has been saved to /root/.cache/huggingface/stored_tokens
[1m[31mCannot authenticate through git-credential as no helper is defined on your machine.
You might have to 

### 3. Import necessary libraries.

In [4]:
from langchain.llms import HuggingFacePipeline
from transformers import AutoTokenizer, pipeline, logging
import torch
import warnings

**Warnings Filter in Python:** The warnings filter controls whether warnings are ignored, displayed, or turned into errors (raising an exception).

Conceptually, the warnings filter maintains an ordered list of filter specifications; any specific warning is matched against each filter specification in the list in turn until a match is found; the filter determines the disposition of the match. Each entry is a tuple of the form (action, message, category, module, lineno), where:

- action is one of the following strings: [Read More (on docs.python.org)](https://docs.python.org/3/library/warnings.html#warning-filter)

The **warnings filter** controls how warnings are handled in Python—whether they are **ignored, displayed, or turned into exceptions**.  

The filter operates as an **ordered list** of filter specifications. When a warning is raised, it is matched against each filter in sequence until a match is found. The matching filter determines the action taken.  

Each filter entry is a **tuple** in the form:  
`(action, message, category, module, lineno)`, where **action** is one of the following:

**List of Warnings Filter Actions:**  

| **Action**  | **Behavior**  |
|------------|--------------|
| `"default"` | Prints the first occurrence of matching warnings for each **location** (module + line number).  |
| `"error"`   | Converts matching warnings into **exceptions**.  |
| `"ignore"`  | Suppresses matching warnings entirely.  |
| `"always"`  | Always prints matching warnings.  |
| `"module"`  | Prints the first occurrence of matching warnings for each **module** (ignoring line number).  |
| `"once"`    | Prints only the **first occurrence** of matching warnings, regardless of location.  |

This filtering mechanism allows **fine-grained control** over how warnings are handled in a Python program. 🚀


In [5]:
warnings.filterwarnings("ignore")

In [6]:
model="meta-llama/Llama-2-7b-chat-hf"

In [7]:
tockenizer=AutoTokenizer.from_pretrained(model)

tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

### Two major options to create a pipeline:

**1. Using `HuggingFacePipeline.from_model_id`:**

Uses `HuggingFacePipeline.from_model_id()` → This is a LangChain utility that directly loads a model from Hugging Face for text generation.
- Passes `model_kwargs` → Controls parameters like `temperature` and `max_length`.
- Easier to use if you’re working within LangChain.
- Abstracts model loading → You don’t need to manually load the model and tokenizer.

**2. Using `transformers.pipeline()`:**

 Uses `transformers.pipeline()` directly → A lower-level approach from Hugging Face's transformers library.
- More fine-grained control → Allows additional arguments like:
  - `torch_dtype=torch.float16` (for reduced memory usage)
  - `trust_remote_code=True` (for loading models with custom code)
  - `device_map="auto"` (to distribute across available devices)
    - Explicit model and tokenizer handling → You must load the model/tokenizer separately before calling `pipeline()`.

| Feature               | **LangChain (`from_model_id`)** | **Hugging Face (`pipeline()`)** |
|-----------------------|--------------------------------|--------------------------------|
| **Abstraction**       | Higher (easier setup)         | Lower (manual setup)          |
| **Model Loading**     | Automatic                     | Manual                        |
| **Customization**     | Limited                       | Highly customizable           |
| **GPU/Memory Handling** | Implicit                    | Explicit (`torch_dtype`, `device_map`) |
| **Best Use Case**     | Quick prototyping in LangChain | More control over model execution |


In [8]:
# Create pipeline using transformers.pipeline()
# lower-level approach from Hugging Face's transformers library
pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tockenizer,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
    max_length=1000,
    # truncate=True,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tockenizer.eos_token_id
)

# ALTERNATIVE
# Create the HuggingFace pipeline
# LangChain utility that directly loads a model from Hugging Face for text generation
# pipeline=HuggingFacePipeline.from_model_id(
#     model_id=model,
#     task="text-generation",
#     model_kwargs={"temperature":0.5,"max_length":256}
# )


config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

Device set to use cuda:0


The next step is to integrate LLaMA into LangChain using `Hugging Face` (or `llama-cpp-python`).

In [9]:
llm = HuggingFacePipeline(pipeline=pipeline, model_kwargs={"temperature": 0})

In case you are using OpenAI gpt-based model for instance, you can declare the corresponding llm as
```llm=OpenAI("gpt-3.5-turbo", temperature=0.5)```.

In [10]:
prompt = "Which one would you rather buy, a laptop or a desktop computer?"

In [11]:
print(llm(prompt))

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Which one would you rather buy, a laptop or a desktop computer?

Laptop
Desktop Computer

Laptop:
Pros:
- Portability: You can take it anywhere with you.
- Convenience: No need to set up a separate workspace.
- Versatility: Can be used for both work and entertainment.

Desktop Computer:
Pros:
- Power: More powerful than a laptop.
- Speed: Faster processing speeds.
- Customization: Can be customized to fit your specific needs.

What are the advantages and disadvantages of each option? 


To structure input text more properly, you can also define a `PromptTemplate()` for your tasks.

In [12]:
from langchain.prompts import PromptTemplate

In [13]:
prompt_template = PromptTemplate(
    input_variables=["product"],
    template="What one would you rather buy, a {product} or a desktop computer?",
)

In [16]:
from langchain.chains import LLMChain
llm_chain = LLMChain(prompt=prompt_template, llm=llm)
response = llm_chain.run("television")
print(response)

What one would you rather buy, a television or a desktop computer?

Personally, I would rather buy a desktop computer because I think it offers more value for the money. Here are some reasons why:

1. Desktop computers are generally more powerful than televisions. They have more processing power, more memory, and faster storage, which means they can handle more complex tasks and multitasking better.
2. Desktop computers can be upgraded and customized to meet your specific needs. If you want to add more storage, upgrade your graphics card, or improve your sound system, you can do so without having to buy a whole new computer.
3. Desktop computers are more versatile than televisions. They can be used for a wide range of tasks, from browsing the internet and checking email to playing games, editing videos, and working on documents.
4. Desktop computers are generally cheaper than televisions, especially when you consider the cost of a high-quality television that can rival the performance 