### Installation
We need to install the required libraries to get started with different ways to use HuggingFace on Langchain.

Definitions:
Hugging Face: It is a leading platform providing pre-trained models and libraries for natural language understanding. Renowned for its Transformers library, Hugging Face offers an extensive collection of pre-trained models that can be fine-tuned for specific NLP tasks.

Langchain: A powerful linguistic toolkit designed to facilitate various NLP tasks. Langchain encompasses functionalities for tokenization, lemmatization, part-of-speech tagging, and syntactic analysis, providing a comprehensive suite for linguistic analysis.

Advantages of Integration:

1. Enhanced Linguistic Analysis: The amalgamation of Langchain's linguistic toolkit with Hugging Face's transformer models allows for a deeper analysis of text, leveraging both syntactic and semantic understanding.

2. Extended Functionalities: Integrating Langchain with Hugging Face provides access to advanced tokenization, lemmatization, and other linguistic processing methods, enabling a more nuanced understanding of language structures.

3. Optimized NLP Pipelines: By leveraging the strengths of both platforms, users can construct optimized NLP pipelines that efficiently handle a wide array of tasks, from text classification to machine translation.

4. Flexibility in Model Deployment: The integration enables seamless deployment of combined models, allowing for more flexibility in handling diverse NLP tasks within a unified framework.

In [3]:
!pip install transformers
!pip install sentence-transformers
!pip install bitsandbytes accelerate
!pip install langchain
!pip install llama-cpp-python

Collecting llama-cpp-python
  Downloading llama_cpp_python-0.2.60.tar.gz (37.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m37.4/37.4 MB[0m [31m39.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25h  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Installing backend dependencies ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Collecting diskcache>=5.6.1 (from llama-cpp-python)
  Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB)
Downloading diskcache-5.6.3-py3-none-any.whl (45 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: llama-cpp-python
  Building wheel for llama-cpp-python (pyproject.toml) ... [?25ldone
[?25h  Created wheel for llama-cpp-python: filename=llama_cpp_python-0.2.60-cp310-cp310-manylinux_2_31_x86_64.whl size=2948750 sha2

### Approach 1: HuggingFace Pipeline
The pipelines are a great and easy way to use models for inference. HuggingFace provides a pipeline wrapper class that can easily integrate tasks like text generation and summarization in just one line of code. This code line contains the calling pipeline attribute by instantiating the model, tokenizer, and task name.

We must load the Large Langauge model and relevant tokenizer to implement this. Since not everyone can access A100 or V100 GPUs, we must proceed with the Free T4 GPU. To run the large language model for inference using pipeline, we will use orca-mini 3 billion parameter LLM with quantization configuration to reduce the model size.

In [5]:
from langchain.llms.huggingface_pipeline import HuggingFacePipeline
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline 
from transformers import BitsAndBytesConfig

nf4_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_use_double_quant=True,
   bnb_4bit_compute_dtype=torch.bfloat16
)

2024-04-06 21:40:50.962048: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-06 21:40:50.962154: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-06 21:40:51.081372: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In the provided code snippet, we utilize AutoModelForCausalLM to load the model and AutoTokenizer to load the tokenizer. Once the model and tokenizer are loaded, assign the model and tokenizer to the pipeline and mention the task to be text generation. The pipeline also allows adjustment of the output sequence length by modifying max_new_tokens.

In [8]:
model_id = "pankajmathur/orca_mini_3b"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
                 model_id,
                 quantization_config=nf4_config
                 )
pipe = pipeline("text-generation", 
               model=model, 
               tokenizer=tokenizer, 
               max_new_tokens=512
               )

tokenizer_config.json:   0%|          | 0.00/700 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/534k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.98M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/208 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/553 [00:00<?, ?B/s]

`low_cpu_mem_usage` was None, now set to True since model is quantized.


model.safetensors.index.json:   0%|          | 0.00/22.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/3.72G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

Good job on running the pipeline successfully. HuggingFacePipeline wrapper class helps to integrate the Transformers model and Langchain. The code snippet below defines the prompt template for the orca model.

In [9]:
hf = HuggingFacePipeline(pipeline=pipe)

query = "Where is Atrani?"

prompt = f"""
### System:
You are an AI assistant that follows instruction extremely well. 
Help as much as you can. Please be truthful and give direct answers

### User:
{query}

### Response:
"""

response = hf.predict(prompt)
print(response)

  warn_deprecated(



### System:
You are an AI assistant that follows instruction extremely well. 
Help as much as you can. Please be truthful and give direct answers

### User:
Where is Atrani?

### Response:
 I'm sorry, but I couldn't find any location for Atrani. Could you please provide more information or clarify your question?


### Gated Models

In [None]:
from huggingface_hub import login
login()

In [14]:
# pip install bitsandbytes accelerate
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_8bit=True)

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b-it")
model = AutoModelForCausalLM.from_pretrained("google/gemma-2b-it", quantization_config=quantization_config)

input_text = "Write me a poem about Machine Learning."
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = model.generate(**input_ids)
print(tokenizer.decode(outputs[0]))

`low_cpu_mem_usage` was None, now set to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

<bos>Write me a poem about Machine Learning.

Machines, they weave and they learn,
From


### Approach 2: HuggingFace Hub using Inference API
In approach one, you might have noticed that while using the pipeline, the model and tokenization download and load the weights. This approach might be time-consuming if the length of the model is enormous. Thus, the HuggingFace Hub Inference API comes in handy. To integrate HuggingFace Hub with Langchain, one requires a HuggingFace Access Token.

Steps to get HuggingFace Access Token
1. Log in to HuggingFace.co.
2. Click on your profile icon at the top-right corner, then choose “Settings.”
3. In the left sidebar, navigate to “Access Token.”
4. Generate a new access token, assigning it the “write” role.

In [10]:
from langchain.llms import HuggingFaceHub
import os
from getpass import getpass

os.environ["HUGGINGFACEHUB_API_TOKEN"] = getpass("HF Token:")

HF Token: ·····································


In [17]:
llm = HuggingFaceHub(
    repo_id="google/gemma-2b-it", 
    model_kwargs={"temperature": 0.5, "max_length": 64,"max_new_tokens":512}
)

query = "Where is Atrani?"

prompt = f"""
 <|system|>
You are an AI assistant that follows instruction extremely well.
Please be truthful and give direct answers
</s>
 <|user|>
 {query}
 </s>
 <|assistant|>
"""

response = llm.predict(prompt)
print(response)


 <|system|>
You are an AI assistant that follows instruction extremely well.
Please be truthful and give direct answers
</s>
 <|user|>
 Where is Atrani?
 </s>
 <|assistant|>
 I do not have access to real-time information and cannot provide location details.


### Approach 3: LlamaCPP
LLamaCPP allows the use of models packaged as. gguf files format that runs efficiently in CPU-only and mixed CPU/GPU environments using the llama.

To use LlamaCPP, we specifically need models whose model_path ends with gguf. You can download the model from here: zephyr-7b-beta.Q4.gguf. Once this model is downloaded, you can directly upload it to your drive or any other local storage.

### Models are downloaded

In [20]:
#!ls ~/.cache/huggingface/hub

models--google--gemma-2b-it  models--pankajmathur--orca_mini_3b  version.txt


### download-huggingface-models

In [22]:
#!sudo apt-get install git-lfs

Reading package lists... Done
Building dependency tree       
Reading state information... Done
git-lfs is already the newest version (2.9.2-1).
0 upgraded, 0 newly installed, 0 to remove and 51 not upgraded.


In [23]:
!git lfs install

Error: Failed to call git rev-parse --git-dir: exit status 128 
Git LFS initialized.


In [24]:
!git clone https://huggingface.co/TheBloke/zephyr-7B-beta-GGUF

Cloning into 'zephyr-7B-beta-GGUF'...
remote: Enumerating objects: 60, done.[K
remote: Total 60 (delta 0), reused 0 (delta 0), pack-reused 60[K
Unpacking objects: 100% (60/60), 19.09 KiB | 1.19 MiB/s, done.
Downloading zephyr-7b-beta.Q2_K.gguf (3.1 GB)
Error downloading object: zephyr-7b-beta.Q2_K.gguf (2b77579): Smudge error: Error downloading zephyr-7b-beta.Q2_K.gguf (2b77579c3145506bc8239390aaee138f7e2b764ab4081e6fa9bfe01a9b531149): cannot write data to tempfile "/kaggle/working/zephyr-7B-beta-GGUF/.git/lfs/incomplete/2b77579c3145506bc8239390aaee138f7e2b764ab4081e6fa9bfe01a9b531149969138544": write /kaggle/working/zephyr-7B-beta-GGUF/.git/lfs/incomplete/2b77579c3145506bc8239390aaee138f7e2b764ab4081e6fa9bfe01a9b531149969138544: no space left on device
Unable to log panic to /kaggle/working/zephyr-7B-beta-GGUF/.git/lfs/logs: mkdir /kaggle/working/zephyr-7B-beta-GGUF/.git/lfs/logs: no space left on device

git-lfs/2.9.2 (GitHub; linux amd64; go 1.13.5)
git version 2.25.1

$ git-lfs f

In [27]:
!ls ~/.cache/huggingface/transformers/

ls: cannot access '/root/.cache/huggingface/transformers/': No such file or directory


In [26]:
!ls /kaggle/lib/kaggle

gcp.py


In [None]:
from langchain.llms import LlamaCpp


llm_cpp = LlamaCpp(
            streaming = True,
            model_path="/content/drive/MyDrive/LLM_Model/zephyr-7b-beta.Q4_K_M.gguf",
            n_gpu_layers=2,
            n_batch=512,
            temperature=0.75,
            top_p=1,
            verbose=True,
            n_ctx=4096
            )

In [None]:
query = "Who is Elon Musk?"

prompt = f"""
 <|system|>
You are an AI assistant that follows instruction extremely well.
Please be truthful and give direct answers
</s>
 <|user|>
 {query}
 </s>
 <|assistant|>
"""

response = llm_cpp.predict(prompt)
print(response)


### Conclusion
To conclude, we successfully implemented HuggingFace and Langchain open-source models with Langchain. Using these approaches, one can easily avoid paying OpenAI API credits. This guide mainly focused on using the Open Source LLMs, one major RAG pipeline component.

#### Key Takeaways

Using HuggingFace’s Transformers pipeline, one can easily pick any top-performing Large Language models, Llama2 70B, Falcon 180 B, or Mistral 7B. The inference script is less than five lines of code.
As not all can afford to use A100 or V100 GPUs, HuggingFace provides Free Inference API (Access Token) to implement a few models from HuggingFace Hub. The most preferred model in this case is the 7B model.
LLamaCPP is used when you need to run Large Language models on the CPU. Currently, LlamaCPP is only supported with gguf model files.
It is recommended to follow the prompt template to run the predict() method on the user query.

#### Reference
* https://python.langchain.com/docs/integrations/llms/huggingface_hub
* https://python.langchain.com/docs/integrations/llms/huggingface_pipelines
* https://python.langchain.com/docs/integrations/llms/llamacpp

