# Running HF Models through API and Locally in Langchain

Here we (1) test the inference API from huggingface to call a model output running in the HF server and (2) learn how we can import that model and run it locally on our macchines. All compatible with Langchanin tools

Inspiration taken from: https://www.youtube.com/watch?v=Kn7SX2Mx_Jk&list=PL8motc6AQftk1Bs42EW45kwYbyJ4jOdiZ&index=6

### ReadMe material

- [Inference in Large Language Models](https://medium.com/@andrew_johnson_4/understanding-inference-in-large-language-models-f4a4a4a736a5) - Insightful definition of text inference for LLMs (no need to pay for entire article, definition in first paragraphs)
- [Causal Language Modeling](https://huggingface.co/docs/transformers/tasks/language_modeling). Used for text generation by predicting the next token in a sequence of tokens, unlike the other type of language modeling, masked models.
- [Differences between encoder-only, decoder-only and encoder-decoder models](https://magazine.sebastianraschka.com/p/understanding-encoder-and-decoder) - All of these are sequence-to-sequence models (i.e. <i>seq2seq</i>)

In [11]:
import os
from langchain import PromptTemplate, LLMChain
from langchain.llms import HuggingFaceHub
from langchain.chains.conversation.memory import ConversationBufferMemory
from langchain.chains import ConversationChain

### Let's try to the inference API first - HuggingfaceHub, that is, non-local model running

Recall we tested this first [here](https://github.com/jzamalloa1/langchain_learning/blob/main/hf_works_flan2B.ipynb)

This works well for many of the Huggingface hosted models, but doesn't support all models.

In [2]:
os.environ["HUGGINGFACEHUB_API_TOKEN"] = ""

Build prompt as we have learned [before](https://github.com/jzamalloa1/langchain_learning/blob/main/agents.ipynb)

In [12]:
template = """
Question: {question}

Answer: Let's think through this step by step
"""

prompt = PromptTemplate(
    input_variables=["question"],
    template=template
)

And instantiate model through HuggingFaceHub and start LLM Chain

In [8]:
llm_model = HuggingFaceHub(
    repo_id="google/flan-t5-xxl",
    model_kwargs={
        "temperature":0.1,
        "max_new_tokens":256,
        "verbose":"True" # Not sure if this is needed
    }
)

llm_chain = LLMChain(
    llm=llm_model,
    prompt=prompt,
    verbose=True
)

In [10]:
llm_chain.run("Where does the oldest cat in the world live?")



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3m
Question: Where does the oldest cat in the world live?

Answer: Let's think through this step by step
[0m

[1m> Finished chain.[0m


'The oldest cat in the world is a female named Snowball. Snowball lives in the United States. The United States is a country in North America. So, the answer is the United States.'

You can sort of see that is showing it's train of thought. We can try something a bit more complex below:

In [11]:
print(llm_chain.run("What is the coldest month in the US?"))



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3m
Question: What is the coldest month in the US?

Answer: Let's think through this step by step
[0m

[1m> Finished chain.[0m
The coldest month in the US is January. The average temperature in January is -2 degrees celsius. So, the answer is January.


We saw above how one can easily call the HuggingFaceHub API to call models non-locally. Let's try another model to see where limitations might exist

In [13]:
blender_llm_model = HuggingFaceHub(
    repo_id="facebook/blenderbot-1B-distill",
    model_kwargs={
        "temperature":0.1,
        "max_new_tokens":256,
        "verbose":"True" # Not sure if this is needed
    }
)

blenderbot_chain = LLMChain(
    llm=blender_llm_model,
    prompt=prompt,
    verbose=True
)



ValidationError: 1 validation error for HuggingFaceHub
__root__
  Got invalid task conversational, currently only ('text2text-generation', 'text-generation', 'summarization') are supported (type=value_error)

Notice the error above: <u>currently only ('text2text-generation', 'text-generation', 'summarization') are supported</u>. The BlenderBot model is a [conversation AI chatbot type model](https://huggingface.co/docs/transformers/model_doc/blenderbot), which aims to converse with user. This type, unlike text2text and text generation, is not supported by HuggingFaceHub

### Running models locally

Some of the advantages of running models locally are: (1) fine-tuning models to own data, (2) can use own GPU and (3) run models that cannot be ran through API (like model above)

The HuggingfacePipeline class will allow us to run LLMs locally

In [1]:
from langchain.llms import HuggingFacePipeline
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, AutoModelForSeq2SeqLM

  from .autonotebook import tqdm as notebook_tqdm
  warn("The installed version of bitsandbytes was compiled without GPU support. "



Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
'NoneType' object has no attribute 'cadam32bit_grad_fp32'
CUDA SETUP: Loading binary /opt/homebrew/Caskroom/miniforge/base/envs/open_ai/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so...
dlopen(/opt/homebrew/Caskroom/miniforge/base/envs/open_ai/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so, 0x0006): tried: '/opt/homebrew/Caskroom/miniforge/base/envs/open_ai/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so' (not a mach-o file)


Let's load a model locally. We are going to use a decoder-encoder model: T5-Flan

The AutoModel loader loads the seq2seq model, in this example this is the hybrid decoder-encoder model

In [3]:
model_id = "google/flan-t5-large"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id, device_map="auto")

Downloading model.safetensors: 100%|██████████| 3.13G/3.13G [01:53<00:00, 27.5MB/s]
Some weights of T5ForConditionalGeneration were not initialized from the model checkpoint at google/flan-t5-large and are newly initialized: ['encoder.embed_tokens.weight', 'decoder.embed_tokens.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Downloading (…)neration_config.json: 100%|██████████| 147/147 [00:00<00:00, 474kB/s]


Then we setup the <b>pipeline</b> to specify the <u>task</u> (see [pipeline documentation](https://huggingface.co/docs/transformers/pipeline_tutorial) for various tasks), the model and the tokenizer. The task here: <u>text2text is for encoder-decoder usage</u>

In [4]:
pipe = pipeline(
    task = "text2text-generation", #explore tasks in link above
    model=model,
    tokenizer=tokenizer,
    max_length=100
)

We are ready to pass it to the Huggingface pipeline to use langchain as before

In [5]:
llm_model = HuggingFacePipeline(pipeline=pipe) # Ready to be integrate into langchain

In [9]:
print(llm_model("What is the capital of Germany?"))

berlin


We can integrate into langchain as we have [previously](https://github.com/jzamalloa1/langchain_learning/blob/main/tools_and_chains.ipynb)

In [14]:
llm_chain = LLMChain(
    prompt=prompt,
    llm=llm_model,
    verbose=True
)

llm_chain.predict(question="What is the capital of Germany?")



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3m
Question: What is the capital of Germany?

Answer: Let's think through this step by step
[0m

[1m> Finished chain.[0m


'The capital of Germany is Berlin. Berlin is located in Germany. So, the answer is Berlin.'

Let's test a Decoder model only, like GPT-2. This needs to be setup slightly different. Given that this is a <u>decoder model we'll use the "text-generation" task</u>

In [18]:
model_id = "gpt2-medium"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

pipe = pipeline(
    task = "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_length=500
)

In [19]:
llm_model = HuggingFacePipeline(pipeline=pipe)

llm_chain = LLMChain(
    llm=llm_model,
    prompt=prompt,
    verbose=True
)

In [20]:
print(llm_chain.predict(question="What is the capital of Germany?"))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.




[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3m
Question: What is the capital of Germany?

Answer: Let's think through this step by step
[0m

[1m> Finished chain.[0m

First of all,

We must look at where this capital came from [1,2].

This capital came from France. And this was a real financial capital which was part of the French political economy…

It started out in a kind of financial crisis [revised version below]. A financial crisis of course meant a major bank failure. For the first time the French nationalised part of the banks. But they didn't try to sell on a huge debt pile – and that was the beginning of the financial markets collapsing.

…but France then didn't try to sell off the debt pile. This was because there really wasn't this huge commercial debt pile that they had [revised version below]: their debts were around 1.24% of GDP – and this meant that they were under debt-servicing.

[revised version below] – and this meant that they w

...not exactly a great model (it is several months old now), but, it fullfil the purpose of being tested locally