# __Please run the provided demo notebook file in Google Colab to explore the hands-on example.__

# **Run LLM Falcon Locally**

# **Description:**
This project will walk you through the process of setting up a local Falcon LLM using Langchain’s prompt template and conversationChain functionalities.

# **Steps to Perform:**
1. Set up the Environment
2. Download Falcon 7B Model and Tokenizer from Hugging Face
3. Set up Model and Generation Configuration
3. Build the Conversation Chain
4. Modify the Prompt Template to Define a Specific Conversational Style
5. Manage Conversation History with Conversationbufferwindowmemory
6. Interact with the LLM


# **Step 1: Set up the Environment**

In [None]:
#Install the libraries if not installed
#!pip install bitsandbytes, langchain, torch, transformers, accelerate, xformers, einops

Collecting bitsandbytes
  Downloading bitsandbytes-0.42.0-py3-none-any.whl (105.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.0/105.0 MB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: bitsandbytes
Successfully installed bitsandbytes-0.42.0
Collecting accelerate
  Downloading accelerate-0.26.1-py3-none-any.whl (270 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m270.9/270.9 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: accelerate
Successfully installed accelerate-0.26.1
Collecting xformers
  Downloading xformers-0.0.23.post1-cp310-cp310-manylinux2014_x86_64.whl (213.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m213.0/213.0 MB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
Collecting torch==2.1.2 (from xformers)
  Downloading torch-2.1.2-cp310-cp310-manylinux1_x86_64.whl (670.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m670.2/670.2 MB[0m 

In [None]:
import re
import warnings
from typing import List

import torch
from langchain import PromptTemplate
from langchain.chains import ConversationChain
from langchain.chains.conversation.memory import ConversationBufferWindowMemory
from langchain.llms import HuggingFacePipeline
from langchain.schema import BaseOutputParser
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    StoppingCriteria,
    StoppingCriteriaList,
    pipeline,
)

# **Step 2: Download the Falcon 7B Model and Tokenizer from Hugging Face**

In [None]:
MODEL_NAME = "tiiuae/falcon-7b-instruct"
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME, trust_remote_code=True, load_in_8bit=True, device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model.eval()

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
print(f"Model device: {model.device}")


config.json:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

configuration_falcon.py:   0%|          | 0.00/7.16k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/tiiuae/falcon-7b-instruct:
- configuration_falcon.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.



modeling_falcon.py:   0%|          | 0.00/56.9k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/tiiuae/falcon-7b-instruct:
- modeling_falcon.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


pytorch_model.bin.index.json:   0%|          | 0.00/16.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

pytorch_model-00001-of-00002.bin:   0%|          | 0.00/9.95G [00:00<?, ?B/s]

pytorch_model-00002-of-00002.bin:   0%|          | 0.00/4.48G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/117 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/287 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.73M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/281 [00:00<?, ?B/s]

Model device: cuda:0


# **Step 3: Set up the LLM and Generation Configuration**
Configure the LLM for inference using the following Python code:

In [None]:

model.eval()
generation_config = model.generation_config
# Set temperature to 0 for deterministic responses
generation_config.temperature = 0
# Set number of returned sequences to 1
generation_config.num_return_sequences = 1
# Set maximum new tokens per response
generation_config.max_new_tokens = 256
# Disable token caching
generation_config.use_cache = False
# Set repetition penalty for more diverse responses
generation_config.repetition_penalty = 1.7
# Define pad and EOS token IDs
generation_config.pad_token_id = tokenizer.eos_token_id
generation_config.eos_token_id = tokenizer.eos_token_id

# **Step 4: Build the Conversation Chain**


*   Define a custom prompt template that sets the context for the conversation.
*   Create the ConversationChain object using the following Python code:





In [None]:

initial_prompt = """
The following is a conversation between a human and an AI. The AI is knowledgeable and provides detailed answers.

Current conversation:

Human: What is the theory of relativity?
AI:
""".strip()

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=10)

# Create the HuggingFacePipeline object
llm_pipeline = HuggingFacePipeline(pipeline=pipe)

# Create the ConversationChain object
dialogue_chain = ConversationChain(llm=llm_pipeline)

# Print the initial prompt template
print(dialogue_chain.prompt.template)

The following is a friendly conversation between a human and an AI. The AI is talkative and provides lots of specific details from its context. If the AI does not know the answer to a question, it truthfully says it does not know.

Current conversation:
{history}
Human: {input}
AI:


# **Step 5: Modify the Prompt Template to Define a Specific Conversational Style**

In [None]:
new_template = """
The following is a conversation between a human and an AI. The AI behaves like Albert Einstein, providing detailed explanations about physics.

Current conversation:
{history}
Human: {input}
AI:""".strip()

# Create a new PromptTemplate object
prompt = PromptTemplate(input_variables=["history", "input"], template=new_template)

# Print the new prompt template
print(new_template)


The following is a conversation between a human and an AI. The AI behaves like Albert Einstein, providing detailed explanations about physics.

Current conversation:
{history}
Human: {input}
AI:


# **Step 6: Manage Conversation History with Conversationbufferwindowmemory**

In [None]:
memory = ConversationBufferWindowMemory(
    memory_key="history", k=6, return_only_outputs=True
)

chain = ConversationChain(llm=llm, memory=memory, prompt=prompt, verbose=True)

# **Step 7: Interact with the LLM**
* Provide an input prompt to initiate the conversation and start interacting with the LLM.
* Observe the chain’s output and continue the dialogue by providing further input.

In [None]:
text = "Think of a name for automaker that builds family cars with big V8 engines. The name must be a single word and easy to pronounce."
res = chain.predict(input=text)
print(res)




[1m> Entering new ConversationChain chain...[0m
Prompt after formatting:
[32;1m[1;3mThe following is a conversation between a human and an AI. The AI behaves like Albert Einstein, providing detailed explanations about physics.

Current conversation:

Human: Think of a name for automaker that builds family cars with big V8 engines. The name must be a single word and easy to pronounce.
AI:[0m





[1m> Finished chain.[0m
 Tesla.
User 


# **Conclusion**
This tutorial provides a basic understanding of running a local Falcon LLM using Langchain’s PromptTemplate and ConversationChain functionalities. \
