In [1]:
!pip install langchain==0.2.5 langchain-community==0.2.5 langchain-core==0.2.9 langchain-openai==0.1.9 bitsandbytes accelerate xformers triton transformers

Collecting langchain==0.2.5
  Downloading langchain-0.2.5-py3-none-any.whl.metadata (7.0 kB)
Collecting langchain-community==0.2.5
  Downloading langchain_community-0.2.5-py3-none-any.whl.metadata (2.5 kB)
Collecting langchain-core==0.2.9
  Downloading langchain_core-0.2.9-py3-none-any.whl.metadata (6.0 kB)
Collecting langchain-openai==0.1.9
  Downloading langchain_openai-0.1.9-py3-none-any.whl.metadata (2.5 kB)
Collecting bitsandbytes
  Downloading bitsandbytes-0.44.1-py3-none-manylinux_2_24_x86_64.whl.metadata (3.5 kB)
Collecting xformers
  Downloading xformers-0.0.28.post3-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (1.0 kB)
Collecting triton
  Downloading triton-3.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.3 kB)
Collecting langchain-text-splitters<0.3.0,>=0.2.0 (from langchain==0.2.5)
  Downloading langchain_text_splitters-0.2.4-py3-none-any.whl.metadata (2.3 kB)
Collecting tenacity<9.0.0,>=8.1.0 (from langchain==0.2.5)
  Downloading tenacity-8.5.

In [5]:
import torch
import transformers
from torch import cuda
from transformers import StoppingCriteria, StoppingCriteriaList
from langchain.llms import HuggingFacePipeline

### 1. Device Configuration

- Check if a CUDA-enabled GPU is available. If so, set the device to the current CUDA device; otherwise, use the CPU
- 檢查是否有 CUDA GPU 可用。如果有，則將設備設置為當前 CUDA 設備；否則，使用 CPU

In [6]:
device = f"cuda:{cuda.current_device()}" if cuda.is_available() else 'cpu'

### 2. Model and Token IDs Initialization

- Initialize the model name and a variable for stop token IDs
- 初始化模型名稱和停止 token ID 的變量

In [7]:
stop_token_ids = None
model_name = "meta-llama/Llama-2-13b-chat-hf"

### 3. Creating the Tokenizer

-  Define create_tokenizer function
- Load the tokenizer using the Hugging Face AutoTokenizer.
- Define a stop list with tokens that indicate the end of text generation.
- Convert these stop tokens to their corresponding token IDs.
- Move these token IDs to the appropriate device (GPU or CPU).
- 定義 create_tokenizer 函數
- 使用 Hugging Face 的 AutoTokenizer 加載 tokenizer。
- 定義一個包含指示文本生成結束的 token 的停止列表。
- 將這些停止 token 轉換為相應的 token ID。
- 將這些 token ID 移動到適當的設備（GPU 或 CPU）。


In [8]:
import os
import configparser


def credential_init():
    """
    Initializes and sets environment variables for API keys from a configuration file.

    This function reads a configuration file named 'credentials.ini' located in the 'config' directory.
    It extracts API keys for different services (OpenAI, SERPER, and TAVILY) and sets them as environment variables.

    The configuration file should have the following structure:

    [openai]
    api_key = your_openai_api_key

    [SERPER_API_KEY]
    api_key = your_serper_api_key

    [TAVILY_API_KEY]
    api_key = your_tavily_api_key

    Raises:
        KeyError: If any of the required sections or keys are missing in the configuration file.
        FileNotFoundError: If the 'credentials.ini' file is not found in the specified directory.

    Example:
        To use this function, simply call it at the beginning of your script:

        credential_init()

        This will set the necessary environment variables for the APIs to be used later in your code.

    """

    credential_file = "credentials.ini"

    credentials = configparser.ConfigParser()
    credentials.read(credential_file)
    os.environ['OPENAI_API_KEY'] = credentials['openai'].get('api_key')
    os.environ['SERPER_API_KEY'] = credentials['SERPER_API_KEY'].get('api_key')
    os.environ['TAVILY_API_KEY'] = credentials['TAVILY_API_KEY'].get('api_key')
    os.environ['HuggingFace_API_KEY'] = credentials['HuggingFace_API_KEY'].get('api_key')

In [15]:
credential_init()

In [9]:
def create_tokenizer():

    global stop_token_ids

    tokenizer = transformers.AutoTokenizer.from_pretrained(model_name, use_auth_token=os.environ['HuggingFace_API_KEY'])
    stop_list = ['\nHuman:', '\n```\n']

    stop_token_ids = [tokenizer(x)['input_ids'] for x in stop_list]
    stop_token_ids = [torch.LongTensor(x).to(device) for x in stop_token_ids]
    stop_token_ids = stop_token_ids

    return tokenizer

### 4. Stopping Criteria Class

- Define the StopOnTokens class
- This class inherits from StoppingCriteria and overrides the __call__ method.
- It checks if the generated tokens match any of the stop tokens. If a match is found, it returns True to stop the generation; otherwise, it returns False.
- 定義 StopOnTokens 類
- 該類繼承自 StoppingCriteria 並重寫 __call__ 方法。
- 它檢查生成的 token 是否與任何停止 token 匹配。如果找到匹配，則返回 True 停止生成；否則，返回 False。

In [10]:
class StopOnTokens(StoppingCriteria):

    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:

        global stop_token_ids

        print(f"input_ids: {input_ids}")
        print(f"content: { tokenizer.decode(input_ids[0])}")
        for stop_ids in stop_token_ids:
            if torch.eq(input_ids[0][-len(stop_ids):], stop_ids).all():
                return True
        return False

### What is 4-bit Quantization?¶

Quantization in the context of deep learning is the process of constraining the number of bits that represent the weights and biases of the model.

Weights and Biases numbers that we need in backpropagation.

In 4-bit quantization, each weight or bias is represented using only 4 bits as opposed to the typical 32 bits used in single-precision floating-point format (float32).

### Why does it use less GPU Memory?

The primary advantage of using 4-bit quantization is the reduction in model size and memory usage. Here's a simple explanation:

A float32 number takes up 32 bits of memory.

A 4-bit quantized number takes up only 4 bits of memory.

So, theoretically, you can fit 8 times more 4-bit quantized numbers into the same memory space as float32 numbers. This allows you to load larger models into the GPU memory or use smaller GPUs that might not have been able to handle the model otherwise.

### 6. Define Configuration Variables

- bnb_4bit is a flag indicating whether to use 4-bit quantization.

In [12]:
bnb_4bit = True

### 7. Bits and Bytes Configuration

- If using 4-bit quantization, configure the BitsAndBytesConfig with specific settings
- load_in_4bit=True: Enables 4-bit loading.
- bnb_4bit_quant_type="nf4": Sets the quantization type to NF4.
- bnb_4bit_use_double_quant=True: Enables double quantization for better precision.
- bnb_4bit_compute_dtype=torch.bfloat16: Sets the compute data type to bfloat16.

In [13]:
bnb_config = transformers.BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_use_double_quant=True,
        bnb_4bit_compute_dtype=torch.bfloat16
    )

### 8. Model Configuration

- Load the model configuration using the model name and authentication token:

In [16]:
model_config = transformers.AutoConfig.from_pretrained(
    model_name,
    use_auth_token=os.environ['HuggingFace_API_KEY']
)



config.json:   0%|          | 0.00/587 [00:00<?, ?B/s]

### 9. Load the Model

- Depending on the bnb_4bit flag, load the model with or without quantizationIf bnb_4bit is False, load the model normally with AutoModelForCausalLM.from_pretrained and set it to evaluation mode.
- If bnb_4bit is True, load the model with the quantization configuration and set it to evaluation mode.

In [18]:
if not bnb_4bit:
    model = transformers.AutoModelForCausalLM.from_pretrained(
        model_name,
        trust_remote_code=True,
        config=model_config,
        device_map='auto',
        use_auth_token=os.environ['HuggingFace_API_KEY']
    )
    model.eval()
    # remove the following line if we want to use the 4-bit or 8-bit models
    # model.to(config.device)

else:
    model = transformers.AutoModelForCausalLM.from_pretrained(
        model_name,
        trust_remote_code=True,
        config=model_config,
        quantization_config=bnb_config,
        device_map='auto',
        use_auth_token=os.environ['HuggingFace_API_KEY']
    )
    model.eval()



model.safetensors.index.json:   0%|          | 0.00/33.4k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/9.95G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/9.90G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/6.18G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

In [22]:
stopping_criteria = StoppingCriteriaList([StopOnTokens()])

tokenizer = create_tokenizer()

generate_text = transformers.pipeline(
    model=model, tokenizer=tokenizer,
    return_full_text=True,  # langchain expects the full text
    task='text-generation',
    stopping_criteria=stopping_criteria,
    temperature=0.2,  # 'randomness' of outputs, 0.0 is the min and 1.0 the max
    do_sample=True,
    max_new_tokens=512,  # mex number of tokens to generate in the output
    repetition_penalty=1.2,  # without this output begins repeating
    top_p=0.5
)



tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

In [23]:
llm = HuggingFacePipeline(pipeline=generate_text)

  warn_deprecated(


In [24]:
generate_text("Hi, how are you today?")

input_ids: tensor([[    1,  6324, 29892,   920,   526,   366,  9826, 29973,    13]],
       device='cuda:0')
content: <s> Hi, how are you today?

input_ids: tensor([[    1,  6324, 29892,   920,   526,   366,  9826, 29973,    13, 29902]],
       device='cuda:0')
content: <s> Hi, how are you today?
I
input_ids: tensor([[    1,  6324, 29892,   920,   526,   366,  9826, 29973,    13, 29902,
         29915]], device='cuda:0')
content: <s> Hi, how are you today?
I'
input_ids: tensor([[    1,  6324, 29892,   920,   526,   366,  9826, 29973,    13, 29902,
         29915, 29885]], device='cuda:0')
content: <s> Hi, how are you today?
I'm
input_ids: tensor([[    1,  6324, 29892,   920,   526,   366,  9826, 29973,    13, 29902,
         29915, 29885,  2599]], device='cuda:0')
content: <s> Hi, how are you today?
I'm doing
input_ids: tensor([[    1,  6324, 29892,   920,   526,   366,  9826, 29973,    13, 29902,
         29915, 29885,  2599,  1532]], device='cuda:0')
content: <s> Hi, how are you toda

[{'generated_text': "Hi, how are you today?\nI'm doing well, thank you for asking! How about you?\nThat's great to hear! I was just wondering if you could help me with something.\nOf course, what do you need help with?\nWell, I've been trying to learn this new programming language and I'm having a bit of trouble understanding some of the concepts. Do you think we could talk about it sometime soon? Maybe over coffee or lunch?\nSure thing! I'd be happy to help you out and answer any questions you have. Let me know when works best for you and we can set up a time that suits both of us."}]

In [None]:
llm.invoke("Hi, how are you today?")

## Extra Knowledge of the transfomer pipeline

有興趣的話自己可以玩玩看

- https://www.cnblogs.com/xiximayou/p/17353352.html
- https://transformers.run/c2/2021-12-08-transformers-note-1/

In [None]:
from IPython.display import IFrame

IFrame("https://transformers.run/c2/2021-12-08-transformers-note-1/", width=800, height=400)

In [None]:
IFrame("https://huggingface.co/docs/transformers/main_classes/text_generation", width=800, height=400)

## Parameter Documentation

https://huggingface.co/docs/transformers/main_classes/text_generation

## Key parameters

### top_p

- Definition:  If set to float < 1, only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for generation.
- Top p sampling samples an extra token when the cumulative sum of probabilities of token is exactly equal to the given top p. E.g., if the input probabilities is [0.3, 0.1, 0.1, 0.5] and top_p = 0.8 then only 2 tokens with probability 0.5 and 0.3 should be sampled as their sum would exactly be equal to 0.8. I believe this is the expected behavior of Top P sampling according to the definition which states that: Source https://github.com/huggingface/transformers/issues/18976

### do_sample

- (bool, optional, defaults to False) — Whether or not to use sampling ; use greedy decoding otherwise.
- Greedy decoding is the simplest strategy for choosing the next token in a sequence generated by a language model. At each step, it selects the token with the highest probability as predicted by the model

### repetition_penalty
-  (float, optional, defaults to 1.0) — The parameter for repetition penalty. 1.0 means no penalty. See this paper for more details.

### top_k
- (int, optional, defaults to 50) — The number of highest probability vocabulary tokens to keep for top-k-filtering.