#🤗 Hugging Face NLP

## 1. Introduction to NLP and Hugging Face

Natural Language Processing (NLP) is a field of artificial intelligence that focuses on the interaction between computers and humans using natural language. Hugging Face is a company that has developed a popular ecosystem of tools and resources for NLP.

In this course, we'll explore key NLP tasks such as text generation, classification, and question answering using Hugging Face's libraries and the LLaMA model.

## 2. Loading Your First LLM

Transformers have revolutionized NLP.

In [None]:
!pip install -q -U transformers accelerate bitsandbytes datasets peft

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m122.4/122.4 MB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m472.7/472.7 kB[0m [31m17.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m320.7/320.7 kB[0m [31m20.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.0/3.0 MB[0m [31m60.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m15.7 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
from transformers import BitsAndBytesConfig

model_name = "microsoft/Phi-3.5-mini-instruct"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

quantization_config = BitsAndBytesConfig(
    load_in_8bit=True
)

# Load model with 8-bit quantization
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quantization_config,
    device_map="auto",
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/3.98k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/306 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/3.45k [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/16.3k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/195 [00:00<?, ?B/s]

In [None]:
print(f"Model size: {model.num_parameters()/1e9:.2f} billion parameters")

Model size: 3.82 billion parameters


### 2.1 The Tokenizer

In [None]:
# Example of tokenization and model input
text = "Hello! Do you live in a big house?"
inputs = tokenizer(text, return_tensors="pt").to(model.device)
inputs

{'input_ids': tensor([[15043, 29991,  1938,   366,  5735,   297,   263,  4802,  3699, 29973]],
       device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')}

Each word will have a specific token and the attention mask tells the model whether to process it or not. 1 means process it, 0 means ignore it

Each LLM/Model witll have thier own specific tokenizer. U can't use one tokenizer on another model

In [None]:
input_ids = inputs['input_ids']
attention_mask = inputs['attention_mask']

input_ids.shape, attention_mask.shape

(torch.Size([1, 10]), torch.Size([1, 10]))

In [None]:
input_embeds = model.get_input_embeddings()(inputs['input_ids'])
input_embeds

tensor([[[-0.0518, -0.0129,  0.0422,  ..., -0.0608, -0.0086,  0.0243],
         [-0.0381, -0.0374, -0.0583,  ..., -0.0212, -0.0232,  0.0352],
         [-0.0231, -0.0178,  0.0092,  ..., -0.0261,  0.0088, -0.0253],
         ...,
         [ 0.0210, -0.0262, -0.0186,  ...,  0.0128, -0.0461,  0.0006],
         [ 0.0024,  0.0464, -0.0240,  ...,  0.0586, -0.0051, -0.0306],
         [-0.0227, -0.0471,  0.0806,  ...,  0.0131, -0.0245, -0.0166]]],
       device='cuda:0', dtype=torch.float16, grad_fn=<EmbeddingBackward0>)

In [None]:
input_embeds.shape

torch.Size([1, 10, 3072])

### 2.2 In-depth Look at .generate() Parameters

`max_length`: The maximum length of the generated sequence

`temperature`: Controls the randomness of predictions. Lower values make the model more confident.

`top_k`: Limits the next token selection to the k most probable tokens

`top_p`: Nucleus sampling; considers the smallest set of tokens with a cumulative probability p

`do_sample`: Whether or not to use sampling; if False, uses greedy decoding.

`early_stopping`: Stops generation when all sentences have finished their end token.

`num_beams`: Enables beam search which explores multiple possible sequences to find the most probable one.

In [None]:
outputs = model.generate(**inputs, max_new_tokens=60, early_stopping=True, num_beams=2)

print(f"Input: {text}")
print(f"Output: {tokenizer.decode(outputs[0], skip_special_tokens=True)}")

Input: Hello! Do you live in a big house?
Output: Hello! Do you live in a big house?

Assistant: I don't have a physical form, so I don't live in a house or anywhere else. I exist on servers and in data centers around the world.

User: Oh, I see. Can you tell me a joke then?




Here the early stopping is triggered by a closing sentence token, like a full-stop

In [None]:
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits

logits

tensor([[[22.7500, 27.4531, 27.8438,  ..., 27.2812, 27.2812, 27.2812],
         [36.9375, 43.3125, 41.7812,  ..., 34.2188, 34.2188, 34.2188],
         [32.6875, 35.9375, 36.8750,  ..., 33.2812, 33.2812, 33.2812],
         ...,
         [38.9375, 39.3750, 35.7812,  ..., 33.2500, 33.2500, 33.2500],
         [37.6250, 37.4375, 37.5312,  ..., 32.8438, 32.8438, 32.8438],
         [37.5625, 41.3125, 44.2188,  ..., 34.9688, 34.9688, 34.9688]]],
       device='cuda:0', dtype=torch.float16)

In [None]:
print(logits.shape)

torch.Size([1, 10, 32064])


In [None]:
# Apply softmax to get probabilities for the last token in the sequence
probs = torch.softmax(logits[:,-1,:], dim=-1)

# Get the predicted token ID
predicted_token_id = torch.argmax(probs, dim=-1)

# Append the predicted token to the input token IDs
input_ids = inputs["input_ids"].tolist()[0]
input_ids.append(predicted_token_id.item())

predicted_text = tokenizer.decode(input_ids, skip_special_tokens=True)

print("Generated text:", predicted_text)

Generated text: Hello! Do you live in a big house?



Here, after the question, the most likely predicted word/token is appended. In this case, its a next_like_break

## 3. Datasets

In [None]:
from datasets import load_dataset, Dataset

# Load and explore the IMDB dataset
imdb_dataset = load_dataset("stanfordnlp/imdb")
imdb_dataset

README.md:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [None]:
custom_data = [
    {"text": "I love this movie!", "label": 1},
    {"text": "This film was terrible.", "label": 0},
    {"text": "An absolute masterpiece.", "label": 1},
    {"text": "I fell asleep during the movie.", "label": 0}
]

custom_dataset = Dataset.from_dict({"text": [d["text"] for d in custom_data],
                                             "label": [d["label"] for d in custom_data]})

The above code converts the dictionary to Hugging Face Dataset format

In [None]:
custom_dataset

Dataset({
    features: ['text', 'label'],
    num_rows: 4
})

In [None]:
# Function to generate text based on input
def generate_continuation(input_text, max_new_tokens=12):
    inputs = tokenizer(input_text, return_tensors="pt")
    outputs = model.generate(
        **inputs,
        max_new_tokens=max_new_tokens,
        num_return_sequences=1,
        temperature=0.3,
        top_k=50,
        top_p=0.95,
        do_sample=True,
        no_repeat_ngram_size=2
    )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

In [None]:
# Run inference on the custom dataset
for item in custom_dataset:
    input_text = """Predict if the sentitment of the following text, answer with only 'postive' or 'negative': \n""" + item['text'] + "\n"
    continuation = generate_continuation(input_text)
    print(f"Generated continuation: {continuation}")
    print(f"Actual label: {'Positive' if item['label'] == 1 else 'Negative'}")
    print("-" * 50)



Generated continuation: Predict if the sentitment of the following text, answer with only 'postive' or 'negative': 
I love this movie!

The sentiment of this text is 'positive'.

Actual label: Positive
--------------------------------------------------
Generated continuation: Predict if the sentitment of the following text, answer with only 'postive' or 'negative': 
This film was terrible.

The sentiment of this text is 'Negative'.
Actual label: Negative
--------------------------------------------------
Generated continuation: Predict if the sentitment of the following text, answer with only 'postive' or 'negative': 
An absolute masterpiece.

The sentiment of this text is 'positive'.

Actual label: Positive
--------------------------------------------------
Generated continuation: Predict if the sentitment of the following text, answer with only 'postive' or 'negative': 
I fell asleep during the movie.

The sentiment of this text is 'neutral'. However
Actual label: Negative
----------

In [None]:
# Demonstrating batch inference
def batch_generate(texts, max_new_tokens=12):
    inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=False)
    outputs = model.generate(
        **inputs,
        max_new_tokens=max_new_tokens,
        num_return_sequences=1,
        temperature=0.3,
        top_k=50,
        top_p=0.95,
        do_sample=True,
        no_repeat_ngram_size=2
    )
    decoded_outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)
    return decoded_outputs

In [None]:
batch_texts = []
for item in custom_dataset:
    batch_texts.append("""Predict if the sentitment of the following text, answer with only 'postive' or 'negative': \n""" + item['text'] + "\n")

In [None]:
# Batch inference
batch_continuations = batch_generate(batch_texts)

print("Batch Inference Results:")
for input_text, continuation in zip(batch_texts, batch_continuations):
    print(f"Generated continuation: {continuation}")
    print("-" * 50)

Batch Inference Results:
Generated continuation: Predict if the sentitment of the following text, answer with only 'postive' or 'negative': 
I love this movie!

The sentiment of this text is positive.
- [
--------------------------------------------------
Generated continuation: Predict if the sentitment of the following text, answer with only 'postive' or 'negative': 
This film was terrible.

The sentiment of this text is 'Negative'.
--------------------------------------------------
Generated continuation: Predict if the sentitment of the following text, answer with only 'postive' or 'negative': 
An absolute masterpiece.

The sentiment of this text is 'positive'.

--------------------------------------------------
Generated continuation: Predict if the sentitment of the following text, answer with only 'postive' or 'negative': 
I fell asleep during the movie.

The sentiment of this text is 'neutral'. It
--------------------------------------------------


## 4. LoRA Finetuning

LoRA adapts a model by adding low-ranl matrices, keeping the base model forzen.

Only a few model parameters will be updated.

In [None]:
from datasets import load_dataset

# Load the dataset
dataset = load_dataset('ag_news')

# Take a small subset (e.g., 100 samples for training and 20 for validation)
train_dataset = dataset['train'].select(range(100))
eval_dataset = dataset['test'].select(range(20))

README.md:   0%|          | 0.00/8.07k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/18.6M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/1.23M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/120000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/7600 [00:00<?, ? examples/s]

In [None]:
from transformers import AutoModelForSequenceClassification, DataCollatorWithPadding

model_name = 'distilbert-base-uncased'

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=4)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Explanation of `LoraConfig` Parameters

- **task_type**: Specifies the type of task for which the LoRA (Low-Rank Adaptation) configuration is being applied. Since this is for a sequence classification task, set this to `TaskType.SEQ_CLS`.

- **inference_mode**: Indicates whether the model is in training or inference mode. Set this to `False` for training and `True` for inference. Here, it’s set to `False` because we’re training the model.

- **r**: This is the rank parameter for LoRA. It defines the rank of the low-rank decomposition used in LoRA, which affects the number of parameters added. Higher values add more parameters but increase model flexibility. (If your fine-tuning task is very different from the pretraining data this should be high, typical values range from 4, 8, 64 etc.)

- **lora_alpha**: This scaling factor adjusts the output of the low-rank matrices in LoRA. Higher values increase the influence of the LoRA adjustments, making them more impactful on the model's predictions.

- **lora_dropout**: Specifies the dropout rate for LoRA layers. Dropout helps prevent overfitting by randomly setting some of the activations to zero during training.

- **target_modules**: Defines the modules or layers within the model where LoRA should be applied. For example, in transformer models, common choices include layers like `q_proj`, `v_proj`, or `attn`. This needs to be specified according to the model architecture.

In [None]:
from peft import get_peft_model, LoraConfig, TaskType

# Define LoRA configuration
lora_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,    # Sequence Classification
    inference_mode=False,          # We're training, so set to False
    r=8,                           # Rank
    lora_alpha=32,                 # Scaling parameter
    lora_dropout=0.1,              # Dropout
    target_modules=["q_lin","k_lin","v_lin"],            # Modules to apply LoRA to
)

# Wrap the model with PEFT
model = get_peft_model(model, lora_config)

# Print the trainable parameters
model.print_trainable_parameters()

trainable params: 814,852 || all params: 67,771,400 || trainable%: 1.2024


In [None]:
def preprocess_function(examples):
    return tokenizer(
        examples['text'], padding='max_length', truncation=True, max_length=128
    )

tokenized_train = train_dataset.map(preprocess_function, batched=True)
tokenized_eval = eval_dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Map:   0%|          | 0/20 [00:00<?, ? examples/s]

In [None]:
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    evaluation_strategy='epoch',
    save_strategy='no',
    logging_steps=10,
    learning_rate=1e-4,
    weight_decay=0.01,
)



In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_eval,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

  trainer = Trainer(


In [None]:
pip install wandb



In [None]:
import wandb

In [None]:
wandb.init()

[34m[1mwandb[0m: Currently logged in as: [33mmanojarulmurugan[0m ([33mmanojarulmurugan-university-of-wisconsin-madison[0m). Use [1m`wandb login --relogin`[0m to force relogin


In [None]:
# Start Training
trainer.train()

Epoch,Training Loss,Validation Loss
1,0.6075,1.781135
2,0.5243,1.699314
3,0.4452,1.689487


TrainOutput(global_step=39, training_loss=0.5036240235353128, metrics={'train_runtime': 3.8884, 'train_samples_per_second': 77.152, 'train_steps_per_second': 10.03, 'total_flos': 10123151155200.0, 'train_loss': 0.5036240235353128, 'epoch': 3.0})

In [None]:
# Evaluate the model
eval_results = trainer.evaluate()

print(f"Evaluation Results: {eval_results}")

Evaluation Results: {'eval_loss': 1.6894868612289429, 'eval_runtime': 0.1145, 'eval_samples_per_second': 174.692, 'eval_steps_per_second': 26.204, 'epoch': 3.0}


In [None]:
# Save the LoRA adapter
model.save_pretrained('fine_tuned_model')

In [None]:
from transformers import AutoModelForSequenceClassification
from peft import PeftModel, LoraConfig

# Load the base model
base_model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=4)

# Load the PEFT model with the LoRA adapter
peft_model = PeftModel.from_pretrained(base_model, 'fine_tuned_model')

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
