# Working with Hugging Face - Part 4

## Fine-tuning and Embeddings

Explore the different frameworks for fine-tuning, text generation, and embeddings. Start with the basics of fine-tuning a pre-trained model on a specific dataset and task to improve performance. Then, use Auto classes to generate the text from prompts and images. Finally, you will explore how to generate and use embeddings.

### Preparing a dataset
Fine-tuning a model requires several steps including identifying the model to fine-tune, preparing the dataset, creating the training loop object, then saving the model.

A model trained on English text classification has been identified for you, but it's up to you to prepare the imdb dataset in order to fine-tune this model to classify the sentiment of movie reviews.

The imdb dataset is already loaded for you and saved as dataset.

In [None]:
# Import modules
from datasets import load_dataset
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model_name = "distilbert-base-uncased-finetuned-sst-2-english"

# Load the model
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

dataset = load_dataset("stanfordnlp/imdb")

# Use tokenizer on text
dataset = dataset.map(lambda row: tokenizer(row["text"], padding='max_length', max_length=512, truncation=True), 
                      keep_in_memory=True)

### Building the trainer
To fine-tune a model, it must be trained on new data. This is the process of the model learning patterns within a training dataset, then evaluating how well it can predict patterns in an unseen test dataset. The goal is to help the model build an understanding of patterns while also being generalizable to new data yet to be seen.

Build a training object to fine-tune the "distilbert-base-uncased-finetuned-sst-2-english" model to be better at identifying sentiment of movie reviews.

The training_data and testing_data dataset are available for you. Trainer and TrainingArguments from transformers are also loaded. They were modified for the purpose of this exercise.

In [None]:
from transformers import Trainer, TrainingArguments

# Create training arguments
training_args = TrainingArguments(output_dir="./results")

training_data = dataset['train']
testing_data = dataset['test']

# Create the trainer
trainer = Trainer(
    model=model, 
    args=training_args, 
    train_dataset=training_data, 
    eval_dataset=testing_data
)

# Start the trainer
trainer.train()

In [None]:
# Save model
local_path = "./fine_tuned_model"
trainer.save_model(local_path)

### Using the fine-tuned model
Now that the model is fine-tuned, it can be used within pipeline tasks, such as for sentiment analysis. At this point, the model is typically saved to a local directory (i.e. on your own computer), so a local file path is needed.

You'll use the newly fine-tuned distilbert model. There is a sentence, "I am a HUGE fan of romantic comedies.", saved as text_example.

Note: we are using our own pipeline module for this exercise for teaching purposes. The model is "saved" (i.e. not really) under the path ./fine_tuned_model.

In [None]:
from transformers import pipeline

text_example = "I am a HUGE fan of romantic comedies."

# Create the classifier
classifier = pipeline(task="sentiment-analysis", model=local_path)

# Classify the text
results = classifier(text=text_example)

print(results)

### Generating text from a text prompt
Generating text can be accomplished using Auto classes from the Hugging Face transformers library. It can be a useful method for developing content, such as technical documentation or creative material.

You'll walk through the steps to process the text prompt, "Wear sunglasses when its sunny because", then generate new text from it.

AutoTokenizer and AutoModelForCausalLM from the transformers library are already loaded for you.

In [2]:
from transformers import AutoTokenizer, AutoModelForCausalLM

# Set model name
model_name = "gpt2"

# Get the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

prompt = "Wear sunglasses when its sunny because"

# Tokenize the input
input_ids = tokenizer.encode(prompt, return_tensors="pt")

# Generate the text output
output = model.generate(input_ids, num_return_sequences=1)

# Decode the output
generated_text = tokenizer.decode(output[0])

print(generated_text)


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Wear sunglasses when its sunny because it's a hot day.

The best way to get


### Generating a caption for an image
Generating text can be done for modalities other than text, such as images. This has a lot of benefits including faster content creation by generating captions from images.

You'll create a caption for a fashion image using the Microsoft GIT model ("microsoft/git-base-coco").

AutoProcessor and AutoModelForCausalLM from the transformers library is already loaded for you along with the image.

In [6]:
from transformers import AutoProcessor, AutoModelForCausalLM
from PIL import Image

image = Image.open('fashion.jpeg')

# Get the processor and model
processor = AutoProcessor.from_pretrained("microsoft/git-base-coco")
model = AutoModelForCausalLM.from_pretrained("microsoft/git-base-coco")

# Process the image
pixels = processor(images=image, return_tensors="pt").pixel_values

# Generate the ids
output = model.generate(pixel_values=pixels)

# Decode the output
caption = processor.batch_decode(output)

print(caption[0])


preprocessor_config.json:   0%|          | 0.00/503 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/453 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/2.82k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/707M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/141 [00:00<?, ?B/s]



[CLS] a woman wearing a black sweater and gray pants. [SEP]


### Generate embeddings for a sentence
Embeddings are playing an increasingly big role in ML and AI systems. A common use case is embedding text to support search.

The sentence-transformers package from Hugging Face is useful for getting started with embedding models. You'll compare the embedding shape from two different models - "all-MiniLM-L6-v2" and "sentence-transformers/paraphrase-albert-small-v2". This can determine which is better suited for a project (i.e. because of storage constraints).

The sentence used for embedding, "Programmers, do you put your comments (before|after) the related code?", is saved as sentence.

SentenceTransformer from the sentence-transformers package was already loaded for you.

In [13]:
from sentence_transformers import SentenceTransformer

sentence = "Programmers, do you put your comments (before|after) the related code?"

# Create the first embedding model
embedder1 = SentenceTransformer("all-MiniLM-L6-v2")

# Embed the sentence
embedding1 = embedder1.encode([sentence])

# Create and use second embedding model
embedder2 = SentenceTransformer("sentence-transformers/paraphrase-albert-small-v2")
embedding2 = embedder2.encode([sentence])
 
# Compare the shapes
print(embedding1.shape == embedding2.shape)


  return torch._C._cuda_getDeviceCount() > 0


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/4.06k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/827 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/46.7M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/465 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/760k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.31M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/245 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

False


### Using semantic search
The similarity, or closeness, between a query and the other sentences, or documents, is the foundation for semantic search. This is a search method which takes into account context and intent of the query. Similarity measures, such as cosine similarity, are used to quantify the distance between the query and each sentence within the dimensional space. Results of a search are based on the closest sentences to the query.

You will use semantic search to return the top two Reddit threads relevant to the user query, "I need a desktop book reader for Mac".

The embedder and sentence_embeddings are already loaded for you along with util.semantic_search().

In [20]:
from sentence_transformers import util

sentences = ['Programmers, do you put your comments (before|after) the related code?',
 'How sure are we that there were never any intelligent dinosaurs?',
 'Can anyone suggest a desktop book reader for Mac that works similar to Stanza on the iPhone?',
 'I will be in Lima, Ohio Monday night/tuesday on business. What is there to do, and see in the area?',
 "I'm looking for a good quality headset that doesn't cost too much. Any recommendations?",
 'How do I get a list of all the duplicate items using LINQ?',
 "Please help me figure out why it's so tough for me to connect to Valve games. It's driving me insane.",
 "Is there such a thing as 'good' instant coffee?",
 'How do I get the distinct/unique values in a column in Excel?']

query = "I need a desktop book reader for Mac"

embedder = SentenceTransformer("all-MiniLM-L6-v2")
sentence_embeddings = embedder.encode(sentences)

# Generate embeddings
query_embedding = embedder.encode([query])[0]

# Compare embeddings
hits = util.semantic_search(query_embedding, sentence_embeddings, top_k=2)

# Print the top results
for hit in hits[0]:
    print(sentences[hit["corpus_id"]], "(Score: {:.4f})".format(hit["score"]))

Can anyone suggest a desktop book reader for Mac that works similar to Stanza on the iPhone? (Score: 0.8011)
I'm looking for a good quality headset that doesn't cost too much. Any recommendations? (Score: 0.1437)
