# CS 195: Natural Language Processing
## Transfer Learning

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ericmanley/f23-CS195NLP/blob/main/F7_1_TransferLearning.ipynb)

## Reference

Hugging Face NLP Course Chapter 1: Transformer Models https://huggingface.co/learn/nlp-course/chapter1/1

Hugging Face NLP Course Chapter 3: Fine-tuning a model with the Trainer API or Keras https://huggingface.co/learn/nlp-course/chapter3/1

Hugging Face NLP Course Chapter 7, Section 5: Summarization https://huggingface.co/learn/nlp-course/chapter7/5?fw=tf

In [None]:
import sys
!{sys.executable} -m pip install --no-cache-dir datasets keras tensorflow sentencepiece



## Transfer Learning

**Transfer Learning** is the process of taking a model that was trained (**pre-trained**) on one task and then **fine tuned** for another task.

Today we're going to practice fine-tuning a pre-trained **transformer** model - we'll cover transformers in more detail next week, but they work a lot like the other neural network models we've looked at so far.

<div>
    <img src="https://github.com/ericmanley/f23-CS195NLP/blob/main/images/pretraining.svg?raw=1" width=700>
    <br />
    <img src="https://github.com/ericmanley/f23-CS195NLP/blob/main/images/finetuning.svg?raw=1" width=700>
</div>

image source: https://huggingface.co/learn/nlp-course/chapter1/4?fw=tf

## Common pre-trained models

There are a variety of pre-trained models out there
* usually *very large*
* pretrained on *massive amounts of data*

<div>
    <img src="https://github.com/ericmanley/f23-CS195NLP/blob/main/images/model_parameters.png?raw=1" width=800>
</div>

**Encoders:** BERT, ALBERT, DistilBERT, ELECTRA, RoBERTa
* Usually trained on masked input - model tries to predict the missing word in a sequence


**Decoders:** CTRL, GPT, GPT-2, Transformer XL
* Neural language models - usually trying to predict the next word in a sequence

**Encoder-Decoder Models:** BART, mBART, Marian, T5
* full sequence-to-sequence models


## Working Example

We're going to work through our text-to-emoji example, fine-tuning a variant of T5.

### Load and filter our dataset just like before

In [None]:
from datasets import load_dataset


# Define a function to check if 'text' is not None
def is_not_none(example):
    return example['text'] is not None

dataset = load_dataset("KomeijiForce/Text2Emoji",split="train")

# Filter the dataset
dataset = dataset.filter(is_not_none)
dataset

Dataset({
    features: ['text', 'emoji', 'topic'],
    num_rows: 503682
})

### choosing a sample to work with

Even the smaller transformer models will take too long to train on in class

Let's choose a small sample to work on in class

In [None]:
# Shuffle the dataset
shuffled_dataset = dataset.shuffle(seed=42)

# Select a small sample
sample_size = 1000  # Define your sample size
sample_dataset = shuffled_dataset.select(range(sample_size))

#if you want to use the entire dataset just uncomment the following
#sample_dataset = shuffled_dataset

### Train/test split

Hugging Face datasets actually include a `train_test_split` function for splitting into training and testing sets if you don't already have them split.

In [None]:
dataset_split = sample_dataset.train_test_split(test_size=0.2)
dataset_split

DatasetDict({
    train: Dataset({
        features: ['text', 'emoji', 'topic'],
        num_rows: 800
    })
    test: Dataset({
        features: ['text', 'emoji', 'topic'],
        num_rows: 200
    })
})

### Reminder of what the data looks like

In [None]:
print(dataset_split["train"]["text"][46])
print(dataset_split["train"]["emoji"][46])

Italy, the land of exquisite tusani delights and passionate bella donna. Indulge in delicious pasta that melts your mouth and sip upon velvety cappuccinos in picturesque terrazas.
🇮🇹🍝🍷🍮🍴🌇💋💃


### The Tokenizer

Since we will be using an existing model to start, we need to make sure we prepare our data in the same way that model was trained on.

**T5:** an encoder-decoder Transformer architecture suitable for sequence-to-sequences tasks

**mT5:** A multilingual version of T5, pretrained on the multilingual Common Crawl corpus (mC4), covering 101 languages

**mt5-small:** A small version of mT5, suitable for getting things working before attempting to train on a large model

`mt5-small` uses the SentencePiece tokenizer

In [None]:
from transformers import AutoTokenizer

#uses the sentencepiece tokenizer
model_checkpoint = "google/mt5-small"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=False)

### Looking at an example of the tokenization

You'll see that the token ids get returned as `input_ids`

It also includes an `attention_mask` which allows the algorithm to focus on specific important words using its attention mechanism - it's initialized to all 1s

In [None]:
inputs = tokenizer(dataset_split["train"]["text"][46])
inputs

{'input_ids': [20161, 261, 287, 6604, 304, 2121, 148586, 8901, 2384, 269, 91203, 305, 259, 67387, 265, 52751, 18726, 260, 14651, 454, 1017, 281, 259, 74075, 17515, 533, 259, 108063, 263, 772, 82975, 305, 395, 325, 259, 18390, 259, 165232, 276, 317, 181091, 337, 281, 22515, 921, 12445, 13742, 260, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

### Converting ids back to tokens

Here's what the tokens look like.

The `▁` and `</s>` are hallmarks of the SentencePiece tokenizer algorithm

In [None]:
tokenizer.convert_ids_to_tokens(inputs.input_ids)

['▁Italy',
 ',',
 '▁the',
 '▁land',
 '▁of',
 '▁ex',
 'quisite',
 '▁tus',
 'ani',
 '▁de',
 'lights',
 '▁and',
 '▁',
 'passionat',
 'e',
 '▁bella',
 '▁donna',
 '.',
 '▁Ind',
 'ul',
 'ge',
 '▁in',
 '▁',
 'delicious',
 '▁pasta',
 '▁that',
 '▁',
 'melt',
 's',
 '▁your',
 '▁mouth',
 '▁and',
 '▁si',
 'p',
 '▁',
 'upon',
 '▁',
 'velvet',
 'y',
 '▁c',
 'appuccin',
 'os',
 '▁in',
 '▁pictures',
 'que',
 '▁terra',
 'zas',
 '.',
 '</s>']

### How does it work on the emojis?

Fortunately, this seems to work pretty well for the emoji output too

some may come back as `<unk>` for unknown tokens

In [None]:
target = tokenizer(dataset_split["train"]["emoji"][46])
target

{'input_ids': [259, 239032, 220673, 244487, 242059, 245732, 242444, 247172, 223801, 239566, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [None]:
tokenizer.convert_ids_to_tokens(target.input_ids)

['▁', '🏛', '️', '⛰', '️', '🏰', '🌄', '📜', '✨', '</s>']

In [None]:
tokenizer.decode(target.input_ids)

'🏛️⛰️🏰🌄📜✨</s>'

### Let's define a preprocessing function

This will allow us to tokenize both the text and labels while allow use to add the token ids from the emojis as the `"labels"` key in the overall data structure where it will be convenient to have them for training.

In [None]:
max_input_length = 100
max_target_length = 20


def preprocess_function(examples):
    model_inputs = tokenizer(
        examples["text"],
        max_length=max_input_length,
        truncation=True,
    )
    labels = tokenizer(
        examples["emoji"], max_length=max_target_length, truncation=True
    )
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs



Hugging Face datasets have a `map` method that allows you to apply a preprocessing function like this to every example in the data set.

Notice that we get everything we had before (text, emoji, topic), but now we also have the input_ids (the tokens), the attention mask, and the labels (also token ids).

In [None]:
#turn the tokenized data back into a dataset
tokenized_datasets = dataset_split.map(preprocess_function, batched=True)
tokenized_datasets

Map:   0%|          | 0/800 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'emoji', 'topic', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 800
    })
    test: Dataset({
        features: ['text', 'emoji', 'topic', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 200
    })
})

### Grabbing the pre-trained model

as a reminder, `model_checkpoint` was defined earlier - it is `"google/mt5-small"`

Note that this is an encoder-decoder transformer model the was pretrained on a 750 GB dataset which included tasks for summarization, translation, question answering, and classification.

In [None]:
from transformers import TFAutoModelForSeq2SeqLM

model = TFAutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

All model checkpoint layers were used when initializing TFMT5ForConditionalGeneration.

All the layers of TFMT5ForConditionalGeneration were initialized from the model checkpoint at google/mt5-small.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFMT5ForConditionalGeneration for predictions without further training.


### Using a data collator

Hugging Face provides a Data Collator class which is used to collect the training data into batches and dynamically pad them so that each batch is appropriately padded but without an overall fixed length.

With `return_tensors="tf"` we're saying we want the data back in an appropriate data structure suitable for using with Keras/Tensorflow.

In [None]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors="tf")

Let's make a version of the dataset where the original text fields are removed so we can use it with the collator.

In [None]:
tokenized_datasets_no_text = tokenized_datasets.remove_columns(["text","emoji","topic"])
tokenized_datasets_no_text

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 800
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 200
    })
})

In [None]:
tf_train_dataset = model.prepare_tf_dataset(
    tokenized_datasets_no_text["train"],
    collate_fn=data_collator,
    shuffle=True,
    batch_size=32,
)
tf_eval_dataset = model.prepare_tf_dataset(
    tokenized_datasets_no_text["test"],
    collate_fn=data_collator,
    shuffle=False,
    batch_size=32,
)

### Setting up the optimizer

When fine-tuning a pre-trained algorithm, you usually want to use a smaller learning rate.

Note that we do not specify a loss function - it will use whatever was used in the base model.

*NB:* I'm using values that were in the example on the website (https://huggingface.co/learn/nlp-course/chapter7/5?fw=tf ) for a different dataset - I don't know if these are the best for this problem

In [None]:
from transformers import create_optimizer
import tensorflow as tf

num_train_epochs = 8
num_train_steps = len(tf_train_dataset) * num_train_epochs

optimizer, schedule = create_optimizer(
    init_lr=5.6e-5,
    num_warmup_steps=0,
    num_train_steps=num_train_steps,
    weight_decay_rate=0.01,
)

model.compile(optimizer=optimizer)

# Train in mixed-precision float16 - can be helpful if running on a GPU
#tf.keras.mixed_precision.set_global_policy("mixed_float16")

In [None]:
model.fit(tf_train_dataset, validation_data=tf_eval_dataset, epochs=1)

### Saving a copy of the model's weights

This will allow us to load the model later and work with it without completely retraining.

In [None]:
model.save_pretrained("models/emoji-model-v2")

### Reload a saved model

In [None]:
model = TFAutoModelForSeq2SeqLM.from_pretrained("models/emoji-model-v1")

OSError: ignored

### Inference

Let's suppose we have an example to get a prediction for. For now, let's grab one from the test set

In [None]:
print( tokenized_datasets["test"]["text"][15] )
print( tokenized_datasets["test"]["emoji"][15] )
print( tokenized_datasets["test"]["input_ids"][15] )

Marvel at the towering cathedral steeples and intricate stained glass windows. This stunning architectural wonder radiates a sense of divine presence and spirituality.
🏙️💒🧚⛪🚄💫🕊️🌸✨
[46577, 344, 287, 288, 176572, 317, 216387, 113489, 104793, 305, 281, 92804, 346, 259, 263, 29967, 27416, 20727, 260, 1494, 259, 263, 59976, 259, 262, 115957, 29100, 79398, 1837, 259, 262, 13336, 304, 64236, 265, 65901, 265, 305, 43498, 2302, 260, 1]


Use the `generate` method to get a prediction sequence from the intput IDs.

If you don't already have the tokens, make sure to use your tokenizer first.

In [None]:
prediction = model.generate([tokenized_datasets["test"]["input_ids"][15]], max_length=max_target_length)
tokenizer.convert_ids_to_tokens(prediction[0])

['<pad>', '▁', '✨', '✨', '</s>']

In [None]:
decoded_output = tokenizer.decode(prediction[0], skip_special_tokens=True)
decoded_output

'✨✨'

## Applied Exploration

The applied exploration for this fortnight will be a little different. I want everyone to get some experience fine-tuning an existing model, so this will be the task for the entire fortnight.

Fine-tune an existing model with the following requirements
* Choose a different starting model - you can use any Hugging Face model, but consider starting with a general one like BART or Llama2.
* Choose a different data set - think about something that would be good to include in an application that interests you
* Evaluate how well it performed. For sequence-to-sequence model, try going back and using Rouge from Fortnight 1.

## Applied Exploration Description

In [10]:
import re, random
from urllib.request import urlopen
from bs4 import BeautifulSoup
from datetime import date
from transformers import pipeline

def managable_chunks(text):
  i = 0
  text_chunks = []
  while i < len(text):
    text_chunks = text[i:i+4000]
    i += 4000
  return text_chunks


url = "https://www.foodnetwork.com/recipes/recipes-a-z/p/1"
html = urlopen(url).read()
soup = BeautifulSoup(html, features="html.parser")

urls_raw = []
for link in soup.find_all('a'):
    urls_raw.append(link.get('href'))
print(urls_raw)

# Clean urls
urls = []
for url in urls_raw:
    if url != None:
      url = url.split(" ", 1)[0]
      if "/recipes/" in url and "photos" not in url:
        urls.append(url[2:])
print(urls)



['https://watch.foodnetwork.com/?utm_source=marketingsite&utm_medium=trendingline_watchfullseasons_text', '//www.foodnetwork.com/shows/tv-schedule', '//www.foodnetwork.com/site/newsletter-sign-up', '//www.foodnetwork.com/videos', '//www.foodnetwork.com/features/articles/sweepstakes-and-contests', 'https://www.foodnetwork.com/kitchen/classes', '//www.foodnetwork.com/magazine', '//www.foodnetwork.com/fn-dish', '//www.foodnetwork.com/shows/a-z', '//www.foodnetwork.com/profiles/talent', '//www.foodnetwork.com/restaurants', '//www.foodnetwork.com/shows/tv-schedule', 'https://www.max.com/channel/food-network', 'https://www.facebook.com/FoodNetwork', 'https://twitter.com/FoodNetwork', 'https://instagram.com/FoodNetwork', 'https://www.youtube.com/FoodNetwork', 'https://www.pinterest.com/FoodNetwork', '//www.foodnetwork.com/site/snapchat-discover.html', '//www.foodnetwork.com', '//www.foodnetwork.com/recipes', '//www.foodnetwork.com/holidays-and-parties/packages/holidays', '//www.foodnetwork.co

In [22]:
for recipes in urls[:1]:
  recipe_url = recipes
  html = urlopen("https://www.foodnetwork.com/recipes/food-network-kitchen/eye-round-christmas-roast-17206548").read()
  soup = BeautifulSoup(html, features="html.parser")
  #print(soup)

  # So things we can extract: title, description, ingredients
  # Title seems like it should be the y-variable
  # Description.... should I map description to y? hm.
  description = soup.find('div', class_ = "o-AssetDescription__a-Description").get_text()
  ingredient = soup.find('span', class_ = "o-Ingredients__a-Ingredient--CheckboxLabel").get_text()

  print(description)



      Who says Christmas roast has to be prime rib? This much more affordable roast – made with eye round – will impress everyone at the holiday table. The crisp, spicy peppercorn crust pairs beautifully with the bright horseradish cream and leftovers make amazing sandwiches – so hold on to any extra sauce.
    


In [None]:
import re, random
from urllib.request import urlopen
from bs4 import BeautifulSoup
from datetime import date
from transformers import pipeline

def managable_chunks(text):
  i = 0
  text_chunks = []
  while i < len(text):
    text_chunks = text[i:i+4000]
    i += 4000
  return text_chunks

summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

url = "https://www.themoscowtimes.com/"
html = urlopen(url).read()
soup = BeautifulSoup(html, features="html.parser")

urls_raw = []
for link in soup.find_all('a'):
    urls_raw.append(link.get('href'))

# Clean urls
urls = []
for url in urls_raw:
    if url != None:
      url = url.split(" ", 1)[0]
      urls.append(url)

# Fetch the current year/month for finding articles
today = date.today()
year = str(today.strftime("%Y"))
month = str(today.strftime("%m"))

# Root of link for moscowtimes content
news_site_root = "https://www.themoscowtimes.com/"

# Root of link for articles
article_root = news_site_root + year + "/" + month

news_article_links = []
for url in urls:
  if url.find(article_root) == 0:
    news_article_links.append(url)

selected_articles = random.sample(news_article_links, 4)

weekly_summary = ""

for article in selected_articles:
  article_url = article
  html = urlopen(article_url).read()
  soup = BeautifulSoup(html, features="html.parser")

  text = soup.find('div', class_ = "article__content").get_text()

  # break into lines and remove leading and trailing space on each
  lines = (line.strip() for line in text.splitlines())
  # break multi-headlines into a line each
  chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
  # drop blank lines
  text = '\n'.join(chunk for chunk in chunks if chunk)

  weekly_summary = weekly_summary + summarizer(text[:4000])[0]["summary_text"] + "\n"

print(weekly_summary)