# Machine Learning Article Tag generation

#### Nanda Kishor M Pai

Machine Learning model to generate Tags for Machine Learning related articles. This model is a fine-tuned version of [t5-small](https://huggingface.co/t5-small) fine-tuned on a refined version of [190k Medium Articles](https://www.kaggle.com/datasets/fabiochiusano/medium-articles) dataset for generating Machine Learning article tags using the article textual content as input. While usually formulated as a multi-label classification problem, this model deals with _tag generation_ as a text2text generation task (inspiration and reference: [fabiochiu/t5-base-tag-generation](https://huggingface.co/fabiochiu/t5-base-tag-generation)).
<br><br>
Finetuning Notebook Reference: [Hugging face summarization notebook](https://github.com/huggingface/notebooks/blob/main/examples/summarization.ipynb).

In [None]:
! pip install datasets transformers rouge-score nltk
! apt install git-lfs

Connecting Hugging face for Pushing to hub

In [4]:
from huggingface_hub import notebook_login

notebook_login()

Token is valid.
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /root/.cache/huggingface/token
Login successful


Make sure your version of Transformers is at least 4.11.0 since the functionality was introduced in that version:

In [5]:
import transformers
import datasets
import random
import pandas as pd
from IPython.display import display, HTML
from transformers import EarlyStoppingCallback
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

You can find a script version of this notebook to fine-tune your model in a distributed fashion using multiple GPUs or TPUs [here](https://github.com/huggingface/transformers/tree/master/examples/seq2seq).

We also quickly upload some telemetry - this tells us which examples and software versions are getting used so we know where to prioritize our maintenance efforts. We don't collect (or care about) any personally identifiable information, but if you'd prefer not to be counted, feel free to skip this step or delete this cell entirely.

In [6]:
from transformers.utils import send_example_telemetry

send_example_telemetry("summarization_notebook", framework="pytorch")

# Fine-tuning a model on a summarization task

In this notebook, we will see how to fine-tune one of the [🤗 Transformers](https://github.com/huggingface/transformers) model for a summarization task.


We will look into how to fine-tune a model on a custom dataset using the `Trainer` API.

In [7]:
model_checkpoint = "t5-small"

This notebook is built to run  with any model checkpoint from the [Model Hub](https://huggingface.co/models) as long as that model has a sequence-to-sequence version in the Transformers library. Here we picked the [`t5-small`](https://huggingface.co/t5-small) checkpoint. 

## Loading the dataset

We will use the [🤗 Datasets](https://github.com/huggingface/datasets) library to get the metric we need to use for evaluation (to compare our model to the benchmark). This can be easily done with the functions `load_metric`.  

In [8]:
from datasets import load_metric

metric = load_metric("rouge")

  metric = load_metric("rouge")


Downloading builder script:   0%|          | 0.00/2.16k [00:00<?, ?B/s]

## Importing Dataset

In [19]:
import pandas as pd

df = pd.read_csv("/content/Medium_ML_Specific_Refined_Tags_940articles_1000words.csv")
df = df[['text', 'corrected_tags']] 
df.head(2)

Unnamed: 0,text,corrected_tags
0,A method to select either a condensed data tab...,"data table, radio button, Dash data table, dop..."
1,The ruptures Package\n\nCharles Truong adapted...,"ruptures, changepoint detection, Python, PELT,..."


Creating Dataset from Pandas to Huggingface Data object

In [20]:
dataset = datasets.Dataset.from_pandas(df)

Train Test Val Split

In [21]:
# 80% train, 20% test + validation
train_test_dataset = dataset.train_test_split(test_size=0.2)
# Split the 20% test + valid in half test, half valid
test_valid = train_test_dataset['test'].train_test_split(test_size=0.5)
# gather everyone if you want to have a single DatasetDict
train_test_valid_dataset = datasets.DatasetDict({
    'train': train_test_dataset['train'],
    'test': test_valid['test'],
    'valid': test_valid['train']})

In [22]:
type(train_test_valid_dataset)

datasets.dataset_dict.DatasetDict

## Preprocessing the data

In [23]:
from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-small automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


By default, the call above will use one of the fast tokenizers (backed by Rust) from the 🤗 Tokenizers library.

In [24]:
if model_checkpoint in ["t5-small", "t5-base", "t5-larg", "t5-3b", "t5-11b"]:
    prefix = "summarize: "
else:
    prefix = ""

In [25]:
max_input_length = 1024
max_target_length = 128

def preprocess_function(examples):
    model_inputs = tokenizer(examples["text"], max_length=max_input_length, truncation=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["corrected_tags"], max_length=max_target_length, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [26]:
# tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)
tokenized_datasets = train_test_valid_dataset.map(preprocess_function, batched=True)

  0%|          | 0/1 [00:00<?, ?ba/s]



  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [27]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['text', 'corrected_tags', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 752
    })
    test: Dataset({
        features: ['text', 'corrected_tags', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 94
    })
    valid: Dataset({
        features: ['text', 'corrected_tags', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 94
    })
})

## Fine-tuning the model

In [28]:
from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer

model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/242M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [29]:
batch_size = 16
model_name = model_checkpoint.split("/")[-1]
args = Seq2SeqTrainingArguments(
    f"{model_name}-machine-articles-tag-generation",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    logging_strategy = 'epoch',
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=20,
    predict_with_generate=True,
    fp16=True,
    metric_for_best_model = 'eval_loss',
    load_best_model_at_end = True,
    push_to_hub=False,
)

In [30]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

In [31]:
import nltk
import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    
    # Rouge expects a newline after each sentence
    decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds]
    decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]
    
    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    # Extract a few results
    result = {key: value.mid.fmeasure * 100 for key, value in result.items()}
    
    # Add mean generated length
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)
    
    return {k: round(v, 4) for k, v in result.items()}

Then we just need to pass all of this along with our datasets to the `Seq2SeqTrainer`:

In [32]:
trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["valid"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience = 1,early_stopping_threshold=0.01)]
)

Using cuda_amp half precision backend


We can now finetune our model by just calling the `train` method:

In [33]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `T5ForConditionalGeneration.forward` and have been ignored: corrected_tags, text. If corrected_tags, text are not expected by `T5ForConditionalGeneration.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 752
  Num Epochs = 20
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 940
  Number of trainable parameters = 60506624
You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1,3.8173,2.963479,18.0181,5.9506,16.6678,16.4276,18.8617
2,3.0348,2.565983,24.3443,9.1775,22.0591,22.0224,18.883
3,2.709,2.354473,26.803,11.614,24.883,24.6785,18.8085
4,2.5353,2.24477,30.1396,14.3692,27.3638,27.2744,18.1915
5,2.4272,2.178632,31.1623,15.2598,28.3844,28.5093,18.0532
6,2.3318,2.135915,30.4555,15.3011,27.821,27.8011,18.266
7,2.2745,2.109996,30.9549,15.3681,28.1127,28.0477,18.1277
8,2.2321,2.087043,32.0794,16.2302,28.9297,28.8723,18.2872
9,2.2132,2.067296,33.2077,16.9096,29.9765,30.009,18.0319
10,2.188,2.054794,33.5125,17.121,30.3222,30.2758,18.0213


The following columns in the evaluation set don't have a corresponding argument in `T5ForConditionalGeneration.forward` and have been ignored: corrected_tags, text. If corrected_tags, text are not expected by `T5ForConditionalGeneration.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 94
  Batch size = 16
Generate config GenerationConfig {
  "decoder_start_token_id": 0,
  "eos_token_id": 1,
  "pad_token_id": 0,
  "transformers_version": "4.26.1"
}

Generate config GenerationConfig {
  "decoder_start_token_id": 0,
  "eos_token_id": 1,
  "pad_token_id": 0,
  "transformers_version": "4.26.1"
}

Generate config GenerationConfig {
  "decoder_start_token_id": 0,
  "eos_token_id": 1,
  "pad_token_id": 0,
  "transformers_version": "4.26.1"
}

Generate config GenerationConfig {
  "decoder_start_token_id": 0,
  "eos_token_id": 1,
  "pad_token_id": 0,
  "transformers_version": "4.26.1"
}

Generate config GenerationConfig {
  "decoder_start_token_id": 

TrainOutput(global_step=564, training_loss=2.5053753142661237, metrics={'train_runtime': 540.7277, 'train_samples_per_second': 27.814, 'train_steps_per_second': 1.738, 'total_flos': 2442648832966656.0, 'train_loss': 2.5053753142661237, 'epoch': 12.0})

You can now upload the result of the training to the Hub, just execute this instruction:

In [34]:
trainer.push_to_hub()

## Inference

We can now share this model: they can all load it with the identifier `"nandakishormpai/t5-small-machine-articles-tag-generation"` so for instance:

In [35]:
from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained("nandakishormpai/t5-small-machine-articles-tag-generation")

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.47k [00:00<?, ?B/s]

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--nandakishormpai--t5-small-machine-articles-tag-generation/snapshots/cffe3b8589c2a9521bda72644fb3e18a40ee6ab7/config.json
Model config T5Config {
  "_name_or_path": "nandakishormpai/t5-small-machine-articles-tag-generation",
  "architectures": [
    "T5ForConditionalGeneration"
  ],
  "d_ff": 2048,
  "d_kv": 64,
  "d_model": 512,
  "decoder_start_token_id": 0,
  "dense_act_fn": "relu",
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "relu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "is_gated_act": false,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "n_positions": 512,
  "num_decoder_layers": 6,
  "num_heads": 8,
  "num_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_max_distance": 128,
  "relative_attention_num_buckets": 32,
  "task_specific_params": {
    "summarization": {
      "early_stopping": true,
      "length_pen

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/242M [00:00<?, ?B/s]

loading weights file pytorch_model.bin from cache at /root/.cache/huggingface/hub/models--nandakishormpai--t5-small-machine-articles-tag-generation/snapshots/cffe3b8589c2a9521bda72644fb3e18a40ee6ab7/pytorch_model.bin
Generate config GenerationConfig {
  "decoder_start_token_id": 0,
  "eos_token_id": 1,
  "pad_token_id": 0,
  "transformers_version": "4.26.1"
}

All model checkpoint weights were used when initializing T5ForConditionalGeneration.

All the weights of T5ForConditionalGeneration were initialized from the model checkpoint at nandakishormpai/t5-small-machine-articles-tag-generation.
If your task is similar to the task the model of the checkpoint was trained on, you can already use T5ForConditionalGeneration for predictions without further training.


Downloading (…)neration_config.json:   0%|          | 0.00/142 [00:00<?, ?B/s]

loading configuration file generation_config.json from cache at /root/.cache/huggingface/hub/models--nandakishormpai--t5-small-machine-articles-tag-generation/snapshots/cffe3b8589c2a9521bda72644fb3e18a40ee6ab7/generation_config.json
Generate config GenerationConfig {
  "_from_model_config": true,
  "decoder_start_token_id": 0,
  "eos_token_id": 1,
  "pad_token_id": 0,
  "transformers_version": "4.26.1"
}



In [36]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import nltk
nltk.download('punkt')

tokenizer = AutoTokenizer.from_pretrained("nandakishormpai/t5-small-machine-articles-tag-generation")
model = AutoModelForSeq2SeqLM.from_pretrained("nandakishormpai/t5-small-machine-articles-tag-generation")

#Insert Article here
text = """
Paige, AI in pathology and genomics

Fundamentally transforming the diagnosis and treatment of cancer
Paige has raised $25M in total. We talked with Leo Grady, its CEO.
How would you describe Paige in a single tweet?
AI in pathology and genomics will fundamentally transform the diagnosis and treatment of cancer.
How did it all start and why? 
Paige was founded out of Memorial Sloan Kettering to bring technology that was developed there to doctors and patients worldwide. For over a decade, Thomas Fuchs and his colleagues have developed a new, powerful technology for pathology. This technology can improve cancer diagnostics, driving better patient care at lower cost. Paige is building clinical products from this technology and extending the technology to the development of new biomarkers for the biopharma industry.
What have you achieved so far?
TEAM: In the past year and a half, Paige has built a team with members experienced in AI, entrepreneurship, design and commercialization of clinical software.
PRODUCT: We have achieved FDA breakthrough designation for the first product we plan to launch, a testament to the impact our technology will have in this market.
CUSTOMERS: None yet, as we are working on CE and FDA regulatory clearances. We are working with several biopharma companies.
What do you plan to achieve in the next 2 or 3 years?
Commercialization of multiple clinical products for pathologists, as well as the development of novel biomarkers that can help speed up and better inform the diagnosis and treatment selection for patients with cancer.

"""

inputs = tokenizer([text], max_length=1024, truncation=True, return_tensors="pt")
output = model.generate(**inputs, num_beams=8, do_sample=True, min_length=10,
                        max_length=64)
decoded_output = tokenizer.batch_decode(output, skip_special_tokens=True)[0]
# tags = list(set(decoded_output.strip().split(", ")))

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Downloading (…)okenizer_config.json:   0%|          | 0.00/2.35k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

loading file spiece.model from cache at None
loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--nandakishormpai--t5-small-machine-articles-tag-generation/snapshots/cffe3b8589c2a9521bda72644fb3e18a40ee6ab7/tokenizer.json
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at /root/.cache/huggingface/hub/models--nandakishormpai--t5-small-machine-articles-tag-generation/snapshots/cffe3b8589c2a9521bda72644fb3e18a40ee6ab7/special_tokens_map.json
loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--nandakishormpai--t5-small-machine-articles-tag-generation/snapshots/cffe3b8589c2a9521bda72644fb3e18a40ee6ab7/tokenizer_config.json
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--nandakishormpai--t5-small-machine-articles-tag-generation/snapshots/cffe3b8589c2a9521bda72644fb3e18a40ee6ab7/config.json
Model config T5Config {
  "_name_or_path": "nandakish

In [37]:
decoded_output.split(",")

['Paige', ' AI in pathology and genomics', ' AI in pathology', ' genomics']