# Step 2b: Fine-tuning Keyword Extractor Model

Now, we have to fine-tune our second model. This model should be able to extract keywords from text.

In [2]:
!pip install transformers datasets

Collecting transformers
  Downloading transformers-4.30.2-py3-none-any.whl (7.2 MB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/7.2 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━[0m [32m6.5/7.2 MB[0m [31m195.3 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m7.2/7.2 MB[0m [31m187.7 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m99.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.13.1-py3-none-any.whl (486 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m486.2/486.2 kB[0m [31m46.7 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m28.9 

In [3]:
!pip install accelerate -U

Collecting accelerate
  Downloading accelerate-0.20.3-py3-none-any.whl (227 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/227.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.6/227.6 kB[0m [31m20.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: accelerate
Successfully installed accelerate-0.20.3


I decided to use [this dataset](https://https://huggingface.co/datasets/51la5/keyword-extraction) from the HuggingFace Hub as my training dataset.

In [4]:
from datasets import load_dataset

dataset = load_dataset("51la5/keyword-extraction")

Downloading and preparing dataset csv/51la5--keyword-extraction to /root/.cache/huggingface/datasets/51la5___csv/51la5--keyword-extraction-3ae609c663a88c01/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/180M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/44.7M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/51la5___csv/51la5--keyword-extraction-3ae609c663a88c01/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

However, since it doesn't have a validation dataset, I split the train dataset into two parts.

In [5]:
from datasets import Dataset

train = dataset['train'][:16524]
eval = dataset['train'][5509:]
train = Dataset.from_dict(train)
eval = Dataset.from_dict(eval)

test = dataset["test"]

Here's an example of what this dataset looks like:

In [6]:
print(train)
print(train[0])

Dataset({
    features: ['dataset', 'file_id', 'text', 'summary', 'type'],
    num_rows: 16524
})
{'dataset': 'www', 'file_id': '13534577', 'text': 'Crosslanguage blog mining and trend visualisation People use weblogs to express thoughts, present ideas and share knowledge, therefore weblogs are extraordinarily valuable resources, amongs others, for trend analysis. Trends are derived from the chronological sequence of blog post count per topic. The comparison with a reference corpus allows qualitative statements over identified trends. We propose a crosslanguage blog mining and trend visualisation system to analyse blogs across languages and topics. The trend visualisation facilitates the identification of trends and the comparison with the reference news article corpus. To prove the correctness of our system we computed the correlation between trends in blogs and news articles for a subset of blogs and topics. The evaluation corroborated our hypothesis of a high correlation coefficient

Because this dataset doesn't come with a 'label' column, I had to create my own. Since there are two types of type, 'KEYWORD' and 'KEYPHRASE,' I just based the label column's values off of them.

Since I have to alter three different datasets, I decided to use a function instead.

In [7]:
import pandas as pd
from datasets import Dataset

def adding_label_column(dataset):
  dataset = pd.DataFrame(dataset, columns = ['dataset', 'file_id', 'text', 'summary', 'type'])
  labels = []

  for i in range(len(dataset)):
    if dataset['type'][i] == "KEYWORD":
      labels.append(0)
    else:
      labels.append(1)

  dataset['label'] = labels

  dataset = Dataset.from_pandas(pd.DataFrame(data=dataset))

  return dataset

In [8]:
train = adding_label_column(train)
eval = adding_label_column(eval)
test = adding_label_column(test)

Here is the edited dataset:

In [9]:
print(train)
print(train[0])

Dataset({
    features: ['dataset', 'file_id', 'text', 'summary', 'type', 'label'],
    num_rows: 16524
})
{'dataset': 'www', 'file_id': '13534577', 'text': 'Crosslanguage blog mining and trend visualisation People use weblogs to express thoughts, present ideas and share knowledge, therefore weblogs are extraordinarily valuable resources, amongs others, for trend analysis. Trends are derived from the chronological sequence of blog post count per topic. The comparison with a reference corpus allows qualitative statements over identified trends. We propose a crosslanguage blog mining and trend visualisation system to analyse blogs across languages and topics. The trend visualisation facilitates the identification of trends and the comparison with the reference news article corpus. To prove the correctness of our system we computed the correlation between trends in blogs and news articles for a subset of blogs and topics. The evaluation corroborated our hypothesis of a high correlation co

I used the distilbert-base-uncased tokenizer.

In [10]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [11]:
def tokenize(example):
  return tokenizer(example["text"], truncation=True, padding=True)

train_tokenized = train.map(tokenize, batched=True)
eval_tokenized = eval.map(tokenize, batched=True)
test_tokenized = test.map(tokenize, batched=True)

Map:   0%|          | 0/16524 [00:00<?, ? examples/s]

Map:   0%|          | 0/16524 [00:00<?, ? examples/s]

Map:   0%|          | 0/5513 [00:00<?, ? examples/s]

In [12]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

I decided to use the [distilbert-base-uncased-finetuned-sst-2-english](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english) model from the HuggingFace Hub.

In [13]:
from transformers import AutoModelForSequenceClassification, TrainingArguments

model = AutoModelForSequenceClassification.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english', num_labels=2)

# Define the training arguments
training_args = TrainingArguments(
    output_dir="keyword-extractor",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    num_train_epochs=2,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
)

Downloading (…)lve/main/config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

In [14]:
from transformers import Trainer
from sklearn.metrics import accuracy_score

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_tokenized,
    eval_dataset=eval_tokenized,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=lambda pred: {"accuracy": accuracy_score(pred.label_ids, pred.predictions.argmax(-1))},
)

In [15]:
# Train the model
trainer.train()

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.0,0.002675,0.999576
2,0.0001,0.000569,0.999879


TrainOutput(global_step=4132, training_loss=0.004308985267762129, metrics={'train_runtime': 603.9108, 'train_samples_per_second': 54.723, 'train_steps_per_second': 6.842, 'total_flos': 4377782590783488.0, 'train_loss': 0.004308985267762129, 'epoch': 2.0})

In [16]:
# Evaluate the model on the test set
eval_results = trainer.evaluate(test_tokenized)

# Print the accuracy
print(f"Test Accuracy: {eval_results['eval_accuracy']}")

Test Accuracy: 0.9998186105568656


Finally, I pushed the fine-tuned model to the HuggingFace Hub, so I can access it in Step 3!

In [None]:
from huggingface_hub import notebook_login
notebook_login()

In [19]:
model_name = "Keyword-Extractor"

model.push_to_hub(model_name,
                  use_auth_token=True,
                  commit_message="Keyword Extractor Model",
                  private=False)

pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/mayapapaya/Keyword-Extractor/commit/e2955703188d30fdb98b68172864a0abd566cccc', commit_message='Keyword Extractor Model', commit_description='', oid='e2955703188d30fdb98b68172864a0abd566cccc', pr_url=None, pr_revision=None, pr_num=None)