# Artist classification using BERT
This notebook uses a transformer-based model BERT on the task of artist classification.

The motivation behind using this model is to utilize its context understanding and trained knowledge to fine-tune lyrics of different artist.

We'll first use the pretrained model ito transform the lyrics into BERT vector embeddings and run a kNN algorithm over all the vector embeddings. Then we'll be fine-tuning the model on artist classification.

## Preparing the data

In [53]:
from datasets import load_dataset

dataset_folder = '../data/'

dataset = load_dataset('csv',
                     data_files={
                        'train': dataset_folder + 'songs_train.txt',
                        'test': dataset_folder + 'songs_test.txt',
                        'dev': dataset_folder + 'songs_dev.txt',
                     },
#                     split='train[:10%]',
                     column_names=['artist', 'title', 'lyrics'],
                     sep='\t')

Using custom data configuration default-ca7abb43b6747eaf
Reusing dataset csv (/home/urbikn/.cache/huggingface/datasets/csv/default-ca7abb43b6747eaf/0.0.0/51cce309a08df9c4d82ffd9363bbe090bf173197fc01a71b034e8594995a1a58)
100%|██████████| 3/3 [00:00<00:00, 905.77it/s]


Get all artists to have a class for label mapping

In [54]:
from datasets import ClassLabel

if len(dataset) == 3:
    union_artists = set(dataset['train']['artist']) | set(dataset['test']['artist']) | set(dataset['dev']['artist'])
else:
    union_artists = set(dataset['artist'])

artists = ClassLabel(names=list(union_artists))

## Transform lyrics into BERT embeddings
We're using the classic pre-trained BERT model, passing in the lyrics and getting out the final hidden states (or output vectors) of the model.

In [25]:
from transformers import AutoTokenizer, AutoModelForPreTraining

# This model can be used for loooong documents
model_name = "allenai/longformer-base-4096"

# but for now I'm only gonna use bert, cuz RAM
model_name = "bert-base-uncased"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model =  AutoModelForPreTraining.from_pretrained(model_name)

Some weights of BertForPreTraining were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['cls.predictions.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [5]:
def get_embeddings(examples, n=3):
    def encode(batch):
        tokens = tokenizer(batch["lyrics"], padding="longest", truncation=True, max_length=512, return_tensors="pt")
        return tokens

    examples.set_transform(encode)

    output = model(**examples[:n])
    embeddings = output['pooler_output'].detach().numpy()

    return embeddings

In [6]:
labels = artists.str2int(dataset['artist'])
examples = get_embeddings(dataset, 3)



In [7]:
from sklearn.neighbors import KNeighborsClassifier

neigh = KNeighborsClassifier(n_neighbors=1)
neigh.fit(examples[:2], labels[:2])

In [19]:
references = labels[2:3]
predictions = neigh.predict(examples[-1:])

In [57]:
import evaluate

f1_metric = evaluate.load("f1")
acc_metric = evaluate.load("accuracy")

def compute_metrics(examples):
    predictions, references = examples

    metric = acc_metric.compute(predictions=predictions, references=references)
    metric.update(f1_metric.compute(predictions=predictions, references=references, average="micro"))

    return metric

compute_metrics([predictions, references])

{'accuracy': 0.0, 'f1': 0.0}

## Fine-tune BERT on artist classification
Now we're going to fine-tune the BERT model on SequenceClassification task, using the classic pipelines

In [51]:
from datasets import load_dataset

dataset_folder = '../data/'

dataset = load_dataset('csv',
                     data_files={
                        'train': dataset_folder + 'songs_train.txt',
                        'test': dataset_folder + 'songs_test.txt',
                        'dev': dataset_folder + 'songs_dev.txt',
                     },
                     column_names=['artist', 'title', 'lyrics'],
                     sep='\t')

Using custom data configuration default-ca7abb43b6747eaf
Reusing dataset csv (/home/urbikn/.cache/huggingface/datasets/csv/default-ca7abb43b6747eaf/0.0.0/51cce309a08df9c4d82ffd9363bbe090bf173197fc01a71b034e8594995a1a58)
100%|██████████| 3/3 [00:00<00:00, 56.69it/s]


In [None]:
from datasets import ClassLabel

if len(dataset) == 3:
    union_artists = set(dataset['train']['artist']) | set(dataset['test']['artist']) | set(dataset['dev']['artist'])
else:
    union_artists = set(dataset['artist'])

artists = ClassLabel(names=list(union_artists))

In [45]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification 

# This model can be used for loooong documents
model_name = "allenai/longformer-base-4096"

# but for now I'm only gonna use bert, cuz RAM
model_name = "bert-base-uncased"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=artists.num_classes)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

In [55]:
from transformers import AutoTokenizer

# Turn labels to numbers and tokenize input
def transform(batch):
    batch['labels'] = artists.str2int(batch['artist'])
    return tokenizer(batch["lyrics"], padding="max_length", truncation=True)


tokenized_datasets = dataset.map(transform, batched=True).remove_columns(['artist', 'title', 'lyrics'])
tokenized_datasets

100%|██████████| 47/47 [00:19<00:00,  2.38ba/s]
100%|██████████| 6/6 [00:02<00:00,  2.36ba/s]
100%|██████████| 6/6 [00:02<00:00,  2.34ba/s]


DatasetDict({
    train: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 46120
    })
    test: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 5765
    })
    dev: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 5765
    })
})

In [None]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(output_dir="test_trainer", evaluation_strategy="epoch")

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset['train'],
    eval_dataset=dataset['dev'],
    test_dataset=dataset['test'],
    compute_metrics=compute_metrics,
)

In [None]:
trainer.train()

In [None]:
trainer.evaluate()

## Use fine-tuned BERT to transform lyrics into embeddings
Now that we have fine-tuned the BERT model on artist classification, we're again going to be using its embeddings for kNN classification.