# Artist classification using BERT
This notebook uses a transformer-based model BERT on the task of artist classification.

The motivation behind using this model is to utilize its context understanding and trained knowledge to fine-tune lyrics of different artist.

We'll first use the pretrained model ito transform the lyrics into BERT vector embeddings and run a kNN algorithm over all the vector embeddings. Then we'll be fine-tuning the model on artist classification.

## Preparing the data

In [52]:
from pathlib import Path
from datasets import load_dataset

dataset_folder = '../data/'

dataset = load_dataset('csv',
                     data_files={
                        'train': dataset_folder + 'songs_train.txt',
                        'test': dataset_folder + 'songs_test.txt',
                        'dev': dataset_folder + 'songs_dev.txt',
                     },
                     split='train[:10%]',
                     column_names=['artist', 'title', 'lyrics'],
                     sep='\t')

Using custom data configuration default-ca7abb43b6747eaf
Reusing dataset csv (/home/urbikn/.cache/huggingface/datasets/csv/default-ca7abb43b6747eaf/0.0.0/51cce309a08df9c4d82ffd9363bbe090bf173197fc01a71b034e8594995a1a58)


In [53]:
dataset

Dataset({
    features: ['artist', 'title', 'lyrics'],
    num_rows: 4612
})

## Transform lyrics into BERT embeddings
We're using the classic pre-trained BERT model, passing in the lyrics and getting out the final hidden states (or output vectors) of the model.

In [54]:
from transformers import AutoTokenizer, AutoModel

# This model can be used for loooong documents
model_name = "allenai/longformer-base-4096"

# but for now I'm only gonna use bert, cuz RAM
model_name = "bert-base-uncased"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [55]:
def encode(batch):
    return tokenizer(batch["lyrics"], padding="longest", truncation=True, max_length=512, return_tensors="pt")

dataset.set_transform(encode)

In [56]:
output = model(**dataset[:2])

In [57]:
examples = output['pooler_output'].detach().numpy()

In [36]:
from sklearn.neighbors import KNeighborsClassifier

neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(examples, [1] * 10)

In [47]:
def tokenization(batch):
    return tokenizer(batch['lyrics'], padding="longest", truncation=True, max_length=512, return_tensors='pt')

training_examples = dataset.map(tokenization)    

100%|██████████| 4612/4612 [00:05<00:00, 786.70ex/s]


In [109]:
def tokenization(example):
    return tokenizer(example['lyrics'], padding="longest", truncation=True, max_length=2048)

dataset = dataset.map(tokenization)    

100%|██████████| 4612/4612 [00:06<00:00, 712.63ex/s]
