In [2]:
#Transformer Applications

In [3]:
text = """I have attempted to put this review on Walmart.com\
but it indicates I have been 'opted out'.\
I purchased Binax covid tests online to pick up at store.\
Good product. Product is FIVE STARS, online shopping experience is ONE STAR.\
Online order Sunday 8/27/23 with Store pickup on Monday 8/28/23 around 2pm.\
I am reviewing to caution other online shoppers to ALWAYS scroll down\
before adding ANY item to their WalMart cart\
because the NOTICE if something is not returnable isn't visible unless you do.\
I picked up my order on the way out of town,\
and only later saw that the expiration dates were all close enough that\
I NEVER would have purchased the same amount of tests had I been shopping in-store.\
So I clicked 'start a return' on my receipt\
and only then found out I had purchased something non-refundable."""

In [4]:
#Text Classification

In [5]:

from transformers import pipeline

classifier = pipeline("text-classification")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


In [6]:
import pandas as pd

outputs = classifier(text)
pd.DataFrame(outputs) 

Unnamed: 0,label,score
0,NEGATIVE,0.996656


In [7]:
ner_tagger = pipeline("ner", aggregation_strategy="simple")
outputs = ner_tagger(text)
pd.DataFrame(outputs)    

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english)


Unnamed: 0,entity_group,score,word,start,end
0,ORG,0.963201,Walmart,39,46
1,ORG,0.976453,Binax,103,108
2,ORG,0.914135,WalMart,400,407


In [8]:
reader = pipeline("question-answering")
question = "What does the customer want?"
outputs = reader(question=question, context=text)
pd.DataFrame([outputs])

No model was supplied, defaulted to distilbert-base-cased-distilled-squad (https://huggingface.co/distilbert-base-cased-distilled-squad)


Unnamed: 0,score,start,end,answer
0,0.123665,103,120,Binax covid tests


In [None]:
summarizer = pipeline("summarization")
outputs = summarizer(text, max_length=70, clean_up_tokenization_spaces=True)
print(outputs[0]['summary_text'])

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 (https://huggingface.co/sshleifer/distilbart-cnn-12-6)


In [10]:
translator = pipeline("translation_en_to_de", 
                      model="Helsinki-NLP/opus-mt-en-de")
outputs = translator(text, clean_up_tokenization_spaces=True, min_length=100)
print(outputs[0]['translation_text'])

Ich habe versucht, diese Bewertung auf Walmart.com setzen, aber es deutet darauf hin, dass ich 'opted out'.Ich kaufte Binax Covid-Tests online, um im Laden abholen.Gutes Produkt. Produkt ist FIVE STARS, Online-Shopping-Erlebnis ist ONE STAR.Online-Bestellung Sonntag 27.08.23 mit Laden Pickup am Montag 28.08.23 gegen 14 Uhr.Ich überprüfe, um andere Online-Shopper auf ALWAYS nach unten scrollen, bevor Sie EINE Artikel zu ihrem WalMart Warenkorb hinzufügen, weil die HINWEIS, wenn etwas nicht wiederherstellbar ist nicht sichtbar ist, es sei denn, Sie tun.Ich nahm meine Bestellung auf dem Weg aus der Stadt, und erst später sah, dass die Ablaufdaten waren alle nah genug, dass ich nie die gleiche Menge von Tests gekauft hätte ich im Laden einkaufen gewesen.


In [11]:
from transformers import set_seed
set_seed(42) # Set the seed to get reproducible results

In [12]:
generator = pipeline("text-generation")
response = "Dear Jhonston, I am sorry that your order not refundable ."
prompt = text + "\n\nCustomer service response:\n" + response
outputs = generator(prompt, max_length=305)
print(outputs[0]['generated_text'])

No model was supplied, defaulted to gpt2 (https://huggingface.co/gpt2)
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


I have attempted to put this review on Walmart.combut it indicates I have been 'opted out'.I purchased Binax covid tests online to pick up at store.Good product. Product is FIVE STARS, online shopping experience is ONE STAR.Online order Sunday 8/27/23 with Store pickup on Monday 8/28/23 around 2pm.I am reviewing to caution other online shoppers to ALWAYS scroll downbefore adding ANY item to their WalMart cartbecause the NOTICE if something is not returnable isn't visible unless you do.I picked up my order on the way out of town,and only later saw that the expiration dates were all close enough thatI NEVER would have purchased the same amount of tests had I been shopping in-store.So I clicked 'start a return' on my receiptand only then found out I had purchased something non-refundable.

Customer service response:
Dear Jhonston, I am sorry that your order not refundable . The return process could take an average of two weeks if the item is delivered, but I could not deliver it because i

In [13]:
'''Text classification is a fundamental task in Natural Language Processing (NLP) that involves assigning predefined categories or labels to a given piece of text. It is also known as text categorization. Text classification has a wide range of applications, including sentiment analysis, spam detection, topic modeling, language identification, and more. Here's an overview of the key concepts and steps involved in text classification:

    Data Collection: The first step is to collect a labeled dataset. This dataset consists of text documents, such as emails, articles, tweets, or reviews, along with their corresponding labels or categories. For instance, in sentiment analysis, you might have positive and negative labels for movie reviews.

    Data Preprocessing:
        Text Cleaning: Remove any unnecessary characters, symbols, and formatting from the text.
        Tokenization: Split the text into individual words or tokens.
        Lowercasing: Convert all text to lowercase to ensure consistent handling of words in different cases.
        Stopword Removal: Eliminate common words (e.g., "the," "and," "in") that do not carry significant information for classification.
        Stemming or Lemmatization: Reduce words to their base or root form to handle different word forms (e.g., "running" to "run").

    Feature Extraction: Transform the preprocessed text into numerical features that machine learning models can understand. Common methods include:
        Bag of Words (BoW): Represent each document as a vector where each element corresponds to the frequency of a word in the document.
        Term Frequency-Inverse Document Frequency (TF-IDF): Weigh words based on their importance in the document relative to their importance in the entire dataset.
        Word Embeddings: Use pre-trained word embeddings like Word2Vec, GloVe, or FastText to represent words as dense vectors. These embeddings capture semantic relationships between words.

    Model Selection: Choose a machine learning or deep learning model for text classification. Some commonly used models include:
        Naive Bayes: A simple probabilistic model based on Bayes' theorem, often used for text classification.
        Support Vector Machines (SVM): A linear or non-linear classifier that aims to find the best hyperplane to separate different classes.
        Neural Networks: Deep learning models like Convolutional Neural Networks (CNNs) or Recurrent Neural Networks (RNNs) can capture complex patterns in text data.
        Transformer Models: State-of-the-art models like BERT, GPT, and their variants have shown remarkable performance in various NLP tasks, including text classification.

    Model Training: Train the selected model using the preprocessed and feature-extracted data. This involves optimizing the model's parameters to make accurate predictions.

    Evaluation: Assess the model's performance using evaluation metrics such as accuracy, precision, recall, F1-score, and confusion matrix. Cross-validation or holdout validation can be used to ensure the model's generalization ability.

    Hyperparameter Tuning: Fine-tune the model's hyperparameters to improve its performance.

    Deployment: Once the model performs satisfactorily, it can be deployed in a real-world application to classify new, unseen text data.

    Monitoring and Maintenance: Continuously monitor the model's performance and update it as needed to ensure it remains effective as new data becomes available.

Text classification is a versatile and powerful NLP technique that can be applied to a wide range of problems across various domains. The choice of preprocessing techniques, feature extraction methods, and model architectures should be tailored to the specific task and dataset to achieve the best results.'''

'Text classification is a fundamental task in Natural Language Processing (NLP) that involves assigning predefined categories or labels to a given piece of text. It is also known as text categorization. Text classification has a wide range of applications, including sentiment analysis, spam detection, topic modeling, language identification, and more. Here\'s an overview of the key concepts and steps involved in text classification:\n\n    Data Collection: The first step is to collect a labeled dataset. This dataset consists of text documents, such as emails, articles, tweets, or reviews, along with their corresponding labels or categories. For instance, in sentiment analysis, you might have positive and negative labels for movie reviews.\n\n    Data Preprocessing:\n        Text Cleaning: Remove any unnecessary characters, symbols, and formatting from the text.\n        Tokenization: Split the text into individual words or tokens.\n        Lowercasing: Convert all text to lowercase to 

In [14]:
from datasets import list_datasets

all_datasets = list_datasets()
print(f"There are {len(all_datasets)} datasets currently available on the Hub")
print(f"The first 10 are: {all_datasets[:10]}")

There are 68448 datasets currently available on the Hub
The first 10 are: ['acronym_identification', 'ade_corpus_v2', 'adversarial_qa', 'aeslc', 'afrikaans_ner_corpus', 'ag_news', 'ai2_arc', 'air_dialogue', 'ajgt_twitter_ar', 'allegro_reviews']


In [15]:
from datasets import load_dataset

emotions = load_dataset("SetFit/emotion")

Using custom data configuration SetFit___emotion-89147fdf376d67e2
Reusing dataset json (C:\Users\Erhan\.cache\huggingface\datasets\json\SetFit___emotion-89147fdf376d67e2\0.0.0\c2d554c3377ea79c7664b93dc65d0803b45e3279000f993c7bfd18937fd7f426)


  0%|          | 0/3 [00:00<?, ?it/s]

In [16]:
emotions

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'label_text'],
        num_rows: 16000
    })
    test: Dataset({
        features: ['text', 'label', 'label_text'],
        num_rows: 2000
    })
    validation: Dataset({
        features: ['text', 'label', 'label_text'],
        num_rows: 2000
    })
})

In [17]:
train_ds = emotions["train"]
train_ds

Dataset({
    features: ['text', 'label', 'label_text'],
    num_rows: 16000
})

In [18]:
len(train_ds)

16000

In [19]:
train_ds[0]

{'text': 'i didnt feel humiliated', 'label': 0, 'label_text': 'sadness'}

In [20]:
train_ds.column_names

['text', 'label', 'label_text']

In [21]:
print(train_ds.features)

{'text': Value(dtype='string', id=None), 'label': Value(dtype='int64', id=None), 'label_text': Value(dtype='string', id=None)}


In [22]:
print(train_ds[10:20])

{'text': ['i feel like i have to make the suffering i m seeing mean something', 'i do feel that running is a divine experience and that i can expect to have some type of spiritual encounter', 'i think it s the easiest time of year to feel dissatisfied', 'i feel low energy i m just thirsty', 'i have immense sympathy with the general point but as a possible proto writer trying to find time to write in the corners of life and with no sign of an agent let alone a publishing contract this feels a little precious', 'i do not feel reassured anxiety is on each side', 'i didnt really feel that embarrassed', 'i feel pretty pathetic most of the time', 'i started feeling sentimental about dolls i had as a child and so began a collection of vintage barbie dolls from the sixties', 'i now feel compromised and skeptical of the value of every unit of work i put in'], 'label': [0, 1, 3, 0, 1, 1, 0, 0, 0, 4], 'label_text': ['sadness', 'joy', 'anger', 'sadness', 'joy', 'joy', 'sadness', 'sadness', 'sadnes

In [23]:
print(train_ds["text"][10:20])

['i feel like i have to make the suffering i m seeing mean something', 'i do feel that running is a divine experience and that i can expect to have some type of spiritual encounter', 'i think it s the easiest time of year to feel dissatisfied', 'i feel low energy i m just thirsty', 'i have immense sympathy with the general point but as a possible proto writer trying to find time to write in the corners of life and with no sign of an agent let alone a publishing contract this feels a little precious', 'i do not feel reassured anxiety is on each side', 'i didnt really feel that embarrassed', 'i feel pretty pathetic most of the time', 'i started feeling sentimental about dolls i had as a child and so began a collection of vintage barbie dolls from the sixties', 'i now feel compromised and skeptical of the value of every unit of work i put in']


In [24]:
# The original URL used in the book is no longer available, so we use a different one
dataset_url = "https://huggingface.co/datasets/transformersbook/emotion-train-split/raw/main/train.txt"
!wget {dataset_url}

'wget' is not recognized as an internal or external command,
operable program or batch file.


In [25]:
!head -n 1 train.txt

'head' is not recognized as an internal or external command,
operable program or batch file.


In [26]:
dataset = load_dataset(
  "lhoestq/custom_squad",
  revision="main"  # tag name, or branch name, or commit hash
)

Reusing dataset custom_squad (C:\Users\Erhan\.cache\huggingface\datasets\lhoestq___custom_squad\plain_text\1.0.0\397916d1ae99584877e0fb4f5b8b6f01e66fcbbeff4d178afb30c933a8d0d93a)


  0%|          | 0/2 [00:00<?, ?it/s]

In [27]:
dataset = load_dataset(
  "lhoestq/custom_squad",
  revision="main"  # tag name, or branch name, or commit hash
)

Reusing dataset custom_squad (C:\Users\Erhan\.cache\huggingface\datasets\lhoestq___custom_squad\plain_text\1.0.0\397916d1ae99584877e0fb4f5b8b6f01e66fcbbeff4d178afb30c933a8d0d93a)


  0%|          | 0/2 [00:00<?, ?it/s]

In [28]:
dataset_url = "https://www.dropbox.com/s/1pzkadrvffbqw6o/train.txt"
!wget {dataset_url}

'wget' is not recognized as an internal or external command,
operable program or batch file.


In [29]:
import requests

dataset_url = "https://www.dropbox.com/s/1pzkadrvffbqw6o/train.txt"

response = requests.get(dataset_url)

if response.status_code == 200:
    with open("train.txt", "wb") as file:
        file.write(response.content)
    print("Downloaded the dataset as train.txt")
else:
    print("Failed to download the dataset")


Downloaded the dataset as train.txt


In [30]:
pip install requests


Note: you may need to restart the kernel to use updated packages.




In [32]:
from datasets import load_dataset

# Define a dataset builder instance
builder_instance = load_dataset("train.txt")

# Configure download settings
download_config = {...}  # Specify download settings as a dictionary

# Download and prepare the dataset
builder_instance.download_and_prepare(
    download_config=download_config,
    download_mode="reuse_cache",  # Choose the appropriate mode
    ignore_verifications=True,     # Set to True if needed
    try_from_hf_gcs=True,          # Set to True if needed
    use_auth_token=False           # Set to True if authentication is required
)


FileNotFoundError: Couldn't find a dataset script at C:\Users\Erhan\train.txt\train.txt.py or any data file in the same directory. Couldn't find 'train.txt' on the Hugging Face Hub either: FileNotFoundError: Couldn't find file at https://raw.githubusercontent.com/huggingface/datasets/master/datasets/train.txt/train.txt.py

In [33]:
emotions_local = load_dataset("csv", data_files="train.txt", sep=";",
names=["text", "label"])

Using custom data configuration default-3bff9914b361edad


Downloading and preparing dataset csv/default to C:\Users\Erhan\.cache\huggingface\datasets\csv\default-3bff9914b361edad\0.0.0\bf68a4c4aefa545d0712b2fcbb1b327f905bbe2f6425fbc5e8c25234acb9e14a...


  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  csv_file_reader = pd.read_csv(file, iterator=True, dtype=dtype, **self.config.read_csv_kwargs)


AttributeError: 'TextFileReader' object has no attribute 'f'

In [34]:
import requests

# Define the URL of the file you want to download
dataset_url = "https://www.dropbox.com/s/1pzkadrvffbqw6o/train.txt"

# Define the local file path where you want to save the downloaded file
local_file_path = "train.txt"

# Download the file from the URL
response = requests.get(dataset_url)

if response.status_code == 200:
    # Open a local file and write the content of the downloaded file to it
    with open(local_file_path, "wb") as file:
        file.write(response.content)
    print(f"Downloaded the dataset and saved it as {local_file_path}")
else:
    print("Failed to download the dataset")


Downloaded the dataset and saved it as train.txt


In [35]:
emotions_local = load_dataset("csv", data_files="train.txt", sep=";",
names=["text", "label"])

Using custom data configuration default-6840f09cecd4df97


Downloading and preparing dataset csv/default to C:\Users\Erhan\.cache\huggingface\datasets\csv\default-6840f09cecd4df97\0.0.0\bf68a4c4aefa545d0712b2fcbb1b327f905bbe2f6425fbc5e8c25234acb9e14a...


  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

AttributeError: 'TextFileReader' object has no attribute 'f'

In [36]:
import pandas as pd

emotions.set_format(type="pandas")
df = emotions["train"][:]
df.head(20)

Unnamed: 0,text,label,label_text
0,i didnt feel humiliated,0,sadness
1,i can go from feeling so hopeless to so damned...,0,sadness
2,im grabbing a minute to post i feel greedy wrong,3,anger
3,i am ever feeling nostalgic about the fireplac...,2,love
4,i am feeling grouchy,3,anger
5,ive been feeling a little burdened lately wasn...,0,sadness
6,ive been taking or milligrams or times recomme...,5,surprise
7,i feel as confused about life as a teenager or...,4,fear
8,i have been with petronas for years i feel tha...,1,joy
9,i feel romantic too,2,love


In [37]:
from transformers import AutoTokenizer
from bertviz.transformers_neuron_view import BertModel
from bertviz.neuron_view import show
model_ckpt = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = BertModel.from_pretrained(model_ckpt)
text = "time flies like an arrow"
show(model, "bert", tokenizer, text, display_mode="light", layer=0, head=8)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [38]:
inputs = tokenizer(text, return_tensors="pt", add_special_tokens=False)
inputs.input_ids

tensor([[ 2051, 10029,  2066,  2019,  8612]])

In [40]:
from torch import nn
from transformers import AutoConfig
config = AutoConfig.from_pretrained(model_ckpt)
token_emb = nn.Embedding(config.vocab_size, config.hidden_size)
token_emb

Embedding(30522, 768)

In [41]:
inputs_embeds = token_emb(inputs.input_ids)
inputs_embeds.size()

torch.Size([1, 5, 768])

In [42]:
import torch
from math import sqrt
query = key = value = inputs_embeds
dim_k = key.size(-1)
scores = torch.bmm(query, key.transpose(1,2)) / sqrt(dim_k)
scores.size()

torch.Size([1, 5, 5])

In [43]:
import torch.nn.functional as F
weights = F.softmax(scores, dim=-1)
weights.sum(dim=-1)

tensor([[1., 1., 1., 1., 1.]], grad_fn=<SumBackward1>)

In [44]:
attn_outputs = torch.bmm(weights, value)
attn_outputs.shape

torch.Size([1, 5, 768])

In [45]:
import torch
from math import sqrt 

query = key = value = inputs_embeds
dim_k = key.size(-1)
scores = torch.bmm(query, key.transpose(1,2)) / sqrt(dim_k)
scores.size()

torch.Size([1, 5, 5])

In [46]:
import torch.nn.functional as F

weights = F.softmax(scores, dim=-1)
weights.sum(dim=-1)

tensor([[1., 1., 1., 1., 1.]], grad_fn=<SumBackward1>)

In [47]:
def scaled_dot_product_attention(query, key, value):
    dim_k = query.size(-1)
    scores = torch.bmm(query, key.transpose(1, 2)) / sqrt(dim_k)
    weights = F.softmax(scores, dim=-1)
    return torch.bmm(weights, value)

In [48]:
class AttentionHead(nn.Module):
    def __init__(self, embed_dim, head_dim):
        super().__init__()
        self.q = nn.Linear(embed_dim, head_dim)
        self.k = nn.Linear(embed_dim, head_dim)
        self.v = nn.Linear(embed_dim, head_dim)

    def forward(self, hidden_state):
        attn_outputs = scaled_dot_product_attention(
            self.q(hidden_state), self.k(hidden_state), self.v(hidden_state))
        return attn_outputs

In [49]:
class MultiHeadAttention(nn.Module):
    def __init__(self, config):
        super().__init__()
        embed_dim = config.hidden_size
        num_heads = config.num_attention_heads
        head_dim = embed_dim // num_heads
        self.heads = nn.ModuleList(
            [AttentionHead(embed_dim, head_dim) for _ in range(num_heads)]
        )
        self.output_linear = nn.Linear(embed_dim, embed_dim)

    def forward(self, hidden_state):
        x = torch.cat([h(hidden_state) for h in self.heads], dim=-1)
        x = self.output_linear(x)
        return x

In [50]:
multihead_attn = MultiHeadAttention(config)
attn_output = multihead_attn(inputs_embeds)    
attn_output.size() 

torch.Size([1, 5, 768])

In [51]:
from bertviz import head_view
from transformers import AutoModel

model = AutoModel.from_pretrained(model_ckpt, output_attentions=True)

sentence_a = "time flies like an arrow"
sentence_b = "fruit flies like a banana"

viz_inputs = tokenizer(sentence_a, sentence_b, return_tensors='pt')
attention = model(**viz_inputs).attentions
sentence_b_start = (viz_inputs.token_type_ids == 0).sum(dim=1)
tokens = tokenizer.convert_ids_to_tokens(viz_inputs.input_ids[0])

head_view(attention, tokens, sentence_b_start, heads=[8])

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


<IPython.core.display.Javascript object>

In [52]:
class FeedForward(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.linear_1 = nn.Linear(config.hidden_size, config.intermediate_size)
        self.linear_2 = nn.Linear(config.intermediate_size, config.hidden_size)
        self.gelu = nn.GELU()
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        
    def forward(self, x):
        x = self.linear_1(x)
        x = self.gelu(x)
        x = self.linear_2(x)
        x = self.dropout(x)
        return x

In [53]:
feed_forward = FeedForward(config)
ff_outputs = feed_forward(attn_outputs)
ff_outputs.size()

torch.Size([1, 5, 768])

In [54]:
class TransformerEncoderLayer(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.layer_norm_1 = nn.LayerNorm(config.hidden_size)
        self.layer_norm_2 = nn.LayerNorm(config.hidden_size)
        self.attention = MultiHeadAttention(config)
        self.feed_forward = FeedForward(config)

    def forward(self, x):
        # Apply layer normalization and then copy input into query, key, value
        hidden_state = self.layer_norm_1(x)
        # Apply attention with a skip connection
        x = x + self.attention(hidden_state)
        # Apply feed-forward layer with a skip connection
        x = x + self.feed_forward(self.layer_norm_2(x))
        return x
     

In [55]:
encoder_layer = TransformerEncoderLayer(config)
inputs_embeds.shape, encoder_layer(inputs_embeds).size()

(torch.Size([1, 5, 768]), torch.Size([1, 5, 768]))

In [56]:
class Embeddings(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.token_embeddings = nn.Embedding(config.vocab_size, 
                                             config.hidden_size)
        self.position_embeddings = nn.Embedding(config.max_position_embeddings,
                                                config.hidden_size)
        self.layer_norm = nn.LayerNorm(config.hidden_size, eps=1e-12)
        self.dropout = nn.Dropout()

    def forward(self, input_ids):
        # Create position IDs for input sequence
        seq_length = input_ids.size(1)
        position_ids = torch.arange(seq_length, dtype=torch.long).unsqueeze(0)
        # Create token and position embeddings
        token_embeddings = self.token_embeddings(input_ids)
        position_embeddings = self.position_embeddings(position_ids)
        # Combine token and position embeddings
        embeddings = token_embeddings + position_embeddings
        embeddings = self.layer_norm(embeddings)
        embeddings = self.dropout(embeddings)
        return embeddings

In [57]:
embedding_layer = Embeddings(config)
embedding_layer(inputs.input_ids).size()

torch.Size([1, 5, 768])

In [58]:
class TransformerEncoder(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.embeddings = Embeddings(config)
        self.layers = nn.ModuleList([TransformerEncoderLayer(config) 
                                     for _ in range(config.num_hidden_layers)])

    def forward(self, x):
        x = self.embeddings(x)
        for layer in self.layers:
            x = layer(x)
        return x

In [59]:
encoder = TransformerEncoder(config)
encoder(inputs.input_ids).size()

torch.Size([1, 5, 768])

In [60]:
class TransformerForSequenceClassification(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.encoder = TransformerEncoder(config)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
        
    def forward(self, x):
        x = self.encoder(x)[:, 0, :] # select hidden state of [CLS] token
        x = self.dropout(x)
        x = self.classifier(x)
        return x

In [61]:
config.num_labels = 3
encoder_classifier = TransformerForSequenceClassification(config)
encoder_classifier(inputs.input_ids).size()

torch.Size([1, 3])

In [62]:
seq_len = inputs.input_ids.size(-1)
mask = torch.tril(torch.ones(seq_len, seq_len)).unsqueeze(0)
mask[0]

tensor([[1., 0., 0., 0., 0.],
        [1., 1., 0., 0., 0.],
        [1., 1., 1., 0., 0.],
        [1., 1., 1., 1., 0.],
        [1., 1., 1., 1., 1.]])

In [63]:
scores.masked_fill(mask == 0, -float("inf"))

tensor([[[26.3880,    -inf,    -inf,    -inf,    -inf],
         [ 1.0824, 29.0536,    -inf,    -inf,    -inf],
         [-1.4391,  0.8314, 27.2194,    -inf,    -inf],
         [-0.8588, -1.6071, -0.2110, 27.2913,    -inf],
         [-0.4795, -1.9412, -0.3605, -1.6157, 28.3139]]],
       grad_fn=<MaskedFillBackward0>)

In [64]:
def scaled_dot_product_attention(query, key, value, mask=None):
    dim_k = query.size(-1)
    scores = torch.bmm(query, key.transpose(1, 2)) / sqrt(dim_k)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float("-inf"))
    weights = F.softmax(scores, dim=-1)
    return weights.bmm(value)

In [65]:
from datasets import get_dataset_config_names
xtreme_subsets = get_dataset_config_names("xtreme")
print(f"XTREME has {len(xtreme_subsets)} configurations")

XTREME has 183 configurations


In [66]:
panx_subsets = [s for s in xtreme_subsets if s.startswith("PAN")]
panx_subsets[5:16]

['PAN-X.el',
 'PAN-X.en',
 'PAN-X.es',
 'PAN-X.et',
 'PAN-X.eu',
 'PAN-X.fa',
 'PAN-X.fi',
 'PAN-X.fr',
 'PAN-X.he',
 'PAN-X.hi',
 'PAN-X.hu']

In [67]:
from datasets import load_dataset
load_dataset("xtreme", name="PAN-X.de")

Reusing dataset xtreme (C:\Users\Erhan\.cache\huggingface\datasets\xtreme\PAN-X.de\1.0.0\2fc6b63c5326cc0d1f73060649612889b3a7ed8a6605c91cecdbd228a7158b17)


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    validation: Dataset({
        features: ['tokens', 'ner_tags', 'langs'],
        num_rows: 10000
    })
    test: Dataset({
        features: ['tokens', 'ner_tags', 'langs'],
        num_rows: 10000
    })
    train: Dataset({
        features: ['tokens', 'ner_tags', 'langs'],
        num_rows: 20000
    })
})

In [68]:
from collections import defaultdict
from datasets import DatasetDict

langs = ["de", "fr", "it", "en"]
fracs = [0.629, 0.229, 0.084, 0.059]
# Return a DatasetDict if a key doesn't exist
panx_ch = defaultdict(DatasetDict)

for lang, frac in zip(langs, fracs):
    # Load monolingual corpus
    ds = load_dataset("xtreme", name=f"PAN-X.{lang}")
    # Shuffle and downsample each split according to spoken proportion
    for split in ds:
        panx_ch[lang][split] = (
            ds[split]
            .shuffle(seed=0)
            .select(range(int(frac * ds[split].num_rows))))

Reusing dataset xtreme (C:\Users\Erhan\.cache\huggingface\datasets\xtreme\PAN-X.de\1.0.0\2fc6b63c5326cc0d1f73060649612889b3a7ed8a6605c91cecdbd228a7158b17)


  0%|          | 0/3 [00:00<?, ?it/s]

Loading cached shuffled indices for dataset at C:\Users\Erhan\.cache\huggingface\datasets\xtreme\PAN-X.de\1.0.0\2fc6b63c5326cc0d1f73060649612889b3a7ed8a6605c91cecdbd228a7158b17\cache-7318edec81f76aa6.arrow
Loading cached shuffled indices for dataset at C:\Users\Erhan\.cache\huggingface\datasets\xtreme\PAN-X.de\1.0.0\2fc6b63c5326cc0d1f73060649612889b3a7ed8a6605c91cecdbd228a7158b17\cache-cbd29dccd93ef58f.arrow
Loading cached shuffled indices for dataset at C:\Users\Erhan\.cache\huggingface\datasets\xtreme\PAN-X.de\1.0.0\2fc6b63c5326cc0d1f73060649612889b3a7ed8a6605c91cecdbd228a7158b17\cache-4433310f7a3b2793.arrow
Reusing dataset xtreme (C:\Users\Erhan\.cache\huggingface\datasets\xtreme\PAN-X.fr\1.0.0\2fc6b63c5326cc0d1f73060649612889b3a7ed8a6605c91cecdbd228a7158b17)


  0%|          | 0/3 [00:00<?, ?it/s]

Loading cached shuffled indices for dataset at C:\Users\Erhan\.cache\huggingface\datasets\xtreme\PAN-X.fr\1.0.0\2fc6b63c5326cc0d1f73060649612889b3a7ed8a6605c91cecdbd228a7158b17\cache-4a1996403248b4e2.arrow
Loading cached shuffled indices for dataset at C:\Users\Erhan\.cache\huggingface\datasets\xtreme\PAN-X.fr\1.0.0\2fc6b63c5326cc0d1f73060649612889b3a7ed8a6605c91cecdbd228a7158b17\cache-5d4f9e5aefa05972.arrow
Loading cached shuffled indices for dataset at C:\Users\Erhan\.cache\huggingface\datasets\xtreme\PAN-X.fr\1.0.0\2fc6b63c5326cc0d1f73060649612889b3a7ed8a6605c91cecdbd228a7158b17\cache-6789784a489dc7d6.arrow
Reusing dataset xtreme (C:\Users\Erhan\.cache\huggingface\datasets\xtreme\PAN-X.it\1.0.0\2fc6b63c5326cc0d1f73060649612889b3a7ed8a6605c91cecdbd228a7158b17)


  0%|          | 0/3 [00:00<?, ?it/s]

Loading cached shuffled indices for dataset at C:\Users\Erhan\.cache\huggingface\datasets\xtreme\PAN-X.it\1.0.0\2fc6b63c5326cc0d1f73060649612889b3a7ed8a6605c91cecdbd228a7158b17\cache-845df155c04c1192.arrow
Loading cached shuffled indices for dataset at C:\Users\Erhan\.cache\huggingface\datasets\xtreme\PAN-X.it\1.0.0\2fc6b63c5326cc0d1f73060649612889b3a7ed8a6605c91cecdbd228a7158b17\cache-4038e5f0ccb7a363.arrow
Loading cached shuffled indices for dataset at C:\Users\Erhan\.cache\huggingface\datasets\xtreme\PAN-X.it\1.0.0\2fc6b63c5326cc0d1f73060649612889b3a7ed8a6605c91cecdbd228a7158b17\cache-e220bc62f3b2de61.arrow
Reusing dataset xtreme (C:\Users\Erhan\.cache\huggingface\datasets\xtreme\PAN-X.en\1.0.0\2fc6b63c5326cc0d1f73060649612889b3a7ed8a6605c91cecdbd228a7158b17)


  0%|          | 0/3 [00:00<?, ?it/s]

Loading cached shuffled indices for dataset at C:\Users\Erhan\.cache\huggingface\datasets\xtreme\PAN-X.en\1.0.0\2fc6b63c5326cc0d1f73060649612889b3a7ed8a6605c91cecdbd228a7158b17\cache-2292d48c0b6f8502.arrow
Loading cached shuffled indices for dataset at C:\Users\Erhan\.cache\huggingface\datasets\xtreme\PAN-X.en\1.0.0\2fc6b63c5326cc0d1f73060649612889b3a7ed8a6605c91cecdbd228a7158b17\cache-56d73ebf7717cb83.arrow
Loading cached shuffled indices for dataset at C:\Users\Erhan\.cache\huggingface\datasets\xtreme\PAN-X.en\1.0.0\2fc6b63c5326cc0d1f73060649612889b3a7ed8a6605c91cecdbd228a7158b17\cache-5117c26f1eb0d215.arrow


In [69]:
import pandas as pd

pd.DataFrame({lang: [panx_ch[lang]["train"].num_rows] for lang in langs},
             index=["Number of training examples"])

Unnamed: 0,de,fr,it,en
Number of training examples,12580,4580,1680,1180


In [70]:
element = panx_ch["fr"]["train"][5]
for key, value in element.items():
    print(f"{key}: {value}")

tokens: ['Sertorius', 'se', 'trouve', 'alors', 'entouré', 'non', 'seulement', "d'ennemis", 'extérieurs', 'mais', 'aussi', 'intérieurs', '.']
ner_tags: [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
langs: ['fr', 'fr', 'fr', 'fr', 'fr', 'fr', 'fr', 'fr', 'fr', 'fr', 'fr', 'fr', 'fr']


In [71]:
for key, value in panx_ch["de"]["train"].features.items():
    print(f"{key}: {value}")

tokens: Sequence(feature=Value(dtype='string', id=None), length=-1, id=None)
ner_tags: Sequence(feature=ClassLabel(num_classes=7, names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC'], names_file=None, id=None), length=-1, id=None)
langs: Sequence(feature=Value(dtype='string', id=None), length=-1, id=None)


In [72]:
tags = panx_ch["fr"]["train"].features["ner_tags"].feature
print(tags)

ClassLabel(num_classes=7, names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC'], names_file=None, id=None)


In [73]:
def create_tag_names(batch):
    return {"ner_tags_str": [tags.int2str(idx) for idx in batch["ner_tags"]]}

panx_de = panx_ch["fr"].map(create_tag_names)



  0%|          | 0/2290 [00:00<?, ?ex/s]

  0%|          | 0/2290 [00:00<?, ?ex/s]

  0%|          | 0/4580 [00:00<?, ?ex/s]

In [74]:
import pandas as pd
de_example = panx_de["train"][5]
pd.DataFrame([de_example["tokens"], de_example["ner_tags_str"]],
['Tokens', 'Tags'])

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
Tokens,Sertorius,se,trouve,alors,entouré,non,seulement,d'ennemis,extérieurs,mais,aussi,intérieurs,.
Tags,B-PER,O,O,O,O,O,O,O,O,O,O,O,O


In [75]:
from collections import Counter

split2freqs = defaultdict(Counter)
for split, dataset in panx_de.items():
    for row in dataset["ner_tags_str"]:
        for tag in row:
            if tag.startswith("B"):
                tag_type = tag.split("-")[1]
                split2freqs[split][tag_type] += 1
pd.DataFrame.from_dict(split2freqs, orient="index")

Unnamed: 0,PER,ORG,LOC
validation,969,878,1085
test,1045,885,1130
train,2059,1758,2301


In [76]:
from transformers import AutoTokenizer

bert_model_name = "bert-base-cased"
xlmr_model_name = "xlm-roberta-base"
bert_tokenizer = AutoTokenizer.from_pretrained(bert_model_name)
xlmr_tokenizer = AutoTokenizer.from_pretrained(xlmr_model_name)

In [77]:
text = "L’une est professeur, l’autre travaille dans le tourisme. Une semaine par an,\
Alda Kirstensdottir et Tina Magnusson, 34 et 37 ans, viennent bénévolement garder \
le refuge de la baie de Breidavik"
bert_tokens = bert_tokenizer(text).tokens()
xlmr_tokens = xlmr_tokenizer(text).tokens()

In [78]:
df = pd.DataFrame([bert_tokens, xlmr_tokens], index=["BERT", "XLM-R"])
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,63,64,65,66,67,68,69,70,71,72
BERT,[CLS],L,’,une,est,pro,##fe,##sse,##ur,",",...,la,b,##ai,##e,de,B,##re,##ida,##vik,[SEP]
XLM-R,<s>,▁L,’,une,▁est,▁professeur,",",▁l,’,autre,...,,,,,,,,,,


In [79]:
"".join(xlmr_tokens).replace(u"\u2581", " ")

'<s> L’une est professeur, l’autre travaille dans le tourisme. Une semaine par an,Alda Kirstensdottir et Tina Magnusson, 34 et 37 ans, viennent bénévolement garder le refuge de la baie de Breidavik</s>'

In [80]:
import torch.nn as nn
from transformers import XLMRobertaConfig
from transformers.modeling_outputs import TokenClassifierOutput
from transformers.models.roberta.modeling_roberta import RobertaModel
from transformers.models.roberta.modeling_roberta import RobertaPreTrainedModel

class XLMRobertaForTokenClassification(RobertaPreTrainedModel):
    config_class = XLMRobertaConfig

    def __init__(self, config):
        super().__init__(config)
        self.num_labels = config.num_labels
        # Load model body
        self.roberta = RobertaModel(config, add_pooling_layer=False)
        # Set up token classification head
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
        # Load and initialize weights
        self.init_weights()

    def forward(self, input_ids=None, attention_mask=None, token_type_ids=None, 
                labels=None, **kwargs):
        # Use model body to get encoder representations
        outputs = self.roberta(input_ids, attention_mask=attention_mask,
                               token_type_ids=token_type_ids, **kwargs)
        # Apply classifier to encoder representation
        sequence_output = self.dropout(outputs[0])
        logits = self.classifier(sequence_output)
        # Calculate losses
        loss = None
        if labels is not None:
            loss_fct = nn.CrossEntropyLoss()
            loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
        # Return model output object
        return TokenClassifierOutput(loss=loss, logits=logits, 
                                     hidden_states=outputs.hidden_states, 
                                     attentions=outputs.attentions)

In [81]:
index2tag = {idx: tag for idx, tag in enumerate(tags.names)}
tag2index = {tag: idx for idx, tag in enumerate(tags.names)}
     

In [82]:
from transformers import AutoConfig

xlmr_config = AutoConfig.from_pretrained(xlmr_model_name, 
                                         num_labels=tags.num_classes,
                                         id2label=index2tag, label2id=tag2index)

In [83]:
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
xlmr_model = (XLMRobertaForTokenClassification
              .from_pretrained(xlmr_model_name, config=xlmr_config)
              .to(device))

Some weights of the model checkpoint at xlm-roberta-base were not used when initializing XLMRobertaForTokenClassification: ['lm_head.dense.bias', 'lm_head.layer_norm.weight', 'roberta.pooler.dense.weight', 'lm_head.decoder.weight', 'lm_head.dense.weight', 'roberta.pooler.dense.bias', 'lm_head.bias', 'lm_head.layer_norm.bias']
- This IS expected if you are initializing XLMRobertaForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of XLMRobertaForTokenClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.bias', 'classifie

In [120]:
input_ids = xlmr_tokenizer.encode(text, return_tensors="pt")
pd.DataFrame([xlmr_tokens, input_ids[0].numpy()], index=["Tokens", "Input IDs"])

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,42,43,44,45,46,47,48,49,50,51
Tokens,<s>,▁L,’,une,▁est,▁professeur,",",▁l,’,autre,...,▁le,▁refuge,▁de,▁la,▁baie,▁de,▁Bre,ida,vik,</s>
Input IDs,0,339,26,1811,437,166104,4,96,26,34773,...,95,211190,8,21,7348,8,6499,1683,5342,2


In [121]:
outputs = xlmr_model(input_ids.to(device)).logits
predictions = torch.argmax(outputs, dim=-1)
print(f"Number of tokens in sequence: {len(xlmr_tokens)}")
print(f"Shape of outputs: {outputs.shape}")

Number of tokens in sequence: 52
Shape of outputs: torch.Size([1, 52, 7])


In [122]:
preds = [tags.names[p] for p in predictions[0].cpu().numpy()]
pd.DataFrame([xlmr_tokens, preds], index=["Tokens", "Tags"])

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,42,43,44,45,46,47,48,49,50,51
Tokens,<s>,▁L,’,une,▁est,▁professeur,",",▁l,’,autre,...,▁le,▁refuge,▁de,▁la,▁baie,▁de,▁Bre,ida,vik,</s>
Tags,I-ORG,I-ORG,I-ORG,I-ORG,I-ORG,I-ORG,I-ORG,I-ORG,I-ORG,I-ORG,...,I-ORG,I-ORG,I-ORG,I-ORG,I-ORG,I-ORG,I-ORG,I-ORG,I-ORG,I-ORG


In [123]:
def tag_text(text, tags, model, tokenizer):
    # Get tokens with special characters
    tokens = tokenizer(text).tokens()
    # Encode the sequence into IDs
    input_ids = xlmr_tokenizer(text, return_tensors="pt").input_ids.to(device)
    # Get predictions as distribution over 7 possible classes
    outputs = model(input_ids)[0]
    # Take argmax to get most likely class per token
    predictions = torch.argmax(outputs, dim=2)
    # Convert to DataFrame
    preds = [tags.names[p] for p in predictions[0].cpu().numpy()]
    return pd.DataFrame([tokens, preds], index=["Tokens", "Tags"])

In [124]:
words, labels = de_example["tokens"], de_example["ner_tags"]

In [125]:
tokenized_input = xlmr_tokenizer(de_example["tokens"], is_split_into_words=True)
tokens = xlmr_tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])

In [126]:
pd.DataFrame([tokens], index=["Tokens"])

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,15,16,17,18,19,20,21,22,23,24
Tokens,<s>,▁Ser,torius,▁se,▁trouve,▁alors,▁en,tour,é,▁non,...,▁,extérieur,s,▁mais,▁aussi,▁intérieur,s,▁,.,</s>


In [127]:
word_ids = tokenized_input.word_ids()
pd.DataFrame([tokens, word_ids], index=["Tokens", "Word IDs"])

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,15,16,17,18,19,20,21,22,23,24
Tokens,<s>,▁Ser,torius,▁se,▁trouve,▁alors,▁en,tour,é,▁non,...,▁,extérieur,s,▁mais,▁aussi,▁intérieur,s,▁,.,</s>
Word IDs,,0,0,1,2,3,4,4,4,5,...,8,8,8,9,10,11,11,12,12,


In [128]:
previous_word_idx = None
label_ids = []

for word_idx in word_ids:
    if word_idx is None or word_idx == previous_word_idx:
        label_ids.append(-100)
    elif word_idx != previous_word_idx:
        label_ids.append(labels[word_idx])
    previous_word_idx = word_idx
    
labels = [index2tag[l] if l != -100 else "IGN" for l in label_ids]
index = ["Tokens", "Word IDs", "Label IDs", "Labels"]

pd.DataFrame([tokens, word_ids, label_ids, labels], index=index)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,15,16,17,18,19,20,21,22,23,24
Tokens,<s>,▁Ser,torius,▁se,▁trouve,▁alors,▁en,tour,é,▁non,...,▁,extérieur,s,▁mais,▁aussi,▁intérieur,s,▁,.,</s>
Word IDs,,0,0,1,2,3,4,4,4,5,...,8,8,8,9,10,11,11,12,12,
Label IDs,-100,1,-100,0,0,0,0,-100,-100,0,...,0,-100,-100,0,0,0,-100,0,-100,-100
Labels,IGN,B-PER,IGN,O,O,O,O,IGN,IGN,O,...,O,IGN,IGN,O,O,O,IGN,O,IGN,IGN


In [129]:
def tokenize_and_align_labels(examples):
    tokenized_inputs = xlmr_tokenizer(examples["tokens"], truncation=True, 
                                      is_split_into_words=True)
    labels = []
    for idx, label in enumerate(examples["ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=idx)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            if word_idx is None or word_idx == previous_word_idx:
                label_ids.append(-100)
            else:
                label_ids.append(label[word_idx])
            previous_word_idx = word_idx
        labels.append(label_ids)
    tokenized_inputs["labels"] = labels
    return tokenized_inputs
     

In [130]:
def encode_panx_dataset(corpus):
    return corpus.map(tokenize_and_align_labels, batched=True, 
                      remove_columns=['langs', 'ner_tags', 'tokens'])

In [132]:
panx_de_encoded = encode_panx_dataset(panx_ch["de"])

  0%|          | 0/7 [00:00<?, ?ba/s]

  0%|          | 0/7 [00:00<?, ?ba/s]

  0%|          | 0/13 [00:00<?, ?ba/s]

In [133]:
from seqeval.metrics import classification_report

y_true = [["O", "O", "O", "B-MISC", "I-MISC", "I-MISC", "O"],
          ["B-PER", "I-PER", "O"]]
y_pred = [["O", "O", "B-MISC", "I-MISC", "I-MISC", "I-MISC", "O"],
          ["B-PER", "I-PER", "O"]]
print(classification_report(y_true, y_pred))

              precision    recall  f1-score   support

        MISC       0.00      0.00      0.00         1
         PER       1.00      1.00      1.00         1

   micro avg       0.50      0.50      0.50         2
   macro avg       0.50      0.50      0.50         2
weighted avg       0.50      0.50      0.50         2



In [134]:
import numpy as np

def align_predictions(predictions, label_ids):
    preds = np.argmax(predictions, axis=2)
    batch_size, seq_len = preds.shape
    labels_list, preds_list = [], []

    for batch_idx in range(batch_size):
        example_labels, example_preds = [], []
        for seq_idx in range(seq_len):
            # Ignore label IDs = -100
            if label_ids[batch_idx, seq_idx] != -100:
                example_labels.append(index2tag[label_ids[batch_idx][seq_idx]])
                example_preds.append(index2tag[preds[batch_idx][seq_idx]])

        labels_list.append(example_labels)
        preds_list.append(example_preds)

    return preds_list, labels_list

In [135]:
from transformers import TrainingArguments

num_epochs = 3
batch_size = 24
logging_steps = len(panx_de_encoded["train"]) // batch_size
model_name = f"{xlmr_model_name}-finetuned-panx-de"
training_args = TrainingArguments(
    output_dir=model_name, log_level="error", num_train_epochs=num_epochs, 
    per_device_train_batch_size=batch_size, 
    per_device_eval_batch_size=batch_size, evaluation_strategy="epoch", 
    save_steps=1e6, weight_decay=0.01, disable_tqdm=False, 
    logging_steps=logging_steps, push_to_hub=True)

In [20]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [29]:
def tokenize(batch):
    return tokenizer(batch["text"], padding=True, truncation=True)

In [34]:
text = "Tokenizing text is a core task of NLP."
tokenized_text = list(text)
print(tokenized_text)

In [35]:
token2idx = {ch: idx for idx, ch in enumerate(sorted(set(tokenized_text)))}
print(token2idx)

In [36]:
input_ids = [token2idx[token] for token in tokenized_text]
print(input_ids)

In [37]:
categorical_df = pd.DataFrame(
    {"Name": ["Bumblebee", "Optimus Prime", "Megatron"], "Label ID": [0,1,2]})
categorical_df

Unnamed: 0,Name,Label ID
0,Bumblebee,0
1,Optimus Prime,1
2,Megatron,2


In [38]:
pd.get_dummies(categorical_df["Name"])

Unnamed: 0,Bumblebee,Megatron,Optimus Prime
0,1,0,0
1,0,0,1
2,0,1,0
