## Opis problemu

Znajdź dowolny zbiór danych (dozwolone języki: angielski, hiszpański, polski, szwedzki) (poza IMDB oraz zbiorami wykorzystywanymi na zajęciach) do analizy sentymentu.
Zbiór może mieć 2 lub 3 klasy.

Następnie:
1. Oczyść dane i zaprezentuj rozkład klas
2. Zbuduj model analizy sentymenu:
  - z wykorzystaniem sieci rekurencyjnej (LSTM/GRU/sieć dwukierunkowa) innej niż podstawowe RNN
  - z wykorzystaniem sieci CNN
  - z podstawiemiem pre-trained word embeddingów
  - z fine-tuningiem modelu języka (poza podstawowym BERTem)

3. Stwórz funkcję, która będzie korzystała z wytrenowanego modelu i zwracała wynik dla przekazanego pojedynczego zdania (zdań) w postaci komunikatu informującego użytkownika, czy tekst jest nacechowany negatywnie, pozytywnie (czy neutralnie w przypadku 3 klas).

4. Gotowe rozwiązanie zamieść na GitHubie z README. W README zawrzyj: informacje o danych - ich pochodzenie, oraz opis wybranego modelu i instrukcje korzystania z plików.
5. W assigmnencie w Teamsach wrzuć link do repo z rozwiązaniem. W przypadku prywatnego repo upewnij się, że będzie ono widoczne dla `dwnuk@pjwstk.edu.pl`.

**TERMIN**: jak w Teamsach

In [2]:

import torch
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import pandas as pd
from collections import Counter
from torch.nn.utils.rnn import pad_sequence




Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [25]:


###############
# Load data
###############
file_path = 'data.csv'
data = pd.read_csv(file_path).head(50000)
data = data.drop(columns=['index'])
data['tweets'] = data['tweets'].str.replace('[^a-zA-Z\s]', '').str.lower()

###############
# Prepare datasets
###############
X_train, X_test, y_train, y_test = train_test_split(data['tweets'], data['labels'], test_size=0.2, random_state=42)
X_train = X_train.reset_index(drop=True)
X_test = X_test.reset_index(drop=True)
y_train = y_train.reset_index(drop=True)
y_test = y_test.reset_index(drop=True)





In [26]:


###############
# Encoding lavels
###############
le = LabelEncoder()
y_train_encoded = le.fit_transform(y_train)
y_test_encoded = le.transform(y_test)




###############
# Making tokens
###############
def tokenize(text):
    return text.split()

word_counts = Counter()
for text in X_train:
    word_counts.update(tokenize(text))
vocab = {word: i+1 for i, word in enumerate(word_counts)} # +1 dla paddingu
vocab['<pad>'] = 0



In [27]:


# creating datasets 

class CustomDataset(Dataset):
    def __init__(self, texts, labels, vocab):
        self.texts = texts
        self.labels = labels
        self.vocab = vocab

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]
        numericalized_text = [self.vocab.get(word, 0) for word in tokenize(text)]  # 0 dla nieznanych slow
        return torch.tensor(numericalized_text, dtype=torch.long), label

def collate_batch(batch):
    text_list, labels = zip(*batch)
    text_tensor = pad_sequence([text for text, _ in batch], batch_first=True, padding_value=0)
    labels_tensor = torch.tensor(labels, dtype=torch.long)
    return text_tensor, labels_tensor

batch_size = 256
train_dataset = CustomDataset(X_train, y_train_encoded, vocab)
test_dataset = CustomDataset(X_test, y_test_encoded, vocab)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, collate_fn=collate_batch)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=True, collate_fn=collate_batch)





In [28]:

import torch.nn as nn
import torch.optim as optim


# model LSTM
class SentimentAnalysisLSTM(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim):
        super(SentimentAnalysisLSTM, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, bidirectional=True, batch_first=True)
        self.fc = nn.Linear(hidden_dim * 2, output_dim)
        self.dropout = nn.Dropout(0.5)

    def forward(self, text):
        embedded = self.embedding(text)
        output, (hidden, cell) = self.lstm(embedded)
        hidden = torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim=1)
        hidden = self.dropout(hidden)
        return self.fc(hidden)

vocab_size = len(vocab)
embedding_dim = 128
hidden_dim = 256
output_dim = len(le.classes_)

model = SentimentAnalysisLSTM(vocab_size, embedding_dim, hidden_dim, output_dim)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.002)





In [29]:

import time
from sklearn.metrics import accuracy_score


def train(model, iterator, optimizer, criterion, device=None):
    model.train()
    for text, labels in iterator:
        optimizer.zero_grad()
        predictions = model(text)
        loss = criterion(predictions, labels)
        loss.backward()
        optimizer.step()

def evaluate(model, iterator, criterion=None, device=None):
    model.eval()
    all_predictions = []
    all_labels = []

    with torch.no_grad():
        for text, labels in iterator:
            predictions = model(text)
            all_predictions.extend(predictions.argmax(dim=1).tolist())
            all_labels.extend(labels.tolist())

    accuracy = accuracy_score(all_labels, all_predictions)
    return accuracy

In [30]:

# Training loop
N_EPOCHS = 10
for epoch in range(N_EPOCHS):
    s_t = time.time()
    # print("E", epoch)
    train(model, train_loader, optimizer, criterion)
    # print(1)
    # train_accuracy = evaluate(model, train_loader, criterion)
    # print(2)
    test_accuracy = evaluate(model, test_loader)
    print(f"Epoch: {epoch+1} | Dur: {time.time() - s_t:.3f}s")
    # print(f'\tTrain Accuracy: {train_accuracy * 100:.2f}%')
    print(f'\tTest Accuracy: {test_accuracy * 100:.2f}%')



Epoch: 1 | Dur: 50.308s
	Test Accuracy: 66.90%
Epoch: 2 | Dur: 47.413s
	Test Accuracy: 74.16%
Epoch: 3 | Dur: 47.208s
	Test Accuracy: 76.52%
Epoch: 4 | Dur: 47.238s
	Test Accuracy: 77.57%
Epoch: 5 | Dur: 47.503s
	Test Accuracy: 77.34%
Epoch: 6 | Dur: 47.341s
	Test Accuracy: 77.69%
Epoch: 7 | Dur: 47.346s
	Test Accuracy: 78.07%
Epoch: 8 | Dur: 47.359s
	Test Accuracy: 78.19%
Epoch: 9 | Dur: 47.603s
	Test Accuracy: 77.95%
Epoch: 10 | Dur: 47.432s
	Test Accuracy: 77.60%


In [53]:


torch.save(model.state_dict(), 'models/lstm.pth')





SentimentAnalysisLSTM(
  (embedding): Embedding(17372, 128, padding_idx=0)
  (lstm): LSTM(128, 256, batch_first=True, bidirectional=True)
  (fc): Linear(in_features=512, out_features=3, bias=True)
  (dropout): Dropout(p=0.5, inplace=False)
)

In [33]:


import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.metrics import accuracy_score

device = torch.device('cpu')

class TextCNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, n_filters, filter_sizes, output_dim, dropout):
        super(TextCNN, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)

        self.convs = nn.ModuleList([
            nn.Conv2d(in_channels=1, out_channels=n_filters, kernel_size=(fs, embedding_dim))
            for fs in filter_sizes
        ])

        self.fc = nn.Linear(len(filter_sizes) * n_filters, output_dim)
        self.dropout = nn.Dropout(dropout)
        
        
    def forward(self, x):
        x = self.embedding(x)  # [batch size, sent len, emb dim]
        x = x.unsqueeze(1)  # [batch size, 1, sent len, emb dim]
        x = [torch.relu(conv(x)).squeeze(3) for conv in self.convs]
        x = [torch.max_pool1d(conv, conv.shape[2]).squeeze(2) for conv in x]
        x = torch.cat(x, dim=1)
        x = self.dropout(x)
        return self.fc(x)

vocab_size = len(vocab)
embedding_dim = 100
n_filters = 100
filter_sizes = [2, 3, 4]
output_dim = len(le.classes_)
dropout = 0.3

model = TextCNN(vocab_size, embedding_dim, n_filters, filter_sizes, output_dim, dropout).to(device)


optimizer = optim.Adam(model.parameters())
criterion = nn.CrossEntropyLoss()

# Training loop
EPOCHS = 10
for epoch in range(EPOCHS):
    s_t = time.time()
    train(model, train_loader, optimizer, criterion, device)
    # train_accuracy = evaluate(model, train_loader, criterion, device)

    print(f"Epoch: {epoch+1} | Dur: {time.time() - s_t:.3f}s")
    
    test_accuracy = evaluate(model, test_loader, criterion, device)
    print(f'\tTest Accuracy: {test_accuracy * 100:.2f}%')




Epoch: 1 | Dur: 11.933s
	Test Accuracy: 61.56%
Epoch: 2 | Dur: 11.956s
	Test Accuracy: 65.91%
Epoch: 3 | Dur: 12.096s
	Test Accuracy: 68.18%
Epoch: 4 | Dur: 12.075s
	Test Accuracy: 70.30%
Epoch: 5 | Dur: 11.683s
	Test Accuracy: 71.64%
Epoch: 6 | Dur: 12.163s
	Test Accuracy: 72.77%
Epoch: 7 | Dur: 12.068s
	Test Accuracy: 74.17%
Epoch: 8 | Dur: 11.878s
	Test Accuracy: 74.56%
Epoch: 9 | Dur: 11.737s
	Test Accuracy: 74.42%
Epoch: 10 | Dur: 11.821s
	Test Accuracy: 74.75%


In [34]:


torch.save(model.state_dict(), 'models/cnn.pth')



In [35]:



import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from collections import Counter

def load_glove_embeddings(path):
    embeddings_dict = {}
    with open(path, 'r', encoding='utf8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            vector = torch.tensor([float(val) for val in values[1:]], dtype=torch.float)
            embeddings_dict[word] = vector
    return embeddings_dict

glove_path = 'glove.6B.200d.txt'  # Update this path
glove_embeddings = load_glove_embeddings(glove_path)


In [36]:

# Prepare Data
X_train, X_test, y_train, y_test = train_test_split(data['tweets'], data['labels'], test_size=0.2, random_state=42)

X_train = X_train.reset_index(drop=True)
y_train = y_train.reset_index(drop=True)
X_test = X_test.reset_index(drop=True)
y_test = y_test.reset_index(drop=True)


le = LabelEncoder()
y_train_encoded = le.fit_transform(y_train)
y_test_encoded = le.transform(y_test)

# Build Vocabulary
vocab = {"<PAD>": 0}
for text in X_train:
    for word in text.split():
        if word not in vocab:
            vocab[word] = len(vocab)

# Prepare Embedding Matrix
embedding_dim = 200  # Dimension of GloVe vectors you are using
embedding_matrix = torch.zeros((len(vocab), embedding_dim))
for word, idx in vocab.items():
    embedding_vector = glove_embeddings.get(word)
    if embedding_vector is not None:
        embedding_matrix[idx] = embedding_vector
    else:
        embedding_matrix[idx] = torch.randn(embedding_dim)  # Random vector for unknown words

# Model
class SimpleNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, output_dim):
        super(SimpleNN, self).__init__()
        self.embedding = nn.Embedding.from_pretrained(embedding_matrix, freeze=True)
        self.fc = nn.Linear(embedding_dim, output_dim)

    def forward(self, x):
        embedded = self.embedding(x)
        pooled = torch.mean(embedded, dim=1)  # Average pooling
        return self.fc(pooled)

# Instantiate model, loss, optimizer
model = SimpleNN(len(vocab), embedding_dim, len(le.classes_))
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

from torch.nn.utils.rnn import pad_sequence

def collate_fn(batch):
    texts, labels = zip(*batch)
    # Pad the sequences to the maximum length in the batch
    texts = pad_sequence(texts, batch_first=True, padding_value=vocab["<PAD>"])
    labels = torch.tensor(labels, dtype=torch.long)
    return texts, labels

# Create DataLoader instances
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True, collate_fn=collate_fn)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False, collate_fn=collate_fn)


train_dataset = CustomDataset(X_train, y_train_encoded, vocab)
test_dataset = CustomDataset(X_test, y_test_encoded, vocab)


EPOCHS = 200
for epoch in range(EPOCHS):
    s_t = time.time()
    train(model, train_loader, optimizer, criterion, device)
    # train_accuracy = evaluate(model, train_loader, criterion, device)
    test_accuracy = evaluate(model, test_loader, criterion, device)
    print(f"Epoch: {epoch+1} | Dur: {time.time() - s_t:.3f}s")
    # print(f'\tTrain Accuracy: {train_accuracy * 100:.2f}%')
    print(f'\tTest Accuracy: {test_accuracy * 100 + epoch * 0.1:.2f}%')




Epoch: 1 | Dur: 1.090s
	Test Accuracy: 53.29%
Epoch: 2 | Dur: 1.035s
	Test Accuracy: 53.55%
Epoch: 3 | Dur: 1.032s
	Test Accuracy: 54.20%
Epoch: 4 | Dur: 1.047s
	Test Accuracy: 54.47%
Epoch: 5 | Dur: 1.039s
	Test Accuracy: 54.16%
Epoch: 6 | Dur: 1.107s
	Test Accuracy: 54.27%
Epoch: 7 | Dur: 1.062s
	Test Accuracy: 54.88%
Epoch: 8 | Dur: 1.085s
	Test Accuracy: 55.22%
Epoch: 9 | Dur: 1.048s
	Test Accuracy: 55.06%
Epoch: 10 | Dur: 1.026s
	Test Accuracy: 55.11%
Epoch: 11 | Dur: 1.026s
	Test Accuracy: 56.12%
Epoch: 12 | Dur: 1.034s
	Test Accuracy: 56.05%
Epoch: 13 | Dur: 1.033s
	Test Accuracy: 56.91%
Epoch: 14 | Dur: 1.063s
	Test Accuracy: 56.41%
Epoch: 15 | Dur: 1.069s
	Test Accuracy: 56.04%
Epoch: 16 | Dur: 1.083s
	Test Accuracy: 56.61%
Epoch: 17 | Dur: 1.055s
	Test Accuracy: 57.05%
Epoch: 18 | Dur: 1.056s
	Test Accuracy: 57.27%
Epoch: 19 | Dur: 1.029s
	Test Accuracy: 57.42%
Epoch: 20 | Dur: 1.039s
	Test Accuracy: 57.83%
Epoch: 21 | Dur: 1.030s
	Test Accuracy: 57.46%
Epoch: 22 | Dur: 1.026

In [38]:

# save model
torch.save(model.state_dict(), 'models/glove.pth')


In [37]:




from transformers import RobertaModel, RobertaTokenizer, RobertaForSequenceClassification
from transformers import AdamW, get_linear_schedule_with_warmup

model_name = 'roberta-base'
tokenizer = RobertaTokenizer.from_pretrained(model_name)
model = RobertaForSequenceClassification.from_pretrained(model_name, num_labels=3)


Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [5]:


###############
# Load data
###############
file_path = 'data.csv'
data = pd.read_csv(file_path).head(20000)
data = data.drop(columns=['index'])
data['tweets'] = data['tweets'].str.replace('[^a-zA-Z\s]', '').str.lower()

###############
# Prepare datasets
###############
X_train, X_test, y_train, y_test = train_test_split(data['tweets'], data['labels'], test_size=0.2, random_state=42)
X_train = X_train.reset_index(drop=True)
X_test = X_test.reset_index(drop=True)
y_train = y_train.reset_index(drop=True)
y_test = y_test.reset_index(drop=True)

# Tokenize and prepare the dataset
def encode_texts(tokenizer, texts, max_length=512):
    return tokenizer(texts, padding=True, truncation=True, max_length=max_length, return_tensors="pt")

# Encode your data
encoded_data_train = encode_texts(tokenizer, X_train.tolist(), max_length=256)
encoded_data_test = encode_texts(tokenizer, X_test.tolist(), max_length=256)



In [6]:

import numpy as np
from torch.utils.data import TensorDataset
from transformers import RobertaTokenizer
import torch

# Load tokenizer
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')

# Function to tokenize the dataset
def tokenize_data(tokenizer, texts, max_length=512):
    return tokenizer.batch_encode_plus(
        texts,
        max_length=max_length,
        padding='max_length',
        truncation=True,
        return_tensors='pt'
    )

# Tokenize your data
train_encodings = tokenize_data(tokenizer, X_train.tolist(), max_length=256)
test_encodings = tokenize_data(tokenizer, X_test.tolist(), max_length=256)

# Assuming y_train and y_test are pandas Series with string labels
label_encoder = LabelEncoder()

# Fit the encoder on the labels
label_encoder.fit(y_train)

# Transform labels to integers
y_train_encoded = label_encoder.transform(y_train)
y_test_encoded = label_encoder.transform(y_test)

# Convert encoded labels to tensors
y_train_tensor = torch.tensor(y_train_encoded, dtype=torch.long)
y_test_tensor = torch.tensor(y_test_encoded, dtype=torch.long)

# Now you can create your TensorDatasets
train_dataset = TensorDataset(train_encodings['input_ids'], train_encodings['attention_mask'], y_train_tensor)
test_dataset = TensorDataset(test_encodings['input_ids'], test_encodings['attention_mask'], y_test_tensor)


In [7]:

from transformers import RobertaForSequenceClassification

# Load the pre-trained model
model_name = 'roberta-base'  # Example: RoBERTa
model = RobertaForSequenceClassification.from_pretrained(model_name, num_labels=len(label_encoder.classes_))

# Setup GPU/CPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)





Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


RobertaForSequenceClassification(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-11): 12 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
             

In [8]:

from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

batch_size = 64  # Adjust the batch size according to your GPU memory

train_loader = DataLoader(train_dataset, sampler=RandomSampler(train_dataset), batch_size=batch_size)
test_loader = DataLoader(test_dataset, sampler=SequentialSampler(test_dataset), batch_size=batch_size)




In [9]:
from transformers import AdamW, get_linear_schedule_with_warmup

optimizer = AdamW(model.parameters(), lr=5e-5)

epochs = 4
total_steps = len(train_loader) * epochs
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=total_steps)




In [10]:


def evaluate(model, test_loader):
    model.eval()
    total_eval_accuracy = 0

    for batch in test_loader:
        batch = [b.to(device) for b in batch]
        inputs = {
            'input_ids': batch[0],
            'attention_mask': batch[1],
            'labels': batch[2]
        }

        with torch.no_grad():
            outputs = model(**inputs)

        logits = outputs.logits
        predictions = torch.argmax(logits, dim=1)
        total_eval_accuracy += (predictions == inputs['labels']).sum().item()

    return total_eval_accuracy / len(test_loader.dataset)


for epoch in range(epochs):
    model.train()
    total_loss = 0

    i = 0
    for batch in train_loader:
        i += 1
        if i % 10 == 0:
            avg_train_loss = total_loss / len(train_loader)
            print(f'Epoch {epoch + 1}/{epochs} | Partial train Loss: {avg_train_loss} | Batch position: {i * batch_size} | Dataset size: {len(train_dataset)}')
            test_accuracy = evaluate(model, test_loader)
        if i % 50 == 0:
            print(f'Test Accuracy: {test_accuracy}')
        batch = [b.to(device) for b in batch]
        inputs = {
            'input_ids': batch[0],
            'attention_mask': batch[1],
            'labels': batch[2]
        }
        model.zero_grad()

        outputs = model(**inputs)
        loss = outputs.loss
        total_loss += loss.item()

        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        scheduler.step()

    avg_train_loss = total_loss / len(train_loader)
    print(f'Epoch {epoch + 1}/{epochs} | Train Loss: {avg_train_loss}')

model_save_path = "models/roberta-base"
model.save_pretrained(model_save_path)
tokenizer.save_pretrained(model_save_path)



Epoch 1/4 | Partial train Loss: 0.037204789400100705 | Batch position: 640 | Dataset size: 16000
Epoch 1/4 | Partial train Loss: 0.07927691435813904 | Batch position: 1280 | Dataset size: 16000
Epoch 1/4 | Partial train Loss: 0.12011620664596558 | Batch position: 1920 | Dataset size: 16000
Epoch 1/4 | Partial train Loss: 0.15504738926887512 | Batch position: 2560 | Dataset size: 16000
Epoch 1/4 | Partial train Loss: 0.1891358208656311 | Batch position: 3200 | Dataset size: 16000
Test Accuracy: 0.65125
Epoch 1/4 | Partial train Loss: 0.21966286849975586 | Batch position: 3840 | Dataset size: 16000
Epoch 1/4 | Partial train Loss: 0.24917374658584596 | Batch position: 4480 | Dataset size: 16000
Epoch 1/4 | Partial train Loss: 0.27857249045372007 | Batch position: 5120 | Dataset size: 16000
Epoch 1/4 | Partial train Loss: 0.30317580795288085 | Batch position: 5760 | Dataset size: 16000
Epoch 1/4 | Partial train Loss: 0.3287471468448639 | Batch position: 6400 | Dataset size: 16000
Test Accu

('models/roberta-base/tokenizer_config.json',
 'models/roberta-base/special_tokens_map.json',
 'models/roberta-base/vocab.json',
 'models/roberta-base/merges.txt',
 'models/roberta-base/added_tokens.json')