Cant load tokenizer locally after downloading it #11243

jiwidi · 2021-04-14T11:02:04Z

Hi!

I'm following the tutorial for this pretrained model https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment. It works the first time I run it (and download the tokenizer) but after that it will complain that I don't have any tokenizer on the path specified.

The code is the following

from transformers import AutoModelForSequenceClassification
from transformers import TFAutoModelForSequenceClassification
from transformers import AutoTokenizer
import numpy as np
from scipy.special import softmax
import csv
import urllib.request

# Preprocess text (username and link placeholders)
def preprocess(text):
    new_text = []


    for t in text.split(" "):
        t = '@user' if t.startswith('@') and len(t) > 1 else t
        t = 'http' if t.startswith('http') else t
        new_text.append(t)
    return " ".join(new_text)

# Tasks:
# emoji, emotion, hate, irony, offensive, sentiment
# stance/abortion, stance/atheism, stance/climate, stance/feminist, stance/hillary

task='sentiment'
MODEL = f"cardiffnlp/twitter-roberta-base-{task}"

tokenizer = AutoTokenizer.from_pretrained(MODEL)

# download label mapping
labels=[]
mapping_link = f"https://raw.githubusercontent.com/cardiffnlp/tweeteval/main/datasets/{task}/mapping.txt"
with urllib.request.urlopen(mapping_link) as f:
    html = f.read().decode('utf-8').split("\n")
    csvreader = csv.reader(html, delimiter='\t')
labels = [row[1] for row in csvreader if len(row) > 1]

# PT
model = AutoModelForSequenceClassification.from_pretrained(MODEL)
model.save_pretrained(MODEL)

text = "Good night 😊"
text = preprocess(text)
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
scores = output[0][0].detach().numpy()
scores = softmax(scores)

# # TF
# model = TFAutoModelForSequenceClassification.from_pretrained(MODEL)
# model.save_pretrained(MODEL)

# text = "Good night 😊"
# encoded_input = tokenizer(text, return_tensors='tf')
# output = model(encoded_input)
# scores = output[0][0].numpy()
# scores = softmax(scores)

ranking = np.argsort(scores)
ranking = ranking[::-1]
for i in range(scores.shape[0]):
    l = labels[ranking[i]]
    s = scores[ranking[i]]
    print(f"{i+1}) {l} {np.round(float(s), 4)}")

And fails on tokenizer = AutoTokenizer.from_pretrained(MODEL) with output:

2021-04-13 21:43:03.723523: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
Traceback (most recent call last):
  File "train.py", line 27, in <module>
    tokenizer = AutoTokenizer.from_pretrained(MODEL)
  File "/home/jiwidi/anaconda3/envs/cuda/lib/python3.7/site-packages/transformers/models/auto/tokenization_auto.py", line 423, in from_pretrained
    return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "/home/jiwidi/anaconda3/envs/cuda/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 1698, in from_pretrained
    raise EnvironmentError(msg)
OSError: Can't load tokenizer for '/mnt/kingston/github/MIARFID/ALC/cardiffnlp/twitter-roberta-base-sentiment'. Make sure that:

- '/mnt/kingston/github/MIARFID/ALC/cardiffnlp/twitter-roberta-base-sentiment' is a correct model identifier listed on 'https://huggingface.co/models'

- or '/mnt/kingston/github/MIARFID/ALC/cardiffnlp/twitter-roberta-base-sentiment' is the correct path to a directory containing relevant tokenizer files

After running the script train.py the tokenizer is downloaded to the path the script is on. The path structrue is like this:

├── cardiffnlp
│   └── twitter-roberta-base-sentiment
│       ├── config.json
│       └── pytorch_model.bin
└── train.py

I have transformers version 4.5.1

The text was updated successfully, but these errors were encountered:

LysandreJik · 2021-04-14T14:36:40Z

Hi, that's because the tokenizer first looks to see if the path specified is a local path. Since you're saving your model on a path with the same identifier as the hub checkpoint, when you're re-running the script both the model and tokenizer will look into that folder.

The tokenizer doesn't find anything in there, as you've only saved the model, not the tokenizer. You should either save the tokenier as well, or change the path so that it isn't mistaken for a local path when it should be the hub.

jiwidi · 2021-04-14T14:41:49Z

Hi, that's because the tokenizer first looks to see if the path specified is a local path. Since you're saving your model on a path with the same identifier as the hub checkpoint, when you're re-running the script both the model and tokenizer will look into that folder.

The tokenizer doesn't find anything in there, as you've only saved the model, not the tokenizer. You should either save the tokenier as well, or change the path so that it isn't mistaken for a local path when it should be the hub.

How could I also save the tokenizer? Im newbie with transformer library and I took that code from the webpage.

LysandreJik · 2021-04-14T14:44:21Z

You can add tokenizer.save_pretrained(MODEL) right under the model's save_pretrained!

jingenqi · 2023-01-19T23:21:23Z

i love you Lysanderjik

jiwidi closed this as completed Apr 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cant load tokenizer locally after downloading it #11243

Cant load tokenizer locally after downloading it #11243

jiwidi commented Apr 14, 2021

LysandreJik commented Apr 14, 2021

jiwidi commented Apr 14, 2021

LysandreJik commented Apr 14, 2021

jingenqi commented Jan 19, 2023

Cant load tokenizer locally after downloading it #11243

Cant load tokenizer locally after downloading it #11243

Comments

jiwidi commented Apr 14, 2021

LysandreJik commented Apr 14, 2021

jiwidi commented Apr 14, 2021

LysandreJik commented Apr 14, 2021

jingenqi commented Jan 19, 2023