Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cant load tokenizer locally after downloading it #11243

Closed
jiwidi opened this issue Apr 14, 2021 · 4 comments
Closed

Cant load tokenizer locally after downloading it #11243

jiwidi opened this issue Apr 14, 2021 · 4 comments

Comments

@jiwidi
Copy link

jiwidi commented Apr 14, 2021

Hi!

I'm following the tutorial for this pretrained model https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment. It works the first time I run it (and download the tokenizer) but after that it will complain that I don't have any tokenizer on the path specified.

The code is the following

from transformers import AutoModelForSequenceClassification
from transformers import TFAutoModelForSequenceClassification
from transformers import AutoTokenizer
import numpy as np
from scipy.special import softmax
import csv
import urllib.request

# Preprocess text (username and link placeholders)
def preprocess(text):
    new_text = []


    for t in text.split(" "):
        t = '@user' if t.startswith('@') and len(t) > 1 else t
        t = 'http' if t.startswith('http') else t
        new_text.append(t)
    return " ".join(new_text)

# Tasks:
# emoji, emotion, hate, irony, offensive, sentiment
# stance/abortion, stance/atheism, stance/climate, stance/feminist, stance/hillary

task='sentiment'
MODEL = f"cardiffnlp/twitter-roberta-base-{task}"

tokenizer = AutoTokenizer.from_pretrained(MODEL)

# download label mapping
labels=[]
mapping_link = f"https://raw.githubusercontent.com/cardiffnlp/tweeteval/main/datasets/{task}/mapping.txt"
with urllib.request.urlopen(mapping_link) as f:
    html = f.read().decode('utf-8').split("\n")
    csvreader = csv.reader(html, delimiter='\t')
labels = [row[1] for row in csvreader if len(row) > 1]

# PT
model = AutoModelForSequenceClassification.from_pretrained(MODEL)
model.save_pretrained(MODEL)

text = "Good night 😊"
text = preprocess(text)
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
scores = output[0][0].detach().numpy()
scores = softmax(scores)

# # TF
# model = TFAutoModelForSequenceClassification.from_pretrained(MODEL)
# model.save_pretrained(MODEL)

# text = "Good night 😊"
# encoded_input = tokenizer(text, return_tensors='tf')
# output = model(encoded_input)
# scores = output[0][0].numpy()
# scores = softmax(scores)

ranking = np.argsort(scores)
ranking = ranking[::-1]
for i in range(scores.shape[0]):
    l = labels[ranking[i]]
    s = scores[ranking[i]]
    print(f"{i+1}) {l} {np.round(float(s), 4)}")

And fails on tokenizer = AutoTokenizer.from_pretrained(MODEL) with output:

2021-04-13 21:43:03.723523: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
Traceback (most recent call last):
  File "train.py", line 27, in <module>
    tokenizer = AutoTokenizer.from_pretrained(MODEL)
  File "/home/jiwidi/anaconda3/envs/cuda/lib/python3.7/site-packages/transformers/models/auto/tokenization_auto.py", line 423, in from_pretrained
    return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "/home/jiwidi/anaconda3/envs/cuda/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 1698, in from_pretrained
    raise EnvironmentError(msg)
OSError: Can't load tokenizer for '/mnt/kingston/github/MIARFID/ALC/cardiffnlp/twitter-roberta-base-sentiment'. Make sure that:

- '/mnt/kingston/github/MIARFID/ALC/cardiffnlp/twitter-roberta-base-sentiment' is a correct model identifier listed on 'https://huggingface.co/models'

- or '/mnt/kingston/github/MIARFID/ALC/cardiffnlp/twitter-roberta-base-sentiment' is the correct path to a directory containing relevant tokenizer files

After running the script train.py the tokenizer is downloaded to the path the script is on. The path structrue is like this:

├── cardiffnlp
│   └── twitter-roberta-base-sentiment
│       ├── config.json
│       └── pytorch_model.bin
└── train.py

I have transformers version 4.5.1

@LysandreJik
Copy link
Member

Hi, that's because the tokenizer first looks to see if the path specified is a local path. Since you're saving your model on a path with the same identifier as the hub checkpoint, when you're re-running the script both the model and tokenizer will look into that folder.

The tokenizer doesn't find anything in there, as you've only saved the model, not the tokenizer. You should either save the tokenier as well, or change the path so that it isn't mistaken for a local path when it should be the hub.

@jiwidi
Copy link
Author

jiwidi commented Apr 14, 2021

Hi, that's because the tokenizer first looks to see if the path specified is a local path. Since you're saving your model on a path with the same identifier as the hub checkpoint, when you're re-running the script both the model and tokenizer will look into that folder.

The tokenizer doesn't find anything in there, as you've only saved the model, not the tokenizer. You should either save the tokenier as well, or change the path so that it isn't mistaken for a local path when it should be the hub.

How could I also save the tokenizer? Im newbie with transformer library and I took that code from the webpage.

@LysandreJik
Copy link
Member

You can add tokenizer.save_pretrained(MODEL) right under the model's save_pretrained!

@jiwidi jiwidi closed this as completed Apr 27, 2021
@jingenqi
Copy link

i love you Lysanderjik

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants