-
Notifications
You must be signed in to change notification settings - Fork 25.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Queries about the Notation and Model training of T5 and ELECTRA sentiment classification. #3704
Comments
Hi!
from transformers import AutoTokenizer
uncased_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
cased_tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
print(uncased_tokenizer.tokenize("Hi, this is Lysandre"))
# ['hi', ',', 'this', 'is', 'l', '##ys', '##and', '##re'] <-- notice how uppercase letters are now lowercased
print(cased_tokenizer.tokenize("Hi, this is Lysandre"))
# ['Hi', ',', 'this', 'is', 'L', '##ys', '##and', '##re']
XLM models are usually multilingual, which is the case for those you mentioned: The following number is the hidden size, e.g.
As @craffel said in the issue you mentioned, T5 was trained with SST-2 so should work out-of-the-box if you follow what he mentioned in this issue. There is no current If you're looking at easy sentiment classification, please take a look at the pipelines and at the already-finetuned sequence classification models and look for sentiment classification especially. |
@LysandreJik thanks, it was helpful 🙂 |
Hi, it is easy to use the pre-trained T5 models for sentiment ID. You could do something like MODEL_NAME = "t5-base"
model = transformers.T5ForConditionalGeneration.from_pretrained(MODEL_NAME)
tokenizer = transformers.AutoTokenizer.from_pretrained(MODEL_NAME)
input_text = "sst2 sentence: This movie was great! I loved the acting."
inputs = tokenizer.encode_plus(input_text, return_token_type_ids=False, return_tensors="pt")
print(tokenizer.decode(model.generate(**inputs)[0]))
input_text = "sst2 sentence: The acting was so bad in this movie I left immediately."
inputs = tokenizer.encode_plus(input_text, return_token_type_ids=False, return_tensors="pt")
print(tokenizer.decode(model.generate(**inputs)[0])) The |
Hi, @craffel Thank for your quick response and the intuitive code snippet. As I said, I am trying to implement T5 for a Model loadingMODEL_NAME = "t5-base"
transformer_layer = transformers.T5ForConditionalGeneration.from_pretrained(MODEL_NAME)
tokenizer = transformers.AutoTokenizer.from_pretrained(MODEL_NAME) A general encoderdef regular_encode(texts, tokenizer, maxlen=512):
enc_di = tokenizer.batch_encode_plus(
texts,
return_attention_masks=False,
return_token_type_ids=False,
pad_to_max_length=True,
max_length=maxlen
)
return np.array(enc_di['input_ids']) Build the model (as per my task)def build_model(transformer, max_len=190):
input_word_ids = Input(shape=(max_len,), dtype=tf.int32)
sequence_output = transformer(input_word_ids)[0]
cls_token = sequence_output[:, 0, :]
out = Dense(1, activation='sigmoid')(cls_token)
model = Model(inputs=input_word_ids, outputs=out)
model.compile(Adam(lr=1e-5), loss='binary_crossentropy', metrics=['accuracy'])
return model Tokenized the data and grab the targets int(1,0)x_train = regular_encode(data.text, tokenizer, maxlen=190)
y_train = data.target.values # (0, 1)
model = build_model(transformer_layer, max_len=190)
model.fit...
model.predict... I sure I'm missing something crucial part that is not considering |
No, the target should always be text for T5. You should map your 0/1 labels to the words "negative" and "positive" and fine-tune T5 to predict those words, and then map them back to 0/1 after the model outputs the text if needed. This is the point of the text-to-text framework - all tasks take text as input and produce text as output. So, for example, your "build model" code should not include a dense layer with a sigmoid output, etc. There is no modification to the model structure necessary whatsoever.
Yes, that is the intention. |
@LysandreJik @craffel
All inputs are converted to this format |
Hi @vapyc, this seems to be an unrelated issue. Would you mind opening a new issue? When you do, would it be possible for you to show the entire stack trace, e.g. the line where it fails in your code, alongside all the information you've provided here? Thanks. |
@LysandreJik I'd be very interested in an |
i just posted a pull request ... was super simple to get it working |
@liuzzi awesome! I look forward to trying it out. |
@liuzzi wonderful, thanks a lot. Well done brother. Can you share a working notebook on this, please? Thank you. |
@innat i did not use a notebook to fine-tune, but for sentiment analysis you can just use the run_glue.py script with the SST-2 task which is a binary sentiment analysis task. You shouldn't even need to change any code, just make sure your dataset follows the format of SST-2. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
I have a few questions about the model notation. And also short info about T5 and ELECTRA. I would like to make separate issues but things are not too complex. I mainly working on CV, so sorry if I being so silly.
1 Cased or Uncased
What is mean by cased and uncased?
2 Suffix
I was trying to run the XLM model but in the pre-train model, I've found the following weights, I understood about XML-MLM but couldn't get the rest of the part, ex:
enfr-1024, enro-1024
etc.3 Sentiment Analysis using T5 and ELECTRA
Is it possible to use these two models for sentiment classification, simply just a binary classification? How can we implement these two transformers? I have a high-level overview of T5, it transforms both (input/target) as a text. I found it useful though but bit trouble to implement. Using transformers, is it possible to go within a convenient way?
The text was updated successfully, but these errors were encountered: