Queries about the Notation and Model training of T5 and ELECTRA sentiment classification. #3704

innat · 2020-04-08T20:24:02Z

I have a few questions about the model notation. And also short info about T5 and ELECTRA. I would like to make separate issues but things are not too complex. I mainly working on CV, so sorry if I being so silly.

1 Cased or Uncased

What is mean by cased and uncased?

bert-base-uncased
bert-base-cased

2 Suffix

I was trying to run the XLM model but in the pre-train model, I've found the following weights, I understood about XML-MLM but couldn't get the rest of the part, ex: enfr-1024, enro-1024 etc.

xlm-mlm-enfr-1024
xlm-mlm-enro-1024
xlm-mlm-tlm-xnli15-1024

3 Sentiment Analysis using T5 and ELECTRA

Is it possible to use these two models for sentiment classification, simply just a binary classification? How can we implement these two transformers? I have a high-level overview of T5, it transforms both (input/target) as a text. I found it useful though but bit trouble to implement. Using transformers, is it possible to go within a convenient way?

The text was updated successfully, but these errors were encountered:

LysandreJik · 2020-04-08T21:02:33Z

Hi!

1 - casing is the difference between lowercasing and uppercasing. Uncased models do not handle uppercase letters, and therefore lowercase them:

from transformers import AutoTokenizer

uncased_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
cased_tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

print(uncased_tokenizer.tokenize("Hi, this is Lysandre"))
# ['hi', ',', 'this', 'is', 'l', '##ys', '##and', '##re'] <-- notice how uppercase letters are now lowercased

print(cased_tokenizer.tokenize("Hi, this is Lysandre"))
# ['Hi', ',', 'this', 'is', 'L', '##ys', '##and', '##re']

2 - These should be clarified with model cards on the model hub but we haven't gotten to changing them yet.

XLM models are usually multilingual, which is the case for those you mentioned: ende means english-german, enfr, english-french, xnli15 means the 15 languages that are used in XNLI.

The following number is the hidden size, e.g. 1024 means that the hidden size of the model is 1024.

3 - You may useT5 for sentiment classification, ELECTRA as well but with a bit more additional work.

As @craffel said in the issue you mentioned, T5 was trained with SST-2 so should work out-of-the-box if you follow what he mentioned in this issue.

There is no current ElectraForSequenceClassification as ELECTRA is so new, but it will certainly make its way in the library in the coming weeks! Once this head is here (feel free to add it yourself, it would be as easy as copying one head from one other modeling file and putting it for ELECTTRA), ELECTRA can be used for sentiment classification, but it would require you to fine-tune it first to a sentiment classification dataset (like the SST-2 dataset).

If you're looking at easy sentiment classification, please take a look at the pipelines and at the already-finetuned sequence classification models and look for sentiment classification especially.

innat · 2020-04-08T21:19:00Z

@LysandreJik thanks, it was helpful 🙂

craffel · 2020-04-08T21:20:05Z

Hi, it is easy to use the pre-trained T5 models for sentiment ID. You could do something like

MODEL_NAME = "t5-base"
model = transformers.T5ForConditionalGeneration.from_pretrained(MODEL_NAME)
tokenizer = transformers.AutoTokenizer.from_pretrained(MODEL_NAME)
input_text = "sst2 sentence: This movie was great! I loved the acting."
inputs = tokenizer.encode_plus(input_text, return_token_type_ids=False, return_tensors="pt")
print(tokenizer.decode(model.generate(**inputs)[0]))
input_text = "sst2 sentence: The acting was so bad in this movie I left immediately."
inputs = tokenizer.encode_plus(input_text, return_token_type_ids=False, return_tensors="pt")
print(tokenizer.decode(model.generate(**inputs)[0]))

The "sst2 sentence:" prefix is what we used for the SST-2 task. It is a sentiment ID task. The model needs to see this prefix to know what task you want it to undertake.

innat · 2020-04-09T09:52:41Z

Hi, @craffel Thank for your quick response and the intuitive code snippet. As I said, I am trying to implement T5 for a binary sentiment classification task (label as 1 and 0). So, if I want to use T5, I've to treat my task as a text-to-text, in other words, positive and negative. But I feel a bit confused, if I have the following scenario how should I approach.

Model loading

MODEL_NAME = "t5-base"
transformer_layer = transformers.T5ForConditionalGeneration.from_pretrained(MODEL_NAME)
tokenizer = transformers.AutoTokenizer.from_pretrained(MODEL_NAME)

A general encoder

def regular_encode(texts, tokenizer, maxlen=512):
    enc_di = tokenizer.batch_encode_plus(
        texts, 
        return_attention_masks=False, 
        return_token_type_ids=False,
        pad_to_max_length=True,
        max_length=maxlen
    )
    return np.array(enc_di['input_ids'])

Build the model (as per my task)

def build_model(transformer, max_len=190):
    input_word_ids = Input(shape=(max_len,), dtype=tf.int32)
    sequence_output = transformer(input_word_ids)[0]
    cls_token = sequence_output[:, 0, :]
    out = Dense(1, activation='sigmoid')(cls_token)
    model = Model(inputs=input_word_ids, outputs=out)
    model.compile(Adam(lr=1e-5), loss='binary_crossentropy', metrics=['accuracy'])

    return model

Tokenized the data and grab the targets int(1,0)

x_train = regular_encode(data.text, tokenizer, maxlen=190)
y_train = data.target.values # (0, 1)
model = build_model(transformer_layer, max_len=190)
model.fit...
model.predict...

I sure I'm missing something crucial part that is not considering text-to-text manner. If I convert 1 and 0 of labels as Positive and Negative...I mean shouldn't the target need to be numeric! And about the prefix, sst2 sentence: so, this is, in other words, is a string indicator to inform the model about the goal or task. So, do I have to add this string at the beginning of every text sentence or (samples)?

craffel · 2020-04-09T16:15:29Z

I sure I'm missing something crucial part that is not considering text-to-text manner. If I convert 1 and 0 of labels as Positive and Negative...I mean shouldn't the target need to be numeric!

No, the target should always be text for T5. You should map your 0/1 labels to the words "negative" and "positive" and fine-tune T5 to predict those words, and then map them back to 0/1 after the model outputs the text if needed. This is the point of the text-to-text framework - all tasks take text as input and produce text as output. So, for example, your "build model" code should not include a dense layer with a sigmoid output, etc. There is no modification to the model structure necessary whatsoever.

And about the prefix, sst2 sentence: so, this is, in other words, is a string indicator to inform the model about the goal or task. So, do I have to add this string at the beginning of every text sentence or (samples)?

Yes, that is the intention.

parthplc · 2020-04-18T09:19:42Z

@LysandreJik @craffel
Please check this issue!
As per the discussion I have a similar approach on binary classification on the text. But it seems that I am doing something wrong. I have also converted the target 0 and 1 to "0" and "1". Don't know where I am getting wrong.

MODEL_NAME = "t5-base"
transformer_layer = transformers.TFT5ForConditionalGeneration.from_pretrained(MODEL_NAME)
tokenizer = transformers.AutoTokenizer.from_pretrained(MODEL_NAME)

def regular_encode(texts, tokenizer, maxlen=512):
    enc_di = tokenizer.batch_encode_plus(
        texts, 
        return_attention_masks=False, 
        return_token_type_ids=False,
        pad_to_max_length=True,
        max_length=maxlen
    )
    return np.array(enc_di['input_ids'])

def build_model(transformer, max_len=190):
    input_word_ids = Input(shape=(max_len,), dtype=tf.int32)
    sequence_output = transformer(input_word_ids)[0]
    cls_token = sequence_output[:, 0, :]
    out = Dense(1, activation='sigmoid')(cls_token)
    model = Model(inputs=input_word_ids, outputs=out)
    model.compile(Adam(lr=1e-5), loss='binary_crossentropy', metrics=['accuracy'])

    return model

x_train = regular_encode(train_df.new_text, tokenizer, maxlen=190)
y_train = train_df.target.values # (0, 1) 0 and 1 convert to string
model = build_model(transformer_layer, max_len=190)

ValueError: in converted code:

    /opt/conda/lib/python3.6/site-packages/transformers/modeling_tf_t5.py:854 call  *
        encoder_outputs = self.encoder(
    /opt/conda/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/base_layer.py:822 __call__
        outputs = self.call(cast_inputs, *args, **kwargs)
    /opt/conda/lib/python3.6/site-packages/transformers/modeling_tf_t5.py:445 call
        raise ValueError("You have to specify either input_ids or inputs_embeds")

    ValueError: You have to specify either input_ids or inputs_embeds

All inputs are converted to this format
"sst2 sentence: our deeds are the reason for this..."
I used the same things but having trouble with this error. I need to fine-tune the model on my custom dataset.

LysandreJik · 2020-04-23T18:56:14Z

Hi @vapyc, this seems to be an unrelated issue. Would you mind opening a new issue? When you do, would it be possible for you to show the entire stack trace, e.g. the line where it fails in your code, alongside all the information you've provided here? Thanks.

lingdoc · 2020-05-03T07:09:08Z

@LysandreJik I'd be very interested in an ElectraForSequenceClassification head, as I'm not confident I could implement it myself since I'm quite new to Transformers and still learning how the library is organized. Any chance this is coming soon?

liuzzi · 2020-05-09T19:35:39Z

i just posted a pull request ... was super simple to get it working

#4257

lingdoc · 2020-05-10T05:41:52Z

@liuzzi awesome! I look forward to trying it out.

innat · 2020-05-10T08:01:03Z

@liuzzi wonderful, thanks a lot. Well done brother. Can you share a working notebook on this, please? Thank you.

liuzzi · 2020-05-10T23:08:45Z

@innat i did not use a notebook to fine-tune, but for sentiment analysis you can just use the run_glue.py script with the SST-2 task which is a binary sentiment analysis task. You shouldn't even need to change any code, just make sure your dataset follows the format of SST-2.

stale · 2020-07-10T00:30:52Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale bot added the wontfix label Jul 10, 2020

innat closed this as completed Jul 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Queries about the Notation and Model training of T5 and ELECTRA sentiment classification. #3704

Queries about the Notation and Model training of T5 and ELECTRA sentiment classification. #3704

innat commented Apr 8, 2020

LysandreJik commented Apr 8, 2020

innat commented Apr 8, 2020

craffel commented Apr 8, 2020

innat commented Apr 9, 2020

craffel commented Apr 9, 2020

parthplc commented Apr 18, 2020 •

edited

LysandreJik commented Apr 23, 2020

lingdoc commented May 3, 2020

liuzzi commented May 9, 2020

lingdoc commented May 10, 2020

innat commented May 10, 2020 •

edited

liuzzi commented May 10, 2020

stale bot commented Jul 10, 2020

Queries about the Notation and Model training of T5 and ELECTRA sentiment classification. #3704

Queries about the Notation and Model training of T5 and ELECTRA sentiment classification. #3704

Comments

innat commented Apr 8, 2020

1 Cased or Uncased

2 Suffix

3 Sentiment Analysis using T5 and ELECTRA

LysandreJik commented Apr 8, 2020

innat commented Apr 8, 2020

craffel commented Apr 8, 2020

innat commented Apr 9, 2020

Model loading

A general encoder

Build the model (as per my task)

Tokenized the data and grab the targets int(1,0)

craffel commented Apr 9, 2020

parthplc commented Apr 18, 2020 • edited

LysandreJik commented Apr 23, 2020

lingdoc commented May 3, 2020

liuzzi commented May 9, 2020

lingdoc commented May 10, 2020

innat commented May 10, 2020 • edited

liuzzi commented May 10, 2020

stale bot commented Jul 10, 2020

parthplc commented Apr 18, 2020 •

edited

innat commented May 10, 2020 •

edited