Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How should one modify the code to successfully run text classification? #43

Closed
davidefiocco opened this issue Sep 21, 2018 · 7 comments
Closed

Comments

@davidefiocco
Copy link

davidefiocco commented Sep 21, 2018

Hi,

I am new to PyTorch (but still more at ease with it than TF) so I thought to experiment with @thomwolf 's implementation in this repo (thanks for sharing it!!)

I would like to try out the code to perform binary text classification of text snippets, similar to the classification tasks such as the Corpus of Linguistic Acceptability (CoLA) and the Stanford Sentiment Treebank (SST-2) in the original reference.

These are the steps that I think are needed to get the code working (but I am not sure that these are correct and/or exhaustive):

  1. Create two sets snippets_val.csv and snippets_test.csv containing two columns, text (string) and class (an int equal to 0 or 1).
  2. In datasets.py create two new functions:
    • _snippets returning two lists st, y, and
    • snippets defined with different values of n_train and n_valid and whose return statement looks like return (trX, trY), (vaX, vaY), (teX, )
  3. In train.py, rewrite transform_roc into a transform_snippet that doesn't use [delimiter] and takes only one argument in input <- somewhat tricky to me can anyone provide some guidance?
  4. In train.py, in the encoding bit and afterwards:
  5. In train.py:
  6. In analysis.py:
    • create a new function snippets so to invoke _snippets (from datasets.py) and read in snippets_test.csv and adjust its call to _snippets so to take into account that it outputs two lists (not 4)
  7. Modify imports in train.py coherently with all of the above.

Does all of the above make sense as a plan, or can somebody fill missing bits or provide an alternative list of "sub-steps" ?
Also, can someone provide some guidance on how to rewrite transform_roc (comments on the original code would be fantastic, I am glad to annotate the original function and contribute to the repo as a result of this!)

Thanks to anyone patiently reading this!

@lordzuko
Copy link

@davidefiocco I have made the modifications in transform_roc method as below for the entailment task, which is a classification problem:

def transform_roc(X1, X2, X3):
    n_batch = len(X1)
    xmb = np.zeros((n_batch, 2, n_ctx, 2), dtype=np.int32)
    mmb = np.zeros((n_batch, 2, n_ctx), dtype=np.float32)
    start = encoder['_start_']
    delimiter = encoder['_delimiter_']
    for i, (x1, x2, x3), in enumerate(zip(X1, X2, X3)):
        x12 = [start] + x1[:max_len] + [delimiter] + x2[:max_len] + [clf_token]
        x13 = [start] + x1[:max_len] + [delimiter] + x3[:max_len] + [clf_token]
        l12 = len(x12)
        l13 = len(x13)
        xmb[i, 0, :l12, 0] = x12
        xmb[i, 1, :l13, 0] = x13
        mmb[i, 0, :l12] = 1
        mmb[i, 1, :l13] = 1
    # Position information that is added to the input embeddings in the TransformerModel
    xmb[:, :, :, 1] = np.arange(n_vocab + n_special, n_vocab + n_special + n_ctx)
    return xmb, mmb

You can refer to my question in #40 . Hope this helps.

@davidefiocco
Copy link
Author

davidefiocco commented Sep 26, 2018

@lordzuko thanks!

I had seen #40 and that's excellent guidance for me on how that transform function should change depending on the task. However, I thought that given that in the README there is an "architectural" difference between classification and entailment (see https://github.com/huggingface/pytorch-openai-transformer-lm/blob/master/assets/ftlm.png) edits should be slightly more profound with respect to the original transform_roc and I should use a function of the form

def transform_snippet(X1):
    n_batch = len(X1)
    xmb = np.zeros((n_batch, 2, n_ctx, 2), dtype=np.int32)
    mmb = np.zeros((n_batch, 2, n_ctx), dtype=np.float32)
    start = encoder['_start_']
    for i, (x1), in enumerate(X1):
        x12 = [start] + x1[:max_len] + [clf_token]
        l12 = len(x12)
        xmb[i, 0, :l12, 0] = x12
        mmb[i, 0, :l12] = 1
    # Position information that is added to the input embeddings in the TransformerModel
    xmb[:, :, :, 1] = np.arange(n_vocab + n_special, n_vocab + n_special + n_ctx)
    return xmb, mmb

but this is not working as intended as of now..

@lordzuko
Copy link

@davidefiocco Are you getting any error? or the model is not training as intended, if you are getting an error, can you please share the error-trace, please.

@davidefiocco
Copy link
Author

davidefiocco commented Sep 26, 2018

@lordzuko I have tried implementing the steps that I described in #43 (comment) (hoping they make sense...) and got

Traceback (most recent call last):
File "train.py", line 218, in <module>
    teX, teM = transform_snippet(teX1)
File "train.py", line 29, in transform_snippet
    xmb[i, 0, :l12, 0] = x12
ValueError: setting an array element with a sequence.

I can try to make this reproducible is to provide more code/publish a fork as I modified the current code in several parts (see my comment above for the full list) when trying to implement the head change (#43 (comment) is one of the changes). I am not very proficient in PyTorch yet though (so they may be clumsy changes), that's why the questions in #43 (comment) . Most likely, the transform_snippet function posted in #43 (comment) is not OK.

@thomwolf
Copy link
Member

thomwolf commented Oct 5, 2018

Hi @davidefiocco,
Your transform_snippet function should be the way to go.
I think it's just a python typo. Looks like your l12 is equal to one. Probably comes from this line: for i, (x1), in enumerate(X1). Try using for i, x1 in enumerate(X1)

@davidefiocco
Copy link
Author

davidefiocco commented Oct 6, 2018

Hi @thomwolf, thanks for your reply and tip!

As advertised I forked the code, and you find the result at master...davidefiocco:master and that specific edit can be found at https://github.com/davidefiocco/pytorch-openai-transformer-lm/blob/e9945725603544cdebaec91937d4a16f14db0ad8/train.py#L26

In the fork namings ,news stands for "newsgroup", as I tried to classify snippets of text coming from a (2-newsgroup) subset of the 20 newsgroup dataset (http://scikit-learn.org/0.19/datasets/twenty_newsgroups.html). I haven't been successful in using the algorithm yet (the code now runs without errors, but iterations don't seem to converge).

I will update this issue if I manage to get it sorted, and if someone is keen on giving feedback on what needs to be changed in the code I'll be very happy to work on it.

@davidefiocco
Copy link
Author

davidefiocco commented Nov 20, 2018

I had another bug, which I think I fixed with
https://github.com/davidefiocco/pytorch-openai-transformer-lm/commit/d546da7c7076fac73d8fc850b2d0066edc36680c

And I seem to converge and reproduce the 91+% evaluation accuracy on SST-2.

I am still not sure that everything is really fine, but it seems to converge at least now!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants