Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doubt: Can I use this service to obtain docvecs/Paragraph vector of an entire article #232

Closed
4 tasks done
kapilkd13 opened this issue Feb 8, 2019 · 4 comments
Closed
4 tasks done

Comments

@kapilkd13
Copy link

kapilkd13 commented Feb 8, 2019

Hi all,
I am trying to obtain fixed length doc vectors/ Paragraph vectors with this implementation. As mentioned in docs I can increase max_seq_len from 25 to the desired length and pass my article as input. I want to know if this approach is right or is there a downside to it. Also, is there another better approach to obtain docvecs using Bert Model?
Currently, we use gensim library to obtain docvecs for an article. Another approach could be to use word vecs obtained from bert model, one hot encode paragraph ids and obtain its vector (similar to gensim paragraph vec implementation).
What do you guys suggest?

Prerequisites

Please fill in by replacing [ ] with [x].

System information

Some of this information can be collected via this script.

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
  • TensorFlow installed from (source or binary):
  • TensorFlow version:
  • Python version:
  • bert-as-service version:
  • GPU model and memory:
  • CPU model and memory:

Description

Please replace YOUR_SERVER_ARGS and YOUR_CLIENT_ARGS accordingly. You can also write your own description for reproducing the issue.

I'm using this command to start the server:

bert-serving-start YOUR_SERVER_ARGS

and calling the server via:

bc = BertClient(YOUR_CLIENT_ARGS)
bc.encode()

Then this issue shows up:

...

@kapilkd13
Copy link
Author

Hi, Currently what I have decided to do is set max_seq_length to 1500 and pass multiple sentences as part of a single sequence separated by ||| but I am encountering
ValueError: The seq length (1500) cannot be greater than max_position_embeddings (512).

This is due to the fact that, in https://github.com/google-research/bert/blob/master/modeling.py#L43
the default value of max_position_embeddings is set to 512. Can you open an option to set this from your library as a cmd argument? I can work on this if you want.

@ironflood
Copy link

I don't think this is the expected use of BERT. BERT is a network trained at the sentence embedding level, thus the representation of more than one sentence should be pretty inaccurate and the computation needed beyond 512 tokens would be huge (remember, the computation isn't linear to the number of tokens).

There's many strategies for you to try if you want a more accurate paragraph representation, for example:

  • do element wise average / max over the sequence of sentence embeddings that compose your paragraphs (no additional training), thus your resulting paragraph embedding will be of the same number of dims than each sentence embedding.
  • use a paragraph embedding from weighted sentences if you have any extra information about how each sentence matters to your overall paragraph representation (no additional training)
  • use a RNN / CNN downstream layer to get a paragraph embedding, however this requires training on a target label (regression / classification task).

@hanxiao
Copy link
Member

hanxiao commented Feb 14, 2019

length restriction on the server side can now be waived in 1.8.2. This issue is fixed in #236 and the new feature is available since 1.8.2. Please do

pip install bert-serving-client bert-serving-server -U

for the update.

You can now set max_seq_len=NONE when starting a server. In this case, the max_seq_len will be determined by the longest sequence in a batch (or mini-batch if parallelization is activated). That means you can send any sequence shorter than max_position_embeddings (usually 512) defined in bert.json.

You may also want to check the new argument -fixed_embed_length by bert-serving-start --help especially if you intend to use it as ELMo-like embedding.

@rajshrivastava
Copy link

I don't think this is the expected use of BERT. BERT is a network trained at the sentence embedding level, thus the representation of more than one sentence should be pretty inaccurate and the computation needed beyond 512 tokens would be huge (remember, the computation isn't linear to the number of tokens).

There's many strategies for you to try if you want a more accurate paragraph representation, for example:

  • do element wise average / max over the sequence of sentence embeddings that compose your paragraphs (no additional training), thus your resulting paragraph embedding will be of the same number of dims than each sentence embedding.
  • use a paragraph embedding from weighted sentences if you have any extra information about how each sentence matters to your overall paragraph representation (no additional training)
  • use a RNN / CNN downstream layer to get a paragraph embedding, however this requires training on a target label (regression / classification task).

@ironflood The averaging technique sounds interesting! Could you please point me to the results if this has been tried by anyone?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants