Doubt: Can I use this service to obtain docvecs/Paragraph vector of an entire article #232

kapilkd13 · 2019-02-08T09:33:08Z

Hi all,
I am trying to obtain fixed length doc vectors/ Paragraph vectors with this implementation. As mentioned in docs I can increase max_seq_len from 25 to the desired length and pass my article as input. I want to know if this approach is right or is there a downside to it. Also, is there another better approach to obtain docvecs using Bert Model?
Currently, we use gensim library to obtain docvecs for an article. Another approach could be to use word vecs obtained from bert model, one hot encode paragraph ids and obtain its vector (similar to gensim paragraph vec implementation).
What do you guys suggest?

Prerequisites

Please fill in by replacing [ ] with [x].

Are you running the latest bert-as-service?
Did you follow the installation and the usage instructions in README.md?
Did you check the FAQ list in README.md?
Did you perform a cursory search on existing issues?

System information

Some of this information can be collected via this script.

OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
TensorFlow installed from (source or binary):
TensorFlow version:
Python version:
bert-as-service version:
GPU model and memory:
CPU model and memory:

Description

Please replace YOUR_SERVER_ARGS and YOUR_CLIENT_ARGS accordingly. You can also write your own description for reproducing the issue.

I'm using this command to start the server:

bert-serving-start YOUR_SERVER_ARGS

and calling the server via:

bc = BertClient(YOUR_CLIENT_ARGS)
bc.encode()

Then this issue shows up:

...

The text was updated successfully, but these errors were encountered:

kapilkd13 · 2019-02-08T13:29:48Z

Hi, Currently what I have decided to do is set max_seq_length to 1500 and pass multiple sentences as part of a single sequence separated by ||| but I am encountering
ValueError: The seq length (1500) cannot be greater than max_position_embeddings (512).

This is due to the fact that, in https://github.com/google-research/bert/blob/master/modeling.py#L43
the default value of max_position_embeddings is set to 512. Can you open an option to set this from your library as a cmd argument? I can work on this if you want.

ironflood · 2019-02-12T13:40:23Z

I don't think this is the expected use of BERT. BERT is a network trained at the sentence embedding level, thus the representation of more than one sentence should be pretty inaccurate and the computation needed beyond 512 tokens would be huge (remember, the computation isn't linear to the number of tokens).

There's many strategies for you to try if you want a more accurate paragraph representation, for example:

do element wise average / max over the sequence of sentence embeddings that compose your paragraphs (no additional training), thus your resulting paragraph embedding will be of the same number of dims than each sentence embedding.
use a paragraph embedding from weighted sentences if you have any extra information about how each sentence matters to your overall paragraph representation (no additional training)
use a RNN / CNN downstream layer to get a paragraph embedding, however this requires training on a target label (regression / classification task).

hanxiao · 2019-02-14T06:53:30Z

length restriction on the server side can now be waived in 1.8.2. This issue is fixed in #236 and the new feature is available since 1.8.2. Please do

pip install bert-serving-client bert-serving-server -U

for the update.

You can now set max_seq_len=NONE when starting a server. In this case, the max_seq_len will be determined by the longest sequence in a batch (or mini-batch if parallelization is activated). That means you can send any sequence shorter than max_position_embeddings (usually 512) defined in bert.json.

You may also want to check the new argument -fixed_embed_length by bert-serving-start --help especially if you intend to use it as ELMo-like embedding.

rajshrivastava · 2020-10-30T18:53:18Z

I don't think this is the expected use of BERT. BERT is a network trained at the sentence embedding level, thus the representation of more than one sentence should be pretty inaccurate and the computation needed beyond 512 tokens would be huge (remember, the computation isn't linear to the number of tokens).

There's many strategies for you to try if you want a more accurate paragraph representation, for example:

do element wise average / max over the sequence of sentence embeddings that compose your paragraphs (no additional training), thus your resulting paragraph embedding will be of the same number of dims than each sentence embedding.

use a paragraph embedding from weighted sentences if you have any extra information about how each sentence matters to your overall paragraph representation (no additional training)

use a RNN / CNN downstream layer to get a paragraph embedding, however this requires training on a target label (regression / classification task).

@ironflood The averaging technique sounds interesting! Could you please point me to the results if this has been tried by anyone?

hanxiao mentioned this issue Feb 14, 2019

Why not make max_seq_len a client hyper-parameter? #233

Closed

4 tasks

hanxiao closed this as completed Feb 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Doubt: Can I use this service to obtain docvecs/Paragraph vector of an entire article #232

Doubt: Can I use this service to obtain docvecs/Paragraph vector of an entire article #232

kapilkd13 commented Feb 8, 2019 •

edited

kapilkd13 commented Feb 8, 2019

ironflood commented Feb 12, 2019

hanxiao commented Feb 14, 2019

rajshrivastava commented Oct 30, 2020

Doubt: Can I use this service to obtain docvecs/Paragraph vector of an entire article #232

Doubt: Can I use this service to obtain docvecs/Paragraph vector of an entire article #232

Comments

kapilkd13 commented Feb 8, 2019 • edited

Description

kapilkd13 commented Feb 8, 2019

ironflood commented Feb 12, 2019

hanxiao commented Feb 14, 2019

rajshrivastava commented Oct 30, 2020

kapilkd13 commented Feb 8, 2019 •

edited