Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What does the output of feature-extraction pipeline represent? #4613

Closed
orenpapers opened this issue May 27, 2020 · 6 comments
Closed

What does the output of feature-extraction pipeline represent? #4613

orenpapers opened this issue May 27, 2020 · 6 comments
Labels

Comments

@orenpapers
Copy link

I am using the feature-extraction pipeline:

nlp_fe = pipeline('feature-extraction')
nlp_fe('there is a book on the desk')

As an output I get a list with one element - that is a list with 9 elements - that is a list of 768 features (floats).
What is the output represent? What is every element of the lists, and what is the meaning of the 768 float values?
Thanks

@Abhishek-Rnjn
Copy link

They are embeddings generated from the model. (Bert -Base Model I guess. cause it has a hidden representation of 768 dim). You get 9 elements:- one contextual embedding for each word in your sequence. These values of embeddings represent some hidden features that are not easy to interpret.

@orenpapers
Copy link
Author

orenpapers commented May 28, 2020

So the pipeline will just return the last layer encoding of Bert?
So what is the differance with a code like

input_ids = torch.tensor(bert_tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  
outputs = bert_model(input_ids)
hidden_states = outputs[-1][1:]  # The last hidden-state is the first element of the output tuple
layer_hidden_state = hidden_states[n_layer]
return layer_hidden_state

Also, does BERT encoding have similar traits as word2vec? e.g. similar word will be closer, France - Paris = England - London , etc?

@stale
Copy link

stale bot commented Jul 27, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Jul 27, 2020
@stale stale bot closed this as completed Aug 4, 2020
@merleyc
Copy link

merleyc commented Apr 30, 2021

So the pipeline will just return the last layer encoding of Bert?
So what is the differance with a code like

input_ids = torch.tensor(bert_tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  
outputs = bert_model(input_ids)
hidden_states = outputs[-1][1:]  # The last hidden-state is the first element of the output tuple
layer_hidden_state = hidden_states[n_layer]
return layer_hidden_state

Also, does BERT encoding have similar traits as word2vec? e.g. similar word will be closer, France - Paris = England - London , etc?

Hi @orko19,
Did you understand the difference from 'hidden_states' vs. 'feature-extraction pipeline'? I'd like to understand it as well
Thanks!

@orenpapers
Copy link
Author

@merleyc I do not! Please share if you do :)

@allmwh
Copy link

allmwh commented May 25, 2021

The outputs between "last_hidden_state" and "feature-extraction pipeline" are same, you can try by yourself

"feature-extraction pipeline" just helps us do some jobs from tokenize words to embedding

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants