Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does FSE guarantee ordering of vectors to be that of the input sentences? #36

Closed
grantmwilliams opened this issue Jan 26, 2021 · 1 comment

Comments

@grantmwilliams
Copy link
Contributor

grantmwilliams commented Jan 26, 2021

For an example like:

import pandas as pd

from fse.models import uSIF
from fse import SplitIndexedList
from gensim.models.keyedvectors import FastTextKeyedVectors

fasttext_model_path = "models/fasttext-wiki-news-subwords-300.model"
ft = FastTextKeyedVectors.load(fasttext_model_path)

sent_fp = "data/sentences/sentences.csv.gz"
df = pd.read_csv(sent_fp)

sentences = df.sentence.values

indexed_sentences = SplitIndexedList(sentences)

model = uSIF(ft, workers=2, lang_freq="en")

sentence_count, word_count = model.train(indexed_sentences)

embeddings = model.sv.vectors

Where I read in an ordered list of sentences and then process them through a pre-trained model, does FSE guarantee the order of the model vectors to be the same order that the sentences were fed in?

I didn't see anything in the documentation or source code to suggest they wouldn't be, but I also haven't seen in the documentation any claims for guaranteed ordering either.

Thanks!

@oborchers
Copy link
Owner

Hi @grantmwilliams,

the destination to where a sentence vector is written is completely dependent on the input.
Each input is an iterable of (list[str], int), whereas int represents the target index. All input wrappers of type Indexed will always take the supplied order of inputs, whereas CIndexed can be used to supply a custom set of indices for many-to-one mappings.

I think this library requires some update by now, especially documentation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants