Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TextCorpus doesn't provide a way to convert document text to indices as needed for say DL NLP models #1634

Closed
roopalgarg opened this issue Oct 18, 2017 · 3 comments
Labels
difficulty easy Easy issue: required small fix feature Issue described a new feature good first issue Issue for new contributors (not required gensim understanding + very simple)

Comments

@roopalgarg
Copy link
Contributor

Description

TextCorpus doesn't provide a way to convert text in a document to indices per the dictionary as needed for say Deep Learning NLP models. TextCorpus uses Dictionary objects doc2bow function which is great for most ML models but for DL models where we need sequential indices for text its not usable in most cases.

Steps/Code/Corpus to Reproduce

sample.txt

hello how are you ?
i am good

code:

from gensim.corpora.textcorpus import TextCorpus

some_file_name = "sample.txt"
some_dictionary = {
    '<UNK>': 0,
    'how': 1,
    'hello': 2,
    'hi': 3,
    'are': 4,
    'you': 5,
    '?': 6,
    'good': 7
}

gensim_dictionary = Dictionary()
gensim_dictionary.token2id = some_dictionary

txt_corpus = TextCorpus(input=some_file_name, dictionary=gensim_dictionary, token_filters=[])

for text in txt_corpus:
    print list(text)

Expected Results

Some way to simply convert the corpus to indices per the token2id dict object in Dictionary class, also adding in option to provide an unknown token id which replaces all unknown tokens.
So either adding a doc2idx() in TextCorpus or integrating that in Dictionary class along with doc2bow()
[2, 1, 4, 5, 6]
[0, 0, 7]

Actual Results

[(1, 1), (2, 1), (4, 1), (5, 1), (6, 1)]
[(7, 1)]

Versions

Darwin-16.4.0-x86_64-i386-64bit
('Python', '2.7.12 |Anaconda custom (x86_64)| (default, Jul 2 2016, 17:43:17) \n[GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2336.11.00)]')
('NumPy', '1.13.1')
('SciPy', '0.19.1')

@menshikh-iv
Copy link
Contributor

Hi @roopalgarg, doc2bow returns data in BagOfWords format (unordered), this works fine for many models from gensim. Also, doc2bow returns frequency for each word in the document (second element in the tuple).

Add method doc2idx may be a good idea, wdyt @piskvorky @gojomo?

@menshikh-iv menshikh-iv added difficulty easy Easy issue: required small fix feature Issue described a new feature good first issue Issue for new contributors (not required gensim understanding + very simple) labels Oct 19, 2017
@roopalgarg
Copy link
Contributor Author

awesome! excited to work on it... i will wait for a confirmation from your end if we are actually going to add the doc2idx feature. I am assuming we would add it to the Dictionary class?

@roopalgarg
Copy link
Contributor Author

@menshikh-iv @piskvorky @gojomo any updates on this?
my use case is mainly where we want to convert a document into a series of indices per a word -> word_id mapping as is needed for Deep Learning based NLP models.
I had a couple of questions around adding a doc2idx feature in the Dictionary class since the class itself does a lot of house keeping with the allow_update parameter in doc2bow, should the same kind of house keeping be done for the doc2idx feature as well? It might be an overall for this feature if its used mainly for DL models.
Or should I simply add a feature like get_texts_idx to the TextCorpus class? It would work similar to the __iter__ but instead of calling self.dictionary.doc2bow() would convert the text to indices using the token2id in the Dictionary class?

VaiyeBe pushed a commit to VaiyeBe/gensim that referenced this issue Nov 26, 2017
…iskvorky#1720)

* define doc2idx to convert a document to a vector of indexes per the dictionary

* update documentation

* changes to textcorpus to add a mode for index vector format output. adding test case for the changes

* fixing doc string

* fix doc string

* fix doc string

* removing trailing white spaces

* removing trailing white spaces

* changes as per review

* change as per review.

reverting changes to TextCorpus as discussed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty easy Easy issue: required small fix feature Issue described a new feature good first issue Issue for new contributors (not required gensim understanding + very simple)
Projects
None yet
Development

No branches or pull requests

2 participants