Restricting dtm/tf_idf creation to only the top N features from the lexicon #71

pazzo83 · 2018-02-17T17:48:31Z

I am working with a corpus of 100k+ documents, so the number of features in the lexicon is extremely high. Thus I'm running into memory issues and the like. I know in scipy's TfidfVectorizer and similar approaches, you can limit the number of features such that you only are dealing with the top N features when working with the dtm and tf_idf matrices. Is there some way to do that with this package, or are there plans to add such a feature?

Thanks!

aviks · 2018-02-18T20:12:55Z

Not yet, but might be worth adding.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restricting dtm/tf_idf creation to only the top N features from the lexicon #71

Restricting dtm/tf_idf creation to only the top N features from the lexicon #71

pazzo83 commented Feb 17, 2018

aviks commented Feb 18, 2018

Restricting dtm/tf_idf creation to only the top N features from the lexicon #71

Restricting dtm/tf_idf creation to only the top N features from the lexicon #71

Comments

pazzo83 commented Feb 17, 2018

aviks commented Feb 18, 2018