Issue when using n_gram_range other than (1,1) #5

ColinFerguson · 2020-10-15T21:11:04Z

Hi, really nice work with this package, it's very useful.

Model initiation takes the arguement n_gram_range, but I think that it doesn't get used. Should line 241 referenced here be
count = CountVectorizer(ngram_range=n_gram_range, stop_words="english").fit(documents)?

BERTopic/bertopic/model.py

Line 241 in 9f7dca1

count = CountVectorizer(stop_words="english").fit(documents)

It might be nice to have the stop_words argument be configurable at initiation as well, so that the user could pass a corpus-specific set of stop words.

The text was updated successfully, but these errors were encountered:

MaartenGr · 2020-10-17T06:26:58Z

You are correct! Stupid overview on my part not actually using the n_gram_range. Same with stopwords.

* Fixed ngram and added stopwords * Update pypi version

MaartenGr · 2020-10-17T06:50:01Z

Master has the most up-to-date version. Pypi was updated to 0.2.3 to include the changes you proposed. Let me know if you find any other issues!

ColinFerguson · 2020-10-19T18:21:29Z

Great thank you so much @MaartenGr

MaartenGr mentioned this issue Oct 17, 2020

Fixed ngram + added stopwords #6

Merged

MaartenGr added a commit that referenced this issue Oct 17, 2020

Fixed ngram + added stopwords (#6, #5)

a61a768

* Fixed ngram and added stopwords * Update pypi version

MaartenGr closed this as completed Oct 17, 2020

vsean103 mentioned this issue Apr 4, 2022

Question regarding assign topics to new dataset using pretrained model #467

Closed

PhPv mentioned this issue Jul 30, 2024

incorrect result by topic_model.get_topic_info() due to zeroshot_topic_list was set #2102

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue when using n_gram_range other than (1,1) #5

Issue when using n_gram_range other than (1,1) #5

ColinFerguson commented Oct 15, 2020

MaartenGr commented Oct 17, 2020

MaartenGr commented Oct 17, 2020

ColinFerguson commented Oct 19, 2020

Issue when using n_gram_range other than (1,1) #5

Issue when using n_gram_range other than (1,1) #5

Comments

ColinFerguson commented Oct 15, 2020

MaartenGr commented Oct 17, 2020

MaartenGr commented Oct 17, 2020

ColinFerguson commented Oct 19, 2020