Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Idea: Add stemming #1

Open
voz opened this issue Jan 9, 2013 · 7 comments
Open

Idea: Add stemming #1

voz opened this issue Jan 9, 2013 · 7 comments

Comments

@voz
Copy link

voz commented Jan 9, 2013

Reduce derived word to their stems (stemming) and afterwards match the stems only. It might be more computationally intensive, but the list should become easier to maintain and more bullshit could be discovered.

@mourner
Copy link
Owner

mourner commented Jan 9, 2013

Agreed! Some words become bullshit only in combination but there are others that definitely should be stemmed, thanks for the idea!

@calvinmetcalf
Copy link

Could add a point value to words, or just put them in groups with the same bullshit level, and modify the bs value based on the proximity to other bullshit words i.e. with a threshold of 1, 'monetize' might have 1.2 and always be bullshit, but 'functionality' 0.8 so not bullshit but if 3 words away from 'empowerment', 0.8 then bullshit, 0.8+(0.8/3)=1.07.

@mourner
Copy link
Owner

mourner commented Jan 9, 2013

Lol, that's awesome idea. :) May be hard to implement though, and tough to assign/maintain the values.
Should be discussed in a separate issue I think, quite different from stemming proposal.

@voz
Copy link
Author

voz commented Jan 9, 2013

Yes, but the usual trick here is to come with the right weights. How do we know that "'monetize' might have 1.2" and no 1.875?

On Jan 9, 2013, at 4:47 PM, Calvin Metcalf notifications@github.com wrote:

Could add a point value to words, or just put them in groups with the same bullshit level, and modify the bs value based on the proximity to other bullshit words i.e. with a threshold of 1, 'monetize' might have 1.2 and always be bullshit, but 'functionality' 0.8 so not bullshit but if 3 words away from 'empowerment', 0.8 then bullshit, 0.8+(0.8/3)=1.07.


Reply to this email directly or view it on GitHub.

@calvinmetcalf
Copy link

my bad, was thinking of solutions to the issue of words not bullshit by themselves

@voz
Copy link
Author

voz commented Jan 9, 2013

The idea of weights is a good one, the only thing is that one needs a set of manually classified bullshit texts in order to get the values. But we can discuss it in another issue as @mourner mentioned.

On Jan 9, 2013, at 4:54 PM, Calvin Metcalf notifications@github.com wrote:

my bad, was thinking of solutions to the issue of words not bullshit by themselves


Reply to this email directly or view it on GitHub.

@calvinmetcalf
Copy link

I experemented with some of the available stemming libraries, neither porter stemmer nor Snowball.js are really at a level that is really usable here..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants