Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

func CosineSimilarity() return NaN #7

Closed
WhisperRain opened this issue Jun 7, 2018 · 4 comments
Closed

func CosineSimilarity() return NaN #7

WhisperRain opened this issue Jun 7, 2018 · 4 comments

Comments

@WhisperRain
Copy link

WhisperRain commented Jun 7, 2018

Dear James bowman
I use your library to calculate similarity. The function CosineSimilarity() returns many NaN. So i can't continue my work.
I have only changed your vectorisers.go file . All my changes are as following.

func (v *CountVectoriser) Transform(docs ...string) (mat.Matrix, error) { //function begin from here
mat := sparse.NewDOK(len(v.Vocabulary), len(docs))

for d, doc := range docs {
	v.Tokeniser.ForEachIn(doc, func(word string) {
		i, exists := v.Vocabulary[word]

		if exists {
			weight, wieghtExist := TrainingData.WeightMap[word]
			// normal weight value: 2,  unimportant weight value: 1, important  weight value: 3
			if wieghtExist {
				mat.Set(i, d, mat.At(i, d)+weight)
			} else {
				mat.Set(i, d, mat.At(i, d)+1)
			}

		}
	})
}
return mat.ToCSR(), nil

}

@james-bowman
Copy link
Owner

james-bowman commented Jun 7, 2018

The NaN usually occurs if one of the vectors being compared has an L2 norm of 0 which usually means the matrix has no non-zero values. The result of calculating cosine similarity on such a vector is undefined. Have you trained (fitted) the vectoriser prior to using it to transform your document(s)? I suspect at least some of the documents in your test data are 100% comprised of words that were not present in your training corpus. Perhaps try using a larger training data set to make sure the vectoriser learns more words during training.

@james-bowman
Copy link
Owner

Let me know how you get on?

@WhisperRain
Copy link
Author

Yeah! My training data set is pretty small now. It's true that some of the documents in my test data are comprised of words that were not present in training corpus. And I'll have a new try.
Thanks for answering my question in such a short time.

@james-bowman
Copy link
Owner

Not a problem and you are most welcome. You could also try the HashingVectoriser as a direct drop in replacement for the Count vectoriser which doesn't require training.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants