Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] ENH: Make normalization an explicit transformation. #649

Merged
merged 1 commit into from Jun 1, 2016

Conversation

devashishd12
Copy link
Contributor

Addresses issue #69. What this PR includes:

  • Added support for ''l1' norm.
  • Added norm parameter to Similarity constructor.
  • Added norm parameter to matutils.unitvec.
  • Created file models/normmodel.py to make normalization an explicit transformation.

TODO:

  • Check for input validation
  • Make test file test/test_norms.py
  • Add tests for input validation using assertRaises
  • Add documentation
  • Add more tests to testTransform to cover all input type validations

@piskvorky @tmylk Could you please tell me if I'm going on the right track?

@devashishd12
Copy link
Contributor Author

@piskvorky @tmylk I've created models/normmodel.py in order to make normalization an explicit transformation. I hope I'm not going off on a tangent with this pr :P

@devashishd12 devashishd12 changed the title [WIP] ENH: added l0, l1 norm to make transformation explicit. [WIP] ENH: Make normalization an explicit transformation. Apr 1, 2016
@piskvorky piskvorky added the feature Issue described a new feature label Apr 1, 2016
@devashishd12
Copy link
Contributor Author

I've added initial tests for just checking the logic. Input validation testing and testing for the Similarity class constructor still remains.

if scipy.sparse.issparse(vec): # convert scipy.sparse to standard numpy array
vec = vec.tocsr()
veclen = numpy.sqrt(numpy.sum(vec.data ** 2))
if norm == 'l0':
veclen = len(vec.data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

L_0 'norm' is not homogeneous so it is impossible to scale.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes true.... Could you please elaborate more on the scaling part? Will it be affecting efficiency since as the corpus size increases we cannot use the fact that f(ax) = (a^k)f(x)? Should I remove the support for zero "norm"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to confuse you, meant scaling in mathematical not IT sense.

What number do we need to divide vector a=(5,5) by in order to have l_0_norm(a)=1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation! I think it would be best to remove L0 norm then and keep the normalization just for l1 and l2. Should I proceed with it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. Thanks.

On Mon, Apr 4, 2016 at 8:04 PM, Devashish Deshpande <
notifications@github.com> wrote:

In gensim/matutils.py
#649 (comment):

 if scipy.sparse.issparse(vec): # convert scipy.sparse to standard numpy array
     vec = vec.tocsr()
  •    veclen = numpy.sqrt(numpy.sum(vec.data *\* 2))
    
  •    if norm == 'l0':
    
  •        veclen = len(vec.data)
    

Thanks for the explanation! I think it would be best to remove L0 norm
then and keep the normalization just for l1 and l2. Should I proceed with
it?


You are receiving this because you were assigned.
Reply to this email directly or view it on GitHub
https://github.com/piskvorky/gensim/pull/649/files/8b27eb4f46e0854fefdb4b55b6cea6c42c11a14d#r58429753

@devashishd12
Copy link
Contributor Author

@tmylk I've removed L0 norm and added some more tests to test_normmodel.

@devashishd12 devashishd12 force-pushed the norm_add branch 2 times, most recently from ef103b5 to 739d868 Compare April 5, 2016 05:30
@devashishd12
Copy link
Contributor Author

@tmylk @piskvorky I've added further tests to cover all input types. I think this pr is ready for review.

@devashishd12 devashishd12 changed the title [WIP] ENH: Make normalization an explicit transformation. [MRG] ENH: Make normalization an explicit transformation. Apr 6, 2016
@devashishd12
Copy link
Contributor Author

Removed merge conflicts.

@tmylk
Copy link
Contributor

tmylk commented Apr 25, 2016

@dsquareindia Hey, why did you close this pull request? It is a useful feature.

@devashishd12
Copy link
Contributor Author

Sorry for that; I'll open it again.

@devashishd12
Copy link
Contributor Author

devashishd12 commented May 2, 2016

Removed redundant use of corpus, rebased and squashed. Could someone please review this? Thanks!

return list(vec)
if norm == 'l1':
length = float(sum(abs(val) for _, val in vec))
assert length > 0.0, "Document contains all zero entries"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason for this check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I had just added that to make it symmetric with the L2 case. I should be returning the vector unchanged in case of 0 length right? Should I proceed with the change?

return ret_normalized_vec(vec, length)
if norm == 'l2':
length = 1.0 * math.sqrt(sum(val ** 2 for _, val in vec))
assert length > 0.0, "sparse documents must not contain any explicit zero entries"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I change this to returning the vector unchanged in case of zero length as we are doing with other input types?

@devashishd12
Copy link
Contributor Author

@tmylk I have split the test suite into multiple tests to ease error checking. I have also removed the __getitem__ transformation and have made it a function called normalize(). I hope now the pr is a bit better! :)

@devashishd12
Copy link
Contributor Author

@piskvorky could you please review this? I think it's ready for merge.

"""
Calculates the norm by calling matutils.unitvec with the norm parameter.
"""
logger.info("Normalizing...")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This log message is not very illuminating to users... try being more specific.

@piskvorky
Copy link
Owner

Only did a quick scan, looks nicely done, but I'll leave the thorough review to @tmylk.

My only nitpick would be Normmodel => NormModel, for readability. WDYT?

self.num_nnz = numnnz
self.norms = norms

def normalize(self, bow):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@piskvorky should this rather be a __getitem__() since a transformation here is defined as something which acts like transformation[doc].

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. We can also let users use either -- a named function or [ ].

@devashishd12
Copy link
Contributor Author

@tmylk @piskvorky I've addressed all the comments. Hopefully it's much better now! Could you please review :)

@tmylk tmylk merged commit 02e49f9 into piskvorky:develop Jun 1, 2016
@devashishd12 devashishd12 deleted the norm_add branch June 1, 2016 03:02
@devashishd12
Copy link
Contributor Author

@tmylk @piskvorky thanks a ton for the reviews! 🍻

@devashishd12
Copy link
Contributor Author

@piskvorky I was thinking of writing a notebook tutorial for this feature however I can't think of a practical use case for it. What would be the best way to go about this?

@piskvorky
Copy link
Owner

Not sure... @tmylk ? Maybe indexing some corpus with/without normalization, showing how the returned "most similar" docs (or words, or embeddings... basically any vectors) differ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Issue described a new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants