[MRG] ENH: Make normalization an explicit transformation. #649

devashishd12 · 2016-03-30T11:13:04Z

Addresses issue #69. What this PR includes:

Added support for ''l1' norm.
Added norm parameter to Similarity constructor.
Added norm parameter to matutils.unitvec.
Created file models/normmodel.py to make normalization an explicit transformation.

TODO:

Check for input validation
Make test file test/test_norms.py
Add tests for input validation using assertRaises
Add documentation
Add more tests to testTransform to cover all input type validations

@piskvorky @tmylk Could you please tell me if I'm going on the right track?

devashishd12 · 2016-04-01T10:27:48Z

@piskvorky @tmylk I've created models/normmodel.py in order to make normalization an explicit transformation. I hope I'm not going off on a tangent with this pr :P

devashishd12 · 2016-04-03T15:11:16Z

I've added initial tests for just checking the logic. Input validation testing and testing for the Similarity class constructor still remains.

tmylk · 2016-04-03T19:47:25Z

gensim/matutils.py

    if scipy.sparse.issparse(vec): # convert scipy.sparse to standard numpy array
        vec = vec.tocsr()
-        veclen = numpy.sqrt(numpy.sum(vec.data ** 2))
+        if norm == 'l0':
+            veclen = len(vec.data)


L_0 'norm' is not homogeneous so it is impossible to scale.

Yes true.... Could you please elaborate more on the scaling part? Will it be affecting efficiency since as the corpus size increases we cannot use the fact that f(ax) = (a^k)f(x)? Should I remove the support for zero "norm"?

Sorry to confuse you, meant scaling in mathematical not IT sense.

What number do we need to divide vector a=(5,5) by in order to have l_0_norm(a)=1?

Thanks for the explanation! I think it would be best to remove L0 norm then and keep the normalization just for l1 and l2. Should I proceed with it?

Yep. Thanks.

On Mon, Apr 4, 2016 at 8:04 PM, Devashish Deshpande <
notifications@github.com> wrote:

In gensim/matutils.py
#649 (comment):

if scipy.sparse.issparse(vec): # convert scipy.sparse to standard numpy array vec = vec.tocsr()

veclen = numpy.sqrt(numpy.sum(vec.data *\* 2))

if norm == 'l0':

veclen = len(vec.data)

Thanks for the explanation! I think it would be best to remove L0 norm
then and keep the normalization just for l1 and l2. Should I proceed with
it?

—
You are receiving this because you were assigned.
Reply to this email directly or view it on GitHub
https://github.com/piskvorky/gensim/pull/649/files/8b27eb4f46e0854fefdb4b55b6cea6c42c11a14d#r58429753

devashishd12 · 2016-04-04T21:58:17Z

@tmylk I've removed L0 norm and added some more tests to test_normmodel.

devashishd12 · 2016-04-06T09:09:32Z

@tmylk @piskvorky I've added further tests to cover all input types. I think this pr is ready for review.

devashishd12 · 2016-04-21T06:24:27Z

Removed merge conflicts.

tmylk · 2016-04-25T08:00:11Z

@dsquareindia Hey, why did you close this pull request? It is a useful feature.

devashishd12 · 2016-04-25T09:34:08Z

Sorry for that; I'll open it again.

devashishd12 · 2016-05-02T13:21:13Z

Removed redundant use of corpus, rebased and squashed. Could someone please review this? Thanks!

tmylk · 2016-05-23T15:46:03Z

gensim/matutils.py

-            return list(vec)
+        if norm == 'l1':
+            length = float(sum(abs(val) for _, val in vec))
+            assert length > 0.0, "Document contains all zero entries"


What is the reason for this check?

Sorry I had just added that to make it symmetric with the L2 case. I should be returning the vector unchanged in case of 0 length right? Should I proceed with the change?

devashishd12 · 2016-05-23T22:00:42Z

gensim/matutils.py

+            return ret_normalized_vec(vec, length)
+        if norm == 'l2':
+            length = 1.0 * math.sqrt(sum(val ** 2 for _, val in vec))
+            assert length > 0.0, "sparse documents must not contain any explicit zero entries"


Should I change this to returning the vector unchanged in case of zero length as we are doing with other input types?

devashishd12 · 2016-05-23T22:04:28Z

@tmylk I have split the test suite into multiple tests to ease error checking. I have also removed the __getitem__ transformation and have made it a function called normalize(). I hope now the pr is a bit better! :)

devashishd12 · 2016-05-27T13:04:56Z

@piskvorky could you please review this? I think it's ready for merge.

piskvorky · 2016-05-27T15:12:39Z

gensim/models/normmodel.py

+        """
+        Calculates the norm by calling matutils.unitvec with the norm parameter.
+        """
+        logger.info("Normalizing...")


This log message is not very illuminating to users... try being more specific.

piskvorky · 2016-05-27T15:14:49Z

Only did a quick scan, looks nicely done, but I'll leave the thorough review to @tmylk.

My only nitpick would be Normmodel => NormModel, for readability. WDYT?

devashishd12 · 2016-05-27T15:22:56Z

gensim/models/normmodel.py

+        self.num_nnz = numnnz
+        self.norms = norms
+
+    def normalize(self, bow):


@piskvorky should this rather be a __getitem__() since a transformation here is defined as something which acts like transformation[doc].

Sure. We can also let users use either -- a named function or [ ].

devashishd12 · 2016-05-31T07:34:13Z

@tmylk @piskvorky I've addressed all the comments. Hopefully it's much better now! Could you please review :)

devashishd12 · 2016-06-01T03:11:47Z

@tmylk @piskvorky thanks a ton for the reviews! 🍻

devashishd12 · 2016-06-28T20:04:27Z

@piskvorky I was thinking of writing a notebook tutorial for this feature however I can't think of a practical use case for it. What would be the best way to go about this?

piskvorky · 2016-06-30T03:57:23Z

Not sure... @tmylk ? Maybe indexing some corpus with/without normalization, showing how the returned "most similar" docs (or words, or embeddings... basically any vectors) differ?

devashishd12 changed the title ~~[WIP] ENH: added l0, l1 norm to make transformation explicit.~~ [WIP] ENH: Make normalization an explicit transformation. Apr 1, 2016

piskvorky assigned tmylk Apr 1, 2016

piskvorky added the feature Issue described a new feature label Apr 1, 2016

tmylk reviewed Apr 3, 2016
View reviewed changes

devashishd12 force-pushed the norm_add branch from e6b7d43 to a6df0d2 Compare April 4, 2016 16:49

devashishd12 force-pushed the norm_add branch 2 times, most recently from ef103b5 to 739d868 Compare April 5, 2016 05:30

devashishd12 changed the title ~~[WIP] ENH: Make normalization an explicit transformation.~~ [MRG] ENH: Make normalization an explicit transformation. Apr 6, 2016

devashishd12 force-pushed the norm_add branch from 19f11f5 to e6f0f96 Compare April 21, 2016 06:19

devashishd12 closed this Apr 23, 2016

devashishd12 reopened this Apr 25, 2016

devashishd12 force-pushed the norm_add branch from e3d9b6d to 946c596 Compare May 2, 2016 13:20

tmylk reviewed May 23, 2016
View reviewed changes

devashishd12 force-pushed the norm_add branch from 946c596 to b626326 Compare May 23, 2016 21:58

devashishd12 reviewed May 23, 2016
View reviewed changes

devashishd12 force-pushed the norm_add branch from 18b209c to 2676bb6 Compare May 26, 2016 13:22

piskvorky reviewed May 27, 2016
View reviewed changes

devashishd12 reviewed May 27, 2016
View reviewed changes

ENH: Made normalization an explicit transformation.

997ba51

devashishd12 force-pushed the norm_add branch from fc24e15 to 997ba51 Compare May 31, 2016 07:32

tmylk merged commit 02e49f9 into piskvorky:develop Jun 1, 2016

devashishd12 deleted the norm_add branch June 1, 2016 03:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] ENH: Make normalization an explicit transformation. #649

[MRG] ENH: Make normalization an explicit transformation. #649

devashishd12 commented Mar 30, 2016

devashishd12 commented Apr 1, 2016

devashishd12 commented Apr 3, 2016

tmylk Apr 3, 2016

devashishd12 Apr 4, 2016

tmylk Apr 4, 2016

devashishd12 Apr 4, 2016

tmylk Apr 4, 2016

devashishd12 commented Apr 4, 2016

devashishd12 commented Apr 6, 2016

devashishd12 commented Apr 21, 2016

tmylk commented Apr 25, 2016

devashishd12 commented Apr 25, 2016

devashishd12 commented May 2, 2016 •

edited

tmylk May 23, 2016

devashishd12 May 23, 2016

devashishd12 May 23, 2016

devashishd12 commented May 23, 2016

devashishd12 commented May 27, 2016

piskvorky May 27, 2016

piskvorky commented May 27, 2016

devashishd12 May 27, 2016

piskvorky May 27, 2016

devashishd12 commented May 31, 2016

devashishd12 commented Jun 1, 2016

devashishd12 commented Jun 28, 2016

piskvorky commented Jun 30, 2016

[MRG] ENH: Make normalization an explicit transformation. #649

[MRG] ENH: Make normalization an explicit transformation. #649

Conversation

devashishd12 commented Mar 30, 2016

devashishd12 commented Apr 1, 2016

devashishd12 commented Apr 3, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

devashishd12 commented Apr 4, 2016

devashishd12 commented Apr 6, 2016

devashishd12 commented Apr 21, 2016

tmylk commented Apr 25, 2016

devashishd12 commented Apr 25, 2016

devashishd12 commented May 2, 2016 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

devashishd12 commented May 23, 2016

devashishd12 commented May 27, 2016

Choose a reason for hiding this comment

piskvorky commented May 27, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

devashishd12 commented May 31, 2016

devashishd12 commented Jun 1, 2016

devashishd12 commented Jun 28, 2016

piskvorky commented Jun 30, 2016

devashishd12 commented May 2, 2016 •

edited