Optimize tagger logic using numpy #35

ZhanruiLiang · 2022-10-17T02:04:41Z

For #33.
Use numpy to vectorize computations instead of doing vector/matrix computations using dict.
The observed performance gain is around 3x, tested under the use case of https://github.com/CanCLID/typo-corrector. train_tagger.py shows similar performance improvement.

To make the tests pass, I had to change a bug about prev, prev2 = self.START. It should really be prev2, prev = self.START as prev2 is supposed be the tag of the i-2th word.

Also had to increase the training interation count to make tests pass. The test 135 is changed to 136 as 5 happens to have some weights in the trained model. The iteration count is now 50% and the accuracy is over 98%.

Add a concise title to this pull request on the GitHub web interface.
Add a description in this box to describe what this pull request is about.
If code behavior is being updated (e.g., a bug fix), relevant tests should be added.
The CircleCI builds should pass, including both the code styling checks by
black and flake8 as well as the test suite.
Add an entry to CHANGELOG.md at the repository's root level.

Also update model file and a test case.

ZhanruiLiang · 2022-10-17T05:49:27Z

Doc test is failing with:

Example at /home/circleci/project/docs/source/parsing.rst, line 26, column 1 did not evaluate as expected:
Expected:
    *X:    你         食         咗        飯          未        呀        ？
    %mor:  PRON|nei5  VERB|sik6  PART|zo2  NOUN|faan6  ADV|mei6  PART|aa4  ？
    <BLANKLINE>
    *X:    食         咗        喇         ！
    %mor:  VERB|sik6  PART|zo2  PART|laa1  ！
    <BLANKLINE>
    *X:    你         聽日           得         唔      得閒           呀        ？
    %mor:  PRON|nei5  ADV|ting1jat6  VERB|dak1  ADV|m4  ADJ|dak1haan4  PART|aa4  ？
    <BLANKLINE>
Got:
    *X:    你         食         咗        飯          未        呀        ？
    %mor:  PRON|nei5  VERB|sik6  PART|zo2  NOUN|faan6  ADV|mei6  PART|aa4  ？
    <BLANKLINE>
    *X:    食         咗        喇         ！
    %mor:  VERB|sik6  PART|zo2  PART|laa1  ！
    <BLANKLINE>
    *X:    你         聽日           得        唔      得閒           呀        ？
    %mor:  PRON|nei5  ADV|ting1jat6  AUX|dak1  ADV|m4  ADJ|dak1haan4  PART|aa4  ？
    <BLANKLINE>

Any suggestion? I'm not sure whether "得" in this context is VERB or AUX.

laubonghaudoi · 2022-10-18T00:43:38Z

我覺得個 AUX 肯定係唔啱嘅，但係亦都唔係 VERB，唔知 @jacksonllee 點睇。我唔係瞭解 PyCantonese 嘅標準，我覺得應該係將「得」視作「得閒」嘅縮略所以都係標成 ADJ？

jacksonllee · 2022-10-18T04:21:39Z

Hello! Just a heads-up that I see this coming through and I'll be able to take a look at this PR this week. Thanks for the contribution!

jacksonllee

Sorry for the delay. For the test failures, my hunch is that the noisy nature of the HKCanCor data might be to blame, but I'm digging into this and will come back with more comments. Thank you for your careful work!

jacksonllee · 2022-11-05T00:56:24Z

src/pycantonese/pos_tagging/tagger.py

+        # Maps feature into column index of the weights matrix.
+        self._feature_to_index: Dict[Hashable, int] = {}
+        # Matrix represented by 2D array of shape (n_classes, n_features).
+        self._weights = numpy.zeros(1)


It looks like this _weights array is a pretty sparse matrix at the end of the day (2 million values, with ~93% being zero). What do you think about switching to one of those sparse matrix representations from scipy (I'd be okay with adding scipy as a dependency) to lower the memory footprint? Speaking of which, maybe we can use float32 instead of the default float64 in numpy/scipy to save more memory? I doubt if the higher precision of float64 matters for our purposes here.

Changed to float32. I tried but scipy sparse matrices are not a good fit here. It's expensive to update sparse matrix that supports fast arithmetic, but we need to do such updates in this case. As a result, it's much slower, e.g. a training iteration takes like 10s while the current impl takes <1s. After changing the float32, the memory footprint is <8 MiB which I don't think it's really a concern. Also, importing scipy will definitely take more than 8 MiB.

laubonghaudoi · 2023-10-09T20:48:36Z

想問下呢個有冇咩進展？希望可以解決到 #33 噉樣就可以發佈個新版，解決埋 #43

ZhanruiLiang · 2023-10-13T01:21:22Z

No update since my last comment. As I remember, the problem is that test data are too flaky and the updated code doesn't produce the same output.

ZhanruiLiang added 5 commits October 16, 2022 18:41

Update tagger to use numpy.

d333ac9

Also update model file and a test case.

update change log and requirement file.

5474cc7

Fix formatting issue.

55a99de

Merge branch 'main' into numpy_pos_tag

1d870d2

Fix formating to make "black" happy.

8bd9788

ZhanruiLiang changed the title ~~Optimize tagger logic using numpy (#33)~~ Optimize tagger logic using numpy Oct 17, 2022

jacksonllee reviewed Nov 5, 2022

View reviewed changes

ZhanruiLiang added 3 commits November 17, 2022 00:19

Resolve conflicts.

048f696

Update weights matrix serialization format

43c6277

Format files.

4204dfe

laubonghaudoi mentioned this pull request Oct 9, 2023

Segmenter removes space of English words in code-mixed sentence #43

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize tagger logic using numpy #35

Optimize tagger logic using numpy #35

ZhanruiLiang commented Oct 17, 2022 •

edited

ZhanruiLiang commented Oct 17, 2022

laubonghaudoi commented Oct 18, 2022

jacksonllee commented Oct 18, 2022

jacksonllee left a comment

jacksonllee Nov 5, 2022

ZhanruiLiang Nov 17, 2022 •

edited

laubonghaudoi commented Oct 9, 2023

ZhanruiLiang commented Oct 13, 2023

Optimize tagger logic using numpy #35

Are you sure you want to change the base?

Optimize tagger logic using numpy #35

Conversation

ZhanruiLiang commented Oct 17, 2022 • edited

ZhanruiLiang commented Oct 17, 2022

laubonghaudoi commented Oct 18, 2022

jacksonllee commented Oct 18, 2022

jacksonllee left a comment

Choose a reason for hiding this comment

jacksonllee Nov 5, 2022

Choose a reason for hiding this comment

ZhanruiLiang Nov 17, 2022 • edited

Choose a reason for hiding this comment

laubonghaudoi commented Oct 9, 2023

ZhanruiLiang commented Oct 13, 2023

ZhanruiLiang commented Oct 17, 2022 •

edited

ZhanruiLiang Nov 17, 2022 •

edited