Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(deinflect): Add logic to handle all inflections of the copula だ #988

Merged
merged 13 commits into from
May 1, 2022

Conversation

maawisul
Copy link
Contributor

@maawisul maawisul commented Apr 22, 2022

Fixes #89

Add one more conditional for the type of "です"
@codecov
Copy link

codecov bot commented Apr 22, 2022

Codecov Report

Merging #988 (4b57766) into main (69c7fe7) will increase coverage by 0.15%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #988      +/-   ##
==========================================
+ Coverage   78.97%   79.12%   +0.15%     
==========================================
  Files           7        7              
  Lines        3001     3004       +3     
  Branches      174      184      +10     
==========================================
+ Hits         2370     2377       +7     
+ Misses        628      622       -6     
- Partials        3        5       +2     
Impacted Files Coverage Δ
extension/data.ts 82.43% <100.00%> (+0.43%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 69c7fe7...4b57766. Read the comment docs.

@melink14
Copy link
Owner

I think the change to make it conjugate as a verb might be too broad since it will cause it to match other verb endings as well:
image

I wasn't thinking it through when I thought we could reuse the existing す verb logic... 🙇🏾

Since です is so irregular I think the complete fix would be to add a new type for cop in deinflect.dat and then add entries for the cases which only apply to this type of word. (https://www.tanoshiijapanese.com/dictionary/conjugation_details.cfm?entry_id=103243&element_id=128958&conjugation_type_id=5 is the best list I found but we don't need to add them all now.)

To add an entry to deinflect.dat is faily straight forward. It starts with a set of reasons and then each entry has the form of:
<inflected_form> <deinflected_form> <type_mask> <index_into_reasons>

The first two are encoding the logic we want and the last is a 0 based index into the reasons at the top of the file.

The type mask is opaque but it looks to be two values OR'd together (type of inflected word and type of deinflected word) as was clarified in this fork: https://gist.github.com/lynn/07f7ce3c314223d2aca19ec2bb0540cd#file-patch-deinflect-py-L7

I think if we add a new base mask for です then we can easily add a line to deinflect.dat and update your branch a bit to make things work specifically for this case!

What do you think?

@maawisul
Copy link
Contributor Author

I agree! I did think that the case of です and its conjugation is pretty irregular and need to be differentiated from any other verbs or adjectives. However, I am not sure on how to come up with a new typemask and would need some guidance on that.

@melink14
Copy link
Owner

No problem! I have some ideas about that.

For the type of the inflected word, I think 0x0080 (OTHER) makes sense in this case (though if we added forms like ではない it would be different). For BASE we'll need a new type in the upper half of the mask (imagine BASE_DESU) and since BASE_SURU is 0x1000 we can increment it to 0x2000 and be safe. (This will convert to a type of 32 in data.ts).

> node -p "0x80|0x2000"
8320

So I think a type mask of 8320 should work well for でした ー> です

Let me know if you have any other questions.

@melink14
Copy link
Owner

BTW, I was reading http://www.japaneseprofessor.com/reference/grammar/conjugations-of-the-japanese-copula/ which made me realize the honorific forms are also lost in rikaikun.

I think we'll be able to add several new lines for various copula conjugations once the base logic is in. You could do it at the same time or just focus on でした depending on your time/want.

@maawisul
Copy link
Contributor Author

Thanks! I will try working on it as much as I can submit another PR when I'm satisfied with my change.

Copy link

@tentagul tentagul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that works.

@melink14
Copy link
Owner

Thanks! I will try working on it as much as I can submit another PR when I'm satisfied with my change.

For now, I'll close this PR then and feel free to re-open when ready, or let me know if you have any more questions on the issue.

@melink14 melink14 closed this Apr 26, 2022
@melink14 melink14 mentioned this pull request Apr 26, 2022
Add entries to でした、じゃありません、じゃありませんでした
ではございません, ではございませんでした, でございましょう, でございまして
@maawisul
Copy link
Contributor Author

maawisul commented Apr 28, 2022

I think I am satisfied with my changes and ready to re-open. How do I re-open a pull request?

@melink14 melink14 reopened this Apr 28, 2022
@maawisul
Copy link
Contributor Author

I added entries and type(past negative) to deinflect.dat so that rikaikun now recognizes:
でした
じゃありません
じゃありませんでした
でございません / ではございません
でございませんでした / ではございません

I am not sure on what to do with ございます on its own as I am not sure if it is still considered a copula or not 

@melink14
Copy link
Owner

Thanks for the update! Looks like we're pretty close.

I think maybe things became complicated because the dictionary contains many congugated forms of the copula which you tried to work with directly and thus created congugation mappings to intermediate inflections.

It makes sense given that the dictionary contains things like じゃない even though that's the negative form of だ but I think we should maintain consistency and conjugate back to the base form in all cases. This means that we'll show two entries for some words but that's okay in my mind because sometimes these short common phrases have idiomatic meaning as well as the basic grammar meaning.

I think not relying on the dictionary containing certain common inflections is also more robust since they might remove the inflected forms in a future update.

In that case, it makes sense to add mappings back to だ since it's the polite affirmative form of the copula. We can start with the mappings that don't have any other entry in the dictionary or fill out the entire table. I'll leave a comment with what I mean inline.

Copy link
Owner

@melink14 melink14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty good with some small comments.

We should also add some tests for the data.ts changes. There aren't tests for that file yet but it should be fairly straightforward to do in the same manner of other files like background.ts.

That being said, it can be tricky writing the first tests for a class, so let me know if you want me to add a commit with some tests this time around.

extension/data/deinflect.dat Outdated Show resolved Hide resolved
extension/data/deinflect.dat Outdated Show resolved Hide resolved
extension/data/deinflect.dat Outdated Show resolved Hide resolved
extension/data/deinflect.dat Outdated Show resolved Hide resolved
extension/data/deinflect.dat Outdated Show resolved Hide resolved
extension/data/deinflect.dat Outdated Show resolved Hide resolved
extension/data/deinflect.dat Outdated Show resolved Hide resolved
extension/data/deinflect.dat Outdated Show resolved Hide resolved
extension/data/deinflect.dat Outdated Show resolved Hide resolved
extension/data.ts Show resolved Hide resolved
maawisul and others added 4 commits April 30, 2022 11:33
Co-authored-by: Erek Speed <melink14@gmail.com>
Co-authored-by: Erek Speed <melink14@gmail.com>
fix entry for ではない and じゃない
- Add chai assertion helpers for loose matching of objects inside of arrays.
- Add positive and negative cases for new/changed branches.
@melink14
Copy link
Owner

melink14 commented May 1, 2022

I think it should be basically done now though it looks like I need to increase the timeout for the new data tests (loading the data is slower in github than on my local computer...)

Gradually increase if tests are flaky in the cloud.
@melink14 melink14 changed the title fix: Modify data.ts to recognize "でした” feat(deinflect): Add logic to handle all inflections of the copula だ May 1, 2022
@mergify mergify bot merged commit 52d30c5 into melink14:main May 1, 2022
@melink14
Copy link
Owner

melink14 commented May 1, 2022

@all-contributors Add @maawisul for code

@allcontributors
Copy link
Contributor

@melink14

I've put up a pull request to add @maawisul! 🎉

mergify bot pushed a commit that referenced this pull request May 1, 2022
Add @maawisul as a contributor for code.

This was requested by melink14 [in this comment](#988 (comment))
melink14 pushed a commit that referenced this pull request May 2, 2022
## [2.4.0](v2.3.5...v2.4.0) (2022-05-02)

### Features

* **deinflect:** Add logic to handle all inflections of the copula だ ([#988](#988)) ([52d30c5](52d30c5)), closes [#89](#89)

### Bug Fixes

* **dict:** Update dictionaries to latest versions ([#1004](#1004)) ([0c31844](0c31844))
@melink14
Copy link
Owner

melink14 commented May 2, 2022

🎉 This PR is included in version 2.4.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

でした not recognized
3 participants