Add analysis framework #236

mosuka · 2022-10-09T13:45:58Z

Add analysis framework.
However, this feature is experimental.

mocobeta · 2022-10-11T13:22:20Z

lindera/src/character_filter.rs

+    }
+}
+
+impl CharacterFilter for UnicodeNormalizeCharacterFilter {


I feel it'd be better to have separate .rs files for each CharacterFilter implementation?

Ok. I'll separate .rs file.

Fixed.
Please see the following commit:
fbcc4e6

mocobeta · 2022-10-11T13:29:56Z

lindera/src/character_filter.rs

+
+#[derive(Clone, Debug, Deserialize, Serialize, PartialEq, Eq)]
+pub struct MappingCharacterFilterConfig {
+    pub mapping: HashMap<char, char>,


Can we have HashMap<String, String> for mapping, so that users can replace character sequences with other character sequences instead of replacing just one character with another one?
For example, Lucene's MappingCharfilter takes a String -> String map.
https://lucene.apache.org/core/8_0_0/analyzers-common/org/apache/lucene/analysis/charfilter/MappingCharFilter.html
https://lucene.apache.org/core/8_0_0/analyzers-common/org/apache/lucene/analysis/charfilter/NormalizeCharMap.Builder.html

I see, I guess I should imitate Lucene.

Fixed.
Please see the following commit:
c6141f4

mocobeta · 2022-10-11T13:51:42Z

lindera/src/token_filter.rs

+                }
+            }
+
+            t.push(token.clone());


We could take Token's references to avoid cloning Token objects? Also, could we replace the tokens vector in place?

I've not tried but

Vec<Token<'a>>

could be

Vec<Rc<RefCell<<Token<'a>>>>

I'll revise it to the no-copying method. Thanks.

Fixed.
Please see the following commit:
388982c

I have decided to use the retain function at this time. What do you think?

johtani

I love this PR!
I left some suggestions. Please check these.
And we should have some setting error test cases in analyzer.rs. What do you think?

johtani · 2022-10-11T13:44:39Z

lindera/src/character_filter.rs

+pub const UNICODE_NORMALIZE_CHARACTER_FILTER_NAME: &str = "unicode_normalize";
+
+#[derive(Clone, Debug, Deserialize, Serialize, PartialEq, Eq)]
+pub enum UnidoceNormalizeKind {


Suggested change

pub enum UnidoceNormalizeKind {

pub enum UnicodeNormalizeKind {

Fix typo

Oh, that's a terrible typo!
I'll fix it all together. Thanks!

Fixed.
Please see the following commit:
fbcc4e6

johtani · 2022-10-11T13:44:54Z

lindera/src/character_filter.rs

+}
+
+#[derive(Clone, Debug, Deserialize, Serialize, PartialEq, Eq)]
+pub struct UnidoceNormalizeCharacterFilterConfig {


Suggested change

pub struct UnidoceNormalizeCharacterFilterConfig {

pub struct UnicodeNormalizeCharacterFilterConfig {

Fix typo

johtani · 2022-10-11T13:45:16Z

lindera/src/character_filter.rs

+    pub kind: UnidoceNormalizeKind,
+}
+
+impl UnidoceNormalizeCharacterFilterConfig {


Suggested change

impl UnidoceNormalizeCharacterFilterConfig {

impl UnicodeNormalizeCharacterFilterConfig {

Fix typo

johtani · 2022-10-11T13:45:29Z

lindera/src/character_filter.rs

+}
+
+impl UnidoceNormalizeCharacterFilterConfig {
+    pub fn new(kind: UnidoceNormalizeKind) -> Self {


Suggested change

pub fn new(kind: UnidoceNormalizeKind) -> Self {

pub fn new(kind: UnicodeNormalizeKind) -> Self {

Fix typo

johtani · 2022-10-11T13:45:44Z

lindera/src/character_filter.rs

+
+#[derive(Clone, Debug, Deserialize, Serialize, PartialEq, Eq)]
+pub struct UnidoceNormalizeCharacterFilterConfig {
+    pub kind: UnidoceNormalizeKind,


Suggested change

pub kind: UnidoceNormalizeKind,

pub kind: UnicodeNormalizeKind,

Fix typo

johtani · 2022-10-11T13:48:41Z

lindera/src/character_filter.rs

+            UnidoceNormalizeKind::NFC => text.nfc().collect::<String>(),
+            UnidoceNormalizeKind::NFD => text.nfd().collect::<String>(),
+            UnidoceNormalizeKind::NFKC => text.nfkc().collect::<String>(),
+            UnidoceNormalizeKind::NFKD => text.nfkd().collect::<String>(),


Suggested change

UnidoceNormalizeKind::NFC => text.nfc().collect::<String>(),

UnidoceNormalizeKind::NFD => text.nfd().collect::<String>(),

UnidoceNormalizeKind::NFKC => text.nfkc().collect::<String>(),

UnidoceNormalizeKind::NFKD => text.nfkd().collect::<String>(),

UnicodeNormalizeKind::NFC => text.nfc().collect::<String>(),

UnicodeNormalizeKind::NFD => text.nfd().collect::<String>(),

UnicodeNormalizeKind::NFKC => text.nfkc().collect::<String>(),

UnicodeNormalizeKind::NFKD => text.nfkd().collect::<String>(),

Fix typo

johtani · 2022-10-11T13:54:03Z

lindera/src/character_filter.rs

+
+impl RegexCharacterFilter {
+    pub fn new(config: RegexCharacterFilterConfig) -> Self {
+        let regex = Regex::new(&config.pattern).unwrap();


Is there no error? Can we test the pattern before new instance?

Fixed.
Please see the following commit:
2caeca4

johtani · 2022-10-11T13:54:39Z

lindera/src/character_filter.rs

+    use crate::character_filter::{
+        CharacterFilter, MappingCharacterFilter, MappingCharacterFilterConfig,
+        RegexCharacterFilter, RegexCharacterFilterConfig, UnicodeNormalizeCharacterFilter,
+        UnidoceNormalizeCharacterFilterConfig,


Suggested change

UnidoceNormalizeCharacterFilterConfig,

UnicodeNormalizeCharacterFilterConfig,

Fix typo

johtani · 2022-10-11T13:55:03Z

lindera/src/character_filter.rs

+        }
+        "#;
+        let config =
+            UnidoceNormalizeCharacterFilterConfig::from_slice(config_str.as_bytes()).unwrap();


Suggested change

UnidoceNormalizeCharacterFilterConfig::from_slice(config_str.as_bytes()).unwrap();

UnicodeNormalizeCharacterFilterConfig::from_slice(config_str.as_bytes()).unwrap();

Fix typo

johtani · 2022-10-11T13:55:15Z

lindera/src/character_filter.rs

+        let config =
+            UnidoceNormalizeCharacterFilterConfig::from_slice(config_str.as_bytes()).unwrap();
+
+        assert_eq!(config.kind, super::UnidoceNormalizeKind::NFKC);


Suggested change

assert_eq!(config.kind, super::UnidoceNormalizeKind::NFKC);

assert_eq!(config.kind, super::UnicodeNormalizeKind::NFKC);

Fix typo

johtani

LGTM without typo :)

johtani · 2022-10-13T08:35:22Z

lindera/src/character_filter/unicode_normalize.rs

+    pub kind: UnicodeNormalizeKind,
+}
+
+impl UnidoceNormalizeCharacterFilterConfig {


Suggested change

impl UnidoceNormalizeCharacterFilterConfig {

impl UnicodeNormalizeCharacterFilterConfig {

typo

johtani · 2022-10-13T08:35:43Z

lindera/src/character_filter/unicode_normalize.rs

+
+#[derive(Clone, Debug)]
+pub struct UnicodeNormalizeCharacterFilter {
+    config: UnidoceNormalizeCharacterFilterConfig,


Suggested change

config: UnidoceNormalizeCharacterFilterConfig,

config: UnicodeNormalizeCharacterFilterConfig,

typo

johtani · 2022-10-13T08:35:55Z

lindera/src/character_filter/unicode_normalize.rs

+}
+
+impl UnicodeNormalizeCharacterFilter {
+    pub fn new(config: UnidoceNormalizeCharacterFilterConfig) -> Self {


Suggested change

pub fn new(config: UnidoceNormalizeCharacterFilterConfig) -> Self {

pub fn new(config: UnicodeNormalizeCharacterFilterConfig) -> Self {

typo

johtani · 2022-10-13T08:36:10Z

lindera/src/character_filter/unicode_normalize.rs

+
+    pub fn from_slice(data: &[u8]) -> LinderaResult<Self> {
+        Ok(Self::new(
+            UnidoceNormalizeCharacterFilterConfig::from_slice(data)?,


Suggested change

UnidoceNormalizeCharacterFilterConfig::from_slice(data)?,

UnicodeNormalizeCharacterFilterConfig::from_slice(data)?,

typo

johtani · 2022-10-13T08:36:38Z

lindera/src/character_filter/unicode_normalize.rs

+    use lindera_core::character_filter::CharacterFilter;
+
+    use crate::character_filter::unicode_normalize::{
+        UnicodeNormalizeCharacterFilter, UnidoceNormalizeCharacterFilterConfig,


Suggested change

UnicodeNormalizeCharacterFilter, UnidoceNormalizeCharacterFilterConfig,

UnicodeNormalizeCharacterFilter, UnicodeNormalizeCharacterFilterConfig,

typo

johtani · 2022-10-13T08:36:53Z

lindera/src/character_filter/unicode_normalize.rs

+        }
+        "#;
+        let config =
+            UnidoceNormalizeCharacterFilterConfig::from_slice(config_str.as_bytes()).unwrap();


Suggested change

UnidoceNormalizeCharacterFilterConfig::from_slice(config_str.as_bytes()).unwrap();

UnicodeNormalizeCharacterFilterConfig::from_slice(config_str.as_bytes()).unwrap();

typo

johtani · 2022-10-13T08:38:00Z

lindera/src/character_filter/unicode_normalize.rs

+}
+
+#[derive(Clone, Debug, Deserialize, Serialize, PartialEq, Eq)]
+pub struct UnidoceNormalizeCharacterFilterConfig {


Suggested change

pub struct UnidoceNormalizeCharacterFilterConfig {

pub struct UnicodeNormalizeCharacterFilterConfig {

typo

mosuka · 2022-10-13T12:01:22Z

@johtani
Fix typos and add test for analyzer wrong setting.
9c911d5

johtani

LGTM

mocobeta · 2022-10-15T04:30:27Z

lindera/src/character_filter/mapping.rs

+#[derive(Clone)]
+pub struct MappingCharacterFilter {
+    config: MappingCharacterFilterConfig,
+    trie: DoubleArray<Vec<u8>>,


Is the mapping so large? To me an ordinal HashMap seems to be sufficient (and fast) here.

OK, I found Lucene's MappingCharFilter uses FST.

mocobeta · 2022-10-15T04:40:57Z

lindera/src/character_filter/regex.rs

+
+        let mut text = "リンデラは形態素解析器です。".to_string();
+        filter.apply(&mut text).unwrap();
+        assert_eq!("Linderaは形態素解析器です。", text);


Just in case, I think it'd be better to have tests for patterns including actual regular expressions instead of just a fixed string.

mocobeta

Looks good to me.

Add analysis framework

36fe892

mosuka requested review from johtani, mocobeta and ikawaha October 9, 2022 13:45

mosuka added 2 commits October 9, 2022 22:53

Update CHANGES.md

789bb4c

Update CHANGES.md

d52f7c8

mocobeta reviewed Oct 11, 2022

View reviewed changes

johtani requested changes Oct 11, 2022

View reviewed changes

mosuka added 4 commits October 12, 2022 00:29

Separate files

fbcc4e6

Change HashSet to HashMap

c6141f4

Use retain

388982c

Handle error

2caeca4

mosuka requested review from mocobeta and johtani October 12, 2022 13:29

johtani requested changes Oct 13, 2022

View reviewed changes

mosuka added 2 commits October 13, 2022 20:56

Fix typo and test for wrong setting

9c911d5

Remove comments

b4f5b48

mosuka requested a review from johtani October 13, 2022 12:03

johtani approved these changes Oct 13, 2022

View reviewed changes

mocobeta reviewed Oct 15, 2022

View reviewed changes

mocobeta approved these changes Oct 15, 2022

View reviewed changes

mosuka added 2 commits October 15, 2022 23:14

Add test for regex pattern matching

cf6bed7

Fix format

a90d618

mosuka merged commit d5527d5 into main Oct 15, 2022

mosuka mentioned this pull request Oct 15, 2022

Add analyzer framework #168

Closed

mosuka deleted the analysis_framework branch October 24, 2022 01:56

	pub enum UnidoceNormalizeKind {
	pub enum UnicodeNormalizeKind {

	pub struct UnidoceNormalizeCharacterFilterConfig {
	pub struct UnicodeNormalizeCharacterFilterConfig {

	impl UnidoceNormalizeCharacterFilterConfig {
	impl UnicodeNormalizeCharacterFilterConfig {

	pub fn new(kind: UnidoceNormalizeKind) -> Self {
	pub fn new(kind: UnicodeNormalizeKind) -> Self {

	pub kind: UnidoceNormalizeKind,
	pub kind: UnicodeNormalizeKind,

	UnidoceNormalizeCharacterFilterConfig,
	UnicodeNormalizeCharacterFilterConfig,

	UnidoceNormalizeCharacterFilterConfig::from_slice(config_str.as_bytes()).unwrap();
	UnicodeNormalizeCharacterFilterConfig::from_slice(config_str.as_bytes()).unwrap();

	assert_eq!(config.kind, super::UnidoceNormalizeKind::NFKC);
	assert_eq!(config.kind, super::UnicodeNormalizeKind::NFKC);

	config: UnidoceNormalizeCharacterFilterConfig,
	config: UnicodeNormalizeCharacterFilterConfig,

	pub fn new(config: UnidoceNormalizeCharacterFilterConfig) -> Self {
	pub fn new(config: UnicodeNormalizeCharacterFilterConfig) -> Self {

	UnicodeNormalizeCharacterFilter, UnidoceNormalizeCharacterFilterConfig,
	UnicodeNormalizeCharacterFilter, UnicodeNormalizeCharacterFilterConfig,

Add analysis framework #236

Add analysis framework #236

Conversation

mosuka commented Oct 9, 2022

mocobeta Oct 11, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mocobeta Oct 11, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mocobeta Oct 11, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mosuka Oct 12, 2022 • edited

Choose a reason for hiding this comment

johtani left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johtani left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mosuka commented Oct 13, 2022

johtani left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mocobeta left a comment

Choose a reason for hiding this comment

mocobeta Oct 11, 2022 •

edited

mocobeta Oct 11, 2022 •

edited

mocobeta Oct 11, 2022 •

edited

mosuka Oct 12, 2022 •

edited