-
-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add analysis framework #236
Conversation
lindera/src/character_filter.rs
Outdated
} | ||
} | ||
|
||
impl CharacterFilter for UnicodeNormalizeCharacterFilter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel it'd be better to have separate .rs files for each CharacterFilter implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I'll separate .rs file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
Please see the following commit:
fbcc4e6
lindera/src/character_filter.rs
Outdated
|
||
#[derive(Clone, Debug, Deserialize, Serialize, PartialEq, Eq)] | ||
pub struct MappingCharacterFilterConfig { | ||
pub mapping: HashMap<char, char>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have HashMap<String, String>
for mapping, so that users can replace character sequences with other character sequences instead of replacing just one character with another one?
For example, Lucene's MappingCharfilter takes a String -> String map.
https://lucene.apache.org/core/8_0_0/analyzers-common/org/apache/lucene/analysis/charfilter/MappingCharFilter.html
https://lucene.apache.org/core/8_0_0/analyzers-common/org/apache/lucene/analysis/charfilter/NormalizeCharMap.Builder.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, I guess I should imitate Lucene.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
Please see the following commit:
c6141f4
lindera/src/token_filter.rs
Outdated
} | ||
} | ||
|
||
t.push(token.clone()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could take Token
's references to avoid cloning Token objects? Also, could we replace the tokens
vector in place?
I've not tried but
Vec<Token<'a>>
could be
Vec<Rc<RefCell<<Token<'a>>>>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll revise it to the no-copying method. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
Please see the following commit:
388982c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have decided to use the retain function at this time. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love this PR!
I left some suggestions. Please check these.
And we should have some setting error test cases in analyzer.rs. What do you think?
lindera/src/character_filter.rs
Outdated
pub const UNICODE_NORMALIZE_CHARACTER_FILTER_NAME: &str = "unicode_normalize"; | ||
|
||
#[derive(Clone, Debug, Deserialize, Serialize, PartialEq, Eq)] | ||
pub enum UnidoceNormalizeKind { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pub enum UnidoceNormalizeKind { | |
pub enum UnicodeNormalizeKind { |
Fix typo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, that's a terrible typo!
I'll fix it all together. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
Please see the following commit:
fbcc4e6
lindera/src/character_filter.rs
Outdated
} | ||
|
||
#[derive(Clone, Debug, Deserialize, Serialize, PartialEq, Eq)] | ||
pub struct UnidoceNormalizeCharacterFilterConfig { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pub struct UnidoceNormalizeCharacterFilterConfig { | |
pub struct UnicodeNormalizeCharacterFilterConfig { |
Fix typo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
lindera/src/character_filter.rs
Outdated
pub kind: UnidoceNormalizeKind, | ||
} | ||
|
||
impl UnidoceNormalizeCharacterFilterConfig { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
impl UnidoceNormalizeCharacterFilterConfig { | |
impl UnicodeNormalizeCharacterFilterConfig { |
Fix typo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
lindera/src/character_filter.rs
Outdated
} | ||
|
||
impl UnidoceNormalizeCharacterFilterConfig { | ||
pub fn new(kind: UnidoceNormalizeKind) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pub fn new(kind: UnidoceNormalizeKind) -> Self { | |
pub fn new(kind: UnicodeNormalizeKind) -> Self { |
Fix typo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
lindera/src/character_filter.rs
Outdated
|
||
#[derive(Clone, Debug, Deserialize, Serialize, PartialEq, Eq)] | ||
pub struct UnidoceNormalizeCharacterFilterConfig { | ||
pub kind: UnidoceNormalizeKind, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pub kind: UnidoceNormalizeKind, | |
pub kind: UnicodeNormalizeKind, |
Fix typo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
lindera/src/character_filter.rs
Outdated
UnidoceNormalizeKind::NFC => text.nfc().collect::<String>(), | ||
UnidoceNormalizeKind::NFD => text.nfd().collect::<String>(), | ||
UnidoceNormalizeKind::NFKC => text.nfkc().collect::<String>(), | ||
UnidoceNormalizeKind::NFKD => text.nfkd().collect::<String>(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UnidoceNormalizeKind::NFC => text.nfc().collect::<String>(), | |
UnidoceNormalizeKind::NFD => text.nfd().collect::<String>(), | |
UnidoceNormalizeKind::NFKC => text.nfkc().collect::<String>(), | |
UnidoceNormalizeKind::NFKD => text.nfkd().collect::<String>(), | |
UnicodeNormalizeKind::NFC => text.nfc().collect::<String>(), | |
UnicodeNormalizeKind::NFD => text.nfd().collect::<String>(), | |
UnicodeNormalizeKind::NFKC => text.nfkc().collect::<String>(), | |
UnicodeNormalizeKind::NFKD => text.nfkd().collect::<String>(), |
Fix typo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
lindera/src/character_filter.rs
Outdated
|
||
impl RegexCharacterFilter { | ||
pub fn new(config: RegexCharacterFilterConfig) -> Self { | ||
let regex = Regex::new(&config.pattern).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there no error? Can we test the pattern before new instance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
Please see the following commit:
2caeca4
lindera/src/character_filter.rs
Outdated
use crate::character_filter::{ | ||
CharacterFilter, MappingCharacterFilter, MappingCharacterFilterConfig, | ||
RegexCharacterFilter, RegexCharacterFilterConfig, UnicodeNormalizeCharacterFilter, | ||
UnidoceNormalizeCharacterFilterConfig, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UnidoceNormalizeCharacterFilterConfig, | |
UnicodeNormalizeCharacterFilterConfig, |
Fix typo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
lindera/src/character_filter.rs
Outdated
} | ||
"#; | ||
let config = | ||
UnidoceNormalizeCharacterFilterConfig::from_slice(config_str.as_bytes()).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UnidoceNormalizeCharacterFilterConfig::from_slice(config_str.as_bytes()).unwrap(); | |
UnicodeNormalizeCharacterFilterConfig::from_slice(config_str.as_bytes()).unwrap(); |
Fix typo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
lindera/src/character_filter.rs
Outdated
let config = | ||
UnidoceNormalizeCharacterFilterConfig::from_slice(config_str.as_bytes()).unwrap(); | ||
|
||
assert_eq!(config.kind, super::UnidoceNormalizeKind::NFKC); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert_eq!(config.kind, super::UnidoceNormalizeKind::NFKC); | |
assert_eq!(config.kind, super::UnicodeNormalizeKind::NFKC); |
Fix typo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM without typo :)
pub kind: UnicodeNormalizeKind, | ||
} | ||
|
||
impl UnidoceNormalizeCharacterFilterConfig { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
impl UnidoceNormalizeCharacterFilterConfig { | |
impl UnicodeNormalizeCharacterFilterConfig { |
typo
|
||
#[derive(Clone, Debug)] | ||
pub struct UnicodeNormalizeCharacterFilter { | ||
config: UnidoceNormalizeCharacterFilterConfig, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
config: UnidoceNormalizeCharacterFilterConfig, | |
config: UnicodeNormalizeCharacterFilterConfig, |
typo
} | ||
|
||
impl UnicodeNormalizeCharacterFilter { | ||
pub fn new(config: UnidoceNormalizeCharacterFilterConfig) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pub fn new(config: UnidoceNormalizeCharacterFilterConfig) -> Self { | |
pub fn new(config: UnicodeNormalizeCharacterFilterConfig) -> Self { |
typo
|
||
pub fn from_slice(data: &[u8]) -> LinderaResult<Self> { | ||
Ok(Self::new( | ||
UnidoceNormalizeCharacterFilterConfig::from_slice(data)?, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UnidoceNormalizeCharacterFilterConfig::from_slice(data)?, | |
UnicodeNormalizeCharacterFilterConfig::from_slice(data)?, |
typo
use lindera_core::character_filter::CharacterFilter; | ||
|
||
use crate::character_filter::unicode_normalize::{ | ||
UnicodeNormalizeCharacterFilter, UnidoceNormalizeCharacterFilterConfig, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UnicodeNormalizeCharacterFilter, UnidoceNormalizeCharacterFilterConfig, | |
UnicodeNormalizeCharacterFilter, UnicodeNormalizeCharacterFilterConfig, |
typo
} | ||
"#; | ||
let config = | ||
UnidoceNormalizeCharacterFilterConfig::from_slice(config_str.as_bytes()).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UnidoceNormalizeCharacterFilterConfig::from_slice(config_str.as_bytes()).unwrap(); | |
UnicodeNormalizeCharacterFilterConfig::from_slice(config_str.as_bytes()).unwrap(); |
typo
} | ||
|
||
#[derive(Clone, Debug, Deserialize, Serialize, PartialEq, Eq)] | ||
pub struct UnidoceNormalizeCharacterFilterConfig { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pub struct UnidoceNormalizeCharacterFilterConfig { | |
pub struct UnicodeNormalizeCharacterFilterConfig { |
typo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
#[derive(Clone)] | ||
pub struct MappingCharacterFilter { | ||
config: MappingCharacterFilterConfig, | ||
trie: DoubleArray<Vec<u8>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the mapping so large? To me an ordinal HashMap seems to be sufficient (and fast) here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I found Lucene's MappingCharFilter uses FST.
|
||
let mut text = "リンデラは形態素解析器です。".to_string(); | ||
filter.apply(&mut text).unwrap(); | ||
assert_eq!("Linderaは形態素解析器です。", text); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just in case, I think it'd be better to have tests for patterns including actual regular expressions instead of just a fixed string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
Add analysis framework.
However, this feature is experimental.