Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add analysis framework #236

Merged
merged 11 commits into from
Oct 15, 2022
Merged

Add analysis framework #236

merged 11 commits into from
Oct 15, 2022

Conversation

mosuka
Copy link
Member

@mosuka mosuka commented Oct 9, 2022

Add analysis framework.
However, this feature is experimental.

}
}

impl CharacterFilter for UnicodeNormalizeCharacterFilter {
Copy link
Member

@mocobeta mocobeta Oct 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel it'd be better to have separate .rs files for each CharacterFilter implementation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I'll separate .rs file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.
Please see the following commit:
fbcc4e6


#[derive(Clone, Debug, Deserialize, Serialize, PartialEq, Eq)]
pub struct MappingCharacterFilterConfig {
pub mapping: HashMap<char, char>,
Copy link
Member

@mocobeta mocobeta Oct 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have HashMap<String, String> for mapping, so that users can replace character sequences with other character sequences instead of replacing just one character with another one?
For example, Lucene's MappingCharfilter takes a String -> String map.
https://lucene.apache.org/core/8_0_0/analyzers-common/org/apache/lucene/analysis/charfilter/MappingCharFilter.html
https://lucene.apache.org/core/8_0_0/analyzers-common/org/apache/lucene/analysis/charfilter/NormalizeCharMap.Builder.html

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I guess I should imitate Lucene.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.
Please see the following commit:
c6141f4

}
}

t.push(token.clone());
Copy link
Member

@mocobeta mocobeta Oct 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could take Token's references to avoid cloning Token objects? Also, could we replace the tokens vector in place?

I've not tried but

Vec<Token<'a>>

could be

Vec<Rc<RefCell<<Token<'a>>>>

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll revise it to the no-copying method. Thanks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.
Please see the following commit:
388982c

Copy link
Member Author

@mosuka mosuka Oct 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have decided to use the retain function at this time. What do you think?

Copy link
Member

@johtani johtani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love this PR!
I left some suggestions. Please check these.
And we should have some setting error test cases in analyzer.rs. What do you think?

pub const UNICODE_NORMALIZE_CHARACTER_FILTER_NAME: &str = "unicode_normalize";

#[derive(Clone, Debug, Deserialize, Serialize, PartialEq, Eq)]
pub enum UnidoceNormalizeKind {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pub enum UnidoceNormalizeKind {
pub enum UnicodeNormalizeKind {

Fix typo

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that's a terrible typo!
I'll fix it all together. Thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.
Please see the following commit:
fbcc4e6

}

#[derive(Clone, Debug, Deserialize, Serialize, PartialEq, Eq)]
pub struct UnidoceNormalizeCharacterFilterConfig {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pub struct UnidoceNormalizeCharacterFilterConfig {
pub struct UnicodeNormalizeCharacterFilterConfig {

Fix typo

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

pub kind: UnidoceNormalizeKind,
}

impl UnidoceNormalizeCharacterFilterConfig {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
impl UnidoceNormalizeCharacterFilterConfig {
impl UnicodeNormalizeCharacterFilterConfig {

Fix typo

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

}

impl UnidoceNormalizeCharacterFilterConfig {
pub fn new(kind: UnidoceNormalizeKind) -> Self {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pub fn new(kind: UnidoceNormalizeKind) -> Self {
pub fn new(kind: UnicodeNormalizeKind) -> Self {

Fix typo

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto


#[derive(Clone, Debug, Deserialize, Serialize, PartialEq, Eq)]
pub struct UnidoceNormalizeCharacterFilterConfig {
pub kind: UnidoceNormalizeKind,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pub kind: UnidoceNormalizeKind,
pub kind: UnicodeNormalizeKind,

Fix typo

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Comment on lines 59 to 62
UnidoceNormalizeKind::NFC => text.nfc().collect::<String>(),
UnidoceNormalizeKind::NFD => text.nfd().collect::<String>(),
UnidoceNormalizeKind::NFKC => text.nfkc().collect::<String>(),
UnidoceNormalizeKind::NFKD => text.nfkd().collect::<String>(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
UnidoceNormalizeKind::NFC => text.nfc().collect::<String>(),
UnidoceNormalizeKind::NFD => text.nfd().collect::<String>(),
UnidoceNormalizeKind::NFKC => text.nfkc().collect::<String>(),
UnidoceNormalizeKind::NFKD => text.nfkd().collect::<String>(),
UnicodeNormalizeKind::NFC => text.nfc().collect::<String>(),
UnicodeNormalizeKind::NFD => text.nfd().collect::<String>(),
UnicodeNormalizeKind::NFKC => text.nfkc().collect::<String>(),
UnicodeNormalizeKind::NFKD => text.nfkd().collect::<String>(),

Fix typo

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto


impl RegexCharacterFilter {
pub fn new(config: RegexCharacterFilterConfig) -> Self {
let regex = Regex::new(&config.pattern).unwrap();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there no error? Can we test the pattern before new instance?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.
Please see the following commit:
2caeca4

use crate::character_filter::{
CharacterFilter, MappingCharacterFilter, MappingCharacterFilterConfig,
RegexCharacterFilter, RegexCharacterFilterConfig, UnicodeNormalizeCharacterFilter,
UnidoceNormalizeCharacterFilterConfig,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
UnidoceNormalizeCharacterFilterConfig,
UnicodeNormalizeCharacterFilterConfig,

Fix typo

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

}
"#;
let config =
UnidoceNormalizeCharacterFilterConfig::from_slice(config_str.as_bytes()).unwrap();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
UnidoceNormalizeCharacterFilterConfig::from_slice(config_str.as_bytes()).unwrap();
UnicodeNormalizeCharacterFilterConfig::from_slice(config_str.as_bytes()).unwrap();

Fix typo

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

let config =
UnidoceNormalizeCharacterFilterConfig::from_slice(config_str.as_bytes()).unwrap();

assert_eq!(config.kind, super::UnidoceNormalizeKind::NFKC);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert_eq!(config.kind, super::UnidoceNormalizeKind::NFKC);
assert_eq!(config.kind, super::UnicodeNormalizeKind::NFKC);

Fix typo

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Member

@johtani johtani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM without typo :)

pub kind: UnicodeNormalizeKind,
}

impl UnidoceNormalizeCharacterFilterConfig {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
impl UnidoceNormalizeCharacterFilterConfig {
impl UnicodeNormalizeCharacterFilterConfig {

typo


#[derive(Clone, Debug)]
pub struct UnicodeNormalizeCharacterFilter {
config: UnidoceNormalizeCharacterFilterConfig,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
config: UnidoceNormalizeCharacterFilterConfig,
config: UnicodeNormalizeCharacterFilterConfig,

typo

}

impl UnicodeNormalizeCharacterFilter {
pub fn new(config: UnidoceNormalizeCharacterFilterConfig) -> Self {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pub fn new(config: UnidoceNormalizeCharacterFilterConfig) -> Self {
pub fn new(config: UnicodeNormalizeCharacterFilterConfig) -> Self {

typo


pub fn from_slice(data: &[u8]) -> LinderaResult<Self> {
Ok(Self::new(
UnidoceNormalizeCharacterFilterConfig::from_slice(data)?,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
UnidoceNormalizeCharacterFilterConfig::from_slice(data)?,
UnicodeNormalizeCharacterFilterConfig::from_slice(data)?,

typo

use lindera_core::character_filter::CharacterFilter;

use crate::character_filter::unicode_normalize::{
UnicodeNormalizeCharacterFilter, UnidoceNormalizeCharacterFilterConfig,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
UnicodeNormalizeCharacterFilter, UnidoceNormalizeCharacterFilterConfig,
UnicodeNormalizeCharacterFilter, UnicodeNormalizeCharacterFilterConfig,

typo

}
"#;
let config =
UnidoceNormalizeCharacterFilterConfig::from_slice(config_str.as_bytes()).unwrap();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
UnidoceNormalizeCharacterFilterConfig::from_slice(config_str.as_bytes()).unwrap();
UnicodeNormalizeCharacterFilterConfig::from_slice(config_str.as_bytes()).unwrap();

typo

}

#[derive(Clone, Debug, Deserialize, Serialize, PartialEq, Eq)]
pub struct UnidoceNormalizeCharacterFilterConfig {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pub struct UnidoceNormalizeCharacterFilterConfig {
pub struct UnicodeNormalizeCharacterFilterConfig {

typo

@mosuka
Copy link
Member Author

mosuka commented Oct 13, 2022

@johtani
Fix typos and add test for analyzer wrong setting.
9c911d5

@mosuka mosuka requested a review from johtani October 13, 2022 12:03
Copy link
Member

@johtani johtani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

#[derive(Clone)]
pub struct MappingCharacterFilter {
config: MappingCharacterFilterConfig,
trie: DoubleArray<Vec<u8>>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the mapping so large? To me an ordinal HashMap seems to be sufficient (and fast) here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I found Lucene's MappingCharFilter uses FST.


let mut text = "リンデラは形態素解析器です。".to_string();
filter.apply(&mut text).unwrap();
assert_eq!("Linderaは形態素解析器です。", text);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just in case, I think it'd be better to have tests for patterns including actual regular expressions instead of just a fixed string.

Copy link
Member

@mocobeta mocobeta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@mosuka mosuka merged commit d5527d5 into main Oct 15, 2022
@mosuka mosuka mentioned this pull request Oct 15, 2022
@mosuka mosuka deleted the analysis_framework branch October 24, 2022 01:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants