Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Whitespace tokenizer #496

Merged
merged 1 commit into from
Nov 28, 2022
Merged

Whitespace tokenizer #496

merged 1 commit into from
Nov 28, 2022

Conversation

elshize
Copy link
Member

@elshize elshize commented Nov 26, 2022

Implements a whitespace tokenizer next to the old term tokenizer. The old tokenizer is renamed to EnglishTokenizer (as it contains English-specific rules such as possessives), and both tokenizers are now organized into a class hierarchy with a common virtual interface, so that they could be used interchangeably in the future, as well so that perhaps more tokenizers can be implemented.

Related to #494

@elshize elshize mentioned this pull request Nov 26, 2022
3 tasks
Implements a whitespace tokenizer next to the old term tokenizer. The
old tokenizer is renamed to EnglishTokenizer (as it contains
English-specific rules such as possessives), and both tokenizers are now
organized into a class hierarchy with a common virtual interface, so
that they could be used interchangeably in the future, as well so that
perhaps more tokenizers can be implemented.
@codecov
Copy link

codecov bot commented Nov 26, 2022

Codecov Report

Base: 92.87% // Head: 92.84% // Decreases project coverage by -0.03% ⚠️

Coverage data is based on head (5e529fa) compared to base (78dbc6b).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #496      +/-   ##
==========================================
- Coverage   92.87%   92.84%   -0.04%     
==========================================
  Files          92       92              
  Lines        4351     4332      -19     
==========================================
- Hits         4041     4022      -19     
  Misses        310      310              
Impacted Files Coverage Δ
include/pisa/query/query_stemmer.hpp 100.00% <100.00%> (ø)
include/pisa/tokenizer.hpp 100.00% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@elshize elshize requested review from amallia and JMMackenzie and removed request for amallia November 27, 2022 14:04
Copy link
Member

@JMMackenzie JMMackenzie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks! This looks great.

@elshize elshize self-assigned this Nov 28, 2022
@elshize elshize merged commit 84fa537 into master Nov 28, 2022
@elshize elshize deleted the whitespace-tokenizer branch November 28, 2022 01:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants