-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can tokenize using custom delimiters from UNA segment #3
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work Chaymae 💪
I have a few suggestions, see the inline comments.
It would also be good to have a simple, high-level test case that tests actually parsing a message including a UNA segment - the tests you've added just test the tokenizer.
I'm saying this because we might need to strip/ignore the UNA segment in the parser but that would become easier to tell if you add this high-level test.
We are currently missing the Decimal mark delimiter (.).
This is intentional because it is up to the user to determine if a certain value is considered a numeric.
We could provide a helper method for this that uses the decimal mark delimiter to convert a value to a number. But for now let's just ignore it.
The other delimiter is the Repetition separator but it is not supported in some versions of edifact.
Not sure what the effect of that separator would be? Maybe something to consider in a later PR?
@@ -7,6 +7,7 @@ | |||
/spec/reports/ | |||
/tmp/ | |||
/Gemfile.lock | |||
*.swp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a better approach is to configure this on your local machine: https://help.github.com/en/articles/ignoring-files#create-a-global-gitignore
# TODO: Should check if the message starts with `UNA`, and then extract the different separator/terminator settings to be used for initializing the tokenizer. | ||
new | ||
def for_message(edifact_message) | ||
if edifact_message =~ /\AUNA/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if edifact_message =~ /\AUNA/ | |
if edifact_message[0..2] == 'UNA' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why the second one is better than the first ?? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regex seems like overkill for something like this.
def for_message(edifact_message) | ||
if edifact_message =~ /\AUNA/ | ||
# Example: UNA:+.? ' | ||
new(release_character: edifact_message[6], segment_terminator: edifact_message[8], data_element_separator: edifact_message[4], component_data_element_separator: edifact_message[3]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it could be more clear which characters after UNA
are what. I'm thinking something like this:
component_data_element_separator, data_element_separator, _decimal_mark, release_character, _reserved, segment_terminator = edifact_message.slice(3, 6).split('')
It becomes a super long line so if it could be broken up that would be nice.
@@ -1,4 +1,36 @@ | |||
RSpec.describe Edifunct::Tokenizer do | |||
describe ".for_message" do | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✂️
end | ||
|
||
context "when UNA header is missing" do | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✂️
I made some modifications and pushed up 3d9f3f9, which is basically the changes you had here but with a few changes. Thanks for the contribution ✌️ |
For now we only support 4 out of 6 different delimiters.
We are currently missing the Decimal mark delimiter (.). The other delimiter is the Repetition separator but it is not supported in some versions of edifact.