Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fully parse all IEEE normtitle entries #4

Closed
ronaldtse opened this issue Mar 11, 2022 · 19 comments
Closed

Fully parse all IEEE normtitle entries #4

ronaldtse opened this issue Mar 11, 2022 · 19 comments
Assignees
Labels
enhancement New feature or request

Comments

@ronaldtse
Copy link
Contributor

ronaldtse commented Mar 11, 2022

Full PubIDs from IEEE:
pubid-sorted.txt.zip

$ wc -l pubid-sorted.txt
    9436 pubid-sorted.txt

Please also look at #2 and #3 for resolved details.

Method of generating this list:

  1. Extract all xml files from ieee-rawbib2
  2. Run the following command
find . -name '*.xml' -exec bash -c 'xmllint --xpath '//publication/normtitle/text()' --nocdata $0 >> pubid.txt; echo >> pubid.txt' \{} \;
sort pubid.txt | uniq > pubid-sorted.txt

Some observed rules (#2 (comment)):

  • REV means revision
  • INT means interpretation
  • /D{N} or /D.{N} or _D{N} means draft N
  • Cor means corrigenda
  • Amd means amendment

Joint publications:

  • P844.3/C22.2 293.3/D0, Aug 2018 is an "IEEE P844.3" joint standard with "CSA C22.2 No. 293.3", Draft 0.
  • 309/N42.3-1999 is "IEEE 309" joint with "ANSI N42.3"
  • 529-1980/Cor 1-2017 means it is a correction of "IEEE 592-1980" issued in 2017, the first corrigenda for the standard
  • P1062/D.19, March 2015 means it is Draft 19 of "IEEE P1062"
  • Notice sometimes they miss the D:
    • P1775/1.9.7, Mar 2009 <===
    • P1775/D1.9.8, Dec 2009
  • C37.60/62271-111-2018 is joint IEC 62271-111 and IEEE Std C37.60-2018
  • IEC/IEEE P60079-30-2/D4A, Jul 2013
  • P16326/201x/DIS, Dec 2018
  • P16326/DFCD, Sept 2007
  • P16326_FDIS, Apr 2019
  • P42020 FDIS, Mar 2019
  • P42020/V1.9, Aug 2018
  • P63195/ED1, 2018
  • P63195_CDV/V3, Feb 2020
  • PC37.60/IEC 62271-111_D10, Feb 2012
  • PC37.60/P62271-111_D6.1, August 2016
  • 1003.1/2003.1-1994 is "IEEE Std 1003.1/2003.l/lNT, March 1994 Edition"
  • 1003.2/INT-1994
  • 1224/1224.1/1327-1994 is "IEEE Std 1224/1327/1224.1/lNT, March 1994 Edition" (lNT "L" actually should be INT "I")
  • 323/323a-1974 is "IEEE Std 323-1974 (Revision of IEEE 323-1971 and ANSI N41.5-1971)"
@ronaldtse ronaldtse added the enhancement New feature or request label Mar 11, 2022
@mico
Copy link
Contributor

mico commented Mar 12, 2022

@ronaldtse do you have any references to standards or IEEE PubID parsing implementations that could help?

@mico
Copy link
Contributor

mico commented Mar 12, 2022

@ronaldtse what we want to do with parsed PubIDs? Do we need to convert it back to PubID, other formats?

@ronaldtse
Copy link
Contributor Author

Code that is now used for PubID parsing is here:
https://github.com/relaton/relaton-ieee/blob/main/lib/relaton_ieee/rawbib_id_parser.rb

@ronaldtse
Copy link
Contributor Author

The source files for these entries are at https://github.com/relaton/ieee-rawbib.

There are few problems:

  1. In these XML files, there are a number of identical normtitle entries even though they have different filenames. We need to find out how to distinguish these bibliographic entries and then report back to IEEE.
  2. Some of these entries are properly associated, e.g. <standard_id> value is 0.
  3. Some of these entries are draft entries of each other. As part of Relaton, we need to find out how the entries are related and re-build the full graph.

@ronaldtse
Copy link
Contributor Author

Regarding pubid, notice that there are multiple types of IEEE PubIDs, and also some jointly-published ones with ISO PubIDs. Since we now have an ISO PubID implementation, it will help us here.

@mico
Copy link
Contributor

mico commented Mar 14, 2022

@ronaldtse could you tell me what problem we are trying to solve here? Do we want to convert to another format or we want to distinguish these bibliographic entries from "ieee-rawbib" / build relations graph or something else?

@ronaldtse
Copy link
Contributor Author

Right now, Relaton-IEEE is unable to parse all IEEE PubID entries due to parsing through using regular expressions. It has the following consequences:

  1. We are unable to convert all ieee-rawbib data into https://github.com/ietf-ribose/relaton-data-ieee . Around 10% of entries are now missing, therefore people cannot cite from the full library. (see this: https://github.com/ietf-ribose/bibxml-service/issues/136#issuecomment-1047138249 and Missing bibliographic items for these identifiers (from ieee-rawbib) relaton/relaton-ieee#16)
  2. Some entries in Relaton-IEEE are parsed wrongly. This means that people end up citing the wrong document. See this for example: feat: add Gemfiles ietf-ribose/bibxml-service#31 (comment)

i.e. we must properly parse IEEE PubIDs in order to make the full IEEE dataset available for citation.

@mico
Copy link
Contributor

mico commented Mar 14, 2022

Right now, Relaton-IEEE is unable to parse all IEEE PubID entries due to parsing through using regular expressions. It has the following consequences:

  1. We are unable to convert all ieee-rawbib data into https://github.com/ietf-ribose/relaton-data-ieee . Around 10% of entries are now missing, therefore people cannot cite from the full library. (see this: Mapping for IEEE references in bibxml6 to IEEE dataset ietf-ribose/bibxml-service#136 (comment) and Missing bibliographic items for these identifiers (from ieee-rawbib) relaton/relaton-ieee#16)
  2. Some entries in Relaton-IEEE are parsed wrongly. This means that people end up citing the wrong document. See this for example: Data mismatch when retrieving IEEE standards by xml2rfc paths ietf-ribose/bibxml-service#31 (comment)

i.e. we must properly parse IEEE PubIDs in order to make the full IEEE dataset available for citation.

Will we use it (pubid-ieee) to replace https://github.com/relaton/relaton-ieee/blob/main/lib/relaton_ieee/rawbib_id_parser.rb ?

@ronaldtse
Copy link
Contributor Author

ronaldtse commented Mar 14, 2022

Will we use it (pubid-ieee) to replace https://github.com/relaton/relaton-ieee/blob/main/lib/relaton_ieee/rawbib_id_parser.rb ?

Yes.

@mico
Copy link
Contributor

mico commented Mar 14, 2022

@ronaldtse should we use pubid-iso to parse identifiers like:

IEC/IEEE 62704-1:2017
IEC/IEEE 62704-2:2017
IEC/IEEE 62704-3:2017
IEC/IEEE 62704-4:2020
IEC/IEEE 63113:2021
IEC/IEEE 63260:2020
IEC/ISO/IEEE 80005-1:2012
ISO/IEC FDIS P15289, April 2014(E)

?

@ronaldtse
Copy link
Contributor Author

The ones that start with ISO, yes. But the rest are IEC identifiers, IEC PubIDs are similar to ISO’s but they have different stages, and allow a sub part (eg IEC 1000-1-2). We need to have a pubid-iec.

@mico
Copy link
Contributor

mico commented Mar 15, 2022

IEEE Std 1073.1.1.1-2004 (https://standards.ieee.org/ieee/1073.1.1.1/1571/)
image
"Replaced by ISO/IEEE 11073-10101-2004"

Example of similar identifier:
P11073-10101c (https://standards.ieee.org/ieee/11073-10101c/10476/)
Title: "Standard for Health informatics--Point-of-care medical device communication - Part 10101: Nomenclature Amendment 3: Additional definitions".

@ronaldtse I believe IEEE Std 1073.1.1.1-2004 should be "IEEE 1073-10101-2004" or "IEEE 11073-10101-2004", what do you think?

@ronaldtse
Copy link
Contributor Author

I believe IEEE Std 1073.1.1.1-2004 should be "IEEE 1073-10101-2004" or "IEEE 11073-10101-2004", what do you think?

No, we have to keep the original identifier. Its replacement "ISO/IEEE 11073-10101-2004" probably intentionally selected the 10101 part to keep identity with 1.1.1. Notice that 1073 became 11073 because ISO 1073 is already taken by another standard. This is causality in reverse.

"P11073-10101c" means it is the "provisional" (i.e. draft) version of "11073-10101c". The "c" character means it is the 3rd Amendment to "11073-10101". According to the website, "P11073-10101c" is done in 2020 so it is a "draft amendment".

i.e. historically:

  1. IEEE 1073.1.1.1-2004 was published
  2. ISO/IEEE 11073-10101-2004 was published
  3. IEEE P11073-10101c is a draft amendment of ISO/IEEE 11073-10101-2004 (the ieee-rawbib data directly indicates which standard supersedes which)

@mico
Copy link
Contributor

mico commented Mar 15, 2022

@ronaldtse "IEEE 802.15.22.3-2020" - how can I know what is 22 and 3 here?

@mico
Copy link
Contributor

mico commented Mar 15, 2022

I believe IEEE Std 1073.1.1.1-2004 should be "IEEE 1073-10101-2004" or "IEEE 11073-10101-2004", what do you think?

No, we have to keep the original identifier. Its replacement "ISO/IEEE 11073-10101-2004" probably intentionally selected the 10101 part to keep identity with 1.1.1. Notice that 1073 became 11073 because ISO 1073 is already taken by another standard. This is causality in reverse.

I'm trying to find solution how I should treat these numbers. I had an idea to parse it as {number}.{part}.{subpart} but there are over 3 numbers. Maybe I can parse extra numbers as extra subparts.

@ronaldtse
Copy link
Contributor Author

"IEEE 802.15.22.3-2020" "IEEE Standard for Spectrum Characterization and Occupancy Sensing":

  • 802 is a "Standards Committee" (https://www.ieee802.org)
  • 802.15 is a "Working Group" ("802.15 WG - Wireless Specialty Networks (WSN) Working Group")
  • 22.3 is the "Part number" selected by the 802.15 group. This is an arbitrary number.
    • This number is likely selected because the topic is similar to the standard called "802.22.3" (draft 802.22.3)

You can see that "22.3" is called the "Part" in the draft.
Screenshot 2022-03-15 at 11 08 35 AM

@ronaldtse
Copy link
Contributor Author

I had an idea to parse it as {number}.{part}.{subpart} but there are over 3 numbers.

I am not sure on whether there is a proper structure in IEEE identifiers. Some patterns are somewhat arbitrary (e.g. there exists 802.15.22.3 but not 802.15.22.1 and 802.15.22.2.)

This is a topic we will need to investigate and analyse.

@mico
Copy link
Contributor

mico commented Sep 16, 2022

@ronaldtse I believe we finished with this issue

@mico mico closed this as completed Sep 16, 2022
@ronaldtse
Copy link
Contributor Author

@mico we have 886 identifiers that are not yet being parsed, but I will make that into a new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants