techStandards

Download and parse technical standard documents

Introduction

This repository contains functions to download standard documents from the ETSI website and parse standard documents. For related functions (e.g., accessing ITU-T standard documents), see here.

Installation

You can install the development version from GitHub with:

# install.packages("devtools")
devtools::install_github("lorenzbr/techStandards")

What does the standard document parser do?

Technical standards are often described in extremely large documents comprising hundreds and sometimes even thousands of pages. This can lead to huge challenges for NLP and ML models dealing with such large texts. Thus, it is helpful to consider smaller parts of a standard and apply your model of choice to those. To select specific chapters, sections or paragraphs of a technical standard, this parser identifies the table of contents of a standard document and searches for the corresponding text using the title of the section and the page number as specified in the table of contents. The output are csv files with the structured text data (full text for each paragraph as outlined in the table of contents). Currently, the text data is also aggregated on chapter level and is stored in a separate txt file. The algorithm is based on regular expressions and excact as well as string similarity matches. While it works very well for most standard, for some, the parsing may fail or may not be that accurate. A log file with further details and messages is also outputted.

The two following pictures show an excerpt of a standard document. Exemplarily, the red boxes highlight what kind of information the standard document parser extracts. In practice, all the information of a document is parsed.

Examples

library(techStandards)

# Download ETSI standard documents
data("etsi_standards_meta")
download_etsi_standards(etsi_standards_meta, path = "")

# Get file names
files <- list.files(system.file("extdata/etsi_examples", package = "techStandards"), 
                    pattern = "pdf", full.names = TRUE)
file <- files[1]

# Set paths
input.path <- "inst/extdata/etsi_examples"
output.path <- input.path

# Parse a single standard document
parse_standard_doc(file, output.path, sso = "ETSI", overwrite = TRUE)

# Parse all standard documents
parse_standard_docs(input.path, output.path, sso = "ETSI", overwrite = TRUE)

Potential use cases

Standard essentiality/relevance assessments: fine-grained comparisons of patents with specific technical aspects of a standard
Track changes of standard documents over time: how does the text change relative to associated declared standard-essential patents?
Identify which sections of a technical standard have become void
Find technically similar implementations in other technical standards (e.g., from other standard-setting organizations)
Identify undisclosed standard-essential patents (e.g., patents filed through blanket declarations or potentially undeclared patents)

License

This R package is licensed under the MIT license.

See here for further information.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
R		R
data-raw		data-raw
data		data
inst		inst
man		man
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
NAMESPACE		NAMESPACE
README.md		README.md
TechStandards.Rproj		TechStandards.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

techStandards

Introduction

Installation

What does the standard document parser do?

Examples

Potential use cases

License

About

Releases

Packages

Languages

License

lorenzbr/techStandards

Folders and files

Latest commit

History

Repository files navigation

techStandards

Introduction

Installation

What does the standard document parser do?

Examples

Potential use cases

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages