Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

U+0085 control character breaks PICA line parsing #277

Closed
pkiraly opened this issue Jun 1, 2023 · 2 comments
Closed

U+0085 control character breaks PICA line parsing #277

pkiraly opened this issue Jun 1, 2023 · 2 comments
Milestone

Comments

@pkiraly
Copy link
Owner

pkiraly commented Jun 1, 2023

In the k10plus catalogue some subfields (typically, but not exclusively 047I$a) contains text, which has one or more U+0085 control characters. This character does not match to the regex . (any character) in Java, and thus it blocks parsing the field. Here is an example:

047I aSeveral recent empirical and theoretical studies have revived interest in the relationship between the level of the exchange rate and economic development. This paper develops a dynamic model based on the Ricardian framework with a continuum of goods to consider the issue from a somewhat different perspective. In the short run, a devaluation can boost pro
fits in spite of real wage rigidity. Moreover, the resulting diversi
fication can offset the negative consequences for the trade balance of higher employment and pro
fitability at Home. Over the longer run, and in the presence of learning-by-accumulation, the initial boost to pro
ts and investment induced by a devaluation could enable a country to gain a permanent foothold in new sectors at a higher real wage. While directly suppressing the real wage could also lead to diversi
cation, what makes nominal devaluations a particularly useful tool is that these make it possible to expand domestic pro
fits while limiting internal distributional conflict and the ensuing negative effects on development.

Lots if times this character is standing before, after or even replacing fi, such as pro-fits, diversi-fication, fi-ght, or diversi-cation.
There might be two ways to fix this issue:

  1. remove all U+0085 characters from the input
  2. modify the regex to handle texts with U+0085 characters

In both cases the question is standing: should we report the effected lines?

@nichtich What do you prefer?

@pkiraly pkiraly added this to the PICA: 1.2 milestone Jun 1, 2023
@nichtich
Copy link
Collaborator

nichtich commented Jun 2, 2023

U+0085 is a valid code point in PICA field values (only code points below U+0020, including \n should not occurr in field values)., so the regex must be modified.

@pkiraly
Copy link
Owner Author

pkiraly commented Jun 2, 2023

Thanks! I modified the regex.

@pkiraly pkiraly closed this as completed Jun 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants