U+0085 control character breaks PICA line parsing #277

pkiraly · 2023-06-01T20:33:01Z

In the k10plus catalogue some subfields (typically, but not exclusively 047I$a) contains text, which has one or more U+0085 control characters. This character does not match to the regex . (any character) in Java, and thus it blocks parsing the field. Here is an example:

047I aSeveral recent empirical and theoretical studies have revived interest in the relationship between the level of the exchange rate and economic development. This paper develops a dynamic model based on the Ricardian framework with a continuum of goods to consider the issue from a somewhat different perspective. In the short run, a devaluation can boost pro
fits in spite of real wage rigidity. Moreover, the resulting diversi
fication can offset the negative consequences for the trade balance of higher employment and pro
fitability at Home. Over the longer run, and in the presence of learning-by-accumulation, the initial boost to pro
ts and investment induced by a devaluation could enable a country to gain a permanent foothold in new sectors at a higher real wage. While directly suppressing the real wage could also lead to diversi
cation, what makes nominal devaluations a particularly useful tool is that these make it possible to expand domestic pro
fits while limiting internal distributional conflict and the ensuing negative effects on development.

Lots if times this character is standing before, after or even replacing fi, such as pro-fits, diversi-fication, fi-ght, or diversi-cation.
There might be two ways to fix this issue:

remove all U+0085 characters from the input
modify the regex to handle texts with U+0085 characters

In both cases the question is standing: should we report the effected lines?

@nichtich What do you prefer?

The text was updated successfully, but these errors were encountered:

nichtich · 2023-06-02T06:20:40Z

U+0085 is a valid code point in PICA field values (only code points below U+0020, including \n should not occurr in field values)., so the regex must be modified.

pkiraly · 2023-06-02T07:09:04Z

Thanks! I modified the regex.

pkiraly added this to the PICA: 1.2 milestone Jun 1, 2023

pkiraly added a commit that referenced this issue Jun 2, 2023

U+0085 control character breaks PICA line parsing #277

66dc6db

pkiraly closed this as completed Jun 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

U+0085 control character breaks PICA line parsing #277

U+0085 control character breaks PICA line parsing #277

pkiraly commented Jun 1, 2023

nichtich commented Jun 2, 2023

pkiraly commented Jun 2, 2023

U+0085 control character breaks PICA line parsing #277

U+0085 control character breaks PICA line parsing #277

Comments

pkiraly commented Jun 1, 2023

nichtich commented Jun 2, 2023

pkiraly commented Jun 2, 2023