Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

BibIndex: add possibility to transliterate phrases #1085

Open
tiborsimko opened this Issue · 1 comment

2 participants

Tibor Simko invenio-developers
Tibor Simko
Owner

Originally on 2012-06-13

1) It will be useful to add a possibility to transliterate phrases upon indexing time, especially for author names. This will read string stored in the DB and will optionally generate more terms to index, quite like stemming does, depending on index configuration.

2) We may want to separate generated terms into different indexes though, in case one would like to search for exact value as opposed to 'fuzzy' transliterated value. Kind of like Xapian does with its Z forms for stemming, so that people could search for both stemmed or non-stemmed version. This could be also applied to stemming, we've had such requests in the past.

2) We could use unidecode for this. Theodoros writes:

From all(?) the packages, I found that Uridecode
(http://pypi.python.org/pypi/Unidecode) supports most of the languages
that could be transliterated (although, for Greek it does not support
the standard ISO 843 but a 'custom' one which is not very good as a
practice. More details for the official transliteration standards for
Greek, here: http://transliteration.eki.ee/pdf/Greek.pdf)

The usage is very simple and I run an example with the following VERY
complex Unicode string (with Hebrew, Hindi, Chinese and Greek):
---------------
The decomposition mapping is <츠, U+11B8>, and not <0x110E, ᅳ, 11B8>.
<p>The title says ‫פעילות הבינאום, W3C‬ in Hebrew</p>
abcáßçकखी國際𐎄𐎔𐎘
Ελληνικά
---------------

and is converted to:

---------------
The decomposition mapping is <ceu, b>, and not <c, eu, 11B8>.
<p>The title says p`ylvt hbynvm, W3C in Hebrew</p>
abcassckkhiiGuo Ji
Ellenika
---------------

3) As for per-index configuration of the transliteration, see also #852.

invenio-developers
Collaborator

Originally by arwagner on 2013-11-26

In case of authority records one might get a bunch of such "transliterations" via "additional name forms" in the 400% and friends. Say you have

1001_ $a Müller, Hans
4001_ $a Muller, H
4001_ $a Mueller, H

it would signify that Muller or Mueller are "known names" for Müller.

Note. that this also handles the case that Müller, Hans changes his name like

4001_ $a Schmidt, H

so "Schmidt, H" would be also a valid form for "Müller, Hans" of the name. Thinking for marriage/divorce/pseudonyms/... For an elaborate example of this issue check out

http://viaf.org/viaf/24602065/#Goethe,_Johann_Wolfgang_%CB%9Cvon%C5%93_1749-1832

so this should be taken into account here as well.

Samuele Kaplun kaplun referenced this issue from a commit in kaplun/invenio
Samuele Kaplun kaplun BibIndex: new BibIndexASCIIAuthorTokenizer
* Introduces new BibIndexASCIIAuthorTokenizer(), based on
  BibIndexAuthorTokenizer(), where all strings are first asciified.
  (closes #1085)

Signed-off-by: Samuele Kaplun <samuele.kaplun@cern.ch>
a0e30cd
Samuele Kaplun kaplun referenced this issue from a commit in kaplun/invenio
Samuele Kaplun kaplun BibIndex: new BibIndexASCIIAuthorTokenizer
* Introduces new BibIndexASCIIAuthorTokenizer(), based on
  BibIndexAuthorTokenizer(), where all strings are first asciified.
  (closes #1085)

Signed-off-by: Samuele Kaplun <samuele.kaplun@cern.ch>
c24f4e4
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.