Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiscript support in biblatex/biber #416

Open
plk opened this issue Apr 26, 2016 · 175 comments
Open

Multiscript support in biblatex/biber #416

plk opened this issue Apr 26, 2016 · 175 comments
Assignees
Milestone

Comments

@plk
Copy link
Owner

plk commented Apr 26, 2016

No description provided.

@plk plk self-assigned this Apr 26, 2016
@duncdrum
Copy link

duncdrum commented May 3, 2016

I can't get the MWE from this SE thread to work.

\documentclass{article}
\usepackage{fontspec} 
\usepackage{polyglossia} 
\setdefaultlanguage{english}
\usepackage{xeCJK}
\setCJKmainfont{Hiragino Mincho Pro}

\usepackage[style=authoryear,%
            language=auto,%
            autolang=langname,%
          vform=romanised]{biblatex}
\addbibresource{literature.bib}
\usepackage{filecontents}
\begin{filecontents}{literature.bib}
@COLLECTION{yanagida_zengaku_sosho_1975,
  LANGID = {japanese},
  EDITOR = {柳田聖山},
  EDITOR_romanised = {Yanagida, Seizan},
  TITLE = {禪學叢書},
  TITLE_romanised = {Chūbun shuppansha},
  TITLE_translated_english = {Collected Materials for the Study of Zen},
  LOCATION = {京都},
  LOCATION_romanised = {Kyōto},
  LOCATION_translated_english = {Kyoto},
  PUBLISHER = {中文出版社},
  PUBLISHER_romanised = {Chūbun shuppansha},
  DATE = {1974/1977}
}
\end{filecontents}
\begin{document}
Hello World.\footcite{yanagida_zengaku_sosho_1975}
\nocite{*}

\printbibliography
\end{document}
}

The vform=romanised isn't recognised

Package xkeyval Error: `vform' undefined in families `blx@opt@pre'.

Is this still be the preferred option to include multi-lingual data in a biblatex file? It seems that romanised is a specific use case, whereas transliteration would be more fitting. Sanskrit texts for example where transcribed into Chinese, without being "romanised." Are there limits on the number of transcriptions and translation one can include?

Are there any restrictions for the contents of the LANGID field? Most catalogues that include language information seem to have these in ISO form. Does each item require one primary language, or would a bilingual edition have two LANGIDs, how do these interact with babel/polyglossia?

@plk
Copy link
Owner Author

plk commented May 3, 2016

The multiscript code was never in a released version and was in a separate git branch but it is currently in limbo and hasn't been updated in a long time because it really over-complicated the biblatex internals and I hit several problems. I would like to look at it again at some point but at the moment it's not really useable.

@duncdrum
Copy link

duncdrum commented May 3, 2016

I see so for biblatex export of our records, we should wrap transliteration and translation into standard biblatex fields?

...
EDITOR = {Yanagida~Seizan, 柳田聖山},
...

since biblatexml seems to be in a similar state of limbo?

@plk
Copy link
Owner Author

plk commented May 3, 2016

It's not so much anything to do with the data source format (biblatexml has had a lot of work for version 3.4 which will be released soon) but the internals which handle multi-script. There is no current obvious way to deal with this apart from perhaps with the related entry functionality - you could have multiple entries with RELATED fields and then you'd have to write driver macros to support this. You could ask on TSE. That format you suggest won't work because name fields have to be parsed by the usual bibtex name parsing rules.

@maieul
Copy link
Contributor

maieul commented May 6, 2016

dear @plk, I will be glad to make any feedback for test in such features. If I remember well conversation on previous topic, two points are very important:

  • the fact that biblatex and biber setting must be correlated
  • the possibility to specify normalisated form in different language for different field of the same entry.

@u-fischer
Copy link

I'm too still interested in this topic. Imho problems are

  • lists (of location or names) as it can happen that only some names needs a translation/transscription.
  • the handling of names where "first name lastname" doesn't work in the same way as in english.
  • How to avoid too complicated "if this variant exists then ... else if this variant exist ... else ..." chains in the drivers.

@plk
Copy link
Owner Author

plk commented May 19, 2016

It's exactly point three you mention that made me think that the last PoC I did of this was not the right way - it got extremely complicated and ugly in the internals. Point two about names should I think be possible already with the name changes in 3.3 - that was partly the motivation - you can define custom name parts and also customise how sort keys are constructed for name parts (see 93-nameparts.tex sample file which implements basic Russian patronymics).

A problem I forsee is how to determine labelname, labeltitle etc. without demanding that every field they reference has every script variant defined. The alternative is a messy interface to select the script variant of them and I'd rather avoid that.

@u-fischer
Copy link

Can the name parts interface handle different name types in one author list in one go? That means a chinese name, a russian name with patronoymics and a german? And the main question: How can one manage a bib in xml-format. Things like your answer here http://tex.stackexchange.com/a/308761/2388 looks very good but I doubt that user want to write xml manually.

Perhaps biber could handle an input syntax like this:

 author={\namepart{family}{Fischer} \namepart{given}{Ulrike}  and \namepart{family}{...} \namepart{prefix}{von}  }

Then one wouldn't have to convert everything to xml to explore the power of the \namepart system.

Regarding the labelname: I think one shouldn't overdo the automation but allow user to define them manually if the wishes gets to special, e.g. labelname_typeX={...} with some interface to select such labelnames.

@plk
Copy link
Owner Author

plk commented May 19, 2016

I would need to experiment a bit but essentially, any name parts defined by \DeclareDatamodelConstant[type=list]{nameparts}{ ... } are available to names. This doesn't really address the multi-script requirement though. The XML format has had a lot of work for 3.4. You don't need to write it by hand - tool mode can convert to/from this (I use the XML format for my own work).

I am a bit loathe to extend the bibtex format as it usually means hacking the btparse C library and this is painful and fragile. It's also a general CPAN module and so it must always remain backwards compatible with any generic bibtex usage.

@u-fischer
Copy link

The problem with converting bib<->xml is that it will (probably) only work as long the content is usable in both formats. But I do understand that extending the format of normal name fields is difficult. What about a new field format with a strict input syntax with name parts? Then one could use xauthor={....}.

Btw: I get errors when converting to xml and back to bib:

    G:\biblatextest>biber --tool --output-format=biblatexml biblatex-examples.bib
    G:\biblatextest>biber --tool --output-format=bibtex biblatex-examples_bibertool.bltxml

@plk
Copy link
Owner Author

plk commented May 19, 2016

You need to use --input-format=biblatexml on the second run. I suppose I could auto-detect this from the filename extension.

@plk
Copy link
Owner Author

plk commented May 21, 2016

@u-fischer - I have added an extended name format for bibtex data sources when using biber. It allows you to specify the name parts explicitly and you can mix and match this with normal bibtex names:

AUTHOR = {Alan Smith and family=Brown, prefix=de, given=Robert}

I'd rather not encourage tex markup in names, hence this format (which has to be handled in biber anyway).

Detection of which parsing routine to use is automatic but you can turn off extended name format parsing with a biber flag in case of issues. It also allows explicit specification of prefices and supports any custom nameparts defined in the data model. See the biber PDF doc and the 93-nameparts.tex which comes with biblatex which uses both biblatexml and this extended bibtex format. This is in 2.6/3.5 dev versions.

@u-fischer
Copy link

This sounds very good, I will try it tomorrow -- and it is naturally ok that it not a TeX-syntax, I only used it because I'm used to.

@plk
Copy link
Owner Author

plk commented May 23, 2016

Actually, you are right about the bib<->biblatexml round-trip. There is currently no support for biblatexml->bib, only bib->bib.

EDIT: See below, this is now possible - tool mode can now convert between anything, including the extended name format and normal name format.

@Shinoto-github
Copy link

I am new to this discussion and cannot really help developing code etc.

But I can speak Japanese and have rudimentary knowledge of Chinese and Korean, and I do write in the humanities in several languages, using multilingual bibliographies (including Western languages). I will be glad to help with comments if you wish so and can test documents. If this is more of a nuisance, do not hesitate to tell me. This is OK for me.

As for author names, the solution of plk AUTHOR = {Alan Smith and family={Brown}, prefix={de}, given={Robert}} looks good. There is another case with generation names in Korean and Chinese; some want them mentioned isolated, others include them into their personal (given) name.

This is all for now.

@plk
Copy link
Owner Author

plk commented May 26, 2016

This could be helpful when I get time to look more into it. The new name format you mention is already in bibaltex 3.5/biber 2.6 development versions (on sourceforge) and you can define any new nameparts you need to deal with things like generation names. However, the main issue with multilingual support is having multiple copies of the same field in the same bibliography data entry and this is something which is quite hard to implement.

@Shinoto-github
Copy link

Great to hear about the name format in the development versions!

I see three main problems with Far Eastern sources which I will explain below. If I understand something wrong, please do not take your precious time to correct my view. I would be embarrassed if my comments steal your time rather than help finding a solution.

  1. Different formats in the same bibliography. Some Far Eastern journals request separation of Western and Far East languages; so it is easy to get two different bibliographies and it should be easy to apply a style that is appropriate to the language. But since the style is chosen in the preamble, it must be a universal style for different cases.
  2. Brackets for certain fields (like series titles) are different to those in Western languages. If a package like csquotes can deal with Japanese etc. brackets, and we could choose them in a modification of the style, that would make the bibliography look a lot better.
  3. As for multiple name versions, there are mostly these types: (1) Translations into different languages (for institutions), (2) Transscripts according to different transscript systems into different writing systems, (3) Original writing, and (4) Transcription for identification. The fourth is particularly important in styles similar to authoryear, where the author is identified and mentioned once, whereas each entry might have a different original or transscript or translation. -- But basically, every field can have these variations, a title can be translated or transscribed as well.

For the time being, I am writing my biblatex files with the ID transscript (4) in romanised form into the normal author and editor field, for the other cases I use a field-naming-system that adds the type and the intended language or writing system to the field name. E.g. "author" comes as "author-trsscpt-hepburn" or "author-trslation-de" or "author-orig-ja" etc. -- Whenever I use the entries, for now I use regular expressions to create the fields that the authoryear style recognises in order to get the respective data printed.

@moewew
Copy link
Collaborator

moewew commented Jun 5, 2016

The new way to give name parts explicitly is really useful. I still think that the biblatexml is a bit too overwhelming for the user, while .bib files are very easy to understand.

Would it also be possible to give the sortnamekeyscheme in this model (per-name)?

@plk
Copy link
Owner Author

plk commented Jun 5, 2016

@moewew - yes, that's important. It should work now for per-namelist and per-name scope in bibtex datasources - see the biblatex doc on \DeclareSortingNamekeyScheme.

I agree about biblatexml - until there is some GUI interface to a backend XML format like this, it's not very easy to see things at a glance even though it's conceptually easier and less prone to errors.

@moewew
Copy link
Collaborator

moewew commented Jun 6, 2016

That works very well, thank you.

I noticed that Biber complains (use of uninitialized value in (.)) if a name does not include a family part. The output is fine, but it seems Biber expects names to have a family part (maybe this is connected to the next observation). With the new name scheme it would probably be necessary to allow for customisation of the uniquename option (per sortnamekeyscheme or something new).

This all came up in Bibtex/Biber: how to cite an author using Ethiopian conventions? on TeX.SX.

MWE (for the use of uninitialized value in (.))

\documentclass{article}
\usepackage{filecontents}

\begin{filecontents*}{\jobname.bib}
@book{james,
  author  = {given=James},
  title   = {Test},
  date    = {1983}
}
\end{filecontents*}

\usepackage[style=authoryear]{biblatex}

\addbibresource{\jobname.bib}

\begin{document}
\textcite{james}
\printbibliography
\end{document}

For me the allure of the .bib format is that you don't need a GUI to work with it comfortably. (Judging by the number of questions on TeX.SX about JabRef and other exporters they can even cause more problems than they solve.) Even though I have a soft spot for XML and agree that it is a nicer format to store the data, the biblatexml format is a bit too verbose for me to work with manually.

@plk
Copy link
Owner Author

plk commented Jun 6, 2016

Yes, I need to remove the last traces of assumption that every name has a family name - that's been hard-coded into biblatex/biber for a long time. Then I think uniquename etc. need to be customisable. Looking into it.

@simifilm
Copy link

simifilm commented Jun 6, 2016

Probably a subject for a separate discussion: While I really like the new flexible data model for names, I think biblatex finally reaches a point where we have to think more about how GUIs can handle the changes.

I just added pen names to biblatex-fiwiwhich works fine. But if I have an entry like

Author = {given={William}, family={Atheling}, suffix={Jr.}, truefamily={Blish}, truegiven={James}},

in a .bib file, BibDesk, my GUI of choice, can't handle it anymore. And, of course, the biblatexml can't be handled by any application.

I realize that the biblatex devs can't also take care of the various GUIs, and unfortunately, the BibDesk haven't been very forthcoming about biblatex in the past, but I think some kind of exchange or communication between with some GUI devs could be established, this would be a big boon.

OTOH, if I end up with a .bib file which I can only edit with a text editor, that would, at least for me, be a big step backwards.

@maieul
Copy link
Contributor

maieul commented Jun 6, 2016

indeed, the Bibdesk team is not very open to biblatex. For example, they don't want mechanism of nested crossref.

However, may I suggest to use : as separator inside field

Author = {given:{William}, family:{Atheling}, suffix:{Jr.}, truefamily:{Blish}, truegiven:{James}}

That will make the GUI be compatible without any modification

@plk
Copy link
Owner Author

plk commented Jun 6, 2016

Good idea - I only use Emacs and so I am not really aware of the GUI situation.

@maieul
Copy link
Contributor

maieul commented Jun 6, 2016

I have tested with bibdesk: ":" is working. What should be tested is, I think, Zotero (with https://github.com/retorquere/zotero-better-bibtex) and JabRef.

@u-fischer
Copy link

However, may I suggest to use : as separator inside field

I can't test it, but why should bibdesk care about the separator (colon or equal sign)? Imho the only thing that should matter in a "normal" bibtex application is the numbers of commas.

@plk
Copy link
Owner Author

plk commented Jun 6, 2016

I imagine it confuses it with the = after the field name. I will make the separator configurable.

@mariashinoto
Copy link

mariashinoto commented Feb 12, 2020 via email

@pauloney
Copy link
Collaborator

pauloney commented Feb 12, 2020 via email

@norbusan
Copy link

Just to chime in a bit, I agree completely with @pauloney , as I did study these kind of things during my (Tibetan/Sanskrit/Ancient Greek) studies.

Transliteration is used to convey a proper pronounciation, something like the current use of Katakana in Japanese (and the use of Hiragana before the reform).

Transcription is just a representation change that is reversible. That means, that a transcription needs to be reversible, that is, one can return from the transcribed text to the original without errors (in theory).

To pick up the example of @zaw-shinoto , actually both "Fuji" and "Huzi" are transcriptions, but different ones, one is Hepburn (revised or not) and one is Nihon-shiki or Kunrei-shiki. To be honest, both of them are actually not "transcriptions", because there is no way to distinguish Hiragana from Katakana from Kanji in it, and thus one cannot return to the original text. But this is a different topic.

@pauloney
Copy link
Collaborator

pauloney commented Feb 12, 2020 via email

@mariashinoto
Copy link

mariashinoto commented Feb 12, 2020 via email

@pauloney
Copy link
Collaborator

pauloney commented Feb 12, 2020 via email

@mariashinoto
Copy link

mariashinoto commented Feb 12, 2020 via email

@plk
Copy link
Owner Author

plk commented Feb 12, 2020

To be clear, the multiscript alternate forms (translation, transliteration erc.) are not interpreted by biblatex at all. You are free to use them as you will, they are just string identifiers. The multiscript language/script identifiers must be valid BCP47 tag format (this does not enforce case). The separator between the field, form and lang is by default underscore but this is customisable. The dash separator within BCP47 tags is not customisable as it is part of the IETF standard. The example was just to show the mechanisms, not to advocate for data classifications in this example.

@retorquere
Copy link

Why not standardize separator between field/form/lang? Unless it is configurable in the bib file itself, having it variable means the bib file is harder to parse/produce by other tools (such as Zotero BBT)

@mariashinoto
Copy link

mariashinoto commented Feb 12, 2020 via email

@plk
Copy link
Owner Author

plk commented Feb 12, 2020

Well it is simply that there is a biber option to change the default underscore when parsing data, that’s all. It can also be changed on tool mode output.

@plk
Copy link
Owner Author

plk commented Feb 12, 2020

The purpose is to provide functionality for defining and using multiple “alternates” of the same field within one entry. The exact forms/langs are in a sense arbitrary as you would use the form/langs as string keys to determine what to print. However, the lang is used to auto-switch babel/polyglossia language even on a per list item basis which means that real multilingual/script support is possible.

@mariashinoto
Copy link

mariashinoto commented Feb 12, 2020 via email

@pauloney
Copy link
Collaborator

pauloney commented Feb 12, 2020 via email

@plk
Copy link
Owner Author

plk commented Feb 12, 2020

@pauloney - it's already in the manual for 4.0, with some examples. All of the new experimental functionality is documented.

@yannis1962
Copy link

yannis1962 commented Nov 25, 2020 via email

@plk
Copy link
Owner Author

plk commented Nov 25, 2020

Version 4.0 of biblatex and 4.0 of biber, in the "experimental" folders on Sourceforge implement the multiscript branch currently. The documentation in that version has all of the details and there is a sample "98-multiscript-biber.pdf" and the source "98-multiscript.tex" which has examples.

@yannis1962
Copy link

yannis1962 commented Nov 25, 2020 via email

@plk
Copy link
Owner Author

plk commented Nov 25, 2020

Unfortunately, these versions are not in TLMGR and you have to get them from Sourceforge and install manually. You need biblatex 4 and biber 4.

@yannis1962
Copy link

yannis1962 commented Dec 1, 2020 via email

@plk
Copy link
Owner Author

plk commented Dec 3, 2020

Here is a complete example which works with the 4.0 versions of biblatex/biber and which shows the general approach and gives the results you want. Some notes:

  • This is a slightly strange format as is has Japanese as the primary language for non-names and English/transliterations for names. So, I chose english as the default language and gave japanese as the alternate form.
  • To be more correct, you would probably want to define a custom name part for japanese in the data model. Here, the japanese name translations are treated as just "family" parts which is obviously semantically incorrect.
  • More check could be put in to test for the presence of japanese translated variants for fields to avoid spurious output.
\documentclass[a4paper]{article}
\usepackage{luatexja-fontspec}
\setmainfont{TeXGyrePagella-Regular}
\setmainjfont{IPAexMincho}
\usepackage[japanese,main=english]{babel}
\begin{filecontents}[force]{\jobname.bib}
@INPROCEEDINGS{bib21,
  AUTHOR                      = {Kōno, Rokurō and Nagata, Hidemasa and Sasahara, Hiroyuki},
  AUTHOR_translation_ja-jp    = {河野六郎 and 永田英正 and 笹原宏之},
  EDITOR                      = {Kōno, Rokurō and Chino, Eiichi and Nishida, Tatsuo},
  EDITOR_translation_ja-jp    = {河野六郎 and 千野栄一 and 西田龍雄},
  TITLE                       = {Kanji},
  TITLE_translation_ja-jp     = {漢字},
  BOOKTITLE                   = {Encyclopedia of the World's Scripts},
  BOOKTITLE_translation_ja-jp = {世界文字辞典},
  PUBLISHER                   = {Sanseidō},
  PUBLISHER_translation_ja-jp = {三省堂},
  ADDRESS                     = {Tokyo},
  PAGES                       = {256--281},
  YEAR                        = {2001}
}
\end{filecontents}
\usepackage{csquotes}
\usepackage[style=authoryear,%
            dynamiclabel=true,%
            language=auto,%
            autolang=other,%
            autofieldlang=other]{biblatex}
\addbibresource{\jobname.bib}

\def\jpnamevariant#1{%
  \expandafter\ifdefvoid\csname #1translationja-jp\endcsname
    {}
    {\space\mkbibbrackets{\csuse{#1translationja-jp}}}}

\def\jplistvariant#1{%
  \expandafter\ifdefvoid\csname #1translationja-jp\endcsname
    {}
    {\csuse{#1translationja-jp}}}
  
\def\jpfieldvariant#1{%
  \iffieldundef[translation][ja-jp]{#1}
    {}
    {\thefield[translation][ja-jp]{#1}}}

\DeclareNameFormat{given-family}{%
  \ifgiveninits
    {\usebibmacro{name:given-family}
      {\namepartfamily}
      {\namepartgiveni}
      {\namepartprefix}
      {\namepartsuffix}}
    {\usebibmacro{name:given-family}
      {\namepartfamily}
      {\namepartgiven}
      {\namepartprefix}
      {\namepartsuffix}}%
   \jpnamevariant{namepartfamily}%  
   \usebibmacro{name:andothers}}
  
\DeclareNameFormat{family-given/given-family}{%
  \ifnumequal{\value{listcount}}{1}
    {\ifgiveninits
       {\usebibmacro{name:family-given}
         {\namepartfamily}
         {\namepartgiveni}
         {\namepartprefix}
         {\namepartsuffix}}
       {\usebibmacro{name:family-given}
         {\namepartfamily}
         {\namepartgiven}
         {\namepartprefix}
         {\namepartsuffix}}%
     \jpnamevariant{namepartfamily}%
     \ifboolexpe{%
       test {\ifdefvoid\namepartgiven}
       and
       test {\ifdefvoid\namepartprefix}}
       {}
       {\usebibmacro{name:revsdelim}}}
    {\ifgiveninits
       {\usebibmacro{name:given-family}
         {\namepartfamily}
         {\namepartgiveni}
         {\namepartprefix}
         {\namepartsuffix}}
       {\usebibmacro{name:given-family}
         {\namepartfamily}
         {\namepartgiven}
         {\namepartprefix}
         {\namepartsuffix}}%
     \jpnamevariant{namepartfamily}}%
  \usebibmacro{name:andothers}}

\DeclareFieldFormat
  [article,inbook,incollection,inproceedings,patent,thesis,unpublished]
  {title}{\mkbibquote{\jpfieldvariant{title}\addspace\mkbibbrackets{#1}\isdot}}

\DeclareFieldFormat{booktitle}{\jpfieldvariant{booktitle}\addspace\mkbibbrackets{#1}}

\DeclareListFormat{publisher}{%
  \usebibmacro{list:delim}{#1}%
  \usebibmacro{list:langswitchon}%
  \jplistvariant{listitem}\addspace\mkbibbrackets{#1}\isdot
  \usebibmacro{list:langswitchoff}%
  \usebibmacro{list:andothers}}

\begin{document}
\cite{bib21}
\printbibliography
\end{document}

@dbitouze
Copy link
Contributor

dbitouze commented Jan 2, 2021

Here is a complete example which works with the 4.0 versions of biblatex/biber

Sounds very nice, thanks! But where are available "the 4.0 versions of biblatex/biber"?

@plk
Copy link
Owner Author

plk commented Jan 2, 2021

@plk
Copy link
Owner Author

plk commented Jul 27, 2022

We are looking into "releasing" the muiltiscript version 4.0 of biblatex/biber as separate packages on CTAN so that they can have wider evaluation. They would be called biblatex-ms and biber-ms and users should be able to install these in parallel. They should be (and are in all current regression tests) backwards-compatible when not using any of the multiscript features and extended .bib syntax. However, they will still be considered "experimental" as they have not had extensive testing with many styles etc. They are slower than the current "standard" version as they have to do a lot more in order to account for multiple scripts in any field.

@plk plk added this to the v4.0 milestone Mar 8, 2023
@plk
Copy link
Owner Author

plk commented Mar 8, 2023

biblatex-ms and biber-ms are now in TL for testing and feedback. They are fully documented and should be backwards compatible and up to date with the current 3.19/2.19 versions of biblatex/biber.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Multiscript support
Awaiting triage
Development

No branches or pull requests