Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Word agglutination in German but not in French #27

Closed
everzeni opened this issue Apr 20, 2017 · 9 comments
Closed

Word agglutination in German but not in French #27

everzeni opened this issue Apr 20, 2017 · 9 comments
Assignees

Comments

@everzeni
Copy link
Collaborator

In the sentence :

All persons in the civil service had to obtain an Ariernachweis (Aryan certificate) in order to prove their Aryan ancestry.

I would tend to annotate only Aryan certificate like this:

(<ENAMEX type="PERSON_TYPE">Aryan</ENAMEX> certificate)

So I have two questions:

  1. If that's correct to annotate only "Aryan" inAryan certificate, what should we do with Ariernachweis?

  2. If we annotate the NP Aryan certificate as a whole (same for the NP Ariernachweis), what is the right class? ARTIFACT maybe?

grobid-ner/resources/dataset/ner/corpus/xml/generated/Wikipedia_holocaust.2.training.xml:41
@kermitt2
Copy link
Owner

kermitt2 commented Apr 20, 2017

hello!

  1. I think it is fine to annotate only Aryan in Aryan certificate.

  2. We can leave Ariernachweis not annotated at all. A certificate is not really an ARTIFACT, it's out of scope of what we have decided (arbitrarily) to consider as named entities.
    The good point is that Ariernachweis is very well recognized via the German Wikipedia and there are not an infinite type of identity papers.

@everzeni
Copy link
Collaborator Author

everzeni commented Apr 25, 2017

Slightly related, what should we do with translations, for example of concepts:

Nazi leaders proclaimed the existence of a Volksgemeinschaft ("people's community").

1) the existence of a <ENAMEX type="CONCEPT">Volksgemeinschaft</ENAMEX> ("people's 
community").

2) the existence of a <ENAMEX type="CONCEPT">Volksgemeinschaft</ENAMEX> ("<ENAMEX
 type="CONCEPT">people's community</ENAMEX>").

2) the existence of a Volksgemeinschaft ("<ENAMEX type="CONCEPT">people's
 community</ENAMEX>").
?

@lfoppiano
Copy link
Collaborator

In my opinion, as we want to keep the same approach as what said before, I would then say the 3rd answer.

@lfoppiano
Copy link
Collaborator

We've discussed this afternoon.

I try to explain:

  1. the needs to annotate also (in this case) german words comes from the need when concept are represented better in a foreign language.
    (a counter argument would be that (I think) in german any word is a concept).
    @kermitt2 what do you think?

  2. In this case
    e.g. Ethnic Germans required more Lebensraum ("living space") according to Nazi doctrine so population displacement (which included murder) and colonial settlement were intrinsically linked.

living space is not a concept today, but since it was a concept related to the culture of the prewar germany, shall we annotate as concept?

@kermitt2
Copy link
Owner

  1. well in the case of Ariernachweis it's not an abstract concept, it's a legal paper. So no need to annotate it as concept - beyond Aryan in the English translation (clearly a PERSON_TYPE).

In the other examples, Lebensraum and Volksgemeinschaft are German political ideas/principles and CONCEPT appears intuitively relevant. It's not common German words.

  1. I would say it's really Lebensraum the name of the concept.

@lfoppiano
Copy link
Collaborator

For (2) 👍, this is was the point @everzeni raised and does make sense 🕺

Regarding the (1) does it make sense to annotate it at all then?
What if we have something like

  • in france many people make use of the Securité Sociale (Social Security), or
  • The politician was taken to the Questura di Milano (central police station) and he had to answer some questions

For the first case both of them are an clear entity - shall we annotate then both? For the second case central police station is more generic (because in the english culture I think there is no real concept like that)

I hope these examples are relevant

@ebenaissa
Copy link
Collaborator

ebenaissa commented May 15, 2017

I have also multiple cases of entities translated here. What should we do ? Annotate both ? Just the original ? Just the translation ?

The “<ENAMEX type="INSTITUTION">Archives Générales du Royaume</ENAMEX>” (<ENAMEX type="INSTITUTION">National Archives of Belgium</ENAMEX>) 
“<ENAMEX type="INSTITUTION">Archives de l’État dans les Provinces</ENAMEX>” (<ENAMEX type="INSTITUTION">State Archives in the Provinces</ENAMEX>)
in other words the <ENAMEX type="INSTITUTION">State Archives</ENAMEX> are a federal academic establishment that forms part of the “<ENAMEX type="INSTITUTION">Service Public Fédéral de Programmation Politique scientifique</ENAMEX> ”(<ENAMEX type="INSTITUTION">Belgian Federal Science Policy Office</ENAMEX>).</sentence>

Following probable examples where the translation is not really an entity (like "people's community" for Volksgemeinschaft) and following the rule where we annotate foreign words with existing classes i vote to annotate just the original words.

@lfoppiano
Copy link
Collaborator

@ebenaissa for me I would annotate everything in your examples.
Regarding your second pargaraph, I also agree not to annotate the "translated" when it's common words, for example 'social security' is a clearly entity, while 'centra police office' (could be though at the building but it's a group of common words).

@kermitt2 does it make sense to you?

@kermitt2
Copy link
Owner

yes this all make perfect sense 💃

@everzeni everzeni removed the question label Jun 16, 2017
@everzeni everzeni closed this as completed Aug 2, 2017
everzeni pushed a commit that referenced this issue Aug 2, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants