Add classification for degree_type field #1894

rikirenz · 2017-01-30T15:42:38Z

Adds a classification for degree_type field in according to the schema: https://github.com/inspirehep/inspire-schemas/blob/master/inspire_schemas/records/elements/degree_type.json

Signed-off-by: Samuele Kaplun <samuele.kaplun@cern.ch>

Signed-off-by: Spiros Delviniotis <spyridon.delviniotis@cern.ch>

Signed-off-by: Samuele Kaplun <samuele.kaplun@cern.ch>

Signed-off-by: Spiros Delviniotis <spyridon.delviniotis@cern.ch>

rikirenz · 2017-01-30T16:00:40Z

I would like to know how we should match the data from the legacy system with the new data defined in the degree_type.json (https://github.com/inspirehep/inspire-schemas/blob/master/inspire_schemas/records/elements/degree_type.json).

This is the list with values and number of occurrences on the legacy system:

 2806 PhD
   2319 PHD
    103 Master
     90 MAS
     28 MASTER
     27 master
     18 Bachelor
     12 Phd
     11 UG
      8 Habilitation
      5 Thesis
      2 Diploma
      1 MAs
      1 Mas
      1 Laurea

and this is the list of the new values in the schema for degree_type:

        other
        diploma
        bachelor
        laurea
        master
        phd
        habilitation

and this is the actual mapping that we are doing in inspire-next:

DEGREE_TYPES_MAP = {                                                         
         'Bachelor': 'Bachelor',                                                  
         'UG': 'Bachelor',                                                        
         'MAS': 'Master',                                                         
         'master': 'Master',                                                      
         'Master': 'Master',                                                      
         'PhD': 'PhD',                                                            
         'PHD': 'PhD',                                                            
}

this is the mapping that I would like to implement:

  PhD -> phd
  PHD -> phd
  Phd -> phd
  Master -> master
  MAS -> master
  MASTER -> master
  master -> master
  Bachelor -> bachelor
  UG -> bachelor
  Habilitation -> habilitation
  Thesis -> laurea
  Diploma -> diploma
  MAs -> master
  Mas -> master
  Laurea -> laurea

am I right? @kaplun @michamos

michamos · 2017-01-30T16:02:18Z

almost, thesis -> other, not everybody is Italian :)

kaplun · 2017-01-30T16:03:59Z

You can actually simplify your mapping by simply perform the lower case transformation first. (Invenio is case-insensitive).

  phd -> phd
  bachelor -> bachelor
  ug -> bachelor
  habilitation -> habilitation
  thesis -> thesis
  diploma -> diploma
  mas -> master
  laurea -> thesis

rikirenz · 2017-01-30T16:13:08Z

Thanks! Do you think is necessary keep this field: https://github.com/inspirehep/inspire-next/pull/1894/files#diff-338f51def51c1adf539431d8baf989f6L466
IMO no, because it is not defined in the schema. But please confirm it.

kaplun · 2017-01-30T16:19:21Z

IMHO nope because with the above mapping you exhaust the above mentioned list of occurrences on legacy.

rikirenz · 2017-01-31T17:19:19Z

tests/unit/dojson/test_dojson_hepnames.py

@@ -653,6 +661,8 @@ def test_advisors_from_701__a_g_i():
    ]
    result = hepnames.do(create_record(snippet))

+    assert jsonschema_validate(result['advisors'], subschema,
+                               resolver=LocalRefResolver('', {})) is None


This is not the correct way to validate the schema. We have already discussed about this (https://github.com/inspirehep/inspire-next/pull/1896/files/10db08b2d25be331fe27b2ef7d18e63651cb1207#diff-8d2cf8d412e39817c6d484261eed7b89).
During the merge phase it has to be fixed in the "right way".

rikirenz · 2017-02-01T08:33:54Z

I was trying to understand how we convert 701 from schema to marc and I found this: https://github.com/inspirehep/inspire-next/blob/master/inspirehep/dojson/hepnames/fields/bd1xx.py#L483
IMHO _degree_type is not part of the schema, it should be removed and replaced with degree_type. Could you give me some explanations about it? @michamos @kaplun

kaplun · 2017-02-01T09:19:54Z

Indeed you should look at degree_type. Originally _degree_type was intended to contain those values that didn't belong to the enum. But there shouldn't be any anymore.

rikirenz · 2017-02-01T09:45:24Z

tests/unit/dojson/test_dojson_hepnames.py

+
+    result = hepnames2marc.do(snippet)
+
+    assert expected == result


I changed the test in this way for 2 reasons:

It is more readable

In order to test all the 701 fields

Let me know if there was a specific reason to do the test with the previous approach. If there was not a specific reason I think this way is better

Ok, but you're based on an old branch: that test was already deleted in master.

Signed-off-by: Riccardo Candido <riccardo.candido@gmail.com>

* Changes tests in according to the new degree_types * Adds schema validation Signed-off-by: Riccardo Candido <riccardo.candido@gmail.com>

annetteholtkamp · 2017-02-02T10:52:25Z

Why do you want to get rid of laurea? It's more specific than thesis I think. - Annette On Jan 30, 2017, at 5:04 PM, Samuele Kaplun <notifications@github.com<mailto:notifications@github.com>> wrote: You can actually simplify your mapping by simply perform the lower case transformation first. (Invenio is case-insensitive). phd -> phd bachelor -> bachelor ug -> bachelor habilitation -> habilitation thesis -> thesis diploma -> diploma mas -> master laurea -> thesis — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub<#1894 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AM1-O3vlhPhWQPSI39QkrKxCTDfqyVe4ks5rXgnwgaJpZM4LxjW2>.

kaplun · 2017-02-02T12:18:56Z

It just says that's Italian. But it doesn't say if it's master or bachelor (as much as thesis doesn't say it).

michamos · 2017-02-02T13:47:35Z

still, it's some data that we have, and that we decided to keep in the enum, so it should be mapped as laurea -> laurea.

kaplun · 2017-02-02T13:51:11Z

Right. Didn't remember we actually have it in the enum.

kaplun · 2017-02-07T16:19:41Z

As pointed out in inspirehep/inspire-schemas#80
thesis -> other (since it's not a degree type for real and doesn't bring real information).

kaplun · 2017-02-07T16:20:20Z

BTW, I am confused: @spirosdelviniotis have you took over this branch?

spirosdelviniotis · 2017-02-07T16:22:26Z

@kaplun I am working on top of master branch.
Should I work on top of this one?

kaplun · 2017-02-07T16:25:58Z

@spirosdelviniotis I guess not. Was just asking to fully understand.

kaplun and others added 13 commits January 30, 2017 09:15

setup: enforce using inspire-schemas~=11.0

08173b2

Signed-off-by: Samuele Kaplun <samuele.kaplun@cern.ch>

OPS disabling Travis on this branch

8cd38ab

WIP

fe7067a

Signed-off-by: Samuele Kaplun <samuele.kaplun@cern.ch>

dojson: update 300 field

7335e60

Signed-off-by: Spiros Delviniotis <spyridon.delviniotis@cern.ch>

dojson: update 999C5 field - parent_isbn

ed3f324

Signed-off-by: Spiros Delviniotis <spyridon.delviniotis@cern.ch>

dojson: update 999C5 field - parent_report_number

954df2a

Signed-off-by: Spiros Delviniotis <spyridon.delviniotis@cern.ch>

general: update thesis_info field

350493e

Signed-off-by: Spiros Delviniotis <spyridon.delviniotis@cern.ch>

general: update persistent_identifier type

ba8c600

Signed-off-by: Spiros Delviniotis <spyridon.delviniotis@cern.ch>

WIP

b57635d

Signed-off-by: Samuele Kaplun <samuele.kaplun@cern.ch>

dojson: add 100__v field

909dd04

Signed-off-by: Spiros Delviniotis <spyridon.delviniotis@cern.ch>

dojson: add 700__v field

0782c61

Signed-off-by: Spiros Delviniotis <spyridon.delviniotis@cern.ch>

dojson: update license field

d486e41

Signed-off-by: Spiros Delviniotis <spyridon.delviniotis@cern.ch>

dojson: add _private_note field

28bd64f

Signed-off-by: Spiros Delviniotis <spyridon.delviniotis@cern.ch>

david-caro added the WIP label Jan 30, 2017

rikirenz self-assigned this Jan 30, 2017

kaplun force-pushed the inspire-schema-10-sprint branch from 28bd64f to 00300ee Compare January 31, 2017 10:55

kaplun mentioned this pull request Jan 31, 2017

global: bump invenio-schemas to ~26.0 #1884

Closed

25 tasks

rikirenz force-pushed the schemas-degree-type branch from 87d2d7a to b4465dd Compare January 31, 2017 17:11

rikirenz commented Jan 31, 2017

View reviewed changes

rikirenz force-pushed the schemas-degree-type branch from b4465dd to 082533f Compare February 1, 2017 09:40

rikirenz commented Feb 1, 2017

View reviewed changes

rikirenz added 2 commits February 1, 2017 10:50

dojson: change rule 701 with new degree types

4a97bf0

Signed-off-by: Riccardo Candido <riccardo.candido@gmail.com>

tests: improve tests for rule 701

692def1

* Changes tests in according to the new degree_types * Adds schema validation Signed-off-by: Riccardo Candido <riccardo.candido@gmail.com>

rikirenz force-pushed the schemas-degree-type branch from 082533f to 692def1 Compare February 1, 2017 09:51

jacquerie mentioned this pull request Feb 15, 2017

general: bump inspire-schemas to version ~26.0 #1918

Merged

jacquerie closed this in #1918 Mar 1, 2017

jacquerie removed the WIP label Mar 1, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add classification for degree_type field #1894

Add classification for degree_type field #1894

rikirenz commented Jan 30, 2017 •

edited

rikirenz commented Jan 30, 2017 •

edited

michamos commented Jan 30, 2017

kaplun commented Jan 30, 2017

rikirenz commented Jan 30, 2017

kaplun commented Jan 30, 2017

rikirenz Jan 31, 2017

rikirenz commented Feb 1, 2017 •

edited

kaplun commented Feb 1, 2017

rikirenz Feb 1, 2017 •

edited

jacquerie Feb 1, 2017 •

edited

rikirenz Feb 1, 2017

annetteholtkamp commented Feb 2, 2017 via email

kaplun commented Feb 2, 2017

michamos commented Feb 2, 2017

kaplun commented Feb 2, 2017

kaplun commented Feb 7, 2017

kaplun commented Feb 7, 2017

spirosdelviniotis commented Feb 7, 2017

kaplun commented Feb 7, 2017

Add classification for degree_type field #1894

Add classification for degree_type field #1894

Conversation

rikirenz commented Jan 30, 2017 • edited

rikirenz commented Jan 30, 2017 • edited

michamos commented Jan 30, 2017

kaplun commented Jan 30, 2017

rikirenz commented Jan 30, 2017

kaplun commented Jan 30, 2017

rikirenz Jan 31, 2017

Choose a reason for hiding this comment

rikirenz commented Feb 1, 2017 • edited

kaplun commented Feb 1, 2017

rikirenz Feb 1, 2017 • edited

Choose a reason for hiding this comment

jacquerie Feb 1, 2017 • edited

Choose a reason for hiding this comment

rikirenz Feb 1, 2017

Choose a reason for hiding this comment

annetteholtkamp commented Feb 2, 2017 via email

kaplun commented Feb 2, 2017

michamos commented Feb 2, 2017

kaplun commented Feb 2, 2017

kaplun commented Feb 7, 2017

kaplun commented Feb 7, 2017

spirosdelviniotis commented Feb 7, 2017

kaplun commented Feb 7, 2017

rikirenz commented Jan 30, 2017 •

edited

rikirenz commented Jan 30, 2017 •

edited

rikirenz commented Feb 1, 2017 •

edited

rikirenz Feb 1, 2017 •

edited

jacquerie Feb 1, 2017 •

edited