Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add classification for degree_type field #1894

Conversation

rikirenz
Copy link
Contributor

@rikirenz rikirenz commented Jan 30, 2017

Adds a classification for degree_type field in according to the schema: https://github.com/inspirehep/inspire-schemas/blob/master/inspire_schemas/records/elements/degree_type.json

kaplun and others added 13 commits January 30, 2017 09:15
Signed-off-by: Samuele Kaplun <samuele.kaplun@cern.ch>
Signed-off-by: Samuele Kaplun <samuele.kaplun@cern.ch>
Signed-off-by: Spiros Delviniotis <spyridon.delviniotis@cern.ch>
Signed-off-by: Spiros Delviniotis <spyridon.delviniotis@cern.ch>
Signed-off-by: Spiros Delviniotis <spyridon.delviniotis@cern.ch>
Signed-off-by: Spiros Delviniotis <spyridon.delviniotis@cern.ch>
Signed-off-by: Spiros Delviniotis <spyridon.delviniotis@cern.ch>
Signed-off-by: Samuele Kaplun <samuele.kaplun@cern.ch>
Signed-off-by: Spiros Delviniotis <spyridon.delviniotis@cern.ch>
Signed-off-by: Spiros Delviniotis <spyridon.delviniotis@cern.ch>
Signed-off-by: Spiros Delviniotis <spyridon.delviniotis@cern.ch>
Signed-off-by: Spiros Delviniotis <spyridon.delviniotis@cern.ch>
@david-caro david-caro added the WIP label Jan 30, 2017
@rikirenz rikirenz self-assigned this Jan 30, 2017
@rikirenz
Copy link
Contributor Author

rikirenz commented Jan 30, 2017

I would like to know how we should match the data from the legacy system with the new data defined in the degree_type.json (https://github.com/inspirehep/inspire-schemas/blob/master/inspire_schemas/records/elements/degree_type.json).

This is the list with values and number of occurrences on the legacy system:

 2806 PhD
   2319 PHD
    103 Master
     90 MAS
     28 MASTER
     27 master
     18 Bachelor
     12 Phd
     11 UG
      8 Habilitation
      5 Thesis
      2 Diploma
      1 MAs
      1 Mas
      1 Laurea

and this is the list of the new values in the schema for degree_type:

        other
        diploma
        bachelor
        laurea
        master
        phd
        habilitation

and this is the actual mapping that we are doing in inspire-next:

DEGREE_TYPES_MAP = {                                                         
         'Bachelor': 'Bachelor',                                                  
         'UG': 'Bachelor',                                                        
         'MAS': 'Master',                                                         
         'master': 'Master',                                                      
         'Master': 'Master',                                                      
         'PhD': 'PhD',                                                            
         'PHD': 'PhD',                                                            
}

this is the mapping that I would like to implement:

  PhD -> phd
  PHD -> phd
  Phd -> phd
  Master -> master
  MAS -> master
  MASTER -> master
  master -> master
  Bachelor -> bachelor
  UG -> bachelor
  Habilitation -> habilitation
  Thesis -> laurea
  Diploma -> diploma
  MAs -> master
  Mas -> master
  Laurea -> laurea

am I right? @kaplun @michamos

@michamos
Copy link
Contributor

almost, thesis -> other, not everybody is Italian :)

@kaplun
Copy link
Contributor

kaplun commented Jan 30, 2017

You can actually simplify your mapping by simply perform the lower case transformation first. (Invenio is case-insensitive).

  phd -> phd
  bachelor -> bachelor
  ug -> bachelor
  habilitation -> habilitation
  thesis -> thesis
  diploma -> diploma
  mas -> master
  laurea -> thesis

@rikirenz
Copy link
Contributor Author

Thanks! Do you think is necessary keep this field: https://github.com/inspirehep/inspire-next/pull/1894/files#diff-338f51def51c1adf539431d8baf989f6L466
IMO no, because it is not defined in the schema. But please confirm it.

@kaplun
Copy link
Contributor

kaplun commented Jan 30, 2017

IMHO nope because with the above mapping you exhaust the above mentioned list of occurrences on legacy.

@@ -653,6 +661,8 @@ def test_advisors_from_701__a_g_i():
]
result = hepnames.do(create_record(snippet))

assert jsonschema_validate(result['advisors'], subschema,
resolver=LocalRefResolver('', {})) is None
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the correct way to validate the schema. We have already discussed about this (https://github.com/inspirehep/inspire-next/pull/1896/files/10db08b2d25be331fe27b2ef7d18e63651cb1207#diff-8d2cf8d412e39817c6d484261eed7b89).
During the merge phase it has to be fixed in the "right way".

@rikirenz
Copy link
Contributor Author

rikirenz commented Feb 1, 2017

I was trying to understand how we convert 701 from schema to marc and I found this: https://github.com/inspirehep/inspire-next/blob/master/inspirehep/dojson/hepnames/fields/bd1xx.py#L483
IMHO _degree_type is not part of the schema, it should be removed and replaced with degree_type. Could you give me some explanations about it? @michamos @kaplun

@kaplun
Copy link
Contributor

kaplun commented Feb 1, 2017

Indeed you should look at degree_type. Originally _degree_type was intended to contain those values that didn't belong to the enum. But there shouldn't be any anymore.


result = hepnames2marc.do(snippet)

assert expected == result
Copy link
Contributor Author

@rikirenz rikirenz Feb 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the test in this way for 2 reasons:

  1. It is more readable
  2. In order to test all the 701 fields

Let me know if there was a specific reason to do the test with the previous approach. If there was not a specific reason I think this way is better

Copy link
Contributor

@jacquerie jacquerie Feb 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, but you're based on an old branch: that test was already deleted in master.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💩

Signed-off-by: Riccardo Candido <riccardo.candido@gmail.com>
* Changes tests in according to the
  new degree_types
* Adds schema validation

Signed-off-by: Riccardo Candido <riccardo.candido@gmail.com>
@annetteholtkamp
Copy link

annetteholtkamp commented Feb 2, 2017 via email

@kaplun
Copy link
Contributor

kaplun commented Feb 2, 2017

It just says that's Italian. But it doesn't say if it's master or bachelor (as much as thesis doesn't say it).

@michamos
Copy link
Contributor

michamos commented Feb 2, 2017

still, it's some data that we have, and that we decided to keep in the enum, so it should be mapped as laurea -> laurea.

@kaplun
Copy link
Contributor

kaplun commented Feb 2, 2017

Right. Didn't remember we actually have it in the enum.

@kaplun
Copy link
Contributor

kaplun commented Feb 7, 2017

As pointed out in inspirehep/inspire-schemas#80
thesis -> other (since it's not a degree type for real and doesn't bring real information).

@kaplun
Copy link
Contributor

kaplun commented Feb 7, 2017

BTW, I am confused: @spirosdelviniotis have you took over this branch?

@spirosdelviniotis
Copy link
Contributor

@kaplun I am working on top of master branch.
Should I work on top of this one?

@kaplun
Copy link
Contributor

kaplun commented Feb 7, 2017

@spirosdelviniotis I guess not. Was just asking to fully understand.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants