-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
first impressions and questions #13
Comments
To illustrate a bit more my results above, I will paste the example 2 of my previous comment (I provided an empty family in reference.csv): gndiff_source.csv : (A, B, C added here for reference: they were not part of the file)
gndiff_reference.csv :
It is a minimal example, but I want to illustrate that my reference list wouldn't be a simple list of very different accepted names. TEST 1: JSON
TEST 2: CSV
COMMENTS:
I am mostly concerned about the possibilities of getting the "best choice" in 1 & 2. But there are no numeric differences which could help in that task (specially when editDistance is not available). Biased by my little experience with other gnames applications, I expected gndiff JSON output to be much more verbose than it is. (as an option, at least). As names are being parsed for the matching process, I would be interested in taking profit of the parsed stuff in my output (for both source.csv and reference.csv names).
Of course, I mention all this because I assume all that info is actually being generated during gndiff matching process. So it wouldn't be difficult to (optionally) output everything, I hope. So, to summarize FEATURE REQUESTS:
Sorry for the long explanations. Thanks a lot for all your help, and sorry for not having tested gndiff much earlier |
@abubelinha, I got back to gndiff finally and going through your comment. I will try to modify README according to your notes and answers to other
It is definitely a possibility. I suspect a good approach would be to keep Source can have any number of fields, same as reference. Field 'Family' is not implemented yet. If it is given in both files, it will be
This is a bug, I will make an issue
No effect on speed at all. Families just show in the output for manual
yes, it does make sense. Currently score is used only for sorting results,
Sounds like a bug
Exact match is made by canonical forms, and authorship is used to pick
It is a bug, I have to find out why it happens.
Makes sense, I'll make a ticket.
Interesting idea, yes, can be done too. @abubelinha, thank you a lot for your feedback, a lot of good ideas and |
You're welcome. |
First of all, thanks a lot for creating gndiff. It's gonna be so useful for me.
I have just tried with a small file, to get the feeling of how it works.
I already found some issues to comment:
Unclear input file formats description, where it says "Prepare two files with names. There are 3 possible file formats:" ... but actually, only two formats are mentioned: (1) simple list, one name per line; (2) csv file, with some other fields (see below).
Also, it is unclear to me if the CSV format applies only to the reference.csv file or also to source.csv:
I think that is not the idea, because output already provides an autonumeric index. So I understand source.csv would usually contain just one field, with names and nothing else (just one column). With one possible exception:
Family
: I suppose in that case it should be present in both files. Correct? But I couldn't make it work properly in my tries (see below):I might have misunderstood input CSV format description above. But if
Family
andTaxonID
are optional fields?, then JSON output contains errors sometimes:1.. If I don't provide a
Family
column in reference.csv, then json outputreferenceRecords[n].family
contains the same value asname
(theScientificName
field provided in my reference.csv file).2.. If I provide a
Family
column in reference.csv (even with empty values), then json output seems correct (referenceRecords[n].family
contains those family values I provided).3.. But if I also provide a
Family
in source.csv, then json output includes a newsourceRecord.id
which does contain the same value assourceRecord.name
.4.. If source.csv contains other columns (i.e., ScientificName + LifeForm) then json output produces
sourceRecord.family
=sourceRecord.id
=sourceRecord.name
(all containing the ScientificName provided in source.csv).So I am a bit confused. I think it would be worth to provide a couple of sample input files, and explicitly say if they can/should contain some other columns or not.
Regarding
family
: A real example case of how "tricky homonyms where family helps to resolve taxa from each other" would be useful too (I think family is not going to solve anything in my case, but just to be sure). I wonder how this "use family" option affects speed: does it make matching faster or slower for large datasets?CSV/TSV outputs are missing column headers? This could seem irrelevant, but it makes a bit difficult to check if the output content is correct. Also, I cannot proceed with further tasks, like merging this output with other tabular data by means of column joins (I can try to figure out headers and add them myself ... but it would be safer if gndiff did it to avoid mistakes).
EDIT: I have just realized that some of the above suggestions were already addressed by @Adafede in a previous closed issue (#12).
Sorry about that. My comments are pretty verbose, so @dimus might still find some helpful feedback in some of them.
This is a new one:
subsp.
rank vs. avar.
rank, identical in everything else). But my source.csv only contains one (i.e. thesubsp.
). How can I make the decision to select the most similar in these cases?I will better post an example in a new comment to illustrate this.
Thanks a lot in advance !!
The text was updated successfully, but these errors were encountered: