Handling of duplicate entries via uniq #70

Open
mfenner opened this Issue Jul 29, 2013 · 13 comments

Projects

None yet

2 participants

@mfenner
Contributor
mfenner commented Jul 29, 2013

It would be nice if we could use the ruby Array methods uniq (and uniq!) with bibliographies. Just like the corresponding Array method uniq should use a default (:title and :year), and should accept other attributes for comparison via a block.

I have implemented hash, == and eql? for the Entry class (see example at https://github.com/mfenner/orcid-feed/blob/master/lib/work.rb), and this makes it very easy to remove dupicates. I'm struggling with implementing uniq in the Bibliography class and am therefore using an Array of BibTex::Entry has intermediate step.

@inukshuk
Owner

I have considered porting all list/query functions to return a Bibliography so that we can use them like regular Ruby Enumerables. Currently, most functions (like queries) return arrays; the main difference between an array and the Bibliography class is that the Bibliography keeps track of the entries hash for quick access to individual Entries and in order to do stuff like resolving cross references etc.

What this mostly boils down to is how to handle object references. In a Bibliography object we have a list of Entry instances which, in turn, contain Value objects — these help us mask the different value types (typically, strings or names). When we have methods like #uniq (or #query) return a new Bibliography we need to make up our mind how to deal with all these instances. Up to now I have always assumed that we ought to make copies of everything — otherwise modifying an object in one Bibliography would change the same object in another Bibliography. On the other hand, perhaps this is exactly what one would expect, seeing as regular Ruby Arrays for example would not make deep copies. For instance, something I use a lot is the #names function wich returns all Name values. I can then modify values in this name objects and the changes will be visible in my Bibliography because the entries there still hold the references.

Have you given this some thought as well? Would you expect for #uniq to return a new Bibliography with carbon copies or with references to the same Entry objects?

@mfenner
Contributor
mfenner commented Jul 30, 2013

I haven't thought too much about whether or not copies make more sense. For #uniq I would probably expect a new Bibliography with carbon copies. I currently don't use the #names function, but I can see your point about changing the object directly.

Overall I like a design for the Bibliography object that lets me use many of the standard Ruby Array functions.

@inukshuk
Owner

I think we could add a #uniq! method right away; ideally it would use the same interface as select_duplicates_by like you explained earlier.

If you like we can also add #uniq for now, but the #uniq! method will definitely be more efficient.

@mfenner
Contributor
mfenner commented Jul 30, 2013

This would be great. I slightly prefer #uniq over #uniq!, because with the latter I have to check that the method didn't return nil because there were no duplicates.

@inukshuk
Owner

Ah yes that's a long standing issue. : ) I need to check but there are bang methods in the standard lib that do not do that and I tend to not do it either, but I've been told off for it on ruby-lang so… Personally I find conditions that use bang methods extremely dangerous.

@inukshuk
Owner

I'll try to implement this in the next couple of days; would be great if you could review before we release anything though.

@mfenner
Contributor
mfenner commented Jul 31, 2013

I'm of course happy to review.

@inukshuk inukshuk added a commit that referenced this issue Aug 5, 2013
@inukshuk add a #uniq! implementation
also adds experimental #uniq method

see #70
5a3c356
@inukshuk
Owner
inukshuk commented Aug 5, 2013

I've added a quick implementation of #uniq! and a first experimental #uniq method. The signature is quite complex (based on select duplicates by) and I haven't tested many of the more complex cases. The idea is to give you a lot of flexibility, because dropping duplicates is a pretty sensitive task, especially when working with large datasets where you could lose important data by accident.

So I'll explain how I want it to work (as I said, need more tests to make sure it actually does this). By default uniq! will get rid of duplicates with the same year-title combination. You can pass in any number of field names and the method will compute a digest based on those fields. Sometimes however you will need even more control: e.g., let's say you want to discard duplicated by year, title, all authors last name and initials — for such cases you can also pass in a block and that block will be passed the computed digest for each entry and the full entry. The block should then return the final digest — that means you can either add to the digest or you can disregard it entirely and compute your own digest for each entry.

So the example I posted above would be (untested pseudo-code):

bib.uniq!(:year, :title) do |digest, entry|
  digest << entry.author.map { |a| [a.family, a.initials].join }.join
end
@inukshuk
Owner
inukshuk commented Aug 5, 2013

About the experimental #uniq — this should currently return a new Bibliography with all items duplicated. If we want to make shallow copies instead we'd have to rewrite a number of methods because currently each element can only be associated with a single bibliography (this makes stuff like resolving strings or cross references easier).

@mfenner
Contributor
mfenner commented Aug 7, 2013

I have problems installing the lastest version from Github:

gem 'bibtex-ruby', :git => 'git://github.com/inukshuk/bibtex-ruby.git'

bibtex-ruby at /home/vagrant/.bundler/ruby/1.9.1/bibtex-ruby-c1addae0af57 did not have a valid gemspec.
This prevents bundler from installing bins or native extensions, but that may not affect its functionality.
The validation message from Rubygems was:
  ["Gemfile.lock", "lib/bibtex/name_parser.rb", "lib/bibtex/parser.rb"] are not files
@inukshuk
Owner
inukshuk commented Aug 7, 2013

Ah, this is because we currently do not have the generated files in the repository. As suggested elsewhere this is common practice however; I'll add the generated parsers in a moment.

@inukshuk
Owner
inukshuk commented Aug 7, 2013

Can you try it one more time?

@mfenner
Contributor
mfenner commented Aug 7, 2013

Works now, will do some testing and report back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment