Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: ILSVRC 2010 (a.k.a. ImageNet) #68

Merged
merged 49 commits into from
Dec 3, 2015
Merged

Conversation

dwf
Copy link
Contributor

@dwf dwf commented Apr 8, 2015

Putting this up so that the interested can have a look. In particular looking for some feedback from @vdumoulin, as well as @bartvm for the right place we could split this up in order to speed it up with multiple processes.

@vdumoulin
Copy link
Contributor

Is ImageNet colloquial enough that we could use it as the name of the dataset without causing too much confusion?

@dwf
Copy link
Contributor Author

dwf commented Apr 9, 2015

Well, as test data for later year competitions become available, we'd like
to support evaluation on those too. There's also the 22k class "ImageNet
data release" version for which there is no standard train/valid/test set,
so...

My hope is that a lot of the same code will be applicable to other
"ImageNet-family" challenges, but we should probably name the class after
the official task.

On Thu, Apr 9, 2015 at 3:33 PM, vdumoulin notifications@github.com wrote:

Is ImageNet colloquial enough that we could use it as the name of the
dataset without causing too much confusion?


Reply to this email directly or view it on GitHub
#68 (comment).

@dwf
Copy link
Contributor Author

dwf commented Apr 16, 2015

@bartvm @vdumoulin I've got this converter into a near workable state now, and could use some eyes on it suggesting refactorings before I go and try to add unit tests.

I've adopted this kwargs-based approach to debug logging that I quite like, and think we could use a similar thing (or maybe a virtually identical thing) for passing information to a progress bar via logging calls.

@vdumoulin
Copy link
Contributor

You should eventually think of moving the code to fuel.converters.ilsvrc2010.

In order to support kwargs in the converter itself, we should eventually refactor converter functions to use subparsers like downloader functions do. I can review that if you choose to make it part of your PR.

@vdumoulin
Copy link
Contributor

Oh, nevermind, I just saw that this was already on your TODO list.

@dwf
Copy link
Contributor Author

dwf commented Apr 16, 2015

Yes, though I have a slight refactor I'll propose for the dispatched function, something like

from argparse import Namespace
from functools import wraps

def accept_namespace(func):
    @wraps(func)
    def wrapped(*args, **kwargs):
        if len(kwargs) == 0 and len(args) == 1 and isinstance(args[0], Namespace):
            return func(**vars(args[0]))
        else:
            return func(*args, **kwargs)
    return wrapped

That way you can write functions like this

@accepts_namespace
def foo(a, b, c):
    ...

and use it as a dispatch function for a subparser with matching argument names, while still having both a function that can be used programmatically without awkwardly instantiating a namespace and also can be documented more naturally.

@vdumoulin
Copy link
Contributor

I like that idea!

@dwf dwf force-pushed the imagenet branch 2 times, most recently from 4e7023d to 7c9765f Compare April 16, 2015 21:19
@dwf
Copy link
Contributor Author

dwf commented Apr 16, 2015

@vdumoulin I've rewritten the history to put this file in fuel/converters/ (git filter-branch is neat). I also realized I hadn't included an entire file.

xrange(1000)))

# Mapping to take ILSVRC2010 (integer) IDs to our internal 0-999 encoding.
# label_map = dict(zip(synsets['ILSVRC2010_ID'], xrange(1000)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it still useful?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I haven't pushed in a while but it is necessary for the valid and test
sets (yes, they distribute the ground truth for the valid and test sets
differently).
On Apr 20, 2015 1:39 PM, "vdumoulin" notifications@github.com wrote:

In fuel/converters/ilsvrc2010.py
#68 (comment):

  •    The number of images the workers should send to the sink at a
    
  •    time.
    
  • """
  • debug = make_debug_logging_function(log, 'MAIN')
  • Read what's necessary from the development kit.

  • devkit_path = os.path.join(directory, DEVKIT_ARCHIVE)
  • synsets, cost_matrix, raw_valid_groundtruth = read_devkit(devkit_path)
  • Mapping to take WordNet IDs to our internal 0-999 encoding.

  • wnid_map = dict(zip((s.decode('utf8') for s in synsets['WNID']),
  •                    xrange(1000)))
    
  • Mapping to take ILSVRC2010 (integer) IDs to our internal 0-999 encoding.

  • label_map = dict(zip(synsets['ILSVRC2010_ID'], xrange(1000)))

Is it still useful?


Reply to this email directly or view it on GitHub
https://github.com/bartvm/fuel/pull/68/files#r28710617.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it looks like you're building that later on (ilsvrc_id_to_zero_based, lines 123-124).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dwf
Copy link
Contributor Author

dwf commented Apr 20, 2015

@vdumoulin A very rough guide to ZeroMQ:

  • A PUSH socket can have many PULLers, that's what's done with the ventilator/workers. Likewise, a PULL can have many PUSHers. This is what happens between workers and the sink.
  • Atomicity is guaranteed, with a few caveats. That is to say, if you send a message on a zeromq socket, a client will get all of it or none of it. This also applies to sequences of messages where all-but-the-last message is sent with zmq.SNDMORE in the flags: this is used to tell zeromq that one or more further send calls are part of the same logical message, and should be received together. The only caveat I'm aware of is that if your process quits before all messages have been received, you may lose the last few messages; this is why I loop and sleep at the end of the ventilator function. (I'm not sure why this happens since zmq.LINGER should be infinite by default in my case...).
  • The high water mark stuff is just to stop buffers accumulating ad infinitum.

debug = make_debug_logging_function(log, 'VENTILATOR')
debug(status='START')
sender = context.socket(zmq.PUSH)
sender.hwm = high_water_mark
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use the sender.set_hwm setter instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hwm is a property which invokes set_hwm anyway, so no real difference.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know!

@vdumoulin
Copy link
Contributor

@dwf I've gone through a superficial review of docstrings. I'm still reading the code to give it an in-depth review.

connected on `logging_port`.

"""
logger = logging.getLogger(__name__)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need logger as argument when you immediately assign it here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. 37e5f2b.

@dwf
Copy link
Contributor Author

dwf commented Apr 23, 2015

@vdumoulin I just pushed a pretty major update. Lots of new docstrings, and better factored.

The number of worker processes to deploy.
worker_batch_size : int, optional
The number of images the workers should send to the sink at a
time.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add documentation for output_filename.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dwf dwf force-pushed the imagenet branch 2 times, most recently from c99dcaa to c05e7c2 Compare December 2, 2015 19:09
@vdumoulin
Copy link
Contributor

@dwf and I did an offline test converting the ImageNet dataset, and it works! Coverage is down 1.2%, but considering the importance of this PR, I'm ready to merge right away.

vdumoulin added a commit that referenced this pull request Dec 3, 2015
WIP: ILSVRC 2010 (a.k.a. ImageNet)
@vdumoulin vdumoulin merged commit 8d416ed into mila-iqia:master Dec 3, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants