Skip to content


Subversion checkout URL

You can clone with
Download ZIP


v0.3.5, 2012-08-21 -- The Last Ride of v0.3.x[?]
 * EMR:
   * --pool-wait-minutes option lets you wait up to X minutes before creating a
     job flow (#455)
   * Job flow ID included in error messages on failure (#452)
   * JOB and JOB_FLOW cleanup options (#485, #455)
 * EMR and Hadoop:
   * Compatibility fixes related to deprecated options and Hadoop's bizarre
     non-sequential version numbers (#489, #534)
 * Other:
   * Warn when *_PROTOCOL is not a class (#490)
 * Bug fixes:
   * Unicode strings can be used when specifying interpreters (#431)
   * --enable-emr-logging no longer causes the wrong counters/logs to be parsed
   * TMP_DIR inserted into 'sort' environment variables (#477)
   * Setting hadoop_home in mrjob.conf works again
   * Gzipped input files work when specified with relative paths (#494)
   * Passthrough options are not re-ordered when sent to Hadoop Streaming
   * json module is supported again if simplejson doesn't exist (#544)
   * HadoopJobRunner.path_exists() is no longer backwards (#549)


v0.3.4.1, 2012-06-12 -- The test suite doesn't catch everything...
 * Local mode doesn't try to send multiple mappers to the same output file
   when using multiple compressed files as input


v0.3.4, 2012-06-11 -- We are friendly people.
 * Experimental support for IronPython in the local and inline runners
 * set_status() and increment_counter() will encode messages/names of type
   'unicode' as UTF-8 when writing to Hadoop Streaming
 * EMR and Hadoop counter parsing is more correct
 * fetches logs from S3 when asked instead of
   incorrectly refusing to do so
 * jobconf values can be booleans in mrjob.conf as well as 'true' and 'false'
 * hadoop_version can be a float in mrjob.conf, but a warning is printed to the
 * Command line help is split across several --help-* commands
 * Local runner sorts output consistently


v0.3.3.2, 2012-04-10 -- It's a race [condition]!
 * Option parsing no longer dies when -- is used as an argument (#435)
 * Fixed race condition where two jobs can join same job flow thinking it is
   idle, delaying one of the jobs (#438)
 * Better error message when a config file contains no data for the current
   runner (#433)


v0.3.3.1, 2012-04-02 -- Hothothothothothothotfix
 * Fixed S3 locking mechanism parsing of last modified time to work around a
    bug in boto


v0.3.3, 2012-03-29 -- Bug...bug...bug...bug...bug...FEATURE!
 * EMR:
   * Error detection code follows symlinks in Hadoop logs (#396)
   * terminate_idle_job_flows locks job flows before terminating them (#391)
   * terminate_idle_job_flows -qq silences all output (#380)
 * Other fixes:
   * mr_tower_of_powers test no longer requires Testify (#395)
   * Various runner du() implementations no longer broken (#393, #394)
   * Hadoop counter parser regex handles long lines better (#388)
   * Hadoop counter parser regex is more correct (#305)
   * Better error when trying to parse YAML without PyYAML (#348)


AMI versions, spot instances, and more
 * Docs:
   * 'Testing with mrjob' section in docs (includes #321)
   * MRJobRunner.counters() included in docs (#321)
   * terminate_idle_job_flows is spelled correctly in docs (#339)
 * Running jobs:
   * local mode:
     * Allow non-string jobconf values again (this changed in v0.3.0)
     * Don't split *.gz files (#333)
   * emr mode:
     * Spot instance support via ec2_*_instance_bid_price and renamed instance
       type/number options (#219)
     * ami_version option to allow switching between EMR AMIs (#306)
     * 'Error while reading from input file' displays correct file (#358)
     * python_bin used for bootstrap_python_packages instead of just 'python'
     * Pooling works with bootstrap_mrjob=False (#347)
     * Pooling makes sure a job flow has space for the new job before joining
       it (#324)
 * EMR tools:
   * create_job_flow no longer tries to use an option that does not exist
   * report_long_jobs tool alerts on jobs that have run for more than X hours
   * mrboss no longer spells stderr 'stsderr'
   * terminate_idle_job_flows counts jobs with pending (but not running)
     steps as idle (#365)
   * terminate_idle_job_flows can terminate job flows near the end of a
     billable hour (#319)
   * audit_usage breaks down job flows by pool (#239)
   * Various tools (e.g. audit_usage) get list of job flows correctly (#346)


Nooooo there were bugs!
 * Instance-type command-line arguments always override mrjob.conf (Issue #311)
 * Fixed crash in (Issue #315)
 * Tests now use unittest; python test now works (Issue #292)


Worth the wait
 * Configuration:
   * Saner mrjob.conf locations (Issue #97):
     * ~/.mrjob is deprecated in favor of ~/.mrjob.conf
     * searching in PYTHONPATH is deprecated
     * MRJOB_CONF environment variable for custom paths
 * Defining Jobs (MRJob):
   * Combiner support (Issue #74)
   * *_init() and *_final() methods for mappers, combiners, and reducers
     (Issue #124)
   * mapper/combiner/reducer methods no longer need to contain a yield
     statement if they emit no data
   * Protocols:
     * Protocols can be anything with read() and write() methods, and are
       instances by default (Issue #229)
     * Set protocols with the *_PROTOCOL attributes or by re-defining the
       *_protocol() methods
     * Built-in protocol classes cache the encoded and decoded value of the
       last key for faster decoding during reducing (Issue #230)
     * --*protocol switches and aliases are deprecated (Issue #106)
   * Set Hadoop formats with HADOOP_*_FORMAT attributes or the hadoop_*_format()
     methods (Issue #241)
     * --hadoop-*-format switches are deprecated
     * Hadoop formats can no longer be set from mrjob.conf
   * Set jobconf with JOBCONF attribute or the jobconf() method (in addition
     to --jobconf)
   * Set Hadoop partitioner class with --partitioner, PARTITIONER, or
     partitioner() (Issue #6)
   * Custom option parsing (Issue #172)
   * Use mrjob.compat.get_jobconf_value() to get jobconf values from environment
 * Running jobs:
   * All modes:
     * All runners are Hadoop-version aware and use the correct jobconf and
       combiner invocation styles (Issue #111)
     * All types of URIs can be passed through to Hadoop (Issue #53)
     * Speed up steps with no mapper by using cat (Issue #5)
     * Stream compressed files with cat() method (Issue #17)
     * hadoop_bin, python_bin, and ssh_bin can now all take switches (Issue #96)
     * job_name_prefix option is gone (was deprecated)
     * Better cleanup (Issue #10):
       * Separate cleanup_on_failure option
       * More granular cleanup options
     * Cleaner handling of passthrough options (Issue #32)
   * emr mode:
     * job flow pooling (Issue #26)
     * vastly improved log fetching via SSH (Issue #2)
       * New tool:
     * default Hadoop version on EMR is 0.20 (was 0.18)
     * ec2_instance_type option now only sets instance type for slave nodes
       when there are multiple EC2 instances (Issue #66)
     * New tool: for running commands on all nodes and
       saving output locally
   * inline mode:
     * Supports cmdenv (Issue #136)
     * Passthrough options can now affect steps list (Issue #301)
   * local mode:
     * Runs 2 mappers and 2 reducers in parallel by default (Issue #228)
     * Preliminary Hadoop simulation for some jobconf variables (Issue #86)
 * Misc:
   * boto 2.0+ is now required (Issue #92)
   * Removed debian packaging (should be handled separately)


Bugfixes and betas
 * Fix log parsing crash dealing with timeout errors
 * Make work with simplejson
 * Add emr_additional_info option, to support EMR beta features
 * Remove debian packaging (should be handled separately)
 * Fix crash when creating tmp bucket for job in us-east-1
Something went wrong with that request. Please try again.