Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

v0.3.5

v0.3.5, 2012-08-21 -- The Last Ride of v0.3.x[?]
 * EMR:
   * --pool-wait-minutes option lets you wait up to X minutes before creating a
     job flow (#455)
   * Job flow ID included in error messages on failure (#452)
   * JOB and JOB_FLOW cleanup options (#485, #455)
 * EMR and Hadoop:
   * Compatibility fixes related to deprecated options and Hadoop's bizarre
     non-sequential version numbers (#489, #534)
 * Other:
   * Warn when *_PROTOCOL is not a class (#490)
 * Bug fixes:
   * Unicode strings can be used when specifying interpreters (#431)
   * --enable-emr-logging no longer causes the wrong counters/logs to be parsed
     (#446)
   * TMP_DIR inserted into 'sort' environment variables (#477)
   * Setting hadoop_home in mrjob.conf works again
   * Gzipped input files work when specified with relative paths (#494)
   * Passthrough options are not re-ordered when sent to Hadoop Streaming
     (#509)
   * json module is supported again if simplejson doesn't exist (#544)
   * HadoopJobRunner.path_exists() is no longer backwards (#549)

v0.3.4.1

v0.3.4.1, 2012-06-12 -- The test suite doesn't catch everything...
 * Local mode doesn't try to send multiple mappers to the same output file
   when using multiple compressed files as input

v0.3.4

v0.3.4, 2012-06-11 -- We are friendly people.
 * Experimental support for IronPython in the local and inline runners
 * set_status() and increment_counter() will encode messages/names of type
   'unicode' as UTF-8 when writing to Hadoop Streaming
 * EMR and Hadoop counter parsing is more correct
 * mrjob.tools.emr.fetch_logs fetches logs from S3 when asked instead of
   incorrectly refusing to do so
 * jobconf values can be booleans in mrjob.conf as well as 'true' and 'false'
   strings
 * hadoop_version can be a float in mrjob.conf, but a warning is printed to the
   console
 * Command line help is split across several --help-* commands
 * Local runner sorts output consistently

v0.3.3.2

v0.3.3.2, 2012-04-10 -- It's a race [condition]!
 * Option parsing no longer dies when -- is used as an argument (#435)
 * Fixed race condition where two jobs can join same job flow thinking it is
   idle, delaying one of the jobs (#438)
 * Better error message when a config file contains no data for the current
   runner (#433)

v0.3.3.1

v0.3.3.1, 2012-04-02 -- Hothothothothothothotfix
 * Fixed S3 locking mechanism parsing of last modified time to work around a
    bug in boto

v0.3.3

v0.3.3, 2012-03-29 -- Bug...bug...bug...bug...bug...FEATURE!
 * EMR:
   * Error detection code follows symlinks in Hadoop logs (#396)
   * terminate_idle_job_flows locks job flows before terminating them (#391)
   * terminate_idle_job_flows -qq silences all output (#380)
 * Other fixes:
   * mr_tower_of_powers test no longer requires Testify (#395)
   * Various runner du() implementations no longer broken (#393, #394)
   * Hadoop counter parser regex handles long lines better (#388)
   * Hadoop counter parser regex is more correct (#305)
   * Better error when trying to parse YAML without PyYAML (#348)

v0.3.2

AMI versions, spot instances, and more
 * Docs:
   * 'Testing with mrjob' section in docs (includes #321)
   * MRJobRunner.counters() included in docs (#321)
   * terminate_idle_job_flows is spelled correctly in docs (#339)
 * Running jobs:
   * local mode:
     * Allow non-string jobconf values again (this changed in v0.3.0)
     * Don't split *.gz files (#333)
   * emr mode:
     * Spot instance support via ec2_*_instance_bid_price and renamed instance
       type/number options (#219)
     * ami_version option to allow switching between EMR AMIs (#306)
     * 'Error while reading from input file' displays correct file (#358)
     * python_bin used for bootstrap_python_packages instead of just 'python'
       (#355)
     * Pooling works with bootstrap_mrjob=False (#347)
     * Pooling makes sure a job flow has space for the new job before joining
       it (#324)
 * EMR tools:
   * create_job_flow no longer tries to use an option that does not exist
     (#349)
   * report_long_jobs tool alerts on jobs that have run for more than X hours
     (#345)
   * mrboss no longer spells stderr 'stsderr'
   * terminate_idle_job_flows counts jobs with pending (but not running)
     steps as idle (#365)
   * terminate_idle_job_flows can terminate job flows near the end of a
     billable hour (#319)
   * audit_usage breaks down job flows by pool (#239)
   * Various tools (e.g. audit_usage) get list of job flows correctly (#346)

v0.3.1

Nooooo there were bugs!
 * Instance-type command-line arguments always override mrjob.conf (Issue #311)
 * Fixed crash in mrjob.tools.emr.audit_usage (Issue #315)
 * Tests now use unittest; python setup.py test now works (Issue #292)

v0.3.0

Worth the wait
 * Configuration:
   * Saner mrjob.conf locations (Issue #97):
     * ~/.mrjob is deprecated in favor of ~/.mrjob.conf
     * searching in PYTHONPATH is deprecated
     * MRJOB_CONF environment variable for custom paths
 * Defining Jobs (MRJob):
   * Combiner support (Issue #74)
   * *_init() and *_final() methods for mappers, combiners, and reducers
     (Issue #124)
   * mapper/combiner/reducer methods no longer need to contain a yield
     statement if they emit no data
   * Protocols:
     * Protocols can be anything with read() and write() methods, and are
       instances by default (Issue #229)
     * Set protocols with the *_PROTOCOL attributes or by re-defining the
       *_protocol() methods
     * Built-in protocol classes cache the encoded and decoded value of the
       last key for faster decoding during reducing (Issue #230)
     * --*protocol switches and aliases are deprecated (Issue #106)
   * Set Hadoop formats with HADOOP_*_FORMAT attributes or the hadoop_*_format()
     methods (Issue #241)
     * --hadoop-*-format switches are deprecated
     * Hadoop formats can no longer be set from mrjob.conf
   * Set jobconf with JOBCONF attribute or the jobconf() method (in addition
     to --jobconf)
   * Set Hadoop partitioner class with --partitioner, PARTITIONER, or
     partitioner() (Issue #6)
   * Custom option parsing (Issue #172)
   * Use mrjob.compat.get_jobconf_value() to get jobconf values from environment
 * Running jobs:
   * All modes:
     * All runners are Hadoop-version aware and use the correct jobconf and
       combiner invocation styles (Issue #111)
     * All types of URIs can be passed through to Hadoop (Issue #53)
     * Speed up steps with no mapper by using cat (Issue #5)
     * Stream compressed files with cat() method (Issue #17)
     * hadoop_bin, python_bin, and ssh_bin can now all take switches (Issue #96)
     * job_name_prefix option is gone (was deprecated)
     * Better cleanup (Issue #10):
       * Separate cleanup_on_failure option
       * More granular cleanup options
     * Cleaner handling of passthrough options (Issue #32)
   * emr mode:
     * job flow pooling (Issue #26)
     * vastly improved log fetching via SSH (Issue #2)
       * New tool: mrjob.tools.emr.fetch_logs
     * default Hadoop version on EMR is 0.20 (was 0.18)
     * ec2_instance_type option now only sets instance type for slave nodes
       when there are multiple EC2 instances (Issue #66)
     * New tool: mrjob.tools.emr.mrboss for running commands on all nodes and
       saving output locally
   * inline mode:
     * Supports cmdenv (Issue #136)
     * Passthrough options can now affect steps list (Issue #301)
   * local mode:
     * Runs 2 mappers and 2 reducers in parallel by default (Issue #228)
     * Preliminary Hadoop simulation for some jobconf variables (Issue #86)
 * Misc:
   * boto 2.0+ is now required (Issue #92)
   * Removed debian packaging (should be handled separately)

v0.2.8

Bugfixes and betas
 * Fix log parsing crash dealing with timeout errors
 * Make mr_travelling_salesman.py work with simplejson
 * Add emr_additional_info option, to support EMR beta features
 * Remove debian packaging (should be handled separately)
 * Fix crash when creating tmp bucket for job in us-east-1

v0.2.7

Hooray for interns!
 * All runner options can be set from the command line (Issue #121)
   * Including for mrjob.tools.emr.create_job_flow (Issue #142)
 * New EMR options:
   * availability_zone (Issue #72)
   * bootstrap_actions (Issue #69)
   * enable_emr_debugging (Issue #133)
 * Read counters from EMR log files (Issue #134)
 * Clean old files out of S3 with mrjob.tools.emr.s3_tmpwatch (Issue #9)
 * EMR parses and reports job failure due to steps timing out (Issue #15)
 * EMR boostrap files are no longer made public on S3 (Issue #70)
 * mrjob.tools.emr.terminate_idle_job_flows handles custom hadoop streaming
   jars correctly (Issue #116)
 * LocalMRJobRunner separates out counters by step (Issue #28)
 * bootstrap_python_packages works regardless of tarball name (Issue #49)
 * mrjob always creates temp buckets in the correct AWS region (Issue #64)
 * Catch abuse of __main__ in jobs (Issue #78)
 * Added mr_travelling_salesman example

v0.2.6

Hadoop 0.20 in EMR, inline runner, and more
 * Set Hadoop to run on EMR with --hadoop-version (Issue #71).
   * Default is still 0.18, but will change to 0.20 in mrjob v0.3.0.
 * New inline runner, for testing locally with a debugger
 * New --strict-protocols option, to catch unencodable data (Issue #76)
 * Added steps_python_bin option (for use with virtualenv)
 * mrjob no longer chokes when asked to run on an EMR job flow running
   Hadoop 0.20 (Issue #110)
 * mrjob no longer chokes on job flows with no LogUri (Issue #112)

v0.2.5

Hadoop input and output formats
 * Added hadoop_input/output_format options
 * You can now specify a custom Hadoop streaming jar (hadoop_streaming_jar)
 * extra args to hadoop now come before -mapper/-reducer on EMR, so
   that e.g. -libjar will work (worked in hadoop mode since v0.2.2)
 * hadoop mode now supports s3n:// URIs (Issue #53)

v0.2.4

fix bootstrapping mrjob
 * Fix bootstrapping of mrjob in hadoop and local mode (Issue #89)
 * SSH tunnels try to use the same port for the same job flow (Issue #67)
 * Added mr_postfix_bounce and mr_pegasos_svm to examples.
 * Retry on spurious 505s from EMR API

v0.2.3

boto compatibility
 * Fix incompatibility with boto 2.0b4 (Issue #91)

v0.2.2

GET/POST EMR issue
 * Use POST requests for most EMR queries (EMR was choking on large GETs)
 * find_probable_cause_of_failure() ignores transient errors (Issue #31)
 * --hadoop-arg now actually works (Issue #79)
   * on Hadoop, extra args are added first, so you can set e.g. -libjar
 * S3 buckets may now have . in their names
 * MRJob scripts now respect --quiet (Issue #84)
 * added --no-output option for MRJob scripts (Issue #81)
 * added --python-bin option (Issue #54)

v0.2.1

laststatechangereason bugfix
 * Don't assume EMR sets laststatechangereason

v0.2.0

Many bugfixes, Windows support
 * New Features/Changes:
   * EMRJobRunner now prints % of mappers and reducers completed when you
     enable the SSH tunnel.
   * Added mr_page_rank example
   * Added mrjob.tools.emr.audit_usage script (Issue #21)
   * You can specify alternate job owners with the "owner" option. Useful for
     auditing usage. (Issue #59)
   * The job_name_prefix option has been renamed to label (the old name still
     works but is deprecated)
   * bootstrap_cmds and bootstrap_scripts no longer automatically invoke sudo
 * Bugs Fixed/Cleanup:
   * bootstrap files no longer get uploaded to S3 twice (Issue #8)
   * When using add_file_option(), show_steps() can now see the local version
     of the file (Issue #45)
   * Now works on Windows (Issue #46)
   * No longer requires external jar, tar, or zip binaries (Issue #47)
   * mrjob-* scratch bucket is only created as needed (Issue #50)
   * Can now specify us-east-1 region explicitly (Issue #58)
   * mrjob.tools.emr.terminate_idle_job_flows leaves Hive jobs alone (Issue #60)

v0.1.0

Same code, better version. It's official!

v0.1.0-pre3

Pre-release to run Yelp code against
 * Added debian packaging
 * mrjob bootstrapping can now deal with symlinks in site-packages/mrjob
 * MRJobRunner.stream_output() can now be called multiple times

v0.1.0-pre2

Second pre-release after testing
 * Fixed small bugs that broke Python 2.5.1 and Python 2.7
 * Fixed reading mrjob.conf without yaml installed
 * Fix tests to work with modern simplejson and pipes.quote()
 * Auto-create temp bucket on S3 if we don't have one (Issue #16)
 * Auto-infer AWS region from bucket (Issue #7)
 * --steps now passes in all extra args (e.g. --protocol) (Issue #4)
 * Better docs
Something went wrong with that request. Please try again.