# MapReduce
MapReduce is a programming model for data processing. MapReduce programs are inherently parallel, thus putting very large-scale data analysis into the hands of anyone with enough machines at their disposal.

## The problem with parallel processing
1. file sizes vary, so processing time is limited by the longest file. A better approache is to split data into equal sized chunks
2. combining the result from different processes
3. processing on mulitple machines brings on problems like coordination and reliability

## MapReduce: an overview
A MapReduce program need three things: a map function, a reduce function, and some code to run the job.

### Datatypes
Hadoop uses its own datatypes found in the `org.apache.hadoop.io` package.

### Context
A `Context` instance is mainly used to write the output.

### Mapper
The `Mapper` class is a generic type, with four formal type parameters that specify the input key, input value, output key, and output value types of the map function. The `map()` method is passed a key and a value.

### Reducer
Like `Mapper`, four formal type parameters are used to specify the input and output types, this time for the reduce function. Output from the `Mapper` are grouped by their keys before being sent to the input of the `Reducer`.

### Job
A Job object forms the specification of the job and gives you control over how the job is run. 
Usual setup:
- The `setJarByClass()` method tell hadoop to look for the relevant Jar files containing this class.
- The `addInputPath()` method adds input files/directories/file patterns.
- The `setOutputPath()` sets the output path. This directory shouldn't exist before the program is run.
- specify the map and reduce types to use via the `setMapperClass()` and `setReducerClass()` methods.
- The `setOutputKeyClass()` and `setOutputValueClass()` methods control the output types for the reduce function, and must match what the Reduce class produces. The map output types default to the same types, so they do not need to be set if the mapper produces the same types as the reducer (as it does in our case). However, if they are different, the map output types must be set using the `setMapOutputKeyClass()` and `setMapOutputValueClass()` methods.


## Example: A simple MapReduce program
this program takes temperature data as input and find the maximum temperature for each year.

`Mapper`:
```java
public class MaxTemperatureMapper
    extends Mapper<LongWritable, Text, Text, IntWritable> {

  private static final int MISSING = 9999;
  
  @Override
  public void map(LongWritable key, Text value, Context context)
      throws IOException, InterruptedException {
    
    String line = value.toString();
    String year = line.substring(15, 19);
    int airTemperature;
    if (line.charAt(87) == '+') { // parseInt doesn't like leading plus signs
      airTemperature = Integer.parseInt(line.substring(88, 92));
    } else {
      airTemperature = Integer.parseInt(line.substring(87, 92));
    }
    String quality = line.substring(92, 93);
    if (airTemperature != MISSING && quality.matches("[01459]")) {
      context.write(new Text(year), new IntWritable(airTemperature));
    }
  }
}
```

`Reducer`:
```java
public class MaxTemperatureReducer
    extends Reducer<Text, IntWritable, Text, IntWritable> {
  
  @Override
  public void reduce(Text key, Iterable<IntWritable> values, Context context)
      throws IOException, InterruptedException {
    
    int maxValue = Integer.MIN_VALUE;
    for (IntWritable value : values) {
      maxValue = Math.max(maxValue, value.get());
    }
    context.write(key, new IntWritable(maxValue));
  }
}
```

`Driver`:
```java
public static void main(String[] args) throws Exception {
    if (args.length != 2) {
      System.err.println("Usage: MaxTemperature <input path> <output path>");
      System.exit(-1);
    }
    
    Job job = new Job();
    job.setJarByClass(MaxTemperature.class);
    job.setJobName("Max temperature");

    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    
    job.setMapperClass(MaxTemperatureMapper.class);
    job.setReducerClass(MaxTemperatureReducer.class);

    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
```