<a href="https://colab.research.google.com/github/groda/big_data/blob/master/MapReduce_Primer_HelloWorld.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href="https://github.com/groda/big_data"><div><img src="https://github.com/groda/big_data/blob/master/logo_bdb.png?raw=true" align=right width="90"></div></a>

# MapReduce: A Primer with <code>Hello World!</code>
<br>
<br>

For this tutorial, we are going to download the core Hadoop distribution and run Hadoop in _local standalone mode_:

> ❝ _By default, Hadoop is configured to run in a non-distributed mode, as a single Java process._ ❞

(see [https://hadoop.apache.org/docs/stable/.../Standalone_Operation](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html#Standalone_Operation))

We are going to run a MapReduce job using MapReduce's [streaming application](https://hadoop.apache.org/docs/stable/hadoop-streaming/HadoopStreaming.html#Hadoop_Streaming). This is not to be confused with real-time streaming:

> ❝ _Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer._ ❞

MapReduce streaming defaults to using `IdentityMapper` `IdentityReducer`, thus eliminating the need for explicit specification of a mapper or reducer. Finally, we show how to run a map-only job by setting `mapreduce.job.reduce` equal to $0$.


# Download core Hadoop

In [1]:
HADOOP_URL = "https://dlcdn.apache.org/hadoop/common/stable/hadoop-3.3.6.tar.gz"

import requests
import os
import tarfile

def download_and_extract_targz(url):
    response = requests.get(url)
    filename = url.rsplit('/', 1)[-1]
    HADOOP_HOME = filename[:-7]
    # set HADOOP_HOME environment variable
    os.environ['HADOOP_HOME'] = HADOOP_HOME
    if os.path.isdir(HADOOP_HOME):
      print("Not downloading, Hadoop folder {} already exists".format(HADOOP_HOME))
      return
    if response.status_code == 200:
        with open(filename, 'wb') as file:
            file.write(response.content)
        with tarfile.open(filename, 'r:gz') as tar_ref:
            extract_path = tar_ref.extractall(path='.')
            # Get the names of all members (files and directories) in the archive
            all_members = tar_ref.getnames()
            # If there is a top-level directory, get its name
            if all_members:
              top_level_directory = all_members[0]
              print(f"ZIP file downloaded and extracted successfully. Contents saved at: {top_level_directory}")
    else:
        print(f"Failed to download ZIP file. Status code: {response.status_code}")


download_and_extract_targz(HADOOP_URL)

ZIP file downloaded and extracted successfully. Contents saved at: hadoop-3.3.6


# Set environment variables

## Set `HADOOP_HOME` and `PATH`

In [2]:
# HADOOP_HOME was set earlier when downloading Hadoop distribution
print("HADOOP_HOME is {}".format(os.environ['HADOOP_HOME']))

os.environ['PATH'] = ':'.join([os.path.join(os.environ['HADOOP_HOME'], 'bin'), os.environ['PATH']])
print("PATH is {}".format(os.environ['PATH']))

HADOOP_HOME is hadoop-3.3.6
PATH is hadoop-3.3.6/bin:/opt/hostedtoolcache/Python/3.8.18/x64/bin:/opt/hostedtoolcache/Python/3.8.18/x64:/snap/bin:/home/runner/.local/bin:/opt/pipx_bin:/home/runner/.cargo/bin:/home/runner/.config/composer/vendor/bin:/usr/local/.ghcup/bin:/home/runner/.dotnet/tools:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin


## Set `JAVA_HOME`

While Java is readily available on Google Colab, we consider the broader scenario of an Ubuntu machine. In this case, we ensure compatibility by installing Java, specifically opting for the `openjdk-19-jre-headless` version.

In [3]:
import shutil

# set variable JAVA_HOME (install Java if necessary)
def is_java_installed():
    os.environ['JAVA_HOME'] = os.path.realpath(shutil.which("java")).split('/bin')[0]
    return os.environ['JAVA_HOME']

def install_java():
    # Uncomment and modify the desired version
    # java_version= 'openjdk-11-jre-headless'
    # java_version= 'default-jre'
    # java_version= 'openjdk-17-jre-headless'
    # java_version= 'openjdk-18-jre-headless'
    java_version= 'openjdk-19-jre-headless'

    print(f"Java not found. Installing {java_version} ... (this might take a while)")
    try:
        cmd = f"apt install -y {java_version}"
        subprocess_output = subprocess.run(cmd, shell=True, check=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True)
        stdout_result = subprocess_output.stdout
        # Process the results as needed
        print("Done installing Java {}".format(java_version))
        os.environ['JAVA_HOME'] = os.path.realpath(shutil.which("java")).split('/bin')[0]
        print("JAVA_HOME is {}".format(os.environ['JAVA_HOME']))
    except subprocess.CalledProcessError as e:
        # Handle the error if the command returns a non-zero exit code
        print("Command failed with return code {}".format(e.returncode))
        print("stdout: {}".format(e.stdout))

# Install Java if not available
if is_java_installed():
    print("Java is already installed: {}".format(os.environ['JAVA_HOME']))
else:
    print("Installing Java")
    install_java()

Java is already installed: /usr/lib/jvm/temurin-11-jdk-amd64


# Run a MapReduce job with Hadoop streaming

## Create a file

Write the string"Hello, World!" to a local file.<p>**Note:** you will be writing to the file `./hello.txt` in your current directory (denoted by `./`).

In [4]:
!echo "Hello, World!">./hello.txt

## Launch the MapReduce "Hello, World!" application

Since the default filesystem is the local filesystem (as opposed to HDFS) we do not need to upload the local file `hello.txt` to HDFS.

Run a MapReduce job with `/bin/cat` as a mapper and no reducer.

**Note:** the first step of removing the output directory is necessary because MapReduce does not overwrite data folders by design.

In [5]:
%%bash
hdfs dfs -rm -r my_output

mapred streaming \
    -input hello.txt \
    -output my_output \
    -mapper '/bin/cat'

rm: `my_output': No such file or directory


2024-03-09 20:33:24,927 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties


2024-03-09 20:33:25,038 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).


2024-03-09 20:33:25,038 INFO impl.MetricsSystemImpl: JobTracker metrics system started


2024-03-09 20:33:25,051 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!


2024-03-09 20:33:25,204 INFO mapred.FileInputFormat: Total input files to process : 1


2024-03-09 20:33:25,221 INFO mapreduce.JobSubmitter: number of splits:1


2024-03-09 20:33:25,391 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1134687999_0001


2024-03-09 20:33:25,391 INFO mapreduce.JobSubmitter: Executing with tokens: []


2024-03-09 20:33:25,494 INFO mapreduce.Job: The url to track the job: http://localhost:8080/


2024-03-09 20:33:25,502 INFO mapreduce.Job: Running job: job_local1134687999_0001


2024-03-09 20:33:25,507 INFO mapred.LocalJobRunner: OutputCommitter set in config null


2024-03-09 20:33:25,515 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapred.FileOutputCommitter


2024-03-09 20:33:25,529 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2


2024-03-09 20:33:25,529 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false


2024-03-09 20:33:25,568 INFO mapred.LocalJobRunner: Waiting for map tasks


2024-03-09 20:33:25,570 INFO mapred.LocalJobRunner: Starting task: attempt_local1134687999_0001_m_000000_0


2024-03-09 20:33:25,591 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2


2024-03-09 20:33:25,591 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false


2024-03-09 20:33:25,609 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]


2024-03-09 20:33:25,617 INFO mapred.MapTask: Processing split: file:/home/runner/work/big_data/big_data/hello.txt:0+14


2024-03-09 20:33:25,627 INFO mapred.MapTask: numReduceTasks: 1


2024-03-09 20:33:25,647 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)


2024-03-09 20:33:25,647 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100


2024-03-09 20:33:25,647 INFO mapred.MapTask: soft limit at 83886080


2024-03-09 20:33:25,647 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600


2024-03-09 20:33:25,647 INFO mapred.MapTask: kvstart = 26214396; length = 6553600


2024-03-09 20:33:25,650 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer


2024-03-09 20:33:25,652 INFO streaming.PipeMapRed: PipeMapRed exec [/bin/cat]


2024-03-09 20:33:25,656 INFO Configuration.deprecation: mapred.work.output.dir is deprecated. Instead, use mapreduce.task.output.dir


2024-03-09 20:33:25,658 INFO Configuration.deprecation: mapred.local.dir is deprecated. Instead, use mapreduce.cluster.local.dir


2024-03-09 20:33:25,658 INFO Configuration.deprecation: map.input.file is deprecated. Instead, use mapreduce.map.input.file


2024-03-09 20:33:25,658 INFO Configuration.deprecation: map.input.length is deprecated. Instead, use mapreduce.map.input.length


2024-03-09 20:33:25,659 INFO Configuration.deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id


2024-03-09 20:33:25,659 INFO Configuration.deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition


2024-03-09 20:33:25,660 INFO Configuration.deprecation: map.input.start is deprecated. Instead, use mapreduce.map.input.start


2024-03-09 20:33:25,660 INFO Configuration.deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap


2024-03-09 20:33:25,660 INFO Configuration.deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id


2024-03-09 20:33:25,660 INFO Configuration.deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id


2024-03-09 20:33:25,661 INFO Configuration.deprecation: mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords


2024-03-09 20:33:25,661 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name


2024-03-09 20:33:25,681 INFO streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]


2024-03-09 20:33:25,682 INFO streaming.PipeMapRed: Records R/W=1/1


2024-03-09 20:33:25,683 INFO streaming.PipeMapRed: MRErrorThread done


2024-03-09 20:33:25,684 INFO streaming.PipeMapRed: mapRedFinished


2024-03-09 20:33:25,687 INFO mapred.LocalJobRunner: 


2024-03-09 20:33:25,687 INFO mapred.MapTask: Starting flush of map output


2024-03-09 20:33:25,687 INFO mapred.MapTask: Spilling map output


2024-03-09 20:33:25,687 INFO mapred.MapTask: bufstart = 0; bufend = 15; bufvoid = 104857600


2024-03-09 20:33:25,687 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214396(104857584); length = 1/6553600


2024-03-09 20:33:25,695 INFO mapred.MapTask: Finished spill 0


2024-03-09 20:33:25,704 INFO mapred.Task: Task:attempt_local1134687999_0001_m_000000_0 is done. And is in the process of committing


2024-03-09 20:33:25,706 INFO mapred.LocalJobRunner: Records R/W=1/1


2024-03-09 20:33:25,706 INFO mapred.Task: Task 'attempt_local1134687999_0001_m_000000_0' done.


2024-03-09 20:33:25,712 INFO mapred.Task: Final Counters for attempt_local1134687999_0001_m_000000_0: Counters: 17


	File System Counters


		FILE: Number of bytes read=141437


		FILE: Number of bytes written=784784


		FILE: Number of read operations=0


		FILE: Number of large read operations=0


		FILE: Number of write operations=0


	Map-Reduce Framework


		Map input records=1


		Map output records=1


		Map output bytes=15


		Map output materialized bytes=23


		Input split bytes=102


		Combine input records=0


		Spilled Records=1


		Failed Shuffles=0


		Merged Map outputs=0


		GC time elapsed (ms)=0


		Total committed heap usage (bytes)=317718528


	File Input Format Counters 


		Bytes Read=14


2024-03-09 20:33:25,712 INFO mapred.LocalJobRunner: Finishing task: attempt_local1134687999_0001_m_000000_0


2024-03-09 20:33:25,713 INFO mapred.LocalJobRunner: map task executor complete.


2024-03-09 20:33:25,716 INFO mapred.LocalJobRunner: Waiting for reduce tasks


2024-03-09 20:33:25,717 INFO mapred.LocalJobRunner: Starting task: attempt_local1134687999_0001_r_000000_0


2024-03-09 20:33:25,725 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2


2024-03-09 20:33:25,725 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false


2024-03-09 20:33:25,726 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]


2024-03-09 20:33:25,728 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@5749d306


2024-03-09 20:33:25,730 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!


2024-03-09 20:33:25,745 INFO reduce.MergeManagerImpl: MergerManager: memoryLimit=2933076736, maxSingleShuffleLimit=733269184, mergeThreshold=1935830784, ioSortFactor=10, memToMemMergeOutputsThreshold=10


2024-03-09 20:33:25,747 INFO reduce.EventFetcher: attempt_local1134687999_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events


2024-03-09 20:33:25,772 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local1134687999_0001_m_000000_0 decomp: 19 len: 23 to MEMORY


2024-03-09 20:33:25,776 INFO reduce.InMemoryMapOutput: Read 19 bytes from map-output for attempt_local1134687999_0001_m_000000_0


2024-03-09 20:33:25,778 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 19, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->19


2024-03-09 20:33:25,781 INFO reduce.EventFetcher: EventFetcher is interrupted.. Returning


2024-03-09 20:33:25,782 INFO mapred.LocalJobRunner: 1 / 1 copied.


2024-03-09 20:33:25,782 INFO reduce.MergeManagerImpl: finalMerge called with 1 in-memory map-outputs and 0 on-disk map-outputs


2024-03-09 20:33:25,786 INFO mapred.Merger: Merging 1 sorted segments


2024-03-09 20:33:25,787 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 3 bytes


2024-03-09 20:33:25,788 INFO reduce.MergeManagerImpl: Merged 1 segments, 19 bytes to disk to satisfy reduce memory limit


2024-03-09 20:33:25,788 INFO reduce.MergeManagerImpl: Merging 1 files, 23 bytes from disk


2024-03-09 20:33:25,789 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce


2024-03-09 20:33:25,789 INFO mapred.Merger: Merging 1 sorted segments


2024-03-09 20:33:25,791 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 3 bytes


2024-03-09 20:33:25,791 INFO mapred.LocalJobRunner: 1 / 1 copied.


2024-03-09 20:33:25,798 INFO mapred.Task: Task:attempt_local1134687999_0001_r_000000_0 is done. And is in the process of committing


2024-03-09 20:33:25,799 INFO mapred.LocalJobRunner: 1 / 1 copied.


2024-03-09 20:33:25,799 INFO mapred.Task: Task attempt_local1134687999_0001_r_000000_0 is allowed to commit now


2024-03-09 20:33:25,801 INFO output.FileOutputCommitter: Saved output of task 'attempt_local1134687999_0001_r_000000_0' to file:/home/runner/work/big_data/big_data/my_output


2024-03-09 20:33:25,802 INFO mapred.LocalJobRunner: reduce > reduce


2024-03-09 20:33:25,802 INFO mapred.Task: Task 'attempt_local1134687999_0001_r_000000_0' done.


2024-03-09 20:33:25,803 INFO mapred.Task: Final Counters for attempt_local1134687999_0001_r_000000_0: Counters: 24


	File System Counters


		FILE: Number of bytes read=141515


		FILE: Number of bytes written=784834


		FILE: Number of read operations=0


		FILE: Number of large read operations=0


		FILE: Number of write operations=0


	Map-Reduce Framework


		Combine input records=0


		Combine output records=0


		Reduce input groups=1


		Reduce shuffle bytes=23


		Reduce input records=1


		Reduce output records=1


		Spilled Records=1


		Shuffled Maps =1


		Failed Shuffles=0


		Merged Map outputs=1


		GC time elapsed (ms)=0


		Total committed heap usage (bytes)=317718528


	Shuffle Errors


		BAD_ID=0


		CONNECTION=0


		IO_ERROR=0


		WRONG_LENGTH=0


		WRONG_MAP=0


		WRONG_REDUCE=0


	File Output Format Counters 


		Bytes Written=27


2024-03-09 20:33:25,803 INFO mapred.LocalJobRunner: Finishing task: attempt_local1134687999_0001_r_000000_0


2024-03-09 20:33:25,803 INFO mapred.LocalJobRunner: reduce task executor complete.


2024-03-09 20:33:26,513 INFO mapreduce.Job: Job job_local1134687999_0001 running in uber mode : false


2024-03-09 20:33:26,514 INFO mapreduce.Job:  map 100% reduce 100%


2024-03-09 20:33:26,515 INFO mapreduce.Job: Job job_local1134687999_0001 completed successfully


2024-03-09 20:33:26,522 INFO mapreduce.Job: Counters: 30


	File System Counters


		FILE: Number of bytes read=282952


		FILE: Number of bytes written=1569618


		FILE: Number of read operations=0


		FILE: Number of large read operations=0


		FILE: Number of write operations=0


	Map-Reduce Framework


		Map input records=1


		Map output records=1


		Map output bytes=15


		Map output materialized bytes=23


		Input split bytes=102


		Combine input records=0


		Combine output records=0


		Reduce input groups=1


		Reduce shuffle bytes=23


		Reduce input records=1


		Reduce output records=1


		Spilled Records=2


		Shuffled Maps =1


		Failed Shuffles=0


		Merged Map outputs=1


		GC time elapsed (ms)=0


		Total committed heap usage (bytes)=635437056


	Shuffle Errors


		BAD_ID=0


		CONNECTION=0


		IO_ERROR=0


		WRONG_LENGTH=0


		WRONG_MAP=0


		WRONG_REDUCE=0


	File Input Format Counters 


		Bytes Read=14


	File Output Format Counters 


		Bytes Written=27


2024-03-09 20:33:26,522 INFO streaming.StreamJob: Output directory: my_output


## Verify the result

If the job executed successfully, an empty file named `_SUCCESS` is expected to be present in the output directory `my_output`.

Verify the success of the MapReduce job by checking for the presence of the `_SUCCESS` file.

In [6]:
%%bash

echo "Check if MapReduce job was successful"
hdfs dfs -test -e my_output/_SUCCESS
if [ $? -eq 0 ]; then
	echo "_SUCCESS exists!"
fi

Check if MapReduce job was successful


_SUCCESS exists!


**Note:** `hdfs dfs -ls` is the same as `ls` since the default filesystem is the local filesystem.

In [7]:
!hdfs dfs -ls my_output

Found 2 items
-rw-r--r--   1 runner docker          0 2024-03-09 20:33 my_output/_SUCCESS
-rw-r--r--   1 runner docker         15 2024-03-09 20:33 my_output/part-00000


In [8]:
!ls -l my_output

total 4
-rw-r--r-- 1 runner docker  0 Mar  9 20:33 _SUCCESS
-rw-r--r-- 1 runner docker 15 Mar  9 20:33 part-00000


The actual output of the MapReduce job is contained in the file `part-00000` in the output directory.

In [9]:
!cat my_output/part-00000

Hello, World!	


# MapReduce without specifying mapper or reducer

In the previous example, we have seen how to run a MapReduce job without specifying any reducer.

Since the only required options for `mapred streaming` are `input` and `output`, we can also run a MapReduce job without specifying a mapper.

In [10]:
!mapred streaming -h

2024-03-09 20:33:29,605 ERROR streaming.StreamJob: Unrecognized option: -h
Usage: $HADOOP_HOME/bin/hadoop jar hadoop-streaming.jar [options]
Options:
  -input          <path> DFS input file(s) for the Map step.
  -output         <path> DFS output directory for the Reduce step.
  -mapper         <cmd|JavaClassName> Optional. Command to be run as mapper.
  -combiner       <cmd|JavaClassName> Optional. Command to be run as combiner.
  -reducer        <cmd|JavaClassName> Optional. Command to be run as reducer.
  -file           <file> Optional. File/dir to be shipped in the Job jar file.
                  Deprecated. Use generic option "-files" instead.
  -inputformat    <TextInputFormat(default)|SequenceFileAsTextInputFormat|JavaClassName>
                  Optional. The input format class.
  -outputformat   <TextOutputFormat(default)|JavaClassName>
                  Optional. The output format class.
  -partitioner    <JavaClassName>  Optional. The partitioner class.
  -nu

In [11]:
%%bash
hdfs dfs -rm -r my_output

mapred streaming \
    -input hello.txt \
    -output my_output

2024-03-09 20:33:30,627 INFO Configuration.deprecation: io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum


Deleted my_output


2024-03-09 20:33:31,725 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties


2024-03-09 20:33:31,809 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).


2024-03-09 20:33:31,810 INFO impl.MetricsSystemImpl: JobTracker metrics system started


2024-03-09 20:33:31,821 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!


2024-03-09 20:33:31,969 INFO mapred.FileInputFormat: Total input files to process : 1


2024-03-09 20:33:31,983 INFO mapreduce.JobSubmitter: number of splits:1


2024-03-09 20:33:32,144 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1144934071_0001


2024-03-09 20:33:32,144 INFO mapreduce.JobSubmitter: Executing with tokens: []


2024-03-09 20:33:32,274 INFO mapreduce.Job: The url to track the job: http://localhost:8080/


2024-03-09 20:33:32,278 INFO mapred.LocalJobRunner: OutputCommitter set in config null


2024-03-09 20:33:32,280 INFO mapreduce.Job: Running job: job_local1144934071_0001


2024-03-09 20:33:32,285 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapred.FileOutputCommitter


2024-03-09 20:33:32,293 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2


2024-03-09 20:33:32,293 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false


2024-03-09 20:33:32,320 INFO mapred.LocalJobRunner: Waiting for map tasks


2024-03-09 20:33:32,323 INFO mapred.LocalJobRunner: Starting task: attempt_local1144934071_0001_m_000000_0


2024-03-09 20:33:32,342 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2


2024-03-09 20:33:32,343 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false


2024-03-09 20:33:32,361 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]


2024-03-09 20:33:32,368 INFO mapred.MapTask: Processing split: file:/home/runner/work/big_data/big_data/hello.txt:0+14


2024-03-09 20:33:32,377 INFO mapred.MapTask: numReduceTasks: 1


2024-03-09 20:33:32,397 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)


2024-03-09 20:33:32,397 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100


2024-03-09 20:33:32,397 INFO mapred.MapTask: soft limit at 83886080


2024-03-09 20:33:32,397 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600


2024-03-09 20:33:32,397 INFO mapred.MapTask: kvstart = 26214396; length = 6553600


2024-03-09 20:33:32,401 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer


2024-03-09 20:33:32,405 INFO mapred.LocalJobRunner: 


2024-03-09 20:33:32,405 INFO mapred.MapTask: Starting flush of map output


2024-03-09 20:33:32,405 INFO mapred.MapTask: Spilling map output


2024-03-09 20:33:32,405 INFO mapred.MapTask: bufstart = 0; bufend = 22; bufvoid = 104857600


2024-03-09 20:33:32,405 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214396(104857584); length = 1/6553600


2024-03-09 20:33:32,410 INFO mapred.MapTask: Finished spill 0


2024-03-09 20:33:32,419 INFO mapred.Task: Task:attempt_local1144934071_0001_m_000000_0 is done. And is in the process of committing


2024-03-09 20:33:32,421 INFO mapred.LocalJobRunner: file:/home/runner/work/big_data/big_data/hello.txt:0+14


2024-03-09 20:33:32,421 INFO mapred.Task: Task 'attempt_local1144934071_0001_m_000000_0' done.


2024-03-09 20:33:32,428 INFO mapred.Task: Final Counters for attempt_local1144934071_0001_m_000000_0: Counters: 17


	File System Counters


		FILE: Number of bytes read=141437


		FILE: Number of bytes written=782672


		FILE: Number of read operations=0


		FILE: Number of large read operations=0


		FILE: Number of write operations=0


	Map-Reduce Framework


		Map input records=1


		Map output records=1


		Map output bytes=22


		Map output materialized bytes=30


		Input split bytes=102


		Combine input records=0


		Spilled Records=1


		Failed Shuffles=0


		Merged Map outputs=0


		GC time elapsed (ms)=0


		Total committed heap usage (bytes)=314572800


	File Input Format Counters 


		Bytes Read=14


2024-03-09 20:33:32,428 INFO mapred.LocalJobRunner: Finishing task: attempt_local1144934071_0001_m_000000_0


2024-03-09 20:33:32,430 INFO mapred.LocalJobRunner: map task executor complete.


2024-03-09 20:33:32,433 INFO mapred.LocalJobRunner: Waiting for reduce tasks


2024-03-09 20:33:32,437 INFO mapred.LocalJobRunner: Starting task: attempt_local1144934071_0001_r_000000_0


2024-03-09 20:33:32,446 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2


2024-03-09 20:33:32,446 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false


2024-03-09 20:33:32,446 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]


2024-03-09 20:33:32,451 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@7818ee30


2024-03-09 20:33:32,453 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!


2024-03-09 20:33:32,470 INFO reduce.MergeManagerImpl: MergerManager: memoryLimit=2933076736, maxSingleShuffleLimit=733269184, mergeThreshold=1935830784, ioSortFactor=10, memToMemMergeOutputsThreshold=10


2024-03-09 20:33:32,472 INFO reduce.EventFetcher: attempt_local1144934071_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events


2024-03-09 20:33:32,498 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local1144934071_0001_m_000000_0 decomp: 26 len: 30 to MEMORY


2024-03-09 20:33:32,501 INFO reduce.InMemoryMapOutput: Read 26 bytes from map-output for attempt_local1144934071_0001_m_000000_0


2024-03-09 20:33:32,502 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 26, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->26


2024-03-09 20:33:32,505 INFO reduce.EventFetcher: EventFetcher is interrupted.. Returning


2024-03-09 20:33:32,506 INFO mapred.LocalJobRunner: 1 / 1 copied.


2024-03-09 20:33:32,507 INFO reduce.MergeManagerImpl: finalMerge called with 1 in-memory map-outputs and 0 on-disk map-outputs


2024-03-09 20:33:32,511 INFO mapred.Merger: Merging 1 sorted segments


2024-03-09 20:33:32,511 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 16 bytes


2024-03-09 20:33:32,512 INFO reduce.MergeManagerImpl: Merged 1 segments, 26 bytes to disk to satisfy reduce memory limit


2024-03-09 20:33:32,513 INFO reduce.MergeManagerImpl: Merging 1 files, 30 bytes from disk


2024-03-09 20:33:32,513 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce


2024-03-09 20:33:32,513 INFO mapred.Merger: Merging 1 sorted segments


2024-03-09 20:33:32,514 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 16 bytes


2024-03-09 20:33:32,515 INFO mapred.LocalJobRunner: 1 / 1 copied.


2024-03-09 20:33:32,520 INFO mapred.Task: Task:attempt_local1144934071_0001_r_000000_0 is done. And is in the process of committing


2024-03-09 20:33:32,521 INFO mapred.LocalJobRunner: 1 / 1 copied.


2024-03-09 20:33:32,521 INFO mapred.Task: Task attempt_local1144934071_0001_r_000000_0 is allowed to commit now


2024-03-09 20:33:32,523 INFO output.FileOutputCommitter: Saved output of task 'attempt_local1144934071_0001_r_000000_0' to file:/home/runner/work/big_data/big_data/my_output


2024-03-09 20:33:32,523 INFO mapred.LocalJobRunner: reduce > reduce


2024-03-09 20:33:32,524 INFO mapred.Task: Task 'attempt_local1144934071_0001_r_000000_0' done.


2024-03-09 20:33:32,524 INFO mapred.Task: Final Counters for attempt_local1144934071_0001_r_000000_0: Counters: 24


	File System Counters


		FILE: Number of bytes read=141529


		FILE: Number of bytes written=782730


		FILE: Number of read operations=0


		FILE: Number of large read operations=0


		FILE: Number of write operations=0


	Map-Reduce Framework


		Combine input records=0


		Combine output records=0


		Reduce input groups=1


		Reduce shuffle bytes=30


		Reduce input records=1


		Reduce output records=1


		Spilled Records=1


		Shuffled Maps =1


		Failed Shuffles=0


		Merged Map outputs=1


		GC time elapsed (ms)=5


		Total committed heap usage (bytes)=314572800


	Shuffle Errors


		BAD_ID=0


		CONNECTION=0


		IO_ERROR=0


		WRONG_LENGTH=0


		WRONG_MAP=0


		WRONG_REDUCE=0


	File Output Format Counters 


		Bytes Written=28


2024-03-09 20:33:32,524 INFO mapred.LocalJobRunner: Finishing task: attempt_local1144934071_0001_r_000000_0


2024-03-09 20:33:32,524 INFO mapred.LocalJobRunner: reduce task executor complete.


2024-03-09 20:33:33,289 INFO mapreduce.Job: Job job_local1144934071_0001 running in uber mode : false


2024-03-09 20:33:33,290 INFO mapreduce.Job:  map 100% reduce 100%


2024-03-09 20:33:33,291 INFO mapreduce.Job: Job job_local1144934071_0001 completed successfully


2024-03-09 20:33:33,298 INFO mapreduce.Job: Counters: 30


	File System Counters


		FILE: Number of bytes read=282966


		FILE: Number of bytes written=1565402


		FILE: Number of read operations=0


		FILE: Number of large read operations=0


		FILE: Number of write operations=0


	Map-Reduce Framework


		Map input records=1


		Map output records=1


		Map output bytes=22


		Map output materialized bytes=30


		Input split bytes=102


		Combine input records=0


		Combine output records=0


		Reduce input groups=1


		Reduce shuffle bytes=30


		Reduce input records=1


		Reduce output records=1


		Spilled Records=2


		Shuffled Maps =1


		Failed Shuffles=0


		Merged Map outputs=1


		GC time elapsed (ms)=5


		Total committed heap usage (bytes)=629145600


	Shuffle Errors


		BAD_ID=0


		CONNECTION=0


		IO_ERROR=0


		WRONG_LENGTH=0


		WRONG_MAP=0


		WRONG_REDUCE=0


	File Input Format Counters 


		Bytes Read=14


	File Output Format Counters 


		Bytes Written=28


2024-03-09 20:33:33,298 INFO streaming.StreamJob: Output directory: my_output


## Verify the result

In [12]:
%%bash

echo "Check if MapReduce job was successful"
hdfs dfs -test -e my_output/_SUCCESS
if [ $? -eq 0 ]; then
	echo "_SUCCESS exists!"
fi

Check if MapReduce job was successful


_SUCCESS exists!


Show output

In [13]:
!cat my_output/part-00000

0	Hello, World!


What happened here is that not having defined any mapper or reducer, the "Identity" mapper ([IdentityMapper](https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapred/lib/IdentityMapper.html)) and reducer ([IdentityReducer](https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapred/lib/IdentityReducer.html)) were used by default (see [Streaming command options](https://hadoop.apache.org/docs/stable/hadoop-streaming/HadoopStreaming.html#Streaming_Command_Options)).

# Run a map-only MapReduce job

Not specifying mapper and reducer in the MapReduce job submission does not mean that MapReduce isn't going to run the mapper and reducer steps, it is simply going to use the Identity mapper and reducer.

To run a MapReduce job _without_ reducer one needs to use the generic option

    \-D mapreduce.job.reduces=0

(see [specifying map-only jobs](https://hadoop.apache.org/docs/stable/hadoop-streaming/HadoopStreaming.html#Specifying_Map-Only_Jobs)).

In [14]:
%%bash
hdfs dfs -rm -r my_output

mapred streaming \
    -D mapreduce.job.reduces=0 \
    -input hello.txt \
    -output my_output

2024-03-09 20:33:35,392 INFO Configuration.deprecation: io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum


Deleted my_output


2024-03-09 20:33:36,467 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties


2024-03-09 20:33:36,555 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).


2024-03-09 20:33:36,556 INFO impl.MetricsSystemImpl: JobTracker metrics system started


2024-03-09 20:33:36,568 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!


2024-03-09 20:33:36,725 INFO mapred.FileInputFormat: Total input files to process : 1


2024-03-09 20:33:36,739 INFO mapreduce.JobSubmitter: number of splits:1


2024-03-09 20:33:36,902 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local702958357_0001


2024-03-09 20:33:36,902 INFO mapreduce.JobSubmitter: Executing with tokens: []


2024-03-09 20:33:37,027 INFO mapreduce.Job: The url to track the job: http://localhost:8080/


2024-03-09 20:33:37,039 INFO mapred.LocalJobRunner: OutputCommitter set in config null


2024-03-09 20:33:37,042 INFO mapreduce.Job: Running job: job_local702958357_0001


2024-03-09 20:33:37,050 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapred.FileOutputCommitter


2024-03-09 20:33:37,059 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2


2024-03-09 20:33:37,059 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false


2024-03-09 20:33:37,092 INFO mapred.LocalJobRunner: Waiting for map tasks


2024-03-09 20:33:37,095 INFO mapred.LocalJobRunner: Starting task: attempt_local702958357_0001_m_000000_0


2024-03-09 20:33:37,115 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2


2024-03-09 20:33:37,116 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false


2024-03-09 20:33:37,131 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]


2024-03-09 20:33:37,137 INFO mapred.MapTask: Processing split: file:/home/runner/work/big_data/big_data/hello.txt:0+14


2024-03-09 20:33:37,145 INFO mapred.MapTask: numReduceTasks: 0


2024-03-09 20:33:37,155 INFO mapred.LocalJobRunner: 


2024-03-09 20:33:37,161 INFO mapred.Task: Task:attempt_local702958357_0001_m_000000_0 is done. And is in the process of committing


2024-03-09 20:33:37,162 INFO mapred.LocalJobRunner: 


2024-03-09 20:33:37,162 INFO mapred.Task: Task attempt_local702958357_0001_m_000000_0 is allowed to commit now


2024-03-09 20:33:37,163 INFO output.FileOutputCommitter: Saved output of task 'attempt_local702958357_0001_m_000000_0' to file:/home/runner/work/big_data/big_data/my_output


2024-03-09 20:33:37,168 INFO mapred.LocalJobRunner: file:/home/runner/work/big_data/big_data/hello.txt:0+14


2024-03-09 20:33:37,168 INFO mapred.Task: Task 'attempt_local702958357_0001_m_000000_0' done.


2024-03-09 20:33:37,174 INFO mapred.Task: Final Counters for attempt_local702958357_0001_m_000000_0: Counters: 15


	File System Counters


		FILE: Number of bytes read=141437


		FILE: Number of bytes written=779556


		FILE: Number of read operations=0


		FILE: Number of large read operations=0


		FILE: Number of write operations=0


	Map-Reduce Framework


		Map input records=1


		Map output records=1


		Input split bytes=102


		Spilled Records=0


		Failed Shuffles=0


		Merged Map outputs=0


		GC time elapsed (ms)=0


		Total committed heap usage (bytes)=314572800


	File Input Format Counters 


		Bytes Read=14


	File Output Format Counters 


		Bytes Written=28


2024-03-09 20:33:37,174 INFO mapred.LocalJobRunner: Finishing task: attempt_local702958357_0001_m_000000_0


2024-03-09 20:33:37,175 INFO mapred.LocalJobRunner: map task executor complete.


2024-03-09 20:33:38,055 INFO mapreduce.Job: Job job_local702958357_0001 running in uber mode : false


2024-03-09 20:33:38,057 INFO mapreduce.Job:  map 100% reduce 0%


2024-03-09 20:33:38,059 INFO mapreduce.Job: Job job_local702958357_0001 completed successfully


2024-03-09 20:33:38,063 INFO mapreduce.Job: Counters: 15


	File System Counters


		FILE: Number of bytes read=141437


		FILE: Number of bytes written=779556


		FILE: Number of read operations=0


		FILE: Number of large read operations=0


		FILE: Number of write operations=0


	Map-Reduce Framework


		Map input records=1


		Map output records=1


		Input split bytes=102


		Spilled Records=0


		Failed Shuffles=0


		Merged Map outputs=0


		GC time elapsed (ms)=0


		Total committed heap usage (bytes)=314572800


	File Input Format Counters 


		Bytes Read=14


	File Output Format Counters 


		Bytes Written=28


2024-03-09 20:33:38,064 INFO streaming.StreamJob: Output directory: my_output


## Verify the result

In [15]:
!hdfs dfs -test -e my_output/_SUCCESS && cat my_output/part-00000

0	Hello, World!


## Why a map-only application?

The advantage of a map-only job is that the sorting and shuffling phases are skipped, so if you do not need that remember to specify `-D mapreduce.job.reduces=0 `.

On the other hand, a MapReduce job even with the default `IdentityReducer` will deliver sorted results because the data passed from the mapper to the reducer always gets sorted.


# Improved version of the MapReduce "Hello, World!" application

Taking into account the previous considerations, here's a more efficient version of the 'Hello, World!' application that bypasses the shuffling and sorting step.

In [16]:
%%bash
hdfs dfs -rm -r my_output

mapred streaming \
    -D mapreduce.job.reduces=0 \
    -input hello.txt \
    -output my_output \
    -mapper '/bin/cat'

2024-03-09 20:33:40,205 INFO Configuration.deprecation: io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum


Deleted my_output


2024-03-09 20:33:41,252 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties


2024-03-09 20:33:41,343 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).


2024-03-09 20:33:41,344 INFO impl.MetricsSystemImpl: JobTracker metrics system started


2024-03-09 20:33:41,356 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!


2024-03-09 20:33:41,522 INFO mapred.FileInputFormat: Total input files to process : 1


2024-03-09 20:33:41,538 INFO mapreduce.JobSubmitter: number of splits:1


2024-03-09 20:33:41,689 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1030737068_0001


2024-03-09 20:33:41,690 INFO mapreduce.JobSubmitter: Executing with tokens: []


2024-03-09 20:33:41,819 INFO mapreduce.Job: The url to track the job: http://localhost:8080/


2024-03-09 20:33:41,821 INFO mapreduce.Job: Running job: job_local1030737068_0001


2024-03-09 20:33:41,835 INFO mapred.LocalJobRunner: OutputCommitter set in config null


2024-03-09 20:33:41,842 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapred.FileOutputCommitter


2024-03-09 20:33:41,847 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2


2024-03-09 20:33:41,847 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false


2024-03-09 20:33:41,876 INFO mapred.LocalJobRunner: Waiting for map tasks


2024-03-09 20:33:41,880 INFO mapred.LocalJobRunner: Starting task: attempt_local1030737068_0001_m_000000_0


2024-03-09 20:33:41,898 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2


2024-03-09 20:33:41,899 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false


2024-03-09 20:33:41,915 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]


2024-03-09 20:33:41,924 INFO mapred.MapTask: Processing split: file:/home/runner/work/big_data/big_data/hello.txt:0+14


2024-03-09 20:33:41,934 INFO mapred.MapTask: numReduceTasks: 0


2024-03-09 20:33:41,941 INFO streaming.PipeMapRed: PipeMapRed exec [/bin/cat]


2024-03-09 20:33:41,944 INFO Configuration.deprecation: mapred.work.output.dir is deprecated. Instead, use mapreduce.task.output.dir


2024-03-09 20:33:41,947 INFO Configuration.deprecation: mapred.local.dir is deprecated. Instead, use mapreduce.cluster.local.dir


2024-03-09 20:33:41,947 INFO Configuration.deprecation: map.input.file is deprecated. Instead, use mapreduce.map.input.file


2024-03-09 20:33:41,948 INFO Configuration.deprecation: map.input.length is deprecated. Instead, use mapreduce.map.input.length


2024-03-09 20:33:41,949 INFO Configuration.deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id


2024-03-09 20:33:41,949 INFO Configuration.deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition


2024-03-09 20:33:41,953 INFO Configuration.deprecation: map.input.start is deprecated. Instead, use mapreduce.map.input.start


2024-03-09 20:33:41,953 INFO Configuration.deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap


2024-03-09 20:33:41,954 INFO Configuration.deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id


2024-03-09 20:33:41,954 INFO Configuration.deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id


2024-03-09 20:33:41,954 INFO Configuration.deprecation: mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords


2024-03-09 20:33:41,955 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name


2024-03-09 20:33:41,973 INFO streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]


2024-03-09 20:33:41,975 INFO streaming.PipeMapRed: Records R/W=1/1


2024-03-09 20:33:41,975 INFO streaming.PipeMapRed: MRErrorThread done


2024-03-09 20:33:41,976 INFO streaming.PipeMapRed: mapRedFinished


2024-03-09 20:33:41,980 INFO mapred.LocalJobRunner: 


2024-03-09 20:33:41,985 INFO mapred.Task: Task:attempt_local1030737068_0001_m_000000_0 is done. And is in the process of committing


2024-03-09 20:33:41,987 INFO mapred.LocalJobRunner: 


2024-03-09 20:33:41,987 INFO mapred.Task: Task attempt_local1030737068_0001_m_000000_0 is allowed to commit now


2024-03-09 20:33:41,988 INFO output.FileOutputCommitter: Saved output of task 'attempt_local1030737068_0001_m_000000_0' to file:/home/runner/work/big_data/big_data/my_output


2024-03-09 20:33:41,990 INFO mapred.LocalJobRunner: Records R/W=1/1


2024-03-09 20:33:41,990 INFO mapred.Task: Task 'attempt_local1030737068_0001_m_000000_0' done.


2024-03-09 20:33:41,997 INFO mapred.Task: Final Counters for attempt_local1030737068_0001_m_000000_0: Counters: 15


	File System Counters


		FILE: Number of bytes read=141437


		FILE: Number of bytes written=785612


		FILE: Number of read operations=0


		FILE: Number of large read operations=0


		FILE: Number of write operations=0


	Map-Reduce Framework


		Map input records=1


		Map output records=1


		Input split bytes=102


		Spilled Records=0


		Failed Shuffles=0


		Merged Map outputs=0


		GC time elapsed (ms)=0


		Total committed heap usage (bytes)=314572800


	File Input Format Counters 


		Bytes Read=14


	File Output Format Counters 


		Bytes Written=27


2024-03-09 20:33:41,998 INFO mapred.LocalJobRunner: Finishing task: attempt_local1030737068_0001_m_000000_0


2024-03-09 20:33:41,999 INFO mapred.LocalJobRunner: map task executor complete.


2024-03-09 20:33:42,843 INFO mapreduce.Job: Job job_local1030737068_0001 running in uber mode : false


2024-03-09 20:33:42,845 INFO mapreduce.Job:  map 100% reduce 0%


2024-03-09 20:33:42,846 INFO mapreduce.Job: Job job_local1030737068_0001 completed successfully


2024-03-09 20:33:42,850 INFO mapreduce.Job: Counters: 15


	File System Counters


		FILE: Number of bytes read=141437


		FILE: Number of bytes written=785612


		FILE: Number of read operations=0


		FILE: Number of large read operations=0


		FILE: Number of write operations=0


	Map-Reduce Framework


		Map input records=1


		Map output records=1


		Input split bytes=102


		Spilled Records=0


		Failed Shuffles=0


		Merged Map outputs=0


		GC time elapsed (ms)=0


		Total committed heap usage (bytes)=314572800


	File Input Format Counters 


		Bytes Read=14


	File Output Format Counters 


		Bytes Written=27


2024-03-09 20:33:42,850 INFO streaming.StreamJob: Output directory: my_output


In [17]:
!hdfs dfs -test -e my_output/_SUCCESS && cat my_output/part-00000

Hello, World!	
