<a href="https://colab.research.google.com/github/groda/big_data/blob/master/Hadoop_Setting_up_a_Single_Node_Cluster.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href="https://github.com/groda/big_data"><div><img src="https://github.com/groda/big_data/blob/master/logo_bdb.png?raw=true" align=right width="90"></div></a>

# HDFS and MapReduce on a single-node Hadoop cluster
<br>
<br>

In this tutorial/notebook we'll showcase the setup of a single-node cluster, following the guidelines outlined on https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html. Subsequently, we'll demonstrate the seamless execution of elementary HDFS and MapReduce commands.

Upon downloading the software, several preliminary steps must be taken, including setting environment variables, generating SSH keys, and more. To streamline these tasks, we've consolidated them under the "Prologue" section.

Upon completion of the prologue, we can launch a single-node Hadoop cluster on the current virtual machine.

Following that, we'll execute a series of test HDFS commands and MapReduce jobs on the Hadoop cluster. These will be performed using a dataset sourced from a publicly available collection.

Finally, we'll proceed to shut down the cluster.


**TABLE OF CONTENTS**
* **[Prologue](#scrollTo=oUuQjW2oNMcJ)**

 * [Check the available Java version](#scrollTo=qFfOrktMPq8M)

 * [Download core Hadoop](#scrollTo=KE7kSYSXQYLf)

   * [Verify the downloaded file](#scrollTo=lGI4TNXPamMr)

 * [Configure `PATH`](#scrollTo=RlgP1ytnRtUK)

 * [Configure `core-site.xml` and `hdfs-site.xml`](#scrollTo=KLmxLQeJSb4A)

 * [Set environment variables](#scrollTo=kXbSKFyeMqr2)

 * [Setup localhost access via SSH key](#scrollTo=k2-Fdp73cF0V)

   * [Install `openssh` and start server](#scrollTo=-Uxmv3RdUwiF)

   * [Generate key](#scrollTo=PYKoSlaENuyG)

   * [Check SSH connection to localhost](#scrollTo=FwA6rKpScnVi)

* **[Launch a single-node Hadoop cluster](#scrollTo=V68C4cDySyek)**

   * [Initialize the namenode](#scrollTo=HTDPwnVlSbHS)

   * [Start cluster](#scrollTo=xMrEiLB_VAeR)

* **[Run some simple HDFS commands](#scrollTo=CKRRbwDFv3ZQ)**

* **[Run some simple MapReduce jobs](#scrollTo=G3KBe4R65bl1)**

   * [Simplest MapReduce job](#scrollTo=yVJA-3jSATGV)

   * [Another MapReduce example: filter a log file](#scrollTo=BbosNo0TD3oH)

   * [Aggregate data with MapReduce](#scrollTo=Sam22f-YT1xR)

* **[Stop cluster](#scrollTo=IF6-Z5RotAcO)**

* **[Concluding remarks](#scrollTo=w5N7tb0HSbZB)**



# Prologue

## Check the available Java version
 Apache Hadoop 3.3 and upper supports Java 8 and Java 11 (runtime only). See: https://cwiki.apache.org/confluence/display/HADOOP/Hadoop+Java+Versions


Check if Java version is one of `8`, `11`

In [1]:
!java -version

openjdk version "11.0.27" 2025-04-15
OpenJDK Runtime Environment Temurin-11.0.27+6 (build 11.0.27+6)
OpenJDK 64-Bit Server VM Temurin-11.0.27+6 (build 11.0.27+6, mixed mode)


In [2]:
%%bash
JAVA_MAJOR_VERSION=$(java -version 2>&1 | grep -m1 -Po '(\d+\.)+\d+' | cut -d '.' -f1)
if [[ $JAVA_MAJOR_VERSION -eq 8 || $JAVA_MAJOR_VERSION -eq 11 ]]
 then
 echo "Java version is one of 8, 11 ✓"
 fi

Java version is one of 8, 11 ✓


Find the variable for the environment variable `JAVA_HOME`

Find the path for the environment variable `JAVA_HOME`

In [3]:
!readlink -f $(which java)

/usr/lib/jvm/temurin-11-jdk-amd64/bin/java


Extract JAVA_HOME from the Java path by removing the `bin/java` part in the end

In [4]:
%%bash
JAVA_HOME=$(readlink -f $(which java) | sed 's/\/bin\/java$//')
echo $JAVA_HOME

/usr/lib/jvm/temurin-11-jdk-amd64


## Download core Hadoop
Download the latest stable version of the core Hadoop distribution from one of the download mirrors locations https://www.apache.org/dyn/closer.cgi/hadoop/common/.

**Note** with the option `--no-clobber`, `wget` will not download the file if it already exists.

In [5]:
!wget --no-clobber https://dlcdn.apache.org/hadoop/common/hadoop-3.4.0/hadoop-3.4.0.tar.gz

--2025-07-13 15:33:44--  https://dlcdn.apache.org/hadoop/common/hadoop-3.4.0/hadoop-3.4.0.tar.gz


Resolving dlcdn.apache.org (dlcdn.apache.org)... 151.101.2.132, 2a04:4e42::644
Connecting to dlcdn.apache.org (dlcdn.apache.org)|151.101.2.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 965537117 (921M) [application/x-gzip]
Saving to: ‘hadoop-3.4.0.tar.gz’

hadoop-3.4.0.tar.gz   0%[                    ]       0  --.-KB/s               

hadoop-3.4.0.tar.gz   7%[>                   ]  69.33M   347MB/s               

hadoop-3.4.0.tar.gz  15%[==>                 ] 146.97M   367MB/s               

hadoop-3.4.0.tar.gz  24%[===>                ] 223.00M   372MB/s               

hadoop-3.4.0.tar.gz  32%[=====>              ] 297.16M   371MB/s               


















2025-07-13 15:33:46 (367 MB/s) - ‘hadoop-3.4.0.tar.gz’ saved [965537117/965537117]



Uncompress archive

In [6]:
%%bash
if [ ! -d "hadoop-3.4.0" ]; then
  tar xzf hadoop-3.4.0.tar.gz
fi

### Verify the downloaded file

(see https://www.apache.org/dyn/closer.cgi/hadoop/common/)

Download sha512 file

In [7]:
! wget --no-clobber https://dlcdn.apache.org/hadoop/common/hadoop-3.4.0/hadoop-3.4.0.tar.gz.sha512

--2025-07-13 15:33:56--  https://dlcdn.apache.org/hadoop/common/hadoop-3.4.0/hadoop-3.4.0.tar.gz.sha512
Resolving dlcdn.apache.org (dlcdn.apache.org)... 151.101.2.132, 2a04:4e42::644
Connecting to dlcdn.apache.org (dlcdn.apache.org)|151.101.2.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 160 [text/plain]
Saving to: ‘hadoop-3.4.0.tar.gz.sha512’


2025-07-13 15:33:56 (15.1 MB/s) - ‘hadoop-3.4.0.tar.gz.sha512’ saved [160/160]



Compare

In [8]:
%%bash
A=$(sha512sum hadoop-3.4.0.tar.gz | cut - -d' ' -f1)
B=$(cut hadoop-3.4.0.tar.gz.sha512 -d' ' -f4)
printf "%s\n%s\n" $A $B
[[ $A == $B ]] && echo "True"

6f653c0109f97430047bd3677c50da7c8a2809d153b231794cf980b3208a6b4beff8ff1a03a01094299d459a3a37a3fe16731629987165d71f328657dbf2f24c


6f653c0109f97430047bd3677c50da7c8a2809d153b231794cf980b3208a6b4beff8ff1a03a01094299d459a3a37a3fe16731629987165d71f328657dbf2f24c


True


## Configure `PATH`

Add the Hadoop folder to the `PATH` environment variable


In [9]:
!echo $PATH

/opt/hostedtoolcache/Java_Temurin-Hotspot_jdk/11.0.27-6/x64/bin:/opt/hostedtoolcache/Python/3.8.18/x64/bin:/opt/hostedtoolcache/Python/3.8.18/x64:/snap/bin:/home/runner/.local/bin:/opt/pipx_bin:/home/runner/.cargo/bin:/home/runner/.config/composer/vendor/bin:/usr/local/.ghcup/bin:/home/runner/.dotnet/tools:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin


In [10]:
import os
os.environ['HADOOP_HOME'] = os.path.join(os.getcwd(), 'hadoop-3.4.0')
os.environ['PATH'] = ':'.join([os.path.join(os.environ['HADOOP_HOME'], 'bin'), os.environ['PATH']])
#os.environ['JAVA_HOME'] = '/usr/lib/jvm/java-11-openjdk-amd64'

In [11]:
import os
print(os.environ)

environ({'LANG': 'C.UTF-8', 'PATH': '/home/runner/work/big_data/big_data/hadoop-3.4.0/bin:/opt/hostedtoolcache/Java_Temurin-Hotspot_jdk/11.0.27-6/x64/bin:/opt/hostedtoolcache/Python/3.8.18/x64/bin:/opt/hostedtoolcache/Python/3.8.18/x64:/snap/bin:/home/runner/.local/bin:/opt/pipx_bin:/home/runner/.cargo/bin:/home/runner/.config/composer/vendor/bin:/usr/local/.ghcup/bin:/home/runner/.dotnet/tools:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin', 'MAIL': '/var/mail/root', 'LOGNAME': 'root', 'USER': 'root', 'HOME': '/root', 'SHELL': '/bin/bash', 'TERM': 'xterm-color', 'SUDO_COMMAND': '/usr/bin/env PATH=/opt/hostedtoolcache/Java_Temurin-Hotspot_jdk/11.0.27-6/x64/bin:/opt/hostedtoolcache/Python/3.8.18/x64/bin:/opt/hostedtoolcache/Python/3.8.18/x64:/snap/bin:/home/runner/.local/bin:/opt/pipx_bin:/home/runner/.cargo/bin:/home/runner/.config/composer/vendor/bin:/usr/local/.ghcup/bin:/home/runner/.dotnet/tools:/usr/local/sbin:/usr/local/bin:/usr

In [12]:
!echo $PATH

/home/runner/work/big_data/big_data/hadoop-3.4.0/bin:/opt/hostedtoolcache/Java_Temurin-Hotspot_jdk/11.0.27-6/x64/bin:/opt/hostedtoolcache/Python/3.8.18/x64/bin:/opt/hostedtoolcache/Python/3.8.18/x64:/snap/bin:/home/runner/.local/bin:/opt/pipx_bin:/home/runner/.cargo/bin:/home/runner/.config/composer/vendor/bin:/usr/local/.ghcup/bin:/home/runner/.dotnet/tools:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin


## Configure `core-site.xml` and `hdfs-site.xml`

Edit the file `etc/hadoop/core-site.xml` and `etc/hadoop/hdfs-site.xml` to configure pseudo-distributed operation.

**`etc/hadoop/core-site.xml`**
```
<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>
```

**`etc/hadoop/hdfs-site.xml`**
```
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>
```

In [13]:
%%bash
echo -e "<configuration> \n\
    <property> \n\
        <name>fs.defaultFS</name> \n\
        <value>hdfs://localhost:9000</value> \n\
    </property> \n\
</configuration>" >hadoop-3.4.0/etc/hadoop/core-site.xml

echo -e "<configuration> \n\
    <property> \n\
        <name>dfs.replication</name> \n\
        <value>1</value> \n\
    </property> \n\
</configuration>" >hadoop-3.4.0/etc/hadoop/hdfs-site.xml

Check

In [14]:
cat hadoop-3.4.0/etc/hadoop/hdfs-site.xml

<configuration> 
    <property> 
        <name>dfs.replication</name> 
        <value>1</value> 
    </property> 
</configuration>


## Set environment variables

Add the following lines to the Hadoop configuration script `hadoop-env.sh`(the script is in `hadoop-3.4.0/sbin`).
```
export HDFS_NAMENODE_USER=root
export HDFS_DATANODE_USER=root
export HDFS_SECONDARYNAMENODE_USER=root
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root
```

In [15]:
%%bash
cp -n hadoop-3.4.0/etc/hadoop/hadoop-env.sh hadoop-3.4.0/etc/hadoop/hadoop-env.sh.org
cat <<😃 >hadoop-3.4.0/etc/hadoop/hadoop-env.sh
export HDFS_NAMENODE_USER=root
export HDFS_DATANODE_USER=root
export HDFS_SECONDARYNAMENODE_USER=root
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root
😃



## Setup localhost access via SSH key

We are going to allow passphraseless access to `localhost` with a secure key.

SSH must be installed and sshd must be running in order to use the Hadoop scripts that manage remote Hadoop daemons.



### Install `openssh` and start server

I'm not sure why we need the option `StrictHostKeyChecking no`. This option tells the `ssh` server to allow key authentication only from known hosts, in particular it prevents a host from authenticating with key if the key has changed. I guess this option is needed since a new ssh key is generated every time one runs this notebook.

Alternatively, one could just delete the file `~/.ssh/known_hosts` or else use `ssh-keygen -R hostname` to remove all keys belonging to hostname from the `known_hosts` file (see for instance [How to remove strict RSA key checking in SSH and what's the problem here?](https://serverfault.com/questions/6233/how-to-remove-strict-rsa-key-checking-in-ssh-and-whats-the-problem-here) or [Remove key from known_hosts](https://superuser.com/questions/30087/remove-key-from-known-hosts)). The option `ssh-keygen -R hostname` would be the most appropriate in a production setting where the file `~/.ssh/known_hosts` might contain other entries that you do not want to delete.


In [16]:
%%bash
sudo apt-get update
sudo apt-get -y install openssh-server
echo 'StrictHostKeyChecking no' >> /etc/ssh/ssh_config
sudo /etc/init.d/ssh restart

Get:1 file:/etc/apt/apt-mirrors.txt Mirrorlist [144 B]


Hit:2 http://azure.archive.ubuntu.com/ubuntu noble InRelease


Hit:3 http://azure.archive.ubuntu.com/ubuntu noble-updates InRelease


Hit:4 http://azure.archive.ubuntu.com/ubuntu noble-backports InRelease


Hit:5 http://azure.archive.ubuntu.com/ubuntu noble-security InRelease


Hit:6 https://packages.microsoft.com/repos/azure-cli noble InRelease


Hit:7 https://packages.microsoft.com/ubuntu/24.04/prod noble InRelease


Reading package lists...


Reading package lists...


Building dependency tree...


Reading state information...


openssh-server is already the newest version (1:9.6p1-3ubuntu13.12).


0 upgraded, 0 newly installed, 0 to remove and 12 not upgraded.


Restarting ssh (via systemctl): ssh.service.


### Generate key
Generate an SSH key that does not require a password.

The private key is contained in the file `id_rsa` located in the folder `~/.ssh`.

The public key is added to the file `~/.ssh/authorized_keys` in order to allow authentication with that key.

In [17]:
%%bash
rm $HOME/.ssh/id_rsa
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys

rm: cannot remove '/root/.ssh/id_rsa': No such file or directory


Generating public/private rsa key pair.


Your identification has been saved in /root/.ssh/id_rsa


Your public key has been saved in /root/.ssh/id_rsa.pub


The key fingerprint is:


SHA256:2pr28+wRpR9CSGvlONDu37pX3vpo6a8I3dgHO8fSETs root@fv-az1277-496


The key's randomart image is:


+---[RSA 3072]----+


|      ... .      |


|       o.*       |


|       .* o .  . |


|       ..o o    o|


|       .S + . .E |


|       o.  = =.=o|


|      . ..o.+o*o=|


|      .o...oo.+*.|


|     .o..+*+.o==o|


+----[SHA256]-----+


### Check SSH connection to localhost

The following command should output "hi!" if the connection works.

In [18]:
!ssh localhost "echo hi 👋"



hi 👋


# Launch a single-node Hadoop cluster

## Initialize the namenode

In [19]:
!hdfs namenode -format -nonInteractive



2025-07-13 15:34:03,282 INFO namenode.NameNode: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = fv-az1277-496/10.1.0.42
STARTUP_MSG:   args = [-format, -nonInteractive]
STARTUP_MSG:   version = 3.4.0
STARTUP_MSG:   classpath = /home/runner/work/big_data/big_data/hadoop-3.4.0/etc/hadoop:/home/runner/work/big_data/big_data/hadoop-3.4.0/share/hadoop/common/lib/jetty-security-9.4.53.v20231009.jar:/home/runner/work/big_data/big_data/hadoop-3.4.0/share/hadoop/common/lib/kerb-common-2.0.3.jar:/home/runner/work/big_data/big_data/hadoop-3.4.0/share/hadoop/common/lib/listenablefuture-9999.0-empty-to-avoid-conflict-with-guava.jar:/home/runner/work/big_data/big_data/hadoop-3.4.0/share/hadoop/common/lib/netty-all-4.1.100.Final.jar:/home/runner/work/big_data/big_data/hadoop-3.4.0/share/hadoop/common/lib/commons-compress-1.24.0.jar:/home/runner/work/big_data/big_data/hadoop-3.4.0/share/hadoop/common/lib/jakarta.act

2025-07-13 15:34:03,387 INFO namenode.NameNode: createNameNode [-format, -nonInteractive]


2025-07-13 15:34:04,018 INFO namenode.NameNode: Formatting using clusterid: CID-84e57919-64e3-4d57-a8cd-573dc564ff7c


2025-07-13 15:34:04,034 INFO namenode.FSEditLog: Edit logging is async:true
2025-07-13 15:34:04,072 INFO namenode.FSNamesystem: KeyProvider: null
2025-07-13 15:34:04,073 INFO namenode.FSNamesystem: fsLock is fair: true
2025-07-13 15:34:04,073 INFO namenode.FSNamesystem: Detailed lock hold time metrics enabled: false


2025-07-13 15:34:04,168 INFO namenode.FSNamesystem: fsOwner                = root (auth:SIMPLE)
2025-07-13 15:34:04,168 INFO namenode.FSNamesystem: supergroup             = supergroup
2025-07-13 15:34:04,168 INFO namenode.FSNamesystem: isPermissionEnabled    = true
2025-07-13 15:34:04,168 INFO namenode.FSNamesystem: isStoragePolicyEnabled = true
2025-07-13 15:34:04,168 INFO namenode.FSNamesystem: HA Enabled: false


2025-07-13 15:34:04,202 INFO common.Util: dfs.datanode.fileio.profiling.sampling.percentage set to 0. Disabling file IO profiling


2025-07-13 15:34:04,277 INFO blockmanagement.DatanodeManager: Slow peers collection thread shutdown


2025-07-13 15:34:04,284 INFO blockmanagement.DatanodeManager: dfs.block.invalidate.limit : configured=1000, counted=60, effected=1000
2025-07-13 15:34:04,284 INFO blockmanagement.DatanodeManager: dfs.namenode.datanode.registration.ip-hostname-check=true
2025-07-13 15:34:04,286 INFO blockmanagement.BlockManager: dfs.namenode.startup.delay.block.deletion.sec is set to 000:00:00:00.000
2025-07-13 15:34:04,286 INFO blockmanagement.BlockManager: The block deletion will start around 2025 Jul 13 15:34:04
2025-07-13 15:34:04,288 INFO util.GSet: Computing capacity for map BlocksMap
2025-07-13 15:34:04,288 INFO util.GSet: VM type       = 64-bit
2025-07-13 15:34:04,289 INFO util.GSet: 2.0% max memory 3.9 GB = 80 MB
2025-07-13 15:34:04,289 INFO util.GSet: capacity      = 2^23 = 8388608 entries
2025-07-13 15:34:04,296 INFO blockmanagement.BlockManager: Storage policy satisfier is disabled
2025-07-13 15:34:04,296 INFO blockmanagement.BlockManager: dfs.block.access.token.enable = false
2025

2025-07-13 15:34:04,338 INFO namenode.FSDirectory: GLOBAL serial map: bits=29 maxEntries=536870911
2025-07-13 15:34:04,338 INFO namenode.FSDirectory: USER serial map: bits=24 maxEntries=16777215
2025-07-13 15:34:04,338 INFO namenode.FSDirectory: GROUP serial map: bits=24 maxEntries=16777215
2025-07-13 15:34:04,338 INFO namenode.FSDirectory: XATTR serial map: bits=24 maxEntries=16777215
2025-07-13 15:34:04,344 INFO util.GSet: Computing capacity for map INodeMap
2025-07-13 15:34:04,344 INFO util.GSet: VM type       = 64-bit
2025-07-13 15:34:04,345 INFO util.GSet: 1.0% max memory 3.9 GB = 40 MB
2025-07-13 15:34:04,345 INFO util.GSet: capacity      = 2^22 = 4194304 entries
2025-07-13 15:34:04,347 INFO namenode.FSDirectory: ACLs enabled? true
2025-07-13 15:34:04,347 INFO namenode.FSDirectory: POSIX ACL inheritance enabled? true
2025-07-13 15:34:04,347 INFO namenode.FSDirectory: XAttrs enabled? true
2025-07-13 15:34:04,347 INFO namenode.NameNode: Caching file names occurring more 

2025-07-13 15:34:04,406 INFO common.Storage: Storage directory /tmp/hadoop-root/dfs/name has been successfully formatted.
2025-07-13 15:34:04,434 INFO namenode.FSImageFormatProtobuf: Saving image file /tmp/hadoop-root/dfs/name/current/fsimage.ckpt_0000000000000000000 using no compression


2025-07-13 15:34:04,495 INFO namenode.FSImageFormatProtobuf: Image file /tmp/hadoop-root/dfs/name/current/fsimage.ckpt_0000000000000000000 of size 396 bytes saved in 0 seconds .
2025-07-13 15:34:04,505 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
2025-07-13 15:34:04,510 INFO blockmanagement.DatanodeManager: Slow peers collection thread shutdown
2025-07-13 15:34:04,532 INFO namenode.FSNamesystem: Stopping services started for active state
2025-07-13 15:34:04,532 INFO namenode.FSNamesystem: Stopping services started for standby state
2025-07-13 15:34:04,535 INFO namenode.FSImage: FSImageSaver clean checkpoint: txid=0 when meet shutdown.
2025-07-13 15:34:04,535 INFO namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at fv-az1277-496/10.1.0.42
************************************************************/


## Start cluster

In [20]:
!$HADOOP_HOME/sbin/start-dfs.sh

Starting namenodes on [localhost]


Starting datanodes


Starting secondary namenodes [fv-az1277-496]




In [21]:
%%bash
# Check if HDFS is in safe mode
if hdfs dfsadmin -safemode get | grep 'ON'; then
  echo "Namenode is in safe mode. Leaving safe mode..."
  hdfs dfsadmin -safemode leave
else
  echo "Namenode is not in safe mode."
fi

Namenode is not in safe mode.


# Run some simple HDFS commands

In [22]:
%%bash
# create directory "my_dir" in HDFS home
hdfs dfs -mkdir /user
hdfs dfs -mkdir /user/root # this is the "home" of user root on HDFS
hdfs dfs -mkdir my_dir

# if sampls_data does not exist, create it (so that the notebook can run also outside of Colab)
mkdir -p sample_data
touch sample_data/mnist_test.csv

# Check if the file is empty and fill it if needed
if [ ! -s sample_data/mnist_test.csv ]; then
  echo -e "0 1 2 3 4\n5 6 7 8 9" > sample_data/mnist_test.csv
fi


# upload file mnist_test.csv to my_dir
hdfs dfs -put sample_data/mnist_test.csv my_dir/

# show contents of directory my_dir
hdfs dfs -ls -h my_dir

Found 1 items


-rw-r--r--   1 root supergroup         20 2025-07-13 15:34 my_dir/mnist_test.csv


# Run some simple MapReduce jobs

We'll employ the [streaming](https://hadoop.apache.org/docs/stable/hadoop-streaming/HadoopStreaming.html) library, which broadens our options by enabling the use of any programming language for both the mapper and/or the reducer.

With this utility any executable or file containing code that the operating system can interpret and execute directly, can serve as mapper and/or reducer.

## Simplest MapReduce job

This is a "no-code" example since we are going to use the existing Unix commands `cat` and `wc` respectively as mapper and as reducer. The result will show a line with three values: the counts of lines, words, and characters in the input file(s).

Input folder is `/user/my_user/my_dir/`, output folder `/user/my_user/output_simplest`.

**Note**: the output folder should not exist because it is created by Hadoop (this is in accordance with Hadoop's principle of not overwriting data).

Now run the MapReduce job

In [23]:
%%bash

hdfs dfs -rm -r output_simplest || hdfs namenode -format -nonInteractive
mapred streaming \
  -input my_dir \
  -output output_simplest \
  -mapper /bin/cat \
  -reducer /usr/bin/wc

rm: `output_simplest': No such file or directory


namenode is running as process 3461.  Stop it first and ensure /tmp/hadoop-root-namenode.pid file is empty before retry.


2025-07-13 15:34:26,243 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties


2025-07-13 15:34:26,320 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).


2025-07-13 15:34:26,320 INFO impl.MetricsSystemImpl: JobTracker metrics system started


2025-07-13 15:34:26,331 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!


2025-07-13 15:34:26,545 INFO mapred.FileInputFormat: Total input files to process : 1


2025-07-13 15:34:26,553 INFO mapreduce.JobSubmitter: number of splits:1


2025-07-13 15:34:26,912 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local932099876_0001


2025-07-13 15:34:26,912 INFO mapreduce.JobSubmitter: Executing with tokens: []


2025-07-13 15:34:27,009 INFO mapreduce.Job: The url to track the job: http://localhost:8080/


2025-07-13 15:34:27,010 INFO mapreduce.Job: Running job: job_local932099876_0001


2025-07-13 15:34:27,010 INFO mapred.LocalJobRunner: OutputCommitter set in config null


2025-07-13 15:34:27,011 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapred.FileOutputCommitter


2025-07-13 15:34:27,016 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2


2025-07-13 15:34:27,016 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false


2025-07-13 15:34:27,046 INFO mapred.LocalJobRunner: Waiting for map tasks


2025-07-13 15:34:27,047 INFO mapred.LocalJobRunner: Starting task: attempt_local932099876_0001_m_000000_0


2025-07-13 15:34:27,065 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2


2025-07-13 15:34:27,065 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false


2025-07-13 15:34:27,080 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]


2025-07-13 15:34:27,086 INFO mapred.MapTask: Processing split: hdfs://localhost:9000/user/root/my_dir/mnist_test.csv:0+20


2025-07-13 15:34:27,107 INFO mapred.MapTask: numReduceTasks: 1


2025-07-13 15:34:27,126 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)


2025-07-13 15:34:27,126 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100


2025-07-13 15:34:27,126 INFO mapred.MapTask: soft limit at 83886080


2025-07-13 15:34:27,126 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600


2025-07-13 15:34:27,126 INFO mapred.MapTask: kvstart = 26214396; length = 6553600


2025-07-13 15:34:27,128 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer


2025-07-13 15:34:27,130 INFO streaming.PipeMapRed: PipeMapRed exec [/bin/cat]


2025-07-13 15:34:27,134 INFO Configuration.deprecation: mapred.work.output.dir is deprecated. Instead, use mapreduce.task.output.dir


2025-07-13 15:34:27,136 INFO Configuration.deprecation: mapred.local.dir is deprecated. Instead, use mapreduce.cluster.local.dir


2025-07-13 15:34:27,136 INFO Configuration.deprecation: map.input.file is deprecated. Instead, use mapreduce.map.input.file


2025-07-13 15:34:27,136 INFO Configuration.deprecation: map.input.length is deprecated. Instead, use mapreduce.map.input.length


2025-07-13 15:34:27,136 INFO Configuration.deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id


2025-07-13 15:34:27,137 INFO Configuration.deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition


2025-07-13 15:34:27,138 INFO Configuration.deprecation: map.input.start is deprecated. Instead, use mapreduce.map.input.start


2025-07-13 15:34:27,138 INFO Configuration.deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap


2025-07-13 15:34:27,138 INFO Configuration.deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id


2025-07-13 15:34:27,138 INFO Configuration.deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id


2025-07-13 15:34:27,139 INFO Configuration.deprecation: mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords


2025-07-13 15:34:27,139 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name


2025-07-13 15:34:27,213 INFO streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]


2025-07-13 15:34:27,213 INFO streaming.PipeMapRed: MRErrorThread done


2025-07-13 15:34:27,214 INFO streaming.PipeMapRed: Records R/W=2/1


2025-07-13 15:34:27,214 INFO streaming.PipeMapRed: mapRedFinished


2025-07-13 15:34:27,216 INFO mapred.LocalJobRunner: 


2025-07-13 15:34:27,216 INFO mapred.MapTask: Starting flush of map output


2025-07-13 15:34:27,216 INFO mapred.MapTask: Spilling map output


2025-07-13 15:34:27,216 INFO mapred.MapTask: bufstart = 0; bufend = 22; bufvoid = 104857600


2025-07-13 15:34:27,216 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214392(104857568); length = 5/6553600


2025-07-13 15:34:27,221 INFO mapred.MapTask: Finished spill 0


2025-07-13 15:34:27,229 INFO mapred.Task: Task:attempt_local932099876_0001_m_000000_0 is done. And is in the process of committing


2025-07-13 15:34:27,232 INFO mapred.LocalJobRunner: Records R/W=2/1


2025-07-13 15:34:27,232 INFO mapred.Task: Task 'attempt_local932099876_0001_m_000000_0' done.


2025-07-13 15:34:27,237 INFO mapred.Task: Final Counters for attempt_local932099876_0001_m_000000_0: Counters: 23


	File System Counters


		FILE: Number of bytes read=141934


		FILE: Number of bytes written=856567


		FILE: Number of read operations=0


		FILE: Number of large read operations=0


		FILE: Number of write operations=0


		HDFS: Number of bytes read=20


		HDFS: Number of bytes written=0


		HDFS: Number of read operations=5


		HDFS: Number of large read operations=0


		HDFS: Number of write operations=1


		HDFS: Number of bytes read erasure-coded=0


	Map-Reduce Framework


		Map input records=2


		Map output records=2


		Map output bytes=22


		Map output materialized bytes=32


		Input split bytes=105


		Combine input records=0


		Spilled Records=2


		Failed Shuffles=0


		Merged Map outputs=0


		GC time elapsed (ms)=5


		Total committed heap usage (bytes)=157286400


	File Input Format Counters 


		Bytes Read=20


2025-07-13 15:34:27,237 INFO mapred.LocalJobRunner: Finishing task: attempt_local932099876_0001_m_000000_0


2025-07-13 15:34:27,237 INFO mapred.LocalJobRunner: map task executor complete.


2025-07-13 15:34:27,240 INFO mapred.LocalJobRunner: Waiting for reduce tasks


2025-07-13 15:34:27,240 INFO mapred.LocalJobRunner: Starting task: attempt_local932099876_0001_r_000000_0


2025-07-13 15:34:27,246 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2


2025-07-13 15:34:27,246 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false


2025-07-13 15:34:27,246 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]


2025-07-13 15:34:27,248 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@7ebd293c


2025-07-13 15:34:27,249 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!


2025-07-13 15:34:27,265 INFO reduce.MergeManagerImpl: MergerManager: memoryLimit=2936012800, maxSingleShuffleLimit=734003200, mergeThreshold=1937768576, ioSortFactor=10, memToMemMergeOutputsThreshold=10


2025-07-13 15:34:27,270 INFO reduce.EventFetcher: attempt_local932099876_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events


2025-07-13 15:34:27,300 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local932099876_0001_m_000000_0 decomp: 28 len: 32 to MEMORY


2025-07-13 15:34:27,303 INFO reduce.InMemoryMapOutput: Read 28 bytes from map-output for attempt_local932099876_0001_m_000000_0


2025-07-13 15:34:27,305 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 28, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->28


2025-07-13 15:34:27,306 INFO reduce.EventFetcher: EventFetcher is interrupted.. Returning


2025-07-13 15:34:27,307 INFO mapred.LocalJobRunner: 1 / 1 copied.


2025-07-13 15:34:27,307 INFO reduce.MergeManagerImpl: finalMerge called with 1 in-memory map-outputs and 0 on-disk map-outputs


2025-07-13 15:34:27,311 INFO mapred.Merger: Merging 1 sorted segments


2025-07-13 15:34:27,311 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 16 bytes


2025-07-13 15:34:27,313 INFO reduce.MergeManagerImpl: Merged 1 segments, 28 bytes to disk to satisfy reduce memory limit


2025-07-13 15:34:27,314 INFO reduce.MergeManagerImpl: Merging 1 files, 32 bytes from disk


2025-07-13 15:34:27,314 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce


2025-07-13 15:34:27,314 INFO mapred.Merger: Merging 1 sorted segments


2025-07-13 15:34:27,315 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 16 bytes


2025-07-13 15:34:27,316 INFO mapred.LocalJobRunner: 1 / 1 copied.


2025-07-13 15:34:27,317 INFO streaming.PipeMapRed: PipeMapRed exec [/usr/bin/wc]


2025-07-13 15:34:27,320 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address


2025-07-13 15:34:27,321 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps


2025-07-13 15:34:27,352 INFO streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]


2025-07-13 15:34:27,355 INFO streaming.PipeMapRed: MRErrorThread done


2025-07-13 15:34:27,355 INFO streaming.PipeMapRed: Records R/W=2/1


2025-07-13 15:34:27,356 INFO streaming.PipeMapRed: mapRedFinished


2025-07-13 15:34:27,395 INFO mapred.Task: Task:attempt_local932099876_0001_r_000000_0 is done. And is in the process of committing


2025-07-13 15:34:27,397 INFO mapred.LocalJobRunner: 1 / 1 copied.


2025-07-13 15:34:27,398 INFO mapred.Task: Task attempt_local932099876_0001_r_000000_0 is allowed to commit now


2025-07-13 15:34:27,411 INFO output.FileOutputCommitter: Saved output of task 'attempt_local932099876_0001_r_000000_0' to hdfs://localhost:9000/user/root/output_simplest


2025-07-13 15:34:27,412 INFO mapred.LocalJobRunner: Records R/W=2/1 > reduce


2025-07-13 15:34:27,412 INFO mapred.Task: Task 'attempt_local932099876_0001_r_000000_0' done.


2025-07-13 15:34:27,413 INFO mapred.Task: Final Counters for attempt_local932099876_0001_r_000000_0: Counters: 30


	File System Counters


		FILE: Number of bytes read=142030


		FILE: Number of bytes written=856599


		FILE: Number of read operations=0


		FILE: Number of large read operations=0


		FILE: Number of write operations=0


		HDFS: Number of bytes read=20


		HDFS: Number of bytes written=25


		HDFS: Number of read operations=10


		HDFS: Number of large read operations=0


		HDFS: Number of write operations=3


		HDFS: Number of bytes read erasure-coded=0


	Map-Reduce Framework


		Combine input records=0


		Combine output records=0


		Reduce input groups=2


		Reduce shuffle bytes=32


		Reduce input records=2


		Reduce output records=1


		Spilled Records=2


		Shuffled Maps =1


		Failed Shuffles=0


		Merged Map outputs=1


		GC time elapsed (ms)=5


		Total committed heap usage (bytes)=157286400


	Shuffle Errors


		BAD_ID=0


		CONNECTION=0


		IO_ERROR=0


		WRONG_LENGTH=0


		WRONG_MAP=0


		WRONG_REDUCE=0


	File Output Format Counters 


		Bytes Written=25


2025-07-13 15:34:27,413 INFO mapred.LocalJobRunner: Finishing task: attempt_local932099876_0001_r_000000_0


2025-07-13 15:34:27,417 INFO mapred.LocalJobRunner: reduce task executor complete.


2025-07-13 15:34:28,012 INFO mapreduce.Job: Job job_local932099876_0001 running in uber mode : false


2025-07-13 15:34:28,013 INFO mapreduce.Job:  map 100% reduce 100%


2025-07-13 15:34:28,015 INFO mapreduce.Job: Job job_local932099876_0001 completed successfully


2025-07-13 15:34:28,021 INFO mapreduce.Job: Counters: 36


	File System Counters


		FILE: Number of bytes read=283964


		FILE: Number of bytes written=1713166


		FILE: Number of read operations=0


		FILE: Number of large read operations=0


		FILE: Number of write operations=0


		HDFS: Number of bytes read=40


		HDFS: Number of bytes written=25


		HDFS: Number of read operations=15


		HDFS: Number of large read operations=0


		HDFS: Number of write operations=4


		HDFS: Number of bytes read erasure-coded=0


	Map-Reduce Framework


		Map input records=2


		Map output records=2


		Map output bytes=22


		Map output materialized bytes=32


		Input split bytes=105


		Combine input records=0


		Combine output records=0


		Reduce input groups=2


		Reduce shuffle bytes=32


		Reduce input records=2


		Reduce output records=1


		Spilled Records=4


		Shuffled Maps =1


		Failed Shuffles=0


		Merged Map outputs=1


		GC time elapsed (ms)=10


		Total committed heap usage (bytes)=314572800


	Shuffle Errors


		BAD_ID=0


		CONNECTION=0


		IO_ERROR=0


		WRONG_LENGTH=0


		WRONG_MAP=0


		WRONG_REDUCE=0


	File Input Format Counters 


		Bytes Read=20


	File Output Format Counters 


		Bytes Written=25


2025-07-13 15:34:28,021 INFO streaming.StreamJob: Output directory: output_simplest


If the `output` directory contains the empty file `_SUCCESS`, this means that the job was successful.

Check the output of the MapReduce job.

In [24]:
!hdfs dfs -cat output_simplest/part-00000

      2      10      22	


The number of words is in this case equal to the number of lines because there are no word separators (empty spaces) in the file, so each line is a word.

## Another MapReduce example: filter a log file

We're going to use a Linux logfile and look for the string `sshd` in a given position. The file stems from [Loghub](https://github.com/logpai/loghub), a freely available collection of system logs for AI-driven log analytics research.

The mapper `mapper.py` filters the file for the given string `sshd` at field 4.

The job has no reducer (option `-reducer NONE`). Note that without a reducer the sorting and shuffling phase after the map phase is skipped.


Download the logfile `Linux_2k.log`:

In [25]:
!wget --no-clobber https://raw.githubusercontent.com/logpai/loghub/master/Linux/Linux_2k.log

--2025-07-13 15:34:29--  https://raw.githubusercontent.com/logpai/loghub/master/Linux/Linux_2k.log
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 

200 OK
Length: 216485 (211K) [text/plain]
Saving to: ‘Linux_2k.log’

Linux_2k.log          0%[                    ]       0  --.-KB/s               


2025-07-13 15:34:30 (49.5 MB/s) - ‘Linux_2k.log’ saved [216485/216485]



In [26]:
%%bash
hdfs dfs -mkdir input || true
hdfs dfs -put Linux_2k.log input/ || true

Define the mapper

In [27]:
%%writefile mapper.py
#!/usr/bin/env python
import sys

for line in sys.stdin:
    # split the line into words
    line = line.strip()
    fields = line.split()
    if (len(fields)>=5 and fields[4].startswith('sshd')):
      print(line)


Writing mapper.py


Test the script (after setting the correct permissions)

In [28]:
!chmod 700 mapper.py

Look at the first 10 lines

In [29]:
!head -10 Linux_2k.log

Jun 14 15:16:01 combo sshd(pam_unix)[19939]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=218.188.2.4 
Jun 14 15:16:02 combo sshd(pam_unix)[19937]: check pass; user unknown
Jun 14 15:16:02 combo sshd(pam_unix)[19937]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=218.188.2.4 
Jun 15 02:04:59 combo sshd(pam_unix)[20882]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=220-135-151-1.hinet-ip.hinet.net  user=root
Jun 15 02:04:59 combo sshd(pam_unix)[20884]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=220-135-151-1.hinet-ip.hinet.net  user=root
Jun 15 02:04:59 combo sshd(pam_unix)[20883]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=220-135-151-1.hinet-ip.hinet.net  user=root
Jun 15 02:04:59 combo sshd(pam_unix)[20885]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=220-135-151-1.hinet-ip.hinet.net  user=root
Jun 15 02:04

Test the mapper in the shell (not using MapReduce):

In [30]:
!head -100 Linux_2k.log| ./mapper.py

Jun 14 15:16:01 combo sshd(pam_unix)[19939]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=218.188.2.4
Jun 14 15:16:02 combo sshd(pam_unix)[19937]: check pass; user unknown
Jun 14 15:16:02 combo sshd(pam_unix)[19937]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=218.188.2.4
Jun 15 02:04:59 combo sshd(pam_unix)[20882]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=220-135-151-1.hinet-ip.hinet.net  user=root
Jun 15 02:04:59 combo sshd(pam_unix)[20884]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=220-135-151-1.hinet-ip.hinet.net  user=root
Jun 15 02:04:59 combo sshd(pam_unix)[20883]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=220-135-151-1.hinet-ip.hinet.net  user=root
Jun 15 02:04:59 combo sshd(pam_unix)[20885]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=220-135-151-1.hinet-ip.hinet.net  user=root
Jun 15 02:04:59 combo

Now run the MapReduce job on the pseudo-cluster

In [31]:
%%bash

hdfs dfs -rm -r output_filter

mapred streaming \
  -file mapper.py \
  -input input \
  -output output_filter \
  -mapper mapper.py \
  -reducer NONE


rm: `output_filter': No such file or directory


2025-07-13 15:34:36,675 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.


packageJobJar: [mapper.py] [] /tmp/streamjob4722781552366300910.jar tmpDir=null


2025-07-13 15:34:37,314 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties


2025-07-13 15:34:37,394 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).


2025-07-13 15:34:37,394 INFO impl.MetricsSystemImpl: JobTracker metrics system started


2025-07-13 15:34:37,405 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!


2025-07-13 15:34:37,607 INFO mapred.FileInputFormat: Total input files to process : 1


2025-07-13 15:34:37,616 INFO mapreduce.JobSubmitter: number of splits:1


2025-07-13 15:34:37,758 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local623326603_0001


2025-07-13 15:34:37,758 INFO mapreduce.JobSubmitter: Executing with tokens: []


2025-07-13 15:34:37,904 INFO mapred.LocalDistributedCacheManager: Localized file:/home/runner/work/big_data/big_data/mapper.py as file:/tmp/hadoop-root/mapred/local/job_local623326603_0001_be014539-6c75-43d6-a513-5be3f3b851b3/mapper.py


2025-07-13 15:34:37,963 INFO mapreduce.Job: The url to track the job: http://localhost:8080/


2025-07-13 15:34:37,964 INFO mapred.LocalJobRunner: OutputCommitter set in config null


2025-07-13 15:34:37,965 INFO mapreduce.Job: Running job: job_local623326603_0001


2025-07-13 15:34:37,965 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapred.FileOutputCommitter


2025-07-13 15:34:37,971 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2


2025-07-13 15:34:37,971 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false


2025-07-13 15:34:38,015 INFO mapred.LocalJobRunner: Waiting for map tasks


2025-07-13 15:34:38,018 INFO mapred.LocalJobRunner: Starting task: attempt_local623326603_0001_m_000000_0


2025-07-13 15:34:38,043 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2


2025-07-13 15:34:38,043 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false


2025-07-13 15:34:38,060 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]


2025-07-13 15:34:38,074 INFO mapred.MapTask: Processing split: hdfs://localhost:9000/user/root/input/Linux_2k.log:0+216485


2025-07-13 15:34:38,095 INFO mapred.MapTask: numReduceTasks: 0


2025-07-13 15:34:38,133 INFO streaming.PipeMapRed: PipeMapRed exec [/home/runner/work/big_data/big_data/./mapper.py]


2025-07-13 15:34:38,139 INFO Configuration.deprecation: mapred.work.output.dir is deprecated. Instead, use mapreduce.task.output.dir


2025-07-13 15:34:38,140 INFO Configuration.deprecation: mapred.local.dir is deprecated. Instead, use mapreduce.cluster.local.dir


2025-07-13 15:34:38,140 INFO Configuration.deprecation: map.input.file is deprecated. Instead, use mapreduce.map.input.file


2025-07-13 15:34:38,140 INFO Configuration.deprecation: map.input.length is deprecated. Instead, use mapreduce.map.input.length


2025-07-13 15:34:38,140 INFO Configuration.deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id


2025-07-13 15:34:38,141 INFO Configuration.deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition


2025-07-13 15:34:38,142 INFO Configuration.deprecation: map.input.start is deprecated. Instead, use mapreduce.map.input.start


2025-07-13 15:34:38,142 INFO Configuration.deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap


2025-07-13 15:34:38,142 INFO Configuration.deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id


2025-07-13 15:34:38,142 INFO Configuration.deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id


2025-07-13 15:34:38,143 INFO Configuration.deprecation: mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords


2025-07-13 15:34:38,143 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name


2025-07-13 15:34:38,191 INFO streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]


2025-07-13 15:34:38,191 INFO streaming.PipeMapRed: R/W/S=10/0/0 in:NA [rec/s] out:NA [rec/s]


2025-07-13 15:34:38,193 INFO streaming.PipeMapRed: R/W/S=100/0/0 in:NA [rec/s] out:NA [rec/s]


2025-07-13 15:34:38,201 INFO streaming.PipeMapRed: R/W/S=1000/0/0 in:NA [rec/s] out:NA [rec/s]


2025-07-13 15:34:38,204 INFO streaming.PipeMapRed: Records R/W=1294/1


2025-07-13 15:34:38,214 INFO streaming.PipeMapRed: MRErrorThread done


2025-07-13 15:34:38,218 INFO streaming.PipeMapRed: mapRedFinished


2025-07-13 15:34:38,220 INFO mapred.LocalJobRunner: 


2025-07-13 15:34:38,254 INFO mapred.Task: Task:attempt_local623326603_0001_m_000000_0 is done. And is in the process of committing


2025-07-13 15:34:38,257 INFO mapred.LocalJobRunner: 


2025-07-13 15:34:38,257 INFO mapred.Task: Task attempt_local623326603_0001_m_000000_0 is allowed to commit now


2025-07-13 15:34:38,266 INFO output.FileOutputCommitter: Saved output of task 'attempt_local623326603_0001_m_000000_0' to hdfs://localhost:9000/user/root/output_filter


2025-07-13 15:34:38,267 INFO mapred.LocalJobRunner: Records R/W=1294/1


2025-07-13 15:34:38,268 INFO mapred.Task: Task 'attempt_local623326603_0001_m_000000_0' done.


2025-07-13 15:34:38,272 INFO mapred.Task: Final Counters for attempt_local623326603_0001_m_000000_0: Counters: 21


	File System Counters


		FILE: Number of bytes read=664


		FILE: Number of bytes written=715633


		FILE: Number of read operations=0


		FILE: Number of large read operations=0


		FILE: Number of write operations=0


		HDFS: Number of bytes read=216485


		HDFS: Number of bytes written=85436


		HDFS: Number of read operations=9


		HDFS: Number of large read operations=0


		HDFS: Number of write operations=3


		HDFS: Number of bytes read erasure-coded=0


	Map-Reduce Framework


		Map input records=2000


		Map output records=677


		Input split bytes=102


		Spilled Records=0


		Failed Shuffles=0


		Merged Map outputs=0


		GC time elapsed (ms)=0


		Total committed heap usage (bytes)=111149056


	File Input Format Counters 


		Bytes Read=216485


	File Output Format Counters 


		Bytes Written=85436


2025-07-13 15:34:38,272 INFO mapred.LocalJobRunner: Finishing task: attempt_local623326603_0001_m_000000_0


2025-07-13 15:34:38,272 INFO mapred.LocalJobRunner: map task executor complete.


2025-07-13 15:34:38,969 INFO mapreduce.Job: Job job_local623326603_0001 running in uber mode : false


2025-07-13 15:34:38,970 INFO mapreduce.Job:  map 100% reduce 0%


2025-07-13 15:34:38,972 INFO mapreduce.Job: Job job_local623326603_0001 completed successfully


2025-07-13 15:34:38,975 INFO mapreduce.Job: Counters: 21


	File System Counters


		FILE: Number of bytes read=664


		FILE: Number of bytes written=715633


		FILE: Number of read operations=0


		FILE: Number of large read operations=0


		FILE: Number of write operations=0


		HDFS: Number of bytes read=216485


		HDFS: Number of bytes written=85436


		HDFS: Number of read operations=9


		HDFS: Number of large read operations=0


		HDFS: Number of write operations=3


		HDFS: Number of bytes read erasure-coded=0


	Map-Reduce Framework


		Map input records=2000


		Map output records=677


		Input split bytes=102


		Spilled Records=0


		Failed Shuffles=0


		Merged Map outputs=0


		GC time elapsed (ms)=0


		Total committed heap usage (bytes)=111149056


	File Input Format Counters 


		Bytes Read=216485


	File Output Format Counters 


		Bytes Written=85436


2025-07-13 15:34:38,975 INFO streaming.StreamJob: Output directory: output_filter


Check the result

In [32]:
!hdfs dfs -ls output_filter

Found 2 items
-rw-r--r--   1 root supergroup          0 2025-07-13 15:34 output_filter/_SUCCESS
-rw-r--r--   1 root supergroup      85436 2025-07-13 15:34 output_filter/part-00000


In [33]:
!hdfs dfs -cat output_filter/part-00000 |head

Jun 14 15:16:01 combo sshd(pam_unix)[19939]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=218.188.2.4	
Jun 14 15:16:02 combo sshd(pam_unix)[19937]: check pass; user unknown	
Jun 14 15:16:02 combo sshd(pam_unix)[19937]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=218.188.2.4	
Jun 15 02:04:59 combo sshd(pam_unix)[20882]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=220-135-151-1.hinet-ip.hinet.net  user=root	
Jun 15 02:04:59 combo sshd(pam_unix)[20884]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=220-135-151-1.hinet-ip.hinet.net  user=root	
Jun 15 02:04:59 combo sshd(pam_unix)[20883]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=220-135-151-1.hinet-ip.hinet.net  user=root	
Jun 15 02:04:59 combo sshd(pam_unix)[20885]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=220-135-151-1.hinet-ip.hinet.net  user=root	
Jun 15 02:04:5

## Aggregate data with MapReduce

Following the example in [Hadoop Streaming/Aggregate package](https://hadoop.apache.org/docs/stable/hadoop-streaming/HadoopStreaming.html#Hadoop_Aggregate_Package)

In [34]:
%%writefile myAggregatorForKeyCount.py
#!/usr/bin/env python
import sys

def generateLongCountToken(id):
    return "LongValueSum:" + id + "\t" + "1"

def main(argv):
    line = sys.stdin.readline()
    try:
        while line:
            line = line[:-1]
            fields = line.split()
            s = fields[4].split('[')[0]
            print(generateLongCountToken(s))
            line = sys.stdin.readline()
    except "end of file":
        return None

if __name__ == "__main__":
     main(sys.argv)

Writing myAggregatorForKeyCount.py


Set permissions

In [35]:
!chmod 700 myAggregatorForKeyCount.py

Test the mapper

In [36]:
!head -20 Linux_2k.log| ./myAggregatorForKeyCount.py

LongValueSum:sshd(pam_unix)	1
LongValueSum:sshd(pam_unix)	1
LongValueSum:sshd(pam_unix)	1
LongValueSum:sshd(pam_unix)	1
LongValueSum:sshd(pam_unix)	1
LongValueSum:sshd(pam_unix)	1
LongValueSum:sshd(pam_unix)	1
LongValueSum:sshd(pam_unix)	1
LongValueSum:sshd(pam_unix)	1
LongValueSum:sshd(pam_unix)	1
LongValueSum:sshd(pam_unix)	1
LongValueSum:sshd(pam_unix)	1
LongValueSum:sshd(pam_unix)	1
LongValueSum:su(pam_unix)	1
LongValueSum:su(pam_unix)	1
LongValueSum:logrotate:	1
LongValueSum:su(pam_unix)	1
LongValueSum:su(pam_unix)	1
LongValueSum:sshd(pam_unix)	1
LongValueSum:sshd(pam_unix)	1


Run the MapReduce job

In [37]:
%%bash

chmod +x myAggregatorForKeyCount.py

hdfs dfs -rm -r output_aggregate

mapred streaming \
  -input input \
  -output output_aggregate \
  -mapper myAggregatorForKeyCount.py \
  -reducer aggregate \
  -file myAggregatorForKeyCount.py


rm: `output_aggregate': No such file or directory


2025-07-13 15:34:44,479 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.


packageJobJar: [myAggregatorForKeyCount.py] [] /tmp/streamjob6364205998257741820.jar tmpDir=null


2025-07-13 15:34:44,995 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties


2025-07-13 15:34:45,070 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).


2025-07-13 15:34:45,070 INFO impl.MetricsSystemImpl: JobTracker metrics system started


2025-07-13 15:34:45,080 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!


2025-07-13 15:34:45,287 INFO mapred.FileInputFormat: Total input files to process : 1


2025-07-13 15:34:45,302 INFO mapreduce.JobSubmitter: number of splits:1


2025-07-13 15:34:45,450 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1880963874_0001


2025-07-13 15:34:45,450 INFO mapreduce.JobSubmitter: Executing with tokens: []


2025-07-13 15:34:45,594 INFO mapred.LocalDistributedCacheManager: Localized file:/home/runner/work/big_data/big_data/myAggregatorForKeyCount.py as file:/tmp/hadoop-root/mapred/local/job_local1880963874_0001_962a766c-34f3-47b2-88ac-39aad90e394f/myAggregatorForKeyCount.py


2025-07-13 15:34:45,638 INFO mapreduce.Job: The url to track the job: http://localhost:8080/


2025-07-13 15:34:45,639 INFO mapred.LocalJobRunner: OutputCommitter set in config null


2025-07-13 15:34:45,639 INFO mapreduce.Job: Running job: job_local1880963874_0001


2025-07-13 15:34:45,640 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapred.FileOutputCommitter


2025-07-13 15:34:45,650 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2


2025-07-13 15:34:45,650 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false


2025-07-13 15:34:45,691 INFO mapred.LocalJobRunner: Waiting for map tasks


2025-07-13 15:34:45,693 INFO mapred.LocalJobRunner: Starting task: attempt_local1880963874_0001_m_000000_0


2025-07-13 15:34:45,710 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2


2025-07-13 15:34:45,710 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false


2025-07-13 15:34:45,723 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]


2025-07-13 15:34:45,729 INFO mapred.MapTask: Processing split: hdfs://localhost:9000/user/root/input/Linux_2k.log:0+216485


2025-07-13 15:34:45,745 INFO mapred.MapTask: numReduceTasks: 1


2025-07-13 15:34:45,766 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)


2025-07-13 15:34:45,766 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100


2025-07-13 15:34:45,766 INFO mapred.MapTask: soft limit at 83886080


2025-07-13 15:34:45,766 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600


2025-07-13 15:34:45,766 INFO mapred.MapTask: kvstart = 26214396; length = 6553600


2025-07-13 15:34:45,769 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer


2025-07-13 15:34:45,778 INFO streaming.PipeMapRed: PipeMapRed exec [/home/runner/work/big_data/big_data/./myAggregatorForKeyCount.py]


2025-07-13 15:34:45,783 INFO Configuration.deprecation: mapred.work.output.dir is deprecated. Instead, use mapreduce.task.output.dir


2025-07-13 15:34:45,784 INFO Configuration.deprecation: mapred.local.dir is deprecated. Instead, use mapreduce.cluster.local.dir


2025-07-13 15:34:45,785 INFO Configuration.deprecation: map.input.file is deprecated. Instead, use mapreduce.map.input.file


2025-07-13 15:34:45,785 INFO Configuration.deprecation: map.input.length is deprecated. Instead, use mapreduce.map.input.length


2025-07-13 15:34:45,785 INFO Configuration.deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id


2025-07-13 15:34:45,786 INFO Configuration.deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition


2025-07-13 15:34:45,786 INFO Configuration.deprecation: map.input.start is deprecated. Instead, use mapreduce.map.input.start


2025-07-13 15:34:45,787 INFO Configuration.deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap


2025-07-13 15:34:45,787 INFO Configuration.deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id


2025-07-13 15:34:45,787 INFO Configuration.deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id


2025-07-13 15:34:45,787 INFO Configuration.deprecation: mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords


2025-07-13 15:34:45,788 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name


2025-07-13 15:34:45,853 INFO streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]


2025-07-13 15:34:45,853 INFO streaming.PipeMapRed: R/W/S=10/0/0 in:NA [rec/s] out:NA [rec/s]


2025-07-13 15:34:45,855 INFO streaming.PipeMapRed: R/W/S=100/0/0 in:NA [rec/s] out:NA [rec/s]


2025-07-13 15:34:45,865 INFO streaming.PipeMapRed: R/W/S=1000/0/0 in:NA [rec/s] out:NA [rec/s]


2025-07-13 15:34:45,868 INFO streaming.PipeMapRed: Records R/W=1201/1


2025-07-13 15:34:45,884 INFO streaming.PipeMapRed: MRErrorThread done


2025-07-13 15:34:45,890 INFO streaming.PipeMapRed: mapRedFinished


2025-07-13 15:34:45,893 INFO mapred.LocalJobRunner: 


2025-07-13 15:34:45,893 INFO mapred.MapTask: Starting flush of map output


2025-07-13 15:34:45,893 INFO mapred.MapTask: Spilling map output


2025-07-13 15:34:45,893 INFO mapred.MapTask: bufstart = 0; bufend = 48923; bufvoid = 104857600


2025-07-13 15:34:45,893 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26206400(104825600); length = 7997/6553600


2025-07-13 15:34:45,917 INFO mapred.MapTask: Finished spill 0


2025-07-13 15:34:45,925 INFO mapred.Task: Task:attempt_local1880963874_0001_m_000000_0 is done. And is in the process of committing


2025-07-13 15:34:45,929 INFO mapred.LocalJobRunner: Records R/W=1201/1


2025-07-13 15:34:45,929 INFO mapred.Task: Task 'attempt_local1880963874_0001_m_000000_0' done.


2025-07-13 15:34:45,934 INFO mapred.Task: Final Counters for attempt_local1880963874_0001_m_000000_0: Counters: 24


	File System Counters


		FILE: Number of bytes read=1059


		FILE: Number of bytes written=720546


		FILE: Number of read operations=0


		FILE: Number of large read operations=0


		FILE: Number of write operations=0


		HDFS: Number of bytes read=216485


		HDFS: Number of bytes written=0


		HDFS: Number of read operations=5


		HDFS: Number of large read operations=0


		HDFS: Number of write operations=1


		HDFS: Number of bytes read erasure-coded=0


	Map-Reduce Framework


		Map input records=2000


		Map output records=2000


		Map output bytes=48923


		Map output materialized bytes=782


		Input split bytes=102


		Combine input records=2000


		Combine output records=30


		Spilled Records=30


		Failed Shuffles=0


		Merged Map outputs=0


		GC time elapsed (ms)=1


		Total committed heap usage (bytes)=199229440


	File Input Format Counters 


		Bytes Read=216485


2025-07-13 15:34:45,934 INFO mapred.LocalJobRunner: Finishing task: attempt_local1880963874_0001_m_000000_0


2025-07-13 15:34:45,934 INFO mapred.LocalJobRunner: map task executor complete.


2025-07-13 15:34:45,936 INFO mapred.LocalJobRunner: Waiting for reduce tasks


2025-07-13 15:34:45,937 INFO mapred.LocalJobRunner: Starting task: attempt_local1880963874_0001_r_000000_0


2025-07-13 15:34:45,942 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2


2025-07-13 15:34:45,942 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false


2025-07-13 15:34:45,943 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]


2025-07-13 15:34:45,945 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@d14c165


2025-07-13 15:34:45,946 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!


2025-07-13 15:34:45,958 INFO reduce.MergeManagerImpl: MergerManager: memoryLimit=2936012800, maxSingleShuffleLimit=734003200, mergeThreshold=1937768576, ioSortFactor=10, memToMemMergeOutputsThreshold=10


2025-07-13 15:34:45,959 INFO reduce.EventFetcher: attempt_local1880963874_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events


2025-07-13 15:34:45,977 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local1880963874_0001_m_000000_0 decomp: 778 len: 782 to MEMORY


2025-07-13 15:34:45,979 INFO reduce.InMemoryMapOutput: Read 778 bytes from map-output for attempt_local1880963874_0001_m_000000_0


2025-07-13 15:34:45,980 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 778, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->778


2025-07-13 15:34:45,981 INFO reduce.EventFetcher: EventFetcher is interrupted.. Returning


2025-07-13 15:34:45,982 INFO mapred.LocalJobRunner: 1 / 1 copied.


2025-07-13 15:34:45,982 INFO reduce.MergeManagerImpl: finalMerge called with 1 in-memory map-outputs and 0 on-disk map-outputs


2025-07-13 15:34:45,986 INFO mapred.Merger: Merging 1 sorted segments


2025-07-13 15:34:45,986 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 760 bytes


2025-07-13 15:34:45,987 INFO reduce.MergeManagerImpl: Merged 1 segments, 778 bytes to disk to satisfy reduce memory limit


2025-07-13 15:34:45,988 INFO reduce.MergeManagerImpl: Merging 1 files, 782 bytes from disk


2025-07-13 15:34:45,988 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce


2025-07-13 15:34:45,988 INFO mapred.Merger: Merging 1 sorted segments


2025-07-13 15:34:45,989 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 760 bytes


2025-07-13 15:34:45,989 INFO mapred.LocalJobRunner: 1 / 1 copied.


2025-07-13 15:34:46,038 INFO mapred.Task: Task:attempt_local1880963874_0001_r_000000_0 is done. And is in the process of committing


2025-07-13 15:34:46,040 INFO mapred.LocalJobRunner: 1 / 1 copied.


2025-07-13 15:34:46,040 INFO mapred.Task: Task attempt_local1880963874_0001_r_000000_0 is allowed to commit now


2025-07-13 15:34:46,051 INFO output.FileOutputCommitter: Saved output of task 'attempt_local1880963874_0001_r_000000_0' to hdfs://localhost:9000/user/root/output_aggregate


2025-07-13 15:34:46,051 INFO mapred.LocalJobRunner: reduce > reduce


2025-07-13 15:34:46,051 INFO mapred.Task: Task 'attempt_local1880963874_0001_r_000000_0' done.


2025-07-13 15:34:46,055 INFO mapred.Task: Final Counters for attempt_local1880963874_0001_r_000000_0: Counters: 30


	File System Counters


		FILE: Number of bytes read=2655


		FILE: Number of bytes written=721328


		FILE: Number of read operations=0


		FILE: Number of large read operations=0


		FILE: Number of write operations=0


		HDFS: Number of bytes read=216485


		HDFS: Number of bytes written=326


		HDFS: Number of read operations=10


		HDFS: Number of large read operations=0


		HDFS: Number of write operations=3


		HDFS: Number of bytes read erasure-coded=0


	Map-Reduce Framework


		Combine input records=0


		Combine output records=0


		Reduce input groups=30


		Reduce shuffle bytes=782


		Reduce input records=30


		Reduce output records=30


		Spilled Records=30


		Shuffled Maps =1


		Failed Shuffles=0


		Merged Map outputs=1


		GC time elapsed (ms)=0


		Total committed heap usage (bytes)=199229440


	Shuffle Errors


		BAD_ID=0


		CONNECTION=0


		IO_ERROR=0


		WRONG_LENGTH=0


		WRONG_MAP=0


		WRONG_REDUCE=0


	File Output Format Counters 


		Bytes Written=326


2025-07-13 15:34:46,055 INFO mapred.LocalJobRunner: Finishing task: attempt_local1880963874_0001_r_000000_0


2025-07-13 15:34:46,056 INFO mapred.LocalJobRunner: reduce task executor complete.


2025-07-13 15:34:46,642 INFO mapreduce.Job: Job job_local1880963874_0001 running in uber mode : false


2025-07-13 15:34:46,643 INFO mapreduce.Job:  map 100% reduce 100%


2025-07-13 15:34:46,644 INFO mapreduce.Job: Job job_local1880963874_0001 completed successfully


2025-07-13 15:34:46,649 INFO mapreduce.Job: Counters: 36


	File System Counters


		FILE: Number of bytes read=3714


		FILE: Number of bytes written=1441874


		FILE: Number of read operations=0


		FILE: Number of large read operations=0


		FILE: Number of write operations=0


		HDFS: Number of bytes read=432970


		HDFS: Number of bytes written=326


		HDFS: Number of read operations=15


		HDFS: Number of large read operations=0


		HDFS: Number of write operations=4


		HDFS: Number of bytes read erasure-coded=0


	Map-Reduce Framework


		Map input records=2000


		Map output records=2000


		Map output bytes=48923


		Map output materialized bytes=782


		Input split bytes=102


		Combine input records=2000


		Combine output records=30


		Reduce input groups=30


		Reduce shuffle bytes=782


		Reduce input records=30


		Reduce output records=30


		Spilled Records=60


		Shuffled Maps =1


		Failed Shuffles=0


		Merged Map outputs=1


		GC time elapsed (ms)=1


		Total committed heap usage (bytes)=398458880


	Shuffle Errors


		BAD_ID=0


		CONNECTION=0


		IO_ERROR=0


		WRONG_LENGTH=0


		WRONG_MAP=0


		WRONG_REDUCE=0


	File Input Format Counters 


		Bytes Read=216485


	File Output Format Counters 


		Bytes Written=326


2025-07-13 15:34:46,649 INFO streaming.StreamJob: Output directory: output_aggregate


Check result

In [38]:
%%bash
hdfs dfs -ls output_aggregate
hdfs dfs -cat output_aggregate/part-00000

Found 2 items


-rw-r--r--   1 root supergroup          0 2025-07-13 15:34 output_aggregate/_SUCCESS


-rw-r--r--   1 root supergroup        326 2025-07-13 15:34 output_aggregate/part-00000


--	1


bluetooth:	2


cups:	12


ftpd	916


gdm(pam_unix)	2


gdm-binary	1


gpm	2


hcid	1


irqbalance:	1


kernel:	76


klogind	46


login(pam_unix)	2


logrotate:	43


named	16


network:	2


nfslock:	1


portmap:	1


random:	1


rc:	1


rpc.statd	1


rpcidmapd:	1


sdpd	1


snmpd	1


sshd(pam_unix)	677


su(pam_unix)	172


sysctl:	1


syslog:	2


syslogd	7


udev	8


xinetd	2


Pretty-print table of aggregated data

In [39]:
%%bash
hdfs dfs -get output_aggregate/part-00000 result # download results file
# Use awk to format the output into columns and then sort by the second field numerically in descending order
awk '{printf "%-20s %s\n", $1, $2}' result | sort -k2nr

ftpd                 916


sshd(pam_unix)       677


su(pam_unix)         172


kernel:              76


klogind              46


logrotate:           43


named                16


cups:                12


udev                 8


syslogd              7


bluetooth:           2


gdm(pam_unix)        2


gpm                  2


login(pam_unix)      2


network:             2


syslog:              2


xinetd               2


--                   1


gdm-binary           1


hcid                 1


irqbalance:          1


nfslock:             1


portmap:             1


random:              1


rc:                  1


rpc.statd            1


rpcidmapd:           1


sdpd                 1


snmpd                1


sysctl:              1


# Stop cluster

When you're done with your computations, you can shut down the Hadoop cluster and stop the `sshd` service.

In [40]:
!./hadoop-3.4.0/sbin/stop-dfs.sh

Stopping namenodes on [localhost]


Stopping datanodes


Stopping secondary namenodes [fv-az1277-496]


Stop the `sshd` daemon

In [41]:
!/etc/init.d/ssh stop

Stopping ssh (via systemctl): ssh.service[0;1;38;5;185mStopping 'ssh.service', but its triggering units are still active:[0m
[0;1;38;5;185mssh.socket[0m
.


# Concluding remarks

We have started a single-node Hadoop cluster and ran some simple HDFS and MapReduce commands.

Even when running on a single machine, one can benefit from the parallelism provided by multiple virtual cores.

Hadoop provides also a command-line utility (the CLI MiniCluster) to start and stop a single-node Hadoop cluster "_without the need to set any environment variables or manage configuration files_" (https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/CLIMiniCluster.html). The [Hadoop MiniCluster](https://github.com/groda/big_data/blob/master/Hadoop_minicluster.ipynb) notebook serves as a guide for launching the Hadoop MiniCluster.

While it can be useful to be able to start a Hadoop cluster with a single command, delving into the functionality of each component offers valuable insights into the intricacies of Hadoop architecture, thereby enriching the learning process.

If you found this notebook helpful, consider exploring:
 - [Hadoop single-node cluster setup with Python](https://github.com/groda/big_data/blob/master/Hadoop_single_node_cluster_setup_Python.ipynb) similar to this but using Python in place of bash
 - [Setting up Spark Standalone on Google Colab](https://github.com/groda/big_data/blob/master/Hadoop_Setting_up_Spark_Standalone_on_Google_Colab.ipynb)
 - [Getting to know the Spark Standalone Architecture](https://github.com/groda/big_data/blob/master/Spark_Standalone_Architecture_on_Google_Colab.ipynb)


