<a href="https://github.com/groda/big_data"><div><img src="https://github.com/groda/big_data/blob/master/logo_bdb.png?raw=true" align=right width="90"></div></a>

# HDFS and MapReduce on a single-node Hadoop cluster
<br>
<br>

In this tutorial/notebook we'll showcase the setup of a single-node cluster, following the guidelines outlined on https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html. Subsequently, we'll demonstrate the seamless execution of elementary HDFS and MapReduce commands.

Upon downloading the software, several preliminary steps must be taken, including setting environment variables, generating SSH keys, and more. To streamline these tasks, we've consolidated them under the "Prologue" section.

Upon completion of the prologue, we can launch a single-node Hadoop cluster on the current virtual machine.

Following that, we'll execute a series of test HDFS commands and MapReduce jobs on the Hadoop cluster. These will be performed using a dataset sourced from a publicly available collection.

Finally, we'll proceed to shut down the cluster.


**TABLE OF CONTENTS**
* **[Prologue](#scrollTo=oUuQjW2oNMcJ)**

 * [Check the available Java version](#scrollTo=qFfOrktMPq8M)

 * [Download core Hadoop](#scrollTo=KE7kSYSXQYLf)

   * [Verify the downloaded file](#scrollTo=lGI4TNXPamMr)

 * [Configure `PATH`](#scrollTo=RlgP1ytnRtUK)

 * [Configure `core-site.xml` and `hdfs-site.xml`](#scrollTo=KLmxLQeJSb4A)

 * [Set environment variables](#scrollTo=kXbSKFyeMqr2)

 * [Setup localhost access via SSH key](#scrollTo=k2-Fdp73cF0V)

   * [Install `openssh` and start server](#scrollTo=-Uxmv3RdUwiF)

   * [Generate key](#scrollTo=PYKoSlaENuyG)

   * [Check SSH connection to localhost](#scrollTo=FwA6rKpScnVi)

* **[Launch a single-node Hadoop cluster](#scrollTo=V68C4cDySyek)**

   * [Initialize the namenode](#scrollTo=HTDPwnVlSbHS)

   * [Start cluster](#scrollTo=xMrEiLB_VAeR)

* **[Run some simple HDFS commands](#scrollTo=CKRRbwDFv3ZQ)**

* **[Run some simple MapReduce jobs](#scrollTo=G3KBe4R65bl1)**

   * [Simplest MapReduce job](#scrollTo=yVJA-3jSATGV)

   * [Another MapReduce example: filter a log file](#scrollTo=BbosNo0TD3oH)

   * [Aggregate data with MapReduce](#scrollTo=Sam22f-YT1xR)

* **[Stop cluster](#scrollTo=IF6-Z5RotAcO)**

* **[Concluding remarks](#scrollTo=w5N7tb0HSbZB)**



# Prologue

## Check the available Java version
 Apache Hadoop 3.4.2 supports Java > 8 (JDK>8). See: https://hadoop.apache.org/docs/r3.4.2/


Check if Java version is one of `8`, `11`

In [1]:
!java -version

openjdk version "17.0.17" 2025-10-21
OpenJDK Runtime Environment (build 17.0.17+10-Ubuntu-122.04)
OpenJDK 64-Bit Server VM (build 17.0.17+10-Ubuntu-122.04, mixed mode, sharing)


In [2]:
%%bash
JAVA_MAJOR_VERSION=$(java -version 2>&1 | grep -m1 -Po '(\d+\.)+\d+' | cut -d '.' -f1)
if [[ $JAVA_MAJOR_VERSION -eq 11 || $JAVA_MAJOR_VERSION -eq 17 ]]
 then
 echo "Java version is one of 11, 17 âœ“"
 fi

Java version is one of 11, 17 âœ“


## Set `JAVA_HOME`

Find the path for the environment variable `JAVA_HOME`

In [3]:
!readlink -f $(which java)

/usr/lib/jvm/java-17-openjdk-amd64/bin/java


In [4]:
%%bash
JAVA_HOME=$(readlink -f $(which java) | sed 's/\/bin\/java$//')
echo $JAVA_HOME

/usr/lib/jvm/java-17-openjdk-amd64


Extract `JAVA_HOME` from the Java path by removing the `bin/java` part in the end

In [3]:
import subprocess
import os

java_home = subprocess.check_output(
    "readlink -f $(which java) | sed 's:/bin/java$::'",
    shell=True,
    text=True
).strip()

os.environ["JAVA_HOME"] = java_home

print("JAVA_HOME =", os.environ["JAVA_HOME"])

JAVA_HOME = /usr/lib/jvm/java-17-openjdk-amd64


## Download core Hadoop
Download the latest stable version of the core Hadoop distribution from one of the download mirrors locations https://www.apache.org/dyn/closer.cgi/hadoop/common/.

**Note** with the option `--no-clobber`, `wget` will not download the file if it already exists.

In [6]:
!wget --no-clobber https://dlcdn.apache.org/hadoop/common/hadoop-3.4.2/hadoop-3.4.2.tar.gz

File â€˜hadoop-3.4.2.tar.gzâ€™ already there; not retrieving.



Uncompress archive

In [7]:
%%bash
if [ ! -d "hadoop-3.4.2" ]; then
  tar xzf hadoop-3.4.2.tar.gz
fi

### Verify the downloaded file

(see https://www.apache.org/dyn/closer.cgi/hadoop/common/)

Download sha512 file

In [8]:
! wget --no-clobber https://dlcdn.apache.org/hadoop/common/hadoop-3.4.2/hadoop-3.4.2.tar.gz.sha512

File â€˜hadoop-3.4.2.tar.gz.sha512â€™ already there; not retrieving.



Compare

In [9]:
%%bash
A=$(sha512sum hadoop-3.4.2.tar.gz | cut - -d' ' -f1)
B=$(cut hadoop-3.4.2.tar.gz.sha512 -d' ' -f4)
printf "%s\n%s\n" $A $B
[[ $A == $B ]] && echo "True"

79a383e156022d6690da359120b25db8146452265d92a4e890d9ea78c2078a01b661daf78163ee9b4acef7106b01fd5c8d1a55f7ad284f88b31ab3f402ae3acf
79a383e156022d6690da359120b25db8146452265d92a4e890d9ea78c2078a01b661daf78163ee9b4acef7106b01fd5c8d1a55f7ad284f88b31ab3f402ae3acf
True


## Configure `PATH`

Add the Hadoop folder to the `PATH` environment variable


In [10]:
!echo $PATH

/opt/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/tools/node/bin:/tools/google-cloud-sdk/bin


In [11]:
import os
os.environ['HADOOP_HOME'] = os.path.join(os.getcwd(), 'hadoop-3.4.2')
os.environ['PATH'] = ':'.join([os.path.join(os.environ['HADOOP_HOME'], 'bin'), os.environ['PATH']])
os.environ['JAVA_HOME'] = '/usr/lib/jvm/java-17-openjdk-amd64'

In [12]:
import os
for key, value in os.environ.items():
    print(f"{key}: {value}")

SHELL: /bin/bash
COLAB_JUPYTER_TRANSPORT: ipc
CGROUP_MEMORY_EVENTS: /sys/fs/cgroup/memory.events /var/colab/cgroup/jupyter-children/memory.events
VM_GCE_METADATA_HOST: 169.254.169.253
MODEL_PROXY_HOST: https://mp.kaggle.net
HOSTNAME: 642d0bb387c8
LANGUAGE: en_US
TBE_RUNTIME_ADDR: 172.28.0.1:8011
GCE_METADATA_TIMEOUT: 3
COLAB_JUPYTER_IP: 172.28.0.12
COLAB_LANGUAGE_SERVER_PROXY_ROOT_URL: http://172.28.0.1:8013/
KMP_LISTEN_PORT: 6000
TF_FORCE_GPU_ALLOW_GROWTH: true
ENV: /root/.bashrc
PWD: /
TBE_EPHEM_CREDS_ADDR: 172.28.0.1:8009
COLAB_LANGUAGE_SERVER_PROXY_REQUEST_TIMEOUT: 30s
TBE_CREDS_ADDR: 172.28.0.1:8008
COLAB_JUPYTER_TOKEN: 
LAST_FORCED_REBUILD: 20250623
TCLLIBPATH: /usr/share/tcltk/tcllib1.20
COLAB_KERNEL_MANAGER_PROXY_HOST: 172.28.0.12
UV_BUILD_CONSTRAINT: 
COLAB_WARMUP_DEFAULTS: 1
HOME: /root
LANG: en_US.UTF-8
CLOUDSDK_CONFIG: /content/.config
UV_SYSTEM_PYTHON: true
COLAB_RELEASE_TAG: release-colab-external_20260202-060039_RC01
KMP_TARGET_PORT: 9000
KMP_EXTRA_ARGS: --logtostderr --

In [13]:
!echo $PATH

/content/hadoop-3.4.2/bin:/opt/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/tools/node/bin:/tools/google-cloud-sdk/bin


## Configure `core-site.xml` and `hdfs-site.xml`

Edit the file `etc/hadoop/core-site.xml` and `etc/hadoop/hdfs-site.xml` to configure pseudo-distributed operation.

**`etc/hadoop/core-site.xml`**
```
<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>
```

**`etc/hadoop/hdfs-site.xml`**
```
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>
```

In [14]:
%%bash
echo -e "<configuration> \n\
    <property> \n\
        <name>fs.defaultFS</name> \n\
        <value>hdfs://localhost:9000</value> \n\
    </property> \n\
</configuration>" | sudo tee hadoop-3.4.2/etc/hadoop/core-site.xml > /dev/null

echo -e "<configuration> \n\
    <property> \n\
        <name>dfs.replication</name> \n\
        <value>1</value> \n\
    </property> \n\
</configuration>" | sudo tee hadoop-3.4.2/etc/hadoop/hdfs-site.xml > /dev/null


Check

In [15]:
!cat hadoop-3.4.2/etc/hadoop/hdfs-site.xml

<configuration> 
    <property> 
        <name>dfs.replication</name> 
        <value>1</value> 
    </property> 
</configuration>


## Set environment variables

Add the following lines to the Hadoop configuration script `hadoop-env.sh`(the script is in `hadoop-3.4.2/sbin`).
```
export HDFS_NAMENODE_USER=root
export HDFS_DATANODE_USER=root
export HDFS_SECONDARYNAMENODE_USER=root
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root
```

In [16]:
%%bash
cp -n hadoop-3.4.2/etc/hadoop/hadoop-env.sh hadoop-3.4.2/etc/hadoop/hadoop-env.sh.org
cat <<ðŸ˜ƒ >hadoop-3.4.2/etc/hadoop/hadoop-env.sh
export JAVA_HOME="/usr/lib/jvm/java-17-openjdk-amd64"
export HDFS_NAMENODE_USER=root
export HDFS_DATANODE_USER=root
export HDFS_SECONDARYNAMENODE_USER=root
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root
ðŸ˜ƒ

## Setup localhost access via SSH key

We are going to allow passphraseless access to `localhost` with a secure key.

SSH must be installed and sshd must be running in order to use the Hadoop scripts that manage remote Hadoop daemons.



### Install `openssh` and start server

I'm not sure why we need the option `StrictHostKeyChecking no`. This option tells the `ssh` server to allow key authentication only from known hosts, in particular it prevents a host from authenticating with key if the key has changed. I guess this option is needed since a new ssh key is generated every time one runs this notebook.

Alternatively, one could just delete the file `~/.ssh/known_hosts` or else use `ssh-keygen -R hostname` to remove all keys belonging to hostname from the `known_hosts` file (see for instance [How to remove strict RSA key checking in SSH and what's the problem here?](https://serverfault.com/questions/6233/how-to-remove-strict-rsa-key-checking-in-ssh-and-whats-the-problem-here) or [Remove key from known_hosts](https://superuser.com/questions/30087/remove-key-from-known-hosts)). The option `ssh-keygen -R hostname` would be the most appropriate in a production setting where the file `~/.ssh/known_hosts` might contain other entries that you do not want to delete.


In [17]:
%%bash
sudo apt-get update
sudo apt-get -y install openssh-server
# tee -a appends to the file using elevated privileges
echo 'StrictHostKeyChecking no' | sudo tee -a /etc/ssh/ssh_config
sudo /etc/init.d/ssh restart

Hit:1 https://cli.github.com/packages stable InRelease
Hit:2 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
Hit:3 http://archive.ubuntu.com/ubuntu jammy InRelease
Hit:4 http://security.ubuntu.com/ubuntu jammy-security InRelease
Hit:5 https://r2u.stat.illinois.edu/ubuntu jammy InRelease
Hit:6 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
Hit:7 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Hit:8 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:9 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Reading package lists...
Reading package lists...
Building dependency tree...
Reading state information...
openssh-server is already the newest version (1:8.9p1-3ubuntu0.13).
0 upgraded, 0 newly installed, 0 to remove and 46 not upgraded.
StrictHostKeyChecking no
 * Restarting OpenBSD Secure Shell server sshd
   ...done.


W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)


### Generate key
Generate an SSH key that does not require a password.

The private key is contained in the file `id_rsa` located in the folder `~/.ssh`.

The public key is added to the file `~/.ssh/authorized_keys` in order to allow authentication with that key.

In [18]:
%%bash
rm $HOME/.ssh/id_rsa
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys

Generating public/private rsa key pair.
Your identification has been saved in /root/.ssh/id_rsa
Your public key has been saved in /root/.ssh/id_rsa.pub
The key fingerprint is:
SHA256:D62qnIgTNIXm2eqjJOooYKo3pz1DJYqBdPz/XkS2f7s root@642d0bb387c8
The key's randomart image is:
+---[RSA 3072]----+
|  o              |
| + +             |
|= = .      o     |
|o= ....  .o .    |
|.oo. o. S .o     |
|+o. .  . +. .    |
|*o .    o .. . . |
|O+o++. . ..   . .|
|@+++=+. ..     E.|
+----[SHA256]-----+


### Check SSH connection to localhost

The following command should output "hi!" if the connection works.

In [19]:
!ssh localhost "echo hi ðŸ‘‹"

hi ðŸ‘‹


# Launch a single-node Hadoop cluster

## Initialize the namenode

In [20]:
!sudo env JAVA_HOME=$JAVA_HOME $HADOOP_HOME/bin/hdfs namenode -format -nonInteractive

namenode is running as process 5374.  Stop it first and ensure /tmp/hadoop-root-namenode.pid file is empty before retry.


## Start cluster

In [21]:
!$HADOOP_HOME/sbin/start-dfs.sh

Starting namenodes on [localhost]
localhost: namenode is running as process 5374.  Stop it first and ensure /tmp/hadoop-root-namenode.pid file is empty before retry.
Starting datanodes
localhost: datanode is running as process 5495.  Stop it first and ensure /tmp/hadoop-root-datanode.pid file is empty before retry.
Starting secondary namenodes [642d0bb387c8]
642d0bb387c8: secondarynamenode is running as process 5695.  Stop it first and ensure /tmp/hadoop-root-secondarynamenode.pid file is empty before retry.


In [22]:
%%bash
# Check if HDFS is in safe mode
if hdfs dfsadmin -safemode get | grep 'ON'; then
  echo "Namenode is in safe mode. Leaving safe mode..."
  hdfs dfsadmin -safemode leave
else
  echo "Namenode is not in safe mode."
fi

Namenode is not in safe mode.


# Run some simple HDFS commands

In [23]:
%%bash
# create directory "my_dir" in HDFS home
hdfs dfs -mkdir /user
hdfs dfs -mkdir /user/root # this is the "home" of user root on HDFS
hdfs dfs -mkdir my_dir

# if sampls_data does not exist, create it
mkdir -p sample_data
touch sample_data/mnist_test.csv

# Check if the file is empty and fill it if needed
if [ ! -s sample_data/mnist_test.csv ]; then
  echo -e "0\n1\n2\n3\n4\n5\n6\n7\n8\n9" > sample_data/mnist_test.csv
fi


# upload file mnist_test.csv to my_dir
hdfs dfs -put sample_data/mnist_test.csv my_dir/

# show contents of directory my_dir
hdfs dfs -ls -h my_dir

Found 1 items
-rw-r--r--   1 root supergroup     17.4 M 2026-02-04 18:51 my_dir/mnist_test.csv


mkdir: `/user': File exists
mkdir: `/user/root': File exists
mkdir: `my_dir': File exists
put: `my_dir/mnist_test.csv': File exists


# Run some simple MapReduce jobs

We'll employ the [streaming](https://hadoop.apache.org/docs/stable/hadoop-streaming/HadoopStreaming.html) library, which broadens our options by enabling the use of any programming language for both the mapper and/or the reducer.

With this utility any executable or file containing code that the operating system can interpret and execute directly, can serve as mapper and/or reducer.

## Simplest MapReduce job

This is a "no-code" example since we are going to use the existing Unix commands `cat` and `wc` respectively as mapper and as reducer. The result will show a line with three values: the counts of lines, words, and characters in the input file(s).

Input folder is `/user/my_user/my_dir/`, output folder `/user/my_user/output_simplest`.

**Note**: the output folder should not exist because it is created by Hadoop (this is in accordance with Hadoop's principle of not overwriting data).

Now run the MapReduce job

In [24]:
%%bash

hdfs dfs -rm -r output_simplest || hdfs namenode -format -nonInteractive
mapred streaming \
  -input my_dir \
  -output output_simplest \
  -mapper /bin/cat \
  -reducer /usr/bin/wc

Deleted output_simplest


2026-02-04 18:54:28,959 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!
2026-02-04 18:54:29,351 INFO mapred.FileInputFormat: Total input files to process : 1
2026-02-04 18:54:29,368 INFO mapreduce.JobSubmitter: number of splits:1
2026-02-04 18:54:29,677 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1756936430_0001
2026-02-04 18:54:29,680 INFO mapreduce.JobSubmitter: Executing with tokens: []
2026-02-04 18:54:29,883 INFO mapred.LocalJobRunner: OutputCommitter set in config null
2026-02-04 18:54:29,883 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
2026-02-04 18:54:29,885 INFO mapreduce.Job: Running job: job_local1756936430_0001
2026-02-04 18:54:29,886 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapred.FileOutputCommitter
2026-02-04 18:54:29,898 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2
2026-02-04 18:54:29,898 INFO output.FileOutputCommitter: FileOutputCommitter s

If the `output` directory contains the empty file `_SUCCESS`, this means that the job was successful.

Check the output of the MapReduce job.

In [25]:
!hdfs dfs -cat output_simplest/part-00000

  10000   10000 18299443	


The number of words is in this case equal to the number of lines because there are no word separators (empty spaces) in the file, so each line is a word.

## Another MapReduce example: filter a log file

We're going to use a Linux logfile and look for the string `sshd` in a given position. The file stems from [Loghub](https://github.com/logpai/loghub), a freely available collection of system logs for AI-driven log analytics research.

The mapper `mapper.py` filters the file for the given string `sshd` at field 4.

The job has no reducer (option `-reducer NONE`). Note that without a reducer the sorting and shuffling phase after the map phase is skipped.


Download the logfile `Linux_2k.log`:

In [26]:
!wget --no-clobber https://raw.githubusercontent.com/logpai/loghub/master/Linux/Linux_2k.log

File â€˜Linux_2k.logâ€™ already there; not retrieving.



In [27]:
%%bash
hdfs dfs -mkdir input || true
hdfs dfs -put Linux_2k.log input/ || true

mkdir: `input': File exists
put: `input/Linux_2k.log': File exists


Define the mapper

In [28]:
%%writefile mapper.py
#!/usr/bin/env python
import sys

for line in sys.stdin:
    # split the line into words
    line = line.strip()
    fields = line.split()
    if (len(fields)>=5 and fields[4].startswith('sshd')):
      print(line)


Overwriting mapper.py


Test the script (after setting the correct permissions)

In [29]:
!chmod 700 mapper.py

Look at the first 10 lines

In [30]:
!head -10 Linux_2k.log

Jun 14 15:16:01 combo sshd(pam_unix)[19939]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=218.188.2.4 
Jun 14 15:16:02 combo sshd(pam_unix)[19937]: check pass; user unknown
Jun 14 15:16:02 combo sshd(pam_unix)[19937]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=218.188.2.4 
Jun 15 02:04:59 combo sshd(pam_unix)[20882]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=220-135-151-1.hinet-ip.hinet.net  user=root
Jun 15 02:04:59 combo sshd(pam_unix)[20884]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=220-135-151-1.hinet-ip.hinet.net  user=root
Jun 15 02:04:59 combo sshd(pam_unix)[20883]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=220-135-151-1.hinet-ip.hinet.net  user=root
Jun 15 02:04:59 combo sshd(pam_unix)[20885]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=220-135-151-1.hinet-ip.hinet.net  user=root
Jun 15 02:04:59 com

Test the mapper in the shell (not using MapReduce):

In [31]:
!head -100 Linux_2k.log| ./mapper.py

Jun 14 15:16:01 combo sshd(pam_unix)[19939]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=218.188.2.4
Jun 14 15:16:02 combo sshd(pam_unix)[19937]: check pass; user unknown
Jun 14 15:16:02 combo sshd(pam_unix)[19937]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=218.188.2.4
Jun 15 02:04:59 combo sshd(pam_unix)[20882]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=220-135-151-1.hinet-ip.hinet.net  user=root
Jun 15 02:04:59 combo sshd(pam_unix)[20884]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=220-135-151-1.hinet-ip.hinet.net  user=root
Jun 15 02:04:59 combo sshd(pam_unix)[20883]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=220-135-151-1.hinet-ip.hinet.net  user=root
Jun 15 02:04:59 combo sshd(pam_unix)[20885]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=220-135-151-1.hinet-ip.hinet.net  user=root
Jun 15 02:04:59 combo sshd(p

Now run the MapReduce job on the pseudo-cluster

In [32]:
%%bash

hdfs dfs -rm -r output_filter

mapred streaming \
  -file mapper.py \
  -input input \
  -output output_filter \
  -mapper mapper.py \
  -reducer NONE


Deleted output_filter
packageJobJar: [mapper.py] [] /tmp/streamjob9992742135488871831.jar tmpDir=null


2026-02-04 18:54:42,365 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
2026-02-04 18:54:43,630 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!
2026-02-04 18:54:43,976 INFO mapred.FileInputFormat: Total input files to process : 1
2026-02-04 18:54:43,998 INFO mapreduce.JobSubmitter: number of splits:1
2026-02-04 18:54:44,214 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1309914515_0001
2026-02-04 18:54:44,215 INFO mapreduce.JobSubmitter: Executing with tokens: []
2026-02-04 18:54:44,449 INFO mapred.LocalDistributedCacheManager: Localized file:/content/mapper.py as file:/tmp/hadoop-root/mapred/local/job_local1309914515_0001_da72e068-339a-4e35-a998-d2900de7f97e/mapper.py
2026-02-04 18:54:44,556 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
2026-02-04 18:54:44,558 INFO mapreduce.Job: Running job: job_local1309914515_0001
2026-02-04 18:54:44,559 INFO mapred.LocalJobRunner: O

Check the result

In [33]:
!hdfs dfs -ls output_filter

Found 2 items
-rw-r--r--   1 root supergroup          0 2026-02-04 18:54 output_filter/_SUCCESS
-rw-r--r--   1 root supergroup      85436 2026-02-04 18:54 output_filter/part-00000


In [34]:
!hdfs dfs -cat output_filter/part-00000 |head

Jun 14 15:16:01 combo sshd(pam_unix)[19939]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=218.188.2.4	
Jun 14 15:16:02 combo sshd(pam_unix)[19937]: check pass; user unknown	
Jun 14 15:16:02 combo sshd(pam_unix)[19937]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=218.188.2.4	
Jun 15 02:04:59 combo sshd(pam_unix)[20882]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=220-135-151-1.hinet-ip.hinet.net  user=root	
Jun 15 02:04:59 combo sshd(pam_unix)[20884]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=220-135-151-1.hinet-ip.hinet.net  user=root	
Jun 15 02:04:59 combo sshd(pam_unix)[20883]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=220-135-151-1.hinet-ip.hinet.net  user=root	
Jun 15 02:04:59 combo sshd(pam_unix)[20885]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=220-135-151-1.hinet-ip.hinet.net  user=root	
Jun 15 02:04:59 combo

## Aggregate data with MapReduce

Following the example in [Hadoop Streaming/Aggregate package](https://hadoop.apache.org/docs/stable/hadoop-streaming/HadoopStreaming.html#Hadoop_Aggregate_Package)

In [35]:
%%writefile myAggregatorForKeyCount.py
#!/usr/bin/env python
import sys

def generateLongCountToken(id):
    return "LongValueSum:" + id + "\t" + "1"

def main(argv):
    line = sys.stdin.readline()
    try:
        while line:
            line = line[:-1]
            fields = line.split()
            s = fields[4].split('[')[0]
            print(generateLongCountToken(s))
            line = sys.stdin.readline()
    except "end of file":
        return None

if __name__ == "__main__":
     main(sys.argv)

Overwriting myAggregatorForKeyCount.py


Set permissions

In [36]:
!chmod 700 myAggregatorForKeyCount.py

Test the mapper

In [37]:
!head -20 Linux_2k.log| ./myAggregatorForKeyCount.py

LongValueSum:sshd(pam_unix)	1
LongValueSum:sshd(pam_unix)	1
LongValueSum:sshd(pam_unix)	1
LongValueSum:sshd(pam_unix)	1
LongValueSum:sshd(pam_unix)	1
LongValueSum:sshd(pam_unix)	1
LongValueSum:sshd(pam_unix)	1
LongValueSum:sshd(pam_unix)	1
LongValueSum:sshd(pam_unix)	1
LongValueSum:sshd(pam_unix)	1
LongValueSum:sshd(pam_unix)	1
LongValueSum:sshd(pam_unix)	1
LongValueSum:sshd(pam_unix)	1
LongValueSum:su(pam_unix)	1
LongValueSum:su(pam_unix)	1
LongValueSum:logrotate:	1
LongValueSum:su(pam_unix)	1
LongValueSum:su(pam_unix)	1
LongValueSum:sshd(pam_unix)	1
LongValueSum:sshd(pam_unix)	1


Run the MapReduce job

In [38]:
%%bash

chmod +x myAggregatorForKeyCount.py

hdfs dfs -rm -r output_aggregate

mapred streaming \
  -input input \
  -output output_aggregate \
  -mapper myAggregatorForKeyCount.py \
  -reducer aggregate \
  -file myAggregatorForKeyCount.py


Deleted output_aggregate
packageJobJar: [myAggregatorForKeyCount.py] [] /tmp/streamjob3004401346663127159.jar tmpDir=null


2026-02-04 18:54:53,598 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
2026-02-04 18:54:54,782 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!
2026-02-04 18:54:55,188 INFO mapred.FileInputFormat: Total input files to process : 1
2026-02-04 18:54:55,202 INFO mapreduce.JobSubmitter: number of splits:1
2026-02-04 18:54:55,458 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1780223367_0001
2026-02-04 18:54:55,459 INFO mapreduce.JobSubmitter: Executing with tokens: []
2026-02-04 18:54:55,739 INFO mapred.LocalDistributedCacheManager: Localized file:/content/myAggregatorForKeyCount.py as file:/tmp/hadoop-root/mapred/local/job_local1780223367_0001_4f71e57e-3be1-43ba-8f16-ce5cc0b0da90/myAggregatorForKeyCount.py
2026-02-04 18:54:55,851 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
2026-02-04 18:54:55,853 INFO mapreduce.Job: Running job: job_local1780223367_0001
2026-02-04 18:54:55

Check result

In [39]:
%%bash
hdfs dfs -ls output_aggregate
hdfs dfs -cat output_aggregate/part-00000

Found 2 items
-rw-r--r--   1 root supergroup          0 2026-02-04 18:54 output_aggregate/_SUCCESS
-rw-r--r--   1 root supergroup        326 2026-02-04 18:54 output_aggregate/part-00000
--	1
bluetooth:	2
cups:	12
ftpd	916
gdm(pam_unix)	2
gdm-binary	1
gpm	2
hcid	1
irqbalance:	1
kernel:	76
klogind	46
login(pam_unix)	2
logrotate:	43
named	16
network:	2
nfslock:	1
portmap:	1
random:	1
rc:	1
rpc.statd	1
rpcidmapd:	1
sdpd	1
snmpd	1
sshd(pam_unix)	677
su(pam_unix)	172
sysctl:	1
syslog:	2
syslogd	7
udev	8
xinetd	2


Pretty-print table of aggregated data

In [40]:
%%bash
hdfs dfs -get output_aggregate/part-00000 result # download results file
# Use awk to format the output into columns and then sort by the second field numerically in descending order
awk '{printf "%-20s %s\n", $1, $2}' result | sort -k2nr

ftpd                 916
sshd(pam_unix)       677
su(pam_unix)         172
kernel:              76
klogind              46
logrotate:           43
named                16
cups:                12
udev                 8
syslogd              7
bluetooth:           2
gdm(pam_unix)        2
gpm                  2
login(pam_unix)      2
network:             2
syslog:              2
xinetd               2
--                   1
gdm-binary           1
hcid                 1
irqbalance:          1
nfslock:             1
portmap:             1
random:              1
rc:                  1
rpcidmapd:           1
rpc.statd            1
sdpd                 1
snmpd                1
sysctl:              1


get: `result': File exists


# Stop cluster

When you're done with your computations, you can shut down the Hadoop cluster and stop the `sshd` service.

In [41]:
!./hadoop-3.4.2/sbin/stop-dfs.sh

Stopping namenodes on [localhost]
Stopping datanodes
Stopping secondary namenodes [642d0bb387c8]


Stop the `sshd` daemon

In [42]:
!/etc/init.d/ssh stop

 * Stopping OpenBSD Secure Shell server sshd
   ...done.


# Concluding remarks

We have started a single-node Hadoop cluster and ran some simple HDFS and MapReduce commands.

Even when running on a single machine, one can benefit from the parallelism provided by multiple virtual cores.

Hadoop provides also a command-line utility (the CLI MiniCluster) to start and stop a single-node Hadoop cluster "_without the need to set any environment variables or manage configuration files_" (https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/CLIMiniCluster.html). The [Hadoop MiniCluster](https://github.com/groda/big_data/blob/master/Hadoop_minicluster.ipynb) notebook serves as a guide for launching the Hadoop MiniCluster.

While it can be useful to be able to start a Hadoop cluster with a single command, delving into the functionality of each component offers valuable insights into the intricacies of Hadoop architecture, thereby enriching the learning process.

If you found this notebook helpful, consider exploring:
 - [Hadoop single-node cluster setup with Python](https://github.com/groda/big_data/blob/master/Hadoop_single_node_cluster_setup_Python.ipynb) similar to this but using Python in place of bash
 - [Setting up Spark Standalone on Google Colab](https://github.com/groda/big_data/blob/master/Hadoop_Setting_up_Spark_Standalone_on_Google_Colab.ipynb)
 - [Getting to know the Spark Standalone Architecture](https://github.com/groda/big_data/blob/master/Spark_Standalone_Architecture_on_Google_Colab.ipynb)


