<a href="https://github.com/groda/big_data"><div><img src="https://github.com/groda/big_data/blob/master/logo_bdb.png?raw=true" align=right width="90"></div></a>

# HDFS and MapReduce on a single-node Hadoop cluster
<br>
<br>

In this tutorial/notebook we'll showcase the setup of a single-node cluster, following the guidelines outlined on https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html. Subsequently, we'll demonstrate the seamless execution of elementary HDFS and MapReduce commands.

Upon downloading the software, several preliminary steps must be taken, including setting environment variables, generating SSH keys, and more. To streamline these tasks, we've consolidated them under the "Prologue" section.

Upon completion of the prologue, we can launch a single-node Hadoop cluster on the current virtual machine.

Following that, we'll execute a series of test HDFS commands and MapReduce jobs on the Hadoop cluster. These will be performed using a dataset sourced from a publicly available collection.

Finally, we'll proceed to shut down the cluster.


**TABLE OF CONTENTS**
* **[Prologue](#scrollTo=oUuQjW2oNMcJ)**

 * [Check the available Java version](#scrollTo=qFfOrktMPq8M)

 * [Download core Hadoop](#scrollTo=KE7kSYSXQYLf)

   * [Verify the downloaded file](#scrollTo=lGI4TNXPamMr)

 * [Configure `PATH`](#scrollTo=RlgP1ytnRtUK)

 * [Configure `core-site.xml` and `hdfs-site.xml`](#scrollTo=KLmxLQeJSb4A)

 * [Set environment variables](#scrollTo=kXbSKFyeMqr2)

 * [Setup localhost access via SSH key](#scrollTo=k2-Fdp73cF0V)

   * [Install `openssh` and start server](#scrollTo=-Uxmv3RdUwiF)

   * [Generate key](#scrollTo=PYKoSlaENuyG)

   * [Check SSH connection to localhost](#scrollTo=FwA6rKpScnVi)

* **[Launch a single-node Hadoop cluster](#scrollTo=V68C4cDySyek)**

   * [Initialize the namenode](#scrollTo=HTDPwnVlSbHS)

   * [Start cluster](#scrollTo=xMrEiLB_VAeR)

* **[Run some simple HDFS commands](#scrollTo=CKRRbwDFv3ZQ)**

* **[Run some simple MapReduce jobs](#scrollTo=G3KBe4R65bl1)**

   * [Simplest MapReduce job](#scrollTo=yVJA-3jSATGV)

   * [Another MapReduce example: filter a log file](#scrollTo=BbosNo0TD3oH)

   * [Aggregate data with MapReduce](#scrollTo=Sam22f-YT1xR)

* **[Stop cluster](#scrollTo=IF6-Z5RotAcO)**

* **[Concluding remarks](#scrollTo=w5N7tb0HSbZB)**



> # â›” Do not run on Google Colab
> Run this notebook on any Ubuntu Jammy machine instead.

In [1]:
!cat /etc/os-release

PRETTY_NAME="Ubuntu 22.04.5 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.5 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy


# Prologue

## Check the available Java version
 Apache Hadoop 3.4.2 supports Java > 8 (JDK>8). See: https://hadoop.apache.org/docs/r3.4.2/


Check if Java version is one of `8`, `11`

In [2]:
!java -version

openjdk version "17.0.17" 2025-10-21
OpenJDK Runtime Environment (build 17.0.17+10-Ubuntu-122.04)
OpenJDK 64-Bit Server VM (build 17.0.17+10-Ubuntu-122.04, mixed mode, sharing)


In [3]:
%%bash
JAVA_MAJOR_VERSION=$(java -version 2>&1 | grep -m1 -Po '(\d+\.)+\d+' | cut -d '.' -f1)
if [[ $JAVA_MAJOR_VERSION -eq 11 || $JAVA_MAJOR_VERSION -eq 17 ]]
 then
 echo "Java version is one of 11, 17 âœ“"
 fi

Java version is one of 11, 17 âœ“


## Set `JAVA_HOME`

Find the path for the environment variable `JAVA_HOME`

In [4]:
!readlink -f $(which java)

/usr/lib/jvm/java-17-openjdk-amd64/bin/java


In [5]:
%%bash
JAVA_HOME=$(readlink -f $(which java) | sed 's/\/bin\/java$//')
echo $JAVA_HOME

/usr/lib/jvm/java-17-openjdk-amd64


Extract `JAVA_HOME` from the Java path by removing the `bin/java` part in the end

In [6]:
import subprocess
import os

java_home = subprocess.check_output(
    "readlink -f $(which java) | sed 's:/bin/java$::'",
    shell=True,
    text=True
).strip()

os.environ["JAVA_HOME"] = java_home

print("JAVA_HOME =", os.environ["JAVA_HOME"])

JAVA_HOME = /usr/lib/jvm/java-17-openjdk-amd64


## Download core Hadoop
Download the latest stable version of the core Hadoop distribution from one of the download mirrors locations https://www.apache.org/dyn/closer.cgi/hadoop/common/.

**Note** with the option `--no-clobber`, `wget` will not download the file if it already exists.

In [7]:
!wget --no-clobber https://dlcdn.apache.org/hadoop/common/hadoop-3.4.2/hadoop-3.4.2.tar.gz

--2026-02-04 20:43:59--  https://dlcdn.apache.org/hadoop/common/hadoop-3.4.2/hadoop-3.4.2.tar.gz
Resolving dlcdn.apache.org (dlcdn.apache.org)... 151.101.2.132, 2a04:4e42::644
Connecting to dlcdn.apache.org (dlcdn.apache.org)|151.101.2.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1065831750 (1016M) [application/x-gzip]
Saving to: â€˜hadoop-3.4.2.tar.gzâ€™


2026-02-04 20:44:05 (172 MB/s) - â€˜hadoop-3.4.2.tar.gzâ€™ saved [1065831750/1065831750]



Uncompress archive

In [8]:
%%bash
if [ ! -d "hadoop-3.4.2" ]; then
  tar xzf hadoop-3.4.2.tar.gz
fi

### Verify the downloaded file

(see https://www.apache.org/dyn/closer.cgi/hadoop/common/)

Download sha512 file

In [9]:
! wget --no-clobber https://dlcdn.apache.org/hadoop/common/hadoop-3.4.2/hadoop-3.4.2.tar.gz.sha512

--2026-02-04 20:44:26--  https://dlcdn.apache.org/hadoop/common/hadoop-3.4.2/hadoop-3.4.2.tar.gz.sha512
Resolving dlcdn.apache.org (dlcdn.apache.org)... 151.101.2.132, 2a04:4e42::644
Connecting to dlcdn.apache.org (dlcdn.apache.org)|151.101.2.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 160 [text/plain]
Saving to: â€˜hadoop-3.4.2.tar.gz.sha512â€™


2026-02-04 20:44:26 (2.93 MB/s) - â€˜hadoop-3.4.2.tar.gz.sha512â€™ saved [160/160]



Compare

In [10]:
%%bash
A=$(sha512sum hadoop-3.4.2.tar.gz | cut - -d' ' -f1)
B=$(cut hadoop-3.4.2.tar.gz.sha512 -d' ' -f4)
printf "%s\n%s\n" $A $B
[[ $A == $B ]] && echo "True"

79a383e156022d6690da359120b25db8146452265d92a4e890d9ea78c2078a01b661daf78163ee9b4acef7106b01fd5c8d1a55f7ad284f88b31ab3f402ae3acf
79a383e156022d6690da359120b25db8146452265d92a4e890d9ea78c2078a01b661daf78163ee9b4acef7106b01fd5c8d1a55f7ad284f88b31ab3f402ae3acf
True


## Configure `PATH`

Add the Hadoop folder to the `PATH` environment variable


In [11]:
!echo $PATH

/opt/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/tools/node/bin:/tools/google-cloud-sdk/bin


In [12]:
import os
os.environ['HADOOP_HOME'] = os.path.join(os.getcwd(), 'hadoop-3.4.2')
os.environ['PATH'] = ':'.join([os.path.join(os.environ['HADOOP_HOME'], 'bin'), os.environ['PATH']])

In [13]:
import os
for key, value in os.environ.items():
    print(f"{key}: {value}")

SHELL: /bin/bash
COLAB_JUPYTER_TRANSPORT: ipc
CGROUP_MEMORY_EVENTS: /sys/fs/cgroup/memory.events /var/colab/cgroup/jupyter-children/memory.events
VM_GCE_METADATA_HOST: 169.254.169.253
MODEL_PROXY_HOST: https://mp.kaggle.net
HOSTNAME: 7a779a0ab01e
LANGUAGE: en_US
TBE_RUNTIME_ADDR: 172.28.0.1:8011
GCE_METADATA_TIMEOUT: 3
COLAB_JUPYTER_IP: 172.28.0.12
COLAB_LANGUAGE_SERVER_PROXY_ROOT_URL: http://172.28.0.1:8013/
KMP_LISTEN_PORT: 6000
TF_FORCE_GPU_ALLOW_GROWTH: true
ENV: /root/.bashrc
PWD: /
TBE_EPHEM_CREDS_ADDR: 172.28.0.1:8009
COLAB_LANGUAGE_SERVER_PROXY_REQUEST_TIMEOUT: 30s
TBE_CREDS_ADDR: 172.28.0.1:8008
COLAB_JUPYTER_TOKEN: 
LAST_FORCED_REBUILD: 20250623
TCLLIBPATH: /usr/share/tcltk/tcllib1.20
COLAB_KERNEL_MANAGER_PROXY_HOST: 172.28.0.12
UV_BUILD_CONSTRAINT: 
COLAB_WARMUP_DEFAULTS: 1
HOME: /root
LANG: en_US.UTF-8
CLOUDSDK_CONFIG: /content/.config
UV_SYSTEM_PYTHON: true
COLAB_RELEASE_TAG: release-colab-external_20260202-060039_RC01
KMP_TARGET_PORT: 9000
KMP_EXTRA_ARGS: --logtostderr --

In [14]:
!echo $PATH

/content/hadoop-3.4.2/bin:/opt/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/tools/node/bin:/tools/google-cloud-sdk/bin


## Configure `core-site.xml` and `hdfs-site.xml`

Edit the file `etc/hadoop/core-site.xml` and `etc/hadoop/hdfs-site.xml` to configure pseudo-distributed operation.

**`etc/hadoop/core-site.xml`**
```
<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>
```

**`etc/hadoop/hdfs-site.xml`**
```
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>
```

In [15]:
%%bash
echo -e "<configuration> \n\
    <property> \n\
        <name>fs.defaultFS</name> \n\
        <value>hdfs://localhost:9000</value> \n\
    </property> \n\
</configuration>" | sudo tee hadoop-3.4.2/etc/hadoop/core-site.xml > /dev/null

echo -e "<configuration> \n\
    <property> \n\
        <name>dfs.replication</name> \n\
        <value>1</value> \n\
    </property> \n\
</configuration>" | sudo tee hadoop-3.4.2/etc/hadoop/hdfs-site.xml > /dev/null


Check

In [16]:
!cat hadoop-3.4.2/etc/hadoop/hdfs-site.xml

<configuration> 
    <property> 
        <name>dfs.replication</name> 
        <value>1</value> 
    </property> 
</configuration>


## Set environment variables

Add the following lines to the Hadoop configuration script `hadoop-env.sh`(the script is in `hadoop-3.4.2/sbin`).
```
export HDFS_NAMENODE_USER=root
export HDFS_DATANODE_USER=root
export HDFS_SECONDARYNAMENODE_USER=root
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root
```

In [17]:
%%bash
cp -n hadoop-3.4.2/etc/hadoop/hadoop-env.sh hadoop-3.4.2/etc/hadoop/hadoop-env.sh.org
cat <<ðŸ˜ƒ >hadoop-3.4.2/etc/hadoop/hadoop-env.sh
export JAVA_HOME=$JAVA_HOME
export HADOOP_HOME=$HADOOP_HOME
export HDFS_NAMENODE_USER=root
export HDFS_DATANODE_USER=root
export HDFS_SECONDARYNAMENODE_USER=root
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root
ðŸ˜ƒ

## Setup localhost access via SSH key

We are going to allow passphraseless access to `localhost` with a secure key.

SSH must be installed and sshd must be running in order to use the Hadoop scripts that manage remote Hadoop daemons.



### Install `openssh` and start server

I'm not sure why we need the option `StrictHostKeyChecking no`. This option tells the `ssh` server to allow key authentication only from known hosts, in particular it prevents a host from authenticating with key if the key has changed. I guess this option is needed since a new ssh key is generated every time one runs this notebook.

Alternatively, one could just delete the file `~/.ssh/known_hosts` or else use `ssh-keygen -R hostname` to remove all keys belonging to hostname from the `known_hosts` file (see for instance [How to remove strict RSA key checking in SSH and what's the problem here?](https://serverfault.com/questions/6233/how-to-remove-strict-rsa-key-checking-in-ssh-and-whats-the-problem-here) or [Remove key from known_hosts](https://superuser.com/questions/30087/remove-key-from-known-hosts)). The option `ssh-keygen -R hostname` would be the most appropriate in a production setting where the file `~/.ssh/known_hosts` might contain other entries that you do not want to delete.


In [18]:
%%bash
sudo apt-get update
sudo apt-get -y install openssh-server
# tee -a appends to the file using elevated privileges
echo 'StrictHostKeyChecking no' | sudo tee -a /etc/ssh/ssh_config
sudo /etc/init.d/ssh restart

Get:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,632 B]
Get:2 https://cli.github.com/packages stable InRelease [3,917 B]
Get:3 https://r2u.stat.illinois.edu/ubuntu jammy InRelease [6,555 B]
Get:4 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ Packages [83.8 kB]
Get:5 https://cli.github.com/packages stable/main amd64 Packages [356 B]
Get:6 https://r2u.stat.illinois.edu/ubuntu jammy/main all Packages [9,696 kB]
Get:7 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease [18.1 kB]
Hit:8 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:9 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease [24.6 kB]
Get:10 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy/main amd64 Packages [38.8 kB]
Get:11 https://r2u.stat.illinois.edu/ubuntu jammy/main amd64 Packages [2,891 kB]
Get:12 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Get:13 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy

W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 78, <> line 4.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
dpkg-preconfigure: unable to re-open stdin: 


### Generate key
Generate an SSH key that does not require a password.

The private key is contained in the file `id_rsa` located in the folder `~/.ssh`.

The public key is added to the file `~/.ssh/authorized_keys` in order to allow authentication with that key.

In [19]:
%%bash
rm $HOME/.ssh/id_rsa
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys

Generating public/private rsa key pair.
Your identification has been saved in /root/.ssh/id_rsa
Your public key has been saved in /root/.ssh/id_rsa.pub
The key fingerprint is:
SHA256:mNsFMbLbp0RUPAjTz7ngfDDnh8OXBpI2rkNYz3AuZ+g root@7a779a0ab01e
The key's randomart image is:
+---[RSA 3072]----+
|      +o++.      |
|       =ooo      |
|      . o+ o     |
|      o*X.*      |
|     o+%SXo+ .   |
|    . ++X+B =    |
|     o.=o. =     |
|      E          |
|       .         |
+----[SHA256]-----+


rm: cannot remove '/root/.ssh/id_rsa': No such file or directory
Created directory '/root/.ssh'.


### Check SSH connection to localhost

The following command should output "hi!" if the connection works.

In [20]:
!ssh localhost "echo hi ðŸ‘‹"

hi ðŸ‘‹


# Launch a single-node Hadoop cluster

## Initialize the namenode

In [21]:
!sudo env JAVA_HOME=$JAVA_HOME $HADOOP_HOME/bin/hdfs namenode -format -nonInteractive

2026-02-04 20:44:47,734 INFO namenode.NameNode: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = 7a779a0ab01e/172.28.0.12
STARTUP_MSG:   args = [-format, -nonInteractive]
STARTUP_MSG:   version = 3.4.2
STARTUP_MSG:   classpath = /content/hadoop-3.4.2/etc/hadoop:/content/hadoop-3.4.2/share/hadoop/common/lib/hadoop-shaded-guava-1.4.0.jar:/content/hadoop-3.4.2/share/hadoop/common/lib/listenablefuture-9999.0-empty-to-avoid-conflict-with-guava.jar:/content/hadoop-3.4.2/share/hadoop/common/lib/hadoop-auth-3.4.2.jar:/content/hadoop-3.4.2/share/hadoop/common/lib/netty-transport-native-epoll-4.1.118.Final-linux-riscv64.jar:/content/hadoop-3.4.2/share/hadoop/common/lib/jetty-servlet-9.4.57.v20241219.jar:/content/hadoop-3.4.2/share/hadoop/common/lib/jersey-server-1.19.4.jar:/content/hadoop-3.4.2/share/hadoop/common/lib/snappy-java-1.1.10.4.jar:/content/hadoop-3.4.2/share/hadoop/common/lib/jetty-util-ajax-9.4.57.v20241

## Start cluster

In [22]:
!sudo env JAVA_HOME=$JAVA_HOME $HADOOP_HOME/sbin/start-dfs.sh

Starting namenodes on [localhost]
Starting datanodes
Starting secondary namenodes [7a779a0ab01e]


In [23]:
%%bash
# Check if HDFS is in safe mode
if hdfs dfsadmin -safemode get | grep 'ON'; then
  echo "Namenode is in safe mode. Leaving safe mode..."
  hdfs dfsadmin -safemode leave
else
  echo "Namenode is not in safe mode."
fi

Namenode is not in safe mode.


# Run some simple HDFS commands

In [24]:
%%bash
# create directory "my_dir" in HDFS home
hdfs dfs -mkdir /user
hdfs dfs -mkdir /user/root # this is the "home" of user root on HDFS
hdfs dfs -mkdir my_dir

# if sampls_data does not exist, create it
mkdir -p sample_data
touch sample_data/mnist_test.csv

# Check if the file is empty and fill it if needed
if [ ! -s sample_data/mnist_test.csv ]; then
  echo -e "0\n1\n2\n3\n4\n5\n6\n7\n8\n9" > sample_data/mnist_test.csv
fi


# upload file mnist_test.csv to my_dir
hdfs dfs -put sample_data/mnist_test.csv my_dir/

# show contents of directory my_dir
hdfs dfs -ls -h my_dir

Found 1 items
-rw-r--r--   1 root supergroup     17.4 M 2026-02-04 20:45 my_dir/mnist_test.csv


# Run some simple MapReduce jobs

We'll employ the [streaming](https://hadoop.apache.org/docs/stable/hadoop-streaming/HadoopStreaming.html) library, which broadens our options by enabling the use of any programming language for both the mapper and/or the reducer.

With this utility any executable or file containing code that the operating system can interpret and execute directly, can serve as mapper and/or reducer.

## Simplest MapReduce job

This is a "no-code" example since we are going to use the existing Unix commands `cat` and `wc` respectively as mapper and as reducer. The result will show a line with three values: the counts of lines, words, and characters in the input file(s).

Input folder is `/user/my_user/my_dir/`, output folder `/user/my_user/output_simplest`.

**Note**: the output folder should not exist because it is created by Hadoop (this is in accordance with Hadoop's principle of not overwriting data).

Now run the MapReduce job

In [25]:
%%bash

hdfs dfs -rm -r output_simplest || hdfs namenode -format -nonInteractive
mapred streaming \
  -input my_dir \
  -output output_simplest \
  -mapper /bin/cat \
  -reducer /usr/bin/wc

rm: `output_simplest': No such file or directory
namenode is running as process 1722.  Stop it first and ensure /tmp/hadoop-root-namenode.pid file is empty before retry.
2026-02-04 20:45:33,243 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!
2026-02-04 20:45:33,722 INFO mapred.FileInputFormat: Total input files to process : 1
2026-02-04 20:45:33,754 INFO mapreduce.JobSubmitter: number of splits:1
2026-02-04 20:45:34,114 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local433737235_0001
2026-02-04 20:45:34,116 INFO mapreduce.JobSubmitter: Executing with tokens: []
2026-02-04 20:45:34,398 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
2026-02-04 20:45:34,400 INFO mapreduce.Job: Running job: job_local433737235_0001
2026-02-04 20:45:34,402 INFO mapred.LocalJobRunner: OutputCommitter set in config null
2026-02-04 20:45:34,404 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapred.FileOutputCommitter
2026-02-04 2

If the `output` directory contains the empty file `_SUCCESS`, this means that the job was successful.

Check the output of the MapReduce job.

In [26]:
!hdfs dfs -cat output_simplest/part-00000

  10000   10000 18299443	


The number of words is in this case equal to the number of lines because there are no word separators (empty spaces) in the file, so each line is a word.

## Another MapReduce example: filter a log file

We're going to use a Linux logfile and look for the string `sshd` in a given position. The file stems from [Loghub](https://github.com/logpai/loghub), a freely available collection of system logs for AI-driven log analytics research.

The mapper `mapper.py` filters the file for the given string `sshd` at field 4.

The job has no reducer (option `-reducer NONE`). Note that without a reducer the sorting and shuffling phase after the map phase is skipped.


Download the logfile `Linux_2k.log`:

In [27]:
!wget --no-clobber https://raw.githubusercontent.com/logpai/loghub/master/Linux/Linux_2k.log

--2026-02-04 20:45:39--  https://raw.githubusercontent.com/logpai/loghub/master/Linux/Linux_2k.log
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 216485 (211K) [text/plain]
Saving to: â€˜Linux_2k.logâ€™


2026-02-04 20:45:39 (8.23 MB/s) - â€˜Linux_2k.logâ€™ saved [216485/216485]



In [28]:
%%bash
hdfs dfs -mkdir input || true
hdfs dfs -put Linux_2k.log input/ || true

Define the mapper

In [29]:
%%writefile mapper.py
#!/usr/bin/env python
import sys

for line in sys.stdin:
    # split the line into words
    line = line.strip()
    fields = line.split()
    if (len(fields)>=5 and fields[4].startswith('sshd')):
      print(line)


Writing mapper.py


Test the script (after setting the correct permissions)

In [30]:
!chmod 700 mapper.py

Look at the first 10 lines

In [31]:
!head -10 Linux_2k.log

Jun 14 15:16:01 combo sshd(pam_unix)[19939]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=218.188.2.4 
Jun 14 15:16:02 combo sshd(pam_unix)[19937]: check pass; user unknown
Jun 14 15:16:02 combo sshd(pam_unix)[19937]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=218.188.2.4 
Jun 15 02:04:59 combo sshd(pam_unix)[20882]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=220-135-151-1.hinet-ip.hinet.net  user=root
Jun 15 02:04:59 combo sshd(pam_unix)[20884]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=220-135-151-1.hinet-ip.hinet.net  user=root
Jun 15 02:04:59 combo sshd(pam_unix)[20883]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=220-135-151-1.hinet-ip.hinet.net  user=root
Jun 15 02:04:59 combo sshd(pam_unix)[20885]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=220-135-151-1.hinet-ip.hinet.net  user=root
Jun 15 02:04:59 com

Test the mapper in the shell (not using MapReduce):

In [32]:
!head -100 Linux_2k.log| ./mapper.py

Jun 14 15:16:01 combo sshd(pam_unix)[19939]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=218.188.2.4
Jun 14 15:16:02 combo sshd(pam_unix)[19937]: check pass; user unknown
Jun 14 15:16:02 combo sshd(pam_unix)[19937]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=218.188.2.4
Jun 15 02:04:59 combo sshd(pam_unix)[20882]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=220-135-151-1.hinet-ip.hinet.net  user=root
Jun 15 02:04:59 combo sshd(pam_unix)[20884]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=220-135-151-1.hinet-ip.hinet.net  user=root
Jun 15 02:04:59 combo sshd(pam_unix)[20883]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=220-135-151-1.hinet-ip.hinet.net  user=root
Jun 15 02:04:59 combo sshd(pam_unix)[20885]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=220-135-151-1.hinet-ip.hinet.net  user=root
Jun 15 02:04:59 combo sshd(p

Now run the MapReduce job on the pseudo-cluster

In [33]:
%%bash

hdfs dfs -rm -r output_filter

mapred streaming \
  -file mapper.py \
  -input input \
  -output output_filter \
  -mapper mapper.py \
  -reducer NONE


packageJobJar: [mapper.py] [] /tmp/streamjob15454486606118402166.jar tmpDir=null


rm: `output_filter': No such file or directory
2026-02-04 20:45:50,371 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
2026-02-04 20:45:52,058 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!
2026-02-04 20:45:52,519 INFO mapred.FileInputFormat: Total input files to process : 1
2026-02-04 20:45:52,550 INFO mapreduce.JobSubmitter: number of splits:1
2026-02-04 20:45:52,878 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local593607530_0001
2026-02-04 20:45:52,879 INFO mapreduce.JobSubmitter: Executing with tokens: []
2026-02-04 20:45:53,209 INFO mapred.LocalDistributedCacheManager: Localized file:/content/mapper.py as file:/tmp/hadoop-root/mapred/local/job_local593607530_0001_6ec44ffd-17be-4525-98fc-86d0ab0a1684/mapper.py
2026-02-04 20:45:53,396 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
2026-02-04 20:45:53,399 INFO mapred.LocalJobRunner: OutputCommitter set in config null
202

Check the result

In [34]:
!hdfs dfs -ls output_filter

Found 2 items
-rw-r--r--   1 root supergroup          0 2026-02-04 20:45 output_filter/_SUCCESS
-rw-r--r--   1 root supergroup      85436 2026-02-04 20:45 output_filter/part-00000


In [35]:
!hdfs dfs -cat output_filter/part-00000 |head

Jun 14 15:16:01 combo sshd(pam_unix)[19939]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=218.188.2.4	
Jun 14 15:16:02 combo sshd(pam_unix)[19937]: check pass; user unknown	
Jun 14 15:16:02 combo sshd(pam_unix)[19937]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=218.188.2.4	
Jun 15 02:04:59 combo sshd(pam_unix)[20882]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=220-135-151-1.hinet-ip.hinet.net  user=root	
Jun 15 02:04:59 combo sshd(pam_unix)[20884]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=220-135-151-1.hinet-ip.hinet.net  user=root	
Jun 15 02:04:59 combo sshd(pam_unix)[20883]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=220-135-151-1.hinet-ip.hinet.net  user=root	
Jun 15 02:04:59 combo sshd(pam_unix)[20885]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=220-135-151-1.hinet-ip.hinet.net  user=root	
Jun 15 02:04:59 combo

## Aggregate data with MapReduce

Following the example in [Hadoop Streaming/Aggregate package](https://hadoop.apache.org/docs/stable/hadoop-streaming/HadoopStreaming.html#Hadoop_Aggregate_Package)

In [36]:
%%writefile myAggregatorForKeyCount.py
#!/usr/bin/env python
import sys

def generateLongCountToken(id):
    return "LongValueSum:" + id + "\t" + "1"

def main(argv):
    line = sys.stdin.readline()
    try:
        while line:
            line = line[:-1]
            fields = line.split()
            s = fields[4].split('[')[0]
            print(generateLongCountToken(s))
            line = sys.stdin.readline()
    except "end of file":
        return None

if __name__ == "__main__":
     main(sys.argv)

Writing myAggregatorForKeyCount.py


Set permissions

In [37]:
!chmod 700 myAggregatorForKeyCount.py

Test the mapper

In [38]:
!head -20 Linux_2k.log| ./myAggregatorForKeyCount.py

LongValueSum:sshd(pam_unix)	1
LongValueSum:sshd(pam_unix)	1
LongValueSum:sshd(pam_unix)	1
LongValueSum:sshd(pam_unix)	1
LongValueSum:sshd(pam_unix)	1
LongValueSum:sshd(pam_unix)	1
LongValueSum:sshd(pam_unix)	1
LongValueSum:sshd(pam_unix)	1
LongValueSum:sshd(pam_unix)	1
LongValueSum:sshd(pam_unix)	1
LongValueSum:sshd(pam_unix)	1
LongValueSum:sshd(pam_unix)	1
LongValueSum:sshd(pam_unix)	1
LongValueSum:su(pam_unix)	1
LongValueSum:su(pam_unix)	1
LongValueSum:logrotate:	1
LongValueSum:su(pam_unix)	1
LongValueSum:su(pam_unix)	1
LongValueSum:sshd(pam_unix)	1
LongValueSum:sshd(pam_unix)	1


Run the MapReduce job

In [39]:
%%bash

chmod +x myAggregatorForKeyCount.py

hdfs dfs -rm -r output_aggregate

mapred streaming \
  -input input \
  -output output_aggregate \
  -mapper myAggregatorForKeyCount.py \
  -reducer aggregate \
  -file myAggregatorForKeyCount.py


packageJobJar: [myAggregatorForKeyCount.py] [] /tmp/streamjob13926960731478249436.jar tmpDir=null


rm: `output_aggregate': No such file or directory
2026-02-04 20:46:05,123 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
2026-02-04 20:46:06,863 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!
2026-02-04 20:46:07,377 INFO mapred.FileInputFormat: Total input files to process : 1
2026-02-04 20:46:07,404 INFO mapreduce.JobSubmitter: number of splits:1
2026-02-04 20:46:07,730 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1472963981_0001
2026-02-04 20:46:07,732 INFO mapreduce.JobSubmitter: Executing with tokens: []
2026-02-04 20:46:08,090 INFO mapred.LocalDistributedCacheManager: Localized file:/content/myAggregatorForKeyCount.py as file:/tmp/hadoop-root/mapred/local/job_local1472963981_0001_5b18ad8d-1d2b-4c2d-9aa0-ad10a5c63e4f/myAggregatorForKeyCount.py
2026-02-04 20:46:08,232 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
2026-02-04 20:46:08,234 INFO mapred.LocalJobRunner:

Check result

In [40]:
%%bash
hdfs dfs -ls output_aggregate
hdfs dfs -cat output_aggregate/part-00000

Found 2 items
-rw-r--r--   1 root supergroup          0 2026-02-04 20:46 output_aggregate/_SUCCESS
-rw-r--r--   1 root supergroup        326 2026-02-04 20:46 output_aggregate/part-00000
--	1
bluetooth:	2
cups:	12
ftpd	916
gdm(pam_unix)	2
gdm-binary	1
gpm	2
hcid	1
irqbalance:	1
kernel:	76
klogind	46
login(pam_unix)	2
logrotate:	43
named	16
network:	2
nfslock:	1
portmap:	1
random:	1
rc:	1
rpc.statd	1
rpcidmapd:	1
sdpd	1
snmpd	1
sshd(pam_unix)	677
su(pam_unix)	172
sysctl:	1
syslog:	2
syslogd	7
udev	8
xinetd	2


Pretty-print table of aggregated data

In [41]:
%%bash
hdfs dfs -get output_aggregate/part-00000 result # download results file
# Use awk to format the output into columns and then sort by the second field numerically in descending order
awk '{printf "%-20s %s\n", $1, $2}' result | sort -k2nr

ftpd                 916
sshd(pam_unix)       677
su(pam_unix)         172
kernel:              76
klogind              46
logrotate:           43
named                16
cups:                12
udev                 8
syslogd              7
bluetooth:           2
gdm(pam_unix)        2
gpm                  2
login(pam_unix)      2
network:             2
syslog:              2
xinetd               2
--                   1
gdm-binary           1
hcid                 1
irqbalance:          1
nfslock:             1
portmap:             1
random:              1
rc:                  1
rpcidmapd:           1
rpc.statd            1
sdpd                 1
snmpd                1
sysctl:              1


# Stop cluster

When you're done with your computations, you can shut down the Hadoop cluster and stop the `sshd` service.

In [42]:
!./hadoop-3.4.2/sbin/stop-dfs.sh

Stopping namenodes on [localhost]
Stopping datanodes
Stopping secondary namenodes [7a779a0ab01e]


Stop the `sshd` daemon

In [43]:
!/etc/init.d/ssh stop

 * Stopping OpenBSD Secure Shell server sshd
   ...done.


# Concluding remarks

We have started a single-node Hadoop cluster and ran some simple HDFS and MapReduce commands.

Even when running on a single machine, one can benefit from the parallelism provided by multiple virtual cores.

Hadoop provides also a command-line utility (the CLI MiniCluster) to start and stop a single-node Hadoop cluster "_without the need to set any environment variables or manage configuration files_" (https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/CLIMiniCluster.html). The [Hadoop MiniCluster](https://github.com/groda/big_data/blob/master/Hadoop_minicluster.ipynb) notebook serves as a guide for launching the Hadoop MiniCluster.

While it can be useful to be able to start a Hadoop cluster with a single command, delving into the functionality of each component offers valuable insights into the intricacies of Hadoop architecture, thereby enriching the learning process.

If you found this notebook helpful, consider exploring:
 - [Hadoop single-node cluster setup with Python](https://github.com/groda/big_data/blob/master/Hadoop_single_node_cluster_setup_Python.ipynb) similar to this but using Python in place of bash
 - [Setting up Spark Standalone on Google Colab](https://github.com/groda/big_data/blob/master/Hadoop_Setting_up_Spark_Standalone_on_Google_Colab.ipynb)
 - [Getting to know the Spark Standalone Architecture](https://github.com/groda/big_data/blob/master/Spark_Standalone_Architecture_on_Google_Colab.ipynb)


