<a href="https://colab.research.google.com/github/groda/Chart.js/blob/master/Hadoop_Setting_up_a_Single_Node_Cluster.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# HDFS and MapReduce on a single-node Hadoop cluster

We're going to set up a single-node cluster (following the instructions on https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html) and show how to run simple HDFS and MapReduce commands.

After downloading the software, it is necessary to carry out some preliminary steps like setting environment variables, generating SSH keys, etc.). We grouped all these steps under "Prologue".

Once done with the prologue, we are able to start a single-node Hadoop cluster on the current virtual machine.

We are going to run some test HDFS commands and MapReduce jobs on the Hadoop cluster.

Finally, the cluster will be shut down.


## Prologue

### Check the available Java version
 Apache Hadoop 3.3 and upper supports Java 8 and Java 11 (runtime only). See: https://cwiki.apache.org/confluence/display/HADOOP/Hadoop+Java+Versions


Check if Java version is one of `8`, `11`

In [1]:
!java -version

openjdk version "11.0.16" 2022-07-19
OpenJDK Runtime Environment (build 11.0.16+8-post-Ubuntu-0ubuntu118.04)
OpenJDK 64-Bit Server VM (build 11.0.16+8-post-Ubuntu-0ubuntu118.04, mixed mode, sharing)


In [2]:
%%bash
JAVA_MAJOR_VERSION=$(java -version 2>&1 | grep -m1 -Po '(\d+\.)+\d+' | cut -d '.' -f1)
if [[ $JAVA_MAJOR_VERSION -eq 8 || $JAVA_MAJOR_VERSION -eq 11 ]]
 then 
 echo "Java version is one of 8, 11 ✓" 
 fi

Java version is one of 8, 11 ✓


Find the variable for the environment variable `JAVA_HOME`

Find the path for the environment variable `JAVA_HOME`

In [3]:
!readlink -f $(which java)

/usr/lib/jvm/java-11-openjdk-amd64/bin/java


Extract JAVA_HOME from the Java path by removing the `bin/java` part in the end

In [4]:
%%bash
JAVA_HOME=$(readlink -f $(which java) | sed 's/\/bin\/java$//')
echo $JAVA_HOME

/usr/lib/jvm/java-11-openjdk-amd64


We're going to use `JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64`.

Use `%env%` to set the variable for the current notebook session.

In [5]:
#%env JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64

### Download core Hadoop 
Download the latest stable version of the core Hadoop distribution from one of the download mirrors locations https://www.apache.org/dyn/closer.cgi/hadoop/common/

In [6]:
%%bash
curl -L https://dlcdn.apache.org/hadoop/common/stable/hadoop-3.3.4.tar.gz -o hadoop-3.3.4.tar.gz
tar xzf hadoop-3.3.4.tar.gz 

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0  9  663M    9 65.4M    0     0   189M      0  0:00:03 --:--:--  0:00:03  188M 51  663M   51  342M    0     0   256M      0  0:00:02  0:00:01  0:00:01  256M 94  663M   94  626M    0     0   268M      0  0:00:02  0:00:02 --:--:--  268M100  663M  100  663M    0     0   266M      0  0:00:02  0:00:02 --:--:--  266M


### Verify the downloaded file 

(see https://www.apache.org/dyn/closer.cgi/hadoop/common/)

In [7]:
!curl -O https://dlcdn.apache.org/hadoop/common/stable/hadoop-3.3.4.tar.gz.sha512

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100   160  100   160    0     0   6666      0 --:--:-- --:--:-- --:--:--  6666


In [8]:
%%bash
sha512sum hadoop-3.3.4.tar.gz | cut - -d' ' -f1
cut hadoop-3.3.4.tar.gz.sha512 -d' ' -f4

ca5e12625679ca95b8fd7bb7babc2a8dcb2605979b901df9ad137178718821097b67555115fafc6dbf6bb32b61864ccb6786dbc555e589694a22bf69147780b4
ca5e12625679ca95b8fd7bb7babc2a8dcb2605979b901df9ad137178718821097b67555115fafc6dbf6bb32b61864ccb6786dbc555e589694a22bf69147780b4


### Configure `PATH`

Add the Hadoop folder to the `PATH` environment variable


In [9]:
!echo $PATH

/opt/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/tools/node/bin:/tools/google-cloud-sdk/bin


In [10]:
import os
#os.environ['HADOOP_HOME'] = os.path.join(os.environ['HOME'], 'hadoop-3.3.4')
os.environ['HADOOP_HOME'] = os.path.join('/content', 'hadoop-3.3.4')
os.environ['PATH'] = ':'.join([os.path.join(os.environ['HADOOP_HOME'], 'bin'), os.environ['PATH']])
os.environ['JAVA_HOME'] = '/usr/lib/jvm/java-11-openjdk-amd64'

In [11]:
import os
print(os.environ)



In [12]:
!echo $PATH

/content/hadoop-3.3.4/bin:/opt/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/tools/node/bin:/tools/google-cloud-sdk/bin


### Configure `core-site.xml` and `hdfs-site.xml`

Edit the file `etc/hadoop/core-site.xml` and `etc/hadoop/hdfs-site.xml` to configure pseudo-distributed operation.

**`etc/hadoop/core-site.xml`**
```
<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>
```

**`etc/hadoop/hdfs-site.xml`**
```
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>
```

In [13]:
%%bash
echo -e "<configuration> \n\
    <property> \n\
        <name>fs.defaultFS</name> \n\
        <value>hdfs://localhost:9000</value> \n\
    </property> \n\
</configuration>" >hadoop-3.3.4/etc/hadoop/core-site.xml

echo -e "<configuration> \n\
    <property> \n\
        <name>dfs.replication</name> \n\
        <value>1</value> \n\
    </property> \n\
</configuration>" >hadoop-3.3.4/etc/hadoop/hdfs-site.xml

Check

In [14]:
cat hadoop-3.3.4/etc/hadoop/hdfs-site.xml

<configuration> 
    <property> 
        <name>dfs.replication</name> 
        <value>1</value> 
    </property> 
</configuration>


### Set environment variables

Add the following lines to the Hadoop configuration script `hadoop-env.sh`(the script is in `hadoop-3.3.4/sbin`).
```
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
export HDFS_NAMENODE_USER=root
export HDFS_DATANODE_USER=root
export HDFS_SECONDARYNAMENODE_USER=root
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root
```

In [15]:
%%bash
echo -e "export HDFS_NAMENODE_USER=root \n\
export HDFS_DATANODE_USER=root \n\
export HDFS_SECONDARYNAMENODE_USER=root \n\
export YARN_RESOURCEMANAGER_USER=root \n\
export YARN_NODEMANAGER_USER=root" >> hadoop-3.3.4/etc/hadoop/hadoop-env.sh

### Setup localhost access via SSH key

We are going to allow passphraseless sccess to `localhost` with a secure key.

SSH must be installed and sshd must be running to use the Hadoop scripts that manage remote Hadoop daemons.



#### Install `openssh` and start server

In [16]:
%%bash
apt-get install openssh-server
echo 'StrictHostKeyChecking no' >> /etc/ssh/ssh_config
/etc/init.d/ssh restart

Reading package lists...
Building dependency tree...
Reading state information...
openssh-server is already the newest version (1:7.6p1-4ubuntu0.7).
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 12 not upgraded.
 * Restarting OpenBSD Secure Shell server sshd
   ...done.


#### Generate key
Generate SSH key that does not require a password.

In [17]:
%%bash
rm /root/.ssh/id_rsa
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys

Generating public/private rsa key pair.
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:dJfNcKhqqLWy7x7fxJpYHIsfvb1+WbXMj7LQ3b2fWXc root@93ad6cb45e12
The key's randomart image is:
+---[RSA 2048]----+
|            ...  |
|            .*   |
|        . ..o o  |
|       . ...    .|
|       .S.    o o|
|      ooo= . . =o|
|     ooo= = . .+E|
|    o .* * +. + O|
|    .*= = o.== +o|
+----[SHA256]-----+


#### Check SSH connection to localhost

The following command should output "hi!" if the connection works.

In [18]:
!ssh localhost "echo hi!"

hi!


## Launch a single-node Hadoop cluster

### Initialize the namenode

In [19]:
!hdfs namenode -format -nonInteractive

namenode is running as process 7406.  Stop it first and ensure /tmp/hadoop-root-namenode.pid file is empty before retry.


In [20]:
!echo -e "export JAVA_HOME=$(readlink -f $(which java) | sed 's/\/bin\/java$//') \n\
export HDFS_NAMENODE_USER=root \n\
export HDFS_DATANODE_USER=root \n\
export HDFS_SECONDARYNAMENODE_USER=root \n\
export YARN_RESOURCEMANAGER_USER=root \n\
export YARN_NODEMANAGER_USER=root" >> hadoop-3.3.4/etc/hadoop/hadoop-env.sh

### Start cluster

In [21]:
!$HADOOP_HOME/sbin/start-dfs.sh

Starting namenodes on [localhost]
localhost: namenode is running as process 7406.  Stop it first and ensure /tmp/hadoop-root-namenode.pid file is empty before retry.
Starting datanodes
localhost: datanode is running as process 7540.  Stop it first and ensure /tmp/hadoop-root-datanode.pid file is empty before retry.
Starting secondary namenodes [93ad6cb45e12]
93ad6cb45e12: secondarynamenode is running as process 7759.  Stop it first and ensure /tmp/hadoop-root-secondarynamenode.pid file is empty before retry.


## Run some simple `hdfs` commands

In [22]:
%%bash
# create directory "my_dir" in HDFS home 
hdfs dfs -mkdir /user
hdfs dfs -mkdir /user/root # this is the "home" of user root on HDFS
hdfs dfs -mkdir my_dir

# upload file mnist_test.csv to my_dir
hdfs dfs -put /content/sample_data/mnist_test.csv my_dir/

# show contents of directory my_dir
hdfs dfs -ls -h my_dir


Found 1 items
-rw-r--r--   1 root supergroup     17.4 M 2022-10-08 15:43 my_dir/mnist_test.csv


mkdir: `/user': File exists
mkdir: `/user/root': File exists
mkdir: `my_dir': File exists
put: `my_dir/mnist_test.csv': File exists


## Run some simple MrReduce jobs
We're going to use the [streaming](https://hadoop.apache.org/docs/stable/hadoop-streaming/HadoopStreaming.html) library. 
WIth this utility any executable can be used as the mapper and/or the reducer. 

#### Simplest MapReduce job

We are going to use the Unix commands `cat` as mapper and `wc` as reducer so we don't need to code anything. The result will show a line with three values: the counts of lines, words, and characters in the input file(s).

Input folder is `/user/my_user/my_dir/`, output folder `/user/my_user/output`.

**Note**: the output folder should not exist because it is created by Hadoop (this is in acordance with Hadoop's principle of not overwriting data).

In [23]:
%%bash

hdfs dfs -rm -r output >/dev/null 2>&1
mapred streaming \
  -input my_dir \
  -output output \
  -mapper /bin/cat \
  -reducer /usr/bin/wc

2022-10-08 15:58:05,691 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
2022-10-08 15:58:05,889 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2022-10-08 15:58:05,889 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2022-10-08 15:58:05,908 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!
2022-10-08 15:58:06,351 INFO mapred.FileInputFormat: Total input files to process : 1
2022-10-08 15:58:06,368 INFO mapreduce.JobSubmitter: number of splits:1
2022-10-08 15:58:06,610 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1911765736_0001
2022-10-08 15:58:06,611 INFO mapreduce.JobSubmitter: Executing with tokens: []
2022-10-08 15:58:06,809 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
2022-10-08 15:58:06,811 INFO mapreduce.Job: Running job: job_local1911765736_0001
2022-10-08 15:58:06,818 INFO mapred.LocalJobRunner: OutputCommitter set in config null
2022-10

If the `output` directory contains the empty file `_SUCCESS`, this means that the job was successful. 

Check the output of the MapReduce job.

In [24]:
!hdfs dfs -cat output/part-00000

  10000   10000 18299443	


The number of words is in this case equal to the number of lines because there are no word separators (empty spaces) in the file, so each line is a word.

#### Another MapReduce example: filter a log file

We're going to use a Linux logfile and look for the string `sshd` in a given position. 

The mapper `mapper.py` filters the file for the given string `sshd` at field 4. 

The job has no reducer (option `-reducer NONE`). Note that without a reducer the sorting and shuffling phase after the map phase is skipped.


Download the logfile `Linux_2k.log`:

In [25]:
!curl -O https://raw.githubusercontent.com/logpai/loghub/master/Linux/Linux_2k.log

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0  7  209k    7 15524    0     0   170k      0  0:00:01 --:--:--  0:00:01  168k100  209k  100  209k    0     0  2228k      0 --:--:-- --:--:-- --:--:-- 2204k


In [26]:
%%bash
hdfs dfs -mkdir input
hdfs dfs -put Linux_2k.log input/ 
hdfs dfs -rm -r output >/dev/null 2>&1

mkdir: `input': File exists
put: `input/Linux_2k.log': File exists


Define the mapper

In [27]:
%%writefile mapper.py
#!/usr/bin/env python
import sys

for line in sys.stdin:
    # split the line into words
    line = line.strip()
    fields = line.split()
    if (len(fields)>=5 and fields[4].startswith('sshd')):
      print(line)


Overwriting mapper.py


Test the script (after setting the correct permissions)

In [28]:
!chmod 700 mapper.py

Look at the first 10 lines

In [29]:
!head -10 Linux_2k.log

Jun 14 15:16:01 combo sshd(pam_unix)[19939]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=218.188.2.4 
Jun 14 15:16:02 combo sshd(pam_unix)[19937]: check pass; user unknown
Jun 14 15:16:02 combo sshd(pam_unix)[19937]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=218.188.2.4 
Jun 15 02:04:59 combo sshd(pam_unix)[20882]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=220-135-151-1.hinet-ip.hinet.net  user=root
Jun 15 02:04:59 combo sshd(pam_unix)[20884]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=220-135-151-1.hinet-ip.hinet.net  user=root
Jun 15 02:04:59 combo sshd(pam_unix)[20883]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=220-135-151-1.hinet-ip.hinet.net  user=root
Jun 15 02:04:59 combo sshd(pam_unix)[20885]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=220-135-151-1.hinet-ip.hinet.net  user=root
Jun 15 02:04:59 combo sshd

Test the mapper in the shell (not using MapReduce):

In [30]:
!head -100 Linux_2k.log| ./mapper.py 

Jun 14 15:16:01 combo sshd(pam_unix)[19939]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=218.188.2.4
Jun 14 15:16:02 combo sshd(pam_unix)[19937]: check pass; user unknown
Jun 14 15:16:02 combo sshd(pam_unix)[19937]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=218.188.2.4
Jun 15 02:04:59 combo sshd(pam_unix)[20882]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=220-135-151-1.hinet-ip.hinet.net  user=root
Jun 15 02:04:59 combo sshd(pam_unix)[20884]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=220-135-151-1.hinet-ip.hinet.net  user=root
Jun 15 02:04:59 combo sshd(pam_unix)[20883]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=220-135-151-1.hinet-ip.hinet.net  user=root
Jun 15 02:04:59 combo sshd(pam_unix)[20885]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=220-135-151-1.hinet-ip.hinet.net  user=root
Jun 15 02:04:59 combo sshd(p

Now run the MapReduce job on the pseudo-cluster

In [31]:
%%bash

hdfs dfs -rm -r output >/dev/null 2>&1
mapred streaming \
  -file mapper.py \
  -input input \
  -output output \
  -mapper mapper.py \
  -reducer NONE 
  

packageJobJar: [mapper.py] [] /tmp/streamjob11527518915332424217.jar tmpDir=null


2022-10-08 15:58:25,431 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
2022-10-08 15:58:26,780 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
2022-10-08 15:58:26,961 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2022-10-08 15:58:26,961 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2022-10-08 15:58:26,979 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!
2022-10-08 15:58:27,399 INFO mapred.FileInputFormat: Total input files to process : 1
2022-10-08 15:58:27,425 INFO mapreduce.JobSubmitter: number of splits:1
2022-10-08 15:58:27,706 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1672820701_0001
2022-10-08 15:58:27,706 INFO mapreduce.JobSubmitter: Executing with tokens: []
2022-10-08 15:58:28,020 INFO mapred.LocalDistributedCacheManager: Localized file:/content/mapper.py as file:/tmp/hadoop-root/mapred/local/job_local16

Check the result

In [32]:
!hdfs dfs -ls output

Found 2 items
-rw-r--r--   1 root supergroup          0 2022-10-08 15:58 output/_SUCCESS
-rw-r--r--   1 root supergroup      85436 2022-10-08 15:58 output/part-00000


In [33]:
!hdfs dfs -cat output/part-00000 |head 

Jun 14 15:16:01 combo sshd(pam_unix)[19939]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=218.188.2.4	
Jun 14 15:16:02 combo sshd(pam_unix)[19937]: check pass; user unknown	
Jun 14 15:16:02 combo sshd(pam_unix)[19937]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=218.188.2.4	
Jun 15 02:04:59 combo sshd(pam_unix)[20882]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=220-135-151-1.hinet-ip.hinet.net  user=root	
Jun 15 02:04:59 combo sshd(pam_unix)[20884]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=220-135-151-1.hinet-ip.hinet.net  user=root	
Jun 15 02:04:59 combo sshd(pam_unix)[20883]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=220-135-151-1.hinet-ip.hinet.net  user=root	
Jun 15 02:04:59 combo sshd(pam_unix)[20885]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=220-135-151-1.hinet-ip.hinet.net  user=root	
Jun 15 02:04:59 combo

### Aggregate data with MapReduce

Following the example in [Hadoop Streaming/Aggregate package](https://hadoop.apache.org/docs/stable/hadoop-streaming/HadoopStreaming.html#Hadoop_Aggregate_Package)

In [34]:
%%writefile myAggregatorForKeyCount.py
#!/usr/local/bin/python
import sys

def generateLongCountToken(id):
    return "LongValueSum:" + id + "\t" + "1"

def main(argv):
    line = sys.stdin.readline()
    try:
        while line:
            line = line[:-1]
            fields = line.split()
            s = fields[4].split('[')[0]
            print(generateLongCountToken(s))
            line = sys.stdin.readline()
    except "end of file":
        return None

if __name__ == "__main__":
     main(sys.argv)

Overwriting myAggregatorForKeyCount.py


Set permissions

In [35]:
!chmod 700 myAggregatorForKeyCount.py

Test the mapper 

In [36]:
!head -20 Linux_2k.log| ./myAggregatorForKeyCount.py

LongValueSum:sshd(pam_unix)	1
LongValueSum:sshd(pam_unix)	1
LongValueSum:sshd(pam_unix)	1
LongValueSum:sshd(pam_unix)	1
LongValueSum:sshd(pam_unix)	1
LongValueSum:sshd(pam_unix)	1
LongValueSum:sshd(pam_unix)	1
LongValueSum:sshd(pam_unix)	1
LongValueSum:sshd(pam_unix)	1
LongValueSum:sshd(pam_unix)	1
LongValueSum:sshd(pam_unix)	1
LongValueSum:sshd(pam_unix)	1
LongValueSum:sshd(pam_unix)	1
LongValueSum:su(pam_unix)	1
LongValueSum:su(pam_unix)	1
LongValueSum:logrotate:	1
LongValueSum:su(pam_unix)	1
LongValueSum:su(pam_unix)	1
LongValueSum:sshd(pam_unix)	1
LongValueSum:sshd(pam_unix)	1


Run the MapReduce job

In [37]:
%%bash
hdfs dfs -rm -r output >/dev/null 2>&1
mapred streaming \
  -file mapper.py \
  -file myAggregatorForKeyCount.py \
  -input input \
  -output output \
  -mapper myAggregatorForKeyCount.py \
  -reducer aggregate  
  
  

packageJobJar: [mapper.py, myAggregatorForKeyCount.py] [] /tmp/streamjob9018992628385811252.jar tmpDir=null


2022-10-08 15:58:40,123 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
2022-10-08 15:58:41,487 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
2022-10-08 15:58:41,611 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2022-10-08 15:58:41,611 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2022-10-08 15:58:41,631 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!
2022-10-08 15:58:41,988 INFO mapred.FileInputFormat: Total input files to process : 1
2022-10-08 15:58:42,011 INFO mapreduce.JobSubmitter: number of splits:1
2022-10-08 15:58:42,305 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local368443522_0001
2022-10-08 15:58:42,305 INFO mapreduce.JobSubmitter: Executing with tokens: []
2022-10-08 15:58:42,620 INFO mapred.LocalDistributedCacheManager: Localized file:/content/mapper.py as file:/tmp/hadoop-root/mapred/local/job_local368

Check result

In [38]:
%%bash
hdfs dfs -ls output
hdfs dfs -cat output/part-00000 

Found 2 items
-rw-r--r--   1 root supergroup          0 2022-10-08 15:58 output/_SUCCESS
-rw-r--r--   1 root supergroup        326 2022-10-08 15:58 output/part-00000
--	1
bluetooth:	2
cups:	12
ftpd	916
gdm(pam_unix)	2
gdm-binary	1
gpm	2
hcid	1
irqbalance:	1
kernel:	76
klogind	46
login(pam_unix)	2
logrotate:	43
named	16
network:	2
nfslock:	1
portmap:	1
random:	1
rc:	1
rpc.statd	1
rpcidmapd:	1
sdpd	1
snmpd	1
sshd(pam_unix)	677
su(pam_unix)	172
sysctl:	1
syslog:	2
syslogd	7
udev	8
xinetd	2


Pretty-print table of aggregated data

In [39]:
%%bash
hdfs dfs -get output/part-00000 result # download results file
column -t result|sort -k2nr # sort by field 2 numerically in descending order

ftpd             916
sshd(pam_unix)   677
su(pam_unix)     172
kernel:          76
klogind          46
logrotate:       43
named            16
cups:            12
udev             8
syslogd          7
bluetooth:       2
gdm(pam_unix)    2
gpm              2
login(pam_unix)  2
network:         2
syslog:          2
xinetd           2
--               1
gdm-binary       1
hcid             1
irqbalance:      1
nfslock:         1
portmap:         1
random:          1
rc:              1
rpcidmapd:       1
rpc.statd        1
sdpd             1
snmpd            1
sysctl:          1


get: `result': File exists


## Stop cluster

When you're done with your computations, you can shut down the Hadoop cluster.

In [40]:
!./hadoop-3.3.4/sbin/stop-dfs.sh

Stopping namenodes on [localhost]
Stopping datanodes
Stopping secondary namenodes [93ad6cb45e12]
