#Hadoop on Colab

Content:

1.   [Installing Java 8](#scrollTo=Kxt9UbTArwTC)
2.   [Installing Secure Shell Server (SSHD)](#scrollTo=E1_rIFHNb1zk)
3.   [Installing Hadoop 3.2.3](#scrollTo=qCeL0IBlrnoF)
4.   [Running Hadoop in standalone mode](#scrollTo=xyNhcphwU326)
5.   [Running Hadoop in Pseudo-distributed mode](#scrollTo=lEWV2YjJmR78)

##Introduction

Hadoop is an open-source framework which is mainly used for storage purposes and maintaining and analyzing a large amount of data or datasets on the clusters of commodity hardware, which means it is actually a data management tool.

Hadoop mainly works on 3 different modes:

*   Standalone Mode
*   Pseudo-distributed Mode
*   Fully-distributed Mode


**Standalone Mode**

By default, Hadoop is configured to run in a non distributed mode. It runs as a single Java process. Instead of HDFS, this mode utilizes the local file system. This mode is useful for debugging and there isn't any need to configure core-site.xml, hdfs-site.xml, mapred-site.xml, masters & slaves. Stand-alone mode is usually the fastest mode in Hadoop.

**Pseudo-distributed Mode**

Hadoop can also run on a single node in a Pseudo-distributed mode. In this mode, each daemon runs on separate java processes. In this mode custom configuration is required (core-site.xml, hdfs-site.xml, mapred-site.xml). Here HDFS is utilized for input and output. This mode of deployment is useful for testing and debugging purposes.

**Fully-distributed Mode**

This is the production mode of Hadoop. In this mode typically one machine in the cluster is designated as NameNode and another as Resource Manager exclusively. These are masters. All other nodes act as Data Node and Node Manager. These are the slaves. Configuration parameters and environment need to be specified for Hadoop Daemons.

##Installing Java 8

Hadoop is a java programming-based data processing framework

OpenJDK is a development environment for building applications, applets, and components using the Java programming language.

In [None]:
#Checking the installed Java version
!java -version

openjdk version "11.0.20.1" 2023-08-24
OpenJDK Runtime Environment (build 11.0.20.1+1-post-Ubuntu-0ubuntu122.04)
OpenJDK 64-Bit Server VM (build 11.0.20.1+1-post-Ubuntu-0ubuntu122.04, mixed mode, sharing)


Creating Java related environment variables

The JAVA_HOME is an operating system environment variable points to the file system location where the JDK or JRE was installed.

In [None]:
#Finding the default Java path
!readlink -f /usr/bin/java | sed "s:bin/java::"

/usr/lib/jvm/java-11-openjdk-amd64/


In [None]:
#Importing os module
import os
#Creating environment variables
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64/"
os.environ["JRE_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64//jre"
os.environ["PATH"] += ":$JAVA_HOME/bin:$JRE_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin"

## Installing Secure Shell Server (SSHD)

We need to define a means for the master node to remotely access every node in our cluster.

Hadoop uses passphrases SSH for the communication between the nodes.

SSH is a cryptographic network protocol for operating network services securely over an unsecured network.

SSH utilizes standard public key criptography to create a pair of keys for user verification: one public and one private.

In [None]:
#It is good practice to purge before installation
!apt-get purge openssh-server -qq

In [None]:

#Installing openssh-server
!apt-get install openssh-server -qq > /dev/null

In [None]:
#Starting the server
!service ssh start

 * Starting OpenBSD Secure Shell server sshd
   ...done.


The port number for SSH is 22 by default

In [None]:
!grep Port /etc/ssh/sshd_config

#Port 22
#GatewayPorts no


The pseudo distributed mode is special case of fully the distributed mode, in which the single host is localhost (our machine). We need to make sure that to access to localhost and login we do not need to enter a password. Therefore, SSH needs to be set up to allow passwordless login for the Hadoop user. The simplest way to achive this is to generate a public-private key pair.

In [None]:
#Creating a new rsa key pair with empty password
!ssh-keygen -t rsa -P "" -f ~/.ssh/id_rsa

Generating public/private rsa key pair.
Created directory '/root/.ssh'.
Your identification has been saved in /root/.ssh/id_rsa
Your public key has been saved in /root/.ssh/id_rsa.pub
The key fingerprint is:
SHA256:JZPwD4YLjMfplIY59Pxx0KS96TM9aMGd0oz8A1cauFI root@1326dc238f5b
The key's randomart image is:
+---[RSA 3072]----+
|  .   oo.        |
| . O o B.o       |
|  = % + E o .    |
|   * o O # =     |
|    . + S O      |
|       o O       |
|        * =      |
|       . o o     |
|                 |
+----[SHA256]-----+


In [None]:
#Showing the public key
!more /root/.ssh/id_rsa.pub

ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQDTNUhc6HzaJ9o7kd1ZruS1S+xMTyw39HyppGP5dOcU
/fja94TTkxWCbWXbk52NkgecHVgyrX8Ic9tHCsnYFM6fwtAaTh6fxYD6nQYZnqfkzq3iSLSaoDKN6uKR
ah4Kk6P8cj0Qw+3TLO7Ly+ostdPC+pI7fWQSdGTUkvHNdrxbRD9OW5sS9zOFRnht9C3W/smxj3yvS3uX
VbB9gu5pvr5n8rFXhurENLccKXJ0PCKrBmE/Ga4drzDK9UFq+jgOBrbqWpTLv5CRSPhpaqVoPD0aoAmG
wBBxsxI49SR8VoxUoEuuaGh8Q9Wk7ocwXdA33xtduSKDITZAl/YOUDVqLqAOGP1KZGJ9NyMwuOydu+rn
mc58wrb47RF5FmgvU/WUEZQln0bspq4mwZjBTTBuc8F22ETuCLpShtQCnPY02vpJTWPwIHqM99yVW8j9
WSRzwivMpHVbV66enzQN1Lk8Ad7t9Q6e7pEDQMA7LGtus0X8c2/VZ+Y8tuTwqLFqWoMmXM0= root@13
26dc238f5b


In [None]:
#Copying the key to autorized keys
!cat $HOME/.ssh/id_rsa.pub>>$HOME/.ssh/authorized_keys
#Changing the permissions on the key
!chmod 0600 ~/.ssh/authorized_keys

In [None]:
#Conneting with the local machine
!ssh -o StrictHostKeyChecking=no localhost uptime

 17:55:28 up 9 min,  0 users,  load average: 0.80, 0.43, 0.24


## Installing Hadoop 3.3.4

In [None]:
#Downloading Hadoop 3.3.4
!wget -q https://dlcdn.apache.org/hadoop/common/hadoop-3.3.4/hadoop-3.3.4.tar.gz

In [None]:
#Untarring the file
!sudo tar -xzf hadoop-3.3.4.tar.gz
#Removing the tar file
#rm hadoop-3.2.3.tar.gz

Standard location to install Hadoop are

*   /usr/local
*   /opt

In [None]:
#Copying the hadoop files to user/local
!cp -r hadoop-3.3.4/ /usr/local/
#-r copy directories recursively

In [None]:
#Exploring hadoop-3.2.3/etc/hadoop directory
!ls /usr/local/hadoop-3.3.4/etc/hadoop
#we can see various configuration files of hadoop

capacity-scheduler.xml		  kms-log4j.properties
configuration.xsl		  kms-site.xml
container-executor.cfg		  log4j.properties
core-site.xml			  mapred-env.cmd
hadoop-env.cmd			  mapred-env.sh
hadoop-env.sh			  mapred-queues.xml.template
hadoop-metrics2.properties	  mapred-site.xml
hadoop-policy.xml		  shellprofile.d
hadoop-user-functions.sh.example  ssl-client.xml.example
hdfs-rbf-site.xml		  ssl-server.xml.example
hdfs-site.xml			  user_ec_policies.xml.template
httpfs-env.sh			  workers
httpfs-log4j.properties		  yarn-env.cmd
httpfs-site.xml			  yarn-env.sh
kms-acls.xml			  yarnservice-log4j.properties
kms-env.sh			  yarn-site.xml


We need to configure a few things before running Hadoop. That is, we need to either add or modify few parameters in these configuration files to operate Hadoop in whichever mode we want to.

Configuring hadoop-env.sh file

hadoop-env.sh is a bash script that containts environment variables that are used in the scripts to run Hadoop

In [None]:
#Exploring hadoop-env.sh file
!cat /usr/local/hadoop-3.3.4/etc/hadoop/hadoop-env.sh

#
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Set Hadoop-specific environment variables here.

##
## THIS FILE ACTS AS THE MASTER FILE FOR ALL HADOOP PROJECTS.
## SETTINGS HERE WILL BE READ BY ALL HADOOP COMMANDS.  THEREFORE,
## ONE CAN USE THIS FILE TO SET

The only required enviroment variable is **JAVA_HOME**. All the others are optional.

To specify the JAVA_HOME variable in hadoop-env.sh we need to uncomment the export line and update it with the actual directory.

In this case it should look like this:

`export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64/`

In [None]:
#Adding JAVA_HOME directory to hadoop-env.sh file
!sed -i '/export JAVA_HOME=/a export JAVA_HOME=\/usr\/lib\/jvm\/java-11-openjdk-amd64' /usr/local/hadoop-3.3.4/etc/hadoop/hadoop-env.sh

Because it is convinient we create an environment variable that points to the Hadoop installation directory

In [None]:
#Creating Hadoop home variable
import os
os.environ["HADOOP_HOME"] = "/usr/local/hadoop-3.3.4"

Configuring XML files

The majority of Hadoop setting are contained in XML configuration files. These files are also known as **resources**.

They have the following structure:


```
<configuration>
...
  <property>
    <name>...</name>
    <value>...</value>
    <description>...</description>
  </property>
...
</configuration>
```



The XLM file can contained any number of the property elements. Each property element defines a specific configuration name-value pair.

Hadoop configuration is driven by two distict types of XLM configuration files:

1. **Default** (read-only): core-default.xml, hdfs-default.xml, mapred-default.xml, yarn-default.xml. These files should never be modified.
2. **Site specific** configuration files: core-site.xml, hdfs-site.xml, mapred-site.xml, yarn-site.xml. These files are loaded from class path and their values are used to overwrite the corresponding values of the properties in the matching default configuration files.


In [None]:
#Exploring hadoop-3.2.3/etc/hadoop xml files
!ls $HADOOP_HOME/etc/hadoop/*.xml

/usr/local/hadoop-3.3.4/etc/hadoop/capacity-scheduler.xml
/usr/local/hadoop-3.3.4/etc/hadoop/core-site.xml
/usr/local/hadoop-3.3.4/etc/hadoop/hadoop-policy.xml
/usr/local/hadoop-3.3.4/etc/hadoop/hdfs-rbf-site.xml
/usr/local/hadoop-3.3.4/etc/hadoop/hdfs-site.xml
/usr/local/hadoop-3.3.4/etc/hadoop/httpfs-site.xml
/usr/local/hadoop-3.3.4/etc/hadoop/kms-acls.xml
/usr/local/hadoop-3.3.4/etc/hadoop/kms-site.xml
/usr/local/hadoop-3.3.4/etc/hadoop/mapred-site.xml
/usr/local/hadoop-3.3.4/etc/hadoop/yarn-site.xml


Each component in Hadoop is configured using an xml file

*   core-site.xml: common properties
*   hdfs-site.xml: HDFS properties
*   mapred-site.xml: MapReduce properties
*   yarn-site.xml: YARN properties

By configuring these xml files accordingly Hadoop can be run in one of the three modes.

In [None]:
#Content of core-site.xml file
!cat $HADOOP_HOME/etc/hadoop/core-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
</configuration>


As we can see, no properties are set. They are empty by default. So there is nothing to overwrite and Hadoop runs with the default properties.

## Running Hadoop in standalone mode

With the default configuration properties Hadoop runs in a standalone mode (non distributed mode). That is, standalone mode (also knows as local mode) is the default mode for Hadoop.

There are no deamons to run. Just a single java process

Local filesystem and the local MapReduce job runner are used

The command to run a Hadoop mapreduce program that is written in Java is:

`$HADOOP_HOME/bin/hadoop jar <jar>`

jar is Java archive tool that packages (and compresses) a set of files into a single archive.

The default installation already has several MapReduce examples program that we can use.

In [None]:
#Exploring mapreduce tools
!ls $HADOOP_HOME/share/hadoop/mapreduce/*.jar

/usr/local/hadoop-3.3.4/share/hadoop/mapreduce/hadoop-mapreduce-client-app-3.3.4.jar
/usr/local/hadoop-3.3.4/share/hadoop/mapreduce/hadoop-mapreduce-client-common-3.3.4.jar
/usr/local/hadoop-3.3.4/share/hadoop/mapreduce/hadoop-mapreduce-client-core-3.3.4.jar
/usr/local/hadoop-3.3.4/share/hadoop/mapreduce/hadoop-mapreduce-client-hs-3.3.4.jar
/usr/local/hadoop-3.3.4/share/hadoop/mapreduce/hadoop-mapreduce-client-hs-plugins-3.3.4.jar
/usr/local/hadoop-3.3.4/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.3.4.jar
/usr/local/hadoop-3.3.4/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.3.4-tests.jar
/usr/local/hadoop-3.3.4/share/hadoop/mapreduce/hadoop-mapreduce-client-nativetask-3.3.4.jar
/usr/local/hadoop-3.3.4/share/hadoop/mapreduce/hadoop-mapreduce-client-shuffle-3.3.4.jar
/usr/local/hadoop-3.3.4/share/hadoop/mapreduce/hadoop-mapreduce-client-uploader-3.3.4.jar
/usr/local/hadoop-3.3.4/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.4.jar


Mapreduce examples

In [None]:
#Exploring the examples of programs available
!$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.3.jar

JAR does not exist or is not a normal file: /usr/local/hadoop-3.3.4/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.3.jar


wordcount

As description says, wordcount is a map/reduce program that counts the words in the input files.

In [None]:
#Usage of the wordcount MapReduce program
!$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.3.jar wordcount

JAR does not exist or is not a normal file: /usr/local/hadoop-3.3.4/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.3.jar


The parameters are an input directory where the text to be analized is allocated and an output directory where the program is going to allocate its output

In [None]:
#Dowloading text example to use as input
!wget -q https://www.mirrorservice.org/sites/ftp.ibiblio.org/pub/docs/books/gutenberg/1/0/101/101.txt

In [None]:
#Running MapReduce program wordcount
#the output directory will be created automatically
!$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.3.jar wordcount /content/101.txt /content/output

JAR does not exist or is not a normal file: /usr/local/hadoop-3.3.4/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.3.jar


In [None]:
#Exploring the created output directory
#part-r-00000 contains the actual ouput
!ls /content/output

ls: cannot access '/content/output': No such file or directory


In [None]:
#Printing out first 50 lines
!head -50 /content/output/part-r-00000

head: cannot open '/content/output/part-r-00000' for reading: No such file or directory


## Running Hadoop in Pseudo-distributed mode

In Pseudo-distributed mode all the distributed components of Hadoop come into play. That is, all the Hadoop deamons that are responsible for distributed storage and distributed processing will run on the same machine.

Master deamons:

*   NameNode
*   Resource Manager
*   Standby NameNode

Slave deamons:

*   DataNode
*   Node Manager


Configuring XML files

As mentioned, by setting the properties in the **site** xml configuration files, we overwrite the corresponding properties in the **default** xml configuration files and, this way, we tell Hadoop which machines are in the cluster and where and how we want to run the Hadoop daemons

The specific content that these files need to have to make Hadoop run in Pseudo-distributed mode can be found in the documentation of the release on the official website. For Hadoop 3.2.3 the website is:

https://hadoop.apache.org/docs/r3.2.3/hadoop-project-dist/hadoop-common/SingleCluster.html

Configuring core-site.xml

In [None]:
#Adding required property to core-site.xlm file
!sed -i '/<configuration>/a\
  <property>\n\
    <name>fs.defaultFS</name>\n\
    <value>hdfs://localhost:9000</value>\n\
  </property>' \
$HADOOP_HOME/etc/hadoop/core-site.xml

In [None]:
#Content of core-site.xml after the editing
!cat $HADOOP_HOME/etc/hadoop/core-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
     <name>fs.defaultFS</name>
     <value>hdfs://localhost:9000</value>
   </property>
</configuration>


Configuring hdfs-site.xml

In [None]:
#Adding required property to hdfs-site.xml file
#Since we are running Hadoop in only one machine, a replication factor greater than 1 does not make sense
!sed -i '/<configuration>/a\
  <property>\n\
    <name>dfs.replication</name>\n\
    <value>1</value>\n\
  </property>' \
$HADOOP_HOME/etc/hadoop/hdfs-site.xml

In [None]:
#Content of hdfs-site.xml after the editing
!cat $HADOOP_HOME/etc/hadoop/hdfs-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
     <name>dfs.replication</name>
     <value>1</value>
   </property>

</configuration>


Configuring mapred-site.xml

In [None]:
#Adding required properties to mapred-site.xml file
!sed -i '/<configuration>/a\
  <property>\n\
    <name>mapreduce.framework.name</name>\n\
    <value>yarn</value>\n\
  </property>\n\
  <property>\n\
    <name>mapreduce.application.classpath</name>\n\
    <value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*</value>\n\
  </property>' \
$HADOOP_HOME/etc/hadoop/mapred-site.xml

In [None]:
#Content of mapred-site.xml after the editing
!cat $HADOOP_HOME/etc/hadoop/mapred-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
     <name>mapreduce.framework.name</name>
     <value>yarn</value>
   </property>
   <property>
     <name>mapreduce.application.classpath</name>
     <value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/li

Configuring yarn-site.xml

In [None]:
#Adding required properties to yarn-site.xml file
!sed -i '/<configuration>/a\
  <property>\n\
    <description>The hostname of the RM.</description>\n\
    <name>yarn.resourcemanager.hostname</name>\n\
    <value>localhost</value>\n\
  </property>\n\
  <property>\n\
    <name>yarn.nodemanager.aux-services</name>\n\
    <value>mapreduce_shuffle</value>\n\
  </property>\n\
  <property>\n\
    <name>yarn.nodemanager.env-whitelist</name>\n\
    <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_HOME,PATH,LANG,TZ,HADOOP_MAPRED_HOME</value>\n\
  </property>' \
$HADOOP_HOME/etc/hadoop/yarn-site.xml

In [None]:
#Content of yarn-site.xml after the editing
!cat $HADOOP_HOME/etc/hadoop/yarn-site.xml

<?xml version="1.0"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->
<configuration>
<property>
     <description>The hostname of the RM.</description>
     <name>yarn.resourcemanager.hostname</name>
     <value>localhost</value>
   </property>
   <property>
     <name>yarn.nodemanager.aux-services</name>
     <value>mapreduce_shuffle</value>
   </property>
   <property>
     <name>yarn.nodemanager.env-whitelist</name>
     <value>JAVA_HOME,HADOOP_COMMON_HOME,HAD

Formatting the HDFS Filesystem

Before HDFS can be used for the first time the file system must be formatted. The formatting process creates an empty file system by creating the storage directories and the initial versions of the NameNodes

In [None]:
!$HADOOP_HOME/bin/hdfs namenode -format

2023-09-13 17:56:15,343 INFO namenode.NameNode: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = 1326dc238f5b/172.28.0.12
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 3.3.4
STARTUP_MSG:   classpath = /usr/local/hadoop-3.3.4/etc/hadoop:/usr/local/hadoop-3.3.4/share/hadoop/common/lib/jetty-security-9.4.43.v20210629.jar:/usr/local/hadoop-3.3.4/share/hadoop/common/lib/listenablefuture-9999.0-empty-to-avoid-conflict-with-guava.jar:/usr/local/hadoop-3.3.4/share/hadoop/common/lib/slf4j-reload4j-1.7.36.jar:/usr/local/hadoop-3.3.4/share/hadoop/common/lib/commons-cli-1.2.jar:/usr/local/hadoop-3.3.4/share/hadoop/common/lib/kerby-asn1-1.0.1.jar:/usr/local/hadoop-3.3.4/share/hadoop/common/lib/kerb-crypto-1.0.1.jar:/usr/local/hadoop-3.3.4/share/hadoop/common/lib/guava-27.0-jre.jar:/usr/local/hadoop-3.3.4/share/hadoop/common/lib/hadoop-annotations-3.3.4.jar:/usr/local/hadoop-3.3.4/share/hadoop/common/lib/hadoo

Hadoop scripts

Hadoop comes with scripts for running commands, and starting and stopping daemons across the whole cluster. These scripts can be found in the bin and sbin directories

In [None]:
#Exploring Hadoop scripts available in sbin directory
!ls $HADOOP_HOME/sbin

distribute-exclude.sh	 start-all.sh	      stop-balancer.sh
FederationStateStore	 start-balancer.sh    stop-dfs.cmd
hadoop-daemon.sh	 start-dfs.cmd	      stop-dfs.sh
hadoop-daemons.sh	 start-dfs.sh	      stop-secure-dns.sh
httpfs.sh		 start-secure-dns.sh  stop-yarn.cmd
kms.sh			 start-yarn.cmd       stop-yarn.sh
mr-jobhistory-daemon.sh  start-yarn.sh	      workers.sh
refresh-namenodes.sh	 stop-all.cmd	      yarn-daemon.sh
start-all.cmd		 stop-all.sh	      yarn-daemons.sh


In [None]:
#Creating other necessary enviroment variables before starting nodes
os.environ["HDFS_NAMENODE_USER"] = "root"
os.environ["HDFS_DATANODE_USER"] = "root"
os.environ["HDFS_SECONDARYNAMENODE_USER"] = "root"
os.environ["YARN_RESOURCEMANAGER_USER"] = "root"
os.environ["YARN_NODEMANAGER_USER"] = "root"

In [None]:
!$HADOOP_HOME/sbin/start-all.sh

Starting namenodes on [localhost]
Starting datanodes
Starting secondary namenodes [1326dc238f5b]
Starting resourcemanager
Starting nodemanagers


In [None]:
#Listing the running deamons
!jps

3632 DataNode
4246 NodeManager
3512 NameNode
4137 ResourceManager
4365 Jps
3839 SecondaryNameNode


Tambien se pueden lanzar todos los procesos uno a uno

In [None]:
#Launching hdfs deamons
!$HADOOP_HOME/sbin/start-dfs.sh

Starting namenodes on [localhost]
localhost: namenode is running as process 3512.  Stop it first and ensure /tmp/hadoop-root-namenode.pid file is empty before retry.
Starting datanodes
localhost: datanode is running as process 3632.  Stop it first and ensure /tmp/hadoop-root-datanode.pid file is empty before retry.
Starting secondary namenodes [1326dc238f5b]
1326dc238f5b: secondarynamenode is running as process 3839.  Stop it first and ensure /tmp/hadoop-root-secondarynamenode.pid file is empty before retry.


In [None]:
#Listing the running deamons
!jps

3632 DataNode
5108 Jps
4246 NodeManager
3512 NameNode
4137 ResourceManager
3839 SecondaryNameNode


In [None]:
#Launching yarn deamons
#nohup causes a process to ignore a SIGHUP signal
!nohup $HADOOP_HOME/sbin/start-yarn.sh

nohup: ignoring input and appending output to 'nohup.out'


In [None]:
#Listing the running deamons
!jps

3632 DataNode
4246 NodeManager
3512 NameNode
4137 ResourceManager
5353 Jps
3839 SecondaryNameNode


Monitoring Hadoop cluster with hadoop admin commands

In [None]:
#Report the basic file system information and statistics
!$HADOOP_HOME/bin/hdfs dfsadmin -report

Configured Capacity: 115658190848 (107.72 GB)
Present Capacity: 83862347776 (78.10 GB)
DFS Remaining: 83862323200 (78.10 GB)
DFS Used: 24576 (24 KB)
DFS Used%: 0.00%
Replicated Blocks:
	Under replicated blocks: 0
	Blocks with corrupt replicas: 0
	Missing blocks: 0
	Missing blocks (with replication factor 1): 0
	Low redundancy blocks with highest priority to recover: 0
	Pending deletion blocks: 0
Erasure Coded Block Groups: 
	Low redundancy block groups: 0
	Block groups with corrupt internal blocks: 0
	Missing block groups: 0
	Low redundancy blocks with highest priority to recover: 0
	Pending deletion blocks: 0

-------------------------------------------------
Live datanodes (1):

Name: 127.0.0.1:9866 (localhost)
Hostname: 1326dc238f5b
Decommission Status : Normal
Configured Capacity: 115658190848 (107.72 GB)
DFS Used: 24576 (24 KB)
Non DFS Used: 31779065856 (29.60 GB)
DFS Remaining: 83862323200 (78.10 GB)
DFS Used%: 0.00%
DFS Remaining%: 72.51%
Configured Cache Capacity: 0 (0 B)
Cache

#### TRAEMOS LOS DATOS DEL DRIVE A LA MAQUINA


In [None]:
# mount it
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
!ls -lrta /content/drive/MyDrive/00_IMF/IMF/TEMARIO/3_Hadoop/FICHEROS

total 15
-rw------- 1 root root 12567 Sep 10 08:51 medidas.txt
-rw------- 1 root root   171 Sep 10 10:36 input.txt
-rw------- 1 root root   602 Sep 10 10:37 reducer.py
-rw------- 1 root root   208 Sep 10 10:57 mapper.py


In [None]:
# copy it there
!cp /content/drive/MyDrive/00_IMF/IMF/TEMARIO/3_Hadoop/FICHEROS/* .

In [None]:
!ls -lrta

total 679884
-rw-r--r--  1 root root    678064 Feb  9  2012 101.txt
drwxr-xr-x 10 1024 1024      4096 Jul 29  2022 hadoop-3.3.4
-rw-r--r--  1 root root 695457782 Jul 29  2022 hadoop-3.3.4.tar.gz
drwxr-xr-x  4 root root      4096 Sep 12 13:21 .config
drwxr-xr-x  1 root root      4096 Sep 12 13:22 sample_data
drwxr-xr-x  1 root root      4096 Sep 13 17:45 ..
drwx------  5 root root      4096 Sep 13 17:50 drive
drwxr-xr-x  1 root root      4096 Sep 13 17:57 .
-rw-------  1 root root       320 Sep 13 17:57 nohup.out
-rw-------  1 root root       171 Sep 13 17:57 input.txt
-rw-------  1 root root       208 Sep 13 17:57 mapper.py
-rw-------  1 root root     12567 Sep 13 17:57 medidas.txt
-rw-------  1 root root       602 Sep 13 17:57 reducer.py


## Lo podemos lanzar a lo "bruto" para comprobar si funciona

In [None]:
cat input.txt | python3 mapper.py | sort | python3 reducer.py

Cada	1
de	3
documento	2
documento.	1
documentos.	1
ejemplo	2
en	1
es	2
Este	2
Las	1
los	1
otro	1
palabras	1
palabras.	1
pueden	1
repetirse	1
texto.	1
tiene	1
un	2
varias	1


In [None]:
#Creating directory in HDFS
!$HADOOP_HOME/bin/hdfs dfs -mkdir /word_count
#Coping file from local file system to HDFS
!$HADOOP_HOME/bin/hdfs dfs -put /content/input.txt /word_count

In [None]:
#Exploring Hadoop folder
!$HADOOP_HOME/bin/hdfs dfs -ls /word_count

Found 1 items
-rw-r--r--   1 root supergroup        171 2023-09-13 17:57 /word_count/input.txt


In [None]:
!$HADOOP_HOME/bin/mapred streaming -files mapper.py,reducer.py -mapper "python3 mapper.py" -reducer "python3 reducer.py" -input hdfs:////word_count/input.txt -output hdfs:////word_count/salida

packageJobJar: [] [/usr/local/hadoop-3.3.4/share/hadoop/tools/lib/hadoop-streaming-3.3.4.jar] /tmp/streamjob8435826653528261435.jar tmpDir=null
2023-09-13 17:57:40,937 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at localhost/127.0.0.1:8032
2023-09-13 17:57:41,341 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at localhost/127.0.0.1:8032
2023-09-13 17:57:41,816 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1694627822238_0001
2023-09-13 17:57:42,842 INFO mapred.FileInputFormat: Total input files to process : 1
2023-09-13 17:57:43,015 INFO mapreduce.JobSubmitter: number of splits:2
2023-09-13 17:57:43,565 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1694627822238_0001
2023-09-13 17:57:43,565 INFO mapreduce.JobSubmitter: Executing with tokens: []
2023-09-13 17:57:43,922 INFO conf.Configuration: resource-types.xml not found
2023-09-13 17:57:43,9

In [None]:
#Exploring Hadoop folder
!$HADOOP_HOME/bin/hdfs dfs -ls /word_count

Found 2 items
-rw-r--r--   1 root supergroup        171 2023-09-13 17:57 /word_count/input.txt
drwxr-xr-x   - root supergroup          0 2023-09-13 17:58 /word_count/salida


### Comprobamos la SALIDA

In [None]:
!$HADOOP_HOME/bin/hdfs dfs -cat /word_count/salida/*

Cada	1
Este	2
Las	1
de	3
documento	2
documento.	1
documentos.	1
ejemplo	2
en	1
es	2
los	1
otro	1
palabras	1
palabras.	1
pueden	1
repetirse	1
texto.	1
tiene	1
un	2
varias	1


### El fichero del EJERCICIO

In [None]:
!head medidas.txt

2017	Enero	46	
2017	Enero	10	
2017	Enero	6	
2017	Enero	9	
2017	Enero	34	
2017	Enero	-3	
2017	Enero	48	
2017	Enero	20	
2017	Enero	-5	
2017	Enero	0	


In [None]:
!tail medidas.txt

2018	Diciembre	46	
2018	Diciembre	1	
2018	Diciembre	46	
2018	Diciembre	28	
2018	Diciembre	13	
2018	Diciembre	-1	
2018	Diciembre	4	
2018	Diciembre	38	
2018	Diciembre	29	
2018	Diciembre	2	


In [None]:
!wc -l medidas.txt

730 medidas.txt


In [None]:
!jps

3632 DataNode
6482 Jps
4246 NodeManager
3512 NameNode
4137 ResourceManager
3839 SecondaryNameNode
