## Hadoop
Hadoop is a **Java-based** programming framework that supports the processing and storage of extremely large datasets on a cluster of inexpensive machines. It was the first major open source project in the big data playing field and is sponsored by the Apache Software Foundation.

## Step 1: Verifying JAVA Installation
Java must be installed on your system before installing Hadoop. Let us verify java installation using the following command:

`!java --version`

In [None]:
!java --version

openjdk 11.0.22 2024-01-16
OpenJDK Runtime Environment (build 11.0.22+7-post-Ubuntu-0ubuntu222.04.1)
OpenJDK 64-Bit Server VM (build 11.0.22+7-post-Ubuntu-0ubuntu222.04.1, mixed mode, sharing)


## Step 2: Use `wget` command to download latest Hadoop version from Apache's website
https://downloads.apache.org/hadoop/common

In [None]:
!wget https://downloads.apache.org/hadoop/common/hadoop-3.4.0/hadoop-3.4.0.tar.gz

--2024-04-07 20:36:07--  https://downloads.apache.org/hadoop/common/hadoop-3.4.0/hadoop-3.4.0.tar.gz
Resolving downloads.apache.org (downloads.apache.org)... 88.99.208.237, 135.181.214.104, 2a01:4f9:3a:2c57::2, ...
Connecting to downloads.apache.org (downloads.apache.org)|88.99.208.237|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 965537117 (921M) [application/x-gzip]
Saving to: ‘hadoop-3.4.0.tar.gz’


2024-04-07 20:36:55 (19.4 MB/s) - ‘hadoop-3.4.0.tar.gz’ saved [965537117/965537117]



## Step 3: Extract the downloaded Hadoop package using the `!tar -xzvf` command.

* -x flag to extract,
* -z to uncompress,
* -v for verbose output, and
* -f to specify that we’re extracting from a file

In [None]:
!tar -xzvf hadoop-3.4.0.tar.gz

## Step 4: Copy  hadoop file to `/user/local`

In [None]:
!cp -r hadoop-3.4.0/ /usr/local/

### 4.1 Set environment variable `HADOOP_HOME` to `/user/local/<hadoop_version/>`

In [None]:
import os
os.environ["HADOOP_HOME"] = "/usr/local/hadoop-3.4.0"

## Step 5: Configuring Hadoop’s Java Home
Hadoop requires that you set the path to Java, either as an environment variable or in the Hadoop configuration file.

In [None]:
!readlink -f /usr/bin/java | sed "s:bin/java::"

/usr/lib/jvm/java-11-openjdk-amd64/


### 5.1 Set as environment variable

In [None]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64/"

### 5.2 Set in Hadoop configuration file

set in /usr/local/hadoop-3.4.0/etc/hadoop/hadoop-env.sh
`export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64/`

## Step 6: Running Hadoop

In [None]:
!/usr/local/hadoop-3.4.0/bin/hadoop

Usage: hadoop [OPTIONS] SUBCOMMAND [SUBCOMMAND OPTIONS]
 or    hadoop [OPTIONS] CLASSNAME [CLASSNAME OPTIONS]
  where CLASSNAME is a user-provided Java class

  OPTIONS is none or any of:

buildpaths                       attempt to add class files from build tree
--config dir                     Hadoop config directory
--debug                          turn on shell script debug mode
--help                           usage information
hostnames list[,of,host,names]   hosts to use in worker mode
hosts filename                   list of hosts to use in worker mode
loglevel level                   set the log4j level for this command
workers                          turn on worker mode

  SUBCOMMAND is one of:


    Admin Commands:

daemonlog     get/set the log level for each daemon

    Client Commands:

archive       create a Hadoop archive
checknative   check native Hadoop and compression libraries availability
classpath     prints the class path needed to get the Hadoop jar and the re

## Step 7: Copy the config files

In [None]:
!mkdir ~/input

In [None]:
!cp /usr/local/hadoop-3.4.0/etc/hadoop/*.xml ~/input

## Step 8: Review the config files

In [None]:
!ls ~/input

capacity-scheduler.xml	hadoop-policy.xml  hdfs-site.xml    kms-acls.xml  mapred-site.xml
core-site.xml		hdfs-rbf-site.xml  httpfs-site.xml  kms-site.xml  yarn-site.xml


Given below are the list of files that you have to edit to configure Hadoop.

1. core-site.xml
: The core-site.xml file contains information such as the port number used for Hadoop instance, memory allocated for the file system, memory limit for storing the data, and the size of Read/Write buffers.

2. hdfs-site.xml: The hdfs-site.xml file contains information such as the value of replication data, the namenode path, and the datanode path of your local file systems. It means the place where you want to store the Hadoop infra.

3. yarn-site.xml: This file is used to configure yarn into Hadoop.

4. mapred-site.xml: This file is used to specify which MapReduce framework we are using.

## Step 8: Verifying Hadoop Installation

### Step I: Name Node Setup

In [None]:
!/usr/local/hadoop-3.4.0/bin/hdfs namenode -format

2024-04-07 20:41:51,775 INFO namenode.NameNode: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = 606cd77ca520/172.28.0.12
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 3.4.0
STARTUP_MSG:   classpath = /usr/local/hadoop-3.4.0/etc/hadoop:/usr/local/hadoop-3.4.0/share/hadoop/common/lib/jetty-io-9.4.53.v20231009.jar:/usr/local/hadoop-3.4.0/share/hadoop/common/lib/zookeeper-jute-3.8.3.jar:/usr/local/hadoop-3.4.0/share/hadoop/common/lib/jersey-server-1.19.4.jar:/usr/local/hadoop-3.4.0/share/hadoop/common/lib/kerby-pkix-2.0.3.jar:/usr/local/hadoop-3.4.0/share/hadoop/common/lib/jetty-security-9.4.53.v20231009.jar:/usr/local/hadoop-3.4.0/share/hadoop/common/lib/netty-codec-smtp-4.1.100.Final.jar:/usr/local/hadoop-3.4.0/share/hadoop/common/lib/jline-3.9.0.jar:/usr/local/hadoop-3.4.0/share/hadoop/common/lib/jul-to-slf4j-1.7.36.jar:/usr/local/hadoop-3.4.0/share/hadoop/common/lib/jsr311-api-1.1.1.jar:/usr/loc

### Step II: Verifying Hadoop dfs

Add following to /usr/local/hadoop-3.4.0/etc/hadoop/hadoop-env.sh

export HDFS_NAMENODE_USER="root"

export HDFS_DATANODE_USER="root"

export HDFS_SECONDARYNAMENODE_USER="root"

export YARN_RESOURCEMANAGER_USER="root"

export YARN_NODEMANAGER_USER="root"

In [None]:
!sudo apt-get install -y openssh-server

In [None]:
!sudo /etc/init.d/ssh start
!service ssh status

 * Starting OpenBSD Secure Shell server sshd
   ...done.
 * sshd is running


In [None]:
!ssh-keygen -t rsa -b 4096 -C "namitakalra@google.com"

Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa): 
Created directory '/root/.ssh'.
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /root/.ssh/id_rsa
Your public key has been saved in /root/.ssh/id_rsa.pub
The key fingerprint is:
SHA256:CCcxuvvownX6R9N8SkoZj7RQ1RsEEa272T4QD5a0sUo namitakalra@google.com
The key's randomart image is:
+---[RSA 4096]----+
|    o   .=B.     |
|   . o . o +     |
|  . o o . * o    |
|   . = E O .     |
|  .   = S =      |
|   o . O B o     |
|. o o o + B      |
|.. +   o + o     |
| oo o..   ...    |
+----[SHA256]-----+


In [None]:
!cat /root/.ssh/id_rsa.pub > /root/.ssh/authorized_keys

In [None]:
!ssh-keyscan -H localhost >> /root/.ssh/known_hosts

# localhost:22 SSH-2.0-OpenSSH_8.9p1 Ubuntu-3ubuntu0.6
# localhost:22 SSH-2.0-OpenSSH_8.9p1 Ubuntu-3ubuntu0.6
# localhost:22 SSH-2.0-OpenSSH_8.9p1 Ubuntu-3ubuntu0.6
# localhost:22 SSH-2.0-OpenSSH_8.9p1 Ubuntu-3ubuntu0.6
# localhost:22 SSH-2.0-OpenSSH_8.9p1 Ubuntu-3ubuntu0.6


In [None]:
!chmod 700 /root/.ssh
!chmod 600 /root/.ssh/authorized_keys

In [None]:
!/usr/local/hadoop-3.4.0/sbin/start-dfs.sh

Starting namenodes on [606cd77ca520]
Starting datanodes
Starting secondary namenodes [606cd77ca520]


### Step III: Verifying Yarn Script

In [None]:
!/usr/local/hadoop-3.4.0/sbin/start-yarn.sh

Starting resourcemanager
Starting nodemanagers


### Step IV: Accessing Hadoop on Browser

The default port number to access all applications of cluster is 8088. Use the following url to visit this service.

http://localhost:8088/

### Using `ngrok` to create a public url

In [None]:
!wget https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip

--2024-04-07 20:50:06--  https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip
Resolving bin.equinox.io (bin.equinox.io)... 54.161.241.46, 18.205.222.128, 52.202.168.65, ...
Connecting to bin.equinox.io (bin.equinox.io)|54.161.241.46|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13921656 (13M) [application/octet-stream]
Saving to: ‘ngrok-stable-linux-amd64.zip’


2024-04-07 20:50:08 (17.6 MB/s) - ‘ngrok-stable-linux-amd64.zip’ saved [13921656/13921656]



In [None]:
!unzip ngrok-stable-linux-amd64.zip

Archive:  ngrok-stable-linux-amd64.zip
  inflating: ngrok                   


In [None]:
!cp -r ngrok /usr/local/bin

In [None]:
!ngrok authtoken "2e66TXIkQSNBNkqYO2lPE6FvuxU_wLQjMY5yA9DGEjzgjDU6"

Authtoken saved to configuration file: /root/.ngrok2/ngrok.yml


In [None]:
!pip install pyngrok

Collecting pyngrok
  Downloading pyngrok-7.1.6-py3-none-any.whl (22 kB)
Installing collected packages: pyngrok
Successfully installed pyngrok-7.1.6


In [None]:
from pyngrok import ngrok

In [None]:
ngrok.set_auth_token("2emwk6Enp8UGLQEyCuMibLppkdE_bvVQypkDmjuhkewsSQ4X")

tunnel = ngrok.connect(addr="8088", proto="http")
# The public URL is directly available from the tunnel object.
public_url = tunnel.public_url
print("Tunnel Public URL:", public_url)



Tunnel Public URL: https://94d8-34-83-145-234.ngrok-free.app


In [None]:
!/usr/local/hadoop-3.4.0/bin/hadoop jar /usr/local/hadoop-3.4.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.4.0.jar grep ~/input ~/grep_example 'allowed*'

In [None]:
!cat ~/grep_example/*

In [None]:
!cat ~/input/* | grep -c "allowed"