<a href="https://colab.research.google.com/github/namitakalra-google/Big-Data-Workshop/blob/main/labs/%5BSolution%5D_hdfs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Hadoop
Hadoop is a **Java-based** programming framework that supports the processing and storage of extremely large datasets on a cluster of inexpensive machines. It was the first major open source project in the big data playing field and is sponsored by the Apache Software Foundation.

## Step 1: Verifying JAVA Installation
Java must be installed on your system before installing Hadoop. Let us verify java installation using the following command:

`!java --version`

In [None]:
!java --version

openjdk 11.0.22 2024-01-16
OpenJDK Runtime Environment (build 11.0.22+7-post-Ubuntu-0ubuntu222.04.1)
OpenJDK 64-Bit Server VM (build 11.0.22+7-post-Ubuntu-0ubuntu222.04.1, mixed mode, sharing)


## Step 2: Use `wget` command to download latest Hadoop version from Apache's website
https://downloads.apache.org/hadoop/common

In [None]:
!wget https://downloads.apache.org/hadoop/common/hadoop-3.4.0/hadoop-3.4.0.tar.gz

--2024-04-09 05:48:49--  https://downloads.apache.org/hadoop/common/hadoop-3.4.0/hadoop-3.4.0.tar.gz
Resolving downloads.apache.org (downloads.apache.org)... 135.181.214.104, 88.99.208.237, 2a01:4f9:3a:2c57::2, ...
Connecting to downloads.apache.org (downloads.apache.org)|135.181.214.104|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 965537117 (921M) [application/x-gzip]
Saving to: ‘hadoop-3.4.0.tar.gz’


2024-04-09 05:49:39 (18.5 MB/s) - ‘hadoop-3.4.0.tar.gz’ saved [965537117/965537117]



## Step 3: Extract the downloaded Hadoop package using the `!tar -xzvf` command.

* -x flag to extract,
* -z to uncompress,
* -v for verbose output, and
* -f to specify that we’re extracting from a file

In [None]:
!tar -xzvf hadoop-3.4.0.tar.gz

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
hadoop-3.4.0/share/doc/hadoop/api/org/apache/hadoop/fs/protocolPB/package-summary.html
hadoop-3.4.0/share/doc/hadoop/api/org/apache/hadoop/fs/protocolPB/package-tree.html
hadoop-3.4.0/share/doc/hadoop/api/org/apache/hadoop/fs/UnsupportedFileSystemException.html
hadoop-3.4.0/share/doc/hadoop/api/org/apache/hadoop/fs/FsStatus.html
hadoop-3.4.0/share/doc/hadoop/api/org/apache/hadoop/fs/FileAlreadyExistsException.html
hadoop-3.4.0/share/doc/hadoop/api/org/apache/hadoop/fs/FsServerDefaults.html
hadoop-3.4.0/share/doc/hadoop/api/org/apache/hadoop/fs/Options.OpenFileOptions.html
hadoop-3.4.0/share/doc/hadoop/api/org/apache/hadoop/fs/StorageType.html
hadoop-3.4.0/share/doc/hadoop/api/org/apache/hadoop/fs/ByteBufferReadable.html
hadoop-3.4.0/share/doc/hadoop/api/org/apache/hadoop/fs/azurebfs/
hadoop-3.4.0/share/doc/hadoop/api/org/apache/hadoop/fs/azurebfs/AzureBlobFileSystemStore.html
hadoop-3.4.0/share/doc/hadoop/api/org/apache/h

## Step 4: Copy  hadoop file to `/user/local`

In [None]:
!cp -r hadoop-3.4.0/ /usr/local/

### 4.1 Set environment variable `HADOOP_HOME` to `/user/local/<hadoop_version/>`

In [None]:
import os
os.environ["HADOOP_HOME"] = "/usr/local/hadoop-3.4.0"

## Step 5: Configuring Hadoop’s Java Home
Hadoop requires that you set the path to Java, either as an environment variable or in the Hadoop configuration file.

In [None]:
!readlink -f /usr/bin/java | sed "s:bin/java::"

/usr/lib/jvm/java-11-openjdk-amd64/


### 5.1 Set as environment variable

In [None]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64/"

### 5.2 Set in Hadoop configuration file

set in /usr/local/hadoop-3.4.0/etc/hadoop/hadoop-env.sh
`export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64/`

## Step 6: Running Hadoop

In [None]:
!/usr/local/hadoop-3.4.0/bin/hadoop

Usage: hadoop [OPTIONS] SUBCOMMAND [SUBCOMMAND OPTIONS]
 or    hadoop [OPTIONS] CLASSNAME [CLASSNAME OPTIONS]
  where CLASSNAME is a user-provided Java class

  OPTIONS is none or any of:

buildpaths                       attempt to add class files from build tree
--config dir                     Hadoop config directory
--debug                          turn on shell script debug mode
--help                           usage information
hostnames list[,of,host,names]   hosts to use in worker mode
hosts filename                   list of hosts to use in worker mode
loglevel level                   set the log4j level for this command
workers                          turn on worker mode

  SUBCOMMAND is one of:


    Admin Commands:

daemonlog     get/set the log level for each daemon

    Client Commands:

archive       create a Hadoop archive
checknative   check native Hadoop and compression libraries availability
classpath     prints the class path needed to get the Hadoop jar and the re

## Step 7: Copy the config files

In [None]:
!mkdir ~/input

In [None]:
!cp /usr/local/hadoop-3.4.0/etc/hadoop/*.xml ~/input

## Step 8: Review the config files

In [None]:
!ls ~/input

capacity-scheduler.xml	hadoop-policy.xml  hdfs-site.xml    kms-acls.xml  mapred-site.xml
core-site.xml		hdfs-rbf-site.xml  httpfs-site.xml  kms-site.xml  yarn-site.xml


Given below are the list of files that you have to edit to configure Hadoop.

1. core-site.xml
: The core-site.xml file contains information such as the port number used for Hadoop instance, memory allocated for the file system, memory limit for storing the data, and the size of Read/Write buffers.

2. hdfs-site.xml: The hdfs-site.xml file contains information such as the value of replication data, the namenode path, and the datanode path of your local file systems. It means the place where you want to store the Hadoop infra.

3. yarn-site.xml: This file is used to configure yarn into Hadoop.

4. mapred-site.xml: This file is used to specify which MapReduce framework we are using.

## Step 8: Start HDFS

### Step I: Name Node Setup

In [None]:
!/usr/local/hadoop-3.4.0/bin/hdfs namenode -format

2024-04-08 18:30:29,602 INFO namenode.NameNode: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = c95ab9dac78e/172.28.0.12
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 3.4.0
STARTUP_MSG:   classpath = /usr/local/hadoop-3.4.0/etc/hadoop:/usr/local/hadoop-3.4.0/share/hadoop/common/lib/kerb-common-2.0.3.jar:/usr/local/hadoop-3.4.0/share/hadoop/common/lib/netty-codec-http-4.1.100.Final.jar:/usr/local/hadoop-3.4.0/share/hadoop/common/lib/kerb-identity-2.0.3.jar:/usr/local/hadoop-3.4.0/share/hadoop/common/lib/netty-transport-udt-4.1.100.Final.jar:/usr/local/hadoop-3.4.0/share/hadoop/common/lib/netty-codec-stomp-4.1.100.Final.jar:/usr/local/hadoop-3.4.0/share/hadoop/common/lib/jsr305-3.0.2.jar:/usr/local/hadoop-3.4.0/share/hadoop/common/lib/jetty-server-9.4.53.v20231009.jar:/usr/local/hadoop-3.4.0/share/hadoop/common/lib/stax2-api-4.2.1.jar:/usr/local/hadoop-3.4.0/share/hadoop/common/lib/jline-3.9.0.jar

### Step II: Verifying Hadoop dfs

Add following to /usr/local/hadoop-3.4.0/etc/hadoop/hadoop-env.sh

export HDFS_NAMENODE_USER="root"

export HDFS_DATANODE_USER="root"

export HDFS_SECONDARYNAMENODE_USER="root"

export YARN_RESOURCEMANAGER_USER="root"

export YARN_NODEMANAGER_USER="root"

In [None]:
!sudo apt-get install -y openssh-server

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  libwrap0 ncurses-term openssh-sftp-server ssh-import-id
Suggested packages:
  molly-guard monkeysphere ssh-askpass ufw
The following NEW packages will be installed:
  libwrap0 ncurses-term openssh-server openssh-sftp-server ssh-import-id
0 upgraded, 5 newly installed, 0 to remove and 45 not upgraded.
Need to get 800 kB of archives.
After this operation, 6,161 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 openssh-sftp-server amd64 1:8.9p1-3ubuntu0.6 [38.7 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy/main amd64 libwrap0 amd64 7.6.q-31build2 [47.9 kB]
Get:3 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 openssh-server amd64 1:8.9p1-3ubuntu0.6 [435 kB]
Get:4 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 ncurses-term all 6.3-2ubuntu0.1 [267 kB]
Get:5 http:

In [None]:
!sudo /etc/init.d/ssh start
!service ssh status

 * Starting OpenBSD Secure Shell server sshd
   ...done.
 * sshd is running


In [None]:
!ssh-keygen -t rsa -b 4096 -C "<your_email_id>"

Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa): 
Created directory '/root/.ssh'.
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /root/.ssh/id_rsa
Your public key has been saved in /root/.ssh/id_rsa.pub
The key fingerprint is:
SHA256:7Dg/Mm56xVccLpqDVnRZPBMnJAX2lr5zfd60Pkbx2Z4 <your_email_id>
The key's randomart image is:
+---[RSA 4096]----+
|          +B*..  |
|        ..oo=+   |
|       . . o++   |
|       .. .o+  . |
|       +So o.   =|
|      oo* .  . +o|
|     .o..o  o o.=|
|      =o.    o E=|
|    .=.o..    oo+|
+----[SHA256]-----+


In [None]:
!cat /root/.ssh/id_rsa.pub > /root/.ssh/authorized_keys
!ssh-keyscan -H localhost >> /root/.ssh/known_hosts
!chmod 700 /root/.ssh
!chmod 600 /root/.ssh/authorized_keys

# localhost:22 SSH-2.0-OpenSSH_8.9p1 Ubuntu-3ubuntu0.6
# localhost:22 SSH-2.0-OpenSSH_8.9p1 Ubuntu-3ubuntu0.6
# localhost:22 SSH-2.0-OpenSSH_8.9p1 Ubuntu-3ubuntu0.6
# localhost:22 SSH-2.0-OpenSSH_8.9p1 Ubuntu-3ubuntu0.6
# localhost:22 SSH-2.0-OpenSSH_8.9p1 Ubuntu-3ubuntu0.6


Use below command to verify that hdfs is running fine before we start working on it.

In [None]:
!/usr/local/hadoop-3.4.0/sbin/start-dfs.sh

Starting namenodes on [c95ab9dac78e]
Starting datanodes
Starting secondary namenodes [c95ab9dac78e]


# Let's work on HDFS now - Analyze Website Traffic Logs

## Generating sample data

Run the provided Python code to generate a file (`sample_log.txt`) containing simulated website access logs. Examine the log format to understand the data.

In [None]:
import random
from datetime import datetime
import time

ips = ["192.168.1.10", "10.0.0.5", "203.0.113.8", ...]  # Sample IP addresses
referrers = ["https://google.com", "https://www.facebook.com", ...]
pages = ["/", "products/gadget1", "about", "contact"]

def generate_log_line():
  ip = random.choice(ips)
  referrer = random.choice(referrers)
  page = random.choice(pages)
  timestamp = datetime.now().strftime("[%d/%b/%Y:%H:%M:%S %z]")
  method = random.choice(["GET", "POST"])
  status = random.choice([200, 404])

  return f'{ip} - - {timestamp} "{method} {page} HTTP/1.1" {status} {len(page)} "{referrer}" "Mozilla/5.0"'

with open('sample_log.txt', 'w') as f:
  for _ in range(2):  # Generate 200 log lines
    f.write(generate_log_line() + '\n')
    time.sleep(random.random())  # Simulate logs coming in over time

## Task 1: Prepare Your HDFS Workspace

Create a directory called `website_analysis` in `/usr/local` to store and organize your log data.



In [None]:
!/usr/local/hadoop-3.4.0/bin/hdfs dfs -mkdir /usr/local/website_analysis

## Task 2:  Transfer the Sample Data to HDFS

Description: Upload the sample_log.txt file to your HDFS workspace.

In [None]:
!/usr/local/hadoop-3.4.0/bin/hdfs dfs -put sample_log.txt /usr/local/website_analysis

## Task 3: Explore/View the Data in HDFS

Description: List the files in your HDFS workspace and view the contents of sample_log.txt to verify successful transfer.

In [None]:
!/usr/local/hadoop-3.4.0/bin/hdfs dfs -ls /usr/local/website_analysis

Found 1 items
-rw-r--r--   1 root root        181 2024-04-08 18:33 /usr/local/website_analysis/sample_log.txt


In [None]:
!/usr/local/hadoop-3.4.0/bin/hdfs dfs -cat /usr/local/website_analysis/sample_log.txt | head

203.0.113.8 - - [08/Apr/2024:18:32:37 ] "POST contact HTTP/1.1" 200 7 "Ellipsis" "Mozilla/5.0"
10.0.0.5 - - [08/Apr/2024:18:32:37 ] "POST / HTTP/1.1" 404 1 "Ellipsis" "Mozilla/5.0"


## Task 4: Investigating Data Size

Description: Determine the size of your log data in HDFS and get a count of files and directories within your workspace.

In [None]:
!/usr/local/hadoop-3.4.0/bin/hdfs dfs -du -s /usr/local/website_analysis
!/usr/local/hadoop-3.4.0/bin/hdfs dfs -count /usr/local/website_analysis

0  0  /usr/local/website_analysis
           1            0                  0 /usr/local/website_analysis


## Task 5: Interacting with the Data

Description: Extract lines containing the word "search" from your log file, and sort the results. This showcases data filtering within HDFS.

In [None]:
!/usr/local/hadoop-3.4.0/bin/hdfs dfs -text /usr/local/website_analysis/sample_log.txt | grep "search" | sort

/bin/bash: line 1: /usr/local/hadoop-3.4.0/bin/hdfs: No such file or directory


## Task 6: Checking File Permissions

Description: View the file permissions of sample_log.txt in HDFS.

In [None]:
!/usr/local/hadoop-3.4.0/bin/hdfs dfs -ls /usr/local/website_analysis/sample_log.txt

ls: `/usr/local/website_analysis/sample_log.txt': No such file or directory


## Task 7: Creating Nested Directories

Description: Create a new directory structure within your workspace: /usr/local/website_analysis/daily_logs/2024-04-09

In [None]:
!/usr/local/hadoop-3.4.0/bin/hdfs dfs -mkdir -p /usr/local/website_analysis/daily_logs/2024-04-09

# (The -p flag creates any missing parent directories)

## Task 8: Appending Data to a File

Description: Generate a few more log lines in a new file (`new_logs.txt`) and append those lines to the existing sample.txt file in HDFS.

In [None]:
!/usr/local/hadoop-3.4.0/bin/hdfs dfs -appendToFile new_logs.txt /usr/local/website_analysis/sample.txt

## Task 9: Manipulating Files in HDFS

Description: Rename a file in HDFS and then delete it. These commands demonstrate basic file management within HDFS.

Example : Rename `/usr/local/website_analysis/sample_log.txt` to `/usr/local/website_analysis/sample.txt` & then delete the new file.

In [None]:
!/usr/local/hadoop-3.4.0/bin/hdfs dfs -mv /usr/local/website_analysis/sample_log.txt /usr/local/website_analysis/sample.txt
!/usr/local/hadoop-3.4.0/bin/hdfs dfs -rm /usr/local/website_analysis/sample.txt

2024-04-08 18:37:35,197 INFO Configuration.deprecation: io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum
Deleted /usr/local/website_analysis/sample.txt


## Task 10: Cleanup
Description: Remove your HDFS workspace. This is important practice, especially in shared cluster environments.

In [None]:
!/usr/local/hadoop-3.4.0/bin/hdfs dfs -rm -r /usr/local/website_analysis