<a href="https://colab.research.google.com/github/m-afzal/BCU-BigDataManagement/blob/main/HelloWorld_Hadoop.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#CMP7203- Big data Management - week2

By the end of this activity, you will be able to:

•	Interact with Hadoop’s command-line.

•	Copy files into and out of the HDFS (Hadoop Distributed File System).

•	Execute the WordCount application.

•	Copy the results from WordCount out of HDFS.


#What is Hadoop
![Hadoop Logo](https://github.com/pnavaro/big-data/blob/master/notebooks/images/hadoop.png?raw=1)

- Framework for running applications on large cluster. 
- The Hadoop framework transparently provides applications both reliability and data motion. 
- Hadoop implements the computational paradigm named **Map/Reduce**, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. 
- It provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster.
- Both MapReduce and the **Hadoop Distributed File System** are designed so that node failures are automatically handled by the framework.

## HDFS

* It is a distributed file systems.
* HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware.
* HDFS is suitable for applications that have large data sets. 
* HDFS provides interfaces to move applications closer to where the data is located. The computation is much more efficient when the size of the data set is huge. 
* HDFS consists of a single NameNode with a number of DataNodes which manage storage. 
* HDFS exposes a file system namespace and allows user data to be stored in files. 
    1. A file is split by the NameNode into blocks stored in DataNodes. 
    2. The [NameNode](http://svmass2.mass.uhb.fr:50070) executes operations like opening, closing, and renaming files and directories.
    3. The [Secondary NameNode](http://svmass2.mass.uhb.fr:50090/status.html) stores information from **NameNode**. 
    4. The **DataNodes** manage perform block creation, deletion, and replication upon instruction from the NameNode.
    5. The placement of replicas is optimized for data reliability, availability, and network bandwidth utilization.
    6. User data never flows through the NameNode.
* Files in HDFS are write-once and have strictly one writer at any time.
* The DataNode has no knowledge about HDFS files.

#Hadoop Instalation Part
Hadoop is a Java-based programming framework that supports the processing and storage of extremely large datasets on a cluster of inexpensive machines. It was the first major open source project in the big data playing field and is sponsored by the Apache Software Foundation.



In [5]:
#!wget https://downloads.apache.org/hadoop/common/hadoop-3.3.0/hadoop-3.3.0.tar.gz
!wget https://downloads.apache.org/hadoop/common/hadoop-3.3.4/hadoop-3.3.4.tar.gz
!wget https://raw.githubusercontent.com/besherh/BigDataManagement/924228b1a3fec29b6240e0de5e893cf7493a7de4/hadoop-examples.jar.zip


--2023-01-30 10:20:30--  https://downloads.apache.org/hadoop/common/hadoop-3.3.4/hadoop-3.3.4.tar.gz
Resolving downloads.apache.org (downloads.apache.org)... 88.99.95.219, 135.181.214.104, 2a01:4f8:10a:201a::2, ...
Connecting to downloads.apache.org (downloads.apache.org)|88.99.95.219|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 695457782 (663M) [application/x-gzip]
Saving to: ‘hadoop-3.3.4.tar.gz’


2023-01-30 10:20:57 (25.2 MB/s) - ‘hadoop-3.3.4.tar.gz’ saved [695457782/695457782]

--2023-01-30 10:20:57--  https://raw.githubusercontent.com/besherh/BigDataManagement/924228b1a3fec29b6240e0de5e893cf7493a7de4/hadoop-examples.jar.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 129392 (126K) [application/zip]
Saving to: ‘hadoop-e

we’ll use the tar command with the -x flag to extract, -z to uncompress, -v for verbose output, and -f to specify that we’re extracting from a file



In [6]:
#!tar -xzf hadoop-3.3.0.tar.gz
!tar -xzf hadoop-3.3.4.tar.gz


In [7]:
#copy  hadoop file to user/local
#!cp -r hadoop-3.3.0/ /usr/local/
!cp -r hadoop-3.3.4/ /usr/local/


In [8]:
#Unzip the archive file (jars)
#!unzip hadoop-examples.jar.zip
!unzip hadoop-examples.jar.zip


Archive:  hadoop-examples.jar.zip
  inflating: hadoop-examples.jar     


In [9]:
#houes cleaning
#!rm -r hadoop-3.3.0/
#!rm  hadoop-3.3.0.tar.gz
#!rm  hadoop-examples.jar.zip
!rm -r hadoop-3.3.4/
!rm  hadoop-3.3.4.tar.gz
!rm  hadoop-examples.jar.zip

#Step2:Configuring Hadoop’s Java Home
Hadoop requires that you set the path to Java, either as an environment variable or in the Hadoop configuration file.



In [10]:
#To find the default Java path
!readlink -f /usr/bin/java | sed "s:bin/java::"


/usr/lib/jvm/java-11-openjdk-amd64/


In [12]:
#make sure to add the same path that you get from the previous command
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64/"



In [13]:
#!/usr/local/hadoop-3.3.0/bin/hadoop
!/usr/local/hadoop-3.3.4/bin/hadoop

Usage: hadoop [OPTIONS] SUBCOMMAND [SUBCOMMAND OPTIONS]
 or    hadoop [OPTIONS] CLASSNAME [CLASSNAME OPTIONS]
  where CLASSNAME is a user-provided Java class

  OPTIONS is none or any of:

buildpaths                       attempt to add class files from build tree
--config dir                     Hadoop config directory
--debug                          turn on shell script debug mode
--help                           usage information
hostnames list[,of,host,names]   hosts to use in slave mode
hosts filename                   list of hosts to use in slave mode
loglevel level                   set the log4j level for this command
workers                          turn on worker mode

  SUBCOMMAND is one of:


    Admin Commands:

daemonlog     get/set the log level for each daemon

    Client Commands:

archive       create a Hadoop archive
checknative   check native Hadoop and compression libraries availability
classpath     prints the class path needed to get the Hadoop jar and the
    

# Basic Hadoop Commands
In this section, we are going to lean some basic commands that allow us to interact with hadoop file system. First, we are going to download a file (text file) then we will copy the file into HDFS. Let's try to download a file from the internet then do some basic commands.

In [37]:
#!wget  http://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt
!wget https://raw.githubusercontent.com/m-afzal/BCU-BigDataManagement/main/myWordCount.txt 

--2023-01-30 11:09:26--  https://raw.githubusercontent.com/m-afzal/BCU-BigDataManagement/main/myWordCount.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 45 [text/plain]
Saving to: ‘myWordCount.txt’


2023-01-30 11:09:26 (1.76 MB/s) - ‘myWordCount.txt’ saved [45/45]



#1.Listing Files/Directories
we can use the LS command to list the files in HDFS. 

Hadoop HDFS ls Command Usage:

```
hadoop fs -ls /path

```



In [38]:
#!/usr/local/hadoop-3.3.0/bin/hadoop fs -ls
!/usr/local/hadoop-3.3.4/bin/hadoop fs -ls

Found 6 items
drwxr-xr-x   - root root       4096 2023-01-26 14:32 .config
-rw-r--r--   1 root root     142466 2013-10-16 16:52 hadoop-examples.jar
-rw-r--r--   1 root root         45 2023-01-30 11:09 myWordCount.txt
drwxr-xr-x   - root root       4096 2023-01-30 10:51 out
drwxr-xr-x   - root root       4096 2023-01-26 14:33 sample_data
-rw-r--r--   1 root root    5458199 2020-04-23 18:02 t8.shakespeare.txt


In [17]:
#!/usr/local/hadoop-3.3.0/bin/hadoop fs -ls /usr/local/hadoop-3.3.0
!/usr/local/hadoop-3.3.4/bin/hadoop fs -ls /usr/local/hadoop-3.3.4

Found 13 items
-rw-r--r--   1 root root      24707 2023-01-30 10:43 /usr/local/hadoop-3.3.4/LICENSE-binary
-rw-r--r--   1 root root      15217 2023-01-30 10:43 /usr/local/hadoop-3.3.4/LICENSE.txt
-rw-r--r--   1 root root      29473 2023-01-30 10:43 /usr/local/hadoop-3.3.4/NOTICE-binary
-rw-r--r--   1 root root       1541 2023-01-30 10:43 /usr/local/hadoop-3.3.4/NOTICE.txt
-rw-r--r--   1 root root        175 2023-01-30 10:43 /usr/local/hadoop-3.3.4/README.txt
drwxr-xr-x   - root root       4096 2023-01-30 10:43 /usr/local/hadoop-3.3.4/bin
drwxr-xr-x   - root root       4096 2023-01-30 10:43 /usr/local/hadoop-3.3.4/etc
drwxr-xr-x   - root root       4096 2023-01-30 10:43 /usr/local/hadoop-3.3.4/include
drwxr-xr-x   - root root       4096 2023-01-30 10:43 /usr/local/hadoop-3.3.4/lib
drwxr-xr-x   - root root       4096 2023-01-30 10:43 /usr/local/hadoop-3.3.4/libexec
drwxr-xr-x   - root root       4096 2023-01-30 10:43 /usr/local/hadoop-3.3.4/licenses-binary
drwxr-xr-x   - root root       

#2.CP command
The cp command copies a file from a directory to another inside the HDFS.

```
hadoop fs -cp <src> <dest>

```



In [18]:
#!/usr/local/hadoop-3.3.0/bin/hadoop fs -cp t8.shakespeare.txt /usr/local/t8.shakespeare2.txt
!/usr/local/hadoop-3.3.4/bin/hadoop fs -cp t8.shakespeare.txt /usr/local/t8.shakespeare2.txt

In [19]:
#let's verify that the file is copied using the ls command
#!/usr/local/hadoop-3.3.0/bin/hadoop fs -ls /usr/local/
!/usr/local/hadoop-3.3.4/bin/hadoop fs -ls /usr/local/

Found 18 items
drwxr-xr-x   - root root       4096 2023-01-26 14:40 /usr/local/_gcs_config_ops.so
drwxr-xr-x   - root root       4096 2023-01-26 14:46 /usr/local/bin
drwxr-xr-x   - root root       4096 2023-01-26 14:46 /usr/local/colab
drwxr-xr-x   - root root       4096 2023-01-26 14:26 /usr/local/cuda
drwxr-xr-x   - root root       4096 2023-01-26 14:26 /usr/local/cuda-11
drwxr-xr-x   - root root       4096 2023-01-26 14:26 /usr/local/cuda-11.2
drwxr-xr-x   - root root       4096 2023-01-26 14:41 /usr/local/etc
drwxr-xr-x   - root root       4096 2022-10-19 16:47 /usr/local/games
drwxr-xr-x   - root root       4096 2023-01-30 10:43 /usr/local/hadoop-3.3.4
drwxr-xr-x   - root root       4096 2023-01-26 14:41 /usr/local/include
drwxr-xr-x   - root root       4096 2023-01-26 14:40 /usr/local/lib
drwxr-xr-x   - root root       4096 2023-01-26 14:40 /usr/local/licensing
drwxr-xr-x   - root root       4096 2023-01-26 14:40 /usr/local/man
drwxr-xr-x   - root root       4096 2022-10-19 16:49

#3.MV command
mv command is used to move a file/directory to another directory within HDFS.
Hadoop HDFS mv Command Usage:



```
hadoop fs -mv <src> <dest>
```



In [20]:
#!/usr/local/hadoop-3.3.0/bin/hadoop fs -mv /usr/local/t8.shakespeare2.txt ./t8.shakespeare2.txt 
!/usr/local/hadoop-3.3.4/bin/hadoop fs -mv /usr/local/t8.shakespeare2.txt ./t8.shakespeare2.txt 

In [21]:
#!/usr/local/hadoop-3.3.0/bin/hadoop fs -ls /usr/local/
!/usr/local/hadoop-3.3.4/bin/hadoop fs -ls /usr/local/

Found 17 items
drwxr-xr-x   - root root       4096 2023-01-26 14:40 /usr/local/_gcs_config_ops.so
drwxr-xr-x   - root root       4096 2023-01-26 14:46 /usr/local/bin
drwxr-xr-x   - root root       4096 2023-01-26 14:46 /usr/local/colab
drwxr-xr-x   - root root       4096 2023-01-26 14:26 /usr/local/cuda
drwxr-xr-x   - root root       4096 2023-01-26 14:26 /usr/local/cuda-11
drwxr-xr-x   - root root       4096 2023-01-26 14:26 /usr/local/cuda-11.2
drwxr-xr-x   - root root       4096 2023-01-26 14:41 /usr/local/etc
drwxr-xr-x   - root root       4096 2022-10-19 16:47 /usr/local/games
drwxr-xr-x   - root root       4096 2023-01-30 10:43 /usr/local/hadoop-3.3.4
drwxr-xr-x   - root root       4096 2023-01-26 14:41 /usr/local/include
drwxr-xr-x   - root root       4096 2023-01-26 14:40 /usr/local/lib
drwxr-xr-x   - root root       4096 2023-01-26 14:40 /usr/local/licensing
drwxr-xr-x   - root root       4096 2023-01-26 14:40 /usr/local/man
drwxr-xr-x   - root root       4096 2022-10-19 16:49

In [22]:
#!/usr/local/hadoop-3.3.0/bin/hadoop fs -ls ./
!/usr/local/hadoop-3.3.4/bin/hadoop fs -ls ./

Found 5 items
drwxr-xr-x   - root root       4096 2023-01-26 14:32 .config
-rw-r--r--   1 root root     142466 2013-10-16 16:52 hadoop-examples.jar
drwxr-xr-x   - root root       4096 2023-01-26 14:33 sample_data
-rw-r--r--   1 root root    5458199 2020-04-23 18:02 t8.shakespeare.txt
-rw-r--r--   1 root root    5458199 2023-01-30 10:47 t8.shakespeare2.txt


#4. copyFromLocal
This command is used to copy a file from a local file system into Hadoop distriputd file system (HDFS)
```
hadoop fs -copyFromLocal <localsrc> <hdfs destination>

```
Here in the below example, we are trying to copy the 't8.shakespeare.txt' file present in the local file system to a directory of Hadoop.

**Note**: in this environment both local and hadoop file systems are the same, however in a production environment those are different. 


#5.Making Directories
You can make a directory using mkdir command, lets see it in action.


Hadoop HDFS mkdir Command Usage:

```
hadoop fs –mkdir /path/directory_name
```



In [23]:
#!/usr/local/hadoop-3.3.0/bin/hadoop fs -mkdir myFirstDirectory
!/usr/local/hadoop-3.3.4/bin/hadoop fs -mkdir myFirstDirectory

In [24]:
#!/usr/local/hadoop-3.3.0/bin/hadoop fs -ls
!/usr/local/hadoop-3.3.4/bin/hadoop fs -ls

Found 6 items
drwxr-xr-x   - root root       4096 2023-01-26 14:32 .config
-rw-r--r--   1 root root     142466 2013-10-16 16:52 hadoop-examples.jar
drwxr-xr-x   - root root       4096 2023-01-30 10:48 myFirstDirectory
drwxr-xr-x   - root root       4096 2023-01-26 14:33 sample_data
-rw-r--r--   1 root root    5458199 2020-04-23 18:02 t8.shakespeare.txt
-rw-r--r--   1 root root    5458199 2023-01-30 10:47 t8.shakespeare2.txt


#6.Removing direcoty/file
You can remove a file using rm command, for a directory we use '-r' option.

Hadoop HDFS rm Command Usage:



```
hadoop fs -rm -r /path/directory_name/
hadoop fs -rm - /path/directory_name/FileName

```



In [26]:
#!/usr/local/hadoop-3.3.0/bin/hadoop fs -rm -r myFirstDirectory
!/usr/local/hadoop-3.3.4/bin/hadoop fs -rm -r myFirstDirectory

2023-01-30 10:49:36,657 INFO Configuration.deprecation: io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum
Deleted myFirstDirectory


In [28]:
#!/usr/local/hadoop-3.3.0/bin/hadoop fs -rm  t8.shakespeare2.txt
!/usr/local/hadoop-3.3.4/bin/hadoop fs -rm  t8.shakespeare2.txt

2023-01-30 10:50:06,222 INFO Configuration.deprecation: io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum
Deleted t8.shakespeare2.txt


In [39]:
#!/usr/local/hadoop-3.3.0/bin/hadoop fs -ls
!/usr/local/hadoop-3.3.4/bin/hadoop fs -ls

Found 6 items
drwxr-xr-x   - root root       4096 2023-01-26 14:32 .config
-rw-r--r--   1 root root     142466 2013-10-16 16:52 hadoop-examples.jar
-rw-r--r--   1 root root         45 2023-01-30 11:09 myWordCount.txt
drwxr-xr-x   - root root       4096 2023-01-30 10:51 out
drwxr-xr-x   - root root       4096 2023-01-26 14:33 sample_data
-rw-r--r--   1 root root    5458199 2020-04-23 18:02 t8.shakespeare.txt


## WordCount Example 

The [Worcount example](https://wiki.apache.org/hadoop/WordCount) is implemented in Java and it is the example of [Hadoop MapReduce Tutorial](https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html)


In [47]:
#!/usr/local/hadoop-3.3.0/bin/hadoop jar ./hadoop-examples.jar wordcount
!/usr/local/hadoop-3.3.4/bin/hadoop jar ./hadoop-examples.jar wordcount

Usage: wordcount <in> [<in>...] <out>


In [41]:
#!/usr/local/hadoop-3.3.0/bin/hadoop jar ./hadoop-examples.jar wordcount t8.shakespeare.txt out
#!/usr/local/hadoop-3.3.4/bin/hadoop jar ./hadoop-examples.jar wordcount t8.shakespeare.txt out
!/usr/local/hadoop-3.3.4/bin/hadoop jar ./hadoop-examples.jar wordcount myWordCount.txt out

2023-01-30 11:11:11,830 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
2023-01-30 11:11:12,050 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2023-01-30 11:11:12,051 INFO impl.MetricsSystemImpl: JobTracker metrics system started
org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory file:/content/out already exists
	at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:164)
	at org.apache.hadoop.mapreduce.JobSubmitter.checkSpecs(JobSubmitter.java:277)
	at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:143)
	at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1571)
	at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1568)
	at java.base/java.security.AccessController.doPrivileged(Native Method)
	at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:18

Let's Look inside output directory. The directory created by WordCount contains several files. Look inside the directory by running hadoop –fs ls out

In [48]:
#!/usr/local/hadoop-3.3.0/bin/hadoop fs -ls ./out
!/usr/local/hadoop-3.3.4/bin/hadoop fs -ls ./out

Found 2 items
-rw-r--r--   1 root root          0 2023-01-30 10:51 out/_SUCCESS
-rw-r--r--   1 root root     717768 2023-01-30 10:51 out/part-r-00000


In [49]:
!/usr/local/hadoop-3.3.4/bin/hadoop fs -ls ./out/part*

-rw-r--r--   1 root root     717768 2023-01-30 10:51 out/part-r-00000


In [50]:
#! cat out/part-r-00000 local.txt
! cat out/part*

"	241
"'Tis	1
"A	4
"AS-IS".	1
"Air,"	1
"Alas,	1
"Amen"	2
"Amen"?	1
"Amen,"	1
"And	1
"Aroint	1
"B	1
"Black	1
"Break	1
"Brutus"	1
"Brutus,	2
"C	1
"Caesar"?	1
"Caesar,	1
"Caesar."	2
"Certes,"	1
"Come	1
"Cursed	1
"D	1
"Darest	1
"Defect"	1
"Do	1
"E	1
"Fear	2
"Fly,	1
"Gentle	1
"Give	2
"Glamis	1
"God	2
"Good	1
"Havoc!"	1
"He	1
"Help	1
"Help,	2
"Here	1
"Hold,	2
"I	4
"Indeed!"	1
"King	1
"Liberty,	1
"Lo,	1
"Long	1
"Murther!"	2
"Neither	1
"Now	1
"O	2
"Peace,	1
"Pro-	1
"Project	1
"Right	1
"Shall	1
"Sing	2
"Sir,	1
"Sleep	2
"Small	2
"Speak,	1
"Sweet	1
"That	1
"The	1
"These	1
"They	2
"This	2
"Thus	2
"Tis	2
"Where	1
"Willow,	1
"You'll	1
"better"?	1
"hem,"	1
"never."	1
"not"	1
"small	1
"then"	1
"thrusting"	1
"thy	1
"twas	1
"whore"	1
"whore."	1
"willow";	1
#100]	1
&	3
&C.	2
&c.	12
&c.'	2
&c.,	2
'"All	1
'"Among	1
'"And,	1
'"But,	1
'"Gamut"	1
'"How	1
'"Lo,	2
'"Look	1
'"My	1
'"Now	1
'"O	2
'"The	1
'"When	1
''Tis	3
'-on	1
'A	53
'A-down	1
'Above	1
'Accommodated!'	1
'Accost'	1
'Achilles	2
'Ad	3
'Adieu,	1
'Afte

#Things to remember


1.   Map/Reduce 
2.   Differences between copy commands ( cp, copyToLocal, put, etc) 
[Click Here](https://bigdatansql.com/2020/09/29/difference-between-copyfromlocal-put-copytolocal-and-get/)



#Exercise
Create a new colab notebook then:

1.   Download Hadoop
2.   Set Java_Home
3.   Create a new folder called 'hello_hadoop'
4.   download the following file https://raw.githubusercontent.com/m-afzal/BCU-BigDataManagement/main/myWordCount.txt and move it to 'hellp_hadoop' directory
5.  apply map reduce to count the words in the previous file







