# Hadoop

### Installing Java

In [1]:
#Checking the installed Java version
!java -version

openjdk version "11.0.17" 2022-10-18
OpenJDK Runtime Environment (build 11.0.17+8-post-Ubuntu-1ubuntu218.04)
OpenJDK 64-Bit Server VM (build 11.0.17+8-post-Ubuntu-1ubuntu218.04, mixed mode, sharing)


In [2]:
#Installing java 8
# -q, quiet level 2: no output except for errors
#> /dev/null on the end of any command where you want to redirect all the stdout into nothingness
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

In [3]:
#Switching java version to use as default (choose option 2)
!update-alternatives --config java

There are 2 choices for the alternative java (providing /usr/bin/java).

  Selection    Path                                            Priority   Status
------------------------------------------------------------
* 0            /usr/lib/jvm/java-11-openjdk-amd64/bin/java      1111      auto mode
  1            /usr/lib/jvm/java-11-openjdk-amd64/bin/java      1111      manual mode
  2            /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java   1081      manual mode

Press <enter> to keep the current choice[*], or type selection number: 2
update-alternatives: using /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java to provide /usr/bin/java (java) in manual mode


In [4]:
#Switching javac version to use as default (choose option 2)
!update-alternatives --config javac

There are 2 choices for the alternative javac (providing /usr/bin/javac).

  Selection    Path                                          Priority   Status
------------------------------------------------------------
* 0            /usr/lib/jvm/java-11-openjdk-amd64/bin/javac   1111      auto mode
  1            /usr/lib/jvm/java-11-openjdk-amd64/bin/javac   1111      manual mode
  2            /usr/lib/jvm/java-8-openjdk-amd64/bin/javac    1081      manual mode

Press <enter> to keep the current choice[*], or type selection number: 2
update-alternatives: using /usr/lib/jvm/java-8-openjdk-amd64/bin/javac to provide /usr/bin/javac (javac) in manual mode


In [5]:
#Switching jps version to use as default (choose option 2)
!update-alternatives --config jps

There are 2 choices for the alternative jps (providing /usr/bin/jps).

  Selection    Path                                        Priority   Status
------------------------------------------------------------
* 0            /usr/lib/jvm/java-11-openjdk-amd64/bin/jps   1111      auto mode
  1            /usr/lib/jvm/java-11-openjdk-amd64/bin/jps   1111      manual mode
  2            /usr/lib/jvm/java-8-openjdk-amd64/bin/jps    1081      manual mode

Press <enter> to keep the current choice[*], or type selection number: 2
update-alternatives: using /usr/lib/jvm/java-8-openjdk-amd64/bin/jps to provide /usr/bin/jps (jps) in manual mode


In [6]:
#Checking Java default version
!java -version

openjdk version "1.8.0_352"
OpenJDK Runtime Environment (build 1.8.0_352-8u352-ga-1~18.04-b08)
OpenJDK 64-Bit Server VM (build 25.352-b08, mixed mode)


In [7]:
#Finding the default Java path
!readlink -f /usr/bin/java | sed "s:bin/java::"

/usr/lib/jvm/java-8-openjdk-amd64/jre/


### Installing Hadoop

In [8]:
#Downloading Hadoop
!wget -q  https://archive.apache.org/dist/hadoop/common/hadoop-3.3.4/hadoop-3.3.4.tar.gz
!wget -q https://archive.apache.org/dist/hadoop/common/hadoop-3.3.4/hadoop-3.3.4.tar.gz.sha512

In [9]:
#File verification
!shasum -a512 hadoop-3.3.4.tar.gz | awk '{print $1}'

ca5e12625679ca95b8fd7bb7babc2a8dcb2605979b901df9ad137178718821097b67555115fafc6dbf6bb32b61864ccb6786dbc555e589694a22bf69147780b4


In [10]:
!cat hadoop-3.3.4.tar.gz.sha512

SHA512 (hadoop-3.3.4.tar.gz) = ca5e12625679ca95b8fd7bb7babc2a8dcb2605979b901df9ad137178718821097b67555115fafc6dbf6bb32b61864ccb6786dbc555e589694a22bf69147780b4


In [11]:
#Untarring the file
!tar -xf hadoop-3.3.4.tar.gz

In [12]:
#Removing the tar file
!rm hadoop-3.3.4.tar.gz
!rm hadoop-3.3.4.tar.gz.sha512

In [26]:
#Copying the hadoop files to user/local
#-r copy directories recursively
!cp -r hadoop-3.3.4/ /usr/local/

In [30]:
#Exploring hadoop-3.3.4/etc/hadoop directory
!ls /usr/local/hadoop-3.3.4/etc/hadoop

capacity-scheduler.xml		  kms-log4j.properties
configuration.xsl		  kms-site.xml
container-executor.cfg		  log4j.properties
core-site.xml			  mapred-env.cmd
hadoop-env.cmd			  mapred-env.sh
hadoop-env.sh			  mapred-queues.xml.template
hadoop-metrics2.properties	  mapred-site.xml
hadoop-policy.xml		  shellprofile.d
hadoop-user-functions.sh.example  ssl-client.xml.example
hdfs-rbf-site.xml		  ssl-server.xml.example
hdfs-site.xml			  user_ec_policies.xml.template
httpfs-env.sh			  workers
httpfs-log4j.properties		  yarn-env.cmd
httpfs-site.xml			  yarn-env.sh
kms-acls.xml			  yarnservice-log4j.properties
kms-env.sh			  yarn-site.xml


In [None]:
#Exploring hadoop-env.sh file
!cat hadoop-3.3.4/etc/hadoop/hadoop-env.sh

In [None]:
#Adding JAVA_HOME directory to hadoop-env.sh file
#Removing comments from other items

# export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
# export HADOOP_HOME=/usr/local/hadoop-3.3.4
# export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop

In [43]:
#Testing
!/usr/local/hadoop-3.3.4/bin/hadoop

Usage: hadoop [OPTIONS] SUBCOMMAND [SUBCOMMAND OPTIONS]
 or    hadoop [OPTIONS] CLASSNAME [CLASSNAME OPTIONS]
  where CLASSNAME is a user-provided Java class

  OPTIONS is none or any of:

buildpaths                       attempt to add class files from build tree
--config dir                     Hadoop config directory
--debug                          turn on shell script debug mode
--help                           usage information
hostnames list[,of,host,names]   hosts to use in slave mode
hosts filename                   list of hosts to use in slave mode
loglevel level                   set the log4j level for this command
workers                          turn on worker mode

  SUBCOMMAND is one of:


    Admin Commands:

daemonlog     get/set the log level for each daemon

    Client Commands:

archive       create a Hadoop archive
checknative   check native Hadoop and compression libraries availability
classpath     prints the class path needed to get the Hadoop jar and the
    

In [68]:
#Importing os module
import os
#Creating environment variables
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["JRE_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64/jre"
os.environ["HADOOP_HOME"] = "/usr/local/hadoop-3.3.4"

### File Management

In [None]:
# hdfs dfs -cp  # Copy files from source to destination
# hdfs dfs -mv  # Move files from source to destination
# hdfs dfs -mkdir /foodir # Create a directory named /foodir
# hdfs dfs -rm /foodir/myfile # Delete the file
# hdfs dfs -rmdir /foodir # Delete a directory /foodir    
# hdfs dfs -rmr /foodir   # Remove a directory named /foodir and content under it recursively   

### List Files

In [None]:
# hdfs dfs -ls -h -R # Recursively list subdirectories with human-readable file sizes.

### Upload/Download Files

In [None]:
# -p : Preserves rights and modification times.
# -f : Overwrites the destination if it already exists.
# hdfs dfs -put -f /home/myfile /hadoop # Copies the file from local file system to HDFS
# hdfs dfs -get hadoop/myfile /home # Copies the file from HDFS to local file system

### Read/Write Files

In [None]:
# hdfs dfs -cat /foodir/myfile #View the contents of a file named /foodir/myfile
# hdfs dfs -touchz /foodir/myfile  #Create a file named /foodir/myfile

### Testing

In [70]:
!$HADOOP_HOME/bin/hdfs dfs -mkdir /content/hadoop_data

In [71]:
#!find / -type d -name hadoop_data

In [74]:
!$HADOOP_HOME/bin/hdfs dfs -put -f /content/drive/MyDrive/datasets/test_data.csv /content/hadoop_data

In [76]:
!$HADOOP_HOME/bin/hdfs dfs -ls -h -R /content/hadoop_data

-rw-r--r--   1 root root        380 2023-01-07 11:13 /content/hadoop_data/test_data.csv


In [77]:
!$HADOOP_HOME/bin/hdfs dfs -cat /content/hadoop_data/test_data.csv

region,manager,product,amount
r2,m1,pr3,12
r2,m5,pr6,49
r2,m3,pr1,49
r5,m4,pr7,59
r5,m2,pr2,68
r1,m4,pr3,50
r2,m5,pr2,21
r2,m5,pr1,21
r6,m4,pr4,68
r6,m3,pr2,22
r5,m3,pr2,46
r5,m2,pr5,70
r6,m2,pr1,64
r1,m5,pr3,59
r4,m3,pr6,66
r9,m4,pr2,82
r2,m5,pr1,57
r1,m1,pr5,46
r10,m2,pr2,74
r6,m5,pr3,28
r2,m2,pr3,85
r2,m5,pr1,15
r4,m5,pr5,79
r4,m2,pr5,51
r4,m2,pr3,84

In [78]:
!$HADOOP_HOME/bin/hdfs dfs -text /content/hadoop_data/test_data.csv

region,manager,product,amount
r2,m1,pr3,12
r2,m5,pr6,49
r2,m3,pr1,49
r5,m4,pr7,59
r5,m2,pr2,68
r1,m4,pr3,50
r2,m5,pr2,21
r2,m5,pr1,21
r6,m4,pr4,68
r6,m3,pr2,22
r5,m3,pr2,46
r5,m2,pr5,70
r6,m2,pr1,64
r1,m5,pr3,59
r4,m3,pr6,66
r9,m4,pr2,82
r2,m5,pr1,57
r1,m1,pr5,46
r10,m2,pr2,74
r6,m5,pr3,28
r2,m2,pr3,85
r2,m5,pr1,15
r4,m5,pr5,79
r4,m2,pr5,51
r4,m2,pr3,84

In [80]:
!$HADOOP_HOME/bin/hdfs dfs -touchz /content/hadoop_data/test_data2.csv

In [60]:
!$HADOOP_HOME/bin/hdfs dfs -rm -r /content/hadoop_data

2023-01-07 10:44:05,155 INFO Configuration.deprecation: io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum
Deleted /content/hadoop_data


In [81]:
!$HADOOP_HOME/bin/hdfs dfs -rm /content/hadoop_data/test_data.csv
!$HADOOP_HOME/bin/hdfs dfs -rm /content/hadoop_data/test_data2.csv
!$HADOOP_HOME/bin/hdfs dfs -rmdir /content/hadoop_data

2023-01-07 11:23:02,001 INFO Configuration.deprecation: io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum
Deleted /content/hadoop_data/test_data.csv
2023-01-07 11:23:03,820 INFO Configuration.deprecation: io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum
Deleted /content/hadoop_data/test_data2.csv
