# Taller 6 - Hadoop - HDFS

Juan Navarro, <jsnavarroa@unal.edu.co>


# Instalación

```bash
cd "${HOME}/worskpace/BDA"

git clone https://github.com/jsnavarroa/docker-hadoop.git
cd docker-hadoop

# Install docker-compose
conda create --name py3 python=3.6
conda activate py3
conda install -c conda-forge docker-compose

# Run local
docker-compose -f docker-compose-local.yml build
docker-compose -f docker-compose-local.yml up -d

```

* Hadoop URLs:
  * NameNode http://localhost:9870/dfshealth.html#tab-overview.
  * HDFS hdfs://localhost:9800.  

In [1]:
%%bash

# Enviroment variables
#export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")

echo "JAVA_HOME=$JAVA_HOME"

JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/



# Práctica HDFS

Después de instalado el cluster de Hadoop se ejecutan los siguientes comandos para la práctica:

In [2]:
%%bash

HADOOP_HOME="${HOME}/Programas/BDA/hadoop-3.1.1"

# Cleaning
$HADOOP_HOME/bin/hdfs dfs -rm -r /casa

# Create house
$HADOOP_HOME/bin/hdfs dfs -mkdir -p /casa/piso1/sala
$HADOOP_HOME/bin/hdfs dfs -mkdir -p /casa/piso1/cocina
$HADOOP_HOME/bin/hdfs dfs -mkdir -p /casa/piso2/alcoba

# Add furniture
$HADOOP_HOME/bin/hdfs dfs -put ./resources/mesa.jpg /casa/piso1/sala
$HADOOP_HOME/bin/hdfs dfs -put ./resources/estufa.jpg /casa/piso1/cocina
$HADOOP_HOME/bin/hdfs dfs -put ./resources/libro.doc /casa/piso2/alcoba
$HADOOP_HOME/bin/hdfs dfs -put ./resources/televisor.jpg /casa/piso2/alcoba

# Renovation
$HADOOP_HOME/bin/hdfs dfs -mkdir -p /casa/piso2/estudio
$HADOOP_HOME/bin/hdfs dfs -put ./resources/cama.jpg /casa/piso2/alcoba
$HADOOP_HOME/bin/hdfs dfs -mv /casa/piso2/alcoba/televisor.jpg /casa/piso1/sala/televisor.jpg
$HADOOP_HOME/bin/hdfs dfs -mv /casa/piso2/alcoba/libro.doc /casa/piso2/estudio/libro.doc

# Check
$HADOOP_HOME/bin/hdfs dfs -ls -R /casa

Deleted /casa
drwxr-xr-x   - juan supergroup          0 2018-11-25 20:22 /casa/piso1
drwxr-xr-x   - juan supergroup          0 2018-11-25 20:22 /casa/piso1/cocina
-rw-r--r--   3 juan supergroup       5194 2018-11-25 20:22 /casa/piso1/cocina/estufa.jpg
drwxr-xr-x   - juan supergroup          0 2018-11-25 20:22 /casa/piso1/sala
-rw-r--r--   3 juan supergroup       3106 2018-11-25 20:22 /casa/piso1/sala/mesa.jpg
-rw-r--r--   3 juan supergroup       5688 2018-11-25 20:22 /casa/piso1/sala/televisor.jpg
drwxr-xr-x   - juan supergroup          0 2018-11-25 20:22 /casa/piso2
drwxr-xr-x   - juan supergroup          0 2018-11-25 20:22 /casa/piso2/alcoba
-rw-r--r--   3 juan supergroup       5859 2018-11-25 20:22 /casa/piso2/alcoba/cama.jpg
drwxr-xr-x   - juan supergroup          0 2018-11-25 20:22 /casa/piso2/estudio
-rw-r--r--   3 juan supergroup       9216 2018-11-25 20:22 /casa/piso2/estudio/libro.doc



# Ejercicio de conteo de términos

## I. Computar el valor de pi en paralelo en 5 nodos con 5 "samples"

In [3]:
%%bash

HADOOP_HOME="${HOME}/Programas/BDA/hadoop-3.1.1"
HADOOP_EXAMPLES_JAR="${HADOOP_HOME}/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.1.jar"

$HADOOP_HOME/bin/hadoop --loglevel WARN jar $HADOOP_EXAMPLES_JAR pi 5 5


Number of Maps  = 5
Samples per Map = 5
Wrote input for Map #0
Wrote input for Map #1
Wrote input for Map #2
Wrote input for Map #3
Wrote input for Map #4
Starting Job
2018-11-25 20:22:24,865 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!
2018-11-25 20:22:24,898 WARN io.ReadaheadPool: Failed readahead on ifile
EBADF: Bad file descriptor
	at org.apache.hadoop.io.nativeio.NativeIO$POSIX.posix_fadvise(Native Method)
	at org.apache.hadoop.io.nativeio.NativeIO$POSIX.posixFadviseIfPossible(NativeIO.java:270)
	at org.apache.hadoop.io.nativeio.NativeIO$POSIX$CacheManipulator.posixFadviseIfPossible(NativeIO.java:147)
	at org.apache.hadoop.io.ReadaheadPool$ReadaheadRequestImpl.run(ReadaheadPool.java:208)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
2018-11-25 20:22:24,903 WARN io.ReadaheadPool: Failed rea

## II. Frecuencia de Palabras

In [4]:
%%bash

HADOOP_HOME="${HOME}/Programas/BDA/hadoop-3.1.1"
HADOOP_EXAMPLES_JAR="${HADOOP_HOME}/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.1.jar"

cd ./data

# Split file
split --additional-suffix=".txt" biblia.txt

mkdir -p ./archivos_biblia
mv xa*.txt ./archivos_biblia

# Upload files
$HADOOP_HOME/bin/hdfs dfs -mkdir -p biblia/input
$HADOOP_HOME/bin/hdfs dfs -put -f ./archivos_biblia/* biblia/input

$HADOOP_HOME/bin/hdfs dfs -ls biblia/input

# Execute wordcount
$HADOOP_HOME/bin/hdfs dfs -rm -r biblia/output
$HADOOP_HOME/bin/hadoop --loglevel WARN jar $HADOOP_EXAMPLES_JAR wordcount biblia/input biblia/output

# Show results
$HADOOP_HOME/bin/hdfs dfs -cat biblia/output/part-r-00000 | grep --contex=10 "^él"

Found 7 items
-rw-r--r--   3 juan supergroup     111448 2018-11-25 20:22 biblia/input/xaa.txt
-rw-r--r--   3 juan supergroup     127578 2018-11-25 20:22 biblia/input/xab.txt
-rw-r--r--   3 juan supergroup     151478 2018-11-25 20:22 biblia/input/xac.txt
-rw-r--r--   3 juan supergroup     190377 2018-11-25 20:22 biblia/input/xad.txt
-rw-r--r--   3 juan supergroup     167416 2018-11-25 20:22 biblia/input/xae.txt
-rw-r--r--   3 juan supergroup     131772 2018-11-25 20:22 biblia/input/xaf.txt
-rw-r--r--   3 juan supergroup     121707 2018-11-25 20:22 biblia/input/xag.txt
Deleted biblia/output
2018-11-25 20:22:35,050 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!
árboles	2
árboles,	4
árboles.	1
árboles;	2
ásperos	2
áspides	1
átate	1
échala	1
échalo	3
échate	4
él	592
él!	1
él,	215
él.	131
él:	8
él;	43
él?	8
éramos	5
ése	16
ése,	1
ésta	6
ésta,	4
éstas	3
éste	107
éste,	25
éste.	1
éste:	3



# Otros ejercicios

In [5]:
%lsmagic

Available magic commands:
%%javascript 
%%js 
%%html 
%%HTML 
%%bash 
%lsmagic 
%classpath add jar <jar path>
%classpath add mvn <group name version>
%%classpath add mvn <group name version>
%classpath add dynamic 
%classpath config resolver <repoName repoUrl>
%classpath reset 
%classpath 
%import static <classpath>
%import <classpath>
%unimport <classpath>
%time 
%%time 
%timeit 
%%timeit 
%load_magic 
%%kernel 
%%python 
%%clojure 
%%groovy 
%%java 
%%kotlin 
%%scala 
%%sql 
%%async 


In [11]:
%classpath add mvn org.apache.hadoop hadoop-client 3.1.1

In [18]:
package co.edu.unal.bda.hadoop;

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public class WordCount {

	private static final Logger log = LoggerFactory.getLogger(WordCount.class);

	public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {

		private final static IntWritable one = new IntWritable(1);
		private Text word = new Text();

		public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
			StringTokenizer itr = new StringTokenizer(value.toString());
			while (itr.hasMoreTokens()) {
				word.set(itr.nextToken());
				context.write(word, one);
			}
		}
	}

	public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
		private IntWritable result = new IntWritable();

		public void reduce(Text key, Iterable<IntWritable> values, Context context)
				throws IOException, InterruptedException {
			int sum = 0;
			for (IntWritable val : values) {
				sum += val.get();
			}
			result.set(sum);
			context.write(key, result);
		}
	}

	public static void main(String input, String output) throws Exception {
		Configuration conf = new Configuration();
		conf.set("fs.defaultFS", "hdfs://localhost:9800");
		conf.set("fs.hdfs.impl", org.apache.hadoop.hdfs.DistributedFileSystem.class.getName());

		Job job = Job.getInstance(conf, "word count");
		job.setJarByClass(WordCount.class);
		job.setMapperClass(TokenizerMapper.class);
		job.setCombinerClass(IntSumReducer.class);
		job.setReducerClass(IntSumReducer.class);
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(IntWritable.class);

		// Overwrite output
		FileSystem fileSystem = FileSystem.get(conf);
		Path outputPath = new Path(output);
		if (fileSystem.exists(outputPath)) {
			fileSystem.delete(outputPath, true);
		}
		fileSystem.close();

		FileInputFormat.addInputPath(job, new Path(input));

		FileOutputFormat.setOutputPath(job, outputPath);

		job.waitForCompletion(true);

		log.info("Work done");
	}
}

co.edu.unal.bda.hadoop.WordCount

## 1. Calcular el tf*idf de cada término de la colección anterior (Biblia) usando mapreduce (se debe analizar y modificar  el archivo wordcount de java) . Donde tf es la frecuencia del término e idf es la frecuencia inversa del término en la colección de documentos. Se puede calcular como:

idf (t) =  log (|D| /(1+ numero de documentos donde aparece t)

donde D es el número de documentos en la colección.

In [19]:
package co.edu.unal.bda.hadoop;

try {
	WordCount.main("biblia/input","biblia/p1");
} catch (Exception e) {
    e.printStackTrace();
}

java.lang.ClassCastException: org.apache.hadoop.hdfs.DistributedFileSystem cannot be cast to org.apache.hadoop.fs.FileSystem
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3353)
	at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3403)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3371)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:477)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:226)
	at co.edu.unal.bda.hadoop.WordCount.main(WordCount.java:82)
	at co.edu.unal.bda.hadoop.BeakerWrapperClass1261714175Idf294ca399bcd420085e0e58769cfd929.beakerRun(BeakerWrapperClass1261714175Idf294ca399bcd420085e0e58769cfd929.java:35)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:4

null

In [9]:
%%bash

HADOOP_HOME="${HOME}/Programas/BDA/hadoop-3.1.1"

$HADOOP_HOME/bin/hdfs dfs -cat biblia/p1/part-r-00000 | head -n 15

(2	1
(Como	1
(Hablo	1
(Hch.	3
(Jn.	1
(Judas	1
(Lc.	35
(María,	1
(Mr.	84
(Mt.	188
(Porque	2
(aunque	2
(como	5
(dice	1
(digo	1
cat: Unable to write to output stream.

