## Hadoop MapReduce
- The purpose is to practice using hadoop software framework by executing map reduce tasks that count words and bigrams in Sherlock Holmes book

### Install Hadoop

In [1]:
!wget https://downloads.apache.org/hadoop/common/hadoop-3.3.0/hadoop-3.3.0.tar.gz

--2022-03-01 15:13:25--  https://downloads.apache.org/hadoop/common/hadoop-3.3.0/hadoop-3.3.0.tar.gz
Resolving downloads.apache.org (downloads.apache.org)... 88.99.95.219, 135.181.214.104, 2a01:4f9:3a:2c57::2, ...
Connecting to downloads.apache.org (downloads.apache.org)|88.99.95.219|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 500749234 (478M) [application/x-gzip]
Saving to: ‘hadoop-3.3.0.tar.gz’


2022-03-01 15:13:52 (18.4 MB/s) - ‘hadoop-3.3.0.tar.gz’ saved [500749234/500749234]



In [2]:
!tar -xzf hadoop-3.3.0.tar.gz
!cp -r hadoop-3.3.0/ /usr/local/

### Set up path

In [3]:
!echo "export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")" >> \
/usr/local/hadoop-3.3.0/etc/hadoop/hadoop-env.sh

In [4]:
import os
os.environ['PATH'] += ':/usr/local/hadoop-3.3.0/bin'

### Run Hadoop

In [5]:
!hadoop

Usage: hadoop [OPTIONS] SUBCOMMAND [SUBCOMMAND OPTIONS]
 or    hadoop [OPTIONS] CLASSNAME [CLASSNAME OPTIONS]
  where CLASSNAME is a user-provided Java class

  OPTIONS is none or any of:

buildpaths                       attempt to add class files from build tree
--config dir                     Hadoop config directory
--debug                          turn on shell script debug mode
--help                           usage information
hostnames list[,of,host,names]   hosts to use in slave mode
hosts filename                   list of hosts to use in slave mode
loglevel level                   set the log4j level for this command
workers                          turn on worker mode

  SUBCOMMAND is one of:


    Admin Commands:

daemonlog     get/set the log level for each daemon

    Client Commands:

archive       create a Hadoop archive
checknative   check native Hadoop and compression libraries availability
classpath     prints the class path needed to get the Hadoop jar and the
    

### Download data

In [6]:
!gdown -O sherlock_text.txt --id 130pPxku3vSEOvJb-Wsx1ysF0Fz7WTeKo 

Downloading...
From: https://drive.google.com/uc?id=130pPxku3vSEOvJb-Wsx1ysF0Fz7WTeKo
To: /content/sherlock_text.txt
100% 1.13M/1.13M [00:00<00:00, 6.45MB/s]


### Download scripts

In [7]:
!git clone https://github.com/ppkgtmm/big-data-map-reduce.git map-reduce

Cloning into 'map-reduce'...
remote: Enumerating objects: 23, done.[K
remote: Counting objects: 100% (23/23), done.[K
remote: Compressing objects: 100% (16/16), done.[K
remote: Total 23 (delta 9), reused 15 (delta 5), pack-reused 0[K
Unpacking objects: 100% (23/23), done.


In [8]:
!python3 map-reduce/mapper.py


Usage: python3 file_name.py count_type [clean]

        count_type	unigram or bigram
        clean		specify t if you want to clean texts



### Steps
- Copy file from local machine to HDFS. Show your command

In [9]:
!hadoop fs -mkdir /hduser
!hadoop fs -mkdir /hduser/word_count
!hadoop fs -mkdir /hduser/word_count/input
!hadoop fs -put ./sherlock_text.txt /hduser/word_count/input/text.txt
!hadoop fs -ls /hduser/word_count/input

Found 1 items
-rw-r--r--   1 root root    1126637 2022-03-01 15:14 /hduser/word_count/input/text.txt


In [10]:
!hadoop fs -head /hduser/word_count/input/text.txt


A STUDY IN SCARLET.





PART I.

(_Being a reprint from the reminiscences of_ JOHN H. WATSON, M.D., _late
of the Army Medical Department._) [2]




CHAPTER I. MR. SHERLOCK HOLMES.


IN the year 1878 I took my degree of Doctor of Medicine of the
University of London, and proceeded to Netley to go through the course
prescribed for surgeons in the army. Having completed my studies there,
I was duly attached to the Fifth Northumberland Fusiliers as Assistant
Surgeon. The regiment was stationed in India at the time, and before
I could join it, the second Afghan war had broken out. On landing at
Bombay, I learned that my corps had advanced through the passes, and
was already deep in the enemy's country. I followed, however, with many
other officers who were in the same situation as myself, and succeeded
in reaching Candahar in safety, where I found my regiment, and at once
entered upon my new duties.

The campaign brought honours and promotion to many, but for

- Run MapReduce through Hadoop Streaming. Show your result

In [11]:
!head /content/sherlock_text.txt | python3 /content/map-reduce/mapper.py unigram

A	1
STUDY	1
IN	1
SCARLET.	1
PART	1
I.	1
(_Being	1
a	1
reprint	1
from	1
the	1
reminiscences	1
of_	1
JOHN	1
H.	1
WATSON,	1
M.D.,	1
_late	1


In [12]:
!hadoop jar /usr/local/hadoop-3.3.0/share/hadoop/tools/lib/hadoop-streaming-3.3.0.jar \
-mapper 'python mapper.py unigram' -reducer 'python reducer.py' \
-input /hduser/word_count/input \
-output /hduser/word_count/output \
-file /content/map-reduce/mapper.py -file /content/map-reduce/reducer.py

2022-03-01 15:15:06,665 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [/content/map-reduce/mapper.py, /content/map-reduce/reducer.py] [] /tmp/streamjob11337564572291151494.jar tmpDir=null
2022-03-01 15:15:07,408 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
2022-03-01 15:15:07,649 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2022-03-01 15:15:07,650 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2022-03-01 15:15:07,675 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!
2022-03-01 15:15:07,902 INFO mapred.FileInputFormat: Total input files to process : 1
2022-03-01 15:15:07,923 INFO mapreduce.JobSubmitter: number of splits:1
2022-03-01 15:15:08,323 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1969085402_0001
2022-03-01 15:15:08,324 INFO mapreduce.JobSubmitter: Executing with tokens: []
2022-03-01 15:15

In [13]:
!hadoop fs -head /hduser/word_count/output/part-00000

"'About	1
"'After	1
"'All	1
"'An	1
"'And	1
"'Arthur	1
"'At	2
"'Black	1
"'But	4
"'Cause	1
"'Consider,	1
"'Does	1
"'For	1
"'Friends,'	1
"'God	1
"'Half	1
"'He	1
"'Here	1
"'How	1
"'Hum!'	1
"'I	11
"'IVY	1
"'If	1
"'It	6
"'It's	3
"'Jack	1
"'Listen	1
"'Look	1
"'Mr.	1
"'My	1
"'No	2
"'No;	2
"'None	1
"'Nonsense!'	1
"'Nonsense,	1
"'Not	2
"'Nothing	1
"'On	1
"'Perhaps,	1
"'Populus	1
"'Possibly	1
"'Quite	1
"'Rache,'	1
"'See	1
"'So	1
"'Take	1
"'Tention!"	1
"'The	1
"'Then	1
"'There	3
"'This	2
"'Tis	2
"'To	1
"'We'll	1
"'Well,	3
"'Well?'	1
"'What	3
"'When	1
"'Where	1
"'Who	1
"'Why,	1
"'Would	1
"'Yes.'	3
"'You	6
"'Your	2
"--Holmes's	1
"--diverting	1
"--in	1
"13,	1
"1742."	1
"1884."	1
"A	57
"About	6
"Absolutely	1
"After	2
"Ah!	6
"Ah,	26
"All	12
"Always!"	1
"Am	1
"Amen!	1
"American	1
"American,	1
"Ames,	1
"Ames,"	1
"Among	1
"An	6
"And	124
"And,	1
"Any	4
"Anyone	1
"Anything	4
"Apart	1
"Are	10
"Aren't	1
"Arthur	1
"As	12
"Ask	1
"At	10
"Au	1
"Away,	1
"Ay,	7
"Ay,"	1
"Baldwin--he	1
"Barrymore	2
"Bartholomew	2
"Ba

- Modify mapper.py as follows:
  - Transform all characters in word to lowercase
  - Remove special character
- Run MapReduce again. Show your result

In [14]:
!hadoop jar /usr/local/hadoop-3.3.0/share/hadoop/tools/lib/hadoop-streaming-3.3.0.jar \
-mapper 'python mapper.py unigram t' -reducer 'python reducer.py' \
-input /hduser/word_count/input \
-output /hduser/word_count/output_q3 \
-file /content/map-reduce/mapper.py -file /content/map-reduce/reducer.py

2022-03-01 15:15:34,156 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [/content/map-reduce/mapper.py, /content/map-reduce/reducer.py] [] /tmp/streamjob3719629990838623494.jar tmpDir=null
2022-03-01 15:15:34,889 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
2022-03-01 15:15:35,005 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2022-03-01 15:15:35,005 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2022-03-01 15:15:35,027 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!
2022-03-01 15:15:35,171 INFO mapred.FileInputFormat: Total input files to process : 1
2022-03-01 15:15:35,189 INFO mapreduce.JobSubmitter: number of splits:1
2022-03-01 15:15:35,504 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local347159830_0001
2022-03-01 15:15:35,505 INFO mapreduce.JobSubmitter: Executing with tokens: []
2022-03-01 15:15:3

In [15]:
!hadoop fs -head /hduser/word_count/output_q3/part-00000

1	6
10	5
109	1
11	5
12	3
127	1
129	1
13	7
13th	1
14	3
14th	1
15	4
1543	1
15th	1
16	1
1642	1
1644	1
1647	1
16th	1
17	1
171	1
1730	1
1742	2
1750	2
17th	1
1800	1
1857	1
1860	1
1865	1
1871	1
1872	1
1874	1
1875	2
1876	1
1878	3
1882	6
1883	1
1884	2
18th	1
19	1
2	8
20	1
200	1
21	2
22	1
221b	4
23	1
24	2
249	1
25	1
26	2
27	2
2704	3
28	1
28th	1
29	4
293	1
3	14
30	2
31	1
34	1
340	1
341	9
34th	2
36	1
37	2
3d	3
4	7
41	1
46	2
47	1
4th	5
5	7
534	8
6	6
66	1
6th	1
7	8
76	1
8	5
80	1
9	6
97163	1
a	4737
aback	2
abandon	7
abandoned	7
abandoning	1
abandons	1
abashed	1
abdullah	8
abelwhite	4
aberdeen	1
aberdonian	2
abetting	1
abhor	2
abiding	2
abilities	1
ability	2
able	73
aboard	4
abode	5
aborigines	1
abortive	1
about	355
above	53
abreast	1
abroad	2
abrupt	1
abruptly	2
absence	13
absent	14
absentee	1
absolute	6
absolutely	9
absorb	1
absorbed	12
absorbing	4
abstainers	1
abstract	2
abstracted	3
abstraction	1
abstruse	2
absurd	5
absurdity	1
absurdly	1
abused	2
abusing	1
abyss	1
ac	1
accent	6
accept	9
accepted	

- How many times that the word `data` appears?

In [16]:
!hadoop fs -cat /hduser/word_count/output_q3/part-00000 | grep "data"

data	8


> **Observation**
> - The word `data` appeared in the text for `8` times

- Modify mapper.py to capture the combination of 2 neighbouring words (Bigram)
- Run MapReduce again. Show your result

In [17]:
!hadoop jar /usr/local/hadoop-3.3.0/share/hadoop/tools/lib/hadoop-streaming-3.3.0.jar \
-mapper 'python mapper.py bigram t' -reducer 'python reducer.py' \
-input /hduser/word_count/input \
-output /hduser/word_count/output_q7 \
-file /content/map-reduce/mapper.py -file /content/map-reduce/reducer.py

2022-03-01 15:16:03,075 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [/content/map-reduce/mapper.py, /content/map-reduce/reducer.py] [] /tmp/streamjob1978822126841633554.jar tmpDir=null
2022-03-01 15:16:03,721 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
2022-03-01 15:16:03,863 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2022-03-01 15:16:03,864 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2022-03-01 15:16:03,886 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!
2022-03-01 15:16:04,084 INFO mapred.FileInputFormat: Total input files to process : 1
2022-03-01 15:16:04,106 INFO mapreduce.JobSubmitter: number of splits:1
2022-03-01 15:16:04,409 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1618948256_0001
2022-03-01 15:16:04,409 INFO mapreduce.JobSubmitter: Executing with tokens: []
2022-03-01 15:16:

In [18]:
!hadoop fs -head /hduser/word_count/output_q7/part-00000

1-knowledge	1
1-mr	1
1-the	3
10-30	1
10-and	1
10-extract	1
10-plays	1
109-293	1
11-he	1
11-in	1
11-is	1
11-the	1
12-death	1
12-has	1
127-36	1
129-camberwell	1
13-127	1
13-duncan	2
13-fixing	1
13-he	1
14-the	1
14th-of	1
15-a	1
15-and	1
15-why	1
1543-and	1
16-cushion	1
1642-charles	1
1644-of	1
1647-are	1
16th-a	1
17-21	1
1742-dr	1
1750-and	1
1750-or	1
17th-all	1
1800-i	1
1857-and	1
1865-a	1
1871-which	1
1875-it	1
1876-of	1
1878-i	1
1878-my	1
1878-nearly	1
1882-an	1
1882-do	1
1882-grimpen	1
1882-my	1
1882-to	1
1883-medical	1
1884-at	1
1884-it	1
18th-of	1
19-the	1
2-philosophy	1
2-sherlock	1
2-the	3
2-upon	1
200-pounds	1
21-41	1
21-cold	1
221b-baker	2
23-he	1
24-1872	1
25-and	1
26-fancy	1
27-as	1
27-had	1
2704-and	1
2704-is	1
2704-said	1
28-to	1
28th-of	1
29-but	1
29-chicago	3
293-5	1
3-37	1
3-astronomy	1
3-before	1
3-i	1
3-lauriston	3
3-lodge	1
3-mayfield	1
3-pinchin	1
3-the	2
3-turpey	1
30-and	1
30-train	1
31-4	1
34-do	1
340-miles	1
341-i	1
341-it	1
341-vermissa	4
341-were	1
341-whatever

- How many times that `sherlock` and `holmes` appears together?

In [19]:
!hadoop fs -cat /hduser/word_count/output_q7/part-00000 | grep "sherlock-holmes"

sherlock-holmes	115


In [20]:
!hadoop fs -cat /hduser/word_count/output_q7/part-00000 | grep "holmes-sherlock"

> **Observation**
> - The words `sherlock`and `holmes` appeared together in the text for `115` times