<b>This notebook is just a tutorial for you to get familiar with skip-gram and MapReduce.  
<font color="red">You don't need to hand in this notebook</font>, so feel free to jump to [Requirement section](#Assignment-Requirement) and directly work on your `mapper.py` and `reducer.py` if you already have the idea of how to do so.</b>  

# Week 05: Skip-gram and MapReduce

In previous assignments, you have known the concept of ngrams and how to generate them.  
This week, we are introducing another gram type, called *skip-gram*, to you.  
Also, we are going to calculate it on a large dataset, so you'll have to process it with the MapReduce technique.  

So, first thing first: what is skip-gram?  

## Skip-gram

<i>\[S\]kip-grams are a generalization of n-grams in which the components (typically words) need not be consecutive in the text under consideration, but may leave gaps that are skipped over.  - from [Wikipedia](https://en.wikipedia.org/wiki/N-gram#Skip-gram)</i>  

That is, skip-gram is actually the same as ngram, but allowed to skip some words in between.  
In the sentence <i>"Strong winds blew roofs away"</i>, two of its bigrams are <i>"winds blew"</i> and <i>"blew roofs"</i>, while <i>"blew away"</i> is one of the skipgrams with distance 2, since it skipped one word <i>"roofs"</i> .  
As you can see, skipgram is able to capture the phrase seperated by other words.  

Now consider another sentence

> "Skip-gram is used to predict the context word for a given target word".

With a pivot word *predict*, all of its skip-grams within distance 5 are as below:
```
.------------------------------------------------------------------------------.
| distance || -5 |     -4    | -3 |  -2  | -1 |  1  |    2    |  3   |  4  | 5 |
|----------||----|-----------|----|------|----|-----|---------|------|-----|---|
| predict  || -  | Skip-gram | is | used | to | the | context | word | for | a |
'------------------------------------------------------------------------------'
```

<a name="Practice"></a>
### Practice: Distance table of skip-gram

Now, let's practice!  
Given a sentence <i>"Skip-gram is used to predict the context word for a given target word"</i>, <u>**output all of its skip-gram with distance between -3 to 3</u> and show the result in a table**.  

**Example**
```
distance      -3            -2            -1            1             2             3             
--------------------------------------------------------------------------------------------
Skip-gram     -             -             -             is            used          to            
is            -             -             Skip-gram     used          to            predict       
used          -             Skip-gram     is            to            predict       the           
to            Skip-gram     is            used          predict       the           context       
predict       is            used          to            the           context       word          
the           used          to            predict       context       word          for           
context       to            predict       the           word          for           a             
word          predict       the           context       for           a             given         
for           the           context       word          a             given         target        
a             context       word          for           given         target        word          
given         word          for           a             target        word          -             
target        for           a             given         word          -             -             
word          a             given         target        -             -             -             
```

\*Hint: Try to get the skip-grams for a single word first if you have trouble generating them all at once. 
```
(predict, is, -3)
(predict, used, -2)
(predict, to, -1)
(predict, the, 1)
(predict, context, 2)
(predict, word, 3)
```

In [1]:
import os
import string

In [2]:
tokens = "Skip-gram is used to predict the context word for a given target word".split()
token_length = len(tokens)

for i in range(token_length):
    
    print(tokens[i] + ' ', end='')
    
    for j in range(-3, 3):
        if i + j in range(0, token_length) and j != 0:
            print(tokens[i+j] + ' ', end='')
        else:
            print('- ', end='')        
    print()

Skip-gram - - - - is used 
is - - Skip-gram - used to 
used - Skip-gram is - to predict 
to Skip-gram is used - predict the 
predict is used to - the context 
the used to predict - context word 
context to predict the - word for 
word predict the context - for a 
for the context word - a given 
a context word for - given target 
given word for a - target word 
target for a given - word - 
word a given target - - - 


## MapReduce

<i>MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster. - from [Wikipedia](https://en.wikipedia.org/wiki/MapReduce)</i> 

### Why MapReduce?

Imagine that you are working on a pretty large dataset, say all pages on Wikipedia (whose size has already reached 94GB in 2013).  
Most likely you are not able to process the whole corpus in the memory or on a single computer. Even a simple frequency counter would be challenging under such a huge data size.  
To deal with this, Google proposed a big-data processing model called MapReduce, and it has been implemented and supported by many distributed computing systems, such as Apache Hadoop.  
The core concept of MapReduce is to **split, apply and then combine**, so that each data segment can be handled separately.  

### Mapper-Shuffler-Reducer

![](https://www.todaysoftmag.com/images/articles/tsm33/large/a11.png)
<small><i> - image source: [Today Software Magazine](https://www.todaysoftmag.com/article/1358/hadoop-mapreduce-deep-diving-and-tuning)</i></small> 

As you can see in the picture:  
First, the whole data is split into some smaller partitions, each partition able to be processed by an independant machine.  
In this step, **mappers** will generate one or more key-value pair(s) that can easily be clustered.  
 - example: in a word counter, it would generate the word and the word's current count.  

Then, we will **shuffle** and group all outputs from mappers.  
 - example: sort the output from mappers.  

Lastly, we can combine the grouped values and **reduce** them into final results.  
 - example: calculate total frequency in each group.  

## MapReduce for skip-gram

Now, having the concepts of skip-gram and MapReduce in mind, it's time to put these all together: let's generate skip-gram table with MapReduce technique!  

It may sound scary to some of you, so let's break it down first.  
There are 3 steps to do, and each step is described as below:
1. **Mapper**: Print all skip-gram with its distance infomation, and the current count of it.  
   ```
   a b -3 1
   a c 3  1
   c e -2 1
   a c 1  1
   a b -3 1
   b d 2  1
   ```
2. **Shuffler**: Group all skipgrams by its text. This can be easily achieved with sorting.  
   ```
   a b -3 1
   a b -3 1
   a c 1  1
   a c 3  1
   b d 2  1
   c e -2 1
   ```
3. **Reducer** :  
   Since the results have been sorted in the previous step, we can easily calculate the frequency of each skip-gram with different distance.  
   So we can know that the frequency of skipgram `a b` with distance $-3$ should be $1+1=2$, while other skip-grams' are all $1$.

### Step 1: Mapper

First, in the mapper we want to generate all skip-grams within distance $-5$ to $5$.  
Remember that you've already done something similar in [previous Practice](#Practice)? Just modify it to MapReduce format!  

Output: 
 - `"{pivot}\t{word}\t{distance}\t{count}"`  

Example: 
```
predict is  -3  1
predict used    -2  1
predict the -1  1
predict the 1   1
...
```



In [11]:
print(string.punctuation)
print(''.join(' ' for c in string.punctuation))

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
                                


In [12]:
def dict_add(d: dict, s: str):
    if not s in d:
        d[s] = 1
    else:
        d[s] += 1
    return

with open(os.path.join('data', 'wiki1G.txt')) as f:
    
    sg_count = {}
    
    for line in f:
        
        line = line.lower()
        line = line.translate(str.maketrans(string.punctuation + '–',
                                            ''.join(' ' for c in string.punctuation + '–')))
        
        tokens = line.split()
        
        
        for i in range(0, len(tokens)):
            for j in range(-5, 6):
                if j == 0 or not i+j in range(0, len(tokens)):
                    continue
                dict_add(sg_count, '\t'.join([tokens[i], tokens[i+j], str(j)]))
        
        break
        
        
        

for d in sg_count:
    print(d + '\t' + str(sg_count[d]))



anarchism	anarchism	1	1
anarchism	is	2	1
anarchism	a	3	3
anarchism	political	4	1
anarchism	philosophy	5	1
anarchism	anarchism	-1	1
anarchism	is	1	24
anarchism	a	2	8
anarchism	political	3	2
anarchism	philosophy	4	1
anarchism	and	5	3
is	anarchism	-2	1
is	anarchism	-1	24
is	a	1	12
is	political	2	1
is	philosophy	3	1
is	and	4	4
is	movement	5	1
a	anarchism	-3	3
a	anarchism	-2	8
a	is	-1	12
a	political	1	2
a	philosophy	2	1
a	and	3	5
a	movement	4	3
a	that	5	1
political	anarchism	-4	1
political	anarchism	-3	2
political	is	-2	1
political	a	-1	2
political	philosophy	1	1
political	and	2	4
political	movement	3	1
political	that	4	1
political	is	5	1
philosophy	anarchism	-5	1
philosophy	anarchism	-4	1
philosophy	is	-3	1
philosophy	a	-2	1
philosophy	political	-1	1
philosophy	and	1	1
philosophy	movement	2	1
philosophy	that	3	1
philosophy	is	4	1
philosophy	sceptical	5	1
and	anarchism	-5	3
and	is	-4	4
and	a	-3	5
and	political	-2	4
and	philosophy	-1	1
and	movement	1	1
and	that	2	3
and	is	3	8
and	sceptical	4

its	socialism	3	1
its	matthew	4	1
its	s	5	1
connections	negative	-5	1
connections	connotations	-4	1
connections	and	-3	1
connections	emphasise	-2	1
connections	its	-1	1
connections	with	1	1
connections	socialism	2	1
connections	matthew	3	1
connections	s	4	1
connections	adams	5	1
with	connotations	-5	1
with	and	-4	4
with	emphasise	-3	1
with	its	-2	1
with	connections	-1	1
with	socialism	1	1
with	matthew	2	1
with	s	3	1
with	adams	4	1
with	and	5	2
socialism	and	-5	1
socialism	emphasise	-4	1
socialism	its	-3	1
socialism	connections	-2	1
socialism	with	-1	1
socialism	matthew	1	1
socialism	s	2	1
socialism	adams	3	1
socialism	and	4	2
socialism	carl	5	1
matthew	emphasise	-5	1
matthew	its	-4	1
matthew	connections	-3	1
matthew	with	-2	1
matthew	socialism	-1	1
matthew	s	1	1
matthew	adams	2	1
matthew	and	3	1
matthew	carl	4	1
matthew	levy	5	1
s	its	-5	1
s	connections	-4	1
s	with	-3	1
s	socialism	-2	1
s	matthew	-1	1
s	adams	1	1
s	and	2	1
s	carl	3	1
s	levy	4	1
s	write	5	1
adams	connections	-5	1
adams	

institutions	cities	-2	1
institutions	that	-1	1
institutions	of	1	1
institutions	authority	2	1
institutions	were	3	1
institutions	established	4	1
institutions	and	5	1
of	towns	-5	1
of	cities	-3	1
of	that	-2	1
of	institutions	-1	1
of	were	2	2
of	anarchistic	5	1
authority	and	-5	1
authority	cities	-4	1
authority	that	-3	1
authority	institutions	-2	1
authority	were	1	1
authority	established	2	1
authority	and	3	1
authority	anarchistic	4	1
authority	ideas	5	1
were	cities	-5	1
were	that	-4	1
were	institutions	-3	1
were	of	-2	2
were	authority	-1	1
were	established	1	1
were	and	2	2
were	anarchistic	3	1
were	ideas	4	1
were	espoused	5	1
established	that	-5	1
established	institutions	-4	1
established	authority	-2	1
established	were	-1	1
established	and	1	1
established	anarchistic	2	1
established	ideas	3	1
established	espoused	4	1
established	as	5	1
and	institutions	-5	1
and	authority	-3	1
and	were	-2	2
and	established	-1	1
and	anarchistic	1	1
and	ideas	2	1
and	espoused	3	1
anarchistic	of	-5	1
ana

according	of	-2	1
according	goods	-1	1
according	one	2	1
according	s	3	1
according	needs	4	1
according	at	5	1
to	distribution	-4	1
to	goods	-2	1
to	one	1	1
to	s	2	1
to	needs	3	1
to	at	4	1
one	distribution	-5	1
one	of	-4	2
one	goods	-3	1
one	according	-2	1
one	to	-1	1
one	s	1	2
one	needs	2	1
one	at	3	1
one	the	4	1
one	turn	5	1
s	goods	-4	1
s	according	-3	1
s	to	-2	1
s	one	-1	2
s	needs	1	1
s	at	2	1
s	turn	4	1
s	of	5	1
needs	goods	-5	1
needs	according	-4	1
needs	to	-3	1
needs	one	-2	1
needs	s	-1	1
needs	at	1	1
needs	the	2	1
needs	turn	3	1
needs	of	4	1
needs	the	5	1
at	according	-5	1
at	to	-4	1
at	one	-3	1
at	s	-2	1
at	needs	-1	1
at	turn	2	1
at	of	3	5
at	the	4	3
at	century	5	1
the	one	-4	1
the	needs	-2	1
the	turn	1	2
the	century	4	2
the	anarchism	5	6
turn	one	-5	1
turn	s	-4	1
turn	needs	-3	1
turn	at	-2	1
turn	the	-1	2
turn	of	1	2
turn	century	3	1
turn	anarchism	4	1
turn	had	5	1
of	s	-5	1
of	needs	-4	1
of	at	-3	5
of	turn	-1	2
of	century	2	1
of	anarchism	3	4
of	spread	5	1
the	needs	-5	1
the	

punk	with	-1	1
punk	subculture	1	1
punk	as	2	1
punk	exemplified	3	1
punk	by	4	1
punk	bands	5	1
subculture	anarchism	-5	1
subculture	became	-4	1
subculture	associated	-3	1
subculture	with	-2	1
subculture	punk	-1	1
subculture	as	1	1
subculture	exemplified	2	1
subculture	by	3	1
subculture	bands	4	1
subculture	such	5	1
as	became	-5	1
as	associated	-4	1
as	with	-3	2
as	punk	-2	1
as	subculture	-1	1
as	exemplified	1	1
as	by	2	3
as	bands	3	1
as	as	5	3
exemplified	associated	-5	1
exemplified	with	-4	1
exemplified	punk	-3	1
exemplified	subculture	-2	1
exemplified	as	-1	1
exemplified	by	1	1
exemplified	bands	2	1
exemplified	such	3	1
exemplified	as	4	1
exemplified	crass	5	1
by	with	-5	1
by	punk	-4	1
by	subculture	-3	1
by	as	-2	3
by	exemplified	-1	1
by	bands	1	1
by	such	2	1
by	as	3	2
by	crass	4	1
bands	punk	-5	1
bands	subculture	-4	1
bands	as	-3	1
bands	exemplified	-2	1
bands	by	-1	1
bands	such	1	1
bands	as	2	1
bands	crass	3	1
bands	and	4	1
bands	the	5	1
such	subculture	-5	1
such	exemplified	-3	1
s

anarchist	were	2	2
anarchist	mutualism	3	1
anarchist	and	4	4
anarchist	individualism	5	1
currents	inceptive	-5	1
currents	currents	-4	1
currents	among	-3	1
currents	classical	-2	1
currents	were	1	1
currents	mutualism	2	1
currents	and	3	1
currents	individualism	4	1
currents	they	5	1
were	currents	-5	1
were	among	-4	2
were	classical	-3	1
were	anarchist	-2	2
were	currents	-1	1
were	mutualism	1	1
were	individualism	3	1
were	they	4	2
were	were	5	2
mutualism	among	-5	1
mutualism	classical	-4	1
mutualism	anarchist	-3	1
mutualism	currents	-2	1
mutualism	were	-1	1
mutualism	individualism	2	1
mutualism	they	3	1
mutualism	were	4	1
mutualism	followed	5	1
and	classical	-5	1
and	anarchist	-4	4
and	currents	-3	1
and	individualism	1	1
and	were	3	2
and	followed	4	1
and	by	5	2
individualism	anarchist	-5	1
individualism	currents	-4	1
individualism	were	-3	1
individualism	mutualism	-2	1
individualism	and	-1	1
individualism	they	1	1
individualism	were	2	1
individualism	followed	3	1
individualism	by	4	1
ind

with	anarchist	2	1
with	movement	3	1
with	although	4	1
with	contemporary	5	1
the	history	-4	1
the	engage	-2	2
the	although	3	1
the	contemporary	4	1
anarchist	history	-5	1
anarchist	engage	-3	1
anarchist	with	-2	1
anarchist	although	2	2
anarchist	anarchism	4	1
anarchist	favours	5	1
movement	to	-5	2
movement	engage	-4	1
movement	with	-3	1
movement	although	1	1
movement	contemporary	2	1
movement	anarchism	3	2
movement	favours	4	1
movement	actions	5	1
although	engage	-5	1
although	with	-4	1
although	the	-3	1
although	anarchist	-2	2
although	movement	-1	1
although	contemporary	1	1
although	anarchism	2	1
although	favours	3	1
although	actions	4	1
although	over	5	1
contemporary	with	-5	1
contemporary	the	-4	1
contemporary	movement	-2	1
contemporary	although	-1	1
contemporary	anarchism	1	2
contemporary	favours	2	1
contemporary	actions	3	1
contemporary	over	4	1
contemporary	academic	5	1
anarchism	anarchist	-4	1
anarchism	movement	-3	2
anarchism	although	-2	1
anarchism	contemporary	-1	2
anarchism

the	overthrow	5	1
movement	as	-5	1
movement	part	-3	1
movement	of	-2	1
movement	which	1	1
movement	sought	2	1
movement	to	3	1
movement	overthrow	4	1
which	a	-5	2
which	part	-4	1
which	movement	-1	1
which	sought	1	1
which	to	2	1
which	overthrow	3	1
which	the	4	1
which	state	5	1
sought	part	-5	1
sought	of	-4	1
sought	the	-3	1
sought	movement	-2	1
sought	which	-1	1
sought	overthrow	2	1
sought	the	3	1
sought	state	4	1
sought	and	5	1
to	movement	-3	1
to	which	-2	1
to	overthrow	1	1
to	capitalism	5	1
overthrow	the	-5	1
overthrow	movement	-4	1
overthrow	which	-3	1
overthrow	sought	-2	1
overthrow	to	-1	1
overthrow	the	1	1
overthrow	state	2	1
overthrow	and	3	1
overthrow	capitalism	4	1
overthrow	anarchists	5	1
the	which	-4	1
the	sought	-3	1
the	overthrow	-1	1
the	capitalism	3	2
state	which	-5	1
state	sought	-4	1
state	overthrow	-2	1
state	capitalism	2	1
state	anarchists	3	1
state	also	4	1
state	reinforced	5	1
and	sought	-5	1
and	overthrow	-3	1
and	capitalism	1	1
and	also	3	1
and	reinforced	4	1
ca

for	way	3	1
for	these	4	1
for	hacktivists	5	1
free	software	-5	1
free	that	-4	1
free	are	-3	1
free	available	-2	1
free	the	1	1
free	way	2	1
free	these	3	1
free	hacktivists	4	1
free	work	5	1
the	are	-4	2
the	available	-3	1
the	free	-1	1
the	these	2	1
the	hacktivists	3	1
the	work	4	1
way	are	-5	1
way	available	-4	1
way	for	-3	1
way	free	-2	1
way	these	1	1
way	hacktivists	2	1
way	work	3	1
way	to	4	1
way	develop	5	1
these	available	-5	1
these	for	-4	1
these	free	-3	1
these	the	-2	1
these	way	-1	1
these	hacktivists	1	1
these	work	2	1
these	develop	4	1
these	and	5	1
hacktivists	for	-5	1
hacktivists	free	-4	1
hacktivists	the	-3	1
hacktivists	way	-2	1
hacktivists	these	-1	1
hacktivists	work	1	1
hacktivists	to	2	1
hacktivists	develop	3	1
hacktivists	and	4	1
hacktivists	distribute	5	1
work	free	-5	1
work	the	-4	1
work	way	-3	1
work	these	-2	1
work	hacktivists	-1	1
work	to	1	1
work	develop	2	1
work	and	3	1
work	distribute	4	1
work	resembles	5	1
to	way	-4	1
to	hacktivists	-2	1
to	work	-1	1
to	deve

that	towards	-3	1
that	an	-2	1
that	individualism	-1	1
that	dropping	2	1
that	cause	4	1
was	leaned	-5	1
was	towards	-4	1
was	an	-3	1
was	individualism	-2	1
was	dropping	1	1
was	cause	3	1
was	social	5	1
dropping	towards	-5	1
dropping	an	-4	1
dropping	individualism	-3	1
dropping	that	-2	1
dropping	was	-1	1
dropping	the	1	1
dropping	cause	2	1
dropping	of	3	1
dropping	social	4	1
dropping	liberation	5	1
the	individualism	-4	1
the	dropping	-1	1
the	cause	1	1
the	liberation	4	1
cause	individualism	-5	1
cause	that	-4	1
cause	was	-3	1
cause	dropping	-2	1
cause	the	-1	1
cause	of	1	1
cause	social	2	1
cause	liberation	3	1
cause	the	4	1
cause	interest	5	1
of	dropping	-3	1
of	cause	-1	1
of	liberation	2	1
of	interest	4	1
social	was	-5	1
social	dropping	-4	1
social	cause	-2	1
social	liberation	1	1
social	the	2	1
social	interest	3	1
social	of	4	1
social	anarchists	5	1
liberation	dropping	-5	1
liberation	the	-4	1
liberation	cause	-3	1
liberation	of	-2	1
liberation	social	-1	1
liberation	the	1	1
liberati

major	aspects	-4	1
major	of	-3	1
major	their	-2	1
major	life	-1	1
major	decisions	1	1
major	are	2	1
major	taken	3	1
major	by	4	1
major	a	5	1
decisions	aspects	-5	1
decisions	of	-4	1
decisions	their	-3	1
decisions	life	-2	1
decisions	major	-1	1
decisions	are	1	1
decisions	taken	2	1
decisions	by	3	1
decisions	a	4	1
decisions	small	5	1
are	of	-5	2
are	their	-4	1
are	life	-3	1
are	major	-2	1
are	decisions	-1	1
are	taken	1	1
are	by	2	1
are	small	4	1
are	elite	5	1
taken	their	-5	1
taken	life	-4	1
taken	major	-3	1
taken	decisions	-2	1
taken	are	-1	1
taken	by	1	1
taken	a	2	1
taken	small	3	1
taken	elite	4	1
taken	authority	5	1
by	life	-5	1
by	major	-4	1
by	decisions	-3	1
by	are	-2	1
by	taken	-1	1
by	small	2	1
by	elite	3	1
by	authority	4	1
by	ultimately	5	1
a	major	-5	1
a	decisions	-4	1
a	taken	-2	1
a	elite	2	1
a	authority	3	1
a	ultimately	4	1
a	rests	5	1
small	decisions	-5	1
small	are	-4	1
small	taken	-3	1
small	by	-2	1
small	elite	1	1
small	authority	2	1
small	ultimately	3	1
small	rests	4	1
sm

anarchism	arguments	-2	1
anarchism	against	-1	1
anarchism	fiala	1	1
anarchism	s	2	1
anarchism	other	3	1
anarchism	critiques	4	1
fiala	list	-5	1
fiala	of	-4	1
fiala	arguments	-3	1
fiala	against	-2	1
fiala	anarchism	-1	1
fiala	s	1	1
fiala	other	2	1
fiala	critiques	3	1
fiala	were	4	1
fiala	that	5	1
s	arguments	-4	1
s	against	-3	1
s	anarchism	-2	1
s	fiala	-1	1
s	other	1	1
s	critiques	2	1
s	were	3	1
s	that	4	1
s	anarchism	5	1
other	arguments	-5	1
other	anarchism	-3	1
other	fiala	-2	1
other	s	-1	1
other	were	2	1
other	that	3	1
other	anarchism	4	1
other	is	5	1
critiques	against	-5	1
critiques	anarchism	-4	1
critiques	fiala	-3	1
critiques	s	-2	1
critiques	were	1	1
critiques	that	2	1
critiques	anarchism	3	1
critiques	is	4	1
critiques	innately	5	1
were	fiala	-4	1
were	s	-3	1
were	other	-2	1
were	critiques	-1	1
were	that	1	1
were	anarchism	2	1
were	is	3	1
were	innately	4	1
were	related	5	1
that	fiala	-5	1
that	s	-4	1
that	other	-3	1
that	critiques	-2	1
that	were	-1	1
that	innately	3	1
that	relate

In [13]:
import csv

with open(os.path.join('data', 'mapper.tsv'), 'w') as tsvfile:
    writer = csv.writer(tsvfile)
    for d in sg_count:
        writer.writerow([d + '\t' + str(sg_count[d])])

### Step 2: Shuffler

All we need to do in the shuffler is sorting, so let's use the built-in command to do this for us!  

Try this on your terminal/command prompt ;)  
(You can get the sample input from [here](https://drive.google.com/drive/folders/1vKxr--sLd2J4kdsXUzJDBZdG3AmV4NGl?usp=sharing))

**Unix**  
```bash
sort -k1,3 < mapper.sample.tsv
```
**Windows**
```powershell
type mapper.sample.tsv | sort
```

### Step 3: Reducer

Since all the input should have been sorted in previous shuffler, the task of reducer is pretty simple: just count how many times the same gram appears, and then print the count out!

Input: 
 - `"{pivot}\t{word}\t{distance}\t{count}"`
 - You can get a sample input file `shuffler.sample.tsv` from [here](https://drive.google.com/drive/folders/1vKxr--sLd2J4kdsXUzJDBZdG3AmV4NGl?usp=sharing)

Output: 
 - `"{pivot}\t{word}\t{total_freq}\t{-5}\t{-4}\t{-3}\t{-2}\t{-1}\t{1}\t{2}\t{3}\t{4}\t{5}"`
 - The first two column are the skipgram; the third column is the sum of total frequency; column 4\~13 are the frequency with distance -5\~5, without 0.

Example:
 - `arouse  open    4       0       0       3       0       0       0       0       0       0       1`

Hints: 
1. Parse the input from shuffler
2. Check if this is the same skipgram as the previous one
3. If so, add the frequency according to its distance
4. If not, output the previous skipgram data

Note that you may NOT want to store all your counting results in a dict or any data structure.  
Recall that one purpose of MapReduce is to prevent memory exhaustion. It loses its value if you end up storing it again.  
Instead, <u>directly print it out or write it into a file</u> .  
(Don't get me wrong: of course you can store some temporary data, but let's not store the whole result and then print it out at once, okay?)


In [29]:
with open(os.path.join('data', 'shuffler.tsv')) as f:
    with open(os.path.join('data', 'reducer.tsv'), 'w') as tsvfile:
        
        writer = csv.writer(tsvfile)
        pre_sg = ''
        sg_count = [0,0,0,0,0,0,0,0,0,0]
        sg_total_count = 0
        
        for line in f:

            tokens = line.split('\t')

            sg = tokens[0] + '\t' + tokens[1]
            index = 0

            if int(tokens[2]) < 0:
                index = int(tokens[2]) + 5
            else:
                index = int(tokens[2]) + 4

            count = int(tokens[3])

            if pre_sg == '':

                pre_sg = sg
                sg_count[index] += count
                sg_total_count = count
                continue

            if sg == pre_sg:

                sg_count[index] += count
                sg_total_count += count
                
            else:

                writer.writerow([pre_sg + '\t' + str(count) + '\t' + '\t'.join(str(n) for n in sg_count)])
                pre_sg = sg
                sg_count = [0,0,0,0,0,0,0,0,0,0]
                sg_count[index] += count
                sg_total_count = count


data/reducer.tsv


### Step 4: Combine them together!  

Now you can move your code above into mapper.py and reducer.py (with some tiny modifications, of course), and this is your assignment this week!   
See below for detailed requirement description.  

**Hints: What should I modify in my mapper and reducer?**  

1. Receive/pass data from standard I/O, rather than the file (We've already done this for you)
2. Process with the whole dataset, rather than only the first line

That's it!  

The processing takes some times (~1hr w/o parallel computing), so go enjoy some coffee or movies (or sleep) during the waiting time ;)

<a name="Assignment-Requirement"></a>
## Assignment Requirement 

1. You need to implement the `mapper.py` and `reducer.py` to calculate the skip-gram table.

2. In `mapper.py`, you need to generate skipgrams with distance within -5 to 5 (inclusive).  
   - Input: Pure text file (`wiki1G.txt`) with each line as a wikipage.
   - Output: `"{pivot}\t{word}\t{distance}\t{count}"`
   - Example: 
     ```
     predict is  -3  1
     predict used    -2  1
     predict the -1  1
     predict the 1   1
     ...
     ```
   - Sample output: `mapper.sample.tsv` (Find it [here](https://drive.google.com/drive/folders/1vKxr--sLd2J4kdsXUzJDBZdG3AmV4NGl?usp=sharing); no need to be exactly the same)

3. In `reducer.py`, you have to collect the output from the shuffler (`sort`) and generate the skip-gram table.
   - Input: `"{pivot}\t{word}\t{distance}\t{count}"`
   - Output: 
     - `"{pivot}\t{word}\t{total}\t{-5}\t{-4}\t{-3}\t{-2}\t{-1}\t{1}\t{2}\t{3}\t{4}\t{5}"`
     - The first two column are the skipgram; the third column is the sum of total frequency; column 4\~13 are the frequency with distance -5\~5, without 0.
   - Example:
     ```
     arouse  of      1       0       0       0       0       0       0       0       1       0       0
     arouse  open    4       0       0       3       0       0       0       0       0       0       1
     arouse  so      2       0       1       0       0       0       0       0       0       1       0
     arouse  sufficiently    1       0       1       0       0       0       0       0       0       0       0
     ...
     ```
   - Sample output: `reducer.sample.tsv` (Find it [here](https://drive.google.com/drive/folders/1vKxr--sLd2J4kdsXUzJDBZdG3AmV4NGl?usp=sharing); no need to be exactly the same)

4. Concate your MapReduce procedure and generate the skip-gram on wiki1G dataset
   - Unix: 
     - Use the [local map-reduce tool](https://github.com/dspp779/local-mapreduce) (faster),
     - or run it directly: `python mapper.py < wiki1G.txt | sort -k1,2 -k3n | python reducer.py > skipgram.tsv` (slower)
   - Windows: 
     - CMD: `python mapper.py < wiki1G.txt | sort | python reducer.py > skipgram.tsv`
     - PS: `type wiki1G.txt | python mapper.py | sort | python reducer.py > skipgram.tsv`
     - or the bash environment you installed last week.  
   - See [Appendix](#built-in-command) if you want to know what these commands mean

During the demo, you need to 

1. show us your skip-gram result on the given dataset, and
2. explain your implementation in `mapper.py` and `reducer.py`.  

Note that the final result would be a large file (~6 GB), so **you may want to show it with `more` or `less` command**.  

## TA's note

Congratulations! You've learned how to calculate skipgram frequency and to deal with a huge dataset with MapReduce technique.  

Remember to <b><a href="https://docs.google.com/spreadsheets/d/1QGeYl5dsD9sFO9SYg4DIKk-xr-yGjRDOOLKZqCLDv2E/edit?usp=sharing">make an appoiment with TA</a> to demo/explain your implementation <u>before <font color="red">10/21 15:30</font></u></b> .  
You should also submit your `mapper.py` and `reducer.py` to <a href="https://eeclass.nthu.edu.tw/course/homework/3285">eeclass</a> .

<a name="built-in-command"></a>
## Appendix: useful built-in commands

Several built-in commands are very useful in the MapReducer procedure.  
Here we introduce `cat` and `type`, `<` and `>`, `sort`, and pipe `|`.  

### cat (on Unix)
`cat` command, which is definitly not indicating some cute creatures (*meow~*), is the abbreviation of `concatenate`. ([doc](https://man7.org/linux/man-pages/man1/cat.1.html))   

When you `cat` a file, it means you want to print the content from a file (or some files) to standard output.  
Now open your bash and test the command below!  
```bash
cat file.txt
```

You should see something like this: 

![picture](https://i.imgur.com/Z9shOYQ.png)

### type (on Windows)
`type` command works exactly the same as `cat` on Unix, but without its cute nickname (Shame on you, Windows). ([doc](https://docs.microsoft.com/en-us/windows-server/administration/windows-commands/type))  

Similarly, if you `type` a file, it means you print the content from a file (or some files) to standard output.  

```powershell
type file.txt
```

You should see something like this:  
![](https://i.imgur.com/5WFhxkq.png)

### `>`? `<`? `>///<`? 

`<` and `>` are the I/O redirections.  
`program < filename` means that you want to redirect the input from a file to a program, while `program > filename` means that you want to redirect the output of a program to that file.  

For example,
```bash
echo "hello world" > greet.txt
```
writes the string "hello world" into a file `greet.txt`.  

On the other hand, 
```bash
head < greet.txt
```
makes `head` receive the content from `greet.txt`, so it will print out the string in `greet.txt`.  
![](https://i.imgur.com/swxv8LG.png)
<small>p.s. `>///<` is just a joke. Don't take it seriously.</small>

### sort

As its name suggests, `sort` sorts the data that it receives. (doc on [Linux](https://man7.org/linux/man-pages/man1/sort.1.html) and on [Windows](https://docs.microsoft.com/en-us/windows-server/administration/windows-commands/sort))  
Try this:
```
sort sample.txt
```
You can see that the content has been sorted before printed onto your screen.  
![](https://i.imgur.com/QFEq3Tc.png)

### Pipe `|`

Pipe passes the output from previous program to the next program.  
For example, 
```bash
python program.py | sort
```
will pass the output of `program.py` to `sort` command.  