# Spark assignment 1: Pairs
Find all the pairs of two consequent words where the first word is “narodnaya”. For each pair, count the number of occurrences in the Wikipedia dump. Print all the pairs with their count in a lexicographical order. Output format is “word_pair <tab> count”, for example:

red_apple	100500

crazy_zoo	42

Note that two words in a pair are concatenated with the underscore character, and the result is in the lowercase.

One motivation for counting these continuations is to get a better understanding of the language. Some words, like “the”, have a lot of continuations, while others, like “San”, have just a few (“San Francisco”, for example). One can build a language model with these statistics. If you are interested to learn more, search for “n-gram language model” in the Internet.

In [34]:
from pyspark import SparkConf, SparkContext
sc = SparkContext(conf=SparkConf().setAppName("MyApp").setMaster("local"))
import re

In [35]:

def parse_article(line):
    try:
        article_id, text = unicode(line.rstrip()).lower().split('\t', 1)
        text = re.sub("^\W+|\W+$", "", text, flags=re.UNICODE)
        words = re.split("\W*\s+\W*", text, flags=re.UNICODE)
        return words
    except ValueError as e:
        return []
    
wiki = sc.textFile("/data/wiki/en_articles_part/articles-part", 16).map(parse_article) #one to one map
result = wiki.take(1)[0]

In [36]:
for word in result[:50]:
    print word

anarchism
anarchism
is
often
defined
as
a
political
philosophy
which
holds
the
state
to
be
undesirable
unnecessary
or
harmful
the
following
sources
cite
anarchism
as
a
political
philosophy
slevin
carl
anarchism
the
concise
oxford
dictionary
of
politics
ed
iain
mclean
and
alistair
mcmillan
oxford
university
press
2003
however
others
argue


In [37]:
def pair_words(words, key_word="narodnaya"):
    word_list=[]
    for inx, word in enumerate(words[:-1]):
        if word==key_word:
            word_list.append((word+"_"+words[inx+1],1))        
    return word_list

In [38]:
paired_words = wiki.flatMap(lambda x : pair_words(x)) #flat map is * to many transformation which transform each element from 0 to M
#map is one to one, that's why we need to devide map with flat map

paired_words.collect()

[(u'narodnaya_volya', 1),
 (u'narodnaya_volya', 1),
 (u'narodnaya_volya', 1),
 (u'narodnaya_volya', 1),
 (u'narodnaya_volya', 1),
 (u'narodnaya_volya', 1),
 (u'narodnaya_volya', 1),
 (u'narodnaya_volya', 1),
 (u'narodnaya_volya', 1),
 (u'narodnaya_gazeta', 1)]

In [39]:
word_reduce = paired_words.reduceByKey(lambda x, y : x+y).sortByKey()

In [40]:
word_reduce.collect()

[(u'narodnaya_gazeta', 1), (u'narodnaya_volya', 9)]

In [41]:
result = word_reduce.collect()
for key, value in result:
    print("{0}\t{1}".format(key,value))

narodnaya_gazeta	1
narodnaya_volya	9
