# Micro project

## Spark: Counting the Number of Pairs 

We will find all the pairs of two consequent words where the first word is “narodnaya”. Then for each pair we will count the number of occurrences in the Wikipedia dump. 

One motivation for counting these continuations is to get a better understanding of the language. Some words, like “the”, have a lot of continuations, while others, like “San”, have just a few (“San Francisco”, for example). One can build a language model with these statistics. If you are interested to learn more, search for “n-gram language model” in the Internet.

In [3]:
from pyspark import SparkConf, SparkContext
sc = SparkContext(conf=SparkConf().setAppName("Pairs").setMaster("local"))

import re
import numpy as np

def collect_pairs(words):
    pairs = []
    for i in range(len(words)):
        if(words[i].lower() == "narodnaya"):
            pairs.append((words[i].lower() + "_" + words[i + 1],1))
    return pairs

def parse_article(line):
    try:
        article_id, text = unicode(line.rstrip()).split('\t', 1)
        text = re.sub("^\W+|\W+$", "", text, flags=re.UNICODE)
        words = re.split("\W*\s+\W*", text, flags=re.UNICODE)
        words = [element.lower() for element in words]
        return words
    except ValueError as e:
        return []
    

# Load and parallelize text file
wiki = sc.textFile("articles-part", 16)
# Parse the article
wiki = wiki.map(parse_article)
# Find the pairs where the first word is narodnaya
wiki = wiki.map(collect_pairs)
# Count the pairs
wiki = wiki.flatMap(lambda x: x).reduceByKey(lambda x, y: x + y)

# Sort the pairs
wiki = wiki.sortByKey()

# Print the total
result = wiki.collect()

for t in result:
    print(str(t[0]) + "\t" + str(t[1]))

sc.stop()

narodnaya_gazeta	1
narodnaya_volya	9


_Project is part of "Big Data Essentials: HDFS, MapReduce and Spark RDD by Yandex" course at Coursera_