# Spark assignment 2: Collocations

As for the second part of the assignment, your task is to extract collocations: that is word combinations that occur together. For example, “high school” or “roman empire”.

To find collocations, you will use NPMI (normalized pointwise mutual information) metric.

PMI of two words, a & b, is defined as “PMI(a, b) = ln (P(ab) / (P(a) * P(b))”, where P(ab) is the probability of two words coming one after the other, and P(a) and P(b) are probabilities of words a & b respectively.

You will estimate probabilities with occurrence counts, that is “P(a) = # of occurrences of word a / total number of words”, and “P(ab) = # of occurrences of words ‘a b’ / total number of word pairs”.

Therefore, rare combinations of coupled words have large PMI.

NPMI is computed as “NPMI(a, b) = PMI(a, b) / -ln P(ab)”. This normalizes the quantity to be within the range [-1; 1].

You task is a bit more complicated now:

* Extract all the words, as in the previous task.
* Filter out stopwords using the dictionary (/datasets/stop_words_en.txt ) (do not forget to convert words to the lowercase!)
* Compute all bigrams (that is, pairs of consequent words)
* Leave only bigrams with at least 500 occurrences
* Compute NPMI for every bigram (note: when computing probabilities, you need unpruned counts!)
* Sort word pairs by NPMI in the descending order
* Print top 39 word pairs, with words delimited by the underscore “_”

For example,

    roman_empire
    south_africa
    
Dataset location: /data/wiki/en_articles_part

The part of the result on the sample dataset:

    ...
    references_reading
    notes_references
    award_best
    north_america
    new_zealand
    ...
    
Hint: if you did everything right, “roman_empire” and “south_africa” are going to be in the result.

In [1]:
# Python2 on grader, Python3 in docker
import sys
if sys.version_info.major == 3:
    import os
    os.environ['PYSPARK_PYTHON'] = '/opt/conda/bin/python'
    os.environ['PYTHONHASHSEED'] = '42'
    unicode = str

## Extract all the words

In [2]:
from pyspark import SparkConf, SparkContext
sc = SparkContext(conf=SparkConf().setAppName("PMI").setMaster("local"))

import re

def parse_article(line):
    try:
        article_id, text = unicode(line.rstrip()).split('\t', 1)
        text = re.sub("^\W+|\W+$", "", text, flags=re.UNICODE)
        words = re.split("\W*\s+\W*", text, flags=re.UNICODE)
        return words
    except ValueError as e:
        return []

wiki = sc.textFile("/data/wiki/en_articles_part/articles-part", 16).map(parse_article)

In [3]:
# Lowercase texts

lwiki = wiki.map(lambda text: [s.lower() for s in text])

In [4]:
#result = lwiki.take(1)[0]
#for word in result[:5]:
#    print(word)

## Filter out stopwords

In [5]:
def load_stopwords(fname):
    with open(fname) as h:
        return [unicode(l.strip()) for l in h]

STOP_WORDS = load_stopwords('/datasets/stop_words_en.txt')
# print(STOP_WORDS[:5])

In [6]:
def filter_text(words):
    return [w for w in words if w not in STOP_WORDS]
texts = lwiki.map(filter_text)

In [7]:
#result = texts.take(1)[0]
#for word in result[:5]:
#    print(word)

## Bigrams

In [8]:
def word_list_to_pairs(ls):
    return zip(ls, ls[1:])

pairs = texts.flatMap(word_list_to_pairs)
pairs = pairs.map(lambda p: (p[0] + '_' + p[1], 1))
pairs.cache()

PythonRDD[2] at RDD at PythonRDD.scala:48

In [9]:
# pairs.take(5)

In [10]:
# Leave only bigrams with at least 500 occurrences

sumFunc = lambda x, y: x + y

counted_pairs = pairs.aggregateByKey(0, sumFunc, sumFunc)
counted_pairs.cache()

PythonRDD[7] at RDD at PythonRDD.scala:48

In [11]:
pairs500 = counted_pairs.filter(lambda rec: rec[1] >= 500)
pairs500.cache()

PythonRDD[8] at RDD at PythonRDD.scala:48

In [12]:
# pairs500.take(5)

## Compute NPMI for every bigram

In [13]:
# Word counts

words = texts.flatMap(lambda text: [(w, 1) for w in text])
counted_words = words.aggregateByKey(0, sumFunc, sumFunc)
counted_words.cache()

PythonRDD[13] at RDD at PythonRDD.scala:48

In [14]:
# counted_words.take(5)

In [15]:
# Join for the first word
# (bigr, Nb) -> (left_word, (bigr, Nb)) -> join against (word, nw) -> (left_word, (bigt, Nb, nw))
def first_word_as_key(rec):
    (key, nb) = rec
    return (key.split('_')[0], (key, nb))
for_join1 = pairs500.map(first_word_as_key)

# for_join1.take(5)

In [16]:
join1 = for_join1.join(counted_words)
join1.cache()

PythonRDD[21] at RDD at PythonRDD.scala:48

In [17]:
# join1.take(5)

In [18]:
# Join for the first word
# ('notes', (('notes_references', 638), 2267)) -> ('references', (('notes_references', 638), 2267))
def second_word_as_key(rec):
    (_, ((key, nb), nl)) = rec
    return (key.split('_')[1], (key, nb, nl))
for_join2 = join1.map(second_word_as_key)

#for_join2.take(5)

In [19]:
join2 = for_join2.join(counted_words)
join2.cache()

PythonRDD[29] at RDD at PythonRDD.scala:48

In [24]:
# join2.take(5)

In [27]:
# Compute NPMI
# Using counts instead of probabilities because doesn't affect sorting
import math

# ('references', (('notes_references', 638, 2267), 4151)),
def calc_npmi(rec):
    (_, ((pair, nb, nl), nr)) = rec
    pmi = math.log(1.0 * nb / nl / nr)
    npmi = -1.0 * pmi / math.log(nb)
    return (pair, npmi)
npmi = join2.map(calc_npmi)

# npmi.take(5)

## Sort word pairs, print top

In [28]:
npmi_sorted = npmi.sortBy(lambda rec: rec[1])
npmi_sorted.cache()

PythonRDD[55] at RDD at PythonRDD.scala:48

In [29]:
for rec in npmi_sorted.take(39):
    print(rec[0])

los_angeles
external_links
united_states
prime_minister
new_york
san_francisco
19th_century
et_al
20th_century
supreme_court
references_external
soviet_union
air_force
university_press
united_kingdom
world_war
baseball_player
war_ii
roman_catholic
north_america
civil_war
new_zealand
notes_references
references_reading
award_best
american_actor
catholic_church
united_nations
south_africa
took_place
roman_empire
american_actress
high_school
american_singer-songwriter
american_baseball
york_city
american_football
years_later
north_american
