# Incorporating phrases (n_grams) into a Word2Vec Model
### Steps:
1. Discovering Common Phrases in Corpora
2. Tagging Corpora with Phrases
3. Training a Word2Vec Model with the tagged corpora

### Notes:
* For the most part I elected to used RDD's to hold data instead of data frames. Data frames are more intuitive but RDD's usually have faster computation times
* The data I used for this model came from a list of scotch reviews that I found online. It consists of about 2,000 records and its schema is printed below. The limited amount of data we have lends to our model performing poorly.

# Prepwork and Setup

## Imports

In [6]:
import os
import re
import logging
import pandas as pd
import matplotlib.pyplot as plt

from nltk.corpus import stopwords
from operator import add
from pyspark import SparkConf, SparkContext, SQLContext
from pyspark.sql.types import *
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import Word2Vec as DfWord2Vec
from pyspark.mllib.evaluation import MulticlassMetrics
from pyspark.mllib.feature import Word2Vec

char_splitter = re.compile("[.,;!:()-]")
abspath = os.path.abspath(os.path.dirname('__file__'))
data_file = 'data/scotch_review.csv'
stop_words = set(stopwords.words())

# print full column width of pandas column's
pd.set_option('max_colwidth', -1)

logging.basicConfig(
    format='%(asctime)s : %(levelname)s : %(message)s',
    level=logging.INFO)

## Load and Prepare Data

In [7]:
schema = StructType([StructField("_c0", StringType(), True), 
                     StructField("name", StringType(), True), 
                     StructField("category", StringType(), True), 
                     StructField("review.point", IntegerType(), True), 
                     StructField("price", IntegerType(), True), 
                     StructField("currency", StringType(), True), 
                     StructField("description", StringType(), True),
                     StructField("ErrorField", StringType(), True)])

data_df = \
    (sqlContext.read
        .option('mode', 'PERMISSIVE')
        .load(os.path.join(abspath, data_file),
            format='com.databricks.spark.csv',
            header='true',
            schema=schema,
            nullValue='NA'
        )
    )

# Filter out rows that are corrupted, i.e. have the incorrect number of rows
data_df=data_df.filter(data_df.ErrorField.isNull()).drop('ErrorField')
data_df=data_df.na.drop(subset=['_c0'])

In [8]:
data_df.printSchema()

root
 |-- _c0: string (nullable = true)
 |-- name: string (nullable = true)
 |-- category: string (nullable = true)
 |-- review.point: integer (nullable = true)
 |-- price: integer (nullable = true)
 |-- currency: string (nullable = true)
 |-- description: string (nullable = true)



In [9]:
data_df.toPandas().head()

Unnamed: 0,_c0,name,category,review.point,price,currency,description
0,1,"Johnnie Walker Blue Label, 40%",Blended Scotch Whisky,97,225,$,"Magnificently powerful and intense. Caramels, dried peats, elegant cigar smoke, seeds scraped from vanilla beans, brand new pencils, peppercorn, coriander seeds, and star anise make for a deeply satisfying nosing experience. Silky caramels, bountiful fruits of ripe peach, stewed apple, orange pith, and pervasive smoke with elements of burnt tobacco. An abiding finish of smoke, dry spices, and banoffee pie sweetness. Close to perfection. Editor's Choice"
1,4,"Compass Box The General, 53.4%",Blended Malt Scotch Whisky,96,325,$,"With a name inspired by a 1926 Buster Keaton movie, only 1,698 bottles produced, and the news that one of the two batches is more than 30 years old, the clues were there that this blend was never going to be cheap. It isn't, but it's superb, rich in flavor that screams dusty old oak office, fresh polish, and Sunday church, with spices, oak dried fruits, squiggly raisins, and a surprising melting fruit-and-nut dairy chocolate back story."
2,5,"Chivas Regal Ultis, 40%",Blended Malt Scotch Whisky,96,160,$,"Captivating, enticing, and wonderfully charming, this first blended malt from Chivas Regal contains selections of five Speyside malts: Strathisla, Longmorn, Tormore, Allt-a-Bhainne, and Braeval. Red apple, cherry, raspberry fudge, peach and mango fruit salad, dusting of cinnamon, and dry heather sprigs. In essence, it’s rich and satisfying, with dark vanilla, apricot, Bourneville-covered Brazil nuts, and tangerine, smoothed over by caramel and wood spices, maltiness, and gingersnap biscuits. Quite heavenly. Editor's Choice"
3,10,"Glenfarclas Family Casks 1954 Cask #1260, 47.2%",Single Malt Scotch,96,3360,$,"A rich amber color and elegantly oxidized notes greet you. There are luscious old fruits—pineapple, dried peach, apricot—and puffs of coal-like smokiness. In time, sweet spices (cumin especially) emerge. Superbly balanced. The palate, while fragile, still has real sweetness alongside a lick of treacle. It can take a drop of water, allowing richer, darker fruits to emerge. The finish is powerful, long, and resonant. Superb, not over-wooded, and a fair price for such a rarity. £1,995"
4,13,"The Last Drop (distilled at Lochside) 1972 (cask 346), 44%",Grain Scotch Whisky,96,3108,$,"A remarkable beauty from the Angus town of Montrose. The elegant nose shows a dram at peace with itself; golden syrup, hay bales, ground hazelnut, liquid honey, French baguette, High Mountain oolong, and rubbed spice blends. Refreshing palate of honey, toffee, citrus, honeycomb wax, and a profusion of sweet vanilla. Rich, sweet oak and deep pepper notes to finish. Truly a sublime and venerable grain. (106 bottles) £2,400"


In [11]:
description_rdd = data_df.select('description').rdd.flatMap(list).repartition(10)
description_rdd.take(2)

['Straw-gold color. On the nose, sweet toffee, citrus notes, seaweed, and spice complement a powerful peat smoke infusion. In body, it is thick and oily. On the palate, a somewhat sweet maltiness up front is run over by a powerful peat smoke locomotive. Again, the whisky is enriched with citrus and pear notes, spice, and seaweed. The finish is powerful, long, and warming. The smoke lingers for minutes, if not hours. If you like your Ardbeg to go to a phenolic extreme, you will cherish this one. This big, powerful whisky makes no apologies for its Islay roots. And the fact that this whisky is bottled at 46% ABV just makes this big whisky even bigger.',
 "Ardbeg's first standard release in nearly a decade, An Oa is matured in virgin oak, Pedro Ximénez, and bourbon barrels, with component whiskies married in the distillery's French oak 'Gathering Vat.' The nose offers sweet peat, smoky lemon rind, ginger, and angelica. A soft and sweet palate entry is followed by hot peat, black tea, pepp

# 1. Discovering Common Phrases in Corpora
### Steps:
1. Segment corpora into candidate phrases
2. Select candidate phrases with more than *M* occurrences in the corpora as common phrases

### Segment corpora into candidate phrases

First, remove special characters that do not indicate a phrase boundary, e.g. "\" or "%".


In [12]:
def remove_non_phrase_boundaries(text):
    """remove characters that are not indicators of phrase boundaries"""
    return re.sub("([{}@\"$%&\\\/*'’\"]|\d)", "", text)

In [13]:
description_no_special = description_rdd.map(lambda txt: remove_non_phrase_boundaries(txt))
description_no_special.first()

'Straw-gold color. On the nose, sweet toffee, citrus notes, seaweed, and spice complement a powerful peat smoke infusion. In body, it is thick and oily. On the palate, a somewhat sweet maltiness up front is run over by a powerful peat smoke locomotive. Again, the whisky is enriched with citrus and pear notes, spice, and seaweed. The finish is powerful, long, and warming. The smoke lingers for minutes, if not hours. If you like your Ardbeg to go to a phenolic extreme, you will cherish this one. This big, powerful whisky makes no apologies for its Islay roots. And the fact that this whisky is bottled at  ABV just makes this big whisky even bigger.'

Next, split text at special characters that do indicate phrase boundaries, e.g. ".", "!", or "?". Then, split phrases up further at words that indicate a phrase boundary. Either stop words (e.g. "and" or "afterwards") or words that are shorter than N characters long

In [14]:
def generate_candidate_phrases(text, stopwords):
    """ generate phrases using phrase boundary markers """

    # split up into phrases at punctuation boundary markers, i.e. ".", "!", or "?"
    coarse_candidates = char_splitter.split(text.lower())

    candidate_phrases = []

    for coarse_phrase in coarse_candidates:
    
        words = re.split("\\s+", coarse_phrase)
        previous_stop = False

        # examine each word to determine if it is a phrase boundary marker or
        # part of a phrase or lone ranger
        for w in words:

            if w in stopwords and not previous_stop:
                # phrase boundary encountered, so put a hard indicator
                candidate_phrases.append(";")
                previous_stop = True
            elif w not in stopwords and len(w) > 2:
                # keep adding words to list until a phrase boundary is detected
                candidate_phrases.append(w.strip())
                previous_stop = False

    # get a list of candidate phrases without boundary demarcation
    phrases = re.split(";+", ' '.join(candidate_phrases))

    return phrases

In [15]:
# Generate candidate phrases
candidate_phrases = description_no_special.map(lambda txt: generate_candidate_phrases(txt, stop_words))
print("Example description:" + "\n" + description_no_special.first() + "\n")
print("Example candidate phrases:" + "\n" + str(candidate_phrases.first()))

Example description:
Straw-gold color. On the nose, sweet toffee, citrus notes, seaweed, and spice complement a powerful peat smoke infusion. In body, it is thick and oily. On the palate, a somewhat sweet maltiness up front is run over by a powerful peat smoke locomotive. Again, the whisky is enriched with citrus and pear notes, spice, and seaweed. The finish is powerful, long, and warming. The smoke lingers for minutes, if not hours. If you like your Ardbeg to go to a phenolic extreme, you will cherish this one. This big, powerful whisky makes no apologies for its Islay roots. And the fact that this whisky is bottled at  ABV just makes this big whisky even bigger.

Example candidate phrases:
['straw gold color ', ' nose sweet toffee citrus notes seaweed ', ' spice complement ', ' powerful peat smoke infusion ', ' body ', ' thick ', ' oily ', ' palate ', ' somewhat sweet maltiness ', ' front ', ' run ', ' powerful peat smoke locomotive ', ' ', ' whisky ', ' enriched ', ' citrus ', ' pe

### Select candidate phrases with more than *M* occurrences in the corpora as common phrases

First, perform a map and reduce operation to get a count of each phrase.

In [16]:
def phrases_to_counts(phrases):
    """ strip any white space and send back a count of 1"""
    clean_phrases = []

    for p in phrases:
        word = p.strip()

        # we only need to count phrases, so ignore unigrams
        if len(word) > 1 and ' ' in word:
            clean_phrases.append([word, 1])

    return clean_phrases

In [19]:
# Get count of candidate phrases
candidates_by_count = candidate_phrases.flatMap(lambda phrase: phrases_to_counts(phrase)) \
    .reduceByKey(add).sortBy(lambda phrases: phrases[1], ascending=False)

print("Top 3 phrases by count: " + str(candidates_by_count.take(3)))
print("Total num of phrases: " + str(candidates_by_count.count()))

Top 3 phrases by count: [('year old', 103), ('sherry casks', 44), ('chill filtered', 41)]
Total num of phrases: 21113


Next, filter out phrases that do not occur a minimum number of times in the corpora

In [None]:
# Keep phrases with more than 10 occurences
selected_phrases = candidates_by_count.filter(lambda phrases: phrases[1] >= 10) \
        .sortBy(lambda phrases: phrases[0], ascending=True)

selected_phrases.take(5)

# 2. Tagging Corpora with Phrases
### Steps:
1. Join the words that make up a selected phrase into one entity using a "_" character, e.g. "american oak" becomes "american_oak".
2. Create a mapping between the old phrase and the new combined phrase.
3. Replace all instances of the old phrase in the corpora with the new combined phrase.

### Join the words that make up a selected phrase into one entity.

In [None]:
selected_phrases_tags = selected_phrases.map(lambda phrase: (phrase[0], phrase[0].replace(" ", "_")))
selected_phrases_tags.take(5)

### Create a mapping between the old phrase and the new combined phrase.

In [None]:
selected_phrases_map = selected_phrases_tags.collectAsMap()

# broadcast a few values so that these are not copied to the worker nodes
# each time
selected_phrases_bc = sc.broadcast(selected_phrases_map)
keys = list(selected_phrases_map.keys())
keys.sort(key=len, reverse=True)
sorted_key_bc = sc.broadcast(keys)

### Replace all instances of the old phrase in the corpora with the new combined phrase.

In [None]:
def tag_data(original_text, phrase_transformation, keys):
    """Process the pipe separated file"""
    original_text = original_text.lower()

    # greedy approach, start with the longest phrase
    for phrase in keys:
        # keep track of all the substitutes for a given phrase
        original_text = original_text.replace(
            phrase, phrase_transformation[phrase])

    return original_text

def remove_punctuation(text):
    """remove all special characters"""
    return  re.sub('[^A-Za-z0-9\s]+', '', text)

In [None]:
tagged_text_rdd = description_rdd.map(
    lambda txt: tag_data(
        remove_punctuation(txt),
        selected_phrases_bc.value, sorted_key_bc.value))

tagged_text_rdd.take(3)

# 3. Training a Phrase2Vec model using Word2Vec

In [None]:
# Learn a mapping from words to Vectors.

# split each review into a list of words
inp = tagged_text_rdd.map(lambda row: row.split(" "))

word2vec = Word2Vec().setVectorSize(3)
model = word2vec.fit(inp)

**Note:** I used PySpark's Word2Vec algorithm. The tutorial used Gensim. Should we consider this or another algorithm?

### Testing the Word2Vec Model

In [None]:
# Most occuring tagged words:
selected_phrases.sortBy(lambda phrase: phrase[1], ascending=False).take(5)

In [None]:
sym1 = model.findSynonyms('sherry_casks', 3)
sym2 = model.findSynonyms('chill_filtered', 3)

print("Phrases most similar to 'years_old':")
print("------------------------------------")
for s in sym1: print(s)

print('\n')

print("Phrases most similar to 'chill_filtered':")
print("------------------------------------")
for s in sym2: print(s)
    


**Note:** Our model does not perform well because of the limited amount of data we have (~2000 records). Our text mining approach needs more data compared to a semantic chunking approach

# Categorizing Whiskey by Review Text

In [None]:
data_df_2 = data_df.rdd.map(lambda x: (x['_c0'], x['category'], \
                            tag_data(remove_punctuation(x['description']), \
                                     selected_phrases_bc.value, sorted_key_bc.value).split(" ") \
                           )).toDF(['_c0', 'category', 'description'])

data_df_2.toPandas().head()

In [None]:
# Learn a mapping from words to Vectors.
word2Vec = DfWord2Vec(vectorSize=50, minCount=10, inputCol="description", outputCol="descriptionFeatures")
model = word2Vec.fit(data_df_2.select('description'))

data_df_2 = model.transform(data_df_2)
data_df_2.toPandas().head(1)

In [None]:
stringIndexer = StringIndexer(inputCol='category', outputCol='categoryIndex')
indexed = stringIndexer.fit(data_df_2).transform(data_df_2)

indexed.toPandas().head(1)

In [None]:
### Randomly split data into training and test sets. set seed for reproducibility
trainingData, testData = indexed.randomSplit([0.75, 0.35], seed=100)

# Create initial LogisticRegression model
lr = LogisticRegression(labelCol="categoryIndex", featuresCol="descriptionFeatures", maxIter=10)

# Train model with Training Data
lrModel = lr.fit(trainingData)

In [None]:
predictions = lrModel.transform(testData)

In [None]:
# Compute raw scores on the test set
predictionAndLabels = predictions.select(['prediction', 'categoryIndex']).rdd

metrics = MulticlassMetrics(predictionAndLabels)

In [None]:
cm = metrics.confusionMatrix()
print("Confusion Matrix:")
print(cm.toArray())
print("\n")
print("Model Precision: " + "{0:.2%}".format(metrics.precision()))