d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# N-grams Lab

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) In this lab you:<br>
* Learn what n-grams are
* Generate n-grams for each review

In [0]:
%run ../Includes/Classroom-Setup

Another commonly used preprocessing step is the generation of n-grams, an ordered sequence of `n` tokens. This can be important when there are meaningful phrases that are made up of multiple words in a specific order. For example knowing that "really good" occurred is more important than just knowing that the tokens "really" and "good" appeared in the text. Sometimes using `n > 2` may be helpful such as extracting the phrase "really highly recommend."

Here is an example of the n-grams with n = 1, 2, and 3 for a simple sentence:

![](https://files.training.databricks.com/images/trigram-updated.png)

In addition to using n-grams on a word level, often times they are also used on a character level so that subwords can be learned and typos can be dealt with.

**Note:** n-grams with `n = 1` is the same as the original tokens list.

Load in our tokenized and processed DataFrame.

In [0]:
processedDF = spark.read.parquet("/mnt/training/reviews/tfidf.parquet")

One implementation of n-grams is SparkML's built-in `NGram` <a href="https://spark.apache.org/docs/latest/ml-features.html#n-gram" target="_blank">function</a>,  which takes an integer `n` and a list of tokens, and creates all possible groups of **exactly** `n` consecutive tokens.

Using SparkML's `NGram` function, fill in the following cell to return a new DataFrame, `ngramDF`, with the column, `ngrams`, containing all n-grams **up to and including** `n = 3`. In other words, append a single column which contains all the tokens (cleaned), bigrams, and trigrams of each corresponding row.


**Note:** We want to construct our n-grams using the `CleanTokens` column, *not* the raw `Tokens` column.

**Hint:** You may want to take a look at <a href = "http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.concat" target="_blank"?> pyspark.sql.functions.concat</a>.

In [0]:
# TODO
# Apply n-grams to processed DataFrame
from pyspark.ml.feature import NGram
from pyspark.sql.functions import concat, col

ngramDF = processedDF.select("Text", "Tokens", "CleanTokens")

# Create the ngram2 and ngram3 transformers
ngram2 = NGram(n=2, inputCol="CleanTokens", outputCol="ngrams2")
ngram3 = NGram(n=3, inputCol="CleanTokens", outputCol="ngrams3")

# Perform the transformation
ngramDF = ngram2.transform(ngramDF)
ngramDF = ngram3.transform(ngramDF)

# Combine tokens, bigrams, and trigrams
ngramDF = ngramDF.withColumn(
    "ngrams", concat(col("tokens"), col("ngrams2"), col("ngrams3"))
)

display(ngramDF.select("ngrams"))

ngrams
"List(my, little, westie, benji, was, having, heartburn, diarrhea, and, upset, stomach, issues, for, quite, a, few, months, he, would, throw, up, almost, daily, we, took, him, to, an, internest, who, recommended, we, feed, him, a, kibble, containing, only, two, ingredients, before, we, proceed, with, other, tests, this, is, one, of, the, kibbles, she, suggested, it, was, in, two, days, that, his, vomiting, and, diarrhia, stopped, i, highly, recommend, this, product, thank, you, natural, balance, little westie, westie benji, benji heartburn, heartburn diarrhea, diarrhea upset, upset stomach, stomach issues, issues quite, quite months, months throw, throw almost, almost daily, daily took, took internest, internest recommended, recommended feed, feed kibble, kibble containing, containing two, two ingredients, ingredients proceed, proceed tests, tests one, one kibbles, kibbles suggested, suggested two, two days, days vomiting, vomiting diarrhia, diarrhia stopped, stopped highly, highly recommend, recommend product, product thank, thank natural, natural balance, little westie benji, westie benji heartburn, benji heartburn diarrhea, heartburn diarrhea upset, diarrhea upset stomach, upset stomach issues, stomach issues quite, issues quite months, quite months throw, months throw almost, throw almost daily, almost daily took, daily took internest, took internest recommended, internest recommended feed, recommended feed kibble, feed kibble containing, kibble containing two, containing two ingredients, two ingredients proceed, ingredients proceed tests, proceed tests one, tests one kibbles, one kibbles suggested, kibbles suggested two, suggested two days, two days vomiting, days vomiting diarrhia, vomiting diarrhia stopped, diarrhia stopped highly, stopped highly recommend, highly recommend product, recommend product thank, product thank natural, thank natural balance)"
"List(i, put, 1, 2, teabags, of, easy, now, in, an, entire, teapot, let, it, steep, for, 20, 30, min, and, have, some, before, bed, no, matter, when, i, got, to, sleep, i, always, wake, up, 7, 8, hours, later, feeling, good, and, without, having, woken, up, in, the, middle, of, the, night, except, maybe, to, use, the, restroom, but, i, always, fall, back, asleep, instantly, its, also, great, anytime, during, the, day, to, just, calm, your, nerves, it, won, t, make, you, sleepy, this, is, easily, my, favorite, calming, tea, put 1, 1 2, 2 teabags, teabags easy, easy entire, entire teapot, teapot let, let steep, steep 20, 20 30, 30 min, min bed, bed matter, matter got, got sleep, sleep always, always wake, wake 7, 7 8, 8 hours, hours later, later feeling, feeling good, good without, without woken, woken middle, middle night, night except, except maybe, maybe use, use restroom, restroom always, always fall, fall back, back asleep, asleep instantly, instantly also, also great, great anytime, anytime day, day calm, calm nerves, nerves won, won make, make sleepy, sleepy easily, easily favorite, favorite calming, calming tea, put 1 2, 1 2 teabags, 2 teabags easy, teabags easy entire, easy entire teapot, entire teapot let, teapot let steep, let steep 20, steep 20 30, 20 30 min, 30 min bed, min bed matter, bed matter got, matter got sleep, got sleep always, sleep always wake, always wake 7, wake 7 8, 7 8 hours, 8 hours later, hours later feeling, later feeling good, feeling good without, good without woken, without woken middle, woken middle night, middle night except, night except maybe, except maybe use, maybe use restroom, use restroom always, restroom always fall, always fall back, fall back asleep, back asleep instantly, asleep instantly also, instantly also great, also great anytime, great anytime day, anytime day calm, day calm nerves, calm nerves won, nerves won make, won make sleepy, make sleepy easily, sleepy easily favorite, easily favorite calming, favorite calming tea)"
"List(this, is, a, chips, ahoy, flavor, called, ooey, gooey, chocofudge, so, it, should, already, be, pretty, clear, from, the, get, go, who, is, going, to, love, this, and, who, isn, t, the, cookies, are, soft, but, don, t, fall, apart, or, dissolve, in, your, mouth, they, are, definitely, gooey, but, not, overly, messy, br, br, obviously, they, aren, t, going, to, beat, homemade, fudge, cookies, fresh, out, of, the, oven, but, for, a, packaged, treat, in, the, goodie, aisle, of, the, supermarket, these, are, pretty, tasty, and, well, worth, the, purchase, add, a, little, milk, and, you, re, in, for, a, seriously, good, time, chips ahoy, ahoy flavor, flavor called, called ooey, ooey gooey, gooey chocofudge, chocofudge already, already pretty, pretty clear, clear get, get go, go going, going love, love isn, isn cookies, cookies soft, soft fall, fall apart, apart dissolve, dissolve mouth, mouth definitely, definitely gooey, gooey overly, overly messy, messy obviously, obviously aren, aren going, going beat, beat homemade, homemade fudge, fudge cookies, cookies fresh, fresh oven, oven packaged, packaged treat, treat goodie, goodie aisle, aisle supermarket, supermarket pretty, pretty tasty, tasty well, well worth, worth purchase, purchase add, add little, little milk, milk re, re seriously, seriously good, good time, chips ahoy flavor, ahoy flavor called, flavor called ooey, called ooey gooey, ooey gooey chocofudge, gooey chocofudge already, chocofudge already pretty, already pretty clear, pretty clear get, clear get go, get go going, go going love, going love isn, love isn cookies, isn cookies soft, cookies soft fall, soft fall apart, fall apart dissolve, apart dissolve mouth, dissolve mouth definitely, mouth definitely gooey, definitely gooey overly, gooey overly messy, overly messy obviously, messy obviously aren, obviously aren going, aren going beat, going beat homemade, beat homemade fudge, homemade fudge cookies, fudge cookies fresh, cookies fresh oven, fresh oven packaged, oven packaged treat, packaged treat goodie, treat goodie aisle, goodie aisle supermarket, aisle supermarket pretty, supermarket pretty tasty, pretty tasty well, tasty well worth, well worth purchase, worth purchase add, purchase add little, add little milk, little milk re, milk re seriously, re seriously good, seriously good time)"
"List(price, was, great, and, fast, shipping, convenience, was, also, key, in, this, purchase, definately, would, buy, this, product, and, others, again, price great, great fast, fast shipping, shipping convenience, convenience also, also key, key purchase, purchase definately, definately buy, buy product, product others, price great fast, great fast shipping, fast shipping convenience, shipping convenience also, convenience also key, also key purchase, key purchase definately, purchase definately buy, definately buy product, buy product others)"
"List(this, is, a, must, have, coffee, for, christmas, time, it, is, the, perfect, nutty, coconut, blend, the, taste, is, indescribably, delicious, must coffee, coffee christmas, christmas time, time perfect, perfect nutty, nutty coconut, coconut blend, blend taste, taste indescribably, indescribably delicious, must coffee christmas, coffee christmas time, christmas time perfect, time perfect nutty, perfect nutty coconut, nutty coconut blend, coconut blend taste, blend taste indescribably, taste indescribably delicious)"
"List(lorna, doone, s, is, a, delicious, buttery, shortbread, cookie, that, melts, in, your, mouth, if, eaten, alone, or, as, a, cookie, to, accompany, your, morning, coffee, or, tea, as, well, as, a, perfect, breakfast, treat, recently, in, october, through, warehouse, deals, fulfilled, by, amazon, and, shipped, for, free, i, picked, up, 2, cases, of, 12, boxes, that, are, 5oz, packages, each, with, approximately, 20, shortbread, cookies, per, package, for, 12, 99, case, 25, 98, for, 24, total, 5oz, sleeves, of, lorna, doone, s, i, point, this, price, comparison, out, to, help, educate, my, fellow, buyers, of, this, product, seeing, that, in, your, local, supermarket, they, re, retailing, for, around, 4, 49, 4, 99, for, each, 10oz, box, containing, 2, of, the, 20, count, cookie, packages, in, each, now, that, is, the, same, exact, equivalent, as, 2, of, the, 5oz, sleeve, packages, that, i, purchased, containing, 12, boxes, per, case, amazon, frequently, sells, the, 10oz, boxes, in, a, pack, of, 12, for, around, 46, 52, buying, it, my, way, for, around, 26, is, economically, smart, and, can, save, you, as, much, as, 50, off, the, listed, above, reference, have, fun, and, enjoy, lorna doone, doone delicious, delicious buttery, buttery shortbread, shortbread cookie, cookie melts, melts mouth, mouth eaten, eaten alone, alone cookie, cookie accompany, accompany morning, morning coffee, coffee tea, tea well, well perfect, perfect breakfast, breakfast treat, treat recently, recently october, october warehouse, warehouse deals, deals fulfilled, fulfilled amazon, amazon shipped, shipped free, free picked, picked 2, 2 cases, cases 12, 12 boxes, boxes 5oz, 5oz packages, packages approximately, approximately 20, 20 shortbread, shortbread cookies, cookies per, per package, package 12, 12 99, 99 case, case 25, 25 98, 98 24, 24 total, total 5oz, 5oz sleeves, sleeves lorna, lorna doone, doone point, point price, price comparison, comparison help, help educate, educate fellow, fellow buyers, buyers product, product seeing, seeing local, local supermarket, supermarket re, re retailing, retailing around, around 4, 4 49, 49 4, 4 99, 99 10oz, 10oz box, box containing, containing 2, 2 20, 20 count, count cookie, cookie packages, packages exact, exact equivalent, equivalent 2, 2 5oz, 5oz sleeve, sleeve packages, packages purchased, purchased containing, containing 12, 12 boxes, boxes per, per case, case amazon, amazon frequently, frequently sells, sells 10oz, 10oz boxes, boxes pack, pack 12, 12 around, around 46, 46 52, 52 buying, buying way, way around, around 26, 26 economically, economically smart, smart save, save much, much 50, 50 listed, listed reference, reference fun, fun enjoy, lorna doone delicious, doone delicious buttery, delicious buttery shortbread, buttery shortbread cookie, shortbread cookie melts, cookie melts mouth, melts mouth eaten, mouth eaten alone, eaten alone cookie, alone cookie accompany, cookie accompany morning, accompany morning coffee, morning coffee tea, coffee tea well, tea well perfect, well perfect breakfast, perfect breakfast treat, breakfast treat recently, treat recently october, recently october warehouse, october warehouse deals, warehouse deals fulfilled, deals fulfilled amazon, fulfilled amazon shipped, amazon shipped free, shipped free picked, free picked 2, picked 2 cases, 2 cases 12, cases 12 boxes, 12 boxes 5oz, boxes 5oz packages, 5oz packages approximately, packages approximately 20, approximately 20 shortbread, 20 shortbread cookies, shortbread cookies per, cookies per package, per package 12, package 12 99, 12 99 case, 99 case 25, case 25 98, 25 98 24, 98 24 total, 24 total 5oz, total 5oz sleeves, 5oz sleeves lorna, sleeves lorna doone, lorna doone point, doone point price, point price comparison, price comparison help, comparison help educate, help educate fellow, educate fellow buyers, fellow buyers product, buyers product seeing, product seeing local, seeing local supermarket, local supermarket re, supermarket re retailing, re retailing around, retailing around 4, around 4 49, 4 49 4, 49 4 99, 4 99 10oz, 99 10oz box, 10oz box containing, box containing 2, containing 2 20, 2 20 count, 20 count cookie, count cookie packages, cookie packages exact, packages exact equivalent, exact equivalent 2, equivalent 2 5oz, 2 5oz sleeve, 5oz sleeve packages, sleeve packages purchased, packages purchased containing, purchased containing 12, containing 12 boxes, 12 boxes per, boxes per case, per case amazon, case amazon frequently, amazon frequently sells, frequently sells 10oz, sells 10oz boxes, 10oz boxes pack, boxes pack 12, pack 12 around, 12 around 46, around 46 52, 46 52 buying, 52 buying way, buying way around, way around 26, around 26 economically, 26 economically smart, economically smart save, smart save much, save much 50, much 50 listed, 50 listed reference, listed reference fun, reference fun enjoy)"
"List(we, were, very, pleased, with, the, preserves, they, are, very, tasty, and, not, too, sweet, as, some, are, i, am, glad, we, made, the, purchase, pleased preserves, preserves tasty, tasty sweet, sweet glad, glad made, made purchase, pleased preserves tasty, preserves tasty sweet, tasty sweet glad, sweet glad made, glad made purchase)"
"List(very, high, in, protein, and, high, in, convenience, factor, not, as, high, in, fat, as, slim, jims, but, not, nearly, as, spicy, high protein, protein high, high convenience, convenience factor, factor high, high fat, fat slim, slim jims, jims nearly, nearly spicy, high protein high, protein high convenience, high convenience factor, convenience factor high, factor high fat, high fat slim, fat slim jims, slim jims nearly, jims nearly spicy)"
"List(i, liked, this, cereal, it, has, good, mouth, feel, and, it, s, not, too, sweet, it, stays, crunchy, for, a, reasonable, time, in, cereal, the, brown, sugar, flavor, is, there, but, subtle, and, i, suppose, all, that, fiber, is, good, my, husband, and, kids, all, passed, on, it, though, because, it, wasn, t, sweet, enough, for, their, tastes, liked cereal, cereal good, good mouth, mouth feel, feel sweet, sweet stays, stays crunchy, crunchy reasonable, reasonable time, time cereal, cereal brown, brown sugar, sugar flavor, flavor subtle, subtle suppose, suppose fiber, fiber good, good husband, husband kids, kids passed, passed though, though wasn, wasn sweet, sweet enough, enough tastes, liked cereal good, cereal good mouth, good mouth feel, mouth feel sweet, feel sweet stays, sweet stays crunchy, stays crunchy reasonable, crunchy reasonable time, reasonable time cereal, time cereal brown, cereal brown sugar, brown sugar flavor, sugar flavor subtle, flavor subtle suppose, subtle suppose fiber, suppose fiber good, fiber good husband, good husband kids, husband kids passed, kids passed though, passed though wasn, though wasn sweet, wasn sweet enough, sweet enough tastes)"
"List(i, bought, the, walnut, butter, for, the, first, time, and, it, was, one, of, the, best, butter, i, have, tested, br, i, am, becoming, a, reguliar, buyer, of, fastachi, products, because, they, are, the, freshest, in, the, market, br, the, pistachio, butter, is, also, wonderfull, but, a, bit, expensive, so, i, will, buy, twice, more, walnut, butter, than, pistachio, and, i, am, going, to, try, their, hazelnut, now, good, job, fastachi, bought walnut, walnut butter, butter first, first time, time one, one best, best butter, butter tested, tested becoming, becoming reguliar, reguliar buyer, buyer fastachi, fastachi products, products freshest, freshest market, market pistachio, pistachio butter, butter also, also wonderfull, wonderfull bit, bit expensive, expensive buy, buy twice, twice walnut, walnut butter, butter pistachio, pistachio going, going try, try hazelnut, hazelnut good, good job, job fastachi, bought walnut butter, walnut butter first, butter first time, first time one, time one best, one best butter, best butter tested, butter tested becoming, tested becoming reguliar, becoming reguliar buyer, reguliar buyer fastachi, buyer fastachi products, fastachi products freshest, products freshest market, freshest market pistachio, market pistachio butter, pistachio butter also, butter also wonderfull, also wonderfull bit, wonderfull bit expensive, bit expensive buy, expensive buy twice, buy twice walnut, twice walnut butter, walnut butter pistachio, butter pistachio going, pistachio going try, going try hazelnut, try hazelnut good, hazelnut good job, good job fastachi)"


Similar to how we looked at the top tokens in our dataset, now we can use the `ngramDF` you created above to take a look at the most common n-grams (`n = 2` and `3`) in our dataset.

In [0]:
# Resulting top 25 ngrams
from pyspark.sql.functions import size, split, explode

ngramDist = (
    ngramDF.withColumn("indivNGrams", explode(col("ngrams")))
    .filter(size(split("indivNGrams", " ")) > 1)  # only keep ngrams with n>1
    .groupBy("indivNGrams")
    .count()
    .sort(col("count").desc())
)

display(ngramDist)

indivNGrams,count
gluten free,16435
amazon gp,16012
gp product,16007
amazon gp product,16005
k cups,15097
taste like,14921
highly recommend,14564
ve tried,14417
peanut butter,14057
dog food,13162


What do you notice about the frequent n-grams? Could they be important in text processing?

-sandbox
&copy; 2021 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>