# Indiegogo Notebook

This notebook (created on Databricks) shows how to train a ML Model with Spark and the use of Spark SQL to clean and prepare the dataset.

In [0]:
%sql 

select project_id, title, tagline from default.indiegogo

project_id,title,tagline
1166581,Super Troopers 2,"The #SuperTroopers2 campaign is over, but the movie will be out in theaters on 4/20/18!"
1143140,Con Man,A new comedy from Alan Tudyk and Nathan Fillion produced by YOU!
2558245,The Camera Pack: Peter McKinnon X NOMATIC,"A Functional Camera Pack for all types of travelers! Just you, one bag, and the adventure!"
2650630,The Book,The Ultimate Guide To Rebuilding A Civilization. Over 400 pages of detailed illustrations.
1676513,Code 8 - a film from Robbie & Stephen Amell,Help Robbie & Stephen Amell make their first feature film together!
814501,Lazer Team by Rooster Teeth,Rooster Teeth is making its first feature length movie and we need your help!
2656903,GENKI: ShadowCast for the Nintendo Switch PS5 Xbox,"The easy way to play console games on your laptop. No TV needed to play, stream, or record gameplay."
731457,Gosnell Movie,A historic crowdfunding campaign for a movie about America's biggest serial killer Kermit Gosnell.
1383172,Indivisible - RPG from the Creators of Skullgirls,Indivisible - A 2D Action RPG from Lab Zero PS4/XB1/Win/Mac/Linux
2543163,GENKI: Covert Dock for the Nintendo Switch,A stealth dock hidden in a portable GaN-charger. Set your dock free and make any TV your playground.


First of all, we check the distribution of our target.
This notebook will take in input a text and predict the category of the project.

In [0]:
%sql 
select count(distinct project_id) from default.indiegogo

count(DISTINCT project_id)
4600


In [0]:
%sql

select category, count(distinct project_id) num from default.indiegogo group by category order by num desc

category,num
Film,2267
Music,624
Art,302
Comics,287
Writing & Publishing,286
Video Games,274
Dance & Theater,178
Tabletop Games,172
Web Series & TV Shows,72
Creative Works,64


Create new category to aggregate project into main bascket

In [0]:
%sql
-- drop table ds_projects_categories_class;
CREATE TABLE ds_projects_categories_class AS (
select project_id, title, tagline as description, category as old_category,
  case
    when category like '%Photography%' then 'Art'
    when category like '%Podcasts%' then 'Digital Product'
    when category like '%Film%' then 'Film'
    when category like '%Music%' then 'Music'
    when category like '%Art%' then 'Art'
    when category like '%Comics%' then 'Comics'
    when category like '%Writing & Publishing%' then 'Art'
    when category like '%Video Games%' then 'Digital Product'
    when category like '%Games%' then 'Digital Product'
    when category like '%Web%'  then 'Digital Product'
    when category like '%Creative%' then 'Art'
    when category like '%Podcasts%' then 'Digital Product'
    when category like '%Blogs%' then 'Digital Product'
    when category like '%Dance%' then 'Art'
    when category like '%Theater%' then 'Art'
  else 'Others'
  end as category
  from default.indiegogo)

num_affected_rows,num_inserted_rows


In [0]:
%sql
select category, count(distinct project_id) num from ds_projects_categories_class group by category order by num desc

category,num
Film,2267
Art,891
Music,624
Digital Product,531
Comics,287


Our data have an unbalanced distribution.

In [0]:
ds_dataset = spark.sql("select * from ds_projects_categories where category <> 'Others'").dropDuplicates(['project_id'])

In [0]:
ds_dataset.createOrReplaceTempView("ds_dataset")

With Spark, we can create a custom temp view on our dataset and use Spark SQL to query the data.

In [0]:
%sql

select category, count(*) num from ds_dataset group by category order by num desc

category,num
Product,1605
Illustration,1235
Games,796
Music,222


Now, we can split the data in train and test dataset.
The training data is used to make sure the machine recognizes patterns in the data, and the test data is used to see how well the machine can predict new answers based on its training.

In [0]:
train, test = ds_dataset.randomSplit([0.9, 0.1], seed=12345)

Our goal is to create a pipeline that receive in input a dataset and perform this task:
- select the correct feature
- apply a tokenization method to split the description in sigle words
- filter tokens and remove the noise
- create ngrams from text
- train a word2vec model
- tran a NaiveBayes model
- test our performance

First, let's select only the imortant feature. We will use the description to predict the category.
During this task, we make lowercase the text and we remove some noise...

To remove some noise, we often use regular expression:
https://en.wikipedia.org/wiki/Regular_expression

In [0]:
train.createOrReplaceTempView("train")
train_cleaned = spark.sql("""
  select regexp_replace(title || '' '' || description, '[0-9]', ' ') as description_cleaned, category as target from train
  """).filter("description_cleaned is not null")
display(train_cleaned.select("*").limit(10))

description_cleaned,target
Social media for people with disabilitiesAmicis is a social media platform for people with disabilities. Where you can talk about your disability with others like you.,Product
"BUUZA!! Vol : In the Land of Spider SilkThe Award-Winning LGBTQ Slice of Life Urban Fantasy Webcomic, BUUZA!!, returns for a second volume!",Illustration
"The Lords of VlacholdA new faction for Battleground Fantasy Warfare, featuring the artwork of Natalie Bernard and Argent Arts.",Games
C Crystallized Star Lights ( series matchless stars)These are Crystal Star Christmas light replicas from the Fourties. This will also update the old style with new technology.,Product
"Magical Beings Enamel PinsCute foxes, monsters, magic, and coy themed hard enamel pins!",Illustration
The Justice Farm: Kernel of JusticeA Children's Book On The Power Of Teamwork.,Illustration
Extended Ergonomic FootrestEnjoy comfortably working at your desk as our footrest takes pressure off of your lower back. Its long length makes it suitable for all,Product
"Cat's Cradle: A Fantasy Town for e and Other RPG SystemsA three-book set for adventures in the town of Cat's Cradle: a sourcebook, adventure book, and NPC book. For E and other fantasy RPGs.",Games
BADDA MOON RISING. The new novel by Ian JarvisThe fourth Ian Jarvis novel in the Quist and Watson series of humorous detective mysteries.,Product
"Refillable Mini Scuba Tank - Refills With a Hand PumpDive Portable Lungs is lightweight, portable, refillable via hand pump and gives you up to min underwater",Product


### Pre-processing (clean text and reduce features)

Spark has two tokenizer component (called transformers):
- Default Tokenizer
- RegexTokenizer

We use the RegexTokenize to make sure to split the description by ' ' and contestually remove some special chars (like |,/!....).

Each Spark transformers usually have:
- an inputCol: the column used as input during the transformation
- an outputCol: the column created as output of the transformer
- a transform method that takes in input a dataset and produce a new dataset with all the columns of the incoming dataset with (plus) the outputCol

In [0]:
from pyspark.ml.feature import RegexTokenizer
from pyspark.sql.functions import col, udf
from pyspark.sql.types import IntegerType

regexTokenizer = RegexTokenizer(inputCol="description_cleaned", outputCol="words", pattern="\\W")

# check our results
countTokens = udf(lambda words: len(words), IntegerType())
tokenized = regexTokenizer.transform(train_cleaned)
tokenized_counts = tokenized.select("description_cleaned", "words").withColumn("tokens", countTokens(col("words")))
display(tokenized_counts.select("*").limit(10))

description_cleaned,words,tokens
Social media for people with disabilitiesAmicis is a social media platform for people with disabilities. Where you can talk about your disability with others like you.,"List(social, media, for, people, with, disabilitiesamicis, is, a, social, media, platform, for, people, with, disabilities, where, you, can, talk, about, your, disability, with, others, like, you)",26
"BUUZA!! Vol : In the Land of Spider SilkThe Award-Winning LGBTQ Slice of Life Urban Fantasy Webcomic, BUUZA!!, returns for a second volume!","List(buuza, vol, in, the, land, of, spider, silkthe, award, winning, lgbtq, slice, of, life, urban, fantasy, webcomic, buuza, returns, for, a, second, volume)",23
"The Lords of VlacholdA new faction for Battleground Fantasy Warfare, featuring the artwork of Natalie Bernard and Argent Arts.","List(the, lords, of, vlacholda, new, faction, for, battleground, fantasy, warfare, featuring, the, artwork, of, natalie, bernard, and, argent, arts)",19
C Crystallized Star Lights ( series matchless stars)These are Crystal Star Christmas light replicas from the Fourties. This will also update the old style with new technology.,"List(c, crystallized, star, lights, series, matchless, stars, these, are, crystal, star, christmas, light, replicas, from, the, fourties, this, will, also, update, the, old, style, with, new, technology)",27
"Magical Beings Enamel PinsCute foxes, monsters, magic, and coy themed hard enamel pins!","List(magical, beings, enamel, pinscute, foxes, monsters, magic, and, coy, themed, hard, enamel, pins)",13
The Justice Farm: Kernel of JusticeA Children's Book On The Power Of Teamwork.,"List(the, justice, farm, kernel, of, justicea, children, s, book, on, the, power, of, teamwork)",14
Extended Ergonomic FootrestEnjoy comfortably working at your desk as our footrest takes pressure off of your lower back. Its long length makes it suitable for all,"List(extended, ergonomic, footrestenjoy, comfortably, working, at, your, desk, as, our, footrest, takes, pressure, off, of, your, lower, back, its, long, length, makes, it, suitable, for, all)",26
"Cat's Cradle: A Fantasy Town for e and Other RPG SystemsA three-book set for adventures in the town of Cat's Cradle: a sourcebook, adventure book, and NPC book. For E and other fantasy RPGs.","List(cat, s, cradle, a, fantasy, town, for, e, and, other, rpg, systemsa, three, book, set, for, adventures, in, the, town, of, cat, s, cradle, a, sourcebook, adventure, book, and, npc, book, for, e, and, other, fantasy, rpgs)",37
BADDA MOON RISING. The new novel by Ian JarvisThe fourth Ian Jarvis novel in the Quist and Watson series of humorous detective mysteries.,"List(badda, moon, rising, the, new, novel, by, ian, jarvisthe, fourth, ian, jarvis, novel, in, the, quist, and, watson, series, of, humorous, detective, mysteries)",23
"Refillable Mini Scuba Tank - Refills With a Hand PumpDive Portable Lungs is lightweight, portable, refillable via hand pump and gives you up to min underwater","List(refillable, mini, scuba, tank, refills, with, a, hand, pumpdive, portable, lungs, is, lightweight, portable, refillable, via, hand, pump, and, gives, you, up, to, min, underwater)",25


There is a lot of noise. *What do do?*

Let's check the distribution of the words (by plotting the first 20 tokens by frequency)

In [0]:
from pyspark.sql.functions import explode, desc
tokens = tokenized.select(explode(col("words")).alias("word")).groupBy(col("word")).count().orderBy(desc("count"))
display(tokens.select("*").limit(20))

word,count
the,2435
and,1861
a,1767
of,1628
to,1215
for,886
in,866
with,686
your,521
s,492


### StopWordsRemover
We can remove stopwords, *i.e. common terms in documents or language that fill our analysis with noise.*

http://spark.apache.org/docs/latest/ml-features.html#stopwordsremover

StopWordsRemover takes as input a sequence of strings (e.g. the output of a Tokenizer) and drops all the stop words from the input sequences. The list of stopwords is specified by the stopWords parameter. Default stop words for some languages are accessible by calling StopWordsRemover.loadDefaultStopWords(language), for which available options are “danish”, “dutch”, “english”, “finnish”, “french”, “german”, “hungarian”, “italian”, “norwegian”, “portuguese”, “russian”, “spanish”, “swedish” and “turkish”. A boolean parameter caseSensitive indicates if the matches should be case sensitive (false by default).

In [0]:
from pyspark.ml.feature import StopWordsRemover
remover = StopWordsRemover(inputCol="words", outputCol="cleaned", caseSensitive=False)

In [0]:
remover.setStopWords(StopWordsRemover.loadDefaultStopWords("english"))
cleaned = remover.transform(tokenized)
display(cleaned)

description_cleaned,target,words,cleaned
Social media for people with disabilitiesAmicis is a social media platform for people with disabilities. Where you can talk about your disability with others like you.,Product,"List(social, media, for, people, with, disabilitiesamicis, is, a, social, media, platform, for, people, with, disabilities, where, you, can, talk, about, your, disability, with, others, like, you)","List(social, media, people, disabilitiesamicis, social, media, platform, people, disabilities, talk, disability, others, like)"
"BUUZA!! Vol : In the Land of Spider SilkThe Award-Winning LGBTQ Slice of Life Urban Fantasy Webcomic, BUUZA!!, returns for a second volume!",Illustration,"List(buuza, vol, in, the, land, of, spider, silkthe, award, winning, lgbtq, slice, of, life, urban, fantasy, webcomic, buuza, returns, for, a, second, volume)","List(buuza, vol, land, spider, silkthe, award, winning, lgbtq, slice, life, urban, fantasy, webcomic, buuza, returns, second, volume)"
"The Lords of VlacholdA new faction for Battleground Fantasy Warfare, featuring the artwork of Natalie Bernard and Argent Arts.",Games,"List(the, lords, of, vlacholda, new, faction, for, battleground, fantasy, warfare, featuring, the, artwork, of, natalie, bernard, and, argent, arts)","List(lords, vlacholda, new, faction, battleground, fantasy, warfare, featuring, artwork, natalie, bernard, argent, arts)"
C Crystallized Star Lights ( series matchless stars)These are Crystal Star Christmas light replicas from the Fourties. This will also update the old style with new technology.,Product,"List(c, crystallized, star, lights, series, matchless, stars, these, are, crystal, star, christmas, light, replicas, from, the, fourties, this, will, also, update, the, old, style, with, new, technology)","List(c, crystallized, star, lights, series, matchless, stars, crystal, star, christmas, light, replicas, fourties, also, update, old, style, new, technology)"
"Magical Beings Enamel PinsCute foxes, monsters, magic, and coy themed hard enamel pins!",Illustration,"List(magical, beings, enamel, pinscute, foxes, monsters, magic, and, coy, themed, hard, enamel, pins)","List(magical, beings, enamel, pinscute, foxes, monsters, magic, coy, themed, hard, enamel, pins)"
The Justice Farm: Kernel of JusticeA Children's Book On The Power Of Teamwork.,Illustration,"List(the, justice, farm, kernel, of, justicea, children, s, book, on, the, power, of, teamwork)","List(justice, farm, kernel, justicea, children, book, power, teamwork)"
Extended Ergonomic FootrestEnjoy comfortably working at your desk as our footrest takes pressure off of your lower back. Its long length makes it suitable for all,Product,"List(extended, ergonomic, footrestenjoy, comfortably, working, at, your, desk, as, our, footrest, takes, pressure, off, of, your, lower, back, its, long, length, makes, it, suitable, for, all)","List(extended, ergonomic, footrestenjoy, comfortably, working, desk, footrest, takes, pressure, lower, back, long, length, makes, suitable)"
"Cat's Cradle: A Fantasy Town for e and Other RPG SystemsA three-book set for adventures in the town of Cat's Cradle: a sourcebook, adventure book, and NPC book. For E and other fantasy RPGs.",Games,"List(cat, s, cradle, a, fantasy, town, for, e, and, other, rpg, systemsa, three, book, set, for, adventures, in, the, town, of, cat, s, cradle, a, sourcebook, adventure, book, and, npc, book, for, e, and, other, fantasy, rpgs)","List(cat, cradle, fantasy, town, e, rpg, systemsa, three, book, set, adventures, town, cat, cradle, sourcebook, adventure, book, npc, book, e, fantasy, rpgs)"
BADDA MOON RISING. The new novel by Ian JarvisThe fourth Ian Jarvis novel in the Quist and Watson series of humorous detective mysteries.,Product,"List(badda, moon, rising, the, new, novel, by, ian, jarvisthe, fourth, ian, jarvis, novel, in, the, quist, and, watson, series, of, humorous, detective, mysteries)","List(badda, moon, rising, new, novel, ian, jarvisthe, fourth, ian, jarvis, novel, quist, watson, series, humorous, detective, mysteries)"
"Refillable Mini Scuba Tank - Refills With a Hand PumpDive Portable Lungs is lightweight, portable, refillable via hand pump and gives you up to min underwater",Product,"List(refillable, mini, scuba, tank, refills, with, a, hand, pumpdive, portable, lungs, is, lightweight, portable, refillable, via, hand, pump, and, gives, you, up, to, min, underwater)","List(refillable, mini, scuba, tank, refills, hand, pumpdive, portable, lungs, lightweight, portable, refillable, via, hand, pump, gives, min, underwater)"


Let's check our cleaning performance on the first 20 words

In [0]:
tokens = cleaned.select(explode(col("cleaned")).alias("word")).groupBy(col("word")).count().orderBy(desc("count"))
display(tokens.select("*").limit(20))

word,count
enamel,419
game,375
book,306
new,302
pins,236
d,200
pin,198
de,187
world,178
art,176


We continue to have some noise...

What do do now?

Can we remove too short terms?

How?

In Spark, It's quite easy to define some new function.

http://spark.apache.org/docs/latest/sql-ref-functions-udf-scalar.html

User-Defined Functions (UDFs) are user-programmable routines that act on one row.

In [0]:
from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType, StringType

# define our custom function to remove too short terms
def filter_by_len(words):
  filtered = [word for word in words if len(word) >= 2]
  return filtered

# register our function as udf
filter_by_len_udf = udf(filter_by_len, ArrayType(StringType()))

In [0]:
filtered = cleaned.withColumn("filtered", filter_by_len_udf(col("cleaned")))
display(filtered)

description_cleaned,target,words,cleaned,filtered
Social media for people with disabilitiesAmicis is a social media platform for people with disabilities. Where you can talk about your disability with others like you.,Product,"List(social, media, for, people, with, disabilitiesamicis, is, a, social, media, platform, for, people, with, disabilities, where, you, can, talk, about, your, disability, with, others, like, you)","List(social, media, people, disabilitiesamicis, social, media, platform, people, disabilities, talk, disability, others, like)","List(social, media, people, disabilitiesamicis, social, media, platform, people, disabilities, talk, disability, others, like)"
"BUUZA!! Vol : In the Land of Spider SilkThe Award-Winning LGBTQ Slice of Life Urban Fantasy Webcomic, BUUZA!!, returns for a second volume!",Illustration,"List(buuza, vol, in, the, land, of, spider, silkthe, award, winning, lgbtq, slice, of, life, urban, fantasy, webcomic, buuza, returns, for, a, second, volume)","List(buuza, vol, land, spider, silkthe, award, winning, lgbtq, slice, life, urban, fantasy, webcomic, buuza, returns, second, volume)","List(buuza, vol, land, spider, silkthe, award, winning, lgbtq, slice, life, urban, fantasy, webcomic, buuza, returns, second, volume)"
"The Lords of VlacholdA new faction for Battleground Fantasy Warfare, featuring the artwork of Natalie Bernard and Argent Arts.",Games,"List(the, lords, of, vlacholda, new, faction, for, battleground, fantasy, warfare, featuring, the, artwork, of, natalie, bernard, and, argent, arts)","List(lords, vlacholda, new, faction, battleground, fantasy, warfare, featuring, artwork, natalie, bernard, argent, arts)","List(lords, vlacholda, new, faction, battleground, fantasy, warfare, featuring, artwork, natalie, bernard, argent, arts)"
C Crystallized Star Lights ( series matchless stars)These are Crystal Star Christmas light replicas from the Fourties. This will also update the old style with new technology.,Product,"List(c, crystallized, star, lights, series, matchless, stars, these, are, crystal, star, christmas, light, replicas, from, the, fourties, this, will, also, update, the, old, style, with, new, technology)","List(c, crystallized, star, lights, series, matchless, stars, crystal, star, christmas, light, replicas, fourties, also, update, old, style, new, technology)","List(crystallized, star, lights, series, matchless, stars, crystal, star, christmas, light, replicas, fourties, also, update, old, style, new, technology)"
"Magical Beings Enamel PinsCute foxes, monsters, magic, and coy themed hard enamel pins!",Illustration,"List(magical, beings, enamel, pinscute, foxes, monsters, magic, and, coy, themed, hard, enamel, pins)","List(magical, beings, enamel, pinscute, foxes, monsters, magic, coy, themed, hard, enamel, pins)","List(magical, beings, enamel, pinscute, foxes, monsters, magic, coy, themed, hard, enamel, pins)"
The Justice Farm: Kernel of JusticeA Children's Book On The Power Of Teamwork.,Illustration,"List(the, justice, farm, kernel, of, justicea, children, s, book, on, the, power, of, teamwork)","List(justice, farm, kernel, justicea, children, book, power, teamwork)","List(justice, farm, kernel, justicea, children, book, power, teamwork)"
Extended Ergonomic FootrestEnjoy comfortably working at your desk as our footrest takes pressure off of your lower back. Its long length makes it suitable for all,Product,"List(extended, ergonomic, footrestenjoy, comfortably, working, at, your, desk, as, our, footrest, takes, pressure, off, of, your, lower, back, its, long, length, makes, it, suitable, for, all)","List(extended, ergonomic, footrestenjoy, comfortably, working, desk, footrest, takes, pressure, lower, back, long, length, makes, suitable)","List(extended, ergonomic, footrestenjoy, comfortably, working, desk, footrest, takes, pressure, lower, back, long, length, makes, suitable)"
"Cat's Cradle: A Fantasy Town for e and Other RPG SystemsA three-book set for adventures in the town of Cat's Cradle: a sourcebook, adventure book, and NPC book. For E and other fantasy RPGs.",Games,"List(cat, s, cradle, a, fantasy, town, for, e, and, other, rpg, systemsa, three, book, set, for, adventures, in, the, town, of, cat, s, cradle, a, sourcebook, adventure, book, and, npc, book, for, e, and, other, fantasy, rpgs)","List(cat, cradle, fantasy, town, e, rpg, systemsa, three, book, set, adventures, town, cat, cradle, sourcebook, adventure, book, npc, book, e, fantasy, rpgs)","List(cat, cradle, fantasy, town, rpg, systemsa, three, book, set, adventures, town, cat, cradle, sourcebook, adventure, book, npc, book, fantasy, rpgs)"
BADDA MOON RISING. The new novel by Ian JarvisThe fourth Ian Jarvis novel in the Quist and Watson series of humorous detective mysteries.,Product,"List(badda, moon, rising, the, new, novel, by, ian, jarvisthe, fourth, ian, jarvis, novel, in, the, quist, and, watson, series, of, humorous, detective, mysteries)","List(badda, moon, rising, new, novel, ian, jarvisthe, fourth, ian, jarvis, novel, quist, watson, series, humorous, detective, mysteries)","List(badda, moon, rising, new, novel, ian, jarvisthe, fourth, ian, jarvis, novel, quist, watson, series, humorous, detective, mysteries)"
"Refillable Mini Scuba Tank - Refills With a Hand PumpDive Portable Lungs is lightweight, portable, refillable via hand pump and gives you up to min underwater",Product,"List(refillable, mini, scuba, tank, refills, with, a, hand, pumpdive, portable, lungs, is, lightweight, portable, refillable, via, hand, pump, and, gives, you, up, to, min, underwater)","List(refillable, mini, scuba, tank, refills, hand, pumpdive, portable, lungs, lightweight, portable, refillable, via, hand, pump, gives, min, underwater)","List(refillable, mini, scuba, tank, refills, hand, pumpdive, portable, lungs, lightweight, portable, refillable, via, hand, pump, gives, min, underwater)"


In [0]:
tokens = filtered.select(explode(col("filtered")).alias("word")).groupBy(col("word")).count().orderBy(desc("count"))
display(tokens.select("*").limit(20))

word,count
enamel,419
game,375
book,306
new,302
pins,236
pin,198
de,187
world,178
art,176
series,170


### Ngrams

An n-gram is a sequence of n tokens for some integer *n*.
The NGram component can be used to transform input features into n-grams.

Spark NGram takes as input a sequence of strings.

We will create ngrams for n=2.

In [0]:
from pyspark.ml.feature import NGram
ngrams2 = NGram(n=2, inputCol="cleaned", outputCol="ngrams_2")
ngrams = ngrams2.transform(filtered)

display(ngrams)

description_cleaned,target,words,cleaned,filtered,ngrams_2
Social media for people with disabilitiesAmicis is a social media platform for people with disabilities. Where you can talk about your disability with others like you.,Product,"List(social, media, for, people, with, disabilitiesamicis, is, a, social, media, platform, for, people, with, disabilities, where, you, can, talk, about, your, disability, with, others, like, you)","List(social, media, people, disabilitiesamicis, social, media, platform, people, disabilities, talk, disability, others, like)","List(social, media, people, disabilitiesamicis, social, media, platform, people, disabilities, talk, disability, others, like)","List(social media, media people, people disabilitiesamicis, disabilitiesamicis social, social media, media platform, platform people, people disabilities, disabilities talk, talk disability, disability others, others like)"
"BUUZA!! Vol : In the Land of Spider SilkThe Award-Winning LGBTQ Slice of Life Urban Fantasy Webcomic, BUUZA!!, returns for a second volume!",Illustration,"List(buuza, vol, in, the, land, of, spider, silkthe, award, winning, lgbtq, slice, of, life, urban, fantasy, webcomic, buuza, returns, for, a, second, volume)","List(buuza, vol, land, spider, silkthe, award, winning, lgbtq, slice, life, urban, fantasy, webcomic, buuza, returns, second, volume)","List(buuza, vol, land, spider, silkthe, award, winning, lgbtq, slice, life, urban, fantasy, webcomic, buuza, returns, second, volume)","List(buuza vol, vol land, land spider, spider silkthe, silkthe award, award winning, winning lgbtq, lgbtq slice, slice life, life urban, urban fantasy, fantasy webcomic, webcomic buuza, buuza returns, returns second, second volume)"
"The Lords of VlacholdA new faction for Battleground Fantasy Warfare, featuring the artwork of Natalie Bernard and Argent Arts.",Games,"List(the, lords, of, vlacholda, new, faction, for, battleground, fantasy, warfare, featuring, the, artwork, of, natalie, bernard, and, argent, arts)","List(lords, vlacholda, new, faction, battleground, fantasy, warfare, featuring, artwork, natalie, bernard, argent, arts)","List(lords, vlacholda, new, faction, battleground, fantasy, warfare, featuring, artwork, natalie, bernard, argent, arts)","List(lords vlacholda, vlacholda new, new faction, faction battleground, battleground fantasy, fantasy warfare, warfare featuring, featuring artwork, artwork natalie, natalie bernard, bernard argent, argent arts)"
C Crystallized Star Lights ( series matchless stars)These are Crystal Star Christmas light replicas from the Fourties. This will also update the old style with new technology.,Product,"List(c, crystallized, star, lights, series, matchless, stars, these, are, crystal, star, christmas, light, replicas, from, the, fourties, this, will, also, update, the, old, style, with, new, technology)","List(c, crystallized, star, lights, series, matchless, stars, crystal, star, christmas, light, replicas, fourties, also, update, old, style, new, technology)","List(crystallized, star, lights, series, matchless, stars, crystal, star, christmas, light, replicas, fourties, also, update, old, style, new, technology)","List(c crystallized, crystallized star, star lights, lights series, series matchless, matchless stars, stars crystal, crystal star, star christmas, christmas light, light replicas, replicas fourties, fourties also, also update, update old, old style, style new, new technology)"
"Magical Beings Enamel PinsCute foxes, monsters, magic, and coy themed hard enamel pins!",Illustration,"List(magical, beings, enamel, pinscute, foxes, monsters, magic, and, coy, themed, hard, enamel, pins)","List(magical, beings, enamel, pinscute, foxes, monsters, magic, coy, themed, hard, enamel, pins)","List(magical, beings, enamel, pinscute, foxes, monsters, magic, coy, themed, hard, enamel, pins)","List(magical beings, beings enamel, enamel pinscute, pinscute foxes, foxes monsters, monsters magic, magic coy, coy themed, themed hard, hard enamel, enamel pins)"
The Justice Farm: Kernel of JusticeA Children's Book On The Power Of Teamwork.,Illustration,"List(the, justice, farm, kernel, of, justicea, children, s, book, on, the, power, of, teamwork)","List(justice, farm, kernel, justicea, children, book, power, teamwork)","List(justice, farm, kernel, justicea, children, book, power, teamwork)","List(justice farm, farm kernel, kernel justicea, justicea children, children book, book power, power teamwork)"
Extended Ergonomic FootrestEnjoy comfortably working at your desk as our footrest takes pressure off of your lower back. Its long length makes it suitable for all,Product,"List(extended, ergonomic, footrestenjoy, comfortably, working, at, your, desk, as, our, footrest, takes, pressure, off, of, your, lower, back, its, long, length, makes, it, suitable, for, all)","List(extended, ergonomic, footrestenjoy, comfortably, working, desk, footrest, takes, pressure, lower, back, long, length, makes, suitable)","List(extended, ergonomic, footrestenjoy, comfortably, working, desk, footrest, takes, pressure, lower, back, long, length, makes, suitable)","List(extended ergonomic, ergonomic footrestenjoy, footrestenjoy comfortably, comfortably working, working desk, desk footrest, footrest takes, takes pressure, pressure lower, lower back, back long, long length, length makes, makes suitable)"
"Cat's Cradle: A Fantasy Town for e and Other RPG SystemsA three-book set for adventures in the town of Cat's Cradle: a sourcebook, adventure book, and NPC book. For E and other fantasy RPGs.",Games,"List(cat, s, cradle, a, fantasy, town, for, e, and, other, rpg, systemsa, three, book, set, for, adventures, in, the, town, of, cat, s, cradle, a, sourcebook, adventure, book, and, npc, book, for, e, and, other, fantasy, rpgs)","List(cat, cradle, fantasy, town, e, rpg, systemsa, three, book, set, adventures, town, cat, cradle, sourcebook, adventure, book, npc, book, e, fantasy, rpgs)","List(cat, cradle, fantasy, town, rpg, systemsa, three, book, set, adventures, town, cat, cradle, sourcebook, adventure, book, npc, book, fantasy, rpgs)","List(cat cradle, cradle fantasy, fantasy town, town e, e rpg, rpg systemsa, systemsa three, three book, book set, set adventures, adventures town, town cat, cat cradle, cradle sourcebook, sourcebook adventure, adventure book, book npc, npc book, book e, e fantasy, fantasy rpgs)"
BADDA MOON RISING. The new novel by Ian JarvisThe fourth Ian Jarvis novel in the Quist and Watson series of humorous detective mysteries.,Product,"List(badda, moon, rising, the, new, novel, by, ian, jarvisthe, fourth, ian, jarvis, novel, in, the, quist, and, watson, series, of, humorous, detective, mysteries)","List(badda, moon, rising, new, novel, ian, jarvisthe, fourth, ian, jarvis, novel, quist, watson, series, humorous, detective, mysteries)","List(badda, moon, rising, new, novel, ian, jarvisthe, fourth, ian, jarvis, novel, quist, watson, series, humorous, detective, mysteries)","List(badda moon, moon rising, rising new, new novel, novel ian, ian jarvisthe, jarvisthe fourth, fourth ian, ian jarvis, jarvis novel, novel quist, quist watson, watson series, series humorous, humorous detective, detective mysteries)"
"Refillable Mini Scuba Tank - Refills With a Hand PumpDive Portable Lungs is lightweight, portable, refillable via hand pump and gives you up to min underwater",Product,"List(refillable, mini, scuba, tank, refills, with, a, hand, pumpdive, portable, lungs, is, lightweight, portable, refillable, via, hand, pump, and, gives, you, up, to, min, underwater)","List(refillable, mini, scuba, tank, refills, hand, pumpdive, portable, lungs, lightweight, portable, refillable, via, hand, pump, gives, min, underwater)","List(refillable, mini, scuba, tank, refills, hand, pumpdive, portable, lungs, lightweight, portable, refillable, via, hand, pump, gives, min, underwater)","List(refillable mini, mini scuba, scuba tank, tank refills, refills hand, hand pumpdive, pumpdive portable, portable lungs, lungs lightweight, lightweight portable, portable refillable, refillable via, via hand, hand pump, pump gives, gives min, min underwater)"


And now we can merge the result in a single column with a new udf.

In [0]:
# union of the results
def union_ngrams(c1,c2):
  return c1 + c2

union_ngrams_udf = udf(union_ngrams, ArrayType(StringType()))

ngrams_final = ngrams.filter("filtered is not Null").withColumn("ngrams", union_ngrams_udf(col("filtered"), col("ngrams_2")))
display(ngrams_final)

description_cleaned,target,words,cleaned,filtered,ngrams_2,ngrams
Social media for people with disabilitiesAmicis is a social media platform for people with disabilities. Where you can talk about your disability with others like you.,Product,"List(social, media, for, people, with, disabilitiesamicis, is, a, social, media, platform, for, people, with, disabilities, where, you, can, talk, about, your, disability, with, others, like, you)","List(social, media, people, disabilitiesamicis, social, media, platform, people, disabilities, talk, disability, others, like)","List(social, media, people, disabilitiesamicis, social, media, platform, people, disabilities, talk, disability, others, like)","List(social media, media people, people disabilitiesamicis, disabilitiesamicis social, social media, media platform, platform people, people disabilities, disabilities talk, talk disability, disability others, others like)","List(social, media, people, disabilitiesamicis, social, media, platform, people, disabilities, talk, disability, others, like, social media, media people, people disabilitiesamicis, disabilitiesamicis social, social media, media platform, platform people, people disabilities, disabilities talk, talk disability, disability others, others like)"
"BUUZA!! Vol : In the Land of Spider SilkThe Award-Winning LGBTQ Slice of Life Urban Fantasy Webcomic, BUUZA!!, returns for a second volume!",Illustration,"List(buuza, vol, in, the, land, of, spider, silkthe, award, winning, lgbtq, slice, of, life, urban, fantasy, webcomic, buuza, returns, for, a, second, volume)","List(buuza, vol, land, spider, silkthe, award, winning, lgbtq, slice, life, urban, fantasy, webcomic, buuza, returns, second, volume)","List(buuza, vol, land, spider, silkthe, award, winning, lgbtq, slice, life, urban, fantasy, webcomic, buuza, returns, second, volume)","List(buuza vol, vol land, land spider, spider silkthe, silkthe award, award winning, winning lgbtq, lgbtq slice, slice life, life urban, urban fantasy, fantasy webcomic, webcomic buuza, buuza returns, returns second, second volume)","List(buuza, vol, land, spider, silkthe, award, winning, lgbtq, slice, life, urban, fantasy, webcomic, buuza, returns, second, volume, buuza vol, vol land, land spider, spider silkthe, silkthe award, award winning, winning lgbtq, lgbtq slice, slice life, life urban, urban fantasy, fantasy webcomic, webcomic buuza, buuza returns, returns second, second volume)"
"The Lords of VlacholdA new faction for Battleground Fantasy Warfare, featuring the artwork of Natalie Bernard and Argent Arts.",Games,"List(the, lords, of, vlacholda, new, faction, for, battleground, fantasy, warfare, featuring, the, artwork, of, natalie, bernard, and, argent, arts)","List(lords, vlacholda, new, faction, battleground, fantasy, warfare, featuring, artwork, natalie, bernard, argent, arts)","List(lords, vlacholda, new, faction, battleground, fantasy, warfare, featuring, artwork, natalie, bernard, argent, arts)","List(lords vlacholda, vlacholda new, new faction, faction battleground, battleground fantasy, fantasy warfare, warfare featuring, featuring artwork, artwork natalie, natalie bernard, bernard argent, argent arts)","List(lords, vlacholda, new, faction, battleground, fantasy, warfare, featuring, artwork, natalie, bernard, argent, arts, lords vlacholda, vlacholda new, new faction, faction battleground, battleground fantasy, fantasy warfare, warfare featuring, featuring artwork, artwork natalie, natalie bernard, bernard argent, argent arts)"
C Crystallized Star Lights ( series matchless stars)These are Crystal Star Christmas light replicas from the Fourties. This will also update the old style with new technology.,Product,"List(c, crystallized, star, lights, series, matchless, stars, these, are, crystal, star, christmas, light, replicas, from, the, fourties, this, will, also, update, the, old, style, with, new, technology)","List(c, crystallized, star, lights, series, matchless, stars, crystal, star, christmas, light, replicas, fourties, also, update, old, style, new, technology)","List(crystallized, star, lights, series, matchless, stars, crystal, star, christmas, light, replicas, fourties, also, update, old, style, new, technology)","List(c crystallized, crystallized star, star lights, lights series, series matchless, matchless stars, stars crystal, crystal star, star christmas, christmas light, light replicas, replicas fourties, fourties also, also update, update old, old style, style new, new technology)","List(crystallized, star, lights, series, matchless, stars, crystal, star, christmas, light, replicas, fourties, also, update, old, style, new, technology, c crystallized, crystallized star, star lights, lights series, series matchless, matchless stars, stars crystal, crystal star, star christmas, christmas light, light replicas, replicas fourties, fourties also, also update, update old, old style, style new, new technology)"
"Magical Beings Enamel PinsCute foxes, monsters, magic, and coy themed hard enamel pins!",Illustration,"List(magical, beings, enamel, pinscute, foxes, monsters, magic, and, coy, themed, hard, enamel, pins)","List(magical, beings, enamel, pinscute, foxes, monsters, magic, coy, themed, hard, enamel, pins)","List(magical, beings, enamel, pinscute, foxes, monsters, magic, coy, themed, hard, enamel, pins)","List(magical beings, beings enamel, enamel pinscute, pinscute foxes, foxes monsters, monsters magic, magic coy, coy themed, themed hard, hard enamel, enamel pins)","List(magical, beings, enamel, pinscute, foxes, monsters, magic, coy, themed, hard, enamel, pins, magical beings, beings enamel, enamel pinscute, pinscute foxes, foxes monsters, monsters magic, magic coy, coy themed, themed hard, hard enamel, enamel pins)"
The Justice Farm: Kernel of JusticeA Children's Book On The Power Of Teamwork.,Illustration,"List(the, justice, farm, kernel, of, justicea, children, s, book, on, the, power, of, teamwork)","List(justice, farm, kernel, justicea, children, book, power, teamwork)","List(justice, farm, kernel, justicea, children, book, power, teamwork)","List(justice farm, farm kernel, kernel justicea, justicea children, children book, book power, power teamwork)","List(justice, farm, kernel, justicea, children, book, power, teamwork, justice farm, farm kernel, kernel justicea, justicea children, children book, book power, power teamwork)"
Extended Ergonomic FootrestEnjoy comfortably working at your desk as our footrest takes pressure off of your lower back. Its long length makes it suitable for all,Product,"List(extended, ergonomic, footrestenjoy, comfortably, working, at, your, desk, as, our, footrest, takes, pressure, off, of, your, lower, back, its, long, length, makes, it, suitable, for, all)","List(extended, ergonomic, footrestenjoy, comfortably, working, desk, footrest, takes, pressure, lower, back, long, length, makes, suitable)","List(extended, ergonomic, footrestenjoy, comfortably, working, desk, footrest, takes, pressure, lower, back, long, length, makes, suitable)","List(extended ergonomic, ergonomic footrestenjoy, footrestenjoy comfortably, comfortably working, working desk, desk footrest, footrest takes, takes pressure, pressure lower, lower back, back long, long length, length makes, makes suitable)","List(extended, ergonomic, footrestenjoy, comfortably, working, desk, footrest, takes, pressure, lower, back, long, length, makes, suitable, extended ergonomic, ergonomic footrestenjoy, footrestenjoy comfortably, comfortably working, working desk, desk footrest, footrest takes, takes pressure, pressure lower, lower back, back long, long length, length makes, makes suitable)"
"Cat's Cradle: A Fantasy Town for e and Other RPG SystemsA three-book set for adventures in the town of Cat's Cradle: a sourcebook, adventure book, and NPC book. For E and other fantasy RPGs.",Games,"List(cat, s, cradle, a, fantasy, town, for, e, and, other, rpg, systemsa, three, book, set, for, adventures, in, the, town, of, cat, s, cradle, a, sourcebook, adventure, book, and, npc, book, for, e, and, other, fantasy, rpgs)","List(cat, cradle, fantasy, town, e, rpg, systemsa, three, book, set, adventures, town, cat, cradle, sourcebook, adventure, book, npc, book, e, fantasy, rpgs)","List(cat, cradle, fantasy, town, rpg, systemsa, three, book, set, adventures, town, cat, cradle, sourcebook, adventure, book, npc, book, fantasy, rpgs)","List(cat cradle, cradle fantasy, fantasy town, town e, e rpg, rpg systemsa, systemsa three, three book, book set, set adventures, adventures town, town cat, cat cradle, cradle sourcebook, sourcebook adventure, adventure book, book npc, npc book, book e, e fantasy, fantasy rpgs)","List(cat, cradle, fantasy, town, rpg, systemsa, three, book, set, adventures, town, cat, cradle, sourcebook, adventure, book, npc, book, fantasy, rpgs, cat cradle, cradle fantasy, fantasy town, town e, e rpg, rpg systemsa, systemsa three, three book, book set, set adventures, adventures town, town cat, cat cradle, cradle sourcebook, sourcebook adventure, adventure book, book npc, npc book, book e, e fantasy, fantasy rpgs)"
BADDA MOON RISING. The new novel by Ian JarvisThe fourth Ian Jarvis novel in the Quist and Watson series of humorous detective mysteries.,Product,"List(badda, moon, rising, the, new, novel, by, ian, jarvisthe, fourth, ian, jarvis, novel, in, the, quist, and, watson, series, of, humorous, detective, mysteries)","List(badda, moon, rising, new, novel, ian, jarvisthe, fourth, ian, jarvis, novel, quist, watson, series, humorous, detective, mysteries)","List(badda, moon, rising, new, novel, ian, jarvisthe, fourth, ian, jarvis, novel, quist, watson, series, humorous, detective, mysteries)","List(badda moon, moon rising, rising new, new novel, novel ian, ian jarvisthe, jarvisthe fourth, fourth ian, ian jarvis, jarvis novel, novel quist, quist watson, watson series, series humorous, humorous detective, detective mysteries)","List(badda, moon, rising, new, novel, ian, jarvisthe, fourth, ian, jarvis, novel, quist, watson, series, humorous, detective, mysteries, badda moon, moon rising, rising new, new novel, novel ian, ian jarvisthe, jarvisthe fourth, fourth ian, ian jarvis, jarvis novel, novel quist, quist watson, watson series, series humorous, humorous detective, detective mysteries)"
"Refillable Mini Scuba Tank - Refills With a Hand PumpDive Portable Lungs is lightweight, portable, refillable via hand pump and gives you up to min underwater",Product,"List(refillable, mini, scuba, tank, refills, with, a, hand, pumpdive, portable, lungs, is, lightweight, portable, refillable, via, hand, pump, and, gives, you, up, to, min, underwater)","List(refillable, mini, scuba, tank, refills, hand, pumpdive, portable, lungs, lightweight, portable, refillable, via, hand, pump, gives, min, underwater)","List(refillable, mini, scuba, tank, refills, hand, pumpdive, portable, lungs, lightweight, portable, refillable, via, hand, pump, gives, min, underwater)","List(refillable mini, mini scuba, scuba tank, tank refills, refills hand, hand pumpdive, pumpdive portable, portable lungs, lungs lightweight, lightweight portable, portable refillable, refillable via, via hand, hand pump, pump gives, gives min, min underwater)","List(refillable, mini, scuba, tank, refills, hand, pumpdive, portable, lungs, lightweight, portable, refillable, via, hand, pump, gives, min, underwater, refillable mini, mini scuba, scuba tank, tank refills, refills hand, hand pumpdive, pumpdive portable, portable lungs, lungs lightweight, lightweight portable, portable refillable, refillable via, via hand, hand pump, pump gives, gives min, min underwater)"


### TF-IDF

In [0]:
from pyspark.ml.feature import HashingTF
from pyspark.ml.feature import IDF

hashing_tf = HashingTF(inputCol="ngrams", outputCol="rawFeatures")
idf = IDF(inputCol="rawFeatures", outputCol="features")
hash_dataset = hashing_tf.transform(ngrams_final)
idf_model = idf.fit(hash_dataset)
idf_dataset = idf_model.transform(hash_dataset)


### NaiveBayes

Before the train phase we have to convert target label to index. Spark Estimator usually takes in input category as bin (i.e. integer rapresentation of the class).

So we apply:
- IndexToString to encode the category to its index
- StringIndexer to decode the index to the original category

In [0]:
from pyspark.ml.feature import IndexToString, StringIndexer

indexer = StringIndexer(inputCol="target", outputCol="label")
indexer_model = indexer.fit(idf_dataset)
indexed = indexer_model.transform(idf_dataset)
converter = IndexToString(inputCol="prediction", outputCol="prediction_category", labels=indexer_model.labels)

Train the model...

In [0]:
from pyspark.ml.classification import NaiveBayes
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit

nb = NaiveBayes(smoothing=1.0, modelType="multinomial", labelCol="label", featuresCol="features")
paramGrid = ParamGridBuilder()\
    .addGrid(nb.smoothing, [0.1, 0.5, 1.0]) \
    .build()

tvs = TrainValidationSplit(estimator=nb,
                           estimatorParamMaps=paramGrid,
                           evaluator=MulticlassClassificationEvaluator(),
                           trainRatio=0.8)

In [0]:
nb_model = tvs.fit(indexed)

### Test
Now, we have to apply the same operations to the test set to produce the same input for our ML model.

In [0]:
test.createOrReplaceTempView("test")
test_cleaned = spark.sql("select project_id, title, description, regexp_replace(title || ' ' || description, '[0-9]', ' ') as description_cleaned, category as target from test").filter("description_cleaned is not null")
test_tokenized = regexTokenizer.transform(test_cleaned)
test_cleaned = remover.transform(test_tokenized)
test_filtered = test_cleaned.withColumn("filtered", filter_by_len_udf(col("cleaned")))
test_ngrams = ngrams2.transform(test_filtered)
test_ngrams_final = test_ngrams.filter("filtered is not Null").withColumn("ngrams", union_ngrams_udf(col("filtered"), col("ngrams_2")))

test_hash_dataset = hashing_tf.transform(test_ngrams_final)
test_idf_dataset = idf_model.transform(test_hash_dataset)
test_indexed = indexer_model.transform(test_idf_dataset)

Apply our model to the test dataset.

In [0]:
predictions = nb_model.transform(test_indexed).select("title", "description", "label", "prediction", "target")
predictions_decoded = converter.transform(predictions)
display(predictions_decoded.select("*").limit(20))

title,description,label,prediction,target,prediction_category
E.Z. Shakes album fundraiser - The Spirit,Help us make our new album come out on vinyl by backing us and getting great rewards.,3.0,3.0,Music,Music
Troc.me – Swap your skills,The whole portal which promotes the concept of barter equal exchange of skills and abilities,0.0,2.0,Product,Games
Glorified - Issue #3,A manga-style comic book series about revenge and redemption set in a post-apocalyptic world.,1.0,1.0,Illustration,Illustration
THE NEW AMAZONS: ORIGIN-A-GO-GO!,"A 44 page full-color, square-bound, tongue-in-cheek superhero team comic featuring OCTOBRIANA! By John A. Short & Gabrielle Noble.",1.0,1.0,Illustration,Illustration
MORONAVIRUS,"A short film to raise awareness of FAKE NEWS, an evil spreading faster than any virus. Entirely filmed during lockdown.",1.0,1.0,Illustration,Illustration
Custom Molding Nose Strips for Masks,Eliminates Foggy Glasses and Increases Comfort,0.0,0.0,Product,Product
Beautiful & Inclusive Book of Empowering Wishes & Happiness,"Empower children to be the best they can be, with inspiring wishes and beautifully enchanting illustrations by a prize-winning artist.",1.0,1.0,Illustration,Illustration
A brand new album from Dream Frequency,"20 new floor-filling tracks ,with remixes from Rob Tissera, K69, Si Frater and Dave Heaton",3.0,3.0,Music,Music
50 Smiles Under One Pledge,5 Animated holiday e-cards of 5 festivals of your choice and send up to 10 receivers each e-card & more quality animation rewards.,0.0,0.0,Product,Product
Be different. Bronze diver style watch.,"White Rhino Diver watch with Bronze case , 3D dial , sapphire glass .",0.0,0.0,Product,Product


Compute classification performance on the test set

In [0]:
evaluator_precision = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction",
                                              metricName="weightedPrecision")
precision = evaluator_precision.evaluate(predictions)
print("Test set weighted precision = " + str(precision))


evaluator_recall = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction",
                                              metricName="weightedRecall")
recall = evaluator_recall.evaluate(predictions)
print("Test set weighted recall = " + str(recall))


evaluator_f1 = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction",
                                              metricName="f1")
f1 = evaluator_f1.evaluate(predictions)
print("Test set f1 = " + str(f1))

Check the issues...

What do do to improve our performance?

In [0]:
display(predictions_decoded.filter("label <> prediction").select("*").limit(20))

title,description,label,prediction,target,prediction_category
Troc.me – Swap your skills,The whole portal which promotes the concept of barter equal exchange of skills and abilities,0.0,2.0,Product,Games
Pokebug Evolution Chains,Bug Pokemon Evolution Lines - Acrylic Charms,0.0,1.0,Product,Illustration
Silent Ocean,"A mermaid saves someone from drowning, only to let curiosity get the better of her.",1.0,0.0,Illustration,Product
Aldo Novarese: Alfa-Beta,The reissue of a typographic masterpiece,1.0,3.0,Illustration,Music
Da J3rk Spot,A Taste of the Caribbean in Alaska,0.0,1.0,Product,Illustration
Adventures Of Twisted Kitties,"I want to bring my twin cats to ""virtual life"" by creating cartoon versions of them. Comical adventures of 2 very different brothers.",1.0,2.0,Illustration,Games
Final Fantasy Summon Enamel Pins,Hard enamel pins based off of the Final Fantasy VII summon materia.,0.0,1.0,Product,Illustration
Pride Cats,Hard Enamel pins shaped like cats and colored to show LGBT+ Pride.,0.0,1.0,Product,Illustration
Help Fund Our Summer Issue of Black Pages,A pocket-sized directory of black-owned businesses.,1.0,0.0,Illustration,Product
"The Missing Peace: Still Born, Still Loved","An interactive grief journal/workbook for people who have experienced or are affected by miscarriage, stillbirth and early infant loss.",1.0,0.0,Illustration,Product


### Word2Vec
Train our Word2Vec model... the goal is to create a new column features that contain the results of our Word2Vec model for each record.

In [0]:
from pyspark.ml.feature import Word2Vec
word2Vec = Word2Vec(vectorSize=300, minCount=10, inputCol="ngrams", outputCol="features")
model = word2Vec.fit(ngrams_final)

Try our model and check if it works...

In [0]:
model.findSynonyms("game", 10).show(truncate=False)

In [0]:
model.findSynonyms("manga", 10).show(truncate=False)