<img src="https://storage.googleapis.com/kaggle-media/competitions/spooky-books/dmitrij-paskevic-44124.jpg" style="width:200px; float: left; padding-right: 10px"/>
<h2 style="font-face: verdana; font-size: 32px;">Spooky Author Identification</h2>
<h3 style="font-face: verdana; font-size: 16px;">Derive rich features for Machine Learning with the Watson Cognitive APIs</h3>
<br><br>
<a href="https://www.kaggle.com/c/spooky-author-identification/">Spooky Author Identification Kaggle Competition</a>

<h3 style="font-face: verdana; font-size: 16px;">The objective is to predict the author of excerpts from horror stories by Edgar Allan Poe, Mary Shelley, and HP Lovecraft.</h3>

The dataset contains text from works of fiction written by these spooky authors. The goal is to accurately identify the author of the sentences.

Data fields in the dataset:

    id - a unique identifier for each sentence
    text - some text written by one of the authors
    author - the author of the sentence (EAP: Edgar Allan Poe, HPL: HP Lovecraft; MWS: Mary Wollstonecraft Shelley)

<h3 style="font-face: verdana; font-size: 16px;">Approach</h3>

We will approach this challenge by first using a traditional multiclassification machine learning approach. We will then explore using IBM Watson Natural Language Understanding to derive additional enhanced features on which to learn a machine learning model.

<h3 style="font-face: verdana; font-size: 16px;">IBM Watson Natural Language Understanding</h3>

IBM Watson™ Natural Language Understanding (NLU) can analyze semantic features of text input, including categories, concepts, emotion, entities, keywords, metadata, relations, semantic roles, and sentiment. In this example, we will utilize the emotion and sentiment features of NLU to create enhanced machine learning features.


<h4 style="font-face: verdana; font-size: 16px;">Emotion</h4>

The emotion feature of NLU allows you to analyze emotion conveyed by specific target phrases or by the document as a whole. You can also enable emotion analysis for entities and keywords that are automatically detected by the service. In this example, we will simply analyze the spooky excerpt as a whole. The emotions we will derive features for are 

- Anger
- Joy
- Sadness
- Fear
- Disgust

Emotion scores range from 0 to 1 for sadness, joy, fear, disgust, and anger. A 0 means the text doesn't convey the emotion, and a 1 means the text definitly carries the emotion.

<h4 style="font-face: verdana; font-size: 16px;">Sentiment</h4>

The sentiment feature of NLU allows you to analyze the sentiment toward specific target phrases and the sentiment of the document as a whole. You can also get sentiment information for detected entities and keywords by enabling the sentiment option for those features. In this example, we will simply analyze the spooky excerpt as a whole.

The sentiment score ranges from -1 (negative sentiment) to 1 (positive sentiment).



## Download and unzip the dataset

In [1]:
import os
if os.path.isfile('train.zip'):
    os.remove("train.zip")
if os.path.isfile('train.csv'):
    os.remove("train.csv")
import wget
url = 'https://github.com/hackerguy/SpookyAuthorIdentification/blob/master/train.zip?raw=true'
wget.download(url)
import zipfile
zip = zipfile.ZipFile('train.zip', 'r')
zip.extractall()
zip.close()

## Read in the data set as a Spark DataFrame
### Infer schema and column names

In [2]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

data = (spark.read
  .format('csv')
  .option('header', 'true')
  .option("inferSchema", "true")
  .load('train.csv'))

### Display the dataset

In [3]:
import pandas as pd
pd.set_option('display.max_colwidth', -1)
data.toPandas().head()

Unnamed: 0,id,text,author
0,id26305,"This process, however, afforded me no means of ascertaining the dimensions of my dungeon; as I might make its circuit, and return to the point whence I set out, without being aware of the fact; so perfectly uniform seemed the wall.",EAP
1,id17569,It never once occurred to me that the fumbling might be a mere mistake.,HPL
2,id11008,"In his left hand was a gold snuff box, from which, as he capered down the hill, cutting all manner of fantastic steps, he took snuff incessantly with an air of the greatest possible self satisfaction.",EAP
3,id27763,"How lovely is spring As we looked from Windsor Terrace on the sixteen fertile counties spread beneath, speckled by happy cottages and wealthier towns, all looked as in former years, heart cheering and fair.",MWS
4,id12958,"Finding nothing else, not even gold, the Superintendent abandoned his attempts; but a perplexed look occasionally steals over his countenance as he sits thinking at his desk.",HPL


### Show the schema of the data including data types

In [4]:
data.printSchema()

root
 |-- id: string (nullable = true)
 |-- text: string (nullable = true)
 |-- author: string (nullable = true)



## Remove rows that do not have valid author fields

In [5]:
data = data.filter((data['author']=='EAP')| (data['author']=='HPL') | (data['author']=='MWS'))

## Limit the data size for processing efficiency

In [6]:
data = data.limit(100)

### Dataset Overview - number of rows and columns

In [7]:
print("There are {} rows in the dataset.".format(str(data.count())))
print("There are {} columns in the dataset.".format(str(len(data.columns))))



There are 100 rows in the dataset.
There are 3 columns in the dataset.


## Tokenize the text

In [8]:
from pyspark.ml.feature import Tokenizer, RegexTokenizer
from pyspark.sql.functions import col, udf
from pyspark.sql.types import IntegerType

tokenizer = Tokenizer(inputCol="text", outputCol="words")

countTokens = udf(lambda words: len(words), IntegerType())

tokenized = tokenizer.transform(data)
(tokenized.select("text", "words")
    .withColumn("#tokens", countTokens(col("words"))).toPandas().head())

Unnamed: 0,text,words,#tokens
0,"This process, however, afforded me no means of ascertaining the dimensions of my dungeon; as I might make its circuit, and return to the point whence I set out, without being aware of the fact; so perfectly uniform seemed the wall.","[this, process,, however,, afforded, me, no, means, of, ascertaining, the, dimensions, of, my, dungeon;, as, i, might, make, its, circuit,, and, return, to, the, point, whence, i, set, out,, without, being, aware, of, the, fact;, so, perfectly, uniform, seemed, the, wall.]",41
1,It never once occurred to me that the fumbling might be a mere mistake.,"[it, never, once, occurred, to, me, that, the, fumbling, might, be, a, mere, mistake.]",14
2,"In his left hand was a gold snuff box, from which, as he capered down the hill, cutting all manner of fantastic steps, he took snuff incessantly with an air of the greatest possible self satisfaction.","[in, his, left, hand, was, a, gold, snuff, box,, from, which,, as, he, capered, down, the, hill,, cutting, all, manner, of, fantastic, steps,, he, took, snuff, incessantly, with, an, air, of, the, greatest, possible, self, satisfaction.]",36
3,"How lovely is spring As we looked from Windsor Terrace on the sixteen fertile counties spread beneath, speckled by happy cottages and wealthier towns, all looked as in former years, heart cheering and fair.","[how, lovely, is, spring, as, we, looked, from, windsor, terrace, on, the, sixteen, fertile, counties, spread, beneath,, speckled, by, happy, cottages, and, wealthier, towns,, all, looked, as, in, former, years,, heart, cheering, and, fair.]",34
4,"Finding nothing else, not even gold, the Superintendent abandoned his attempts; but a perplexed look occasionally steals over his countenance as he sits thinking at his desk.","[finding, nothing, else,, not, even, gold,, the, superintendent, abandoned, his, attempts;, but, a, perplexed, look, occasionally, steals, over, his, countenance, as, he, sits, thinking, at, his, desk.]",27


## Remove common words

In [9]:
from pyspark.ml.feature import StopWordsRemover

remover = StopWordsRemover(inputCol="words", outputCol="filtered").setCaseSensitive(False)
removed = remover.transform(tokenized)
removed.select("text", "words", "filtered" ).toPandas().head()

Unnamed: 0,text,words,filtered
0,"This process, however, afforded me no means of ascertaining the dimensions of my dungeon; as I might make its circuit, and return to the point whence I set out, without being aware of the fact; so perfectly uniform seemed the wall.","[this, process,, however,, afforded, me, no, means, of, ascertaining, the, dimensions, of, my, dungeon;, as, i, might, make, its, circuit,, and, return, to, the, point, whence, i, set, out,, without, being, aware, of, the, fact;, so, perfectly, uniform, seemed, the, wall.]","[process,, however,, afforded, means, ascertaining, dimensions, dungeon;, might, make, circuit,, return, point, whence, set, out,, without, aware, fact;, perfectly, uniform, seemed, wall.]"
1,It never once occurred to me that the fumbling might be a mere mistake.,"[it, never, once, occurred, to, me, that, the, fumbling, might, be, a, mere, mistake.]","[never, occurred, fumbling, might, mere, mistake.]"
2,"In his left hand was a gold snuff box, from which, as he capered down the hill, cutting all manner of fantastic steps, he took snuff incessantly with an air of the greatest possible self satisfaction.","[in, his, left, hand, was, a, gold, snuff, box,, from, which,, as, he, capered, down, the, hill,, cutting, all, manner, of, fantastic, steps,, he, took, snuff, incessantly, with, an, air, of, the, greatest, possible, self, satisfaction.]","[left, hand, gold, snuff, box,, which,, capered, hill,, cutting, manner, fantastic, steps,, took, snuff, incessantly, air, greatest, possible, self, satisfaction.]"
3,"How lovely is spring As we looked from Windsor Terrace on the sixteen fertile counties spread beneath, speckled by happy cottages and wealthier towns, all looked as in former years, heart cheering and fair.","[how, lovely, is, spring, as, we, looked, from, windsor, terrace, on, the, sixteen, fertile, counties, spread, beneath,, speckled, by, happy, cottages, and, wealthier, towns,, all, looked, as, in, former, years,, heart, cheering, and, fair.]","[lovely, spring, looked, windsor, terrace, sixteen, fertile, counties, spread, beneath,, speckled, happy, cottages, wealthier, towns,, looked, former, years,, heart, cheering, fair.]"
4,"Finding nothing else, not even gold, the Superintendent abandoned his attempts; but a perplexed look occasionally steals over his countenance as he sits thinking at his desk.","[finding, nothing, else,, not, even, gold,, the, superintendent, abandoned, his, attempts;, but, a, perplexed, look, occasionally, steals, over, his, countenance, as, he, sits, thinking, at, his, desk.]","[finding, nothing, else,, even, gold,, superintendent, abandoned, attempts;, perplexed, look, occasionally, steals, countenance, sits, thinking, desk.]"


### Show list of common words removed

In [10]:
from __future__ import print_function
[print(x) for x in remover.getStopWords()]

i
me
my
myself
we
our
ours
ourselves
you
your
yours
yourself
yourselves
he
him
his
himself
she
her
hers
herself
it
its
itself
they
them
their
theirs
themselves
what
which
who
whom
this
that
these
those
am
is
are
was
were
be
been
being
have
has
had
having
do
does
did
doing
a
an
the
and
but
if
or
because
as
until
while
of
at
by
for
with
about
against
between
into
through
during
before
after
above
below
to
from
up
down
in
out
on
off
over
under
again
further
then
once
here
there
when
where
why
how
all
any
both
each
few
more
most
other
some
such
no
nor
not
only
own
same
so
than
too
very
s
t
can
will
just
don
should
now
d
ll
m
o
re
ve
y
ain
aren
couldn
didn
doesn
hadn
hasn
haven
isn
ma
mightn
mustn
needn
shan
shouldn
wasn
weren
won
wouldn


[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,

## Hash the words and inverse weight words that occur frequently across all text

In [11]:
from pyspark.ml.feature import HashingTF, IDF

hashingTF = HashingTF(inputCol="filtered", outputCol="rawFeatures", numFeatures=100)
featurizedData = hashingTF.transform(removed)

idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = idf.fit(featurizedData)
rescaledData = idfModel.transform(featurizedData)

rescaledData.select("text", "rawFeatures", "features").toPandas().head()

Unnamed: 0,text,rawFeatures,features
0,"This process, however, afforded me no means of ascertaining the dimensions of my dungeon; as I might make its circuit, and return to the point whence I set out, without being aware of the fact; so perfectly uniform seemed the wall.","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0, 1.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0, 0.0, 1.0)","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.71404795764, 0.0, 0.0, 0.0, 1.72474875895, 3.95212637445, 0.0, 0.0, 0.0, 0.0, 0.0, 2.05017115938, 1.47962630091, 0.0, 0.0, 0.0, 0.0, 1.43706668649, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.90707031574, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.72474875895, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.21722524404, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.8425317946, 1.8425317946, 0.0, 0.0, 2.21722524404, 1.97606318723, 1.8425317946, 1.43706668649, 2.31253542385, 0.0, 0.0, 0.0, 0.0, 1.57059807912, 0.0, 0.0, 4.10034231876, 0.0, 0.0, 0.0, 0.0, 2.05017115938)"
1,It never once occurred to me that the fumbling might be a mere mistake.,"(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0)","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.13021386705, 0.0, 1.35702397882, 0.0, 0.0, 0.0, 0.0, 0.0, 2.05017115938, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.13021386705, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.43706668649, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.31253542385, 0.0, 0.0, 0.0)"
2,"In his left hand was a gold snuff box, from which, as he capered down the hill, cutting all manner of fantastic steps, he took snuff incessantly with an air of the greatest possible self satisfaction.","(1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0)","(2.21722524404, 0.0, 0.0, 1.67068153767, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.21722524404, 0.0, 0.0, 0.0, 0.0, 0.0, 1.57059807912, 0.0, 2.41789593951, 4.26042773411, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.21722524404, 0.0, 0.0, 2.05017115938, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 5.33842073557, 0.0, 0.0, 0.0, 0.0, 0.0, 1.97606318723, 0.0, 0.0, 0.0, 0.0, 2.21722524404, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.57059807912, 0.0, 0.0, 2.13021386705, 0.0, 1.90707031574, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.97606318723, 2.05017115938, 0.0, 2.31253542385, 0.0, 1.67068153767, 0.0)"
3,"How lovely is spring As we looked from Windsor Terrace on the sixteen fertile counties spread beneath, speckled by happy cottages and wealthier towns, all looked as in former years, heart cheering and fair.","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 2.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.57059807912, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.72474875895, 0.0, 0.0, 1.57059807912, 1.72474875895, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 6.65167573213, 0.0, 1.43706668649, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.21722524404, 3.81414063148, 2.13021386705, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.05017115938, 1.97606318723, 2.66921036779, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.31928365084, 0.0, 2.21722524404, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.57059807912, 0.0, 0.0, 2.13021386705, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.21722524404, 0.0, 1.8425317946, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)"
4,"Finding nothing else, not even gold, the Superintendent abandoned his attempts; but a perplexed look occasionally steals over his countenance as he sits thinking at his desk.","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0)","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.57059807912, 0.0, 1.72474875895, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.8425317946, 0.0, 1.8425317946, 1.43706668649, 0.0, 0.0, 4.10034231876, 0.0, 0.0, 0.0, 0.0, 0.0, 2.21722524404, 0.0, 0.0, 0.0, 5.07135795032, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.72474875895, 0.0, 2.13021386705, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.8425317946, 0.0, 2.31253542385, 0.0, 2.21722524404, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.31253542385, 0.0, 0.0, 0.0)"


## Encode the label column

In [12]:
from pyspark.ml.feature import StringIndexer
labelIndexer = StringIndexer(inputCol='author', outputCol='label').fit(data)

## Use Logistic Regression Algorithm to predict author

In [13]:
from pyspark.ml.classification import LogisticRegression
lr = LogisticRegression(labelCol = "label", maxIter=10, regParam=0.3, threshold=0.7)

## Convert indexed labels back to original labels

In [14]:
from pyspark.ml.feature import IndexToString
labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel", labels=labelIndexer.labels)

## Define the machine learning pipeline

In [15]:
stages = [tokenizer, remover, hashingTF, idf, labelIndexer, lr, labelConverter]
from pyspark.ml import Pipeline
pipeline = Pipeline(stages = stages)

### Display the parameter setting of the pipeline stages

In [16]:
print("Tokenizer:")
print(tokenizer.explainParams())
print("*************************")
print("Remover:")
print(remover.explainParams())
print("*************************")
print("HashingTF:")
print(hashingTF.explainParams())
print("*************************")
print("IDF:")
print(idf.explainParams())
print("*************************")
print("LogisticRegression:")
print(lr.explainParams())
print("*************************")
print("Pipeline:")
print(pipeline.explainParams())

Tokenizer:
inputCol: input column name. (current: text)
outputCol: output column name. (default: Tokenizer_46a982d5fd4a887ac532__output, current: words)
*************************
Remover:
caseSensitive: whether to do a case sensitive comparison over the stop words (default: False, current: False)
inputCol: input column name. (current: words)
outputCol: output column name. (default: StopWordsRemover_4d3dacac538aa8251ed2__output, current: filtered)
stopWords: The words to be filtered out (default: [u'i', u'me', u'my', u'myself', u'we', u'our', u'ours', u'ourselves', u'you', u'your', u'yours', u'yourself', u'yourselves', u'he', u'him', u'his', u'himself', u'she', u'her', u'hers', u'herself', u'it', u'its', u'itself', u'they', u'them', u'their', u'theirs', u'themselves', u'what', u'which', u'who', u'whom', u'this', u'that', u'these', u'those', u'am', u'is', u'are', u'was', u'were', u'be', u'been', u'being', u'have', u'has', u'had', u'having', u'do', u'does', u'did', u'doing', u'a', u'an', 

## Split the dataset into training and test data sets

In [17]:
train, test = data.randomSplit([70.0,30.0], seed=1)
print('The number of records in the traininig data set is {}.'.format(train.count()))
print('The number of rows labeled EAP in the training data set is {}.'.format(train.filter(train['author'] == 'EAP').count()))
print('The number of rows labeled HPL in the training data set is {}.'.format(train.filter(train['author'] == 'HPL').count()))
print('The number of rows labeled MWS in the training data set is {}.'.format(train.filter(train['author'] == 'MWS').count()))
print("")
print('The number of records in the test data set is {}.'.format(test.count()))
print('The number of rows labeled EAP in the test data set is {}.'.format(test.filter(test['author'] == 'EAP').count()))
print('The number of rows labeled HPL in the test data set is {}.'.format(test.filter(test['author'] == 'HPL').count()))
print('The number of rows labeled MWS in the test data set is {}.'.format(test.filter(test['author'] == 'MWS').count()))

The number of records in the traininig data set is 77.
The number of rows labeled EAP in the training data set is 27.
The number of rows labeled HPL in the training data set is 25.
The number of rows labeled MWS in the training data set is 25.

The number of records in the test data set is 23.
The number of rows labeled EAP in the test data set is 9.
The number of rows labeled HPL in the test data set is 8.
The number of rows labeled MWS in the test data set is 6.


## Train the model using the training data set

In [18]:
model = pipeline.fit(train)

## Make predictions using the test data set

In [19]:
predictions = model.transform(test)

In [20]:
predictions.select("author", "label", "prediction", 'predictedLabel', "probability").toPandas().head()

Unnamed: 0,author,label,prediction,predictedLabel,probability
0,MWS,2,2,MWS,"[0.333603184945, 0.259378643869, 0.407018171186]"
1,MWS,2,1,HPL,"[0.14545299873, 0.506132146855, 0.348414854416]"
2,HPL,1,1,HPL,"[0.343832262321, 0.386505251073, 0.269662486606]"
3,MWS,2,2,MWS,"[0.124928674119, 0.0717508395858, 0.803320486295]"
4,HPL,1,1,HPL,"[0.282942583593, 0.540385561467, 0.17667185494]"


## Evaluate the model performance by calculating the accuracy

In [21]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
evaluator = MulticlassClassificationEvaluator(labelCol = "label", predictionCol="prediction").setMetricName("accuracy")
print('Accuracy = {:0.2f}%.'.format(evaluator.evaluate(predictions)*100))

Accuracy = 47.83%.


## Investigate the prediction results results

In [22]:
EAPandEAP = predictions.filter(predictions['author']=='EAP').filter(predictions['predictedLabel']=='EAP').count()
EAPnotEAP = predictions.filter(predictions['author']=='EAP').filter(predictions['predictedLabel']!='EAP').count()
notEAPbutEAP = predictions.filter(predictions['author']!='EAP').filter(predictions['predictedLabel']=='EAP').count()
print("Predicted EAP correctly {} times.".format(EAPandEAP))
print("Failed to predict EAP {} times.".format(EAPnotEAP))
print("Predicted EAP incorrectly {} times.".format(notEAPbutEAP))

Predicted EAP correctly 5 times.
Failed to predict EAP 4 times.
Predicted EAP incorrectly 5 times.


In [23]:
HPLandHPL = predictions.filter(predictions['author']=='HPL').filter(predictions['predictedLabel']=='HPL').count()
HPLnotHPL = predictions.filter(predictions['author']=='HPL').filter(predictions['predictedLabel']!='HPL').count()
notHPLbutHPL = predictions.filter(predictions['author']!='HPL').filter(predictions['predictedLabel']=='HPL').count()
print("Predicted HPL correctly {} times.".format(HPLandHPL))
print("Failed to predict HPL {} times.".format(HPLnotHPL))
print("Predicted HPL incorrectly {} times.".format(notHPLbutHPL))

Predicted HPL correctly 4 times.
Failed to predict HPL 4 times.
Predicted HPL incorrectly 4 times.


In [24]:
MWSandMWS = predictions.filter(predictions['author']=='MWS').filter(predictions['predictedLabel']=='MWS').count()
MWSnotMWS = predictions.filter(predictions['author']=='MWS').filter(predictions['predictedLabel']!='MWS').count()
notMWSbutMWS = predictions.filter(predictions['author']!='MWS').filter(predictions['predictedLabel']=='MWS').count()
print("Predicted MWS correctly {} times.".format(MWSandMWS))
print("Failed to predict MWS {} times.".format(MWSnotMWS))
print("Predicted MWS incorrectly {} times.".format(notMWSbutMWS))

Predicted MWS correctly 2 times.
Failed to predict MWS 4 times.
Predicted MWS incorrectly 3 times.


# Use Natural Language Understanding to create rich features

## Setup configuration for the Natural Language Understanding service

In [25]:
import watson_developer_cloud
from watson_developer_cloud import NaturalLanguageUnderstandingV1
import watson_developer_cloud.natural_language_understanding.features.v1 as Features
import json

In [26]:
NLU_USERNAME = 'c15be849-aa59-4a88-b30a-6d0a22e308be'
NLU_PASSWORD = 'm7cKrXz7aW5h'
natural_language_understanding = NaturalLanguageUnderstandingV1(
  username=NLU_USERNAME,
  password=NLU_PASSWORD,
  version="2017-02-27")

### Show example of employing NLU API on single row of the data set

In [27]:
dataNLU = data.select(data["text"]).toJSON().collect()[0][8:-1]
print(dataNLU)
import json
features=[
    Features.Emotion(),
    Features.Sentiment()
  ]
nlu = natural_language_understanding.analyze(text=dataNLU, features=features)
anger = nlu['emotion']['document']['emotion']['anger']
joy = nlu['emotion']['document']['emotion']['joy']
sadness = nlu['emotion']['document']['emotion']['sadness']
fear = nlu['emotion']['document']['emotion']['fear']
disgust = nlu['emotion']['document']['emotion']['disgust']
sentiment = nlu['sentiment']['document']['score']

print("")
print("Anger = {}".format(anger))
print("Joy = {}".format(joy))
print("Sadness = {}".format(sadness))
print("Fear = {}".format(fear))
print("Disgust = {}".format(disgust))
print("Sentiment = {}".format(sentiment))
print("")
print(json.dumps(nlu, indent=2))

"This process, however, afforded me no means of ascertaining the dimensions of my dungeon; as I might make its circuit, and return to the point whence I set out, without being aware of the fact; so perfectly uniform seemed the wall."

Anger = 0.22525
Joy = 0.208037
Sadness = 0.156157
Fear = 0.092108
Disgust = 0.024618
Sentiment = 0.834411

{
  "usage": {
    "text_characters": 233, 
    "features": 2, 
    "text_units": 1
  }, 
  "emotion": {
    "document": {
      "emotion": {
        "anger": 0.22525, 
        "joy": 0.208037, 
        "sadness": 0.156157, 
        "fear": 0.092108, 
        "disgust": 0.024618
      }
    }
  }, 
  "language": "en", 
  "sentiment": {
    "document": {
      "score": 0.834411, 
      "label": "positive"
    }
  }
}


### Limit the data size for processing efficiency

In [28]:
data2 = data.limit(100)
data2.toPandas().head()

Unnamed: 0,id,text,author
0,id26305,"This process, however, afforded me no means of ascertaining the dimensions of my dungeon; as I might make its circuit, and return to the point whence I set out, without being aware of the fact; so perfectly uniform seemed the wall.",EAP
1,id17569,It never once occurred to me that the fumbling might be a mere mistake.,HPL
2,id11008,"In his left hand was a gold snuff box, from which, as he capered down the hill, cutting all manner of fantastic steps, he took snuff incessantly with an air of the greatest possible self satisfaction.",EAP
3,id27763,"How lovely is spring As we looked from Windsor Terrace on the sixteen fertile counties spread beneath, speckled by happy cottages and wealthier towns, all looked as in former years, heart cheering and fair.",MWS
4,id12958,"Finding nothing else, not even gold, the Superintendent abandoned his attempts; but a perplexed look occasionally steals over his countenance as he sits thinking at his desk.",HPL


## Define UDF to create NLU derived features

In [29]:
from pyspark.sql.functions import udf
#import json
udfNLU = (udf(lambda text: json.dumps(NaturalLanguageUnderstandingV1(
    username=NLU_USERNAME, password=NLU_PASSWORD, version="2017-02-27")
    .analyze(text=text, features=features))))

## Invoke UDF to create new column with NLU output

In [30]:
data2 = data2.withColumn('nlu', udfNLU(data2['text']))
data2.toPandas().head()

Unnamed: 0,id,text,author,nlu
0,id26305,"This process, however, afforded me no means of ascertaining the dimensions of my dungeon; as I might make its circuit, and return to the point whence I set out, without being aware of the fact; so perfectly uniform seemed the wall.",EAP,"{""usage"": {""text_characters"": 231, ""features"": 2, ""text_units"": 1}, ""emotion"": {""document"": {""emotion"": {""anger"": 0.22525, ""joy"": 0.208037, ""sadness"": 0.156157, ""fear"": 0.092108, ""disgust"": 0.024618}}}, ""language"": ""en"", ""sentiment"": {""document"": {""score"": 0.875231, ""label"": ""positive""}}}"
1,id17569,It never once occurred to me that the fumbling might be a mere mistake.,HPL,"{""usage"": {""text_characters"": 71, ""features"": 2, ""text_units"": 1}, ""emotion"": {""document"": {""emotion"": {""anger"": 0.268703, ""joy"": 0.07335, ""sadness"": 0.284674, ""fear"": 0.336473, ""disgust"": 0.049945}}}, ""language"": ""en"", ""sentiment"": {""document"": {""score"": -0.867677, ""label"": ""negative""}}}"
2,id11008,"In his left hand was a gold snuff box, from which, as he capered down the hill, cutting all manner of fantastic steps, he took snuff incessantly with an air of the greatest possible self satisfaction.",EAP,"{""usage"": {""text_characters"": 200, ""features"": 2, ""text_units"": 1}, ""emotion"": {""document"": {""emotion"": {""anger"": 0.047614, ""joy"": 0.856397, ""sadness"": 0.021778, ""fear"": 0.017659, ""disgust"": 0.061879}}}, ""language"": ""en"", ""sentiment"": {""document"": {""score"": -0.739374, ""label"": ""negative""}}}"
3,id27763,"How lovely is spring As we looked from Windsor Terrace on the sixteen fertile counties spread beneath, speckled by happy cottages and wealthier towns, all looked as in former years, heart cheering and fair.",MWS,"{""usage"": {""text_characters"": 206, ""features"": 2, ""text_units"": 1}, ""emotion"": {""document"": {""emotion"": {""anger"": 0.002317, ""joy"": 0.895471, ""sadness"": 0.074464, ""fear"": 0.011802, ""disgust"": 0.011855}}}, ""language"": ""en"", ""sentiment"": {""document"": {""score"": 0.928205, ""label"": ""positive""}}}"
4,id12958,"Finding nothing else, not even gold, the Superintendent abandoned his attempts; but a perplexed look occasionally steals over his countenance as he sits thinking at his desk.",HPL,"{""usage"": {""text_characters"": 174, ""features"": 2, ""text_units"": 1}, ""emotion"": {""document"": {""emotion"": {""anger"": 0.239133, ""joy"": 0.005415, ""sadness"": 0.43508, ""fear"": 0.343188, ""disgust"": 0.311778}}}, ""language"": ""en"", ""sentiment"": {""document"": {""score"": -0.717046, ""label"": ""negative""}}}"


## Define UDFs to extract NLU derived features

In [31]:
from pyspark.sql.types import DoubleType
udfAnger = udf(lambda nlu: json.loads(nlu)["emotion"]["document"]["emotion"]["anger"], DoubleType())
udfJoy = udf(lambda nlu: json.loads(nlu)["emotion"]["document"]["emotion"]["joy"], DoubleType())
udfSadness = udf(lambda nlu: json.loads(nlu)["emotion"]["document"]["emotion"]["sadness"], DoubleType())
udfFear = udf(lambda nlu: json.loads(nlu)["emotion"]["document"]["emotion"]["fear"], DoubleType())
udfDisgust = udf(lambda nlu: json.loads(nlu)["emotion"]["document"]["emotion"]["disgust"], DoubleType())
udfSentiment = udf(lambda nlu: json.loads(nlu)['sentiment']['document']['score'], DoubleType())

## Invoke UDFs to create new columns for the enhanced emotion and sentiment features

In [32]:
data2 = (data2.withColumn('Anger', udfAnger(data2['nlu']))
        .withColumn('Joy', udfJoy(data2['nlu']))
        .withColumn('Sadness', udfSadness(data2['nlu']))
        .withColumn('Fear', udfFear(data2['nlu']))
        .withColumn('Disgust', udfDisgust(data2['nlu']))
        .withColumn('Sentiment', udfSentiment(data2['nlu'])))
data2.toPandas().head()

Unnamed: 0,id,text,author,nlu,Anger,Joy,Sadness,Fear,Disgust,Sentiment
0,id26305,"This process, however, afforded me no means of ascertaining the dimensions of my dungeon; as I might make its circuit, and return to the point whence I set out, without being aware of the fact; so perfectly uniform seemed the wall.",EAP,"{""usage"": {""text_characters"": 231, ""features"": 2, ""text_units"": 1}, ""emotion"": {""document"": {""emotion"": {""anger"": 0.22525, ""joy"": 0.208037, ""sadness"": 0.156157, ""fear"": 0.092108, ""disgust"": 0.024618}}}, ""language"": ""en"", ""sentiment"": {""document"": {""score"": 0.875231, ""label"": ""positive""}}}",0.22525,0.208037,0.156157,0.092108,0.024618,0.875231
1,id17569,It never once occurred to me that the fumbling might be a mere mistake.,HPL,"{""usage"": {""text_characters"": 71, ""features"": 2, ""text_units"": 1}, ""emotion"": {""document"": {""emotion"": {""anger"": 0.268703, ""joy"": 0.07335, ""sadness"": 0.284674, ""fear"": 0.336473, ""disgust"": 0.049945}}}, ""language"": ""en"", ""sentiment"": {""document"": {""score"": -0.867677, ""label"": ""negative""}}}",0.268703,0.07335,0.284674,0.336473,0.049945,-0.867677
2,id11008,"In his left hand was a gold snuff box, from which, as he capered down the hill, cutting all manner of fantastic steps, he took snuff incessantly with an air of the greatest possible self satisfaction.",EAP,"{""usage"": {""text_characters"": 200, ""features"": 2, ""text_units"": 1}, ""emotion"": {""document"": {""emotion"": {""anger"": 0.047614, ""joy"": 0.856397, ""sadness"": 0.021778, ""fear"": 0.017659, ""disgust"": 0.061879}}}, ""language"": ""en"", ""sentiment"": {""document"": {""score"": -0.739374, ""label"": ""negative""}}}",0.047614,0.856397,0.021778,0.017659,0.061879,-0.739374
3,id27763,"How lovely is spring As we looked from Windsor Terrace on the sixteen fertile counties spread beneath, speckled by happy cottages and wealthier towns, all looked as in former years, heart cheering and fair.",MWS,"{""usage"": {""text_characters"": 206, ""features"": 2, ""text_units"": 1}, ""emotion"": {""document"": {""emotion"": {""anger"": 0.002317, ""joy"": 0.895471, ""sadness"": 0.074464, ""fear"": 0.011802, ""disgust"": 0.011855}}}, ""language"": ""en"", ""sentiment"": {""document"": {""score"": 0.928205, ""label"": ""positive""}}}",0.002317,0.895471,0.074464,0.011802,0.011855,0.928205
4,id12958,"Finding nothing else, not even gold, the Superintendent abandoned his attempts; but a perplexed look occasionally steals over his countenance as he sits thinking at his desk.",HPL,"{""usage"": {""text_characters"": 174, ""features"": 2, ""text_units"": 1}, ""emotion"": {""document"": {""emotion"": {""anger"": 0.239133, ""joy"": 0.005415, ""sadness"": 0.43508, ""fear"": 0.343188, ""disgust"": 0.311778}}}, ""language"": ""en"", ""sentiment"": {""document"": {""score"": -0.717046, ""label"": ""negative""}}}",0.239133,0.005415,0.43508,0.343188,0.311778,-0.717046


In [35]:
data2.select(data2['text'], data2['Anger'], data2['Joy'], data2['Sadness'], data2['Fear'], data2['Disgust'], data2['Sentiment']).toPandas().head(10)

Unnamed: 0,text,Anger,Joy,Sadness,Fear,Disgust,Sentiment
0,"This process, however, afforded me no means of ascertaining the dimensions of my dungeon; as I might make its circuit, and return to the point whence I set out, without being aware of the fact; so perfectly uniform seemed the wall.",0.22525,0.208037,0.156157,0.092108,0.024618,0.875231
1,It never once occurred to me that the fumbling might be a mere mistake.,0.268703,0.07335,0.284674,0.336473,0.049945,-0.867677
2,"In his left hand was a gold snuff box, from which, as he capered down the hill, cutting all manner of fantastic steps, he took snuff incessantly with an air of the greatest possible self satisfaction.",0.047614,0.856397,0.021778,0.017659,0.061879,-0.739374
3,"How lovely is spring As we looked from Windsor Terrace on the sixteen fertile counties spread beneath, speckled by happy cottages and wealthier towns, all looked as in former years, heart cheering and fair.",0.002317,0.895471,0.074464,0.011802,0.011855,0.928205
4,"Finding nothing else, not even gold, the Superintendent abandoned his attempts; but a perplexed look occasionally steals over his countenance as he sits thinking at his desk.",0.239133,0.005415,0.43508,0.343188,0.311778,-0.717046
5,"A youth passed in solitude, my best years spent under your gentle and feminine fosterage, has so refined the groundwork of my character that I cannot overcome an intense distaste to the usual brutality exercised on board ship: I have never believed it to be necessary, and when I heard of a mariner equally noted for his kindliness of heart and the respect and obedience paid to him by his crew, I felt myself peculiarly fortunate in being able to secure his services.",0.030202,0.380229,0.316263,0.011006,0.016229,0.987105
6,"The astronomer, perhaps, at this point, took refuge in the suggestion of non luminosity; and here analogy was suddenly let fall.",0.146274,0.03453,0.189046,0.644884,0.066424,0.0
7,The surcingle hung in ribands from my body.,0.175506,0.159886,0.116204,0.363028,0.3051,0.0
8,"I knew that you could not say to yourself 'stereotomy' without being brought to think of atomies, and thus of the theories of Epicurus; and since, when we discussed this subject not very long ago, I mentioned to you how singularly, yet with how little notice, the vague guesses of that noble Greek had met with confirmation in the late nebular cosmogony, I felt that you could not avoid casting your eyes upward to the great nebula in Orion, and I certainly expected that you would do so.",0.007289,0.17762,0.575457,0.321208,0.025653,0.569564
9,"I confess that neither the structure of languages, nor the code of governments, nor the politics of various states possessed attractions for me.",0.073041,0.262087,0.175994,0.142699,0.117686,-0.851087


# Retrain model with NLU features added

## Split the dataset into training and test data sets

In [36]:
train2, test2 = data2.randomSplit([70.0,30.0], seed=1)

## Bucketize the NLU features

In [37]:
from pyspark.ml.feature import Bucketizer
AngerBucketSplits = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
AngerBucket = Bucketizer(splits=AngerBucketSplits, inputCol="Anger", outputCol="AngerBucket")
JoyBucketSplits = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
JoyBucket = Bucketizer(splits=JoyBucketSplits, inputCol="Joy", outputCol="JoyBucket")
SadnessBucketSplits = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
SadnessBucket = Bucketizer(splits=SadnessBucketSplits, inputCol="Sadness", outputCol="SadnessBucket")
FearBucketSplits = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
FearBucket = Bucketizer(splits=FearBucketSplits, inputCol="Fear", outputCol="FearBucket")
DisgustBucketSplits = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
DisgustBucket = Bucketizer(splits=DisgustBucketSplits, inputCol="Disgust", outputCol="DisgustBucket")
SentimentBucketSplits = [-1.0, -0.9, -0.8, -0.7, -0.6, -0.5, -0.4, -0.3, -0.2, -0.1, 0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
SentimentBucket = Bucketizer(splits=SentimentBucketSplits, inputCol="Sentiment", outputCol="SentimentBucket")

## Create a feature vector

In [38]:
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=["features", "AngerBucket", "JoyBucket", "SadnessBucket", "FearBucket", "DisgustBucket","SentimentBucket"], outputCol="features2")

## Create a revised machine learning pipeline utilizing the new bucketed NLU feaures

In [39]:
lr2 = LogisticRegression(labelCol = "label", featuresCol= "features2", maxIter=10, regParam=0.3, threshold=0.5)
stages2 = [tokenizer, remover, hashingTF, idf, AngerBucket, JoyBucket, SadnessBucket, FearBucket, DisgustBucket, SentimentBucket, assembler, labelIndexer, lr2, labelConverter]
pipeline2 = Pipeline(stages = stages2)

## Train the new model using the training data set

In [40]:
model2 = pipeline2.fit(train2)

## Make updated predictions (with NLU features) using the test data set

In [41]:
predictions2 = model2.transform(test2)

In [42]:
predictions2.select("author", "label", "prediction", 'predictedLabel', "probability").toPandas().head()

Unnamed: 0,author,label,prediction,predictedLabel,probability
0,MWS,2,2,MWS,"[0.313963602262, 0.206820474173, 0.479215923565]"
1,MWS,2,1,HPL,"[0.205516174805, 0.56008512151, 0.234398703685]"
2,HPL,1,1,HPL,"[0.306149703703, 0.372492537266, 0.321357759031]"
3,MWS,2,2,MWS,"[0.130653342993, 0.0636292267325, 0.805717430274]"
4,HPL,1,1,HPL,"[0.284520552752, 0.637145197235, 0.0783342500136]"


## Evaluate the updated model performance by calculating the accuracy

In [43]:
evaluator2 = MulticlassClassificationEvaluator(labelCol = "label", predictionCol="prediction").setMetricName("accuracy")
print('Accuracy with NLU = {:0.2f}%.'.format(evaluator.evaluate(predictions2)*100))

Accuracy with NLU = 52.17%.


## Investigate Improved Results

In [44]:
EAPandEAP2 = predictions2.filter(predictions2['author']=='EAP').filter(predictions2['predictedLabel']=='EAP').count()
EAPnotEAP2 = predictions2.filter(predictions2['author']=='EAP').filter(predictions2['predictedLabel']!='EAP').count()
notEAPbutEAP2 = predictions2.filter(predictions2['author']!='EAP').filter(predictions2['predictedLabel']=='EAP').count()
print("Predicted EAP correctly {} times vs. {} previously.".format(EAPandEAP2, EAPandEAP))
print("Failed to predict EAP {} times vs. {} previously.".format(EAPnotEAP2, EAPnotEAP))
print("Predicted EAP incorrectly {} times vs. {} previously.".format(notEAPbutEAP2, notEAPbutEAP))

Predicted EAP correctly 6 times vs. 5 previously.
Failed to predict EAP 3 times vs. 4 previously.
Predicted EAP incorrectly 6 times vs. 5 previously.


In [45]:
HPLandHPL2 = predictions2.filter(predictions2['author']=='HPL').filter(predictions2['predictedLabel']=='HPL').count()
HPLnotHPL2 = predictions2.filter(predictions2['author']=='HPL').filter(predictions2['predictedLabel']!='HPL').count()
notHPLbutHPL2 = predictions2.filter(predictions2['author']!='HPL').filter(predictions2['predictedLabel']=='HPL').count()
print("Predicted HPL correctly {} times vs. {} previously.".format(HPLandHPL2, HPLandHPL))
print("Failed to predict HPL {} times vs. {} previously.".format(HPLnotHPL2, HPLnotHPL))
print("Predicted HPL incorrectly {} times vs. {} previously.".format(notHPLbutHPL2, notHPLbutHPL))

Predicted HPL correctly 4 times vs. 4 previously.
Failed to predict HPL 4 times vs. 4 previously.
Predicted HPL incorrectly 3 times vs. 4 previously.


In [46]:
MWSandMWS2 = predictions2.filter(predictions2['author']=='MWS').filter(predictions2['predictedLabel']=='MWS').count()
MWSnotMWS2 = predictions2.filter(predictions2['author']=='MWS').filter(predictions2['predictedLabel']!='MWS').count()
notMWSbutMWS2 = predictions2.filter(predictions2['author']!='MWS').filter(predictions2['predictedLabel']=='MWS').count()
print("Predicted MWS correctly {} times vs. {} previously.".format(MWSandMWS2, MWSandMWS))
print("Failed to predict MWS {} times vs. {} previously.".format(MWSnotMWS2, MWSnotMWS))
print("Predicted MWS incorrectly {} times vs. {} previously.".format(notMWSbutMWS2, notMWSbutMWS))

Predicted MWS correctly 2 times vs. 2 previously.
Failed to predict MWS 4 times vs. 4 previously.
Predicted MWS incorrectly 2 times vs. 3 previously.


![IBM Logo](http://www-03.ibm.com/press/img/Large_IBM_Logo_TN.jpg)

Rich Tarro  
Solutions Architect, IBM Corporation  
rtarro@us.ibm.com

November 27, 2017