<img src="https://storage.googleapis.com/kaggle-media/competitions/spooky-books/dmitrij-paskevic-44124.jpg" style="width:200px; float: left; padding-right: 10px"/>
<h2 style="font-face: verdana; font-size: 32px;">Spooky Author Identification</h2>
<h3 style="font-face: verdana; font-size: 16px;">Derive rich features for Machine Learning with the Watson Cognitive APIs</h3>
<br><br>
<a href="https://www.kaggle.com/c/spooky-author-identification/">Spooky Author Identification Kaggle Competition</a>

<h3 style="font-face: verdana; font-size: 16px;">The objective of this machine learning model is to predict the author of excerpts from horror stories by Edgar Allan Poe, Mary Shelley, and HP Lovecraft.</h3>

The dataset contains text from works of fiction written by these spooky authors. The goal is to accurately identify the author of the sentences.

Data fields in the dataset:

    id - a unique identifier for each sentence
    text - some text written by one of the authors
    author - the author of the sentence (EAP: Edgar Allan Poe, HPL: HP Lovecraft; MWS: Mary Wollstonecraft Shelley)

<h3 style="font-face: verdana; font-size: 16px;">Approach</h3>

We will approach this challenge by first using a traditional multiclassification machine learning approach. We will then explore using IBM Watson Natural Language Understanding to derive additional enhanced features on which to learn a machine learning model.

<h3 style="font-face: verdana; font-size: 16px;">IBM Watson Natural Language Understanding</h3>

IBM Watson™ Natural Language Understanding (NLU) can analyze semantic features of text input, including categories, concepts, emotion, entities, keywords, metadata, relations, semantic roles, and sentiment. In this example, we will utilize the emotion and sentiment features of NLU to create enhanced machine learning features.


<h4 style="font-face: verdana; font-size: 16px;">Emotion</h4>

The emotion feature of NLU allows you to analyze emotion conveyed by specific target phrases or by the document as a whole. You can also enable emotion analysis for entities and keywords that are automatically detected by the service. In this example, we will simply analyze the spooky excerpt as a whole. The emotions we will derive features for are 

- Anger
- Joy
- Sadness
- Fear
- Disgust

Emotion scores range from 0 to 1 for sadness, joy, fear, disgust, and anger. A 0 means the text doesn't convey the emotion, and a 1 means the text definitely carries the emotion.

<h4 style="font-face: verdana; font-size: 16px;">Sentiment</h4>

The sentiment feature of NLU allows you to analyze the sentiment toward specific target phrases and the sentiment of the document as a whole. You can also get sentiment information for detected entities and keywords by enabling the sentiment option for those features. In this example, we will simply analyze the spooky excerpt as a whole.

The sentiment score ranges from -1 (negative sentiment) to 1 (positive sentiment).



## Download and unzip the dataset

In [2]:
import os
if os.path.isfile('train.zip'):
    os.remove("train.zip")
if os.path.isfile('train.csv'):
    os.remove("train.csv")
import wget
url = 'https://github.com/hackerguy/SpookyAuthorIdentification/blob/master/train.zip?raw=true'
wget.download(url)
import zipfile
zip = zipfile.ZipFile('train.zip', 'r')
zip.extractall()
zip.close()

## Read in the data set as a Spark DataFrame
### Infer schema and column names

In [3]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

data = (spark.read
  .format('csv')
  .option('header', 'true')
  .option("inferSchema", "true")
  .load('train.csv'))

### Display the dataframe

In [4]:
import pandas as pd
pd.set_option('display.max_colwidth', -1)
data.toPandas().head()

Unnamed: 0,id,text,author
0,id26305,"This process, however, afforded me no means of ascertaining the dimensions of my dungeon; as I might make its circuit, and return to the point whence I set out, without being aware of the fact; so perfectly uniform seemed the wall.",EAP
1,id17569,It never once occurred to me that the fumbling might be a mere mistake.,HPL
2,id11008,"In his left hand was a gold snuff box, from which, as he capered down the hill, cutting all manner of fantastic steps, he took snuff incessantly with an air of the greatest possible self satisfaction.",EAP
3,id27763,"How lovely is spring As we looked from Windsor Terrace on the sixteen fertile counties spread beneath, speckled by happy cottages and wealthier towns, all looked as in former years, heart cheering and fair.",MWS
4,id12958,"Finding nothing else, not even gold, the Superintendent abandoned his attempts; but a perplexed look occasionally steals over his countenance as he sits thinking at his desk.",HPL


### Show the schema of the data including data types

In [5]:
data.printSchema()

root
 |-- id: string (nullable = true)
 |-- text: string (nullable = true)
 |-- author: string (nullable = true)



## Dataset Overview - number of rows and columns

In [6]:
print("There are {} rows in the dataset.".format(str(data.count())))
print("There are {} columns in the dataset.".format(str(len(data.columns))))

There are 19579 rows in the dataset.
There are 3 columns in the dataset.


# Optionally use a subset of the data for processing efficiency

In [7]:
fraction = 1.0
data = data.sample(False, fraction, seed=0)

## Split up the dataframe - this will be important later on to limit the API call rate to the NLU Service

In [8]:
data00,data01,data02,data03,data04,data05,data06,data07,data08,data09,data10,data11,data12,data13,data14,data15,data16,data17,data18,data19, \
data20,data21,data22,data23,data24,data25,data26,data27,data28,data29,data30,data31,data32,data33,data34,data35,data36,data37,data38,data39, \
data40,data41,data42,data43,data44,data45,data46,data47,data48,data49,data50,data51,data52,data53,data54,data55,data56,data57,data58,data59, \
data60,data61,data62,data63,data64,data65,data66,data67,data68,data69,data70,data71,data72,data73,data74,data75,data76,data77,data78,data79, \
data80,data81,data82,data83,data84,data85,data86,data87,data88,data89,data90,data91,data92,data93,data94,data95,data96,data97,data98,data99 \
 = data.randomSplit([1.0]*100, 0)

In [47]:
dataUnion= (data00.union(data01).union(data02).union(data03).union(data04).union(data05).union(data06).union(data07).union(data08).union(data09)
            .union(data10).union(data11).union(data12).union(data13).union(data14).union(data15).union(data16).union(data17).union(data18).union(data19)
            .union(data20).union(data21).union(data22).union(data23).union(data24).union(data25).union(data26).union(data27).union(data28).union(data29)
            .union(data30).union(data31).union(data32).union(data33).union(data34).union(data35).union(data36).union(data37).union(data38).union(data39)
            .union(data40).union(data41).union(data42).union(data43).union(data44).union(data45).union(data46).union(data47).union(data48).union(data49)
            .union(data50).union(data51).union(data52).union(data53).union(data54).union(data55).union(data56).union(data57).union(data58).union(data59)
            .union(data60).union(data61).union(data62).union(data63).union(data64).union(data65).union(data66).union(data67).union(data68).union(data79)
            .union(data70).union(data71).union(data72).union(data73).union(data74).union(data75).union(data76).union(data77).union(data78).union(data79)
            .union(data80).union(data81).union(data82).union(data83).union(data84).union(data85).union(data86).union(data87).union(data88).union(data89)
            .union(data90).union(data91).union(data92).union(data93).union(data94).union(data95).union(data96).union(data97).union(data98).union(data99))

print("The combined dataset contains {} rows.".format(dataUnion.count()))

The combined dataset contains 19543 rows.


## Remove punctuation from the text

In [48]:
from pyspark.ml.feature import SQLTransformer
removePunctuationTrans = SQLTransformer(
    statement="""SELECT *, TRANSLATE(text,',.;?\''"+','') AS textNoPunct FROM __THIS__""")
dataUnion = removePunctuationTrans.transform(dataUnion)
dataUnion.toPandas().head()

Unnamed: 0,id,text,author,textNoPunct
0,id00159,Where if anywhere had he been on those nights of daemoniac alienage?,HPL,Where if anywhere had he been on those nights of daemoniac alienage
1,id00553,Yet the effect is incongruous to the timid alone.,EAP,Yet the effect is incongruous to the timid alone
2,id00666,"""""""You may easily believe",""""" said he",You may easily believe
3,id00788,"I am glad Woodville is not with me for perhaps he would grieve, and I desire to see smiles alone during the last scene of my life; when I last wrote to him I told him of my ill health but not of its mortal tendency, lest he should conceive it to be his duty to come to me for I fear lest the tears of friendship should destroy the blessed calm of my mind.",MWS,I am glad Woodville is not with me for perhaps he would grieve and I desire to see smiles alone during the last scene of my life when I last wrote to him I told him of my ill health but not of its mortal tendency lest he should conceive it to be his duty to come to me for I fear lest the tears of friendship should destroy the blessed calm of my mind
4,id00830,"You know, the curled up paper tacked to that frightful canvas in the cellar; the thing I thought was a photograph of some scene he meant to use as a background for that monster.",HPL,You know the curled up paper tacked to that frightful canvas in the cellar the thing I thought was a photograph of some scene he meant to use as a background for that monster


In [49]:
dataUnion = dataUnion.drop('text').withColumnRenamed('textNoPunct', 'text')
dataUnion.toPandas().head()

Unnamed: 0,id,author,text
0,id00159,HPL,Where if anywhere had he been on those nights of daemoniac alienage
1,id00553,EAP,Yet the effect is incongruous to the timid alone
2,id00666,""""" said he",You may easily believe
3,id00788,MWS,I am glad Woodville is not with me for perhaps he would grieve and I desire to see smiles alone during the last scene of my life when I last wrote to him I told him of my ill health but not of its mortal tendency lest he should conceive it to be his duty to come to me for I fear lest the tears of friendship should destroy the blessed calm of my mind
4,id00830,HPL,You know the curled up paper tacked to that frightful canvas in the cellar the thing I thought was a photograph of some scene he meant to use as a background for that monster


## Remove rows that do not have valid author fields

In [51]:
from pyspark.ml.feature import SQLTransformer
authorTrans = SQLTransformer(
    statement="SELECT * FROM __THIS__ WHERE author='EAP' OR author='HPL' OR author='MWS'")
dataUnion = authorTrans.transform(dataUnion)
dataUnion.toPandas().head()

Unnamed: 0,id,author,text
0,id00159,HPL,Where if anywhere had he been on those nights of daemoniac alienage
1,id00553,EAP,Yet the effect is incongruous to the timid alone
2,id00788,MWS,I am glad Woodville is not with me for perhaps he would grieve and I desire to see smiles alone during the last scene of my life when I last wrote to him I told him of my ill health but not of its mortal tendency lest he should conceive it to be his duty to come to me for I fear lest the tears of friendship should destroy the blessed calm of my mind
3,id00830,HPL,You know the curled up paper tacked to that frightful canvas in the cellar the thing I thought was a photograph of some scene he meant to use as a background for that monster
4,id01380,MWS,Sometimes she observed the war of elements thinking that they also declared against her and listened to the pattering of the rain in gloomy despair


## Tokenize the text

In [52]:
from pyspark.ml.feature import Tokenizer
from pyspark.sql.functions import col, udf
from pyspark.sql.types import IntegerType

tokenizer = Tokenizer(inputCol="text", outputCol="words")

countTokens = udf(lambda words: len(words), IntegerType())

tokenized = tokenizer.transform(dataUnion)
(tokenized.select("text", "words")
    .withColumn("#tokens", countTokens(col("words"))).toPandas().head())

Unnamed: 0,text,words,#tokens
0,Where if anywhere had he been on those nights of daemoniac alienage,"[where, if, anywhere, had, he, been, on, those, nights, of, daemoniac, alienage]",12
1,Yet the effect is incongruous to the timid alone,"[yet, the, effect, is, incongruous, to, the, timid, alone]",9
2,I am glad Woodville is not with me for perhaps he would grieve and I desire to see smiles alone during the last scene of my life when I last wrote to him I told him of my ill health but not of its mortal tendency lest he should conceive it to be his duty to come to me for I fear lest the tears of friendship should destroy the blessed calm of my mind,"[i, am, glad, woodville, is, not, with, me, for, perhaps, he, would, grieve, and, i, desire, to, see, smiles, alone, during, the, last, scene, of, my, life, when, i, last, wrote, to, him, i, told, him, of, my, ill, health, but, not, of, its, mortal, tendency, lest, he, should, conceive, it, to, be, his, duty, to, come, to, me, for, i, fear, lest, the, tears, of, friendship, should, destroy, the, blessed, calm, of, my, mind]",75
3,You know the curled up paper tacked to that frightful canvas in the cellar the thing I thought was a photograph of some scene he meant to use as a background for that monster,"[you, know, the, curled, up, paper, tacked, to, that, frightful, canvas, in, the, cellar, the, thing, i, thought, was, a, photograph, of, some, scene, he, meant, to, use, as, a, background, for, that, monster]",34
4,Sometimes she observed the war of elements thinking that they also declared against her and listened to the pattering of the rain in gloomy despair,"[sometimes, she, observed, the, war, of, elements, thinking, that, they, also, declared, against, her, and, listened, to, the, pattering, of, the, rain, in, gloomy, despair]",25


## Remove common words

In [53]:
from pyspark.ml.feature import StopWordsRemover

remover = StopWordsRemover(inputCol="words", outputCol="filtered").setCaseSensitive(False)
removed = remover.transform(tokenized)
removed.select("text", "words", "filtered" ).toPandas().head()

Unnamed: 0,text,words,filtered
0,Where if anywhere had he been on those nights of daemoniac alienage,"[where, if, anywhere, had, he, been, on, those, nights, of, daemoniac, alienage]","[anywhere, nights, daemoniac, alienage]"
1,Yet the effect is incongruous to the timid alone,"[yet, the, effect, is, incongruous, to, the, timid, alone]","[yet, effect, incongruous, timid, alone]"
2,I am glad Woodville is not with me for perhaps he would grieve and I desire to see smiles alone during the last scene of my life when I last wrote to him I told him of my ill health but not of its mortal tendency lest he should conceive it to be his duty to come to me for I fear lest the tears of friendship should destroy the blessed calm of my mind,"[i, am, glad, woodville, is, not, with, me, for, perhaps, he, would, grieve, and, i, desire, to, see, smiles, alone, during, the, last, scene, of, my, life, when, i, last, wrote, to, him, i, told, him, of, my, ill, health, but, not, of, its, mortal, tendency, lest, he, should, conceive, it, to, be, his, duty, to, come, to, me, for, i, fear, lest, the, tears, of, friendship, should, destroy, the, blessed, calm, of, my, mind]","[glad, woodville, perhaps, would, grieve, desire, see, smiles, alone, last, scene, life, last, wrote, told, ill, health, mortal, tendency, lest, conceive, duty, come, fear, lest, tears, friendship, destroy, blessed, calm, mind]"
3,You know the curled up paper tacked to that frightful canvas in the cellar the thing I thought was a photograph of some scene he meant to use as a background for that monster,"[you, know, the, curled, up, paper, tacked, to, that, frightful, canvas, in, the, cellar, the, thing, i, thought, was, a, photograph, of, some, scene, he, meant, to, use, as, a, background, for, that, monster]","[know, curled, paper, tacked, frightful, canvas, cellar, thing, thought, photograph, scene, meant, use, background, monster]"
4,Sometimes she observed the war of elements thinking that they also declared against her and listened to the pattering of the rain in gloomy despair,"[sometimes, she, observed, the, war, of, elements, thinking, that, they, also, declared, against, her, and, listened, to, the, pattering, of, the, rain, in, gloomy, despair]","[sometimes, observed, war, elements, thinking, also, declared, listened, pattering, rain, gloomy, despair]"


### Show list of common words removed

In [54]:
from __future__ import print_function
[print(x) for x in remover.getStopWords()]

i
me
my
myself
we
our
ours
ourselves
you
your
yours
yourself
yourselves
he
him
his
himself
she
her
hers
herself
it
its
itself
they
them
their
theirs
themselves
what
which
who
whom
this
that
these
those
am
is
are
was
were
be
been
being
have
has
had
having
do
does
did
doing
a
an
the
and
but
if
or
because
as
until
while
of
at
by
for
with
about
against
between
into
through
during
before
after
above
below
to
from
up
down
in
out
on
off
over
under
again
further
then
once
here
there
when
where
why
how
all
any
both
each
few
more
most
other
some
such
no
nor
not
only
own
same
so
than
too
very
s
t
can
will
just
don
should
now
d
ll
m
o
re
ve
y
ain
aren
couldn
didn
doesn
hadn
hasn
haven
isn
ma
mightn
mustn
needn
shan
shouldn
wasn
weren
won
wouldn


[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,

## Hash the words

In [55]:
from pyspark.ml.feature import HashingTF

hashingTF = HashingTF(inputCol="filtered", outputCol="rawFeatures", numFeatures=100)
featurizedData = hashingTF.transform(removed)
featurizedData.select("text", "filtered", "rawFeatures").toPandas().head()

Unnamed: 0,text,filtered,rawFeatures
0,Where if anywhere had he been on those nights of daemoniac alienage,"[anywhere, nights, daemoniac, alienage]","(0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)"
1,Yet the effect is incongruous to the timid alone,"[yet, effect, incongruous, timid, alone]","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)"
2,I am glad Woodville is not with me for perhaps he would grieve and I desire to see smiles alone during the last scene of my life when I last wrote to him I told him of my ill health but not of its mortal tendency lest he should conceive it to be his duty to come to me for I fear lest the tears of friendship should destroy the blessed calm of my mind,"[glad, woodville, perhaps, would, grieve, desire, see, smiles, alone, last, scene, life, last, wrote, told, ill, health, mortal, tendency, lest, conceive, duty, come, fear, lest, tears, friendship, destroy, blessed, calm, mind]","(0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 2.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 2.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 2.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0)"
3,You know the curled up paper tacked to that frightful canvas in the cellar the thing I thought was a photograph of some scene he meant to use as a background for that monster,"[know, curled, paper, tacked, frightful, canvas, cellar, thing, thought, photograph, scene, meant, use, background, monster]","(0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0)"
4,Sometimes she observed the war of elements thinking that they also declared against her and listened to the pattering of the rain in gloomy despair,"[sometimes, observed, war, elements, thinking, also, declared, listened, pattering, rain, gloomy, despair]","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 2.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0)"


## Inverse weight words that occur frequently across all text

In [56]:
from pyspark.ml.feature import IDF

idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = idf.fit(featurizedData)
rescaledData = idfModel.transform(featurizedData)

rescaledData.select("text", "rawFeatures", "features").toPandas().head()

Unnamed: 0,text,rawFeatures,features
0,Where if anywhere had he been on those nights of daemoniac alienage,"(0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(0.0, 2.3578376752, 0.0, 0.0, 0.0, 0.0, 2.23381436968, 2.31083762115, 0.0, 0.0, 0.0, 2.40962528497, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)"
1,Yet the effect is incongruous to the timid alone,"(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.23019211475, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.10190430106, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.07059584048, 0.0, 2.1897007534, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.21379830497, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)"
2,I am glad Woodville is not with me for perhaps he would grieve and I desire to see smiles alone during the last scene of my life when I last wrote to him I told him of my ill health but not of its mortal tendency lest he should conceive it to be his duty to come to me for I fear lest the tears of friendship should destroy the blessed calm of my mind,"(0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 2.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 2.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 2.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0)","(0.0, 0.0, 0.0, 2.08881451813, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.40962528497, 0.0, 1.51129462677, 0.0, 4.20199207157, 2.26007638317, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.96695410623, 0.0, 2.23019211475, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.03470009477, 0.0, 0.0, 0.0, 0.0, 1.96022806703, 0.0, 2.55577345979, 0.0, 0.0, 0.0, 2.21787163036, 0.0, 4.31305395943, 1.96062246656, 0.0, 0.0, 2.50623420641, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.24685916724, 2.17982900492, 1.92117419623, 1.94768871778, 2.07059584048, 0.0, 2.1897007534, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.91814503829, 2.31814307651, 0.0, 2.12395388522, 0.0, 0.0, 0.0, 1.95159040045, 0.0, 0.0, 0.0, 2.1603731383, 4.12016652018, 0.0, 0.0, 0.0, 0.0, 1.94846783741, 1.95472274443, 0.0, 0.0, 0.0, 0.0)"
3,You know the curled up paper tacked to that frightful canvas in the cellar the thing I thought was a photograph of some scene he meant to use as a background for that monster,"(0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0)","(0.0, 2.3578376752, 0.0, 0.0, 0.0, 1.87994272924, 0.0, 0.0, 0.0, 0.0, 2.12348952487, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0091168165, 0.0, 0.0, 0.0, 0.0, 2.29197960252, 0.0, 0.0, 0.0, 2.14507647293, 2.55577345979, 0.0, 2.42960098039, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.07059584048, 0.0, 2.1897007534, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.16956749293, 0.0, 0.0, 0.0, 2.09150872722, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.06008326009, 0.0, 0.0, 0.0, 0.0, 0.0, 1.95472274443, 2.24633409503, 0.0, 0.0, 0.0)"
4,Sometimes she observed the war of elements thinking that they also declared against her and listened to the pattering of the rain in gloomy despair,"(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 2.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0)","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.23381436968, 2.31083762115, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.26862801198, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.04795664982, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.17982900492, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.16956749293, 0.0, 0.0, 2.31814307651, 0.0, 4.24790777044, 0.0, 0.0, 0.0, 1.95159040045, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.34965528004, 0.0, 0.0, 0.0, 2.24633409503, 0.0, 0.0, 0.0)"


## Encode the label column

In [57]:
from pyspark.ml.feature import StringIndexer
labelIndexer = StringIndexer(inputCol='author', outputCol='label').fit(dataUnion)

## Use Logistic Regression Algorithm to predict author

In [58]:
from pyspark.ml.classification import LogisticRegression
lr = LogisticRegression(labelCol = "label", maxIter=10, regParam=0.3, threshold=0.7)

## Convert indexed labels back to original labels

In [59]:
from pyspark.ml.feature import IndexToString
labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel", labels=labelIndexer.labels)

## Define the machine learning pipeline

In [63]:
stages = [removePunctuationTrans, authorTrans, tokenizer, remover, hashingTF, idf, labelIndexer, lr, labelConverter]
from pyspark.ml import Pipeline
pipeline = Pipeline(stages = stages)

### Display the parameter setting of the pipeline stages

In [64]:
print("Remove Punctuation SQL Transformer:")
print(removePunctuationTrans.explainParams())
print("*************************")
print("Remove Invalid Author Labels SQL Transformer:")
print(authorTrans.explainParams())
print("*************************")
print(tokenizer.explainParams())
print("*************************")
print("Tokenizer:")
print(tokenizer.explainParams())
print("*************************")
print("Remover:")
print(remover.explainParams())
print("*************************")
print("HashingTF:")
print(hashingTF.explainParams())
print("*************************")
print("IDF:")
print(idf.explainParams())
print("*************************")
print("LogisticRegression:")
print(lr.explainParams())
print("*************************")
print("Pipeline:")
print(pipeline.explainParams())

Remove Punctuation SQL Transformer:
statement: SQL statement (current: SELECT *, TRANSLATE(text,',.;?''"+','') AS textNoPunct FROM __THIS__)
*************************
Remove Invalid Author Labels SQL Transformer:
statement: SQL statement (current: SELECT * FROM __THIS__ WHERE author='EAP' OR author='HPL' OR author='MWS')
*************************
inputCol: input column name. (current: text)
outputCol: output column name. (default: Tokenizer_42c4bc9876fac23c27bd__output, current: words)
*************************
Tokenizer:
inputCol: input column name. (current: text)
outputCol: output column name. (default: Tokenizer_42c4bc9876fac23c27bd__output, current: words)
*************************
Remover:
caseSensitive: whether to do a case sensitive comparison over the stop words (default: False, current: False)
inputCol: input column name. (current: words)
outputCol: output column name. (default: StopWordsRemover_448f9ce0d82672eff3f6__output, current: filtered)
stopWords: The words to be filte

## Split the dataset into training and test data sets

In [65]:
train, test = dataUnion.randomSplit([70.0,30.0], seed=1)
print('The number of records in the training data set is {}.'.format(train.count()))
print('The number of rows labeled EAP in the training data set is {}.'.format(train.filter(train['author'] == 'EAP').count()))
print('The number of rows labeled HPL in the training data set is {}.'.format(train.filter(train['author'] == 'HPL').count()))
print('The number of rows labeled MWS in the training data set is {}.'.format(train.filter(train['author'] == 'MWS').count()))
print("")
print('The number of records in the test data set is {}.'.format(test.count()))
print('The number of rows labeled EAP in the test data set is {}.'.format(test.filter(test['author'] == 'EAP').count()))
print('The number of rows labeled HPL in the test data set is {}.'.format(test.filter(test['author'] == 'HPL').count()))
print('The number of rows labeled MWS in the test data set is {}.'.format(test.filter(test['author'] == 'MWS').count()))

The number of records in the training data set is 12547.
The number of rows labeled EAP in the training data set is 4933.
The number of rows labeled HPL in the training data set is 3801.
The number of rows labeled MWS in the training data set is 3813.

The number of records in the test data set is 5460.
The number of rows labeled EAP in the test data set is 2092.
The number of rows labeled HPL in the test data set is 1644.
The number of rows labeled MWS in the test data set is 1724.


## Train the model using the training data set

In [66]:
model = pipeline.fit(train)

## Make predictions using the test data set

In [67]:
predictions = model.transform(test)

In [68]:
predictions.select("author", "label", "prediction", 'predictedLabel', "probability").toPandas().head()

Unnamed: 0,author,label,prediction,predictedLabel,probability
0,MWS,1,0,EAP,"[0.374734316168, 0.310769999674, 0.314495684158]"
1,EAP,0,0,EAP,"[0.436005102966, 0.276185090862, 0.287809806172]"
2,MWS,1,0,EAP,"[0.390615230322, 0.291331800252, 0.318052969426]"
3,MWS,1,2,HPL,"[0.229277580922, 0.271608026015, 0.499114393063]"
4,HPL,2,2,HPL,"[0.356927998973, 0.285659993041, 0.357412007986]"


## Evaluate the model performance by calculating the accuracy

In [69]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
evaluator = MulticlassClassificationEvaluator(labelCol = "label", predictionCol="prediction").setMetricName("accuracy")
print('Accuracy = {:0.2f}%.'.format(evaluator.evaluate(predictions)*100))

Accuracy = 47.05%.


## Investigate the prediction results

In [70]:
EAPandEAP = predictions.filter(predictions['author']=='EAP').filter(predictions['predictedLabel']=='EAP').count()
EAPnotEAP = predictions.filter(predictions['author']=='EAP').filter(predictions['predictedLabel']!='EAP').count()
notEAPbutEAP = predictions.filter(predictions['author']!='EAP').filter(predictions['predictedLabel']=='EAP').count()
print("Predicted EAP correctly {} times.".format(EAPandEAP))
print("Failed to predict EAP {} times.".format(EAPnotEAP))
print("Predicted EAP incorrectly {} times.".format(notEAPbutEAP))

Predicted EAP correctly 1697 times.
Failed to predict EAP 395 times.
Predicted EAP incorrectly 2107 times.


In [71]:
HPLandHPL = predictions.filter(predictions['author']=='HPL').filter(predictions['predictedLabel']=='HPL').count()
HPLnotHPL = predictions.filter(predictions['author']=='HPL').filter(predictions['predictedLabel']!='HPL').count()
notHPLbutHPL = predictions.filter(predictions['author']!='HPL').filter(predictions['predictedLabel']=='HPL').count()
print("Predicted HPL correctly {} times.".format(HPLandHPL))
print("Failed to predict HPL {} times.".format(HPLnotHPL))
print("Predicted HPL incorrectly {} times.".format(notHPLbutHPL))

Predicted HPL correctly 511 times.
Failed to predict HPL 1133 times.
Predicted HPL incorrectly 434 times.


In [72]:
MWSandMWS = predictions.filter(predictions['author']=='MWS').filter(predictions['predictedLabel']=='MWS').count()
MWSnotMWS = predictions.filter(predictions['author']=='MWS').filter(predictions['predictedLabel']!='MWS').count()
notMWSbutMWS = predictions.filter(predictions['author']!='MWS').filter(predictions['predictedLabel']=='MWS').count()
print("Predicted MWS correctly {} times.".format(MWSandMWS))
print("Failed to predict MWS {} times.".format(MWSnotMWS))
print("Predicted MWS incorrectly {} times.".format(notMWSbutMWS))

Predicted MWS correctly 361 times.
Failed to predict MWS 1363 times.
Predicted MWS incorrectly 350 times.


# Use Natural Language Watson Understanding to create rich features

## Setup configuration for the Watson Natural Language Understanding (NLU) service

In [73]:
import watson_developer_cloud
from watson_developer_cloud import NaturalLanguageUnderstandingV1
import watson_developer_cloud.natural_language_understanding.features.v1 as Features
import json

In [76]:
NLU_USERNAME = 'de841bd0-8142-4715-b7f6-0885510fde6a'
NLU_PASSWORD = 'D8K4U5duW72f'
natural_language_understanding = NaturalLanguageUnderstandingV1(
  username=NLU_USERNAME,
  password=NLU_PASSWORD,
  version="2017-02-27")

## Show example of employing NLU API on single row of the data set

In [None]:
dataNLUtest = dataUnion.select(dataUnion["text"]).toJSON().collect()[0][8:-1]
print(dataNLUtest)
import json
features=[
    Features.Emotion(),
    Features.Sentiment()
  ]
nlu = natural_language_understanding.analyze(text=dataNLUtest, features=features, language='en', clean='true')
anger = nlu['emotion']['document']['emotion']['anger']
joy = nlu['emotion']['document']['emotion']['joy']
sadness = nlu['emotion']['document']['emotion']['sadness']
fear = nlu['emotion']['document']['emotion']['fear']
disgust = nlu['emotion']['document']['emotion']['disgust']
sentiment = nlu['sentiment']['document']['score']

print("")
print("Anger = {}".format(anger))
print("Joy = {}".format(joy))
print("Sadness = {}".format(sadness))
print("Fear = {}".format(fear))
print("Disgust = {}".format(disgust))
print("Sentiment = {}".format(sentiment))
print("")
print(json.dumps(nlu, indent=2))

## Define UDF to create NLU derived features

In [37]:
from pyspark.sql.functions import udf
#import json
udfNLU = (udf(lambda text: json.dumps(NaturalLanguageUnderstandingV1(
    username=NLU_USERNAME, password=NLU_PASSWORD, version="2017-02-27")
    .analyze(text=text, features=features, language='en', clean='true'))))

## Invoke UDF to create new column with NLU output

In [38]:
#dataSmall = data.limit(10)
#dataSmall = dataSmall.withColumn('nlu', udfNLU(dataSmall['text']))
#dataSmall.toPandas().head()

In [39]:
data00NLU = data00.withColumn('nlu', udfNLU(data00['text']))
data01NLU = data01.withColumn('nlu', udfNLU(data01['text']))
data02NLU = data02.withColumn('nlu', udfNLU(data02['text']))
data03NLU = data03.withColumn('nlu', udfNLU(data03['text']))
data04NLU = data04.withColumn('nlu', udfNLU(data04['text']))
data05NLU = data05.withColumn('nlu', udfNLU(data05['text']))
data06NLU = data06.withColumn('nlu', udfNLU(data06['text']))
data07NLU = data07.withColumn('nlu', udfNLU(data07['text']))
data08NLU = data08.withColumn('nlu', udfNLU(data08['text']))
data09NLU = data09.withColumn('nlu', udfNLU(data09['text']))
data10NLU = data10.withColumn('nlu', udfNLU(data10['text']))
data11NLU = data11.withColumn('nlu', udfNLU(data11['text']))
data12NLU = data12.withColumn('nlu', udfNLU(data12['text']))
data13NLU = data13.withColumn('nlu', udfNLU(data13['text']))
data14NLU = data14.withColumn('nlu', udfNLU(data14['text']))
data15NLU = data15.withColumn('nlu', udfNLU(data15['text']))
data16NLU = data16.withColumn('nlu', udfNLU(data16['text']))
data17NLU = data17.withColumn('nlu', udfNLU(data17['text']))
data18NLU = data18.withColumn('nlu', udfNLU(data18['text']))
data19NLU = data19.withColumn('nlu', udfNLU(data19['text']))
data20NLU = data20.withColumn('nlu', udfNLU(data20['text']))
data21NLU = data21.withColumn('nlu', udfNLU(data21['text']))
data22NLU = data22.withColumn('nlu', udfNLU(data22['text']))
data23NLU = data23.withColumn('nlu', udfNLU(data23['text']))
data24NLU = data24.withColumn('nlu', udfNLU(data24['text']))
data25NLU = data25.withColumn('nlu', udfNLU(data25['text']))
data26NLU = data26.withColumn('nlu', udfNLU(data26['text']))
data27NLU = data27.withColumn('nlu', udfNLU(data27['text']))
data28NLU = data28.withColumn('nlu', udfNLU(data28['text']))
data29NLU = data29.withColumn('nlu', udfNLU(data29['text']))
data30NLU = data30.withColumn('nlu', udfNLU(data30['text']))
data31NLU = data31.withColumn('nlu', udfNLU(data31['text']))
data32NLU = data32.withColumn('nlu', udfNLU(data32['text']))
data33NLU = data33.withColumn('nlu', udfNLU(data33['text']))
data34NLU = data34.withColumn('nlu', udfNLU(data34['text']))
data35NLU = data35.withColumn('nlu', udfNLU(data35['text']))
data36NLU = data36.withColumn('nlu', udfNLU(data36['text']))
data37NLU = data37.withColumn('nlu', udfNLU(data37['text']))
data38NLU = data38.withColumn('nlu', udfNLU(data38['text']))
data39NLU = data39.withColumn('nlu', udfNLU(data39['text']))
data40NLU = data40.withColumn('nlu', udfNLU(data40['text']))
data41NLU = data41.withColumn('nlu', udfNLU(data41['text']))
data42NLU = data42.withColumn('nlu', udfNLU(data42['text']))
data43NLU = data43.withColumn('nlu', udfNLU(data43['text']))
data44NLU = data44.withColumn('nlu', udfNLU(data44['text']))
data45NLU = data45.withColumn('nlu', udfNLU(data45['text']))
data46NLU = data46.withColumn('nlu', udfNLU(data46['text']))
data47NLU = data47.withColumn('nlu', udfNLU(data47['text']))
data48NLU = data48.withColumn('nlu', udfNLU(data48['text']))
data49NLU = data49.withColumn('nlu', udfNLU(data49['text']))
data50NLU = data50.withColumn('nlu', udfNLU(data50['text']))
data51NLU = data51.withColumn('nlu', udfNLU(data51['text']))
data52NLU = data52.withColumn('nlu', udfNLU(data52['text']))
data53NLU = data53.withColumn('nlu', udfNLU(data53['text']))
data54NLU = data54.withColumn('nlu', udfNLU(data54['text']))
data55NLU = data55.withColumn('nlu', udfNLU(data55['text']))
data56NLU = data56.withColumn('nlu', udfNLU(data56['text']))
data57NLU = data57.withColumn('nlu', udfNLU(data57['text']))
data58NLU = data58.withColumn('nlu', udfNLU(data58['text']))
data59NLU = data59.withColumn('nlu', udfNLU(data59['text']))
data60NLU = data60.withColumn('nlu', udfNLU(data60['text']))
data61NLU = data61.withColumn('nlu', udfNLU(data61['text']))
data62NLU = data62.withColumn('nlu', udfNLU(data62['text']))
data63NLU = data63.withColumn('nlu', udfNLU(data63['text']))
data64NLU = data64.withColumn('nlu', udfNLU(data64['text']))
data65NLU = data65.withColumn('nlu', udfNLU(data65['text']))
data66NLU = data66.withColumn('nlu', udfNLU(data66['text']))
data67NLU = data67.withColumn('nlu', udfNLU(data67['text']))
data68NLU = data68.withColumn('nlu', udfNLU(data68['text']))
data69NLU = data69.withColumn('nlu', udfNLU(data69['text']))
data70NLU = data70.withColumn('nlu', udfNLU(data70['text']))
data71NLU = data71.withColumn('nlu', udfNLU(data71['text']))
data72NLU = data72.withColumn('nlu', udfNLU(data72['text']))
data73NLU = data73.withColumn('nlu', udfNLU(data73['text']))
data74NLU = data74.withColumn('nlu', udfNLU(data74['text']))
data75NLU = data75.withColumn('nlu', udfNLU(data75['text']))
data76NLU = data76.withColumn('nlu', udfNLU(data76['text']))
data77NLU = data77.withColumn('nlu', udfNLU(data77['text']))
data78NLU = data78.withColumn('nlu', udfNLU(data78['text']))
data79NLU = data79.withColumn('nlu', udfNLU(data79['text']))
data80NLU = data80.withColumn('nlu', udfNLU(data80['text']))
data81NLU = data81.withColumn('nlu', udfNLU(data81['text']))
data82NLU = data82.withColumn('nlu', udfNLU(data82['text']))
data83NLU = data83.withColumn('nlu', udfNLU(data83['text']))
data84NLU = data84.withColumn('nlu', udfNLU(data84['text']))
data85NLU = data85.withColumn('nlu', udfNLU(data85['text']))
data86NLU = data86.withColumn('nlu', udfNLU(data86['text']))
data87NLU = data87.withColumn('nlu', udfNLU(data87['text']))
data88NLU = data88.withColumn('nlu', udfNLU(data88['text']))
data89NLU = data89.withColumn('nlu', udfNLU(data89['text']))
data90NLU = data90.withColumn('nlu', udfNLU(data90['text']))
data91NLU = data91.withColumn('nlu', udfNLU(data91['text']))
data92NLU = data92.withColumn('nlu', udfNLU(data92['text']))
data93NLU = data93.withColumn('nlu', udfNLU(data93['text']))
data94NLU = data94.withColumn('nlu', udfNLU(data94['text']))
data95NLU = data95.withColumn('nlu', udfNLU(data95['text']))
data96NLU = data96.withColumn('nlu', udfNLU(data96['text']))
data97NLU = data97.withColumn('nlu', udfNLU(data97['text']))
data98NLU = data98.withColumn('nlu', udfNLU(data98['text']))
data99NLU = data99.withColumn('nlu', udfNLU(data99['text']))

In [40]:
dataNLU= (data00NLU.union(data01NLU).union(data02NLU).union(data03NLU).union(data04NLU).union(data05NLU).union(data06NLU).union(data07NLU).union(data08NLU).union(data09NLU)
            .union(data10NLU).union(data11NLU).union(data12NLU).union(data13NLU).union(data14NLU).union(data15NLU).union(data16NLU).union(data17NLU).union(data18NLU).union(data19NLU)
            .union(data20NLU).union(data21NLU).union(data22NLU).union(data23NLU).union(data24NLU).union(data25NLU).union(data26NLU).union(data27NLU).union(data28NLU).union(data29NLU)
            .union(data30NLU).union(data31NLU).union(data32NLU).union(data33NLU).union(data34NLU).union(data35NLU).union(data36NLU).union(data37NLU).union(data38NLU).union(data39NLU)
            .union(data40NLU).union(data41NLU).union(data42NLU).union(data43NLU).union(data44NLU).union(data45NLU).union(data46NLU).union(data47NLU).union(data48NLU).union(data49NLU)
            .union(data50NLU).union(data51NLU).union(data52NLU).union(data53NLU).union(data54NLU).union(data55NLU).union(data56NLU).union(data57NLU).union(data58NLU).union(data59NLU)
            .union(data60NLU).union(data61NLU).union(data62NLU).union(data63NLU).union(data64NLU).union(data65NLU).union(data66NLU).union(data67NLU).union(data68NLU).union(data69NLU)
            .union(data70NLU).union(data71NLU).union(data72NLU).union(data73NLU).union(data74NLU).union(data75NLU).union(data76NLU).union(data77NLU).union(data78NLU).union(data79NLU)
            .union(data80NLU).union(data81NLU).union(data82NLU).union(data83NLU).union(data84NLU).union(data85NLU).union(data86NLU).union(data87NLU).union(data88NLU).union(data89NLU)
            .union(data90NLU).union(data91NLU).union(data92NLU).union(data93NLU).union(data94NLU).union(data95NLU).union(data96NLU).union(data97NLU).union(data98NLU).union(data99NLU)
            .cache())

print("The combined dataset contains {} rows.".format(dataNLU.count()))

The combined dataset contains 18047 rows.


## Define UDFs to identify bad rows retuned by NLU

In [41]:
from pyspark.sql.types import DoubleType
udfAngerTest = udf(lambda nlu: json.loads(nlu))
udfJoyTest = udf(lambda nlu: json.loads(nlu))
udfSadnessTest = udf(lambda nlu: json.loads(nlu))
udfFearTest = udf(lambda nlu: json.loads(nlu))
udfDisgustTest = udf(lambda nlu: json.loads(nlu))
udfSentimentTest = udf(lambda nlu: json.loads(nlu))

In [42]:
dataNLUtest = (dataNLU.withColumn('AngerTest', udfAngerTest(dataNLU['nlu']))
        .withColumn('JoyTest', udfJoyTest(dataNLU['nlu']))
        .withColumn('SadnessTest', udfSadnessTest(dataNLU['nlu']))
        .withColumn('FearTest', udfFearTest(dataNLU['nlu']))
        .withColumn('DisgustTest', udfDisgustTest(dataNLU['nlu']))
        .withColumn('SentimentTest', udfSentimentTest(dataNLU['nlu'])))

In [43]:
print("Number of bad rows found = {}".format(dataNLUtest.filter(~ col('AngerTest').like('%anger%')).count()))
print("Number of bad rows found = {}".format(dataNLUtest.filter(~ col('JoyTest').like('%joy%')).count()))
print("Number of bad rows found = {}".format(dataNLUtest.filter(~ col('SadnessTest').like('%sadness%')).count()))
print("Number of bad rows found = {}".format(dataNLUtest.filter(~ col('FearTest').like('%fear%')).count()))
print("Number of bad rows found = {}".format(dataNLUtest.filter(~ col('DisgustTest').like('%disgust%')).count()))
print("Number of bad rows found = {}".format(dataNLUtest.filter(~ col('SentimentTest').like('%sentiment%')).count()))

Number of bad rows found = 0
Number of bad rows found = 0
Number of bad rows found = 0
Number of bad rows found = 0
Number of bad rows found = 0
Number of bad rows found = 0


## Remove bad rows returned from NLU

In [44]:
dataNLU = (dataNLUtest.filter(col('AngerTest').like('%anger%'))
            .filter(col('JoyTest').like('%joy%'))
            .filter(col('SadnessTest').like('%sadness%'))
            .filter(col('FearTest').like('%fear%'))
            .filter(col('DisgustTest').like('%disgust%'))
            .filter(col('SentimentTest').like('%sentiment%')))

In [45]:
print("Number of bad rows found = {}".format(dataNLU.filter(~ col('AngerTest').like('%anger%')).count()))
print("Number of bad rows found = {}".format(dataNLU.filter(~ col('JoyTest').like('%joy%')).count()))
print("Number of bad rows found = {}".format(dataNLU.filter(~ col('SadnessTest').like('%sadness%')).count()))
print("Number of bad rows found = {}".format(dataNLU.filter(~ col('FearTest').like('%fear%')).count()))
print("Number of bad rows found = {}".format(dataNLU.filter(~ col('DisgustTest').like('%disgust%')).count()))
print("Number of bad rows found = {}".format(dataNLU.filter(~ col('SentimentTest').like('%sentiment%')).count()))

Number of bad rows found = 0
Number of bad rows found = 0
Number of bad rows found = 0
Number of bad rows found = 0
Number of bad rows found = 0
Number of bad rows found = 0


In [46]:
#Drop NLU test columnts
dataNLU = dataNLU.drop('AngerTest','JoyTest', 'SadnessTest', 'FearTest', 'DisgustTest', 'SentimentTest')

## Define UDFs to extract NLU derived features

In [47]:
from pyspark.sql.types import DoubleType
udfAnger = udf(lambda nlu: json.loads(nlu)["emotion"]["document"]["emotion"]["anger"], DoubleType())
udfJoy = udf(lambda nlu: json.loads(nlu)["emotion"]["document"]["emotion"]["joy"], DoubleType())
udfSadness = udf(lambda nlu: json.loads(nlu)["emotion"]["document"]["emotion"]["sadness"], DoubleType())
udfFear = udf(lambda nlu: json.loads(nlu)["emotion"]["document"]["emotion"]["fear"], DoubleType())
udfDisgust = udf(lambda nlu: json.loads(nlu)["emotion"]["document"]["emotion"]["disgust"], DoubleType())
udfSentiment = udf(lambda nlu: json.loads(nlu)['sentiment']['document']['score'], DoubleType())

## Invoke UDFs to create new columns for the enhanced emotion and sentiment features

In [48]:
#dataNLU2 = (dataSmall.withColumn('Anger', udfAnger(dataSmall['nlu']))
#        .withColumn('Joy', udfJoy(dataSmall['nlu']))
#        .withColumn('Sadness', udfSadness(dataSmall['nlu']))
#        .withColumn('Fear', udfFear(dataSmall['nlu']))
#        .withColumn('Disgust', udfDisgust(dataSmall['nlu']))
#        .withColumn('Sentiment', udfSentiment(dataSmall['nlu'])))
#dataNLU2.toPandas().head()

In [49]:
dataNLU = (dataNLU.withColumn('Anger', udfAnger(dataNLU['nlu']))
        .withColumn('Joy', udfJoy(dataNLU['nlu']))
        .withColumn('Sadness', udfSadness(dataNLU['nlu']))
        .withColumn('Fear', udfFear(dataNLU['nlu']))
        .withColumn('Disgust', udfDisgust(dataNLU['nlu']))
        .withColumn('Sentiment', udfSentiment(dataNLU['nlu'])))

In [50]:
dataNLU.select(dataNLU['text'], dataNLU['Anger'], dataNLU['Joy'], dataNLU['Sadness'], dataNLU['Fear'], dataNLU['Disgust'], dataNLU['Sentiment']).toPandas().head(10)

Unnamed: 0,text,Anger,Joy,Sadness,Fear,Disgust,Sentiment
0,"I walked rapidly, softly, and close to the ruined houses.",0.417696,0.021155,0.482235,0.298235,0.134835,-0.651749
1,"The youth's febrile mind, apparently, was dwelling on strange things; and the doctor shuddered now and then as he spoke of them.",0.055444,0.116561,0.365396,0.351243,0.088965,-0.765087
2,"""But since the murderer has been discovered """" """"The murderer discovered Good God how can that be? who could attempt to pursue him?""",0.218197,0.137784,0.393301,0.066075,0.42919,0.0
3,"So thick were the vapours that the way was hard, and though Atal followed on at last, he could scarce see the grey shape of Barzai on the dim slope above in the clouded moonlight.",0.032738,0.150773,0.694929,0.20387,0.02903,-0.506536
4,"By authority of the king such districts were placed under ban, and all persons forbidden, under pain of death, to intrude upon their dismal solitude.",0.247888,0.026101,0.771124,0.095373,0.122879,-0.858506
5,"Thus, while Perdita was entertaining her guests, and anxiously awaiting the arrival of her lord, his ring was brought her; and she was told that a poor woman had a note to deliver to her from its wearer.",0.098105,0.104288,0.470236,0.089047,0.107151,0.0
6,"As the evening wore away he became more and more absorbed in reverie, from which no sallies of mine could arouse him.",0.110423,0.141169,0.527687,0.172764,0.117253,0.0
7,"He stopped in his tracks then, flailing his arms wildly in the air, began to stagger backward.",0.234947,0.170794,0.115705,0.310494,0.091302,-0.868316
8,"In feeling my way I had found many angles, and thus deduced an idea of great irregularity; so potent is the effect of total darkness upon one arousing from lethargy or sleep The angles were simply those of a few slight depressions, or niches, at odd intervals.",0.021526,0.102365,0.736927,0.158123,0.025682,-0.891721
9,"He seemed insensible to the presence of any one else, but if, as a trial to awaken his sensibility, my aunt brought me into the room he would instantly rush out with every symptom of fury and distraction.",0.437653,0.155442,0.238697,0.203852,0.013623,-0.915947


# Retrain model with NLU features added

## Split the dataset into training and test data sets

In [51]:
trainNLU, testNLU = dataNLU.randomSplit([70.0,30.0], seed=1)

## Bucketize the NLU features

In [52]:
from pyspark.ml.feature import Bucketizer
AngerBucketSplits = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
AngerBucket = Bucketizer(splits=AngerBucketSplits, inputCol="Anger", outputCol="AngerBucket")
JoyBucketSplits = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
JoyBucket = Bucketizer(splits=JoyBucketSplits, inputCol="Joy", outputCol="JoyBucket")
SadnessBucketSplits = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
SadnessBucket = Bucketizer(splits=SadnessBucketSplits, inputCol="Sadness", outputCol="SadnessBucket")
FearBucketSplits = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
FearBucket = Bucketizer(splits=FearBucketSplits, inputCol="Fear", outputCol="FearBucket")
DisgustBucketSplits = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
DisgustBucket = Bucketizer(splits=DisgustBucketSplits, inputCol="Disgust", outputCol="DisgustBucket")
SentimentBucketSplits = [-1.0, -0.9, -0.8, -0.7, -0.6, -0.5, -0.4, -0.3, -0.2, -0.1, 0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
SentimentBucket = Bucketizer(splits=SentimentBucketSplits, inputCol="Sentiment", outputCol="SentimentBucket")

## Create a feature vector

In [53]:
from pyspark.ml.feature import VectorAssembler
assembler = (VectorAssembler(inputCols=["features", "AngerBucket", "JoyBucket", "SadnessBucket", "FearBucket", "DisgustBucket","SentimentBucket"], 
             outputCol="featuresNLU"))

## Create a revised machine learning pipeline utilizing the new bucketed NLU feaures

In [54]:
lrNLU = LogisticRegression(labelCol = "label", featuresCol= "featuresNLU", maxIter=10, regParam=0.3, threshold=0.5)
stagesNLU = ([tokenizer, remover, hashingTF, idf, AngerBucket, JoyBucket, SadnessBucket, FearBucket, DisgustBucket, SentimentBucket,
            assembler, labelIndexer, lrNLU, labelConverter])
pipelineNLU = Pipeline(stages = stagesNLU)

## Train the new model using the training data set

In [55]:
modelNLU = pipelineNLU.fit(trainNLU)

## Make updated predictions (with NLU features) using the test data set

In [56]:
predictionsNLU = modelNLU.transform(testNLU)

In [57]:
predictionsNLU.select("author", "label", "prediction", 'predictedLabel', "probability").toPandas().head()

Unnamed: 0,author,label,prediction,predictedLabel,probability
0,EAP,0,0,EAP,"[0.406589386585, 0.259198863094, 0.334211750321]"
1,EAP,0,0,EAP,"[0.445357856625, 0.310846362146, 0.24379578123]"
2,MWS,1,0,EAP,"[0.341663611648, 0.316793807219, 0.341542581133]"
3,MWS,1,0,EAP,"[0.43338070479, 0.341784103287, 0.224835191923]"
4,MWS,1,0,EAP,"[0.404221211604, 0.286485576834, 0.309293211562]"


## Evaluate the updated model performance by calculating the accuracy

In [58]:
evaluatorNLU = MulticlassClassificationEvaluator(labelCol = "label", predictionCol="prediction").setMetricName("accuracy")
print('Accuracy with NLU = {:0.2f}%.'.format(evaluatorNLU.evaluate(predictionsNLU)*100))

Accuracy with NLU = 47.94%.


## Investigate Improved Results

In [59]:
EAPandEAPnlu = predictionsNLU.filter(predictionsNLU['author']=='EAP').filter(predictionsNLU['predictedLabel']=='EAP').count()
EAPnotEAPnlu = predictionsNLU.filter(predictionsNLU['author']=='EAP').filter(predictionsNLU['predictedLabel']!='EAP').count()
notEAPbutEAPnlu = predictionsNLU.filter(predictionsNLU['author']!='EAP').filter(predictionsNLU['predictedLabel']=='EAP').count()
print("Predicted EAP correctly {} times vs. {} previously.".format(EAPandEAPnlu, EAPandEAP))
print("Failed to predict EAP {} times vs. {} previously.".format(EAPnotEAPnlu, EAPnotEAP))
print("Predicted EAP incorrectly {} times vs. {} previously.".format(notEAPbutEAPnlu, notEAPbutEAP))

Predicted EAP correctly 1597 times vs. 1703 previously.
Failed to predict EAP 532 times vs. 424 previously.
Predicted EAP incorrectly 1805 times vs. 2137 previously.


In [60]:
HPLandHPLnlu = predictionsNLU.filter(predictionsNLU['author']=='HPL').filter(predictionsNLU['predictedLabel']=='HPL').count()
HPLnotHPLnlu = predictionsNLU.filter(predictionsNLU['author']=='HPL').filter(predictionsNLU['predictedLabel']!='HPL').count()
notHPLbutHPLnlu = predictionsNLU.filter(predictionsNLU['author']!='HPL').filter(predictionsNLU['predictedLabel']=='HPL').count()
print("Predicted HPL correctly {} times vs. {} previously.".format(HPLandHPLnlu, HPLandHPL))
print("Failed to predict HPL {} times vs. {} previously.".format(HPLnotHPLnlu, HPLnotHPL))
print("Predicted HPL incorrectly {} times vs. {} previously.".format(notHPLbutHPLnlu, notHPLbutHPL))

Predicted HPL correctly 582 times vs. 466 previously.
Failed to predict HPL 1050 times vs. 1162 previously.
Predicted HPL incorrectly 581 times vs. 483 previously.


In [61]:
MWSandMWSnlu = predictionsNLU.filter(predictionsNLU['author']=='MWS').filter(predictionsNLU['predictedLabel']=='MWS').count()
MWSnotMWSnlu = predictionsNLU.filter(predictionsNLU['author']=='MWS').filter(predictionsNLU['predictedLabel']!='MWS').count()
notMWSbutMWSnlu = predictionsNLU.filter(predictionsNLU['author']!='MWS').filter(predictionsNLU['predictedLabel']=='MWS').count()
print("Predicted MWS correctly {} times vs. {} previously.".format(MWSandMWSnlu, MWSandMWS))
print("Failed to predict MWS {} times vs. {} previously.".format(MWSnotMWSnlu, MWSnotMWS))
print("Predicted MWS incorrectly {} times vs. {} previously.".format(notMWSbutMWSnlu, notMWSbutMWS))

Predicted MWS correctly 444 times vs. 318 previously.
Failed to predict MWS 1266 times vs. 1387 previously.
Predicted MWS incorrectly 462 times vs. 353 previously.


![IBM Logo](http://www-03.ibm.com/press/img/Large_IBM_Logo_TN.jpg)

Rich Tarro  
Solutions Architect, IBM Corporation  
rtarro@us.ibm.com

December 8, 2017