<img src="https://storage.googleapis.com/kaggle-media/competitions/spooky-books/dmitrij-paskevic-44124.jpg" style="width:200px; float: left; padding-right: 10px"/>
<h2 style="font-face: verdana; font-size: 32px;">Spooky Author Identification</h2>
<h3 style="font-face: verdana; font-size: 16px;">Creating Rich Machine Learning Features with the Watson Cognitive APIs</h3>
<br><br>
https://www.kaggle.com/c/spooky-author-identification

## Download and unzip the data set

In [1]:
import os
if os.path.isfile('train.zip'):
    os.remove("train.zip")
if os.path.isfile('train.csv'):
    os.remove("train.csv")
import wget
url = 'https://github.com/hackerguy/SpookyAuthorIdentification/blob/master/train.zip?raw=true'
wget.download(url)
import zipfile
zip = zipfile.ZipFile('train.zip', 'r')
zip.extractall()
zip.close()

## Read in the data set as a Spark DataFrame
### Infer schema and column names

In [2]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

data = (spark.read
  .format('csv')
  .option('header', 'true')
  .option("inferSchema", "true")
  .load('train.csv'))

## Display the dataset

In [3]:
import pandas as pd
pd.set_option('display.max_colwidth', -1)
data.toPandas().head()

Unnamed: 0,id,text,author
0,id26305,"This process, however, afforded me no means of ascertaining the dimensions of my dungeon; as I might make its circuit, and return to the point whence I set out, without being aware of the fact; so perfectly uniform seemed the wall.",EAP
1,id17569,It never once occurred to me that the fumbling might be a mere mistake.,HPL
2,id11008,"In his left hand was a gold snuff box, from which, as he capered down the hill, cutting all manner of fantastic steps, he took snuff incessantly with an air of the greatest possible self satisfaction.",EAP
3,id27763,"How lovely is spring As we looked from Windsor Terrace on the sixteen fertile counties spread beneath, speckled by happy cottages and wealthier towns, all looked as in former years, heart cheering and fair.",MWS
4,id12958,"Finding nothing else, not even gold, the Superintendent abandoned his attempts; but a perplexed look occasionally steals over his countenance as he sits thinking at his desk.",HPL


## Remove rows that do not have valid author fields

In [4]:
data = data.filter((data['author']=='EAP')| (data['author']=='HPL') | (data['author']=='MWS'))

## Show the schema of the data including data types

In [5]:
data.printSchema()

root
 |-- id: string (nullable = true)
 |-- text: string (nullable = true)
 |-- author: string (nullable = true)



## Limit the data size for processing efficiency

In [6]:
data = data.limit(100)

### Dataset Overview - number of rows and columns

In [7]:
print("There are " + str(data.count()) + " observations in the survey dataset.")
print("There are " + str(len(data.columns)) + " variables in the dataset.")



There are 100 observations in the survey dataset.
There are 3 variables in the dataset.


## Tokenize the text

In [8]:
from pyspark.ml.feature import Tokenizer, RegexTokenizer
from pyspark.sql.functions import col, udf
from pyspark.sql.types import IntegerType

tokenizer = Tokenizer(inputCol="text", outputCol="words")

countTokens = udf(lambda words: len(words), IntegerType())

tokenized = tokenizer.transform(data)
(tokenized.select("text", "words")
    .withColumn("#tokens", countTokens(col("words"))).toPandas().head())

Unnamed: 0,text,words,#tokens
0,"This process, however, afforded me no means of ascertaining the dimensions of my dungeon; as I might make its circuit, and return to the point whence I set out, without being aware of the fact; so perfectly uniform seemed the wall.","[this, process,, however,, afforded, me, no, means, of, ascertaining, the, dimensions, of, my, dungeon;, as, i, might, make, its, circuit,, and, return, to, the, point, whence, i, set, out,, without, being, aware, of, the, fact;, so, perfectly, uniform, seemed, the, wall.]",41
1,It never once occurred to me that the fumbling might be a mere mistake.,"[it, never, once, occurred, to, me, that, the, fumbling, might, be, a, mere, mistake.]",14
2,"In his left hand was a gold snuff box, from which, as he capered down the hill, cutting all manner of fantastic steps, he took snuff incessantly with an air of the greatest possible self satisfaction.","[in, his, left, hand, was, a, gold, snuff, box,, from, which,, as, he, capered, down, the, hill,, cutting, all, manner, of, fantastic, steps,, he, took, snuff, incessantly, with, an, air, of, the, greatest, possible, self, satisfaction.]",36
3,"How lovely is spring As we looked from Windsor Terrace on the sixteen fertile counties spread beneath, speckled by happy cottages and wealthier towns, all looked as in former years, heart cheering and fair.","[how, lovely, is, spring, as, we, looked, from, windsor, terrace, on, the, sixteen, fertile, counties, spread, beneath,, speckled, by, happy, cottages, and, wealthier, towns,, all, looked, as, in, former, years,, heart, cheering, and, fair.]",34
4,"Finding nothing else, not even gold, the Superintendent abandoned his attempts; but a perplexed look occasionally steals over his countenance as he sits thinking at his desk.","[finding, nothing, else,, not, even, gold,, the, superintendent, abandoned, his, attempts;, but, a, perplexed, look, occasionally, steals, over, his, countenance, as, he, sits, thinking, at, his, desk.]",27


## Remove common words

In [9]:
from pyspark.ml.feature import StopWordsRemover

remover = StopWordsRemover(inputCol="words", outputCol="filtered").setCaseSensitive(False)
removed = remover.transform(tokenized)
removed.select("text", "words", "filtered" ).toPandas().head()

Unnamed: 0,text,words,filtered
0,"This process, however, afforded me no means of ascertaining the dimensions of my dungeon; as I might make its circuit, and return to the point whence I set out, without being aware of the fact; so perfectly uniform seemed the wall.","[this, process,, however,, afforded, me, no, means, of, ascertaining, the, dimensions, of, my, dungeon;, as, i, might, make, its, circuit,, and, return, to, the, point, whence, i, set, out,, without, being, aware, of, the, fact;, so, perfectly, uniform, seemed, the, wall.]","[process,, however,, afforded, means, ascertaining, dimensions, dungeon;, might, make, circuit,, return, point, whence, set, out,, without, aware, fact;, perfectly, uniform, seemed, wall.]"
1,It never once occurred to me that the fumbling might be a mere mistake.,"[it, never, once, occurred, to, me, that, the, fumbling, might, be, a, mere, mistake.]","[never, occurred, fumbling, might, mere, mistake.]"
2,"In his left hand was a gold snuff box, from which, as he capered down the hill, cutting all manner of fantastic steps, he took snuff incessantly with an air of the greatest possible self satisfaction.","[in, his, left, hand, was, a, gold, snuff, box,, from, which,, as, he, capered, down, the, hill,, cutting, all, manner, of, fantastic, steps,, he, took, snuff, incessantly, with, an, air, of, the, greatest, possible, self, satisfaction.]","[left, hand, gold, snuff, box,, which,, capered, hill,, cutting, manner, fantastic, steps,, took, snuff, incessantly, air, greatest, possible, self, satisfaction.]"
3,"How lovely is spring As we looked from Windsor Terrace on the sixteen fertile counties spread beneath, speckled by happy cottages and wealthier towns, all looked as in former years, heart cheering and fair.","[how, lovely, is, spring, as, we, looked, from, windsor, terrace, on, the, sixteen, fertile, counties, spread, beneath,, speckled, by, happy, cottages, and, wealthier, towns,, all, looked, as, in, former, years,, heart, cheering, and, fair.]","[lovely, spring, looked, windsor, terrace, sixteen, fertile, counties, spread, beneath,, speckled, happy, cottages, wealthier, towns,, looked, former, years,, heart, cheering, fair.]"
4,"Finding nothing else, not even gold, the Superintendent abandoned his attempts; but a perplexed look occasionally steals over his countenance as he sits thinking at his desk.","[finding, nothing, else,, not, even, gold,, the, superintendent, abandoned, his, attempts;, but, a, perplexed, look, occasionally, steals, over, his, countenance, as, he, sits, thinking, at, his, desk.]","[finding, nothing, else,, even, gold,, superintendent, abandoned, attempts;, perplexed, look, occasionally, steals, countenance, sits, thinking, desk.]"


### Show list of common words removed

In [10]:
from __future__ import print_function
[print(x) for x in remover.getStopWords()]

i
me
my
myself
we
our
ours
ourselves
you
your
yours
yourself
yourselves
he
him
his
himself
she
her
hers
herself
it
its
itself
they
them
their
theirs
themselves
what
which
who
whom
this
that
these
those
am
is
are
was
were
be
been
being
have
has
had
having
do
does
did
doing
a
an
the
and
but
if
or
because
as
until
while
of
at
by
for
with
about
against
between
into
through
during
before
after
above
below
to
from
up
down
in
out
on
off
over
under
again
further
then
once
here
there
when
where
why
how
all
any
both
each
few
more
most
other
some
such
no
nor
not
only
own
same
so
than
too
very
s
t
can
will
just
don
should
now
d
ll
m
o
re
ve
y
ain
aren
couldn
didn
doesn
hadn
hasn
haven
isn
ma
mightn
mustn
needn
shan
shouldn
wasn
weren
won
wouldn


[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,

## Hash the words and inverse weight words that occur frequently across all text

In [11]:
from pyspark.ml.feature import HashingTF, IDF

hashingTF = HashingTF(inputCol="filtered", outputCol="rawFeatures", numFeatures=100)
featurizedData = hashingTF.transform(removed)

idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = idf.fit(featurizedData)
rescaledData = idfModel.transform(featurizedData)

rescaledData.select("text", "rawFeatures", "features").toPandas().head()

Unnamed: 0,text,rawFeatures,features
0,"This process, however, afforded me no means of ascertaining the dimensions of my dungeon; as I might make its circuit, and return to the point whence I set out, without being aware of the fact; so perfectly uniform seemed the wall.","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0, 1.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0, 0.0, 1.0)","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.71404795764, 0.0, 0.0, 0.0, 1.72474875895, 3.95212637445, 0.0, 0.0, 0.0, 0.0, 0.0, 2.05017115938, 1.47962630091, 0.0, 0.0, 0.0, 0.0, 1.43706668649, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.90707031574, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.72474875895, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.21722524404, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.8425317946, 1.8425317946, 0.0, 0.0, 2.21722524404, 1.97606318723, 1.8425317946, 1.43706668649, 2.31253542385, 0.0, 0.0, 0.0, 0.0, 1.57059807912, 0.0, 0.0, 4.10034231876, 0.0, 0.0, 0.0, 0.0, 2.05017115938)"
1,It never once occurred to me that the fumbling might be a mere mistake.,"(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0)","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.13021386705, 0.0, 1.35702397882, 0.0, 0.0, 0.0, 0.0, 0.0, 2.05017115938, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.13021386705, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.43706668649, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.31253542385, 0.0, 0.0, 0.0)"
2,"In his left hand was a gold snuff box, from which, as he capered down the hill, cutting all manner of fantastic steps, he took snuff incessantly with an air of the greatest possible self satisfaction.","(1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0)","(2.21722524404, 0.0, 0.0, 1.67068153767, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.21722524404, 0.0, 0.0, 0.0, 0.0, 0.0, 1.57059807912, 0.0, 2.41789593951, 4.26042773411, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.21722524404, 0.0, 0.0, 2.05017115938, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 5.33842073557, 0.0, 0.0, 0.0, 0.0, 0.0, 1.97606318723, 0.0, 0.0, 0.0, 0.0, 2.21722524404, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.57059807912, 0.0, 0.0, 2.13021386705, 0.0, 1.90707031574, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.97606318723, 2.05017115938, 0.0, 2.31253542385, 0.0, 1.67068153767, 0.0)"
3,"How lovely is spring As we looked from Windsor Terrace on the sixteen fertile counties spread beneath, speckled by happy cottages and wealthier towns, all looked as in former years, heart cheering and fair.","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 2.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.57059807912, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.72474875895, 0.0, 0.0, 1.57059807912, 1.72474875895, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 6.65167573213, 0.0, 1.43706668649, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.21722524404, 3.81414063148, 2.13021386705, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.05017115938, 1.97606318723, 2.66921036779, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.31928365084, 0.0, 2.21722524404, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.57059807912, 0.0, 0.0, 2.13021386705, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.21722524404, 0.0, 1.8425317946, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)"
4,"Finding nothing else, not even gold, the Superintendent abandoned his attempts; but a perplexed look occasionally steals over his countenance as he sits thinking at his desk.","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0)","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.57059807912, 0.0, 1.72474875895, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.8425317946, 0.0, 1.8425317946, 1.43706668649, 0.0, 0.0, 4.10034231876, 0.0, 0.0, 0.0, 0.0, 0.0, 2.21722524404, 0.0, 0.0, 0.0, 5.07135795032, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.72474875895, 0.0, 2.13021386705, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.8425317946, 0.0, 2.31253542385, 0.0, 2.21722524404, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.31253542385, 0.0, 0.0, 0.0)"


## Encode the label column

In [12]:
from pyspark.ml.feature import StringIndexer
labelIndexer = StringIndexer(inputCol='author', outputCol='label').fit(data)

## Use Logistic Regression Algorithm to predict author

In [13]:
from pyspark.ml.classification import LogisticRegression
lr = LogisticRegression(labelCol = "label", maxIter=10, regParam=0.3, threshold=0.7)
#lr = LogisticRegression(labelCol = 'label', maxIter=10, regParam=0.3, elasticNetParam=0.8)

## Convert indexed labels back to original labels

In [14]:
from pyspark.ml.feature import IndexToString
labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel", labels=labelIndexer.labels)

## Define the machine learning pipeline

In [15]:
stages = [tokenizer, remover, hashingTF, idf, labelIndexer, lr, labelConverter]
from pyspark.ml import Pipeline
pipeline = Pipeline(stages = stages)

### Display the parameter setting of the pipeline stages

In [16]:
print("Tokenizer:")
print(tokenizer.explainParams())
print("*************************")
print("Remover:")
print(remover.explainParams())
print("*************************")
print("HashingTF:")
print(hashingTF.explainParams())
print("*************************")
print("IDF:")
print(idf.explainParams())
print("*************************")
print("LogisticRegression:")
print(lr.explainParams())
print("*************************")
print("Pipeline:")
print(pipeline.explainParams())

Tokenizer:
inputCol: input column name. (current: text)
outputCol: output column name. (default: Tokenizer_4227971a5803d5f8a5ce__output, current: words)
*************************
Remover:
caseSensitive: whether to do a case sensitive comparison over the stop words (default: False, current: False)
inputCol: input column name. (current: words)
outputCol: output column name. (default: StopWordsRemover_4fce913f94309d138f97__output, current: filtered)
stopWords: The words to be filtered out (default: [u'i', u'me', u'my', u'myself', u'we', u'our', u'ours', u'ourselves', u'you', u'your', u'yours', u'yourself', u'yourselves', u'he', u'him', u'his', u'himself', u'she', u'her', u'hers', u'herself', u'it', u'its', u'itself', u'they', u'them', u'their', u'theirs', u'themselves', u'what', u'which', u'who', u'whom', u'this', u'that', u'these', u'those', u'am', u'is', u'are', u'was', u'were', u'be', u'been', u'being', u'have', u'has', u'had', u'having', u'do', u'does', u'did', u'doing', u'a', u'an', 

## Split the dataset into training and test data sets

In [17]:
train, test = data.randomSplit([70.0,30.0], seed=1)
print('The number of records in the traininig data set is {}.'.format(train.count()))
print('The number of rows labeled EAP in the training data set is {}.'.format(train.filter(train['author'] == 'EAP').count()))
print('The number of rows labeled HPL in the training data set is {}.'.format(train.filter(train['author'] == 'HPL').count()))
print('The number of rows labeled MWS in the training data set is {}.'.format(train.filter(train['author'] == 'MWS').count()))

print('The number of records in the test data set is {}.'.format(test.count()))
print('The number of rows labeled EAP in the test data set is {}.'.format(test.filter(test['author'] == 'EAP').count()))
print('The number of rows labeled HPL in the test data set is {}.'.format(test.filter(test['author'] == 'HPL').count()))
print('The number of rows labeled MWS in the test data set is {}.'.format(test.filter(test['author'] == 'MWS').count()))

The number of records in the traininig data set is 77.
The number of rows labeled EAP in the training data set is 27.
The number of rows labeled HPL in the training data set is 25.
The number of rows labeled MWS in the training data set is 25.
The number of records in the test data set is 23.
The number of rows labeled EAP in the test data set is 9.
The number of rows labeled HPL in the test data set is 8.
The number of rows labeled MWS in the test data set is 6.


## Train the model using the training data set

In [18]:
model = pipeline.fit(train)

## Make predictions using the test data set

In [19]:
predictions = model.transform(test)

In [20]:
predictions.select("author", "label", "prediction", 'predictedLabel', "probability").toPandas().head()

Unnamed: 0,author,label,prediction,predictedLabel,probability
0,MWS,2,2,MWS,"[0.333603184945, 0.259378643869, 0.407018171186]"
1,MWS,2,1,HPL,"[0.14545299873, 0.506132146855, 0.348414854416]"
2,HPL,1,1,HPL,"[0.343832262321, 0.386505251073, 0.269662486606]"
3,MWS,2,2,MWS,"[0.124928674119, 0.0717508395858, 0.803320486295]"
4,HPL,1,1,HPL,"[0.282942583593, 0.540385561467, 0.17667185494]"


## Evaluate the model performance by calculating the accuracy

In [22]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
evaluator = MulticlassClassificationEvaluator(labelCol = "label", predictionCol="prediction").setMetricName("accuracy")
print('Accuracy = {}.'.format(evaluator.evaluate(predictions)))

Accuracy = 0.478260869565.


## Investigate the prediction results results

In [64]:
EAPandEAP = predictions.filter(predictions['author']=='EAP').filter(predictions['predictedLabel']=='EAP').count()
EAPnotEAP = predictions.filter(predictions['author']=='EAP').filter(predictions['predictedLabel']!='EAP').count()
notEAPbutEAP = predictions.filter(predictions['author']!='EAP').filter(predictions['predictedLabel']=='EAP').count()
print("Predicted EAP correctly {} times.".format(EAPandEAP))
print("Failed to predict EAP {} times.".format(EAPnotEAP))
print("Predicted EAP incorrectly {} times.".format(notEAPbutEAP))

Predicted EAP correctly 5 times.
Failed to predict EAP 4 times.
Predicted EAP incorrectly 5 times.


In [65]:
HPLandHPL = predictions.filter(predictions['author']=='HPL').filter(predictions['predictedLabel']=='HPL').count()
HPLnotHPL = predictions.filter(predictions['author']=='HPL').filter(predictions['predictedLabel']!='HPL').count()
notHPLbutHPL = predictions.filter(predictions['author']!='HPL').filter(predictions['predictedLabel']=='HPL').count()
print("Predicted HPL correctly {} times.".format(HPLandHPL))
print("Failed to predict HPL {} times.".format(HPLnotHPL))
print("Predicted HPL incorrectly {} times.".format(notHPLbutHPL))

Predicted HPL correctly 4 times.
Failed to predict HPL 4 times.
Predicted HPL incorrectly 4 times.


In [66]:
MWSandMWS = predictions.filter(predictions['author']=='MWS').filter(predictions['predictedLabel']=='MWS').count()
MWSnotMWS = predictions.filter(predictions['author']=='MWS').filter(predictions['predictedLabel']!='MWS').count()
notMWSbutMWS = predictions.filter(predictions['author']!='MWS').filter(predictions['predictedLabel']=='MWS').count()
print("Predicted MWS correctly {} times.".format(MWSandMWS))
print("Failed to predict MWS {} times.".format(MWSnotMWS))
print("Predicted MWS incorrectly {} times.".format(notMWSbutMWS))

Predicted MWS correctly 2 times.
Failed to predict MWS 4 times.
Predicted MWS incorrectly 3 times.


# Use Natural Language Understanding to create rich features

## Setup configuration for the Natural La service

In [23]:
import watson_developer_cloud
from watson_developer_cloud import NaturalLanguageUnderstandingV1
import watson_developer_cloud.natural_language_understanding.features.v1 as Features
import json

In [24]:
NLU_USERNAME = 'c15be849-aa59-4a88-b30a-6d0a22e308be'
NLU_PASSWORD = 'm7cKrXz7aW5h'
natural_language_understanding = NaturalLanguageUnderstandingV1(
  username=NLU_USERNAME,
  password=NLU_PASSWORD,
  version="2017-02-27")

### Randomly pick text from dataset to analyze

In [26]:
dataNLU = data.select(data["text"]).toJSON().collect()[0][8:-1]
print(dataNLU)
import json
features=[
    Features.Emotion(),
    Features.Sentiment()
  ]
nlu = natural_language_understanding.analyze(text=dataNLU, features=features)
anger = nlu['emotion']['document']['emotion']['anger']
joy = nlu['emotion']['document']['emotion']['joy']
sadness = nlu['emotion']['document']['emotion']['sadness']
fear = nlu['emotion']['document']['emotion']['fear']
disgust = nlu['emotion']['document']['emotion']['disgust']
sentiment = nlu['sentiment']['document']['score']

print("")
print("Anger = {}".format(anger))
print("Joy = {}".format(joy))
print("Sadness = {}".format(sadness))
print("Fear = {}".format(fear))
print("Disgust = {}".format(disgust))
print("Sentiment = {}".format(sentiment))
print("")
print(json.dumps(nlu, indent=2))

"This process, however, afforded me no means of ascertaining the dimensions of my dungeon; as I might make its circuit, and return to the point whence I set out, without being aware of the fact; so perfectly uniform seemed the wall."

Anger = 0.22525
Joy = 0.208037
Sadness = 0.156157
Fear = 0.092108
Disgust = 0.024618
Sentiment = 0.834411

{
  "usage": {
    "text_characters": 233, 
    "features": 2, 
    "text_units": 1
  }, 
  "emotion": {
    "document": {
      "emotion": {
        "anger": 0.22525, 
        "joy": 0.208037, 
        "sadness": 0.156157, 
        "fear": 0.092108, 
        "disgust": 0.024618
      }
    }
  }, 
  "language": "en", 
  "sentiment": {
    "document": {
      "score": 0.834411, 
      "label": "positive"
    }
  }
}


## Define UDFs to create NLU derived features

In [30]:
from pyspark.sql.types import FloatType
from pyspark.sql.functions import udf
udfAnger = (udf(lambda text: NaturalLanguageUnderstandingV1(
    username=NLU_USERNAME, password=NLU_PASSWORD, version="2017-02-27")
    .analyze(text=text, features=features)['emotion']['document']['emotion']['anger'],
    FloatType()))
udfJoy = (udf(lambda text: NaturalLanguageUnderstandingV1(
    username=NLU_USERNAME, password=NLU_PASSWORD, version="2017-02-27")
    .analyze(text=text, features=features)['emotion']['document']['emotion']['joy'],
    FloatType()))
udfSadness = (udf(lambda text: NaturalLanguageUnderstandingV1(
    username=NLU_USERNAME, password=NLU_PASSWORD, version="2017-02-27")
    .analyze(text=text, features=features)['emotion']['document']['emotion']['sadness'],
    FloatType()))
udfFear = (udf(lambda text: NaturalLanguageUnderstandingV1(
    username=NLU_USERNAME, password=NLU_PASSWORD, version="2017-02-27")
    .analyze(text=text, features=features)['emotion']['document']['emotion']['fear'],
    FloatType()))
udfDisgust = (udf(lambda text: NaturalLanguageUnderstandingV1(
    username=NLU_USERNAME, password=NLU_PASSWORD, version="2017-02-27")
    .analyze(text=text, features=features)['emotion']['document']['emotion']['disgust'],
    FloatType()))
udfSentiment = (udf(lambda text: NaturalLanguageUnderstandingV1(
    username=NLU_USERNAME, password=NLU_PASSWORD, version="2017-02-27")
    .analyze(text=text, features=features)['sentiment']['document']['score'],
    FloatType()))

### Use limited data set for processing efficiency

In [31]:
data2 = data.limit(100)
data2.toPandas().head()

Unnamed: 0,id,text,author
0,id26305,"This process, however, afforded me no means of ascertaining the dimensions of my dungeon; as I might make its circuit, and return to the point whence I set out, without being aware of the fact; so perfectly uniform seemed the wall.",EAP
1,id17569,It never once occurred to me that the fumbling might be a mere mistake.,HPL
2,id11008,"In his left hand was a gold snuff box, from which, as he capered down the hill, cutting all manner of fantastic steps, he took snuff incessantly with an air of the greatest possible self satisfaction.",EAP
3,id27763,"How lovely is spring As we looked from Windsor Terrace on the sixteen fertile counties spread beneath, speckled by happy cottages and wealthier towns, all looked as in former years, heart cheering and fair.",MWS
4,id12958,"Finding nothing else, not even gold, the Superintendent abandoned his attempts; but a perplexed look occasionally steals over his countenance as he sits thinking at his desk.",HPL


In [32]:
data2 = (data2.withColumn('Anger', udfAnger(data2['text']))
        .withColumn('Joy', udfJoy(data2['text']))
        .withColumn('Sadness', udfSadness(data2['text']))
        .withColumn('Fear', udfFear(data2['text']))
        .withColumn('Disgust', udfDisgust(data2['text']))
        .withColumn('Sentiment', udfSentiment(data2['text'])))
data2.cache()
data2.select(data2['Anger'], data2['Joy'], data2['Sadness'], data2['Fear'], data2['Disgust'], data2['Sentiment']).toPandas().head(10)

Unnamed: 0,Anger,Joy,Sadness,Fear,Disgust,Sentiment
0,0.22525,0.208037,0.156157,0.092108,0.024618,0.875231
1,0.268703,0.07335,0.284674,0.336473,0.049945,-0.867677
2,0.047614,0.856397,0.021778,0.017659,0.061879,-0.739374
3,0.002317,0.895471,0.074464,0.011802,0.011855,0.928205
4,0.239133,0.005415,0.43508,0.343188,0.311778,-0.717046
5,0.030202,0.380229,0.316263,0.011006,0.016229,0.987105
6,0.146274,0.03453,0.189046,0.644884,0.066424,0.0
7,0.175506,0.159886,0.116204,0.363028,0.3051,0.0
8,0.007289,0.17762,0.575457,0.321208,0.025653,0.569564
9,0.073041,0.262087,0.175994,0.142699,0.117686,-0.851087


In [33]:
data2.toPandas().head()

Unnamed: 0,id,text,author,Anger,Joy,Sadness,Fear,Disgust,Sentiment
0,id26305,"This process, however, afforded me no means of ascertaining the dimensions of my dungeon; as I might make its circuit, and return to the point whence I set out, without being aware of the fact; so perfectly uniform seemed the wall.",EAP,0.22525,0.208037,0.156157,0.092108,0.024618,0.875231
1,id17569,It never once occurred to me that the fumbling might be a mere mistake.,HPL,0.268703,0.07335,0.284674,0.336473,0.049945,-0.867677
2,id11008,"In his left hand was a gold snuff box, from which, as he capered down the hill, cutting all manner of fantastic steps, he took snuff incessantly with an air of the greatest possible self satisfaction.",EAP,0.047614,0.856397,0.021778,0.017659,0.061879,-0.739374
3,id27763,"How lovely is spring As we looked from Windsor Terrace on the sixteen fertile counties spread beneath, speckled by happy cottages and wealthier towns, all looked as in former years, heart cheering and fair.",MWS,0.002317,0.895471,0.074464,0.011802,0.011855,0.928205
4,id12958,"Finding nothing else, not even gold, the Superintendent abandoned his attempts; but a perplexed look occasionally steals over his countenance as he sits thinking at his desk.",HPL,0.239133,0.005415,0.43508,0.343188,0.311778,-0.717046


## Rerun Model with NLU Features Added

In [34]:
train2, test2 = data2.randomSplit([70.0,30.0], seed=1)

In [35]:
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=["features", "Anger", "Joy", "Sadness", "Fear", "Disgust","Sentiment"], outputCol="features2")

In [36]:
lr = LogisticRegression(labelCol = "label", featuresCol= "features2", maxIter=10, regParam=0.3, threshold=0.7)
stages2 = [tokenizer, remover, hashingTF, idf, assembler, labelIndexer, lr, labelConverter]
pipeline2 = Pipeline(stages = stages2)

In [37]:
model2 = pipeline2.fit(train2)

In [38]:
predictions2 = model2.transform(test2)

In [39]:
predictions2.select("author", "label", "prediction", 'predictedLabel', "probability").toPandas().head()

Unnamed: 0,author,label,prediction,predictedLabel,probability
0,MWS,2,2,MWS,"[0.309468385722, 0.221262936173, 0.469268678105]"
1,MWS,2,1,HPL,"[0.19436539635, 0.563329407895, 0.242305195755]"
2,HPL,1,1,HPL,"[0.33304496925, 0.375787198297, 0.291167832453]"
3,MWS,2,2,MWS,"[0.137431710759, 0.0768537530022, 0.785714536239]"
4,HPL,1,1,HPL,"[0.268699509092, 0.658899041912, 0.0724014489959]"


In [51]:
evaluator2 = MulticlassClassificationEvaluator(labelCol = "label", predictionCol="prediction").setMetricName("accuracy")
print('Accuracy = {}.'.format(evaluator.evaluate(predictions2)))

Accuracy = 0.521739130435.


## Investigate Improved Results

In [75]:
EAPandEAP2 = predictions2.filter(predictions2['author']=='EAP').filter(predictions2['predictedLabel']=='EAP').count()
EAPnotEAP2 = predictions2.filter(predictions2['author']=='EAP').filter(predictions2['predictedLabel']!='EAP').count()
notEAPbutEAP2 = predictions2.filter(predictions2['author']!='EAP').filter(predictions2['predictedLabel']=='EAP').count()
print("Predicted EAP correctly {} times vs. {} previously.".format(EAPandEAP2, EAPandEAP))
print("Failed to predict EAP {} times vs. {} previously.".format(EAPnotEAP2, EAPnotEAP))
print("Predicted EAP incorrectly {} times vs. {} previously.".format(notEAPbutEAP2, notEAPbutEAP))

Predicted EAP correctly 6 times vs. 5 previously.
Failed to predict EAP 3 times vs. 4 previously.
Predicted EAP incorrectly 5 times vs. 5 previously.


In [78]:
HPLandHPL2 = predictions2.filter(predictions2['author']=='HPL').filter(predictions2['predictedLabel']=='HPL').count()
HPLnotHPL2 = predictions2.filter(predictions2['author']=='HPL').filter(predictions2['predictedLabel']!='HPL').count()
notHPLbutHPL2 = predictions2.filter(predictions2['author']!='HPL').filter(predictions2['predictedLabel']=='HPL').count()
print("Predicted HPL correctly {} times vs. {} previously.".format(HPLandHPL2, HPLandHPL))
print("Failed to predict HPL {} times vs. {} previously.".format(HPLnotHPL2, HPLnotHPL))
print("Predicted HPL incorrectly {} times vs. {} previously.".format(notHPLbutHPL2, notHPLbutHPL))

Predicted HPL correctly 4 times vs. 4 previously.
Failed to predict HPL 4 times vs. 4 previously.
Predicted HPL incorrectly 3 times vs. 4 previously.


In [81]:
MWSandMWS2 = predictions2.filter(predictions2['author']=='MWS').filter(predictions2['predictedLabel']=='MWS').count()
MWSnotMWS2 = predictions2.filter(predictions2['author']=='MWS').filter(predictions2['predictedLabel']!='MWS').count()
notMWSbutMWS2 = predictions2.filter(predictions2['author']!='MWS').filter(predictions2['predictedLabel']=='MWS').count()
print("Predicted MWS correctly {} times vs. {} previously.".format(MWSandMWS2, MWSandMWS))
print("Failed to predict MWS {} times vs. {} previously.".format(MWSnotMWS2, MWSnotMWS))
print("Predicted MWS incorrectly {} times vs. {} previously.".format(notMWSbutMWS2, notMWSbutMWS))

Predicted MWS correctly 2 times vs. 2 previously.
Failed to predict MWS 4 times vs. 4 previously.
Predicted MWS incorrectly 3 times vs. 3 previously.
