# ENRON EMAILS. TOPIC MODELLING

In [16]:
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._

import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.ml.linalg.{Vector =>mlVector, Matrix=> mlMatrix}
import org.apache.spark.mllib.linalg.{Vector =>mllibVector, Matrix => mllibMatrix}
import org.apache.spark.mllib.linalg.distributed.RowMatrix
import breeze.linalg.{DenseMatrix=>BDM, DenseVector=>BDV}
import breeze.stats.mean 
import breeze.linalg.{norm, normalize} 
import breeze.linalg.functions.{cosineDistance, euclideanDistance}


import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.ml.feature.{RegexTokenizer, StopWordsRemover,CountVectorizer, CountVectorizerModel,IDF, IDFModel}
import org.apache.spark.ml.clustering.{LDA, LDAModel}

In [2]:
//I recommend using Brunel 2.2 with Spark 2.1.
%AddJar -magic https://brunelvis.org/jar/spark-kernel-brunel-all-2.2.jar

Using cached version of spark-kernel-brunel-all-2.2.jar


## 1. Load and Clean Data

In [2]:
val docCleaner :String => String = doc => { doc.replaceAll("[^a-zA-Z0-9]", " ").replaceAll("\\s{2,}", " ").trim().toLowerCase() }
val UDF_docCleaner = udf(docCleaner)

In [3]:
val rdd = sc.textFile("enron_textfile.txt").map(_.split("\\|")).map(arr => (arr(0).toInt, arr(1), arr(2).toDouble) )
val rowRDD = rdd.map(record => Row(record._1, record._2, record._3))

val schema = new StructType().
    add(StructField("id", IntegerType, true)).
    add(StructField("email", StringType, true)).
    add(StructField("label", DoubleType, true))

val corpus = spark.createDataFrame(rowRDD,schema).withColumn("doc", UDF_docCleaner($"email"))
corpus.persist()
println("First five documents: ")
corpus.limit(5).show()

Waiting for a Spark session to start...

First five documents: 
+---+--------------------+-----+--------------------+
| id|               email|label|                 doc|
+---+--------------------+-----+--------------------+
|  1|North America's i...|  0.0|north america s i...|
|  2|FYI -----Original...|  1.0|fyi original mess...|
|  3|14:13:53 Synchron...|  0.0|14 13 53 synchron...|
|  4|^ ----- Forwarded...|  1.0|forwarded by stev...|
|  5|----- Forwarded b...|  0.0|forwarded by stev...|
+---+--------------------+-----+--------------------+



## 2. Feature Enginnering

Document featurization is a process that maps every document in a vector space. The most basic technique is called BoW (Bag of Words) that builds the vector space by counting eact term-j frequency in document-i, so a document-term matrix is obtained. The number of terms taking into consideration is called vocabulary and it is usually set by a minimun token frequency.

This matrix has D rows (# documents) and V columns (# terms), where V is vocabulary size (V <= n, the total number of terms in the corpus). In Spark it will be represented by a DataFrame where the column features holds a collection of D vectors of size V.

In [4]:
val regexTokenizer = new RegexTokenizer().
    setInputCol("doc").
    setOutputCol("tokens").
    setPattern("\\s+").
    setMinTokenLength(2)
val remover = new StopWordsRemover().setCaseSensitive(false).setInputCol("tokens").setOutputCol("tokens_rm")
val TF = new CountVectorizer().setInputCol("tokens_rm").setOutputCol("features").
    setMinTF(2).
    setVocabSize(500)

val stages = Array(regexTokenizer, remover, TF)
val feat_eng_pl = new Pipeline().setStages(stages).fit(corpus)
val docTerm_df = feat_eng_pl.transform(corpus)




In [5]:
val TFModel = feat_eng_pl.stages(2).asInstanceOf[CountVectorizerModel]
val vocabulary = TFModel.vocabulary

println("Vocabulary size: "+ vocabulary.size)
println("Vocabulary: "+ vocabulary.slice(0,10).mkString(", "))

Vocabulary size: 500
Vocabulary: enron, ect, com, hou, power, 2000, subject, 2001, energy, mail


## 3. Model Buidling

Topic modelling is an unsupervised technique that sets an a priori guess of how many different topics are there in a corpus (let's say K is the number of topics). A very basic explanation: Each topic will be related to a set of terms, (which are more frequent in that particular topic), any document will have words related to any of the k topics, but the frequency, number and importance of that set of terms given a topic will define the relationship of that term with the k topics.

Consider the following example: Given a corpus of news, let's say that there is Sports, Science and Politics news, and by an initial guess we correctly set k=3.

We could expect the algorithm to find three topics, and a set of terms related to those three topics, those terms are ranked by importance (given by the weigths)

topic1: Obama(0.67), Congress(0.23), Democrats (0.08)
topic2: Messi(0.5), CR7(0.47), Referee(0.02)
topic3: Kepler(0.63), Sagan(0.27), Cassini(0.9)

The weights don't add up to 1 because there are other terms related to any topic, but in order to label or describe a topic, we only choose the most important ones.

Therefore, any document is mapped to a vectorial space of k dimensions, as follows:

doc-i = [topic1=0.88, topic2=0.1, topic3=0.02]

So the most important topic in doc-i is topic1. Moreover, the algorithm also yields a rank of topics by topic importance in the whole corpus, let's say that each topic weight (or importance) in our corpus is k-p, p=1,2,...,K:

[k1,k3, k2] means that topic k1 is more important that k3 and this one is more important than k2.

Fitting a LDA (latent Dirichlet allocation):

Use SVD (Singular Value Decompostion) in order to obtain an aproximation of the docTerm Matrix [M] so that:

$$ M = U x \Sigma x V^{T} $$

[U]: each row is a document i=1,...,D , and every column is a topic, j=1,...,k (dxk). It maps every document in a topics vectorial space. Used to cluster documents by topic

[V]: each row is a term j=1,...,n and every column is a topic i=1,...k  (nxk). It maps every topic features vector space. It's main task is to label (or describe) each topic with the most relevant terms

[Sigma]: Diagonal matrix of topic coefficients (kxk). Rank the most important topics

In [6]:
val lda = new LDA().setK(3).setMaxIter(30).setFeaturesCol("features").setTopicDistributionCol("topicDistribution")
val ldaModel = lda.fit(docTerm_df) //org.apache.spark.ml.clustering.LDA
val topicsTop_df = ldaModel.describeTopics(3)
topicsTop_df.show(false)

|topic|termIndices |termWeights                                                       |
+-----+------------+------------------------------------------------------------------+
|0    |[2, 9, 0]   |[0.21741384290888094, 0.049706981136568226, 0.04053549921057427]  |
|1    |[10, 14, 15]|[0.025309305907768573, 0.019168714028749912, 0.018024278035226916]|
|2    |[0, 1, 3]   |[0.09614295286779202, 0.09455567268586144, 0.04721695071499122]   |
+-----+------------+------------------------------------------------------------------+



In [7]:
val docTopic_df = ldaModel.transform(docTerm_df)
docTopic_df.select($"id", $"doc", $"topicDistribution").show(5)

+---+--------------------+--------------------+
| id|                 doc|   topicDistribution|
+---+--------------------+--------------------+
|  1|north america s i...|[0.00235965459353...|
|  2|fyi original mess...|[0.82798160872248...|
|  3|14 13 53 synchron...|[0.05753786926167...|
|  4|forwarded by stev...|[0.59561274836540...|
|  5|forwarded by stev...|[0.24381207134737...|
+---+--------------------+--------------------+
only showing top 5 rows



The deafult Optimizer yields a LDAModel that is local, it stores information about topics only, not about the training dataset

In [8]:
ldaModel.isDistributed

false

### Map topics in features space and compute distances among them

In [9]:
val n = ldaModel.vocabSize

In [10]:
//Inferred topics, where each topic is represented by a distribution over terms. (Local Matrix: nxk, each column is a topic)
println("Terms coordinates in topic vectorial space:")
val topicMat = ldaModel.topicsMatrix
val topicExpArr = topicMat.toArray

Terms coordinates in topic vectorial space:


In [11]:
val xTopic0 = BDV(topicExpArr.slice(0,n))
val xTopic1 = BDV(topicExpArr.slice(n,2*n))
val xTopic2 = BDV(topicExpArr.slice(2*n,topicExpArr.size+1))


In [12]:
val zTopic0 =  normalize(xTopic0) 
val zTopic1 =  normalize(xTopic1)
val zTopic2 =  normalize(xTopic2)

In [18]:
val norm_eucldist_01 = cosineDistance(zTopic0,zTopic1)
val norm_cosdist_01 = euclideanDistance(zTopic0, zTopic1) 
println("Distance in features vectorial space from topic0 to topic1 coordinates: ")
println("Euclidean: "+ norm_eucldist_01)
println("Cosine similarity: "+ norm_cosdist_01)

Distance in features vectorial space from topic0 to topic1 coordinates: 
Euclidean: 0.8395039621796017
Cosine similarity: 1.29576538167957


In [21]:
val norm_eucldist_02 = cosineDistance(zTopic0, zTopic2) 
val norm_cosdist_02 = euclideanDistance(zTopic0, zTopic2)
println("Distance in features vectorial space from topic0 to topic2 coordinates: ")
println("Euclidean: "+ norm_eucldist_02)
println("Cosine similarity: "+ norm_cosdist_02)

Distance in features vectorial space from topic0 to topic2 coordinates: 
Euclidean: 0.8398693833223239
Cosine similarity: 1.2960473628091869


In [22]:
val norm_eucldist_12 = cosineDistance(zTopic1, zTopic2)
val norm_cosdist_12 = euclideanDistance(zTopic1, zTopic2)
println("Distance in features vectorial space from topic0 to topic1 coordinates: ")
println("Euclidean: "+ norm_eucldist_12)
println("Cosine similarity: "+ norm_cosdist_12)

Distance in features vectorial space from topic0 to topic1 coordinates: 
Euclidean: 0.6462130127326077
Cosine similarity: 1.1368491656614865


### Label topics with the most relevant words

In [52]:
val topicTokens = topicsTop_df.select($"topic", posexplode($"termIndices"), $"termWeights").withColumnRenamed("col","tokenIdx").
    withColumn("vocabulary",lit(vocabulary)).select($"topic",$"pos",expr("vocabulary[tokenIdx] as token"))

 topicTokens.show(false)

+-----+---+------+
|topic|pos|token |
+-----+---+------+
|0    |0  |com   |
|0    |1  |mail  |
|0    |2  |enron |
|1    |0  |power |
|1    |1  |state |
|1    |2  |market|
|2    |0  |enron |
|2    |1  |ect   |
|2    |2  |hou   |
+-----+---+------+



In [53]:
val topicCoefs = topicsTop_df.select($"topic",posexplode($"termWeights")).withColumnRenamed("col","tokenIdx")
topicCoefs.show(false)

+-----+---+--------------------+
|topic|pos|tokenIdx            |
+-----+---+--------------------+
|0    |0  |0.21436202279104347 |
|0    |1  |0.04889714586407715 |
|0    |2  |0.03035895022908159 |
|1    |0  |0.03175053985006113 |
|1    |1  |0.019002641638193714|
|1    |2  |0.0181822689958963  |
|2    |0  |0.11093990448578811 |
|2    |1  |0.10631141311387063 |
|2    |2  |0.05316370797273026 |
+-----+---+--------------------+



In [54]:
val topicReport =  topicTokens.join(topicCoefs,Seq("topic","pos"),"inner")
println("Topic labelling: ")
topicReport.show(false)

Topic labelling: 
+-----+---+------+--------------------+
|topic|pos|token |tokenIdx            |
+-----+---+------+--------------------+
|0    |0  |com   |0.21436202279104347 |
|0    |1  |mail  |0.04889714586407715 |
|0    |2  |enron |0.03035895022908159 |
|1    |0  |power |0.03175053985006113 |
|1    |1  |state |0.019002641638193714|
|1    |2  |market|0.0181822689958963  |
|2    |0  |enron |0.11093990448578811 |
|2    |1  |ect   |0.10631141311387063 |
|2    |2  |hou   |0.05316370797273026 |
+-----+---+------+--------------------+



We can see that topic0 is closely related to [com, email, enron], topic1 to [power, market, state] and topic2=[enron, ect, hou]

By checking at what tokens best describe a topic, we can describe (or label) that topic. Topic 1 is closely related to energy, so it will be interesting in trial, however topics 0 and 2 may be not so interesting.

### Cluster and rank documents most closely related to every topic

To leverage topic modelling we need to perform two main tasks:
* Clustering documenints in topics
* In each cluster, rank documents by it's relative importance

In [55]:
val maxElementIdx = (v: Vector) => v.toArray.zipWithIndex.maxBy(_._1)._2 :Int 
val maxElementIdx_UDF = udf(maxElementIdx)

In [56]:
val maxElement = (v: Vector) => v.toArray.max.toDouble :Double
val maxElement_UDF = udf(maxElement)

In [138]:
val docTopic = docTopic_df.withColumn("relatedTopic", maxElementIdx_UDF($"topicDistribution")).
    withColumn("relatedTopicWeight", maxElement_UDF($"topicDistribution"))
//Every topic importance among documents
println("Documents coordinates in topics vectorial space: ")
docTopic.select($"id",$"topicDistribution").limit(5).show(false)

Documents coordinates in topics vectorial space: 
+---+--------------------------------------------------------------+
|id |topicDistribution                                             |
+---+--------------------------------------------------------------+
|1  |[0.04876631962014416,0.9474722648407216,0.003761415539134295] |
|2  |[0.6728750600791644,0.005257764192483943,0.3218671757283516]  |
|3  |[0.9880542759065452,4.522598245078323E-4,0.011493464268947028]|
|4  |[0.4709270676557802,0.011854241218315012,0.5172186911259047]  |
|5  |[0.22301099524302745,0.21453950505845448,0.562449499698518]   |
+---+--------------------------------------------------------------+



In [58]:
//Most closely related topic to every document
Range(0,3).map(x => docTopic.filter($"relatedTopic" === lit(x)).
                                       select($"id",  $"relatedTopic", $"relatedTopicWeight").
                                       orderBy($"relatedTopicWeight".desc).
                                       limit(10).show(false))


|id |relatedTopic|relatedTopicWeight|
+---+------------+------------------+
|483|0           |0.9992277264915075|
|465|0           |0.9991443968816465|
|337|0           |0.9985810353565892|
|530|0           |0.9983450275653591|
|412|0           |0.9978456459393924|
|592|0           |0.9973319241233644|
|9  |0           |0.9972315506201558|
|542|0           |0.9969824379044203|
|581|0           |0.9958069633704406|
|641|0           |0.9950648492781391|
+---+------------+------------------+

+---+------------+------------------+                                           
|id |relatedTopic|relatedTopicWeight|
+---+------------+------------------+
|527|1           |0.9990804429656244|
|707|1           |0.9989610219249793|
|531|1           |0.9980240876424809|
|838|1           |0.9973335321156257|
|283|1           |0.996504936858759 |
|806|1           |0.9963827172460834|
|848|1           |0.9961773152338987|
|479|1           |0.994932511579049 |
|218|1           |0.9949145981555191|
|418|1

Vector((), (), ())

### Analyze response variable in each topic

Topic modelling is an unsupervised technique, however in this case, we have a response variable (label) and we can check if splitting the corpus in topics yields better event proportions in  each topic (or cluster of documents)

In [59]:
val docTopic_analytics = docTopic.groupBy($"relatedTopic").agg(count("*").as("N"), sum("label").as("N1")).
    withColumn("N0", $"N"-$"N1").
    withColumn("e", $"N1"/$"N").
    withColumn("ne", $"N0"/$"N")
println("Classification analyisis by topic: ")
docTopic_analytics.show()

Classification analyisis by topic: 
|relatedTopic|  N|  N1|   N0|                  e|                ne|
+------------+---+----+-----+-------------------+------------------+
|           1|166|72.0| 94.0|0.43373493975903615|0.5662650602409639|
|           2|389|47.0|342.0|0.12082262210796915|0.8791773778920309|
|           0|300|20.0|280.0|0.06666666666666667|0.9333333333333333|
+------------+---+----+-----+-------------------+------------------+



In fact, event proportion in topic 1 [powe, market, state] is the highest, so it is  an easy to understand way of clustering documents and therefore to classify them. Moreover, topics 1 and 2 yield a lower event proportion that the baseline (about 0.16), so LDA with k=3 is very good performant unsupervised model to achieve text classification.