# ST446 Distributed Computing for Big Data
## Homework
### Milan Vojnovic and Christine Yuen, LT 2018
---

## P3: Topic Modelling

In this homework assignment problem, you are asked to perform a semantic analysis of the DBLP author publications dataset `dblp/author_large.txt`. 

A. Use Latent Dirichlet Allocation (LDA) to cluster publications by using words in their titles and represent each publication by 10 topics. You should:

A.1. Convert titles to tokens by:
   * Tokenizing words in the title of each publication
   * Removing stop words using the nltk package
   * Removing puctuations, numbers or other symbols
   * Lemmatizing tokens

Note you may skip or add some additional editing of the tokens, but if you do this provide a justification for it. 

A.2. Convert tokens into sparse vectors

A.3. Use the LDA algorithm to find out 10 topics for each publication and represent each topic with first few most relevant words. Note that you can choose to use different number of topics rather than 10, again if you do so provide a justification.

A.4. Comment the obtained results

B. Address each question as in part A, but with each "document" representing publication tiles of specific author. For example, if an author Y wrote "introduction to databases" and "database design", then the "document" for the author Y will be "introduction to database database design". 

In addition, calculate the topic density vector for each author and use the topic density to calculate the cosine simularity for each pair of authors. For example, if the topic density for author X is [0.2,0.8, 0,...] and topic density vector for author Y is [0.1, 0.9, 0, ...], then the cosine simularity is $\frac{0.2*0.1+0.8*0.9}{\sqrt{0.2^2+0.8^2}\sqrt{0.1^2+0.9^2}}$. Show the 10 most similar author pairs.


In [1]:
from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName('QuestionP3') \
.set("spark.kryoserializer.buffer.max", "128m") \
.set("spark.kryoserializer.buffer", "64m") \
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \

sc = SparkContext.getOrCreate(conf=conf)

In [2]:
ardd0 = sc.textFile("C:/hduser/author-large.txt")

# Part A

In [3]:
ardd1 = ardd0.map(lambda x: x.split("\t", 3)) \
        .map(lambda x: (x[1],x[2])) \
        .reduceByKey(lambda x,y: x + " " + y) \
        .map(lambda x: (x[1]))
ardd1.take(1)

["On Modeling Conformance for Flexible Transformation over Data Models. On Modeling Conformance for Flexible Transformation over Data Models. Knowledge Representation and Transformation in Ontology-based Data Integration. Knowledge Representation and Transformation in Ontology-based Data Integration. The 'Family of Languages' Approach to Semantic Interoperability. The 'Family of Languages' Approach to Semantic Interoperability. UML for the Semantic Web: Transformation-Based Approaches. UML for the Semantic Web: Transformation-Based Approaches. UML for the Semantic Web: Transformation-Based Approaches. Tracing Data Lineage Using Schema Transformation Pathways. Tracing Data Lineage Using Schema Transformation Pathways. Transforming UML Domain Descriptions into Configuration Knowledge Bases. Transforming UML Domain Descriptions into Configuration Knowledge Bases. Transforming UML Domain Descriptions into Configuration Knowledge Bases. Transforming UML Domain Descriptions into Configuratio

In [4]:
# A.1. Convert titles to tokens
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import string
stop_words = set(stopwords.words('english'))
table = str.maketrans('', '', string.punctuation)
lmtzr = WordNetLemmatizer()

def get_tokens(line):
    tokens = word_tokenize(line)
    # convert to lower case
    tokens = [w.lower() for w in tokens]
    # remove punctuations from each word
    stripped = [w.translate(table) for w in tokens]
    # remove remaining tokens that are not alphabetic
    words = [word for word in stripped if word.isalpha()]
    # filter out stop words
    words = [w for w in words if not w in stop_words]
    words = [w for w in words if len(w) > 3]
    # stemming the words
    words = [lmtzr.lemmatize(w) for w in words]
    return (words)

In [5]:
ardd2 = ardd1.map(lambda line: (1, get_tokens(line)))

In [6]:
ardd2.take(3)

[(1,
  ['modeling',
   'conformance',
   'flexible',
   'transformation',
   'data',
   'model',
   'modeling',
   'conformance',
   'flexible',
   'transformation',
   'data',
   'model',
   'knowledge',
   'representation',
   'transformation',
   'ontologybased',
   'data',
   'integration',
   'knowledge',
   'representation',
   'transformation',
   'ontologybased',
   'data',
   'integration',
   'family',
   'language',
   'approach',
   'semantic',
   'interoperability',
   'family',
   'language',
   'approach',
   'semantic',
   'interoperability',
   'semantic',
   'transformationbased',
   'approach',
   'semantic',
   'transformationbased',
   'approach',
   'semantic',
   'transformationbased',
   'approach',
   'tracing',
   'data',
   'lineage',
   'using',
   'schema',
   'transformation',
   'pathway',
   'tracing',
   'data',
   'lineage',
   'using',
   'schema',
   'transformation',
   'pathway',
   'transforming',
   'domain',
   'description',
   'configuration',

In [7]:
doc_stop_words = ardd2.flatMap(lambda r: r[1]).map(lambda r: (r,1)).reduceByKey(lambda a,b: a+b)

doc_stop_words = doc_stop_words.filter(lambda a: a[1]>15000).map(lambda r: r[0]).collect()

ardd3 = ardd2.map(lambda r: (r[0],[w for w in r[1] if not w in doc_stop_words]))    

ardd3.take(1)[0][1][:10]

['conformance',
 'flexible',
 'transformation',
 'conformance',
 'flexible',
 'transformation',
 'transformation',
 'ontologybased',
 'transformation',
 'ontologybased']

In [8]:
# A.2. Convert tokens into sparse vectors
from pyspark.ml.feature import CountVectorizer
from pyspark.sql.functions import monotonically_increasing_id

ardd4 = spark.createDataFrame(ardd3, ["dummy","words"])
ardd4.cache()
ardd4.take(1)

[Row(dummy=1, words=['conformance', 'flexible', 'transformation', 'conformance', 'flexible', 'transformation', 'transformation', 'ontologybased', 'transformation', 'ontologybased', 'family', 'interoperability', 'family', 'interoperability', 'transformationbased', 'transformationbased', 'transformationbased', 'tracing', 'lineage', 'schema', 'transformation', 'pathway', 'tracing', 'lineage', 'schema', 'transformation', 'pathway', 'transforming', 'domain', 'description', 'configuration', 'base', 'transforming', 'domain', 'description', 'configuration', 'base', 'transforming', 'domain', 'description', 'configuration', 'base', 'transforming', 'domain', 'description', 'configuration', 'base', 'transforming', 'domain', 'description', 'configuration', 'base', 'transforming', 'transforming', 'schema', 'conversion', 'relational', 'schema', 'conversion', 'relational', 'schema', 'conversion', 'relational', 'rdft', 'mapping', 'metaontology', 'transformation', 'ontology', 'ontology', 'ontology', 'on

In [9]:
cv = CountVectorizer(inputCol="words", outputCol="features", minDF=2)

cv_model = cv.fit(ardd4)

ardd4_w_features = cv_model.transform(ardd4)
ardd4_w_features.cache()
ardd4_w_features.show(10)

+-----+--------------------+--------------------+
|dummy|               words|            features|
+-----+--------------------+--------------------+
|    1|[conformance, fle...|(61425,[13,20,69,...|
|    1|[logical, handlin...|(61425,[48,56,57,...|
|    1|[actor, conceptua...|(61425,[3,6,56,64...|
|    1|[entity, realtion...|(61425,[6,10,41,2...|
|    1|[expert, effectiv...|(61425,[138,188,1...|
|    1|[manufacturing, m...|(61425,[4,11,14,2...|
|    1|[abduction, analo...|(61425,[21,39,45,...|
|    1|[computinganstze,...|(61425,[24,35,46,...|
|    1|[firstclass, data...|(61425,[9,17,59,6...|
|    1|[dataflow, educat...|(61425,[6,15,17,5...|
+-----+--------------------+--------------------+
only showing top 10 rows



In [10]:
from pyspark.mllib.linalg import Vectors
#from pyspark.ml import linalg as ml_linalg
def as_mllib_vector(v):
    return Vectors.sparse(v.size, v.indices, v.values)

features = ardd4_w_features.select("features")
feature_vec = features.rdd.map(lambda r: as_mllib_vector(r[0]))

feature_vec.cache()
feature_vec.take(1)

[SparseVector(64293, {14: 2.0, 56: 2.0, 263: 3.0, 372: 2.0, 635: 2.0, 999: 1.0, 1288: 7.0, 1322: 2.0, 4927: 3.0, 5603: 2.0, 23719: 1.0})]

In [11]:
print ("Vocabulary from CountVectorizerModel is:")
print(cv_model.vocabulary[:100])

Vocabulary from CountVectorizerModel is:
['calibration', 'belief', 'phone', 'discovering', 'gaussian', 'relevance', 'tolerance', 'awareness', 'difference', 'browsing', 'mac', 'dynamically', 'satellite', 'need', 'family', 'connection', 'movement', 'cycle', 'dialogue', 'implication', 'behaviour', 'care', 'ofdm', 'providing', 'bus', 'plan', 'modified', 'theorem', 'template', 'dna', 'arithmetic', 'life', 'alternative', 'density', 'parallelism', 'im', 'networking', 'tolerant', 'part', 'receiver', 'microprocessor', 'forecasting', 'free', 'qualitative', 'bounded', 'detector', 'gate', 'interconnection', 'controlled', 'mathematical', 'step', 'bit', 'direction', 'ct', 'minimal', 'wide', 'ontologybased', 'workload', 'frequent', 'fingerprint', 'customer', 'concurrency', 'compensation', 'selective', 'automation', 'disk', 'pose', 'creation', 'call', 'deterministic', 'potential', 'string', 'categorization', 'formation', 'repository', 'ant', 'lower', 'embedding', 'mr', 'center', 'consideration', 'effe

In [13]:
# A.3. Use the LDA algorithm to find out 10 topics for each publication 
from pyspark.ml.clustering import LDA

lda = LDA(k=10, maxIter=20)
lda_model = lda.fit(ardd4_w_features)

In [15]:
# Describe topics
topics = lda_model.describeTopics(10)

print("The topics described by their top-weighted terms:")

topics.show(truncate=False)

# Shows the results
#transformed = lda_model.transform(news_df_w_features)
#transformed.columns
import numpy as np
topic_i = topics.select("termIndices").rdd.map(lambda r: r[0]).collect()
for i in topic_i:
    print(np.array(cv_model.vocabulary)[i])

The topics described by their top-weighted terms:
+-----+----------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|topic|termIndices                                               |termWeights                                                                                                                                                                                                                         |
+-----+----------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|0    |[51, 13, 54, 6, 25, 35, 29, 17, 48, 66]                

In [14]:
ll = lda_model.logLikelihood(ardd4_w_features)
lp = lda_model.logPerplexity(ardd4_w_features)

print("The lower bound on the log likelihood of the entire corpus: " + str(ll))
print("The upper bound on the perplexity: " + str(lp))

The lower bound on the log likelihood of the entire corpus: -72849486.59277481
The upper bound on the perplexity: 8.378121104596655


A4. Comment on results:

Other than the typical tokenizing processes, I only allowed words that are more than 3 characters because rogue topics like 'en' and 'um' appear when trained with the LDA model. I limited the words frequency threshold at 15000 times to improve the quality and relevancy of topics.

I train the model on the corpus with 10 topics, and set the maximum number of iterations to 20.

Restricting words that occurs less than a certain threshold certainly helped improve the model, because some publications were poorly represented by a handful of irrelevant topics like 'apl' and non-english words like 'prolog' and 'ingres'. I used the perplexity upper bound and results from LDA model as metrics to tune the threshold, keeping the number of iterations at 20. Although I felt that allowing 10 topics to represent each publication is too much of a stretch, LDA does a relatively good job with an upper on perplexity score of 8.38.

# Part B

In [9]:
brdd1 = ardd0.map(lambda x: x.split("\t", 3)) \
            .map(lambda x: (x[0],x[2])) \
            .reduceByKey(lambda x,y: x+ " " +y) \
            .map(lambda x: x[1])
brdd1.take(1)

["Object SQL - A Language for the Design and Implementation of Object Databases. Overview of the Iris DBMS. Overview of the Iris DBMS. A Physician's Workstation as an Application of Object-Oriented Database Technology in Healthcare. A Powerful Wide-Area Information Clent. Database Programming Languages: A Functional Approach. Integrating a Structured-Text Retrieval System with an Object-Oriented Database System."]

In [10]:
# B.1. Convert titles to tokens
brdd2 = brdd1.map(lambda line: (1, get_tokens(line)))

In [11]:
brdd2.take(3)

[(1,
  ['object',
   'language',
   'design',
   'implementation',
   'object',
   'database',
   'overview',
   'iris',
   'dbms',
   'overview',
   'iris',
   'dbms',
   'physician',
   'workstation',
   'application',
   'objectoriented',
   'database',
   'technology',
   'healthcare',
   'powerful',
   'widearea',
   'information',
   'clent',
   'database',
   'programming',
   'language',
   'functional',
   'approach',
   'integrating',
   'structuredtext',
   'retrieval',
   'system',
   'objectoriented',
   'database',
   'system']),
 (1,
  ['physical',
   'object',
   'management',
   'service',
   'query',
   'optimization',
   'object',
   'base',
   'exploiting',
   'relational',
   'technique',
   'application',
   'generator',
   'idea',
   'programming',
   'language',
   'extension',
   'service',
   'field',
   'approach',
   'resource',
   'constrained',
   'sensoractor',
   'network',
   'autoglobe',
   'automatische',
   'administration',
   'dienstbasierten',
   

In [12]:
doc_stop_words = brdd2.flatMap(lambda r: r[1]).map(lambda r: (r,1)).reduceByKey(lambda a,b: a+b)

doc_stop_words = doc_stop_words.filter(lambda a: a[1]>15000).map(lambda r: r[0]).collect()

brdd3 = brdd2.map(lambda r: (r[0],[w for w in r[1] if not w in doc_stop_words]))    

brdd3.take(1)[0][1][:10]

['overview',
 'iris',
 'dbms',
 'overview',
 'iris',
 'dbms',
 'physician',
 'workstation',
 'objectoriented',
 'healthcare']

In [13]:
# B.2. Convert tokens into sparse vectors
brdd4 = spark.createDataFrame(brdd3, ["dummy","words"])
brdd4.cache()
brdd4.take(1)

[Row(dummy=1, words=['overview', 'iris', 'dbms', 'overview', 'iris', 'dbms', 'physician', 'workstation', 'objectoriented', 'healthcare', 'powerful', 'widearea', 'clent', 'functional', 'integrating', 'structuredtext', 'objectoriented'])]

In [14]:
cvb = CountVectorizer(inputCol="words", outputCol="features", minDF=2)

cv_modelb = cvb.fit(brdd4)

brdd4_df_w_features = cv_modelb.transform(brdd4)
brdd4_df_w_features.cache()
brdd4_df_w_features.show(10)

+-----+--------------------+--------------------+
|dummy|               words|            features|
+-----+--------------------+--------------------+
|    1|[overview, iris, ...|(158039,[66,92,11...|
|    1|[physical, base, ...|(158039,[0,3,9,11...|
|    1|[specification, e...|(158039,[6,7,13,2...|
|    1|[specification, e...|(158039,[2,3,4,6,...|
|    1|[spatial, firstcl...|(158039,[1,23,26,...|
|    1|[version, objecto...|(158039,[3,26,45,...|
|    1|[gemstone, persis...|(158039,[116,1239...|
|    1|[storage, exodus,...|(158039,[3,9,11,1...|
|    1|[storage, exodus,...|(158039,[49,66,99...|
|    1|[manager, coopera...|(158039,[3,15,41,...|
+-----+--------------------+--------------------+
only showing top 10 rows



In [15]:
from pyspark.mllib.linalg import Vectors
#from pyspark.ml import linalg as ml_linalg
def as_mllib_vector(v):
    return Vectors.sparse(v.size, v.indices, v.values)

bfeatures = brdd4_df_w_features.select("features")
bfeature_vec = bfeatures.rdd.map(lambda r: as_mllib_vector(r[0]))

bfeature_vec.cache()
bfeature_vec.take(1)

[SparseVector(158039, {66: 2.0, 92: 1.0, 111: 1.0, 457: 2.0, 931: 1.0, 991: 1.0, 1225: 2.0, 1382: 2.0, 1980: 1.0, 3158: 1.0, 4530: 1.0, 94917: 1.0, 132293: 1.0})]

In [16]:
print ("Vocabulary from CountVectorizerModel is:")
print(cv_modelb.vocabulary[:100])

Vocabulary from CountVectorizerModel is:
['integrated', 'sequence', 'component', 'mechanism', 'visualization', 'linear', 'specification', 'flow', 'task', 'matching', 'hierarchical', 'comparison', 'platform', 'ontology', 'text', 'computation', 'improving', 'theory', 'traffic', 'device', 'domain', 'effect', 'vector', 'local', 'internet', 'experience', 'scalable', 'automated', 'modelling', 'game', 'heterogeneous', 'abstract', 'content', 'coding', 'face', 'project', 'synthesis', 'path', 'secure', 'active', 'identification', 'requirement', 'filter', 'spatial', 'monitoring', 'cluster', 'distribution', 'formal', 'reasoning', 'supporting', 'level', 'multiagent', 'methodology', 'error', 'business', 'concept', 'source', 'discovery', 'improved', 'building', 'solution', 'complexity', 'complex', 'context', 'group', 'evolutionary', 'objectoriented', 'allocation', 'signal', 'mapping', 'surface', 'temporal', 'behavior', 'state', 'reduction', 'social', 'shape', 'hardware', 'probabilistic', 'evolution',

LDA

In [17]:
# B.3. Use the LDA algorithm to find out 10 topics for each publication 
from pyspark.ml.clustering import LDA

lda = LDA(k=10, maxIter=20)
blda_model = lda.fit(brdd4_df_w_features)

In [18]:
# Describe topics
btopics = blda_model.describeTopics(10)

print("The topics described by their top-weighted terms:")

btopics.show(truncate=False)

# Shows the results
#transformed = lda_model.transform(news_df_w_features)
#transformed.columns
import numpy as np
btopic_i = btopics.select("termIndices").rdd.map(lambda r: r[0]).collect()
for i in btopic_i:
    print(np.array(cv_modelb.vocabulary)[i])

The topics described by their top-weighted terms:
+-----+-----------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|topic|termIndices                                                |termWeights                                                                                                                                                                                                                     |
+-----+-----------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|0    |[4, 70, 19, 80, 170, 134, 75, 25, 35, 21]                  |[0.0

In [None]:
bll = blda_model.logLikelihood(brdd4_df_w_features)
blp = blda_model.logPerplexity(brdd4_df_w_features)

print("The lower bound on the log likelihood of the entire corpus: " + str(bll))
print("The upper bound on the perplexity: " + str(blp))

B4. Comment on results

Using the same tokenizing process, I trained the model using the same parameters as part A. Visually, performance seems to have improved, because of the improvement in interpretability of topics. This could be due to the increase in the number of words in each document. The perplexity score function was run but it did not manage to give an output despite running for more than 3 hours.