# Topic Modeling with Pyspark

   - **Supported by TingLin**
   - **Kansas State University**

# Table of Contents
  - **Extracting, transforming and selecting features**
    
- **Feature Extractors**
     - [TF-IDF](#TF-IDF)
     - [Word2Vec](#Word2Vec)
     - [CountVectorizer](#CountVectorizer)
- **Feature Transformers**
     - [Tokenizer](#Tokenizer)
     - [StopWordsRemover](#StopWordsRemover)
     - [nn-gram](#nn-gram)
     - [Binarizer](#Binarizer)
     - [PCA](#PCA)
     - [PolynomialExpansion](#PolynomialExpansion)
     - [Discrete Cosine Transform (DCT)](#Discrete Cosine Transform)
     - [StringIndexer](#StringIndexer)
     - [IndexToString](#IndexToString)
     - [OneHotEncoder](#OneHotEncoder)
     - [VectorIndexer](#VectorIndexer)
     - [Interaction](#Interaction)
     - [Normalizer](#Normalizer)
     - [StandardScaler](#StandardScaler)
     - [MinMaxScaler](#MinMaxScaler)
     - [MaxAbsScaler](#MaxAbsScaler)
     - [Bucketizer](#Bucketizer)
     - [ElementwiseProduct](#ElementwiseProduct)
     - [SQLTransformer](#SQLTransformer)
     - [VectorAssembler](#VectorAssembler)
     - [QuantileDiscretizer](#QuantileDiscretizer)
- **Feature Selectors**
     - [VectorSlicer](#VectorSlicer)
     - [RFormula](#RFormula)
     - [ChiSqSelector](#ChiSqSelector)
- **Clustering**
     - [LDA](#LDA)
- **LDA Topic Modeling with csv file**
     - [LDA Topic Modeling with csv file](#LDA Topic Modeling with csv file)
- ** Visualization**
     - [Visualization](#Visualization)

In [1]:
from pyspark.ml.feature import HashingTF, IDF, Tokenizer, CountVectorizer
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark.ml.linalg import Vectors, SparseVector
from pyspark.ml.clustering import LDA, BisectingKMeans
from pyspark.sql.functions import monotonically_increasing_id
import re

In [2]:
from pyspark.sql import SQLContext, Row
from pyspark.ml.feature import CountVectorizer
from pyspark.mllib.clustering import LDA, LDAModel
from pyspark.mllib.linalg import Vector, Vectors
from pyspark.ml.feature import StopWordsRemover

In [3]:
from pyspark.sql import SQLContext
from pyspark import SparkContext
sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)


# Visualization

In [4]:
# Load Data
rawdata = sqlContext.read.load("data/airlines2.csv", format="csv", header=True)
rawdata = rawdata.fillna({'review': ''})                               # Replace nulls with blank string
rawdata = rawdata.withColumn("uid", monotonically_increasing_id())     # Create Unique ID
rawdata = rawdata.withColumn("year_month", rawdata.date.substr(1,7))   # Generate YYYY-MM variable
 
# Show rawdata (as DataFrame)
rawdata.show(10)

+-----+---------------+---------+--------+------+--------+-----+-----------+--------------------+---+----------+
|   id|        airline|     date|location|rating|   cabin|value|recommended|              review|uid|year_month|
+-----+---------------+---------+--------+------+--------+-----+-----------+--------------------+---+----------+
|10001|Delta Air Lines|21-Jun-14|Thailand|     7| Economy|    4|        YES|Flew Mar 30 NRT t...|  0|   21-Jun-|
|10002|Delta Air Lines|19-Jun-14|     USA|     0| Economy|    2|         NO|Flight 2463 leavi...|  1|   19-Jun-|
|10003|Delta Air Lines|18-Jun-14|     USA|     0| Economy|    1|         NO|Delta Website fro...|  2|   18-Jun-|
|10004|Delta Air Lines|17-Jun-14|     USA|     9|Business|    4|        YES|"I just returned ...|  3|   17-Jun-|
|10005|Delta Air Lines|17-Jun-14| Ecuador|     7| Economy|    3|        YES|"Round-trip fligh...|  4|   17-Jun-|
|10006|Delta Air Lines|17-Jun-14|     USA|     9|Business|    5|        YES|Narita - Bangkok ...

- unique id and words would be selected when doing topic modeling

In [5]:
def cleanup_text(record):
    text  = record[8]
    uid   = record[9]
    words = text.split()  
    # Default list of Stopwords
    stopwords_core = ['a', u'about', u'above', u'after', u'again', u'against', u'all', u'am', u'an', u'and', u'any', u'are', u'arent', u'as', u'at', 
    u'be', u'because', u'been', u'before', u'being', u'below', u'between', u'both', u'but', u'by', 
    u'can', 'cant', 'come', u'could', 'couldnt', 
    u'd', u'did', u'didn', u'do', u'does', u'doesnt', u'doing', u'dont', u'down', u'during', 
    u'each', 
    u'few', 'finally', u'for', u'from', u'further', 
    u'had', u'hadnt', u'has', u'hasnt', u'have', u'havent', u'having', u'he', u'her', u'here', u'hers', u'herself', u'him', u'himself', u'his', u'how', 
    u'i', u'if', u'in', u'into', u'is', u'isnt', u'it', u'its', u'itself', 
    u'just', 
    u'll', 
    u'm', u'me', u'might', u'more', u'most', u'must', u'my', u'myself', 
    u'no', u'nor', u'not', u'now', 
    u'o', u'of', u'off', u'on', u'once', u'only', u'or', u'other', u'our', u'ours', u'ourselves', u'out', u'over', u'own', 
    u'r', u're', 
    u's', 'said', u'same', u'she', u'should', u'shouldnt', u'so', u'some', u'such', 
    u't', u'than', u'that', 'thats', u'the', u'their', u'theirs', u'them', u'themselves', u'then', u'there', u'these', u'they', u'this', u'those', u'through', u'to', u'too', 
    u'under', u'until', u'up', 
    u'very', 
    u'was', u'wasnt', u'we', u'were', u'werent', u'what', u'when', u'where', u'which', u'while', u'who', u'whom', u'why', u'will', u'with', u'wont', u'would', 
    u'y', u'you', u'your', u'yours', u'yourself', u'yourselves']
    
    # Custom List of Stopwords - Add your own here
    stopwords_custom = ['']
    stopwords = stopwords_core + stopwords_custom
    stopwords = [word.lower() for word in stopwords]    
    
    text_out = [re.sub('[^a-zA-Z0-9]','',word) for word in words]                                       # Remove special characters
    text_out = [word.lower() for word in text_out if len(word)>2 and word.lower() not in stopwords]     # Remove stopwords and words under X length
    return text_out

udf_cleantext = udf(cleanup_text , ArrayType(StringType()))
clean_text = rawdata.withColumn("words", udf_cleantext(struct([rawdata[x] for x in rawdata.columns])))

# tokenizer = Tokenizer(inputCol="description", outputCol="words")
# wordsData = tokenizer.transform(text)

- split review into words and then clean the words, finally add words as a new column on rawdata

In [6]:
# Show first row of clean_text
clean_text.take(1)

[Row(id=u'10001', airline=u'Delta Air Lines', date=u'21-Jun-14', location=u'Thailand', rating=u'7', cabin=u'Economy', value=u'4', recommended=u'YES', review=u'Flew Mar 30 NRT to BKK. All flights were great. Flight was on-time and the in-flight entertainment was great. Apart from the meals - some Thai passengers cannot eat beef so the flight crews tried to ask other passengers who could eat beef and changed the meals around. We feel disappointed with their food services.', uid=0, year_month=u'21-Jun-', words=[u'flew', u'mar', u'nrt', u'bkk', u'flights', u'great', u'flight', u'ontime', u'inflight', u'entertainment', u'great', u'apart', u'meals', u'thai', u'passengers', u'cannot', u'eat', u'beef', u'flight', u'crews', u'tried', u'ask', u'passengers', u'eat', u'beef', u'changed', u'meals', u'around', u'feel', u'disappointed', u'food', u'services'])]

In [7]:
# Term Frequency Vectorization  - Option 2 (CountVectorizer)    : 
vectorizer = CountVectorizer(inputCol="words", outputCol="Features", vocabSize = 1000)
vectorizer = vectorizer.fit(clean_text)
featurizedData = vectorizer.transform(clean_text)

vocablist = vectorizer.vocabulary
vocab_broadcast = sc.broadcast(vocablist)

idf = IDF(inputCol="Features", outputCol="features")
idfModel = idf.fit(featurizedData)
rescaledData = idfModel.transform(featurizedData)


In [8]:
rescaledData.take(1)

[Row(id=u'10001', airline=u'Delta Air Lines', date=u'21-Jun-14', location=u'Thailand', rating=u'7', cabin=u'Economy', value=u'4', recommended=u'YES', review=u'Flew Mar 30 NRT to BKK. All flights were great. Flight was on-time and the in-flight entertainment was great. Apart from the meals - some Thai passengers cannot eat beef so the flight crews tried to ask other passengers who could eat beef and changed the meals around. We feel disappointed with their food services.', uid=0, year_month=u'21-Jun-', words=[u'flew', u'mar', u'nrt', u'bkk', u'flights', u'great', u'flight', u'ontime', u'inflight', u'entertainment', u'great', u'apart', u'meals', u'thai', u'passengers', u'cannot', u'eat', u'beef', u'flight', u'crews', u'tried', u'ask', u'passengers', u'eat', u'beef', u'changed', u'meals', u'around', u'feel', u'disappointed', u'food', u'services'], features=SparseVector(1000, {0: 0.4099, 3: 1.0601, 11: 1.2624, 25: 1.3913, 32: 3.4155, 46: 1.8131, 56: 4.3116, 97: 2.3469, 113: 2.5063, 201: 2.

-New column as features is added to the rescaleddata

In [9]:
countVectors = vectorizer.transform(rescaledData).select("uid", "features")
from pyspark.mllib.feature import IDF
frequencyVectors = countVectors.rdd.map(lambda vector: vector[1])
from pyspark.mllib.linalg import Vectors
frequencyDenseVectors = frequencyVectors.map(lambda vector: Vectors.dense(vector))
idf = IDF().fit(frequencyDenseVectors)
tfidf = idf.transform(frequencyDenseVectors)
corpus = tfidf.map(lambda x: [1, x]).cache()

In [10]:
countVectors.take(1)

[Row(uid=0, features=SparseVector(1000, {0: 2.0, 3: 1.0, 11: 1.0, 25: 1.0, 32: 2.0, 46: 1.0, 56: 2.0, 97: 1.0, 113: 1.0, 201: 1.0, 213: 1.0, 249: 2.0, 332: 1.0, 346: 1.0, 369: 1.0, 395: 1.0, 490: 1.0, 509: 1.0, 537: 1.0, 621: 2.0, 693: 1.0, 846: 2.0}))]

In [11]:
# find the probability for each vectors
frequencyVectors.take(1)

[SparseVector(1000, {0: 2.0, 3: 1.0, 11: 1.0, 25: 1.0, 32: 2.0, 46: 1.0, 56: 2.0, 97: 1.0, 113: 1.0, 201: 1.0, 213: 1.0, 249: 2.0, 332: 1.0, 346: 1.0, 369: 1.0, 395: 1.0, 490: 1.0, 509: 1.0, 537: 1.0, 621: 2.0, 693: 1.0, 846: 2.0})]

In [12]:
corpus.take(1)

[[1,
  DenseVector([0.4099, 0.0, 0.0, 1.0601, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.2624, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.3913, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.4155, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.8131, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 4.3116, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.3469, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.5063, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0

In [14]:
ldaModel = LDA.train(corpus, k = 15, maxIterations=100, optimizer="online", docConcentration=2.0, topicConcentration=3.0)

- Build Latent Dirichlet Allocation model for clustering
- Note: LDA does not perform well with the EMLDAOptimizer which is used by default. In the case of EMLDAOptimizer we have significant bies to the most popular hashtags. I used the OnlineLDAOptimizer instead. The Optimizer implements the Online variational Bayes LDA algorithm, which processes a subset of the corpus on each iteration, and updates the term-topic distribution adaptively.

In [15]:
topicIndices = ldaModel.describeTopics(maxTermsPerTopic=5)

- each topic has maximun 5 terms

In [16]:
vocablist = vectorizer.vocabulary

- create vocabulary list

In [19]:
topicsRDD = sc.parallelize(topicIndices)

In [22]:
termsRDD.take(5)

[(u'boston', 0.007354214882161223, 0),
 (u'april', 0.007152689955148735, 0),
 (u'2014', 0.006619220965408662, 0),
 (u'hnl', 0.004720877709876037, 0),
 (u'phx', 0.0044917018847754325, 0)]

- each terms and its probability with its topic number

In [21]:
import operator
termsRDD = topicsRDD.map(lambda topic: (zip(operator.itemgetter(*topic[0])(vocablist), topic[1])))
indexedTermsRDD = termsRDD.zipWithIndex()
termsRDD = indexedTermsRDD.flatMap(lambda term: [(t[0], t[1], term[1]) for t in term[0]])
termDF = termsRDD.toDF(['term', 'probability', 'topicId'])
rawJson = termDF.toJSON().collect()


In [23]:
from IPython.core.display import display, HTML
from IPython.display import Javascript

s = ""
for line in rawJson:
    s += (str(line) +',')
stringJson = s[:-1]

- prepare the data and transform it into JSON format.

In [24]:
html_code = """
<!DOCTYPE html>
<meta charset="utf-8">
<style>

circle {
  fill: rgb(31, 119, 180);
  fill-opacity: 0.5;
  stroke: rgb(31, 119, 180);
  stroke-width: 1px;
}

.leaf circle {
  fill: #ff7f0e;
  fill-opacity: 1;
}

text {
  font: 14px sans-serif;
}

</style>
<body>
<script src="https://cdnjs.cloudflare.com/ajax/libs/d3/3.5.5/d3.min.js"></script>

<script>

var json = {
 "name": "data",
 "children": [
  {
     "name": "topics",
     "children": [
      %s
     ]
    }
   ]
};

var r = 1500,
    format = d3.format(",d"),
    fill = d3.scale.category20c();

var bubble = d3.layout.pack()
    .sort(null)
    .size([r, r])
    .padding(1.5);

var vis = d3.select("body").append("svg")
    .attr("width", r)
    .attr("height", r)
    .attr("class", "bubble");

  
var node = vis.selectAll("g.node")
    .data(bubble.nodes(classes(json))
    .filter(function(d) { return !d.children; }))
    .enter().append("g")
    .attr("class", "node")
    .attr("transform", function(d) { return "translate(" + d.x + "," + d.y + ")"; })
    color = d3.scale.category20();
  
  node.append("title")
      .text(function(d) { return d.className + ": " + format(d.value); });

  node.append("circle")
      .attr("r", function(d) { return d.r; })
      .style("fill", function(d) {return color(d.topicName);});

var text = node.append("text")
    .attr("text-anchor", "middle")
    .attr("dy", ".3em")
    .text(function(d) { return d.className.substring(0, d.r / 3)});
  
  text.append("tspan")
      .attr("dy", "1.2em")
      .attr("x", 0)
      .text(function(d) {return Math.ceil(d.value * 10000) /10000; });

// Returns a flattened hierarchy containing all leaf nodes under the root.
function classes(root) {
  var classes = [];

  function recurse(term, node) {
    if (node.children) node.children.forEach(function(child) { recurse(node.term, child); });
    else classes.push({topicName: node.topicId, className: node.term, value: node.probability});
  }

  recurse(null, root);
  return {children: classes};
}

</script>""" % stringJson

- prepare the data and transform it into JSON format

In [25]:
stringJson

'{"term":"boston","probability":0.007354214882161223,"topicId":0},{"term":"april","probability":0.007152689955148735,"topicId":0},{"term":"2014","probability":0.006619220965408662,"topicId":0},{"term":"hnl","probability":0.004720877709876037,"topicId":0},{"term":"phx","probability":0.0044917018847754325,"topicId":0},{"term":"nov","probability":0.008502848846337343,"topicId":1},{"term":"mexico","probability":0.005195803258617839,"topicId":1},{"term":"delta","probability":0.003894210776073452,"topicId":1},{"term":"seating","probability":0.003837598454003098,"topicId":1},{"term":"everyone","probability":0.0035464023798082197,"topicId":1},{"term":"toilets","probability":0.0058680274366245296,"topicId":2},{"term":"times","probability":0.005663976603001911,"topicId":2},{"term":"years","probability":0.005448340380389646,"topicId":2},{"term":"first","probability":0.005388637136086703,"topicId":2},{"term":"united","probability":0.0053579537855618945,"topicId":2},{"term":"service","probability":

In [26]:
# visualize data using D3JS framework
# Display the html
display(HTML(html_code))

- D3 (Data-Driven Documents or D3.js) is a JavaScript library for visualizing data using web standards. D3 helps you bring data to life using SVG, Canvas and HTML. D3 combines powerful visualization and interaction techniques with a data-driven approach to DOM manipulation, giving you the full capabilities of modern browsers and the freedom to design the right visual interface for your data.
- download d3.js, and put it at the same location as this files
