# Clustering StackOverflow Q&A

EPFL Big Data Analysis Week 2 Assignment
https://www.coursera.org/learn/scala-spark-big-data/home/info

"The overall goal of this assignment is to implement a distributed k-means algorithm which clusters posts on the popular question-answer platform StackOverflow according to their score. Moreover, this clustering should be executed in parallel for different programming languages, and the results should be compared.

The motivation is as follows: StackOverflow is an important source of documentation. However, different user-provided answers may have very different ratings (based on user votes) based on their perceived value. Therefore, we would like to look at the distribution of questions and their answers. For example, how many highly-rated answers do StackOverflow users post, and how high are their scores? Are there big differences between higher-rated answers and lower-rated ones?"

Data file download link: http://alaska.epfl.ch/~dockermoocs/bigdata/stackoverflow.csv

**WORK IN PROGRESS**

In [1]:
import time

# Credits to Fahim Sakri 
# Source (https://medium.com/pythonhive/python-decorator-to-measure-the-execution-time-of-methods-fa04cb6bb36d)
# An annotation for timing a python function
def timeit(method):
    def timed(*args, **kw):
        ts = time.time()
        result = method(*args, **kw)
        te = time.time()
        if 'log_time' in kw:
            name = kw.get('log_name', method.__name__.upper())
            kw['log_time'][name] = int((te - ts) * 1000)
        else:
            print ("%r  %2.2f ms" % (method.__name__, (te - ts) * 1000))
        return result
    return timed

from post import Post

In [2]:
## Setup
from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark import SparkConf, SparkContext
import pandas as pd
import numpy as np

spark = SparkSession \
    .builder \
    .appName("EPFL Wk2 Assignment") \
    .getOrCreate()
        
spark.conf.set("spark.executor.instances", 1)
spark.conf.set("spark.executor.cores", 1)
spark.conf.set("spark.cores.max", 1)
spark.sparkContext.addPyFile('post.py')

# Create RDD
data = spark.read.csv('/data/epfl-big-data-analysis/stackoverflow.csv', header=False, inferSchema=True)
data = data.withColumnRenamed("_c0", "post_type_id") # type 1 = question, type 2 = answer
data = data.withColumnRenamed("_c1", "id")
data = data.withColumnRenamed("_c2", "acceptedAnswerId")
data = data.withColumnRenamed("_c3", "parentId")
data = data.withColumnRenamed("_c4", "score")
data = data.withColumnRenamed("_c5", "tag")

#data = pd.read_csv('/data/epfl-big-data-analysis/stackoverflow.csv', 
#                   names=["post_type_id", "id", "acceptedAnswerId", "parentId", "score", "tag"],
#                   dtype={'post_type_id': np.int64, 'score': np.float16})


In [None]:
## Helper methods for clustering

  def euclideanDistanceSum(x1, y1, x2, y2):
    dist = np.sqrt(np.square(np.subtract(px, x)) + np.square(np.subtract(py, y)))
    return np.sum(dist)

  def findClosest(px, py, x, y):
    dist = np.sqrt(np.square(np.subtract(px, x)) + np.square(np.subtract(py, y)))
    return np.argmin(dist)

  /** Average the vectors */
  def averageVectors(ps: Iterable[(Int, Int)]): (Int, Int) = {
    val iter = ps.iterator
    var count = 0
    var comp1: Long = 0
    var comp2: Long = 0
    while (iter.hasNext) {
      val item = iter.next
      comp1 += item._1
      comp2 += item._2
      count += 1
    }
    ((comp1 / count).toInt, (comp2 / count).toInt)
  }
        
        

In [3]:
# Confirm the tag attribute is a single word indicating the language
v = data.select('tag').distinct()
print(v.collect())

[Row(tag='C#'), Row(tag='JavaScript'), Row(tag='Perl'), Row(tag=None), Row(tag='C++'), Row(tag='Groovy'), Row(tag='Objective-C'), Row(tag='CSS'), Row(tag='MATLAB'), Row(tag='Haskell'), Row(tag='Scala'), Row(tag='Clojure'), Row(tag='PHP'), Row(tag='Ruby'), Row(tag='Python'), Row(tag='Java')]


In [4]:
posts = spark.sparkContext.parallelize(data.head(10))

### Step 1 Preparation - Grouped posts by question
First we use the map function to create kv pairs for each type of posts namely questions and answers.  
Then a join operation is used for merging the two datasets.  A dataset `RDD[(QID, Iterable(Question, Answer))]` should be useful, the key is the ID of the question post and the values is a collection of tuple (Question, Answer).

In [5]:
questions = posts.filter(lambda p: p.post_type_id == 1).map(lambda p: (p.id, p))
answers = posts.filter(lambda p: p.post_type_id == 2).map(lambda p: (p.parentId, p))
grouped = questions.join(answers).groupByKey() # Use inner join to exclude posts with no answers
print(questions.take(1))
print(answers.take(1))
print(grouped.take(2))

[(27233496, Row(post_type_id=1, id=27233496, acceptedAnswerId=None, parentId=None, score=0, tag='C#'))]
[(5484340, Row(post_type_id=2, id=5494879, acceptedAnswerId=None, parentId=5484340, score=1, tag=None))]
[(5484340, <pyspark.resultiterable.ResultIterable object at 0x7f43298d5780>), (9002525, <pyspark.resultiterable.ResultIterable object at 0x7f43298d5588>)]


### Step 2 Calculate maximum answer score for each question
Produce a set of key-value pairs - Key of the pair is the question and value should be the maximum answer score of the question.  The output is an `RDD[(Posting, Int)]`

In [6]:
def post_max_scores(iterable):
    max_score = -1
    for pair in iterable:
        question = pair[0]
        answer_score = pair[1].score
        if answer_score > max_score:
            max_score = answer_score
        return (question, max_score)

post_scores = grouped.values().map(post_max_scores) #(post_max_score)
print(post_scores.take(2))

[(Row(post_type_id=1, id=5484340, acceptedAnswerId=None, parentId=None, score=0, tag='C#'), 1), (Row(post_type_id=1, id=9002525, acceptedAnswerId=None, parentId=None, score=2, tag='C++'), 4)]


### Step 3 Create vectors for clustering
Prepare the vectors as an input for clustering.  

<br/>
Index of the language (in the langs list) multiplied by the `langSpread` factor.

The highest answer score (computed above).

The `langSpread factor` is provided (set to 50000). Basically, it makes sure posts about different programming languages have at least distance 50000 using the distance measure provided by the euclideanDist function. You will learn later what this distance means and why it is set to this value. The output is `RDD[(Int, Int)]`

In [11]:
langSpread = 50000
langs = ["JavaScript", "Java", "PHP", "Python", "C#", "C++", "Ruby", "CSS",
      "Objective-C", "Perl", "Scala", "Haskell", "MATLAB", "Clojure", "Groovy"]

def as_vectors(iterable):
    return 
#vectors = post_scores.map(as_vectors)
vectors = post_scores.map(lambda s: (langs.index(s[0].tag)*langSpread, s[1]))
#vectors = post_scores.flatMap(lambda v:v)
print(vectors.count())
print(vectors.take(1))

2
[(200000, 1)]


In [None]:
 https://github.com/seahrh/stackoverflow-spark
        
 val lines   = sc.textFile("src/main/resources/stackoverflow/stackoverflow.csv")  

  val raw     = rawPostings(lines)  

  val grouped = groupedPostings(raw)  

  val scored  = scoredPostings(grouped)  

  val vectors = vectorPostings(scored)
    
    lines: the lines from the csv file as strings

raw: the raw Posting entries for each line

grouped: questions and answers grouped together

scored: questions and scores

vectors: pairs of (language, score) for each question