# Clustering StackOverflow Q&A

EPFL Big Data Analysis Week 2 Assignment
https://www.coursera.org/learn/scala-spark-big-data/home/info

"The overall goal of this assignment is to implement a distributed k-means algorithm which clusters posts on the popular question-answer platform StackOverflow according to their score. Moreover, this clustering should be executed in parallel for different programming languages, and the results should be compared.

The motivation is as follows: StackOverflow is an important source of documentation. However, different user-provided answers may have very different ratings (based on user votes) based on their perceived value. Therefore, we would like to look at the distribution of questions and their answers. For example, how many highly-rated answers do StackOverflow users post, and how high are their scores? Are there big differences between higher-rated answers and lower-rated ones?"

Data file download link: http://alaska.epfl.ch/~dockermoocs/bigdata/stackoverflow.csv

In [1]:
import time

# Credits to Fahim Sakri 
# Source (https://medium.com/pythonhive/python-decorator-to-measure-the-execution-time-of-methods-fa04cb6bb36d)
# An annotation for timing a python function
def timeit(method):
    def timed(*args, **kw):
        ts = time.time()
        result = method(*args, **kw)
        te = time.time()
        if 'log_time' in kw:
            name = kw.get('log_name', method.__name__.upper())
            kw['log_time'][name] = int((te - ts) * 1000)
        else:
            print ("%r  %2.2f ms" % (method.__name__, (te - ts) * 1000))
        return result
    return timed

from post import Post

In [2]:
## Setup
from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark import SparkConf, SparkContext
import pandas as pd
import numpy as np

spark = SparkSession \
    .builder \
    .appName("EPFL Wk2 Assignment") \
    .getOrCreate()
        
spark.conf.set("spark.executor.instances", 1)
spark.conf.set("spark.executor.cores", 1)
spark.conf.set("spark.cores.max", 1)
spark.sparkContext.addPyFile('post.py')

# Create RDD
data = spark.read.csv('/data/epfl-big-data-analysis/stackoverflow.csv', header=False, inferSchema=True)
data = data.withColumnRenamed("_c0", "post_type_id") # type 1 = question, type 2 = answer
data = data.withColumnRenamed("_c1", "id")
data = data.withColumnRenamed("_c2", "acceptedAnswerId")
data = data.withColumnRenamed("_c3", "parentId")
data = data.withColumnRenamed("_c4", "score")
data = data.withColumnRenamed("_c5", "tag")

#data = pd.read_csv('/data/epfl-big-data-analysis/stackoverflow.csv', 
#                   names=["post_type_id", "id", "acceptedAnswerId", "parentId", "score", "tag"],
#                   dtype={'post_type_id': np.int64, 'score': np.float16})


In [3]:
posts = data.rdd

In [6]:
# Preparation - obtain a map of RDD[(QID, Iterable(Question, Answer))]
#def groupedPostings(data):
questions = posts.filter(lambda p: p.post_type_id == 1).map(lambda p: (p.id, p)).take(1)
answers = posts.filter(lambda p: p.post_type_id == 2).map(lambda p: (p.acceptedAnswerId, p))



[(27233496, Row(post_type_id=1, id=27233496, acceptedAnswerId=None, parentId=None, score=0, tag='C#'))]


In [None]:
 https://github.com/seahrh/stackoverflow-spark
        
 val lines   = sc.textFile("src/main/resources/stackoverflow/stackoverflow.csv")  

  val raw     = rawPostings(lines)  

  val grouped = groupedPostings(raw)  

  val scored  = scoredPostings(grouped)  

  val vectors = vectorPostings(scored)
    
    lines: the lines from the csv file as strings

raw: the raw Posting entries for each line

grouped: questions and answers grouped together

scored: questions and scores

vectors: pairs of (language, score) for each question