# Midterm Report: Question Predictor for StackOverflow

## Overview

StackOverflow is the largest online QA platform for programmers to ask, answer or learn new knowledge and technique. By the end of August 2015, there are more than 10,000,000 questions posted on Stack Overflow. However, As the number of questions grows, we may find that there are more and more questions that focuses on the same topic or features, or actually have the same answers. We define these questions as duplicate questions. Duplicate questions will waste the machine resources and made question raiser to wait for a long time for a queston that has been answered already. Currently, Stack Overflow encourage users to manually mark any potential duplicate questions, which is not efficient and may have some bias. Here is an example of duplicate questions:

![](image01.jpg)

Stack Overflow **does not** have a standard method or label to distinguish duplicate questions from normal questions. However, Stack Overflow users usually have default measures to signify possible duplicate questions. We use three conditions to decide whether a question is a duplicate question or not:

1. Check where **[duplicate]** appears in the title.
2. Check whether **This question already has an answer** in the body of the questions.
3. Check whether **Possible duplicate** in the body of the questions.

A question satisfying any of the forementioned conditions will be classified as duplicate questions.

## Data Preparation

We got our data from [here](https://archive.org/details/stackexchange).


The original date is in XML format. We need to keep only all questions(PostTypeId=1) and remove invalid questions(Id=-1). The following code is used to extract useful information from data. For duplicate questions, we also extract the id of the existing questions for further analysis.

#### The basic statistic of the data is listed in the following:

Total number of records: 32209819


Total number of valid questions: 12350818

Total number of duplicate questions: 48197


In [2]:
import re
import sys
import json

regex = re.compile('([a-z0-9]+)="([^"]+)"', re.I)
dup_id = re.compile("stackoverflow.com/questions/(\d+)/")

fp_all = open("all_questions", "w")
fp_dup = open("dup_questions", "w")
with open(sys.argv[1], "r") as fp:
    for line in fp:
        try:
            line = line.strip()
            post = dict(re.findall(regex, line))
            if "Id" not in post:
                continue

            if "has already been answered" in post["Body"] or ("Title" in post and "[duplicate]" in post["Title"]):
                post["dups"] = [int(num) for num in re.findall(dup_id, post["Body"])]
                fp_dup.write(post["Id"] + "wcyz666SQL" + json.dumps(post) + "\n")

            fp_all.write(post["Id"] + "wcyz666SQL" + json.dumps(post) + "\n")
        except:
            pass

IOError: [Errno 2] No such file or directory: '-f'

Then we load the data to **MySQL database** for future retrival.

CREATE TABLE `all_questions` (`id` bigint(64), `text` text charset 'utf8mb4' collate utf8mb4_unicode_ci) ENGINE=MyISAM DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

CREATE TABLE `dup_questions` (`id` bigint(64), `text` text charset 'utf8mb4' collate utf8mb4_unicode_ci) ENGINE=MyISAM DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

load data local infile '/home/ubuntu/all_questions' into table all_questions character set 'utf8mb4' fields terminated by "wcyz666SQL";

load data local infile '/home/ubuntu/dup_questions' into table dup_questions character set 'utf8mb4' fields terminated by "wcyz666SQL";

create INDEX pds_index1 on all_questions(id);

create INDEX pds_index2 on dup_questions(id);




## Feature Build

To evaluate the similarity between two questions, we applied different indicators as features, including title, questions body, question topic tags. Besides, we also applied the LDA model to evaluate the similarity on topics between the questions.


### Preprocess

We preprocess all the texts in the question title/content by removing punctuations, stemming, tokenizing etc.

In [1]:
def tokenize_with_preprocess(text):
    """
    Tokenize with preprocessing. Remove all punctuations and apply stemming on the tokens
    :param text: String
    :return: list of tokens
    """
    return map(__stemmer.stem, filter(lambda w: w not in stop,
                                        nltk.word_tokenize(re.sub(_punc_pattern, '', text.lower()))))


### Text similarity
We try both cosine and jacard similarity as the similarity indicator, based on tfidf/ token set respectively.

In [2]:
def jaccard_similarity(text1, text2):
    """
    :param text1, text2: list of tokens
    :return: float
    """
    return 1.0 - nltk.jaccard_distance(set(text1), set(text2))

def cosine_similarity(text1, text2):
    """
    :param text1, text2: list of tokens
    :return: float
    """
    tfidf = TfidfVectorizer().fit_transform(map(_SimilarityUtil._inverse, [text1, text2]))
    return 1 - cosine(tfidf[0].todense(), tfidf[1].todense())

### LDA model
We applied LDA on the train set to model the topics among each questions. During evaluating, we use the pretrained lda model to tranform the text and get a distribution(a vetor of topic_size) among all the topics. We compute the euclidean distance of the two distribution vector to get the feature.

In [3]:
class LDA(object):
    def __init__(self, corpus, load=False, n_topic=3):
        self.words = np.array(list(set(itertools.chain(*corpus))))
        X = np.array(map(self._transform, corpus))
        self._model = lda.LDA(n_topics=n_topic, random_state=0, n_iter=100)
        self._model.fit(X)
    
    def get_topic(self, tokens):
        return self._model.transform(self._transform(tokens))

## Train
We split the question set into 70% vs 30% as train/test set.

## Next Step

We will train our model based on current dataset and try to come up with a efficient duplicate queston indicator.

## Reference

Zhang Y, Lo D, Xia X et al. Multi-factor duplicate question detection in Stack Overflow. JOURNAL OF COMPUTER
SCIENCE AND TECHNOLOGY 30(5): 981–997 Sept. 2015. DOI 10.1007/s11390-015-1576-4