This repo provides the code to extract "dialogs" from Reddit data, and score them on how "Twitter-like" they are. Twitter data is usually more causual while Reddit data usually contains more specific contents.
As we don't own the Reddit data, we only provide the script here to process the data.
Steps:
- Download the raw data from a third party. The file names are in format
<YYYY-MM>
. - Extract valid submissions (i.e. the "main" post) and their valid comments by
python reddit.py <YYYY-MM> --task=extract
- Extract valid dialogs from these Reddit posts by
python reddit.py <YYYY-MM> --task=conv
Here <YYYY-MM>
stands for the file name you want to process. If there're multiples files you want to process, you can easily do so by a bash file by sh run_all.sh
, where run_all.sh
looks like this:
python reddit 2011-01 --task=extract
python reddit 2011-01 --task=conv
python reddit 2011-02 --task=extract
python reddit 2011-02 --task=conv
...
With Reddit dialogs extracted above, you can run a trained classifier to score each dialog how "Twitter-like" they are.
Steps
- Collect dialogs in a text file that each line is a tokenized dialog and has the format
context \t response
, if context has multiple turns, these turns should be delimited byEOS
. For examplehello , how are you ? EOS not bad . how about yourself \t pretty good .
(if the file is generated byreddit.py
then it already has such format) - Score each dialog by
python classifier.py --score_path=<path>
. This will generate a file<path>.scored
that each line has the formatcontext \t response \t score
, wherescore
is a number in the range from 0 to 1. 0 means very Reddit-like and 1 means Twitter-like
This classifier is trained using shuffled 5M Twitter and 5M Reddit 2-turn conversations. Accuracy on balanced testing dataset is ~90%.
The model architecture is shown in the figure below. Context and response are converted to fixed-length vector separately by stacked GRU, then the context and response vectors are concatenated and passed through MLP to produce the final score.