GitHub

This repo provides the code to extract "dialogs" from Reddit data, and score them on how "Twitter-like" they are. Twitter data is usually more causual while Reddit data usually contains more specific contents.

Reddit data

As we don't own the Reddit data, we only provide the script here to process the data.

Steps:

Download the raw data from a third party. The file names are in format <YYYY-MM>.
Extract valid submissions (i.e. the "main" post) and their valid comments by python reddit.py <YYYY-MM> --task=extract
Extract valid dialogs from these Reddit posts by python reddit.py <YYYY-MM> --task=conv

Here <YYYY-MM> stands for the file name you want to process. If there're multiples files you want to process, you can easily do so by a bash file by sh run_all.sh, where run_all.sh looks like this:

python reddit 2011-01 --task=extract
python reddit 2011-01 --task=conv
python reddit 2011-02 --task=extract
python reddit 2011-02 --task=conv
...

Twitter/Reddit Classifier

Usage

With Reddit dialogs extracted above, you can run a trained classifier to score each dialog how "Twitter-like" they are.

Steps

Collect dialogs in a text file that each line is a tokenized dialog and has the format context \t response, if context has multiple turns, these turns should be delimited by EOS. For example hello , how are you ? EOS not bad . how about yourself \t pretty good .(if the file is generated by reddit.py then it already has such format)
Score each dialog by python classifier.py --score_path=<path>. This will generate a file <path>.scored that each line has the format context \t response \t score, where score is a number in the range from 0 to 1. 0 means very Reddit-like and 1 means Twitter-like

Model

This classifier is trained using shuffled 5M Twitter and 5M Reddit 2-turn conversations. Accuracy on balanced testing dataset is ~90%.

The model architecture is shown in the figure below. Context and response are converted to fixed-length vector separately by stacked GRU, then the context and response vectors are concatenated and passed through MLP to produce the final score.

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
lists		lists
models		models
src		src
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lists

lists

models

models

src

src

.gitignore

.gitignore

README.md

README.md

Repository files navigation

Reddit data

Twitter/Reddit Classifier

Usage

Model

About

Releases

Packages

Languages

golsun/utterance_classifier

Folders and files

Latest commit

History

Repository files navigation

Reddit data

Twitter/Reddit Classifier

Usage

Model

About

Resources

Stars

Watchers

Forks

Languages