Skip to content

golsun/utterance_classifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

57 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This repo provides the code to extract "dialogs" from Reddit data, and score them on how "Twitter-like" they are. Twitter data is usually more causual while Reddit data usually contains more specific contents.

Reddit data

As we don't own the Reddit data, we only provide the script here to process the data.

Steps:

  • Download the raw data from a third party. The file names are in format <YYYY-MM>.
  • Extract valid submissions (i.e. the "main" post) and their valid comments by python reddit.py <YYYY-MM> --task=extract
  • Extract valid dialogs from these Reddit posts by python reddit.py <YYYY-MM> --task=conv

Here <YYYY-MM> stands for the file name you want to process. If there're multiples files you want to process, you can easily do so by a bash file by sh run_all.sh, where run_all.sh looks like this:

python reddit 2011-01 --task=extract
python reddit 2011-01 --task=conv
python reddit 2011-02 --task=extract
python reddit 2011-02 --task=conv
...

Twitter/Reddit Classifier

Usage

With Reddit dialogs extracted above, you can run a trained classifier to score each dialog how "Twitter-like" they are.

Steps

  • Collect dialogs in a text file that each line is a tokenized dialog and has the format context \t response, if context has multiple turns, these turns should be delimited by EOS. For example hello , how are you ? EOS not bad . how about yourself \t pretty good .(if the file is generated by reddit.py then it already has such format)
  • Score each dialog by python classifier.py --score_path=<path>. This will generate a file <path>.scored that each line has the format context \t response \t score, where score is a number in the range from 0 to 1. 0 means very Reddit-like and 1 means Twitter-like

Model

This classifier is trained using shuffled 5M Twitter and 5M Reddit 2-turn conversations. Accuracy on balanced testing dataset is ~90%.

The model architecture is shown in the figure below. Context and response are converted to fixed-length vector separately by stacked GRU, then the context and response vectors are concatenated and passed through MLP to produce the final score.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages