Chat corpus repository

This is a chat corpus collection from various open sources, all files are composed of question-answer pairs, where odd lines are questions, even lines are answers.

I use them for training chatbot on seq2seq model. theory: http://arxiv.org/abs/1406.1078 implementation: https://github.com/Marsan-Ma/tf_chatbot_seq2seq_antilm.git

1. open_subtitles

English movie subtitles parsed from http://opus.lingfil.uu.se/download.php?f=OpenSubtitles/en.tar.gz

2. movie_subtitles_en

Cornell Movie-Dialogs Corpus http://www.mpi-sws.org/~cristian/Cornell_Movie-Dialogs_Corpus.html

3. lyrics_zh

lyrics from PTT forum https://www.ptt.cc/bbs/lyrics/index.html

4. twitter_en

corpus scrap from twitter (700k lines), where odd lines are tweet and even lines are corresponding responded tweets. actually you could scrape your own with my twitter scraper repository

5. twitter_en big

twitter corpus in larger size (5M lines), files splitted to walkaround 100m filesize limit,
just cat them to recover the original gz file. cat twitter_en_big.txt.gz.part* > twitter_en_big.txt.gz

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
lyrics_zh.txt.gz		lyrics_zh.txt.gz
movie_subtitles_en.txt.gz		movie_subtitles_en.txt.gz
open_subtitles.txt.gz		open_subtitles.txt.gz
twitter_en.txt.gz		twitter_en.txt.gz
twitter_en_big.txt.gz.partaa		twitter_en_big.txt.gz.partaa
twitter_en_big.txt.gz.partab		twitter_en_big.txt.gz.partab

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Chat corpus repository

1. open_subtitles

2. movie_subtitles_en

3. lyrics_zh

4. twitter_en

5. twitter_en big

About

Releases

Packages

hit-computer/chat_corpus

Folders and files

Latest commit

History

Repository files navigation

Chat corpus repository

1. open_subtitles

2. movie_subtitles_en

3. lyrics_zh

4. twitter_en

5. twitter_en big

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages