Zamia Brain

The Zamia Brain project provides infrastructure useful to create natural language processing systems based on transformer networks (see https://arxiv.org/abs/1706.03762).

This project is still highly experimental, everything is subject to change without prior notice. The current approach is to generate training corpora for pre-training as well as (multi-)domain refinement. The goal is to train networks that are very robust (i.e. avoid brittleness present in traditional rule-based systems) in their natural language processing capabilities (pretraining) while allowing for a certain amount of control of their behavior (refinement).

For this, you will find these components:

scripts to generate pre-training corpora, typically using web scraping techniques as well as scripts that adapt scientific copora for training (https://github.com/gooofy/zbrain)
scripts that generate corpora from patterns ("skills") for refinement (https://github.com/gooofy/zbrain)
A GPT-2 implementation along with tokenization, training and inference tools (https://github.com/gooofy/transformer-lm)
A TransformerXL implementation along with tokenization, training and inference tools (https://github.com/gooofy/transformer-xl)
Pre-trained models (https://goofy.zamia.org/zamia-speech/brain/)

Corpora

Twitter

You will need to provide a list of accounts as input:

./twitterscrape.py -l de -s 2019-08-01 twitter_de_201908 -U user_stats_de.json twitter_sources_de.txt

Heise

Important: adapt hard-coded paths firs!

./qa_extract_heise.py

Parole

Important: adapt hard-coded paths firs!

./qa_extract_parole.py

Wikipedia

Important: adapt hard-coded paths firs!

./qa_extract_wikipedia.py -l de

Export for pretraining

This will work for GPT-2 as well as TransformerXL:

./qa_export_transformer-lm.py -o base_de heise parole twitter_de_2010 twitter_de_201907 wikipedia_de

Next, encode corpus the corpus using a sentencepiece tokenization model and run the pretraining.

Extract skills from Zamia AI

qa_extract_skills.py -l de skill_personal_de personal.xml

QA finetuning

-q option is important here to include dialog samples

./qa_export_transformer-lm.py -q -o qa_de twitter_de_2010 twitter_de_201907 skill_personal_de

Next, encode corpus the corpus using a sentencepiece tokenization model and continue the training for finetuning.

TODO: KB encoding / indexing / lookup

FIXME

Architecture

This is just a sketch of what a full QA chat bot with associative memory could look like:

                        +------------------+                                                        +------------------------------+
                        | Dialog Ctx     1 |                                                        |       Knowledge Base         |
                        | Dialog Ctx     2 |           +--------------------+                       +------------------------------+
                        | ...              |---------> | DeepNN_Context_Vec |                       | 0.1 0.2 0.01 ... | KB Line 1 |
                        | Dialog Ctx     n |           +--------------------+                       | ...              |           |
                        +------------------+                     |                                  | 0.33 0.1 0.5 ... | KB Line m |
                        | Current Input    |                     |                                  +------------------------------+
                        +------------------+                     |                                                 |
                                 |                               |                                                 |
                                 |                               |        +----------------------------------------+
                                 |                               |        |
                                 |                               |        |
                                 |                               v        v
                                 |                     +---------------------------+
                                 |                     | Nearest Neighbour search  |
                                 |                     +---------------------------+
                                 |                                   |
                                 |                                   |
                                 |                                   v
                                 |                  +--------------------------------+
                                 |                  |         KB Context Lines       |
                                 |                  +--------------------------------+
                                 |                  | 0.4 0.2 0.9 ... Info    line 1 |
                                 |                  | ...                            |
                                 |                  | 0.8 0.2 0.4 ... Info    line k |
                                 |                  +--------------------------------+
                                 |                                   |
                                 |                                   |
                                 |      +----------------------------+
                                 |      |
                                 |      |
                                 v      v
                          +-----------------------+
                          | Info    Line 1        |
                          | ...                   |
                          | Info    Line k        |
                          +-----------------------+
                          | Ctx     Line 1        |
                          | ...                   |
                          | Ctx     Line n        |
                          +-----------------------+
                          | Current Input         |
                          +-----------------------+
                                     |
                                     |
                                     v
                              +-------------+
                              |  DeepNN_QA  |
                              +-------------+
                                     |
                                     |
                              +-------------+
                              |   Response  |
                              +-------------+


[ Knowledge + Dialog History + Current Input ] -> [ Response ]

Knowledge Base

Dialog

<pre> DS_i → data/qa_src/DS_i/#.json \ DS_j → data/qa_src/DS_j/.json | . \ data/qa_enc/train/.json . / data/qa_enc/val/.json . | DS_n → data/qa_src/DS_n/##.json / </pre>

Datasets

Dialog Corpora

TWITTER
personachat
slashdot
reddit
SQuAD 2.0
CoQA
SQuAD 1.1
MC Test
DeepMind CNN/DM http://www.github.com/deepmind/rc-data/
MS MARCO http://www.msmarco.org/
TriviaQA
NewsQA
NarrativeQA https://github.com/deepmind/narrativeqa
HotpotQA
natural_questions
QuestionBank
WebQuestions
wordrobe20140627.csv.gz
YahooAnswers
CommonsenseQA
ComplexWebQuestions
bAbI

Chat Corpora

Zamia AI
74M AIML bots

142M chat_corpus https://github.com/Marsan-Ma-zz/chat_corpus https://github.com/Marsan-Ma/twitter_scraper

           34M open subtitles
           21M twitter_en
*   41M    cornell_movie_dialogs_corpus
*   33M    cornell_movie_quotes_corpus.zip
*    0.2M  Microsoft Research Social Media Conversation Corpus
*    4.3M  swb1_dialogact_annot.tar.gz
* 7800M    The Ubuntu Dialogue Corpus v1.0
*          NPS Chat Corpus (NLTK)
*          Internet archive Twitter stream https://archive.org/search.php?query=collection%3Atwitterstream&sort=-publicdate&page=2
*   58M    chatterbot-logs

Knowledge

WikiData
conceptnet5
framenet_v15
HappyDB
linkedgeodata
nell
opencyc
SemLink
SUMO
UMBEL
weather
wordnet

AI Architecture Survey

scalable:
XL-Net
Transformer XL
OpenAI GPT-2
How to build a State-of-the-Art Conversational AI with Transfer Learning https://medium.com/huggingface/how-to-build-a-state-of-the-art-conversational-ai-with-transfer-learning-2d818ac26313
python newspaper article extractor https://github.com/codelucas/newspaper
OpenWebText https://github.com/jcpeterson/openwebtext https://pushshift.io/ http://files.pushshift.io/reddit/
BERT https://arxiv.org/pdf/1901.08634.pdf
How does TF’s universal sentence encoder work?
Transformer https://arxiv.org/pdf/1706.03762.pdf https://www.tensorflow.org/alpha/tutorials/sequences/transformer
SDNet https://arxiv.org/pdf/1812.03593.pdf

Name	Name	Last commit message	Last commit date
Latest commit gooofy fix url Sep 26, 2019 1ea39bb · Sep 26, 2019 History 35 Commits
data/skills	data/skills	skills	Jul 20, 2019
.gitignore	.gitignore	cleanup	Aug 10, 2019
README.adoc	README.adoc	fix url	Sep 26, 2019
common.py	common.py	add common.py	Aug 10, 2019
kb_embed.py	kb_embed.py	latest	Apr 14, 2019
kb_extract_enwiki.py	kb_extract_enwiki.py	latest	Apr 14, 2019
kb_index.py	kb_index.py	latest	Apr 14, 2019
kb_query.py	kb_query.py	latest	Apr 14, 2019
qa_export_transformer-lm.py	qa_export_transformer-lm.py	dir layout update	Aug 12, 2019
qa_extract_CoQA	qa_extract_CoQA	add current tools	Jun 20, 2019
qa_extract_europarl.py	qa_extract_europarl.py	add corpus extraction scripts	Aug 10, 2019
qa_extract_heise.py	qa_extract_heise.py	add corpus extraction scripts	Aug 10, 2019
qa_extract_parole.py	qa_extract_parole.py	add corpus extraction scripts	Aug 10, 2019
qa_extract_skills.py	qa_extract_skills.py	bugfix	Aug 12, 2019
qa_extract_twitter.py	qa_extract_twitter.py	update cleanup twitterscrape	Jul 20, 2019
qa_extract_wikipedia.py	qa_extract_wikipedia.py	update documentation, bugfix	Sep 26, 2019
twitterscrape.py	twitterscrape.py	update cleanup twitterscrape	Jul 20, 2019
update_twitter_corpus.sh	update_twitter_corpus.sh	update dates	Sep 15, 2019
wiki2plain.py	wiki2plain.py	latest	Apr 14, 2019
zai_extract_skill_xml.py	zai_extract_skill_xml.py	add zamia ai skill extraction helper	Aug 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Zamia Brain

Corpora

Twitter

Heise

Parole

Wikipedia

Export for pretraining

Extract skills from Zamia AI

QA finetuning

TODO: KB encoding / indexing / lookup

Architecture

Knowledge Base

Dialog

Datasets

Dialog Corpora

Chat Corpora

Knowledge

AI Architecture Survey

About

Releases

Packages

Languages

gooofy/zbrain

Folders and files

Latest commit

History

Repository files navigation

Zamia Brain

Corpora

Twitter

Heise

Parole

Wikipedia

Export for pretraining

Extract skills from Zamia AI

QA finetuning

TODO: KB encoding / indexing / lookup

Architecture

Knowledge Base

Dialog

Datasets

Dialog Corpora

Chat Corpora

Knowledge

AI Architecture Survey

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages