Skip to content
Infrastructure useful to create natural language processing systems based on transformer networks
Python Shell
Branch: master
Clone or download
Latest commit 1ea39bb Sep 26, 2019

README.adoc

Zamia Brain

The Zamia Brain project provides infrastructure useful to create natural language processing systems based on transformer networks (see https://arxiv.org/abs/1706.03762).

This project is still highly experimental, everything is subject to change without prior notice. The current approach is to generate training corpora for pre-training as well as (multi-)domain refinement. The goal is to train networks that are very robust (i.e. avoid brittleness present in traditional rule-based systems) in their natural language processing capabilities (pretraining) while allowing for a certain amount of control of their behavior (refinement).

For this, you will find these components:

Corpora

Twitter

You will need to provide a list of accounts as input:

./twitterscrape.py -l de -s 2019-08-01 twitter_de_201908 -U user_stats_de.json twitter_sources_de.txt

Heise

Important: adapt hard-coded paths firs!

./qa_extract_heise.py

Parole

Important: adapt hard-coded paths firs!

./qa_extract_parole.py

Wikipedia

Important: adapt hard-coded paths firs!

./qa_extract_wikipedia.py -l de

Export for pretraining

This will work for GPT-2 as well as TransformerXL:

./qa_export_transformer-lm.py -o base_de heise parole twitter_de_2010 twitter_de_201907 wikipedia_de

Next, encode corpus the corpus using a sentencepiece tokenization model and run the pretraining.

Extract skills from Zamia AI

qa_extract_skills.py -l de skill_personal_de personal.xml

QA finetuning

-q option is important here to include dialog samples

./qa_export_transformer-lm.py -q -o qa_de twitter_de_2010 twitter_de_201907 skill_personal_de

Next, encode corpus the corpus using a sentencepiece tokenization model and continue the training for finetuning.

TODO: KB encoding / indexing / lookup

FIXME

Architecture

This is just a sketch of what a full QA chat bot with associative memory could look like:

                        +------------------+                                                        +------------------------------+
                        | Dialog Ctx     1 |                                                        |       Knowledge Base         |
                        | Dialog Ctx     2 |           +--------------------+                       +------------------------------+
                        | ...              |---------> | DeepNN_Context_Vec |                       | 0.1 0.2 0.01 ... | KB Line 1 |
                        | Dialog Ctx     n |           +--------------------+                       | ...              |           |
                        +------------------+                     |                                  | 0.33 0.1 0.5 ... | KB Line m |
                        | Current Input    |                     |                                  +------------------------------+
                        +------------------+                     |                                                 |
                                 |                               |                                                 |
                                 |                               |        +----------------------------------------+
                                 |                               |        |
                                 |                               |        |
                                 |                               v        v
                                 |                     +---------------------------+
                                 |                     | Nearest Neighbour search  |
                                 |                     +---------------------------+
                                 |                                   |
                                 |                                   |
                                 |                                   v
                                 |                  +--------------------------------+
                                 |                  |         KB Context Lines       |
                                 |                  +--------------------------------+
                                 |                  | 0.4 0.2 0.9 ... Info    line 1 |
                                 |                  | ...                            |
                                 |                  | 0.8 0.2 0.4 ... Info    line k |
                                 |                  +--------------------------------+
                                 |                                   |
                                 |                                   |
                                 |      +----------------------------+
                                 |      |
                                 |      |
                                 v      v
                          +-----------------------+
                          | Info    Line 1        |
                          | ...                   |
                          | Info    Line k        |
                          +-----------------------+
                          | Ctx     Line 1        |
                          | ...                   |
                          | Ctx     Line n        |
                          +-----------------------+
                          | Current Input         |
                          +-----------------------+
                                     |
                                     |
                                     v
                              +-------------+
                              |  DeepNN_QA  |
                              +-------------+
                                     |
                                     |
                              +-------------+
                              |   Response  |
                              +-------------+


[ Knowledge + Dialog History + Current Input ] -> [ Response ]

Knowledge Base

Dialog

<pre> DS_i → data/qa_src/DS_i/#.json \ DS_j → data/qa_src/DS_j/.json | . \ data/qa_enc/train/.json . / data/qa_enc/val/.json . | DS_n → data/qa_src/DS_n/##.json / </pre>

Datasets

Dialog Corpora

Chat Corpora

  • Zamia AI

  • 74M AIML bots

  • 142M chat_corpus https://github.com/Marsan-Ma-zz/chat_corpus https://github.com/Marsan-Ma/twitter_scraper

               34M open subtitles
               21M twitter_en
    *   41M    cornell_movie_dialogs_corpus
    *   33M    cornell_movie_quotes_corpus.zip
    *    0.2M  Microsoft Research Social Media Conversation Corpus
    *    4.3M  swb1_dialogact_annot.tar.gz
    * 7800M    The Ubuntu Dialogue Corpus v1.0
    *          NPS Chat Corpus (NLTK)
    *          Internet archive Twitter stream https://archive.org/search.php?query=collection%3Atwitterstream&sort=-publicdate&page=2
    *   58M    chatterbot-logs

Knowledge

  • WikiData

  • conceptnet5

  • framenet_v15

  • HappyDB

  • linkedgeodata

  • nell

  • opencyc

  • SemLink

  • SUMO

  • UMBEL

  • weather

  • wordnet

You can’t perform that action at this time.