GitHub - kaychaks/kbc-deepdive

Domain Specific KBC

Creating a knowledge base for a particular business domain e.g. insurance using DeepDive

Preparation

get the input data (e.g. transcripts.tsv.tar.gz) from S3 (or any other remote storage location), place inside input folder
- DeepDive's built-in process process/init/app will use input/init.sh to take care of extracting the zip file and also making the file ready for processing
make necessary path changes to the docker-compose file
run the docker-compose file which is going to setup both the DeepDive & Postgres containers

docker-compose up -d

follow the instructions mentioned in the deepdive-docker-setup folder to create the DeepDive docker image to be used in the docker-compose.yml

DeepDive Workflow

Please refer to DeepDive's documentation for extensive details on each command listed below.

deepdive compile command should be executed before running any of the do commands mentioned below and after any changes are made in app.ddlog file.

Loading the transcripts in the DB (if not loaded before). Process of transforming an unstructured chat corpus to transcripts that can be loaded is described below

deepdive do transcripts

In case the transcripts are loaded before, following commands are required to reload the new data from a fresh transcripts file

deepdive redo process/init/app

deepdive redo transcripts

Fetching all the distinct agents from the transcripts

deepdive do agents

Transcripts are converted to turn based chat conversation

deepdive do turns

- Each transcript data is turned into turn based chat conversation so that a fine grained analysis can be done

```
turns(
    trans_id uuid,
    agent_or_user text,
    name text,
    chat text,
    turn_id int,
    reply_to int
).
```
- Turn data is generated via an user defined function (UDF). The UDF here is part of a Scala script following the [Ammonite](http://www.lihaoyi.com/Ammonite/) project. Function takes relevant `id` & `chat` column data and process them to generate data matching Turn data structure as mentioned above

```
function turn_gen over (
        id uuid,
        chat text
    ) returns rows like turns
    implementation "udf/turn_gen.sc" handles tsv lines.

turns += turn_gen(id, chat) :-
      transcripts(id, _, _, _, chat).
```

Adding NLP markups to Turn data

This step will split turn data into sentences and their component tokens. Additionally other markups like lemmas, part-of-speech tags, named entity recognition tags, and a dependency parse of the sentence will be generated. The output schema looks like this

sentences(    
    doc_id text,
    sentence_index int,    
    sentence_text  text,
    tokens         text[],
    lemmas         text[],
    pos_tags       text[],
    ner_tags       text[],
    doc_offsets    int[],
    dep_types      text[],
    dep_tokens     int[]
).

An UDF will use DeepDive's wrapper project on Stanford's CoreNLP parser to produce required markups taking id & chat text from turn data. Additional checks on the turn data fetch are put to take care of null data & malformed HTML issues which Stanford NLP rejects to parse.

function nlp_chat over (
    id text,
    chat text
) returns rows like sentences
implementation "udf/nlp_chat.sh" handles tsv lines.

sentences += nlp_chat((trans_id :: text) || "_" || turn_id, chat):-
    turns(trans_id,_,_,chat,turn_id,_), [chat != "";chat IS NOT NULL],!chat LIKE "%<p align=\"center\">Chat Origin:%".

execution of above process happens with following command

deepdive do sentences

Querying

deepdive query command can be used to take a look at loaded data e.g.

show count of unique agents mentioned the input corpus

deepdive query count(n) ?- agents(n).

show first 5 agents with their departments & respective chat transcript

deepdive query agent, dep, chat | 5 ?- transcripts(_, _, dep, agent, chat).

show top 10 biggest chat conversation ids (along with number of turns)

deepdive query 'id,@order_by("DESC") COUNT(id) | 10 ?- turns(id,_,_,_,_,_).'

For more information on DeepDive's query syntax and features, check out this documentation

Input Data Processing

Raw input data corpus (in this case chat transcripts) is parsed and processed into a structured content as modelled in the transcripts relation of app.ddlog via kbc-data-processing project. If any model changes are made to the transcripts relation here then subsequent changes are required to made in the mentioned project.

Also to process a new input data corpus the same kbc-data-processing project need to be run again. For relevant information on how to run the processing task please refer it's documentation.

Initialisation Process (process/init/app)

This is a built-in process of DeepDive. Here, during this process, we do

Get rid of the first blank line of transcripts input file
Installing & setting up Ammonite
Pre-compile any UDF Scala scripts via Ammonite so that the actual UDF execution becomes faster
Install and setup a fork of DeepDive's wrapper of Stanford CoreNLP bazaar/parser. The fork resolves a bug that prevents NLP parsed data to be copied inside Postgres DB.

Not part of the init script but this script need to be sourced before running any workflow commands which adds ${HOME}/local/util folder to the global $PATH. It contains some useful executables (like tsv2json & jq) required during DeepDive workflow

source input/env.sh

Troubleshooting

apt-get update is not happening
- create an apt.conf file inside /etc/apt/ folder having following syntax
```
Acquire::http::proxy "<YOUR_PROXY>";
Acquire::https::proxy "<YOUR_PROXY>";
```

getting some kind of certificate error during JVM operation like fetching artifacts from sbt

add the MITM certificate to /usr/local/share/ca-certificates/ folder
and then execute the following commands

$ /usr/sbin/update-ca-certificates -f
$ /usr/bin/keytool -import -noprompt -trustcacerts -file "/usr/local/share/ca-certificates/<YOUR_CERTIFICATE>" -keystore "/usr/lib/jvm/java-1.8.0-openjdk-amd64/jre/lib/security/cacerts" -storepass changeit

if for some reason scala scripts based UDF fails first time when a relevant DeepDive process is run, try running the same again. It might happen that the initial failure was due to script compilation message being printed to stdout which DeepDive took as actual data for Postgres
NLP parsing step might fail because of lack of compute resources available in local machine. In that case it's better to do the same computation in some cloud / external server & import the parsed data like this (deepdive load could load bz2 zipped files as well)

deepdive load sentences sentences_parsed.tsv.bz2

Latent Semantic Analysis

Checkout the lsa folder for more information on how Scala & Spark is being used to do the same.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Domain Specific KBC

Preparation

DeepDive Workflow

Querying

Input Data Processing

Initialisation Process (process/init/app)

Troubleshooting

Latent Semantic Analysis

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
deepdive-docker-setup		deepdive-docker-setup
input		input
lsa		lsa
project		project
udf		udf
.gitignore		.gitignore
README.md		README.md
app.ddlog		app.ddlog
db.url		db.url
docker-compose.yml		docker-compose.yml

kaychaks/kbc-deepdive

Folders and files

Latest commit

History

Repository files navigation

Domain Specific KBC

Preparation

DeepDive Workflow

Querying

Input Data Processing

Initialisation Process (process/init/app)

Troubleshooting

Latent Semantic Analysis

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages