KB is an Akinator like engine written in T-SQL
PLpgSQL C#
Latest commit e8330c7 Jul 12, 2016 @vnesterovsky vnesterovsky committed on GitHub Update bulk_insert.sql
Permalink
Failed to load latest commit information.
SQL Update bulk_insert.sql Jul 12, 2016
README.md Update README.md Dec 20, 2015

README.md

Akinator like engine

A collegue has approached to us with a question on how Akinator engine may work.

To our shame we have never heard about this amazing game before. To fill the gap we have immediately started to play it, and have identified it as a Troubleshooting solver.

It took us a couple of minutes to come up with a brilliant solution: "We just need to google and find the engine in the internet". :-)

Unfortunately, this led to nowhere, as no Akinator itself is open sourced, and no other good quality open source solutions are available.

After another hour we have got two more ideas:

  1. The task should fit into SQL;
  2. The task is a good candidate for a neural network.

In fact, the first might be required to teach the second, so we have decided to formalize the problem in terms of SQL, while still keeping in mind a neural network.

SQL solution

We have selected SQL Server as a database to implement our solution. A similar implementation is possible in other SQL (DB2, Oracle, MySql, SQLite), or NoSQL databases. SQL Server's facilities have allowed us to implement the task as a pure T-SQL API.

Concepts

The task can be defined as:"Guess an entity through a series of predicates".

The database should contain following tables:

  • Entity - to define objects that can be found;
  • PredicateType - to define questions that can be asked about objects;
  • Predicate - to relate objects with answers to the questions.

These tables is enough to construct an algorithm that takes as input a list of questions with answers, and offers the following question(s), if any (see Algorithm to suggest next question). But before building such an algorithm we must understand how can we populate these tables.

We argued like this:

  • entities have properties;
  • predicates are based on properties of entities.

Having entities with properties we can:

  • populate Entity;
  • mine properties to populate PredicateType;
  • populate Predicate based on answers.

Thus, the database should also contain following tables:

  • PropertyType - to define different types of properties that entities can have;
  • Property - to define properties per entity.

Test data

Samples are required to evaluate the quality of the implementation. It took awhile to find such a data, mainly due to the problem with formulation of the search query. We have found the DBpedia - a crowd-sourced community effort to extract structured information from Wikipedia and make this information available on the Web.

Many interesting datasets are collected there. We have decided to try Person dataset, which we got in cvs format from http://web.informatik.uni-mannheim.de/DBpediaAsTables/DBpediaClasses.htm. It contains information for more than 1.5 million persons, with several hundreds of different types of properties.

Through a sequence of simple manipulations (see later in Load Persons) we have imported this cvs into PropertyType and into Property tables.

Data mining

At the following stage we looked into data, calculated use counts per each property type, and per property value. This has helped us to formulate several dozen questions about persons, like: "Is he/she living person?", "Is he/she politician?", "Is he/she artist?", "Is he/she writer?", "Is he/she tall?", "Is he a male?".

We have found that there are questions that split whole set of persons into two more or less equal subsets. Other questions subset rather small set of persons. Yet other questions have fuzzy meaning, or are based on impresice property values (e.g. what does it mean "tall", or what to do with property "sex" that is defined not for all persons?)

Looking more closely we can see that questions might be related. Consider that if "Is football player?" is true, then "Is sportsman?" is true also; or if "Was born somewhere in Europe?" is false then "Was born in France?" is also false. One way to represent such relations between questions is through hierarchy:

  1. Is sportsman?
    1. Is football player?
    2. ...
  2. Was born somewhere in Europe?
    1. Was born in France?
    2. ...
  3. ...

Sql definitions

Let's define tables.

Table Entity defines objects:

create table Data.Entity
(
  EntityID int not null primary key
);

where

  • EntityID - is an object identity.

Table PredicateType defines predicate types:

create table Data.PredicateType
(
  PredicateID hierarchyid not null primary key,
  Name nvarchar(128) not null unique,
  Expression nvarchar(max) null,
  Computed bit not null default 0,
  ScopePredicateID hierarchyid null,
  Hidden bit not null default 0,
  Imprecise bit not null default 0
);

where

  • PredicateID - hierarchy ID that uniquely defines a predicate type.
  • Name - a unique predicate name in form: 'IsLivingPerson', 'IsMale', 'IsPolitician'. Name can be seen as a question in human readable form.
  • Expression - an sql source used to evaluate the predicate. Usually it is of the form: Data.Predicate_IsLivingPerson(). If Computed = 0 then this source is used to populate Data.Predicate table; otherwise when Computed = 1 this source is used every time to get entities that match the predicate.
  • Computed - indicates whether to store entities that match the predicate in Data.Predicate table (value is 0), or to evaluate Expression each time a predicate is requested (value is 1).
  • ScopePredicateID - if specified, then the value defines the scope of PredicateID (e.g. "IsMale", "IsFemale" predicates are defined not for all persons, but only for those having "Sex" property; in this case we define "Sex" preicate, and "IsMale", "IsFemale" refer to "Sex" as a scope).
  • Hidden - indicates that the predicate should not be offered (value is 1).
  • Imprecise - indicates whether the predicate is not precise, meanning that some irrelevant objects might be matched by the predicates, and some relevant object might be not matched. (E.g. "IsTall" is a imprecise predicate as the answer considerably depends on player. Answer to such a question should always be considered as "probalby")

Table Predicate stores entities for which predicate is true:

create table Data.Predicate
(
  PredicateID hierarchyid not null,
  EntityID int not null,
  constraint PK_Predicate primary key clustered(PredicateID, EntityID),
  constraint IX_Predicate_Entity unique(EntityID, PredicateID),
  constraint FK_Predicate_Entity foreign key(EntityID) 
    references Data.Entity(EntityID)
    on update cascade
    on delete cascade
);

where

  • PredicateID - a predicate reference;
  • EntityID - an entity reference.

Table PropertyType defines types of properties:

create table Data.PropertyType
(
  PropertyID int not null constraint PK_PropertyType primary key,
  Name nvarchar(128) not null,
  Type nvarchar(256) not null,
  index IX_PropertyType(Name)
);

where

  • PropertyID - a unique property ID;
  • Name - a property name;
  • Type - a property type.

Table Property defines object properties:

create table Data.Property
(
  EntityID int not null,
  PropertyID int not null,
  Value nvarchar(4000) null,
  constraint FK_Property_Entity foreign key(EntityID)
    references Data.Entity(EntityID)
    on update cascade
    on delete cascade,
  index IX_Property_Property clustered(PropertyID, EntityID),
  index IX_Property_Entity(EntityID, PropertyID)
);

where

  • EntityID - is an entity reference;
  • PropertyID - refers to a preperty type;
  • Value - a property value.

Together we had a heated debate on whether Data.Entity is required at all, and if it is required then it might worth to add more info (like name or description) to it. We have agreed that something like this is required: either table or inexed view based on Data.Property or on Data.Predicate tables. In favor of the table decision it says that we can build foreign keys. As for additional columns to Data.Entity the argument was that the Data.Property already contains entity properties, and it is not exactly clear what property should be added directly to Data.Entity.

Predicate definitions

A question definition translated into SQL would look like a query against Property table, like the following:

select EntityID from Data.Property where PropertyID = @livingPersonPropertID

or

select EntityID from Data.Property where PropertyID = @sexProperyID and TextValue = 'male'

Further, these queries are wrapped into sql functions:

select * from Data.Predicate_IsLivingPerson()

or

select * from Data.Predicate_IsMale()

To manage predicates we have defined several simple stored procedures:

-- Defines and populates a predicate
create procedure Data.DefinePredicate
(
  -- A predicate name to define.
  @name nvarchar(128),
  -- A predicate expression.
  @expression nvarchar(max) = null,
  -- Computed indicator
  @computed bit = 0,
  -- Optional parent predicate.
  @parent nvarchar(128) = null,
  -- Optional scope predicate.
  @scope nvarchar(128) = null,
  -- Optional value that indicates that the predicate is hidden.
  @hidden bit = 0,
  -- Indicates whether the predicate lacks accuracy. Default is 0.
  @imprecise bit = 0,
  -- 1 (default) to populate the predicate immediately.
  @populate bit = 1
);

-- Deletes a predicate. create procedure Data.DeletePredicate ( -- A predicate name to define. @name nvarchar(128) = null );

-- Invalidates a predicate. create procedure Data.InvalidatePredicate ( -- A predicate name. @name nvarchar(128) = null );

-- Invalidates all predicates. create procedure Data.InvalidatePredicates();

and a couple of utility functions:

create function Data.GetPropertyID(@name nvarchar(128))
returns int
as
begin
  return (select PropertyID from Data.PropertyType where Name = @name);
end;

create function Data.GetPredicateID(@name nvarchar(128)) returns hierarchyid as begin return (select PredicateID from Data.PredicateType where Name = @name); end;

Now, when you want to define a new predicate, you follow these steps:

  • Define a predicate function that returns a set of entities that apply to a question, like this:
create function Data.Predicate_IsActor()
returns table
as
return
  select
    EntityID
  from 
    Data.Property
  where 
    (PropertyID = Data.GetPropertyID('22-rdf-syntax-ns#type_label')) and
    (TextValue = 'actor');
  • Define a predicate like this:
execute Data.DefinePredicate
  @name = 'IsActor', 
  @expression = 'Data.Predicate_IsActor()';

While you are experimenting with questions you might want to reformulate the question, or delete the question. So to delete the question you call:

execute Data.DeletePredicate @name = 'IsActor'

To invalidate predicate (delete and repopulate relevant data into Data.Predicate table) you call:

execute Data.InvalidatePredicate @name = 'IsActor'

To invalidate all predicates (delete and repopulate all data into Data.Predicate table) you call:

execute Data.InvalidatePredicates

Algorithm to suggest next questions

Core algorithm spins around Predicate, Entity, and PredicateType tables.

Assuming we have P(i) - predicates, and A(i) - answers, where i = 1..n; let answer A(i) be 0 for "no", and 1 for "yes". We are going to build a select that returns next predicates.

The initial part of select gets subsets of entities that match the predicates:

with P1 as -- P(1)
(
  select EntityID from Data.Predicate where PredicateID = @p1
),
P2 as -- P(2)
(
  select EntityID from Data.Predicate where PredicateID = @p2
),
...
Pn as -- P(n)
(
  select EntityID from Data.Predicate where PredicateID = @pn
),

at the next step we get entities that are matched by those predicates:

M as
(
  select EntityID from Data.Entity

-- Intersect those Pi that has A(i) = 1 intersect select EntityID from Pi -- where A(i) = 1 ...

-- Except those Pj that has A(j) = 0 except select EntityID from Pj -- where A(j) = 0 ... ),

Now, we can query predicates for matched entities, except those predicates that has been already used:

P as
(
  select
    P.PredicateID,
    count(*) EntityCount,
    (select count(*) from M) TotalCount 
  from
    Data.Predicate P
    inner join
    M
    on
      P.EntityID = M.EntityID
  where
    P.PredicateID not in (@p1, @p2, ..., @pn)
  group by
    P.PredicateID
)

where

  • EntityCount - is number of entities that match a specified predicate;
  • TotalCount - a total number of entities that match input predicates.

As a final result we can return first several predicates that split set of entities more evenly:

select top(5) * from P order by abs(TotalCount - EntityCount * 2);

For example, if we have n = 5, and A(1) = A(3) = 1, A(2) = A(4) = A(5) = 0 then the select will look like this:

with P1 as -- P(1), A(1) = 1
(
  select EntityID from Data.Predicate where PredicateID = @p1
),
P2 as -- P(2), A(2) = 0
(
  select EntityID from Data.Predicate where PredicateID = @p2
),
P3 as -- P(3), A(3) = 1
(
  select EntityID from Data.Predicate where PredicateID = @p3
),
P4 as -- P(4), A(4) = 0
(
  select EntityID from Data.Predicate where PredicateID = @p4
),
P5 as -- P(5), A(5) = 0
(
  select EntityID from Data.Predicate where PredicateID = @p5
),
M as
(
  select EntityID from Data.Entity
  intersect
  select EntityID from P1
  intersect
  select EntityID from P3
  except
  select EntityID from P2
  except
  select EntityID from P4
  except
  select EntityID from P5
),
P as
(
  select
    P.PredicateID,
    count(*) EntityCount,
    (select count(*) from M) TotalCount
  from
    Data.Predicate P
    inner join
    M
    on
      P.EntityID = M.EntityID
  where
    P.PredicateID not in (@p1, @p2, @p3, @p4, @p5)
  group by
    P.PredicateID
)
select top(5) * from P order by abs(TotalCount - EntityCount * 2);

Now, let's complicate the task, and assume fuzzy answers:

  • 0 - for "no";
  • (0, 0.5) - for "probably no";
  • 0.5 - for "don't know";
  • (0.5, 1) - for "probably yes";
  • 1 - for "yes".

Fuzzy answers should not cut subsets of entities but prioritize order of predicates returned.

Lets's start from "don't know" answer. In this case we should equally accept both subsets of entities that match and that don't match the predicate. The only impact on result from such answer is that the relevant predicate is excluded from the next offers.

So, lets assume in the previous exmaple that we have n = 5, and A(1) = 1, A(3) = 0.5, A(2) = A(4) = A(5) = 0 then the select will look like this:

The result select will look like this:

with P1 as -- P(1), A(1) = 1
(
  select EntityID from Data.Predicate where PredicateID = @p1
),
P2 as -- P(2), A(2) = 0
(
  select EntityID from Data.Predicate where PredicateID = @p2
),
P3 as -- P(3), A(3) = 0.5
(
  select EntityID from Data.Predicate where PredicateID = @p3
),
P4 as -- P(4), A(4) = 0
(
  select EntityID from Data.Predicate where PredicateID = @p4
),
P5 as -- P(5), A(5) = 0
(
  select EntityID from Data.Predicate where PredicateID = @p5
),
M as
(
  select EntityID from Data.Entity
  intersect
  select EntityID from P1
-- intersect
-- select EntityID from P3
  except
  select EntityID from P2
  except
  select EntityID from P4
  except
  select EntityID from P5
),
P as
(
  select
    P.PredicateID,
    count(*) EntityCount,
    (select count(*) from M) TotalCount
  from
    Data.Predicate P
    inner join
    M
    on
      P.EntityID = M.EntityID
  where
    P.PredicateID not in (@p1, @p2, @p3, @p4, @p5)
  group by
    P.PredicateID
)
select top(5) * from P order by abs(TotalCount - EntityCount * 2);

Notice that P3 is not used in M.

The final step is to account "probably" answers. Such answers give weight to each entity. Offered predicates should be ordered according to weight of entities they are based on.

Assuming we have P(i) - predicates, and A(i) - answers, where i = 1..n; and assume that A(k), and A(l) have "probably" answers. In this case let's define weighted entities:

E as
(
  select
    EntityID,
    iif(EntityID in (select EntityID from Pk), @ak, 1 - @ak) *
      iif(EntityID in (select EntityID from Pl), @al, 1 - @al) Weight
  from
    M
),

where

  • Weight - an entity weight.

Now, the query for next predicates can be written like this:

P as
(
  select distinct
    E.Weight,
    P.PredicateID, 
    count(*) over(partition by E.Weight, P.PredicateID) EntityCount,
    count(*) over(partition by E.Weight) TotalCount
  from
    Data.Predicate P
    inner join
    E
    on
      P.EntityID = E.EntityID
  where
    P.PredicateID not in (@p1, @p2, ..., @pn)
)

where

  • Weight - a predicate weight.
  • EntityCount - is number of entities that match a specified predicate per weight;
  • TotalCount - a total number of entities that match input predicates per weight.

The final result will be almost the same as earlier but ordered at first by weight:

select top(5) * from P order by Weight desc, abs(TotalCount - EntityCount * 2);

So, lets assume in our previous exmaple that we have n = 5, and A(1) = 1, A(3) = 0.5, A(2) = 0, A(4) = 0.3, A(5) = 0.8 then the select will look like this:

with P1 as -- P(1), A(1) = 1
(
  select EntityID from Data.Predicate where PredicateID = @p1
),
P2 as -- P(2), A(2) = 0
(
  select EntityID from Data.Predicate where PredicateID = @p2
),
P3 as -- P(3), A(3) = 0.5
(
  select EntityID from Data.Predicate where PredicateID = @p3
),
P4 as -- P(4), A(4) = 0.3
(
  select EntityID from Data.Predicate where PredicateID = @p4
),
P5 as -- P(5), A(5) = 0.8
(
  select EntityID from Data.Predicate where PredicateID = @p5
),
M as
(
  select EntityID from Data.Entity
  intersect
  select EntityID from P1
-- intersect
-- select EntityID from P3
  except
  select EntityID from P2
-- except
-- select EntityID from P4
-- except
-- select EntityID from P5
),
E as
(
  select
    EntityID,
    iif(EntityID in (select EntityID from P4), 0.3, 0.7) *
      iif(EntityID in (select EntityID from P5), 0.8, 0.2) Weight
  from
    M
),
P as
(
  select distinct
    E.Weight,
    P.PredicateID, 
    count(*) over(partition by E.Weight, P.PredicateID) EntityCount,
    count(*) over(partition by E.Weight) TotalCount
  from
    Data.Predicate P
    inner join
    E
    on
      P.EntityID = E.EntityID
  where
    P.PredicateID not in (@p1, @p2, @p3, @p4, @p5)
)
select top(5) * from P order by Weight desc, abs(TotalCount - EntityCount * 2);

That is the most complex form of select that algorithm should produce.

Consider now results of these selects.

If there are rows in result set, then we can either offer predicate from the first row, or, if we want to model some randomness, we can at times go to some other predicate from those returned.

If no rows are returned then we are finished with offers. The only thing we can return is a set of matched entities that is:

select * from M;

or in case or "probably" answers:

select * from E;

We might want to join those selects with Data.Property table to bring some entity properties, like name, or description.

Implementation of algorithm

Now algorithm is clear. The deal is just to implement it. We can see that the structure of select depends considerably on number of questions and type of answers.

To deal with this we use dynamic SQL. In the past we wrote an article on "Dealing with dynamic SQL in SQL Server".

The idea was to use XQuery as a SQL template language. At first you might think that this is too radical step but after a close look you will observe that it might be the most straightforward solution to deal with dynamic SQL. Just consider an XQuery snapshot that builds "Pi as (...), ..." text:

'<sql>with </sql>,

for $predicate in $predicates let $row := xs:integer($predicate/@Row) let $predicateID := xs:string($predicate/@PredicateID) let $name := xs:string($predicate/@Name) return <sql>P{$row} as -- <name>{$name}</name> ( select EntityID from Data.Predicate where PredicateID = <string>{$predicateID}</string> ), </sql>'

We have defined two SQL functions that build SQL text for such input xml:

create function Dynamic.GetSQL_GetNextPredicate
(
  -- Request parameters.
  @params xml
);

create function Dynamic.GetSQL_GetEntities ( -- Request parameters. @params xml, -- Optional property value to return. @property nvarchar(128) );

and two more stored procedures one that offers new predicates, and the other that returns matched entities:

-- Gets next predicates
create procedure Data.GetNextPredicates
(
  -- Request parameters.
  @params xml,
  -- Result as a predicate
  @result xml output
);

-- Gets entities for predicates create procedure Data.GetEntities ( -- Request parameters. @params xml, -- Optional property value to return. @property nvarchar(128) );

The input for these procedures are in the form of xml like this:

<request>
  <question name="IsLivingPerson" answer="1"/>
  <question name="IsFootballPlayer" answer="0"/>
  <question name="IsArtist" answer="0.3"/>
  ...
<request>

Data.GetNextPredicates returns an xml result fragment with next suggested predicates in the form:

<question name="IsPolitician"/>
<question name="IsReligious"/>
<question name="IsMilitary"/>

Data.GetEntities returns a set of entities with possible value of some property (like name or description).

Cache of results

At this point we could complete our explanation of the algorithm and its implementation, especially taking into account that performance over test data, which is more than 1.5 million of entities, is very good. Indeed, it take 100ms on average to build SQL, and from dozens milliseconds and up to 3 - 4 seconds to run the query. Execution plans look good, and there are no bottlenecks for scalability.

But we have thought that we can do better! According to the algorithm there are just several best predicates on the top level; there are also not too many best predicates on the second level, and so on. We have estimated that we can cache different results for all requests, say, for up to ten or even more input predicates. This means that we can immediately give answers to different sets of inputs, and descend to a rather small set of remaining entities. On the remaining set, a regular, even non-cached, search works very fast.

So, we will continue.

Caching implementation

We cache search offers in a tree. Each node in this tree:

  • has a node identifier;
  • refers to the parent node (parent offer); and
  • is classified with answer to the parent offer, and with new offered predicate.

Path from any specific node to the root of the tree defines a set of questions and answers, and the node itself offers next predicate.

This way cache table can be defined like this:

create table Data.PredicateTree
(
  ID int not null primary key,
  ParentID int not null,
  Answer decimal(2, 1) not null,
  PredicateID hierarchyid null,
  Populated bit not null default 0,
  constraint IX_PredicateTree unique(ParentID, Answer, PredicateID)
);

where

  • ID - a node identifier.
  • ParentID - reference to a parent node.
  • Answer - answer to the parent offer.
  • PredicateID - offered predicate; when value is null then this search brings no next predicate.
  • Populated - indicates whether it is a populated search result (1), or it is a row inserted to populate a descendant search result.

How caching works

Caching is integrated into the procedure Data.GetNextPredicates.

  1. When the GetNextPredicates is called, request's questions are sorted by PredicateID. This is to reduce a number of permutations to store in the cache. This can be done, as an offer does not depend on order of questions but on whole set of questions only.
  2. PredicateTree is checked to find a node that corresponds requested questions with answers.
  3. If such node is found then offered predicates are returned.
  4. Otherwise regular search is done, and results are cached into the PredicateTree.

Decision Tree

If you will look at caching from other perspective, and will decide to cache all data in such tree, then it can be qualified as a decition tree. The data contained in such table will be enough to guess any entity.

Play the search

We can guess a specific entity and start playing executing Data.GetNextPredicates iteratively and answering offered questions.

This way we shall reach to the point where no more predicates are offered. This way procedures either localized a minimal subset of entities or found required entity itself.

We have defined a procedure Data.PlaySearch that does exactly this. It plays the game. Among other roles this procedure populates the search cache.

Sources

Solution sources are published at github.com/nesterovsky-bros/KB.

SQL Server scripts are found at github.com/nesterovsky-bros/KB/SQL.

Steps to load DBpedia Persons data are described at github.com/nesterovsky-bros/KB/SQL/Persons.

Thank you for your attention.