Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

Bug 858158 - [keyboard] document prediction algorithm and make it tuneable #9071

Closed
wants to merge 11 commits into from

7 participants

@ckerschb
Collaborator

No description provided.

@ckerschb
Collaborator
apps/keyboard/js/imes/latin/predictions.js
((15 lines not shown))
+// int16 ch; // character
+// int16 lPtr; // left child
+// int16 cPtr; // center child
+// int16 rPtr; // right child
+// int16 nPtr; // next child, holds a pointer to the node
+// // with the next highest frequency
+// int16 high; // holds an overflow byte for lPtr, cPtr, rPtr, nPtr
+// // which keeps nodes as small as possible.
+// int16 frequency; // frequency from the XML file
+// };
+//
+// The algorithm operates in two stages:
+//
+// First, we permutate the user input (prefix) by
+// - adding a character
+// - deleting a character
@davidflanagan Owner

Do you try all possible additions and deletions? And is this really done as a separate phase or while traversing the TST?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
apps/keyboard/js/imes/latin/predictions.js
((18 lines not shown))
+// int16 rPtr; // right child
+// int16 nPtr; // next child, holds a pointer to the node
+// // with the next highest frequency
+// int16 high; // holds an overflow byte for lPtr, cPtr, rPtr, nPtr
+// // which keeps nodes as small as possible.
+// int16 frequency; // frequency from the XML file
+// };
+//
+// The algorithm operates in two stages:
+//
+// First, we permutate the user input (prefix) by
+// - adding a character
+// - deleting a character
+// - replacing characters with surrounding key-characters
+// - transposing characters.
+//
@davidflanagan Owner

What weights are associated with these various permutations? I would assume for example, that the user is more likely to mistype a key and hit a neighboring key than they are to add an extra character to their input. So predictions based on near-key replacements should probably have a higher ranking than predictions based on character deletions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
apps/keyboard/js/imes/latin/predictions.js
((39 lines not shown))
+// candidates because 'a' is a surrounding key of 's' and there is a
+// word that starts with that prefix in the TDAG (e.g. 'and').
+// This array is sorted, so that the highest ranked node is found
+// at index 0.
+//
+// Second, the function 'predictSuffixes' iterates that array of candidates
+// and follows the next pointers (nPtr). Again, this pointer
+// points to the node with the next highest frequency starting with
+// that prefix. We can see this nPtr as a kind of linked list.
+// Using this linked list we can prune whole subtress which favors
+// lookup speed.
+//
+// So the next pointer of 's' points to 'h' ('she' ranked 170)
+// The nPtr in the node of that 'h' (prefix 's') points to
+// 'u' ('such' ranked also 170), and so an.
+//
@davidflanagan Owner

So if I understand correctly, the "s" node has an nPtr that points to "sh" (because she is high frequency) and "sh" has an nPtr that points to "su".

But then what if the user actually types "sh". When the algorithm traverses to "sh" it sees the same (now irrelevant) pointer to "su" doesn't it? I still don't understand.

Maybe a figure would make this clearer.

Or an explanation of what the frequency field means for nodes that don't represent the ends of words.

@davidflanagan Owner

An explanation of how frequencies are handled for shared suffixes would be nice, too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
apps/keyboard/js/imes/latin/predictions.js
((8 lines not shown))
+// The underlying data structure is a ternary search tree (TST) which also uses
+// a direct acyclic graph to compress suffixes (we call this Ternary DAGs).
+// see http://www.strchr.com/ternary_dags for further details on TDAGs.
+//
+// Every Node in the tree uses this format:
+//
+// Node {
+// int16 ch; // character
+// int16 lPtr; // left child
+// int16 cPtr; // center child
+// int16 rPtr; // right child
+// int16 nPtr; // next child, holds a pointer to the node
+// // with the next highest frequency
+// int16 high; // holds an overflow byte for lPtr, cPtr, rPtr, nPtr
+// // which keeps nodes as small as possible.
+// int16 frequency; // frequency from the XML file
@davidflanagan Owner

What does frequency represent for nodes that are prefixes rather than complete words?

@davidflanagan Owner

You still haven't clarified that this is the frequency of the most common word with this node as a prefix.

@davidflanagan Owner

When you build the dictionary, can you use the length of the prefix and the length of the word to weight the frequencies?

Our dictionary lists released and received as the most common words that begin with r. But when the user types 're', we should really predict short words like 'red', 'rest' and 'read' before long words. I understand that we can't do this easily while searching the dictionary as it is currently structured. But does the data structure allow you to take prefix length and word length account when building the dictionary?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@davidflanagan davidflanagan commented on the diff
apps/keyboard/js/imes/latin/predictions.js
((5 lines not shown))
+// Description of the algorithm:
+//
+// We use a precompiled dictionary that is loaded into a typed Array (_dict).
+// The underlying data structure is a ternary search tree (TST) which also uses
+// a direct acyclic graph to compress suffixes (we call this Ternary DAGs).
+// see http://www.strchr.com/ternary_dags for further details on TDAGs.
+//
+// Every Node in the tree uses this format:
+//
+// Node {
+// int16 ch; // character
+// int16 lPtr; // left child
+// int16 cPtr; // center child
+// int16 rPtr; // right child
+// int16 nPtr; // next child, holds a pointer to the node
+// // with the next highest frequency
@davidflanagan Owner

"next highest" compared to what?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@ckerschb
Collaborator

@davidflanagan , I added more documentation, hopefully this makes things clearer now.

@ckerschb
Collaborator

@gregorwagner , anything missing in the documentation?

@rwaldron

If body has a value, the return from Utils.escapeHTML(body); will have a value, even if the value is nothing more then a single space.

@rwaldron

Why not pass just the activity.source.data object? For MMS we're going to need to support multiple recipients.

@gregorwagner

Looks great! One thing that would help a lot would be an example of maybe 2 connected nodes. Like if you type 'a', how does the 'a' node look like and where do the pointers point to and how does the next center node look like. ASCII art maybe or some hand drawing that we can put in the folder?

@davidflanagan davidflanagan commented on the diff
apps/keyboard/js/imes/latin/predictions.js
((54 lines not shown))
+// In this case, for example, we would also add 'a' to this array of
+// candidates because 'a' is a surrounding key of 's' and there is a
+// word that starts with that prefix in the TDAG (e.g. 'and').
+// This array is sorted, so that the highest ranked node is found
+// at index 0.
+//
+// Second, the function 'predictSuffixes' iterates that array of
+// candidates and follows the center pointers (cPtr).
+// The TST is a balanced binary search tree with one exception.
+// The node with the highest frequency is assigned to the center
+// pointer (cPtr). This means, that following the cPtr we always
+// find the word with the highest frequency starting with that
+// prefix.
+// So the center pointer of 's' points to 'h' ('she' ranked 170)
+// The nPtr in the node of that 'h' (prefix 's') points to
+// 'u' ('such' ranked also 170), and so an. While following
@davidflanagan Owner

Ah! Now I understand. Before I thought you were saying that the nptr of "s" pointed to "su".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
apps/keyboard/js/imes/latin/predictions.js
((14 lines not shown))
+// Node {
+// int16 ch; // character
+// int16 lPtr; // left child
+// int16 cPtr; // center child
+// int16 rPtr; // right child
+// int16 nPtr; // next child, holds a pointer to the node
+// // with the next highest frequency
+// int16 high; // holds an overflow byte for lPtr, cPtr, rPtr, nPtr
+// // which keeps nodes as small as possible.
+// int16 frequency; // frequency from the XML file, or
+// // average of compressed/combined nodes.
+// };
+//
+// The algorithm operates in two stages:
+//
+// First, we permutate the user input (prefix) by
@davidflanagan Owner

Do you mean to say here that you permute (note spelling) each node that you visit while traversing the tree?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
apps/keyboard/js/imes/latin/predictions.js
((16 lines not shown))
+// int16 lPtr; // left child
+// int16 cPtr; // center child
+// int16 rPtr; // right child
+// int16 nPtr; // next child, holds a pointer to the node
+// // with the next highest frequency
+// int16 high; // holds an overflow byte for lPtr, cPtr, rPtr, nPtr
+// // which keeps nodes as small as possible.
+// int16 frequency; // frequency from the XML file, or
+// // average of compressed/combined nodes.
+// };
+//
+// The algorithm operates in two stages:
+//
+// First, we permutate the user input (prefix) by
+// * inserting a character;
+// following direct successors of the current node.
@davidflanagan Owner

I don't understand this new line.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@davidflanagan davidflanagan commented on the diff
apps/keyboard/js/imes/latin/predictions.js
((34 lines not shown))
+// find direct successor nodes with the next character
+// following the skipped one.
+// * replacing characters with surrounding key-characters;
+// if the character in the successor is a neighbouring
+// key of the current key, we also follow this path.
+// * transposing characters.
+// we swap neighboring characters in the prefix and try to
+// find successor nodes in the TST.
+//
+// This user input permutation is done while traversing the TDAG trying
+// to find possible candidates. Note, that we multiply the frequency
+// only for exact matches in TDAG. In other words, if there is a word
+// in the TDAG that starts with that prefix, the user most probably
+// has not mistapped it. Therefore we need to boost this candidate.
+// All other permutations are treated equally. We do not rank
+// candidates differently based on the detected error.
@davidflanagan Owner

Interesting. I would have done the opposite and reduced the weight of prefixes that were permuted, leaving unpermuted ones alone...

In any case, I think we need tuneability here. I think, for example, that prefixes generated by replacing characters with nearby characters should have a higher weight than prefixes generated by inserting and deleting characters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@davidflanagan davidflanagan commented on the diff
apps/keyboard/js/imes/latin/predictions.js
((44 lines not shown))
+// to find possible candidates. Note, that we multiply the frequency
+// only for exact matches in TDAG. In other words, if there is a word
+// in the TDAG that starts with that prefix, the user most probably
+// has not mistapped it. Therefore we need to boost this candidate.
+// All other permutations are treated equally. We do not rank
+// candidates differently based on the detected error.
+//
+// For example, the user taps 's' on the keyboard.
+// Therfore, we add the root node for 's' to the array of
+// candidates (which will later predict she, such, some,...).
+// In this case, for example, we would also add 'a' to this array of
+// candidates because 'a' is a surrounding key of 's' and there is a
+// word that starts with that prefix in the TDAG (e.g. 'and').
+// This array is sorted, so that the highest ranked node is found
+// at index 0.
+//
@davidflanagan Owner

But if we suggest "and" when the user types "s" we have failed! So we need weighting by length, somehow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
apps/keyboard/js/imes/latin/predictions.js
((76 lines not shown))
+// Using this linked list we can prune whole subtress which favors
+// lookup speed.
+//
+// Again, a character euqals to 0 (node.ch == 0) indicates we have
+// reached the end of a word in the tree. The frequency associated
+// with this node is the frequency of that word. Note that, we
+// compress suffixes where the character in the node machtes, but
+// not necessarily the frequency, therefore we average the frequency
+// of all compressed suffix nodes which are combined into such a
+// suffix node. Even though this seems to be not accurate and might
+// cause mispredictions, we highlight the fact that commonly shorter
+// input words (prefixes) are not compressed, which means that the
+// correct frequency is still stored in that node. Once the input
+// words get longer and longer, we have allready narrowed the search
+// space for that prefix, so that the averaging of frequencies
+// in compressed nodes does not cause mispredictions.
@davidflanagan Owner

I'd say "hopefully does not" :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@ckerschb
Collaborator

did a new pull request:
#9106

@ckerschb ckerschb closed this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
This page is out of date. Refresh to see the latest.
View
3  apps/communications/contacts/js/contacts.js
@@ -793,6 +793,9 @@ var Contacts = (function() {
if (!handling || ActivityHandler.activityName === 'pick') {
initContactsList();
checkUrl();
+ } else {
+ // Unregister here to avoid un-necessary list operations.
+ navigator.mozContacts.oncontactchange = null;
}
window.dispatchEvent(new CustomEvent('asyncScriptsLoaded'));
});
View
152 apps/keyboard/js/imes/latin/predictions.js
@@ -27,6 +27,158 @@
// predict: given an input string, return the most likely
// completions or corrections for it.
//
+//
+// Description of the algorithm:
+//
+// We use a precompiled dictionary that is loaded into a typed Array (_dict).
+// The underlying data structure is a ternary search tree (TST) which also uses
+// a direct acyclic graph to compress suffixes (we call this Ternary DAGs).
+// see http://www.strchr.com/ternary_dags for further details on TDAGs.
+//
+// Every Node in the TDAG uses this format:
+//
+// Node {
+// int16 ch; // character
+// int16 lPtr; // left child
+// int16 cPtr; // center child
+// int16 rPtr; // right child
+// int16 nPtr; // next child, holds a pointer to the node
+// // with the next highest frequency after the
@davidflanagan Owner

"next highest" compared to what?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
+// // the frequency in the current node.
+// int16 high; // holds an overflow byte for lPtr, cPtr, rPtr, nPtr
+// // which keeps nodes as small as possible.
+// int16 frequency; // frequency from the XML file, or
+// // average of compressed/combined nodes.
+// };
+//
+// The algorithm operates in two stages:
+//
+// First, we permute the user input (prefix) by
+// * inserting a character;
+// following direct successors of the current node.
+// * deleting a character;
+// we skip one character in the prefix and try to
+// find direct successor nodes with the next character
+// following the skipped one.
+// * replacing characters with surrounding key-characters;
+// if the character in the successor is a neighbouring
+// key of the current key, we also follow this path.
+// * transposing characters.
+// we swap neighboring characters in the prefix and try to
+// find successor nodes in the TST.
+//
+// This user input permutation is done while traversing the TDAG trying
+// to find possible candidates. Note, that we multiply the frequency
+// only for exact matches in TDAG. In other words, if there is a word
+// in the TDAG that starts with that prefix, the user most probably
+// has not mistapped it. Therefore we need to boost this candidate.
+// All other permutations are treated equally. We do not rank
+// candidates differently based on the detected error.
@davidflanagan Owner

Interesting. I would have done the opposite and reduced the weight of prefixes that were permuted, leaving unpermuted ones alone...

In any case, I think we need tuneability here. I think, for example, that prefixes generated by replacing characters with nearby characters should have a higher weight than prefixes generated by inserting and deleting characters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
+//
+// For example, the user taps 's' on the keyboard.
+// Therfore, we add the prefix for 's', with a pointer to the next
+// best candidate ('h' following the 's') to the array of
+// candidates (which will later predict 'she', and then 'such',
+// 'some', ...). This insertion into the sorted candidates array is
+// based on the frequency of the this node.
+// In this case, for example, we would also add 'a' to this array of
+// candidates because 'a' is a surrounding key of 's' and there is a
+// word that starts with that prefix in the TDAG (e.g. 'and').
+// This array is sorted, so that the highest ranked node is found
+// at index 0.
+//
@davidflanagan Owner

But if we suggest "and" when the user types "s" we have failed! So we need weighting by length, somehow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
+// Second, the function 'predictSuffixes' iterates that array of
+// candidates and follows the center pointers (cPtr).
+// The TST is a balanced binary search tree with one exception.
+// The node with the highest frequency is assigned to the center
+// pointer (cPtr). This means, that following the cPtr we always
+// find the word with the highest frequency starting with that
+// prefix.
+// So the center pointer of 's' points to 'h' ('she' ranked 170)
+// The nPtr in the node of that 'h' (prefix 's') points to
+// 'u' ('such' ranked also 170), and so an. While following
@davidflanagan Owner

Ah! Now I understand. Before I thought you were saying that the nptr of "s" pointed to "su".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
+// the cPtr we keep adding candidates to the candidates array.
+// Once we reach the end of a word (node.ch == 0) we take out
+// the next best candidate from the sorted candidates array.
+// Since we add candidates while following the cPtr we now
+// might find a new, better ranked candidate at index 0 in
+// the sorted array. We can see this nPtr as a kind of linked list.
+// Using this linked list we can prune whole subtress which favors
+// lookup speed.
+//
+// Again, a character euqals to 0 (node.ch == 0) indicates we have
+// reached the end of a word in the tree. The frequency associated
+// with this node is the frequency of that word. Note that, we
+// compress suffixes where the character in the node machtes, but
+// not necessarily the frequency, therefore we average the frequency
+// of all compressed suffix nodes which are combined into such a
+// suffix node. Even though this seems to be not accurate and might
+// cause mispredictions, we highlight the fact that commonly shorter
+// input words (prefixes) are not compressed, which means that the
+// correct frequency is still stored in that node. Once the input
+// words get longer and longer, we have allready narrowed the search
+// space for that prefix, so that the averaging of frequencies
+// in compressed nodes hopefully does not cause mispredictions.
+//
+// Once the algorithm reaches the maximum number of requested
+// suggestions (_maxSuggestions), we return that array of possible
+// alternatives which are displayed for the user.
+//
+// A simplyfied example to demonstrate the use of the nPtr:
+//
+//
+// -------------
+// | ch: 't' |
+// | cPtr: 'h' |
+// | nPtr: 's' | <-!!!
+// | freq: 222 |
+// -------------
+// / | \
+// / | \
+// / | \
+// ------------- ------------- -------------
+// | ch: 'k' | | ch: 'h' | | ch: 'u' |
+// | cPtr: *** | | cPtr: *** | | cPtr: *** |
+// | nPtr: *** | | nPtr: *** | | nPtr: *** |
+// | freq: 160 | | freq: 222 | | freq: *** |
+// ------------- ------------- -------------
+// / | \ / | \ / | \
+// * * \ * * * * * *
+// \
+// -------------
+// | ch: 's' |
+// | cPtr: 'h' |
+// | nPtr: *** |
+// | freq: 170 |
+// -------------
+// / | \
+// * | *
+// |
+// -------------
+// | ch: 'h' |
+// | cPtr: 'e' |
+// | nPtr: 'u' | <-!!!
+// | freq: 170 |
+// -------------
+// / | \
+// |*|
+// \
+// -------------
+// | ch: 'u' |
+// | cPtr: 'c' |
+// | nPtr: *** |
+// | freq: 170 |
+// -------------
+//
+// The root node is 't' which cPtr points to 'h' which cPtr points
+// to 'e' ('the'). The lPtr of 't' points to 'k' (binary tree), but
+// the nPtr of 't' points to 's'. The cPtr of 's' points to 'h'
+// which cPtr points to 'e' ('she'). The nPtr of the node with ch 'h'
+// points to 'u', because the next highest word in the dictionary
+// is 'such'. This way, we can prune whole subtrees and take
+// shortcuts in the tree to the candidate with the next best frequency
+// after the current frequency.
+
'use strict';
var Predictions = function() {
View
17 apps/sms/js/activity_handler.js
@@ -3,8 +3,19 @@
'use strict';
-function showThreadFromSystemMessage(number) {
+function showThreadFromSystemMessage(number, body) {
var showAction = function act_action(number) {
+ // If we only have a body, just trigger a new message.
+ if (!number && body) {
+ var escapedBody = Utils.escapeHTML(body);
+ if (escapedBody === '') {
+ return;
+ }
+ MessageManager.activityBody = escapedBody;
+ window.location.hash = '#new';
+ return;
+ }
+
var currentLocation = window.location.hash;
switch (currentLocation) {
case '#thread-list':
@@ -58,8 +69,8 @@ window.navigator.mozSetMessageHandler('activity', function actHandle(activity) {
return;
MessageManager.lockActivity = true;
activity.postResult({ status: 'accepted' });
- var number = activity.source.data.number;
- showThreadFromSystemMessage(number);
+ showThreadFromSystemMessage(activity.source.data.number,
+ activity.source.data.body);
});
/* === Incoming SMS support === */
View
6 apps/sms/js/message_manager.js
@@ -6,6 +6,7 @@
var MessageManager = {
currentNum: null,
currentThread: null,
+ activityBody: null, // Used when getting a sms:?body=... activity.
init: function mm_init(callback) {
if (this.initialized) {
return;
@@ -168,6 +169,11 @@ var MessageManager = {
contactButton.parentNode.appendChild(contactButton);
document.getElementById('messages-container').innerHTML = '';
ThreadUI.cleanFields();
+ // If the message has a body, use it to popuplate the input field.
+ if (MessageManager.activityBody) {
+ input.value = MessageManager.activityBody;
+ MessageManager.activityBody = null;
+ }
// Cleaning global params related with the previous thread
MessageManager.currentNum = null;
MessageManager.currentThread = null;
Something went wrong with that request. Please try again.