Skip to content

Commit

Permalink
Merge 6cd3345 into 16b953c
Browse files Browse the repository at this point in the history
  • Loading branch information
vladoohr committed Feb 20, 2020
2 parents 16b953c + 6cd3345 commit 383e099
Show file tree
Hide file tree
Showing 28 changed files with 1,126 additions and 535 deletions.
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -47,4 +47,4 @@ docs/_build/

# Keras model files
history.p
keras_model.*
keras_model*
56 changes: 34 additions & 22 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,19 +61,6 @@ knowledgehub -c /etc/ckan/default/production.ini db init
sudo service apache2 reload
```

6. Run the command for predictive search periodically( daily or weekly is recommended ).
This command will start training the model:

```
knowledgehub -c /etc/ckan/default/production.ini predictive_search train
```

There is a action that can run CLI commands for Knowledge Hub.
This example shows how to run the above command through the API action:
```
curl -v 'http://hostname/api/3/action/run_command' -H'Authorization: API-KEY' -d '{"command": "predictive_search train"}'
```

### Config Settings

These are the required configuration options used by the extension:
Expand All @@ -94,35 +81,40 @@ ckanext.knowledgehub.sub_themes_per_page = 20
ckanext.knowledgehub.dashboards_per_page = 20
```
4. Predictive Search
- Length of the seuqunce after which the model can start predict, recommended at least 15 chars long
- Length of the sequence after which the model can start predict, recommended at least 10 chars long
```
# (optional, default: 10)
ckanext.knowledgehub.rnn.sequence_length = 12
```
- Number of chars to be skipped in generation of next sentence
```
# (optional, default: 3)
# (optional, default: 1)
ckanext.knowledgehub.rnn.sentence_step = 2
```
- Number of predictions to return
```
# (optional, default: 3)
ckanext.knowledgehub.rnn.number_prediction = 2
ckanext.knowledgehub.rnn.number_predictions = 2
```
- Minimum length of the corpus after it should start to predict
```
# (optional, default: 3)
# (optional, default: 10000)
ckanext.knowledgehub.rnn.min_length_corpus = 300
```
- Maximum epochs to learn
```
# (optional, default: 50)
ckanext.knowledgehub.rnn.max_epochs = 30
```
- Full path to the RNN model
- Full path to the RNN weights model
```
# (optional, default: ./keras_model_weights.h5)
ckanext.knowledgehub.rnn.model_weights = /home/user/model_weights.h5
```
# (optional, default: ./keras_model.h5)
ckanext.knowledgehub.rnn.model = /home/user/model.h5
- Full path to the RNN network model
```
# (optional, default: ./keras_model_network.h5)
ckanext.knowledgehub.rnn.model_network = /home/user/model_network.h5
```
- Full path to the model history
```
Expand Down Expand Up @@ -226,7 +218,7 @@ knowledgehub -c /etc/ckan/default/production.ini search-index rebuild --model da

This would rebuild the index for dashboards.

Avalilable model types are:
Avalilable model types are:
* `ckan` - rebuilds the CKAN core (package) index,
* `dashboard` - rebuilds the dasboards index,
* `research-question` - rebuilds the research questions index and
Expand Down Expand Up @@ -264,4 +256,24 @@ The crontab should look something like this:

Data Quality is measured across the six primary dimensions for data quality assessment.

A lot more details are available in the dedicated [documentation section](docs/data-qualtiy-metrics.md).
A lot more details are available in the dedicated [documentation section](docs/data-qualtiy-metrics.md).

# Predictive search

The preditive search functinality predict the next n characters in the word or the next most possible word.
The training data is consist of title and description of all entities on Knowledge hub including themes, sub-themes,
research questions, datasets, visualizations and dashboards. Before it starts predict the machine learning model has to be trained.
By default user should write 10 characters in the search box on home page before it starts to predict.

Run the command for predictive search periodically( daily or weekly is recommended ).
This command will start training the model:

```
knowledgehub -c /etc/ckan/default/production.ini predictive_search train
```

There is a action that can run CLI commands for Knowledge Hub.
This example shows how to run the above command through the API action:
```
curl -v 'http://hostname/api/3/action/run_command' -H'Authorization: API-KEY' -d '{"command": "predictive_search train"}'
```
5 changes: 3 additions & 2 deletions ckanext/knowledgehub/cli/predictive_search.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
import logging

from ckanext.knowledgehub.cli import error_shout
from ckanext.knowledgehub.rnn.worker import learn
from ckanext.knowledgehub.lib.rnn import PredictiveSearchWorker

log = logging.getLogger(__name__)

Expand All @@ -19,7 +19,8 @@ def train():
u'''Initialising the Knowledgehub tables'''
log.info(u"Initialize Knowledgehub tables")
try:
learn()
worker = PredictiveSearchWorker()
worker.run()
except Exception as e:
error_shout(e)
else:
Expand Down
18 changes: 11 additions & 7 deletions ckanext/knowledgehub/fanstatic/javascript/search_prediction.js
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,7 @@
x[i].parentNode.removeChild(x[i]);
}
}
currentFocus = -1;
}

$(document).ready(function () {
Expand All @@ -58,21 +59,21 @@
searchInput
.bind("change keyup", function (event) {
clearTimeout(timer)
if (!(event.keyCode >= 13 && event.keyCode <= 20) && !(event.keyCode >= 37 && event.keyCode <= 40)) {
if (!(event.keyCode >= 13 && event.keyCode <= 20) && !(event.keyCode >= 37 && event.keyCode <= 40) && event.keyCode != 27) {
// detect that user has stopped typing for a while
timer = setTimeout(function() {
var text = searchInput.val();

if (text !== '') {
api.get('get_predictions', {
text: text
query: text
}, true)
.done(function (data) {
if (data.success) {
var a, b;
var results = data.result;

closeAllLists()
closeAllLists();

a = document.createElement("DIV");
a.setAttribute("id", "autocomplete-list");
Expand All @@ -84,8 +85,9 @@
b.innerHTML = text;
b.innerHTML += "<strong>" + r + "</strong>";
b.addEventListener("click", function (e) {
searchInput.val(text + r);
closeAllLists();
searchInput.val(text + r);
searchInput.trigger("change");
});
a.append(b)
});
Expand All @@ -95,25 +97,24 @@
console.log("Get predictions: " + error.statusText);
});
}
}, 500);
}, 300);
}
})
});

$('.search-input-group').on("mouseover", autocompleteItems, function(e){

var activeItem = document.getElementsByClassName('autocomplete-active')[0];
activeItem ? activeItem.classList.remove('autocomplete-active') : null;
event.target !== input ? event.target.classList.add('autocomplete-active') : null;
var p = e.target.parentElement;
var index = Array.prototype.indexOf.call(p.children, e.target);
activeItem ? currentFocus = index : currentFocus = -1

});

searchInput.on('keydown', function (e) {
var x = document.getElementById("autocomplete-list");
if (x) x = x.getElementsByTagName("div");

if (e.keyCode == 40) {
// The arrow DOWN key is pressed
currentFocus++;
Expand All @@ -125,9 +126,12 @@
} else if (e.keyCode == 13) {
// ENTER key is pressed
if (currentFocus > -1) {
e.preventDefault();
// simulate a click on the "active" item*
if (x) x[currentFocus].click();
}
} else if (e.keyCode == 27) {
closeAllLists();
}
});
})(ckan.i18n.ngettext, $);
18 changes: 16 additions & 2 deletions ckanext/knowledgehub/fanstatic/javascript/user_query_result.js
Original file line number Diff line number Diff line change
Expand Up @@ -37,10 +37,10 @@
result_id: result_id
})
.done(function (data) {
console.log("User Quere Result: SAVED!");
console.log("User Query Result: SAVED!");
})
.fail(function (error) {
console.log("User Quere Result failed: " + error.statusText);
console.log("User Query Result failed: " + error.statusText);
});
}
})
Expand All @@ -49,6 +49,19 @@
});
}

function saveKnowledgeHubData(query_text) {
api.post('kwh_data_create', {
type: 'search_query',
title: query_text
})
.done(function (data) {
console.log("User query added to kwh data");
})
.fail(function (error) {
console.log("Failed to add user query to kwh data: " + error.statusText);
});
}

$(document).ready(function () {
var save_user_query = function(callback) {
var tab_content = $('.tab_content');
Expand Down Expand Up @@ -81,6 +94,7 @@
if (query_text) {
var user_id = $('#user-id').val();
saveUserQueryResult(query_text, result_type, result_id, user_id)
saveKnowledgeHubData(query_text, user_id)
}
}
});
Expand Down
1 change: 0 additions & 1 deletion ckanext/knowledgehub/helpers.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,6 @@

from ckanext.knowledgehub.model import Dashboard
from ckanext.knowledgehub.model import ResourceValidation
from ckanext.knowledgehub.rnn import helpers as rnn_helpers


log = logging.getLogger(__name__)
Expand Down
8 changes: 8 additions & 0 deletions ckanext/knowledgehub/lib/rnn/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
from ckanext.knowledgehub.lib.rnn.worker import PredictiveSearchWorker
from ckanext.knowledgehub.lib.rnn.model import PredictiveSearchModel


__all__ = [
'PredictiveSearchWorker',
'PredictiveSearchModel'
]
40 changes: 40 additions & 0 deletions ckanext/knowledgehub/lib/rnn/config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
import os
import time

from ckan.common import config


class PredictiveSearchConfig(object):
''' Hold the configuration for the machine learning model and worker '''

def __init__(self):
self.corpus_length = int(config.get(
u'ckanext.knowledgehub.rnn.min_length_corpus', 10000))
self.sequence_length = config.get(
u'ckanext.knowledgehub.rnn.sequence_length', 10)
self.step = int(
config.get(u'ckanext.knowledgehub.rnn.sentence_step', 1))
self.epochs = int(
config.get(u'ckanext.knowledgehub.rnn.max_epochs', 50))
self.weights_path = config.get(
u'ckanext.knowledgehub.rnn.model_weights',
'./keras_model_weights.h5'
)
self.network_path = config.get(
u'ckanext.knowledgehub.rnn.model_network',
'./keras_model_network.h5'
)
self.history_path = config.get(
u'ckanext.knowledgehub.rnn.history',
'./history.p'
)
self.temp_weigths_path = os.path.join(
os.path.dirname(self.weights_path),
'keras_model_%s.h5' % time.time()
)
self.number_predictions = int(
config.get(
u'ckanext.knowledgehub.rnn.number_predictions',
3
)
)
57 changes: 57 additions & 0 deletions ckanext/knowledgehub/lib/rnn/data_manager.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
import ckan.plugins.toolkit as toolkit


class DataManager:
''' Manage the training data'''

@staticmethod
def create_corpus(corpus):
''' Store the machine learning corpus
:param corpus: the machine learning corpus
:type corpus: string
:returns: the stored corpus
:rtype: dict
'''
return toolkit.get_action('corpus_create')(
{'ignore_auth': True},
{'corpus': corpus}
)

@staticmethod
def get_corpus():
''' Get the data in knowledgehub and create corpus
:returns: the machine learning corpus
:rtype: string
'''
kwh_data = toolkit.get_action(
'kwh_data_list')({'ignore_auth': True}, {})

corpus = ''
if kwh_data.get('total'):
data = kwh_data.get('data', [])
for entry in data:
corpus += ' %s' % entry.get('title')
if entry.get('description'):
corpus += ' %s' % entry.get('description')

return corpus

@staticmethod
def get_last_corpus():
''' Return the corpus usd in the last training of the model '''

return toolkit.get_action('get_last_rnn_corpus')(
{'ignore_auth': True}, {})

@staticmethod
def prepare_corpus(corpus):
''' Find the unique chars in the corpus and index the characters '''

unique_chars = sorted(list(set(corpus)))
char_indices = dict((c, i) for i, c in enumerate(unique_chars))
indices_char = dict((i, c) for i, c in enumerate(unique_chars))

return (unique_chars, char_indices, indices_char)
Loading

0 comments on commit 383e099

Please sign in to comment.