Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrating classifier data from an older classifier-reborn structure #172

Open
tracegilton opened this issue Feb 22, 2018 · 14 comments
Open

Comments

@tracegilton
Copy link

Hi all,

I am attempting to upgrade a classifier that I built using a previous version of ClassifierReborn::Bayes. It looks like I initialized the classifier before backends were added, so now I am running into compatibility issues. I store the classifier structure on disk with Marshal and now when it is loaded it does not have the backend attribute that the newer gem expects.

Is there a best practice for how to update the older classifier so that it will be compatible with the backends system?

@Ch4s3
Copy link
Member

Ch4s3 commented Feb 26, 2018

Interesting. I guess we should have thought this through before publishing the new code. @ibnesayeed do you have any thoughts?

Without having taken a look yet, I imagine you could do some metaprogramming to open the class and add the backend attribute. But, I'm not sure. I'll give it some thought.

@ibnesayeed
Copy link
Contributor

ibnesayeed commented Feb 26, 2018

Marshaling a complex object is always going to have the potential of breaking compatibility when the data-structure changes or any other attributes change. I remember wasting a month figuring out why a Weka model was not predicting anything and the culprit turns out to be the fact that the model was built using a different version of Weka than what was being used to to load the model file.

I think, we need to implement importer/exporter to serialize just the data without tying it to classes or other states of the object. The output can be something like JSON, YAML, or even Google's Protocol Buffer. This will not only help migrate models from one version to the other, but also from one backend to the other.

@tracegilton
Copy link
Author

Thank you for the replies — Exporting as YAML seems like a good way for me to move the trained data into the new structure, but I am unfamiliar with the inner-workings of the backends code to know the impact this will have.

I was able to use a new ClassifierReborn exported to YAML as a template and added in my old data and training totals. When re-importing that YAML structure, it looks correct and it can classify data but training new data still fails.

If nothing else, I can use the YAML export to re-train a new classifier, looping over each word for weight number of times 😅

@ibnesayeed
Copy link
Contributor

The point of an object-independent data-only serialization would to decouple the data from the class structure and object state. Exporting such data structure means looping through all the stored keys and serializing them in a way that is backend independent. Importing it means populating the backend store with those keys and values with the synthesized data rather than loading a ready-made object (as in case of marshaling).

@Ch4s3
Copy link
Member

Ch4s3 commented Feb 27, 2018

@ibnesayeed I feel like an import/export class is probably the right solution. I can take a crack at it maybe this weekend? If you want to take a look, feel free.

@ibnesayeed
Copy link
Contributor

Once you have a PR in place I will be happy to review it. My current priorities are keeping my hands very tight otherwise I would have implemented it.

@Ch4s3
Copy link
Member

Ch4s3 commented Mar 1, 2018

Sounds good. I'll dig in as soon as I can.

@ibnesayeed
Copy link
Contributor

We need to implement following two methods in each backend and have a proxy/alias method to call them from the main Bayes class:

def import(yaml_data_file)
  # Read the yaml_data_file and populate the backend in use
end

def export(yaml_data_file)
  # Traverse the data structure in the used backend and serialize it to the yaml_data_file
end

Instead of specifying file name in the parameter, we can supply/return objects and move the serialization/deserialization responsibility in a task or in some other method. That way the YAML support will not be baked in, but other alternate formats can also be used without changing the underlying implementation.

Exported YAML data file (say, bayes-data.yml) will looks something like this:

---
# Imported from ClassifierReborn::Bayes
total_words: 7
total_trainings: 3
category_counts:
  - Ham:
    - training: 2
    - word: 4
  - Spam:
    - training: 1
    - word: 3
categories:
  - Ham:
    - sunday: 1
    - holiday: 1
    - work: 2
  - Spam:
    - holiday: 1
    - winner: 2

@Ch4s3
Copy link
Member

Ch4s3 commented Mar 1, 2018

I'm trying to think if we need to do a minor release of a pre-backend version to make this work. Thoughts?

@ibnesayeed
Copy link
Contributor

I'm trying to think if we need to do a minor release of a pre-backend version to make this work. Thoughts?

That's a good idea indeed. This feature can be released as minor versions for both pre-and post-backend releases at the same time.

@Ch4s3
Copy link
Member

Ch4s3 commented Mar 1, 2018

Ok, I'll try building it against 2.1, and releasing this as 2.1.1 and 2.2.1. It kind of breaks semver to add new functionality in a patch version, but I don't see a way around that.

@ibnesayeed
Copy link
Contributor

Yes, the backend change was big enough to warrant a major version bump, but we couldn't see it coming. So, for now 2.1.1 and 2.2.1 will do the trick if we don't be too religious about the semver.

@Ch4s3
Copy link
Member

Ch4s3 commented Mar 1, 2018

I agree. I'll take a look tonight then and see if I can pull together a poc.

@Ch4s3 Ch4s3 mentioned this issue Mar 2, 2018
6 tasks
@Ch4s3
Copy link
Member

Ch4s3 commented Mar 2, 2018

Ok I have a WIP pr at #174

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants