Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On how to store the models in the database #2

Closed
5 tasks done
pomodoren opened this issue May 27, 2021 · 7 comments
Closed
5 tasks done

On how to store the models in the database #2

pomodoren opened this issue May 27, 2021 · 7 comments

Comments

@pomodoren
Copy link
Owner

pomodoren commented May 27, 2021

The first question seems to be:

Firstly, develop a simple classification algorithm which attempts
to predict the variable "promoted" through the other variables. 
The focus of this model is pure prediction capability.

Here is described how to do it

Bonus: Please save the model, the current page,
the coefficients and any relevant statistical measure 
to the SQLite database (on a different table than "data") while you are updating it.

Step by step

@pomodoren
Copy link
Owner Author

Which is best model for prediction?

After running the usual suspects (SGD, ASGD, Perceptron, Passive Aggressive (I and II)), then we could see that

  1. SGD
  2. ASGD

did better than the others, and were more stable.
Screenshot from 2021-05-27 22-33-40

Which model has best timing?

Also, after checking their training and prediction time we see that they are similar (<100ms difference).
Screenshot from 2021-05-27 22-33-53

So the choice wont matter that much. As ASGD prediction time (for 1000 instances) is quicker, then we will pick that.
If we have any issue, we can change into standard SGD.

Additionally, lets read quickly about ASGD and SGD just as not to be ignorant.

@pomodoren
Copy link
Owner Author

So SGD defines the speed of the change: stochastic gradient descent of the linear classifier. Right now it has in the background an SVM classifier. Read more here.
On the other hand ASGD is SGD with average=True.

average: bool or int, default=False
When set to True, computes the averaged SGD weights accross all updates and stores the
result in the coef_ attribute. If set to an int greater than 1, averaging will begin once the total 
number of samples seen reaches average. So average=10 will begin averaging after seeing 10 samples.

@pomodoren
Copy link
Owner Author

The database basic table was set under #1 , an issue that just got solved.
To do next would be to understand how to save the model in the database.

On another note

IMPORTANT The ingestion script should ingest data in batches and feed it to the model in batches.
Do not just pre-load all the data in advance. This is the "streaming" part of the challenge.

Batch-size? I guess we can keep the batch size as a CONFIG value, and then load and train based on that. Remember that the model, the current page are stored together, so maybe its implicated that BATCH_SIZE can be dependent on the documents per page. Still, this does not solve the issue of how do we test the model...

@pomodoren
Copy link
Owner Author

After #6, we have kind of decided the process of learning.

when load batch of 10
- check if Instance.count == N 
    - if yes, then train new model
    - store model in PredictionModel Table
- check elif Instance.count() % N == 0 - if yes
    - test existing model
    - store stats results
    - new model: train with the new N - this will wait for next input
    - store new model in db

This can be a class method, because it does not depend that much on the ingestion-batch.

@pomodoren
Copy link
Owner Author

Storing pickled data into SQLite

Bonus: Please save the model, the current page,
the coefficients and any relevant statistical measure 
to the SQLite database (on a different table than "data") while you are updating it.

I am new at this, so do not have really specific idea of what is needed to save - and how we can use these later.
Still, after searching around, I found something really interesting: Modellogger.
I will check its code, and store the model in a similar way.

Screenshot from 2021-05-28 09-02-42

Integration?

Before letting this go, I might kind of force a bit an integration ... Somehow to find a way to use the script of modellogger in the SQLAlchemy model structure.

Second thoughts?

(We do not think that it was a bad idea to store these with SQLAlchemy #1 , right?!)

pomodoren added a commit that referenced this issue May 28, 2021
@pomodoren
Copy link
Owner Author

pomodoren commented May 28, 2021

Screenshot from 2021-05-28 09-13-46

Screenshot from 2021-05-28 09-15-15

These are additional notes regarding the process of: what to store.
Source

@pomodoren
Copy link
Owner Author

pomodoren commented May 28, 2021

Use case ( to remember what we were doing ):

  • Load 1000 cases
  • Update PredictionModel table with a method to take care of the checks
  • Create a model, train it, save it with parameters
  • Load 1000 new cases
  • Test for these new cases
  • Create a model, train it, save it with new parameters

pomodoren added a commit that referenced this issue May 28, 2021
pomodoren added a commit that referenced this issue May 28, 2021
pomodoren added a commit that referenced this issue May 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant