Skip to content

Commit

Permalink
Add some messy lab notes.
Browse files Browse the repository at this point in the history
  • Loading branch information
erikrose committed May 20, 2021
1 parent 957608b commit db4a430
Showing 1 changed file with 13 additions and 0 deletions.
13 changes: 13 additions & 0 deletions autoextract/lab_notes.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
Default SVM settings converged on 81.6 testing accuracy.
Calling fit() instead of our own partial_fit() loop (with no early stopping on either) gave us 80% each time.
Turning on early stopping with default params gave us 80%, then 81.6%, then 88.3%.

Maybe the stochastic methods like SVM don't converge the same each time because it's just a lot of params.
Experiments with LinearRegression classifier (using LBFGS) corroborates this. I think it's still somewhat stochastic, but it converges the same every time for random_states 43, 42, and 48. Training accuracy 96.7, testing 81.7. So it's still somewhat of an overfit but no worse than the various SGD solvers using early stopping.
Bayes. Beware that it's a good classifier (dimensionality independent!) but a crappy estimator, so don't believe its confidence estimates. It ended up giving 100% training accuracy but only 71.7% testing. It didn't care what numbers the y values were: {-1, 1} or {1, 2}. Not sure I'm appropriately preprocessing X for bayes. Would it rather they be fractions?
Wow, using CountVectorizer instead of TF/IDF makes it worse: 93.8% training, 60% testing. Same with or without a max_df ceiling. Even worse with MultinomialNB and CategoricalNB (both 77%/58.3%). BernoulliNB was bad too. CategoricalNB crashed.
RandomForestClassifier(n_estimators=100) did 100%/80% consistently with the CountVectorizer. Same with TF/IDF.
MLP(random_state=48, max_iter=400, verbose=1) converged around t=146 and did 100%/81.7% with no early stopping. With early stopping, it's hard to get it dialed in. Sometimes it's 97.9%/80%, sometimes 95.4%/73.3%.

Talked to evgeny and motin in #maml about sklearn and Fathom and stuff. evgeny recommended the ONNX format for exporting models trained in PyTorch to be executed by various libs in JS.
alissy says they statically linked with TF for DeepSpeech. "I think it's complicated enough not to qualify for providing features in gecko. except if you are okay loading a shared object" TFLite can be a couple of KB "if you build it right".

0 comments on commit db4a430

Please sign in to comment.