New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Code classifier frontend prediction API services. #3719
Code classifier frontend prediction API services. #3719
Conversation
@anmolshkl @AllanYangZhou @seanlip can you check the overall structure and placement of different modules? Also I am currently working on frontend tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just commenting on code structure as requested. Leaving further review to @AllanYangZhou.
@@ -0,0 +1,45 @@ | |||
// Copyright 2017 The Oppia Authors. All Rights Reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe put these in templates/dev/head/domain/classifier. They're not specific to any one interaction, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. No they are not specific to any interaction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. No they are not specific to any interaction.
* Oppia-ml. | ||
*/ | ||
|
||
oppia.factory('CodeClassifierPredictionService', [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe call this CodeReplPredictionService so we have a standard naming scheme for these things ("interaction id" + prediction service). Also some changes may be needed to the .py file to say whether such a service exists so that the relevant files can be imported only if necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can include these services directly in CodeRepl.html, can't we? Then they will be included automatically every time CodeRepl interaction is included. Or are you talking about something else here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes -- look at how the interaction stuff gets loaded; there are files in the extensions/interaction/{{InteractionID}} directory and they are named in quite a specific way. So no, don't include things directly in CodeRepl.html. You probably need to:
- add a bool attr to the interaction .py file definition that indicates whether the interaction supports a prediction service
- for interactions that do support a prediction service, add that service to the {{InteractionId}}.js file
- write a backend test to ensure that all interactions have that boolean defined in their .py file, and that all interactions with that boolean set to True have a service with the appropriate name defined in their JS file. You can just do string matching on the raw JS file, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should go into @pranavsid98's PR. Since he is going to manage how prediction services will be detected and loaded. Also classifiers may also require additional services (like code classifier requires SVMPredictionService, CountVectorizerService etc...), how should we keep track of those extra services? May be maintain a dependency list for each classifier.
Also I think it'd be better to not have tight coupling between interaction and classifier. This enables us to use multiple classifiers per interaction. If we name classifier service same as interaction id then we will be able to use only prediction algorithm at a time. Consider a scenario where we upgarde an existing classifier code. In that case, for backward compatibility, if exploration has old version of classifier data then we can still use old prediction API. We can't do this if we name each prediction service with its interaction id. In general, I think, it would be better to name prediction service same with its algorithm id rather than interaction id. What do you think?
Overall structure seems fine, I'll do a more detailed review tomorrow. Regarding the use of template strings/ES6, won't there be an issue of browser compatability? |
@AllanYangZhou, according to this most browsers now days support for template literals (except older IE11 browser). So this should be fine as long as user's browser is updated. |
Codecov Report
@@ Coverage Diff @@
## develop #3719 +/- ##
===========================================
+ Coverage 45.04% 45.63% +0.59%
===========================================
Files 265 275 +10
Lines 20510 20956 +446
Branches 3184 3259 +75
===========================================
+ Hits 9238 9563 +325
- Misses 11272 11393 +121
Continue to review full report at Codecov.
|
I have added few frontend tests for different services. I will add rest of the tests later. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@prasanna08 a few minor changes.
This looks a little complex and I can imagine the efforts had to put in! Really great! @seanlip Prasanna had to do a lot of duplication (of libraries/our python code) to get this in place. In the long term, does front end prediction looks maintainable to you? The scope for things to go wrong is just huge with front end prediction.
// limitations under the License. | ||
|
||
/** | ||
* Vecotrize function for CountVectorizer feature extractor of sklearn. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Vecotrize -> Vectorizer
Perhaps we can say -> Vectorizer function which mirrors the CountVectorizer feature extractor of sklearn
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DOne
/** | ||
* Vecotrize function for CountVectorizer feature extractor of sklearn. | ||
* | ||
* IMPORTANT NOTE: The vectorize function is uses the vocabulary that was |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Vectorizer*
is uses -> uses
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
* Vecotrize function for CountVectorizer feature extractor of sklearn. | ||
* | ||
* IMPORTANT NOTE: The vectorize function is uses the vocabulary that was | ||
* extracted during classifier's training. During training scikit's |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
During training,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
* | ||
* IMPORTANT NOTE: The vectorize function is uses the vocabulary that was | ||
* extracted during classifier's training. During training scikit's | ||
* CountVectorize class is used for this purpose. If there are any changes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CountVectorizer*
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Hm, yeah -- the duplication is a pity, but how would you do ML offline or in poor-connectivity environments otherwise? I think this is probably the best we can do... |
pos += 1; | ||
} | ||
|
||
if (pos == max) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preferable to use ===
, here and elsewhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
return group(arguments) + '?'; | ||
}; | ||
|
||
var any = function() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe more descriptive names for these 3 functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using groupOfRegEx
, regExMayBePresent
, repeatedRegEx
for them respectively.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
}; | ||
|
||
var Whitespace = String.raw`[ \f\t]*`; | ||
var Comment = String.raw`#[^\r\n]*`; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment is keyword in JS. Also, is there a reason to uppercase these variables?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No specific reason, I just kept them as it is. I will change them to lowercase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@seanlip Do we still care about IE support? I think a lot of these ES6 features don't work on IE at all. |
It would be nice to support some version of IE, especially since this is the learner view and students do not always have control over the computers which are available to them. @prasanna08 could we do that? |
Just FYI, I tried using Oppia on IE 11 once and many things (like editor)
completely broke on it. So, I'm not sure about supporting IE unless we are
talking about Microsoft edge browser.
…On 09-Aug-2017 1:10 PM, "Sean Lip" ***@***.***> wrote:
It would be nice to support some version of IE, especially since this is
the learner view and students do not always have control over the computers
which are available to them. @prasanna08 <https://github.com/prasanna08>
could we do that?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3719 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AGC47wL9-zdsacUacPv35aaewZrks7MKks5sWWJ5gaJpZM4OvSXU>
.
|
Yeah, I think the editor view may have problems (and I'm not actually too worried about that). But what about the learner view? If we are to be accessible to learners then we need to try and meet them where they're at, and that includes their browser. I thought the learner view currently works OK in IE... |
@seanlip if its not very much helpful then I'd rather not do that. We need ES6 for raw strings of regular expression. And its difficult to convert them to escaped strings, that's why many prefer raw strings over escaped strings when it comes to regex. So if its not much helpful than it would be better to use raw strings. I don't even know that whether it will be possible to convert each raw string into escaped string. But I will give it a shot if this is important feature. Aside from that, can't we fallback to default outcome in case of IE browser? |
Hmm the learner view seems ok but the other pages that lead you to the
learner view seems to be broken. I'm not sure if a learner would actually
play any exploration after having a bad experience unless they are directly
going to play a particular exploration.
…On 09-Aug-2017 2:08 PM, "Sean Lip" ***@***.***> wrote:
Yeah, I think the editor view may have problems (and I'm not actually too
worried about that). But what about the learner view? If we are to be
accessible to learners then we need to try and meet them where they're at,
and that includes their browser. I thought the learner view currently works
OK in IE...
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3719 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AGC473_VSvWZr2iyNJ3FH4u2DgMUWK3Nks5sWW_7gaJpZM4OvSXU>
.
|
I'm torn. IE 11 is apparently the second most popular browser after Chrome: https://www.netmarketshare.com/browser-market-share.aspx?qprid=2&qpcustomd=0 . So, definitely non-negligible, and in fact we've generally tried to support later versions of IE in the past (with a cutoff of 8 or 9). If the pages leading up to the learner view are broken, then we really should fix that (@anmolshkl, would you mind filing issues for this please?) It's similar to how we currently do i18n for the learner-facing but not the editor-facing pages. On the other hand I do see the issue here, and I agree that not using String.raw would make the code quite a bit less maintainable. I think a reasonable thing to do is preserve the current impl, but make sure it fails gracefully on IE. So, e.g., explorations with no code classification should still work fine, but explorations with code classification should show an error message at the time the classification is triggered that says something like "Answer classification does not work in IE. Please use another browser." and return the default outcome. That way, there are no surprises, and the page doesn't just break completely under the user. Would that work? |
I think disabling classification on IE and then returning the warning message (without breaking completely) is a good compromise for now. |
Returning warning message sounds good to me. But I am not sure about how that message should be displayed to learner. Some sort of prompt or pop-up box? |
Actually, thinking about it, you probably don't need to surface this to the learner. Just do a |
Yep that's what I thought. Learner does not need to know about it. I will make that change. Edit: I think there might be a way to use escape strings somehow. Look at this. I'll try this out and let you guys know if it works. |
I have removed use of template literals, so now there shouldn't be any compatibility issue. PR is ready for review now. |
// limitations under the License. | ||
|
||
/** | ||
* Vecotrizer function which mirrors the CountVectorizer feature |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Vectorizer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
oppia.factory('CountVectorizerService', [function() { | ||
return { | ||
vectorize: function(tokens, vocabulary) { | ||
var vectorLength = Math.max.apply(Math, Object.values(vocabulary)) + 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Object.values
is also ES6 like template literals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see. I will change it to Object.key
because we only need to know length, so it does not matter whether we use keys or values. Is that fine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, seems fine to use keys instead.
else if (namechars.indexOf(initial) !== -1) { | ||
tokenizedProgram.push([PythonProgramTokenType.NAME, token]); | ||
} | ||
// continued stmt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to just spell this out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
var maxVoteIdx = 0; | ||
for(var i = 0; i < votes.length; i++) { | ||
if (votes[i] > votes[maxVoteIdx]) { | ||
maxVoteIdx = votes[i]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused reading this. The name maxVoteIdx
suggests we're trying to find an index, not the max number of votes, right? Unless you are actually trying to find the biggest number of votes, but then you should call it maxVotes
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, the code is wrong. I want index and not votes. Thanks for spotting it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! Btw, were any of the SVMPredictionService tests supposed to have caught this error?
var TOKEN_NAME_UNK = 'UNK'; | ||
|
||
// List of python keywords. | ||
var KWLIST = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor nit: KW_LIST
is slightly clearer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
} | ||
} | ||
|
||
// new statement |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency, capitalize and add a period (same applies for all other comments).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@prasanna08 Looking good overall. I guess we should get all the checks fixed. |
@AllanYangZhou fixed tests. PTAL? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@prasanna08 just a couple of comments here. I'm trusting you and the tests here :P
* extractor of sklearn. | ||
* | ||
* IMPORTANT NOTE: The Vectorizer function uses the vocabulary that was | ||
* extracted during training. During training scikit's |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
During the training,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
/** | ||
* Tokenizer for Python code. | ||
* | ||
* IMPORTANT NOTE: The tokenizer is build using Python's own tokenizer module. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
build -> built
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
* IMPORTANT NOTE: The tokenizer is build using Python's own tokenizer module. | ||
* These functions are simply translated from Python code to JS code and they | ||
* both do same task. The unnecessary code from Python's tokenizer module | ||
* has been removed before translating it into JS andcCode relevant to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and code*
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
'num in range(1000):\n if num%7 == 0 or num%5 == 0:\n\ts +=x\n' + | ||
'print s'); | ||
|
||
var expectedTokens = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that this matches the output produced by Python's tokenizer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not exactly. The python tokenizer produces \n
tokens but this one fails to do so. Other than that all the important tokens are same. I think the issue with \n
is that python's tokenizer uses StringIO for reading input line but here I am splitting the program with \n
before passing it to tokenizer. But all other tokens are same as python's tokenizer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so the count for '\n' in the count vector would always be 0 in the case of front-end count-vectorizer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, unfortunately, that's likely to happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I'm not sure what impact this could have (the impact could be negligible or it might decrease the accuracy significantly). If we are splitting on '\n', then, can't we just add '\n' tokens later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just remembered this, we are already filtering out NL and COMMENT tokens (in backend as well as frontend) and all \n
are NL tokens, so they are not present in backend either. That means we are not losing anything by not having \n
tokens.
startIndices[i] = startIndices[i - 1] + nSupport[i - 1]; | ||
} | ||
|
||
// Find kernel values for supportVectors and given input. Assumes that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't there be a check that flags an error if the dimensions are not same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a check.
var svmData = { | ||
classes: [0, 1], | ||
n_support: [2, 2], | ||
support_vectors: [[1.0, 1.0], [0.0, 0.0], [1.0, 0.0], [0.0, 1.0]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this generated with scikit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I trained XOR on scikit and extracted data using classifier_utils.extract_svm_parameter()
function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
@prasanna08 The code looks fine, the only concern I have is why the SVMPredictionServiceTests didn't catch the maxVoteIdx bug from earlier? |
@AllanYangZhou so the reason for that is that there were only 2 classes. So let me explain all the cases one by one.
Now my old logic was like this:
So let's see what happens in case of [1, 0] votes. There will be 2 passes of for loop.
Now let's see what happens for [0, 1]
So the result was correct in flawed logic because we only had 2 classes and votes were binary (either 0 vote or 1 vote). If we had few more classes then we would also have more votes per class and error may have been identified at that time. Is that understandable? |
…nto code-classifier-prediction
@prasanna08 Alright, I understand now. Thanks! |
|
||
if ( | ||
tokenId == PythonProgramTokenType.NL || | ||
tokenId == PythonProgramTokenType.COMMENT || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
===
preferred (and elsewhere)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@anmolshkl can you take a final look. So that this can be merged and I can enable code classifier in development instance for some end to end testing. |
This PR implements frontend prediction API for code classifier. It implements all necessary reprocessing and predict function. Also changed eslint
ecmaVersion
to 6 because this PR uses backtick strings (for raw strings).This PR can only be merged after @pranavsid98's #3716 is merged. Until then we should review this PR.