Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code classifier frontend prediction API services. #3719

Merged
merged 19 commits into from Aug 18, 2017

Conversation

prasanna08
Copy link
Contributor

@prasanna08 prasanna08 commented Aug 7, 2017

This PR implements frontend prediction API for code classifier. It implements all necessary reprocessing and predict function. Also changed eslint ecmaVersion to 6 because this PR uses backtick strings (for raw strings).

This PR can only be merged after @pranavsid98's #3716 is merged. Until then we should review this PR.

@prasanna08
Copy link
Contributor Author

@anmolshkl @AllanYangZhou @seanlip can you check the overall structure and placement of different modules? Also I am currently working on frontend tests.

@prasanna08 prasanna08 changed the title [NOT FOR MERGE]: Code classifier prediction API [WIP]: Code classifier prediction API Aug 7, 2017
Copy link
Member

@seanlip seanlip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just commenting on code structure as requested. Leaving further review to @AllanYangZhou.

@@ -0,0 +1,45 @@
// Copyright 2017 The Oppia Authors. All Rights Reserved.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe put these in templates/dev/head/domain/classifier. They're not specific to any one interaction, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. No they are not specific to any interaction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. No they are not specific to any interaction.

* Oppia-ml.
*/

oppia.factory('CodeClassifierPredictionService', [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe call this CodeReplPredictionService so we have a standard naming scheme for these things ("interaction id" + prediction service). Also some changes may be needed to the .py file to say whether such a service exists so that the relevant files can be imported only if necessary?

Copy link
Contributor Author

@prasanna08 prasanna08 Aug 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can include these services directly in CodeRepl.html, can't we? Then they will be included automatically every time CodeRepl interaction is included. Or are you talking about something else here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes -- look at how the interaction stuff gets loaded; there are files in the extensions/interaction/{{InteractionID}} directory and they are named in quite a specific way. So no, don't include things directly in CodeRepl.html. You probably need to:

  • add a bool attr to the interaction .py file definition that indicates whether the interaction supports a prediction service
  • for interactions that do support a prediction service, add that service to the {{InteractionId}}.js file
  • write a backend test to ensure that all interactions have that boolean defined in their .py file, and that all interactions with that boolean set to True have a service with the appropriate name defined in their JS file. You can just do string matching on the raw JS file, I think.

Copy link
Contributor Author

@prasanna08 prasanna08 Aug 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should go into @pranavsid98's PR. Since he is going to manage how prediction services will be detected and loaded. Also classifiers may also require additional services (like code classifier requires SVMPredictionService, CountVectorizerService etc...), how should we keep track of those extra services? May be maintain a dependency list for each classifier.

Also I think it'd be better to not have tight coupling between interaction and classifier. This enables us to use multiple classifiers per interaction. If we name classifier service same as interaction id then we will be able to use only prediction algorithm at a time. Consider a scenario where we upgarde an existing classifier code. In that case, for backward compatibility, if exploration has old version of classifier data then we can still use old prediction API. We can't do this if we name each prediction service with its interaction id. In general, I think, it would be better to name prediction service same with its algorithm id rather than interaction id. What do you think?

@AllanYangZhou
Copy link
Contributor

Overall structure seems fine, I'll do a more detailed review tomorrow.

Regarding the use of template strings/ES6, won't there be an issue of browser compatability?

@prasanna08
Copy link
Contributor Author

@AllanYangZhou, according to this most browsers now days support for template literals (except older IE11 browser). So this should be fine as long as user's browser is updated.

@codecov-io
Copy link

codecov-io commented Aug 8, 2017

Codecov Report

Merging #3719 into develop will increase coverage by 0.59%.
The diff coverage is 70.39%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #3719      +/-   ##
===========================================
+ Coverage    45.04%   45.63%   +0.59%     
===========================================
  Files          265      275      +10     
  Lines        20510    20956     +446     
  Branches      3184     3259      +75     
===========================================
+ Hits          9238     9563     +325     
- Misses       11272    11393     +121
Impacted Files Coverage Δ
...domain/classifier/WinnowingPreprocessingService.js 100% <100%> (ø)
...v/head/domain/classifier/CountVectorizerService.js 100% <100%> (ø)
...interactions/CodeRepl/CodeReplPredictionService.js 53.48% <53.48%> (ø)
...v/head/domain/classifier/PythonProgramTokenizer.js 67.41% <67.41%> (ø)
...dev/head/domain/classifier/SVMPredictionService.js 98.21% <98.21%> (ø)
...ead/domain/exploration/InteractionObjectFactory.js 68.57% <0%> (-11.43%) ⬇️
.../dev/head/domain/exploration/StateObjectFactory.js 80% <0%> (-6.67%) ⬇️
.../exploration_player/AnswerClassificationService.js 95.65% <0%> (-1.79%) ⬇️
...s/dev/head/pages/library/SearchResultsDirective.js 5.88% <0%> (-0.37%) ⬇️
...v/head/pages/learner_dashboard/LearnerDashboard.js 2.76% <0%> (-0.03%) ⬇️
... and 14 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b2ccf54...6c08be6. Read the comment docs.

@prasanna08
Copy link
Contributor Author

prasanna08 commented Aug 8, 2017

I have added few frontend tests for different services. I will add rest of the tests later.

Copy link
Contributor

@anmolshkl anmolshkl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prasanna08 a few minor changes.

This looks a little complex and I can imagine the efforts had to put in! Really great! @seanlip Prasanna had to do a lot of duplication (of libraries/our python code) to get this in place. In the long term, does front end prediction looks maintainable to you? The scope for things to go wrong is just huge with front end prediction.

// limitations under the License.

/**
* Vecotrize function for CountVectorizer feature extractor of sklearn.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vecotrize -> Vectorizer
Perhaps we can say -> Vectorizer function which mirrors the CountVectorizer feature extractor of sklearn

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DOne

/**
* Vecotrize function for CountVectorizer feature extractor of sklearn.
*
* IMPORTANT NOTE: The vectorize function is uses the vocabulary that was
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vectorizer*
is uses -> uses

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

* Vecotrize function for CountVectorizer feature extractor of sklearn.
*
* IMPORTANT NOTE: The vectorize function is uses the vocabulary that was
* extracted during classifier's training. During training scikit's
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During training,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

*
* IMPORTANT NOTE: The vectorize function is uses the vocabulary that was
* extracted during classifier's training. During training scikit's
* CountVectorize class is used for this purpose. If there are any changes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CountVectorizer*

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@seanlip
Copy link
Member

seanlip commented Aug 9, 2017

This looks a little complex and I can imagine the efforts had to put in! Really great! @seanlip Prasanna had to do a lot of duplication (of libraries/our python code) to get this in place. In the long term, does front end prediction looks maintainable to you? The scope for things to go wrong is just huge with front end prediction.

Hm, yeah -- the duplication is a pity, but how would you do ML offline or in poor-connectivity environments otherwise? I think this is probably the best we can do...

pos += 1;
}

if (pos == max) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preferable to use ===, here and elsewhere

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

return group(arguments) + '?';
};

var any = function() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe more descriptive names for these 3 functions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using groupOfRegEx, regExMayBePresent, repeatedRegEx for them respectively.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

};

var Whitespace = String.raw`[ \f\t]*`;
var Comment = String.raw`#[^\r\n]*`;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment is keyword in JS. Also, is there a reason to uppercase these variables?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No specific reason, I just kept them as it is. I will change them to lowercase.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@AllanYangZhou
Copy link
Contributor

@seanlip Do we still care about IE support? I think a lot of these ES6 features don't work on IE at all.

@seanlip
Copy link
Member

seanlip commented Aug 9, 2017

It would be nice to support some version of IE, especially since this is the learner view and students do not always have control over the computers which are available to them. @prasanna08 could we do that?

@anmolshkl
Copy link
Contributor

anmolshkl commented Aug 9, 2017 via email

@seanlip
Copy link
Member

seanlip commented Aug 9, 2017

Yeah, I think the editor view may have problems (and I'm not actually too worried about that). But what about the learner view? If we are to be accessible to learners then we need to try and meet them where they're at, and that includes their browser. I thought the learner view currently works OK in IE...

@prasanna08
Copy link
Contributor Author

@seanlip if its not very much helpful then I'd rather not do that. We need ES6 for raw strings of regular expression. And its difficult to convert them to escaped strings, that's why many prefer raw strings over escaped strings when it comes to regex. So if its not much helpful than it would be better to use raw strings. I don't even know that whether it will be possible to convert each raw string into escaped string. But I will give it a shot if this is important feature.

Aside from that, can't we fallback to default outcome in case of IE browser?

@anmolshkl
Copy link
Contributor

anmolshkl commented Aug 9, 2017 via email

@seanlip
Copy link
Member

seanlip commented Aug 9, 2017

I'm torn. IE 11 is apparently the second most popular browser after Chrome: https://www.netmarketshare.com/browser-market-share.aspx?qprid=2&qpcustomd=0 . So, definitely non-negligible, and in fact we've generally tried to support later versions of IE in the past (with a cutoff of 8 or 9). If the pages leading up to the learner view are broken, then we really should fix that (@anmolshkl, would you mind filing issues for this please?) It's similar to how we currently do i18n for the learner-facing but not the editor-facing pages.

On the other hand I do see the issue here, and I agree that not using String.raw would make the code quite a bit less maintainable.

I think a reasonable thing to do is preserve the current impl, but make sure it fails gracefully on IE. So, e.g., explorations with no code classification should still work fine, but explorations with code classification should show an error message at the time the classification is triggered that says something like "Answer classification does not work in IE. Please use another browser." and return the default outcome. That way, there are no surprises, and the page doesn't just break completely under the user. Would that work?

@AllanYangZhou
Copy link
Contributor

I think disabling classification on IE and then returning the warning message (without breaking completely) is a good compromise for now.

@prasanna08
Copy link
Contributor Author

Returning warning message sounds good to me. But I am not sure about how that message should be displayed to learner. Some sort of prompt or pop-up box?

@seanlip
Copy link
Member

seanlip commented Aug 10, 2017

Actually, thinking about it, you probably don't need to surface this to the learner. Just do a $log.error(); that should suffice.

@prasanna08
Copy link
Contributor Author

prasanna08 commented Aug 10, 2017

Yep that's what I thought. Learner does not need to know about it. I will make that change.

Edit: I think there might be a way to use escape strings somehow. Look at this. I'll try this out and let you guys know if it works.

@prasanna08
Copy link
Contributor Author

prasanna08 commented Aug 10, 2017

I have removed use of template literals, so now there shouldn't be any compatibility issue. PR is ready for review now.

// limitations under the License.

/**
* Vecotrizer function which mirrors the CountVectorizer feature
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vectorizer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

oppia.factory('CountVectorizerService', [function() {
return {
vectorize: function(tokens, vocabulary) {
var vectorLength = Math.max.apply(Math, Object.values(vocabulary)) + 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Object.values is also ES6 like template literals.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see. I will change it to Object.key because we only need to know length, so it does not matter whether we use keys or values. Is that fine?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, seems fine to use keys instead.

else if (namechars.indexOf(initial) !== -1) {
tokenizedProgram.push([PythonProgramTokenType.NAME, token]);
}
// continued stmt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to just spell this out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

var maxVoteIdx = 0;
for(var i = 0; i < votes.length; i++) {
if (votes[i] > votes[maxVoteIdx]) {
maxVoteIdx = votes[i];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused reading this. The name maxVoteIdx suggests we're trying to find an index, not the max number of votes, right? Unless you are actually trying to find the biggest number of votes, but then you should call it maxVotes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, the code is wrong. I want index and not votes. Thanks for spotting it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Btw, were any of the SVMPredictionService tests supposed to have caught this error?

var TOKEN_NAME_UNK = 'UNK';

// List of python keywords.
var KWLIST = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nit: KW_LIST is slightly clearer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}
}

// new statement
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency, capitalize and add a period (same applies for all other comments).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@AllanYangZhou
Copy link
Contributor

@prasanna08 Looking good overall. I guess we should get all the checks fixed.

@prasanna08
Copy link
Contributor Author

@AllanYangZhou fixed tests. PTAL?

Copy link
Contributor

@anmolshkl anmolshkl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prasanna08 just a couple of comments here. I'm trusting you and the tests here :P

* extractor of sklearn.
*
* IMPORTANT NOTE: The Vectorizer function uses the vocabulary that was
* extracted during training. During training scikit's
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During the training,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

/**
* Tokenizer for Python code.
*
* IMPORTANT NOTE: The tokenizer is build using Python's own tokenizer module.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

build -> built

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

* IMPORTANT NOTE: The tokenizer is build using Python's own tokenizer module.
* These functions are simply translated from Python code to JS code and they
* both do same task. The unnecessary code from Python's tokenizer module
* has been removed before translating it into JS andcCode relevant to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and code*

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

'num in range(1000):\n if num%7 == 0 or num%5 == 0:\n\ts +=x\n' +
'print s');

var expectedTokens = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that this matches the output produced by Python's tokenizer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exactly. The python tokenizer produces \n tokens but this one fails to do so. Other than that all the important tokens are same. I think the issue with \n is that python's tokenizer uses StringIO for reading input line but here I am splitting the program with \n before passing it to tokenizer. But all other tokens are same as python's tokenizer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the count for '\n' in the count vector would always be 0 in the case of front-end count-vectorizer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, unfortunately, that's likely to happen.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm not sure what impact this could have (the impact could be negligible or it might decrease the accuracy significantly). If we are splitting on '\n', then, can't we just add '\n' tokens later?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just remembered this, we are already filtering out NL and COMMENT tokens (in backend as well as frontend) and all \n are NL tokens, so they are not present in backend either. That means we are not losing anything by not having \n tokens.

startIndices[i] = startIndices[i - 1] + nSupport[i - 1];
}

// Find kernel values for supportVectors and given input. Assumes that
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't there be a check that flags an error if the dimensions are not same?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a check.

var svmData = {
classes: [0, 1],
n_support: [2, 2],
support_vectors: [[1.0, 1.0], [0.0, 0.0], [1.0, 0.0], [0.0, 1.0]],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this generated with scikit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I trained XOR on scikit and extracted data using classifier_utils.extract_svm_parameter() function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

@AllanYangZhou
Copy link
Contributor

@prasanna08 The code looks fine, the only concern I have is why the SVMPredictionServiceTests didn't catch the maxVoteIdx bug from earlier?

@prasanna08
Copy link
Contributor Author

@AllanYangZhou so the reason for that is that there were only 2 classes. So let me explain all the cases one by one.

  1. Input [1, 1] produced votes [1, 0]. That is it gave 1 vote to output class 0 and 0 votes to output class 1.
  2. Input [1, 0] produced votes [0, 1]
  3. Input [0, 1] produced votes [0, 1]
  4. Input [0, 0] produced votes [1, 0]

Now my old logic was like this:

maxVoteIdx = 0;
for (var i = 0; i < votes.length; i++) {
  if (votes[i] > maxVoteIdx) {
    maxVoteIdx = i;
  }
}

So let's see what happens in case of [1, 0] votes. There will be 2 passes of for loop.

  1. votes[0] > 0 which is True (1 > 0). Therefore maxVoteIdx = 0;
  2. votes[1] > 0 which is False (0 = 0). So no change.
    Therefore returned value is class[maxVoteIdx] which is 0 and that is True in both cases when [1, 0] votes are generated.

Now let's see what happens for [0, 1]

  1. votes[0] > 0. False.
  2. votes[1] > 0. True (1 > 0). maxVoteIdx = 1;
    Therefore returned class is 1 and that is True for both cases when votes are [0, 1].

So the result was correct in flawed logic because we only had 2 classes and votes were binary (either 0 vote or 1 vote). If we had few more classes then we would also have more votes per class and error may have been identified at that time. Is that understandable?

@AllanYangZhou
Copy link
Contributor

@prasanna08 Alright, I understand now. Thanks!


if (
tokenId == PythonProgramTokenType.NL ||
tokenId == PythonProgramTokenType.COMMENT ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

=== preferred (and elsewhere)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@prasanna08 prasanna08 changed the title [WIP]: Code classifier prediction API Code classifier frontend prediction API services. Aug 17, 2017
@prasanna08
Copy link
Contributor Author

@anmolshkl can you take a final look. So that this can be merged and I can enable code classifier in development instance for some end to end testing.

@anmolshkl anmolshkl merged commit 12df8f6 into oppia:develop Aug 18, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants