Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Meaning of cross_val_score output #5

Closed
sarwatfatimam opened this issue May 2, 2016 · 2 comments
Closed

Meaning of cross_val_score output #5

sarwatfatimam opened this issue May 2, 2016 · 2 comments

Comments

@sarwatfatimam
Copy link

Hi.

I have a question from scikit-learn-videos/07_cross_validation.ipynb. The output of the classification accuracy is usually several digits after the decimal e,g. 0.966666666667. If I multiply this value with the total number of observations i.e. 25, I will get 24.1666666667. What does this mean? That 24.1666666667 were classified correctly. Should not it give me a whole number? such as 24 maybe.

@deepish
Copy link

deepish commented May 2, 2016

Hi @Sarwat-Fatima ,
That's the accuracy score which is normalized i.e between the value from 0-1, where 0 means none of the output were accurate and 1 means every prediction was accurate.
You should not be multiplying with number of observations directly. 0.96666667 means 96% of your observations are correctly predicted.

If you want to know how many observations were correctly predicted, just pass normalize=False as show in the example http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html

accuracy_score(y_true, y_pred, normalize=False)

@justmarkham , This is really a nice piece of tutorial series you have prepared for beginners. Thanks for that.

@justmarkham
Copy link
Owner

@Sarwat-Fatima I assume you are asking about the output from this code:

knn = KNeighborsClassifier(n_neighbors=5)
scores = cross_val_score(knn, X, y, cv=10, scoring='accuracy')
print scores
print scores.mean()

In this case, there are 150 response values in the y object. Since this is 10-fold cross-validation, each iteration of cross-validation involves predicting 15 response values. A score of "1" means 15 of 15 were predicted correctly, a score of 0.9333 means 14 of 15 were predicted correctly, etc.

The 0.9666 number is the average of those 10 scores. You can multiply it by 150 and get 145, meaning 145 response values were correctly predicted.

I think you were confused by code in the notebook that showed an example dataset in which there were 25 observations. The relevant data actually had 150 observations.

Hope that helps!

@deepish Thanks for your kind words, I'm glad you like the video series!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants