-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HW 2 review by OttoS #2
Comments
Hi Otto, Thanks for the comments regarding the HW. You're right, I put 51 because of the previous lab. I thought that's what the figure meant but I didn't want to mess with anything! Also I saw that 27 was the optimized score, but I wanted to see how the data changed when I put in a different value and I forgot to change it back. Thanks for catching that. And thanks for explaining the K value further. After the HW and your comments I think I understand it better. and I skimmed the data but I completely missed the proline values. Scaling is obviously important when seeing something like that so I'll pay better attention next time! Thank you for the feedback! |
No worries! As a general rule - I'd say never be afraid to mess with On Tue, Feb 24, 2015 at 5:41 PM, jonesmatt415 notifications@github.com
|
@jonesmatt415
Hi Matt,
The "official" solutions are posted now. You should be able to do a git pull and find those for review. They're also located here: https://github.com/ga-students/DAT_SF_12/tree/gh-pages/Solutions
So... some comments, in chronological order running through your homework.
Why 51? That was the number of observations we had in the demo example. In this case 51 isn't a bad choice, but I wanted to make sure it was conscious. In this data set we have 178 observations, so technically you could go to 177, although I think 51 is a better choice.... Anyway, small point, but wanted to make sure you understood why we used 51.
If you were to run the code below (with a random seed of 1, you'd find a different value for K.
The point is, be careful picking K values based off of a random slice of data. Sometimes you can end up overfitting the model based on a single slice of the data. Another way to do this would be to fit the model, and the score based on cross-validation before choosing your K value.
Additionally, you usually want to use a odd value for K so that there are no tie's (if I use 2 nearest neighbors and both are different classes, which way to I go)
Let me know if you have questions!
Thanks
The text was updated successfully, but these errors were encountered: