Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HW 2 review by OttoS #2

Open
ostegm opened this issue Feb 24, 2015 · 2 comments
Open

HW 2 review by OttoS #2

ostegm opened this issue Feb 24, 2015 · 2 comments

Comments

@ostegm
Copy link

ostegm commented Feb 24, 2015

@jonesmatt415

Hi Matt,

The "official" solutions are posted now. You should be able to do a git pull and find those for review. They're also located here: https://github.com/ga-students/DAT_SF_12/tree/gh-pages/Solutions

So... some comments, in chronological order running through your homework.

  1. Your application of K-means and cross-validation is great. You arrived at the correct answer, however I wanted to point out a few subtle points. It looks like you used the code from class (which is awesome) - but be sure to think about what the inputs are. For example, you looped through the list of odd integers from 1 to 51.
n_neighbors = range(1, 51, 2)

Why 51? That was the number of observations we had in the demo example. In this case 51 isn't a bad choice, but I wanted to make sure it was conscious. In this data set we have 178 observations, so technically you could go to 177, although I think 51 is a better choice.... Anyway, small point, but wanted to make sure you understood why we used 51.

  1. When you chose the number of neighbors (30) I see that you based this off of the graph. Which optimized the score at 27. Remember that this graph was built off of only one slice of the data:

If you were to run the code below (with a random seed of 1, you'd find a different value for K.

X_train, X_test, y_train, y_test = train_test_split(wine_variables, wine['Wine'], test_size=0.3, random_state=1)

The point is, be careful picking K values based off of a random slice of data. Sometimes you can end up overfitting the model based on a single slice of the data. Another way to do this would be to fit the model, and the score based on cross-validation before choosing your K value.

Additionally, you usually want to use a odd value for K so that there are no tie's (if I use 2 nearest neighbors and both are different classes, which way to I go)

  1. This is kind of bonus material, but did you notice that proline is of a bigger magnitude than all the other variables? It ranges from 0-168 when some of the other variables a much much smaller. This scale problem over amplifies the effect of proline. If you scale the data. you can get an accuracy of 96%.
from sklearn.preprocessing import StandardScaler
features_scalar = StandardScaler()
X_train_scaled = features_scalar.fit_transform(X_train)

from sklearn import neighbors
clf_scaled = neighbors.KNeighborsClassifier(3, weights='uniform')
clf_scaled.fit(X_train_scaled, y_train)
  1. In the clustering section, I like that you worked with the two most influential features, but take a look at the solution set for how to use the entire (scaled) feature set.

Let me know if you have questions!

Thanks

@jonesmatt415
Copy link
Owner

Hi Otto,

Thanks for the comments regarding the HW. You're right, I put 51 because of the previous lab. I thought that's what the figure meant but I didn't want to mess with anything!

Also I saw that 27 was the optimized score, but I wanted to see how the data changed when I put in a different value and I forgot to change it back. Thanks for catching that.

And thanks for explaining the K value further. After the HW and your comments I think I understand it better. and I skimmed the data but I completely missed the proline values. Scaling is obviously important when seeing something like that so I'll pay better attention next time!

Thank you for the feedback!

@ostegm
Copy link
Author

ostegm commented Feb 25, 2015

No worries! As a general rule - I'd say never be afraid to mess with
things! That's the whole fun. In fact, mess with everything! Haha

On Tue, Feb 24, 2015 at 5:41 PM, jonesmatt415 notifications@github.com
wrote:

Hi Otto,

Thanks for the comments regarding the HW. You're right, I put 51 because
of the previous lab. I thought that's what the figure meant but I didn't
want to mess with anything!

Also I saw that 27 was the optimized score, but I wanted to see how the
data changed when I put in a different value and I forgot to change it
back. Thanks for catching that.

And thanks for explaining the K value further. After the HW and your
comments I think I understand it better. and I skimmed the data but I
completely missed the proline values. Scaling is obviously important when
seeing something like that so I'll pay better attention next time!

Thank you for the feedback!

Reply to this email directly or view it on GitHub
#2 (comment)
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants