-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
variance explained #7
Comments
Okay, let's try lemmatizing the words before we create the data matrix. That may help. |
I just added a bit of code in. I'll push in an hour, got a meeting. |
@jnaras: I am going to push the notebook with the fixes for the nans (from #4). I am choosing this instead of updating the original notebook since you mentioned you added some code. I want to avoid merge conflicts. What do the both of you think about moving functions to |
That works for me. |
Cool, thanks. |
Back on topic. @semerj mentioned that 18% is pretty good for text! |
Really? For Marti's paper they used 3 factors and accounted for 48% of the variance. It seems like 10 components and 18% is really low, but we can try and see if it works... |
Hmm. Did Marti and her co-author reduce the text prior to the factor analysis (like we did with the 1%)? Or was that their dimensionality reduction approach? (I don't know much about factor analysis.) Yeah, we'll see what happens after lemmatizing. |
PCA isn't a panacea. Your performance is going to depend heavily on your features/preprocessing. |
Thanks, @semerj. I think we were surprised by our value, especially in comparison to Marti's previous work, which I hadn't read too closely. We have a few other things to try, so we'll see how that works. Also wondering whether SVD or Non-negative matrix factorization might work well with text. I think both of these were mentioned in class. |
PCA is basically SVD. But we could try it. If 18% works well, that's fine. Marti's previous work used factor analysis to explain 40% of the variance, but it was also a smaller dataset. I think we should then really use the lemmatizing. I'll push that code into the PMI notebook As far as code organization goes, I'm hoping that the PMI notebook can just be used as a black box to create the data matrix. Unless you both would rather I push a .py file conversion of it?? Let me know. |
I'm okay with it being used as a black box for now to create the data
|
I agree with and prefer having py files. But, as Matar mentioned, that's a lower priority right now. |
Okay, I pushed it. I didn't have any 'nan' entries in the data matrix at the end. If that doesn't really help, we can try stemming as well. |
@jnaras I ran your notebook and the datamatrix I got still had NaNs. |
The last two lines of
|
The last line of my notebook asks if there are any NaNs in the data. Without Juan's add-on, I didn't get any. I don't really understand why. But maybe try it? |
Okay, I pushed a new fix to the notebook with the additional lines from Juan's edit. I'm sorry, I should have added it earlier. I was just confused where the NaN's came from. Hopefully that helps. I wonder if the pickle module is messing up the data...I'll look into it. |
@jnaras It's totally fine. I should have communicated better about the update. I wasn't explicit about the fix—I wrote it as a comment on an issue and pushed to master. Anyway, thanks for updating the notebook! I will try to run it now. Also, I'm going to remove One additional note. The change to Thanks again! |
After trying a bunch of things, here is a summary of where we are at with PCA/kmeans
Given that we have 3k features, I guess 50 components isn't too much. I was thinking of sticking with 50 components for now, since 45% of the variance seems respectable. Or maybe even dropping to 20.
The code is ugly and PCA takes a long time to run with 50 components. Until I clean up the code, I saved out the components and the reduced data matrix into a pickle file so you can play around with it. I'll email it to you guys (I think it's small enough). Thoughts? |
Thanks for the detailed message, @matarhaller! In terms of the running time for PCA, it looks as if Are the clusters still very different while using It's great that you started using the silhouette score. Three clusters might not be too bad given that we're combining all of the essay responses. I've been thinking a lot about this today. Is it possible that combining makes it more difficult to find different "topics" people are writing about? What I'm thinking of is that a set of users might write about topic A for prompt 0 while another set writes about topic B. However, what if the second set writes about topic A for prompt 1? I'm thinking that across prompts, what's written about might not be too distinctive. Just conjecture, but something to think about. |
Since PCA just needs to be run once, and the data matrix fits in memory, I opted to just to use the standard The only indication I have that the clusters are different on different runs of As for separating out different essays - I need to think about this a bit more, but we can try just using a single essay and seeing if it separates out better. My inclination is that since we're using ngrams, we aren't really getting at broader topics anyway, so I don't know if it would be sensitive to the example you gave above. Not sure though... |
They have an example on the Iris data set, which, for sure, fits into memory, using Regarding the documentation noting that PCA only works with dense arrays, I think they mean the actual data structure or representation of the data and not the values themselves. I'd like to think more about the last point, too. Even though we're using ngrams, I'm thinking it might still be influenced. Let me try another example in 2d space with just two essays. If the features we were clustering on, for example, are the tokens "good" and "bad," a few possibilities are:
Imagine "good" along the x-axis and "bad" along the y-axis. Regardless of whether we combine essays or not, users in scenarios 1 and 3 would presumably be correctly grouped. We can imagine them on opposite sides of the coordinate system—large x and small y versus small x and large y. If combining essays for users under scenario 2, however, they might be in between (and possibly overlap with) users in 1 and 3, making it would be difficult to tell which cluster they might belong to. If the essays are separate, though, they we know those under scenario 2 will be in the "good" group for one of the essays and the "bad" group for the other. I'm not sure if this might even be remotely representative of what's happening in our data, but it's what I was thinking about. Thanks for entertaining the idea, though! |
Re: IncrementalPCA - Since we're only running PCA one time, I think I'm okay with just doing the standard one. Also since it seems like the IncrementalPCA is an approximation of PCA, I think we're better off just using PCA if we can. I agree with your assessment of dense arrays - I just wanted to hear it from someone else :) And as for splitting up the essays - I think it's an empirical question. We can try! Not sure though if we should just focus on a single essay or if we should do it separately for a few (and then if so, which do we pick?) |
You're hard to convince, @matarhaller! |
@juanshishido If incremental PCA can be used for matrices that don't all fit into memory, it's going to hit disk a lot. This will actually make it a lot slower. @matarhaller Okay! I'll try to read up on varimax rotation and implement it in python for not square matrices. @both, I'll convert my data_matrix generation script into .py for ease of use and merging. |
@jnaras I think I'll need to |
Wow, ~73% slower!
Thanks, @jnaras! |
Looks like PCA will be harder than we though.
When I take the first 10 principal components (after whitening), they collectively only explain about 18% of the variance. I'm not sure if the problem is in the data we're putting in (maybe imputing isn't helping our cause?), but we might need to think about this a little bit.
The text was updated successfully, but these errors were encountered: