nan in datamatrix #4

matarhaller · 2015-11-23T19:25:35Z

the datamatrix has nans in it, which breaks PCA. I'm not completely sure why they are there, but do you think it's reasonable to just replace nans with 0?

juanshishido · 2015-11-23T22:04:22Z

I think I know what might be going on. Concatenating NaNs with non-NaNs results in NaNs (somewhat related (at about 3:40)). I'm working on a fix now.

juanshishido · 2015-11-23T22:17:00Z

Maybe.

juanshishido · 2015-11-23T23:07:02Z

@matarhaller What was the code you had for checking NaNs? I have the data_matrix object and want to check it.

juanshishido · 2015-11-23T23:09:42Z

Got it 😅

>>> np.isnan(data_matrix.todense()).sum()
0

👍

matarhaller · 2015-11-23T23:09:45Z

just to check if anything is nan: np.isnan(datamatrix).any()

or you can do np.where(np.isnan(datamatrix)) to figure out exactly where the nans are

matarhaller · 2015-11-23T23:10:12Z

@juanshishido You're too speedy!

juanshishido · 2015-11-23T23:11:22Z

Thanks!

The shape of the matrix is now: (57822, 3429). I don't remember the original dimensions, but it's good now.

I created a new notebook for this in a new branch. I think it might be better to just modify the original. What do you all think?

juanshishido · 2015-11-23T23:16:27Z

What was happening was that some people did not fill out any essays. So their TotalEssays values were blank. I am returning this instead: return df[df.TotalEssays.str.len() > 0]. I also found that some of those "empty" TotalEssays had a length greater than 0. So I also added this: .apply(lambda x: re.sub('\s+', ' ', x).strip()).

juanshishido · 2015-11-23T23:20:32Z

A question that stems (NLP joke) from this is, do we want to only use individuals who filled something out for all essays or are partial responses okay (of course, no responses aren't useful)?

matarhaller · 2015-11-23T23:23:00Z

Good point. Since we have so much data, I'm okay with dropping people that didn't answer all the essays.

jnaras · 2015-11-23T23:32:47Z

Oh, okay! Sounds good. Happy to drop people who didn't answer and happy to convert to .py files.

juanshishido · 2015-11-23T23:38:16Z

Great! We'll have to make sure do add that in.

juanshishido · 2015-11-23T23:38:20Z

4b9355f fixes this.

juanshishido · 2015-11-24T08:23:53Z

Decided to move the conversation of NaNs we were having in #7 here.

@jnaras Everything ran and confirmed that np.isnan(data_matrix.todense()).sum() == 0.

With 5fd38b0, I rearranged the imports slightly (and removed the ones we were not using), removed the print statements in filter_vocab and create_data_matrix, added whitespace to the list comprehensions in generate_freqdists and filter_vocab, and changed the formatting for the "Calculating PMI Features" cell.

Thank you!

juanshishido · 2015-11-24T08:39:50Z

Also, the pickled data is good 👍

matarhaller · 2015-11-24T12:55:40Z

So is master fully updated?
On Nov 24, 2015 12:42 AM, "Juan Shishido" notifications@github.com wrote:

Decided to move the conversation of NaNs we were having in #7
#7 here.

@jnaras https://github.com/jnaras Everything ran and confirmed that np.isnan(data_matrix.todense()).sum()
== 0.

With 5fd38b0
5fd38b0,
I rearranged the imports slightly (and removed the ones we were not using),
removed the print statements in filter_vocab and create_data_matrix,
added whitespace to the list comprehensions in generate_freqdists and
filter_vocab, and changed the formatting for the "Calculating PMI
Features" cell.

Thank you!

—
Reply to this email directly or view it on GitHub
#4 (comment).

juanshishido · 2015-11-24T16:16:17Z

@matarhaller Yeah. It says jaya is 3 commits ahead of master, but that's because of how I updated master—fetched the jaya branch to get Calculate PMI features.ipynb, update it, and pushed to master.

juanshishido mentioned this issue Nov 23, 2015

variance explained #7

Open

juanshishido closed this as completed Nov 23, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nan in datamatrix #4

nan in datamatrix #4

matarhaller commented Nov 23, 2015

juanshishido commented Nov 23, 2015

juanshishido commented Nov 23, 2015

juanshishido commented Nov 23, 2015

juanshishido commented Nov 23, 2015

matarhaller commented Nov 23, 2015

matarhaller commented Nov 23, 2015

juanshishido commented Nov 23, 2015

juanshishido commented Nov 23, 2015

juanshishido commented Nov 23, 2015

matarhaller commented Nov 23, 2015

jnaras commented Nov 23, 2015

juanshishido commented Nov 23, 2015

juanshishido commented Nov 23, 2015

juanshishido commented Nov 24, 2015

juanshishido commented Nov 24, 2015

matarhaller commented Nov 24, 2015

juanshishido commented Nov 24, 2015

nan in datamatrix #4

nan in datamatrix #4

Comments

matarhaller commented Nov 23, 2015

juanshishido commented Nov 23, 2015

juanshishido commented Nov 23, 2015

juanshishido commented Nov 23, 2015

juanshishido commented Nov 23, 2015

matarhaller commented Nov 23, 2015

matarhaller commented Nov 23, 2015

juanshishido commented Nov 23, 2015

juanshishido commented Nov 23, 2015

juanshishido commented Nov 23, 2015

matarhaller commented Nov 23, 2015

jnaras commented Nov 23, 2015

juanshishido commented Nov 23, 2015

juanshishido commented Nov 23, 2015

juanshishido commented Nov 24, 2015

juanshishido commented Nov 24, 2015

matarhaller commented Nov 24, 2015

juanshishido commented Nov 24, 2015