Scatter plot from learned code representations #19

sathickibrahims18 · 2019-04-15T07:24:39Z

Hello Ed,

In Med2Vec, after creating the model file, you have created a 2D scatter plot using learned code representations. Is there any grouping is performed between the medical codes after creating the model file for scatter plot?

Because in High charts, the coloring is done based on some grouping.
example:
https://jsfiddle.net/gh/get/library/pure/highcharts/highcharts/tree/master/samples/highcharts/demo/scatter/

I have tried to create scatter plot after performing TSNE on embedding. It is created but there is no grouping, the colors are randomly placed. Cluster does not formed.

Can you please help me in understanding this?

Thanks,
SathickIbrahim

mp2893 · 2019-04-15T07:44:46Z

I showed the scatterplot of ICD9 diagnosis codes, which can be grouped by the ICD9 taxonomy (http://www.icd9data.com/2015/Volume1/default.htm).
For other codes (medication, procedure codes), you just need to find the right grouper.

sathickibrahims18 · 2019-04-15T08:38:47Z

Thanks Ed!

sathickibrahims18 · 2019-05-14T05:29:19Z

Hello Ed,

My problem statement is to predict similar diagnosis codes using med2vec.

For example, If I have 140 Medical codes, embedding dimension size is 200, hidden dimension size is 2000

I have ran your code and got .npz files and find 6 numpy.array variables W_emb, b_output, b_hidden, b_emb, W_output, W_hidden inside it.

Which one is used to predict similar codes? (W_emb or W_output)

Based on my input, the output W_emb is 140X200 and W_output is 2000X140. Because in this issue https://github.com/mp2893/med2vec/issues/16#issue-403688598(Paste it in chrome) you have mentioned W_output is used to predict neighboring visits.

And also how we know the generated embeddings is respect to which diagnosis code? Where that mapping is happened between embeddings and medical codes.

Please try to clarify my doubts.
Thank You

mp2893 · 2019-05-15T04:12:33Z

If you want to find similar diagnosis codes, you shoul use W_emb and b_emb.
Each row of W_emb corresponds to a specific medical code (e.g. diagnosis code, medication code. etc).
Given diagnosis code A, its vector representation is relu(W_emb[row that corresponds to A] + b_emb). Given A's vector representation, calculate cosine similarity between A's vector representation and the vector representations of all other medical codes. Whatever is the closest in terms of cosine similarity is the diagnosis code similar to A.
Hope this helps.

sathickibrahims18 · 2019-05-20T04:32:53Z

Thanks Ed,

When I ran theano code in CPU it's working fine but consumes more time (16 hours for each epoch).

If ran it in GPU then it throws Segmentation Fault in this line (cost = f_grad_shared(x, batchD, y, mask, iVector, jVector)).

Could you please help me to solve this issue?

mp2893 · 2019-05-24T15:03:36Z

Unfortunately, that error seems to be caused by system-related issues, rather than the algorithm itself (unless it's the NaN error)
There is nothing much I can do unless we sit together side-by-side.
I'd suggest taking a look at whether you are using compatible versions of CUDA, Nvidia driver, and Theano. (I know these issues are difficult, but I can't think of any reason why the code would run fine on CPU but not on GPU).

sathickibrahims18 · 2019-05-31T07:44:50Z

Could you please tell me the versions of CUDA, Nvidia driver and Theano you have used?

mp2893 · 2019-05-31T14:23:02Z

It says in the README that I used Theano 0.7.
As for CUDA, IIRC, I used either 6.0 or 7.0.
As for the Nvidia driver, I really have no idea.
If you understand med2vec code, I suggest you just implement a TensorFlow version, since med2vec is not that complicated. Most work is spent on pre-processing the data, and the neural net itself is quite straightforward.

sathickibrahims18 · 2019-06-10T05:44:39Z

Thanks Ed,

But I have also tried TensorFlow version, it takes 16 hours for each epochs because of huge volume of data.

Could you please help me to reduce time for epochs?

mp2893 · 2019-06-12T02:04:48Z

That's weird. How can the job take the same amount of time (16 hours) on both CPU and GPU?

sathickibrahims18 · 2019-06-12T09:03:24Z

Yes Ed, I have faced some weird issues.

cost = f_grad_shared(x, batchD, mask, iVector, jVector)

This particular line takes more time Ed, after removing this line model works fine.
So could you please help me to understood weather the above line is really useful for model.
without that line model works good.

mp2893 · 2019-06-12T20:51:26Z

If you are using demographic information, and are using grouped codes for the softmax output label, then you are going to need that line. Otherwise, the model won't be trained at all.
Please check whether you are using the correct set of option arguments.

sathickibrahims18 · 2019-06-13T06:19:04Z

OK Ed, But I didn't use demographic information and also not perform grouped codes.

cost = f_grad_shared(x, batchD, mask, iVector, jVector)

I have removed this line and generated the embeddings.

The generated embeddings are quite good. I have validated it by using cosine similarity.
For example if we search for Type2Diabetes it gives commorbities as Chronic Kidney Disease.

Could you please confirm whether this approach is correct or not?

mp2893 · 2019-06-14T00:11:36Z

My mistake. The line you deleted is used only when you are using demographic information, but not grouped codes (see line 274 of the source code).
It's still weird, because even though you are not using demographic info, deleting that line somehow impacts the experiment.
Although I don't have any good answer, I guess it's fine as long as your experiment runs without any issue.

victorconan · 2021-01-13T23:12:03Z

If you want to find similar diagnosis codes, you shoul use W_emb and b_emb.
Each row of W_emb corresponds to a specific medical code (e.g. diagnosis code, medication code. etc).
Given diagnosis code A, its vector representation is relu(W_emb[row that corresponds to A] + b_emb). Given A's vector representation, calculate cosine similarity between A's vector representation and the vector representations of all other medical codes. Whatever is the closest in terms of cosine similarity is the diagnosis code similar to A.
Hope this helps.

Is the diagnosis code representation as ReLU(W_emb + b_emb) or ReLU(W_emb)? In the paper, it says

The code representations to be learned is denoted as a matrix Wc' = ReLU(Wc)

mp2893 · 2021-01-14T08:18:05Z

Technically it should be ReLU(W_emb[row that corresponds to A] + b_emb.
I guess in the paper I omitted the bias term.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scatter plot from learned code representations #19

Scatter plot from learned code representations #19

sathickibrahims18 commented Apr 15, 2019 •

edited

Loading

mp2893 commented Apr 15, 2019

sathickibrahims18 commented Apr 15, 2019

sathickibrahims18 commented May 14, 2019 •

edited

Loading

mp2893 commented May 15, 2019

sathickibrahims18 commented May 20, 2019

mp2893 commented May 24, 2019

sathickibrahims18 commented May 31, 2019

mp2893 commented May 31, 2019

sathickibrahims18 commented Jun 10, 2019

mp2893 commented Jun 12, 2019

sathickibrahims18 commented Jun 12, 2019

mp2893 commented Jun 12, 2019

sathickibrahims18 commented Jun 13, 2019

mp2893 commented Jun 14, 2019

victorconan commented Jan 13, 2021

mp2893 commented Jan 14, 2021

Scatter plot from learned code representations #19

Scatter plot from learned code representations #19

Comments

sathickibrahims18 commented Apr 15, 2019 • edited Loading

mp2893 commented Apr 15, 2019

sathickibrahims18 commented Apr 15, 2019

sathickibrahims18 commented May 14, 2019 • edited Loading

mp2893 commented May 15, 2019

sathickibrahims18 commented May 20, 2019

mp2893 commented May 24, 2019

sathickibrahims18 commented May 31, 2019

mp2893 commented May 31, 2019

sathickibrahims18 commented Jun 10, 2019

mp2893 commented Jun 12, 2019

sathickibrahims18 commented Jun 12, 2019

mp2893 commented Jun 12, 2019

sathickibrahims18 commented Jun 13, 2019

mp2893 commented Jun 14, 2019

victorconan commented Jan 13, 2021

mp2893 commented Jan 14, 2021

sathickibrahims18 commented Apr 15, 2019 •

edited

Loading

sathickibrahims18 commented May 14, 2019 •

edited

Loading