Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scatter plot from learned code representations #19

Open
sathickibrahims18 opened this issue Apr 15, 2019 · 16 comments
Open

Scatter plot from learned code representations #19

sathickibrahims18 opened this issue Apr 15, 2019 · 16 comments

Comments

@sathickibrahims18
Copy link

sathickibrahims18 commented Apr 15, 2019

Hello Ed,

In Med2Vec, after creating the model file, you have created a 2D scatter plot using learned code representations. Is there any grouping is performed between the medical codes after creating the model file for scatter plot?

Because in High charts, the coloring is done based on some grouping.
example:
https://jsfiddle.net/gh/get/library/pure/highcharts/highcharts/tree/master/samples/highcharts/demo/scatter/

I have tried to create scatter plot after performing TSNE on embedding. It is created but there is no grouping, the colors are randomly placed. Cluster does not formed.

Can you please help me in understanding this?

Thanks,
SathickIbrahim

@mp2893
Copy link
Owner

mp2893 commented Apr 15, 2019

I showed the scatterplot of ICD9 diagnosis codes, which can be grouped by the ICD9 taxonomy (http://www.icd9data.com/2015/Volume1/default.htm).
For other codes (medication, procedure codes), you just need to find the right grouper.

@sathickibrahims18
Copy link
Author

Thanks Ed!

@sathickibrahims18
Copy link
Author

sathickibrahims18 commented May 14, 2019

Hello Ed,

My problem statement is to predict similar diagnosis codes using med2vec.

For example, If I have 140 Medical codes, embedding dimension size is 200, hidden dimension size is 2000

I have ran your code and got .npz files and find 6 numpy.array variables W_emb, b_output, b_hidden, b_emb, W_output, W_hidden inside it.

Which one is used to predict similar codes? (W_emb or W_output)

Based on my input, the output W_emb is 140X200 and W_output is 2000X140. Because in this issue https://github.com/mp2893/med2vec/issues/16#issue-403688598(Paste it in chrome) you have mentioned W_output is used to predict neighboring visits.

And also how we know the generated embeddings is respect to which diagnosis code? Where that mapping is happened between embeddings and medical codes.

Please try to clarify my doubts.
Thank You

@mp2893
Copy link
Owner

mp2893 commented May 15, 2019

If you want to find similar diagnosis codes, you shoul use W_emb and b_emb.
Each row of W_emb corresponds to a specific medical code (e.g. diagnosis code, medication code. etc).
Given diagnosis code A, its vector representation is relu(W_emb[row that corresponds to A] + b_emb). Given A's vector representation, calculate cosine similarity between A's vector representation and the vector representations of all other medical codes. Whatever is the closest in terms of cosine similarity is the diagnosis code similar to A.
Hope this helps.

@sathickibrahims18
Copy link
Author

Thanks Ed,

When I ran theano code in CPU it's working fine but consumes more time (16 hours for each epoch).

If ran it in GPU then it throws Segmentation Fault in this line (cost = f_grad_shared(x, batchD, y, mask, iVector, jVector)).

Could you please help me to solve this issue?

@mp2893
Copy link
Owner

mp2893 commented May 24, 2019

Unfortunately, that error seems to be caused by system-related issues, rather than the algorithm itself (unless it's the NaN error)
There is nothing much I can do unless we sit together side-by-side.
I'd suggest taking a look at whether you are using compatible versions of CUDA, Nvidia driver, and Theano. (I know these issues are difficult, but I can't think of any reason why the code would run fine on CPU but not on GPU).

@sathickibrahims18
Copy link
Author

Could you please tell me the versions of CUDA, Nvidia driver and Theano you have used?

@mp2893
Copy link
Owner

mp2893 commented May 31, 2019

It says in the README that I used Theano 0.7.
As for CUDA, IIRC, I used either 6.0 or 7.0.
As for the Nvidia driver, I really have no idea.
If you understand med2vec code, I suggest you just implement a TensorFlow version, since med2vec is not that complicated. Most work is spent on pre-processing the data, and the neural net itself is quite straightforward.

@sathickibrahims18
Copy link
Author

Thanks Ed,

But I have also tried TensorFlow version, it takes 16 hours for each epochs because of huge volume of data.

Could you please help me to reduce time for epochs?

@mp2893
Copy link
Owner

mp2893 commented Jun 12, 2019

That's weird. How can the job take the same amount of time (16 hours) on both CPU and GPU?

@sathickibrahims18
Copy link
Author

Yes Ed, I have faced some weird issues.

cost = f_grad_shared(x, batchD, mask, iVector, jVector)

This particular line takes more time Ed, after removing this line model works fine.
So could you please help me to understood weather the above line is really useful for model.
without that line model works good.

@mp2893
Copy link
Owner

mp2893 commented Jun 12, 2019

If you are using demographic information, and are using grouped codes for the softmax output label, then you are going to need that line. Otherwise, the model won't be trained at all.
Please check whether you are using the correct set of option arguments.

@sathickibrahims18
Copy link
Author

OK Ed, But I didn't use demographic information and also not perform grouped codes.

cost = f_grad_shared(x, batchD, mask, iVector, jVector)

I have removed this line and generated the embeddings.

The generated embeddings are quite good. I have validated it by using cosine similarity.
For example if we search for Type2Diabetes it gives commorbities as Chronic Kidney Disease.

Could you please confirm whether this approach is correct or not?

@mp2893
Copy link
Owner

mp2893 commented Jun 14, 2019

My mistake. The line you deleted is used only when you are using demographic information, but not grouped codes (see line 274 of the source code).
It's still weird, because even though you are not using demographic info, deleting that line somehow impacts the experiment.
Although I don't have any good answer, I guess it's fine as long as your experiment runs without any issue.

@victorconan
Copy link

If you want to find similar diagnosis codes, you shoul use W_emb and b_emb.
Each row of W_emb corresponds to a specific medical code (e.g. diagnosis code, medication code. etc).
Given diagnosis code A, its vector representation is relu(W_emb[row that corresponds to A] + b_emb). Given A's vector representation, calculate cosine similarity between A's vector representation and the vector representations of all other medical codes. Whatever is the closest in terms of cosine similarity is the diagnosis code similar to A.
Hope this helps.

Is the diagnosis code representation as ReLU(W_emb + b_emb) or ReLU(W_emb)? In the paper, it says

The code representations to be learned is denoted as a matrix Wc' = ReLU(Wc)

@mp2893
Copy link
Owner

mp2893 commented Jan 14, 2021

Technically it should be ReLU(W_emb[row that corresponds to A] + b_emb.
I guess in the paper I omitted the bias term.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants