Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A few questions #5

Closed
2g-XzenG opened this issue Oct 11, 2017 · 4 comments
Closed

A few questions #5

2g-XzenG opened this issue Oct 11, 2017 · 4 comments

Comments

@2g-XzenG
Copy link

Hello Ed,

Thanks for sharing this great work with us!

After having trouble accessing the EHR dataset, I was wondering if we can generate synthetic data and I read this paper.

I have a few questions though:

  1. It seems to me sequential patient data is more usable for many tasks, have you try to generate this kind of data? (as you mentioned in future work), for example, treat each patient as a matrix, each row will be a visit.

  2. Have you try to do some real world tasks on synthetic data? If yes, can we trust the result we got form the synthetic data?

Thanks!
Xianlong

@mp2893
Copy link
Owner

mp2893 commented Oct 11, 2017

Hi Xianlong,

  1. Actually I'm currently working on it.

  2. I've used the synthetic data from medGAN to train a heart-failure prediction model (I supplemented the dataset with synthetic heart-failure case patients, as they are rarer compared to control patients), and I've observed an improved recall. But this was a very preliminary work, and more rigorous evaluation is necessary.

@2g-XzenG
Copy link
Author

Hi Ed,

Thanks for the reply!

For 2. Have you try to train the model entirely on the synthetic data? if the model which performs well on the synthetic data can also performs well on the real data (kind of like training and validation sets), that I think will be a strong argument that synthetic data is really good, am I right?

Also, as you mentioned heart-failure prediction model, I was wondering are you also generating the label of the EHR data? For example, heart-failure will be 1 and control will be 0 (or say can this model be used to generated labeled data? Like adding the label as the last column of the data.)

Thank you

@mp2893
Copy link
Owner

mp2893 commented Oct 20, 2017

Hi Xianlong,

Figure 3 and 7 in my paper is exactly what you described. I trained logistic regression classifiers with both real and synthetic data, then tested them on held-out real data. There are many details that cannot be covered here, so I recommend you read my paper.

You can generate labeled dataset in many ways. You can add an additional column like you suggested. Or you can develop a conditional generator. In my case, I trained two separate medGANs, one for case dataset, the other for control dataset. But as I said, this experiment was not rigorously conducted, so I can't say that my method is optimal.

Thanks,
Ed

@2g-XzenG
Copy link
Author

cool! I didn't see the connection between these two at the beginning.
I think that will be very useful if we can train models without accessing the real data set. I will look into this direction.

Thanks!

@mp2893 mp2893 closed this as completed Nov 7, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants