A few questions #5

2g-XzenG · 2017-10-11T20:51:40Z

Hello Ed,

Thanks for sharing this great work with us!

After having trouble accessing the EHR dataset, I was wondering if we can generate synthetic data and I read this paper.

I have a few questions though:

It seems to me sequential patient data is more usable for many tasks, have you try to generate this kind of data? (as you mentioned in future work), for example, treat each patient as a matrix, each row will be a visit.
Have you try to do some real world tasks on synthetic data? If yes, can we trust the result we got form the synthetic data?

Thanks!
Xianlong

mp2893 · 2017-10-11T20:59:10Z

Hi Xianlong,

Actually I'm currently working on it.
I've used the synthetic data from medGAN to train a heart-failure prediction model (I supplemented the dataset with synthetic heart-failure case patients, as they are rarer compared to control patients), and I've observed an improved recall. But this was a very preliminary work, and more rigorous evaluation is necessary.

2g-XzenG · 2017-10-20T00:26:05Z

Hi Ed,

Thanks for the reply!

For 2. Have you try to train the model entirely on the synthetic data? if the model which performs well on the synthetic data can also performs well on the real data (kind of like training and validation sets), that I think will be a strong argument that synthetic data is really good, am I right?

Also, as you mentioned heart-failure prediction model, I was wondering are you also generating the label of the EHR data? For example, heart-failure will be 1 and control will be 0 (or say can this model be used to generated labeled data? Like adding the label as the last column of the data.)

Thank you

mp2893 · 2017-10-20T00:32:06Z

Hi Xianlong,

Figure 3 and 7 in my paper is exactly what you described. I trained logistic regression classifiers with both real and synthetic data, then tested them on held-out real data. There are many details that cannot be covered here, so I recommend you read my paper.

You can generate labeled dataset in many ways. You can add an additional column like you suggested. Or you can develop a conditional generator. In my case, I trained two separate medGANs, one for case dataset, the other for control dataset. But as I said, this experiment was not rigorously conducted, so I can't say that my method is optimal.

Thanks,
Ed

2g-XzenG · 2017-10-20T01:10:28Z

cool! I didn't see the connection between these two at the beginning.
I think that will be very useful if we can train models without accessing the real data set. I will look into this direction.

Thanks!

mp2893 closed this as completed Nov 7, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A few questions #5

A few questions #5

2g-XzenG commented Oct 11, 2017

mp2893 commented Oct 11, 2017

2g-XzenG commented Oct 20, 2017

mp2893 commented Oct 20, 2017

2g-XzenG commented Oct 20, 2017

A few questions #5

A few questions #5

Comments

2g-XzenG commented Oct 11, 2017

mp2893 commented Oct 11, 2017

2g-XzenG commented Oct 20, 2017

mp2893 commented Oct 20, 2017

2g-XzenG commented Oct 20, 2017