unlock "other study ID numbers" field in clinicaltrials.gov #510
Comments
Just merged some code from @georgiana-b to extract the secondary IDs from clinicaltrials.gov, so any "well-behaved" identifier will be extracted the next time we update our data. They're not many, though. What we need to improve this is to understand the format of the identifiers, so we can improve our process to detect them. Do you know if the registries document their identifiers format somewhere? |
I don't know of any formal documents that define the formats for each of the registers - we'd probably have to manually go through as we add registers. |
"well-behaved" identifier is (IMHO) = http-based URIs with a look-up URL (re-direct /) resolver service and with a promise of persistence. So, instead of a text string such as "D5130L00067" as a secondary/sponsor identifier. I am pushing for http://clinicaltrials.astrazeneca.com/study/D5130L00067 a first step is an internal process to assign azct.IDs to our 20.000 listed studies internally and an internal look-up service. Would like to move to "things not strings" in the information we send to e.g. CT.gov, see https://lists.w3.org/Archives/Public/public-semweb-lifesci/2013Feb/0067.html |
@kerfors Could you open a new issue explaining this, so we can discuss? |
@vitorbaptista @georgiana-b is this now merged into production? If so, can you please provide me an example link or two I can share with @jessflem |
@pwalsh Yes, this is in production. However, finding a link is quite difficult, because the secondary IDs from CT.org don't tell us from which registry they come from; they're simply strings like This is related in part with the way we handle identifiers, and I want to change it (see #502). An identifier needs a source with our current design. To handle CT.org, we'll probably need to add a bogus source (e.g. |
@vitorbaptista ok. can you prep this issue so we can get the team to do it in the next couple of weeks. |
Blocked by #502 |
Adding another example found by @Bengoldacre, the trials https://explorer.opentrials.net/trials/2a745a57-3e64-4ec1-88a4-21f6a2cefd1a and https://explorer.opentrials.net/trials/a021b738-9294-4126-8eea-393e9c0743d2 are the same. The identifier 2015-002933-23 appears on its clinicaltrials.gov entry at https://clinicaltrials.gov/ct2/show/record/NCT02688400, but without the EUCTR prefix. |
It is a bit of a grim field, but if we could start by getting it broken down into individual IDs, and matching for the obvious ones e.g. the EUCRT ones (always the same structure), then I think we will significantly improve our deduplication rate.
It will also be really helpful to get an idea of the other ID sources in this field, so we can start thinking how to match them.
Thanks!
The text was updated successfully, but these errors were encountered: