Skip to content
This repository has been archived by the owner on Jan 29, 2022. It is now read-only.

unlock "other study ID numbers" field in clinicaltrials.gov #510

Open
jessflem opened this issue Oct 19, 2016 · 9 comments
Open

unlock "other study ID numbers" field in clinicaltrials.gov #510

jessflem opened this issue Oct 19, 2016 · 9 comments

Comments

@jessflem
Copy link

It is a bit of a grim field, but if we could start by getting it broken down into individual IDs, and matching for the obvious ones e.g. the EUCRT ones (always the same structure), then I think we will significantly improve our deduplication rate.

It will also be really helpful to get an idea of the other ID sources in this field, so we can start thinking how to match them.

Thanks!

@vitorbaptista
Copy link
Contributor

Just merged some code from @georgiana-b to extract the secondary IDs from clinicaltrials.gov, so any "well-behaved" identifier will be extracted the next time we update our data. They're not many, though. What we need to improve this is to understand the format of the identifiers, so we can improve our process to detect them. Do you know if the registries document their identifiers format somewhere?

@jessflem
Copy link
Author

I don't know of any formal documents that define the formats for each of the registers - we'd probably have to manually go through as we add registers.

@kerfors
Copy link

kerfors commented Oct 24, 2016

"well-behaved" identifier is (IMHO) = http-based URIs with a look-up URL (re-direct /) resolver service and with a promise of persistence. So, instead of a text string such as "D5130L00067" as a secondary/sponsor identifier. I am pushing for http://clinicaltrials.astrazeneca.com/study/D5130L00067 a first step is an internal process to assign azct.IDs to our 20.000 listed studies internally and an internal look-up service. Would like to move to "things not strings" in the information we send to e.g. CT.gov, see https://lists.w3.org/Archives/Public/public-semweb-lifesci/2013Feb/0067.html

@vitorbaptista
Copy link
Contributor

@kerfors Could you open a new issue explaining this, so we can discuss?

@pwalsh
Copy link
Member

pwalsh commented Nov 30, 2016

@vitorbaptista @georgiana-b is this now merged into production? If so, can you please provide me an example link or two I can share with @jessflem

@vitorbaptista
Copy link
Contributor

@pwalsh Yes, this is in production. However, finding a link is quite difficult, because the secondary IDs from CT.org don't tell us from which registry they come from; they're simply strings like 193284798. Out identifier detection has some regexps to validate them, like a NCT identifier must be NCT\d+, so any well behaved identifiers in the secondary identifiers field would be picked up. The others, which are most of them, won't.

This is related in part with the way we handle identifiers, and I want to change it (see #502). An identifier needs a source with our current design. To handle CT.org, we'll probably need to add a bogus source (e.g. unknown) and handle multiple identifiers from the same source.

@pwalsh
Copy link
Member

pwalsh commented Dec 1, 2016

@vitorbaptista ok. can you prep this issue so we can get the team to do it in the next couple of weeks.

@vitorbaptista
Copy link
Contributor

Blocked by #502

@vitorbaptista
Copy link
Contributor

Adding another example found by @Bengoldacre, the trials https://explorer.opentrials.net/trials/2a745a57-3e64-4ec1-88a4-21f6a2cefd1a and https://explorer.opentrials.net/trials/a021b738-9294-4126-8eea-393e9c0743d2 are the same. The identifier 2015-002933-23 appears on its clinicaltrials.gov entry at https://clinicaltrials.gov/ct2/show/record/NCT02688400, but without the EUCTR prefix.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
Research
Blocked
Development

No branches or pull requests

4 participants