Normalize study_phase values #353

vitorbaptista opened this Issue Aug 30, 2016 · 8 comments


Done in Data Quality

6 participants


Currently we have the following study_phase in our database:

study_phase total
N/A 107936
Phase 2 36444
Phase 3 27630
Phase 1 26288
Phase 4 23598
Not Applicable 10581
Phase 1/Phase 2 8391
Not applicable 5572
Not Specified 5245
Phase 2/Phase 3 4293
Phase II 2526
Phase 0 1897
2-3 1271
2 1250
Phase III 966
3 963
III 804
Phase IV 758
Phase I 664
"Phase I II"
II 375
" 337
IV 253
1 223
Phase 1 / Phase 2 216
Phase 2 / Phase 3 207
Phase 3 / Phase 4 200
4 196
"Phase II III"
n\a 171
phase 3 148
phase 1 135
phase 4 133
1-2 128
Other 124
I 107
0(exploratory trials)) 90
phase 2 73
Phase I/II 61
Phase4 56
Phase II/III 50
Phase1 43
Post-market 42
I-II 40
Phase3 32
Pilot study 22
New Treatment Measure Clinical Study 20
N/ 19
IV (Phase IV study) 17
I (Phase I study) 16
NA 15
III (Phase III study) 14
Phase2 13
IIa 13
Phase 1/2 13
IIb 10
Phase 2/3 10
0 9
Phase III/IV 9
Diagnostic New Technique Clincal Study 8
IIIb 8
Phase1/Phase2 7
II (Phase II study) 4
phase 1/2 4
Phase2/Phase3 3
phase 2/3 3
Not entered 1
V 1

They should all be normalized, so they can be compared. There're a few trials with multiple phases (e.g. Phase 1/2), that I'm not sure how to handle. @opentrials/research could you add your comments here? How should we handle the trials with multiple phases? For example,

roll commented Aug 30, 2016

wow it's really messy


@opentrials/research ping. As far as I can understand from the data, a single trial can be of multiple phases. Is this correct?


@vitorbaptista ah yes, the never ending nuances of trials - yes, a trial can be more than one phase. ideally we would like a trial which is two phases (eg phase 2&3) to be searchable as phase 2, phase 3, and phase 2&3. thanks!

kerfors commented Nov 14, 2016 edited

Hi, I would like to join this discussion as I'm addressing the same challenge for phase and for a few other core properties describing studies. I would like to discuss the different standards, formats to represent the different standards and if there are any simple vocab normalization/mapping service available.

Related activities:

Related "standard"s I know of:

MCRIx commented Nov 17, 2016

I favour the reduction of the term variety to the elements. These have the benefit over more detailed elements of:

  • being able to be mapped to from more detailed elements without loss of accuracy
  • having unambiguous arabic numerals
  • self-identifying as phase descriptions
  • allowing for mapping with only a small loss of precision

Sub-phase descriptors like Ia, IIb, etc. though not arbitrary are open to some interpretation. Details of this type are often revealed in trial title fields.

kerfors commented Jan 5, 2017 edited

In our internal work we use a short-term, basic solution for this using a Map from-to table. Below our current set of map pairs.

I am thinking of developing a Jupyter notebook that combine the information above with some code to gather the different vocabularies, e.g. scrap the values and definitions from e.g.
"Phase 1" "includes initial studies to determine the metabolism and pharmacologic actions of drugs in humans, ...."

And also include the map table below as a potential training set for a basic mapping function given a new set of values, e.g the list above from @vitorbaptista

Non-Interventional Study,N/A
Not applicable,N/A
Real World Evidence,N/A
I,Phase 1
IA,Phase 1
IB,Phase 1
Phase I,Phase 1
Phase Ia,Phase 1
Phase Ib,Phase 1
I-II,Phase 1/Phase 2
I-IIA,Phase 1/Phase 2
IB-II,Phase 1/Phase 2
IB-IIA,Phase 1/Phase 2
Phase I-II,Phase 1/Phase 2
Phase I-IIa,Phase 1/Phase 2
Phase Ib-II,Phase 1/Phase 2
Phase Ib-IIa,Phase 1/Phase 2
II,Phase 2
IIA,Phase 2
IIB,Phase 2
Phase II,Phase 2
Phase IIa,Phase 2
Phase IIb,Phase 2
Phase II-III,Phase 2/Phase 3
Phase II-IIIa,Phase 2/Phase 3
Phase IIb-III,Phase 2/Phase 3
Phase IIb-IIIa,Phase 2/Phase 3
III,Phase 3
IIIA,Phase 3
IIIB,Phase 3
Phase III,Phase 3
Phase IIIa,Phase 3
Phase IIIb,Phase 3
IV,Phase 4
Phase IV,Phase 4

@georgiana-b georgiana-b self-assigned this Jan 11, 2017

@jessflem What do you think should be done with phases like: "New Treatment Measure Clinical Study" or "Diagnostic New Technique Clinical Study"?

I was thinking we can add "(other)" to any phase that is not null or a phase number. This way they can be searched by the word "other" which would return these ambiguous phases and those that have "Other" as value.

@georgiana-b georgiana-b removed their assignment Jan 11, 2017


I was thinking we can add "(other)" to any phase that is not null or a phase number. This way they can be searched by the word "other" which would return these ambiguous phases and those that have "Other" as value.

this sounds sensible, yes. thanks very much!

@vitorbaptista vitorbaptista added a commit to opentrials/processors that referenced this issue Jan 25, 2017
@arthurSena @vitorbaptista arthurSena + vitorbaptista [#353] Normalize study_phase values
`processors/base/normalizers/phases_variations.json` file. Each processor that
needs to normalize a phase should call the
`base.normalizers.get_normalized_phase()` method that looks for the normalized
value in the `processors/base/normalizers/phases_variations.json` file. If it's
found, return it; otherwise logs and return the same value received.

This patch required us to change the DB schema to set `study_phase` to be a

@vitorbaptista vitorbaptista self-assigned this Jan 25, 2017
@vitorbaptista vitorbaptista moved from To Do to Done in Data Quality Jan 26, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment