# Preprocess Targets

Take each of the targets and group together similar classes which we don't have a lot of data for. Additionally break a few classes up into one-versus-all classifiers due to class overlap.

In [1]:
%store -r abstracts_targets

abstracts_targets.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2093 entries, 0 to 2126
Data columns (total 15 columns):
abstract                   2093 non-null object
pmid                       2093 non-null object
allocation                 1746 non-null object
endpoint_classification    1882 non-null object
intervention_model         2085 non-null object
masking                    2092 non-null object
primary_purpose            2015 non-null object
condition                  2093 non-null object
gender                     2093 non-null object
healthy_volunteers         2092 non-null object
maximum_age                2093 non-null object
minimum_age                2093 non-null object
phase                      2093 non-null object
study_type                 2093 non-null object
primary_outcome_measure    2093 non-null object
dtypes: object(15)
memory usage: 261.6+ KB


### Allocation

In [2]:
abstracts_targets.groupby('allocation').size()

allocation
Non-Randomized     228
Randomized        1518
dtype: int64

### Endpoint Classification

Break into one-versus-all classifiers. This is motivated by trying to make individual classifiers for Safety and Efficacy because of the shared Safety/Efficacy class. If their individual classifiers fire strongly enough, then a study can be classified as both.

However, these are the **only** two classes here that can appear together. For instance, `Safety` and `Bio` **cannot** appear together. This type of conflict will have to be resolved at run time. I'll default to just choosing Safety/Efficacy because those are more common and we have more data for those.

- `(Bio-availability Study`, `Bio-equivalence Study)` $\rightarrow$ `Bio`
- `(Pharmacodynamics Study`, `Pharmacokinetics Study`, `Pharmacokinetics/Dynamics Study)` $\rightarrow$ `Pharmaco`
- Break out Safety and Efficiacy into their own separate classifiers.

In [3]:
abstracts_targets.groupby('endpoint_classification').size()

endpoint_classification
Bio-availability Study                2
Bio-equivalence Study                 9
Efficacy Study                      642
Pharmacodynamics Study               30
Pharmacokinetics Study               39
Pharmacokinetics/Dynamics Study      23
Safety Study                        106
Safety/Efficacy Study              1031
dtype: int64

#### Bio

In [4]:
# Bio
abstracts_targets['bio'] = abstracts_targets.endpoint_classification.fillna('').str.startswith('Bio-')

# Pharmaco
abstracts_targets['pharmaco'] = abstracts_targets.endpoint_classification.fillna('').str.startswith('Pharmaco')

# Safety and Efficacy
abstracts_targets['efficacy_study'] = (abstracts_targets.endpoint_classification == 'Efficacy Study') | (abstracts_targets.endpoint_classification == 'Safety/Efficacy Study')
abstracts_targets['safety_study'] = (abstracts_targets.endpoint_classification == 'Safety Study') | (abstracts_targets.endpoint_classification == 'Safety/Efficacy Study')

In [5]:
abstracts_targets.groupby('bio').size()

bio
False    2082
True       11
dtype: int64

In [6]:
abstracts_targets.groupby('pharmaco').size()

pharmaco
False    2001
True       92
dtype: int64

In [7]:
abstracts_targets.groupby('efficacy_study').size()

efficacy_study
False     420
True     1673
dtype: int64

In [8]:
abstracts_targets.groupby('safety_study').size()

safety_study
False     956
True     1137
dtype: int64

### Intervention Model

In [9]:
abstracts_targets.groupby('intervention_model').size()

intervention_model
Crossover Assignment        157
Factorial Assignment         47
Parallel Assignment        1380
Single Group Assignment     501
dtype: int64

### Masking

Collapse masking down into just three labels:

- Single Blind
- Double Blind
- Open Label

In [10]:
abstracts_targets.groupby('masking').size()

masking
Double Blind (Caregiver Investigator Outcomes Assessor)               2
Double Blind (Caregiver Investigator)                                 2
Double Blind (Investigator Outcomes Assessor)                         6
Double Blind (Subject Caregiver Investigator Outcomes Assessor)     378
Double Blind (Subject Caregiver Investigator)                        84
Double Blind (Subject Caregiver Outcomes Assessor)                    5
Double Blind (Subject Caregiver)                                      7
Double Blind (Subject Investigator Outcomes Assessor)                91
Double Blind (Subject Investigator)                                 268
Double Blind (Subject Outcomes Assessor)                             35
Double-Blind                                                          7
Open Label                                                         1013
Single Blind                                                          1
Single Blind (Caregiver)                                

In [11]:
abstracts_targets.masking[abstracts_targets.masking.fillna('').str.startswith('Double Blind')] = 'Double Blind'
abstracts_targets.masking[abstracts_targets.masking.fillna('').str.startswith('Double-Blind')] = 'Double Blind'
abstracts_targets.masking[abstracts_targets.masking.fillna('').str.startswith('Single Blind')] = 'Single Blind'
    
abstracts_targets.groupby('masking').size()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


masking
Double Blind     885
Open Label      1013
Single Blind     194
dtype: int64

### Primary Purpose

In [12]:
abstracts_targets.groupby('primary_purpose').size()

primary_purpose
Basic Science                 46
Diagnostic                    49
Health Services Research      49
Prevention                   263
Screening                      7
Supportive Care               38
Treatment                   1563
dtype: int64

### Gender

In [13]:
abstracts_targets.groupby('gender').size()

gender
Both      1821
Female     186
Male        86
dtype: int64

### Healthy Volunteers

In [14]:
abstracts_targets.groupby('healthy_volunteers').size()

healthy_volunteers
Accepts Healthy Volunteers     411
No                            1681
dtype: int64

### Phase

Create binary labels for:

- Phase 1
- Phase 2
- Phase 3
- Phase 4

Factor combination phases into discrete phases (e.g. Phase 1/Phase 2 is Phase 1 and also Phase 2)

In [15]:
abstracts_targets.groupby('phase').size()

phase
N/A                410
Phase 0              8
Phase 1            104
Phase 1/Phase 2     92
Phase 2            530
Phase 2/Phase 3     60
Phase 3            614
Phase 4            275
dtype: int64

In [16]:
abstracts_targets['phaseNA'] = abstracts_targets.phase == 'N/A'

abstracts_targets['phase0'] = (abstracts_targets.phase == 'Phase 0')
abstracts_targets['phase1'] = (abstracts_targets.phase == 'Phase 1') | (abstracts_targets.phase == 'Phase 1/Phase 2')
abstracts_targets['phase2'] = (abstracts_targets.phase == 'Phase 1/Phase 2') | (abstracts_targets.phase == 'Phase 2') | (abstracts_targets.phase == 'Phase 2/Phase 3')
abstracts_targets['phase3'] = (abstracts_targets.phase == 'Phase 2/Phase 3') | (abstracts_targets.phase == 'Phase 3')
abstracts_targets['phase4'] = (abstracts_targets.phase == 'Phase 4')

In [17]:
abstracts_targets.groupby('phaseNA').size()

phaseNA
False    1683
True      410
dtype: int64

In [18]:
abstracts_targets.groupby('phase0').size()

phase0
False    2085
True        8
dtype: int64

In [19]:
abstracts_targets.groupby('phase1').size()

phase1
False    1897
True      196
dtype: int64

In [20]:
abstracts_targets.groupby('phase2').size()

phase2
False    1411
True      682
dtype: int64

In [21]:
abstracts_targets.groupby('phase3').size()

phase3
False    1419
True      674
dtype: int64

In [22]:
abstracts_targets.groupby('phase4').size()

phase4
False    1818
True      275
dtype: int64

### Study Type

In [23]:
abstracts_targets.groupby('study_type').size()

study_type
Interventional    2093
dtype: int64

### Save it!

In [24]:
abstracts_targets_collapsed = abstracts_targets

%store abstracts_targets_collapsed

Stored 'abstracts_targets_collapsed' (DataFrame)
