## Protein Tokenization Dataset

This notebook will generate a dataset containing entity mentions and associated context for T cell mentions that also include phenotype markers.  This effectively gives the phenotype (as a sequence of proteins) along with a gold label for the associated cell type, a useful combination for evaluating a process that attempts to infer the cell types from phenotype markers.

In [1]:
import os
import os.path as osp
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import tqdm
from tcre import tokenization
from tcre.env import *
from tcre import lib
from ptkn import protein_tokenization as ptkn

In [None]:
# string = 'CD4brightCD45RAloCD45RO-4-1BB-CD62L+++'
# for t in tokenizer.tokenize(string):
#     print(f'{t.text:10} [term={t.token_text}, sign={t.sign_text}, value={t.sign_value}, preferred={t.metadata[2]}]')

## Overlapping Entity Extraction

First, entities tagged by both the JNLPBA-trained NER model in SciSpaCy as well as the dictionary cell type tagger are extracted as a way to identify entity mentions containing a T cell surface form (e.g. "CD4+CD25+ Treg").  This is done below to produce a set of mentions with both phenotype strings (i.e. the protein marker sequences) and a **single** T cell type.

In [3]:
df_tag = pd.read_csv(osp.join(DATA_DIR, 'articles', 'corpus', 'corpus_01', 'tags-union.csv'))
pd.set_option('display.max_info_rows', 10000000)
df_tag.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2310512 entries, 0 to 2310511
Data columns (total 11 columns):
end_chr       2310512 non-null int64
end_wrd       2310512 non-null int64
ent_id        961277 non-null object
ent_lbl       961277 non-null object
ent_prefid    961277 non-null object
ent_src       2310512 non-null object
id            2310512 non-null object
start_chr     2310512 non-null int64
start_wrd     2310512 non-null int64
text          2310512 non-null object
type          2310512 non-null object
dtypes: int64(4), object(7)
memory usage: 193.9+ MB


In [4]:
df_tag['ent_src'].value_counts()

jnlpba    1349235
lkp        961277
Name: ent_src, dtype: int64

In [5]:
import interlap

def get_paired_mentions(g):
    df1 = g[(g['ent_src'] == 'jnlpba') & (g['type'].isin(['CELL_TYPE', 'CELL_LINE']))]
    df2 = g[(g['ent_src'] == 'lkp') & (g['type'] == 'IMMUNE_CELL_TYPE')]
    
    rngs = interlap.InterLap()
    
    # Add jnlpba entries to interval lookup 
    for _, r in df1.iterrows():
        rngs.add((r['start_wrd'], r['end_wrd'], r))
        
    # Loop through direct matches, find jnlpba entries that overlap with first word,
    # and accumulate results which contain the records from both sources side-by-side
    df = []
    for i, r in df2.iterrows():
        matches = [
            o[-1][['id', 'start_wrd', 'end_wrd', 'type', 'text']].to_dict()
            for o in rngs.find((r.start_wrd, r.start_wrd))    
        ]
        for m in matches:
            # Add jnlpba row with extra data for corresponding lookup match
            df.append({**m, **{'match_prefid': r['ent_prefid'], 'match_lbl': r['ent_lbl'], 'match_text': r['text']}})
    return pd.DataFrame(df)
        
def get_cell_type_tags(df):
    res = []
    grps = df.groupby('id')
    for k, g in tqdm.tqdm(grps, total=len(grps)):
        res.append(get_paired_mentions(g))
    return pd.concat(res)
    

df_ct = (
    df_tag
    .pipe(lambda df: df[df['text'].str.contains('[tT]')])
    .pipe(get_cell_type_tags)
)
df_ct.info()

100%|██████████| 9878/9878 [06:10<00:00, 26.64it/s]


<class 'pandas.core.frame.DataFrame'>
Int64Index: 188915 entries, 0 to 1
Data columns (total 8 columns):
end_wrd         188915 non-null int64
id              188915 non-null object
match_lbl       188915 non-null object
match_prefid    188915 non-null object
match_text      188915 non-null object
start_wrd       188915 non-null int64
text            188915 non-null object
type            188915 non-null object
dtypes: int64(2), object(6)
memory usage: 13.0+ MB


In [6]:
# Group by jnlpba span and aggregate all overlapping direct matches
df_ct_grp = df_ct.groupby(['id', 'start_wrd', 'end_wrd', 'text']).agg({'match_lbl': 'unique', 'match_text': 'unique'}).reset_index()
df_ct_grp.head()

Unnamed: 0,id,start_wrd,end_wrd,text,match_lbl,match_text
0,PMC101751,250,252,Th1 cells,[Th1],[Th1]
1,PMC102037,107,111,Gag-specific cytotoxic T lymphocytes,[Tc],[cytotoxic T]
2,PMC103809,3,6,Cytotoxic T Lymphocytes,[Tc],[Cytotoxic T]
3,PMC103836,12,19,Cytokine-Induced CD4+ T-Helper 1 (Th1)-,[Th],[T-Helper]
4,PMC103836,51,59,T-helper 1 (Th1) or Th2 cells,"[Th, Th1, Th2]","[T-helper, Th1, Th2]"


### Filter Mentions

In [7]:
# Filter to entries with only single surface forms
df = (
    df_ct_grp
    .pipe(lambda df: df[df['match_lbl'].apply(len) == 1])
    .copy()
    .assign(match_lbl=lambda df: df['match_lbl'].apply(lambda v: v[0]))
    .assign(match_text=lambda df: df['match_text'].apply(lambda v: v[0]))
    .rename(columns={'match_lbl': 'cell_type', 'match_text': 'cell_type_text'})
)
df.head()

Unnamed: 0,id,start_wrd,end_wrd,text,cell_type,cell_type_text
0,PMC101751,250,252,Th1 cells,Th1,Th1
1,PMC102037,107,111,Gag-specific cytotoxic T lymphocytes,Tc,cytotoxic T
2,PMC103809,3,6,Cytotoxic T Lymphocytes,Tc,Cytotoxic T
3,PMC103836,12,19,Cytokine-Induced CD4+ T-Helper 1 (Th1)-,Th,T-Helper
11,PMC1064915,193,202,T helper 1 (Th1) polarized effector cells,Th1,T helper 1


In [8]:
df['cell_type'].value_counts()

Treg          42645
Th17          23082
NKT           12662
Thymocyte     10275
Tfh           10171
Th1            9993
Th             9165
Th2            8352
TN             6971
TMEM           4939
Tc             2832
γδT            2230
MAIT           1859
Treg1          1603
TCM            1598
Th9            1590
TEM            1515
Trm            1230
iTreg          1203
γδT-Vγ9Vδ2     1198
nTreg          1192
Tscm            981
Tc17            711
Th0             631
Th22            580
TEMRA           579
pTreg           448
Tc1             376
γδT-Vδ2         352
γδT-Vδ1         251
DETC            229
IEL             164
Treg17          146
Tc9             136
Th3             124
Tc2             104
ThP             104
Tfreg           102
γδT-17           93
γδT-Vγ4          13
Tc22              5
Tc0               4
γδT-Vγ1           3
Tsupp             2
Tfh17like         1
γδT-Vγ9           1
Name: cell_type, dtype: int64

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 162445 entries, 0 to 172225
Data columns (total 6 columns):
id                162445 non-null object
start_wrd         162445 non-null int64
end_wrd           162445 non-null int64
text              162445 non-null object
cell_type         162445 non-null object
cell_type_text    162445 non-null object
dtypes: int64(2), object(4)
memory usage: 8.7+ MB


In [97]:
df.head()

Unnamed: 0,id,start_wrd,end_wrd,text,cell_type,cell_type_text,ptkn
0,PMC101751,250,252,Th1 cells,Th1,Th1,"[Th1, cells]"
1,PMC102037,107,111,Gag-specific cytotoxic T lymphocytes,Tc,cytotoxic T,"[Gag, specific, cytotoxic, T, lymphocytes]"
2,PMC103809,3,6,Cytotoxic T Lymphocytes,Tc,Cytotoxic T,"[Cytotoxic, T, Lymphocytes]"
3,PMC103836,12,19,Cytokine-Induced CD4+ T-Helper 1 (Th1)-,Th,T-Helper,"[Cytokine, Induced, CD4⁺(+), T, Helper, 1, (Th1)]"
11,PMC1064915,193,202,T helper 1 (Th1) polarized effector cells,Th1,T helper 1,"[T, helper, 1, (Th1), polarized, effector, cells]"


In [100]:
# Add text with no cell surface form
df['clean_text'] = df.apply(lambda r: r['text'].replace(r['cell_type_text'], '').strip(), axis=1)
df.head()

Unnamed: 0,id,start_wrd,end_wrd,text,cell_type,cell_type_text,ptkn,clean_text
0,PMC101751,250,252,Th1 cells,Th1,Th1,"[Th1, cells]",cells
1,PMC102037,107,111,Gag-specific cytotoxic T lymphocytes,Tc,cytotoxic T,"[Gag, specific, cytotoxic, T, lymphocytes]",Gag-specific lymphocytes
2,PMC103809,3,6,Cytotoxic T Lymphocytes,Tc,Cytotoxic T,"[Cytotoxic, T, Lymphocytes]",Lymphocytes
3,PMC103836,12,19,Cytokine-Induced CD4+ T-Helper 1 (Th1)-,Th,T-Helper,"[Cytokine, Induced, CD4⁺(+), T, Helper, 1, (Th1)]",Cytokine-Induced CD4+ 1 (Th1)-
11,PMC1064915,193,202,T helper 1 (Th1) polarized effector cells,Th1,T helper 1,"[T, helper, 1, (Th1), polarized, effector, cells]",(Th1) polarized effector cells


In [101]:
# Add ptkn tokenization results
tokenizer = tokenization.load_protein_tokenizer()
tokens = [
    [t for w in text.split() for t in tokenizer.tokenize(w)]
    for text in tqdm.tqdm(df['clean_text'])
]
assert len(tokens) == len(df)

100%|██████████| 162445/162445 [16:46<00:00, 161.33it/s]


In [103]:
df['ptkn'] = np.array(tokens)
df['ptkn'].head()

0                                    [cells]
1               [Gag, specific, lymphocytes]
2                              [Lymphocytes]
3     [Cytokine, Induced, CD4⁺(+), 1, (Th1)]
11       [(Th1), polarized, effector, cells]
Name: ptkn, dtype: object

In [104]:
# Filter to entries with at least one protein mention
rm_prs = ['CD4', 'CD8']

def has_prs(tokens):
    prs = [
        t.metadata[2] for t in tokens 
        # Ignore tokens with no metadata (i.e. not proteins), 
        # blacklisted proteins, or those with no non-sign text
        if (t.metadata is not None)
        and (t.metadata[2] not in rm_prs)
        and t.token_text and t.sign_text
    ]
    return len(prs) >= 2

dff = df.pipe(lambda df: df[[has_prs(l) for l in df['ptkn']]]).copy()
dff.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3184 entries, 592 to 172195
Data columns (total 8 columns):
id                3184 non-null object
start_wrd         3184 non-null int64
end_wrd           3184 non-null int64
text              3184 non-null object
cell_type         3184 non-null object
cell_type_text    3184 non-null object
ptkn              3184 non-null object
clean_text        3184 non-null object
dtypes: int64(2), object(6)
memory usage: 223.9+ KB


In [105]:
# Filter entries associated with very general or rare cell types

# Generic cell types that rarely have canonical phenotypes in mention text 
ct_general = ['Thymocyte', 'TMEM', 'TN', 'Th']
cts = dff['cell_type'].value_counts()
ct_rare = list(cts[cts < 10].index.values)

dff = dff[~dff['cell_type'].isin(ct_general + ct_rare)]
dff.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2780 entries, 592 to 172195
Data columns (total 8 columns):
id                2780 non-null object
start_wrd         2780 non-null int64
end_wrd           2780 non-null int64
text              2780 non-null object
cell_type         2780 non-null object
cell_type_text    2780 non-null object
ptkn              2780 non-null object
clean_text        2780 non-null object
dtypes: int64(2), object(6)
memory usage: 195.5+ KB


In [106]:
dff.sample(n=15, random_state=0)

Unnamed: 0,id,start_wrd,end_wrd,text,cell_type,cell_type_text,ptkn,clean_text
153527,PMC6125584,7017,7022,"Foxp3+CD69−Tregs, Foxp3+CD69+ Tregs",Treg,Tregs,"[Foxp3⁺(+), CD69⁻(−), ,, Foxp3⁺(+), CD69⁺(+)]","Foxp3+CD69−, Foxp3+CD69+"
54404,PMC4038546,4174,4179,CD25++FoxP3+CD4+ regulatory T cells,Treg,regulatory T,"[CD25⁺(++), FoxP3⁺(+), CD4⁺(+), cells]",CD25++FoxP3+CD4+ cells
143272,PMC5988690,2235,2240,CD4+CD25+/CD4+Foxp3+ Treg cells,Treg,Treg,"[CD4⁺(+), CD25⁺(+), /, CD4⁺(+), Foxp3⁺(+), cells]",CD4+CD25+/CD4+Foxp3+ cells
37911,PMC3580431,5981,5986,CD4+CD25+FoxP3+ regulatory T cells,Treg,regulatory T,"[CD4⁺(+), CD25⁺(+), FoxP3⁺(+), cells]",CD4+CD25+FoxP3+ cells
166982,PMC6386992,3532,3536,CD25+Foxp3+CD39+ Treg cells,Treg,Treg,"[CD25⁺(+), Foxp3⁺(+), CD39⁺(+), cells]",CD25+Foxp3+CD39+ cells
155076,PMC6160197,5255,5258,CD4+CD25+foxp3+ Tregs,Treg,Tregs,"[CD4⁺(+), CD25⁺(+), foxp3⁺(+)]",CD4+CD25+foxp3+
50188,PMC3907717,6139,6146,Human DC10-induced CD25+Foxp3+ Treg express LAG3,Treg,Treg,"[Human, D, C10⁻(-), induced, CD25⁺(+), Foxp3⁺(...",Human DC10-induced CD25+Foxp3+ express LAG3
97590,PMC5061386,4076,4082,canonical CXCR5+PD-1hiBcl-6+ Tfh phenotype (,Tfh,Tfh,"[canonical, CXCR5⁺(+), PD-1⁺(hi), Bcl, 6, phen...",canonical CXCR5+PD-1hiBcl-6+ phenotype (
12525,PMC2738944,5542,5547,CD25+Foxp3+ T regulatory cells,Treg,T regulatory,"[CD25⁺(+), Foxp3⁺(+), cells]",CD25+Foxp3+ cells
129359,PMC5750223,6683,6686,IFN-γ−IL-17A−Foxp3+CD4+ Treg,Treg,Treg,"[IFN-γ⁻(−), IL-17A⁻(−), Foxp3⁺(+), CD4⁺(+)]",IFN-γ−IL-17A−Foxp3+CD4+


In [107]:
dff['cell_type'].value_counts()

Treg     1745
Tfh       352
Th17      144
Th1        78
NKT        75
TCM        73
TEM        71
Th2        46
nTreg      40
γδT        39
Treg1      37
MAIT       29
TEMRA      28
Tscm       13
iTreg      10
Name: cell_type, dtype: int64

In [109]:
dff[dff['cell_type'].isin(['Tscm'])].drop('clean_text', axis=1)

Unnamed: 0,id,start_wrd,end_wrd,text,cell_type,cell_type_text,ptkn
29599,PMC3376488,5414,5425,"CD27+, CD28+and IL-7Rα\n+, human Tscm cells",Tscm,Tscm,"[CD27⁺(+), ,, CD28⁺(+), and, IL-7, Rα, ⁺(+), ,..."
78893,PMC4605076,1109,1116,CD44lowCD62LhighSca-1highCD122highBcl-2high se...,Tscm,TSCM,"[CD44⁻(low), CD62L⁺(high), Sca, 1, CD122⁺(high..."
78896,PMC4605076,1236,1240,CD45RA+CD62L+CCR7+CD95+ TSCM cells,Tscm,TSCM,"[CD45RA⁺(+), CD62L⁺(+), CCR7⁺(+), CD95⁺(+), ce..."
78899,PMC4605076,1326,1331,CD8+CD45RA+CCR7+CD127+CD95+ viral-specific TSC...,Tscm,TSCM,"[CD8⁺(+), CD45RA⁺(+), CCR7⁺(+), CD127⁺(+), CD9..."
83141,PMC4690867,4338,4353,CD4+ CCR7+ CD45RA+ CD45RO− CD95+ T memory stem...,Tscm,T memory stem,"[CD4⁺(+), CCR7⁺(+), CD45RA⁺(+), CD45RO⁻(−), CD..."
91774,PMC4902324,1720,1728,Tscm CD4+CD45RA+CD45RO−CD62L+CCR7+CD127+CD27+C...,Tscm,Tscm,"[CD4⁺(+), CD45RA⁺(+), CD45RO⁻(−), CD62L⁺(+), C..."
93977,PMC4952071,1804,1812,CCR7+CD45RO−CD45RA+CD27+CD95+ TSCM (Fig. ).Fig,Tscm,TSCM,"[CCR7⁺(+), CD45RO⁻(−), CD45RA⁺(+), CD27⁺(+), C..."
106410,PMC5347696,1695,1700,"CD45RA+CD62L+CD95+, Tscm)",Tscm,Tscm,"[CD45RA⁺(+), CD62L⁺(+), CD95⁺(+), ,, )]"
119077,PMC5591266,2728,2738,CD4+CD45RA+CD62L+CD95+ for T stem cell memory ...,Tscm,Tscm,"[CD4⁺(+), CD45RA⁺(+), CD62L⁺(+), CD95⁺(+), for..."
135583,PMC5852080,3398,3402,CD44lo CD62Lhi TSCM cells,Tscm,TSCM,"[CD44⁻(lo), CD62L⁺(hi), cells]"


## Export

In [112]:
path = osp.join(RESULTS_DATA_DIR, 'protein-tokenization', 'dataset.pkl')
os.makedirs(osp.dirname(path), exist_ok=True)
dff.to_pickle(path)
path

'/lab/data/results/protein-tokenization/dataset.pkl'