## GPT measurementTechnique extraction Evaluation


### Analyze the extraction of measurementTechniques by ChatGPT
Approach:
- Select 10 records at random from each of the sampled repositories
- For each record, evaluate the techniques pulled by ChatGPT as True or False positive based on description
- Calculate precision and recall for each record
- If there are terms which are the same, but appear different, pull those out for standardization analysis

### Analyse the extraction of extraneous generic 'stop word' terms
Approach:
- Split all ChatGPT generated terms into words (split by space)
- Generate frequency table of terms
- Identify top set of generic terms for use as permutations in standardization

### Determine how well ChatGPT extractions match with measTech terms from repos
NCBI GEO, LINCs, and REFRAMEDB have vocabulary-based (GEO), or semi-vocabulary-based (LINCs, REFRAMEDB) measurementTechnique values for their records
Approach:
- Run ChatGPT against 10-25 records from GEO, LINCS, and REFRAMEDB
- Evaluate how well ChatGPT does in terms of getting a match (whether or not it can get at least 1 true positive per record)

In [1]:
import requests
import json
import os
import pandas as pd

In [2]:
script_path = os.getcwd()
data_path = os.path.join(script_path,'data')
result_path = os.path.join(script_path,'result')

### Analyze ChatGPT's extraction

Sample and format the ChatGPT extracted measurementTechniques data for manual evaluation

In [3]:
## Load the file
gpt_results = pd.read_csv(os.path.join(data_path,'GPT Measurement Techniques results.tsv'),delimiter='\t',header=0)
print(gpt_results.head(n=2))

## select 10 random records
ransamps = gpt_results.groupby('Data Repository').sample(10)

def ifenumed(rowdata):
    numlist = ['1.','2.','3.','4.','5.','6.','7.','8.','9.','10.','11.','12.','13.','14.','15.']
    for eachnum in numlist:
        if eachnum in rowdata:
            rowdata.replace(eachnum," - ")
        else:
            break
    return rowdata

## format results from records
def clean_predictions(row):
    rowdata = row['Predictions']
    if '1. ' in rowdata:
        rowdata = ifenumed(rowdata)
    tmpdata = rowdata.replace(" - ","|")
    tmplist = tmpdata.split("|")
    cleanlist = [x.replace("- ","").strip() for x in tmplist]
    return cleanlist 



                       _id Data Repository  \
0  clinepidb_DS_010e5612b8       ClinEpiDB   
1  clinepidb_DS_515a92c711       ClinEpiDB   

                                           Name  \
0  WASH Benefits Kenya Cluster Randomized Trial   
1               LAKANA Cluster Randomized Trial   

                                         Description          Model  \
0  The WASH Benefits Study        Publications fr...  gpt-3.5-turbo   
1  Background: Mass drug administration (MDA) of ...  gpt-3.5-turbo   

                                         Predictions  
0  - Cluster-randomized controlled trial - Survey...  
1  - Mass drug administration - Cluster randomize...  


In [None]:
ransamps['clean_pred'] = ransamps.apply(lambda row: clean_predictions(row), axis=1)
exploded_df = ransamps.explode('clean_pred')
#print(ransamps.head(n=2))
exploded_df.drop(columns=['Model','Predictions'],inplace=True)
print(exploded_df.head(n=2))

## Export results for manual evaluation
exploded_df.to_csv(os.path.join(result_path,'GPT_sample.tsv'),sep='\t',header=True)

### Analyse the extraction of extraneous generic 'stop word' terms

These generic terms can be used to mutate measurementTechnique terms for testing how the measurementTechnique mapping pipeline is affected by ChatGPT's tendency to add these types of stop words

In [4]:
## Pull the terms from Dylan's list (since the source is not needed)
t2t_results = pd.read_csv(os.path.join(data_path,'Measurement Techniques mapped.tsv'),delimiter='\t',header=2)

## Pull the list of techniques
techniques = t2t_results['Technique'].tolist()

## process the terms
cleanlist = []
for eachterm in techniques:
    tmpterm = eachterm.lower()
    tmpterm.replace("-"," ").replace(":"," ")
    tmplist = tmpterm.split(" ")
    cleantmp = [x.replace("(","").replace(")","").strip() for x in tmplist]
    cleanlist.extend(cleantmp)

termseries = pd.Series(cleanlist)
termdf = termseries.to_frame('technique')
termfreq = termdf.groupby('technique').size().reset_index(name='counts')
termfreq.sort_values('counts',ascending=False,inplace=True)
print(termfreq.loc[termfreq['counts']>5])
termfreq.to_csv(os.path.join(result_path,'stopword_freq.tsv'),sep='\t',header=0)

          technique  counts
28         analysis      72
172            data      44
657      sequencing      29
133      collection      28
29              and      24
711           study      21
517              of      19
82            blood      13
738         testing      11
470      microscopy      10
646        sampling       9
539             pcr       9
594   questionnaire       9
49       assessment       9
98             cell       8
609        receptor       8
189      diagnostic       8
107           chain       8
739           tests       8
234         ethical       7
248      expression       7
147         consent       7
183          design       7
633          review       7
200             dna       7
345  identification       7
606        reaction       6
387      interviews       6
715    surveillance       6
434         mapping       6
395       isolation       6
276             for       6
300      geographic       6
185       detection       6
559      polymerase 

### Determine how well ChatGPT extractions match with measTech terms from repos

#### Format the ChatGPT predictions for ReFRAMEDB, GEO, and LINCS

In [4]:
def format_predictions(row):
    rowdata = row['Predictions']
    if '1. ' in rowdata:
        rowdata = ifenumed(rowdata)
    cleanlist = list(map(str.strip, rowdata.strip("][").replace("'", "").split(",")))
    return cleanlist 

In [5]:
raw_25 = pd.read_csv(os.path.join(data_path,'GPT_GEO_ReframeDB_measTech_results.tsv'),delimiter='\t',header=0)
print(raw_25.head(n=2))

raw_10 = pd.read_csv(os.path.join(data_path,'GPT_LINCS_NCBI GEO_ReframeDB.tsv'),delimiter='\t',header=0)

         _id                                               Name  \
0   GSE57323  microRNA Expression Profile on Stimulated Peri...   
1  GSE108631  Genome-wide maps of EWS-FLI1 binding sites and...   

                                         Description          Model  \
0  Background: The emerging relationship between ...  gpt-3.5-turbo   
1  We identified global DNA binding properties of...  gpt-3.5-turbo   

                                         Predictions  
0  ['RNA extraction', 'miRNA profiling using TaqM...  
1                    ['ChIP-seq', 'Hi-C', 'RNA-Seq']  


In [6]:
raw_25['clean_pred'] = raw_25.apply(lambda row: format_predictions(row), axis=1)
exploded_df = raw_25.explode('clean_pred')
clean_25 = exploded_df.drop(['Name','Description','Model','Predictions'],axis=1)
print(clean_25.head(n=2))

        _id                                         clean_pred
0  GSE57323                                     RNA extraction
0  GSE57323  miRNA profiling using TaqMan® Array Human micr...


In [7]:
raw_10['clean_pred'] = raw_10.apply(lambda row: format_predictions(row), axis=1)
exploded_10 = raw_10.explode('clean_pred')
clean_10 = exploded_10.drop(['Model','Predictions','name','description','Measurement Technique'],axis=1)
print(clean_10.head(n=2))

clean_df = pd.concat((clean_25,clean_10),ignore_index=True)
clean_df['_id']=clean_df['_id'].astype(str).str.lower()
clean_df.drop_duplicates(keep='first',inplace=True)
print(len(clean_df))

        _id                 clean_pred
0  lds-1013  Competition binding assay
0  lds-1013           KinomeScan assay
937


In [8]:
print(clean_df.head(n=2))

        _id                                         clean_pred
0  gse57323                                     RNA extraction
1  gse57323  miRNA profiling using TaqMan® Array Human micr...


In [9]:
clean_id_list = clean_df['_id'].unique().tolist()
print(len(clean_id_list))

171


#### Pull the measurementTechnique values for the IDs of the records that were extracted

In [10]:
def parse_measTech(jsonresult):
    tmpTechlist = []
    for eachhit in jsonresult['hits']:
        nde_id = eachhit['_id']
        if isinstance(eachhit['measurementTechnique'],list):
            for eachmeas in eachhit['measurementTechnique']:
                tmpdict = {"_id":nde_id, 'measurementTechnique':eachmeas['name']}
                tmpTechlist.append(tmpdict)
        elif isinstance(eachhit['measurementTechnique'],dict):
            tmpTechlist.append({"_id":nde_id, 'measurementTechnique':eachhit['measurementTechnique']['name']})
    return tmpTechlist

In [11]:
def get_measTechs(clean_id_list):
    measTechlist = []
    for nde_id in clean_id_list:
        api_url = f'https://api-staging.data.niaid.nih.gov/v1/query?&q=identifier%3A"{nde_id}"&fields=measurementTechnique'
        r = requests.get(api_url)
        result = json.loads(r.text)
        if len(result['hits'])>0:
            tmplist = parse_measTech(result)
            measTechlist.extend(tmplist)
        else:
            api_url = f'https://api-staging.data.niaid.nih.gov/v1/query?&q={nde_id}&fields=measurementTechnique'
            r = requests.get(api_url)
            result = json.loads(r.text)
            tmplist = parse_measTech(result)
            measTechlist.extend(tmplist)
    measTechdf = pd.DataFrame(measTechlist)    
    return measTechdf

In [12]:
%%time
measTechdf = get_measTechs(clean_id_list)
print(measTechdf.head(n=2))
print(len(measTechdf))

         _id                               measurementTechnique
0   gse57323                     Expression profiling by RT-PCR
1  gse108631  Genome binding/occupancy profiling by high thr...
218
CPU times: total: 17 s
Wall time: 1min 20s


#### Evaluate if ChatGPT successfully extracted the measurementTechnique

In [13]:
measTechdf['_id'] = measTechdf['_id'].astype(str).str.lower()
print(measTechdf.head(n=2))

         _id                               measurementTechnique
0   gse57323                     Expression profiling by RT-PCR
1  gse108631  Genome binding/occupancy profiling by high thr...


In [14]:
merged_result = clean_df.merge(measTechdf, on='_id', how='left')
merged_result.drop_duplicates(keep='first')
print(len(clean_df),len(merged_result))
print(len(merged_result['_id'].unique().tolist()))

937 1226
171


In [15]:
print(merged_result.head(n=2))

        _id                                         clean_pred  \
0  gse57323                                     RNA extraction   
1  gse57323  miRNA profiling using TaqMan® Array Human micr...   

             measurementTechnique  
0  Expression profiling by RT-PCR  
1  Expression profiling by RT-PCR  


In [16]:
### Basic comparisons

## Try matching by exact match
tmp4test = merged_result.copy()
tmp4test['clean_pred'] = tmp4test['clean_pred'].astype(str).str.lower()
tmp4test['measurementTechnique'] = tmp4test['measurementTechnique'].astype(str).str.lower()
tmp4test['match_text?'] = tmp4test['clean_pred'].equals(tmp4test['measurementTechnique'])
matched = tmp4test.loc[tmp4test['match_text?']==True]
print(len(matched))

## Try matching by length
tmp4test = merged_result.copy()
tmp4test['clean_len'] = [len(x) for x in tmp4test['clean_pred']]
tmp4test['meas_len'] = tmp4test['measurementTechnique'].astype(str).str.len()
tmp4test['match_len?'] = tmp4test['clean_len'].equals(tmp4test['meas_len'])
matched = tmp4test.loc[tmp4test['match_len?']==True]
print(len(matched))

0
0


In [18]:
### Advanced comparisons

## Try calculating Jaccard similarity
def get_jsim(row):
    tmpbag1 = str(row['clean_pred']).split(' ')
    tmpbag2 = str(row['measurementTechnique']).split(' ')
    bag1 = set([x.lower() for x in tmpbag1])
    bag2 = set([x.lower() for x in tmpbag2])
    bag_intersect = bag1.intersection(bag2)
    bag_union = bag1.union(bag2)
    jsim = len(bag_intersect)/len(bag_union)
    return jsim

merged_result['jsim'] = merged_result.apply(lambda row: get_jsim(row),axis=1)
print(merged_result.head(n=2))

        _id                                         clean_pred  \
0  gse57323                                     RNA extraction   
1  gse57323  miRNA profiling using TaqMan® Array Human micr...   

             measurementTechnique      jsim  
0  Expression profiling by RT-PCR  0.000000  
1  Expression profiling by RT-PCR  0.083333  


In [19]:
### Reduce the number of manual evaluations needed
sorted_merged = merged_result.sort_values(by='jsim',ascending = False)
best_matches = sorted_merged.drop_duplicates(subset='_id',keep='first')

### Export the best_matches for evaluation
best_matches.to_csv(os.path.join(result_path,'Lincs_GEO_reframe_best_for_evaluation.tsv'),sep='\t',header=True)

### Export the all_matches for evaluation
sorted_merged.to_csv(os.path.join(result_path,'Lincs_GEO_reframe_for_evaluation.tsv'),sep='\t',header=True)

### Manually evaluate the results
Since GPT can potentially identify more techniques than are available as vocabulary-controlled categories in GEO, ReFRAMEDB, LINCS, we are looking only for the number of matches relative to the total number of records tested. The number of "false" predictions (ie- predictions that don't match the measTech values) are not being evaluated here since it is possible that they are "true" predictions with no matches due to limitations in the vocabulary used for the repository 

In [1]:
import requests
import json
import os
import pandas as pd

In [2]:
script_path = os.getcwd()
data_path = os.path.join(script_path,'data')
result_path = os.path.join(script_path,'result')

In [3]:
evaluateddf = pd.read_csv(os.path.join(result_path,'Lincs_GEO_reframe_evaluated.tsv'),delimiter='\t',header=0,index_col=0)
print(evaluateddf.head(n=2))
all_evaluated_ids = evaluateddf['_id'].unique().tolist()

      _id                                    clean_pred  \
947  gse1  Clustering based on correlation coefficients   
946  gse1                     Gene expression profiling   

              measurementTechnique  jsim      evaluation  
947  Expression profiling by array   0.0            poor  
946  Expression profiling by array   0.4  GPT more broad  


In [4]:
match_found = evaluateddf.loc[evaluateddf['evaluation']!='poor']
unmatched = evaluateddf.loc[evaluateddf['evaluation']=='poor']

In [5]:
unique_records_measTech_combos = evaluateddf.groupby(['_id','measurementTechnique']).size().reset_index(name='counts')
print(len(unique_records_measTech_combos))
resultdict = {"Total number of records evaluated": len(all_evaluated_ids),
              "Total number of unique records & measurementTechnique combos evaluated": len(unique_records_measTech_combos)}

215


In [7]:
good_match = evaluateddf.loc[evaluateddf['evaluation']=='good']
ok_match = evaluateddf.loc[(evaluateddf['evaluation']=='GPT more specific')|(evaluateddf['evaluation']=='GPT more broad')]
good_match_ids = good_match['_id'].unique().tolist()
ok_match_ids = ok_match['_id'].unique().tolist()
all_matched_ids = list(set(good_match_ids).union(set(ok_match_ids)))
match_found_ids = match_found['_id'].unique().tolist()
true_unmatched = [x for x in all_evaluated_ids if x not in match_found_ids]
missing_matched_ids = [x for x in match_found_ids if x not in all_matched_ids]
print(len(all_matched_ids),len(match_found_ids), len(true_unmatched),len(all_evaluated_ids)) 
accounted_ids = list(set(all_matched_ids).union(set(true_unmatched)))
unaccounted_ids = [x for x in all_evaluated_ids if x not in accounted_ids]
print(len(accounted_ids))

resultdict["Number of records with a good match"] = len(good_match_ids)
resultdict["Number of records with an ok match (slightly more broad or specific)"] = len(ok_match_ids)
resultdict["Number of records with ok or better match"] = len(all_matched_ids)
resultdict["Number of records where GPT did not identify at least one matching measurementTechnique"] = len(true_unmatched)
resultdict["Ratio total found"] = len(all_matched_ids)/len(all_evaluated_ids)
resultdict["good match ratio"] = len(good_match_ids)/len(all_evaluated_ids)
resultdict["ok match ratio"] = len(ok_match_ids)/len(all_evaluated_ids)
resultdict["unmatched ration"] = len(true_unmatched)/len(all_evaluated_ids)


with open(os.path.join(result_path,'GEO_LINCS_Reframe_analysis.txt'),'w') as outwrite:
    for k in list(resultdict.keys()):
        outwrite.write(k+'\t'+str(resultdict[k])+'\n')

146 146 25 171
171


### Test functions

In [24]:
### Test the parse_measTech function

nde_id = 'reframedb_a00414'
api_url = f'https://api-staging.data.niaid.nih.gov/v1/query?&q=identifier%3A"{nde_id}"&fields=measurementTechnique'
r = requests.get(api_url)
result = json.loads(r.text)
print(result)
tmpTechlist = parse_measTech(result)

{'took': 323, 'total': 1, 'max_score': 24.680622, 'hits': [{'_id': 'reframedb_a00414', '_ignored': ['all.keyword'], '_score': 24.680622, 'measurementTechnique': [{'name': 'cell based'}]}]}
