## Descriptive length coverage

This notebook checks the length of descriptive text (length of name + description fields). In evaluating measurementTechnique extraction and Clinical / non-clinical study sorting by LLMs, hallucinations were generally much higher when there was less descriptive information available.

For example, measTech extraction had different minimum lengths (depending on repository) in order to succeed:
* For technique-based repositories
  * the minimum length of name+description was 50 characters
* For GREIs, the minimum length was much higher
  * Harvard Dataverse: 300 characters
  * Mendeley: 250 characters
  * Zenodo: 200 characters


In [1]:
import os
import json
import pandas as pd
import requests
import math
import time

In [2]:
sources = ["Figshare","Mendeley","Harvard+Dataverse","Zenodo"]

In [3]:
allresults = pd.DataFrame(["_id","name","description"])
for eachsource in sources:
    print("now fetching: ",eachsource)
    r = requests.get(f'https://api-staging.data.niaid.nih.gov/v1/query?=&q=includedInDataCatalog.name:"{eachsource}"&fields=_id,name,description&fetch_all=true')
    results = json.loads(r.text)
    tmpdf = pd.DataFrame(results['hits'])
    allresults = pd.concat((allresults,tmpdf),ignore_index=True)
    if results['total']>=500:
        i=0
        maxscrolls = math.ceil(results['total']/500)
        scroll_id = results['_scroll_id']
        while i < maxscrolls:
            try:
                r2 = requests.get(f'https://api-staging.data.niaid.nih.gov/v1/query?scroll_id={scroll_id}')
                tmp = json.loads(r2.text)
                scroll_id = tmp['_scroll_id']
                tmpdf = pd.DataFrame(tmp['hits'])
                allresults = pd.concat((allresults,tmpdf),ignore_index=True)
                i=i+1
            except:
                break

now fetching:  Figshare
now fetching:  Mendeley
now fetching:  Harvard+Dataverse
now fetching:  Zenodo


In [12]:
rawdf = allresults[['_id','name','description']].copy().fillna("N/A")
testdf = rawdf.loc[rawdf['_id']!="N/A"].copy()
testdf['text'] = testdf['name']+'\n'+testdf['description']
testdf['redundant descript?'] = testdf['name'] == testdf['description']
testdf['length'] = testdf['text'].astype(str).str.len()

In [14]:
def assume_src(x):
    tmp = x.split('_')
    source = tmp[0]
    return source

In [15]:
testdf['source'] = testdf.apply(lambda row: assume_src(row['_id']), axis=1)
print(testdf.head(n=2))

                 _id                                               name  \
3  figshare_17034748        Properties of study participants (N = 214).   
4  figshare_17034754  Pearson correlation coefficient between CSE, L...   

                                         description  \
3        Properties of study participants (N = 214).   
4  Pearson correlation coefficient between CSE, L...   

                                                text  redundant descript?  \
3  Properties of study participants (N = 214).\nP...                 True   
4  Pearson correlation coefficient between CSE, L...                 True   

   length    source  
3      87  figshare  
4     181  figshare  


In [17]:
minlengths = [50,100,200,300,400,500]

In [18]:
frequencydf = pd.DataFrame(columns=["source","counts","redundant descript?","minlength"])
for eachlen in minlengths:
    tmpdf = testdf.loc[testdf['length']<eachlen]
    freqdf = tmpdf.groupby(["source","redundant descript?"]).size().reset_index(name="counts")
    freqdf['minlength']=eachlen
    frequencydf = pd.concat((frequencydf,freqdf),ignore_index=True)

print(frequencydf)

       source  counts redundant descript? minlength
0   dataverse    3523               False        50
1   dataverse     121                True        50
2       dryad      16               False        50
3       dryad       1                True        50
4    figshare   32836               False        50
5    figshare   18330                True        50
6    mendeley    1979               False        50
7    mendeley     345                True        50
8      zenodo   21113               False        50
9      zenodo     860                True        50
10  dataverse    4610               False       100
11  dataverse     222                True       100
12      dryad     125               False       100
13      dryad       3                True       100
14   figshare   80143               False       100
15   figshare  100328                True       100
16   mendeley   13829               False       100
17   mendeley     768                True       100
18     zenod

In [19]:
import pickle

with open(os.path.join('data','length_check','name_descript_check.pkl'),'wb') as outfile:
    pickle.dump(testdf, outfile)

In [20]:
frequencydf.to_csv(os.path.join('data','length_check','length_frequencies.tsv'),sep='\t',header=True)

In [4]:
import pickle

with open(os.path.join('data','length_check','name_descript_check.pkl'),'rb') as infile:
    data = pickle.load(infile)

print(data.head(n=2))

                 _id                                               name  \
3  figshare_17034748        Properties of study participants (N = 214).   
4  figshare_17034754  Pearson correlation coefficient between CSE, L...   

                                         description  \
3        Properties of study participants (N = 214).   
4  Pearson correlation coefficient between CSE, L...   

                                                text  redundant descript?  \
3  Properties of study participants (N = 214).\nP...                 True   
4  Pearson correlation coefficient between CSE, L...                 True   

   length    source  
3      87  figshare  
4     181  figshare  


In [7]:
figset = data.loc[data['source']=="figshare"]
redundant_figs = figset.loc[figset['redundant descript?']==True]
print(len(redundant_figs))
redundant_fig_ids = redundant_figs['_id'].unique().tolist()
print(len(redundant_fig_ids))

471555
471555


In [8]:
with open(os.path.join('data','length_check','redundant_text_in_fig.txt'),'w') as outfile:
    for figid in redundant_fig_ids:
        outfile.write(figid+'\n')

In [10]:
print(redundant_fig_ids[0:10])

['figshare_17034748', 'figshare_17034754', 'figshare_23694168', 'figshare_17034781', 'figshare_26108144', 'figshare_26108146', 'figshare_26108150', 'figshare_26108153', 'figshare_26108159', 'figshare_26108171']
