## Measuring Data Sharing with NLP is SEEK

### Data Sharing Ratio

- Ds: Data Sharing Ratio
- A: Number of Data assets Available from an Open Access Publication.
- U: Number of Data assets Used in an Open Access Publication.

Ds = _A/U_

Researchers use data assets files in  diverse ways. Data assets can be fragmented in different files  or fully integrated in to a single file. Nevertheless, Open Access Publications we are looking at have an experimental setup and a modelling setup (simulation). We can obtain Authors' discipline. Therefore we can assumed that at least one data asset per discipline should be available. For example in [#achcar_dynamic_2012] :

> Dissecting the Catalytic Mechanism of Trypanosoma brucei Trypanothione Synthetase by Kinetic Analysis and Computational Modeling*
Alejandro E. Leroux,‡,1 Jurgen R. Haanstra,§¶,1 Barbara M. Bakker,§¶,2 and R. Luise Krauth-Siegel‡,3

- ‡,1 : Experimentalist
- §¶,1 : Modeller
- §¶,2 : Modeller
- ‡,3 : Experimentalist

_U_ for this Open Access Publication would be __2__ , that is one data set and one model.

A further assumption, could be that members of different institutions do not partner in a publication unless they have provide a data assets for it.

### Approaches to identify accessible shared data

Identifying the number of accessible shared data its a more complicated task as it requires to actually link a data file with a publication. Data files can be distributed in many places and no-direct link between both assets is often provided. We are considering three approaches:

#### Deposition Statements based approach

Academic Publications would include notes of the data assets they use and where they are deposited for access. These depositions statements can be used to identify shared data.
.Piwowar [#piwowar_identifying_2008] uses NLP to identify __deposition patterns__ on publications. For example:

>  Microarray design and data are also available from the ArrayExpress database under __accession__ no. E-MEXP-2503 (http://www.ebi.ac.uk/microarray-as/aer/entry).

Assumptions:
- Depositions statement are included in publications that use either experimental or modelling data.
- Deposition statements are about data created by the authors.

#### ISA structure based approach

In our case study, scientists can link data to publications using the ISA structure model. These links can be used to track share data.

Assumptions:
- Publication (from a SEEK-based collaboration) are connected to their assets(data) using the ISA structure  model.

#### Biological terms based approach

Academic Publications mention descriptions of the data they use and these descriptions are also used to name and describe files that contain the shared data. Terms in those descriptions and in the publications can be used to link shared data. For example in [#achcar_dynamic_2012] :

Publication extract:

>EXPERIMENTAL PROCEDURES
>…
>__Substrate__ inhibition of __TryS__ by GSH. The __activity__ of __TryS__ at variable __concentrations__ of GSH was measured in the in vivo-like buffer system. The assays contained 8 mm Spd and different fixed ATP __concentrations__ (A) and 2.3 mm ATP and different fixed Spd ...
>...

Data asset in the Repository:
>Filename: __Activity__ of Tb__TryS__ measured by the spectrophotometric assay.xls
>Description: The file contains the initial rate measurements of Tb__TryS__ obtained under different __substrate__ and product initial __concentrations__.

These terms could be used to link publications and data sets, provided that data files could be restricted by author or collaboration.

Assumptions:
- Scientists use terms in publications to describe data that are similar or the same to the terms they use to describe or name data files containing the data.

### Evaluating approaches assumptions in case study data-set

Each approach requires a degree of implementation in order to define which approach would be more helpful, we evaluate if our study case data set fulfils the assumptions of each approach.

The evaluation was done manually and in some case using samples size of 10% instead of the complete data set.

The Sysmo SEEK data base contains links to 206 publications produced by the collaboration participants. The data base copy that we use dates back to January 2013. By February 2014, the SysMO SEEK webpage shows 200 publications.
I used jquery in the webpage:



```
document.getElementsByClassName("list_item with_smaller_shadow curved”)
```



There are six less publications that in our database from January 2013. I'm uncertain about this discrepancy. However this shows that the  data base has not change much compared to the current state.

> Olga: ...publications can not be hidden.

The results of the assumptions evaluation indicate that our data set fits the assumptions of the  __Biological Term-based__ approach better than the other two approaches.

| Approach                        | %   |
| :--                             | :-- |
| Deposition based [^1]           | 40% |
| ISA Structure                   | 25% |
| Biological Term-based [^1]  [^2] | 72% |
[Percentage of data-set items that full-fill the assumptions.][summryTbl]

In the following section there is a description of how the evaluations were made.


[^1]: We use a sample of 10% of the data set.


[^2]: The Term-based analysis is limited to Data Files, it does not evaluate models or SOPs.


##Approach 1: Using Deposition Statements

We used a randomized sample of 20 publications and look manually for deposition statements in each of the publications.

Publications ids:

> 130*93*2578*94*118*199*69152&150*1*7745*190**205*19815*142*111*184*

Deposition terms used to compare manually:
- deposited
- accession
- include
- provided
- supplementary

There were deposition statements in __40%__ of the publications.

##Approach 2: Using ISA Structure

We identify all publications related with a scientific assets, via their ISA structure, in the SEEK data base.

Table *relations* contains information about publications and Documents and assays.

We obtained only 55 publications related to and assets or assay.



```
SELECT DISTINCT object_type,object_id
FROM `relationships`
WHERE object_type = "publication";
```

> 93,97,34,98,8,113,134,35,142,137,145,136,147,149,176,175,174,179,112,148,161,84,162,128,143,127,183,186,94,61,102,105,111,133,182,132,181,184,103,131,104,108,110,106,130,107,146,190,87,177,100,185,139,205,206

This means that **25%** of publications are described with their ISA structure.



###Approach 3: Using Biological Terms

We took a random sample of publications and their related data file description from the SEEK database. Biological terms were identified manually using [Termine Web Demonstrator](http://www.nactem.ac.uk/software/termine/), for the set of publications and the set of file descriptions. Then a simple comparison between the two sets of terms was made.

Our data sample was restricted in three dimensions:
- We use 10% randomised sample of publications.
- The set of publications was constricted to those publications with an ISA structure.
- The set of data files was limited to data-sets. Models or SOPs were not included in the set.

There are *248 datasets* created by Users. There are data sets created by the BOTS or some other entity.

```
SELECT DISTINCT id
FROM `data_files`
WHERE contributor_type = "User";

SELECT DISTINCT subject_id
FROM `relationships`
WHERE object_type = "publication" and `subject_type` = "DataFile";
```

I'll take 20 randomly (10%) to check if their descriptions have terms in [Termine Web Demonstrator](http://www.nactem.ac.uk/software/termine/).

```
import random
myset = [601,600,598,599,868,826,874,876,877,878,879,880,864,992,993,994,995,996,997,998,999,1001,1002,871,1054,1056,1055,1057,1052,1058,1059,1060,863,914,1068,54,1070,1071,1072,1073,1069,832,835,836,833,834]
random.sample(myset, 20)
```

Random of all publications with ISA
>[826, 863, 836, 599, 1073, 879, 54, 992, 997, 835, 1056, 832, 871, 995, 868, 880, 1069, 864, 877, 1070]

Get the titles and descriptions:

```
SELECT DISTINCT id, title, description, contributor_type
FROM `data_files`
WHERE id IN (826, 863, 836, 599, 1073, 879, 54, 992, 997, 835, 1056, 832, 871, 995, 868, 880, 1069, 864, 877, 1070);
```

All Data field have at least one term detected by Termine.

```
SELECT pubmed_id, `title`, id
FROM `publications`
WHERE id IN (SELECT DISTINCT object_id
FROM `relationships`
WHERE `subject_id` IN (826, 863, 836, 599, 1073, 879, 54, 992, 997, 835, 1056, 832, 871, 995, 868, 880, 1069, 864, 877, 1070)
);
```

There are less than 20 results because some items belong to the same paper. List of relations between data field and paper. This would help for the group truth.

```
SELECT DISTINCT subject_id, object_id FROM `relationships` WHERE object_type = "publication" and `subject_type` = "DataFile";
```



subject_id    object_id
601    97
600    97
598    97
599    97
868    97
826    8
874    113
876    113
877    113
878    113
879    113
880    113
864    149
992    175
993    175
994    176
995    175
996    176
997    176
998    176
999    175
1001    176
1002    175
871    112
1054    84
1056    162
1055    162
1057    186
1052    94
1058    148
1059    186
1060    61
863    149
914    102
914    105
914    183
914    111
914    127
914    133
914    182
914    128
914    132
914    181
914    184
914    103
914    131
914    104
914    108
914    143
914    110
914    106
914    130
914    107
1068    190
54    100
1070    185
1071    185
1072    185
1073    185
1069    185
832    206
835    206
836    206
833    206
834    206

To compare both sets the set of terms from the data files description and the set from the publication we use:

```
with open("997") as f2, open("22712534") as f1:
    words = set(line.strip() for line in f1)  #create a set of words from dictionary file
    words2 = [w.split('\t', 2)[1] for w in words]
    #why sets? sets provide an O(1) lookup, so overall complexity is O(N)
    #now loop over each line of other file (word, freq file)
    for line in f2:
        word = line.split('\t', 2)[1]  #fetch word,freq
        if word in words2:  #if word is found in words set then print it
            assert isinstance(word, basestring)
            print word
```

| PubmedID                   | Terms in common                                                                                                                      |
| :--                        | :--                                                                                                                                  |
| 21252224                   | steady-state chemostat culture, aerobiosis scale                                                                                     |
| 20053288                   | --                                                                                                                                   |
| 21097579                   | flux distribution, growth rate, s. pyogene, maximal growth rate                                                                      |
| 21106498                   | --                                                                                                                                   |
| 19802714                   | s. solfataricus cell,s. solfataricus, non-phosphorylating glyceraldehyde 3-phosphate dehydrogenase                                   |
| Proteolysisofbeta-galactos | stress response                                                                                                                      |
| 21815947                   | --                                                                                                                                   |
| 22686585                   | yeast glycolytic oscillation,yeast extract, flow rate                                                                                |
| 22712534                   | yeast culture                                                                                                                        |
| Glucosetransport           | by-product formation rate, by-product formation rate, glucose consumption rate, glucose transport system, clostridium acetobutylicum |
| Integrativemodelling       | clostridium acetobutylicum                                                                                                           |
[Common terms in publications and data sets][termcommon]

Three publications didn't have terms in common with their data files related (according to the ISA structure.) That is 72% of papers of the sample have terms related to their data_files.

| Pubmed_Id | Recall | Precision |
| :--       | :--:   | :--:      |
| x         | 1.0    | 0.45      |
| x         | 1.0    | 0.55      |
| x         | 1.0    | 0.50      |
| x         | 0.5    | 1         |
| x         | 1.0    | 0.45      |
| x         | 1.0    | 0.44      |
| x         | 1.0    | 0.25      |
| x         | 1.0    | 0.625     |
| x         | 1.0    | 0.833     |
[Recall and Precision of Bio-terms approach in a sample of 10% of publications][rec_pre]

## Sample Bias

Check the other 10%

﻿   pubmed_id     recall  precision
0   20053288          1      0.125
1   20933603          0          0
2   21097579        0.2          1
3   21252224          1          1
4   20233302          1  0.3333333
5   21123069          1  0.3333333
6   20300532          1  0.3333333
7   19737355          1  0.3333333
8   19684115          1  0.3333333
9   21106498          1          1
10  19802714          1          1
11  21651626          1  0.3333333
12  21841760          1  0.3333333
13  22431591          1        0.2
14  22383849          1       0.25
15  22511268          1  0.2857143
16  22686585        0.4  0.2857143
17  22712534        0.8  0.5714286
18  23033921          1  0.3333333
19  22923596        0.8          1
20  22001508          0          0
21  23332010          1  0.8333333
22       AVG  0.8272727  0.4644481

SELECT DISTINCT id FROM `publications` WHERE id NOT IN (93,97,34,98,8,113,134,35,142,137,145,136,147,149,176,175,174,179,112,148,161,84,162,128,143,127,183,186,94,61,102,105,111,133,182,132,181,184,103,131,104,108,110,106,130,107,146,190,87,177,100,185,139,205,206);

myset =[1,2,4,5,7,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,36,37,38,39,40,41,42,43,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,62,63,64,65,66,67,68,69,70,71,72,73,75,76,77,78,79,80,81,82,83,86,90,91,92,96,99,101,109,114,115,116,117,118,119,120,121,122,123,124,129,138,140,141,144,150,151,152,153,154,155,156,158,159,160,163,164,165,166,167,168,169,170,171,172,173,178,180,187,188,189,191,192,193,194,195,196,197,198,199,200,201,202,203,204]

import random
random.sample(myset, 20)

﻿SELECT pubmed_id FROM publications WHERE id IN (200, 16, 53, 155, 192, 71, 36, 156, 86, 151, 48, 24, 19, 63, 141, 168, 92, 101, 65, 165)

19321498,18395130,20412803,20214910,17725564,18491319,18546160,19047653,19193632,19403106,19374982,21133689,22281772,18086213,21219666,20947526,21479178,22096228,23175651,22052476