In [1]:
%run notebook_setup.ipynb

In [2]:
%vault from pubmed_derived_data import literature
%vault from pubmed_derived_data import predicted_article_types, reliable_article_types
%vault from pubmed_derived_data import domain_features
%vault from pubmed_derived_data import popular_journals

Imported `literature` (904B0F94) at Monday, 03. Aug 2020 01:11

Imported:

 - `predicted_article_types` (3D39430E)
 - `reliable_article_types` (5D584CB5)

at Monday, 03. Aug 2020 01:11

Imported `domain_features` (02FA7AED) at Monday, 03. Aug 2020 01:11

Imported `popular_journals` (0B2CABD1) at Monday, 03. Aug 2020 01:11

**Aim**:
- verify if TCGA is indeed over-represented in methods papers (and by how much)
- collect the disease terms and create an ontology plot to highlight which kind of diseases are well-studied and which are not)

In [3]:
textual = literature['title'] + ' ' + literature['abstract_clean'].fillna('') + ' ' + literature['full_text'].fillna('')

In [4]:
literature['mentions_tcga'] = (
    textual
    .str.lower().str.contains('tcga|the cancer genome atlas')
)
literature['mentions_tcga'].mean()

0.11805555555555555

In [5]:
from pandas import concat

In [6]:
combined_article_types = concat([
    predicted_article_types,
    reliable_article_types
]).loc[literature.index]

In [7]:
data = (
    literature
    .drop(columns=['full_text', 'abstract'])
    .join(combined_article_types)
)
data['is_type_predicted'] = data.index.isin(predicted_article_types.index)

In [8]:
all_articles = data.assign(one=1)
open_access_subset = all_articles[all_articles.has_full_text == True]

In [9]:
from scipy.stats import fisher_exact

## Cancer enrichment in multi-omics papers (compared to matched papers from same context)

TIAB is PubMed code for 'title and abstract' search restriction; here we use start with all the articles published in journals of the it is used to match the feature extraction performed on abstracts of articles:

In [10]:
%vault from pubmed_derived_data import cancer_articles_from_popular_journals_tiab_only

Imported `cancer_articles_from_popular_journals_tiab_only` (C6D2493E) at Monday, 03. Aug 2020 01:11

In [11]:
%vault from pubmed_derived_data import all_articles_by_journal_and_year

Imported `all_articles_by_journal_and_year` (AB6E261E) at Monday, 03. Aug 2020 01:11

In [12]:
def count_articles_mentioning_disease(data):
    return (
        Series(
            data
            .mentioned_diseases_set
            .astype(object).apply(eval).apply(list)
            .sum()
        )
        .value_counts()
    )

In [13]:
number_of_articles_mentioning_diseases = count_articles_mentioning_disease(domain_features)
number_of_articles_mentioning_diseases.head(10)

cancer                      786
disease                     722
carcinoma                   132
inflammation                 77
cardiovascular               68
diabetes                     60
colorectal cancer            59
adenocarcinoma               53
hepatocellular carcinoma     47
glioblastoma                 42
dtype: int64

In [14]:
journal_share_in_multiomics = popular_journals.journal / sum(popular_journals.journal)
journal_share_in_multiomics.name = 'share'
journal_share_in_multiomics.head(2)

index
Scientific reports                          0.048592
Omics : a journal of integrative biology    0.030081
Name: share, dtype: float64

In [15]:
def counts_weighted_by_share(data, share):
    with_share = data.groupby('journal').sum().join(share)
    return (with_share['count'] * with_share['share']).sum()

In [16]:
cancer_articles_weighted = counts_weighted_by_share(cancer_articles_from_popular_journals_tiab_only, journal_share_in_multiomics)
all_articles_weighted = counts_weighted_by_share(all_articles_by_journal_and_year, journal_share_in_multiomics)

cancer_articles_in_multi_omics = number_of_articles_mentioning_diseases.loc['cancer']
articles_in_multi_omics = len(domain_features)

cancer_articles_in_multi_omics / articles_in_multi_omics, cancer_articles_weighted / all_articles_weighted

(0.22743055555555555, 0.09978695663666994)

In [17]:
fisher_exact([
    [cancer_articles_in_multi_omics, cancer_articles_weighted],
    [articles_in_multi_omics, all_articles_weighted]
])

(2.2795896225172543, 3.1306778185794075e-66)

### Note: this is not as strong without weighting

Which is not surprising, given that journals are not focusing on specific topics, including cancer. Journal publishing a lot of cancer research which has published 3 multi-omics articles would be then counted in as much as "Omics", "Bioinformatics", even though the latter are where the majority of the multi-omics articles get published.

In [18]:
cancer_articles_crude = cancer_articles_from_popular_journals_tiab_only['count'].sum()
all_articles_crude = all_articles_by_journal_and_year['count'].sum()

cancer_articles_crude / all_articles_crude

0.11564909586403536

In [19]:
fisher_exact([
    [cancer_articles_in_multi_omics, cancer_articles_crude],
    [articles_in_multi_omics, all_articles_crude]
])

(1.9665571430228712, 6.762074590844365e-57)

### Diligence check: would it hold if we looked at the full-text articles only?

Yes, but the effect size is lower (higher p-value is expected also because we look at a subset).

In [20]:
%vault from pubmed_derived_data import cancer_articles_from_popular_journals_any_field

Imported `cancer_articles_from_popular_journals_any_field` (6931F0FF) at Monday, 03. Aug 2020 01:11

In [21]:
open_acess_journal_freq = open_access_subset.journal.sorted_value_counts()
oa_popular_journals = open_acess_journal_freq[open_acess_journal_freq >= 3]
oa_popular_journals.sum() / oa_popular_journals.sum()

1.0

In [22]:
oa_journal_share_in_multiomics = oa_popular_journals / sum(oa_popular_journals)
oa_journal_share_in_multiomics.name = 'share'
oa_journal_share_in_multiomics.head(2)

index
Scientific reports    0.102310
PloS one              0.056106
Name: share, dtype: float64

In [23]:
oa_cancer_articles_weighted = counts_weighted_by_share(cancer_articles_from_popular_journals_any_field, oa_journal_share_in_multiomics)
oa_all_articles_weighted = counts_weighted_by_share(all_articles_by_journal_and_year, oa_journal_share_in_multiomics)

oa_number_of_articles_mentioning_diseases = count_articles_mentioning_disease(domain_features.loc[open_access_subset.index])
oa_cancer_articles_in_multi_omics = oa_number_of_articles_mentioning_diseases.loc['cancer']
oa_articles_in_multi_omics = len(open_access_subset)

oa_cancer_articles_in_multi_omics / oa_articles_in_multi_omics, oa_cancer_articles_weighted / oa_all_articles_weighted

(0.2565789473684211, 0.1283517981102067)

In [24]:
fisher_exact([
    [oa_cancer_articles_in_multi_omics, oa_cancer_articles_weighted],
    [oa_articles_in_multi_omics, oa_all_articles_weighted]
])

(1.9992917328123092, 2.1290705982870885e-28)

## TCGA enrichment in computational method papers (compared to other types)

In [25]:
fisher_exact(
    [
        [open_access_subset.query('is_method and mentions_tcga').one.sum(), open_access_subset.query('is_method and not mentions_tcga').one.sum()],
        [open_access_subset.query('not is_method and mentions_tcga').one.sum(), open_access_subset.query('not is_method and not mentions_tcga').one.sum()]
    ]
)

(3.827013732322197, 4.452431104649725e-07)

In [26]:
open_access_subset.query('not is_method').mentions_tcga.mean()

0.19738651994497936

In [27]:
open_access_subset.query('is_method').mentions_tcga.mean()

0.48484848484848486

### Diligence check: does it hold on the manually verified methods?

(Yes, because all full-text method articles were verified/no new methods were predicted from open-access subset)

In [28]:
open_access_subset.query('not is_method and (not is_type_predicted)').mentions_tcga.mean()

0.19738651994497936

In [29]:
open_access_subset.query('is_method and (not is_type_predicted)').mentions_tcga.mean()

0.48484848484848486

### Diligence check: does it hold on the larger superset (for articles with no full text)?

In [30]:
all_articles.query('not is_method').mentions_tcga.mean()

0.1094692400482509

In [31]:
all_articles.query('is_method').mentions_tcga.mean()

0.32142857142857145

In [32]:
fisher_exact(
    [
        [all_articles.query('is_method and mentions_tcga').one.sum(), all_articles.query('is_method and not mentions_tcga').one.sum()],
        [all_articles.query('not is_method and mentions_tcga').one.sum(), all_articles.query('not is_method and not mentions_tcga').one.sum()]
    ]
)

(3.8534145280556764, 5.318078390481294e-11)

Yes, and the effect-size even larger and p-value lower! But the we should report the more conservative finding from the open-access subset, because:

- I would not expect computational method papers to announce that they use TCGA data in abstract - they will keep that as a detail in methods
  - thus the open-access subset should provides more accurate representation
- All the computational methods articles in the open-access subset come from manual curation and not prediction