 # Test-Construct frequency matrix

 This notebook generates a matrix that contains test/construct co-occurrence frequencies. Cell values show number of articles that mention both the task and the cognitive construct.


## Codebook

Matrix is stored in sparse format in the following path: `data/pubmed/test_construct_matrix.csv`. This CSV file contains the following columns:

 - `test`: Name of the cognitive test
 - `construct`: Name of the cognitive construct
 - `test_corpus_size`: Number of articles in the cognitive test corpus
 - `construct_corpus_size`: Number of articles in the cognitive construct corpus
 - `shared_corpus_size`: Number of articles that are occured in both test and construct corpora.

**Note**: Values in the matrix are neither normalized nor scaled.

In [None]:
import pandas as pd
from pathlib import Path
from tqdm import tqdm


OUTPUT_FILE = Path('data/pubmed/test_construct_count_matrix.csv')

TEST_FILES = list(Path('data/pubmed/tests/').glob('*.csv'))
CONSTRUCT_FILES = list(Path('data/pubmed/constructs/').glob('*.csv'))

freqs = []

for test_file in tqdm(TEST_FILES):
  for construct_file in CONSTRUCT_FILES:
    test_df = pd.read_csv(test_file)
    construct_df = pd.read_csv(construct_file)
    test_pmids = set(test_df['pmid'].unique())
    construct_pmids = set(construct_df['pmid'].unique())
    shared_pmids = test_pmids.intersection(construct_pmids)
    freqs.append([
        test_file.stem,
        construct_file.stem,
        len(test_pmids),
        len(construct_pmids),
        len(shared_pmids)]
    )

freqs_df = pd.DataFrame(
    freqs,
    columns=['test', 'construct', 'test_corpus_size', 'construct_corpus_size', 'shared_corpus_size'])

freqs_df.sort_values('shared_corpus_size', ascending=False).to_csv(OUTPUT_FILE, index=False)