In [None]:
import numpy as np
import os
import shelve
from tqdm import tqdm

DICTIONARIES STRUCTURE:

1) disucussion_queries = {0: 
                          {'doc_id': 2627996, 'dis': ['..', '..','..']}
                          1: 
                          {'doc_id': 2622196, 'dis': ['..', '..','..']}
                        }
2) paragraph_split = {0: 
                      {'doc_id': 2627996, 'others': ['..','..'], 'res': ['..','..'], 'dis': ['..','..'], 'par': ['..','..', '..', '..']}}


In [None]:
# Discussion queries (3 per paper)
shelf = shelve.open("discussion_queries") 
disucussion_queries = shelf["my_dict"]

In [None]:
# Full paper splitted into paragraphs
shelf = shelve.open("paragraph_split") 
paragraph_split = shelf["my_dict"]

In [None]:
def automatic_pairing(disucussion_queries, paragraph_split,  K=3, idxs_auto=range(5)):
    
    '''
        Automatic pairing function. Based on BM25 searching method to pair discussion->result.
        By default, 3 discussion paragraphs are paired for each article.
    
        Variables:
        
        -> disucussion_queries -- discussion paragraphs used as query. 
        -> paragraph_split -- full list of paragraphs (introduction, results, methods, concl, ...)
        -> K -- number of top results to consider
        -> idxs_auto -- included indexes from the dictionaries to automatically pair
    '''
    
    
    automatic_dict = dict()
    
    for idx in idxs_auto:
        automatic_dict[idx] = dict()
        
        full_body = paragraph_split[idx]['par']
        discs = disucussion_queries[idx]['dis']
    
        discs_tok = [preprocessor_tokenizer(s) for s in discs]
        full_tok = [preprocessor_tokenizer(s) for s in full_body]
        ids = bm25_predict(discs_tok, full_tok)
        ids_checked = check_ids(paragraph_split, idx, ids, K)

        print('Visual checking that 1st index corresponds to the query...\n')
        for i,di in enumerate(discs):
            print(idx, '.....', di)
            print('-->', full_body[ids[i][0]], '\n')
        
        automatic_dict[idx]["nr_of_disc"] = 3
        automatic_dict[idx]['disc0'] = discs[0]
        automatic_dict[idx]['disc1'] = discs[1]
        automatic_dict[idx]['disc2'] = discs[2]
        
        automatic_dict[idx]['res0'] = np.array(full_body)[ids_checked[0]]
        automatic_dict[idx]['res1'] = np.array(full_body)[ids_checked[1]]
        automatic_dict[idx]['res2'] = np.array(full_body)[ids_checked[2]]
        
    return automatic_dict

In [None]:
def check_ids(paragraph_split, idx, ids, K):
    
    ''' 
        Return ids when:
        
        1) the paragraph is not discussion itself (first index)
        2) the paragraph is included in results section
        
        Remember that the paragraphs are listed following:
        others - results - discussion
    
    '''
    len_others = len(paragraph_split[idx]['others'])
    len_res = len(paragraph_split[idx]['res'])
    ids_checked = []
    for j in ids:
         ids_checked.append([i for i in j if i >= len_others and i<len_others+len_res][:K])
    return ids_checked

In [None]:
def preprocessor_tokenizer(s):
    import nltk
    from nltk import RegexpTokenizer
    tokenizer = RegexpTokenizer( "\w+" )
    from nltk.stem.snowball import SnowballStemmer
    
    s = s.lower()
    s_list=tokenizer.tokenize(s)
    stemmer = SnowballStemmer('english')
    s_list=[stemmer.stem(word) for word in s_list]
    return " ".join(s_list)

In [None]:
def bm25_predict(query_list, doc_list):
    from rank_bm25 import BM25Okapi
    
    queries = query_list.copy()
    corpus = doc_list.copy()
    # build the index for the corpus
    tokenized_corpus = [doc.split() for doc in corpus]
    bm25 = BM25Okapi( tokenized_corpus )
    
    Knn_ids_record = []
    count = 0
    for query in tqdm(queries):  
        doc_scores = bm25.get_scores( query.split() )
        Knn_ids = np.argsort( -doc_scores )
        Knn_ids_record.append( Knn_ids )
        count +=1
    bm25_predicted_ids = np.asarray( Knn_ids_record )
    return bm25_predicted_ids

In [None]:
paired = automatic_pairing(disucussion_queries, paragraph_split, idxs_auto=[5,6,7,8,9,15,16,17,18,19])

100%|██████████| 3/3 [00:00<00:00, 232.54it/s]
100%|██████████| 3/3 [00:00<00:00, 86.75it/s]


Visual checking that 1st index corresponds to the query...

5 ..... Some cadherin expression patterns change during devel- opment. For example, cadherin expression was not the same between chick embryos (Redies et al., 2001) and postnatal quail in the present study. Cad7 downregulation and Cad6B upregu- lation are observed in the RA nucleus during the transition from the sensory to sensorimotor learning stage (Matsunaga and Okanoya, 2008a). In this study, to verify the possibility that some gene expression differences were caused by developmental differences among species, we used juvenile birds at two develop-mental stages to perform the comparative gene expression analy- sis. We found cadherin expressions were changed in some areas between two different developmental stages of the same species (light blue region, Table 1). However, in the vocal system, cad- herin expressions differed in many areas among different species, even their expressions were similar between two different deve

100%|██████████| 3/3 [00:00<00:00, 146.72it/s]
100%|██████████| 3/3 [00:00<00:00, 111.05it/s]
  0%|          | 0/3 [00:00<?, ?it/s]

Visual checking that 1st index corresponds to the query...

7 ..... The discovery of bona fide markers based on the comparison of HVC vs. the underlying nidopallial Shelf provides important support for our rationale that HVC constitutes a molecular specialization of the nidopallium that is the product of specific programs of gene regulation. Furthermore, because our micro- array comparison was not to the whole brain, we were able to identify markers that are also expressed in other song nuclei and brain subdivisions in various combinations (Figure 2 and Table S4). Thus, some molecular specializations of HVC potentially reflect properties that are common among subsets of song nuclei. For example, similar to zRalDH, an enrichment in HVC and LMAN may reflect a nidopallial characteristic that is absent from arcopallial and/or striatal nuclei (e.g. the local synthesis of retinoic acid). In contrast, a shared enrichment in HVC and striatal area X might indicate a possible involvement in the 

100%|██████████| 3/3 [00:00<00:00, 129.61it/s]
100%|██████████| 3/3 [00:00<00:00, 127.44it/s]
100%|██████████| 3/3 [00:00<00:00, 149.36it/s]


Visual checking that 1st index corresponds to the query...

9 ..... Our identification of SDs that emerged following evolutionary divergence of Galliformes (e.g. chicken) and Neoaves (e.g. Zebra finch) substantially improves on pre- vious studies [27,36] by refining the location of SD sites, identifying breakpoints on chrs 11–28 and Z, and distin- guishing SDs present in Zebra finch only, thus possibly specific to songbirds, from those present in chicken only, and thus possibly specific to Galliformes. The fact that the majority of novel genes, both those unique to song- birds as well as those present in other avian groups, are located within or immediately adjacent to SDs suggests that chromosomal rearrangement is a major mechanism for the emergence of novel genomic features in passer- ines and other avian groups, as found in other lineages [48,49]. This corroborates previous reports establishing non-allelic homologous recombination following inter- or intra-chromosomal rearrangement 

100%|██████████| 3/3 [00:00<00:00, 206.24it/s]
100%|██████████| 3/3 [00:00<00:00, 154.40it/s]
100%|██████████| 3/3 [00:00<00:00, 156.58it/s]

Visual checking that 1st index corresponds to the query...

17 ..... The tasks employed in the present study activated ventral occipitotemporal, inferior frontal, superior frontal gyrus, and occipitoparietal regions, but not the posterior left infe- rior frontal gyrus (LIFG) (BA44/45), which is commonly activated in semantic memory tasks. The role of the LIFG remains unclear. Some studies claimed that the anterior LIFG (BA47) plays an important role in semantic processing [18], whereas the posterior LIFG (BA44/45) is specialized for phonological processing [19]. The data presented above suggest that the anterior region is associated with semantic processing, irrespective of phonological demands. The LIFG has been previously shown to be involved in generating semantic associations [20, 21], particularly while making decisions concerning semantic associations [22–24]. Another explanation of this modulatory response is that it reflects increased demand for selection between categorical as




In [None]:
# Full paper splitted into paragraphs
shelf = shelve.open("automatic_split") 
shelf["my_dict"] = paired
shelf.close()