# LISC

LIterature SCanner (LISC) is a python module for scraping literature data. It is basically a wrapper around the Pubmed [E-Utilities](https://www.ncbi.nlm.nih.gov/books/NBK25501/).

LISC provides for two different 'types' of scraping, 'Counts' and 'Words'. 

### Counts

'Counts' scrapes for co-occurence of given set(s) of terms. 

### Words

'Words' scrapes abstract text data, and paper meta-data, for all papers found for a given set of terms.

### Functions vs. Objects

Each of these types of scrapes can be called in one of two ways, either by using scrape functions provided by LISC (function approach), or by using objects provided by LISC (OOP approach). 

Note that, under the hood, these methods are the same, the OOP oriented approach simply provides wrappers around the scraping functions. 

In [1]:
# Add lisc to path - assumes lisc is available one step up in the directory
#  This should work if cloned from Github
import os
import sys
sys.path.append(os.path.split(os.getcwd())[0])

In [2]:
# Set up some test data
#  Note that each entry is itself a list
terms_a = [['brain'], ['cognition']]
terms_b = [['body'], ['biology'], ['disease']]

## Counts

'Counts' scraping gets data about the co-occurence of terms of interest.

Specifically, it search titles and abstracts, and checks how often two terms of interest appear together in the literature. 

In [3]:
# Import LISC - Count
from lisc.count import Count
from lisc.scrape import scrape_counts

### scrape_counts function

In [4]:
# Run a scrape of 'counts' (co-occurence data) - across a single list of terms
dat_numbers, dat_percent, term_counts, _, meta_dat = scrape_counts(terms_a, db='pubmed', verbose=True)

Running counts for:  brain
Running counts for:  cognition


In [5]:
# Check out how many papers where found for each combination
print(dat_numbers)

[[    0 15583]
 [15583     0]]


In [6]:
# Check out the percent of paper overlap
print(dat_percent)

[[ 0.          0.01822714]
 [ 0.27393383  0.        ]]


In [7]:
# Print out many papers found for each term
for term, count in zip(terms_a, term_counts):
    print('{:12} : {}'.format(term[0], count))

brain        : 854934
cognition    : 56886


When given a single set of terms, the 'Counts' scrapes each term  against each other term. 

You can also specify different sets of terms to scrape, as below, whereby each term in list A is scraped for co-occurence for each term in list B (but not to other terms in list A). 

In [8]:
# Run a scrape of 'counts' (co-occurence data) across two different lists of terms
dat_numbers, dat_percent, term_counts_a, term_counts_b, meta_dat = scrape_counts(
    terms_lst_a=terms_a, terms_lst_b=terms_b, db='pubmed', verbose=True)

Running counts for:  brain
Running counts for:  cognition


### Count Object

There is also an OOP interface available in LISC, to organize the terms and data, and run scrapes. 

Note that the underlying code is the same - the count object ultimately calls the same scrape function as above. 

In [9]:
# Initialize counts object
counts = Count()

# Set terms to run
counts.set_terms(terms_a)

In [10]:
# Run scrape
counts.run_scrape(verbose=True)

Running counts for:  brain
Running counts for:  cognition


The Counts object also comes with some helper methods to check out the data.

In [11]:
# Check the highest associations for each term
counts.check_cooc()

For the  brain        the most common association is 	 cognition          with 	 %01.82
For the  cognition    the most common association is 	 brain              with 	 %27.39


In [12]:
# Check how many papers were found for each search term
counts.check_counts()

brain        -   854934
cognition    -    56886


In [13]:
# Check the top term
counts.check_top()

The most studied term is  brain         with   854934 papers


#### Co-occurence data - different word lists

In [14]:
# Initialize count object
counts_two = Count()

# Set terms lists
#  Different terms lists are indexed by the 'A' and 'B' labels
counts_two.set_terms(terms_a, 'A')
counts_two.set_terms(terms_b, 'B')

In [15]:
# Scrape co-occurence data
counts_two.run_scrape()

In [16]:
# From there you can use all the same methods to explore the data
#  You can also specify which list to check
counts_two.check_cooc('A')
print('\n')
counts_two.check_cooc('B')

For the  brain        the most common association is 	 disease            with 	 %14.81
For the  cognition    the most common association is 	 disease            with 	 %20.65


For the  body         the most common association is 	 cognition          with 	 %05.26
For the  biology      the most common association is 	 cognition          with 	 %00.82
For the  disease      the most common association is 	 cognition          with 	 %20.65


###  Synonyms & Exclusion Words

There is also support for adding synonyms and exclusion words. 

Synonyms are combined with the 'OR' operator, meaning results will be returned if they include any of the given terms. 

Exclusion words are combined with the 'NOT' operator, meaning entries will be excluded if they include these terms. 

For example, a using search terms ['gene', 'genetic'] with exclusion words ['protein'] creates the search:
- ("gene"OR"genetic"NOT"protein")

In [17]:
# Initialize Count object
counts = Count()

In [18]:
# Set up terms with synonyms
#  Being able to include synonyms is the reason each term entry is itself a list
terms_lst = [['gene', 'genetic'], ['cortex', 'cortical']]

# Set up exclusions
#  You can also include synonyms for exclusions - which is why each entry is also a list
excl_lst = [['protein'], ['subcortical']]

# Set the terms & exclusions
counts.set_terms(terms_lst, 'A')
counts.set_exclusions(excl_lst, 'A')

In [19]:
# You can check which terms are loaded
counts.terms['A'].check_terms()

List of terms used: 

gene, genetic
cortex, cortical


In [20]:
# Check exclusion words
counts.terms['A'].check_exclusions()

List of exclusion words used: 

gene	 : protein
cortex	 : subcortical


In [21]:
# LISC objects will use the first item of each terms lists as a label for that term
counts.terms['A'].labels

['gene', 'cortex']

Note that searching across different terms lists, and using synonyms and exclusions can all also be done directly using the scrape_counts function. 

## Words

Another way to scrape the data is to get some paper data from 

In [22]:
# Import LISC - Words
from lisc.words import Words
from lisc.scrape import scrape_words

### scrape_words function

In [23]:
# Scrape words data
#  Set the scrape to return data for at most 5 papers per term
dat, meta_dat = scrape_words(terms_a, retmax='5', use_hist=False, save_n_clear=False, verbose=True)

Scraping words for:  brain
Scraping words for:  cognition


In [24]:
# The function returns a list of LISC Data objects
dat

[<lisc.data.Data at 0x118f2aef0>, <lisc.data.Data at 0x118f97400>]

In [25]:
# Each data object holds the data for the scraped papers
d1 = dat[0]

# Print out some of the data
print(d1.n_articles, '\n')
print('\n'.join(d1.titles), '\n')

5 

Dehydroepiandrosterone impacts working memory by shaping cortico-hippocampal structural covariance during development.
Neurobiology of Criterion A: self and interpersonal personality functioning.
The Analgesic Acetaminophen and the Antipsychotic Clozapine can each Redox-Cycle with Melanin.
Dab1 contributes differently to the morphogenesis of the hippocampal subdivisions.
Single-Fraction Radiosurgery Using Conservative Doses for Brain Metastases: Durable Responses in Select Primaries With Limited Toxicity. 



### Words Object

In [26]:
# Initialize Words object
words = Words()

# Set terms to search
words.set_terms(terms_a)

In [27]:
# Run words scrape
words.run_scrape(retmax='5', save_n_clear=False)

In [28]:
# Words also saves the same list of Data objects
words.results

[<lisc.data.Data at 0x118f90f60>, <lisc.data.Data at 0x1194229e8>]

The use of synonyms and exclusion words, demonstrated above for counts, applies in the same way to the scraping words.

### Exploring Words Data

The words object also has a couple convenience methods for exploring the data. 

In [29]:
# Indexing with labels
print(words['brain'])

<lisc.data.Data object at 0x118f90f60>


In [30]:
# Iterating through papers found from a particular search term
#  The iteration returns a dictionary with all the paper data
for art in words['cognition']:
    print(art['title'])

Long-term effects of tDCS on fatigue, mood and cognition in multiple sclerosis.
Dehydroepiandrosterone impacts working memory by shaping cortico-hippocampal structural covariance during development.
Late, but not early, arriving younger siblings foster firstborns' understanding of second-order false belief.
Decline of prefrontal cortical-mediated executive functions but attenuated delay discounting in aged Fischer 344 × brown Norway hybrid rats.
Does phytoestrogen supplementation improve cognition in humans? A systematic review.


### Metadata

Regardless of what you are scraping, or how you run it through LISC, there is some meta-data saved.

This data is collected in a dictionary, that is returned by the scrape functions (and saved to the objects, if applicable).

In [31]:
# The meta data includes some information on the database that was scraped
meta_dat['db_info']

{'count': '27630427',
 'dbbuild': 'Build170925-2207m.9',
 'dbname': 'pubmed',
 'description': 'PubMed bibliographic record',
 'lastupdate': '2017/09/27 16:06',
 'menuname': 'PubMed'}

In [32]:
# It also includes the Requester object, used to launch URL requests, which also has some details about the scrape
print('Start time:    ', meta_dat['req'].st_time)
print('End time:      ', meta_dat['req'].en_time)
print('# of requests: ', meta_dat['req'].n_requests)

Start time:     15:42 Wednesday 27 September
End time:       15:42 Wednesday 27 September
# of requests:  5
