# Article Summarisation
***

## Background

Determine the degree of consensus in contentious academic fields. 

Collect title, publication date and summaries from scholarly articles containing a certain keyword or keywords. Apply NLP models to this data to identify and categorise concepts in this field and determine statistical significance between opposing 'truths', if any. Ranking these groups according to weighted influence will prove the degree of consensus of various approaches in a given academic field.

To this end the academic_consensus model has already searched the abstracts of academic papers that contain the keyword "nutrition" and saved it into corpus_raw.csv.  
  
Overview of this notebook:
- Setup notebok environment and load data (corpus_raw.csv)
- Review articles published per year (sklearn's Countvectorizer)
- Create Bag Of Words (BOW) of all articles for Titles and Conclusions (nltk)
- Create interactive BOW per year with Bokeh

### Credit

Alfrick Opidi - https://blog.floydhub.com/gentle-introduction-to-text-summarization-in-machine-learning/
Jesse JCharis - https://jcharistech.wordpress.com/2018/12/31/text-summarization-using-spacy-and-python/
  
Thank you!

## Setup

### Packages and setup

In [1]:
# Common
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

# Workspace
from IPython.core.interactiveshell import InteractiveShell
from IPython.core.display import display, HTML

# NLP
from gensim.summarization import summarize

In [2]:
# Set workspace
sns.set()
# Set output charackters to 110 (not 79)
pd.options.display.width = 110
# To give multiple cell output. Not just the last command.
InteractiveShell.ast_node_interactivity = 'last'
# Make notebook wider to fit pyLDAvis plot
display(HTML("<style>.container { width:60% !important; }</style>"))

### Load and inspect corpus_raw.csv

In [3]:
# Load corpus.csv as DataFrame with parsed date format
corpus = pd.read_csv('../data/interim/corpus_raw.csv', parse_dates=[0])

In [4]:
# Keyword 
keywords = ['nutrition', 'diet']

In [5]:
# Inspect
corpus.info()
corpus.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1530 entries, 0 to 1529
Data columns (total 3 columns):
publication_date    1530 non-null datetime64[ns, UTC]
title               1530 non-null object
conclusions         1530 non-null object
dtypes: datetime64[ns, UTC](1), object(2)
memory usage: 36.0+ KB


Unnamed: 0,publication_date,title,conclusions
0,2016-03-09 00:00:00+00:00,Pregnancy Requires Major Changes in the Qualit...,['Pregnancy tends to markedly widen the nutrit...
1,2016-08-23 00:00:00+00:00,Continental-Scale Patterns Reveal Potential fo...,"['In all, given the geographic patterns in die..."
2,2015-06-17 00:00:00+00:00,Assessing Nutritional Parameters of Brown Bear...,['Previous studies have illustrated the differ...
3,2015-04-17 00:00:00+00:00,The Self-Reported Clinical Practice Behaviors ...,['The present study provides a valuable insigh...
4,2017-10-09 00:00:00+00:00,The impact of nutritional supplement intake on...,['Our study shows that the propensity to consu...


## Summarisation

### Functions
- Convert DataFrame feature to list of documents (**clean_df_column**)
- Compile new list of documents containing all search words (**search_docs_in_corpus**)
- Convert documents to one list of all sentences, if needed (**all_docs_to_sents**)
- Summarise each document in searched corpus (**summarise_docs**)
- Summarise all sentences together in searched corpus (**summarise_sents**)
- Bringing it all together in one function (**summarise_docs_or_sents**)

In [13]:
def clean_df_column(df, col):
    '''
    For df[col] returns cleaned docs (list)
    Remove unnecessary linespaces and combine again with punct.
    '''
    # Gather all articles' 'Conclusions' into docs
    docs_corp = df[col]

    # Clean unneseccary escapes
    docs = []
    for doc in docs_corp:
        text = doc.replace('\'', '')
        text = text.replace('[', '')
        text = text.replace(']', '')                       
        text = text.replace('\\n' ,'')
        docs.append(text)
    # Split sentences correctly and return to list of docs with joined string of sentences
    # Some sentences were not split between "." because there was no space between sentences
    docs_temp = []
    [docs_temp.append('. '.join(doc.split('.'))) for doc in docs]
    
    return docs_temp

In [7]:
def search_docs_in_corpus(docs, search_words):
    ''' 
    Extract documents articles containing all of the search words.
    docs          -> list of documents (corpus)
    *search_words -> list of search words (str)
    '''
    # Find documents that contain all keywords
    docs_hits = []
    for doc in docs:
        # if number of keywords found in document is the same as the 
        # number of keywords, then include that dument
        counter = 0
        for search_word in search_words:
            if search_word in doc:
                counter += 1
        if counter == len(search_words):
            docs_hits.append(doc)  
    return docs_hits

In [8]:
def all_docs_to_sents(docs):
    '''
    Break up each document into sentences and combine all 
    the sentences together.
    '''
    # Split into sentences
    docs_temp = []
    [docs_temp.append(doc.split('.')) for doc in docs_hit]

    # Collect all sentences into one list
    docs_hits = []
    for doc in docs_temp:
        for sent in doc:
                docs_hits.append(sent)
            
    # Remove double spaces
    docs_temp = []
    [docs_temp.append(doc.replace('  ', '')) for doc in docs_hits]

    # Delete empty string lists
    docs_sents = []
    [docs_sents.append(doc) for doc in docs_temp if len(doc) > 1];
    
    return docs_sents

In [9]:
def summarise_docs(docs):
    '''
    Summarise each article into one sentence.
    docs is a list of articles 
    '''
    summarised = []
    for doc in docs:
        # Only summarize if there is more than 1 sentence (.split with one sentence outputs 2)
        if len(doc.split('.')) > 2:
            summarised.append(summarize(doc, word_count=50, split=True))
        else:
            summarised.append(doc)
    return summarised

In [10]:
def summarise_sents(docs, ratio=0.05):
    '''
    Summarise all sentences.
    '''
    # Convert docs to all sentences
    docs_sents = all_docs_to_sents(docs)
    # Summarise docs
    summarised = summarize('. '.join(docs_sents), ratio, split=True) #word_count=50,
   
    return summarised

Bringing it all together.

In [14]:
def summarise_docs_or_sents(df, col, search_words, all_sents=False, ratio=0.05):
    '''
    Summarise docs or sents from all sentences containing search words.
    df  -> The corpus dataframe
    col -> The column to use
    search_words -> list of words to search in the corpus
    all_sents -> If True, collectively summarise all the sentences together 
                    in the corpus
                 If False, summarise each document independently
    ratio -> For use in summarise_sents
    '''
    # Get data from dataframe and store in docs
    docs = clean_df_column(df, col)
    # Find documents containing the search words
    docs_hits = search_docs_in_corpus(docs, search_words)

    if all_sents:
        summarised = summarise_sents(docs_hits, ratio)
        print('Summary for SENTENCES:' +'\n')
    else:
        summarised = summarise_docs(docs_hits)
        print('Summary for DOCUMENTS:' + '\n')
        
    # Total number of articles
    no_docs = len(docs)
    # Number of ariticles containing search terms
    no_docs_hits = len(docs_hits)
        
    return summarised, no_docs, no_docs_hits, search_words   # docs, docs_hits  #

### Application

In [12]:
# summarise_docs_or_sents
(summarised, no_docs, no_docs_hits, search_words) = summarise_docs_or_sents(corpus, 
                                                                            'conclusions', 
                                                                            search_words = ['protein', 'muscle', 'exercise'], 
                                                                            all_sents=False)
# Show
print('Number of articles in corpus: {}'.format(no_docs))
print('Number of articles with {}: {}'.format(search_words, no_docs_hits))
print('\n' + 'Sentences: ' + '\n')
# Show sentences
[print(sent[0] + '\n') for sent in summarised];

Summary for each DOCUMENTS:

Number of articles in corpus: 1530
Number of articles with ['protein', 'muscle', 'exercise']: 4

Sentences: 

Fortunately, by following the recommended diet and exercise regimes, 80% of PSSM2 WB were reported to show overall improvement as well as a significant decrease in signs of rhabdomyolysis, muscle atrophy, change in behavior and decline in performance.

Our findings suggest that decrease in urinary creatinine excretion rate may appear early in CKD patients, independent of decreased protein intake assessed by urinary urea excretion, as well as of other determinants of muscle mass loss including gender, an older age, non-African origin, diabetes, lower BMI and 24-h proteinuria levels.

Perhaps, by knowing the specific amino acids levels of an athlete or patient, one can use this model to predict an individualized supplement regimen appropriate for which muscle tissue recovery is desired.

Mutations in components of the mitochondrial pathway for fatty a