# Overview
---
This notebook contains the code for the various extractive summary methods outlined in this article. Each method is sectioned into its' own area and the imports for everything are grouped below. All functions take in a string that represents the text of the entire article and return a string that contains the sentences that make up the summary.

In [1]:
# Text Cleaning
import string
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords, wordnet
from nltk import pos_tag
from nltk.stem import WordNetLemmatizer
from nltk.tokenize.treebank import TreebankWordDetokenizer
import re

# Cosine Similarity
from nltk.cluster.util import cosine_distance
import networkx as nx

# Jaccard Similarity
from math import *

# SpaCy
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation
from heapq import nlargest

# LSA
from sumy.summarizers.lsa import LsaSummarizer
from sumy.nlp.tokenizers import Tokenizer
from sumy.parsers.plaintext import PlaintextParser

# LexRank
from sumy.summarizers.lex_rank import LexRankSummarizer
# from sumy.nlp.tokenizers import Tokenizer
# from sumy.parsers.plaintext import PlaintextParser

# Genism
#from gensim.summarization.summarizer import summarize

### Preprocessing Helper Functions
---
While some of the methods involve libraries that automatically preprocess the text, other's still require that this step is done manually. The helper funciton used to clean the data I used for testing purposes is included below, though this would obviously be modified to fit the data that you're working with.

In [None]:
#Instantiating lists of punctuation and stopwords to use later
punct = list(string.punctuation)
sw = stopwords.words('english')

#Function to replace part of speech tags
def pos_replace(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

#Returns text as string after preprocessing
def bare_text(text):
    text = text.replace('\n','')
    text = text.lower()
    #Adds spaces where they are missing after punctuation
    text = re.sub(r'(?<=[.,\?!])(?=[^\s])', r' ', text)
    #Tokenize and lemmatize text
    text_token = word_tokenize(text)
    text_token = [w for w in text_token if w.lower() not in sw]
    text_token = pos_tag(text_token)
    text_token = [(w[0], pos_replace(w[1])) for w in text_token]
    lemmatizer = WordNetLemmatizer() 
    text_token = [lemmatizer.lemmatize(word[0], word[1]) for word in text_token]
    #Get rid of punctuation
    text_token = [w for w in text_token if w not in punct]
    text_token = [w for w in text_token if w not in ["’", "-", "‘"]]
    #Reconcatenate tokens
    text = TreebankWordDetokenizer().detokenize(text_token)
    return text

#Returns list of sentences where each sentence is preprocessed
def clean_sentences(text):
    text = text.replace('\n','')
    #Get rid of links
    text = re.sub(r'www\.[a-z]?\.?(com)+|[a-z]+\.(com)', '', text)
    #Add space after punctuation if its not there
    text = re.sub(r'(?<=[.,\?!:])(?=[^\s])', r' ', text)
    text = text.lower()
    #Get rid of punctuation
    text.replace("[^a-zA-Z]", " ").split(" ")
    sent = sent_tokenize(text)
    return sent

### Cosine Similarity
---
The first summary method used is based on cosine similarity. The first two functions are helper methods used to calculate the similarity between each sentence based on their cosine similarity and then generate a matrix of similarity between each sentence. The final method, `generate_summary_cosine`, is the actual summary generation method where top_n is the number of sentences to include in the summary.

In [None]:
#Returns similarity between two sentences
def sent_sim(sent1, sent2, stopwords = None, method):
    #Filter out stopwords for the sentences
    if stopwords is None:
        stopwords = []
   
    #Makes a list of all the words and builds vectors for each sentence
    all_words = list(set(sent1 + sent2))
    vector1 = [0] * len(all_words)
    vector2 = [0] * len(all_words)
 
    #Builds the vector for the first sentence
    for word in sent1:
        if word in stopwords:
            continue
        vector1[all_words.index(word)] += 1
 
    #Builds the vector for the second sentence
    for word in sent2:
        if word in stopwords:
            continue
        vector2[all_words.index(word)] += 1
    
    #Computes similarity based on which metric is used
    if method == 'cosine':
        return 1 - cosine_distance(vector1, vector2)
    else:
        intersection = len(set.intersection(*[set(vector1), set(vector2)]))
        union = len(set.union(*[set(vector1), set(vector2)]))
        return intersection / float(union)

In [None]:
#Makes similarity matrix for all sentences
def sim_matrix(sent, stopwords = sw, method):
    # Create an empty similarity matrix
    similarity_matrix = np.zeros((len(sent), len(sent)))
    
    #Calculate similarity for each sentence pairing
    for ind1 in range(len(sent)):
        for ind2 in range(len(sent)):
            if ind1 == ind2:
                continue 
            similarity_matrix[ind1][ind2] = sent_sim(sent[ind1], sent[ind2], stop_words, method)
            
    return similarity_matrix

In [None]:
#Generates summary based on sentence similarity
def generate_summary_cosine(article, top_n = 3):
    #Cleans text and prepares list for summary sentences
    summarize_text = []
    sentences = clean_sentences(article)
    
    #Find similar sentences
    sentence_sim_martix = sim_matrix(sentences, method = 'cosine')
    sentence_sim_graph = nx.from_numpy_array(sentence_sim_martix)
    scores = nx.pagerank(sentence_sim_graph)
    
    #Rank similarity and find best n summary sentences
    ranked_sentence = sorted(((scores[i], sent) for i, sent in enumerate(sentences)), reverse = True)    
    for i in range(top_n):
        summarize_text.append(ranked_sentence[i][1])
        
    summary = " ".join(summarize_text)
    return summary

### Jaccard Similarity
---
The next summary method used is based on Jaccard similarity. The first two helper functions for cosine similarity are again used here as there is a lot of overlap in the code, and thus they can be referenced above. The final summary method also has a lot of overlap with the cosine summary method, but is left here as a reference. Again, top_n can be used to determine how many sentences are returned for the summary.

In [None]:
#Generates summary based on sentence similarity
def generate_summary_jaccard(article, top_n = 3):
    #Cleans text and prepares list for summary sentences
    summarize_text = []
    sentences =  clean_sentences(article)
    
    #Find similar sentences
    sentence_sim_martix = sim_matrix(sentences, method = 'jaccard')
    sentence_sim_graph = nx.from_numpy_array(sentence_sim_martix)
    scores = nx.pagerank(sentence_sim_graph)
    
    #Rank similarity and find best n summary sentences
    ranked_sentence = sorted(((scores[i], sent) for i, sent in enumerate(sentences)), reverse = True)    
    for i in range(top_n):
        summarize_text.append(ranked_sentence[i][1])
        
    summary = " ".join(summarize_text)
    return summary

### SpaCy
---
Another summary method discussed in the article is one based in the SpaCy library. This method involves finding the sentences for the summaries based on which sentences contain the most frequent words.

In [None]:
#Generates summary using SpaCy library
def generate_summary_spacy(article, top_n = 3):
    #Loads pipeline and instantiates article as nlp object
    nlp = spacy.load('en_core_web_sm')
    text = nlp(article)
    
    #Tokenizes text and generates word frequencies
    tokens = [token.text for token in text]
    word_frequencies = {}
    for word in text:
        if word.text.lower() not in list(STOP_WORDS):
            if word.text.lower() not in punctuation:
                if word.text not in word_frequencies.keys():
                    word_frequencies[word.text] = 1
                else:
                    word_frequencies[word.text] += 1
                    
    max_frequency = max(word_frequencies.values())
    for word in word_frequencies.keys():
        word_frequencies[word] = word_frequencies[word] / max_frequency
    
    #Scores each sentence based on word frequencies
    sentence_tokens= [sent for sent in text.sents]
    sentence_scores = {}
    for sent in sentence_tokens:
        for word in sent:
            if word.text.lower() in word_frequencies.keys():
                if sent not in sentence_scores.keys():                            
                    sentence_scores[sent] = word_frequencies[word.text.lower()]
                else:
                    sentence_scores[sent] += word_frequencies[word.text.lower()]
                    
    #Finds n sentences with the most frequent words/highest score                 
    summary = nlargest(top_n, sentence_scores, key = sentence_scores.get)
    final_summary = [word.text for word in summary]
    summary = ''.join(final_summary)
    
    #Filters out characters not handeled by SpaCy
    summary = summary.replace('\n','')
    summary = re.sub(r'(?<=[.,\?!:])(?=[^\s])', r' ', summary)
    return summary

### LSA
---
Latent Semantic Analysis (LSA) is another summary method that can be used. In this case it is implemented through the use of sumy, a module designed for automatic summarization. Various other summary methods not mentioned in the article are also availible in this module, though another example using LexRank from this same module is included below.

In [None]:
#Generates summary using LSA
def generate_summary_lsa(article, top_n = 3):
    #Tokenizes text and summarizes using LSA
    parser = PlaintextParser.from_string(article, Tokenizer('english'))
    lsa = LsaSummarizer()
    lsa_summary = lsa(parser.document, top_n)
    
    #Joins sentences together
    summary = ''
    for s in lsa_summary: 
        if summary == '':
            summary = str(s)
        else:
            summary = summary + ' ' + str(s)
    return summary

### LexRank
---
Finally, the last method mentioned in the article is one that uses LexRank through the sumy module. It is executed much the same as the LSA method above.

In [None]:
#Generates summary using LexRank
def generate_summary_lexrank(article, top_n = 3):
    #Tokenizes text and summarizes using LexRank
    parser = PlaintextParser.from_string(article, Tokenizer('english'))
    lex_rank = LexRankSummarizer() 
    lex_summary = lex_rank(parser.document, top_n)
    
    #Joins sentences together
    summary = ''
    for s in lex_summary: 
        if summary == '':
            summary = str(s)
        else:
            summary = summary + ' ' + str(s)
    return summary

### Bonus: Genism
---
While not explicitly talked through in the article, Genism is another popular NLP library that has it's own pre-built summary method. The example code for how to execute a summary using Genism can be found below.

In [None]:
#Generate summay with n word_count
def generate_summary_genism(article, word_count = 250):
    #Preprocesses text
    sentences = clean_sentences(article)
    text = " ".join(sentences)
    
    #Generates summary with n words based on given word_count
    summary = summarize(text, word_count = word_count)
    summary = summary.replace('\n',' ')
    return summary