Practice code for summary

Reference : 
- https://towardsdatascience.com/understand-text-summarization-and-create-your-own-summarizer-in-python-b26a9f09fc70

In [1]:
from nltk.corpus import stopwords
from nltk.cluster.util import cosine_distance
import numpy as np
import networkx as nx

In [2]:
def read_article(file_name):
    file = open(file_name, "r")
    filedata = file.readlines()
    return filedata[0].split(". ")

In [3]:
def sentence_similarity(sent1, sent2, stopwords=None):
    if stopwords is None:
        stopwords = []
 
    sent1 = list(filter(lambda x: x not in stopwords, [w.lower() for w in sent1]))
    sent2 = list(filter(lambda x: x not in stopwords, [w.lower() for w in sent2]))
 
    all_words = list(set(sent1 + sent2))
 
    vector1 = [0] * len(all_words)
    vector2 = [0] * len(all_words)
 
    for w in sent1:
        vector1[all_words.index(w)] += 1
 
    for w in sent2:
        vector2[all_words.index(w)] += 1
 
    return 1 - cosine_distance(vector1, vector2)

In [4]:
def build_similarity_matrix(sentences, stop_words):
    # initialize
    similarity_matrix = np.zeros((len(sentences), len(sentences)))
 
    for idx1 in range(len(sentences)):
        for idx2 in range(len(sentences)):
            if idx1 == idx2: 
                continue 
            similarity_matrix[idx1][idx2] = sentence_similarity(sentences[idx1], sentences[idx2], stop_words)
    
    return similarity_matrix

In [5]:
def generate_summary(sentences, top_n=5):
    stop_words = stopwords.words('english')
    summarize_text = []
    
    # Similary Martix across sentences
    sentence_similarity_martix = build_similarity_matrix(sentences, stop_words)
    
    # Rank sentences
    sentence_similarity_graph = nx.from_numpy_array(sentence_similarity_martix)
    scores = nx.pagerank(sentence_similarity_graph)
    
    # Sort the rank and pick top sentences
    ranked_sentence = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)    
    print("Indexes of top ranked_sentence order are ", ranked_sentence)
    for i in range(top_n):
          summarize_text.append("".join(ranked_sentence[i][1]))

    # the summarize texr
    print("Summarize Text: \n", ". ".join(summarize_text))

In [6]:
# sentences =  read_article(file_name)
sentences = """
T cells have a central role in the orchestration of the immune pathways that contribute to the inflammation and joint destruction characteristic of rheumatoid arthritis (RA). The requirement for a dual signal for T-cell activation and the construction of a fusion protein that prevents engagement of the costimulatory molecules required for this activation has led to a new approach to RA therapy. This approach is mechanistically distinct from other currently used therapies; it targets events early rather than late in the immune cascade, and it results in immunomodulation rather than complete immunosuppression. The fusion protein abatacept is a selective costimulation modulator that avidly binds to the CD80/CD86 ligands on an antigen-presenting cell, resulting in the inability of these ligands to engage the CD28 receptor on the T cell. Abatacept dose-dependently reduces T-cell proliferation, serum concentrations of acute-phase reactants, and other markers of inflammation, including the production of rheumatoid factor by B cells. Recent studies have provided consistent evidence that treatment with abatacept results in a rapid onset of efficacy that is maintained over the course of treatment in patients with inadequate response to methotrexate and anti-tumor necrosis factor therapies. This efficacy includes patient-centered outcomes and radiographic measurement of disease progression. Abatacept has also demonstrated a very favorable safety profile to date. This article reviews the rationale for this therapeutic approach and highlights some of the recent studies that demonstrate the benefits obtained by using abatacept. This clinical experience indicates that abatacept is a significant addition to the therapeutic armamentarium for the management of patients with RA.
""".split('. ')

generate_summary(sentences)

Indexes of top ranked_sentence order are  [(0.10153491027389872, 'The requirement for a dual signal for T-cell activation and the construction of a fusion protein that prevents engagement of the costimulatory molecules required for this activation has led to a new approach to RA therapy'), (0.10136729352732622, 'This clinical experience indicates that abatacept is a significant addition to the therapeutic armamentarium for the management of patients with RA.\n'), (0.10112823554963091, 'This approach is mechanistically distinct from other currently used therapies; it targets events early rather than late in the immune cascade, and it results in immunomodulation rather than complete immunosuppression'), (0.10105756019060837, 'Recent studies have provided consistent evidence that treatment with abatacept results in a rapid onset of efficacy that is maintained over the course of treatment in patients with inadequate response to methotrexate and anti-tumor necrosis factor therapies'), (0.10