# Text Summarizer
by Manuel Romero Muñoz / mrm8488

![Image of Summarizing](summarize.png)

## 2 types of summarization:

1. **Abstractive**: select words based on semantic understanding, even those words did not appear in the source documents. It aims at producing important material in a new way. They interpret and examine the text using advanced NLP techniques.

    - Pros: "Similar" to the human way (smarter).
    - Cons: Cope with problems such as semantic representation, inference and natural language generation.
    
> Input document → understand context → semantics → create own summary.    

2. **Extractive**: summarize articles by selecting a subset of words (in the article) that retain the most important points.

    - Pros:  Do not need to train and build a model prior start using it for your project (often times give better results compared to automatic abstractive summaries).
    - Cons: Not so smart.
    
> Input document → sentences similarity → weight sentences → select sentences with higher rank.   

### Example of Extractive summarization:

##### 1. Import all necessary libraries

In [1]:
import nltk

from nltk.corpus import stopwords

from nltk.cluster.util import cosine_distance

import numpy as np

import networkx as nx

##### 2. Generate clean sentences

In [7]:
def read_article(file_name):

    file = open(file_name, "r")

    filedata = file.readlines()

    article = filedata[0].split(". ")

    sentences = []



    for sentence in article:

        print(sentence)

        sentences.append(sentence.replace("[^a-zA-Z]", " ").split(" "))

    sentences.pop() 

    

    return sentences

##### 3. Similarity matrix
This is where we will be using cosine similarity to find similarity between sentences.

In [8]:
def sentence_similarity(sent1, sent2, stopwords=None):

    if stopwords is None:

        stopwords = []

 

    sent1 = [w.lower() for w in sent1]

    sent2 = [w.lower() for w in sent2]

 

    all_words = list(set(sent1 + sent2))

 

    vector1 = [0] * len(all_words)

    vector2 = [0] * len(all_words)

 

    # build the vector for the first sentence

    for w in sent1:

        if w in stopwords:

            continue

        vector1[all_words.index(w)] += 1

 

    # build the vector for the second sentence

    for w in sent2:

        if w in stopwords:

            continue

        vector2[all_words.index(w)] += 1

 

    return 1 - cosine_distance(vector1, vector2)

In [9]:
def build_similarity_matrix(sentences, stop_words):

    # Create an empty similarity matrix

    similarity_matrix = np.zeros((len(sentences), len(sentences)))

 

    for idx1 in range(len(sentences)):

        for idx2 in range(len(sentences)):

            if idx1 == idx2: #ignore if both are same sentences

                continue 

            similarity_matrix[idx1][idx2] = sentence_similarity(sentences[idx1], sentences[idx2], stop_words)


    return similarity_matrix

##### 4. Generate Summary Method
Method will keep calling all other helper function to keep our summarization pipeline going. Make sure to take a look at all # Steps in below code.

In [13]:
def generate_summary(file_name, top_n=5):

    nltk.download("stopwords")

    stop_words = stopwords.words('english')

    summarize_text = []



    # Step 1 - Read text anc split it

    sentences =  read_article(file_name)
    print("\n")



    # Step 2 - Generate Similary Martix across sentences

    sentence_similarity_martix = build_similarity_matrix(sentences, stop_words)



    # Step 3 - Rank sentences in similarity martix

    sentence_similarity_graph = nx.from_numpy_array(sentence_similarity_martix)

    scores = nx.pagerank(sentence_similarity_graph)



    # Step 4 - Sort the rank and pick top sentences

    ranked_sentence = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)    

    print("Indexes of top ranked_sentence order are ", ranked_sentence)
    print("\n")



    for i in range(top_n):

        summarize_text.append(" ".join(ranked_sentence[i][1]))



    # Step 5 - Offcourse, output the summarize text

    print("Summarize Text: \n", ". ".join(summarize_text))

In [14]:
generate_summary( "trump.txt", 2)

[nltk_data] Downloading package stopwords to C:\Users\Manuel
[nltk_data]     Romero\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
WASHINGTON - The Trump administration has ordered the military to start withdrawing roughly 7,000 troops from Afghanistan in the coming months, two defense officials said Thursday, an abrupt shift in the 17-year-old war there and a decision that stunned Afghan officials, who said they had not been briefed on the plans.President Trump made the decision to pull the troops - about half the number the United States has in Afghanistan now - at the same time he decided to pull American forces out of Syria, one official said.The announcement came hours after Jim Mattis, the secretary of defense, said that he would resign from his position at the end of February after disagreeing with the president over his approach to policy in the Middle East.The whirlwind of troop withdrawals and the resignation of Mr
Mattis leave a murky pic