# Article Summarizer
In hindsight of the recent 2020 election and the ongoing covid-19 pandemic, there is a constant stream of updating news articles that are hard to constantly keep up with. Especially with a busy schedule, staying well informed during these times is quite difficult and time consuming. 

My solution for this is to find the top 'n' important sentences in an article so that when put together it can provide the reader a synopsis of the article.

In [1]:
#importing libraries
from nltk.tokenize import sent_tokenize
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

import re 
import numpy as np
import networkx as nx

Articles I am going to use as test examples:

https://www.theverge.com/2020/5/30/21269703/spacex-launch-crew-dragon-nasa-orbit-successful

https://apnews.com/article/joe-biden-wins-white-house-ap-fd58df73aa677acb74fce2a69adb71f9

https://apnews.com/article/pfizer-vaccine-effective-early-data-4f4ae2e3bad122d17742be22a2240ae8

In [2]:
#opening article file
f = open('spacex_article.txt', 'r')
article1 = f.read()
f.close()

f = open('election_article.txt', 'r')
article2 = f.read()
f.close()

f = open('pfizervacc_article.txt', 'r')
article3 = f.read()
f.close()

In [3]:
#function to split sentences into words while stemming the words
def stemming_tokenizer(str_input):
    porter_stemmer = PorterStemmer()
    words = re.sub(r"[^A-Za-z]", " ", str_input).lower().split()
    words = [porter_stemmer.stem(word) for word in words]
    return words

In [4]:
#function to create tfidf vectors for each sentence in the article
def vectorize_sentences(sentences):
    
    #define tfidf vectorizer
    tfidf_vectorizer = TfidfVectorizer(stop_words='english', tokenizer=stemming_tokenizer, use_idf=True, norm='l1')
    
    #vectorize sentences
    vect_sentences = []
    x=0
    for sentence in sentences:
        sentence = [sentence] #tfidf vectorizer only accepts lists 
        #first we need to fit_transform
        if(x == 0):
            vect_sentences.append(tfidf_vectorizer.fit_transform(sentence))
            x = x+1
        #then after only transform
        vect_sentences.append(tfidf_vectorizer.transform(sentence))
    
    return vect_sentences

In [5]:
#function to build a matrix for the cosine similarity between all the sentences in the article
def sentence_similarity_matrix(vect_sentences, num_sentences):
    similarity_matrix = np.zeros((num_sentences, num_sentences))

    for sentence1 in range(num_sentences):
            for sentence2 in range(num_sentences):
                #dont calculate if the sentence is the same
                if sentence1 == sentence2: 
                    continue 
                similarity_matrix[sentence1][sentence2] = cosine_similarity(vect_sentences[sentence1], vect_sentences[sentence2])
    
    return similarity_matrix

In [6]:
#function to use networkx's pagerank() to rank sentences that are most similar to the rest of the article
def rank_sentences(similarity_matrix, sentences):
    sentence_similarity_graph = nx.from_numpy_array(similarity_matrix)

    similarity_scores = nx.pagerank(sentence_similarity_graph)
    ranked_sentences = sorted(((similarity_scores[i],s) for i,s in enumerate(sentences)), reverse=True) 
    return ranked_sentences

In [7]:
#combine the top 10% ranked sentences to form the summary 
def create_summary(ranked_sentences, num_sentences):
    summary_sentences = []
    summary_sentence_count = round(num_sentences*.15)

    for i in range(summary_sentence_count):
        summary_sentences.append("".join(ranked_sentences[i][1]))
    print("Article Summary (", num_sentences, "->", summary_sentence_count, "sentences): \n"  ," ".join(summary_sentences))

In [8]:
def article_summarizer(article):
    #splitting article into sentences and counting how many sentences there are
    sentences = sent_tokenize(article)
    sentence_count = len(sentences)
    
    #creating tfidf vectors from the sentences
    vect_sentences = vectorize_sentences(sentences)
    
    #building a similarity matrix using cosine similarity
    similarity_matrix = sentence_similarity_matrix(vect_sentences, sentence_count)
    
    #ranking the sentences by similarity using pagerank
    ranked_sentences = rank_sentences(similarity_matrix, sentences)
    
    #compiling the top sentences and printing the summary
    create_summary(ranked_sentences, sentence_count)

In [9]:
article_summarizer(article1)



Article Summary ( 47 -> 7 sentences): 
 This launch is a critical moment for SpaceX, a company formed by Musk with the express purpose of sending humans into space and building settlements on Mars. Through that initiative, NASA enlisted two companies, SpaceX and Boeing, to develop new spacecraft that could regularly ferry the agency’s astronauts to and from the space station. Though this mission is considered a test, it still carried enormous weight for the United States. The rocket dropped the Crew Dragon off in orbit about 12 minutes later. That means we could be soon entering a new era where private companies are the ones routinely taking people to low Earth orbit. “They’re laying the foundation for a new era in human spaceflight,” NASA administrator Jim Bridenstine said before launch. The company is currently working on a new monster rocket called Starship, which may one day take humans to deep space destinations like the Moon and Mars.


In [10]:
article_summarizer(article2)

Article Summary ( 71 -> 11 sentences): 
 The strategy, as well as an appeal to Americans fatigued by Trump’s disruptions and wanting a return to a more traditional presidency, proved effective and resulted in pivotal victories in Michigan and Wisconsin as well as Pennsylvania, onetime Democratic bastions that had flipped to Trump in 2016. The third president to be impeached, though acquitted in the Senate, Trump will leave office having left an indelible imprint in a tenure defined by the shattering of White House norms and a day-to-day whirlwind of turnover, partisan divide and Twitter blasts. It was a precarious balance for Trump’s allies as they try to be supportive of the president -- and avoid risking further fallout -- but face the reality of the vote count. Trump is the first incumbent president to lose reelection since Republican George H.W. There was another COVID-19 outbreak in the White House this week, which sickened his chief of staff Mark Meadows. The president defied cal

In [11]:
article_summarizer(article3)

Article Summary ( 41 -> 6 sentences): 
 Pfizer said its only involvement in Operation Warp Speed is that those doses are part of the administration’s goal to have 300 million doses of COVID-19 vaccines ready sometime next year. Pfizer instead said it has invested $2 billion of its own money in testing and expanding manufacturing capacity. Even if all goes well, authorities have stressed it is unlikely any vaccine will arrive much before the end of the year, and the limited initial supplies will be rationed. Global markets, already buoyed by the victory of President-elect Joe Biden, rallied on the news from Pfizer. The study is continuing, and Pfizer cautioned that the protection rate might change as more COVID-19 cases are added to the calculations. The strong results were a surprise.
