# News Summarizer

In [32]:
import sqlite3
import pandas as pd
import numpy as np
import os
import spacy

from gensim.corpora import Dictionary
from gensim.models.tfidfmodel import TfidfModel
from gensim.matutils import sparse2full

## Load and clean dataset

The dataset was built by Andrew Thompson, and I took it from [this link.](https://components.one/datasets/all-the-news-articles-dataset/)

`204,135 articles from 18 American publications. Includes date, title, publication, article text, publication name, year, month, and URL (for some). Articles mostly unevenly span from 2013 to early 2018, with a smattering pre-2013.`

It comes in sqlite format, so we'll use sqlite3 to read it.

In [4]:
cnx = sqlite3.connect('datasets/all-the-news.db')
df = pd.read_sql_query("SELECT * FROM longform", cnx)
print(df.shape)
cnx.close()

df.head()

(204135, 12)


Unnamed: 0,id,title,author,date,content,year,month,publication,category,digital,section,url
0,1,Agent Cooper in Twin Peaks is the audience: on...,\nTasha Robinson\n,2017-05-31,And never more so than in Showtime’s new...,2017,5,Verge,Longform,1.0,,
1,2,"AI, the humanity!",\nSam Byford\n,2017-05-30,AlphaGo’s victory isn’t a defeat for hum...,2017,5,Verge,Longform,1.0,,
2,3,The Viral Machine,\nKaitlyn Tiffany\n,2017-05-25,Super Deluxe built a weird internet empi...,2017,5,Verge,Longform,1.0,,
3,4,How Anker is beating Apple and Samsung at thei...,\nNick Statt\n,2017-05-22,Steven Yang quit his job at Google in th...,2017,5,Verge,Longform,1.0,,
4,5,Tour Black Panther’s reimagined homeland with ...,\nKwame Opam\n,2017-05-15,Ahead of Black Panther’s 2018 theatrical...,2017,5,Verge,Longform,1.0,,


I decided to pick up 1/3 of all the New York Times publications to save some RAM. 

Also, every publication which its length is < 500 will be left out.

In [6]:
def clean_df(df, category='newspaper', publication='New York Times', sample_frac=1/3, len_threshold=500, random_state=42):
    df = df.copy().query('category == "{}" & publication == "{}"'.format(category, publication))
    df = df.drop(['id', 'digital', 'section', 'url', 'category', 'publication'], axis=1)
    df['content_len'] = df['content'].apply(len)
    df = df.query('content_len > {}'.format(len_threshold))
    df['author'] = df['author'].str.replace('\n', '')
    
    return df.sample(frac=sample_frac, random_state=random_state)
    
df = clean_df(df).reset_index(drop=True)
print(df.shape)
df.head()

(10014, 7)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,title,author,date,content,year,month,content_len
0,A Policeman’s Bear Hug Stops a Suicide Bomber ...,Rod Nordland and Fahim Abed,2017-11-17,"KABUL, Afghanistan — No one will ever know wha...",2017,11,5821
1,"Olympics, Jerusalem, John Conyers: Your Tuesda...",Karen Zraick and Sandra Stevenson,2017-12-05,(Want to get this briefing by email? Here’s th...,2017,12,4709
2,"A Strange Itch, Trouble Breathing, Then Anaphy...","Lisa Sanders, M.D.",2018-01-06,"“I can’t breathe,” the woman panted, her voice...",2018,1,7639
3,Why The Times Covers Its Own Industry - The Ne...,Sydney Ember,2017-09-19,Rolling Stone was one of the many magazines th...,2017,9,4146
4,Judge Decides to Import Jury for Bill Cosby’s ...,Graham Bowley,2017-05-24,Responding to Bill Cosby’s concerns about the ...,2017,5,2324


In [7]:
print(df.loc[0, 'title'], end='\n\n')
print(df.loc[0, 'content'])

A Policeman’s Bear Hug Stops a Suicide Bomber From Killing More - The New York Times

KABUL, Afghanistan — No one will ever know what went through the mind of Afghan Police Lt. Sayed Basam Pacha in those moments when he came face to face with a man he suspected of being a suicide bomber on Thursday afternoon, but whatever it was, he did not hesitate to act. At his back was a crowd of civilians, many of them dignitaries, leaving the hall he was guarding. Around him were officers from the police company he commanded. The suspect had just approached their heavily guarded gate, the only way in or out of the compound around the hall. Broad-shouldered and heavily muscled, Lieutenant Pacha shouted at the suspect to halt, but instead the man started running. The officer stopped him, throwing his arms around him in a bear hug. A second later the bomber detonated the explosive vest hidden under his coat. Fourteen people, including Lieutenant Pacha and seven other police officers as well as six c

## Preprocessing and TFIDF

We're going to use spacy's preprocessing features and gensim's TFIDF algorithm.

### Preprocessing

We only want to keep words/tokens that:
    - Consists of alphabetic characters (`is_alpha`);
    - Do not consist of whitespace characters (`is_space`);
    - Are not punctuation (`is_punct`);
    - Is not part of the [stop list](https://github.com/explosion/spaCy/blob/master/spacy/lang/en/stop_words.py) (`is_stop`);
    - Does not represent a number (`like_num`).
    
We also want to lemmatize each token/word. The lemma of a word is the root form of the word.

Examples: `reading`, `reads` and `read` all become `read`.

### TFIDF



In [8]:
nlp = spacy.load('en_core_web_sm')

def keep_token(t):
    return (t.is_alpha and 
            not (t.is_space or t.is_punct or 
                 t.is_stop or t.like_num))

def lemmatize_doc(doc):
    return [t.lemma_ for t in doc if keep_token(t)]

def preprocess(texts):
    docs = [lemmatize_doc(nlp(doc)) for doc in texts]
    docs_dict = Dictionary(docs)
    docs_dict.compactify()
    docs_corpus = [docs_dict.doc2bow(doc) for doc in docs]
    model_tfidf = TfidfModel(docs_corpus, id2word=docs_dict)
    docs_tfidf = model_tfidf[docs_corpus]
    tfidf_matrix = np.vstack([sparse2full(c, len(docs_dict)) for c in docs_tfidf])
    
    return pd.DataFrame(tfidf_matrix, columns=docs_dict.token2id.keys())

tfidf_df = preprocess(df['content'])
print(tfidf_df.shape)
tfidf_df.head()

(10014, 99846)


Unnamed: 0,-PRON-,a,abdullahzada,abroad,academy,accord,act,advanced,afghan,afghanistan,...,werft,colossians,epiphanny,fila,kobes,laimbeer,robyne,savanah,swoopes,weatherspoon
0,0.004457,0.010696,0.129247,0.022316,0.022282,0.014902,0.010471,0.02522,0.062795,0.073481,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.003487,0.006741,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.006247,0.013482,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.008503,0.008219,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.001134,0.0,0.0,0.0,0.0,0.013741,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## News Summarizer

In [20]:
def summarizer(row):
    def formatter(phrases):
        def capitalize(s):
            return s[0].upper() + s[1:]
        return ' '.join([capitalize(phrase.strip()) for phrase in phrases])
    
    def get_first_phrases(df, len_threshold=30):
        phrases = df.sort_values(by='order')['phrase'].tolist()
        if len(phrases[0]) < len_threshold:
            selected = phrases[:2]
            to_drop = [0, 1]
        else:
            selected = phrases[:1]
            to_drop = [0]
        return df.drop(to_drop, axis=0), selected
    
    def get_corpus(df):
        return df.sort_values(by='score', ascending=False).head(5).sort_values(by='order')['phrase'].tolist()
        
    df = pd.DataFrame(columns=['phrase', 'order', 'score'])
    for i, sent in enumerate(nlp(row['content']).sents):
        score = 0
        norm = len(lemmatize_doc(sent))
        for word in lemmatize_doc(sent):
            score += tfidf_df.loc[row.name, word]
        if norm > 0:
            score /= norm
        df = df.append({
            'phrase': sent.text,
            'order': i,
            'score': score
        }, ignore_index=True)
    
    df, first_phrases = get_first_phrases(df)
    corpus = get_corpus(df)
    return formatter(first_phrases + corpus)
    
df['summarizations'] = df.apply(summarizer, axis=1)

In [21]:
df.to_csv('datasets/summarized_nytimes_news.csv')

In [31]:
def print_summarizations(row):
    print('Title:', row.title)
    print()
    print(row.summarization)
    print()
    print(row.content)
    print()
    print('*' * 15)

df.sample(30, random_state=42).apply(print_summarizations, axis=1)

Title: Express Scripts Sues Maker of Overdose Drug, Intensifying Feud - The New York Times

This article is a collaboration between The Times and ProPublica, the independent nonprofit investigative-journalism organization. Express Scripts claims it is owed more than $14.5 million in fees and rebates related to Evzio, and it has dropped the drug from its preferred list. In January 2016, Evzio carried a list price of $937.50 for two injectors, and Express Scripts billed Kaléo for about $25,000 in administrative fees for its commercial clients for that month. This is not the first time Express Scripts has sued a drug maker with expensive products. In 2015, Express Scripts filed suit against Horizon Pharma, also over unpaid fees. Express Scripts is also being sued.

This article is a collaboration between The Times and ProPublica, the independent nonprofit investigative-journalism organization. A company that manages prescription drug plans for tens of millions of Americans has sued a tiny

1617    None
3918    None
2168    None
1090    None
7753    None
9016    None
9115    None
8983    None
2310    None
8217    None
532     None
765     None
1355    None
5834    None
80      None
7322    None
3064    None
5753    None
1247    None
5676    None
7300    None
4098    None
1937    None
2698    None
4398    None
500     None
6934    None
2291    None
9962    None
8703    None
dtype: object