<a href="https://colab.research.google.com/github/jedavis82/topic_modeling_summarization/blob/main/generate_doc_summaries.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Generate Document Summaries
This notebook will generate summaries for the documents in the various articles.csv files

### Connect to GDrive to access our data files

In [1]:
from google.colab import drive 
drive.mount('/content/gdrive/', force_remount=True)

Mounted at /content/gdrive/


In [2]:
# Change directories to the location of the kaggle news data
%cd 'gdrive/MyDrive/Colab Notebooks/kaggle_news/'
!ls

/content/gdrive/MyDrive/Colab Notebooks/kaggle_news
articles1.csv  articles2.csv  articles3.csv  requirements.txt


### Install Requirements

In [3]:
"""
top2vec[sentence_transformers]
spacy
spacytextblob
pandas
transformers
sentencepiece
numpy
jupyter
scikit-learn"""
# !pip install -r requirements.txt
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


### Import Required Libraries

In [4]:
from transformers import pipeline
from transformers.pipelines.pt_utils import KeyDataset
import pandas as pd

In [14]:
articles_name = 'articles1'
articles_file = f'./{articles_name}.csv'
output_file = f'./{articles_name}_wsummary.csv'
df = pd.read_csv(articles_file)

In [6]:
# Define a batching function to use for batching documents for processing 
# def batch(data, batch_size=64):
#     # See: https://stackoverflow.com/a/8290508
#     l = len(data)
#     for idx in range(0, l, batch_size):
#         yield data[idx:min(idx + batch_size, l)]

In [7]:
# Define a function to summarize documents in the articles csv
def doc_summaries(docs=None, summarizer=None):
    summaries = []
    for out in summarizer(docs, batch_size=16, truncation=True): 
      summaries.append(out['summary_text'])
    return summaries

In [8]:
# Create a summarizer object 
summarizer = pipeline('summarization', model='facebook/bart-large-cnn', 
                      tokenizer='facebook/bart-large-cnn', framework='pt',
                      device=0)

In [9]:
# df = df.head(100)  # Testing purposes

In [15]:
docs = list(df['content'])
len(docs)

50000

In [None]:
summaries = doc_summaries(docs, summarizer)

In [12]:
df['summary'] = summaries

In [13]:
df.head()

Unnamed: 0.1,Unnamed: 0,id,title,publication,author,date,year,month,url,content,summary
0,0,17283,House Republicans Fret About Winning Their Hea...,New York Times,Carl Hulse,2016-12-31,2016.0,12.0,,WASHINGTON — Congressional Republicans have...,House Republicans have a new fear when it come...
1,1,17284,Rift Between Officers and Residents as Killing...,New York Times,Benjamin Mueller and Al Baker,2017-06-19,2017.0,6.0,,"After the bullet shells get counted, the blood...",Four of every five shootings in the 40th Preci...
2,2,17285,"Tyrus Wong, ‘Bambi’ Artist Thwarted by Racial ...",New York Times,Margalit Fox,2017-01-06,2017.0,1.0,,"When Walt Disney’s “Bambi” opened in 1942, cri...","Tyrus Wong, a Chinese immigrant, was one of th..."
3,3,17286,"Among Deaths in 2016, a Heavy Toll in Pop Musi...",New York Times,William McDonald,2017-04-10,2017.0,4.0,,"Death may be the great equalizer, but it isn’t...","The pop music world had, hands down, the bleak..."
4,4,17287,Kim Jong-un Says North Korea Is Preparing to T...,New York Times,Choe Sang-Hun,2017-01-02,2017.0,1.0,,"SEOUL, South Korea — North Korea’s leader, ...",North Korea has conducted five nuclear tests i...


In [None]:
df.to_csv(output_file)