# Phase 1 - Data Acquisition and Refinement

The goals for Phase 1, per the roadmap in the readme, are as follows:

1. Obtain a dataset of news articles that includes the text content as well as a summary of each article. 
2. Explore the dataset to get a sense of the data, such as the number of articles, length of the articles and summaries, and distribution of topics and keywords.
3. Clean and preprocess the data to remove unnecessary characters, punctuation, and stop words. 
4. Tokenize the text into words or subwords, and create input sequences and output summaries.

## Goal 1: Obtain a Dataset

We've obtained a dataset from Kaggle that contains 870,521 articles, each of which has a text content and a summary. The text content is the full article, and the summary is a short version of the article that summarizes the main points.

For more information on the dataset, please see the dataset section in `README.md`.

In [20]:
import pandas as pd
import matplotlib.pyplot as plt
from wordcloud import WordCloud
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\devin\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

In [6]:
df = pd.read_csv('data/news_summarization.csv') # expect this step to take about 30 seconds

## Goal 2: Explore the Dataset

We've explored the dataset and found the following information:
- Total of 870,521 articles
- Only 580,013 unique articles (about 1/3 of the articles are duplicates)
- The most frequently occurring words are generic words like "this", "that", "the", "a", "an", etc.
  - Should we remove these words from the dataset?

The dataset contains the following columns:
- `Unnamed: 0` - index, can be ignored
- `ID` - unique ID for each article (appears to be a generated UUID)
- `Content` - the text content of the article
- `Summary` - the summary of the article
- `Dataset` - the dataset that the article came from (XSum, CNN/Daily Mail, Multi-News)

In [8]:
df = df.drop(df.columns[0], axis=1) # unused index column
df.head()

Unnamed: 0,Content,Summary,Dataset
0,New York police are concerned drones could bec...,Police have investigated criminals who have ri...,CNN/Daily Mail
1,By . Ryan Lipman . Perhaps Australian porn sta...,Porn star Angela White secretly filmed sex act...,CNN/Daily Mail
2,"This was, Sergio Garcia conceded, much like be...",American draws inspiration from fellow country...,CNN/Daily Mail
3,An Ebola outbreak that began in Guinea four mo...,World Health Organisation: 635 infections and ...,CNN/Daily Mail
4,By . Associated Press and Daily Mail Reporter ...,A sinkhole opened up at 5:15am this morning in...,CNN/Daily Mail


In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 870521 entries, 0 to 870520
Data columns (total 3 columns):
 #   Column   Non-Null Count   Dtype 
---  ------   --------------   ----- 
 0   Content  870487 non-null  object
 1   Summary  870521 non-null  object
 2   Dataset  870521 non-null  object
dtypes: object(3)
memory usage: 19.9+ MB


In [10]:
# Let's find the average of the Content and Summary columns
# 
# We can't analyze the entire dataset because it's too large, so we'll take a sample of 1000 rows.
df_sample = df.sample(1000)

In [11]:
data = [
    df_sample['Content'].apply(len),
    df_sample['Summary'].apply(len),
    df_sample['Content'].apply(lambda x: len(x.split())),
    df_sample['Summary'].apply(lambda x: len(x.split())),
    df_sample['Content'].apply(lambda x: len(x.split('.'))),
    df_sample['Summary'].apply(lambda x: len(x.split('.')))
]

sample = pd.DataFrame(data, index=['Content Length', 'Summary Length', 'Content Words', 'Summary Words', 'Content Sentences', 'Summary Sentences']).T.describe().round(2)

sample.loc['mean']

Content Length       3962.20
Summary Length        312.55
Content Words         671.96
Summary Words          53.93
Content Sentences      39.58
Summary Sentences       4.53
Name: mean, dtype: float64

In [14]:
# Now let's look at distribution of topics and keywords.
#
# The dataset doesn't provide topics, so we'll parse keywords from the summary column.
df_sample2 = df.sample(10000)
df_sample2['Keywords'] = df_sample2['Summary'].apply(lambda x: x.split()[-5:])

# Now let's list all the keywords and their frequencies.
keywords = []
for i in df_sample2['Keywords']:
    keywords.extend(i)

keywords = pd.Series(keywords).value_counts()

# Let's plot the top 20 keywords as a word cloud.
# keywords = keywords[20:]
wordcloud = WordCloud(width=800, height=400, max_words=20, background_color='white').generate_from_frequencies(keywords)

plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

NameError: name 'WordCloud' is not defined

## Goal 3: Clean and Preprocess the Data

We'll clean and preprocess the data by removing articles that are duplicates or empty.

I considered removing unnecessary punctuation and filler words, but I decided against it because I think it would be better to leave the data as-is and let the model learn how to handle these words. For example, if we remove the word "the", then the model will not be able to learn that "the" is a filler word and should be ignored.

If the results are not good enough, then we can try removing unnecessary punctuation and filler words.

In [15]:
df = df.drop_duplicates(subset=['Content', 'Summary'])
df = df.dropna(axis=0, how='any')
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 580164 entries, 0 to 870518
Data columns (total 3 columns):
 #   Column   Non-Null Count   Dtype 
---  ------   --------------   ----- 
 0   Content  580164 non-null  object
 1   Summary  580164 non-null  object
 2   Dataset  580164 non-null  object
dtypes: object(3)
memory usage: 17.7+ MB


## Goal 4: Tokenize the Text

We'll tokenize the text by splitting the text into words and creating input sequences and output summaries. We'll also create a dictionary that maps each word to a unique integer. We'll use this dictionary to convert the words in the input sequences and output summaries to integers.

We'll also create a reverse dictionary that maps each integer back to the original word. We'll use this reverse dictionary to convert the integers back to words.

We'll also create a dictionary that maps each word to the number of times it appears in the dataset. We'll use this dictionary to filter out words that appear less than a certain number of times.

In [14]:
# split into words
df['Content'] = df['Content'].apply(lambda x: x.split())
df['Summary'] = df['Summary'].apply(lambda x: x.split())

# remove punctuation
df['Content'] = df['Content'].apply(lambda x: [i for i in x if i.isalpha()])
df['Summary'] = df['Summary'].apply(lambda x: [i for i in x if i.isalpha()])
df.head()

AttributeError: 'float' object has no attribute 'split'

In [43]:
data = ['Hi, pal.', 'Eat your veggies']
test_df = pd.DataFrame(data, columns = ['Sentences'])
tokenized_test = test_df['Sentences'].apply(word_tokenize)
tokenized_test

0         [Hi, ,, pal, .]
1    [Eat, your, veggies]
Name: Sentences, dtype: object

In [29]:
tokenized_content = word_tokenize(df_sample['Content'].apply(word_tokenize))

TypeError: expected string or bytes-like object