# <p style="text-align: center;">News NLP and Summarization</p>
# <p style="text-align: center;">Module-1</p>

# <p style="text-align: center;">Group - 3</p>
### <p style="text-align: center;"> Members: Adedeji Aluko, Ankur Kumar, Apurva Arni, Bora Unalmis, Jack O'Donoghue, Kashin Shah </p>


## Agenda
1. Project Description and Objectives
2. Description of Data and Sources
3. Data Preprocessing
4. Data Analysis
5. Summary and Next Steps

## Project Description and Objective

In the world with overwhelming amount of information, people often find it harder to keep up with the news cycles. Therefore, a machine learning model which can adequately summarize information would enable people to quickly and efficiently digest vast amounts of information, keeping them informed and up-to-date on current events in a fast-paced world. The challenge of distilling complex news stories into concise, contextual, and accurate summaries presents a fascinating opportunity for artificial intelligence and NLP advancements.

To achive the stated goal in the data analysis part, we will use NLP toolkit to preprocess the data. We will then use the K-Means Clustering algorithm to categorize groups of news articles based on similarities in their text content. Further, we will use supervised classification algorithms to anticipate which cluster a new article would belong to most closely. Additionally, we will conduct pairwise cosine similarity on a user-inputted article to generate extractive text summarization that provides the reader with a brief overview of the article’s key points.


## Description of Data and Sources

In this project, we used the combination of readily available datasets and the scraped data. We decided to use the sitemap to scrap the news websites and merged it to build a large dataset which is 8.4 GB in size. It contains 2,688,878 news articles and essays from 27 different news publications in between the years 2016 and 2020. Since the dataset is large in size and becuase we had memory bottleneck, we divided it in thousands of chunks and used random chunk to perform exploratory data analysis.

The data included in every dataframe are:

* **date** date the article was published in yy-mm-dd formate
* **year** year the article was published
* **month** month the article was published (in float)
* **day** day the article was published
* **author** author of the article
* **title** title of the article
* **article** the body of the article

![Screenshot%202023-02-13%20at%209.39.05%20PM.png](attachment:Screenshot%202023-02-13%20at%209.39.05%20PM.png)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#%pip install wordcloud
from wordcloud import WordCloud, STOPWORDS

for chunk in pd.read_csv('/Users/ankur/Downloads/all-the-news-2-1.csv', chunksize=500000):
    df = pd.DataFrame(chunk)
    if df.empty == False:
        break
df.head(5)

FileNotFoundError: ignored

Before cleaning the data, we worked on the shape of the random dataframe chunk with size 500,000. This effectively means that we are reading 500,000 rows of dataset at a time in a single chunk. As evident from the shape, we have 500,000 rows and 10 columns.The exploratory data analysis that we performed is on this 50,000 chunk. In the below steps, we will be cleaning the data set before performing the exploratory analysis.

In [None]:
df.shape

### Data Pre-processing (Data Cleaning)

For the data cleaning, we followed the steps which are explained after each cleaning process in the sub section.
1. We started off by reorganizing the columns to promote clarity and comprehension of each row, and omitted any unnecessary columns.
2. Then a column was created to account for the length of each article.

In [None]:
df= df[['title', 'author', 'article', 'year', 'month', 'day', 'section', 'publication']]

In [None]:
df['len_article'] = df.article.str.len()

In [None]:
df.isnull().sum()

In [None]:
df['len_article'] = df.article.str.len()
df_new = df.dropna().copy()
df_new.reset_index(inplace=True, drop=True)

3. The dataset was checked for null values and any rows containing NAN were deleted.

In [None]:
df_new.head()

In [None]:
print("The original dataset has shape of", df.shape)
print("The new dataset has shape of", df_new.shape)

We realized that the rows with NAN values made up 72% of this chunk of dataset. Due to the large proportion of NAN values in the original dataset, both datasets were used selectively to ensure a comprehensive exploratory data analysis.

In [None]:
df.groupby('publication').article.count()

The graph shows the frequency of occurrence of various word count values in a set of articles. The x-axis represents the number of articles, and the y-axis represents the word count of each article.

The graph has a mean value of 3152 words, indicated by a horizontal line intersecting the y-axis at 3152. This line serves as a reference point, providing insight into the average word count in the articles. The graph also shows the spread of the word count values for each article, with the distribution ranging from a minimum word count to a maximum word count for each article.

In [None]:
import seaborn as sns

### Word Count

Keeping in mind the major objective of this project, we consider using two datasets (one with minimal cleaning and a dataset with no NAN values) to understand the distribution of the word count and the effect the removal of all NAN values have on the dataset.

In [None]:
# Close to a normal distribution, with a positive skew
plt.figure(figsize=(16,8))
mean = df.len_article.mean()
plt.axvline(mean, ls='--', color='black')
sns.histplot(df.len_article)
plt.xlim([0, 12500])
#plt.ylim([0, 5000])
plt.xlabel('Word Count')
plt.ylabel('Numbers of Articles')
plt.title(f'Distribution of Word Count (Mean: {mean:.0f} words)', fontsize=18);

The graph has a mean value of 3152 words, indicated by a horizontal line intersecting the y-axis at 3152. This line serves as a reference point, providing insight into the average word count in the articles. The graph also shows the spread of the word count values for each article, with the distribution ranging from a minimum word count to a maximum word count for each article.

In [None]:
# Close to a normal distribution, with a positive skew
plt.figure(figsize=(16,8))
mean = df_new.len_article.mean()
plt.axvline(mean, ls='--', color='black')
sns.histplot(df_new.len_article)
plt.xlim([0, 12500])
#plt.ylim([0, 5000])
plt.xlabel('Word Count')
plt.ylabel('Numbers of Articles')
plt.title(f'Distribution of Word Count with Cleaner Dataset(Mean: {mean:.0f} words)', fontsize=18);

The graph shows the frequency of occurrence of various word count values in a set of articles. The x-axis represents the number of articles, and the y-axis represents the word count of each article.

Both features appeared normally distributed in general with a heavy positive skew. What we found interesting was that the mean word count for the cleaner datasets (4218) was greater than that of the dataset with minimal cleaning (3152). We speculated that this resulted from the removed rows having more word count lesser than 2000 as can be seen in the first figure.

### Word Cloud


Our data about articles is rich in content regarding a variety of topics from several different publications. To gain insight into what this content focuses on, we decided to create word clouds. Word clouds are visualizations that reveal essential details about the text being analyzed. The size of a word in the visualization determines the frequency of that specific word in the text. This gives the audience an overall sense of the text and is a good indicator of themes that have been talked about.

Word cloud uses the cleaned dataframe built after dropping the null values.

In [None]:
words = ['said', 'S', 'Vice', 'Reuters', 'The Verge', 'now', 'told', 'week', 'would', 'new', 'take','year', 'will', 'much', 'time', 'one']

for _ in words:
    STOPWORDS.add(_)

In [None]:
publications = df_new.publication.unique()

for publication in publications:
    print(publication)

For the chunk of data that we analyzed, we decided to segregate articles based on their publication to get a sense of the themes and topics that most articles seemed to focus on for that specific publication. Three unique publications were found for this chunk of data. The first word cloud is generic and is created from articles from all the publications. This word cloud has mixed themes, which was expected. The themes to note here are politics, business, everyday issues, and emerging events

In [None]:
text = df_new['article'].values

wordcloud = WordCloud(stopwords=STOPWORDS).generate(str(text))
plt.figure(figsize=(16,8))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

The first publication was The Vice. It promotes itself in covering stories that are not well covered by other publications. Based on the word cloud, the themes to note here are astrology and mental health. This was an interesting find because this publication covers several other themes like arts, culture and general news, but astrology and spirituality stood out for this period of time. Other news seems to be rare based on this word cloud.

In [None]:
df_vice = df_new[df_new['publication'] == 'Vice']
df_reuters = df_new[df_new['publication'] == 'Reuters']
df_verge = df_new[df_new['publication'] == 'The Verge']

In [None]:
text = df_vice['article'].values

wordcloud = WordCloud(stopwords=STOPWORDS).generate(str(text))
plt.figure(figsize=(16,8))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

The second publication Reuters. Reuters is known to be one of the largest news agencies in the world, and was established in London. Reuters provides current business, financial, national and international news. Based on the word cloud, it is evident that the articles had themes of politics, business and finance. A lot of the articles also seem to focus more on national news in England.

In [None]:
text = df_reuters['article'].values

wordcloud = WordCloud(stopwords=STOPWORDS).generate(str(text))
plt.figure(figsize=(16,8))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

The third publication was The Verge. It is an American technology news website that that features product reviews, consumer news, feature stories, etc. The world cloud has themes in line with this. There is current news, consumer news and reviews, product information, and other popular topics in the time period.

In [None]:
text = df_verge['article'].values

wordcloud = WordCloud(stopwords=STOPWORDS).generate(str(text))
plt.figure(figsize=(16,8))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

In [None]:
df['year'].value_counts()

## Publication Date Analysis

#### a) Number of Articles distributed based on the Year
We distributed our data into multiple chunks as discussed above. The EDA plots respresented below are based on the 500,000 row chunks from the raw data file. The below plot provides the representation of the number of articles published in each year from 2016 to 2019 based on the chuck used for the analysis. The highest number of articles published is in the year 2017 and the number of publications are as high as >150000 articles. The least number of articles published as per the analysis from the selected chunk is in the year 2019 and the number of publications are close to 80000 articles. The number of articles published has reduced from 2017 to 2018 by ~13.3% and the articles published from 2018 to 2019 is reduced by 36% which is comparitavely high. 

As future steps, we would like to further analyse how the trend looks like for the rest of the chunks. This will be an analysis out of the project scope. 

In [None]:
plt.rcParams['figure.figsize'] = [16, 8]
sns.set(font_scale = 1.2, style = 'whitegrid')
sns_year = sns.countplot(x=df['year'], palette = sns.color_palette("crest"))
sns_year.set(xlabel = "Year", ylabel = "Articles", title = "Number of the articles per year")

#### b)Articles published per month in a year

As per our analysis, we noticed that the number of articles published in June is the highest with 58997 number of articles published. The months - January, February, March, April, May and July have publications in between 43000 and 48000. The number of articles published have reduced to ~30000 from the months August to December based on our analysis. 

In [None]:
month =  df['month']
count = Counter(month)
count = pd.Series(count).sort_values(ascending=False)

bar_charts('Distribution of articles released per month', 'Month of the Year', 'Number of articles',
           count, count.keys(), (15, 10), 1, 18)

#### c)Articles published per day in a month

For the dataset in the chunk used for EDA, we notice that the highest number of publications are made in the second week of every month (especially on the 12th day). As we observe from the below plot, we notice that the least number of articles published is on the 31st day. The reason for this is that all the months in a year(i.e. February, April, June, September, November) do not have 31 days. Similarly, since February do not have days from 29 to 31 in all years except for in 2016 from the chunk of 500,000 data we used in the analysis, we have less number of articles published in those days.Otherwise, we notice that almost all days have more than or close to 15000 articles per day.

In [None]:
plt.rcParams['figure.figsize'] = [17, 9]
sns.set(font_scale = 1, style = 'whitegrid')
df_source = df.day.value_counts()
sns.barplot(x=df_source.index, y=df_source, palette = sns.color_palette("crest"))
plt.ylabel('Number of Articles', fontsize=15)
plt.xlabel('Days of the Month', fontsize=15)
plt.xticks(rotation=0)
plt.title('Distribution of Articles Per Day', fontsize=15);

In [None]:
df_new['author'].value_counts()

### Publications Source Analysis
#### a)Top 10 Authors based on the number of publications
plot, we want to identify the top 10 authors based on the number of articles published by them. The top author who published the highest number of articles is Axios. Axios published more than 4000 articles and close to 5000 articles in the 500,000 chunk from the raw dataset used for this analysis. Sarah Perez, the second highest publisher with respect to the number of articles published is about 1600 articles less than the top publishing author Axios. You can further notice the other 8 authors and their number of publications in the below plot. 

In [None]:
plt.rcParams['figure.figsize'] = [16,8]
sns.set(font_scale = 1.2, style = 'whitegrid')
df_author = df.author.value_counts().head(10)
sns.barplot(x=df_author, y=df_author.index,palette = sns.color_palette("rocket_r"))
plt.ylabel('Authors', fontsize=12)
plt.xlabel('Number of Articles', fontsize=12)
#plt.legend(fontsize=14)
plt.title('Top 10 Authors', fontsize=14);
#sns_year.set( ylabel = "Number of Articles", title = "Top 10 Authur")

In [None]:
for column in df_new:
    Y = df_new['publication']
Y.unique()

In [None]:
def bar_charts(title, x, y, values, keys, figsize, color_num, fontsize=12):
    fig, ax = plt.subplots(1, 1, figsize=figsize)
    if color_num == 1:
        colors = sns.color_palette("crest")
    else:
        colors = sns.color_palette("rocket_r")
    bars = plt.bar(keys, values, color=colors)

    for bar in bars:
        label = list(count)[list(bars).index(bar)]
        height = bar.get_height()
        plt.text(bar.get_x() + bar.get_width()/2, height, label, ha='center', va='bottom',
                 fontsize=fontsize)

    plt.title(title, fontsize=fontsize)
    plt.xlabel(x, fontsize=fontsize)
    plt.ylabel(y, fontsize=fontsize)
    plt.xticks(rotation=45)
    plt.show()


#### b)Articles sourced based on publications

The below plot provides us information on the number of articles sourced based on the publications of the articles from the data available in the 500,000 raw datafile chunk. The highest number of articles are published buy Reuters which are about 119988. This margin is greater by ~21.6% (from 119988 to 93980) to the second highest number of articles published by Vice. The third highest number of articles published are 80073 articles and is published by Refinery 29 followed by TechCrunch, TMZ, Vox, Axios for which the number of articles published are in between 40000 and 45000 articles. We are also considering Vice News and Vice as two separate entities in our analysis as both news articles are seperately published. Another interesting observation is that out of the chunk used for EDA, we noticed that article "The Verge" has only 2 publications. 

As next steps for the future out-of-scope project analysis, we would like to analyse if "The Verge" has the lowest number of publications compared to the rest of all other publications. If yes, we would like to further deep dive to figure out the margin by which the number of publications are low and the possible reason for the lower number of articles published in these years. 

In [None]:
from collections import Counter
count = Counter(df['publication'])
count = pd.Series(count).sort_values(ascending=False)
#print(count)
keys = list(count.keys())
#print(keys)
bar_charts('Articles per source', 'Publication', 'Number of articles', count, keys,
          (20, 13), 2, 18)

#### c)Top 10 sections in the published articles 

One of the interesting analyses we performed is the analysis based on the various sections published by all the articles we sourced for the project. We observe that the market size or the customers of these articles display their interest in "Tech" and find it the most fascinating subject while "Health" is the least concerning topic for the audience. Therefore, the number of articles written on Tech are relatively higher than on Politics or health. 

In [None]:
section = df_new['section']
count = Counter(section)
count = pd.Series(count).sort_values(ascending=False).head(11)

bar_charts('Top 10 Sections of Published Articles', 'Sections', 'Number of articles', count, count.keys(), (16, 8), 2, 16)

### Sentiment Analysis

The goal of the sentiment analysis is to gauge the tone of each article.
Our initial method is to use Python's Natural Language Toolkit, to remove stopwords, strip, standardize, and lemmatize every article into a list of 'words'. Then these lists are parsed and the number of 'negative' and 'positive' words are counted. These words are defined by txt files downloaded from http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html which references the following two papers:<br>

Minqing Hu and Bing Liu. "Mining and Summarizing Customer Reviews." <br>
    Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2004), Aug 22-25, 2004, Seattle, Washington, USA. <br>



Bing Liu, Minqing Hu and Junsheng Cheng. "Opinion Observer: Analyzing and Comparing Opinions on the Web." <br>

Proceedings of the 14th International World Wide Web conference (WWW-2005), May 10-14, 2005, Chiba, Japan.<br>

The sentiment score can then be calculated and normalized from the ratio of positive to negative words. Thus a positive value conveys a positive article tone. However, a positive word does not necessarily imply a positive tone, and furthermore many words may contain differing levels of positivity. Thus as we continue work on this project, we intend to explore VADER (Valence Aware and Sentiment Reasoner) tools to find a more accurate scaled value for polarity from negative and positive valence. https://www.analyticsvidhya.com/blog/2021/12/different-methods-for-calcul
ating-sentiment-score-of-text/. <br>

Bing Liu. "Sentiment Analysis and Subjectivity." An chapter in Handbook of Natural Language Processing, Second Edition, (editors: N. Indurkhya and F. J. Damerau), 2010.

In [None]:
import nltk
nltk.download('omw-1.4')

In [None]:
for chunk in pd.read_csv('/Users/ankur/Downloads/all-the-news-2-1.csv', chunksize=50000):
    df_sen = pd.DataFrame(chunk)
    if df_sen.empty == False:
        break
n = 1
for chunk in pd.read_csv('/Users/ankur/Downloads/all-the-news-2-1.csv', chunksize=50000):
    dfx = pd.DataFrame(chunk)
    n = n+1
    if n == 20:
        break

df_sen.head()

In [None]:
frames = [df_sen, dfx]
df_sen = pd.concat(frames)
df.head()

### Sentiment Analysis by Publication

By separating the publication sources, we can assess the sentiment of their articles. We do not want to simply consider the average/mean tone of their pieces, but we can use that to sort by publication. Then a stacked ridge plot can be used to highlight the distribution of their articles' sentiment scores. What is clear from the plot is that the sentiment scores are mostly close to zero, and follow a relatively normal distribution. With a larger chunk of the dataset however, we may find more insights into how polarising certain sources are. We also may learn more when using the VADER tools to define sentiment scores. <br>

Our hope is that by understanding the polarity and sentiment, we can train our deep learning model to summarize articles while maintaining the key points and the overall sentiment. Furthermore, the absolute count of negative and positive words (emotive language) may allow our model to somewhat mirror the style of writing of that publication.

In [None]:
df_sen.groupby('publication')['month'].mean()

In [None]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import re
lemma = WordNetLemmatizer()
stop_words = stopwords.words('english')
def text_prep(x: str) -> list:
     corp = str(x).lower()
     corp = re.sub('[^a-zA-Z]+',' ', corp).strip()
     tokens = word_tokenize(corp)
     words = [t for t in tokens if t not in stop_words]
     lemmatize = [lemma.lemmatize(w) for w in words]
     return lemmatize

In [None]:
preprocess_tag = [text_prep(i) for i in df_sen['article']]
df_sen["preprocess_txt"] = preprocess_tag
df_sen['total_len'] = df_sen['preprocess_txt'].map(lambda x: len(x))

In [None]:
file = open('/Users/ankur/Downloads/negative-words.txt', 'r')
neg_words = file.read().split()
file = open('/Users/ankur/Downloads/positive-words.txt', 'r')
pos_words = file.read().split()
num_pos = df_sen['preprocess_txt'].map(lambda x: len([i for i in x if i in pos_words]))
df_sen['pos_count'] = num_pos
num_neg = df_sen['preprocess_txt'].map(lambda x: len([i for i in x if i in neg_words]))
df_sen['neg_count'] = num_neg
df_sen['sentiment'] = round((df_sen['pos_count'] - df_sen['neg_count']) / df_sen['total_len'], 2)
df_sen.head()

In [None]:
df_sen['sentiment'] = round((df_sen['pos_count'] - df_sen['neg_count']) / df_sen['total_len'], 2)

In [None]:
df1 = df_sen.groupby('publication')['sentiment'].mean()
df1 = pd.DataFrame(df1.sort_values())
pubs = df1.index.tolist()
df_sen['publication'] = pd.Categorical(df_sen['publication'], categories = pubs)
df_sen = df_sen.sort_values(by=['publication'])
df1.head(10)

In [None]:
df1['color'] = (df1['sentiment']/0.05)*100+50
df1['color'] = df1['color'].astype(int)
palette = sns.color_palette("RdYlGn", 100)
palette = [palette[i] for i in df1['color']]
df1.head()

In [None]:
sns.set_theme(style="white", rc={"axes.facecolor": (0, 0, 0, 0)})

g = sns.FacetGrid(data=df_sen, row="publication", hue="publication", palette=palette, aspect=10, height=1.5)

# Draw the densities
g.map_dataframe(sns.kdeplot, "sentiment",
                bw_adjust=0.8, clip_on=False, fill=True, alpha=0.9, linewidth=5, multiple='layer')
g.map_dataframe(sns.kdeplot, x="sentiment", bw_adjust=0.8, color='grey')
g.map(plt.axhline, y=0, lw=1, clip_on=False, color='black')
g.map(plt.axvline, x=0, lw=0.25, clip_on=False, color='grey')
g.map(plt.axvline, x=-0.05, lw=0.25, clip_on=False, color='grey')
g.map(plt.axvline, x=-0.1, lw=0.25, clip_on=False, color='grey')
g.map(plt.axvline, x=0.05, lw=0.25, clip_on=False, color='grey')
g.map(plt.axvline, x=0.1, lw=0.25, clip_on=False, color='grey')
g.map(plt.axvline, x=0.15, lw=0.25, clip_on=False, color='grey')

# Define and use a simple function to label the plot in axes coordinates
def label(x, color, label):
    ax = plt.gca()
    ax.text(0, 0.2, x.iloc[0], fontweight="bold", color='black',
           ha="left", va="center", transform=ax.transAxes)

g.map(label, "publication")

# Set the subplots to overlap
g.fig.subplots_adjust(hspace=-0.55)

# Remove axes details that don't play well with overlap
g.set_titles("")
g.set(yticks=[], xlabel="Article Aggregate Sentiment Score", ylabel="")
g.despine(bottom=False, left=True)
plt.xlim(-0.15,0.15)


plt.show()

### Sentiment Analysis by Date

Another idea we have is to asses the trend in polarity and sentiment over time. We cannot tell much from the diagram below, and the high average sentiment scores at the end of the graph are simply due to a small sample of articles available from those dates. However, we do anticipate notable trends over time as politics grew more divisive from 2016-2021, and as the COVID-19 Pandemic took over the newstream in 2020.

In [None]:
from datetime import datetime
from datetime import datetime
import plotly.graph_objs as go
import plotly.express as px

df_sen['new_date'] = pd.to_datetime(df_sen['date']).dt.date
df2 = df_sen.groupby('new_date')['sentiment'].mean()
df2 = pd.DataFrame(df2)
df2['negative'] = df_sen.groupby('new_date')['neg_count'].mean()
df2['positive'] = df_sen.groupby('new_date')['pos_count'].mean()
px.line(df2, x = df2.index, y = df2.columns)

### Summary and Next Steps

In this module, we have we found that working with a big dataset(8.8 GB) like that of ours will require specialized toolings like sqlite - as loading such a huge dataset and running loops on it individually will casue memory bottleneck. Further, by different visualizations like distribution of articles by date and top sections of the articles by publisher, we got a global view and understanding of the dataset. 

Furthermore, by running sensitivity and polarity analysis we were able to get valuable insights like Reuters having largely negative sentiment and entertaiment publishers like Mashable with positive sentiment. These insights will help us architect our machine learning models to closley maintain the context and sentiment around the summary. 

As a next step, we will train several clustering algorithms to categorize groups of news articles and supervized classification algorithms to predict which cluster a new article would belongs to. By benchmarking these algorithms against each other, we would be able to come up with the best perfoming model. 