News Recommender

In [33]:
pip install plotly

Note: you may need to restart the kernel to use updated packages.


In [34]:
import os
import math
import time
import numpy as np
import pandas as pd
from collections import defaultdict

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.figure_factory as ff
import plotly.graph_objects as go
import plotly.express as px

Import Data

In [35]:
news_data = pd.read_json("News_Category_Dataset_v2.json", lines=True)

In [36]:
news_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200853 entries, 0 to 200852
Data columns (total 6 columns):
authors              200853 non-null object
category             200853 non-null object
date                 200853 non-null datetime64[ns]
headline             200853 non-null object
link                 200853 non-null object
short_description    200853 non-null object
dtypes: datetime64[ns](1), object(5)
memory usage: 9.2+ MB


In [37]:
news_data.head()

Unnamed: 0,authors,category,date,headline,link,short_description
0,Melissa Jeltsen,CRIME,2018-05-26,There Were 2 Mass Shootings In Texas Last Week...,https://www.huffingtonpost.com/entry/texas-ama...,She left her husband. He killed their children...
1,Andy McDonald,ENTERTAINMENT,2018-05-26,Will Smith Joins Diplo And Nicky Jam For The 2...,https://www.huffingtonpost.com/entry/will-smit...,Of course it has a song.
2,Ron Dicker,ENTERTAINMENT,2018-05-26,Hugh Grant Marries For The First Time At Age 57,https://www.huffingtonpost.com/entry/hugh-gran...,The actor and his longtime girlfriend Anna Ebe...
3,Ron Dicker,ENTERTAINMENT,2018-05-26,Jim Carrey Blasts 'Castrato' Adam Schiff And D...,https://www.huffingtonpost.com/entry/jim-carre...,The actor gives Dems an ass-kicking for not fi...
4,Ron Dicker,ENTERTAINMENT,2018-05-26,Julianna Margulies Uses Donald Trump Poop Bags...,https://www.huffingtonpost.com/entry/julianna-...,"The ""Dietland"" actress said using the bags is ..."


Preprocessing

In [38]:
news_data.shape

(200853, 6)

Checking article headline word lengths

In [39]:
hl_len = defaultdict(int)

for h in news_data['headline']:
    hl_len[len(h.split())] += 1

In [40]:
for k in sorted(hl_len):
    print('{}:{}'.format(k, hl_len[k]))

0:6
1:256
2:1428
3:3332
4:6068
5:9220
6:13183
7:17168
8:21721
9:25259
10:26682
11:24716
12:19607
13:13688
14:8415
15:4910
16:2631
17:1255
18:626
19:296
20:172
21:95
22:50
23:24
24:15
25:6
26:6
27:6
28:6
29:1
30:1
31:1
34:1
38:1
44:1


In [41]:
# Add Graph?

In [42]:
#Retaining articles with headline word lengths > 5
print('Total Articles before removal of short title articles:', news_data.shape[0])
news_data = news_data[news_data['headline'].apply(lambda x: len(x.split()) > 5)]
print('Total Articles after removal of short title articles:', news_data.shape[0])

Total Articles before removal of short title articles: 200853
Total Articles after removal of short title articles: 180543


In [43]:
# Check and remove duplicates

news_data

Unnamed: 0,authors,category,date,headline,link,short_description
0,Melissa Jeltsen,CRIME,2018-05-26,There Were 2 Mass Shootings In Texas Last Week...,https://www.huffingtonpost.com/entry/texas-ama...,She left her husband. He killed their children...
1,Andy McDonald,ENTERTAINMENT,2018-05-26,Will Smith Joins Diplo And Nicky Jam For The 2...,https://www.huffingtonpost.com/entry/will-smit...,Of course it has a song.
2,Ron Dicker,ENTERTAINMENT,2018-05-26,Hugh Grant Marries For The First Time At Age 57,https://www.huffingtonpost.com/entry/hugh-gran...,The actor and his longtime girlfriend Anna Ebe...
3,Ron Dicker,ENTERTAINMENT,2018-05-26,Jim Carrey Blasts 'Castrato' Adam Schiff And D...,https://www.huffingtonpost.com/entry/jim-carre...,The actor gives Dems an ass-kicking for not fi...
4,Ron Dicker,ENTERTAINMENT,2018-05-26,Julianna Margulies Uses Donald Trump Poop Bags...,https://www.huffingtonpost.com/entry/julianna-...,"The ""Dietland"" actress said using the bags is ..."
5,Ron Dicker,ENTERTAINMENT,2018-05-26,Morgan Freeman 'Devastated' That Sexual Harass...,https://www.huffingtonpost.com/entry/morgan-fr...,"""It is not right to equate horrific incidents ..."
6,Ron Dicker,ENTERTAINMENT,2018-05-26,Donald Trump Is Lovin' New McDonald's Jingle I...,https://www.huffingtonpost.com/entry/donald-tr...,"It's catchy, all right."
7,Todd Van Luling,ENTERTAINMENT,2018-05-26,What To Watch On Amazon Prime That’s New This ...,https://www.huffingtonpost.com/entry/amazon-pr...,There's a great mini-series joining this week.
8,Andy McDonald,ENTERTAINMENT,2018-05-26,Mike Myers Reveals He'd 'Like To' Do A Fourth ...,https://www.huffingtonpost.com/entry/mike-myer...,"Myer's kids may be pushing for a new ""Powers"" ..."
9,Todd Van Luling,ENTERTAINMENT,2018-05-26,What To Watch On Hulu That’s New This Week,https://www.huffingtonpost.com/entry/hulu-what...,You're getting a recent Academy Award-winning ...


In [44]:
news_data.sort_values('headline', inplace=True, ascending=False)
news_data

Unnamed: 0,authors,category,date,headline,link,short_description
36290,"Darin Graham, ContributorJournalist",WOMEN,2017-01-25,"“We Shall Overcomb!” Say The 100,000 Marching ...",https://www.huffingtonpost.com/entry/we-shall-...,Thousands of activists descended on London to ...
21194,"Mycah Hazel, Contributorblogger, equal opportu...",HEALTHY LIVING,2017-07-18,“To The Bone” Didn’t Teach Me Glamour. It Taug...,https://www.huffingtonpost.com/entry/to-the-bo...,"Oftentimes, films or TV shows about eating dis..."
29672,"Dana Brownlee, ContributorPresident of Profess...",BUSINESS,2017-04-10,"“I’m Sorry""--The Two Tragically Forgotten Word...",https://www.huffingtonpost.com/entry/im-sorryt...,"Unfortunately, I was one of those frustrated p..."
199048,,DIVORCE,2012-02-16,‘Your Divorce Ruined My Life' What To Do When ...,https://www.huffingtonpost.comhttp://www.thegl...,It was Sunday night and Lucas’s mother had had...
193783,,DIVORCE,2012-04-13,"‘You Better Sit Down,' By The Civilians, At Fl...",https://www.huffingtonpost.comhttp://theater.n...,"The Civilians, the enterprising troupe special..."
112446,,WOMEN,2014-09-09,‘Yes' Is Better Than ‘No' When It Comes To Con...,https://www.huffingtonpost.com/entry/michael-k...,
123546,,WEIRD NEWS,2014-05-05,‘Worst Mom In The World' Selfies,https://www.huffingtonpost.com/entry/worst-mom...,
2932,Elyse Wanshel,QUEER VOICES,2018-04-02,‘Will & Grace’ Creator To Donate Gay Bunny Boo...,https://www.huffingtonpost.com/entry/will-grac...,It's about to be a lot easier for kids in Mike...
67601,Nina Golgowski,WEIRD NEWS,2016-02-03,‘Wild Boar Curling’ Rescues Stranded Wild Boar...,https://www.huffingtonpost.com/entry/wild-boar...,Get this pig in a blanket!
85380,Lilly Workneh,BLACK VOICES,2015-07-17,‘We’re Never Gonna Forget’: Eric Garner’s Fami...,https://www.huffingtonpost.com/entry/were-neve...,Eric Garner's family share memories of the fam...


In [45]:
duplicates = news_data.duplicated('headline', keep=False)
duplicates

36290     False
21194     False
29672     False
199048    False
193783    False
112446    False
123546    False
2932      False
67601     False
85380     False
25186     False
26210     False
20973     False
26939     False
15705     False
83968     False
35588     False
16236      True
14817      True
74066     False
39284     False
10348     False
19088     False
10318     False
66764     False
85147     False
46537     False
9375      False
4487      False
12525     False
          ...  
79194     False
184553    False
127174    False
93360     False
170481    False
180013    False
25151     False
118713    False
137976    False
120489    False
106321    False
184863    False
60589     False
68627     False
91213     False
115463    False
135037    False
50799     False
147824    False
193086    False
40225     False
120801    False
57464     False
197915    False
151118    False
146670    False
110111    False
194610    False
130009    False
149150    False
Length: 180543, dtype: b

In [46]:
print('Total Articles before removal of duplicate title articles:', news_data.shape[0])
news_data = news_data[~duplicates]
print('Total Articles after removal of duplicate title articles:', news_data.shape[0])

Total Articles before removal of duplicate title articles: 180543
Total Articles after removal of duplicate title articles: 178760


In [47]:
# Checking for missing values
news_data.isna().sum()

authors              0
category             0
date                 0
headline             0
link                 0
short_description    0
dtype: int64

Data Exploration

In [48]:
news_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 178760 entries, 36290 to 149150
Data columns (total 6 columns):
authors              178760 non-null object
category             178760 non-null object
date                 178760 non-null datetime64[ns]
headline             178760 non-null object
link                 178760 non-null object
short_description    178760 non-null object
dtypes: datetime64[ns](1), object(5)
memory usage: 9.5+ MB


In [49]:
news_data.shape

(178760, 6)

In [50]:
print('Total number of Articles:', news_data.shape[0])

Total number of Articles: 178760


In [51]:
print('Total number of unique authors:', news_data['authors'].nunique())

Total number of unique authors: 24589


In [52]:
print('Total number of unique categories:', news_data['category'].nunique())

Total number of unique categories: 41


Grouping articles category-wise

In [53]:
# replace with Matplotlib?
fig = go.Figure([go.Bar(x=news_data["category"].value_counts().index, y=news_data["category"].value_counts().values)])
fig['layout'].update(title={"text" : 'Distribution of articles category-wise','y':0.9,'x':0.5,'xanchor': 'center','yanchor': 'top'}, xaxis_title="Category name",yaxis_title="Number of articles")
fig.update_layout(width=800,height=700)
fig

From the bar chart, is is clear that 'Politics' category has the highest nubmer of articles.

No. of articles per month

In [54]:
# needed?

Probability Distribution Function of length of headlines

In [55]:
fig = ff.create_distplot(
            [news_data['headline'].str.len()],
            ['ht'],
            show_hist=False,
            show_rug=False
        )
fig['layout'].update(
        title={
            'text': 'PDF',
            'y': 0.9,
            'x': 0.5,
            'xanchor': 'center', 
            'yanchor': 'top'
        },
        xaxis_title='Length of a headline',
        yaxis_title='probability'
        )
fig.update_layout(showlegend=False, width=500, height=500)
fig