## Preface

Due to easy availibilty of enormous items(services) on your favourite online platforms like *e-commerce*, *job portal*, *food delivery*, *music or video streaming*, it is very hard and time consuming to find out the desired item of your choice quickly. These platforms could help you by **recommending** items as per your interest and preference by just analyzing your past interaction or behaviour with the system.  

From **Amazon** to **Linkedin**, **Uber eats** to **Spotify**, **Netflix** to **Facebook**, **Recommender systems** are most extensively used to suggest "Similar items", "Relevant jobs", "preferred foods", "Movies of interest" etc to their users. 

**Recommender system** with appropiate item suggestions helps in boosting sales, increasing revenue, retaining customers and also adds competitive advantage. 
There are basically two kind of **recommendation** methods.
1. **Content based recommendation**
2. **Collaborative filtering**

**Content based recommendation ** is based on similarity among users/items obtained through their **attributes**. It uses the additional information(meta data) about the **users** or **items** i.e. it relies on what kind of **content** is already available. This meta data could be **user's demograpic information** like *age*, *gender*, *job*, *location*, *skillsets* etc. Similarly for **items** it can be *item name*, *specifications*, *category*, *registration date* etc.

So the core idea is to recommend items by finding similar items/users to the concerned **item/user** based on their **attributes**. 

In this kernel, I am going to discuss about **Content based recommendation** using **News category** dataset. The goal is to recommend **news articles** which are similar to the already read article by using attributes like article *headline*, *category*, *author* and *publishing date*.

So let's get started without any further delay.

## Notebook - Table of Content

1. [**Importing necessary Libraries**](# 1.-Importing-necessary-Libraries)   
2. [**Loading Data**](#2.-Loading-Data)  
3. [**Data Preprocessing**](#3.-Data-Preprocessing)  
    3.a [**Fetching only the articles from 2018**](#3.a-Fetching-only-the-articles-from-2018)  
    3.b [**Removing all the short headline articles**](#3.b-Removing-all-the-short-headline-articles)  
    3.c [**Checking and removing all the duplicates**](#3.c-Checking-and-removing-all-the-duplicates)  
    3.d [**Checking for missing values**](#3.d-Checking-for-missing-values)  
4. [**Basic Data Exploration**](#4.-Basic-Data-Exploration)  
    4.a [**Basic statistics - Number of articles,authors,categories**](#4.a-Basic-statistics---Number-of-articles,authors,categories)  
    4.b [**Distribution of articles category-wise**](#4.b-Distribution-of-articles-category-wise)  
    4.c [**Number of articles per month**](#4.c-Number-of-articles-per-month)   
    4.d [**PDF for length of headlines**](#4.d-PDF-for-length-of-headlines)
5. [**Text Preprocessing**](#5.-Text-Preprocessing)  
    5.a [**Stopwords removal**](#5.a-Stopwords-removal)  
    5.b [**Lemmatization**](#5.b-Lemmatization)  


## 1. Importing necessary Libraries

In [None]:
import numpy as np
import pandas as pd

import os
import math
import time

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.figure_factory as ff
import plotly.graph_objects as go
import plotly.express as px

# Below libraries are for text processing using NLTK
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# Below libraries are for feature representation using sklearn
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Below libraries are for similarity matrices using sklearn
from sklearn.metrics.pairwise import cosine_similarity  
from sklearn.metrics import pairwise_distances

## 2. Loading Data

In [None]:
news_articles = pd.read_json("/kaggle/input/news-category-dataset/News_Category_Dataset_v2.json", lines = True)

In [None]:
news_articles.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200853 entries, 0 to 200852
Data columns (total 6 columns):
category             200853 non-null object
headline             200853 non-null object
authors              200853 non-null object
link                 200853 non-null object
short_description    200853 non-null object
date                 200853 non-null datetime64[ns]
dtypes: datetime64[ns](1), object(5)
memory usage: 9.2+ MB


The dataset contains about two million records of six different features. 

In [None]:
news_articles.head()

Unnamed: 0,category,headline,authors,link,short_description,date
0,CRIME,There Were 2 Mass Shootings In Texas Last Week...,Melissa Jeltsen,https://www.huffingtonpost.com/entry/texas-ama...,She left her husband. He killed their children...,2018-05-26
1,ENTERTAINMENT,Will Smith Joins Diplo And Nicky Jam For The 2...,Andy McDonald,https://www.huffingtonpost.com/entry/will-smit...,Of course it has a song.,2018-05-26
2,ENTERTAINMENT,Hugh Grant Marries For The First Time At Age 57,Ron Dicker,https://www.huffingtonpost.com/entry/hugh-gran...,The actor and his longtime girlfriend Anna Ebe...,2018-05-26
3,ENTERTAINMENT,Jim Carrey Blasts 'Castrato' Adam Schiff And D...,Ron Dicker,https://www.huffingtonpost.com/entry/jim-carre...,The actor gives Dems an ass-kicking for not fi...,2018-05-26
4,ENTERTAINMENT,Julianna Margulies Uses Donald Trump Poop Bags...,Ron Dicker,https://www.huffingtonpost.com/entry/julianna-...,"The ""Dietland"" actress said using the bags is ...",2018-05-26


## 3. Data Preprocessing

### 3.a Fetching only the articles from 2018  

Since the dataset size is quite large so processing through entire dataset may consume too much time. To refrain from this, we are only considering the latest articles from the year 2018. 

In [None]:
news_articles = news_articles[news_articles['date'] >= pd.Timestamp(2018,1,1)]

In [None]:
news_articles.shape

(8583, 6)

Now, the number of news articles comes down to 8583.

### 3.b Removing all the short headline articles 

After stop words removal from headline, the articles with very short headline may become blank headline articles. So let's remove all the articles with less words(<5) in the headline.   

In [None]:
news_articles = news_articles[news_articles['headline'].apply(lambda x: len(x.split())>5)]
print("Total number of articles after removal of headlines with short title:", news_articles.shape[0])

Total number of articles after removal of headlines with short title: 8530


### 3.c Checking and removing all the duplicates

Since some articles are exactly same in headlines, so let's remove all such articles having duplicate headline appearance.

In [None]:
news_articles.sort_values('headline',inplace=True, ascending=False)
duplicated_articles_series = news_articles.duplicated('headline', keep = False)
news_articles = news_articles[~duplicated_articles_series]
print("Total number of articles after removing duplicates:", news_articles.shape[0])

Total number of articles after removing duplicates: 8485


### 3.d Checking for missing values

In [None]:
news_articles.isna().sum()

category             0
headline             0
authors              0
link                 0
short_description    0
date                 0
dtype: int64

## 4. Basic Data Exploration 

### 4.a Basic statistics - Number of articles,authors,categories

In [None]:
print("Total number of articles : ", news_articles.shape[0])
print("Total number of authors : ", news_articles["authors"].nunique())
print("Total number of unqiue categories : ", news_articles["category"].nunique())

Total number of articles :  8485
Total number of authors :  892
Total number of unqiue categories :  26


### 4.b Distribution of articles category-wise

In [None]:
fig = go.Figure([go.Bar(x=news_articles["category"].value_counts().index, y=news_articles["category"].value_counts().values)])
fig['layout'].update(title={"text" : 'Distribution of articles category-wise','y':0.9,'x':0.5,'xanchor': 'center','yanchor': 'top'}, xaxis_title="Category name",yaxis_title="Number of articles")
fig.update_layout(width=800,height=700)
fig

From the bar chart, we can observe that **politics** category has **highest** number of articles then **entertainment** and so on.  

### 4.c Number of articles per month

Let's first group the data on monthly basis using **resample()** function. 

In [None]:
news_articles_per_month = news_articles.resample('m',on = 'date')['headline'].count()
news_articles_per_month

date
2018-01-31    2065
2018-02-28    1694
2018-03-31    1778
2018-04-30    1580
2018-05-31    1368
Freq: M, Name: headline, dtype: int64

In [None]:
fig = go.Figure([go.Bar(x=news_articles_per_month.index.strftime("%b"), y=news_articles_per_month)])
fig['layout'].update(title={"text" : 'Distribution of articles month-wise','y':0.9,'x':0.5,'xanchor': 'center','yanchor': 'top'}, xaxis_title="Month",yaxis_title="Number of articles")
fig.update_layout(width=500,height=500)
fig

From the bar chart, we can observe that **January** month has **highest** number of articles then **March** and so on.  

### 4.d PDF for the length of headlines 

In [None]:
fig = ff.create_distplot([news_articles['headline'].str.len()], ["ht"],show_hist=False,show_rug=False)
fig['layout'].update(title={'text':'PDF','y':0.9,'x':0.5,'xanchor': 'center','yanchor': 'top'}, xaxis_title="Length of a headline",yaxis_title="probability")
fig.update_layout(showlegend = False,width=500,height=500)
fig

The probability distribution function of headline length is almost similar to a **Guassian distribution**, where most of the headlines are 58 to 80 words long in length. 

By Data processing in Step 2, we get a subset of original dataset which has different index labels so let's make the indices uniform ranging from 0 to total number of articles. 

In [None]:
news_articles.index = range(news_articles.shape[0])

In [None]:
# Adding a new column containing both day of the week and month, it will be required later while recommending based on day of the week and month
news_articles["day and month"] = news_articles["date"].dt.strftime("%a") + "_" + news_articles["date"].dt.strftime("%b")

Since after text preprocessing the original headlines will be modified and it doesn't make sense to recommend articles by displaying modified headlines so let's copy the dataset into some other dataset and perform text preprocessing on the later.

In [None]:
news_articles_temp = news_articles.copy()

## 5. Text Preprocessing

### 5.a Stopwords removal

Stop words are not much helpful in analyis and also their inclusion consumes much time during processing so let's remove these. 

In [None]:
stop_words = set(stopwords.words('english'))

In [None]:
for i in range(len(news_articles_temp["headline"])):
    string = ""
    for word in news_articles_temp["headline"][i].split():
        word = ("".join(e for e in word if e.isalnum()))
        word = word.lower()
        if not word in stop_words:
          string += word + " "  
    if(i%1000==0):
      print(i)           # To track number of records processed
    news_articles_temp.at[i,"headline"] = string.strip()

0
1000
2000
3000
4000
5000
6000
7000
8000


### 5.b Lemmatization

Let's find the base form(lemma) of words to consider different inflections of a word same as lemma.

In [None]:
lemmatizer = WordNetLemmatizer()

In [None]:
for i in range(len(news_articles_temp["headline"])):
    string = ""
    for w in word_tokenize(news_articles_temp["headline"][i]):
        string += lemmatizer.lemmatize(w,pos = "v") + " "
    news_articles_temp.at[i, "headline"] = string.strip()
    if(i%1000==0):
        print(i)           # To track number of records processed

0
1000
2000
3000
4000
5000
6000
7000
8000


Generally, we assess **similarity** based on **distance**. If the **distance** is minimum then high **similarity** and if it is maximum then low **similarity**.
To calculate the **distance**, we need to represent the headline as a **d-dimensional** vector. Then we can find out the **similarity** based on the **distance** between vectors.

There are multiple methods to represent a **text** as **d-dimensional** vector like **Bag of words**, **TF-IDF method**, **Word2Vec embedding** etc. Each method has its own advantages and disadvantages. 
