<a href="https://colab.research.google.com/github/joocahyadi/NLP_Recommendation_System/blob/main/NLP_Recommendation_System.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Importing Libraries

First of all, let's import all the libraries that we're going to use

In [None]:
import pandas as pd
import numpy as np
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
from sklearn.preprocessing import normalize

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Importing Dataset

In [None]:
data = pd.read_csv('/content/drive/MyDrive/Data Science Projects/NLP - Recommendation System/CNN_Articels_clean.csv')

We're going to use only the "Article text" column for this project

In [None]:
data = data[['Article text']]

In [None]:
data.head()

Unnamed: 0,Article text
0,"(CNN)Right now, there's a shortage of truck d..."
1,(CNN)Working in a factory or warehouse can me...
2,"(CNN)In a Hong Kong warehouse, a swarm of aut..."
3,"New York (CNN Business)For many years, the wor..."
4,The European Union formally approved on Tuesda...


In [None]:
# Check the length of the dataset
len(data)

4076

Let's take a look at 1 random example of the article text

In [None]:
data.sample(5).iloc[0]['Article text']



# Checking if the Data Have Null Value

Let's consider 2 types of null values: nan/null and blank entries

In [None]:
# For the null value
data.isnull().sum()

Article text    0
dtype: int64

In [None]:
# For the blank entries
blanks = []

for i, text in data.itertuples():
  if type(text) == 'str':
    if text.isspace():
      blanks.append(i)

blanks

[]

We can conclude that there isn't any null values in the text (in other words, there isn't any empty text)

# Text Preprocessing

We're going to use Spacy for text preprocessing. 

First, Let's load the "en_core_web_sm" model.

In [None]:
nlp = spacy.load('en_core_web_sm')

Let's apply the nlp model to create Spacy doc for each text

In [None]:
data['Lemmatized Article Text'] = data['Article text'].apply(nlp)

In [None]:
data.head()

Unnamed: 0,Article text,Lemmatized Article Text
0,"(CNN)Right now, there's a shortage of truck d...","( , (, CNN)Right, now, ,, there, 's, a, shorta..."
1,(CNN)Working in a factory or warehouse can me...,"( , (, CNN)Working, in, a, factory, or, wareho..."
2,"(CNN)In a Hong Kong warehouse, a swarm of aut...","( , (, CNN)In, a, Hong, Kong, warehouse, ,, a,..."
3,"New York (CNN Business)For many years, the wor...","(New, York, (, CNN, Business)For, many, years,..."
4,The European Union formally approved on Tuesda...,"(The, European, Union, formally, approved, on,..."


Create the preprocess function to remove stop words and punctuations. Also, to lemmatize each word in all texts.

In [None]:
def preprocess(text):
  # Remove punctuation
  tokens_no_punct = [token for token in text if not token.is_punct]

  # Remove stop words
  tokens_no_punct_stop = [token for token in tokens_no_punct if not token.is_stop]

  # Lemmatize each word 
  tokens_lemma = [token.lemma_ for token in tokens_no_punct_stop]

  # Joining the tokens into text
  text_lemma = ' '.join(token for token in tokens_lemma)

  return text_lemma

Apply the preprocess function to the "Lemmatized Article Text" column.

In addition to applying the function, let's apply lower() function to lower every token (word).

In [None]:
data['Lemmatized Article Text'] = data['Lemmatized Article Text'].apply(preprocess)

In [None]:
data['Lemmatized Article Text'] = data['Lemmatized Article Text'].apply(lambda text: text.lower())

Here's the current overview of our dataset.

In [None]:
data.head()

Unnamed: 0,Article text,Lemmatized Article Text
0,"(CNN)Right now, there's a shortage of truck d...",cnn)right shortage truck driver worldwide ex...
1,(CNN)Working in a factory or warehouse can me...,cnn)worke factory warehouse mean task repeti...
2,"(CNN)In a Hong Kong warehouse, a swarm of aut...",cnn)in hong kong warehouse swarm autonomous ...
3,"New York (CNN Business)For many years, the wor...",new york cnn business)for year world popular e...
4,The European Union formally approved on Tuesda...,european union formally approve tuesday new ba...


Let's take a look at 5 random samples of the current dataset.

In [None]:
data.sample(5)

Unnamed: 0,Article text,Lemmatized Article Text
2050,London (CNN)Lawmakers from the UK's governing ...,london cnn)lawmaker uk govern conservative par...
3109,London (CNN)E-cigarettes could be prescribed b...,london cnn)e cigarette prescribe england natio...
1867,London (CNN)The Scottish National Party is pro...,london cnn)the scottish national party promise...
255,Asutosh Padhi is McKinsey & Company's managing...,asutosh padhi mckinsey company managing partne...
2797,(CNN)Here's some background information about...,cnn)here background information berlin wall ...


# Building TF-IDF

After the preprocessing step was done, we're ready to create the recommender system.

First, we need to vectorize each token (word). Here, I'm using the Term Frequency - Inverse Document Frequency (TF-IDF) method.

I'm using the max_df and min_df arguments to filter the token (word) that appear too often (in more than 95% of total available texts) and too little (in only 1 text).

In [None]:
tfidf = TfidfVectorizer(max_df=0.95, min_df=2)

In [None]:
dtm = tfidf.fit_transform(data['Lemmatized Article Text'])

In [None]:
dtm

<4076x29922 sparse matrix of type '<class 'numpy.float64'>'
	with 1071897 stored elements in Compressed Sparse Row format>

# Building NMF

After we get the matrix representation of the TF-IDF, we're set to go to the second step using Non-Negative Matrix Factorization.

In [None]:
# Let's try 30 n_components
# The n_components represent the number of topics.
nmf_model = NMF(n_components=30, random_state=42)

In [None]:
# Fitting the NMF model to the dtm
nmf_model.fit(dtm)



NMF(n_components=30, random_state=42)

In [None]:
# Beacuse we choose 30 n_components, the len of nmf_model.components_ should be 30
len(nmf_model.components_)

30

In [None]:
# Check the shape of nmf_model.components_
nmf_model.components_.shape

(30, 29922)

In [None]:
topic_results = nmf_model.transform(dtm)

In [None]:
# Check the shape of topic_results matrix
topic_results.shape

(4076, 30)

In [None]:
# Let's take a look at the first row of topic_results matrix
topic_results[0]

array([0.01935892, 0.        , 0.        , 0.00095875, 0.00696324,
       0.        , 0.13735303, 0.        , 0.01809424, 0.        ,
       0.00061374, 0.        , 0.00031728, 0.        , 0.0095273 ,
       0.01216267, 0.        , 0.00077774, 0.0112343 , 0.        ,
       0.        , 0.        , 0.00095646, 0.        , 0.00550749,
       0.        , 0.        , 0.        , 0.00036116, 0.        ])

# The Recommender Function (Cosine Function)

Finally, we've reached the final step. Applying the recommender function to get the recommendations (top n texts related to the given text).

I use the cosine function because in mathematics, the cosine function can be used to calculate the difference in angle between two vectors.

The value of cosine function is between 0 and 1, where 0 means the two vectors very dissimilar and 1 means the two vectors very similar.

In general, the cosine function in $\mathbb{R}^2$ between vector $a$ and $b$ is defined as below:

\begin{align}
    cos (\theta) = \frac{a \cdot b}{||a|| ||b||}
\end{align}

In [None]:
# Let's normalize all of the text vector
norm_topic_results = normalize(topic_results)

In [None]:
# Create a new dataframe consisted of norm_topic_results and article text
new_data = pd.DataFrame(norm_topic_results, data['Article text'])

In [None]:
new_data = new_data.reset_index()

In [None]:
new_data.head()

Unnamed: 0,Article text,0,1,2,3,4,5,6,7,8,...,20,21,22,23,24,25,26,27,28,29
0,"(CNN)Right now, there's a shortage of truck d...",0.136838,0.0,0.0,0.006777,0.049219,0.0,0.970873,0.0,0.127898,...,0.0,0.0,0.006761,0.0,0.038929,0.0,0.0,0.0,0.002553,0.0
1,(CNN)Working in a factory or warehouse can me...,0.175583,0.0,0.0,0.017456,0.075632,0.0,0.974493,0.0,0.0,...,0.012342,0.0,0.025617,0.0,0.0,0.0,0.0,0.0,0.024106,0.033406
2,"(CNN)In a Hong Kong warehouse, a swarm of aut...",0.107858,0.0,0.0,0.009095,0.009227,0.0,0.986812,0.0,0.0,...,0.018001,0.0,0.0,0.0,0.005872,0.0,0.0,0.0,0.0,0.012867
3,"New York (CNN Business)For many years, the wor...",0.113347,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.978112,0.0,0.054598,0.0,0.0,0.0,0.080647,0.0
4,The European Union formally approved on Tuesda...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.885083,0.0,0.003234,0.0,0.0,0.0,0.0,0.0


# Recommended Article to the 24th Article



Let's take the 24th article as an example

In [None]:
current_article = new_data.iloc[23]

In [None]:
similarities = new_data.iloc[:, 1:].dot(current_article[1:])

In [None]:
print(similarities.astype(float).nlargest())

23      1.000000
163     0.908235
189     0.900680
423     0.899432
3033    0.897820
dtype: float64


In [None]:
new_data.iloc[163]['Article text']

'Washington (CNN)When President Joe Biden passingly said in a voting rights speech last week that he had been "arrested" in the context of the civil rights movement -- even suggesting this had happened more than once -- it was a classic Biden false claim: an anecdote about his past for which there is no evidence, prompted by a decision to ad-lib rather than stick to a prepared text, resulting in easily avoidable questions about his honesty.   Biden\'s imaginary or embellished stories about his own history were the most memorable falsehoods of his first year in office. They were not, however, the only ones. The President also made multiple false claims about important policy matters, notably including three subjects that occupied much of his time: the US withdrawal from Afghanistan, the economy and the Covid-19 pandemic. And Biden was incorrect on numerous occasions when ad-libbing about a wide assortment of facts and figures -- sometimes in a way that appeared inadvertent, but other ti

In [None]:
new_data.iloc[23]['Article text']

'New York (CNN Business)President Joe Biden planned to reshape the Federal Reserve through his nominations for the three vacant seats on the board of governors. But Democratic Sen. Joe Manchin of West Virginia threw a wrench into those plans Monday.The crux of the matter is the nomination of Sarah Bloom Raskin, a former deputy Treasury secretary and a governor of the Federal Reserve Board during the Obama administration, who is facing opposition in a divided Senate.Raskin\'s stance on environmental issues, including her view on the transition away from fossil fuels, are colliding with soaring gas prices and a renewed debate about oil independence in the face of the Russia-Ukraine conflict."Her previous public statements have failed to satisfactorily address my concerns about the critical importance of financing an all-of-the-above energy policy to meet our nation\'s critical energy needs," Manchin said in a statement on Monday announcing his opposition to Raskin\'s nomination. Manchin 

# Implementation

Let's implement the recommender system so it can receive input from user (index of the current article) and output the top 5 recommended articles related to the current article.

In [None]:
index = int(input('Please enter the index number of the article that you currently read: '))
current_article = new_data.iloc[index]
similarities = new_data.iloc[:, 1:].dot(current_article[1:])
print(' ')
print(f'The top 5 articles related to the article number {index} are: \n')
print('Index   Similarity Score')
print(similarities.astype(float).nlargest(6))

Please enter the index number of the article that you currently read: 23
 
The top 5 articles related to the article number 23 are: 

Index   Similarity Score
23      1.000000
163     0.908235
189     0.900680
423     0.899432
3033    0.897820
428     0.893899
dtype: float64


# Voila!