# Named Entity based Recommender System

### Recommender System based on Named Entities as representation of documents

# Named Entity Based Recommender
1. Represent articles in terms TF-IDF Matrix
2. Represent user in terms of - 
        (Alpha) <TF-IDF Vector> + (1-Alpha) <NER Vector>
   where
   Alpha => [0,1] <br/>
   [TF-IDF Vector] => TF-IDF vector representation of concatenated read articles<br/>
   [NER Vector]    => TF-IDF vector representation of NERs associated with concatenated read articles <br/>
3. Calculate cosine similarity between user vector and articles TF-IDF matrix <br/>
4. Get the recommended articles <br/>
5. What if Alpha is 0 when defining user vector ? 

**Describing parameters**:

*1. PATH_NEWS_ARTICLES: specify the path where news_article.csv is present*  <br/>
*2. ARTICLES_READ: List of Article_Ids read by the user*  <br/>
*3. NO_RECOMMENDED_ARTICLES: Refers to the number of recommended articles as a result*

In [1]:
PATH_NEWS_ARTICLES="news_articles.csv"
ARTICLES_READ=[4,5,7,8]
NUM_RECOMMENDED_ARTICLES=5
ALPHA = 0.5

In [2]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import re
from nltk.stem.snowball import SnowballStemmer
import nltk
import numpy
from nltk import word_tokenize, pos_tag, ne_chunk
from nltk.chunk import tree2conlltags
stemmer = SnowballStemmer("english")

# 1. Represent articles in terms TF-IDF Matrix

In [3]:
news_articles = pd.read_csv(PATH_NEWS_ARTICLES)
news_articles.head()

Unnamed: 0,Article_Id,Title,Author,Date,Content,URL
0,0,14 dead after bus falls into canal in Telangan...,Devyani Sultania,"August 22, 2016 12:34 IST",At least 14 people died and 17 others were inj...,http://www.ibtimes.co.in/14-dead-after-bus-fal...
1,1,Pratibha Tiwari molested on busy road Saath ...,Suparno Sarkar,"August 22, 2016 19:47 IST",TV actress Pratibha Tiwari who is best known ...,
2,2,US South Korea begin joint military drill ami...,Namrata Tripathi,"August 22, 2016 18:10 IST",The United States and South Korea began a join...,http://www.ibtimes.co.in/us-south-korea-begin-...
3,3,Illegal construction in Bengaluru Will my hou...,S V Krishnamachari,"August 22, 2016 17:39 IST",The relentless drive by Bengaluru s Bangalore...,http://www.ibtimes.co.in/illegal-construction-...
4,4,Punjab Gau Rakshak Dal chief held for assaulti...,Pranshu Rathee,"August 22, 2016 17:34 IST",Punjab Gau Raksha Dal chief Satish Kumar and h...,http://www.ibtimes.co.in/punjab-gau-rakshak-da...


In [4]:
#Select relevant columns and remove rows with missing values
news_articles = news_articles[['Article_Id','Title','Content']].dropna()
articles = news_articles['Content'].tolist()
articles[0] #an uncleaned article

'At least 14 people died and 17 others were injured after a bus travelling from Hyderabad to Kakinada plunged into a canal from a bridge on the accident-prone stretch of the Hyderabad-Khammam highway in Telangana early Monday morning \nThe injured were admitted to the Government General Hospital for treatment \n\n\nSeven people died on the spot and the others succumbed to injuries while undergoing treatment at the hospital  The passengers belonged to the East and West Godavari districts of Andhra Pradesh \nThe bus  owned by private operator Yatra Genie  commenced its journey from Hyderabad at 11 30 p m  on Sunday  Khammam Superintendent of Police Shah Nawaz Khan was quoted by the Hindustan Times as saying \nThe accident happened around 2 30 a m  when the driver slammed the brakes to avoid a collision with another vehicle coming from the opposite direction on a bridge over Nagarjunsagar project left canal at Nayankangudem village in Khammam district  the daily reported  The bus hit the 

In [5]:
def clean_tokenize(document):
    document = re.sub('[^\w_\s-]', ' ',document)                           #remove punctuation marks and other symbols
    tokens = nltk.word_tokenize(document)                                  #Tokenize sentences
    cleaned_article = ' '.join([stemmer.stem(item) for item in tokens])    #Stemming each token
    return cleaned_article

In [6]:
cleaned_articles = map(clean_tokenize, articles)
cleaned_articles[0]  #a cleaned, tokenized and stemmed article 

u'at least 14 peopl die and 17 other were injur after a bus travel from hyderabad to kakinada plung into a canal from a bridg on the accident-pron stretch of the hyderabad-khammam highway in telangana earli monday morn the injur were admit to the govern general hospit for treatment seven peopl die on the spot and the other succumb to injuri while undergo treatment at the hospit the passeng belong to the east and west godavari district of andhra pradesh the bus own by privat oper yatra geni commenc it journey from hyderabad at 11 30 p m on sunday khammam superintend of polic shah nawaz khan was quot by the hindustan time as say the accid happen around 2 30 a m when the driver slam the brake to avoid a collis with anoth vehicl come from the opposit direct on a bridg over nagarjunsagar project left canal at nayankangudem villag in khammam district the daili report the bus hit the parapet wall of the bridg and nose-div into the canal the driver of the bus was appar drive at high speed due 

In [7]:
#Generate tfidf matrix model 
tfidf_matrix = TfidfVectorizer(stop_words='english', min_df=2)
articles_tfidf_matrix = tfidf_matrix.fit_transform(cleaned_articles)
articles_tfidf_matrix #tfidf vector of an article

<4831x16009 sparse matrix of type '<type 'numpy.float64'>'
	with 468648 stored elements in Compressed Sparse Row format>

# 2. Represent user in terms of articles read

Represent user in terms of - <br/>
<br/>
(Alpha) [TF-IDF Vector] + (1-Alpha) [NER Vector] <br/>

   where <br/>
   Alpha => [0,1] <br/>
   [TF-IDF Vector] => TF-IDF vector representation of concatenated read articles  <br/>
   [NER Vector]    => TF-IDF vector representation of NERs associated with concatenated read articles

In [8]:
def get_ner(article):
    ne_tree = ne_chunk(pos_tag(word_tokenize(article)))
    iob_tagged = tree2conlltags(ne_tree)
    ner_token = ' '.join([token for token,pos,ner_tag in iob_tagged if not ner_tag==u'O']) #Discarding tokens with 'Other' tag
    return ner_token

In [9]:
#Represent user in terms of cleaned content of read articles
user_articles = ' '.join(cleaned_articles[i] for i in ARTICLES_READ) 
print "User Article =>", user_articles
print '\n'
#Represent user in terms of NERs assciated with read articles 
user_articles_ner = ' '.join([get_ner(articles[i]) for i in ARTICLES_READ])
print "NERs of Read Article =>", user_articles_ner

User Article => punjab gau raksha dal chief satish kumar and his accomplic were remand to a day s polic custodi by a patiala court on sunday the polic said satish and his aid were respons for wrong restraint extort loot assault and sodomis cattl trader his two accomplic arun kumar alia anu and kapil kumar alia gauri both resid of rajpura were also arrest along with satish in vrindavan on saturday satish was book under section 323 324 341 342 382 384 148 149 and 506 of the ipc at the rajpura polic station on aug 8 after video of him assault cattl trader were found on social media satish had been on the run after section 377 sodomi of the indian penal code was ad in the first inform report fir against him 10 day ago the charg were ad after two victim imran and naseem both resid of saharanpur in uttar pradesh complain against the cow vigilant the polic said that they are attempt to probe satish s properti in ambala and other district of haryana as they suspect it was financ through money 

In [10]:
#Get vector representation for both of the user read article representation
user_articles_tfidf_vector = tfidf_matrix.transform([user_articles])
user_articles_ner_tfidf_vector = tfidf_matrix.transform([user_articles_ner])
user_articles_tfidf_vector

<1x16009 sparse matrix of type '<type 'numpy.float64'>'
	with 292 stored elements in Compressed Sparse Row format>

In [11]:
# User_Vector =>  (Alpha) [TF-IDF Vector] + (1-Alpha) [NER Vector] 
alpha_tfidf_vector = ALPHA * user_articles_tfidf_vector
alpha_ner_vector = (1-ALPHA) * user_articles_ner_tfidf_vector

user_vector = np.sum(zip(alpha_tfidf_vector,alpha_ner_vector))
user_vector

<1x16009 sparse matrix of type '<type 'numpy.float64'>'
	with 295 stored elements in Compressed Sparse Row format>

In [12]:
user_vector.toarray()

array([[ 0.,  0.,  0., ...,  0.,  0.,  0.]])

# 3. Calculate cosine similarity between user vector and articles TF-IDF matrix


In [13]:
def calculate_cosine_similarity(articles_tfidf_matrix, user_vector):
    articles_similarity_score=cosine_similarity(articles_tfidf_matrix, user_vector.toarray())
    recommended_articles_id = articles_similarity_score.flatten().argsort()[::-1]
    #Remove read articles from recommendations
    final_recommended_articles_id = [article_id for article_id in recommended_articles_id 
                                     if article_id not in ARTICLES_READ ][:NUM_RECOMMENDED_ARTICLES]
    return final_recommended_articles_id

In [14]:
recommended_articles_id = calculate_cosine_similarity(articles_tfidf_matrix, user_vector)
recommended_articles_id

[3742, 2724, 3053, 2862, 2811]

# 4. Get the recommended articles 

In [15]:
#Recommended Articles and their title
#df_news = pd.read_csv(PATH_NEWS_ARTICLES)
print 'Articles Read'
print news_articles.loc[news_articles['Article_Id'].isin(ARTICLES_READ)]['Title']
print '\n'
print 'Recommender '
print news_articles.loc[news_articles['Article_Id'].isin(recommended_articles_id)]['Title']

Articles Read
4    Punjab Gau Rakshak Dal chief held for assaulti...
5    Phillipines drug war  1 800 drug-related death...
7    Dialogue crucial in finding permanent solution...
8    School bus overturns in Jammu killing 1 and in...
Name: Title, dtype: object


Recommender 
2724    PM Modi says at all-party meeting that PoK is ...
2811    Kashmir tense after Hizbul Mujahideen militant...
2862    J K  PM Modi appeals for peace in Valley  assu...
3053    Jammu   Kashmir  Pakistan  Isis Flags Waved in...
3742    Narendra Modi Likely to Announce   70 000 Cror...
Name: Title, dtype: object


# 5. Case when Alpha=0

User Vector => (Alpha) [TF-IDF Vector] + (1-Alpha) [NER Vector] <br/>
Thus,
User Vector => [NER Vector]

User is defined in terms of NERs associated with read article.
This reflects the condition that NER words define the complete article

In [16]:
ALPHA = 0

alpha_tfidf_vector = ALPHA *user_articles_tfidf_vector    # ==> 0
alpha_ner_vector = (1-ALPHA) * user_articles_ner_tfidf_vector

user_vector = np.sum(zip(alpha_tfidf_vector,alpha_ner_vector))
user_vector

<1x16009 sparse matrix of type '<type 'numpy.float64'>'
	with 41 stored elements in Compressed Sparse Row format>

In [28]:
user_vector.toarray()

array([[ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0. 

In [17]:
recommended_articles_id = calculate_cosine_similarity(articles_tfidf_matrix, user_vector)
recommended_articles_id

[3742, 3053, 2724, 2844, 2792]

In [18]:
#Recommended Articles and their title
#df_news = pd.read_csv(PATH_NEWS_ARTICLES)
print 'Articles Read'
print news_articles.loc[news_articles['Article_Id'].isin(ARTICLES_READ)]['Title']
print '\n'
print 'Recommender '
print news_articles.loc[news_articles['Article_Id'].isin(recommended_articles_id)]['Title']

Articles Read
4    Punjab Gau Rakshak Dal chief held for assaulti...
5    Phillipines drug war  1 800 drug-related death...
7    Dialogue crucial in finding permanent solution...
8    School bus overturns in Jammu killing 1 and in...
Name: Title, dtype: object


Recommender 
2724    PM Modi says at all-party meeting that PoK is ...
2792    Pampore encounter  5 dead  fire breaks out in ...
2844    J K encounter  Army jawan  2 terrorists killed...
3053    Jammu   Kashmir  Pakistan  Isis Flags Waved in...
3742    Narendra Modi Likely to Announce   70 000 Cror...
Name: Title, dtype: object
