# Profile-based retrieval

by Tolga Buz, Antonio Javier González Ferrer and Aitor Palacios Cuesta.


This script has been created using Jupyter Notebook and Python 3. Moreover, it requires the use and installation of the following libraries:
* numpy
* scipy
* scikit-learn
* tabulate

In [1]:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer
from tabulate import tabulate

Usually, when evaluating a information retrieval model, the basic concepts are the documents $d$ and a query $q$. The documents are the static elements, which express ideas about some topic in natural language. On the other hand, the queries represents a variable information need for documents pertaining to some topic. A document is represented by a set of index terms $k_i$. 

For our particular case, the <u>topics</u> represent conceptually the documents, since they are permanent along the model, and the <u>text snippets</u> are the queries, since we pretend to match to which profile is the query related. The users will have some topic preferences, and the goal is to provide the users the documents they are interested in. 

Note that in a real scenario the topics may be created based on a long list of specific terms. These index terms may be built from documents that we are sure that represent such topic. For instance, the topics for the term `sport` may be extracted from sport magazines.

In [2]:
# Defining topics.
# Each topic is treated conceptually as a document.
topic_music    = ["music", "sound", "song", "songs", "Taylor", "Swift", "Justin", "Bieber", "Mozart", "pop", "stars", "singing"]
topic_film     = ["movie", "movies", "film", "Tarantino", "Pulp", "Fiction", "actor", "director"]
topic_sports   = ["football", "star", "soccer", "goal", "messi", "jogging", "swimming", "fitness"]
topic_cars     = ["Ferrari", "Lamborghini", "car", "speed", "high", "enzo", "aventador", "driving"]
topic_politics = ["politics", "economics", "trump", "stock", "market" ]

# List of topics.
topics_names = ["topic_music", "topic_film", "topic_sports", "topic_cars", "topic_politics"]
topics_values = [topic_music, topic_film, topic_sports, topic_cars, topic_politics]

We match the topics to some fictitious users. For instance, `user1` likes cars, sports and films; `user2` prefers music and politics; and `user3` loves films:

In [3]:
# Defining the users
user1  = ["topic_cars", "topic_sports", "topic_film"]
user2  = ["topic_music", "topic_politics"]
user3  = ["topic_film"]
users = {"user1": user1,
         "user2": user2,
         "user3": user3}

Ideally, the users can receive any kind of text snippets and such text snippet will be classified according their topics. However, since this is a toy example to assimilate concepts, and the golden standard of terms in the topics is small, we need to define manually the queries using related terms from the topics.

In [4]:
# Loading in text snippets that need to be classified.
# Every text snippet only has 1 topic per definition.
# Each text snippet is a query
text_snippets = ["Jogging is one of the best sports, but I love football", 
                 "Politics news are strongly affecting economics.",
                 "Tarantino is a bad actor, but a good director.", 
                 "Ferrari Enzo is faster than Lamborghini Aventador, though its music accessories have more quality.", 
                 "Ferrari builds the best high speed cars.",
                 "Football star Messi is also doing jogging, swimming and fitness to stay in shape.",
                 "Pulp fiction is one of the best movies ever made.",
                 "Taylor Swift and Justin Bieber are pop stars.",
                 "Mozart is much better than Justin Bieber.",
                 "Many people do not like to hear Taylor Swift singing in spite of she might be married with Tarantino."]

We have decided to represent both documents and queries as a bag of words, where the corresponding document or query is represented as a column in a matrix where terms correspond to the rows, and their corresponding cell value is the numberof times that term appears in the text.  After forming this matrix,  that cell number is used to calculate a new score for the term in that text.  Then, each column will representthe vector of the corresponding topic or snippet.  The score number is calculated usingthe  `tf-idf`  measure  that  we  found  appropriate. 

The standard scheme used along this documents is to use the inverse document frequency in the queries (`tfidf` object) but no to use it in the documents (`tf` object). Both queries and documents have logarithmic tf.

In [5]:
tf = TfidfVectorizer(analyzer='word', 
                     strip_accents='unicode', # Remove accents during the preprocessing step
                     stop_words = 'english', # Eliminate stop words from English
                     lowercase=True, # Convert all characters to lowercase before tokenizing
                     use_idf=False, # Disable the inverse-document-frequency reweightening
                     sublinear_tf = True, # Logarithmic tf
                     norm='l2') # Normalization

tfidf = TfidfVectorizer(analyzer='word', 
                     strip_accents='unicode', 
                     stop_words = 'english', 
                     lowercase=True, 
                     use_idf=True, # Enable inverse-document-frequency reweightening
                     smooth_idf = True, # Smooth idf weights by adding one to document frequencies. Prevents zero divisions.
                     sublinear_tf = True,
                     norm='l2')

First of all, we fit the topcics with the logarithmic tf approach. Then, we transform the topics to the matrix of weights:

In [6]:
# Represent each document as a weighted tf-idf vector
topics_text = list(map(lambda x: ",".join(x), topics_values))
tf_matrix = tf.fit_transform(topics_text)

Secondly, we fit the text snippets with the logarithmic tf and idf approach, getting also the corresponding matrix of weights:

In [7]:
# Represent the query as a weighted tf-idf vector
tfidf_matrix = tfidf.fit_transform(topics_text)
tfidf_query = tfidf.transform(text_snippets)

Finally, we apply the cosine similarity between the two matrices. Since we have normalized both documents and queries, the cosine similarity is simply the dot product between the matrices.

In [8]:
# Compute the cosine similarity score for the query vector and each document vector.
# Note: Cosine for length-normalized vectors is simply the dot product (or scalar product).
cosine_similarity = (tf_matrix * tfidf_query.T).A

Let us visualize the cosine similarity matrix. It represents, for each of the documents (columns), which are the most similar topics (rows):

In [9]:
headers = ['text1', 'text2', 'text3', 'text4', 'text5', 'text6', 'text7', 'text8', 'text9', 'text10']
body = np.append(np.array([["music"],["films"],["sports"],["cars"],["politics"]])
                 , cosine_similarity, axis=1)
print(tabulate(body, headers=headers, tablefmt='pipe', floatfmt=".2f"))

|          |   text1 |   text2 |   text3 |   text4 |   text5 |   text6 |   text7 |   text8 |   text9 |   text10 |
|:---------|--------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:|--------:|---------:|
| music    |    0.00 |    0.00 |    0.00 |    0.13 |    0.00 |    0.00 |    0.00 |    0.71 |    0.50 |     0.43 |
| films    |    0.00 |    0.00 |    0.61 |    0.00 |    0.00 |    0.00 |    0.61 |    0.00 |    0.00 |     0.18 |
| sports   |    0.50 |    0.00 |    0.00 |    0.00 |    0.00 |    0.87 |    0.00 |    0.00 |    0.00 |     0.00 |
| cars     |    0.00 |    0.00 |    0.00 |    0.63 |    0.61 |    0.00 |    0.00 |    0.00 |    0.00 |     0.00 |
| politics |    0.00 |    0.63 |    0.00 |    0.00 |    0.00 |    0.00 |    0.00 |    0.00 |    0.00 |     0.00 |


Our solution proposes a unique topic per document. Therefore, let us select the max value for each of the columns:

In [10]:
# Categorization of the text_snippets
# Rank documents with respect to the query by score (the higher, the better)
# Return best ones.
categories = np.argmax(cosine_similarity, axis=0)

This `categories` vector contains the indeces of the matched topics per each of the text snippets. The rest of the notebook is straightforward: see the topics for each of the users, and give them the text snippets that matches their profiles:

In [11]:
# Assign text_snippets to users
user1_result = []
user2_result = []
user3_result = []

for text_id, category_id in enumerate(categories):
    category = topics_names[category_id]
    if category in user1:
        user1_result.append(text_id)
    if category in user2:
        user2_result.append(text_id)
    if category in user3:
        user3_result.append(text_id)

To conclude, we show which of the initial text snippets are matched to which user:

In [12]:
def print_text_snippets(user, name):
    """Function to print the users and their matched topics"""
    print("\n" + name + " profile: " + str(users[name]))
    print("\nTexts:")
    for text in list(map(lambda x: text_snippets[x], user)):
        print(text)
    print("\n")
    
print_text_snippets(user1_result, "user1")
print_text_snippets(user2_result, "user2")
print_text_snippets(user3_result, "user3")


user1 profile: ['topic_cars', 'topic_sports', 'topic_film']

Texts:
Jogging is one of the best sports, but I love football
Tarantino is a bad actor, but a good director.
Ferrari Enzo is faster than Lamborghini Aventador, though its music accessories have more quality.
Ferrari builds the best high speed cars.
Football star Messi is also doing jogging, swimming and fitness to stay in shape.
Pulp fiction is one of the best movies ever made.



user2 profile: ['topic_music', 'topic_politics']

Texts:
Politics news are strongly affecting economics.
Taylor Swift and Justin Bieber are pop stars.
Mozart is much better than Justin Bieber.
Many people do not like to hear Taylor Swift singing in spite of she might be married with Tarantino.



user3 profile: ['topic_film']

Texts:
Tarantino is a bad actor, but a good director.
Pulp fiction is one of the best movies ever made.


