# Similarity of documents

In the [data analysis and processing directory](./02_Data_Analysis_And_Processing) we created a vector to associate with each article which, in some sense, attempts to capture the content of the article (or at least the content of the abstract). So when we talk about whether or not two articles are similar, we're really talking about the similarity of the vectors. A standard way of evaluating the similarity between two vectors is looking at the angle between them. This should be motivated by the fact that parallel vectors seem like they should be similar while orthogonal vectors are dissimilar. Numerically we represent this quantity by considering the normalized dot product of the two vectors (this number, normalized correctly, is the cosine of the angle between the vectors). 

The normalized dot product of two vectors can take values between $-1$ and $1$, with $1$ corresponding to parallel vectors (high similarity of articles) and $-1$ corresponding to vectors pointing in opposite directions (very dissimilar articles). All of the scores we use refer to the cosine similarity. 

## Prep the data for computation

We use `numpy` to perform fairly efficient numerical matrix operations and some sorting. These mathematical operations translate, from the lens of recommending articles to a user, to computing similarity scores and then sorting the articles by their score. 

The `json` and `pickle` libraries are used to persist the objects used in these computations, since we will need to reuse them to update the database. Also while preprocessing is running, we can have the objects stored in memory on the Flask app serving an API which will allow us to compute recommendations online. 

The `csv` module is used to simplify reading the `CSV` file which stores the article vectors.


For the sake of storage and memory, we'll only keep the top 50 rated recommendations for each article.

In [2]:
import numpy as np
import json
import pickle
import csv

In [3]:
def split_vectors_ids(vectors_file='../../vectors/arxiv_vectors.csv',
                      id_json='../../vectors/id.json',
                      vectors_pkl='../../vectors/vectors.pkl',
                     ):
    """
    Takes in the total csv of vectors for 
    the articles, indexed by arxiv_id and 
    transforms them into a list of ids
    and array of vectors. Then the array is 
    pickled and the ids list is stored as 
    a json.
    """
    
    ids = []
    vectors = []
    
    with open(vectors_file, 'r', newline='') as vectors_csv:
        vectors_reader = csv.reader(vectors_csv)
        for id, *vector in vectors_reader:
            ids.append(id)
            vector = np.array([float(component) for component in vector])
            vectors.append(vector)
    
    with open(id_json, 'w') as json_file:
        json.dump(ids, json_file)
    
    vectors = np.array(vectors)
    vectors = normalize_rows(vectors)
    
    with open(vectors_pkl, 'bw') as pkl_file:
        pickle.dump(vectors, pkl_file)

In [4]:
# split_vectors_ids()

There is a bug with loading large pickled files on MacOS, so id you try to run this locally on a Mac, it probably won't work!

In [5]:
# with open('../../vectors/vectors.pkl', 'rb') as vec_pkl:
#     vectors = pickle.load(vec_pkl)
    
# with open('../../vectors/id.json', 'r') as ids_json:
#     ids = json.load(ids_json)

## Computing the Scores

The functions in this subsection serve the following purpose

1. `score` - compute a batch similarity scores 
1. `sort_scores` - sort the similarity scores, keeping in mind which articles the scores are associated with
1. `compute_recs_paper_id` - A convenience function for computing the recommendations of an article by specifying the id of the article. This will be used in the Flask app for online computation of recommendations.

In [6]:
def score(vectors, all_ids, id_low, id_high):
    """
    Outputs an array, scores. The columns of scores 
    are correspond to the slice of article_ids 
    
    all_id[id_low:id_high]
    
    The rows are indexed by all_ids.
    
    The value in column C and row R is the 
    similarity between the articles all_ids[R]
    and all_ids[id_low+C].
    """
    mask = [id_low <= index < id_high for index, _ in enumerate(all_ids)]
    rows = vectors[mask]
    scores = vectors @ rows.T
    return scores

In [7]:
def sort_scores(scores, all_ids, cur_ids):
    """
    Returns a dictionary of lists. The keys are the ids
    of articles we're currently evaluating.

    The lists contain tuples of scores and ids. The
    score is the similarity score between the current article
    and the other component of the tuple.
    """
    recs_index = scores.argsort(0)[-51:,:][::-1]
    recs = {
      id:[(all_ids[index], scores[index][0]) for index in recs_index[:, col_num]]
        for col_num, id in enumerate(cur_ids)
    }
    return recs

In [8]:
def compute_recs_paper_id(arxiv_id, ids, vectors):
    """
    Takes the list of id_number in question, 
    the list of all ids, and array of vectors
    and returns a dictionary whose key is the arxiv_id
    and values a list of the recommended articles and scores  
    """
    
    id_num = ids.index(arxiv_id)
    scored = score(vectors, ids, id_num, id_num+1)
    recs = sort_scores(scored, ids, ids[id_num:id_num+1])
    return recs

In [10]:
# %time recs = compute_recs_paper_id('1801.08262', ids, vectors)

In [9]:
len(recs['1801.08262'])

51

In [10]:
recs['1801.08262'][:5]

[(1.0000000000000002, '1801.08262'),
 (0.96994439808605737, '1212.2697'),
 (0.9675059791477697, '1410.0230'),
 (0.96315506293288022, '1611.03840'),
 (0.96170088971694145, '1705.00164')]

### Interacting with SQL

We'll define some functions that allow us to store the recommendations in a database so we don't have to compute them again when they're requested by the flask app. In particular the flask app should be able to perform two specific operations with the database.

1. We should be able to retrieve check if a record exists in the table, and if it does render something for the user.
2. If there is not matching record in the table, we can update the table and then service the user.

In [2]:
from arxiv_imports.sqlalchemy_arxiv import Session, articles_similar

In [12]:
def send_to_server(arxiv_id, recs, table_class, session):
    """
    Sends computed recommendations to the the database
    """
    new_recs = [{'id':key,'recs':value} for key, value in recs.items()]
    new_recs = [table_class(**args) for args in new_recs]
    session.add_all(new_recs)
    session.commit()

Send the new recommendations to the server.

In [14]:
session = Session()
send_to_server('1801.08262', recs, articles_similar, session)
session.close()

Get the records back from the server.

In [15]:
def request_recs(id_request, table_class_recs, session):
    """
    Retrieves computed recommendations to the the database
    """
    query = (session
             .query(table_class_recs)
             .filter(table_class_recs.id==id_request)
            )
    records = query.all()
    if records:
        return records


In [16]:
session = Session()
recs_ = request_recs('1801.08262', articles_similar, session)
session.close()

In [30]:
recs_[0].recs[:5]

[[1.0000000000000002, '1801.08262'],
 [0.9699443980860574, '1212.2697'],
 [0.9675059791477697, '1410.0230'],
 [0.9631550629328802, '1611.03840'],
 [0.9617008897169415, '1705.00164']]

## Random articles results