# GloVe Vectors

In this notebook, we tokenize the article abstracts using the `re` module for regular expressions, use the `GloVe` library to create vectors for the tokens based on the corpus of abstracts, and then use these token vectors to create vectors for the associated with each article. This vector object is the primary mechanism of comparison for articles. In particular these vectors can literally just be thought of as geometrical vectors, and so should have an angle between them. This angle is precisely the quantity which the similarity score we will use, cosine similarity, implicitly measures. Two vectors have a higher similarity score when the angle between them is smaller. 

GloVe (short for Global Vectors), is a scheme for associating tokens to vectors by reading the entirety of the corpus of texts. GloVe is known to preserve semantic relationships between tokens, and so could do a good job of representing an entire abstract numerically.

## Load a SQLAlchemy `sessionmaker`
The `sqlalchemy_load.py` file is setup to load a 1sessionmaker1 object which is already pointed at the `postgres` server. We'll use the `sessionmaker`, called Session, object to manage transactions with database.

In [1]:
from sqlalchemy_arxiv import Session, articles_raw, articles_vectors
from sqlalchemy import func

## Set up the SQLAlchemy query


In [2]:
import pandas as pd
import re
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

In [3]:
session = Session()
query = session.query(articles_raw).limit(100)
df = pd.read_sql(query.statement, query.session.bind)
session.close()

In [4]:
df.head()

Unnamed: 0,id,created,setspec,title,abstract
0,1711.08738,2017-11-23,physics:physics,Complex Fluid-Fluid Interface may Non Triviall...,The present study theoretically predicts the...
1,1711.08739,2017-11-23,physics:physics,Thermally modulated cross-stream migration of ...,"In the present study, we investigate the cro..."
2,1711.0874,2017-11-23,cs,fpgaConvNet: A Toolflow for Mapping Diverse Co...,"In recent years, Convolutional Neural Networ..."
3,1711.08741,2017-11-23,math,Remark on the strong solvability of the Navier...,The initial value problem of the incompressi...
4,1711.08742,2017-11-23,cs,Estimating Missing Data in Temporal Data Strea...,Missing data is a ubiquitous problem. It is ...


## Tokenizing

For performing natural language processing, we first convert a text document, represented by a single string, and convert it into a sequence of strings, which are treated as the constituent objects making up the document. Since our textual data is highly specialized with specific types of formatting (mostly $\LaTeX$ commands), I set up a tokenizer that tries to retain these relationships. 

For example, in mathematical writing, the string `\mathbb{R}}`, which renders in $\LaTeX$ as $\mathbb{R}$, is typically used to refer to the real number line, while the string `R` by itself can refer to a number of very different objects (for example a [ring](https://en.wikipedia.org/wiki/Ring_(mathematics&#41;)). While the real number line is a ring, we absolutely do not want to conflate these two strings. So the idea here is to retain the `\mathbb` command as a part of the whole token for the real number line. 

The functions in this section are built for the purpose of performing this tokenization which tries to respect the $LaTeX$.  

In [5]:
def latex_repl(latex_string):
    return " " + latex_string.group(0) + " "

In [6]:
def latex():
    #things to add spaces between
    dollar_sign = r'\${1,2}'
    
    parens = r"\\\(|\\\)"
    
    brackets_left = r"\\["
    brackets_right = r"\\]"
    
    #things like \begins, \mathbb, \emph etc
    commands = r'\\[a-zA-Z]+?\{.*?\}'
    
    
    
    constants_functions = r'\\[a-zA-Z]+'
    simple_math = '[+=\-/]'

    pattern = [
        dollar_sign,
        brackets_right,
        brackets_right,
        parens,
        commands,
        constants_functions,
        simple_math,
    ]
    
    pattern = "|".join(pattern)

    regex_compiled = re.compile(pattern, flags=re.DOTALL)
    
    return regex_compiled

In [7]:
def process_abstract(abstract, latex_regex, latex_repl, whitespace_regex):
    abstract = latex_regex.sub(latex_repl, abstract)
    abstract = whitespace_regex.sub(' ', abstract)
    stop_words_removed = [ word for word in abstract.strip().split() if word.lower() not in ENGLISH_STOP_WORDS ]
    return ' '.join(stop_words_removed)

In [14]:
abstract = df.at[3,'abstract']
abstract

'  The initial value problem of the incompressible Navier-Stokes equations with\nnon-zero forces in $L^{n,\\infty}(\\mathbb{R}^n)$ is investigated. Even though\nthe Stokes semigroup is not strongly continuous on\n$L^{n,\\infty}(\\mathbb{R}^n)$, with the qualitative condition for the external\nforces, it is clarified that the mild solution of the Naiver-Stokes equations\nsatisfies the differential equations in the topology of\n$L^{n,\\infty}(\\mathbb{R}^n)$. Inspired by the conditions for the forces, we\ncharacterize the maximal complete subspace in $L^{n,\\infty}(\\mathbb{R}^n)$\nwhere the Stokes semigroup is strongly continuous at $t=0$. By virtue of this\nsubspace, we also show local well-posedness of the strong solvability of the\nCauchy problem without any smallness condition on the initial data in the\nsubspace. Finally, we discuss the uniqueness criterion for the mild solutions\nin weak Lebesgue spaces by the argument by Brezis.\n'

In [15]:
abstract = df.at[3,'abstract']

latex_regex = latex()
whitespace_regex = re.compile('[:;.,?\s]+')

process_abstract(abstract, latex_regex, latex_repl, whitespace_regex)

'initial value problem incompressible Navier - Stokes equations non - zero forces $ L^{n \\infty }( \\mathbb{R} ^n) $ investigated Stokes semigroup strongly continuous $ L^{n \\infty }( \\mathbb{R} ^n) $ qualitative condition external forces clarified mild solution Naiver - Stokes equations satisfies differential equations topology $ L^{n \\infty }( \\mathbb{R} ^n) $ Inspired conditions forces characterize maximal complete subspace $ L^{n \\infty }( \\mathbb{R} ^n) $ Stokes semigroup strongly continuous $ t = 0 $ virtue subspace local - posedness strong solvability Cauchy problem smallness condition initial data subspace Finally discuss uniqueness criterion mild solutions weak Lebesgue spaces argument Brezis'

## Create a corpus text file.

The `GloVe` library is set up to read a plain text file with documents separated by lines and tokens separated by spaces. The `table_processor` function reads the abstracts from the PostgreSQL database, tokenizes them as above, and writes them to a file so that we can use `GloVe`.

Some of the processes below are a bit intensive. I ran this notebook in a `t2.large` instance on AWS. The `wall time` output below refers to actual time (measured by the computer), i.e. the time according to "The clock on the wall".

In [10]:
def table_processor(session, table_class, corpus_file_path,
                    batch_size, latex_regex, latex_repl, whitespace_regex):
    
    query = session.query(table_class.abstract).yield_per(batch_size)
    
    with open(corpus_file_path, 'w') as corpus:
        for row in query:
            abstract = row.abstract
            abstract = process_abstract(abstract, latex_regex, latex_repl, whitespace_regex)
            corpus.write(abstract + '\n')

In [11]:
# options = {
#     'session':Session(),
#     'table_class':articles_raw,
#     'corpus_file_path':'../../vectors/arxiv_corpus.txt',
#     'batch_size':1000,
#     'latex_regex':latex_regex,
#     'latex_repl':latex_repl,
#     'whitespace_regex':whitespace_regex,
# }

# %time table_processor(**options)

In [12]:
import subprocess

def train_GloVe():
    cmd = ["./glove_arxiv.sh"]
    with open("glove.log", "w") as log:
        subprocess.run(cmd, stderr=log)

The cell below took about 24 hours on a `t2.2xlarge` instance from AWS EC2.

In [13]:
# train_GloVe()

### Vectors for abstracts

Reading the output of the `GloVe` vectors into a `pandas` DataFrame presented some odd challenges. Even set up correctly, specifying the proper separator etc., for some reason a small fraction of the rows were not read correctly, some index elements were doubled and some entire rows were read as indexes. To address this issue, I first tried to use the `np.genfromtext` method to read the file directly as an array. This also failed.

Ultimately I just read the file in manually and explicitly constructed a DataFrame.

I discovered this issue by looking at the lengths of different elements in the DataFrame index and saw a number of index elements (the `GloVe` tokens) that were more than one-hundred thousand characters long.

In [14]:
import numpy as np

In [15]:
with open('../../vectors/GloVe_scratch_files/vectors.txt', 'r') as file:
    lines = file.readlines()


In [16]:
def row_split(line, cols, index):
    label, *vector = line.split()
    index.append(label)
    
    for col_num, new_value in enumerate(vector):
        cols[col_num].append(new_value)
    
    

In [17]:
cols = {i:[] for i in range(300)}
index = []

for line in lines:
    row_split(line, cols, index)

df = pd.DataFrame(data=cols, index=index).astype(float)
    

In [18]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,290,291,292,293,294,295,296,297,298,299
-,0.221337,0.034171,-0.1181,-0.172244,-0.117517,0.128754,-0.116952,-0.028031,0.048693,0.01018,...,0.07531,0.199566,0.091558,-0.025151,-0.073502,-0.143817,-0.247823,0.089644,-0.154207,0.135924
$,0.570122,0.025272,-0.226462,0.163964,-0.141922,0.023358,-0.216404,-0.382666,-0.047065,0.433195,...,0.074485,0.42572,-0.03299,0.137178,-0.172863,0.279619,0.578546,-0.14947,0.093483,-0.060006
model,0.230135,0.122644,0.185519,-0.034758,-0.236756,0.11798,-0.215023,-0.135311,0.253569,0.096707,...,0.036827,0.079115,-0.155606,-0.144959,0.156771,-0.164449,-0.417972,0.015737,-0.024429,0.098938
/,0.045956,0.271367,-0.392238,-0.259288,-0.272326,-0.161802,0.072062,-0.191248,-0.225739,-0.172222,...,0.005772,0.013629,0.012839,-0.014166,0.076333,0.026164,0.160343,0.11287,-0.216186,0.002026
=,0.084244,0.014944,-0.228621,0.041487,-0.07657,-0.190362,-0.490148,-0.123931,0.126358,0.247899,...,-0.042797,0.206089,-0.154815,0.264402,-0.213689,0.084507,0.19247,-0.124478,-0.008774,-0.069953


The `build_doc_vector` function takes an article abstract, represented as a string, and computes the vector associated with that artice. The `abstract_csv` function computes the vectors for all articles and writes them to a file.

In [19]:
def build_doc_vector(document, word_df):
    
    """
    Returns the vector a given, already tokenized, document.
    """
    document = document.split()
    
    vector = np.array([0.0 for _ in word_df.columns])
    
    for token in document:
        if token in word_df.index:
            vector += word_df.loc[token,:].values
    return vector
        

In [20]:
def abstract_csv(article_class,
                 session,
                 latex_regex,
                 latex_repl,
                 whitespace_regex,
                 process_abstract,
                 word_df,
                 output_file
                ):
    
    """
    Write the article vectors to output_file.
    """
    
    query = session.query(article_class.id, article_class.abstract).yield_per(10000)
    
    with open(output_file, "w") as file:
        for record in query:
            abstract = process_abstract(record.abstract, latex_regex, latex_repl, whitespace_regex)
            abstract_vector = build_doc_vector(abstract, word_df)
            abstract_vector = list(abstract_vector)
            abstract_vector = [str(vector_component) for vector_component in abstract_vector]

            new_line = [record.id]
            new_line.extend(abstract_vector)
            new_line = ', '.join(new_line) + '\n'
            
            file.write(new_line)



The cell below took about 6 hours to run on a `t2.large`.

In [21]:
session = Session()
output_file = "../../vectors/arxiv_vectors.csv"
abstract_csv(articles_raw,
             session,
             latex_regex,
             latex_repl,
             whitespace_regex,
             process_abstract,
             word_df=df,
             output_file=output_file,
)
session.close()

## Pass it back to the database

We'll store the vectors associate with the articles in the our PostgreSQL database. The `csv` module is a convenient module for reading a `CSV` file directly. 

In [10]:
import csv

In [13]:
def migrate_to_db(vector_file='../../vectors/arxiv_vectors.csv', session=None, articles_vectors=articles_vectors):
    
    """
    Sends teh vector_file to articles_vectors table in the 
    Postgres database. 
    """
    with open(vector_file, newline='', mode='r') as vector_csv:
        csv_reader = csv.reader(vector_csv)
        new_records = []
        for line_num, line in enumerate(csv_reader):
            
            arxiv_id, *vector = line
            vector = [float(comp) for comp in vector]
            table_args = {f'comp_{i}':vector[i] for i in range(300) }
            table_args['id'] = arxiv_id
            new_table_entry = articles_vectors(**table_args)
            new_records.append(new_table_entry)

            #update every now and then.
            if line_num % 1000 == 0:
                session.add_all(new_records)
                session.commit()
            
                new_records = []
                
        #final table update
        session.add_all(new_records)
        session.commit()
        

In [None]:
session = Session()
%time migrate_to_db(session=session)
session.close()