# GloVe Vectors

## Load a SQLAlchemy `sessionmaker`
The `sqlalchemy_load.py` is setup to load a session maker which is already pointed at the `postgres` server. We'll use the `sessionmaker` object to manage transactions with database.

In [12]:
from sqlalchemy_arxiv import Session, articles_raw, articles_vectors
from sqlalchemy import func

## Set up the query and tokenizing.

In [2]:
import pandas as pd
import re
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

In [3]:
session = Session()
query = session.query(articles_raw).limit(100)
df = pd.read_sql(query.statement, query.session.bind)
session.close()

In [4]:
df.head()

Unnamed: 0,id,created,setspec,title,abstract
0,1506.00545,2015-06-01,physics:cond-mat,Area law and its violation: A microscopic insp...,Quantum fluctuations of local quantities can...
1,1506.00546,2015-06-01,physics:hep-ex,An amplitude analysis of the $\pi^{0}\pi^{0}$ ...,An amplitude analysis of the $\pi^{0}\pi^{0}...
2,1506.00547,2015-06-01,cs,Differential Geometric SLAM,The simultaneous localization and mapping (S...
3,1506.00548,2015-06-01,cs,GRADOOP: Scalable Graph Data Management and An...,Many Big Data applications in business and s...
4,1506.00549,2015-06-01,physics:cond-mat,Quantum Monte Carlo study of strange correlato...,Distinguishing the nontrivial symmetry-prote...


What's my plan here

1. loop over the table so I can process each abstract
2. format the latex as best as I can 

    i. This should be a stripped string, so not ending or starting with whitespace
    
    ii. The only white space allowed will be spaces, no tabs, newlines etc.
    
    iii. Latex commands should be grouped together, make `\begin{align}` a token etc.
    
3. write the formatted version to a file, separate documents by newlines

In [5]:
def latex_repl(latex_string):
    return " " + latex_string.group(0) + " "

In [6]:
def latex():
    #things to add spaces between
    dollar_sign = r'\${1,2}'
    
    parens = r"\\\(|\\\)"
    
    brackets_left = r"\\["
    brackets_right = r"\\]"
    
    #things like \begins, \mathbb, \emph etc
    commands = r'\\[a-zA-Z]+?\{.*?\}'
    
    
    
    constants_functions = r'\\[a-zA-Z]+'
    simple_math = '[+=\-/]'

    pattern = [
        dollar_sign,
        brackets_right,
        brackets_right,
        parens,
        commands,
        constants_functions,
        simple_math,
    ]
    
    pattern = "|".join(pattern)

    regex_compiled = re.compile(pattern, flags=re.DOTALL)
    
    return regex_compiled

In [7]:
def process_abstract(abstract, latex_regex, latex_repl, whitespace_regex):
    abstract = latex_regex.sub(latex_repl, abstract)
    abstract = whitespace_regex.sub(' ', abstract)
    stop_words_removed = [ word for word in abstract.strip().split() if word.lower() not in ENGLISH_STOP_WORDS ]
    return ' '.join(stop_words_removed)

In [8]:
abstract = df.at[10,'abstract']
abstract

'  We present a new extensive analysis of the old problem of finding a\nsatisfactory calibration of the relation between the geometric albedo and some\nmeasurable polarization properties of the asteroids. To achieve our goals, we\nuse all polarimetric data at our disposal. For the purposes of calibration, we\nuse a limited sample of objects for which we can be confident to know the\nalbedo with good accuracy, according to previous investigations of other\nauthors. We find a new set of updated calibration coefficients for the\nclassical slope - albedo relation, but we generalize our analysis and we\nconsider also alternative possibilities, including the use of other\npolarimetric parameters, one being proposed here for the first time, and the\npossibility to exclude from best-fit analyzes the asteroids having low albedos.\nWe also consider a possible parabolic fit of the whole set of data.\n'

In [9]:
abstract = df.at[10,'abstract']

latex_regex = latex()
whitespace_regex = re.compile('[:;.,?\s]+')

process_abstract(abstract, latex_regex, latex_repl, whitespace_regex)

'present new extensive analysis old problem finding satisfactory calibration relation geometric albedo measurable polarization properties asteroids achieve goals use polarimetric data disposal purposes calibration use limited sample objects confident know albedo good accuracy according previous investigations authors new set updated calibration coefficients classical slope - albedo relation generalize analysis consider alternative possibilities including use polarimetric parameters proposed time possibility exclude best - fit analyzes asteroids having low albedos consider possible parabolic fit set data'

## Create a corpus text file.

Some of the processes below are a bit intensive. I ran this notebook in a `t2.large` instance on AWS. The `wall time` output below refers to actual time (measured by the computer), i.e. the time according to "The clock on the wall".

In [10]:
def table_processor(session, table_class, corpus_file_path,
                    batch_size, latex_regex, latex_repl, whitespace_regex):
    
    query = session.query(table_class.abstract).yield_per(batch_size)
    
    with open(corpus_file_path, 'w') as corpus:
        for row in query:
            abstract = row.abstract
            abstract = process_abstract(abstract, latex_regex, latex_repl, whitespace_regex)
            corpus.write(abstract + '\n')

In [11]:
# options = {
#     'session':Session(),
#     'table_class':articles_raw,
#     'corpus_file_path':'../../vectors/arxiv_corpus.txt',
#     'batch_size':1000,
#     'latex_regex':latex_regex,
#     'latex_repl':latex_repl,
#     'whitespace_regex':whitespace_regex,
# }

# %time table_processor(**options)

In [12]:
import subprocess

def train_GloVe():
    cmd = ["./glove_arxiv.sh"]
    with open("glove.log", "w") as log:
        subprocess.run(cmd, stderr=log)

The cell below took about 24 hours on a `t2.2xlarge` instance from AWS EC2.

In [13]:
# train_GloVe()

### Vectors for abstracts

Reading the output of the `GloVe` vectors into a `pandas` DataFrame presented some odd challenges. Even set up correctly, specifying the proper separator etc., for some reason a small fraction of the rows were not read correctly, some index elements were doubled and some entire rows were read as indexes. To address this issue, I first tried to use the `np.genfromtext` method to read the file directly as an array. This also failed.

Ultimately I just read the file in manually and explicitly constructed a DataFrame.

I discovered this issue by looking at the lengths of different elements in the DataFrame index and saw a number of index elements (the `GloVe` tokens) that were more than one-hundred thousand characters long.

In [14]:
import numpy as np

In [15]:
with open('../../vectors/GloVe_scratch_files/vectors.txt', 'r') as file:
    lines = file.readlines()


In [16]:
def row_split(line, cols, index):
    label, *vector = line.split()
    index.append(label)
    
    for col_num, new_value in enumerate(vector):
        cols[col_num].append(new_value)
    
    

In [17]:
cols = {i:[] for i in range(300)}
index = []

for line in lines:
    row_split(line, cols, index)

df = pd.DataFrame(data=cols, index=index).astype(float)
    

In [18]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,290,291,292,293,294,295,296,297,298,299
-,0.221337,0.034171,-0.1181,-0.172244,-0.117517,0.128754,-0.116952,-0.028031,0.048693,0.01018,...,0.07531,0.199566,0.091558,-0.025151,-0.073502,-0.143817,-0.247823,0.089644,-0.154207,0.135924
$,0.570122,0.025272,-0.226462,0.163964,-0.141922,0.023358,-0.216404,-0.382666,-0.047065,0.433195,...,0.074485,0.42572,-0.03299,0.137178,-0.172863,0.279619,0.578546,-0.14947,0.093483,-0.060006
model,0.230135,0.122644,0.185519,-0.034758,-0.236756,0.11798,-0.215023,-0.135311,0.253569,0.096707,...,0.036827,0.079115,-0.155606,-0.144959,0.156771,-0.164449,-0.417972,0.015737,-0.024429,0.098938
/,0.045956,0.271367,-0.392238,-0.259288,-0.272326,-0.161802,0.072062,-0.191248,-0.225739,-0.172222,...,0.005772,0.013629,0.012839,-0.014166,0.076333,0.026164,0.160343,0.11287,-0.216186,0.002026
=,0.084244,0.014944,-0.228621,0.041487,-0.07657,-0.190362,-0.490148,-0.123931,0.126358,0.247899,...,-0.042797,0.206089,-0.154815,0.264402,-0.213689,0.084507,0.19247,-0.124478,-0.008774,-0.069953


In [19]:
def build_doc_vector(document, word_df):
    document = document.split()
    
    vector = np.array([0.0 for _ in word_df.columns])
    
    for token in document:
        if token in word_df.index:
            vector += word_df.loc[token,:].values
    return vector
        

In [20]:
def abstract_csv(article_class,
                 session,
                 latex_regex,
                 latex_repl,
                 whitespace_regex,
                 process_abstract,
                 word_df,
                 output_file
                ):
    
    query = session.query(article_class.id, article_class.abstract).yield_per(10000)
    
    with open(output_file, "w") as file:
        for record in query:
            abstract = process_abstract(record.abstract, latex_regex, latex_repl, whitespace_regex)
            abstract_vector = build_doc_vector(abstract, word_df)
            abstract_vector = list(abstract_vector)
            abstract_vector = [str(vector_component) for vector_component in abstract_vector]

            new_line = [record.id]
            new_line.extend(abstract_vector)
            new_line = ', '.join(new_line) + '\n'
            
            file.write(new_line)



The cell below took about 6 hours to run on a `t2.large`. In the future I should look into parallelizing this process.

In [21]:
session = Session()
output_file = "../../vectors/arxiv_vectors.csv"
article_vectors = abstract_csv(articles_raw,
                               session,
                               latex_regex,
                               latex_repl,
                               whitespace_regex,
                               process_abstract,
                               word_df=df,
                               output_file=output_file,
)
session.close()

# Pass it back to the table

In [10]:
import csv

In [13]:
def migrate_to_db(vector_file='../../vectors/arxiv_vectors.csv', session=None):
    with open(vector_file, newline='', mode='r') as vector_csv:
        csv_reader = csv.reader(vector_csv)
        new_records = []
        for line_num, line in enumerate(csv_reader):
            
            arxiv_id, *vector = line
            vector = [float(comp) for comp in vector]
            table_args = {f'comp_{i}':vector[i] for i in range(300) }
            table_args['id'] = arxiv_id
            new_table_entry = articles_vectors(**table_args)
            new_records.append(new_table_entry)

            #update every now and then.
            if line_num % 1000 == 0:
                session.add_all(new_records)
                session.commit()
            
                new_records = []
                
        #final table update
        session.add_all(new_records)
        session.commit()
        

In [None]:
session = Session()
%time migrate_to_db(session=session)
session.close()