## Outline 
In this notebook we will create two new tables in a `postgreSQL` database, `arxiv_detex` and `arxiv_pandoc`. These tables will contain the titles and abstracts of the articles from `arXiv` after having been processed by `detex` and `pandoc`. Sometimes these processing steps fail, and when they do we just store the original string in the new tables. Both of these new tables have the following columns.


| id | created | setspec | title | title_converted | abstract | abstract_converted |
|----|---------|---------|-------|-----------------|----------|--------------------|
| | | | |  | | | |

Where the `title_converted` and `abstract_converted` features are boolean, and are `f` only when the appropriate program used to convert the text failed to return anything and `t` otherwise.

# Cleaning up $\LaTeX$

There's a bit of a mismatch between the tools we're using and the task at hand. The abstracts and titles are written in $\LaTeX$. This creates a few different problems. 

```text
My favorite theorem is Euler's equation. Euler's equation says
$$
e^{i\theta} = \cos(\theta) + i\sin(\theta).
$$
Euler's equation can be proven using the power series representations of the three functions $e^{i\theta}, \cos(\theta)$ and $\sin(\theta)$. For example 

$$
\sum_{n=0}^\infty \frac{x^n}{n!} = e^x
$$
```


This text block, after being run through a $\LaTeX$ compiler would produce a document (e.g. a PDF) that looks something like the cell below.




My favorite theorem is Euler's equation. Euler's equation says
$$
e^{i\theta} = \cos(\theta) + i\sin(\theta).
$$
Euler's equation can be proven using the power series representations of the three functions $e^{i\theta}, \cos(\theta)$ and $\sin(\theta)$. For example 

$$
\sum_{n=0}^\infty \frac{x^n}{n!} = e^x.
$$


The problem here is that the NLP tools aren't set up to handle all of the peculiarities of $\LaTeX$. To get around this, I used a program called `detex`, that takes the raw $\LaTeX$ above and converts it to the text below.

In [1]:
import subprocess

str_to_fix = r"""
My favorite theorem is Euler's equation. Euler's equation says
$$
e^{i\theta} = \cos(\theta) + i\sin(\theta).
$$
Euler's equation can be proven using the power series representations of the three functions $e^{i\theta}, \cos(\theta)$ and $\sin(\theta)$. For example 

$$
\sum_{n=0}^\infty \frac{x^n}{n!} = e^x
$$
"""

str_to_fix = bytes(str_to_fix, encoding='utf-8')
detex_path = '/usr/bin/detex'

new_str = subprocess.run(detex_path.split(), input=str_to_fix, stdout=subprocess.PIPE) 
new_str = new_str.stdout
new_str = str(new_str, encoding='utf-8')
print(new_str)


My favorite theorem is Euler's equation. Euler's equation says



Euler's equation can be proven using the power series representations of the three functions  and . For example 







Of course we've lost information here, but the goal of converting our $\LaTeX$ to plain text was never about preserving all information, but preserving the information we are able to in a format with which we can work. The plain text abov is certainly something that `spaCy` could understand and work with. 

## WARNING

Not everything in the metadata is proper $\LaTeX$. `detex` won't catch bad code, it will just try to convert it without really parsing the code. For example, let's try the following variation on the above code.

```text
My favorite theorem is Euler's equation. Euler's equation says
$$
e^{i\theta} = \cos(\theta) + i\sin(\theta).
$$
Euler's equation can be proven using the power series representations of the three functions $e^{i\theta}, \cos(\theta)$ and $\sin(\theta)$. For example 

$
\sum_{n=0}^\infty \frac{x^n}{n!} = e^x
$$
```

All we've changed here is to remove the `$` in the second equation. This is a critical $\LaTeX$ error, the above code will fail to generate correct document. However, `detex` will not figure this out. 

In [2]:
str_to_fix = r"""
My favorite theorem is Euler's equation. Euler's equation says

e^{i\theta} = \cos(\theta) + i\sin(\theta).
Euler's equation can be proven using the power series representations of the three functions $e^{i\theta}, \cos(\theta)$ and $\sin(\theta)$. For example 


\sum_{n=0}^\infty \frac{x^n}{n!} = e^x
$$
"""

str_to_fix = bytes(str_to_fix, encoding='utf-8')

new_str = subprocess.run(detex_path.split(), input=str_to_fix, stdout=subprocess.PIPE) 
new_str = new_str.stdout
new_str = str(new_str, encoding='utf-8')
print(new_str)


My favorite theorem is Euler's equation. Euler's equation says

e^i = () + i().
Euler's equation can be proven using the power series representations of the three functions  and . For example 


_n=0^x^nn! = e^x




Another program, `pandoc`, can be used to try to convert individual $\LaTeX$ documents to plain text (`pandoc` is a general document converted, not $\LaTeX$ specific). For the above broken example, `pandoc` would fail to convert the text and give an error. Unlike `detex`, there is a convenient Python `pandoc` wrapper that we can use, rather than converting to byte strings and feed it to the external program. 

In [23]:
import pypandoc

try:
    x = pypandoc.convert_text(new_str, to='plain', format='tex')
    print('pandoc parsed the string successfully')
except:
    print('pandoc failed to parse the string')

pandoc failed to parse the string


Some errors are not caught by `pandoc`. The string 

```text
My favorite theorem is Euler's equation. Euler's equation says

e^{i\theta} = \cos(\theta) + i\sin(\theta).
$$
```
shouldn't be accepted by a $\LaTeX$ compiler, but `pandoc` won't catch the error.

In [35]:
str_to_fix = r"""
My favorite theorem is Euler's equation. Euler's equation says

e^{i\theta} = \cos(\theta) + i\sin(\theta).
$$
"""

try:
    x = pypandoc.convert_text(str_to_fix, to='plain', format='latex')
    print('pandoc parsed the string successfully')
except:
    print('pandoc failed to parse the string')

pandoc parsed the string successfully


In [36]:
print(x)

My favorite theorem is Euler’s equation. Euler’s equation says

e^i = () + i().



With all these caveats in mind, I am going to treat those abstracts which `pandoc` returns a value for as "valid" and later on when we train our GloVe vectors, we will train only on those abstracts. In the next notebook we'll take a look at the actual rates at which `pandoc` fails.

### Load the data

We use `sqlalchemy` as our framework for interacting with the `postgres` server.

In [1]:
from sqlalchemy import create_engine, Column, String, Integer, DATE, BOOLEAN
from sqlalchemy.orm import sessionmaker

from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import func
import json


This loads the required credentials to access the databse.

In [2]:
with open('../../postgres.json') as pg_info:
    pg_json = json.load(pg_info)
    pg_username = pg_json['pg_username']
    pg_password = pg_json['pg_password']
    pg_ip = pg_json['pg_ip']


These are the objects that `sqlalchemy` uses to represent the tables in the database.

In [3]:
Base = declarative_base()

class articles_raw(Base):
    __tablename__ = 'arxiv_raw'
    
    id = Column(String, primary_key=True)
    created = Column(DATE)
    setspec = Column(String)
    title = Column(String)
    abstract = Column(String)
    
class articles_detex(Base):
    __tablename__ = 'arxiv_detex'
    
    id = Column(String, primary_key=True)
    created = Column(DATE)
    setspec = Column(String)
    
    title = Column(String)
    title_converted = Column(BOOLEAN)
    
    abstract = Column(String)
    abstract_converted = Column(BOOLEAN)
    
engine = create_engine(f'postgres://{pg_username}:{pg_password}@{pg_ip}:5432')
Base.metadata.create_all(engine)


class articles_pandoc(Base):
    __tablename__ = 'arxiv_pandoc'
    
    id = Column(String, primary_key=True)
    created = Column(DATE)
    setspec = Column(String)
    
    title = Column(String)
    title_converted = Column(BOOLEAN)
    
    abstract = Column(String)
    abstract_converted = Column(BOOLEAN)
    
engine = create_engine(f'postgres://{pg_username}:{pg_password}@{pg_ip}:5432')
Base.metadata.create_all(engine)


The `str_fix` function either runs `detex` or `pandoc` on the string, depending on what we're doing.

In [4]:
def str_fix(str_to_fix, detex_path, use_pandoc=None):
    if not use_pandoc:
        str_to_fix = bytes(str_to_fix, 'utf-8')
        try:
            new_str = subprocess.run(detex_path.split(), input=str_to_fix, stdout=subprocess.PIPE) 
            new_str = new_str.stdout
            detexed = True
        except:
            new_str = str_to_fix
            detexed = False

        new_str = str(new_str, 'utf-8')
        return new_str, detexed
    else:
        try:
            new_str = pypandoc.convert_text(str_to_fix, to='plain', format='latex')
            panddoc_bool = True
        except:
            new_str = str_to_fix
            panddoc_bool = False
            
        return new_str, panddoc_bool


The `change_tex` function fixes up a record from `arxiv_raw` and processes it for insertion into one of `arxiv_detex` or `arxiv_pandoc`.

In [5]:
def change_tex(record, detex_path, use_pandoc, article_class):
    processed_article_info = {
                'id':record.id,
                'created':record.created,
                'setspec':record.setspec,
                'title':record.title,
                'abstract':record.abstract,
            }
    
    
    processed_abstract, abstract_bool = str_fix(record.abstract, detex_path, use_pandoc)
    processed_title, title_bool = str_fix(record.title, detex_path, use_pandoc)
    
    processed_article_info['abstract'] = processed_abstract
    processed_article_info['abstract_converted'] = abstract_bool
    
    processed_article_info['title'] = processed_title
    processed_article_info['title_converted'] = title_bool
    

    processed_article = article_class(**processed_article_info)
    return processed_article

The `query_tex` function processes the rows of `arxiv_raw` and inserts the processed rose into either `arxiv_detex` or `arxiv_pandoc`.

In [6]:
def query_tex(limit_num=None, detex_path='/usr/bin/detex', batch_size=10000,
              commit_size=1000, use_pandoc=None, article_class=articles_detex):
    
    engine = create_engine(f'postgres://{pg_username}:{pg_password}@{pg_ip}:5432')
    Session = sessionmaker(engine)

    query_session = Session()
    commit_session = Session()
    
    if limit_num:
        query = query_session.query(articles_raw).limit(limit_num)

    else:
        query = query_session.query(articles_raw).yield_per(batch_size)
    
    new_records = []
    
    for row_num, record in enumerate(query): 
        processed_article = change_tex(record, detex_path=detex_path,
                                       use_pandoc=use_pandoc, article_class=article_class)
        new_records.append(processed_article)
        
        if row_num % commit_size == 0:
            commit_session.add_all(new_records)
            commit_session.commit()
            new_records = []
    
    commit_session.add_all(new_records)    
    commit_session.commit()
    
    commit_session.close()
    query_session.close()
    engine.dispose()
        
    
    return row_num
    


### Populate the `arxiv_detex` table


In [None]:
row_nums = query_tex()

### Populate the `arxiv_pandoc` table


In [None]:
row_nums = query_tex(use_pandoc=True, article_class=articles_pandoc)