# Flatten everything

JSONs are, to my mind, certainly preferable to the XML we got in the first place, but they're still not quite as flat as we'd like. We are going to take our JSONs and process them into a PostgreSQL server on AWS. There will be six columns in the table, as depicted below.

| id | created | setspec | title | abstract | tex |
|----|---------|---------|-------|----------|-----|
| 0704.0001| 2007-04-02| physics:hep-ph| Calculation of prompt diphoton production cross sections at Tevatron and LHC energies| A fully differential calculation in perturbative quantum chromodynamics is presented for the production of massive photon pairs at hadron colliders. All next-to-leading order perturbative contributions from quark-antiquark  gluon-(anti)quark  and gluon-gluon subprocesses are included  as well as all-orders resummation of initial-state gluon radiation valid at next-to-next-to-leading logarithmic accuracy. The region of phase space is specified in which the calculation is most reliable. Good agreement is demonstrated with data from the Fermilab Tevatron  and predictions are made for more detailed tests with CDF and DO data. Predictions are shown for distributions of diphoton pairs produced at the energy of the Large Hadron Collider (LHC). Distributions of the diphoton pairs from the decay of a Higgs boson are contrasted with those produced from QCD processes at the LHC  showing that enhanced sensitivity to the signal can be obtained with judicious selection of events.| good| 


## First we handle the $LaTeX$

On arXiv it's very common for abstracts, or even titles, to have at least a little $LaTeX$. This isn't ideal for our purposes, we'll go ahead and try to remove all the formatted equations and the english words that change the text styling (doing our best to leave the actual words unaffected).

In [1]:
import re
import pypandoc
from bs4 import BeautifulSoup

from pathlib import Path
import json


In [2]:
# remove latex math
# convert to plain text, this means that different bold, emphasis schemes are represented
# symbolically rather than as text commands this is preferable 
# since we can remove those symbols without fear of affecting the words themselves!


math_regex = '<span class="math.*?">.*?</span>'
math_compiled = re.compile(math_regex, flags=re.DOTALL)

#there's maybe a simpler way of doing this...but this is straightforward :) 

def latex_clean(latex_string, re_compiled):

    
    #first we convert to html
    cleaned_string = pypandoc.convert_text(source=latex_string, to='html', format='latex')
    
    #remove the math mode stuff
    cleaned_string = re_compiled.sub(repl=' ', string=cleaned_string)

    #convert to plaintext
    cleaned_string = pypandoc.convert_text(source=cleaned_string, to='plain', format='html')
    
    #get rid of new lines
    cleaned_string = cleaned_string.replace('\n', ' ')
    
    return cleaned_string
    
    

In [3]:
def json_to_csv(full_article_json):
    
    #the category information in the header seems a little cleaner
    #than the one in the metadata
    header = full_article_json['header']
    metadata = full_article_json['metadata']
    
    key_prefix = '{http://www.openarchives.org/OAI/2.0/}'
    
    csv = {}
    #I'm channging the name of id to arxivid because I want
    csv['id'] = metadata[f'{key_prefix}id'][0]
    csv['created'] = metadata[f'{key_prefix}created'][0]
    csv['setspec'] = header[f'{key_prefix}setSpec'][0]
    csv['title'] = metadata[f'{key_prefix}title'][0]
    csv['abstract'] = metadata[f'{key_prefix}abstract'][0]
    

    #some of the abstracts or titles contain broken tex
    #if this happens then we can't safely remove the tex automatically.
    #We'll keep these in the exceptions list

    
    math_regex = '<span class="math.*?">.*?</span>'
    math_compiled = re.compile(math_regex, flags=re.DOTALL)

    try:
        #need to fix the title
        csv['title'] = latex_clean(csv['title'], math_compiled).replace(',', ' ')

        #need to fix the abstract
        csv['abstract'] = latex_clean(csv['abstract'], math_compiled).replace(',', ' ')

        csv['tex'] = 'good'
        
    except:
        csv['tex'] = 'bad'
        
    return csv

In [4]:
def json_to_lines(json_file_name):
    
    with open(json_file_name) as json_file:
        jtmp = json.load(json_file)['ListRecords']
    
    article_info = []
    key_prefix = '{http://www.openarchives.org/OAI/2.0/}'
    
    for jtmp_sample in jtmp:
        new_entry = json_to_csv(jtmp_sample)
        article_info.append(new_entry)
    return article_info

### Example

A decent percentage of the articles have some $\LaTeX$ that doesn't play nicely with `pandoc`. In some cases this is due to truly broken $\TeX$. This is the reasoning behind the `tex` feature. If `pandoc` fails, for whatever reason, to convert the $\TeX$ to __HTML__ or from __HTML__ onto plain text.

In [5]:
json_file_name = '../../data/json/initial_harvest_2018_06_21/0.json'
article_info = json_to_lines(json_file_name)

In [6]:
len(article_info)

1000

In [7]:
type(article_info)

list

In [8]:
sort_articles = {
    'good':[],
    'bad':[]
}
for artic in article_info:
    sort_articles[artic['tex']].append(artic)
        

In [9]:
len(sort_articles['good'])

895

In [10]:
sort_articles['good'][0]

{'id': '0704.0001',
 'created': '2007-04-02',
 'setspec': 'physics:hep-ph',
 'title': 'Calculation of prompt diphoton production cross sections at Tevatron and LHC energies ',
 'abstract': 'A fully differential calculation in perturbative quantum chromodynamics is presented for the production of massive photon pairs at hadron colliders. All next-to-leading order perturbative contributions from quark-antiquark  gluon-(anti)quark  and gluon-gluon subprocesses are included  as well as all-orders resummation of initial-state gluon radiation valid at next-to-next-to-leading logarithmic accuracy. The region of phase space is specified in which the calculation is most reliable. Good agreement is demonstrated with data from the Fermilab Tevatron  and predictions are made for more detailed tests with CDF and DO data. Predictions are shown for distributions of diphoton pairs produced at the energy of the Large Hadron Collider (LHC). Distributions of the diphoton pairs from the decay of a Higgs b

# To Postgres

Moving everything over to our Postgres database hosted on AWS.


In [11]:
from sqlalchemy import create_engine, Column, String, Integer, DATE
from sqlalchemy.orm import sessionmaker

from sqlalchemy.ext.declarative import declarative_base

In [19]:
with open('./postgres.json') as pg_info:
    pg_json = json.load(pg_info)
    pg_username = pg_json['username']
    pg_password = pg_json['password']
    pg_ip = pg_json['ip']

engine = create_engine(f'postgres://{pg_username}:{pg_password}@{pg_ip}:5432')

In [20]:
engine.url

postgres://postgres:***@52.39.221.147:5432

In [21]:
Base = declarative_base()

# the article class is how sqlalchemy treats the objects of a row
class Articles(Base):
    __tablename__ = 'arxiv'
    
    id = Column(String, primary_key=True)
    created = Column(DATE)
    setspec = Column(String)
    title = Column(String)
    abstract = Column(String)
    tex = Column(String)


In [23]:
# need to atually put the table in the database
Base.metadata.create_all(engine)

In [24]:
def json_to_sql(json_dir, engine):
    json_dir = Path(json_dir)
    
    Session = sessionmaker(bind=engine)
    
    for json_file_name in json_dir.iterdir():
        if json_file_name.suffix == '.json': #make sure we've got the file, just in case.
            session = Session()
            articles = json_to_lines(json_file_name) 

            articles = [Articles(**article_info) for article_info in articles]

            session.add_all(articles)

            session.commit()


In [None]:
json_dir = "../../data/json/initial_harvest_2018_06_21"

json_to_sql(json_dir, engine)