# JSON to Flat files

JSONs are, to my mind, certainly preferable to the XML we got in the first place, but they're still not quite as flat as we'd like. We are going to take our JSONs and transform then into CSV fils with the following structure

```
id, created, updated, setSpec, categories, authors, title, abstract
0704.0002, 2007-03-30, 2008-12-13, cs math, math.CO cs.CG, IleanaStreinu LouisTheran, Sparsity-certifying Graph Decompositions, We describe a...
.
.
.
```

Some articles have other information like `jornal-ref`, `comments` but these are not present in any consistent way.


There are some real complications here, specifically regarding the `title`, `authors` and `abstract`. 

1. Author names can come in a variety of formats including, but not limited to
    * Tim Dwyer  
    * James Tim Dwyer  
    * James Timothy Dwyer  
    * J. Dwyer  
    * J. T. Dwyer  
    * Dr. Tim Dwyer
    
1. There are complex text features in the text fields inclding:  
    * LaTeX snippets in the titles and abstracts
    * Non english letters in the various 
    * Straight up the abstract and or title isn't written in english.
    * Maybe the abstract is in english but it's also provded in French
    
    > I could read a string through and find any `\'` type substrings and just remove them.   
    > This is what I should do, before I 
 
    > Something that finds `\w*\a*\'\a+` or something like that?

In [336]:
import json
import re

In [337]:
with open('../../data/json/initial_harvest_2018_06_21/0.json') as file:
    jtmp = json.load(file)['ListRecords']
    

In [630]:
jtmp_sample = jtmp[2]
head, meta  = jtmp_sample['header'], jtmp_sample['metadata']

In [631]:
head

{'{http://www.openarchives.org/OAI/2.0/}identifier': ['oai:arXiv.org:0704.0003'],
 '{http://www.openarchives.org/OAI/2.0/}datestamp': ['2008-01-13'],
 '{http://www.openarchives.org/OAI/2.0/}setSpec': ['physics:physics']}

In [632]:
meta

{'{http://www.openarchives.org/OAI/2.0/}id': ['0704.0003'],
 '{http://www.openarchives.org/OAI/2.0/}created': ['2007-04-01'],
 '{http://www.openarchives.org/OAI/2.0/}updated': ['2008-01-12'],
 '{http://www.openarchives.org/OAI/2.0/}authors': [{'{http://www.openarchives.org/OAI/2.0/}keyname': 'Pan',
   '{http://www.openarchives.org/OAI/2.0/}forenames': 'Hongjun'}],
 '{http://www.openarchives.org/OAI/2.0/}title': ['The evolution of the Earth-Moon system based on the dark matter field\n  fluid model'],
 '{http://www.openarchives.org/OAI/2.0/}categories': ['physics.gen-ph'],
 '{http://www.openarchives.org/OAI/2.0/}comments': ['23 pages, 3 figures'],
 '{http://www.openarchives.org/OAI/2.0/}abstract': ["  The evolution of Earth-Moon system is described by the dark matter field\nfluid model proposed in the Meeting of Division of Particle and Field 2004,\nAmerican Physical Society. The current behavior of the Earth-Moon system agrees\nwith this model very well and the general pattern of the ev

In [633]:
key_prefix = '{http://www.openarchives.org/OAI/2.0/}'
key_type = 'abstract'
key = f'{key_prefix}{key_type}'
meta[key][0].strip()

"The evolution of Earth-Moon system is described by the dark matter field\nfluid model proposed in the Meeting of Division of Particle and Field 2004,\nAmerican Physical Society. The current behavior of the Earth-Moon system agrees\nwith this model very well and the general pattern of the evolution of the\nMoon-Earth system described by this model agrees with geological and fossil\nevidence. The closest distance of the Moon to Earth was about 259000 km at 4.5\nbillion years ago, which is far beyond the Roche's limit. The result suggests\nthat the tidal friction may not be the primary cause for the evolution of the\nEarth-Moon system. The average dark matter field fluid constant derived from\nEarth-Moon system data is 4.39 x 10^(-22) s^(-1)m^(-1). This model predicts\nthat the Mars's rotation is also slowing with the angular acceleration rate\nabout -4.38 x 10^(-22) rad s^(-2)."

In [627]:
from tex2py import tex2py


In [628]:
tex_str = meta[key][0].strip().replace('\n', ' ')
tex_abs = tex2py(tex_str)
abs_str = tex_str

In [629]:
abs_str

'A fully differential calculation in perturbative quantum chromodynamics is presented for the production of massive photon pairs at hadron colliders. All next-to-leading order perturbative contributions from quark-antiquark, gluon-(anti)quark, and gluon-gluon subprocesses are included, as well as all-orders resummation of initial-state gluon radiation valid at next-to-next-to-leading logarithmic accuracy. The region of phase space is specified in which the calculation is most reliable. Good agreement is demonstrated with data from the Fermilab Tevatron, and predictions are made for more detailed tests with CDF and DO data. Predictions are shown for distributions of diphoton pairs produced at the energy of the Large Hadron Collider (LHC). Distributions of the diphoton pairs from the decay of a Higgs boson are contrasted with those produced from QCD processes at the LHC, showing that enhanced sensitivity to the signal can be obtained with judicious selection of events.'

In [634]:
# LaTeX uses both ' and " as accent markers, for the acute and umlaut respectively. This makes it a little difficult
# to specify a single string that will contain both characters since those are the characters for opening and closing 
# strings. We can avoid this as shown below.
most_accents = '`^"~=.uvHtcdbk'
acute_accent = "'"

accents = f'{most_accents}{acute_accent}'
accent_regex = rf'[a-zA-Z]*\\[{accents}][a-zA-Z]+'

math_operators = r'[\-*+=]+'

latex_regex_dollar_sign = r'(\$[^\$]*\$)'
latex_regex_double_dollar_sign = r'(\$\$[^\$]*\$\$)' #need to test this

latex_regex_parens = r'(\([^\(\)]*\))' #this is just picking up parentheses
latex_regex_bracket = r'(\[[^\$]*\])' #need to test this

latex_regex = rf'{latex_regex_dollar_sign}|{latex_regex_double_dollar_sign}|{latex_regex_parens}|{latex_regex_bracket}'

abstract_regex = fr'{math_operators}'

abstract_regex


'[\\-*+=]+'

In [616]:
re.findall(abstract_regex, abs_str + 'this is a test  \HfEE + --= +')

['-', '*', '=', '-', '+', '--=', '+']

In [614]:
re.sub(abstract_regex,  ' ', abs_str + 'this is a test  \HfEE -- +++')

'We report studies of cyclotron resonance in monolayer graphene. Cyclotron resonance is detected using the photoconductive response of the sample for several different Landau level occupancies. The experiments measure an electron velocity at the K    point of     1.093 x 10  ms  and in addition detect a significant asymmetry between the electron and hole bands, leading to a difference in the electron and hole velocities of 5% by energies of 125 meV away from the Dirac point.this is a test       '

Wait maybe I should actually leave the accent character alone...like `'N\'eel'` is just that persons name. There's no real issue just keeping that as our string. Yeah okay, that makes sense to me, just leave it alone. 