# JSON to Flat files

JSONs are, to my mind, certainly preferable to the XML we got in the first place, but they're still not quite as flat as we'd like. We are going to take our JSONs and transform then into CSV fils with the following structure

```
id, created, updated, setSpec, categories, authors, title, abstract
0704.0002, 2007-03-30, 2008-12-13, cs math, math.CO cs.CG, IleanaStreinu LouisTheran, Sparsity-certifying Graph Decompositions, We describe a...
.
.
.
```

Some articles have other information like `jornal-ref`, `comments` but these are not present in any consistent way.


There are some real complications here, specifically regarding the `title`, `authors` and `abstract`. 

1. Author names can come in a variety of formats including, but not limited to
    * Tim Dwyer  
    * James Tim Dwyer  
    * James Timothy Dwyer  
    * J. Dwyer  
    * J. T. Dwyer  
    * Dr. Tim Dwyer
    
    > This shouldn't be too crazy, I should be able to keep track of this pretty definitively. Record at most three names, first, middle, last.
    > Ignore titles, they probably aren't really pertinent and they're certainly not unchanging if they're even present
    > Cool!
    
1. There are complex text features in the text fields inclding:  
    * LaTeX snippets in the titles and abstracts
    * Non english letters in the various 
    * Straight up the abstract and or title isn't written in english.
    * Maybe the abstract is in english but it's also provded in French
    
    > I could read a string through and find any `\'` type substrings and just remove them.   
    > This is what I should do, before I 
 
    > Something that finds `\w*\a*\'\a+` or something like that?

In [1]:
import json
import re
import pypandoc

In [2]:
with open('../../data/json/initial_harvest_2018_06_21/0.json') as file:
    jtmp = json.load(file)['ListRecords']
    

In [3]:
jtmp_sample = jtmp[143]
head, meta  = jtmp_sample['header'], jtmp_sample['metadata']

In [4]:
head

{'{http://www.openarchives.org/OAI/2.0/}identifier': ['oai:arXiv.org:0704.0144'],
 '{http://www.openarchives.org/OAI/2.0/}datestamp': ['2009-02-09'],
 '{http://www.openarchives.org/OAI/2.0/}setSpec': ['physics:astro-ph',
  'physics:gr-qc',
  'physics:hep-th']}

In [5]:
meta

{'{http://www.openarchives.org/OAI/2.0/}id': ['0704.0144'],
 '{http://www.openarchives.org/OAI/2.0/}created': ['2007-04-02'],
 '{http://www.openarchives.org/OAI/2.0/}updated': ['2007-09-18'],
 '{http://www.openarchives.org/OAI/2.0/}authors': [{'{http://www.openarchives.org/OAI/2.0/}keyname': 'Podolsky',
   '{http://www.openarchives.org/OAI/2.0/}forenames': 'D.'},
  {'{http://www.openarchives.org/OAI/2.0/}keyname': 'Enqvist',
   '{http://www.openarchives.org/OAI/2.0/}forenames': 'K.'}],
 '{http://www.openarchives.org/OAI/2.0/}title': ['Eternal inflation and localization on the landscape'],
 '{http://www.openarchives.org/OAI/2.0/}categories': ['hep-th astro-ph gr-qc'],
 '{http://www.openarchives.org/OAI/2.0/}comments': ['4 pages; more references added; discussion enlarged'],
 '{http://www.openarchives.org/OAI/2.0/}report-no': ['HIP-2007-15/TH'],
 '{http://www.openarchives.org/OAI/2.0/}journal-ref': ['JCAP 0902:007,2009'],
 '{http://www.openarchives.org/OAI/2.0/}doi': ['10.1088/1475-7516/

In [6]:
key_prefix = '{http://www.openarchives.org/OAI/2.0/}'
key_type = 'abstract'
key = f'{key_prefix}{key_type}'
abs_str = meta[key][0].strip()
abs_str

'We model the essential features of eternal inflation on the landscape of a\ndense discretuum of vacua by the potential $V(\\phi)=V_{0}+\\delta V(\\phi)$,\nwhere $|\\delta V(\\phi)|\\ll V_{0}$ is random. We find that the diffusion of the\ndistribution function $\\rho(\\phi,t)$ of the inflaton expectation value in\ndifferent Hubble patches may be suppressed due to the effect analogous to the\nAnderson localization in disordered quantum systems. At $t \\to \\infty$ only the\nlocalized part of the distribution function $\\rho (\\phi, t)$ survives which\nleads to dynamical selection principle on the landscape. The probability to\nmeasure any but a small value of the cosmological constant in a given Hubble\npatch on the landscape is exponentially suppressed at $t\\to \\infty$.'

In [260]:
rst_abs = pypandoc.convert_text(r'$1+1$', 
                                to='rst', format='latex')
rst_abs

':math:`1+1`\n'

In [263]:
rst_abs = pypandoc.convert_text(r'\(1+1\)', 
                                to='rst', format='latex')
rst_abs

':math:`1+1`\n'

In [290]:
# all the above patterns should be captured with the following regular expression

#the purpose of the ? is that we only want to go to next newline. 
#these mathmodes are not suited for multi line math so the next newlisn character should be the end of the math 
inline_math = ':math:.*?\n'


In [291]:
rst_abs = pypandoc.convert_text(r'$1+1$', 
                                to='rst', format='latex')
p = re.compile(base_tex_regex)

p.findall(string=rst_abs)

[':math:`1+1`\n']

In [292]:
rst_abs = pypandoc.convert_text(r'$$1+1$$', 
                                to='rst', format='latex')
rst_abs

'.. math:: 1+1\n'

In [293]:
rst_abs = pypandoc.convert_text(r'\[1+1\]', 
                                to='rst', format='latex')
rst_abs

'.. math:: 1+1\n'

In [294]:
rst_abs = pypandoc.convert_text(r'\begin{equation}1+1\end{equation}', 
                                to='rst', format='latex')
rst_abs

'.. math:: 1+1\n'

In [297]:
display_math_mode = '.. math::.*?\n'

rst_abs = pypandoc.convert_text(r'\begin{equation}1+1\end{equation}', 
                                to='rst', format='latex')
p = re.compile(display_math_mode)

p.findall(string=rst_abs)

['.. math:: 1+1\n']

# multi line

In [312]:

rst_abs = pypandoc.convert_text(r'\begin{eqnarray}1+1\\ new\end{eqnarray}', 
                                to='rst', format='latex')
p = re.compile(display_math_mode)

p.findall(string=rst_abs)

['.. math::\n']

In [313]:
rst_abs

'.. math::\n\n   \\begin{aligned}\n   1+1\\\\ new\\end{aligned}\n'

In [418]:
rst_abs = pypandoc.convert_text(r'\begin{align}1+1\\ new\end{align}', 
                                to='rst', format='latex')
p = re.compile(display_math_mode)

p.findall(string=rst_abs)

['.. math::\n']

In [419]:
rst_abs

'.. math::\n\n   \\begin{aligned}\n   1+1\\\\ new\\end{aligned}\n'

# This took forever!!!!!!!

In [414]:
multi_line_display_math_mode = '.. math::\n\n   \\\\begin{aligned}.*?\\\\end[{]aligned[}]'
p = re.compile(multi_line_display_math_mode, flags=re.DOTALL)
p.findall(string=rst_abs)

['.. math::\n\n   \\begin{aligned}\n   1+1\\\\ new\\end{aligned}']