# Create Publications from rubyscholar json
Script by [J. Nathan Matias](https://natematias.com) 2018, made available under an [MIT License](https://opensource.org/licenses/MIT).

This script loads data from the json created by [rubyscholar](https://github.com/wurmlab/rubyscholar), compares it to files in \_publications, and creates new files for any publication that doesn't already have a file. This allows a jekyll website to add papers from Google Scholar.

In [12]:
import json, yaml, glob, os, re, frontmatter
from collections import Counter, defaultdict
from dateutil import parser

In [13]:
papers_file = "../rubyscholar/publications.json"

### Load All Markdown Files in \_publications
Iterate through \_publications and extract the metadata so we can avoid creating duplicates. This code should never overwrite an existing markdown data, since we want to be able to customize individual markdown files after they are initially created.

In [61]:
current_publications = {}
current_filenames = []
for filename in glob.glob(os.path.join("_publications", "*")):
    current_filenames.append(filename)
    with open(filename, "r") as f:
        contents = f.read()
        if(len(contents)>0):
            md = frontmatter.loads(contents)
            current_publications[md.metadata['title']] = md.metadata

### Load and Parse Publications json file

In [55]:
def slugify(value):
    """
    Normalizes string, converts to lowercase, removes non-alpha characters,
    and converts spaces to hyphens. modified from the Django codebase
    """
    import unicodedata
    value = re.sub('[^\w\s-]', '', value).strip().lower()
    value = re.sub('[-\s]+', '-', value)
    return value
    
def create_markdown_from_entry(publication):
    post = frontmatter.Post(content="")
    post.metadata['title']     = publication['title']
    post.metadata['excerpt']   = ''
    post.metadata['date']      = publication['year']
    post.metadata['venue']     = publication['journal']
    post.metadata['paperurl']  = publication['citationUrl']
    post.metadata['citation']  = publication['authors'] + ". (" + publication['year'] + "). " + publication['title']  + ". " + publication['journal']
    filename_title = (" ".join(publication['title'].split(" ")[0:5])).lower()
    post.metadata['filename']  = publication['year'] + "-" + slugify(filename_title)
    return post



In [56]:
## LOAD PUBLICATIONS FROM FILE
with open(papers_file, "r") as f:
    publications = json.loads(f.read())

In [57]:
for item in publications.values():
    publication = item['scholar']

### Output Posts to Markdown Where Posts Don't Already Exist

In [62]:
files_written = []
files_omitted = []
for item in publications.values():
    publication = item['scholar']
    post = create_markdown_from_entry(publication)
    if(post.metadata['title'] not in current_publications.keys() and
       post.metadata['filename'] not in current_filenames):
        with open(os.path.join("_publications", post.metadata['filename'] + ".md"), "w+") as f:
            f.write(frontmatter.dumps(post))
            files_written.append(post.metadata['filename'])
    else:
        files_omitted.append(post.metadata['filename'])

print("Wrote {0} new files".format(len(files_written)))
print("Omitted to write {0} existing files".format(len(files_omitted)))

Wrote 20 new files
Omitted to write 0 existing files
