# pkgw/worklog-tools

Framework for generating CV, publications list, etc.
Python HTML TeX Makefile
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.
bibtexparser
example
.gitignore
inifile.py
unicode_to_latex.py
wlgithub.py
wltool
worklog.py

# worklog-tools

This software is part of a system for recording academic output and reporting it in documents like CVs or publication lists.

### “Why do you need software for that?”

Two reasons:

• I want to have my CV and publications list available in both nicely printable and slick web-native formats … without having to keep two very different documents in sync.
• I want my CV to include things like my h-index, citation counts, total telescope time allocated, and so on. (Oh, by the way: I’m an astronomer.) And I want the computer to be responsible for figuring those out. Because that kind of stuff is why we invented computers.

To accomplish these things, the worklog tools process simple structured text files and use the results to fill in LaTeX and HTML templates. There’s built-in functionality to fetch citation counts and fill in statistics like your h-index.

### “Sounds like a thing. Should I care?”

Yes! You can copy the tools and example files to quickly get started automating the generation of your own CV. The log format is flexible and the scripts are simple, so the sky’s the limit in terms of what effects you can achieve.

Also, I like to think that my LaTeX templates are pretty nice. Here are my printable CV and my publications list. Here’s my online publications list.

## Diving in

The worklog system has three pieces:

• Simple text files logging academic output
• LaTeX/HTML templates used to generate output documents
• Software to fill the latter using data gathered from the former

The software is in the same directory as this file — the wltool script drives everything from the command line. The example subdirectory contains sample copies of templates (in *.tmpl.*) and log files (in 2012.txt, 2013.txt).

To get started, first check out this repository if you haven’t done so already. Go into the example directory and type make. This will create the outputs: a CV and publication list in PDF and HTML formats. (Assuming nothing breaks … the scripts are in Python and have few dependencies, so they should be widely portable.) The HTML results have not been particularly beautified, but I've tried to make the PDFs come out nicely.

Now check out example/2013.txt. Log files are in a basic “ini file” format, with records coming in paragraphs headed by a word encased in square brackets. A typical record is:

[talk]
date = 2013 Apr
where = OIR seminar, Harvard/Smithsonian Center for Astrophysics
what = Magnetic Activity Past the Bottom of the Main Sequence
invited = n

(The precise file format is defined among the “Technical details” below.) A major point of emphasis is that this format is very simple and readable for humans and computers alike.

The template files, on the other hand, are complicated since they go to some effort to create attractive output. (Well, currently, this is much more true of the LaTeX templates than the HTML templates.) Most of this effort is in initialization, so the ends of the files are where the actual content shows up. For instance, toward the bottom of example/cv.tmpl.tex you’ll find:

FORMAT \item[|date|] \emph{|where|} \\ |what|''

\section*{Professional Talks --- Invited}
\begin{datelist} % this is a custom environment for nice spacing
RMISCLIST_IF talk invited
\end{datelist}

\section*{Professional Talks --- Other}
\begin{datelist}
RMISCLIST_IF_NOT talk invited
\end{datelist}

The lines beginning with ALL_CAPS trigger actions in the templating scripts. The RMISCLIST_IF directive writes a sequence of [talk] records in reversed order, filtering for an invited field equal to y. Each record is LaTeXified using the template specified in the most recent FORMAT directive. Strings between pipes (|what|) in the FORMAT are replaced by the corresponding values from each record. (The precise functionalities of the various directives are also defined among the “Technical details” below.)

Finally, the Makefile in the example directory wires up commands to automatically create or update the output files using the standard make command.

## Using the tools for your own documents

To get started using this system for yourself, you should copy the script code and example files. You can put them all in the same directory and just edit the Makefile to change toolsdir to . rather than ... Then there are two things to work on: customizing the templates, and entering your previous accomplishments into log files. You’ll probably edit these hand-in-hand as you decide what things to include in your CV and how you want to express them in your log files. Obviously, for some things it makes more sense just to put them directly in the templates, rather than filling in fields from the log files.

Unsurprisingly, the templates get filled by running the wltool program. It has several subcommands that are documented below. The latex and html subcommands are the main ones to use. Of special note, however, are two subcommands. The bootstrap-bibtex subcommand creates a set of log files, seeding them with publication information gathered from a NASA ADS BibTeX file. The update-cites subcommand which automatically fetches citation information from NASA ADS.

I organize my log files by year (e.g. 2012.txt) but you can arrange them any way you want. The wltool reads in data from every file in the current directory whose name ends in .txt, processing the files in alphabetical order.

I recommend that you use make to drive the template filling and git to version-control your files, but those things are up to you.

### Generic lists

The main method for filling the templates with data is a combination of RMISCLIST and FORMAT directives. These provide almost complete flexibility because you can define whatever fields you want in your log files and get them inserted by writing a FORMAT directive.

Important! When fields from the logs are inserted into the templates, they are escaped for LaTeX and HTML as appropriate. So if you write

[foo]
bar = $x^2$

and insert |bar| in a template somewhere, you’ll get dollar signs and carets, not . We can’t sanely output to both HTML and LaTeX without doing this escaping. To get non-keyboard characters, use Unicode — get friendly with http://unicodeit.net/ and Compose Key keyboard features. The log files are fully Unicode-enabled and Unicode characters will be correctly converted to their LaTeX expressions in the output (see the module unicode_to_latex.py).

The main constraint in the list system is in the ability to filter and reorder records. The built-in facilities for doing so are limited, though so far they’ve been sufficient for my needs. The only supported ordering is “reversed”, where that’s relative to the order that the data files are read in: alphabetically by file name, beginning to end. Since I name my files chronologically and append records to them as I do things, this works out to reverse chronological order, which is generally what you want.

As for filtering, the RMISCLIST directives select log records of only a certain type, where the “type” is defined by the word inside square brackets beginning each record (e.g., [talk]). The RMISCLIST_IF and RMISCLIST_IF_NOT directives further filter checking whether a field in each record is equal to y, with a missing field being treated as n.

To extend this behavior, you’re going to need to edit worklog.py. See the cmd_rev_misc_list* functions and setup_processing, which activates different directives.

### Publication lists

Publications are listed in [pub] records. These follow a more detailed format to allow automatic fetching of citation counts, generation of reference lists with links, and computation of statistics such as the h-index.

Publication records are read in and then “partitioned” into various groups (i.e., “refereed”, “non-refereed”) by the partition_pubs function in worklog.py. The PUBLIST directive causes one of these groups to be output, with the crucial wrinkle that each record is augmented with a variety of extra fields to allow various special effects. This augmentation is done in the cite_info function in worklog.py.

If you want to group your publications differently (i.e. “refereed first-author”), then, you’ll need to edit partition_pubs. To change citation text generation or do more processing, you’ll need to dive into cite_info.

The details of publication processing are documented in the next section.

### Awarded proposals list

Proposals can be listed in [prop] records. These also follow a somewhat specific format to allow one particular feature: the tools can compute the total awarded time by each facility. The relevant fields are:

• mepi — should be y if you were the PI.
• accepted — should be y if the proposal was accepted.
• facil — the name of the facility that was proposed to.
• request — the resources that the proposal requested. This should consist of a number followed by a space and then some text specifying the units. Every proposal for the same facility should use the same units.
• award — the resources that were actually awarded. If not specified, it is assumed that the full request was awarded.

Proposals are grouped by facility, and the total amount awarded in each proposal where both mepi and accepted are y is computed. The totals may then be inserted into the template using TALLOCLIST or SPLIT_TALLOCLIST.

### Software repositories list

I have attempted to develop a framework for quantifying contributions that come in the form of software. I wouldn’t say that anyone really has any good idea of how to do this, but I’ve given it my best shot. Software contributions are quantified by tracing contributions to version-control repositories in [repo] records. These follow a somewhat specific format so that the tools can compute statistics such as total numbers of commits. The relevant fields are:

• usercommits — the number of commits you have made to the repository.
• allcommits — the total number of commits in the repository.
• lastusercommit — the date on which you last made a commit, in the form YYYY/MM/DD.
• forks — the number of times the repository has been “forked”, if the hosting service happens to allow and track that.
• stars — the number of times the repository has been “starred”, if the hosting service happens to allow and track that.

TODO: this feature needs more documentation.

### Public engagement activities list

Public engagement activities can be listed in [engagement] records. These can be quite freeform, but if you give these records a class field, the tools will be able to count up different types of engagement activities. Different classes of activity that are tracked are:

• interview — a media interview
• outreach_event — a public outreach event
• press_release — a press release
• public_talk — a public lecture

See the engagement_stats statistics group that can be used with the BEGIN_SUBST directive.

### Other derived information

The BEGIN_SUBST directive begins a block of text in which special text fragments can be inserted into the template directly. This block continues until a bare END directive is encountered. Different substitution groups are available:

• cite_stats — statistics regarding refereed publications drawn from NASA ADS.
• engagement_stats — statistics regarding public engagement activities.
• repo_stats — statistics regarding open-source software contributions.
• talk_stats — statistics regarding professional talks

### Miscellaneous template directives

There are also some more specialized templating directives that are fully documented below. For example, TODAY., inserts the current date.

## Technical details: publication processing

As mentioned above, there are two basic wrinkles to the processing of publications. First, extra fields are added to the records on the fly. Second, the publications are “partitioned” into various subgroups.

The automatic processing assumes that all publication records will define certain fields. Some of the key ones are:

• title — the title of the publication
• authors — the list of authors, in a special format. Separate authors’ names are separated by semicolons. Each author’s name should be in first-middle-last order — no commas. Surnames containing multiple words should have the words separated by underscores — this is the easiest way to have the software pull out surnames automatically. Initials are OK.
• mypos — your numerical position in the author list, with 1 (sensibly) being first. Negative numbers are treated like Python list indices: -1 means that you are the last author, -2 means second-to-last, and so on.
• advpos — a comma-separated list of positions in the author list, again with 1 being first. The corresponding author names will be underlined in the full author list. The intention is to highlight the names of directly-advised students.
• pubdate — the year and month of the publication, in numerical YYYY/MM format. The worklog system generally tries not to enforce a particular date format, but here it does.
• refereedy if the publication is refereed, n if not.
• refpreprinty if the publication has been submitted to the refereeing process but has not yet been accepted.
• informaly if the publication is informal (e.g., a poster); this affects the partitioning process as described below.
• cite — citation text for the publication. This is free-form. My personal preference is to keep it terse and undecorated. Examples include:
• ApJ 746 L20
• proceedings of “RFI Mitigation Workshop” (Groningen)
• The Astronomer’s Telegram #3135
• MNRAS submitted

There are also a set of fields used to create various hyperlinks. As many of these should be defined as exist:

• arxiv — the item’s unadorned Arxiv identifier, i.e. 1310.6757.
• bibcode — the item’s unadorned NASA ADS bibcode, i.e. 2006ApJ...649.1020W.
• doi — the item’s unadorned DOI, i.e. 10.1088/2041-8205/733/2/L20.
• url — some other relevant URL.

Some fields are optional:

• adscites — records NASA ADS citation counts. Automatically set by the wltool update-cites command.
• kind — a one-word description of the item kind if it is nonstandard (e.g., poster). This is only used for the other_link field described below.

The cite_info function uses the above information to create the following fields:

• abstract_link — a hyperlink reading “abstract” that leads to the ADS page for the publication, if its bibcode is defined.
• bold_if_first_title — a copy of title, but with markup to render it in bold if mypos is 1, that is, you are the first author of the item.
• citecountnote — text such as “ [4]” (including a leading space) if the item has 4 ADS citations; otherwise, it is empty.
• full_authors — the full author list, with names separated by commas and non-surnames reduced to initials without punctuation; e.g. “PKG Williams, GC Bower”.
• lcite — a copy of cite, but with markup to make it a hyperlink to an appropriate URL for the publication, based on arxiv, bibcode, doi, or url.
• month — the numerical month of publication
• official_link — a hyperlink reading “official” that leads to the DOI page for the publication, if doi is defined.
• other_link — a hyperlink leading to the publication’s url field. The link text is the value of the kind field.
• preprint_link — a hyperlink reading “preprint” that leads to the Arxiv page for the publication, if its arxiv is defined.
• quotable_title — a copy of title with double quotes replaced with single quotes. This makes it suitable for encasing in double quotes itself. (We don‘t worry about subquotes in the title itself.) Note that the replacement operates on proper typographic left and right quotes; that is, >“< and >”<, but not >"<.
• pubdate — this is modified to read “{year} {Mon}”, where “{Mon}” is the standard three-letter abbreviation of the month name. The space between the year and the month is nonbreaking.
• refereed_mark — a guillemet (») if the publication has refereed equal to y; nothing otherwise.
• short_authors — a shortened author list; either “Foo”, “Foo & Bar”, “Foo, Bar, Baz”, or “Foo et al.”. Only surnames are included. If the MYABBREVNAME directive has been used, your name (as determined from mypos) is replaced with an abbreviated value, so that the author list might read “PKGW & Bower”.
• year — the numerical year of publication.

The groups that publications can be sorted into are:

• all — absolutely all publications in chronological order
• all_rev — all publications in reverse chronological order
• all_formal — publications without informal = y
• refereed — publications with refereed = y
• refereed_rev — as above but in reverse chronological order
• refpreprint — publications with refpreprint = y
• refpreprint_rev — as above but in reverse chronological order
• all_non_refereed — publications without refereed = y or refpreprint = y
• non_refereed — publications without refereed = y or refpreprint = y or informal = y
• non_refereed_rev — as above but in reverse chronological order
• informal — publications without refereed = y or refpreprint = y but with informal = y
• informal_rev — as above but in reverse chronological order

It’s assumed that refereed = y and informal = y never go together. The combination of refereed, refpreprint, non_refereed, and informal captures all publications.

## Technical details: wltool invocation

Here are the subcommands supported by the wltool program:

### bootstrap-bibtex {bibtex-file} {your-surname} {output-dir}

This creates a new set of worklog files, seeding them with publication information from the BibTeX file bibtex-file. You must specify your-surname so that the system can guess the mypos field. The output directory output-dir is created — it may not already exist, to avoid any possibility of overwriting existing data.

Search the created files for XXX to find items that will need manual attention. Other items should generally be correct, but the output only going to be as correct as the input, and things like the cite field are generated with crude heuristics.

Warning: the tool currently assumes that the BibTeX file in question is of the form generated by NASA ADS. Files from other sources will probably not parse successfully.

This is a sort of grep for your log files. It merely reads them all in and prints out all of the records whose type matches record-type. This can be useful for scripting or reminding yourself what fields you used for a certain kind of entry.

The optional argument datadir specifies where the log files are; the default is the current directory.

Example:

$./wltool extract pub [pub] title = An amazing paper … lots more output …$

Fill in the specified template-file by processing the directives described below. The filled-in template is printed to standard output. The output is assumed to be in HTML format; in particular, this means that special characters are encoded in Unicode with UTF-8, and certain fancy text effects (making text bold, for instance) are done with HTML tags.

The optional argument datadir specifies where the log files are; the default is the current directory.

Example:

$./wltool html pubslist.tmpl.html >pubslist.html$

Operates exactly as the html subcommand, except that the output is assumed to be in LaTeX format. Special characters are converted to LaTeX escapes (e.g., “α” → “\alpha”) and text effects are done with LaTeX constructs (e.g., “\textbf{…}”).

Example:

$./wltool html cv.tmpl.tex >cv.tex$

Print out the number of records of each type in your log files.

The optional argument datadir specifies where the log files are; the default is the current directory.

Example:

$./wltool summarize award: 1 job: 2 outreach: 2 prop: 1 pub: 10 talk: 13 teaching: 1$

For each record in the log files with a bibcode field, the script connects to the ADS website and fetches the number of citations. The log files are modified in-place to have the citation counts inserted into each record in a field called adscites. The adscites field records the date that citation counts were last checked, and the script won’t update check counts more frequently than once a week.

As the updates are conducted, the script will print out each bibcode and the change in the number of citations it has received. Records will be annotated with an asterisk if they’re first-author publications and an “R” if they’re refereed.

The optional argument datadir specifies where the log files are; the default is the current directory.

Example:

$./wltool update-cites 2012arXiv1201.5413W * ... 0 (+0) 2012PASP..124..624W *R ... 4 (+0) 2013ApJ...762...85W *R ... 1 (+0) 2013PASA...30....6M R ... 12 (+1) …$

## Technical details: template directives

Here are the directives supported by the template processor. Each directive is recognized when it appears at the very beginning of a line in a template. Most of the directives take arguments that appear on the same line, separated by whitespace.

### BEGIN_SUBST {group}

Marks the beginning of a region in which text will be substituted. The behavior resembles a FORMAT directive in that the following lines should contain pipe-delimited field names that will be substituted with computed values. This behavior will continue until a line containing just the word END is encountered.

The subsitutions are batched into different groups. The available substitution groups are listed below.

Example:

BEGIN_SUBST cite_stats
<p>I may have written only |refpubs| refereed publications, but I assure you
that they are all profound.</p>
END

### FORMAT {template text ...}

This sets the template text that will be used by subsequent PUBLIST, and RMISCLIST-variant commands, until a new FORMAT directive is issued. For each record produced by these commands, the template text will be inserted, with field names delimited by pipes (e.g., |title|) getting replaced by data from the records. Missing fields are an error.

Note that this directive (and all others) must appear on a single line, so it gets a little awkward if you have a very long piece of template text.

Example:

FORMAT \item[|date|] |what|

\begin{enumerate}
RMISCLIST outreach
\end{enumerate}

### MYABBREVNAME {text ...}

This directive turns on special replacement of your name in short author lists; it allows the style of citing your own works as (e.g.) “PKGW & Bower” rather than “Williams & Bower”. The text is this shortened version of your name.

This feature only affects the short_authors field of publications; full_authors is not modified. If you don’t use this directive, short_authors doesn’t get modified either.

Example:

MYABBREVNAME PKGW

…

FORMAT Short authors: |short_authors|
PUBLIST all


### PUBLIST {group}

Causes data for a group of publications to be inserted according to the most recently-specified FORMAT template. The group is one of the “partitions” defined in the publication processing section. The FORMAT template is inserted once for each publication in the specified group.

The template has access to all of the fields defined in your matching [pub] records, as well as the numerous extra fields computed as described in the publication processing section

Example:

FORMAT <dt>|pubdate|</dt><dd>|title|</dd>

<h1>Refereed publications</h1>
<dl>
PUBLIST refereed_rev
</dl>

### TALLOCLIST

Causes information about total resources allocated in proposals to be inserted according to the most recently-specified FORMAT template. The FORMAT template is inserted once for each facility with a successful proposal as PI.

The fields accessible to the template are:

• facil — the facility name, but see below.
• total — the total of the allocations for that facility
• unit — the unit string specified in the proposals for the facility

The records are sorted alphabetically by facil, with the following exception. All facil names starting with the text SUMMARY: are sorted after the other facilities, and are rendered in italics with the SUMMARY: tag stripped off. This feature is intended for money awards, which one typically wants to emphasize. A relevant worklog entry might look like:

[prop]
title = My Awesome Science
date = 2017/11
pi = P K G Williams
mepi = y
facil = Very Large Array
request = 100 hr
request2 = 100 ks Chandra
accepted = y
award3 = 100000 USD SUMMARY:Support funding


Example of the template:

FORMAT |facil| & |total| & |unit| \cr

\section*{Total Allocations as PI}
\begin{tabular}{lrl}
TALLOCLIST
\end{tabular}

### SPLIT_TALLOCLIST {text...}

This is a giant hack to typeset time allocation lists when space is at a premium. It is just like TALLOCLIST, but you can specify arbitrary raw text that will be inserted halfway through the list of allocations. This can let you break the list of allocations into two equal-sized tables, which can then by typeset side-by-side.

Example

FORMAT |facil| & |total| & |unit| \cr

\section*{Total Allocations as PI}
\begin{tabular}{lrl}
SPLIT_TALLOCLIST \end{tabular} \begin{tabular}{lrl}
\end{tabular}

### RMISCLIST {type1[,type2,...]}

Causes worklog data to be inserted according to the most recently-specified FORMAT directive. Records matching any of the comma-separated list of types will be output in reversed order, with the FORMAT template being inserted once for each record.

Example:

FORMAT |institution| & |city| \cr

\begin{tabular}{rl}
RMISCLIST placesilike
\end{tabular}

### RMISCLIST_IF {type1[,type2,...]} {gatefield}

The same as RMISCLIST, except that matching records will be output only if they have the field named gatefield and its value is precisely “y”.

Example:

FORMAT |institution| & |city| \cr

\begin{tabular}{rl}
RMISCLIST_IF placesilike ilikeit
\end{tabular}

### RMISCLIST_IF_NOT {type1[,type2,...]} {gatefield}

The same as RMISCLIST, except that matching records will be output only if either they do not have the field named gatefield, or its value is not precisely “y”.

Example:

FORMAT |institution| & |city| \cr

\begin{tabular}{rl}
RMISCLIST_IF_NOT placesilike ihateit
\end{tabular}

### TODAY.

Inserts the current day as “{Mon} {day}, {year}.” Both the directive name and the output include the trailing period! Here {Mon} is the three-letter abbreviated month name.

Example:

This document was last updated
TODAY.


## Technical details: substitution groups

Here are the different substitution groups known to the BEGIN_SUBST directive.

### cite_stats

These are statistics relating to refereed publications and their citations. The fields within this group are:

• day — the numerical day of the month of the median date when citations were updated.
• hindex — your numerical h-index.
• italich — a bit of a hack; code for the letter “h” in italics, appropriate for either LaTeX or HTML as needed.
• meddate — the median Unix time around which citations were updated.
• month — the numerical month of the median date when citations were updated.
• monthstr — the three-letter abbreviated month of the median date when citations were updated.
• refcites — the total number of citations to refereed publications
• reffirstauth — the total number of refereed first-author publications
• refpubs — the total number of refereed publications
• year — the year of the median date around which citations were updated.

### engagement_stats

These are statistics relating to public engagement activities. The fields are:

• n_interviews — the number of [engagement] records with class = interview.
• n_outreach_events — the number of [engagement] records with class = outreach_event.
• n_press_releases — the number of [engagement] records with class = press_release.
• n_public_talks — the number of [engagement] records with class = public_talk.

### repo_stats

These are statistics relating to open-source software contributions. The fields within this group are:

• total_repos — the total number of tracked repositories.
• total_commits — the total number of commits made by you in all repositories.
• primary_author_stars — the total number of GitHub stars given to repositories in which you are the primary author, which we define as those in which you have made more than 50% of the commits.
• primary_author_forks — the total number of GitHub forks of repositories in which you are the primary author.

### talk_stats

These are statistics relating to professional talks you have given. The fields within this group are:

• n_total — the total number of talks
• n_invited — the number of talks with invited = y
• n_conference — the number of talks with conference = y

## Technical details: the ini file format

The “ini file” format is implemented in the inifile.py module.

Each file is a line-oriented Unicode text file encoded in UTF-8.

Any text in a line at or after a pound sign (#) is ignored, except for quoted field values as described below.

A line of the form [.....]{whitespace} denotes a new record. The text between the brackets is saved in a field called section.

Data lines take the form {field name}{whitespace}={value}. Field names can be anything without spaces, but should take the form of valid Python identifiers.

If the field value is encased in straight double quotes, the text inside the quotes is stored verbatim. This includes leading and trailing whitespace and pound signs.

Otherwise, the field value is constructed from the text after the equals sign, plus text on subsequent lines if those lines begin with whitespace. Leading and trailing whitespace of the resulting value are removed, and the newline/whitespace combination is replaced with a single space.

Example:

# This is a comment
[pub]
pubdate = 2011/09 # and this
title = This is a somewhat long title and so we wrap it onto
the next line
# But not the next pound sign:
cite = "Exciting Committee whitepaper #32"
authors = Peter K. G. Williams
myfield = εχαμπλε
equation = y²×π

## System requirements

The tools should be broadly portable, but they do have a few requirements:

• Python. Version 2.6 or higher should work. Only standard modules are used.
• To drive processing with the example Makefile, you need command-line access to pdflatex and (of course) make.
• The following LaTeX packages are used:
• To use the GitHub features, you need to install the following Python packages and their dependencies:
• httplib2