This repo holds a few files related to the classifiers discussed in the paper
Halterman, Andrew, Philip A. Schrodt, Andreas Beger, Benjamin E. Bagozzi and Grace I. Scarborough. 2023. “Creating Custom Event Data Without Dictionaries: A Bag-of-Tricks.” Working paper presented at the International Studies Association, Montreal, March-2023.
Additional code related to this paper, specifically on the entity-resolution side, can be found at https://github.com/ahalterman/NGEC/
At present, and likely well into the future, this site will mostly contain some utility files useful for reducing typing and typos, and a few code fragments, largely from the Huggingface
and sklearn
documentation with slight modifications, these clarifying exactly what we mean by "default" paramters and techniques. The complete data-generation pipeline is at Leidos and probably the U.S. government—which remains disinclined to share1—holds the intellectual property rights to this, that issue yet to be decided by legions of Gucci-shod lawyers, so it isn't here.
Which is say, there's not a turn-key system here. But the remaining code in the operational pipeline is the routine stuff of, well, operational pipelines, so if you know enough to be creating an event data pipeline, you know enough to write that sort of code, and will almost certainly be better off writing it using whatever idiomatic style you are comfortable with rather than adapting to ours. Or rather, mine 2
A couple more useful links:
-
PLOVER
manual: https://github.com/openeventdata/PLOVER -
plovigy
family of low-footprint, terminal-based Python programs for rapid annotation: https://github.com/openeventdata/plovigy -
An earlier complete event data pipeline, if you are looking for an example: https://github.com/openeventdata/phoenix_pipeline
A final and alas, I-wish-were-unnecessary note: In the last few years there has developed in some parts of the event data community a most unfortunate, and certainly thoroughly unscientific, pathology of rigorously suppressing any criticism of data sets, variously using contractual non-disparangement clauses, threats of legal action, or ruthless pursuit of critics through various all-too-available institutional mechanisms originally designed to protect scientific integrity but now used against it. I believe I speak for our entire team when I say that we welcome criticism, and if you sincerely believe this data and/or the entire exercise is the dumbest thing you've ever seen, you are welcome, even encouraged, to express that opinion. Really.3
The golden-oldie file that lists all of the CAMEO categories and translates the textual descriptions used in ICEWS to the numerical codes used everywhere else. Also see https://github.com/openeventdata/text_to_CAMEO, which also uses this file.
CAMEO to PLOVER conversion files: per the embedded date, may need a bit of updating
Code fragments for reading and using these files to get the PLOVER equivalent of a CAMEO code
Code fragments for estimating the models for determining event categories, modes, and contexts, mostly just showing the appropriate libraries to import and then call.
Assorted Python-formated lists including PLOVER categories, 4-character category abbreviations, modes, contexts, and intensity scores.
Footnotes
-
Why utterly mundane code funded entirely by U.S. taxpayers remains proprietary while billions of dollars of pathbreaking and exceedingly high quality state-of-the-art software generated by corporations such as Alphabet/Google, Meta/Facebook, Amazon, and Microsoft has been made open source is, well, a great mystery. Though as the periodic discourses in War of the Rocks on the utter dysfunctional character of US defense procurement notes repeatedly, the simple combination of Soviet-style central planning and US-style corporate incentives gets you most of the way: nothing personal, just business. ↩
-
Which includes—I'm not making this up—occasional integer loop indices labelled
ka
,kb
... because variables beginning with "k" were integers in FORTRAN IV and using the suffrixes 'a', 'b'... rather than '1', '2'... saved hitting the numerical shift key on a card punch, said shift requiring moving approximately as much finely machined, but not necessarily balanced, brass and steel as a Honda Civic. But nowadays my code mostly just uses iterators. Honest. ↩ -
Or in the words of the late Michael Nicholson, one of the early pioneers in quantitative conflict analysis, "I'd rather have people saying 'that bastard Michael Nicholson' than `Who is Michael Nicholson?'" Our sentiment exactly for PLOVER. ↩