Skip to content
Python tools for processing data from the Catalog of Copyright Entries
Python
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
registrations @ 34f2726
renewals @ ebb6156
.gitmodules
0-parse-registrations.py
1-parse-renewals.py
2-match-renewals.py
3-filter.py
4-sort-it-out.py
5-make-tsv.py
LICENSE
README.md
compare.py
countries.json
model.py
requirements.txt

README.md

Identifying Public Domain Works with Python

This set of Python scripts combines copyright registration data with copyright renewal data to identify works whose copyright has lapsed because a registration was not renewed.

The data comes from datasets provided by the New York Public Library. These datasets build on work done by the Internet Archive, Stanford Libraries, and Project Gutenberg.

Thanks are due to Sean Redmond, Josh Hadro, and Greg Cram.

Statistics

  • There are about 730,000 books where a single renewal record would be necessary for the work to still be in copyright today.
  • Of these, about 19% definitely have that renewal record and are still in copyright.
  • About 10% may have a renewal record, but a manual inspection is necessary to make sure.
  • About 71% definitely have no renewal record.

Getting the Data

The best way to get this data is to generate it yourself by following the instructions below.

If you just want a quick overview listing renewed and unrenewed books, you can find a simplified version of these results in the cce-spreadsheets project.

Generating the Data

First, clone this repository and initialize the submodules. This will bring in the raw registration and renewal data; it'll take a long time.

git submodule init
git submodule update

Make sure the lxml XML parser is installed:

pip install -r requirements.txt

Then run the scripts, one after another:

python 0-parse-registrations.py
python 1-parse-renewals.py
python 2-match-renewals.py
python 3-filter.py
python 4-sort-it-out.py

The final script's output will look something like this:

Among all publications:
output/FINAL-foreign.ndjson: 193732 (12.95%)
output/FINAL-previously-published.ndjson: 77950 (5.21%)
output/FINAL-too-late.ndjson: 429869 (28.72%)
output/FINAL-too-early.ndjson: 7374 (0.49%)
output/FINAL-renewed.ndjson: 136669 (9.13%)
output/FINAL-probably-renewed.ndjson: 976 (0.07%)
output/FINAL-possibly-renewed.ndjson: 70980 (4.74%)
output/FINAL-not-renewed.ndjson: 521499 (34.85%)
output/FINAL-not-books-proper.ndjson: 36408 (2.43%)
output/FINAL-error.ndjson: 21048 (1.41%)
Total: 1496505

Among first US publications in renewal range:
output/FINAL-renewed.ndjson: 136669 (18.72%)
output/FINAL-probably-renewed.ndjson: 976 (0.13%)
output/FINAL-possibly-renewed.ndjson: 70980 (9.72%)
output/FINAL-not-renewed.ndjson: 521499 (71.43%)
Total: 730124

You'll see a number of large files in the output directory. These files represent the work product of each step in the process. The files you're most likely interested in are the FINAL- series, mentioned above. These files represent this project's final conclusions about which books were renewed and which weren't; which books were published in the US and which weren't.

If you think there's been a mistake or a bad assumption somewhere in this process, it's easy to fix. Change the corresponding script, re-run it, then re-run the subsequent scripts to get a new set of FINAL- files.

How it works

I'll cover each script in order.

0-parse-registrations.py

This script converts each copyright registration record from XML to JSON, with a minimum of processing.

Outputs:

  • 0-parsed-registrations.ndjson - A list of registration records, each in JSON format.

1-parse-renewals.py

This script converts each copyright renewal record from CSV to JSON, with a minimum of processing.

Outputs:

  • 1-parsed-renewals.ndjson - A list of renewal records, each in JSON format.

2-match-renewals.py

Match up registrations with their renewals.

Outputs:

  • 2-registrations-with-renewals.ndjson - A list of the same registrations from 0-parsed-registrations.ndjson, except that every registration with one or more renewals has been combined with its renewal information.

  • 2-cross-references-in-foreign-registrations.ndjson - A list of non-obvious potential foreign registrations, to be used in the next step.

  • 2-renewals-with-registrations.ndjson - A list of renewals that could be matched to a registration.

  • 2-renewals-with-no-registrations.ndjson - A list of renewals that couldn't be matched to a registration. Some of these are renewals for pamphlets and such -- works other than "books proper" -- so although their registrations exist, they aren't in this dataset. Others may represent missing data or errors in matching a book to its registration.

3-filter.py

For each registration, make a decision about the quality of the registrations found for it, where it was published, and so on.

Outputs:

  • 3-registrations-foreign.ndjson - Registrations for foreign works, interim registrations (used while foreign works were looking for a US publisher), and registrations where the place of publication looks like a place outside the United States. Foreign works had their copyright renewed by treaty, so the absence of a renewal doesn't prove anything.

  • 3-registrations-previously-published.ndjson - Registrations for works that seem to have been previously published in the US. Even if the copyright has lapsed on this work, there may be a renewed copyright on a previous edition -- in that case, only the new material might be in the public domain. These need to be checked manually.

  • 3-registrations-too-early.ndjson - Registrations that are moot because they happened more than 95 years ago. These books are in the public domain regardless of whether the copyright was renewed, so renewals probably aren't relevant.

  • 3-registrations-too-late.ndjson: Copyright registrations that happened after 1963. These were renewed automatically, so renewals probably aren't relevant.

  • 3-registrations-in-range.ndjson - Registrations where the absence of a renewal record could make the difference between still being in-copyright and being in the public domain.

  • 3-registrations-error.ndjson - Contains about 20,000 registrations which can't be processed because they're missing essential information. This information might be missing from the original registrations, it might be missing from the transcription, or the information might be represented in a form that these scripts can't understand.

4-sort-it-out.py

This simple script takes the output files generated by steps 2 and 3, and consolidates them into a number of files:

  • FINAL-not-renewed.ndjson: These books were almost certainly not renewed and are now in the public domain.
  • FINAL-probably-renewed.ndjson: These books were probably renewed, but a manual check is necessary to make sure.
  • FINAL-possibly-renewed.ndjson: These books had one or more renewal records, but none of them seemed like a good match. A manual check is necessary to see whether the renewals are legit. In particular, this script will count a tentative match for a book if there's a renewal record for any book with the same title.
  • FINAL-foreign.ndjson: These books appear to be foreign publications, or were mentioned in a foreign publication, so their renewal status probably isn't relevant.
  • FINAL-previously-published.ndjson: These books seem to have been previously published in the US. Part or all of the work may still be in copyright -- it depends on whether the previous publications were registered, and whether the registrations were renewed. This must be checked manually.
  • FINAL-not-books-proper.ndjson: These works are not "books proper" -- they're periodicals or something else. Their renewal status can only be determined once the rest of the CCE is made machine-readable.
  • FINAL-error.ndjson: These books couldn't be processed because of errors in the data.

These files represent the final work product. At this point you can take one or more of them and use them in your own research.

5-make-tsv.py

Generates some tab-separated files that summarize the main results:

  • FINAL-not-renewed.tsv - Works that needed a renewal to be under copyright today, but for which no renewal could be found.
  • FINAL-renewed.tsv - Works that needed a renewal to be under copyright today, and for which a renewal was (probably) found.
  • FINAL-foreign.tsv - Works that appear to be foreign publications, for which renewal is now irrelevant. (But any of these works that were renewed are matched to their renewals.)

Dispositions

Each JSON object in the FINAL- files has a disposition key that explains this script's final conclusion about its renewal status. Here are the possible dispositions:

  • Not renewed. - No renewal record was found, and we saw no complicating factors. The copyright on this book has almost certainly lapsed.

  • Renewed (date match). - A renewal was found which has a date match with the original registration. That's almost certainly the 'real' renewal, and if so, this book is still in copyright.

  • Probably renewed (author match). - There was no date match, but one of the renewals has the same author as the one mentioned in this registration. That's probably the 'real' renewal, and if so, this book is still in copyright.

  • Probably renewed (title match). - There was no date or author match, but one of the renewals has a similar title to the one mentioned in the registration. That's probably the 'real' renewal, and if so, this book is still in copyright.

  • Possibly renewed, but none of these renewals seem like a good match. - One or more renewals was found based on the registration ID, but the other data doesn't match. Since renewal IDs were reused over time, this may or may not mean that this particular publication had its copyright renewed. It needs to be checked manually.

  • Foreign publication. - There's strong evidence that this work is a foreign publication. If so, then its copyright was restored by treaty and the presence or absence of a renewal is irrelevant.

  • Possible foreign publication -- mentioned in a registration for a likely foreign publication. - This work was mentioned in the registration record for a foreign publication. This may mean that it, itself, is a foreign publication. It needs to be checked manually.

  • Classified with parent. - This work was grouped beneath another registration, and the parent registration was removed from consideration -- probably because it was post-1963 or because it was a foreign work. In the absence of strong evidence to the contrary, all of its children were also removed from consideration, but they were given a different disposition in case you want to check them manually.

  • Not a book proper. - This registration is for something other than a 'book proper' -- a pamphlet or serial, for instance. These can be renewed and can fall into the public domain, just like books, but we don't have complete data for them, so we can't draw any conclusions about them, and they're excluded.

  • Published before cutoff year. - This registration happened more than 95 years ago, so the question of renewal is moot -- the copyright has expired.

  • Published after cutoff year. - This registration happened after 1963, so the question of renewal is moot -- the copyright was renewed automatically.

  • Error. - The data associated with this registration was missing essential information, and it couldn't be processed.

You can’t perform that action at this time.