/
CHANGES
330 lines (267 loc) · 16.6 KB
/
CHANGES
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
CHANGES
0.10.24 (April 22, 2024)
- The `heading` column in the database's author-book many to many table was being ignored by much of our code. The result was that multiple authors were being listed in alphabetical order. now, the heading column is used and the first sort column for the authors of a book, and the authors other than the first author are have heading=2 (instead of the default `heading=1`) set on initial metadata load. The cataloguer can reset the heading numbers, but does not wish the order of authors other than the "main" author to be tracked in the database.
0.10.23 (March 15, 2024)
- fixed a reversion in 0.10.10 that made author name matching case sensitive.
0.10.22 (February 26, 2024)
- credits should be replaced, not appended
- added test for credit replacement
- fixed tests for current db
- 0.10.21 was not deployed
0.10.20 (January 12, 2024)
- add credit cleaning to GutenbergDatabaseDublinCore. required for autocat3 https://github.com/gutenbergtools/autocat3/issues/116
- refactor credit cleaning
- refine pretty_title truncation to always add … to truncated titles. improves #33
- change no-gutenberg-id log message from warning to info, improve wording fixes #35
0.10.19 (December 7, 2023)
- add media type for `.zip`
- add aliases to metadata parser for `author of introduction` and `author of afterward'.
- fix off by one error in rendering of BCE dates: 1 BCE is represented as 0.
- fix metadata parser not to raise an error only because there's no ebook number - a title suffices.
- ensures no markup in 'language attributes. this was crashing ebookmaker.
- tests added for DCMIType
- improves title truncation in dc.make_pretty_title(). fixes #33
0.10.18 (March 14, 2023)
- add handler for 'original publication' in pg header. This field is not parsed.
0.10.17 (March 7, 2023)
- handle $c and $v title subfields
- include subfield marker removal in GutenbergGlobals.insert_breaks
- normalize capitalization of "eBook" in an error message
0.10.16 (February 15, 2023)
- dc.update_date initialized to datetime.date.min
0.10.15 (February 1, 2023)
- libgutenberg is not compatible with SQLAlchemy 2.0; added a version restriction
- in logging tests, clean up the open log file
- fileinfo didn't work properly in non-pg installations, which depended on symlinks in the filesystem. refactored the file path massaging when storing filenames.
- curly quotes in titles are straightened.
- store subtitle in a marc subfield
- pubinfo was not being reconstituted when loaded from the database
- add support for place in pubinfo
- add spaces after tight commas in rendered pubinfo string
- update json support to version 3.03 of workflow
- remove unused DublinCore.PARSEABLE_EXTENSIONS
- added `update_date` property to DublinCore
- removed code that was adding update date to credit.
- strip updates from db credits.
0.10.13 (December 26, 2022)
- load_from_pgheader now treats allows any plural metadata keywords. ending 's' of a keyword it stripped so that "Editors" is treated the same way "Editor" is.
- refactored and delinted DublinCore
- added tests for translator handling
0.10.10 (December 14, 2022)
- fix undefined variable in pub info handler
- matching of author names with the data base is now stricter - matches may not match partial names.
- delint DublinCore and DublinCoreMapping
0.10.9 (November 28, 2022)
- don't allow creators to be named ""
- fix tests so they don't break so easily when database is updated.
0.10.8 (October 23, 2022)
- dc.credit is doing double duty - it's storing update notices as well as production credits. so updates need to be more sophisticated - updates can be added, credits are fixed. This fixes a problem where credits pulled from the source file are not included in the credites metadata.
0.10.7 (October 7, 2022)
- tweak display of marc fields. "United States: D. Appleton and Company, 1918." instead of "United States :D. Appleton and Company ,1918."
- update tests
- make sure all scanned metadata is normal form composed.
0.10.5 (October 1, 2022)
- fixed bug in pubinfo.__str__ revealed by bugfix in ebookmaker renderer.
0.10.4 (September 30, 2022)
- don't skip last metadata item when it is followed immediately by the START marker
- add `produced by` as a synonym for `credit`
- add `self_closing` parameter to `GutenbergGlobals.insert_breaks()`
- dc.load_from_database() was not loading dc attributes corresponding to marc attributes Support for this has now been added.
- since dc.pubinfo is never False, dc.save() tried (in vain) to save empty pubinfos. No longer.
0.10.3 (September 15, 2022)
- fix bug where list attributes to set were not checked for empty text items
0.10.2 (September 14, 2022)
- fix bug that wiped authorlist if an updated dc object was saved to the DB
- fix bug that failed the marc formated display when there is a pubinfo without a country
- refactor handle_languages so code can be used outside the dc object
- updated lxml dependency
0.10.1 (September 12, 2022)
- fixed bug where metadata upload failed if pubinfo was missing the country.
0.10.0 (July 12, 2022)
- breaking change! moved notifier to be a Logger global because loggers are global
- revised Logger to avoid duplicate logging. Now, every time Logger.setup() is called, at most 3 handlers are set, a StreamHandler, a base_logger, and a ebook-specific logger.
0.9.3 (June 17, 2022)
- remove duplicate log handlers when logging to console
0.9.0 (June 16, 2022)
- tested with Python 3.7 and 3.8; Python 3.6 no longer supported
- added a metadata handle for "created:" fields
- turns out we didn't really want to clear the log handlers
0.8.16 (February 4, 2022)
- add HTML5_DOCTYPE to GutenbergGlobals
- add epub prefix for NSMAP
0.8.15 (January 20, 2022)
- find gutenberg header in html even if html header is long
- catch exception when a non-existent language code is used
- fix exception with empty file
- remove double spaces from names
- add mime types for font files
- update lxml
- delint
- fix error in 0.8.14
0.8.13 (November 16, 2021)
- handle errors in store_file_in_database due to dns errors
0.8.12 (November 5, 2021)
- Attributes should not contain <CR> in HTML5. so we adjusted the code method makes dc meta tags to escape <CR>, <LF> and any combination thereof with the numeric entity 
0.8.11 (September 28, 2021)
- fix issue of polymorphism in dc.languages. Without a db, it's a list of structs; with a db, its a related collection.
- ebookmaker will no longer ignore xml:lang or DC meta attributes
- fix windows path comparison - ebookmaker will behave properly when input file is in outputdir
- delint
0.8.10 (September 7, 2021)
- fixed bug where dc.project_gutenberg_id(ebook) was not handling null ebook
- added session management decorator to methods in GutenbergFiles
0.8.9 (September 3, 2021)
- added a session management decorator to DBUtils so that functions that get a new session also close them. Rebuilds were failing after 75 books.
0.8.8 (September 3, 2021)
- fixed missing log method
- fixed import failures when psycopg2 not present
- skip more tests when psycopg2 not present
0.8.7 (September 2, 2021)
- change xhtml/RDFa style meta elements in dc.to_html()min to of HTML5 style meta elements.
- change psycopg2 WARNING to an INFO
- deleted redundant log handler. Log messages were printing twice
- set encoding for python files to UTF8; this was causing an encoding error in data sent to db.
- fixed bug in is_same_path
0.8.6 (August 25, 2021)
- fix formatting of MARC attributes. Previously the marc formatter assumed a word boundary after the subfield indicator, but that's not aligned with the rest of the world. We have a few attribute fields with dollar amounts, but in every case the char after the $ is a digit. also # means [space] in the MARC docs. Ignore that $ means \1F in MARC docs, we'll just use $, until we make real MARC files.
0.8.5 (August 24, 2021)
- don't skip zip files
- remove debugging statements
- cache the filetypes and compressions loaded from db
- improved test cleanup
0.8.3 (August 23, 2021)
- fixed problem with guess_encoding. In reproducing the regexes I didn't realize that re.match != php's regexp match.
- the handling of updatemode was backwards. While correcting this, I made the logic in save() much clearer
0.8.2 (August 20, 2021)
- bugfix for db loading after file parsing
- bugfix update date
- silent update window changed to 14 days from 28
- bad json logging escalated to critical
- delint
0.8.1 (August 18, 2021)
Bugfixes - 0.8.0 tagged but never released.
- removed debugging code
- fixed PubInfo.__bool__
- fixed another release_date null test
- added subtitle handling
- workflow means files are all always utf8
- orm-based support for putting files in the database has been refactored out of DublinCoreMapping; libgutenberg is now used for non-generated files (from ebookconverter/FileInfo.py) , too
- to remove an ebook with files need to remove files first, because of CASCADE settings in db schema
- added metadata updates.
- update notes are applied only after more than 28 days (can be changed) after release date
- update notes are stored as marc 508 attribute along with production notes
- if a note is supplied in the CREDIT field of the workflow metadata, it is used as the text of the note, without an added date, so WW should enter date if desired.
- if no CREDIT note, then then the note will be "Updated: TODA-YS-DT"
- should also note that there on initial entry, the CREDIT note is sved without additions.
- and that for updates, the only other data currently looked at is the NOTIFY field; all the notify entries are stored in a non-public json file.
- added count_files - handy for tests
- removed support for old file structure and bitcollider
- switched to using refactored GutenbergFiles code for file/db work
- stop cataloging dot files
- added a way to plug a notification handler keyed on the 'critical' loglevel into the Logger.
- added the DublinCore.PGDCObject which can be used without a database for non-database DC stuff
- added a notification handler that sends CRITICAL log entries to a configurable callable
0.8.0 (August 3, 2021)
To be able to use ORM for EbookConverter, we needed to add code code that saves the DublinCoreObject to the database. most of the functions of "autocat.php" which used a text pipe from FileInfo to capture metadata from the dc loaders, are now handled by DublinCoreObject
To support the new publication workflows, we have added a new way to do the initial ingest of metadata from a json file created by the workflow tool. An example json file can be found in the tests directory.
- added a test for the dc header loader, saver, and deleter
- refactored ROLES and LANGS into GutenbergGlobals and made them dicts
- added 'alt_title' attribute to DublinCore objects as this is already handled by autocat.php
- added utility functions for ORM objects in the new DBUtils module. these methods support an optional session param
- ebook_exists(ebook)
- is_not_text(ebook)
- remove_ebook(ebook)
- author_exists(author)
- remove_author(author)
- filetype_books(filetype)
- get_lang(language)
- last_ebook()
- recent_books(interval)
- top_books()
- added save() and delete() methods to DublinCoreObjects. This work was done primarily mostly by the autocat.php script, which interfaced awkwardly with python.
- improve authorname ingest- ", Jr." no longer causes a spurious author
- fixed deletion cascade in m2m book relations
- dc.load_from_database no longer overwrites data loaded from headers and metadata files
- added the following attributes to Dublin Core objects to support ingest from workflow
- pubinfo - an object with publisher name, year and country
- credit - the producer credit line
- added the following attributes to Gutenberg Dublin Core objects to support ingest from workflow
- scan_urls - a list of archive urls
- request_key - key for linking to clearance db
- added save methods for new information items from workflow
- credit - uses the existing 508 attribute (Creation / Production Credits Note)
- scan_urls - uses the newly defined (repeatable) local attribute 904 (Archived Scan URL).
- request_key - uses the local attribute 905 (PGLAF Clearance Number).
- pubinfo
- enters a MARC subfielded string in the 240 field
- adds the first year in the local attribute 906 (First Publication Year)
- adds the 2-letter country code to the local attribute 907 (Publication Country)
- removed publisher from contributor roles because they should go in pubinfo
- fixed a bug which caused ebookmaker to choke when not connected to PG database
- start using pycountry so we have up-to-date language and country codes
- changed default value for DublinCore.release_date to datetime.date.min because otherwise the ORMs autocommits fail as release_date has a NOT_NULL constraint
- added DublinCore.PGDCObject so that ebookmaker can load an object that works with or without a backing database. Note that ebookmaker was broken with 0.7.2 without the database
- tests now check if the database is connected before running (and failing) tests of the database and issue appropriate warnings
- delint
0.7.2 (July 19, 2021)
- test_orm now tests the orm
- removed stray savepoint that caused file addition to fail
- added tests for ORM add/delete file
- added a setter for dc.project_gutenberg_id to populate dc.canonical_url and dc.is_format_of
- added warning for bad filetypes
- updated logger syntax
- accommodate output dirs to be named cache1, cache2, etc for saving in db
- delint
0.7.1 (May 25, 2021)
- added GutenbergDublinCore.canonical_url to support https canonical urls distinct from the http URIs we're keeping so as not to break RDF links.
- updated RDFa style in html head elements from archaic to less archaic (DOCTYPE and schema declarations)
- use dublin core elements instead of dcterms where appropriate for modern metadata.
0.7.0 (April 13, 2021)
- fixed a spelling mistake
- added an is_audiobook method
- added a back relationship
- quelled scalar_subquery coercion warnings
0.7.0a (November 24, 2020)
This version is a significant change from previous version, adding an object relational mapping for the Project Gutenberg database and an ORM-based "Dublin Core" object, "DublinCoreObject" which replicates the "GutenbergDatabaseDublinCore" for most uses. because it is "lazy" - only making queries when needed - it may be much faster in many use cases, and about 20% slower when all attributes are accessed. This release is considered "alpha" with regard to the new objects, but should be production worthy and need no changes when the ORM based objects are not used.
- add Models.py and DublinCoreMapping modules.
- test suite greatly expanded, includes a comparative load test
-
0.6.8 (November 2, 2020)
- add SVG to mediatypes
0.6.7 (October 25, 2020)
- fix GutenbergGlobals.archive_dir with a special case for single digit book numbers
0.6.6 (October 16, 2020)
- add tests for the DC object. This is preparing for a new version, with ORM
0.6.5 (September 1, 2020)
- fix logger so that it handles non int ebook
0.6.3 (April 17, 2020)
- catch OSError when importing cairocffi
0.6.2 (January 18, 2020)
- add rollback for missing book id exception.
0.6.1 (January 15, 2020)
- added exception handling in store_file_in_database for missing book id.
0.6.0 (January 7, 2020)
- added a 'first_letter' attribute to authors from the database to support linking to authorlists in the new version of PG website
0.5.1 (January 4, 2020)
- changed MediaTypes so that exceptions aren't raised when an extension is not known or present
- added tests for MediaTypes
0.5.0 (October 22, 2019)
- fixed mimetype extensions that should't have dots
- added Project Gutenberg branding to cover
0.4.1 (October 8, 2019)
- fixed bug where creators with marcrel='aut' were not included as authors on the cover
0.4.0 (September 16, 2019)
- added a constant PG_URL with the https url so we can keep the http version for RDF and XML namespacing only. This has the effect of changing dc file URLs to https. For some discussion of https in RDF, see https://www.w3.org/blog/2016/05/https-and-the-semantic-weblinked-data/
0.3.2 (May 10, 2019)
- fixed a missing import for internationalization. This code relied on installation of "_" into the context's builtins. If you are installing a translator into builtins, you'll also need to install it into the gettext module.
0.3.1 (Feb 15, 2019)
- added mimetypes for assorted files found in PG ebooks
0.3.0 (Feb 12, 2019)
- added Cover
- added convenience methods on DublinCore
- handling of optional dependencies psycopg and cairocffi
0.2.0 (Jan 10, 2019)
- setup.py now uses setuptools to facilitate use in more complex packaging environments
- added a borg class to provide global option across PG python classes
0.1.6 (Apr 21, 2017)
- earliest version in version control