# BL : Notebook to assign the mapping title - NLP - alias, and copy the data and images accordingly in a correct structure

We have received the BL data but we now to have to verify what we received, the actual years spanned, as well as the titles we will assign to each title and assign unique aliases to them.

In addition, we have to ensure that we have a correct title -> NLPs mapping since the files are organized by NLP in the file structure.

This will allow us to reorganise the existing data to reflext this organization, and ensure that there is the expected file structure/arborescence for our processing needs later on.

The file `/home/piconti/impresso-text-acquisition/text_preparation/data/sample_data/BL/BL-title-alias-mapping.csv` contains the current list of newspapers, NLPs, assigned aliases and actual identified years. 
It can thus be used to ensure this 1-1 mapping can then be imported back into google sheets to have the correct data in all sources

In [1]:
from impresso_essentials.utils import ALL_MEDIA, PARTNER_TO_MEDIA
import re
import pandas as pd
import os
import shutil
import numpy as np
from tqdm import tqdm
from datetime import datetime
from impresso_essentials.utils import chunk
from ast import literal_eval
import dask.bag as db
import json
from dask.diagnostics import ProgressBar
from concurrent.futures import ThreadPoolExecutor
from bs4 import BeautifulSoup, element
from PIL import Image, ImageDraw
from IPython.display import display
from text_preparation.importers.mets_alto import alto
from text_preparation.utils import rescale_coords



## 1. Assigning unique aliases for each of the working titles in the collection

The aliases have been generated by querying chatgpt, after providing a list of existing aliases found in the BL data (for only some titles).

### The final mapping of working title to alias for the new BL data

We will of course ensure that all of these aliases are unique, with respect to each other as well as the list of existing Impresso aliases

In [28]:
bl_aliases = {
    'Aberdeen Press and Journal': 'ANJO',
    'Alston Herald and East Cumberland Advertiser': 'AHEC',
    "Baldwin's London Weekly Journal": 'BLWJ',
    'Baner ac Amserau Cymru': 'BNER',
    'Bargoed Journal': 'BGJO',
    'Barnsley Telephone': 'BTEP',
    "Bell's Family Newspaper": 'BFNP',
    "Bell's News": 'BELL',
    "Bell's Penny Dispatch": 'BPDH',
    "Berrow's Worcester Journal": 'WOJL',
    "Berthold's Political Handkerchief": 'BPHF',
    'Birmingham Daily Post': 'BDPO',
    'Blandford Weekly News': 'BWNW',
    'Bradford Observer': 'BROR',
    'Bridgend Chronicle': 'BGCH',
    'Bridlington and Quay Gazette': 'BQGA',
    'Bridport, Beaminster, and Lyme Regis Telegram': 'BBLT',
    'Brief': 'BRIF',
    'Brighouse & Rastrick Gazette': 'BRGA',
    'Brighton Patriot': 'BRPT',
    'British Army Despatch': 'BRAD',
    'British Mercury or Wednesday Evening Post': 'BRMW',
    'British Miner and General Newsman': 'BRMG',
    'Caledonian Mercury': 'CNMR',
    "Charles Knight's Town & Country Newspaper": 'CKTC',
    'Chelsea & Pimlico Advertiser': 'CPAD',
    'Cheshire Observer': 'CHOR',
    'Christian Times': 'CHTI',
    'City of London Trade Protection Circular': 'CLTP',
    "Cleave's Weekly Police Gazette": 'CWPG',
    "Cobbett's Evening Post": 'CBEP',
    "Cobbett's Weekly Political Register": 'CWPR',
    'Colored News': 'CLNW',
    'Common Sense': 'CMSN',
    'Cradley Heath & Stourbridge Observer': 'CHSO',
    'Daily Gazette For Middlesbrough': 'DGMH',
    'Daily News': 'DNLN',
    'Daily Politician': 'DPLT',
    'Darlington & Richmond Herald': 'DRHE',
    'Denton and Haughton Examiner': 'DHEX',
    'Derby Mercury': 'DYMR',
    'Dewsbury Chronicle and West Riding Advertiser': 'DCWA',
    'Dorset County Express and Agricultural Gazette': 'DCEA',
    "Douglas Jerrold's Weekly Newspaper": 'DJWN',
    "Duckett's Dispatch": 'DDIS',
    'Dundee Courier': 'DUCR',
    'East London Advertiser': 'ELAD',
    'East Wind': 'EAWN',
    'Exeter Flying Post': 'TEFP',
    'Finsbury Free Press': 'FFPR',
    "Fleming's British Farmers' Chronicle": 'FBFC',
    "Fleming's Weekly Express": 'FWEX',
    'Fonetic Nuz': 'FONU',
    "Francis's Metropolitan News": 'FMNW',
    "Freeman's Journal": 'FRJO',
    'Glasgow Courier': 'GLCO',
    "Glasgow Herald": "GWHD",
    "Glasgow Sentinel": "GLSE",
    "Golden Times": "GOTM",
    "Halifax Comet": "HLCM",
    "Hampshire Telegraph": "HPTE",
    "Haslingden Gazette": "HAGZ",
    "Hetherington's Twopenny Dispatch": "HTWD",
    "High Life in London": "HLLN",
    "Holt's Weekly Chronicle": "HWCH",
    "Hour": "HOUR",
    "Huddersfield Chronicle": "HUCE",
    "Hull Packet": "HLPA",
    "Illustrated Crystal Palace Gazette": "ICPG",
    "Illustrated London Life": "ILOL",
    "Illustrated Midland News": "IMNW",
    "Illustrated Sporting News and Theatrical and Musical Review": "ISNT",
    "Illustrated Times 1853": "ILT53",
    "Illustrated Weekly Times": "ILWT",
    "Irvine Express": "IREX",
    "Isle of Wight Observer": "IWOR",
    "Islington Times": "ISTM",
    "Jewish Record": "JWRC",
    "Johnson's Sunday Monitor": "JSMN",
    "Kenilworth Advertiser": "KEAD",
    "Lancaster Standard and County Advertiser": "LSCA",
    "Leeds Intelligencer": "LSIR",
    "Leicester Chronicle": "LECH",
    "Liverpool Mercury": "LVMR",
    "Liverpool Standard and General Commercial Advertiser": "LSGA",
    "Liverpool Weekly Courier": "LWC",
    "Lloyd's Companion to the Penny Sunday Times and Peoples' Police Gazette": "LCPP",
    "Lloyd's Weekly Newspaper": "LINP",
    "London & Provincial News and General Advertiser": "LPNGA",
    "London Dispatch": "LNDH",
    "London Halfpenny Newspaper": "LHPN",
    "London Journal and General Advertiser for Town and Country": "LJGA",
    "London Life": "LNLF",
    "London Moderator and National Adviser": "LMNA",
    "London Railway Newspaper": "LRNW",
    "The London News Letter and Price Current": "LNPC",
    "Manchester Examiner": "MEXM",
    "Manchester Times": "MRTM",
    "Mirror of the Times": "MRTT",
    "Morning Chronicle": "MCLN",
    "Morning Herald": "MRHD",
    "Morning Post": "MOPT",
    "Nantwich, Sandbach & Crewe Star": "NSCS",
    "National Register": "NTRG",
    "Nelson Chronicle, Colne Observer and Clitheroe Division News": "NCCO",
    "New Court Gazette": "NCGA",
    "New Times": "NWTM",
    "Nonconformist Elector": "NCEF",
    "North London Record": "NLRD",
    "North Wales Chronicle": "NRWC",
    "Northern Echo": "NREC",
    "Northern Liberator": "NRLR",
    "Northern Star and Leeds General Advertiser": "NRSR",
    "Northern Weekly Gazette": "NWGZ",
    "Old England": "OLEN",
    "Orr's Kentish Journal": "OKJL",
    "Oxford Journal": "JOJL",
    "Passing Events": "PSEV",
    "Pen and Pencil": "PNPC",
    "Penistone, Stocksbridge and Hoyland Express": "PSHE",
    "Pictorial Times": "PICT",
    "Picture Times": "PITM",
    "Pierce Egan's Life in London, and Sporting Guide": "PELL",
    "Poole Telegram": "POTG",
    "Preston Pilot": "PRPL",
    "Reynold's Newspaper": "RDNP",
    "Ripon Observer": "RIOB",
    "Royal Cornwall Gazette": "COGE",
    "Royal York": "RYRK",
    "Runcorn Examiner": "RUEX",
    "Sainsbury's Weekly Register and Advertising Journal": "SWRJ",
    "Sheffield Public Advertiser": "SHPA",
    "South London Advertiser": "SLAD",
    "South London Times and Lambeth Observer": "SLTL",
    "Southern Star": "SNSR",
    "Southwark Mercury": "SWME",
    "Sport": "SPRT",
    "Stalybridge Examiner": "STEX",
    "Stockton Herald, South Durham and Cleveland Advertiser": "SHSD",
    "Stretford and Urmston Examiner": "STUE",
    "Sunday Gazette": "SUGA",
    "Sunday News": "SUNW",
    "Surrey & Middlesex Standard": "SMSD",
    "Surrey Herald and County Advertiser": "SHCA",
    "Surrey Mercury": "SURY",
    "Swansea and Glamorgan Herald": "SGHL",
    "Swansea Journal and South Wales Liberal": "SJWL",
    "Thacker's Overland News for India and the Colonies": "TONI",
    "The Age (London)": "TALN",
    "The Age 1852": "AGE52",
    "The Agricultural Advertiser and Tenant-Farmers' Advocate": "AATA",
    "The Albion": "ALBN",
    "The Albion and the Star": "ALST",
    "The Anti-Gallican Monitor": "AGMO",
    "The Argus, or, Broad-sheet of the Empire": "ARGB",
    "The Atherstone, Nuneaton, and Warwickshire Times": "ANWT",
    "The Aurora Borealis": "AUBO",
    "The Ballot": "BLOT",
    "The Barrow Herald and Furness Advertiser": "BHFA",
    "The Bath Chronicle": "BHCH",
    "The Beacon (Edinburgh)": "BCE1",
    "The Beacon (London)": "BCL2",
    "The Bee-Hive": "BEHI",
    "The Belfast News-Letter": "BNWL",
    "The Birkenhead News": "BKNW",
    "The Blackburn Standard": "BLSD",
    "The Blackpool Herald": "BLHD",
    "The Blandford and Wimborne Telegram": "BWTE",
    "The Borough of Greenwich Free Press": "BGFP",
    "The Bristol Mercury": "BLMY",
    "The British Banner": "BRBN",
    "The British Emancipator": "BREM",
    "The British Ensign": "BREN",
    "The British Liberator": "BRLB",
    "The British Luminary": "BRLU",
    "The British Neptune": "BRNP",
    "The British Press": "BRPR",
    "The British Standard": "BRST",
    "The British Statesman": "BRSS",
    "The Brunswick, or, True Blue": "BRTB",
    "The Bury and Norwich Post": "BNPT",
    "The Cannock Chase Examiner": "CCEX",
    "The Censor or Satirical Times": "CSTT",
    "The Central Glamorgan Gazette": "CGGA",
    "The Champion (London)": "CHPL",
    "The Champion": "CHPN",
    "The Charter": "CHTR",
    "The Chartist": "CHTT",
    "The City Chronicle": "CICN",
    "The Civil & Military Gazette": "CMGA",
    "The Clerkenwell Dial and Finsbury Advertiser": "CLDF",
    "The Colonist and Commercial Weekly Advertiser": "CCWA",
    "The Commercial Chronicle": "CMCH",
    "The Constitution": "CNSN",
    "The Cosmopolitan": "CSMP",
    "The Cotton Factory Times": "CFTM",
    "The Courier": "COUR",
    "The Court Gazette and Fashionable Guide": "CGFG",
    "The Crim. Con. Gazette": "CCGZ",
    "The Crown": "CRWN",
    "The Daily Director and Entr'acte": "DDEN",
    "The Day": "TDAY",
    "The Dewsbury Chronicle and West Riding Advertiser": "DCWR",
    "The Dial": "TDIA",
    "The Dissenter": "DSNR",
    "The East Riding Telegraph": "ERTG",
    "The Eastern Star": "EAST",
    "The Emigrant and the Colonial Advocate": "ECLA",
    "The English Chronicle and Whitehall Evening Post": "ECWP",
    "The Englishman": "ENGL",
    "The Era": "ERLN",
    "The Essex Standard": "ESSD",
    "The Evening Star": "EVST",
    "The Evening Times (London)": "EVTL",
    "The Evening Times 1825": "EVT25",
    "The Examiner": "EXLN",
    "The Express": "EXPR",
    "The Forest of Dean Examiner": "FODE",
    "The General Evening Post": "GEVP",
    "The Glasgow Chronicle": "GLCH",
    "The Graphic": "GCLN",
    "The Hammersmith Advertiser": "HMSA",
    "The Hampshire Advertiser": "SOHD",
    "The Hebrew Observer": "HBOV",
    "The Herald of Wales": "HOWL",
    "The Illustrated Newspaper": "ILNP",
    "The Illustrated Police News": "HPNW",
    "The Imperial Weekly Gazette": "IWGZ",
    "The Instructor and Select Weekly Advertiser": "ISWA",
    "The Ipswich Journal": "IPJO",
    "The Isle of Man Times": "IMTS",
    "The Kingsland Times and General Advertiser": "KTGA",
    "The Lady's Newspaper and Pictorial Times": "LNPT",
    "The Lady's Own Paper": "LOPA",
    "The Lancaster Gazette": "LAGER",
    "The Lancaster Herald and Town and County Advertiser": "LHTC",
    "The Leeds Mercury": "LEMR",
    "The Little Times": "LTIM",
    "The Liverpool Albion": "LIAL",
    "The Liverpool Chronicle": "LIVC",
    "The Liverpool Telegraph": "LITG",
    "The London & China Herald": "LCHH",
    "The London and Liverpool Advertiser": "LLAD",
    "The London and Scottish Review": "LSCR",
    "The London Chronicle": "LNCH",
    "The London Chronicle and Country Record": "LCCR",
    "The London Daily Guide and Stranger's Companion": "LDGS",
    "The London Evening Post": "LEVP",
    "The London Free Press": "LFPR",
    "The London Illustrated Weekly": "LIWL",
    "The London Journal and Pioneer Newspaper": "LJPN",
    "The London Mercury": "LNM1",
    "The London Mercury 1836": "LNM2",
    "The London Mercury 1847": "LNM3",
    "The London Mirror": "LONM",
    "The London Packet and New Lloyd's Evening Post": "LPNL",
    "The London Phalanx": "LOPH",
    "The London Scotsman": "LSCT",
    "The London Telegraph": "LTLG",
    "The London Weekly Investigator": "LWI",
    "The Man about Town": "MATN",
    "The Manchester Examiner": "MEXA",
    "The Metropolitan": "MTPN",
    "The Midland Examiner and Wolverhampton Times": "MEWT",
    "The Monthly Times": "MNTM",
    "The Morning Gazette": "MOGA",
    "The Morning Mail": "MOMA",
    "The Nation": "NATN",
    "The National": "NTNL",
    "The National Protector": "NTPR",
    "The National Standard": "NTSD",
    "The New Globe": "NGLB",
    "The New Weekly True Sun": "NWTS",
    "The Newcastle Courant": "NECT",
    "The News": "TNEW",
    "The North Cumberland Reformer": "NCRF",
    "The North Londoner": "NLON",
    "The North-West London Times": "NWLT",
    "The Northern Daily Times": "NDTM",
    "The Northern Guardian": "NOGU",
    "The Nottinghamshire Guardian": "NOGN",
    "The Nuneaton Times": "NUNT",
    "The Observer of the Times": "OBTM",
    "The Odd Fellow": "ODFW",
    "The Operative": "OPTE",
    "The Oracle and the Daily Advertiser": "ORDA",
    "The Paddington Advertiser": "PADV",
    "The Pall Mall Gazette": "PMGZ",
    "The Palladium": "PLDM",
    "The Patriot": "PATR",
    "The People's Hue and Cry or Weekly Police Register": "PHCW",
    "The People's Paper": "PPLP",
    "The Pilot": "PLTO",
    "The Pioneer and Weekly Record of Movements": "PWRM",
    "The Planet": "PLNT",
    "The Political Letter": "POLL",
    "The Political Observer": "PLOB",
    "The Pontypridd District Herald": "PDHD",
    "The Poor Man's Guardian": "PMGU",
    "The Porcupine": "PORC",
    "The Potteries Examiner": "POEX",
    "The Press": "TPRS",
    "The Preston Chronicle and Lancashire Advertiser": "PNCH",
    "The Public Cause": "PUCA",
    "The Radical": "RADL",
    "The Railway Bell and London Advertiser": "RBLD",
    "The Reformer": "REFM",
    "The Representative": "REPR",
    "The Saint James's Chronicle": "SJCH",
    "The Satirist; or, the Censor of the Times": "SATR",
    "The Sheffield Independent": "SHIN",
    "The Shropshire Examiner": "SHRE",
    "The Slaithwaite Guardian and Colne Valley News": "SGCV",
    "The South Staffordshire Examiner": "SSEX",
    "The St. Helens Examiner, and Prescot Weekly News": "SHEP",
    "The Standard": "SDLN",
    "The Standard of Freedom": "SOFR",
    "The Star": "STGY",
    "The Stockton Examiner and South Durham and North Yorkshire Herald": "SESD",
    "The Sun": "TSUN",
    "The Sun & Central Press": "SCPR",
    "The Sunday Evening Globe": "SEGL",
    "The Sunday Morning Herald": "SMHE",
    "The Sussex & Surrey Chronicle": "SSCH",
    "The Tamworth Miners' Examiner and Working Men's Journal": "TMEW",
    "The Tichborne Gazette": "TIGA",
    "The Tichborne News and Anti-Oppression Journal": "TNAJ",
    "The Tower Hamlets Mail": "THML",
    "The Trades' Free Press": "TFPR",
    "The True Briton": "TRBT",
    "The True Sun": "TRSN",
    "The Union": "TUNI",
    "The Universe": "UNIV",
    "The Verulam": "VERL",
    "The Vindicator": "VIND",
    "The Warrington Examiner": "WAEX",
    "The Warwickshire Herald": "WAHD",
    "The Watchman": "WTCH",
    "The Week's News": "WKNW",
    "The Weekly Advertiser": "WKAD",
    "The Weekly Chronicle": "WKCH",
    "The Weekly Echo": "WKEC",
    "The Weekly Globe": "WKGB",
    "The Weekly Independent": "WKIN",
    "The Weekly Intelligence": "WKIT",
    "The Weekly Journal": "WKJL",
    "The Weekly Mail": "WKML",
    "The Weekly Review": "WKRV",
    "The Weekly Star and Bell's News": "WSBN",
    "The Wellington Gazette and Military Chronicle": "WGMC",
    "The West End News": "WENW",
    "The West London Times": "WLTM",
    "The Westminster Times": "WMTM",
    "The Weymouth Telegram": "WMTG",
    "The World": "WRLD",
    "The World and Fashionable Sunday Chronicle": "WFSC",
    "The York Herald": "YOHD",
    "Town & Country Daily Newspaper": "TCDN",
    "Town and Country Advertiser": "TCAA",
    "Town Talk": "TTLK",
    "Town Talk 1822": "TTK22",
    "Trade Protection Record": "TPRD",
    "Weekly Times": "WKTN",
    "Weekly True Sun": "WKTS",
    "West Londoner and Select Advertiser for the Borough of Marylebone": "WLSA",
    "Western Mail": "WMCF",
    "Westminster Journal and Old British Spy": "WJBS",
    "Whitehall Evening Post": "WHEP",
    "Widnes Examiner": "WDEX",
    "Wooler's British Gazette": "WBGZ",
    "Wrexham Advertiser": "WRWA",
    "Y Genedl Gymreig": "GNDL",
    "Y Goleuad": "GLAD",
    "York House Papers": "YOHP"
}

In [3]:
len(bl_aliases)

374

In [4]:
# ensuring that the list of aliases provided by chatGPT is indeed unique.
sorted(list(set(bl_aliases.values()))) == sorted(list(bl_aliases.values()))

True

In [5]:
# also ensure that none of the new aliases are already in our collection
any(j in bl_aliases.values() for j in KNOWN_JOURNALS)

False

The final list contains 375 unique titles. 

We will have to reorganize the filestructure so that the input data is organized into these groups

Aberdeen Press and Journal	ANJO
Alston Herald and East Cumberland Advertiser	AHEC
Baldwin's London Weekly Journal	BLWJ
Baner ac Amserau Cymru	BNER
Bargoed Journal	BGJO
Barnsley Telephone	BTEP
Bell's Family Newspaper	BFNP
Bell's News	BELL
Bell's Penny Dispatch	BPDH
Berrow's Worcester Journal	WOJL
Berthold's Political Handkerchief	BPHF
Birmingham Daily Post	BDPO
Blandford Weekly News	BWNW
Bradford Observer	BROR
Bridgend Chronicle	BGCH
Bridlington and Quay Gazette	BQGA
Bridport, Beaminster, and Lyme Regis Telegram	BBLT
Brief	BRIF
Brighouse & Rastrick Gazette	BRGA
Brighton Patriot 	BRPT
British Army Despatch	BRAD
British Mercury or Wednesday Evening Post	BRMW
British Miner and General Newsman	BRMG
Caledonian Mercury	CNMR
Charles Knight's Town & Country Newspaper	CKTC
Chelsea & Pimlico Advertiser	CPAD
Cheshire Observer	CHOR
Christian Times 	CHTI
City of London Trade Protection Circular	CLTP
Cleave's Weekly Police Gazette	CWPG
Cobbett's Evening Post	CBEP
Cobbett's Weekly Political Register 	CWPR
Colored News	CLNW
Common Sense	CMSN
Cradley Heath & Stourbridge Observer	CHSO
Daily Gazette For Middlesbrough	DGMH
Daily News	DNLN
Daily Politician	DPLT
Darlington & Richmond Herald	DRHE
Denton and Haughton Examiner	DHEX
Derby Mercury	DYMR
Dewsbury Chronicle and West Riding Advertiser 	DCWA
Dorset County Express and Agricultural Gazette	DCEA
Douglas Jerrold's Weekly Newspaper	DJWN
Duckett's Dispatch	DDIS
Dundee Courier	DUCR
East London Advertiser	ELAD
East Wind	EAWN
Exeter Flying Post	TEFP
Finsbury Free Press	FFPR
Fleming's British Farmers' Chronicle	FBFC
Fleming's Weekly Express	FWEX
Fonetic Nuz	FONU
Francis's Metropolitan News	FMNW
Freeman's Journal	FRJO
Glasgow Courier	GLCO
Glasgow Herald	GWHD  ----------------
Glasgow Sentinel	GLSE
Golden Times	
Halifax Comet	
Hampshire Telegraph	HPTE
Haslingden Gazette	
Hetherington's Twopenny Dispatch	
High Life in London	
Holt's Weekly Chronicle	
Hour	
Huddersfield Chronicle	HUCE
Hull Packet	HLPA
Illustrated Crystal Palace Gazette	
Illustrated London Life	
Illustrated Midland News	
Illustrated Sporting News and Theatrical and Musical Review	
Illustrated Times 1853	
Illustrated Weekly Times	
Irvine Express	
Isle of Wight Observer	IWOR
Islington Times	
Jewish Record	
Johnson's Sunday Monitor	
Kenilworth Advertiser	
Lancaster Standard and County Advertiser	
Leeds Intelligencer	LSIR
Leicester Chronicle	LECH
Liverpool Mercury	LVMR
Liverpool Standard and General Commercial Advertiser	
Liverpool Weekly Courier	
Lloyd's Companion to the Penny Sunday Times and Peoples' Police Gazette	
Lloyd's Weekly Newspaper	LINP
London & Provincial News and General Advertiser	
London Dispatch	LNDH
London Halfpenny Newspaper	
London Journal and General Advertiser for Town and Country	
London Life	
London Moderator and National Adviser	
London Railway Newspaper	
LThe ondon News Letter and Price Current	
Manchester Examiner	
Manchester Times	MRTM
Mirror of the Times	
Morning Chronicle	MCLN
Morning Herald	
Morning Post	MOPT
Nantwich, Sandbach & Crewe Star	
National Register	
Nelson Chronicle, Colne Observer and Clitheroe Division News	
New Court Gazette	
New Times	
Nonconformist Elector	
North London Record	
North Wales Chronicle	NRWC
Northern Echo	NREC
Northern Liberator	NRLR
Northern Star and Leeds General Advertiser	NRSR
Northern Weekly Gazette	
Old England	
Orr's Kentish Journal	
Oxford Journal	JOJL
Passing Events	
Pen and Pencil	
Penistone, Stocksbridge and Hoyland Express	
Pictorial Times	
Picture Times	
Pierce Egan's Life in London, and Sporting Guide	
Poole Telegram	
Preston Pilot	
Reynold's Newspaper	RDNP
Ripon Observer	
Royal Cornwall Gazette	COGE
Royal York	
Runcorn Examiner	
Sainsbury's Weekly Register and Advertising Journal	
Sheffield Public Advertiser	
South London Advertiser	
South London Times and Lambeth Observer	
Southern Star	SNSR
Southwark Mercury	
Sport	
Stalybridge Examiner	
Stockton Herald, South Durham and Cleveland Advertiser	
Stretford and Urmston Examiner	
Sunday Gazette	
Sunday News	
Surrey & Middlesex Standard	
Surrey Herald and County Advertiser	
Surrey Mercury	
Swansea and Glamorgan Herald	
Swansea Journal and South Wales Liberal	
Thacker's Overland News for India and the Colonies	
The Age (London)	
The Age 1852	
The Agricultural Advertiser and Tenant-Farmers' Advocate	AATA
The Albion	
The Albion and the Star 	
The Anti-Gallican Monitor	
The Argus, or, Broad-sheet of the Empire	
The Atherstone, Nuneaton, and Warwickshire Times	
The Aurora Borealis	
The Ballot	
The Barrow Herald and Furness Advertiser	
The Bath Chronicle	BHCH
The Beacon (Edinburgh)	
The Beacon (London)	
The Bee-Hive	
The Belfast News-Letter	BNWL
The Birkenhead News	
The Blackburn Standard	BLSD
The Blackpool Herald	
The Blandford and Wimborne Telegram	
The Borough of Greenwich Free Press	
The Bristol Mercury	BLMY
The British Banner	
The British Emancipator	
The British Ensign	
The British Liberator	
The British Luminary	
The British Neptune	
The British Press	
The British Standard	
The British Statesman	
The Brunswick, or, True Blue	
The Bury and Norwich Post	BNPT
The Cannock Chase Examiner	
The Censor or Satirical Times	
The Central Glamorgan Gazette	----------------
The Champion (London)
The Champion 	CHPN
The Charter	CHTR
The Chartist	CHTT
The City Chronicle	
The Civil & Military Gazette	
The Clerkenwell Dial and Finsbury Advertiser	
The Colonist and Commercial Weekly Advertiser	
The Commercial Chronicle	
The Constitution	
The Cosmopolitan	
The Cotton Factory Times	
The Courier	
The Court Gazette and Fashionable Guide	
The Crim. Con. Gazette
The Crown	
The Daily Director and Entr'acte	
The Day	
The Dewsbury Chronicle and West Riding Advertiser	
The Dial	
The Dissenter	
The East Riding Telegraph	
The Eastern Star	
The Emigrant and the Colonial Advocate	
The English Chronicle and Whitehall Evening Post
The Englishman	
The Era	ERLN
The Essex Standard	ESSD
The Evening Star	
The Evening Times (London)	
The Evening Times 1825	
The Examiner	EXLN
The Express	
The Forest of Dean Examiner	
The General Evening Post	
The Glasgow Chronicle	
The Graphic	GCLN
The Hammersmith Advertiser	
The Hampshire Advertiser	SOHD
The Hebrew Observer	
The Herald of Wales	
The Illustrated Newspaper	
The Illustrated Police News	HPNW
The Imperial Weekly Gazette	
The Instructor and Select Weekly Advertiser	
The Ipswich Journal    IPJO
The Isle of Man Times	IMTS
The Kingsland Times and General Advertiser	
The Lady's Newspaper and Pictorial Times	
The Lady's Own Paper	
The Lancaster Gazette	LAGER
The Lancaster Herald and Town and County Advertiser 	
The Leeds Mercury	LEMR
The Little Times	
The Liverpool Albion	
The Liverpool Chronicle	
The Liverpool Telegraph	
The London & China Herald	
The London and Liverpool Advertiser	
The London and Scottish Review	
The London Chronicle	
The London Chronicle and Country Record	
The London Daily Guide and Stranger's Companion	
The London Evening Post 	
The London Free Press	
The London Illustrated Weekly	
The London Journal and Pioneer Newspaper	
The London Mercury	
The London Mercury 1836	
The London Mercury 1847	
The London Mirror	
The London Packet and New Lloyd's Evening Post	
The London Phalanx	
The London Scotsman	
The London Telegraph	
THe London Weekly Investigator	
The Man about Town	
The Manchester Examiner	
The Metropolitan	
The Midland Examiner and Wolverhampton Times	
The Monthly Times	
The Morning Gazette	
The Morning Mail	
The Nation	
The National	
The National Protector	
The National Standard	
The New Globe	
The New Weekly True Sun	
The Newcastle Courant 	NECT
The News	
The North Cumberland Reformer	
The North Londoner	
The North-West London Times	
The Northern Daily Times	
The Northern Guardian	
The Nottinghamshire Guardian	NOGN
The Nuneaton Times	
The Observer of the Times	
The Odd Fellow	ODFW
The Operative	OPTE
The Oracle and the Daily Advertiser	
The Paddington Advertiser	
The Pall Mall Gazette	PMGZ
The Palladium	
The Patriot	
The People's Hue and Cry or Weekly Police Register	
The People's Paper	
The Pilot	
The Pioneer and Weekly Record of Movements	
The Planet	
The Political Letter	
The Political Observer	
The Pontypridd District Herald	
The Poor Man's Guardian	PMGU
The Porcupine	
The Potteries Examiner	
The Press	
The Preston Chronicle and Lancashire Advertiser	PNCH
The Public Cause	
The Radical	
The Railway Bell and London Advertiser	
The Reformer	
The Representative	
The Saint James's Chronicle	
The Satirist; or, the Censor of the Times	
The Sheffield Independent	SHIN
The Shropshire Examiner	
The Slaithwaite Guardian and Colne Valley News	
The South Staffordshire Examiner	
The St. Helens Examiner, and Prescot Weekly News	
The Standard	SDLN
The Standard of Freedom 	
The Star	STGY
The Stockton Examiner and South Durham and North Yorkshire Herald	
The Sun	
The Sun & Central Press	
The Sun & Central Press	
The Sunday Evening Globe	
The Sunday Morning Herald	
The Sussex & Surrey Chronicle	
The Tamworth Miners' Examiner and Working Men's Journal	
The Tichborne Gazette	
The Tichborne News and Anti-Oppression Journal	
The Tower Hamlets Mail	
The Trades' Free Press	
The True Briton	
The True Sun	
The Union	
The Universe	
The Verulam	
The Vindicator	
The Warrington Examiner	
The Warwickshire Herald	
The Watchman	
The Week's News	
The Weekly Advertiser	
The Weekly Chronicle	
The Weekly Echo	
The Weekly Globe	
The Weekly Independent	
The Weekly Intelligence	
The Weekly Journal	
The Weekly Mail	
The Weekly Review	
The Weekly Star and Bell's News	
The Wellington Gazette and Military Chronicle	
The West End News	
The West London Times	
The Westminster Times	
The Weymouth Telegram	
The World	
The World and Fashionable Sunday Chronicle	
The York Herald	YOHD
Town & Country Daily Newspaper	
Town and Country Advertiser	
Town Talk	
Town Talk 1822	
Town Talk 1823	
Trade Protection Record	
Weekly Times	
Weekly True Sun	
West Londoner and Select Advertiser for the Borough of Marylebone	
Western Mail	WMCF
Westminster Journal and Old British Spy	
Whitehall Evening Post	
Widnes Examiner	
Wooler's British Gazette	
Wrexham Advertiser	WRWA
Y Genedl Gymreig	GNDL
Y Goleuad	GLAD
York House Papers	

## Using this mapping and list to get a final list of titles-alias with their corresponding lists of NLPs and effective start/end years in the data

This will allow us to generate the final and update DSA_access-rights list, which is then used for several internal processings

In [85]:
# read the data in the csv, extracted from the gsheet

bl_media_list_ext_path = '/home/piconti/impresso-text-acquisition/text_preparation/data/sample_data/BL/BL_extended_title_list.csv'

bl_media_lst_ext_raw_df = pd.read_csv(bl_media_list_ext_path, header=1, index_col=0)
print(bl_media_lst_ext_raw_df.info())
bl_media_lst_ext_raw_df.head()

<class 'pandas.core.frame.DataFrame'>
Index: 647 entries, 1 to 628
Data columns (total 12 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   Normalized Working Title           647 non-null    object 
 1   Working title (BL)                 647 non-null    object 
 2   Variant Title                      647 non-null    object 
 3   NLP                                647 non-null    int64  
 4   Alias (in file-syst or generated)  647 non-null    object 
 5   Country                            549 non-null    object 
 6   Start Year                         544 non-null    float64
 7   End Year                           544 non-null    float64
 8   Copy already shared with Impresso  647 non-null    object 
 9   Start year in Impresso local copy  647 non-null    int64  
 10  End year in Impresso local copy    647 non-null    int64  
 11  Notes about local copy             87 non-null     object 
dtyp

Unnamed: 0,Normalized Working Title,Working title (BL),Variant Title,NLP,Alias (in file-syst or generated),Country,Start Year,End Year,Copy already shared with Impresso,Start year in Impresso local copy,End year in Impresso local copy,Notes about local copy
1,Aberdeen Press and Journal,Aberdeen Press and Journal,Aberdeen Journal and General Advertiser,31,ANJO,Scotland,1798.0,1876.0,"Yes, fully",1789,1876,
2,Aberdeen Press and Journal,Aberdeen Press and Journal,Aberdeen Weekly Journal and General Advertiser,32,ANJO,Scotland,1876.0,1900.0,"Yes, fully",1877,1900,There were some small problems in the filenami...
444,Alston Herald and East Cumberland Advertiser,Alston Herald and East Cumberland Advertiser,"Alston Herald, and East Cumberland Advertiser.",3043,AHEC,England,1875.0,1879.0,"Yes, fully",1875,1879,Not separated in the data
492,Alston Herald and East Cumberland Advertiser,Alston Herald and East Cumberland Advertiser,"Alston Herald, and East Cumberland Advertiser",3043,AHEC,England,1880.0,1880.0,"Yes, fully",1880,1880,Not separated in the data
369,Baldwin's London Weekly Journal,Baldwin's London Weekly Journal,"Baldwin's London Weekly Journal, etc",2243,BLWJ,England,1803.0,1836.0,"Yes, fully",1803,1836,


In [86]:
# ennsuring that we actually have data fro all the titles present in the list
bl_media_lst_ext_raw_df['Copy already shared with Impresso'].value_counts()

Copy already shared with Impresso
Yes, fully                    505
Yes, not originally lsit      102
Yes, partially                 21
Yes, more than in the list     19
Name: count, dtype: int64

In [87]:
# reformatting the csv to keep only columns of interest, and have the columns be in the correct format

cols_to_remove = ['Country', 'Notes about local copy', 'Start Year', 'End Year', 'Copy already shared with Impresso']

bl_media_lst_ext_df = bl_media_lst_ext_raw_df.drop(cols_to_remove, axis=1)
bl_media_lst_ext_df['NLP'] = bl_media_lst_ext_df['NLP'].apply(lambda x: str(x).zfill(7))
bl_media_lst_ext_df['Normalized Working Title'] = bl_media_lst_ext_df['Normalized Working Title'].apply(lambda x: x.strip())
bl_media_lst_ext_df.head()

Unnamed: 0,Normalized Working Title,Working title (BL),Variant Title,NLP,Alias (in file-syst or generated),Start year in Impresso local copy,End year in Impresso local copy
1,Aberdeen Press and Journal,Aberdeen Press and Journal,Aberdeen Journal and General Advertiser,31,ANJO,1789,1876
2,Aberdeen Press and Journal,Aberdeen Press and Journal,Aberdeen Weekly Journal and General Advertiser,32,ANJO,1877,1900
444,Alston Herald and East Cumberland Advertiser,Alston Herald and East Cumberland Advertiser,"Alston Herald, and East Cumberland Advertiser.",3043,AHEC,1875,1879
492,Alston Herald and East Cumberland Advertiser,Alston Herald and East Cumberland Advertiser,"Alston Herald, and East Cumberland Advertiser",3043,AHEC,1880,1880
369,Baldwin's London Weekly Journal,Baldwin's London Weekly Journal,"Baldwin's London Weekly Journal, etc",2243,BLWJ,1803,1836


In [88]:
bl_media_lst_gpd = bl_media_lst_ext_df.groupby('Normalized Working Title').agg({
        "Alias (in file-syst or generated)": lambda x: x.unique()[0] if len(x.unique())==1 else x.unique(),
        "Start year in Impresso local copy": lambda x: x.min(),
        "End year in Impresso local copy": lambda x: x.max(),
        "NLP": lambda x: x.unique(),
        'Working title (BL)': lambda x: x.unique(),
        "Variant Title": lambda x: x.unique(),
    },
).reset_index().rename(columns={
    "Alias (in file-syst or generated)": "Alias",
    'Working title (BL)': 'BL Working Titles',
    "Variant Title": "Variant Titles",
    "NLP": "NLPs",
    "Start year in Impresso local copy": "Start Year",
    "End year in Impresso local copy": "End Year",
})

# assert that there is indeed only one alias for each working title:
assert all([isinstance(x, str) for x in bl_media_lst_gpd["Alias"].values]), "There are working titles with multiple aliases!"

bl_media_lst_gpd

Unnamed: 0,Normalized Working Title,Alias,Start Year,End Year,NLPs,BL Working Titles,Variant Titles
0,Aberdeen Press and Journal,ANJO,1789,1900,"[0000031, 0000032]",[Aberdeen Press and Journal],"[Aberdeen Journal and General Advertiser, Aber..."
1,Alston Herald and East Cumberland Advertiser,AHEC,1875,1880,[0003043],[Alston Herald and East Cumberland Advertiser],"[Alston Herald, and East Cumberland Advertiser..."
2,Baldwin's London Weekly Journal,BLWJ,1803,1836,[0002243],[Baldwin's London Weekly Journal],"[Baldwin's London Weekly Journal, etc]"
3,Baner ac Amserau Cymru,BNER,1857,1900,"[0000036, 0000037]",[Baner ac Amserau Cymru],"[Baner Cymru, Baner ac Amserau Cymru]"
4,Bargoed Journal,BGJO,1904,1912,"[0003104, 0003548]",[Bargoed Journal],"[Bargoed Journal, New Tredegar, Bargoed & Caer..."
...,...,...,...,...,...,...,...
369,Wooler's British Gazette,WBGZ,1819,1823,[0002762],[Wooler's British Gazette],[Wooler's British Gazette]
370,Wrexham Advertiser,WRWA,1854,1900,"[0000185, 0000496]",[Wrexham Advertiser],"[Wrexham Weekly Advertiser, Wrexham Advertiser]"
371,Y Genedl Gymreig,GNDL,1877,1900,[0000059],[Y Genedl Gymreig],[Y Genedl Gymreig]
372,Y Goleuad,GLAD,1869,1900,[0000058],[Y Goleuad],[Y Goleuad]


Small sanity check that all title-alias mappings are indeed correct

In [89]:
alias_mismatch = [bl_aliases[t]==a for t, a in bl_media_lst_gpd[['Normalized Working Title',"Alias"]].values]
mismatch_idices = np.where(~np.array(alias_mismatch))[0].tolist()
assert all(alias_mismatch), f"There is a mismatch in the title-aliases mapping!, indices: {mismatch_idices}"
print("It's all good! all IDs are unique and match!")

It's all good! all IDs are unique and match!


In [90]:
for idx in mismatch_idices:
    title = bl_media_lst_gpd.iloc[idx]['Normalized Working Title']
    print(f"Working title: {title}, alias in gsheet: {bl_media_lst_gpd.iloc[idx]['Alias']}, correct alias: {bl_aliases[title]}")

In [97]:
# reformat list-like columns
list_like_cols = ['NLPs', "BL Working Titles", "Variant Titles"]

for col in list_like_cols:
    bl_media_lst_gpd[col] = bl_media_lst_gpd[col].apply(lambda x: list(x))

bl_media_lst_gpd['NLPs'][0]

['0000031', '0000032']

### Now that the list is finalized and compiled, save it

In [98]:
out_dir = os.path.dirname(bl_media_list_ext_path)

out_path = os.path.join(out_dir, "BL_title_alias_mapping.csv")
out_path

'/home/piconti/impresso-text-acquisition/text_preparation/data/sample_data/BL/BL_title_alias_mapping.csv'

In [100]:
bl_media_lst_gpd.to_csv(out_path)

## 2. Reorganizing the data on the NAS to fit our requirements

Currently we have under `/mnt/project_impresso/original/BL_old`:
- All the NLPs that were shared with us
- With the substructure `NLP/YYYY/MMDD/files`
- There are some errors with files copied across multiple subdirs
- Images are included in the issues files

We would like to have, under `/mnt/project_impresso/original/BL`:
- A file structure of the following format: `alias/NLP(s)/YYYY/MM/DD/files
- not copy the images, but only the OCR XML files

This will be done using mounting points to folders for which I have rw access to the NAS (the mount into `original` is ro for security):
- `/mnt/impresso_ocr_BL` for the files
- `/mnt/impresso_images_BL` for the images

In [4]:
source_path = '/mnt/project_impresso/original/BL_old'
dest_path = '/mnt/impresso_ocr_BL'

In [5]:
NLPs_in_source = os.listdir(source_path)
all_dirs = sorted(os.listdir(source_path))
all_nlps = [d for d in all_dirs if re.fullmatch(r"\d{7}", d)]
len(NLPs_in_source), NLPs_in_source[:5], len(all_nlps), all_nlps[-5:]

(601,
 ['0000504', '0002366', '0000491', '0002364', '0002357'],
 596,
 ['0004691', '0004692', '0004693', '0004694', '0004738'])

In [70]:
nlp_chunks = list(chunk(NLPs_in_source, 101))
len(nlp_chunks), len(nlp_chunks[0]), nlp_chunks[0][:5], nlp_chunks[0][-5:]

(6,
 101,
 ['0000504', '0002366', '0000491', '0002364', '0002357'],
 ['0000151', '0002432', '0002608', '0002635', '0004204'])

In [8]:
nlp_chunks_2 = list(chunk(all_nlps, 100))
len(nlp_chunks_2), len(nlp_chunks_2[0]), len(nlp_chunks_2[-1]), nlp_chunks_2[0][:5], nlp_chunks_2[0][-5:]

(6,
 100,
 96,
 ['0000031', '0000032', '0000033', '0000034', '0000035'],
 ['0000185', '0000186', '0000189', '0000191', '0000193'])

Create a mapping from NLP to Alias

In [34]:
alias_to_nlps = bl_media_lst_gpd[['Alias', 'NLPs']].to_dict(orient='records')
print(alias_to_nlps[:5])

nlp_to_alias = {nlp: record['Alias'] for record in alias_to_nlps for nlp in record['NLPs']}
nlp_to_alias

[{'Alias': 'ANJO', 'NLPs': array(['0000031', '0000032'], dtype=object)}, {'Alias': 'AHEC', 'NLPs': array(['0003043'], dtype=object)}, {'Alias': 'BLWJ', 'NLPs': array(['0002243'], dtype=object)}, {'Alias': 'BNER', 'NLPs': array(['0000036', '0000037'], dtype=object)}, {'Alias': 'BGJO', 'NLPs': array(['0003104', '0003548'], dtype=object)}]


{'0000031': 'ANJO',
 '0000032': 'ANJO',
 '0003043': 'AHEC',
 '0002243': 'BLWJ',
 '0000036': 'BNER',
 '0000037': 'BNER',
 '0003104': 'BGJO',
 '0003548': 'BGJO',
 '0003041': 'BTEP',
 '0002986': 'BFNP',
 '0002789': 'BELL',
 '0002347': 'BPDH',
 '0000150': 'WOJL',
 '0002778': 'BPHF',
 '0000033': 'BDPO',
 '0003052': 'BWNW',
 '0003053': 'BWNW',
 '0000155': 'BROR',
 '0003056': 'BGCH',
 '0003057': 'BGCH',
 '0003059': 'BQGA',
 '0003060': 'BBLT',
 '0002769': 'BRIF',
 '0002770': 'BRIF',
 '0002771': 'BRIF',
 '0003062': 'BRGA',
 '0003061': 'BRGA',
 '0000040': 'BRPT',
 '0002811': 'BRAD',
 '0002812': 'BRAD',
 '0002813': 'BRAD',
 '0002772': 'BRMW',
 '0002773': 'BRMW',
 '0003537': 'BRMG',
 '0003538': 'BRMG',
 '0003539': 'BRMG',
 '0003540': 'BRMG',
 '0003541': 'BRMG',
 '0000045': 'CNMR',
 '0000046': 'CNMR',
 '0000047': 'CNMR',
 '0002984': 'CKTC',
 '0002985': 'CKTC',
 '0003244': 'CPAD',
 '0003245': 'CPAD',
 '0000157': 'CHOR',
 '0000158': 'CHOR',
 '0000485': 'CHOR',
 '0002765': 'CHTI',
 '0002766': 'CHTI',


In [12]:
def extract_date(root_path):
    # extract the year, month and day for a root path which has been format-checked

    # edge case for issue dir "/mnt/project_impresso/original/BL_old/0000071/1785/0618.backup"
    # "/mnt/project_impresso/original/BL_old/0002634/1820/0317.backup/0317/"
    if ".backup" in root_path:
        # remove the '.backup' and everything after to parse the date
        root_path = root_path.split('.backup')[0]
        msg = f"{root_path}: found an unexpected component to the path, removing all after '.backup'!"
        print(msg)
        #logger.error(msg)
    
    try:
        path_tail = root_path.split('/')[6:]#[-2:]
    except Exception as e:
        msg = f"{root_path}: Missing elements! error: {e}"
        print(msg)
        #logger.error(msg)
        return False, '','',''
    
    # path_tail should be in format: ['YYYY', 'MMDD']
    y, m, d = path_tail[0], path_tail[1][:2], path_tail[1][2:]
    
    try:
        # assert that this is a valid date
        date = datetime(year=int(y), month=int(m), day=int(d))
        return True, y, m, d
    except ValueError as e:
        msg = f"{root_path}: Invalid date! {y, m, d}, error: {e}"
        print(msg)
        #logger.error(msg)
        return False, y, m, d

In [11]:
p = "/mnt/project_impresso/original/BL_old/0002634/1820/0317.backup/0317/"
p2 = p.split('.backup')[0]
path_tail = p2.split('/')[6:]
path_tail

['1820', '0317']

In [35]:
def check_if_to_be_copied(source_dir_files: str, dest_issue_dir: str, possible_date_formats, xml_ext = '.xml') -> bool:
    # check if the copy needs to be done when within an issue dir.
    # should not be done if:
    # - it was already done (dest issue dir exists and has exactly the same xml files)

    # list all the files to copy from the source dir: xml files which have the correct date
    src_xml_files_to_copy = [f for f in source_dir_files for d in possible_date_formats if f.endswith(xml_ext) and d in f]

    if len(src_xml_files_to_copy) == 0:
        # if there are no files to copy at all, log it (might be an error)
        msg = f"{dest_issue_dir} - No files to copy in source dir: source_dir_files={source_dir_files}, src_xml_files_to_copy={src_xml_files_to_copy}!"
        print(msg)
        #logger.warning(msg)
        return False, src_xml_files_to_copy
    
    # check if dir exists and all xml files from source are there
    if os.path.exists(dest_issue_dir):
        existing_dest_files = os.listdir(dest_issue_dir)
        if all(f in existing_dest_files for f in src_xml_files_to_copy):
            # the copy wss already done correctly; to be copied = False
            return False, src_xml_files_to_copy
        
    # if dest_issue_dir doesn't exist or not all xml files from source dir are there, the copy needs to be redone
    return True, src_xml_files_to_copy
    

In [50]:
def copy_files_for_NLP(nlp, alias, source_dir=source_path, dest_dir=dest_path, xml_ext = '.xml', date_fmt_chars = ['-', '', '_']):
    # given an NLP, copy all the files within it in the new desired structure
    msg = f"Processing {alias} - NLP {nlp}"
    print(msg)
    #logger.info(msg)

    problem_input_dirs = []
    failed_copies = []
    nlp_dest_dir_path = os.path.join(dest_dir, alias, nlp)
    # first create the subdir for the NLP, inside a director for the Alias, creating it if it does not exist yet
    os.makedirs(nlp_dest_dir_path, exist_ok=True)

    # then iterate on all the years, and for each one, recreate the structure (MM/DD) and copy the *.xml files
    all_years = os.listdir(os.path.join(source_dir, nlp))
    for root,dirs,files in os.walk(os.path.join(source_dir, nlp)):
        # identify the cases when we are in a issue's directory, and we are in the standard case scenario
        if len(files)!=0:
            if len(dirs) == 0:
                valid_date, y, m, d = extract_date(root)
                # ensure the date identified is correct
                if valid_date:
                    # define the out_path where to copy the issue OCR data, to check if it was already processed
                    issue_out_dir = os.path.join(nlp_dest_dir_path, y, m, d)
                    # list the possible dates to find in the files to copy
                    date_formats = [c.join([y, m, d]) for c in date_fmt_chars]
                    # identify the list of files to copy and if there are files left to copy
                    copy_to_do, src_xml_files_to_copy = check_if_to_be_copied(files, issue_out_dir, date_formats)

                    if copy_to_do:
                        # ensure dest issue dir exists
                        os.makedirs(issue_out_dir, exist_ok=True)
                        for f in src_xml_files_to_copy:
                            try:
                                shutil.copy(os.path.join(root, f), issue_out_dir)
                            except Exception as e:
                                msg = f"{alias}-{nlp}-{date_formats[0]} — Copy of {os.path.join(root, f)} failed due to execption {e}, To copy again!."
                                print(msg)
                                #logger.exception(msg)
                                failed_copies.append(os.path.join(root, f))
                    else:
                        msg = f"{alias}-{nlp}-{date_formats[0]} — Skipping: no files to copy issue contents of {root} already exist in {issue_out_dir}:\n"
                        print(msg)
                        #logger.info(msg)
                        msg = (
                            f"   - source dir (contents: {os.listdir(root)}\n"
                            f"   - dest dir contents: {os.listdir(issue_out_dir)}."
                        )
                        #print(msg)
                        #logger.debug(msg)
                else:
                    msg = (
                        f"{alias}-{nlp} — Invalid date!! {root}"
                    )
                    print(msg)
                    #logger.warning(msg)
                    problem_input_dirs.append(root)
            else:
                msg = (
                    f"{alias}-{nlp} — Invalid directoy!! root:{root}, dirs:{dirs}, files={files}"
                )
                print(msg)
                #logger.warning(msg)
                problem_input_dirs.append(root)

    return problem_input_dirs, failed_copies


In [51]:
test_nlp = '0000035'
nlp_to_alias[test_nlp]

'BLMY'

In [52]:
problem_input_dirs, failed_copies = copy_files_for_NLP(test_nlp, nlp_to_alias[test_nlp])

Processing BLMY - NLP 0000035
BLMY-0000035-1898-01-04 — Skipping: no files to copy issue contents of /mnt/project_impresso/original/BL_old/0000035/1898/0104 already exist in /mnt/impresso_ocr_BL/BLMY/0000035/1898/01/04:

BLMY-0000035-1898-11-14 — Skipping: no files to copy issue contents of /mnt/project_impresso/original/BL_old/0000035/1898/1114 already exist in /mnt/impresso_ocr_BL/BLMY/0000035/1898/11/14:

BLMY-0000035-1898-02-23 — Skipping: no files to copy issue contents of /mnt/project_impresso/original/BL_old/0000035/1898/0223 already exist in /mnt/impresso_ocr_BL/BLMY/0000035/1898/02/23:

BLMY-0000035-1898-01-25 — Skipping: no files to copy issue contents of /mnt/project_impresso/original/BL_old/0000035/1898/0125 already exist in /mnt/impresso_ocr_BL/BLMY/0000035/1898/01/25:

BLMY-0000035-1898-04-29 — Skipping: no files to copy issue contents of /mnt/project_impresso/original/BL_old/0000035/1898/0429 already exist in /mnt/impresso_ocr_BL/BLMY/0000035/1898/04/29:

BLMY-0000035-18

KeyboardInterrupt: 

The function works, we ca now put it in a script to launch this copy in a screen.

#### Ensuring that all paths can be processed with the extract_date function, and finding any cases where the data would not fit the expected file-structure

In [106]:
already_done = """0000504
0002366
0000491
0002364
0002357
0003064
0000045
0002385
0002603
0002813
0002349
0002980
0004127
0004201
0000503
0004196
0000191
0003046
0002636
0000075
0003098
0002757
0000071
0002424
0002639
0002604
0000097
0002351
0002808
0002415
0003087
0000079
0004134
0002801
0000496
0002439
0003102
0002378
0003071
0002786
0002427
0002805
0002620
0002244
0002363
0000268
0003081
0002375
0002789
0003251
0000150
0002978
0003032
0003049
0000501
0000060
0002421
0004692
0003265
0002792
0003061
0002436
0004683
0000042
0000172
0003034
0003083
0002587
0003246
0003011
0004132
0003107
0002759
0000183
0003028
0002267
0002619
0002633
0002778
0003245
0002814
0003539
0000181
0000031
0000095
0002760
0003026
0003044
0003077
0000494
0002585
0000073
0002371
0003047
0000493
0002803
0000151
0002432
0002608
0002635
0004204
0000155
0003009
0003029
0000157
0000166
0002993
0002586
0002641
0002981
0002580
0000050
0002613
0003056
0003408
0002416
0002592
0003248
0000499
0003057
0000034
0002353
0003398
0004691
0004688
0002974
0004690
0004209
0002607
0002777
0002606
0003006
0003055
0002791
0002414
0002588
0004212
0002428
0003079
0002985
0002574
0002755
0000211
0000175
0004686
0000098
0003059
0002806
0002642
0000162
0003037
0000154
0002256
0004190
0003262
0000103
0000497
0002810
0003015
0002754
0002596
0002998
0003027
0000495
0003069
0003082
0002386
0004694
0002630
0000186
0004682
0000064
0002643
0002409
0003091
0002420
0004186
0000189
0003058
0000170
0002644
0004214
0000193
0000062
0000161
0002758
0000063
0002261
0003112
0002350
0002601
0000047
0002605
0002594
0003257
0003074
0002816
0000081
0000269
0002374
0000085
0002612
0002410
0002597
0002615
0003115
0000068
0002354
0000058
0000178
0000037
0003080
0002369
0002776
0003051
0000056
0003103
0003540
0002370
0004138
0002084
0002268
0002634
0003088
0004203
0003020
0002419
0002368
0002972
0002769
0002770
0003270
0000078
0002408
0004143
0002609
0004309
0000488
0004197
0000072
0003007
0000054
0000177
0002771
0004693
0000070
0002802
0002781
0002610
0000035"""

In [15]:
already_done

'0000504\n0002366\n0000491\n0002364\n0002357\n0003064\n0000045\n0002385\n0002603\n0002813\n0002349\n0002980\n0004127\n0004201\n0000503\n0004196\n0000191\n0003046\n0002636\n0000075\n0003098\n0002757\n0000071\n0002424\n0002639\n0002604\n0000097\n0002351\n0002808\n0002415\n0003087\n0000079\n0004134\n0002801\n0000496\n0002439\n0003102\n0002378\n0003071\n0002786\n0002427\n0002805\n0002620\n0002244\n0002363\n0000268\n0003081\n0002375\n0002789\n0003251\n0000150\n0002978\n0003032\n0003049\n0000501\n0000060\n0002421\n0004692\n0003265\n0002792\n0003061\n0002436\n0004683\n0000042\n0000172\n0003034\n0003083\n0002587\n0003246\n0003011\n0004132\n0003107\n0002759\n0000183\n0003028\n0002267\n0002619\n0002633\n0002778\n0003245\n0002814\n0003539\n0000181\n0000031\n0000095\n0002760\n0003026\n0003044\n0003077\n0000494\n0002585\n0000073\n0002371\n0003047\n0000493\n0002803\n0000151\n0002432\n0002608\n0002635\n0004204\n0000155\n0003009\n0003029\n0000157\n0000166\n0002993\n0002586\n0002641\n0002981\n0002580\n

In [None]:
last_nlp = ''
iters = {}
for root,dirs,files in os.walk(source_path):
    curr_nlp = str(root.split('/')[-1]).zfill(7)
    if curr_nlp not in already_done:
        if curr_nlp not in iters:
            iters[curr_nlp] = 0
        if curr_nlp != last_nlp and len(curr_nlp)==7:
            print(f"curr_nlp: {curr_nlp}, last_nlp: {last_nlp}")
            last_nlp = curr_nlp
        if len(files)!=0 and any('.xml' in f for f in files):
            iters[curr_nlp] += 1
            valid_date, y, m ,d = extract_date(root)
            if not valid_date:
                print(f"root: {root}, dirs: {dirs}, files: {files}")
        if iters[curr_nlp]>40:
            print(f"curr_nlp: {curr_nlp} - finished tests")
            continue
    else:
        print(f"{curr_nlp} - already done")

In [18]:
i = 0
for root,dirs,files in os.walk(os.path.join(source_path, '0000031')):
    if i < 5 or len(files)==0:
        print(f"root: {root}, dirs: {dirs}, files: {files}")
    i+=1
    if root == '/mnt/project_impresso/original/BL_old/0000031/1813':
        break

root: /mnt/project_impresso/original/BL_old/0000031, dirs: ['1860', '1814', '1811', '1813', '1810', '1806', '1867', '1808', '1824', '1837', '1870', '1807', '1800', '1842', '1874', '1823', '1844', '1856', '1798', '1832', '1846', '1850', '1833', '1849', '1840', '1869', '1821', '1809', '1817', '1859', '1835', '1873', '1865', '1858', '1862', '1819', '1843', '1805', '1853', '1804', '1866', '1839', '1826', '1847', '1841', '1801', '1831', '1827', '1872', '1861', '1828', '1868', '1803', '1864', '1830', '1851', '1845', '1854', '1825', '1871', '1855', '1820', '1812', '1822', '1802', '1834', '1836', '1818', '1876', '1875', '1852', '1863', '1848', '1829', '1838', '1816', '1799', '1857'], files: []
root: /mnt/project_impresso/original/BL_old/0000031/1860, dirs: ['0104', '1114', '0125', '0822', '0502', '0718', '0111', '0229', '1205', '0613', '1031', '0314', '0307', '0425', '0711', '0620', '0919', '1003', '0118', '0704', '0208', '0808', '0926', '0725', '0411', '0321', '1017', '0516', '1219', '0815', 

In [16]:
r = "/mnt/project_impresso/original/BL_old/0000031/1860/0104"
text_source_files = os.listdir(r)

print(text_source_files)
text_source_files.append("test.xml")

valid_date, y, m, d = extract_date(r)

print(f"valid_date={valid_date}, y={y}, m={m}, d={d}")
possible_date_formats = [c.join([y, m, d]) for c in ['-', '', '_']]

source_xml_files1 = [f for f in text_source_files if f.endswith('.xml')]
source_xml_files2 = [f for f in text_source_files for d in possible_date_formats if f.endswith('.xml') and d in f]

print(f"source_xml_files1={source_xml_files1}")
print(f"source_xml_files12={source_xml_files2}")
source_xml_files1 == source_xml_files2

['WO1_ANJO_1860_01_04-0008-037.xml', 'WO1_ANJO_1860_01_04-0007.xml', 'WO1_ANJO_1860_01_04-0005-012.xml', 'WO1_ANJO_1860_01_04-0007-022.xml', 'WO1_ANJO_1860_01_04-0007-024.xml', 'WO1_ANJO_1860_01_04-0004-007.xml', 'WO1_ANJO_1860_01_04-0007-019.xml', 'WO1_ANJO_1860_01_04-0002-003.xml', 'WO1_ANJO_1860_01_04-0008-038.xml', 'WO1_ANJO_1860_01_04-0004-005.xml', '0000031_18600104_0008.xml', 'WO1_ANJO_1860_01_04-0001.xml', 'WO1_ANJO_1860_01_04-0007-026.xml', '0000031_18600104_0001.xml', '0000031_18600104_0005.jp2', 'WO1_ANJO_1860_01_04-0006-014.xml', 'WO1_ANJO_1860_01_04-0005-009.xml', '0000031_18600104_0006.jp2', 'WO1_ANJO_1860_01_04-0008-031.xml', '0000031_18600104_0004.jp2', 'WO1_ANJO_1860_01_04-0008-034.xml', 'WO1_ANJO_1860_01_04-0006-015.xml', 'WO1_ANJO_1860_01_04-0006.xml', 'WO1_ANJO_1860_01_04-0007-020.xml', 'WO1_ANJO_1860_01_04-0002-002.xml', 'WO1_ANJO_1860_01_04-0002.xml', 'WO1_ANJO_1860_01_04-0004.xml', 'WO1_ANJO_1860_01_04-0008-035.xml', '0000031_18600104_manifest.txt', 'WO1_ANJO_186

False

In [None]:
last_nlp = ''
i = 0
for root,dirs,files in tqdm(os.walk(source_path)):
    if root.endswith('BL_old'):
        continue
    if len(files)!=0:
        curr_nlp = root.split('/')[5]
        if curr_nlp != last_nlp and len(curr_nlp)==7:
            print(f"starting {i}, curr_nlp={curr_nlp}")
            last_nlp = curr_nlp
            i+=1
        # only consider cases where we are in the issue files
        if len(dirs) != 0:
            print(f"Found a problematic case!! root: {root}, dirs: {dirs}, files: {files}")

#### Simple debug code to ensure the mapping of alias to NLP is correct in the script

In [101]:
sample_data_dir = "/home/piconti/impresso-text-acquisition/text_preparation/data/sample_data/BL"
title_alias_mapping_file= "BL_title_alias_mapping.csv"

nlp_alias_df = pd.read_csv(os.path.join(sample_data_dir, title_alias_mapping_file), index_col=0)
nlp_alias_df.head()

Unnamed: 0,Normalized Working Title,Alias,Start Year,End Year,NLPs,BL Working Titles,Variant Titles
0,Aberdeen Press and Journal,ANJO,1789,1900,"['0000031', '0000032']",['Aberdeen Press and Journal'],"['Aberdeen Journal and General Advertiser', 'A..."
1,Alston Herald and East Cumberland Advertiser,AHEC,1875,1880,['0003043'],['Alston Herald and East Cumberland Advertiser'],"['Alston Herald, and East Cumberland Advertise..."
2,Baldwin's London Weekly Journal,BLWJ,1803,1836,['0002243'],"[""Baldwin's London Weekly Journal""]","[""Baldwin's London Weekly Journal, etc""]"
3,Baner ac Amserau Cymru,BNER,1857,1900,"['0000036', '0000037']",['Baner ac Amserau Cymru'],"['Baner Cymru', 'Baner ac Amserau Cymru']"
4,Bargoed Journal,BGJO,1904,1912,"['0003104', '0003548']",['Bargoed Journal'],"['Bargoed Journal', 'New Tredegar, Bargoed & C..."


In [102]:
literal_eval(nlp_alias_df['NLPs'][0])

['0000031', '0000032']

In [None]:
alias_to_nlps = nlp_alias_df[['Alias', 'NLPs']].to_dict(orient='records')
print(alias_to_nlps[:5])

nlp_to_alias2 = {nlp: record['Alias'] for record in alias_to_nlps for nlp in literal_eval(record['NLPs'])}
nlp_to_alias2

In [104]:
nlp_to_alias == nlp_to_alias2

True

## 3. Reorganize the images to fit IIIF requirements

In [2]:
len(PARTNER_TO_MEDIA['BL'])

374

In [None]:
def fixed_dir(issue_dir):
    split_dir = issue_dir.split('/')
    return '/'.join(split_dir[:'04'] + split_dir[5:] + ['a'])

In [3]:
fixed_dir('/mnt/impresso_images_BL/ANJO/0000031/1824/01/07')

'/mnt/impresso_images_BL/ANJO/1824/01/07/a'

In [4]:
def get_issues_list(alias, source_path='/mnt/impresso_images_BL'):
    return [(dirpath, fixed_dir(dirpath)) for dirpath, dirnames, _ in os.walk(os.path.join(source_path, alias)) if not dirnames]

In [32]:
def get_issues_list_2(alias, source_path='/mnt/impresso_images_BL'):
    base_path = os.path.join(source_path, alias)
    issues_list = []

    def find_leaf_dirs(path):
        with os.scandir(path) as entries:
            subdirs = [entry for entry in entries if entry.is_dir()]
            if not subdirs:  # If no subdirectories, it's a leaf directory
                issues_list.append((path, fixed_dir(path)))
            else:
                for subdir in subdirs:
                    find_leaf_dirs(subdir.path)

    find_leaf_dirs(base_path)
    return issues_list

In [None]:
bl_aliases = db.from_sequence(PARTNER_TO_MEDIA['BL'])

with ProgressBar():
    bl_issue_paths = bl_aliases.map(lambda x: {x: get_issues_list(x)}).compute()#.take(10, compute=False).compute()

with open("/home/piconti/impresso-text-acquisition/text_preparation/data/sample_data/BL/BL_issue_paths.json", 'w') as f:
    json.dump(bl_issue_paths, f)

bl_issue_paths

In [None]:
with open("/home/piconti/impresso-text-acquisition/text_preparation/data/sample_data/BL/BL_issue_paths.json", 'r') as f:
    bl_issue_paths = json.load(f)

bl_issue_paths

### Create the final Alias to NLP mapping

In [15]:
alias_to_nlp = {alias: os.listdir(os.path.join('/mnt/impresso_ocr_BL/', alias)) for alias in PARTNER_TO_MEDIA['BL'] if alias not in ['DCWA', 'MEXM']}
alias_to_nlp['ANJO'] = ['0000031', '0000032']
alias_to_nlp

with open("/home/piconti/impresso-text-acquisition/text_preparation/data/sample_data/BL/BL_alias_to_NLP.json", 'w', encoding='utf-8') as fin:
    json.dump(alias_to_nlp, fin, indent=4)

In [3]:
with open("/home/piconti/impresso-text-acquisition/text_preparation/data/sample_data/BL/BL_alias_to_NLP.json", 'r', encoding='utf-8') as f:
    alias_to_nlp = json.load(f)

In [4]:
for alias in PARTNER_TO_MEDIA['BL']:
    assert alias in alias_to_nlp or alias in ['DCWA', 'MEXM']

### perform the final move

In [5]:
def move_issue_imgs(dirs):
    alias = dirs[0].split('/')[3]
    status = 1
    
    og_list = os.listdir(dirs[0])
    
    try:
        # ensure the destination directory exists
        os.makedirs(dirs[1])
    except:
        # if dest dir already exists, two options:
        # 1. the issue had already been moved (len(og_list) == 0 and len(os.listdir(dirs[1])) != 0) --> skip
        # 2. there was already an issue for that day (len(og_list) != 0 and len(os.listdir(dirs[1])) != 0) --> move to next edition

        if len(os.listdir(dirs[1])) != 0:
            if len(og_list) == 0:
                print(f"{alias}-{dirs} - Dest dir already exists! had already been moved!")
                # status code 2: already moved
                return 2, dirs
        
            if len(og_list) != 0:
                new_ed = chr(ord('a')+len(os.listdir(dirs[1][:-1])))
                new_dest = dirs[1][:-1]+new_ed
                print(f"{alias}-{dirs} - Dest dir already exists, there was already an issue for that day! using edition {new_ed}, new_dest_dir: {new_dest}")
                # status code 2: already moved
                dirs[1] = new_dest
                os.makedirs(dirs[1])
                status = 4

    # Ensure the source directory exists
    if not os.path.exists(dirs[0]):
        print(f"{alias} - Source directory {dirs[0]} does not exist!.")
        # status code 0: FileNotFoundError
        return 0, dirs
        #raise FileNotFoundError(f"{alias} - Source directory {dirs[0]} does not exist!")
    
    # Move the files or directories
    for f in og_list:
        shutil.move(os.path.join(dirs[0], f), dirs[1])

    if len(og_list) == len(os.listdir(dirs[1])):
        # status code 1: all good
        return status, dirs

    print(f"{alias} - len(og_list) != len(os.listdir(dirs[1])) - Another problem occurred!")
    # status code 3: problem during operation
    return 3, dirs

In [9]:
with open("/home/piconti/impresso-text-acquisition/text_preparation/data/logs/processing_logs/img_move.log", 'r', encoding='utf-8') as fout:
    final_log=json.load(fout)

In [6]:
last_log_name = '2025-04-09-img_move.log'
final_log_dir = "/home/piconti/impresso-text-acquisition/text_preparation/data/logs/processing_logs"
with open(os.path.join(final_log_dir, last_log_name), 'r', encoding='utf-8') as fout:
    final_log=json.load(fout)

In [None]:
num_issues_per_alias = {}
no_sources = final_log['no_source_dirs']
other_problems = final_log['other_problems']
correct_moves = final_log['correct_moves']
moved_prior = final_log['moved_prior']
if 'other_editions' in final_log:
    other_editions = final_log['other_editions']
else:
    other_editions = {}

for idx, alias_dict in enumerate(bl_issue_paths):
    for alias, dirs_list in alias_dict.items():
        if len(dirs_list)>0:
            print(f"\n\n{'-'*15} {alias} ({idx+1}/{len(bl_issue_paths)}) {'-'*15}")
            num_issues_per_alias[alias] = len(dirs_list)
            if any([os.path.exists(os.path.join('/mnt/impresso_images_BL/', alias, nlp)) for nlp in alias_to_nlp[alias]]):
                
                # Use ThreadPoolExecutor for parallel I/O-bound tasks
                with ThreadPoolExecutor() as executor:
                    results = list(executor.map(move_issue_imgs, dirs_list))

                # Filter results based on the status code (all correct)
                all_good = [res[1] for res in results if res[0] == 1]

                correct_moves[alias] = all_good

                if len(results) != len(all_good):
                    already_moved = [res[1] for res in results if res[0] == 2]
                    no_source = [res[1] for res in results if res[0] == 0]
                    other = [res[1] for res in results if res[0] == 3]
                    other_edition = [res[1] for res in results if res[0] == 4]

                    msg = (
                        f"{alias} - Finished moving all, but there were problems!\n"
                        f"- {len(all_good)} issues were moved without problems\n"
                        f"- {len(already_moved)} issues had already been moved\n"
                        f"- {len(no_source)} issues had no source dir (saved)\n"
                        f"- {len(other)} issues had other problems\n"
                        f"- {len(other_edition)} issues had other editions for the same day\n"
                    )
                    if len(no_source) != 0:
                        no_sources[alias] = no_source
                    if len(other) != 0:
                        other_problems[alias] = other
                    if len(already_moved) != 0:
                        moved_prior[alias] = already_moved
                    if len(other_edition) != 0:
                        other_editions[alias] = other_edition
                    print(msg)
                else:
                    print(f"{alias} - Finished moving all without any problems!\n")

                # only delete filestrcuture if there were no problems
                if alias not in no_sources and alias not in other_problems:
                    # now that the move is done, delete the now empty filestructure:
                    for nlp in alias_to_nlp[alias]:
                        print(f"{alias} - removing the empty filestructure for {nlp}")
                        shutil.rmtree(os.path.join('/mnt/impresso_images_BL/', alias, nlp))

                final_log = {
                    'correct_moves': correct_moves,
                    'moved_prior': moved_prior,
                    'no_source_dirs': no_sources,
                    'other_problems': other_problems,
                    'other_editions': other_editions,
                    'num_issues_per_alias': num_issues_per_alias,
                }

                with open(os.path.join(final_log_dir, "2025-04-10-img_move.log"), 'w', encoding='utf-8') as fin:
                    json.dump(final_log, fin, indent=4)

            else:
                print(f"\n\n{alias} - aleady processed prior")

print("DONE!!")

In [14]:
final_log = {
    'correct_moves': correct_moves,
    'moved_prior': moved_prior,
    'no_source_dirs': no_sources,
    'other_problems': other_problems,
    'other_editions': other_editions,
    'num_issues_per_alias': num_issues_per_alias,
}

with open(os.path.join(final_log_dir, "2025-04-09-img_move.log"), 'w', encoding='utf-8') as fin:
    json.dump(final_log, fin, indent=4)

## 4. Remove duplicate issues when necessary

- BGCH (NLPs 0003056 and 0003057) --> entirety of 1892 (53 issues) as well as 29 issues for 1888 (06/15 to 12/28) which have duplicated images in their corresponding /mnt/impresso_images_BL/BGCH/1892/MM/DD/a dir (each image for 56 and 57).
  - Choose the NLP for which to keep the issues - based on OCR format and quality, and delete the others. Note: upon first glances, the iamges from 0003057 seem to be consistently less "slanted" than the ones from 0003056.

- CGFG (NLPs 0002424 and 0002425) --> 14 issues in 1841 (01/16 to 04/17), which were separated into "edition" subfolders a & b
  - Ensure these are indeed duplicated and choose the NLP for which to keep the issues.

- CCGZ (NLPs 0004682 and 0004683) --> 1 issue on CCGZ/1840/01/11
  - Ensure these are indeed duplicated and choose the NLP for which to keep the issue.

- IPJO (NLPs 0000071 and 0000191) --> 51 issues in 1778 (all but 1778/05/02), 52 in 1800 (all but 1800/05/22), 52 in 1748 (all but 1748/01/31, 52 issues in 1785 (all but 1785/06/18), 3 in 1765 (01/19, 01/26 and 09/28), 2 in 1764 (09/15, 06/30), 3 in 1763 (02/12, 10/08, 10/29), 51 in 1790 (all but 05/29), 52 in 1749 (all), 19 in 1750, 52 un 1173 (all but 01/30), 52 in 1789 (all but 10/03)
  - Perform a more precise/in depth check of the situation and choose best solution accordingly.

#### `BGCH` (NLPs `0003056` and `0003057`) 

- 1892: 52 issues
- 1888: 29 issues

In [11]:
def coords_to_xy(coords):
    return [coords[0], coords[1], coords[0]+coords[2], coords[1]+coords[3]]


def draw_box_on_img(base_img_path, coords_xy, img = None, width=10):
    if not img:
        img = Image.open(base_img_path)  
    ImageDraw.Draw(img).rectangle(coords_xy, outline ="red", width=width)
    return img

def read_xml(file_path):
    with open(file_path, 'rb') as f:
        raw_xml = f.read()

    return BeautifulSoup(raw_xml, 'xml')

In [7]:
def print_blocks_on_page(page_xml_cnt, img_path, pg_num, issue_og_path, rescale_block_coords=False):
    pg_block_xy_coords = [coords_to_xy(alto.distill_coordinates(block)) for block in page_xml_cnt.find_all('TextBlock')]

    #print(pg_7_57_11_11.find_all('TextBlock')[0])
    print(len(pg_block_xy_coords), f" blocks found in page {pg_num} of {issue_og_path}.")

    pil_img = Image.open(img_path)

    # fetch necessary info to rescale the coords
    if rescale_block_coords:
        pg_ocr_height = int(page_xml_cnt.find_all('Page')[0]['HEIGHT'])
        pg_ocr_width = int(page_xml_cnt.find_all('Page')[0]['WIDTH'])
        pg_ocr_size = (pg_ocr_width, pg_ocr_height)
        pg_img_size = pil_img.size

        pg_block_xy_resc_coords = [rescale_coords(c, pg_ocr_size, pg_img_size) for c in pg_block_xy_coords]
    else:
        pg_block_xy_resc_coords = pg_block_xy_coords

    for coords in pg_block_xy_resc_coords:
        pil_img = draw_box_on_img(None, coords, img=pil_img, width=20)

    return pil_img

In [14]:
def test_cases_from_date(alias, year, month, day, pg_num, nlps, bl_og_dir, sample_data_dir, save_results=True, check_rescale=True):

    issue_paths = [] 
    pg_xmls = []
    pg_img_paths = []
    pg_str = str(pg_num).zfill(4)
    for (intern_alias, nlp) in nlps:
        issue_path = f"{nlp}/{year}/{month}{day}"
        issue_paths.append(issue_path)

        pg_ocr_path = os.path.join(bl_og_dir, issue_path, f"{intern_alias}-{year}-{month}-{day}-{pg_str}.xml")
        if not os.path.exists(pg_ocr_path):
            pg_ocr_path = os.path.join(bl_og_dir, issue_path, f"{nlp}_{year}{month}{day}_{pg_str}.xml")
            if not os.path.exists(pg_ocr_path):
                print(f"File not found: {pg_ocr_path}")
                continue

        pg_xmls.append(read_xml(pg_ocr_path))
        pg_img_paths.append(os.path.join(bl_og_dir, issue_path, f"{nlp}_{year}{month}{day}_{pg_str}.jp2"))

    for pg_xml, pg_img_path, issue_path, (intern_alias, nlp) in zip(pg_xmls, pg_img_paths, issue_paths, nlps):
        img_not_rsc = print_blocks_on_page(pg_xml, pg_img_path, pg_num, issue_path)
        if check_rescale:
            img_rsc = print_blocks_on_page(pg_xml, pg_img_path, pg_num, issue_path, rescale_block_coords=True)
        
        if save_results:
            print(f"Saving results for {year}-{month}-{day} - pg {pg_num} - alias: {intern_alias}, nlp = {nlp}")
            try:
                img_not_rsc.save(os.path.join(sample_data_dir, f"{alias}_{nlp}_{year}_{month}_{day}_pg_{pg_num}_blocks_wrg_scale.jpg"))
                if check_rescale:
                    img_rsc.save(os.path.join(sample_data_dir, f"{alias}_{nlp}_{year}_{month}_{day}_pg_{pg_num}_blocks_rescaled.jpg"))
            except Exception as e:
                print(f"Error saving images: {e}")
                img_not_rsc = img_not_rsc.convert('RGB')
                img_not_rsc.save(os.path.join(sample_data_dir, f"{alias}_{nlp}_{year}_{month}_{day}_pg_{pg_num}_blocks_wrg_scale.jpg"))
                if check_rescale:
                    img_rsc = img_rsc.convert('RGB')
                    img_rsc.save(os.path.join(sample_data_dir, f"{alias}_{nlp}_{year}_{month}_{day}_pg_{pg_num}_blocks_rescaled.jpg"))

In [9]:
bl_og_data_dir = "/mnt/project_impresso/original/BL_old"

#### 1892/11/11

In [4]:
issue_path_56_11_11 = "0003056/1892/1111"
issue_path_57_11_11 = "0003057/1892/1111"

pg_7_path_56_11_11 = os.path.join(bl_og_data_dir, issue_path_56_11_11, "0003056_18921111_0007.xml")
pg_7_path_57_11_11 = os.path.join(bl_og_data_dir, issue_path_57_11_11, "0003057_18921111_0007.xml")

In [5]:
pg_7_56_11_11 = read_xml(pg_7_path_56_11_11)
pg_7_57_11_11 = read_xml(pg_7_path_57_11_11)

In [None]:
pg_7_block_0_56_coords = alto.distill_coordinates(pg_7_56_11_11.find_all('TextBlock')[0])
print(pg_7_56_11_11.find_all('TextBlock')[0])

pg_7_block_40_56_coords = alto.distill_coordinates(pg_7_56_11_11.find_all('TextBlock')[40])
print(pg_7_56_11_11.find_all('TextBlock')[40])

pg_7_block_0_56_coords, pg_7_block_40_56_coords

In [None]:
pg_7_56_img_path = os.path.join(bl_og_data_dir, issue_path_56_11_11, "0003056_18921111_0007.jp2")
img = draw_box_on_img(pg_7_56_img_path, coords_to_xy(pg_7_block_0_56_coords), width=10)
img = draw_box_on_img(None, coords_to_xy(pg_7_block_40_56_coords), img=img, width=20)
img.show()

In [None]:
pg_7_block_57_xy_coords = [coords_to_xy(alto.distill_coordinates(block)) for block in pg_7_57_11_11.find_all('TextBlock')]
#print(pg_7_57_11_11.find_all('TextBlock')[0])
print(len(pg_7_block_57_xy_coords), " blocks found in page 7 of 0003057.")
pg_7_57_img_path = os.path.join(bl_og_data_dir, issue_path_57_11_11, "0003057_18921111_0007.jp2")
img_57 = draw_box_on_img(pg_7_57_img_path, pg_7_block_57_xy_coords[0], width=20)
for coords in pg_7_block_57_xy_coords[1:]:
    img_57 = draw_box_on_img(None, coords, img=img_57, width=20)
img_57.show()

In [None]:
pg_7_block_56_xy_coords = [coords_to_xy(alto.distill_coordinates(block)) for block in pg_7_56_11_11.find_all('TextBlock')]
#print(pg_7_57_11_11.find_all('TextBlock')[0])
print(len(pg_7_block_56_xy_coords), " blocks found in page 7 of 0003056.")
pg_7_56_img_path = os.path.join(bl_og_data_dir, issue_path_56_11_11, "0003056_18921111_0007.jp2")
img_56 = draw_box_on_img(pg_7_56_img_path, pg_7_block_56_xy_coords[0], width=20)
for coords in pg_7_block_56_xy_coords[1:]:
    img_56 = draw_box_on_img(None, coords, img=img_56, width=20)
img_56.show()

In [10]:
img_57.save("/home/piconti/impresso-text-acquisition/text_preparation/data/sample_data/BL/BGCH_0003057_1892_11_11_pg_7_blocks.jpg")
img_56.save("/home/piconti/impresso-text-acquisition/text_preparation/data/sample_data/BL/BGCH_0003056_1892_11_11_pg_7_blocks.jpg")

#### 1888/08/31

In [12]:
issue_path_56_08_31 = "0003056/1888/0831"
issue_path_57_08_31 = "0003057/1888/0831"

pg_2_path_56_08_31 = os.path.join(bl_og_data_dir, issue_path_56_08_31, "0003056_18880831_0002.xml")
pg_2_path_57_08_31 = os.path.join(bl_og_data_dir, issue_path_57_08_31, "0003057_18880831_0002.xml")

pg_2_56_08_31 = read_xml(pg_2_path_56_08_31)
pg_2_57_08_31 = read_xml(pg_2_path_57_08_31)

In [None]:
pg_2_0831_block_57_xy_coords = [coords_to_xy(alto.distill_coordinates(block)) for block in pg_2_57_08_31.find_all('TextBlock')]
#print(pg_7_57_11_11.find_all('TextBlock')[0])
print(len(pg_2_0831_block_57_xy_coords), f" blocks found in page 2 of {issue_path_57_08_31}.")
pg_2_0831_57_img_path = os.path.join(bl_og_data_dir, issue_path_57_08_31, "0003057_18880831_0002.jp2")
img_57_0831 = draw_box_on_img(pg_2_0831_57_img_path, pg_2_0831_block_57_xy_coords[0], width=20)
for coords in pg_2_0831_block_57_xy_coords[1:]:
    img_57_0831 = draw_box_on_img(None, coords, img=img_57_0831, width=20)
img_57_0831.show()

In [None]:
pg_2_0831_block_56_xy_coords = [coords_to_xy(alto.distill_coordinates(block)) for block in pg_2_56_08_31.find_all('TextBlock')]
#print(pg_7_57_11_11.find_all('TextBlock')[0])
print(len(pg_2_0831_block_56_xy_coords), f" blocks found in page 2 of {issue_path_56_08_31}.")
pg_2_0831_56_img_path = os.path.join(bl_og_data_dir, issue_path_56_08_31, "0003056_18880831_0002.jp2")
img_56_0831 = draw_box_on_img(pg_2_0831_56_img_path, pg_2_0831_block_56_xy_coords[0], width=20)
for coords in pg_2_0831_block_56_xy_coords[1:]:
    img_56_0831 = draw_box_on_img(None, coords, img=img_56_0831, width=20)
img_56_0831.show()

In [19]:
img_57_0831.save("/home/piconti/impresso-text-acquisition/text_preparation/data/sample_data/BL/BGCH_0003057_1888_08_31_pg_2_blocks.jpg")
img_56_0831.save("/home/piconti/impresso-text-acquisition/text_preparation/data/sample_data/BL/BGCH_0003056_1888_08_31_pg_2_blocks.jpg")

#### Testing for more examples of BGCH

In [None]:
sample_data_dir = "/home/piconti/impresso-text-acquisition/text_preparation/data/sample_data/BL/preprocessing_experiments"

nlps = [('BGCH', '0003057'), ('BGCH', '0003056')]


cases_bgch = [
    {'year': 1892, 'month': '01', 'day': '01', 'pg_num': 1},
    {'year': 1892, 'month': '04', 'day': '15', 'pg_num': 2},
    {'year': 1892, 'month': '07', 'day': '08', 'pg_num': 4},
    {'year': 1892, 'month': '12', 'day': '30', 'pg_num': 1},
    {'year': 1888, 'month': '11', 'day': '02', 'pg_num': 2},
    {'year': 1888, 'month': '09', 'day': '07', 'pg_num': 3},
    {'year': 1888, 'month': '12', 'day': '14', 'pg_num': 1},
]

for idx, case in enumerate(cases_bgch):
    print(f"\n Testing case {idx+1}/{len(cases_bgch)}: for {case}")
    test_cases_from_date('BGCH', case['year'], case['month'], case['day'], case['pg_num'], nlps, bl_og_data_dir, sample_data_dir, check_rescale=False)


 Testing case 1/7: for {'year': 1892, 'month': '01', 'day': '01', 'pg_num': 1}
143  blocks found in page 1 of 0003057/1892/0101.
Saving results for 1892-01-01 - pg 1 - alias: BGCH, nlp = 0003057
157  blocks found in page 1 of 0003056/1892/0101.
Saving results for 1892-01-01 - pg 1 - alias: BGCH, nlp = 0003056

 Testing case 2/7: for {'year': 1892, 'month': '04', 'day': '15', 'pg_num': 2}


#### `IPJO` (NLPs `0000071` and `0000191`) 

- many issues for years: 1748, 1750, 1763, 1764, 1765, 1773, 1778, 1785, 1789, 1790, 1800.

#### 1778/07/25

In [4]:
issue_path_71_07_25 = "0000071/1778/0725"
issue_path_191_07_25 = "0000191/1778/0725"

pg_3_path_71_07_25 = os.path.join(bl_og_data_dir, issue_path_71_07_25, "IPJO-1778-07-25-0003.xml")
pg_3_path_191_07_25 = os.path.join(bl_og_data_dir, issue_path_191_07_25, "IPJL-1778-07-25-0003.xml")

pg_3_71_07_25 = read_xml(pg_3_path_71_07_25)
pg_3_191_07_25 = read_xml(pg_3_path_191_07_25)

In [29]:
pg_3_0725_71_height= int(pg_3_71_07_25.find_all('Page')[0]['HEIGHT'])
pg_3_0725_71_width= int(pg_3_71_07_25.find_all('Page')[0]['WIDTH'])
img_71_0725 = Image.open(pg_3_0725_71_img_path).resize((pg_3_0725_71_width, pg_3_0725_71_height))
img_71_0725.size

(2951, 4616)

In [None]:
pg_3_0725_block_71_xy_coords = [coords_to_xy(alto.distill_coordinates(block)) for block in pg_3_71_07_25.find_all('TextBlock')]
#print(pg_7_57_11_11.find_all('TextBlock')[0])
print(len(pg_3_0725_block_71_xy_coords), f" blocks found in page 3 of {issue_path_71_07_25}.")
pg_3_0725_71_img_path = os.path.join(bl_og_data_dir, issue_path_71_07_25, "0000071_17780725_0003.jp2")
img_71_0725 = draw_box_on_img(pg_3_0725_71_img_path, pg_3_0725_block_71_xy_coords[0], width=20)
for coords in pg_3_0725_block_71_xy_coords[1:]:
    img_71_0725 = draw_box_on_img(None, coords, img=img_71_0725, width=20)
img_71_0725.show()

In [None]:
pg_3_0725_block_191_xy_coords = [coords_to_xy(alto.distill_coordinates(block)) for block in pg_3_191_07_25.find_all('TextBlock')]
#print(pg_7_57_11_11.find_all('TextBlock')[0])
print(len(pg_3_0725_block_191_xy_coords), f" blocks found in page 3 of {issue_path_191_07_25}.")
pg_3_0725_191_img_path = os.path.join(bl_og_data_dir, issue_path_191_07_25, "0000191_17780725_0003.jp2")
img_191_0725 = draw_box_on_img(pg_3_0725_191_img_path, pg_3_0725_block_191_xy_coords[0], width=20)

"""pg_3_0725_191_height= int(pg_3_191_07_25.find_all('Page')[0]['HEIGHT'])
pg_3_0725_191_width= int(pg_3_191_07_25.find_all('Page')[0]['WIDTH'])
img_191_0725 = Image.open(pg_3_0725_191_img_path).resize((pg_3_0725_191_width, pg_3_0725_191_height))
print(img_191_0725.size)"""

for coords in pg_3_0725_block_191_xy_coords[1:]:
    img_191_0725 = draw_box_on_img(None, coords, img=img_191_0725, width=20)
img_191_0725.show()

In [9]:
img_71_0725.save("/home/piconti/impresso-text-acquisition/text_preparation/data/sample_data/BL/IPJO_0000071_1778_01_25_pg_3_blocks_wrong_scale.jpg")
img_191_0725.save("/home/piconti/impresso-text-acquisition/text_preparation/data/sample_data/BL/IPJO_0000191_11778_01_25_pg_3_blocks_wrong_scale.jpg")

In [None]:
pg_3_0725_block_71_xy_coords = [coords_to_xy(alto.distill_coordinates(block)) for block in pg_3_71_07_25.find_all('TextBlock')]

#print(pg_7_57_11_11.find_all('TextBlock')[0])
print(len(pg_3_0725_block_71_xy_coords), f" blocks found in page 3 of {issue_path_71_07_25}.")
pg_3_0725_71_img_path = os.path.join(bl_og_data_dir, issue_path_71_07_25, "0000071_17780725_0003.jp2")

# fetch necessary info to rescale the coords
pg_3_0725_71_ocr_height = int(pg_3_71_07_25.find_all('Page')[0]['HEIGHT'])
pg_3_0725_71_ocr_width = int(pg_3_71_07_25.find_all('Page')[0]['WIDTH'])
pg_3_0725_71_ocr_size = (pg_3_0725_71_ocr_width, pg_3_0725_71_ocr_height)
img_71_0725 = Image.open(pg_3_0725_71_img_path)
pg_3_0725_71_img_size = img_71_0725.size

pg_3_0725_block_71_xy_resc_coords = [rescale_coords(c, pg_3_0725_71_ocr_size, pg_3_0725_71_img_size) for c in pg_3_0725_block_71_xy_coords]

for coords in pg_3_0725_block_71_xy_resc_coords:
    img_71_0725 = draw_box_on_img(None, coords, img=img_71_0725, width=20)
img_71_0725.show()

In [None]:
pg_3_0725_block_191_xy_coords = [coords_to_xy(alto.distill_coordinates(block)) for block in pg_3_191_07_25.find_all('TextBlock')]

#print(pg_7_57_11_11.find_all('TextBlock')[0])
print(len(pg_3_0725_block_191_xy_coords), f" blocks found in page 3 of {issue_path_191_07_25}.")
pg_3_0725_191_img_path = os.path.join(bl_og_data_dir, issue_path_191_07_25, "0000191_17780725_0003.jp2")

# fetch necessary info to rescale the coords
pg_3_0725_191_ocr_height = int(pg_3_191_07_25.find_all('Page')[0]['HEIGHT'])
pg_3_0725_191_ocr_width = int(pg_3_191_07_25.find_all('Page')[0]['WIDTH'])
pg_3_0725_191_ocr_size = (pg_3_0725_191_ocr_width, pg_3_0725_191_ocr_height)
img_191_0725 = Image.open(pg_3_0725_191_img_path)
pg_3_0725_191_img_size = img_191_0725.size

pg_3_0725_block_191_xy_resc_coords = [rescale_coords(c, pg_3_0725_191_ocr_size, pg_3_0725_191_img_size) for c in pg_3_0725_block_191_xy_coords]

for coords in pg_3_0725_block_191_xy_resc_coords:
    img_191_0725 = draw_box_on_img(None, coords, img=img_191_0725, width=20)
img_191_0725.show()

In [15]:
img_71_0725.save("/home/piconti/impresso-text-acquisition/text_preparation/data/sample_data/BL/IPJO_0000071_1778_01_25_pg_3_blocks_rescaled.jpg")
img_191_0725.save("/home/piconti/impresso-text-acquisition/text_preparation/data/sample_data/BL/IPJO_0000191_1778_01_25_pg_3_blocks_rescaled.jpg")

#### 1763/10/08

In [40]:
issue_path_71_10_08 = "0000071/1763/1008"
issue_path_191_10_08 = "0000191/1763/1008"

pg_4_path_71_10_08 = os.path.join(bl_og_data_dir, issue_path_71_10_08, "IPJO-1763-10-08-0004.xml")
pg_4_path_191_10_08 = os.path.join(bl_og_data_dir, issue_path_191_10_08, "IPJL-1763-10-08-0004.xml")

pg_4_71_10_08 = read_xml(pg_4_path_71_10_08)
pg_4_191_10_08 = read_xml(pg_4_path_191_10_08)

pg_4_71_10_08_img_path = os.path.join(bl_og_data_dir, issue_path_71_10_08, "0000071_17631008_0004.jp2")
pg_4_191_10_08_img_path = os.path.join(bl_og_data_dir, issue_path_191_10_08, "0000191_17631008_0004.jp2")

In [None]:
img_71_10_08 = print_blocks_on_page(pg_4_71_10_08, pg_4_71_10_08_img_path, 4, issue_path_71_10_08)
img_71_10_08.show()

In [None]:
img_191_10_08 = print_blocks_on_page(pg_4_191_10_08, pg_4_191_10_08_img_path, 4, issue_path_191_10_08)
img_191_10_08.show()

In [None]:
img_71_10_08_rsc = print_blocks_on_page(pg_4_71_10_08, pg_4_71_10_08_img_path, 4, issue_path_71_10_08, rescale_block_coords=True)
img_71_10_08_rsc.show()

In [None]:
img_191_10_08_rsc = print_blocks_on_page(pg_4_191_10_08, pg_4_191_10_08_img_path, 4, issue_path_191_10_08, rescale_block_coords=True)
img_191_10_08_rsc.show()

In [50]:
img_71_10_08.save("/home/piconti/impresso-text-acquisition/text_preparation/data/sample_data/BL/IPJO_0000071_1763_10_08_pg_4_blocks_wrg_scale.jpg")
img_191_10_08.save("/home/piconti/impresso-text-acquisition/text_preparation/data/sample_data/BL/IPJO_0000191_1763_10_08_pg_4_blocks_wrg_scale.jpg")
img_71_10_08_rsc.save("/home/piconti/impresso-text-acquisition/text_preparation/data/sample_data/BL/IPJO_0000071_1763_10_08_pg_4_blocks_rescaled.jpg")
img_191_10_08_rsc.save("/home/piconti/impresso-text-acquisition/text_preparation/data/sample_data/BL/IPJO_0000191_1763_10_08_pg_4_blocks_rescaled.jpg")

### Creating a function to repeat this experiment multiple times

In [68]:
sample_data_dir = "/home/piconti/impresso-text-acquisition/text_preparation/data/sample_data/BL/preprocessing_experiments"

nlps = [('IPJO', '0000071'), ('IPJL', '0000191')]


cases = [
    {'year': 1778, 'month': '12', 'day': '12', 'pg_num': 1},
    {'year': 1800, 'month': '09', 'day': '13', 'pg_num': 2},
    {'year': 1800, 'month': '04', 'day': '26', 'pg_num': 4},
    {'year': 1748, 'month': '06', 'day': '04', 'pg_num': 3},
    {'year': 1748, 'month': '01', 'day': '16', 'pg_num': 1},
    #{'year': 1785, 'month': '06', 'day': '18', 'pg_num': 2},
    {'year': 1785, 'month': '10', 'day': '22', 'pg_num': 3},
    #{'year': 1765, 'month': '06', 'day': '04', 'pg_num': 3},
    #{'year': 1765, 'month': '01', 'day': '16', 'pg_num': 1},
    {'year': 1764, 'month': '09', 'day': '15', 'pg_num': 2},
    {'year': 1764, 'month': '06', 'day': '30', 'pg_num': 4},
    {'year': 1763, 'month': '02', 'day': '12', 'pg_num': 1},
    {'year': 1763, 'month': '10', 'day': '29', 'pg_num': 3},
    {'year': 1790, 'month': '05', 'day': '08', 'pg_num': 2},
    {'year': 1790, 'month': '11', 'day': '20', 'pg_num': 4},
    {'year': 1749, 'month': '08', 'day': '26', 'pg_num': 1},
    {'year': 1749, 'month': '04', 'day': '01', 'pg_num': 4},
    {'year': 1750, 'month': '06', 'day': '23', 'pg_num': 1},
    {'year': 1750, 'month': '02', 'day': '03', 'pg_num': 2},
    {'year': 1773, 'month': '10', 'day': '02', 'pg_num': 2},
    {'year': 1773, 'month': '03', 'day': '06', 'pg_num': 3},
    {'year': 1789, 'month': '09', 'day': '05', 'pg_num': 2},
    {'year': 1789, 'month': '08', 'day': '08', 'pg_num': 1},
]

for idx, case in enumerate(cases):
    print(f"\n Testing case {idx+1}/{len(cases)}: for {case}")
    test_cases_from_date(case['year'], case['month'], case['day'], case['pg_num'], nlps, bl_og_data_dir, sample_data_dir)


 Testing case 1/23: for {'year': 1778, 'month': '12', 'day': '12', 'pg_num': 1}
9  blocks found in page 1 of 0000071/1778/1212.
9  blocks found in page 1 of 0000071/1778/1212.
Saving results for 1778-12-12 - pg 1 - alias: IPJO, nlp = 0000071
15  blocks found in page 1 of 0000191/1778/1212.
15  blocks found in page 1 of 0000191/1778/1212.
Saving results for 1778-12-12 - pg 1 - alias: IPJL, nlp = 0000191

 Testing case 2/23: for {'year': 1800, 'month': '09', 'day': '13', 'pg_num': 2}
0  blocks found in page 2 of 0000071/1800/0913.
0  blocks found in page 2 of 0000071/1800/0913.
Saving results for 1800-09-13 - pg 2 - alias: IPJO, nlp = 0000071
12  blocks found in page 2 of 0000191/1800/0913.
12  blocks found in page 2 of 0000191/1800/0913.
Saving results for 1800-09-13 - pg 2 - alias: IPJL, nlp = 0000191

 Testing case 3/23: for {'year': 1800, 'month': '04', 'day': '26', 'pg_num': 4}
0  blocks found in page 4 of 0000071/1800/0426.
0  blocks found in page 4 of 0000071/1800/0426.
Saving re

In [69]:
cases_2 = [
    {'year': 1765, 'month': '09', 'day': '28', 'pg_num': 3},
    {'year': 1765, 'month': '01', 'day': '19', 'pg_num': 1},
    {'year': 1785, 'month': '08', 'day': '13', 'pg_num': 2},
]


for idx, case in enumerate(cases_2):
    print(f"\n Testing case {idx+1}/{len(cases_2)}: for {case}")
    test_cases_from_date(case['year'], case['month'], case['day'], case['pg_num'], nlps, bl_og_data_dir, sample_data_dir)


 Testing case 1/3: for {'year': 1765, 'month': '09', 'day': '28', 'pg_num': 3}
10  blocks found in page 3 of 0000071/1765/0928.
10  blocks found in page 3 of 0000071/1765/0928.
Saving results for 1765-09-28 - pg 3 - alias: IPJO, nlp = 0000071
12  blocks found in page 3 of 0000191/1765/0928.
12  blocks found in page 3 of 0000191/1765/0928.
Saving results for 1765-09-28 - pg 3 - alias: IPJL, nlp = 0000191

 Testing case 2/3: for {'year': 1765, 'month': '01', 'day': '19', 'pg_num': 1}
13  blocks found in page 1 of 0000071/1765/0119.
13  blocks found in page 1 of 0000071/1765/0119.
Saving results for 1765-01-19 - pg 1 - alias: IPJO, nlp = 0000071
26  blocks found in page 1 of 0000191/1765/0119.
26  blocks found in page 1 of 0000191/1765/0119.
Saving results for 1765-01-19 - pg 1 - alias: IPJL, nlp = 0000191

 Testing case 3/3: for {'year': 1785, 'month': '08', 'day': '13', 'pg_num': 2}
30  blocks found in page 2 of 0000071/1785/0813.
30  blocks found in page 2 of 0000071/1785/0813.
Saving

In [None]:
cases_3 = [
    #{'year': 1800, 'month': '09', 'day': '13', 'pg_num': 1},
    {'year': 1800, 'month': '08', 'day': '02', 'pg_num': 1},
    {'year': 1800, 'month': '04', 'day': '26', 'pg_num': 2},
    {'year': 1778, 'month': '09', 'day': '12', 'pg_num': 1},
]


for idx, case in enumerate(cases_3):
    print(f"\n Testing case {idx+1}/{len(cases_2)}: for {case}")
    test_cases_from_date(case['year'], case['month'], case['day'], case['pg_num'], nlps, bl_og_data_dir, sample_data_dir)


 Testing case 1/3: for {'year': 1800, 'month': '08', 'day': '02', 'pg_num': 1}
101  blocks found in page 1 of 0000071/1800/0802.
101  blocks found in page 1 of 0000071/1800/0802.
Saving results for 1800-08-02 - pg 1 - alias: IPJO, nlp = 0000071
12  blocks found in page 1 of 0000191/1800/0802.
12  blocks found in page 1 of 0000191/1800/0802.
Saving results for 1800-08-02 - pg 1 - alias: IPJL, nlp = 0000191

 Testing case 2/3: for {'year': 1800, 'month': '04', 'day': '26', 'pg_num': 2}
23  blocks found in page 2 of 0000071/1800/0426.
23  blocks found in page 2 of 0000071/1800/0426.
Saving results for 1800-04-26 - pg 2 - alias: IPJO, nlp = 0000071
22  blocks found in page 2 of 0000191/1800/0426.
22  blocks found in page 2 of 0000191/1800/0426.
Saving results for 1800-04-26 - pg 2 - alias: IPJL, nlp = 0000191

 Testing case 3/3: for {'year': 1778, 'month': '09', 'day': '12', 'pg_num': 1}
27  blocks found in page 1 of 0000071/1778/0912.
27  blocks found in page 1 of 0000071/1778/0912.
Savi

In [76]:
cases_4 = [
    #{'year': 1800, 'month': '09', 'day': '13', 'pg_num': 1},
    {'year': 1800, 'month': '01', 'day': '11', 'pg_num': 4},
    {'year': 1800, 'month': '03', 'day': '29', 'pg_num': 4},
    {'year': 1800, 'month': '01', 'day': '25', 'pg_num': 4},
    {'year': 1800, 'month': '10', 'day': '04', 'pg_num': 4},
    {'year': 1800, 'month': '11', 'day': '29', 'pg_num': 4},
]

for idx, case in enumerate(cases_4):
    print(f"\n Testing case {idx+1}/{len(cases_2)}: for {case}")
    test_cases_from_date(case['year'], case['month'], case['day'], case['pg_num'], nlps, bl_og_data_dir, sample_data_dir)


 Testing case 1/3: for {'year': 1800, 'month': '01', 'day': '11', 'pg_num': 4}
0  blocks found in page 4 of 0000071/1800/0111.
0  blocks found in page 4 of 0000071/1800/0111.
Saving results for 1800-01-11 - pg 4 - alias: IPJO, nlp = 0000071
Error saving images: cannot write mode RGBA as JPEG
8  blocks found in page 4 of 0000191/1800/0111.
8  blocks found in page 4 of 0000191/1800/0111.
Saving results for 1800-01-11 - pg 4 - alias: IPJL, nlp = 0000191

 Testing case 2/3: for {'year': 1800, 'month': '03', 'day': '29', 'pg_num': 4}
0  blocks found in page 4 of 0000071/1800/0329.
0  blocks found in page 4 of 0000071/1800/0329.
Saving results for 1800-03-29 - pg 4 - alias: IPJO, nlp = 0000071
Error saving images: cannot write mode RGBA as JPEG
19  blocks found in page 4 of 0000191/1800/0329.
19  blocks found in page 4 of 0000191/1800/0329.
Saving results for 1800-03-29 - pg 4 - alias: IPJL, nlp = 0000191

 Testing case 3/3: for {'year': 1800, 'month': '01', 'day': '25', 'pg_num': 4}
File n