This notebook will:
* Create a dataframe to store the OCR'd text, with a row for each issue (based on metadata spreadsheet already compiled)
* Ingest the OCR'd text into the dataframe
* Check that ingest has worked and text associated with the correct metadata
* Pickle the dataframe for later use


In [1]:
#set working directory
import os
os.chdir("C:\\Users\\jakeb\\OneDrive\\birkbeck\\PG Cert\\Project\\Polymags\\Polydata")#this path needs to contain the filelist.csv

In [2]:
#create dataframe of metadata, we will add OCR'd text as a new column in this
import pandas as pd
df = pd.read_csv('filelist.csv')
df

               Filename                           Type     Date  Volume Issue  \
0         CWSupp007.pdf  Christian Workers’ Supplement  1891-07     NaN   NaN   
1         CWSupp008.pdf  Christian Workers’ Supplement  1891-09     NaN   NaN   
2         CWSupp009.pdf  Christian Workers’ Supplement  1891-10     NaN   NaN   
3         CWSupp010.pdf  Christian Workers’ Supplement  1891-11     NaN   NaN   
4         CWSupp011.pdf  Christian Workers’ Supplement  1891-12     NaN   NaN   
...                 ...                            ...      ...     ...   ...   
1720  Quintinian005.pdf                     Quintinian  1892-08     NaN     5   
1721  Quintinian006.pdf                     Quintinian  1892-09     NaN     6   
1722  Quintinian007.pdf                     Quintinian  1892-10     NaN     7   
1723  Quintinian008.pdf                     Quintinian  1892-11     NaN     8   
1724  Quintinian009.pdf                     Quintinian  1892-12     NaN     9   

      Pages  OCR batch  
0 

In [4]:
#create text column of dataframe to hold text and an OCR filename column to hold OCR filename (for checking it matches up with the original filename & metadata)
df['Text'] = ''
df['OCR filename'] = ''
df

Unnamed: 0,Filename,Type,Date,Volume,Issue,Pages,OCR batch,Text,OCR filename
0,CWSupp007.pdf,Christian Workers’ Supplement,1891-07,,,4,1,,
1,CWSupp008.pdf,Christian Workers’ Supplement,1891-09,,,4,1,,
2,CWSupp009.pdf,Christian Workers’ Supplement,1891-10,,,4,1,,
3,CWSupp010.pdf,Christian Workers’ Supplement,1891-11,,,4,1,,
4,CWSupp011.pdf,Christian Workers’ Supplement,1891-12,,,4,1,,
...,...,...,...,...,...,...,...,...,...
1720,Quintinian005.pdf,Quintinian,1892-08,,5,6,12,,
1721,Quintinian006.pdf,Quintinian,1892-09,,6,2,12,,
1722,Quintinian007.pdf,Quintinian,1892-10,,7,8,12,,
1723,Quintinian008.pdf,Quintinian,1892-11,,8,8,12,,


In [9]:
#get list of OCR files
import os
os.chdir("C:\\Users\\jakeb\\OneDrive\\birkbeck\\PG Cert\\Project\\Polymags\\Polymags for processing\\All_OCR_Files_Corrected")#this folder needs to contain the OCR'd text files. You can download these from https://github.com/jakebickford/PolyMags/
files = os.listdir()
print (files)

['CWSupp007.pdf.tiff.txt', 'CWSupp008.pdf.tiff.txt', 'CWSupp009.pdf.tiff.txt', 'CWSupp010.pdf.tiff.txt', 'CWSupp011.pdf.tiff.txt', 'CWSupp012.pdf.tiff.txt', 'HolidayGuide.pdf.tiff.txt', 'HT0010001.pdf.tiff.txt', 'HT0010002.pdf.tiff.txt', 'HT0010003.pdf.tiff.txt', 'HT0010004.pdf.tiff.txt', 'HT0010005.pdf.tiff.txt', 'HT0010006.pdf.tiff.txt', 'HT0010007.pdf.tiff.txt', 'HT0010008.pdf.tiff.txt', 'HT0010009.pdf.tiff.txt', 'HT0010010.pdf.tiff.txt', 'HT0010011.pdf.tiff.txt', 'HT0010012.pdf.tiff.txt', 'HT0010013.pdf.tiff.txt', 'HT0010014.pdf.tiff.txt', 'HT0010015.pdf.tiff.txt', 'HT0010016.pdf.tiff.txt', 'HT0010017.pdf.tiff.txt', 'HT0010018.pdf.tiff.txt', 'HT0020019.pdf.tiff.txt', 'HT0020020.pdf.tiff.txt', 'HT0020021.pdf.tiff.txt', 'HT0020022.pdf.tiff.txt', 'HT0020023.pdf.tiff.txt', 'HT0020024.pdf.tiff.txt', 'HT0020025.pdf.tiff.txt', 'HT0020026.pdf.tiff.txt', 'HT0020027.pdf.tiff.txt', 'HT0020028.pdf.tiff.txt', 'HT0020029.pdf.tiff.txt', 'HT0020030.pdf.tiff.txt', 'HT0030031.pdf.tiff.txt', 'HT00300

In [10]:

#testing opening a file
with open ('HT0530005.pdf.tiff.txt', encoding='utf8') as f:
    content = f.read()

print(content)


  

A

 

AZI\

7

   

ph BE”

)

THE OFFICIAL ORGAN OF

THE POLYTECHNIC Y.M.C.lI.

FOUNDED BY QUINTIN HOGG.

309 REGENT STREET, LONDON, W.

PATRONS:

THEIR MAJESTIES THE KING AND QUEEN.

 

 

 

Voi. LIII.—No. 5. New SERIES.

JULY,

1913. Pric—E TWopPENCE.

 

 

In and Hbout the Poly.

World’s Record. Itearty congratulations to W. R. Apple-
garth on having achieved the World’s
Amateur Record for the 150 yards. The excellent eflort made
in this direction at Stamtord Bridge on May 31st was
tepeated at Cardifl on Saturday, June 28th, with complete
success,
A )

We are glad to see that W. J. Bailey was again
successful in winning the N.C.U. Championship for
the } Mile, and the Mile, and regret that his chances in the 5
Mile Championship were tuined by puncture. — It was also sad
that Harry Ryan should have been early put out of the race by
a fall. The races that Bailey won were all the more interesting
as amongst the competitors was his determined opponent Vic
Johnson.

Cycling.

a)

N

In [11]:
df2 = df

#read each of the files in turn, adding them to the dataframe by index - index number will be based on 'index' counter
index = -1
for file in files:
    index = index+1
    with open (file, encoding='utf8') as f:
        content = f.read()
        df2.at[index, 'Text'] = content
        df2.at[index, 'OCR filename'] = file #added this so it will include filename of OCR'd text, to cross-check with original PDF filename

print(df2)


               Filename                           Type     Date  Volume Issue  \
0         CWSupp007.pdf  Christian Workers’ Supplement  1891-07     NaN   NaN   
1         CWSupp008.pdf  Christian Workers’ Supplement  1891-09     NaN   NaN   
2         CWSupp009.pdf  Christian Workers’ Supplement  1891-10     NaN   NaN   
3         CWSupp010.pdf  Christian Workers’ Supplement  1891-11     NaN   NaN   
4         CWSupp011.pdf  Christian Workers’ Supplement  1891-12     NaN   NaN   
...                 ...                            ...      ...     ...   ...   
1720  Quintinian005.pdf                     Quintinian  1892-08     NaN     5   
1721  Quintinian006.pdf                     Quintinian  1892-09     NaN     6   
1722  Quintinian007.pdf                     Quintinian  1892-10     NaN     7   
1723  Quintinian008.pdf                     Quintinian  1892-11     NaN     8   
1724  Quintinian009.pdf                     Quintinian  1892-12     NaN     9   

      Pages  OCR batch     

Now we need to do some testing, to make sure the metadata and the text have lined up correctly

In [12]:
df2.isnull().sum()
#we can see there are some missing Volume and Issue numbers - this is fine as we know there are special issues which weren't part of a volume and didn't have issue numbers

Filename         0
Type             0
Date             0
Volume          16
Issue            7
Pages            0
OCR batch        0
Text             0
OCR filename     0
dtype: int64

In [13]:
df2[df2['Volume'].isnull()]
#however on closer inspection is looks like the OCR filename and the PDF filename aren't lining up properly...e.g. the Quintinian001.pdf is showing up as 'Home Tidings'
#ok, have gone back and changed sort order in csv and re-ingested, now it is looking ok

Unnamed: 0,Filename,Type,Date,Volume,Issue,Pages,OCR batch,Text,OCR filename
0,CWSupp007.pdf,Christian Workers’ Supplement,1891-07,,,4,1,"THE POLYTECHNIC MAGAZINE,\n\nChristian Workers...",CWSupp007.pdf.tiff.txt
1,CWSupp008.pdf,Christian Workers’ Supplement,1891-09,,,4,1,"THE POLYTECHNIC MAGAZINE, 809, Regent Street,\...",CWSupp008.pdf.tiff.txt
2,CWSupp009.pdf,Christian Workers’ Supplement,1891-10,,,4,1,"THE POLYTECHNIC MAGAZINE,\n\nChristian Workers...",CWSupp009.pdf.tiff.txt
3,CWSupp010.pdf,Christian Workers’ Supplement,1891-11,,,4,1,"THE POLYTECHNIC MAGAZINE, 809, Regent Street, ...",CWSupp010.pdf.tiff.txt
4,CWSupp011.pdf,Christian Workers’ Supplement,1891-12,,,4,1,"THE POLYTECHNIC MAGAZINE, 809, Regent Street, ...",CWSupp011.pdf.tiff.txt
5,CWSupp012.pdf,Christian Workers’ Supplement,1892-01,,,18,1,"THE POLYTECHNIC MAGAZINE, 809, Regent\n\nNo. 1...",CWSupp012.pdf.tiff.txt
6,HolidayGuide.pdf,Holiday Guide,1891-05-29,,,24,1,THE POLYTEGHNIG HOLIDAY CUIDE.\n\nBEING AN EXT...,HolidayGuide.pdf.tiff.txt
1716,Quintinian001.pdf,Quintinian,1892-04,,1.0,4,12,Che Old Owintinian\n\nAND SUPPLEMENT FOR OLD P...,Quintinian001.pdf.tiff.txt
1717,Quintinian002.pdf,Quintinian,1892-05,,2.0,4,12,Che Ouintinian\n\nAND SUPPLEMENT FOR OLD POLYT...,Quintinian002.pdf.tiff.txt
1718,Quintinian003.pdf,Quintinian,1892-06,,3.0,4,12,\n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n;...,Quintinian003.pdf.tiff.txt


In [14]:
#more checking that OCR and metadata is lined up correclty
df2.head(50)

Unnamed: 0,Filename,Type,Date,Volume,Issue,Pages,OCR batch,Text,OCR filename
0,CWSupp007.pdf,Christian Workers’ Supplement,1891-07,,,4,1,"THE POLYTECHNIC MAGAZINE,\n\nChristian Workers...",CWSupp007.pdf.tiff.txt
1,CWSupp008.pdf,Christian Workers’ Supplement,1891-09,,,4,1,"THE POLYTECHNIC MAGAZINE, 809, Regent Street,\...",CWSupp008.pdf.tiff.txt
2,CWSupp009.pdf,Christian Workers’ Supplement,1891-10,,,4,1,"THE POLYTECHNIC MAGAZINE,\n\nChristian Workers...",CWSupp009.pdf.tiff.txt
3,CWSupp010.pdf,Christian Workers’ Supplement,1891-11,,,4,1,"THE POLYTECHNIC MAGAZINE, 809, Regent Street, ...",CWSupp010.pdf.tiff.txt
4,CWSupp011.pdf,Christian Workers’ Supplement,1891-12,,,4,1,"THE POLYTECHNIC MAGAZINE, 809, Regent Street, ...",CWSupp011.pdf.tiff.txt
5,CWSupp012.pdf,Christian Workers’ Supplement,1892-01,,,18,1,"THE POLYTECHNIC MAGAZINE, 809, Regent\n\nNo. 1...",CWSupp012.pdf.tiff.txt
6,HolidayGuide.pdf,Holiday Guide,1891-05-29,,,24,1,THE POLYTEGHNIG HOLIDAY CUIDE.\n\nBEING AN EXT...,HolidayGuide.pdf.tiff.txt
7,HT0010001.pdf,Home Tidings,1879-06,1.0,1.0,28,1,HOME\n\nPUBLISHED\n\nTIDINGS\n\nNEWS.\n\nMONTH...,HT0010001.pdf.tiff.txt
8,HT0010002.pdf,Home Tidings,1879-07,1.0,2.0,24,1,"PUBLISHED\n\n \n\nTIDINGS\n\nMONTHLY,\n\n \n\n...",HT0010002.pdf.tiff.txt
9,HT0010003.pdf,Home Tidings,1879-07,1.0,3.0,16,1,HOME\n\nPUBLISHED\n\n \n\nTIDINGS\n\nMONTHLY.\...,HT0010003.pdf.tiff.txt


In [15]:
#more checking that OCR and metadata is lined up correctly
df2.tail(50)

Unnamed: 0,Filename,Type,Date,Volume,Issue,Pages,OCR batch,Text,OCR filename
1675,HT0970008.pdf,Polytechnic Magazine,1957-08,97.0,8,28,12,\n\n \n\nTHE\nPOLYTECHNIC\n\nMAGAZINE\n\n \n\...,HT0970008.pdf.tiff.txt
1676,HT0970009.pdf,Polytechnic Magazine,1957-09,97.0,9,36,12,\n\n \n\nTHE\nPOLYTECHNIC\n\nMAGAZINE\n\n \n\...,HT0970009.pdf.tiff.txt
1677,HT0970010.pdf,Polytechnic Magazine,1957-10,97.0,10,40,12,\n\n \n\nTHE\nPOLYTECHNIC\n\nMAGAZINE\n\n \n\...,HT0970010.pdf.tiff.txt
1678,HT0970011.pdf,Polytechnic Magazine,1957-11,97.0,11,44,12,\n\n \n\nTHE\nPOLYTECHNIC\n\nMAGAZINE\n\n \n\...,HT0970011.pdf.tiff.txt
1679,HT0970012.pdf,Polytechnic Magazine,1957-12,97.0,12,40,12,\n\n \n\n \n\nTHE\nPOLYTECHNIC\n\nMAGAZINE.\n...,HT0970012.pdf.tiff.txt
1680,HT0980001.pdf,Polytechnic Magazine,1958-01,98.0,1,36,12,\n\n \n\nTHE\nPOLYTECHNIC\n\nMAGAZINE\n\n \n\...,HT0980001.pdf.tiff.txt
1681,HT0980002.pdf,Polytechnic Magazine,1958-02,98.0,2,36,12,\n\n \n\nTHE\n\nPOLYTECHNIC\nMAGAZINE\n\n \n\...,HT0980002.pdf.tiff.txt
1682,HT0980003.pdf,Polytechnic Magazine,1958-03,98.0,3,40,12,\n\n \n\nTHE\nPOLYTECHNIC\n\nMAGAZINE\n\n \n\...,HT0980003.pdf.tiff.txt
1683,HT0980004.pdf,Polytechnic Magazine,1958-04,98.0,4,52,12,\n\n \n\n \n\n \n\nTHE\nPOLYTECHNIC\n\nMAGAZI...,HT0980004.pdf.tiff.txt
1684,HT0980005.pdf,Polytechnic Magazine,1958-05,98.0,5,44,12,\n\n \n\n \n\nTHE\nPOLYTECHNIC\n\nMAGAZINE\n\...,HT0980005.pdf.tiff.txt


In [13]:
df2[df2['Issue'].isnull()]
#looking at blank issue numbers it is as we would expect: they are the special supplements that didn't have numbers

Unnamed: 0,Filename,Type,Date,Volume,Issue,Pages,OCR batch,Text,OCR filename
0,CWSupp007.pdf,Christian Workers’ Supplement,1891-07,,,4,1,"THE POLYTECHNIC MAGAZINE,\n\nChristian Workers...",CWSupp007.pdf.tiff.txt
1,CWSupp008.pdf,Christian Workers’ Supplement,1891-09,,,4,1,"THE POLYTECHNIC MAGAZINE, 809, Regent Street,\...",CWSupp008.pdf.tiff.txt
2,CWSupp009.pdf,Christian Workers’ Supplement,1891-10,,,4,1,"THE POLYTECHNIC MAGAZINE,\n\nChristian Workers...",CWSupp009.pdf.tiff.txt
3,CWSupp010.pdf,Christian Workers’ Supplement,1891-11,,,4,1,"THE POLYTECHNIC MAGAZINE, 809, Regent Street, ...",CWSupp010.pdf.tiff.txt
4,CWSupp011.pdf,Christian Workers’ Supplement,1891-12,,,4,1,"THE POLYTECHNIC MAGAZINE, 809, Regent Street, ...",CWSupp011.pdf.tiff.txt
5,CWSupp012.pdf,Christian Workers’ Supplement,1892-01,,,18,1,"THE POLYTECHNIC MAGAZINE, 809, Regent\n\nNo. 1...",CWSupp012.pdf.tiff.txt
6,HolidayGuide.pdf,Holiday Guide,1891-05-29,,,24,1,THE POLYTEGHNIG HOLIDAY CUIDE.\n\nBEING AN EXT...,HolidayGuide.pdf.tiff.txt


In [16]:
#print some examples to check they look ok
example = int(input('Enter example number to check (0-1724): '))
print(df2.loc[example])

print(df2.loc[example,'Text']) 

Filename                                            HT0210494.pdf
Type                                         Polytechnic Magazine
Date                                                   1892-12-28
Volume                                                       21.0
Issue                                                         494
Pages                                                          12
OCR batch                                                       3
Text            THE\n\nDecember 28, 892.\n\nPOLYTHCHNIC MAGAZI...
OCR filename                               HT0210494.pdf.tiff.txt
Name: 500, dtype: object
THE

December 28, 892.

POLYTHCHNIC MAGAZINE.

443

 

THE POLYTECHNIC MAGAZINE
_WEDNESDAY, DEC. 28, 1892.

 

 

 

3nstitute Gossip.

"OUR old member, E. J. Painter, writes
to tell me that he has recently started in
business with a friend under the style of
Morrelland Painter, at 225, Kentish Town-
road, where he has opened a stationery,
bookselling, and bookbinding business.


In [18]:
#looks good so will
# pickle initial version of corpus
os.chdir("C:\\Users\\jakeb\\OneDrive\\birkbeck\\PG Cert\\Project\\Polymags\\Polydata\\PreparatoryNotebooksFinalv2")#set this to where you want to keep your pickles, the other notebook assumes it will be the same directory as the notebooks
df2.to_pickle("corpus_1.pkl")