# IAM Dataset

_Published_ -> The database was first published in at the ICDAR 1999.  


About dataset
------------------
The database contains forms of unconstrained handwritten text, which were scanned at a resolution of 300dpi and saved as PNG images with 256 gray levels. 

The IAM Handwriting Database 3.0 is structured as follows:
- 657 writers contributed samples of their handwriting
- 1'539 pages of scanned text
- 5'685 isolated and labeled sentences
- 13'353 isolated and labeled text lines
- 115'320 isolated and labeled words

The words have been extracted from pages of scanned text using an automatic segmentation scheme and were verified manually. 
- Paper name - Automatic Segmentation of the IAM Off-line Database for Handwritten English Text
- Authors Matthias Zimmermann, Horst Bunke
- Link - http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.652.1885&rep=rep1&type=pdf



Reading dataset
----------------
- __form.txt__
  - format: a01-000u 000 2 prt 7 5 52 36
  - a01-000u  -> form id
  - 000       -> writer id
  - 2         -> number of sentences
  - prt       -> word segmentation
    - prt: some lines correctly segmented
    - all: all lines correctly segmented
  - 7 5       -> 5 of 7 lines are correctly segmented into words
  - 52 36     -> the form contains 52 words, 36 are in lines which have been correctly segmented

In [1]:
# import
import os
import shutil
from itertools import islice
from collections import defaultdict

In [2]:
# Create a dictionary to store each writer and its form
writer_form = defaultdict(list)
forms_file_path = "D:\\dataset\\IAM\\forms.txt"
with open(forms_file_path) as f:
    for line in islice(f, 16, None):
        line_list = line.split(' ')
        form_id = line_list[0]
        writer = line_list[1]
        writer_form[writer].append(form_id)
#list(writer_form.items())

In [3]:
#print writer and its no of forms
print("Writer id \t No. of form")
no_of_form_no_of_writer = defaultdict(int)
for key, value in sorted(writer_form.items(), key= lambda kv : len(kv[1]),reverse= True):
    print(f"{key}\t\t\t{len(value)}")
    no_of_form_no_of_writer[len(value)] += 1

Writer id 	 No. of form
000			59
150			10
151			10
152			10
153			10
154			10
384			10
551			10
552			10
588			10
635			10
670			10
671			10
155			9
333			9
334			9
336			9
337			9
338			9
339			9
340			9
341			9
342			9
343			9
344			9
345			9
346			9
347			9
348			9
349			9
634			9
332			8
335			8
118			7
209			7
315			7
415			7
085			6
567			6
025			5
026			5
037			5
123			5
125			5
126			5
128			5
130			5
133			5
173			5
174			5
202			5
203			5
204			5
205			5
206			5
207			5
208			5
247			5
248			5
273			5
274			5
285			5
287			5
288			5
289			5
292			5
293			5
351			5
352			5
353			5
354			5
355			5
385			5
386			5
387			5
389			5
390			5
391			5
393			5
454			5
455			5
456			5
498			5
544			5
546			5
547			5
548			5
549			5
550			5
582			5
583			5
584			5
585			5
058			4
059			4
060			4
061			4
064			4
107			4
108			4
109			4
110			4
111			4
112			4
113			4
114			4
117			4
124			4
129			4
131			4
132			4
193			4
199			4
239			4
241			4
246			4
286			4
291			4
294			4
330			4
350

In [4]:
#no_of_form - no_of_writer
print("No. of form \t No. of Writer")
for key, value in sorted(no_of_form_no_of_writer.items()):
    print(f"{key}\t\t\t{value}")

No. of form 	 No. of Writer
1			356
2			142
3			32
4			34
5			54
6			2
7			4
8			2
9			18
10			12
59			1


In [5]:
#function for extracting all image of a writer to one folder
def getWriterData( writer_id, writer_form_dict, source_path, dest_path):
    '''Extract all image written by author to a folder.'''
    writer_id = str(writer_id)
    
    if(len(writer_form_dict[writer_id]) == 0):
        print("Invalid Writer id")
        return False
    else:
        dest_fol_path = os.path.join(dest_path,writer_id)
        
        if(not os.path.exists(dest_fol_path)):
            os.mkdir(dest_fol_path)
            
        fol_list = writer_form_dict[writer_id]
        for fol in fol_list:
            fol_name_split = fol.split("-")
            parent_fol = fol_name_split[0]
            parent_fol_path = os.path.join(source_path,parent_fol)
            fol_path = os.path.join(parent_fol_path,fol)
            files = os.listdir(fol_path)
            for f in files:
                shutil.copy(fol_path+'\\'+f, dest_fol_path+'\\'+f)
        print("Extracted successfully writer ",writer_id)
        return True

In [6]:
sourcepath = 'D:\\dataset\\IAM\\words'
destpath = 'D:\\dataset\\exp\\wldata_10_10'
wid_list = ['150','151','152','153','154','384','551','552','588','635']
for wid in wid_list:
    getWriterData(wid, writer_form, sourcepath, destpath)

Extracted successfully writer  150
Extracted successfully writer  151
Extracted successfully writer  152
Extracted successfully writer  153
Extracted successfully writer  154
Extracted successfully writer  384
Extracted successfully writer  551
Extracted successfully writer  552
Extracted successfully writer  588
Extracted successfully writer  635
