## The following code makes lmdb databases for the specific data in the ICFHR 2018 competition. Part of the competition is to find out how many pages of specific documents are truly necessary to successfully fine tune a general handwriting recognition model to a particular document type. They have provided 1, 4 and 16 pages of each document type to test fine-tuning performance on the test set. I believe a good validation split for us is to fine-tune on the 8 pages not in the 4 page set, and then validate on the 4 pages. The provided 1 and 4 page sets are strict subsets of the fine-tune lists with more numbers of pages (4 and 16, respectively)

In [5]:
import os
import sys
from glob import glob
import shutil
import subprocess

In [6]:
#lmdb_database_base = "/deep_data/nephi/data/lmdb_ICFHR/specific_data_each_doc"
#spec_tr_lists_dir = "/deep_data/datasets/ICFHR_Data/specific_data_train_list"

lmdb_database_base = "/home/ubuntu/russell/nephi/data/lmdb_ICFHR_bin/specific_data_each_doc/"
spec_tr_lists_dir = "/home/ubuntu/datasets/read_ICFHR/specific_data_train_list/"

spec_lists_files = glob(os.path.join(spec_tr_lists_dir, "*"))
dirs = [os.path.basename(f).partition(".lst")[0] for f in spec_lists_files]

### Make directories for lmdb databases

In [3]:
for num in set([d.partition("_train_")[0] for d in dirs]):
    dirs.append(num + "_train_8")
for d in dirs:
    os.makedirs(os.path.join(lmdb_database_base, d))

### Loop through all of the lists and make the 8 page fine-tuning training lists for validating our fine-tuning approach

In [4]:
#spec_val_lists_dir = "/deep_data/datasets/ICFHR_Data/specific_data_val_list"
spec_val_lists_dir = "/home/ubuntu/datasets/read_ICFHR/specific_data_val_list/"
for num in set([d.partition("_train_")[0] for d in dirs]):
    #list_1 = os.path.join(spec_tr_lists_dir, num + "_train_1.lst")
    list_4 = os.path.join(spec_tr_lists_dir, num + "_train_4.lst")
    list_16 = os.path.join(spec_tr_lists_dir, num + "_train_16.lst")
    with open(list_16, "r") as t_16, open(list_4, "r") as t_4:
        imgs_16 = set(t_16.read().split())
        imgs_4 = set(t_4.read().split())
        imgs_8 = imgs_16 - imgs_4
        
        list_8 = os.path.join(spec_val_lists_dir, num + "_train_8.lst")
        
        with open(list_8, "w") as val_8:
            for img in imgs_8:
                val_8.write(img + "\n")
        

### Verify that the files are correct (they are correct)

In [5]:
for num in set([d.partition("_train_")[0] for d in dirs]):
    list_8 = os.path.join(spec_val_lists_dir, num + "_train_8.lst")
    list_4 = os.path.join(spec_tr_lists_dir, num + "_train_4.lst")
    list_16 = os.path.join(spec_tr_lists_dir, num + "_train_16.lst")
    with open(list_16, "r") as t_16, open(list_4, "r") as t_4, open(list_8, "r") as t_8:
        imgs_16 = set(t_16.read().split())
        imgs_4 = set(t_4.read().split())
        imgs_8 = set(t_8.read().split())
        test_8 = imgs_16 - imgs_4
        print ("Testing number:" + num)
        print(str(imgs_8 == test_8))

Testing number:35013
True
Testing number:30882
True
Testing number:30893
True
Testing number:30866
True
Testing number:35015
True


### Make all the python calls to create lmdb databases for all lists

In [7]:
#lmdb_database_base = "/deep_data/nephi/data/lmdb_ICFHR/specific_data_each_doc"
#spec_tr_lists_dir = "/deep_data/datasets/ICFHR_Data/specific_data_train_list"
#train_data = "/deep_data/datasets/ICFHR_Data/specific_data"

lmdb_database_base = "/home/ubuntu/russell/nephi/data/lmdb_ICFHR_bin/specific_data_each_doc/"
spec_tr_lists_dir = "/home/ubuntu/datasets/read_ICFHR/specific_data_train_list/"
train_data = "/home/ubuntu/datasets/read_ICFHR/specific_data/"
howe_data = "/home/ubuntu/datasets/read_ICFHR/specific_data_howe/"
simplebin_data = "/home/ubuntu/datasets/read_ICFHR/specific_data_imgtxt/"

#python create_dataset.py  --data_dir ~/datasets/read_ICFHR/specific_data --output_dir ~/russell/nephi/data/lmdb_ICFHR_bin/specific_data --icfhr --binarize --howe_dir ~/datasets/read_ICFHR/specific_data_howe --simplebin_dir ~/datasets/read_ICFHR/specific_data_imgtxt

#python create_dataset.py  --data_dir ~/datasets/read_ICFHR/general_data  --output_dir ~/russell/nephi/data/lmdb_ICFHR_bin/general_data --icfhr --binarize --howe_dir ~/datasets/read_ICFHR/general_data_howe --simplebin_dir ~/datasets/read_ICFHR/general_data_imgtxt

#python create_dataset.py ~/datasets/read_ICFHR/specific_data ~/russell/nephi/data/lmdb_ICFHR/specific_data --icfhr /file to include

for num in set([d.partition("_train_")[0] for d in dirs]):
    for s in ["1", "4", "8", "16"]:
        script=''
        if s=="8":
            script = ' '.join(["python create_dataset.py", "--data_dir", train_data, "--output_dir", os.path.join(lmdb_database_base, num + "_train_" + s), 
                               "--icfhr", "--files_include", os.path.join(spec_val_lists_dir, num + "_train_8.lst"),
                              "--binarize", "--howe_dir", howe_data, "--simplebin_dir", simplebin_data])
        else:
            script = ' '.join(["python create_dataset.py", "--data_dir", train_data, "--output_dir", os.path.join(lmdb_database_base, num + "_train_" + s), 
                               "--icfhr", "--files_include", os.path.join(spec_tr_lists_dir , num + "_train_" + s + ".lst"),
                              "--binarize", "--howe_dir", howe_data, "--simplebin_dir", simplebin_data])
        print(script)


python create_dataset.py --data_dir /home/ubuntu/datasets/read_ICFHR/specific_data/ --output_dir /home/ubuntu/russell/nephi/data/lmdb_ICFHR_bin/specific_data_each_doc/35013_train_1 --icfhr --files_include /home/ubuntu/datasets/read_ICFHR/specific_data_train_list/35013_train_1.lst --binarize --howe_dir /home/ubuntu/datasets/read_ICFHR/specific_data_howe/ --simplebin_dir /home/ubuntu/datasets/read_ICFHR/specific_data_imgtxt/
python create_dataset.py --data_dir /home/ubuntu/datasets/read_ICFHR/specific_data/ --output_dir /home/ubuntu/russell/nephi/data/lmdb_ICFHR_bin/specific_data_each_doc/35013_train_4 --icfhr --files_include /home/ubuntu/datasets/read_ICFHR/specific_data_train_list/35013_train_4.lst --binarize --howe_dir /home/ubuntu/datasets/read_ICFHR/specific_data_howe/ --simplebin_dir /home/ubuntu/datasets/read_ICFHR/specific_data_imgtxt/
python create_dataset.py --data_dir /home/ubuntu/datasets/read_ICFHR/specific_data/ --output_dir /home/ubuntu/russell/nephi/data/lmdb_ICFHR_bin/sp

## Now make all python calls to create lmdb databases for all the separate test results

In [9]:
for s in ["30866", "30882", "30893", "35013", "35015"]:
    os.makedirs(os.path.join(lmdb_database_base, s))

In [13]:
#python create_dataset.py --data_dir ~/datasets/read_ICFHR/test_data --output_dir 
#~/russell/nephi/data/lmdb_ICFHR_bin/test_data --icfhr --binarize --howe_dir
#~/datasets/read_ICFHR/test_data_howe --simplebin_dir ~/datasets/read_ICFHR/test_data_simplebin --test

#lmdb_database_base = "/deep_data/nephi/data/lmdb_ICFHR/specific_data_each_doc"
#spec_tr_lists_dir = "/deep_data/datasets/ICFHR_Data/specific_data_train_list"
#train_data = "/deep_data/datasets/ICFHR_Data/specific_data"

lmdb_database_base = "/home/ubuntu/russell/nephi/data/lmdb_ICFHR_bin/test_data_each_doc/"
train_data = "/home/ubuntu/datasets/read_ICFHR/test_data/"
howe_data = "/home/ubuntu/datasets/read_ICFHR/test_data_howe/"
simplebin_data = "/home/ubuntu/datasets/read_ICFHR/test_data_simplebin/"

#python create_dataset.py  --data_dir ~/datasets/read_ICFHR/specific_data --output_dir ~/russell/nephi/data/lmdb_ICFHR_bin/specific_data --icfhr --binarize --howe_dir ~/datasets/read_ICFHR/specific_data_howe --simplebin_dir ~/datasets/read_ICFHR/specific_data_imgtxt

#python create_dataset.py  --data_dir ~/datasets/read_ICFHR/general_data  --output_dir ~/russell/nephi/data/lmdb_ICFHR_bin/general_data --icfhr --binarize --howe_dir ~/datasets/read_ICFHR/general_data_howe --simplebin_dir ~/datasets/read_ICFHR/general_data_imgtxt

#python create_dataset.py ~/datasets/read_ICFHR/specific_data ~/russell/nephi/data/lmdb_ICFHR/specific_data --icfhr /file to include

for s in ["30866", "30882", "30893", "35013", "35015"]:
    script = ' '.join(["python create_dataset.py", "--data_dir", os.path.join(train_data, s), "--output_dir", os.path.join(lmdb_database_base, s), 
                           "--icfhr", "--test",
                          "--binarize", "--howe_dir", howe_data, "--simplebin_dir", simplebin_data])
    print(script)


python create_dataset.py --data_dir /home/ubuntu/datasets/read_ICFHR/test_data/30866 --output_dir /home/ubuntu/russell/nephi/data/lmdb_ICFHR_bin/test_data_each_doc/30866 --icfhr --test --binarize --howe_dir /home/ubuntu/datasets/read_ICFHR/test_data_howe/ --simplebin_dir /home/ubuntu/datasets/read_ICFHR/test_data_simplebin/
python create_dataset.py --data_dir /home/ubuntu/datasets/read_ICFHR/test_data/30882 --output_dir /home/ubuntu/russell/nephi/data/lmdb_ICFHR_bin/test_data_each_doc/30882 --icfhr --test --binarize --howe_dir /home/ubuntu/datasets/read_ICFHR/test_data_howe/ --simplebin_dir /home/ubuntu/datasets/read_ICFHR/test_data_simplebin/
python create_dataset.py --data_dir /home/ubuntu/datasets/read_ICFHR/test_data/30893 --output_dir /home/ubuntu/russell/nephi/data/lmdb_ICFHR_bin/test_data_each_doc/30893 --icfhr --test --binarize --howe_dir /home/ubuntu/datasets/read_ICFHR/test_data_howe/ --simplebin_dir /home/ubuntu/datasets/read_ICFHR/test_data_simplebin/
python create_dataset.

## Now make all the predictions on the test sets

In [17]:
for num in set([d.partition("_train_")[0] for d in dirs]):
    print num

35013
30882
30893
30866
35015


In [18]:


#experiments/expr_ICFHR_17Apr_binarization_augmentation/netCRNN_6_1988.pth
lmdb_database_base = "/home/ubuntu/russell/nephi/data/lmdb_ICFHR_bin/test_data_each_doc/"
spec_tr_lists_dir = "/home/ubuntu/datasets/read_ICFHR/specific_data_train_list/"
train_data = "/home/ubuntu/datasets/read_ICFHR/specific_data/"
howe_data = "/home/ubuntu/datasets/read_ICFHR/specific_data_howe/"
simplebin_data = "/home/ubuntu/datasets/read_ICFHR/specific_data_imgtxt/"

tuned_models = ["experiments/expr_ICFHR_18Apr_finetuning_allnet_35013_train_1/netCRNN_5_7.pth",
                "experiments/expr_ICFHR_18Apr_finetuning_allnet_35013_train_4/netCRNN_6_26.pth",
                "experiments/expr_ICFHR_18Apr_finetuning_allnet_35013_train_16/netCRNN_8_107.pth",
            
              
                "experiments/expr_ICFHR_18Apr_finetuning_allnet_30882_train_1/netCRNN_6_4.pth",
                "experiments/expr_ICFHR_18Apr_finetuning_allnet_30882_train_4/netCRNN_6_14.pth",
                "experiments/expr_ICFHR_18Apr_finetuning_allnet_30882_train_16/netCRNN_7_55.pth",
                
    
                "experiments/expr_ICFHR_18Apr_finetuning_allnet_30893_train_1/netCRNN_7_4.pth",
                "experiments/expr_ICFHR_18Apr_finetuning_allnet_30893_train_4/netCRNN_12_15.pth",
                "experiments/expr_ICFHR_18Apr_finetuning_allnet_30893_train_16/netCRNN_14_64.pth",
                

                "experiments/expr_ICFHR_18Apr_finetuning_allnet_30866_train_1/netCRNN_14_5.pth",
                "experiments/expr_ICFHR_18Apr_finetuning_allnet_30866_train_4/netCRNN_13_20.pth",
                "experiments/expr_ICFHR_18Apr_finetuning_allnet_30866_train_16/netCRNN_12_79.pth",
                
                
                "experiments/expr_ICFHR_18Apr_finetuning_allnet_35015_train_1/netCRNN_14_12.pth",
                "experiments/expr_ICFHR_18Apr_finetuning_allnet_35015_train_4/netCRNN_7_44.pth",
                "experiments/expr_ICFHR_18Apr_finetuning_allnet_35015_train_16/netCRNN_8_177.pth"]

# Let's make the script
#python crnn_main.py --trainroot this is the specific data 1, 4, 16 /home/ubuntu/russell/nephi/data/lmdb_ICFHR_bin/general_data 
#--valroot this will be the corresponding train_8 for validation /home/ubuntu/russell/nephi/data/lmdb_ICFHR_bin/specific_data --dataset ICFHR --cuda --lr 0.0001 
#--displayInterval 120 --valEpoch 1 --saveEpoch 1 --workers 10 --niter 15 --experiment this should be a name unique to the train setexperiments/expr_ICFHR_17Apr_binarization_augmentation 
#--keep_ratio --imgH 60 --imgW 240 --batchSize 6 --binarize > name unique to the train set log_files/log_ICFHR_17Apr_binarization_augmentation.txt

#python crnn_main.py --trainroot /deep_data/nephi/data/lmdb_ICFHR/general_data 
#--valroot /deep_data/nephi/data/lmdb_ICFHR/test_data/30865_testtrack 
#--crnn /deep_data/nephi/experiments/expr_ICFHR_27Mar_alph_werr_fixed_extended/netCRNN_20_5963.pth 
#--cuda --lr 0.00005 --displayInterval 120 --valEpoch 5 --saveEpoch 10 --workers 10 --niter 200
#--experiment experiments/expr_ICFHR_24Mar_testset_test --keep_ratio --imgH 80 --imgW 240 --batchSize 32
#--test_icfhr --test_file 30865_0_unicodeformatted.txt > log_ICFHR_27Mar_russell_testset_test.txt
i = 0
for num in set([d.partition("_train_")[0] for d in dirs]):
    for s in ["0", "1", "4", "16"]:
        script = ''
        if s == "0":
            script = ' '.join(["python crnn_main.py", "--trainroot", os.path.join(lmdb_database_base, num), 
                           "--valroot", os.path.join(lmdb_database_base, num), 
                           "--dataset ICFHR --cuda --lr 0.0001 --displayInterval 4 --valEpoch 1 --saveEpoch 1 --workers 10 --niter 15",
                           "--keep_ratio --imgH 60 --imgW 240 --batchSize 6 --binarize", 
                           #"--experiment", "experiments/expr_" + "ICFHR_18Apr_finetuning_allnet_" + num + "_train_" + s, 
                           "--crnn",  "experiments/expr_ICFHR_17Apr_binarization_augmentation/netCRNN_6_1988.pth", 
                           "--test_icfhr --test_file", os.path.join("test_results/21Apr_firstfinetune_submission", num + "_" + s + ".txt"), ">",
                           "log_files/test_logs/log_ICFHR_21Apr_test_results_" + num + "_" + s + ".txt"])
        else:
            script = ' '.join(["python crnn_main.py", "--trainroot", os.path.join(lmdb_database_base, num), 
                           "--valroot", os.path.join(lmdb_database_base, num), 
                           "--dataset ICFHR --cuda --lr 0.0001 --displayInterval 4 --valEpoch 1 --saveEpoch 1 --workers 10 --niter 15",
                           "--keep_ratio --imgH 60 --imgW 240 --batchSize 6 --binarize", 
                           #"--experiment", "experiments/expr_" + "ICFHR_18Apr_finetuning_allnet_" + num + "_train_" + s, 
                           "--crnn",  tuned_models[i], 
                           "--test_icfhr --test_file", os.path.join("test_results/21Apr_firstfinetune_submission", num + "_" + s + ".txt"), ">",
                           "log_files/test_logs/log_ICFHR_21Apr_test_results_" + num + "_" + s + ".txt"])
            i = i + 1
        print(script)



python crnn_main.py --trainroot /home/ubuntu/russell/nephi/data/lmdb_ICFHR_bin/test_data_each_doc/35013 --valroot /home/ubuntu/russell/nephi/data/lmdb_ICFHR_bin/test_data_each_doc/35013 --dataset ICFHR --cuda --lr 0.0001 --displayInterval 4 --valEpoch 1 --saveEpoch 1 --workers 10 --niter 15 --keep_ratio --imgH 60 --imgW 240 --batchSize 6 --binarize --crnn experiments/expr_ICFHR_17Apr_binarization_augmentation/netCRNN_6_1988.pth --test_icfhr --test_file test_results/21Apr_firstfinetune_submission/35013_0.txt > log_files/test_logs/log_ICFHR_21Apr_test_results_35013_0.txt
python crnn_main.py --trainroot /home/ubuntu/russell/nephi/data/lmdb_ICFHR_bin/test_data_each_doc/35013 --valroot /home/ubuntu/russell/nephi/data/lmdb_ICFHR_bin/test_data_each_doc/35013 --dataset ICFHR --cuda --lr 0.0001 --displayInterval 4 --valEpoch 1 --saveEpoch 1 --workers 10 --niter 15 --keep_ratio --imgH 60 --imgW 240 --batchSize 6 --binarize --crnn experiments/expr_ICFHR_18Apr_finetuning_allnet_35013_train_1/netCR

## Now make all the crnn_main calls to fine tune the results on different specific data before using it on test data

### Here is the fine-tuning for the 3rd round of submissions to ICFHR, using the attention+CTC model

In [9]:
# Here was the best model training on general data and validating on specific data:
#experiments/expr_ICFHR_17Apr_binarization_augmentation/netCRNN_6_1988.pth

lmdb_database_base = "/home/ubuntu/russell/nephi/data/lmdb_ICFHR_bin/specific_data_each_doc/"
pre_model = "/home/ubuntu/russell/nephi/trained_models/results_icfhr_aug/attention+ctc/netCNN_24_2982.pth"
# python crnn_main.py --trainroot data/lmdb_ICFHR/general_data --valroot data/lmdb_ICFHR/specific_data --plot  
# --dataset ICFHR --cuda --lr 0.0001 --displayInterval 120 
#--valEpoch 2 --saveEpoch 2 --workers 3 --niter 200 --keep_ratio --imgH 60 --imgW 240 
#--batchSize 4 --transform --rescale --rescale_dim 3 --grid_distort --rdir results_icfhr_aug 
#--model attention+ctc >logs/icfhr/aug/attctc.log



for num in set([d.partition("_train_")[0] for d in dirs]):
    for s in ["1", "4", "16"]:
        script = ' '.join(["python crnn_main.py", "--trainroot", os.path.join(lmdb_database_base, num + "_train_" + s), 
                           "--valroot", os.path.join(lmdb_database_base, num + "_train_" + "8"), 
                           "--dataset ICFHR --cuda --lr 0.0001 --displayInterval 4 --valEpoch 1 --saveEpoch 1 --workers 3",
                           "--niter 60 --keep_ratio --imgH 60 --imgW 240 --batchSize 4",
                           "--transform --rescale --rescale_dim 3 --grid_distort",
                           "--model attention+ctc --plot",
                           "--rdir", "experiments/7May_finetuning/expr_" + "ICFHR_7May_finetuning_attention+ctc_" + num + "_train_" + s, 
                           "--pre_model",  pre_model, ">",
                           "log_files/log_ICFHR_7May_finetuning_attention+ctc_" + num + "_train_" + s + ".txt"])
        print(script)
        print


python crnn_main.py --trainroot /home/ubuntu/russell/nephi/data/lmdb_ICFHR_bin/specific_data_each_doc/35013_train_1 --valroot /home/ubuntu/russell/nephi/data/lmdb_ICFHR_bin/specific_data_each_doc/35013_train_8 --dataset ICFHR --cuda --lr 0.0001 --displayInterval 4 --valEpoch 1 --saveEpoch 1 --workers 3 --niter 60 --keep_ratio --imgH 60 --imgW 240 --batchSize 4 --transform --rescale --rescale_dim 3 --grid_distort --model attention+ctc --plot --rdir experiments/7May_finetuning/expr_ICFHR_7May_finetuning_attention+ctc_35013_train_1 --pre_model /home/ubuntu/russell/nephi/trained_models/results_icfhr_aug/attention+ctc/netCNN_24_2982.pth > log_files/log_ICFHR_7May_finetuning_attention+ctc_35013_train_1.txt

python crnn_main.py --trainroot /home/ubuntu/russell/nephi/data/lmdb_ICFHR_bin/specific_data_each_doc/35013_train_4 --valroot /home/ubuntu/russell/nephi/data/lmdb_ICFHR_bin/specific_data_each_doc/35013_train_8 --dataset ICFHR --cuda --lr 0.0001 --displayInterval 4 --valEpoch 1 --saveEpoch

#### Here is the fine-tuning for the 1st submission to ICFHR 2018

In [None]:
# Here was the best model training on general data and validating on specific data:
#experiments/expr_ICFHR_17Apr_binarization_augmentation/netCRNN_6_1988.pth

lmdb_database_base = "/home/ubuntu/russell/nephi/data/lmdb_ICFHR_bin/specific_data_each_doc/"
spec_tr_lists_dir = "/home/ubuntu/datasets/read_ICFHR/specific_data_train_list/"
train_data = "/home/ubuntu/datasets/read_ICFHR/specific_data/"
howe_data = "/home/ubuntu/datasets/read_ICFHR/specific_data_howe/"
simplebin_data = "/home/ubuntu/datasets/read_ICFHR/specific_data_imgtxt/"

#python create_dataset.py  --data_dir ~/datasets/read_ICFHR/specific_data --output_dir ~/russell/nephi/data/lmdb_ICFHR_bin/specific_data --icfhr --binarize --howe_dir ~/datasets/read_ICFHR/specific_data_howe --simplebin_dir ~/datasets/read_ICFHR/specific_data_imgtxt

#python create_dataset.py  --data_dir ~/datasets/read_ICFHR/general_data  --output_dir ~/russell/nephi/data/lmdb_ICFHR_bin/general_data --icfhr --binarize --howe_dir ~/datasets/read_ICFHR/general_data_howe --simplebin_dir ~/datasets/read_ICFHR/general_data_imgtxt

#python create_dataset.py ~/datasets/read_ICFHR/specific_data ~/russell/nephi/data/lmdb_ICFHR/specific_data --icfhr /file to include

# Let's make the script
#python crnn_main.py --trainroot this is the specific data 1, 4, 16 /home/ubuntu/russell/nephi/data/lmdb_ICFHR_bin/general_data 
#--valroot this will be the corresponding train_8 for validation /home/ubuntu/russell/nephi/data/lmdb_ICFHR_bin/specific_data --dataset ICFHR --cuda --lr 0.0001 
#--displayInterval 120 --valEpoch 1 --saveEpoch 1 --workers 10 --niter 15 --experiment this should be a name unique to the train setexperiments/expr_ICFHR_17Apr_binarization_augmentation 
#--keep_ratio --imgH 60 --imgW 240 --batchSize 6 --binarize > name unique to the train set log_files/log_ICFHR_17Apr_binarization_augmentation.txt

for num in set([d.partition("_train_")[0] for d in dirs]):
    for s in ["1", "4", "16"]:
        script = ' '.join(["python crnn_main.py", "--trainroot", os.path.join(lmdb_database_base, num + "_train_" + s), "--valroot", os.path.join(lmdb_database_base, num + "_train_" + "8"), 
                               "--dataset ICFHR --cuda --lr 0.0001 --displayInterval 4 --valEpoch 1 --saveEpoch 1 --workers 10 --niter 15 --keep_ratio --imgH 60 --imgW 240 --batchSize 6 --binarize", 
                           "--experiment", "experiments/expr_" + "ICFHR_18Apr_finetuning_allnet_" + num + "_train_" + s, 
                           "--crnn",  "experiments/expr_ICFHR_17Apr_binarization_augmentation/netCRNN_6_1988.pth", ">",
                           "log_files/log_ICFHR_18Apr_finetuning_allnet_" + num + "_train_" + s + ".txt"])
        print(script)


### Here is the code for Russell's computer

In [4]:
# Here was the best model training on general data and validating on specific data:
#0.4007 First model I want to try
#experiments/expr_ICFHR_27Apr_binarization_distortion_randomaffine_testside/netCRNN_27_5963.pth
#0.3937 one model I could try
#experiments/expr_ICFHR_27Apr_binarization_distortion_randomaffine_testside/netCRNN_45_5963.

#Make sure to do the training augmentation, same one.


lmdb_database_base = "/deep_data/nephi/data/lmdb_ICFHR_bin/specific_data_each_doc/"

# Let's make the script
#python crnn_main.py --trainroot this is the specific data 1, 4, 16 /home/ubuntu/russell/nephi/data/lmdb_ICFHR_bin/general_data 
#--valroot this will be the corresponding train_8 for validation /home/ubuntu/russell/nephi/data/lmdb_ICFHR_bin/specific_data --dataset ICFHR --cuda --lr 0.0001 
#--displayInterval 120 --valEpoch 1 --saveEpoch 1 --workers 10 --niter 15 --experiment this should be a name unique to the train setexperiments/expr_ICFHR_17Apr_binarization_augmentation 
#--keep_ratio --imgH 60 --imgW 240 --batchSize 6 --binarize > name unique to the train set log_files/log_ICFHR_17Apr_binarization_augmentation.txt

for num in set([d.partition("_train_")[0] for d in dirs]):
    for s in ["1", "4", "16"]:
        script = ' '.join(["python crnn_main.py", "--trainroot", os.path.join(lmdb_database_base, num + "_train_" + s), "--valroot", os.path.join(lmdb_database_base, num + "_train_" + "8"), 
                               "--dataset ICFHR --cuda --lr 0.0001 --displayInterval 4 --valEpoch 1 --saveEpoch 1 --workers 3 --niter 40 --keep_ratio --imgH 60 --imgW 240 --batchSize 2 --binarize", 
                           "--experiment", "/home/remi10001/deep_data/experiments/expr_" + "ICFHR_28Apr_finetuning_fullaugment_" + num + "_train_" + s, 
                           "--crnn",  "experiments/expr_ICFHR_27Apr_binarization_distortion_randomaffine_testside/netCRNN_27_5963.pth", ">",
                           "log_files/log_ICFHR_28Apr_finetuning_fullaugment_" + num + "_train_" + s + ".txt"])
        print(script)
        print


python crnn_main.py --trainroot /deep_data/nephi/data/lmdb_ICFHR_bin/specific_data_each_doc/35013_train_1 --valroot /deep_data/nephi/data/lmdb_ICFHR_bin/specific_data_each_doc/35013_train_8 --dataset ICFHR --cuda --lr 0.0001 --displayInterval 4 --valEpoch 1 --saveEpoch 1 --workers 3 --niter 40 --keep_ratio --imgH 60 --imgW 240 --batchSize 2 --binarize --experiment /home/remi10001/deep_data/experiments/expr_ICFHR_28Apr_finetuning_fullaugment_35013_train_1 --crnn experiments/expr_ICFHR_27Apr_binarization_distortion_randomaffine_testside/netCRNN_27_5963.pth > log_files/log_ICFHR_28Apr_finetuning_fullaugment_35013_train_1.txt

python crnn_main.py --trainroot /deep_data/nephi/data/lmdb_ICFHR_bin/specific_data_each_doc/35013_train_4 --valroot /deep_data/nephi/data/lmdb_ICFHR_bin/specific_data_each_doc/35013_train_8 --dataset ICFHR --cuda --lr 0.0001 --displayInterval 4 --valEpoch 1 --saveEpoch 1 --workers 3 --niter 40 --keep_ratio --imgH 60 --imgW 240 --batchSize 2 --binarize --experiment /h

# Below is code I wrote in developing this notebook

In [None]:
/deep_data/datasets/ICFHR_Data/specific_data_train_list
/deep_data/datasets/ICFHR_Data/specific_data_val_list

num = str(30882)

spec_data = "/deep_data/datasets/ICFHR_Data/specific_data"
spec_tr_lists_dir = "/deep_data/datasets/ICFHR_Data/specific_data_train_list"
spec_lists_files = glob(os.path.join(spec_tr_lists_dir, "*"))

total_files = glob(os.path.join(spec_data, num + "_train", "*/*.jpg"))
base_files = set([os.path.basename(f) for f in total_files])

list_1 = "/deep_data/datasets/ICFHR_Data/specific_data_train_list/30882_train_1.lst"
list_4 = "/deep_data/datasets/ICFHR_Data/specific_data_train_list/30882_train_4.lst"
list_16 = "/deep_data/datasets/ICFHR_Data/specific_data_train_list/30882_train_16.lst"

fs_1 = set(open(list_1).read().split())
fs_4 = set(open(list_4).read().split())
fs_16 = set(open(list_16).read().split())
fs_8 = fs_16 - fs_4

## I was trying to figure a python way of running loop through the shell but I might as well just print all the shell commands and run the script itself. That won't be hard at all. (running script to make all of these lmdb databases. But now my time is up.

# Next step is to make the 8 lists and then just create the script. Not hard at all

In [26]:
! ls $lmdb_database_base



30866_train_1	30882_train_1	30893_train_1	35013_train_1	35015_train_1
30866_train_16	30882_train_16	30893_train_16	35013_train_16	35015_train_16
30866_train_4	30882_train_4	30893_train_4	35013_train_4	35015_train_4
30866_train_8	30882_train_8	30893_train_8	35013_train_8	35015_train_8


In [3]:
spec_data = "/home/ubuntu/datasets/read_ICFHR/specific_data"
spec_tr_lists_dir = "/home/ubuntu/datasets/read_ICFHR/specific_data_train_list"
spec_lists_files = glob(os.path.join(spec_tr_lists_dir, "*"))

In [4]:
spec_lists_files

['/home/ubuntu/datasets/read_ICFHR/specific_data_train_list/30866_train_1.lst',
 '/home/ubuntu/datasets/read_ICFHR/specific_data_train_list/30882_train_4.lst',
 '/home/ubuntu/datasets/read_ICFHR/specific_data_train_list/30882_train_1.lst',
 '/home/ubuntu/datasets/read_ICFHR/specific_data_train_list/35013_train_4.lst',
 '/home/ubuntu/datasets/read_ICFHR/specific_data_train_list/35015_train_16.lst',
 '/home/ubuntu/datasets/read_ICFHR/specific_data_train_list/30866_train_16.lst',
 '/home/ubuntu/datasets/read_ICFHR/specific_data_train_list/30893_train_16.lst',
 '/home/ubuntu/datasets/read_ICFHR/specific_data_train_list/30882_train_16.lst',
 '/home/ubuntu/datasets/read_ICFHR/specific_data_train_list/35015_train_1.lst',
 '/home/ubuntu/datasets/read_ICFHR/specific_data_train_list/30893_train_4.lst',
 '/home/ubuntu/datasets/read_ICFHR/specific_data_train_list/35013_train_1.lst',
 '/home/ubuntu/datasets/read_ICFHR/specific_data_train_list/35015_train_4.lst',
 '/home/ubuntu/datasets/read_ICFHR/s

In [24]:
for f in spec_lists_files:
    if "30882" in f:
        print(f)

/home/ubuntu/datasets/read_ICFHR/specific_data_train_list/30882_train_4.lst
/home/ubuntu/datasets/read_ICFHR/specific_data_train_list/30882_train_1.lst
/home/ubuntu/datasets/read_ICFHR/specific_data_train_list/30882_train_16.lst


In [None]:
with open(spec_lists_files[1], "r") as f:
    
    print(f.read())

In [6]:
    
    folderPath = os.path.join('folder_name', os.path.basename(filePath))
shutil.copyfile(filePath, folderPath)

30882_0004_1065839_region_1439904809242_204_line_1439904834897_206.jpg
30882_0004_1065839_region_1439904809242_204_line_1439904843793_207.jpg
30882_0004_1065839_region_1439904809242_204_line_1439904853185_208.jpg
30882_0004_1065839_region_1439904809242_204_line_1439904869649_209.jpg
30882_0004_1065839_region_1439904809242_204_line_1439904881136_210.jpg
30882_0004_1065839_region_1439904809242_204_line_1439904890756_211.jpg
30882_0004_1065839_region_1439904809242_204_line_1439904904562_212.jpg
30882_0004_1065839_region_1439904809242_204_line_1439904914236_213.jpg
30882_0004_1065839_region_1439904809242_204_line_1439904927283_214.jpg
30882_0004_1065839_region_1439904809242_204_line_1439904941685_215.jpg
30882_0004_1065839_region_1439904809242_204_line_1439904955219_216.jpg
30882_0004_1065839_region_1439904809242_204_line_1439904965800_217.jpg
30882_0004_1065839_region_1439904809242_204_line_1439904980099_218.jpg
30882_0004_1065839_region_1439904809242_204_line_1439904991252_219.jpg
30882_

In [9]:
with open(spec_lists_files[1], "r") as f:
    
    print(f.read().split())

['30882_0004_1065839_region_1439904809242_204_line_1439904834897_206.jpg', '30882_0004_1065839_region_1439904809242_204_line_1439904843793_207.jpg', '30882_0004_1065839_region_1439904809242_204_line_1439904853185_208.jpg', '30882_0004_1065839_region_1439904809242_204_line_1439904869649_209.jpg', '30882_0004_1065839_region_1439904809242_204_line_1439904881136_210.jpg', '30882_0004_1065839_region_1439904809242_204_line_1439904890756_211.jpg', '30882_0004_1065839_region_1439904809242_204_line_1439904904562_212.jpg', '30882_0004_1065839_region_1439904809242_204_line_1439904914236_213.jpg', '30882_0004_1065839_region_1439904809242_204_line_1439904927283_214.jpg', '30882_0004_1065839_region_1439904809242_204_line_1439904941685_215.jpg', '30882_0004_1065839_region_1439904809242_204_line_1439904955219_216.jpg', '30882_0004_1065839_region_1439904809242_204_line_1439904965800_217.jpg', '30882_0004_1065839_region_1439904809242_204_line_1439904980099_218.jpg', '30882_0004_1065839_region_1439904809

I should do subset analyses of the specific training files with each other and all the files that are present in the full set to try to make a develpment set

In [4]:
num = str(30882)

spec_data = "/home/ubuntu/datasets/read_ICFHR/specific_data"
spec_tr_lists_dir = "/home/ubuntu/datasets/read_ICFHR/specific_data_train_list"
spec_lists_files = glob(os.path.join(spec_tr_lists_dir, "*"))

total_files = glob(os.path.join(spec_data, num + "_train", "*/*.jpg"))
base_files = set([os.path.basename(f) for f in total_files])

list_1 = "/home/ubuntu/datasets/read_ICFHR/specific_data_train_list/30882_train_1.lst"
list_4 = "/home/ubuntu/datasets/read_ICFHR/specific_data_train_list/30882_train_4.lst"
list_16 = "/home/ubuntu/datasets/read_ICFHR/specific_data_train_list/30882_train_16.lst"

fs_1 = set(open(list_1).read().split())
fs_4 = set(open(list_4).read().split())
fs_16 = set(open(list_16).read().split())
fs_8 = fs_16 - fs_4

IOError: [Errno 2] No such file or directory: '/home/ubuntu/datasets/read_ICFHR/specific_data_train_list/30882_train_1.lst'

Here are the paths for working at my home computer

In [7]:
num = str(30882)

spec_data = "/deep_data/datasets/ICFHR_Data/specific_data"
spec_tr_lists_dir = "/deep_data/datasets/ICFHR_Data/specific_data_train_list"
spec_lists_files = glob(os.path.join(spec_tr_lists_dir, "*"))

total_files = glob(os.path.join(spec_data, num + "_train", "*/*.jpg"))
base_files = set([os.path.basename(f) for f in total_files])

list_1 = "/deep_data/datasets/ICFHR_Data/specific_data_train_list/30882_train_1.lst"
list_4 = "/deep_data/datasets/ICFHR_Data/specific_data_train_list/30882_train_4.lst"
list_16 = "/deep_data/datasets/ICFHR_Data/specific_data_train_list/30882_train_16.lst"

fs_1 = set(open(list_1).read().split())
fs_4 = set(open(list_4).read().split())
fs_16 = set(open(list_16).read().split())
fs_8 = fs_16 - fs_4

The code below demonstrates that fs_16 is the same as the total number of files available, and each set of increasing number of pages is a strict subset of the greater page list. (I've shown it for collection 30882, but I am sure it is true for all)

### Therefore, I think a good validation split for the tuning of training files is to select those files not in fs_4, train on those, then predict on fs_4. Then we should use this model for tuning with 1, 4 and 16 additional pages, as it should be the best model.
### I will need to make a split for this "fs_12" and a split for all 1, 4, and 16 subsets for all the document types. Then we can experiment with fine tuning after I enable fine-tuning of convolutional layers.

# I think once I've done this splitting and made lmdb databases of the same, then I can start figuring out how to do a results file from them.

In [46]:
# Union of all the training files
train_subs = fs_16#fs_1 | fs_4 | fs_16

# What remains of total from this union
diff = base_files - train_subs

# Is fs_1 a strict subset of fs_4?
print(len(fs_1 - fs_4))
# Yes

0


In [8]:
len(fs_4)

84

In [41]:
len(base_files)

328

In [10]:
len(fs_8)

244