# Introduction

Optical character recognition is an old "AI" and image-processing task.  What it involves is taking a photograph or scan of a piece of text (printed or handwritten) and turning the characters (as images) into character codes on the computer that therefore allow the text to be edited, indexed, etc.  A key part of that process is identifying where the characters actually are, especially if the characters are mixed among other non-writing, such as images of objects or people.

In this assignment, you will take images from a Chinese image database with annotations that indicate where the Chinese characters are, and you will train a model that takes test images and superimposes upon them a visualization (of your choosing, e.g., a "heat map") of the likelihood that a pixel is close to or part of a valid Chinese character.  The image database contains annotations of "bounding boxes", coordinates of the corners of a box that contains a single Chinese character.  In a sense, this assignment asks you to detect the bounding boxes in test images without the annotation, but a softer version of this: simply to provide the probability, for each pixel, whether that pixel was part of a bounding box containing a Chinese character.  Then, you are to (1) superimpose upon the image a pixel-based map of likelihoods of where the bounding boxes ought to be and (2) apply an evaluation statistic.

This assignment grants you a lot of freedom in how you organize your code and set up the task overall.  Because of the degree of freedom it involves, it will mostly be graded on our evaluation of the effort put into the solution.  An actual high success at the task is not a requirement to get a high grade.  However, you will have to report in detail, in your own format, what you did, why you did it, how to run it -- it must run on mltgpu, be implemented in Python using PyTorch, and make use of the GPUs -- and how to apply it easily to our own test images.

You will have almost a month to do this assignment, even though it is worth only 30% of your grade.  Another assignment with 30% will be given out for the last/remaining two weeks of the study period.   These time periods are coextensive with that of the project, but we expect you to be able to schedule your time well enough to put in an effort at both. This assignment is officially due at 23:59 on 2021 October 18. There are 30 points on this assignment, and a maximum of 20 bonus points.
The data

The source of the task is here: https://ctwdataset.github.io/ (Links to an external site.) They have example images and an example of a baseline task that is much more advanced than what we are doing, but it will give you an idea of the data format, particularly the metadata.  Pay attention especially to the "Annotation format" section of this page: https://ctwdataset.github.io/tutorial/1-basics.html (Links to an external site.)

The metadata and a small sample of the whole image dataset is available at /scratch/lt2326-h21/a1 on mltgpu. The metadata is in json format.  info.json contains information about every image file.  We will unzip only a minority of the original training image files.  train.jsonl is a list of json entities, one per line (that have to be parsed with the json package each separately) that correspond to the files in info.json.  This contains the bounding box information, as well as other information for the original challenge on the web.  See the "Annotation format" section mentioned on the dataset web page linked above.

# Part 1: data preparation (7 points)

The image files are in /scratch/lt2326-h21/a1/images on mltgpu. They are in jpg format.  The code that you write for this part of the project should:

    Use the info.json file to figure out what files are in the training set.  You will just use the official training data for everything.  Remember that you will only see a small minority of training examples in the images directory, for space reasons.
    Divide up the official training data files into your own training, validation, and test datasets depending on your own preferences. You can choose to use fewer files than the maximum available if you run into problems with memory and so on (but first make sure your implementation is reasonably efficient).
    Find the corresponding bounding box information in train.jsonl for each image. 

You can represent the data in any way you like, but remember that it will become a numpy array for processing and a torch tensor for training.  Remember also that the classes are defined by pixel: for each pixel, you will eventually have a set of features (e.g. colour values), and a binary class corresponding to whether the pixel was in a Chinese character bounding box or not (note that there are non-Chinese characters in the set -- see the annotation instructions).  You are allowed to reduce the dimensionality of the images for processing, but consider using a pooling and/or upsampling technique in Part 2 of this assignment to accomplish this goal. 

Describe the choices you made and the challenges you found in your report.

In [19]:
import json
import os
from os import listdir
import numpy as np
import random
import matplotlib.image as mpimg
import matplotlib.path as mplpath
from joblib import Parallel, delayed
import time
import torch

Use the info.json file to figure out what files are in the training set. You will just use the 
official training data for everything.  Remember that you will only see a small minority of 
training examples in the images directory, for space reasons.

In [2]:
"""
EXTRACT of info.json:
{
  "train": 
  [
      {
          "file_name": "0000172.jpg",
          "height": 2048,
          "image_id": "0000172",
          "width": 2048
      },
      ...
}
"""
path = '/scratch/lt2326-h21/a1/images'

with open('/scratch/lt2326-h21/a1/info.json') as info_f:
    info_data = json.load(info_f)
    trainset_in_info = [file['file_name'] for file in info_data['train']]
    
    files_in_info = [file for file in trainset_in_info if file in os.listdir(path)]  
    
print("Total images:",len(files_in_info))

Total images: 845


Divide up the official training data files into your own training, validation, and test datasets 
depending on your own preferences. You can choose to use fewer files than the maximum available 
if you run into problems with memory and so on (but first make sure your implementation is 
reasonably efficient).

In [3]:
# Train 80%, Test 10%, Validation 10%
# random_state to shuffle values and have an unbiased dataset
# https://www.w3schools.com/python/ref_random_shuffle.asp
random.shuffle(files_in_info)

train_data, val_data, test_data = np.split(files_in_info, [int(.6*len(files_in_info)), int(.8*len(files_in_info))])
print("Images in train:",len(train_data),"\nImages in test:",len(test_data),"\nImages in validation:",len(val_data))

Images in train: 507 
Images in test: 169 
Images in validation: 169


Find the corresponding bounding box information in train.jsonl for each image.

In [49]:
"""
EXTRACT of train.jsonl:
[
    {"annotations": 
        [
            [
                {"adjusted_bbox": 
                    [140.26028096262758,
                    897.1957001682758,
                    22.167573140645146,
                    38.36424196832945],
                "attributes": 
                    ["distorted", "raised"], 
                "is_chinese": 
                    true, 
                "polygon": 
                    [
                        [140.26028096262758, 
                        896.7550603352049], 
                        [162.42785410327272,
                        898.0769798344178], 
                        [162.42785410327272, 
                        935.7929346470926], 
                        [140.26028096262758, 
                        935.0939571156308]
                    ], 
                "text": 
                    "\u660e"
                },
                {'adjusted_bbox': 
                    [162.42785410327272, 
                    898.5416545674744, 
                    23.376713493771263, 
                    37.74268246537315], 
                'attributes': 
                    ['distorted', 'raised'], 
                'is_chinese': 
                    True, 
                'polygon': 
                    [
                        [162.42785410327272, 
                        898.0769798344178], 
                        [185.80456759704398, 
                        899.4710040335876], 
                        [185.80456759704398, 
                        936.5300382257251], 
                        [162.42785410327272, 
                        935.7929346470926]
                    ], 
                'text': '海'},
            ],
            ...,
        ],
    'file_name': '0000176.jpg',
    'height': 2048,
    'ignore': 
        [
            {'bbox': 
                [317.58560991549297, 
                917.757599265996, 
                38.45594460201204,
                16.419392077263637], 
            'polygon': 
                [
                    [318.44978844587524, 
                    917.757599265996], 
                    [356.041554517505, 
                    919.0538670615695], 
                    [355.17737598712273, 
                    934.1769913432596], 
                    [317.58560991549297, 
                    932.8807235476861]
                ]
            },
            {'bbox': 
                [30.67833782857143, 
                1040.4823966535316, 
                9.938053099396377, 
                15.526483903023745], 
            'polygon': 
                [
                    [30.67833782857143, 
                    1040.4823966535316], 
                    [40.61639092796781, 
                    1040.6140264959079], 
                    [40.61639092796781, 
                    1055.6113584325794], 
                    [30.67833782857143, 
                    1056.0088805565554]
                ]
            },
            ...,
        ],
    'image_id': '0000176',
    'width': 2048
    },
]
            
"""
#do a training set with polygon
#dictionary with images as key and list of polygon as values
#{0000172:[[[140,896],[162,898],[162,935],[140,935]],[polygon,polygon,polygon,polygon],[...]]}

def get_bounding_box0(data):
    with open('/scratch/lt2326-h21/a1/train.jsonl') as jsonl_f:
        jsonl_data = [json.loads(x) for x in jsonl_f]
        images_polygons = []

        for dictionary in jsonl_data:
            if dictionary['file_name'] in data:
                insider_data = {dictionary['file_name']:[]}

                for sentence in dictionary['annotations']:
                    for character in sentence:
                        if character['is_chinese']:
                            insider_data[dictionary['file_name']].append(character['polygon'])
                images_polygons.append(insider_data)
        
    return images_polygons

def get_bounding_box(data):
    with open('/scratch/lt2326-h21/a1/train.jsonl') as jsonl_f:
        jsonl_data = [json.loads(x) for x in jsonl_f]
        images_polygons = []

        for item in jsonl_data:
            if item['file_name'] in data:
                insider_data = []

                for sentence in item['annotations']:
                    for character in sentence:
                        if character['is_chinese']:
                            insider_data.append(character['polygon'])
                images_polygons.append((item['file_name'],insider_data))
        
    return images_polygons

In [50]:
train_polygon = get_bounding_box(train_data)
test_polygon = get_bounding_box(test_data)
val_polygon = get_bounding_box(val_data)

You can represent the data in any way you like, but remember that it will become a numpy array for processing and a torch tensor for training. Remember also that the classes are defined by pixel: for each pixel, you will eventually have a set of features (e.g. colour values), and a binary class corresponding to whether the pixel was in a Chinese character bounding box or not (note that there are non-Chinese characters in the set -- see the annotation instructions). You are allowed to reduce the dimensionality of the images for processing, but consider using a pooling and/or upsampling technique in Part 2 of this assignment to accomplish this goal. 

In [73]:
# resize images (2048 to 200) because too big
# image to tensors to train
# tensors to np array to process
# binary class (2048 * 2048)=4194304
# where a pixel is either in a chinese character bounding box or not
# gold / truth / expected values

"""
# https://datacarpentry.org/image-processing/aio/index.html
# coordinate system, x y
Imagine that we have a fairly large, but very boring image: a 5,000 × 5,000 pixel image composed of nothing but white pixels. If we used an uncompressed image format such as BMP, with the 24-bit RGB color model, how much storage would be required for the file?
In such an image, there are 5,000 × 5,000 = 25,000,000 pixels, and 24 bits for each pixel, leading to 25,000,000 × 24 = 600,000,000 bits, or 75,000,000 bytes (71.5MB). That is quite a lot of space for a very uninteresting image!

get images resized and tensored
"""

def get_truth(dataset,key):
    
    grid = np.array([[[a,b] for b in list(range(2048))] for a in list(range(2048))]).reshape(2048*2048, 2)
    
    image_polygon = np.zeros(4194304)
    image_polygon.astype(bool)
    
    for item in dataset:
        for polygon in item[key]:
            polygon_path = mplpath.Path(polygon)
            truth_polygon = polygon_path.contains_points(grid)
            
            # turning it into 0s and 1s    
            truth_polygon = np.asarray(truth_polygon, int)

            # updating
            # we take the maximum in case there is the same pixel in different polygons
            image_polygon = np.maximum(truth_polygon, image_polygon)
            
    t_tensor = torch.from_numpy(image_polygon)
    truth_tensor = torch.tensor(t_tensor).type(torch.LongTensor)
    
    return (image, truth_tensor)

def get_image(key):
    image = mpimg.imread(path + "/" + key)
    # image = transform.resize(image, (250, 250))
    image = torch.tensor(image).float()
    
    return image

def get_truth0(polygons_list):
    grr = [[[a, b] for b in list(range(2048))] for a in list(range(2048))]
    grid = np.array(grr)
    grid.shape = (4194304, 2)
    
    p = [pol for pol in polygons_list]

    truth_array = np.zeros(4194304)
    for x in p:
        p2 = mplpath.Path(x)
        truth = np.asarray(p2.contains_points(grid), int)
        truth_array = np.maximum(truth_array, truth)
    
    t_tensor = torch.from_numpy(truth_array)
    
    truth_tensor = torch.tensor(t_tensor).type(torch.LongTensor) #.to(device)
    
    return truth_tensor

def process_images_in_parallel(item):
    return (get_image(item[0]), get_truth0(item[1]))

In [74]:
# Parallelize python code to save time on data processing
# https://stackoverflow.com/questions/42220458/what-does-the-delayed-function-do-when-used-with-joblib-in-python
# https://www.tutorialdocs.com/tutorial/joblib/examples.html

def process_in_parallel(dataset):

    tic = time.time()
    #processed_data = Parallel(n_jobs=10)(delayed(get_truth)(dataset,key) for item in dataset for key in item.keys())
    processed_data = Parallel(n_jobs=10)(delayed(process_images_in_parallel)(item) for item in dataset)
    
    toc = time.time()

    print('Elapsed time for the entire processing: {:.2f} s'.format(toc - tic))
    
    return processed_data

In [76]:
for item in val_polygon:
    print(item[0:1])

('0000176.jpg',)
('0000188.jpg',)
('0000400.jpg',)
('0000431.jpg',)
('0000444.jpg',)
('0000445.jpg',)
('0000447.jpg',)
('0000449.jpg',)
('0000453.jpg',)
('0000464.jpg',)
('0000471.jpg',)
('0000526.jpg',)
('0000559.jpg',)
('0000560.jpg',)
('0000565.jpg',)
('0000574.jpg',)
('0000579.jpg',)
('0000606.jpg',)
('0000614.jpg',)
('0000619.jpg',)
('0000621.jpg',)
('0000638.jpg',)
('0000640.jpg',)
('0000683.jpg',)
('0000705.jpg',)
('0000707.jpg',)
('0000710.jpg',)
('0000810.jpg',)
('0000821.jpg',)
('0000827.jpg',)
('0000832.jpg',)
('0000835.jpg',)
('0000852.jpg',)
('0000854.jpg',)
('0000855.jpg',)
('0000863.jpg',)
('0000884.jpg',)
('0000889.jpg',)
('0000892.jpg',)
('0000901.jpg',)
('0000913.jpg',)
('0000932.jpg',)
('0001230.jpg',)
('0001234.jpg',)
('0001250.jpg',)
('0001270.jpg',)
('0001366.jpg',)
('0001369.jpg',)
('0001376.jpg',)
('0001401.jpg',)
('0001403.jpg',)
('0001525.jpg',)
('0001531.jpg',)
('0001547.jpg',)
('0001548.jpg',)
('0001566.jpg',)
('0001570.jpg',)
('0001571.jpg',)
('0001573.jpg'

In [75]:
print("Processing validation data...")
val_processed = process_in_parallel(val_polygon)

Processing validation data...












Elapsed time for the entire processing: 213.52 s


In [None]:
print("Processing training data...")
train_processed = process_in_parallel(train_polygon)
print("Processing test data...")
test_processed = process_in_parallel(test_polygon)
print("Processing validation data...")
val_processed = process_in_parallel(val_polygon)

In [26]:
# https://d2l.ai/chapter_computer-vision/object-detection-dataset.html
# https://thegradient.pub/semantic-segmentation/
# )=
import torch
import matplotlib.pyplot as plt
device = torch.device('cuda:3')

batch_size = 4  
learning_rate = 0.001
epochs = 3

#train_dataloader = torch.utils.data.DataLoader(processed_train, batch_size=batch_size, shuffle=True)

import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import matplotlib.path as mppath
from PIL import Image
import cv2
import numpy as np

array([[   0,    0],
       [   0,    1],
       [   0,    2],
       ...,
       [2047, 2045],
       [2047, 2046],
       [2047, 2047]])

# Part 2: the models (10 points)

In this part, you will implement two substantially different model architectures, that both take your representation of the images as training input and both take your representation of the bounding boxes as objective (HINT: the binary classification of pixels as belonging to a bounding box or not).  They will save the trained models to files so that they can be loaded and tested later. The output of the models will be a "soft binary" -- the probability of each pixel being inside a bounding box, from 0 to 1.  Consider examining some of the training data before designing your architectures.

You have a large grant of freedom as to what these model architectures will look like (remember: grading is on a "reasonable effort" basis).  There's a high chance (HINT) that they will both use one or more convolutional layers, among other things.  Describe the models and the motivations for the architecture in your report.

In [None]:
# representation of the images as training input
# representation of the bounding boxes as objective -> binary classification

# save the models for later
# output : soft binary = 0 or 1 if in a bounding box or not

# Part 3: testing and evaluation (13 points)

You can use your test data by feeding the test images forward through the models. The output of the models will be pixel maps of the probability of a particular pixel being inside a bounding box.  These will be compared outside the model to the test data's bounding boxes.  You can use a number of different evaluation strategies -- one of them being to choose a probability threshold to decide whether a pixel is inside the bounding box or not, and then take recall/precision/X11/accuracy. Another one is to report it in terms of error, such as mean squared error. Even given your architectural choices, you will likely have hyperparameters to tune.  Describe the progress of your training and testing, with graphs if necessary, in your report.

It should also be possible to examine the effects of applying the model to individual images.  Make it possible to visually represent the pixel/bounding box probabilities superimposed on the original images.  Examine some of the images to conduct a qualitative error analysis of your trained models. Include this analysis in your report.

# Bonus A: detecting a specific character (10 points)

Use the features from your model in Part 2 to classify images as to whether they have a particular Chinese character at all , or not.   Evaluate and report on your classifier.

# Bonus B: optical character recognition (20 points)

Try your hand at attempting to build a model for Chinese character identification in the same vein as the original task.  Devise an evaluation and report on your work.