# **Introduction**

Optical character recognition is an old "AI" and image-processing task.  What it involves is taking a photograph or scan of a piece of text (printed or handwritten) and turning the characters (as images) into character codes on the computer that therefore allow the text to be edited, indexed, etc.  A key part of that process is identifying where the characters actually are, especially if the characters are mixed among other non-writing, such as images of objects or people.

In this assignment, you will take images from a Chinese image database with annotations that indicate where the Chinese characters are, and you will train a model that takes test images, and superimposes upon them a visualization (of your choosing, e.g., a "heat map") of the likelihood that a pixel is close to or part of a valid Chinese character.  The image database contains annotations of "bounding boxes", coordinates of the corners of a box that contains a single Chinese character.  In a sense, this assignment asks you to detect the bounding boxes in test images without the annotation, but a softer version of this: simply to provide the probability, for each pixel, whether that pixel was part of a bounding box containing a Chinese character.  Then, you are to (1) superimpose upon the image a pixel-based map of likelihoods of where the bounding boxes ought to be and (2) apply an evaluation statistic.

This assignment grants you a lot of freedom in how you organize your code and set up the task overall.  Because of the degree of freedom it involves, it will mostly be graded on our evaluation of the effort put into the solution.  An actual high success at the task is not a requirement to get a high grade.  However, you will have to report in detail, in your own format, what you did, why you did it, how to run it -- it must run on mltgpu, be implemented in Python using PyTorch, and make use of the GPUs -- and how to apply it easily to our own test images.

You will have almost a month to do this assignment, even though it is worth only 30% of your grade.  Another assignment with 30% will be given out for the last/remaining two weeks of the study period.   These time periods are coextensive with that of the project, but we expect you to be able to schedule your time well enough to put in an effort at both. This assignment is officially due at **23:59 on 2021 October 18**. There are 30 points on this assignment, and a maximum of 20 bonus points.

# **The data**

The source of the task is here: https://ctwdataset.github.io/ (Links to an external site.) They have example images and an example of a baseline task that is much more advanced than what we are doing, but it will give you an idea of the data format, particularly the metadata.  Pay attention especially to the "Annotation format" section of this page: https://ctwdataset.github.io/tutorial/1-basics.html (Links to an external site.)

The metadata and a small sample of the whole image dataset is available at /scratch/lt2326-h21/a1 on mltgpu. The metadata is in json format.  info.json contains information about every image file.  We will unzip only a minority of the original training image files.  train.jsonl is a list of json entities, one per line (that have to be parsed with the json package each separately) that correspond to the files in info.json.  This contains the bounding box information, as well as other information for the original challenge on the web.  See the "Annotation format" section mentioned on the dataset web page linked above.

# **Part 1: data preparation (7 points)**

The image files are in /scratch/lt2326-h21/a1/images on mltgpu. They are in jpg format.  The code that you write for this part of the project should:

- Use the info.json file to figure out what files are in the training set.  You will just use the official training data for everything.  Remember that you will only see a small minority of training examples in the images directory, for space reasons.
- Divide up the official training data files into your own training, validation, and test datasets depending on your own preferences. You can choose to use fewer files than the maximum available if you run into problems with memory and so on (but first make sure your implementation is reasonably efficient).
- Find the corresponding bounding box information in train.jsonl for each image.

You can represent the data in any way you like, but remember that it will become a numpy array for processing and a torch tensor for training.  Remember also that the classes are defined by pixel: for each pixel, you will eventually have a set of features (e.g. colour values), and a binary class corresponding to whether the pixel was in a Chinese character bounding box or not (note that there are non-Chinese characters in the set -- see the annotation instructions).  You are allowed to reduce the dimensionality of the images for processing, but consider using a pooling and/or upsampling technique in Part 2 of this assignment to accomplish this goal. 

Describe the choices you made and the challenges you found in your report.

In [1]:
#imports

import json
import numpy as np
import pandas as pd
import random
from torch import nn
import os
import torch
from torchvision import transforms
from PIL import Image
import matplotlib.path as mplpath
from skimage import io

#device
device = torch.device('cuda:0')

In [2]:
#paths

images_directory = "/scratch/lt2326-h21/a1/images"
train_json = "/scratch/lt2326-h21/a1/train.jsonl"
info_json = "/scratch/lt2326-h21/a1/info.json"


#opening files and images

with open(train_json) as trainfile:
    train_data = [json.loads(x) for x in trainfile]

with open(info_json) as infofile:
    info_data = json.load(infofile)

In [3]:
#figure out what files are in the training set

selection = ['train']
filtered = list(filter(lambda i: i[0] in selection, info_data.items()))
filenames = [d['file_name'] for d in filtered[0][1]]

iiiimages = os.listdir(images_directory)

my_images = []
for n in iiiimages:
    if n in filenames:
        my_images.append(n.replace('.jpg', ''))

In [4]:
#getting all usable files

files = []
for image in train_data:
    if image['image_id'] in my_images:
        files.append(image)

In [5]:
#find the corresponding bounding box information in train.jsonl for each image

def chinese(images):
    ch_dict = {}
    ch_list = []
    
    for i in images:
        ch_dict[i['image_id']] = {'polygons' : []}
        poly_list = []
        for annotation in i['annotations']:
            for an in annotation:
                if an['is_chinese'] == True:
                    ch_dict[i['image_id']]['polygons'].append(an['polygon'])
                    poly_list.append(an['polygon'])
        ch_list.append((i['image_id'], poly_list))
                    
    return ch_list

chinese_list = chinese(files)

In [6]:
def img_to_tensor(path_to_img):
#     img = Image.open(path_to_img)
#     image_array = np.array(img)
    
    img = io.imread(path_to_img)

    img = torch.tensor(img).float()
    
    return img #img

In [7]:
def get_truth(polygons_list):
    
    grr = [[[a, b] for b in list(range(2048))] for a in list(range(2048))]
    grid = np.array(grr)
    grid.shape = (4194304, 2)
    
    p = [pol for pol in polygons_list]

    truth_array = np.zeros(4194304)
    for x in p:
        p2 = mplpath.Path(x)
        truth = np.asarray(p2.contains_points(grid), int)
        truth_array = np.maximum(truth_array, truth)
    
    t_tensor = torch.from_numpy(truth_array)
    
    truth_tensor = torch.tensor(t_tensor).type(torch.LongTensor) #.to(device)
    
    return truth_tensor

In [8]:
training_real, validation_real, test_real = np.split(chinese_list, [int(len(files)*0.8), int(len(files)*0.9)])

  return array(a, dtype, copy=False, order=order)


In [24]:
def potato(data):
    chaos_probably = []
    for image in data:
        tensor_i = img_to_tensor(images_directory + "/" + image[0] + ".jpg")
        truth = get_truth(image[1])
        chaos_probably.append((tensor_i, truth))
    
    return chaos_probably

training_potatoed = potato(training_real)
test_potatoed = potato(test_real)

  truth_tensor = torch.tensor(t_tensor).type(torch.LongTensor) #.to(device)


KeyboardInterrupt: 

In [9]:
import time
from joblib import Parallel, delayed

def process_images_in_parallel(image):
    return (img_to_tensor(images_directory + "/" + image[0] + ".jpg"), 
            get_truth(image[1])) 

In [10]:
start = time.time()
processed_data = Parallel(n_jobs = 10)(delayed(process_images_in_parallel)(image) for image in training_real)
end = time.time()

print("Processing data took", round(end-start, 0), "seconds")









































Processing data took 918.0 seconds


# **Part 2: the models (10 points)**

In this part, you will implement two substantially different model archictectures, that both take your representation of the images as training input and both take your representation of the bounding boxes as objective (HINT: the binary classification of pixels as belonging to a bounding box or not).  They will save the trained models to files so that they can be loaded and tested later. The output of the models will be a "soft binary" -- the probability of each pixel being inside a bounding box, from 0 to 1.  Consider examining some of the training data before designing your architectures.

You have a large grant of freedom as to what these model architectures will look like (remember: grading is on a "reasonable effort" basis).  There's a high chance (HINT) that they will both use one or more convolutional layers, among other things.  Describe the models and the motivations for the architecture in your report.

In [11]:
#model imports
from torch.nn import Module
from torch.nn import Conv2d
from torch.nn import Linear
from torch.nn import MaxPool2d
from torch.nn import ReLU
from torch.nn import LogSoftmax
from torch import flatten
import torch.optim as optim
from torch.utils.data import DataLoader

In [12]:
#parameters
learning_rate = 0.001
epochs = 3

In [13]:
#CONV => RELU => POOL) * 2 => FC => RELU => FC => SOFTMAX

class Model1(nn.Module):
    def __init__(self):
        super(Model1, self).__init__()
        
        #CONV => RELU => POOL 1.0
        self.conv1 = Conv2d(in_channels=3, out_channels=3, kernel_size=(5, 5))
        self.relu1 = ReLU()
        self.maxpool1 = MaxPool2d(kernel_size=(15, 15), stride=(5, 5))
        
        #CONV => RELU => POOL 2.0
        self.conv2 = Conv2d(in_channels=3, out_channels=3, kernel_size=(5, 5)) #in_channels matches with previous out_channels
        self.relu2 = ReLU()
        self.maxpool2 = MaxPool2d(kernel_size=(15, 15), stride=(5, 5))

        #FC => RELU
        self.fc1 = Linear(in_features=18252, out_features=500) #how to know values?
        self.relu3 = ReLU()
        
        #softmax classifier
        self.fc2 = Linear(in_features=500, out_features=2) #2 because either true or false?
        self.sigmoid = nn.Sigmoid()
#         self.logSoftmax = LogSoftmax(dim=1) # SIGMOID CHANGE
        
    def forward(self, x):
        
        #CONV => RELU => POOL 1.0
        x = self.conv1(x)
        x = self.relu1(x)
        x = self.maxpool1(x)
        
        #CONV => RELU => POOL 2.0
        x = self.conv2(x)
        x = self.relu2(x)
        x = self.maxpool2(x)

        #FC => RELU
        x = flatten(x, 1)
        x = self.fc1(x)
        x = self.relu3(x)
        
        #softmax classifier
        x = self.fc2(x)
        out = self.sigmoid(x)

        return out

In [15]:
dataloader_train = DataLoader(processed_data, shuffle=True, batch_size=32)
# dataloader_test = DataLoader(test_potatoed, shuffle=True, batch_size=32)

In [22]:
model1 = Model1().to(device)

model1.train()

loss_function = nn.CrossEntropyLoss()
optimizer = optim.Adam(model1.parameters(), lr=learning_rate)

for e in range(epochs):
    total_loss = 0
    for i, (image,truth) in enumerate(dataloader_train):
#         tensor_i = img_to_tensor(images_directory + "/" + image[0] + ".jpg")
#         truth = get_truth(image[1])
        
        reshaping = image.unsqueeze(0) #batch???????????????
        reshaping_x2 = reshaping.reshape(1, 3, 2048, 2048)
        
        out = model1(reshaping_x2)
        out = out.reshape(1, 2, 1)
        blabla = torch.nn.Upsample(size=2048*2048)
        out = blabla(out)
        
        truth = truth.unsqueeze(0)
        
        loss = loss_function(out, truth)
        total_loss += loss.item()
        print(total_loss/(i+1), end='\r')
        
        loss.backward()
        
        optimizer.step()
        
        optimizer.zero_grad()
    
torch.save(model1, 'model1_chinese.pt')

RuntimeError: shape '[1, 3, 2048, 2048]' is invalid for input of size 402653184

In [19]:
402653184/(2048*2048)

96.0

In [None]:
#for pretrained model maybe i don't know??????
# https://github.com/jwyang/faster-rcnn.pytorch
# https://modelzoo.co/model/pytorch-cnn-finetune
# https://pytorch.org/vision/stable/models.html
# https://github.com/pytorch/vision/blob/main/torchvision/models/alexnet.py

# the one i'm doin':
# https://haochen23.github.io/2020/06/fine-tune-faster-rcnn-pytorch.html#.YWgohBrP3BV

In [None]:
import torchvision
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor

number_of_classes = 2
in_features = 3

def model2(number_of_classes):
    model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
    model.roi_heads.box_predictor = FastRCNNPredictor(in_features, number_of_classes)
    
    return model

In [None]:
number_of_classes = 2

# get the model using our helper function
model = model2(number_of_classes)
# move model to the right device
model.to(device)

# construct an optimizer
params = [p for p in model.parameters() if p.requires_grad]
optimizer = torch.optim.SGD(params, lr=0.005,
                            momentum=0.9, weight_decay=0.0005)

# and a learning rate scheduler which decreases the learning rate by
# 10x every 3 epochs
lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer,
                                               step_size=3,
                                               gamma=0.1)

In [None]:
data_loader = torch.utils.data.DataLoader(
    training_real, batch_size=2, shuffle=True, num_workers=4,
    collate_fn=utils.collate_fn)

data_loader_test = torch.utils.data.DataLoader(
    test_real, batch_size=1, shuffle=False, num_workers=4,
    collate_fn=utils.collate_fn)

CNN for inspo: https://www.pyimagesearch.com/2021/07/19/pytorch-training-your-first-convolutional-neural-network-cnn/#pyis-cta-modal
https://towardsdatascience.com/beginners-guide-to-building-convolutional-neural-networks-using-tensorflow-s-keras-api-in-python-6e8035e28238
http://parneetk.github.io/blog/cnn-mnist/

# **Part 3: testing and evaluation (13 points)**

You can use your test data by feeding the test images forward through the models. The output of the models will be pixel maps of the probability of a particular pixel being inside a bounding box.  These will be compared outside the model to the test data's bounding boxes.  You can use a number of different evaluation strategies -- one of them being to choose a probability threshold to decide whether a pixel is inside the bounding box or not, and then take recall/precision/X11/accuracy. Another one is to report it in terms of error, such as mean squared error. Even given your architectural choices, you will likely have hyperparameters to tune.  Describe the progress of your training and testing, with graphs if necessary, in your report.

It should also be possible to examine the effects of applying the model to individual images.  Make it possible to visually represent the pixel/bounding box probabilities superimposed on the original images.  Examine some of the images to conduct a qualitative error analysis of your trained models. Include this analysis in your report.

In [None]:
model1.eval()

test_loss = 0

for image in test_real:
    tensor_i = img_to_tensor(images_directory + "/" + image[0] + ".jpg")
    truth = get_truth(image[1])
        
    reshaping = tensor_i.unsqueeze(0)
    reshaping_x2 = reshaping.reshape(1, 3, 2048, 2048)
    
    truth = truth.unsqueeze(0)
    
    with torch.no_grad():
        out = model(reshaping_x2)
        
    loss = loss_function(out, r.view(-1))
    test_loss += loss.item()
    # print average loss for the epoch
    print("Loss:")
    print(total_loss/(i+1), end='\r')
    
    MSE = np.square(np.subtract(truth,out)).mean()
    (print("Mean Squared Error:"))
    print(MSE)    

In [None]:
model2.eval()

test_loss = 0

for image in test_real:
    tensor_i = img_to_tensor(images_directory + "/" + image[0] + ".jpg")
    truth = get_truth(image[1])
        
    reshaping = tensor_i.unsqueeze(0)
    reshaping_x2 = reshaping.reshape(1, 3, 2048, 2048)
    
    truth = truth.unsqueeze(0)
    
    with torch.no_grad():
        out = model(reshaping_x2)
        
    loss = loss_function(out, r.view(-1))
    test_loss += loss.item()
    # print average loss for the epoch
    print("Loss:")
    print(total_loss/(i+1), end='\r')
    
    MSE = np.square(np.subtract(truth,out)).mean()
    (print("Mean Squared Error:"))
    print(MSE)   