# Introduction

- The notebook serves as a walkthrough to my solution to the problem statement of detecting document angle
  - A notebook was chosen instead of a Python script for the sake of interactivity & convenience of the examiner
- In addition to the notebook, the folder contains the following files/directories:
  - A `requirements.txt` file to set up a virtual environment
  - A `scripts.py` file that contains the defined functions required for the solution
  - `train` and `test` directories that contain the provided data
- Although suitable default values have been used, the examiner is requested to set global variables (in the 3rd cell block) to appropriate values of his/her setup

In [1]:
from IPython.display import Image
Image(url='https://c.tenor.com/6Igas8ss6BAAAAAC/let-us-begin-lets-start.gif')

# Installs & Imports

In [2]:
## UNCOMMENT and RUN to install required libraries (if not yet done)
!pip install -r requirements.txt

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting flaml==1.0.12
  Downloading FLAML-1.0.12-py3-none-any.whl (206 kB)
[K     |████████████████████████████████| 206 kB 23.9 MB/s 
[?25hCollecting tqdm==4.62.2
  Downloading tqdm-4.62.2-py2.py3-none-any.whl (76 kB)
[K     |████████████████████████████████| 76 kB 4.4 MB/s 
Collecting lightgbm>=2.3.1
  Downloading lightgbm-3.3.2-py3-none-manylinux1_x86_64.whl (2.0 MB)
[K     |████████████████████████████████| 2.0 MB 55.2 MB/s 
Installing collected packages: lightgbm, tqdm, flaml
  Attempting uninstall: lightgbm
    Found existing installation: lightgbm 2.2.3
    Uninstalling lightgbm-2.2.3:
      Successfully uninstalled lightgbm-2.2.3
  Attempting uninstall: tqdm
    Found existing installation: tqdm 4.64.1
    Uninstalling tqdm-4.64.1:
      Successfully uninstalled tqdm-4.64.1
Successfully installed flaml-1.0.12 lightgbm-3.3.2 tqdm-4.62.2


In [3]:
# Define global variables here
# Path to the 'train' folder
TRAIN_PATH = 'drive/MyDrive/flexday_hw_assignment/data/train'

# Path to the 'test' folder
TEST_PATH = 'drive/MyDrive/flexday_hw_assignment/data/test'

# Resized preprocessed image dimension
RESIZE_DIM = 100 

# List of angle values (used during data enrichment)
ANGLES_L = [0, 90, 180, 270]

# Random State
SEED = 42 

# Flag to indicate whether it's being run on Colab
COLAB = True

In [4]:
# If using Colab
if COLAB:
  from google.colab import drive

  # Had stored the files in Drive for the assignment
  drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [5]:
# Make the necessary imports
from pathlib import Path
from collections import Counter
from sklearn.metrics import accuracy_score
from util import load_data_labels, preprocess_image, preprocess_dataset, fit_automl, enrich_data
if COLAB:
  from google.colab.patches import cv2_imshow

# Loading The Data

- Although we can use `os.walk()` or `Path.iterdir()` to iterate through the images in `train` directory, we have no way to retrive the angle (target) from the file paths 
- We use the provided `labels` file to obtain both the images as well as their corresponding angles
- Note: Although we load the test set, we do not touch it at all!

In [6]:
# Define path objects
train_loc = Path(TRAIN_PATH)
test_loc = Path(TEST_PATH)

# Use \n delimiter to split the full corpus into a list of strings
# Last element of the last is an empty string → slicing till that point
train_raw_labels = (train_loc/'labels').open('r').read().split('\n')[:-1]
test_raw_labels = (test_loc/'labels').open('r').read().split('\n')[:-1]

# Load images+labels from both training and test data 
train_imgs_l, train_angles_l = load_data_labels(train_raw_labels, train_loc)
test_imgs_l, test_angles_l = load_data_labels(test_raw_labels, test_loc)

100%|██████████| 203/203 [01:06<00:00,  3.07it/s]
100%|██████████| 88/88 [00:29<00:00,  2.97it/s]


## Checking for Imbalance

- Accuracy is a straightforward metric to use if training data set is balanced
- If imbalanced, we need to consider a metric like F1 Score

In [7]:
# Use a counter to quickly determine distribution
Counter(train_angles_l)

Counter({0: 55, 90: 51, 270: 45, 180: 52})

# Assumptions/Constraints Followed

- Data contains only driver's licenses
- Test set will be used only for evaluation and not for hyperparameter tuning to control overfitting
- Train set will not be enriched by scraping new data from the web and manually labelling their angles
- Neural Networks were not used in any part of the ML pipeline - processing, feature extraction or modelling

# Observations & Approach

1. Since training data is balanced, we will use `accuracy_score` as the performance metric
2. Not all Driver's Licenses are in landscape mode
  - For e.g., [this](https://www.quora.com/Why-are-drivers-licenses-vertical-in-some-states-and-horizontal-in-others) mentions how portrait-landscape orientation corresponds to the age of the license holder
  - Simple width-height heuristics won't work!
3. It is likely that document orientation is:
  - Not related to colour
  - Related to direction of bulk of the text fields
4. Training samples are not consistent in size, brightness, contrast, noise etc
5. Training samples are not a lot in number


# Data Preprocessing

- Based on observations #3 and #4, preprocessing is primarily focused on:
  - Denoising the image for generalisability
  - Converting to grayscale
  - Thresholding the grayscaled image to boost contrast between text and non-text areas
  - Using a Laplacian filter to construct a map of how intensity change changes along the image

In [8]:
# Generating features for train set & labels
train_X, train_y = preprocess_dataset(train_imgs_l, train_angles_l, RESIZE_DIM)

100%|██████████| 203/203 [00:00<00:00, 746.84it/s]


## Modelling

- We restrict the scope to traditional ML algorithms, including ensemble models
- As such, we can take advantage of `AutoML` packages. For the solution, we use Microsoft's lightweight `FLAML`

In [9]:
# Run custom function to fit AutoML model, print results and return model
automl = fit_automl(train_X, train_y)

[flaml.automl: 09-21 20:08:17] {2600} INFO - task = classification
INFO:flaml.automl:task = classification
[flaml.automl: 09-21 20:08:17] {2602} INFO - Data split method: stratified
INFO:flaml.automl:Data split method: stratified
[flaml.automl: 09-21 20:08:17] {2605} INFO - Evaluation method: holdout
INFO:flaml.automl:Evaluation method: holdout
[flaml.automl: 09-21 20:08:17] {2727} INFO - Minimizing error metric: log_loss
INFO:flaml.automl:Minimizing error metric: log_loss
[flaml.automl: 09-21 20:08:17] {2869} INFO - List of ML learners in AutoML Run: ['lgbm', 'rf', 'xgboost', 'extra_tree', 'xgb_limitdepth', 'lrl1']
INFO:flaml.automl:List of ML learners in AutoML Run: ['lgbm', 'rf', 'xgboost', 'extra_tree', 'xgb_limitdepth', 'lrl1']
[flaml.automl: 09-21 20:08:17] {3174} INFO - iteration 0, current learner lgbm
INFO:flaml.automl:iteration 0, current learner lgbm
[flaml.automl: 09-21 20:08:18] {3308} INFO - Estimated sufficient time budget=9959s. Estimated necessary time budget=230s.
INF

[INFO] Accuracy on train data: 1.0


## Testing

- As expected, the training data performance is extremely high
- How do we fare on the test set?
- Note: Since this is for skill demonstration purposes, I'm making an exception by considering test set for evaluation
  - Ideally, training data ought to be split into a smaller training and a separate validation/holdout set
  - Evaluation must be performed on the holdout set

In [10]:
# Obtain the test set features and labels as numpy arrays
test_X, test_y = preprocess_dataset(test_imgs_l, test_angles_l, RESIZE_DIM)

# Print the accuracy score on the test set
print(f'[INFO] Accuracy on test set: {accuracy_score(automl.predict(test_X), test_y)}')

100%|██████████| 88/88 [00:00<00:00, 1123.70it/s]

[INFO] Accuracy on test set: 0.4318181818181818





# Improving the Model

- Random guessing would yield us 25% accuracy (balanced dataset, 4 classes)
- Current AutoML model performs 1.5x-2x better than random guessing
- We make 2 improvements to our pipeline:
  - Using rotation to generate more training samples
  - Using PCA to reduce dimensionality of training samples  

## Training Data Enrichment via Rotation

- For each training data image, we can generate 3 versions by rotating them by 90, 180 and 270 degrees
  - Rotation will be counterclockwise to be in line with labelling convention in training data
- Each generated version will have a different angle (and hence class label). For e.g.,
  - If original angle is 0, then each successive rotation would yields 90, 180 and 270
  - If original angle is 90, then each successive rotation would yields 180, 270 and 0
  - If original angle is 180, then each successive rotation would yields 270, 0 and 90
  - If original angle is 270, then each successive rotation would yields 0, 90 and 180

In [11]:
# Create a list of enriched images and corresponding labels
rot_imgs_l, rot_labels_l = enrich_data(train_imgs_l, train_angles_l)

# Preprocess this enriched list
train_X_rot, train_y_rot = preprocess_dataset(rot_imgs_l, rot_labels_l, RESIZE_DIM, shuffle=True)

100%|██████████| 203/203 [00:00<00:00, 392.70it/s]
100%|██████████| 812/812 [00:00<00:00, 832.32it/s]


### Modelling + Testing 2.0

- We repeat the AutoML process and subsequent test set evaluation
- As is evident, accuracy has increased by 30-40% with data enrichment!

In [12]:
# Run the custom function to fit an AutoML model and display the results
automl_rot = fit_automl(train_X_rot, train_y_rot)

[flaml.automl: 09-21 20:09:21] {2600} INFO - task = classification
INFO:flaml.automl:task = classification
[flaml.automl: 09-21 20:09:21] {2602} INFO - Data split method: stratified
INFO:flaml.automl:Data split method: stratified
[flaml.automl: 09-21 20:09:21] {2605} INFO - Evaluation method: holdout
INFO:flaml.automl:Evaluation method: holdout
[flaml.automl: 09-21 20:09:21] {2727} INFO - Minimizing error metric: log_loss
INFO:flaml.automl:Minimizing error metric: log_loss
[flaml.automl: 09-21 20:09:21] {2869} INFO - List of ML learners in AutoML Run: ['lgbm', 'rf', 'xgboost', 'extra_tree', 'xgb_limitdepth', 'lrl1']
INFO:flaml.automl:List of ML learners in AutoML Run: ['lgbm', 'rf', 'xgboost', 'extra_tree', 'xgb_limitdepth', 'lrl1']
[flaml.automl: 09-21 20:09:21] {3174} INFO - iteration 0, current learner lgbm
INFO:flaml.automl:iteration 0, current learner lgbm
[flaml.automl: 09-21 20:09:23] {3308} INFO - Estimated sufficient time budget=21493s. Estimated necessary time budget=495s.
IN

[INFO] Accuracy on train data: 1.0


In [13]:
# Print the accuracy score on the test set
print(f'[INFO] Accuracy on test set: {accuracy_score(automl_rot.predict(test_X), test_y)}')

[INFO] Accuracy on test set: 0.7272727272727273


## Dimensionality Reduction

- Even after enrichment, curse of dimensionality looms over our problem
  - Dimension outnumbers samples by 10-15x
- On account of thresholding + Laplacian, images are sparse, i.e., most pixels contain 0 information
- Using PCA can help preserve the structure and reduce the dimensionality

### Modelling + Testing 3.0

In [14]:
automl_pca, pca_object = fit_automl(train_X_rot, train_y_rot, pca=True)

[flaml.automl: 09-21 20:10:30] {2600} INFO - task = classification
INFO:flaml.automl:task = classification
[flaml.automl: 09-21 20:10:30] {2602} INFO - Data split method: stratified
INFO:flaml.automl:Data split method: stratified
[flaml.automl: 09-21 20:10:30] {2605} INFO - Evaluation method: holdout
INFO:flaml.automl:Evaluation method: holdout
[flaml.automl: 09-21 20:10:30] {2727} INFO - Minimizing error metric: log_loss
INFO:flaml.automl:Minimizing error metric: log_loss
[flaml.automl: 09-21 20:10:30] {2869} INFO - List of ML learners in AutoML Run: ['lgbm', 'rf', 'xgboost', 'extra_tree', 'xgb_limitdepth', 'lrl1']
INFO:flaml.automl:List of ML learners in AutoML Run: ['lgbm', 'rf', 'xgboost', 'extra_tree', 'xgb_limitdepth', 'lrl1']
[flaml.automl: 09-21 20:10:30] {3174} INFO - iteration 0, current learner lgbm
INFO:flaml.automl:iteration 0, current learner lgbm
[flaml.automl: 09-21 20:10:31] {3308} INFO - Estimated sufficient time budget=3375s. Estimated necessary time budget=78s.
INFO

[INFO] Accuracy on train data: 1.0


In [15]:
# Evaluating on the test set
test_X_reduced = pca_object.transform(test_X)
print(accuracy_score(automl_pca.predict(test_X_reduced), test_y))

0.8181818181818182


- Compared to the model post data enrichment, we can see a ~10% increase in accuracy on the test set
- Increase in accuracy becomes 69% when compared to the original model
- Note: Multiple image augmentation approaches (Random resized cropping, brightness/contrast fluctuation, edge detection etc) were attempted through libraries like `albumentations`
  - However, they did not translate to improvements in performance
  - As such, they were dropped from this notebook

# Closing Note

- We obtained 80%+ accuracy for the task by:
  - Using composable image processing transforms to accentuate textual area in our images
  - Using AutoML to efficiently explore learner and hyperparameter search space
  - Enriching the dataset by created rotated versions of existing data 
  - Reducing dimensionality of the preprocessed images by PCA
- As laid out in the problem statement, we avoided Neural Networks for the task
- However, if we were to solve the problem using Neural Networks, we can consider a variety of tactics, not limited to those mentioned below:
  - Extracting image features using a pretrained model with the fully connected head at the end 'cut off'
  - Using architecture-specific regularisation techniques like Dropout in addition to data-specific regularisation (augmentation that does not distort document angle)
  - Using non-linear embeddings via UMAP or AutoEncoders in place of PCA to capture richer dependencies
  - Using NN-based OCR technologies like Pytesseract to identify specific text likely to contain more information about document orientation