# Notes on Problem


## Challenge: Support many different text areas robustly

Robustness is key. Not every film/game/whatever has the same number of text boxes, or in the same places, etc. Even in the same stream, textboxes may change size, be occluded, in unknown ways.

It seems one of the first (and probably harder) tasks here is to make sure that we can accurately identify the different text areas, and maintain continuity of those text areas between frames.

## What does a solution even look like?

### High Level Solution

Robustness for games: idea of text groups. The idea is that while we are processing the data, we will naturally find "groups" of text that make sense. In the example of streams, this can be:

- In-game chat
- Twitch chat
- Game information (things like "Channeling", health numbers, timer, kill count, etc.)

During the initial processing, we will auto-detect these groups. This will probably be done by some kind of classification or clustering, using some efficient pre-trained network (maybe MobileNet?).

Afterwards, we want to expose an easy to use UI which will visualize all of these groups for the user and have them be able to label them, merge them, (STRETCH) split them.


### Pre-processing

Make a pass through the data and identify the different types of text. Important notes:

- Air on the side of more groups. It's a better user experience (and easier technically) for the user to combine these later or tell the system (actually these are the same thing)
- Have a fallback group and minimal probability. In the long term I don't know exactly what the solution looks like here, but for this project we care most about nailing the main important text. Smaller things which may be outliers can be thrown in a bucket to start, and we should have a convenient way of setting some kind of floor probability for the system to designate a new bucket

### Actual OCR

Make a more expensive pass through the data using some SOTA OCR engine and track every change to the relevant group. Important notes:

- One simple (but probably costly/wasteful) idea is just to do regular timestamps on every group. This is likely a bad idea
- The better idea is to have the frequency of "check-ins" be a tunable parameter (to help support lower-end hardwares) and then at every check-in compare the text predicted for one group to what it was at the last check in, and only write data if there's a change

### Aggregation

Aggregate the data in a more organized way. This basically means only tracking changes and marking them nicely with timestamps.

### Post-processing

Now we expose everything we've done to the user and solicit them for better names of the groups, combine/split groups as they see fit. 

IDEA: Some kinds of characters (think dragon up, baron up, kill icon) probably don't exist in English. A stretch goal would be to note this and also expose it to the user so they can give it more readable names as they wish to do sorting and cataloging.

# Implementation Goal 1: Pre-process into Text Groups

## Attempt 1: Using [CRAFT](https://pypi.org/project/craft-text-detector/)

Difficulty installing. Going to tesseract, which was recommended anyway

## Attempt 2: Using Tesseract

Was able to get a basic image to text + text box working.

Notes:

- For league streams, some of the text is quite small, so the difference between .png and .jpeg actually seems significant ACTUALLY not convinced, stick with JPEG for now for speed
- There's a fair amount of noise. I.e., there are a lot of boxes around seemingly nothing, and some of the boxes flicker in and out between streams
- The boxes are surrounding individual words, not sentences.

Next steps:
- Explore tesseract documentation better to see if there's any built in tools that will let me detect larger boxes
- If NOT, build something custom that's loosely based on clustering / distance between boxes.
- Try out some of the optimizations [here](https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html)
  - Inverting seems good
  - Noise removal seems good
  - Better binarization seems good

### Attempt 2.1: Simple Bigger Boxes

As noted above, the simple extraction steps tended to give boxes around individual words. I think this is because I just wasn't using the right data from the tesseract output.

Yup, I should first be looking at block_num. Then aggregating. Then worrying about par_num and creating consistent logs and the like.

This is looking pretty decent! I was able to get boxes to blend together better


### Attempt 2.2: Better Binarization

It's not entirely clear what kind of binarization is going on under the hood in tessaract. As suggested in the docs, it's possible that performing our own custom binarization may improve results.

# Implementation Goal Two: Segment into Classes with the Help of the User

## Attempt 1: Clustering Using Feature Vectors from Existing Classification Net

The biggest problem here is that the bounding boxes are potentially very different sizes. The first thing I'm going to try is simply resizing all images to have the same dimensions and plugging in to an existing efficient network (MobileNet perhaps) to get feature vectors and then cluster from there.