# Overview

* In the timeframe it is unfeasible to run through a real-world document pipeline for classification...
* ...however we can run through the same process for a simple test case

## Problem statement
* Imagine you have access to a subset of this raw text dataset https://www.kaggle.com/c/spooky-author-identification.
* Your boss then sends you a folder containing a bunch (1000's) of scanned pages from books that somehow exploded!
    * They say they need them sorted by author ASAP and your job depends on it!
    * We're lucky though because the scan quality is really good (so no cleaning required)
    * What do you do?!!

Don't worry we've got a solution in mind...

## Proposed solution

* Train a text classifier on the raw text data (training set)
* OCR all the given images and create another raw text dataset
* Apply your classifier to the OCR'ed text and boom! you have a sorted set (up to the accuracy of the classifier)
* Job Saved!
    
    

# Train text classifier
* How do we represent text as an input to a mathematical model?
    * This question has many answers but the simplest would be counting!
    * We can count occurences of words across a corpus (collection of documents)
    * We can then map a document (text string) to a vector of length $N$ where $N$ is the size of the vocabulary!
* Okay but how do we actually do this?
    * Could write your own algorithms, but time is of the essence so we should take to the internet.
    * The resounding answer is: sci-kit learn
    
## Sci-kit learn
scikit learn is an amazing open source library with a ton of robust implemenations of your favorite ML models:
https://scikit-learn.org/stable/

* In particular we will be taking advantage of a few modules:
    * text feature extraction https://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_extraction.text
    * pipeline https://scikit-learn.org/stable/modules/classes.html#module-sklearn.pipeline (for "productionalizing")
    * multinomial naive bayes https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html
    * random forest https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
    
## Loose steps

Typically in a realworld scenario data cleaning and gathering would be your number 1 (I can't stress this enough) issue. 

But we have a nice dataset in this case so we can follow a few steps to develop a robust classifier:

## NUMBER ONE RULE: This is an experiment, treat it as such with scientific rigor and reasoning

1) Define your metric and your goal with that metric (accuracy, f1 score, recall, precision etc..) 

2) Define your training set and testing set (These need to be distinct with no overlap!)

3) Define your preprocessing steps, we need to convert our raw data into a usable form by a mathematical model

4) Select a set of candidate algorithms (we'll test them all to pick a winner)

5) Define your validation strategy (how do we decide one model is better than another BEFORE THE TEST SET)

6) Setup your scripts and follow your procedure

7) Select the best candidate and measure the chosen metric on your test set to see if it's acceptable, if not go back and tweak your procedure and repeat.


    
 

 


In [None]:
from PIL import Image
import os
import pytesseract
import pandas as pd

# Grab our Training Set

In [None]:
train = pd.read_csv('images/train.csv')


print('Class Breakdown:')
print(train['author'].value_counts())
print(f'Total:{len(train)}')
print('\n')

print('Sample:')
for row in train.head(5).iterrows():
    data = row[1]
    
    print(data.author)
    print(data.text)
    print('\n')
    

# Organize our data

In [None]:
text_train = train.text.tolist()
labels_train = train.author.tolist()

# Define preprocessing steps

In [None]:
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer

vectorizer = CountVectorizer()
#vectorizer = TfidfVectorizer()


# Pick a model 

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier

mnb = MultinomialNB()

#rf = RandomForestClassifier()

#ada = AdaBoostClassifier()


# Build Pipeline

In [None]:
from sklearn.pipeline import Pipeline

pipeline = Pipeline([('vectorizer',vectorizer),('base_clf',mnb)])

# Setup 5-Fold Cross-Validation for hyperparameter tuning

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {'vectorizer__analyzer':['word'],
              'vectorizer__ngram_range':[(1,1)]}

clf = GridSearchCV(pipeline,
                   param_grid=param_grid,
                   refit=True,
                   verbose=2,
                   cv=5)


clf.fit(text_train,labels_train)

results = pd.DataFrame(data = clf.cv_results_)

display(results)

# OCR test set sample (a larger set would take too long)

In [None]:
data = {"path":[],
        "id":[],
        "predict":[],
        "text":[]
       }

directory = 'images/test-set-sample'

filenames = [name for name in os.listdir(directory) if name.endswith('.png')]

for fn in filenames:

    print(f'Processing {fn}')

    full_path = os.path.join(directory,fn)
    pilimg = Image.open(full_path)

    img_id = fn.split('.')[0]
    
    # Get orientation first, gives tesseract a better chance at extraction.
    try:
        orientation_results = pytesseract.image_to_osd(pilimg,output_type='dict')
        degrees = orientation_results['rotate']
        if degrees != 0:
            pilimg = pilimg.rotate(-degrees,expand=True)
            
        text = pytesseract.image_to_string(pilimg,lang='eng')
    except Exception as e:
        print(e)
        text = ''
    
        
    data['predict'].append(clf.predict([text])[0])
    data['path'].append(full_path)
    data['id'].append(img_id)
    data['text'].append(text)
    


# Spot Check Test Sample

In [None]:
from IPython.display import clear_output

test = pd.read_csv('images/test.csv')

sample = pd.DataFrame(data=data).merge(test[['id','author']],on='id',how='left')

for row in sample.iterrows():
    d = row[1]
    
    img = Image.open(d.path)
    
    display(img)
    
    print('Extracted Text:')
    print(d.text)
    print('\n')
    
    print(f'Predicted: {d.predict}, Actual: {d.author}')
    
    cont = input('continue? (y or n)')
    
    if cont not in ['y','']:
        break
    else:
        clear_output(wait=True)

# Full test set
Imagine we OCR'ed everything! we'll check our performance on the full hold-out (test) set

In [None]:
text_test = test['text'].tolist()
labels_test = test['author'].tolist()

In [None]:
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score, recall_score, precision_score
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

preds = clf.predict(text_test)

fig_labels = clf.classes_
CM=confusion_matrix(labels_test,preds,labels=fig_labels)

fig,ax = plt.subplots(1,1,figsize=(20,20))

sns.heatmap(CM,annot=True,xticklabels=fig_labels, yticklabels=fig_labels ,ax = ax)
ax.set_title("Overall Accuracy:{}".format(accuracy_score(preds,labels_test)))
ax.set_xlabel('Predicted label')
ax.set_ylabel('True label')
fig.set_facecolor('w')

# Challenges
* Can you beat this base accuracy? 
* Can you beat the best on kaggle? (note they use a different metric there)
* Is it possible to classify these images based on their visual content? (as impractical as that may be).
* Can you think of other features you could generate?
* What about other modelling strategies (sequence models?)