___
# __OCR NLP Project__
### Mercedes Wu
##### Time limit: 4 hrs
___

___
## __Getting OCR Receipt Data from: https://github.com/clovaai/cord__
- data is stored in json format with xy coordinates
- train size - 800 images
- test size - 100 images
- including some sample images in data folder
- data does not contain personal information
___

In [None]:
import nltk
import pandas as pd
import numpy as np
import os
import json
import matplotlib.pyplot as plt
from skimage import io
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.calibration import CalibratedClassifierCV
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB

___
## __Data Exploration of Dataset__
- looking at a test image and it's respective json data to get a feel for the data
___

In [None]:
img = io.imread('./data/receipt_ocr/images/train/receipt_00797.png')
fig,ax = plt.subplots(figsize=(100,200))
ax = plt.imshow(img)


In [None]:
folder = './data/receipt_ocr/json/train/'
with open(f'{folder}receipt_00797.json') as f:
    d = json.load(f)

In [None]:
d.keys()

In [None]:
d['valid_line'][0].keys()

In [None]:
d['dontcare']

In [None]:
d['meta']

In [None]:
d['roi']

In [None]:
d['repeating_symbol']

In [None]:
d['valid_line'][0]['words']

In [None]:
d['valid_line'][1]['words']

In [None]:
d['valid_line'][0]['category']

In [None]:
d['valid_line'][0]['group_id']

___
__notes so far__
- valid line gives the most useful information
- can ignore dont care, meta, and ROI for now but in real life data it might not be this simple
- quad gives boundry box of OCR scan
- category is the correct label for the ocr text, we can use this as a target
- group_id seems to be a numerical key of the category
- not sure what rowid refers to yet <br>

__ideas__
- would be nice to get this into a dataframe format
- can use the quad data to help group together certain items for the menu 
    - e.g. "1 Grilled Baby Potato (R"
- whether the text is a number or a string can also give us valuable information
- will probably need to use regex or heuristic based filtering to get rid of some noise 
    - e.g. "(R"
- repeating symbol would be useful to help segment image but for the sake of time/complexity we will ignore it for now
___

___
## __Creating a Training Dataset for Modeling__
- transforming json data in "valid_line" to dataframe
___

In [None]:
# after some initial testing, i cant find a super easy solution
df = pd.DataFrame.from_dict(d['valid_line'], orient='columns')
df

In [None]:
df['words'].iloc[0]

In [None]:
# need to breakdown words column, going to use lambda function but there is most likely room for improvement
# first breaking down the quad to coordinate values
df['quad'] = df['words'].map(lambda x: x[0]['quad'])
df['x1'] = df['quad'].map(lambda x: x['x1'])
df['x2'] = df['quad'].map(lambda x: x['x2'])
df['x3'] = df['quad'].map(lambda x: x['x3'])
df['x4'] = df['quad'].map(lambda x: x['x4'])
df['y1'] = df['quad'].map(lambda x: x['y1'])
df['y2'] = df['quad'].map(lambda x: x['y2'])
df['y3'] = df['quad'].map(lambda x: x['y3'])
df['y4'] = df['quad'].map(lambda x: x['y4'])

In [None]:
# splitting out text to a column
df['text'] = df['words'].map(lambda x: x[0]['text'])
# splitting out iskey&rowid to get a better idea, might need to not use it as a feature to be fair though
df['row_id'] = df['words'].map(lambda x: x[0]['row_id'])
df['is_key'] = df['words'].map(lambda x: x[0]['is_key'])


In [None]:
df.columns

___
__notes__
- looks like rowid calculates based on the quad if the items are on the same line
- techniques for this include random sample consensus (RANSAC)
     - we should try implementing this but due to time constraints let's see the results of initial modeling techniques
___


In [None]:
# putting above code into a function to run over all the json images
def json_to_df_helper(d):
    df = pd.DataFrame.from_dict(d['valid_line'], orient='columns')
    # need to breakdown words column, going to use lambda function but there is most likely room for improvement
    # first breaking down the quad to coordinate values
    df['quad'] = df['words'].map(lambda x: x[0]['quad'])
    df['x1'] = df['quad'].map(lambda x: x['x1'])
    df['x2'] = df['quad'].map(lambda x: x['x2'])
    df['x3'] = df['quad'].map(lambda x: x['x3'])
    df['x4'] = df['quad'].map(lambda x: x['x4'])
    df['y1'] = df['quad'].map(lambda x: x['y1'])
    df['y2'] = df['quad'].map(lambda x: x['y2'])
    df['y3'] = df['quad'].map(lambda x: x['y3'])
    df['y4'] = df['quad'].map(lambda x: x['y4'])
    # splitting out text to a column
    df['text'] = df['words'].map(lambda x: x[0]['text'])
    # splitting out iskey&rowid to get a better idea, might need to not use it as a feature to be fair though
    df['row_id'] = df['words'].map(lambda x: x[0]['row_id'])
    df['is_key'] = df['words'].map(lambda x: x[0]['is_key'])
    
    # filtering down to training features
    return df[['x1', 'x2', 'x3', 'x4', 'y1', 'y2', 'y3', 'y4', 'text', 'row_id', 'is_key', 'category', 'group_id']]

In [None]:
# creating train and test dataframes
def dataset_generator(path_to_jsonfiles):
    dfs = []
    for file in os.listdir(path_to_jsonfiles):
        full_filename = "%s/%s" % (path_to_jsonfiles, file)
        with open(full_filename,'r') as f:
            d = json.load(f)
        dfs.append(json_to_df_helper(d))
    return pd.concat(dfs, axis=0)


In [None]:
train_df = dataset_generator('./data/receipt_ocr/json/train/')
test_df = dataset_generator('./data/receipt_ocr/json/test/')

In [None]:
train_df.head()

In [None]:
test_df.head()

In [None]:
# lets look at the count of values for each category first as a quick check to see if there are any class imbalances
count_train_gb = train_df[['category', 'text']].groupby('category').count().reset_index()
ax = count_train_gb.plot.bar(x='category', y='text', title='Count Classes in Training Data', figsize=(20,10))

In [None]:
# lets look at the count of values for each category first as a quick check to see if there are any class imbalances
count_test_gb = test_df[['category', 'text']].groupby('category').count().reset_index()
ax = count_test_gb.plot.bar(x='category', y='text', title='Count Classes in Testing Data', figsize=(20,10))

___
__Feature Engineering__
- to simplify the classification task for now, we can:
    - drop categories that have less than 100 datapoints in the training data 
        - we can address the the dropped classes later
    - make sure the classes between testing and training sets are the same
        - this doesnt mimic a real word scenario but will still give us an idea of how well our models do in ideal situations
___

In [None]:
filt_categories = list(count_train_gb[count_train_gb['text']>100]['category'].unique())

In [None]:
filt_categories

In [None]:
filt_train_df = train_df[train_df['category'].isin(filt_categories)].copy()
filt_test_df = test_df[test_df['category'].isin(filt_categories)].copy()

___
TODO 11/29/21:
- add some simple feature generation like:
    - length string
    - is it a str, int, float
    - some ranges that items would cost
        - unlikely to find an item > $10,000 dollars
Note:
- looks like the reciepts can be from different countries, need to account for the fact that some countries use , to separate out change
___

___
## __Modeling__
- is this a supervised machine learning problem?
    - i.e. we can leverage past labeled data for inferencing?
    - this way we can utlize techniques that take advangtage of the labeled dataset
- is this a unsupervised machine learning problem?
    - can utilize:
        - business logic
        - positional and linguistic cues
        - some heuristics on receipts to group unlabeled data 
        - clustering techniques based on position
- given the current data, the best choice may be a combination of the two
    - we can use the training data to get us closer to the the preferrered or custom categories then use heuristics/decision trees to break it down further
___

In [None]:
# vectorizing text column
vectorizer = CountVectorizer()
# term frequency inverse document frequency transformer
tfidf_transformer = TfidfTransformer()

In [None]:
# vectorizing and tfidf transforming training data
X_train_counts = vectorizer.fit_transform(filt_train_df['text'])
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
y_train = filt_train_df['category']
# vectorizing and tfidf transforming testing data
X_test_counts = vectorizer.fit_transform(filt_test_df['text'])
X_test_tfidf = tfidf_transformer.fit_transform(X_test_counts)
y_test = filt_test_df['category']


In [None]:
X_train_counts

In [None]:
X_test_counts

In [None]:
set(list(y_train.unique())) - set(list(y_test.unique()))

___
Note:
- we need the number of columns to match for training and testing data in order to predict
- can use select k best to find the more important features
___

In [None]:
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.model_selection import KFold, cross_val_score

In [None]:
# we would need to use SelectKBest to make sure each set of data had the same number of columns
X_train_tfidf_new = SelectKBest(chi2, k=405).fit_transform(X_train_tfidf, y_train)
X_test_tfidf_new = SelectKBest(chi2, k=405).fit_transform(X_test_tfidf, y_test)

In [None]:
# fitting naive bayes multi classifier
clf = MultinomialNB().fit(X_train_tfidf_new, y_train)

In [None]:
filt_test_df['NB_pred'] = clf.predict(X_test_tfidf_new)

In [None]:
filt_test_df['NB_correct_prediction'] = np.where(filt_test_df['category'] == filt_test_df['NB_pred'], 'yes', 'no')

In [None]:
nb_results = filt_test_df['NB_correct_prediction'].value_counts()
nb_results

___
Note:
- initial results with Naive Bayes Classifier doesn't seem great
- can test if a different model and adding will a different model and kfold (due to low data points for some classes) cross validation will improve the accuracy

In [None]:
# setting support vector class weight to balanced to automatically adjust weights inversely proportional to class frequencies
model = LinearSVC(class_weight="balanced", dual=False, tol=1e-2, max_iter=1e5)
kf = KFold(n_splits=3)
cclf = CalibratedClassifierCV(base_estimator=model, cv=kf)
cclf.fit(X_train_tfidf_new, y_train)

In [None]:
filt_test_df['SCV_pred_w_CCCV'] = cclf.predict(X_test_tfidf_new)

In [None]:
filt_test_df['SCV_pred_w_CCCV_correct_prediction'] = np.where(filt_test_df['category'] == filt_test_df['SCV_pred_w_CCCV'], 'yes', 'no')

In [None]:
scv_results = filt_test_df['SCV_pred_w_CCCV_correct_prediction'].value_counts()
scv_results

In [None]:
results = pd.concat([nb_results, scv_results], axis=1).T

In [None]:
results['accuracy'] = results['yes'] / (results['no']+results['yes'])

In [None]:
results.reset_index(inplace=True)

In [None]:
ax = results.plot.bar(x='index', y='accuracy', figsize=(20,10), color={'orange':'SCV_pred_w_CCCV_correct_prediction', 'blue':'NB_correct_prediction'})

___
## __Conclusion__
- This project was fun and a nice introduction to how computer vision and NLP can be related!
- With more time, I would have liked to:
    - explore the modeling side more, expecially the unsupervised approach and using the OCR x,y coordinates as hints as to what a field could be
    - use more feature generation especially with the number fields
        - e.g. date formatting, number limits, integers for quantiy vs floats for prices'
- I believe using a combination of both supervised and unsupervised learning would make the model development process go faster
    - we could use the any availabe training data to help give us an idea of where certain categories like menu usually lie in a receipt
        - I think this would look nice visualized as a heatmap
___
