## Meta-data Extraction Bot

    
    Format - PEP8
    Python - 3.6.3
    Created Date - 04 April 2020
    Author - Pranjal

### CONTENTS


1. Possible Approaches
2. Imports
3. Settings: Hyper-parameters
4. Load Mapper File 
5. Abbreviations Module 
	- Legal Known abbrvs
	- Dynamic Abbrvs


6. Reading from XML Module
7. Intial Preprocessing Module
8. Address Block (AB) Identification Module
9. Labelling Data Module
10. Feature Engineering Module
    - TYPE I. Linguistic Features
	- TYPE II. Text Features


11. Saving & Loading Module 
12. Filtering Dataset Module 
13. Training Module
    - Phase I: Line Classification Training¶
    - **FINAL ARCHITECTURE**
    - Phase II: Chunk Identification Module


14. OLD Methods/Versions for Chunk Identification Module

### <ins>1. Possible Approaches</ins>
    
...

`Approach 1`
    
        Phase I: Line Classification

        1. Extract text from XML files line-byline
        2. filter using NNP, NP, remove VB, ADV, and levenstine match bewteen supplier name, score, without filter score.
        2. Create features such as:
            :email
            :URL
            :1Cap
            :Keywords list - Presence of list keywords in line
            :Legal abbrv - Presence of legal abbrv from list
            
                a. Create  a list of user defined abbrv - Scraped from wikipedia for all English countries.
                b. SN -> split -> last token -> dictword(?) POS tag(?) Synonyms using Word2Vec model
                c. Custom NER Spacy 
                    i.   Filter out LINES from XML containing SN. - Fuzzywuzzy, levensthine, Exact Regex Match
                    ii.  Find out index-start, index-end of SN.
                    iii. Train custom Spacy NER on 1k sentences atleast./ biLSTM CRF to using IOB tags.
                    iV.  Test data + ABC INC 10 May" -> Spacy.ents (ABC INC, ent)
                d. Split SN by space and take last token
                e. Check last token has vowels or not
                f. Contains punct marks or not.
                   
                
            :city name
            :country name
            :%Cap1DictWord
            :%CapDictWord
            :%DictWord
            :%Cap1NonDictWord
            :%CapNonDictWord
            :%NonDictWord
            :%ProperN - spacys pos tag
            :LinePos - postion of line in quadrant
            :l,t,r,b - Regularized l,t,r,b coordinates of line
            :FontSize - font size of Regularized (l,t,r,b) coordinates of line
            :%Spacy NER+ - Presence of Spacy's NER tags in line
            :AddressNeighbour - Context of 'n' neighbouring lines above/below selected line if context is address or not 1/0
         3. Target Values (1/0) if its a "Supplier Name Line" or "Not a Supplier Name Line"
         4. Labelling -
             a. [HIGH WEIGHT] Regex match of Line with Excel Mapper file Supplier Name. - fuzzywuzzy/diff lib levensthine  
             b. [LESS WEIGHT] Score of above values
             c. Manual Check by DA
         5. Classification using XGBClassifer or SVM.
         
         Phase II. Chunk Identification
         chunk identifi - Sentence NNP extract as chunks.
         chunk - % of NER ORG in sent -> thresholding to identify as chunk or no chunks present in sent.
         1. Once the LINES are classified as a class 'Supplier'Name Line SNL'. It shall be broken into 'n' chunks.
         2. Again repeat above steps to classify the chunk and then the chunk shall be classified: "Supplier"/"Notsupplier"
      
...

`Approach 2`
    
        1. Get text in a paragraph from XML. Combine lines side-by-side to get a paragraph.
        2. Word tokenize this to extract words from it and POS tag it.
        3. IOB annotate it or Use BILOU annotate. - Rsa NPM UI, spacy prodigy 
        4. Annotated data is used to train custom bi-LSTM CRF or a HMMM Model.
        5. Test data is annotated using tagging mechanism and passed to trained model to predict as a tag (I,O,B)
        6. CHALLENGE - Sequential Data is reuqired from XML files.
...

`Approach 3`
        
        1. Train an LSTM on patterns of Words followed by a legal abbrev, such that the model predicts that if a word willbe 
           followed by a legal abbrv or not.
...

`Approach 4 (Common)`
        
        1. Convert PDF -> Images using VOTT
        2. Annotate images - Annotate Address (tag: addr); Supplier Name (tag: SN).
        3. Build a custom NER (TF Object detection model) to identify location of address and SN
        4. Use the output from TF Object detection model -> Coordinates of Address and SN across 10k Images.
        5. Train a classifier on these coordinates as Address/Not Address; SN/Not SN
        6. Done.

## <ins>2. Imports</ins>

In [None]:
# Standard libs
import os
import sys
import re
import time
import string
import xml.etree.ElementTree as ET
import uuid
import json
import sqlite3
import configparser
import ast
from collections import OrderedDict, defaultdict
from operator import itemgetter    
import pandas as pd
import numpy as np
import pickle
# Rectangle area
from collections import namedtuple

# Pre-processing libs
## NLTK
import nltk
nltk.data.path.append("../nltk_data__v1.0/")
from nltk import pos_tag
from nltk.corpus import stopwords, wordnet as wn
from nltk.tokenize import word_tokenize, sent_tokenize, wordpunct_tokenize
from nltk.stem import WordNetLemmatizer, PorterStemmer
from nltk.corpus import words as eng_words
lemma = WordNetLemmatizer()
ENG_words = eng_words.words()

## sklearn
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler, MinMaxScaler, MultiLabelBinarizer
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, HashingVectorizer
from sklearn.utils import class_weight

## Gensim
import gensim
from gensim.utils import simple_preprocess
from gensim.models import Phrases, phrases
from gensim.models import Word2Vec, Doc2Vec
from gensim.models import KeyedVectors

## Java standfordNLP
import stanfordnlp 
stf_nlp = stanfordnlp.Pipeline(processors = "tokenize,mwt,pos")

import unicodedata
import sacremoses
from sacremoses import truecase, MosesTruecaser, MosesPunctNormalizer, normalize, truecase

## LEXNLP
# from bs4 import BeautifulSoup
# import lexnlp.extract.en.money
# import lexnlp.extract.en.urls

## FuzzyWuzzy: Distances
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

## geotext: Geoparsing
import geopandas as gpd
from urllib import request
from geotext import GeoText

## libpostal: address
from postal.parser import parse_address as crf_NER

## Spacy
import spacy
from spacy.gold import biluo_tags_from_offsets
nlp = spacy.load("en_core_web_lg")

# Text Vectorization libs
from imblearn.over_sampling import SMOTE, SVMSMOTE, ADASYN
from sklearn.decomposition import TruncatedSVD, NMF, PCA
from sklearn.preprocessing import Normalizer
from scipy.sparse import hstack

# Metrics libs
from sklearn.datasets import make_classification
from sklearn.dummy import DummyClassifier
from sklearn.metrics import precision_recall_curve, precision_score, recall_score, f1_score, auc, roc_curve, roc_auc_score, accuracy_score, f1_score, classification_report, confusion_matrix
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.exceptions import NotFittedError

# Models
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.cluster import KMeans
from sklearn.ensemble import VotingClassifier 
# from xgboost import XGBClassifier
from keras.preprocessing.text import Tokenizer, text_to_word_sequence
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.initializers import Constant
from keras.models import Sequential, Input, Model
from keras.layers import *
from keras.wrappers.scikit_learn import KerasClassifier
from keras.optimizers import SGD
from keras_contrib.layers import CRF
from keras.callbacks import EarlyStopping
from keras import regularizers

# KERAS ELMO IMPLEMENTATION
# import tensorflow as tf
# import tensorflow_hub as hub
# import numpy as np
# import os
# import pandas as pd
# import re
# import keras.layers as layers
# import pydot
# from collections import Counter
# from keras import backend as K
# from keras.callbacks import TensorBoard
# from keras.callbacks import LearningRateScheduler
# from keras.layers import Input, Embedding, BatchNormalization, LSTM, Dense, Concatenate,Bidirectional
# from keras.models import Model
# from keras.utils import plot_model

# Plotting libs
import matplotlib.pyplot as plt
import seaborn as sns

# warnings
import warnings
warnings.filterwarnings('ignore')

----

## <ins>3. Settings: Hyper-parameters</ins>

#### File paths

In [None]:
# Load "Training" data path
data_path = "Data/OCR_XML/"

# deps
mapper_file_path = "Data/actual_training_labels.xls"
legal_abbrv_path = "Data/Legal_Abbrvs_20042020.txt"
dynamic_abbrv_path = "Data/Dynamic_Abbrvs_20042020.txt"
contractions = dict(json.load(open('Data/Contractions.json', 'r')))
REGEX_PUNCTS = "\\\!\"\#\$\%\&\'\(\)\*\+\,\-\.\/\:\;\<\=\>\?\@\[\]\^\_\`\{\|\}\~"
REGEX_SYMBOLS = "\\■\\¤\\¦\\§\\¨\\ª\\«\\¬\\®\\ˉ\\°\\±\\²\\³\\´\\µ\\¶\\¹\\º\\»\\¼\\½\\¾\\¿\\À\\Á\\Â\\Ã\\Å\\Æ\\È\\É\\Ê\\Ë\\Ì\\Í\\Î\\Ï\\Ð\\Ñ\\Ò\\Ó\\Ô\\Õ\\Ö\\Ø\\Ù\\Ú\\Û\\Ü\\Ý\\ß\\à\\á\\â\\ã\\ä\\å\\æ\\ç\\è\\é\\ê\\ë\\ì\\í\\î\\ï\\ð\\ñ\\ò\\ó\\ô\\õ\\ö\\÷\\ø\\ù\\ú\\û\\ü\\ý\\þ\\ÿ"

#### :: Settings ::

In [None]:
LINE_TAG = " <&> "
MINIMUM_CHARS_IN_LINE = 2
LENGTH_OF_EMAIL = 3
LENGTH_OF_URL = 3

----

## <ins> 4. Load Mapper File </ins>

In [None]:
# Loading mapper file (labelling data)
mapper_file = pd.read_excel(mapper_file_path)

# Supplier Names found in mapper file
SN_all = list(mapper_file.SUPPLIER_NAME.values)
SN_unique = list(mapper_file.SUPPLIER_NAME.unique())
print('total invoices = ', len(SN_all))
print('Unique SN = ', len(SN_unique))

----

## <ins>5. Abbreviations Module </ins>

1. Legal knwon abbrvs
2. Dynaimc frequent Abbrvs

### 1. Legal Known abbrvs

Legal known Abbrvs = An extension found in company names that usually follow below rules:


- These lack vowels and are not morphologically well-formed words, thus can be detected because of this.
- Infringe upon the phonotactics of the language in which they occur.
- These predominantly use period "." and might also use typical alphanumeric characters such as /, & or ~ within them.
- Have the same collocations as their unabbreviated counterparts.
- Can exploit the rebus principle (expanding contractions eg. inb4 "in before", NRG "energy").

...

<ins>STEPS:</ins>

1. Srapping a wikipedia page using beautiful Soup4.
    
    scraped = https://en.wikipedia.org/wiki/List_of_legal_entity_types_by_country

In [None]:
# Legal abbrvs
list_legal_abbrv = []
with open(legal_abbrv_path, "r") as f:
    for line in f.readlines():
        abbrv, fullform = line.split("(")[0].strip(), line.split("(")[1].rstrip(")").strip()
        list_legal_abbrv.append(abbrv)
list_legal_abbrv = list(set(list_legal_abbrv + list(map(lambda x:x.upper(), list_legal_abbrv)) + list(map(lambda x:x.lower(), list_legal_abbrv))))
list_legal_abbrv_regex = list(map(lambda x: x.replace(".", "\.").replace(",", "\,").replace("-", "\-").replace("(", "\(").replace(")", "\)").replace("[", "\[").replace("]", "\]"), list_legal_abbrv))

### 2. Dynamic Abbrvs

Dynamic Abbrvs = Last token in a Supplier Name with a POS tag=Noun/Pronoun and which occurs in atleast 50 or more documents.

- These tokens are considered to be dynamically found abbreviation, and stored in a 'dynamic_list'.
- These will be added along with the known legal abbrvs.
- From time-to-time, when more Supplier Names are discovered, this dynamic list will be updated and new tokens will be allowed to enter the list.

....

<ins>STEPS</ins>

**RULES for considering 'last token' as a Dynamic Abbrv:**

1. Should occur in at-least 50 documents.
2. Should not be a stop-word.
3. Should have a POS tag = NN or NNS or similar.

In [None]:
# Checking last token across all Supplier Names
stop_words = set(stopwords.words('english'))
frequent_abbrv = []
for S in SN_all:
    S = S.strip()
    known_abbrv = re.findall(r"(?=(\b" + '\\b|\\b'.join(list_legal_abbrv_regex) + r"\b))", S.lower())
    if len(known_abbrv) > 0:
        continue
    else:
        # If not a known legal abbrv...
        last_token = S.split(" ")[-1]
        # POS tag should be NN or similar
        pos_tag = nltk.pos_tag(word_tokenize(last_token))
        if pos_tag in ["CC", "CD", "DT", "EX", "IN", "LS", "IN", "PDT", "TO", "UH", "WDT", "WP", "WRB"]: continue
        # Should not be a stop-word
        if last_token.lower() in stop_words: continue
        frequent_abbrv.append(last_token.upper())


# MINIMUM FREQUENCY: TFIDF's min freq = 50
MIN_FREQ_ABBRV = 2

corpus = frequent_abbrv
vectorizer = TfidfVectorizer(min_df=MIN_FREQ_ABBRV) # select all last_token occuring in atleast 5 PDFs as addition to abbrv list
X = vectorizer.fit_transform(corpus)
list_frequent_abbrv = vectorizer.get_feature_names() # lowercase
list_frequent_abbrv = list_frequent_abbrv + list(map(lambda x: x.upper(), list_frequent_abbrv))

#### Combined: Legal + Dynamic Abbrvs

In [None]:
list_abbrv = list_legal_abbrv + list_frequent_abbrv
list_abbrv_regex = list(map(lambda x: x.replace(".", "\.").replace(",", "\,").replace("-", "\-").replace("(", "\(").replace(")", "\)").replace("[", "\[").replace("]", "\]"), list_abbrv))

print("Known Abbrvs (Upper + lower) = ", len(list_legal_abbrv))
print("Dynamic Abbrvs = ", len(list_frequent_abbrv))
print("TOTAL ABBRVS = ", len(list_abbrv))

----

## <ins>6. Reading from XML Module</ins>

#### 1. XML data extraction

In [10]:
def get_word_bounding_box(X, Y):
    """
    Finds word-boundary for a given word.
    :param: list X: X coordinates of all chars; list Y: Y coordinates of all chars
    :return: The left,top & right,bottom coordinates of the char constituting the word.
    """
    r = max(X)
    l = min(X)
    b = max(Y)
    t = min(Y)
    return l, t, r, b

def get_average_font_size(list_font_size, total_n, dtype):
    """
    Finds font-size of word by calculating avg font-size of all chars. (e.g. "AbcD" -> FS=[25,20,20,25]; avgFS = SUM/4)
    :param: 
    list list_font_size: Font-sizes of all chars in word; int total_n: No of chars in word; type dtype: Data type of required font-size
    :return: Average font-size of a word.
    """
    if total_n != 0:
        avg_font_size = float(sum(list_font_size)/total_n)
    else:
        avg_font_size = 0.0
    if dtype == int:
        return int(avg_font_size)
    elif dtype == float:
        return float(avg_font_size)
    else:
        return avg_font_size

def get_words_from_char(char_dict):
    """
    Finds out words from set of chars. Creates a dict of {Words: coordinates, font-size, page-dims}
    :param: dict char_dict: A dict containing character wise Coordinates, FS
    :return: A dict of word and its Coordinates, FS, Page-dims
    """    
    word_dict = []
    word = ""
    X, Y, X_, Y_ = [], [], [], []
    font_size, font_size_, total_chars = [], [], 0
    page_width, page_height = 0, 0
    for ch in char_dict:
        # NOT EMPTY
        if ch["text"]:
            # 1. It's a LINEBREAK - Appends stored variables
            if ch["text"] == "<<LINEBREAK>>":
                l, t, r, b = get_word_bounding_box(X, Y)
                l_, t_, r_, b_ = get_word_bounding_box(X_, Y_)
                avg_font_size = get_average_font_size(font_size, total_chars, int)
                avg_font_size_ = get_average_font_size(font_size_, total_chars, float)
                word_dict.append({"content": word, "l": l, "t": t, "r": r, "b": b, 'font_size': avg_font_size, 
                                  "l_":l_, "t_": t_, "r_":r_, "b_":b_, 'font_size_': avg_font_size_,
                                  "page_width": page_width, "page_height": page_height})
                
                word_dict.append("<<LINEBREAK>>")
                word = ""
                X, Y, X_, Y_ = [], [], [], []
                font_size, font_size_, total_chars = [], [], 0
                page_width, page_height = 0, 0
            
            else:
                # 2. It's EMPTY - Appends stored variables
                if ch["text"] == " ":
                    # if the first and only char is space itself
                    if len(word.strip(" ")):
                        l, t, r, b = get_word_bounding_box(X, Y)
                        l_, t_, r_, b_ = get_word_bounding_box(X_, Y_)
                        avg_font_size = get_average_font_size(font_size, total_chars, int)
                        avg_font_size_ = get_average_font_size(font_size_, total_chars, float)
                        word_dict.append({"content": word, "l": l, "t": t, "r": r, "b": b, 'font_size': avg_font_size, 
                                          "l_":l_, "t_": t_, "r_":r_, "b_":b_, 'font_size_': avg_font_size_,
                                          "page_width": page_width, "page_height": page_height})
                        word = ""
                        X, Y, X_, Y_ = [], [], [], []
                        font_size, font_size_, total_chars = [], [], 0
                        page_width, page_height = 0, 0
                    continue
                else:
                    # 3. IT IS A WORD
                    word += ch["text"]
                    # 1. Normal Coordinates
                    X.append(int(ch['l']))
                    X.append(int(ch['r']))
                    Y.append(int(ch['t']))
                    Y.append(int(ch['b']))
                    # 2. Regularized Coordinates
                    X_.append(float(ch['l_']))
                    X_.append(float(ch['r_']))
                    Y_.append(float(ch['t_']))
                    Y_.append(float(ch['b_']))
                    # Storing font-sizes of all chars (letters, digits and some marks)
                    if re.match("^[A-Za-z0-9!@#$%&{}[]()]*$", ch["text"]):
                        font_size.append(np.abs(int(ch['t']) - int(ch['b'])))
                        font_size_.append(np.abs(ch['t_'] - ch['b_']))
                        total_chars += 1
                    page_width, page_height = int(ch['page_width']), int(ch['page_height'])
    return word_dict

def get_non_table_blocks(page):
    """
    Creates a dict of all words present outside the table area.
    :param: obj page: XML object of page
    :return: A dict of all words outside the table area with Coordinates, FS, Page-dims
    """
    page_width, page_height, page_res  = page.attrib['width'], page.attrib['height'], page.attrib['resolution']
    non_table_char_coor = []
    for block in page.iter('{http://www.abbyy.com/FineReader_xml/FineReader10-schema-v1.xml}block'):
        if block.attrib["blockType"] == 'Text':
            for text in block.findall('{http://www.abbyy.com/FineReader_xml/FineReader10-schema-v1.xml}text'):
                for line in text.iter('{http://www.abbyy.com/FineReader_xml/FineReader10-schema-v1.xml}line'):  
                    for ch in line.iter('{http://www.abbyy.com/FineReader_xml/FineReader10-schema-v1.xml}charParams'):
                        non_table_char_coor.append({"text": ch.text, 
                                                    "l": int(ch.attrib['l']), "t": int(ch.attrib['t']),
                                                    "r": int(ch.attrib['r']), "b": int(ch.attrib['b']),
                                                    "l_": float(int(ch.attrib['l'])/int(page_width)), 
                                                    "t_": float(int(ch.attrib['t'])/int(page_height)), 
                                                    "r_": float(int(ch.attrib['r'])/int(page_width)), 
                                                    "b_": float(int(ch.attrib['b'])/int(page_height)),
                                                    "page_width": int(page_width), "page_height": int(page_height)})
                    non_table_char_coor.append({"text": "<<LINEBREAK>>", "l": "", "t": "", "r": "", "b": "", 
                                                "l_": "", "t_": "", "r_": "", "b_": "", "page_width": "", "page_height": ""})
    non_table_word_dict = get_words_from_char(non_table_char_coor)
    return non_table_word_dict

def get_table_blocks(page):
    """
    Creates a dict of all words present inside the table area.
    :param: obj page: XML object of page
    :return: A dict of all words inside the table area with Coordinates, FS, Page-dims.
    """
    page_width, page_height, page_res  = page.attrib['width'], page.attrib['height'], page.attrib['resolution']
    table_char_coor = []
    for block in page.iter('{http://www.abbyy.com/FineReader_xml/FineReader10-schema-v1.xml}block'):
        if block.attrib["blockType"] == 'Table':
            for row in block.findall('{http://www.abbyy.com/FineReader_xml/FineReader10-schema-v1.xml}row'):
                for cell in row.iter('{http://www.abbyy.com/FineReader_xml/FineReader10-schema-v1.xml}cell'):
                    for text in cell.iter('{http://www.abbyy.com/FineReader_xml/FineReader10-schema-v1.xml}text'):
                        for line in text.iter('{http://www.abbyy.com/FineReader_xml/FineReader10-schema-v1.xml}line'):
                            for ch in line.iter('{http://www.abbyy.com/FineReader_xml/FineReader10-schema-v1.xml}charParams'):
                                table_char_coor.append({"text": ch.text, 
                                                        "l": int(ch.attrib['l']), "t": int(ch.attrib['t']), 
                                                        "r": int(ch.attrib['r']), "b": int(ch.attrib['b']),
                                                        "l_": float(int(ch.attrib['l'])/int(page_width)), 
                                                        "t_": float(int(ch.attrib['t'])/int(page_height)), 
                                                        "r_": float(int(ch.attrib['r'])/int(page_width)), 
                                                        "b_": float(int(ch.attrib['b'])/int(page_height)),
                                                        "page_width": int(page_width), "page_height": int(page_height)})
                            table_char_coor.append({"text": "<<LINEBREAK>>", "l": "", "t": "", "r": "", "b": "", 
                                                    "l_": "", "t_": "", "r_": "", "b_": "", "page_width": "", "page_height": ""})
    table_word_dict = get_words_from_char(table_char_coor)
    return table_word_dict

def get_invoice_pages(sub_folder_location, complete_extraction):
    """
    Finds all words(table-block, non-table block) in all pages and their coords, fs, page-dim.
    :param: string sub_folder_location: Path of folder containing XML file
    :return: List of 1st page with words and attributes.
    """
    # read only those files which has xml and images, few folders contain zips files and xlxs
    all_pages = []
    print("DIR : {}".format(sub_folder_location))
    for file in os.listdir(sub_folder_location):
        file_add = os.path.join(sub_folder_location, file)
        if file.endswith(".xml"):
            tree = ET.parse(file_add)
            r = tree.getroot()
            pages = r.iter('{http://www.abbyy.com/FineReader_xml/FineReader10-schema-v1.xml}page')
            for page in pages:
                
                # Page Dims
                page_width, page_height, page_res  = page.attrib['width'], page.attrib['height'], page.attrib['resolution']
                
                # Non-Table Text
                non_table_blocks = get_non_table_blocks(page)
                
                # Table Text
                # - NOTE: Uncoment for table extraction
                if complete_extraction == True:
                    table_blocks = get_table_blocks(page)
                else:
                    # Not extarcting data from table region
                    table_blocks = []
                
                # Add: Non Table + Table data
                all_blocks = non_table_blocks + table_blocks
                
                # All Pages in XML
                all_pages.extend(all_blocks)
                
                # *** NOTE *** 
                # considering only PAGE 1
                break
                
    return all_pages

def get_all_words(all_word_dict):
    """
    Extracts meta-data from dict of words/lines
    :param: dict all_word_dict: A dict of all words across all pages with coords, fs, page-dim
    :return: 
    string text: The text from the entire page
    dict text_dict: A dictionary containing word: meta-data from the entire page
    list lines: A list of all lines from the entire page
    dict line_dict: A dictionary containing line: meta-data from the entire page
    """
    X, Y, X_, Y_ = [], [], [], []
    font_size, font_size_, total_words = [], [], 0
    page_width, page_height = 0, 0
    text, text_dict = " ", []
    line_text, line_dict = " ", []
    lines = []
    for w in all_word_dict:
        if w != "<<LINEBREAK>>":
            # Word: l,t,r,b, font_size, l*, t*, r*, b*, font_size*
            text_dict.append({'word': str(w["content"]), 'l': w['l'], 't': w['t'], 'r': w['r'], 'b': w['b'], 
                              'font_size': w['font_size'], 'l_': w['l_'], 't_': w['t_'], 'r_': w['r_'], 'b_': w['b_'], 
                              'font_size_': w['font_size_'], 'page_width': w['page_width'], 'page_height': w['page_height']})
            # Text
            text = text + str(w["content"]) + " "
            # Line: Bounding Box of all words in line
            # 1. Normal Coords
            X.append(int(w['l']))
            X.append(int(w['r']))
            Y.append(int(w['t']))
            Y.append(int(w['b']))
            # 2. Regularized Coords
            X_.append(float(w['l_']))
            X_.append(float(w['r_']))
            Y_.append(float(w['t_']))
            Y_.append(float(w['b_']))
            # Font-size of normal & regularized coords
            font_size.append(int(w['font_size']))
            font_size_.append(float(w['font_size_']))
            total_words += 1
            # Page Dims
            page_width, page_height =  w['page_width'], w['page_height']
            line_text = line_text + str(w["content"]) + " " 

        else:
            # SAVE LINE DICT
            l, t, r, b = get_word_bounding_box(X, Y)
            l_, t_, r_, b_ = get_word_bounding_box(X_, Y_)
            avg_font_size = get_average_font_size(font_size, total_words, int)
            avg_font_size_ = get_average_font_size(font_size_, total_words, float)
            line_dict.append({'line': line_text.strip(), 'l':l, 't':t, 'r':r, 'b':b, 'font_size':avg_font_size, 
                              'l_':l_, 't_':t_, 'r_':r_, 'b_':b_, 'font_size_':avg_font_size_,
                              'page_width': page_width, 'page_height': page_height})
            # SAVE LINE TEXT
            lines.append(line_text.strip())
            # Initilaize
            X, Y, X_, Y_ = [], [], [], []
            font_size, font_size_, total_words = [], [], 0
            page_width, page_height = 0, 0
            line_text = " "
    text = text.strip()
    return text, text_dict, lines, line_dict

def extract_data(directory, complete_extraction):
    """
    Extracts linguistic meta-data for each XML file.
    :param: string directory: Path where all XML files are located
    :return: List of text, text-with-attributes, line, line-with-attributes for all XML files.
    """
    all_text, all_text_dict, all_lines, all_line_dict = [], [], [], []
    try:
        # Extract XML Text data
        all_words = get_invoice_pages(directory, complete_extraction)
        text, text_dict, lines, line_dict = get_all_words(all_words)
        print("Words = ", len(text))
        # Check for non xml sub-folders...
        if len(text) == 0:
            print("No XML file found in {}".format(sub_folder))
            text = np.nan
        # Stores all text
        all_text.append(text)
        all_text_dict.append(text_dict)
        all_lines.append(lines)
        all_line_dict.append(line_dict)
        print("Extracted !")
        print("******************")
    except Exception as e:
        print("Error in extraction! Error = {}".format(str(e)))
        print("******************")
        return ["error"]*4
    return all_text, all_text_dict, all_lines, all_line_dict

In [11]:
def Create_Data(data_path, complete_extraction):
    """
    Extracts linguistic meta-data for all XML files ilocated in given path.
    :param: string directory: Path where all XML files are located
    :return: List of text, text-with-attributes, line, line-with-attributes for all XML files.
    """
    counter = 0
    mapping_notfound = []
    df = pd.DataFrame()
    for file in os.listdir(data_path):
        print(file)
        # Reading a XML file
        print("#>", counter)
        AID = re.findall(r"(.*)", str(file))[0]
        text_data, text_dict, lines_data, lines_dict = extract_data(data_path + file + "/", complete_extraction)
        if type(text_data) == str:
            if set([text_data, text_dict, lines_data, lines_dict]) == {"error"}:
                counter+=1
                continue
        
        # Storing in a df
        df_data = pd.DataFrame()
        df_data['LINES'] = pd.Series(lines_data[0])
        df_data['LINES_DICT'] = pd.Series(lines_dict[0])
        df_data['TEXT'] = text_data[0]
        df_data['WORDS'] = pd.Series(text_dict*df_data.shape[0])
        df_data['FILE_INDEX'] = [counter] * df_data.shape[0]
        df_data['FILENAME'] = str(file)

        # Finding 'Target variable Y_SN' using mapper file
        if AID in mapper_file.FILENAME.values:
            mapper_df = mapper_file[mapper_file.FILENAME == AID].copy()
            df_data['SUPPLIER_NAME'] = mapper_df['SUPPLIER_NAME'].values[0]
        else:
            print("Supplier Name Mapping DOESN'T EXIST for - ", file)
            mapping_notfound.append(file)

        # Appending in a common df
        df = df.append(df_data)
        counter+=1
    
    # Final df
    df = df.reindex(columns=['FILE_INDEX', 'FILENAME', 'TEXT', 'WORDS', 'LINES', 'LINES_DICT', 
                             'SUPPLIER_NAME']).reset_index(drop=True)
    return df

In [None]:
# df_complete --> For Address Block extraction
df_complete = Create_Data(data_path, complete_extraction=True)

# df --> For Supplier Name extraction
df = Create_Data(data_path, complete_extraction=False)

----

## <ins>7. Intial Preprocessing Module </ins>

#### Helper Functions: Finding NER entities in `df.Lines`

In [14]:
def find_email(s):
    """
    Extracts NER email in text.
    :param: string s: Paragraph or line
    :return: Returns email-address and email-domain.
    """
    LENGTH_OF_EMAIL = 3
    email_tags = re.findall(r"[\w\.-]+@[\w\.-]+", s, flags=re.IGNORECASE | re.MULTILINE)
    email_tags = [email.strip() for email in email_tags if len(email) >= LENGTH_OF_EMAIL]
    email_domains = re.findall(r"[\w\.-]+@([\w\.-]+)", s, flags=re.IGNORECASE | re.MULTILINE)
    email_domains = list(map(lambda x: x.split('.')[0].lower(), email_domains))
    return email_tags, email_domains

def find_url(s):   
    """
    Extracts NER url in text.
    :param: string s: Paragraph or line
    :return: Returns all url-address, non-email url-addres, non-email url-domains.
    """
    # removed emails and decimal values to reduce noise
    s = re.sub(r"\s+", " ", re.sub(r"[\w\.-]+@[\w\.-]+", " ", s, flags=re.IGNORECASE | re.MULTILINE), flags=re.IGNORECASE | re.MULTILINE)
    s = re.sub(r"\d+\.\d+", " ", s, flags=re.IGNORECASE | re.MULTILINE).strip()
    
    # TYPE 1:  On entire data that may/maynot have email attributes like (From: <URL> To: <URL>)
    url_tags = re.findall(r'(https://www.|http://www.|ftp://www.|http://|ftp://|https://|www.)+([A-z_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', s, flags=re.IGNORECASE | re.MULTILINE)    
    url_tags = list(map(lambda x: "".join(x), url_tags))
    
    # TYPE 2:  On data that DO-NOT have email attributes - ** !! IMP FOR CHECKING DOMAIN NAME MATCHING !!**
    s_ = re.sub('[from|to|cc|bcc|subject]+\s*\:+\s+(http://|ftp://|https://|www.)+([A-z_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', " ", s, flags=re.IGNORECASE | re.MULTILINE)
    url_tags_ = re.findall(r'(https://www.|http://www.|ftp://www.|http://|ftp://|https://|www.)+([A-z_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', s_, flags=re.IGNORECASE | re.MULTILINE)    
    url_domains = list(map(lambda x: x[1].split('.')[0].lower(), url_tags_))
    url_tags_ = list(map(lambda x: "".join(x), url_tags_))
    
    return url_tags, url_tags_, url_domains

def find_phone(s):
    """
    Extracts NER telephone in text.
    :param: string s: Paragraph or line
    :return: Returns telephone tags.
    """
    phone_tags = re.findall(r"\s*(?:\+?(\d{1,3}))?[-. (]*(\d{3})[-. )]*(\d{3})[-. ]*(\d{4})(?: *x(\d+))?\s*", s, flags=re.IGNORECASE | re.MULTILINE)
    return phone_tags

def find_money(s):
    """
    Extracts NER currency in text.
    :param: tring s: Paragraph or line
    :return: Returns currency tags in regex pattern.
    
    MONEY TAGS
    ----------
    '$': 'USD',
    '€': 'EUR',
    '¥': 'JPY',
    '£': 'GBP',
    '₠': 'EUR',
    '₨': 'INR',
    '₹': 'INR',
    '₺': 'TRY',
    '元': 'CNY',
    '₽': 'RUB',
    '₩': 'KRW'
    """         
    # "lexNLP" sometimes catches "Cost" or "Amount" keywords as money-phrases! Thus removing these.
    s = re.sub(r"\bcost\b\s+|\bamount\b\s+|\bcostamount\b\s+|\bcosts\b\s+|\bamounts\b\s+|\bprice\b\s+|\bpriced\b\s+", " ", s.strip().lower())
    # lexNLP
    money_tags = list(lexnlp.extract.en.money.get_money(s, return_sources=True))
    money_tags = [str(tag[2]) for tag in money_tags]
    money_tags = [tags for tags in money_tags if len(re.sub(r'\s+', ' ', str(tags)).split(' ')) < 4]
    # to regex format
    money_tags_RE = [tag.replace('$','\$').replace('£','\£').replace('€','\€').replace('¥','\¥').replace('₠','\₠').replace('₨','\₨').replace('₹','\₹').replace('₺','\₺').replace('元','\元').replace('₽','\₽').replace('₩','\₩').strip() for tag in money_tags]
    return money_tags_RE

def find_date(s):
    """
    Extracts NER date in text.
    :param: string s: Paragraph or line
    :return: Returns date tags in regex pattern.
    """
    date_tags = []
    # 10/10/2015
    date_tags += re.findall(r"[\d]{1,2}/[\d]{1,2}/[\d]{4}", s, flags=re.IGNORECASE | re.MULTILINE)
    # 1-11-10
    date_tags += re.findall(r"[\d]{1,2}-[\d]{1,2}-[\d]{2}", s, flags=re.IGNORECASE | re.MULTILINE)
    # 1 NOV 2010
    date_tags += re.findall(r"([\d]{1,2}\s(?:JAN|NOV|OCT|DEC)\s[\d]{4})", s, flags=re.IGNORECASE | re.MULTILINE)
    # 1 November 2010
    date_tags += re.findall(r"(\d{1,2} (?:January|February|March|April|May|June|July|August|September|October|November|December) \d{4})", s, flags=re.IGNORECASE | re.MULTILINE)
    # November 1, 2010
    date_tags += re.findall(r"(\s*(?:January|February|March|April|May|June|July|August|September|October|November|December) \d{1,2}\,* \d{4})", s, flags=re.IGNORECASE | re.MULTILINE)
    # to regex format
    date_tags_RE = [tag.replace('/','\/').replace("\\","\\").replace('-','\-').replace(',','\,').replace('.','\.').replace('|','\|').strip() for tag in date_tags]
    return date_tags_RE

#### Pre-Processing in `df.LINES`

In [15]:
def get_lines_cleaned(line):
    """
    Performs basic-level cleansing on text.
    :param: string s: line
    :return: Returns cleaned line.
    """

    # Removes a line if it contains: only 1 digit, only 1 letter, only symbols, or it is empty
    def find_outliers(line):
        original_line = line
        line = str(line).strip().lower()
        line_stripped = re.sub("\s+", "", re.sub("["+REGEX_PUNCTS+"]+", "", re.sub(r"[^A-z0-9]", "", line))).strip()
        if len(line) < MINIMUM_CHARS_IN_LINE or len(line_stripped) < MINIMUM_CHARS_IN_LINE \
           or re.match(r'^\s*[total|amount|sum|whole|total amount|totals|amounts|final|final amount|final total]+\s*\d*$', 
                       line_stripped) != None:
            return np.nan
        else:
            return original_line

    # Normalizes accented characters in a line.
    def find_accentedChars(line):
        accented_text = unicodedata.normalize('NFKD', line).encode('ascii', 'ignore').decode('utf-8', 'ignore')
        return accented_text
    
    # Expands contarcted words in a line.
    def find_contractions(line):
        expanded_text = list(pd.Series(re.sub(r"\’", "\'", line).split('\n')).apply(lambda x: re.sub(r"(\w+\'\w+)", lambda x: contractions.get(x.group().lower(), x.group().lower()), x)))
        return "\n".join(expanded_text)
    
    # Removes unwanted symbols from lines
    def find_symbols(line):
        return re.sub(r"["+REGEX_SYMBOLS+"]+", " ", line, flags=re.IGNORECASE | re.MULTILINE).strip()
    
    # Removes line tag "<&>" present in a line. Removing it is important for postal_tagging step
    def find_lineTag(line):
        LINE_TAG_RE = LINE_TAG.replace('<','\<').replace('&','\&').replace('>','\>').strip()
        line = re.sub(LINE_TAG_RE, " ", line, flags=re.IGNORECASE | re.MULTILINE).strip()
        return line
    
    # CLEANING
    line = find_outliers(line)
    if str(line) != 'nan':
        line = find_accentedChars(line)
        line = find_contractions(line)
        line = find_symbols(line)
        line = find_lineTag(line)
        line = re.sub(r"\s+", " ", re.sub(r"[\n\t\r]+", "  ", line, flags=re.IGNORECASE | re.MULTILINE), flags=re.IGNORECASE | re.MULTILINE).strip()
    return line

In [None]:
# df_complete
original_shape = df_complete.shape[0]
df_complete['LINES'] = df_complete.LINES.apply(get_lines_cleaned)
df_complete = df_complete.dropna(subset=['LINES']).reset_index(drop=True)
print("Original Shape (before cleaning) = ", original_shape, "\nCleaned Shape = ", df_complete.shape[0], "\n")


# df
original_shape = df.shape[0]
df['LINES'] = df.LINES.apply(get_lines_cleaned)
df = df.dropna(subset=['LINES']).reset_index(drop=True)
print("Original Shape (before cleaning) = ", original_shape, "\nCleaned Shape = ", df.shape[0], '\n')

#### NER Entities Replacement with whitesapce

- Has been removed.

In [17]:
# # Accepts cleaned/pre-processed 'df.LINES' and performs entity recognition and its replacement with tags.
# def get_lines_preprocessed(line):
    
#     # REPLACE entities like: | EMAIL | URL | MONEY | DATE |
#     def replace_email(s):
#         try:
#             email_tags, _ = find_email(s)
#             for tag in email_tags:
#                 s = re.sub(tag, EMAIL_TAG, s, flags=re.IGNORECASE | re.MULTILINE)
#         except Exception as e:
#             print(e)
#         return s
#     def replace_url(s):
#         try:
#             url_tags, _, _ = find_url(s)
#             for tag in url_tags:
#                 s = re.sub(tag, URL_TAG, s, flags=re.IGNORECASE | re.MULTILINE)
#         except Exception as e:
#             print(e)
#         return s
#     def replace_money(s):
#         try:
#             money_tags = find_money(s)
#             for tag in money_tags:
#                 s = re.sub(tag, MONEY_TAG, s, flags=re.IGNORECASE | re.MULTILINE)
#         except Exception as e:
#             print(e)
#         return s
#     def replace_date(s):
#         try:
#             date_tags = find_date(s)
#             for tag in date_tags:
#                 s = re.sub(tag, DATE_TAG, s, flags=re.IGNORECASE | re.MULTILINE)
#         except Exception as e:
#             print(e)
#         return s
    
#     # finds NER entity and replaces them with assigned TAGS - for feature extraction
#     line = replace_email(line)
#     line = replace_url(line)
#     line = replace_date(line)
#     line = replace_money(line)
#     return line

----

## <ins>8. Address Block (AB) Identification Module </ins>

- Approach 1: Using Custom NER (biLSTM-CRF): BILUO Tagging and tag prediciton
- Approach 2: Using Word-Level features, n-grams with a Decision Tree model
- **Approach 3: Using lexNLP, libpostal and a scoring decision matrix**

.....................................................................................................................................................................................................................................................

### Approach 1
Custom NER (biLSTM-CRF): BILUO Tagging and tag prediciton


- All lines are taken and are subjected to spacy NER tagging. Then these lines are subjected to Spacy GoldParser and BILUO tags are created. A BiLSTM-CRF model is trained on these BILUO tags to correctly predict a tag when bunch of new lines are passed to the model.

In [18]:
# def create_BILOU_tagging(df):
    
#     def get_BILUO_tags(file_num, line_num, text, sn):
        
#         file_num = "FILE#" + str(file_num)
#         line_num = "LINE#" + str(line_num)

#         text = re.sub(r"[\n\t\r]+", " ", text)
#         text = re.sub(r"\s+", " ", re.sub(r"[^A-z0-9!!@#$%&?]+", " ", text))
#         text = re.sub(r"\s+", " ", re.sub(r"[\`\~\^\*\(\)\[\]\{\}\:\;\'\"\,\<\.\>\/\\\|\-\_\=\+]+", " ", text))
#         sn = re.sub(r"\s+", " ", re.sub(r"[^A-z0-9!@#$%&?]+", " ", sn))
#         # indexes of regex match
#         match_indexs = [(a.start(), a.end()) for a in list(re.finditer(sn.lower(), text.lower()))]

#         # Spacy GOLD parser
#         doc = nlp(text)
#         pos_tags = [pos[1] for pos in nltk.pos_tag(word_tokenize(text))]
#         entities = [tuple(list(i) + ["SN"]) for i in match_indexs]
#         tags = biluo_tags_from_offsets(doc, entities)
#         BILUO = [(file_num, line_num, str(token), str(pos), tag) for token,pos,tag in zip(doc, pos_tags, tags)]
#         return BILUO
    
#     # Creating tag line-by-line...
#     list_line_word_tags = []
#     line_num = 0
#     for file_num, line, sn in zip(df.index.values, df.TEXT.values, df.SUPPLIER_NAME.values):
#         list_line_word_tags.append(get_BILUO_tags(file_num, line_num, line, sn))
#         line_num+=1
    
#     # Extract attributes
#     FILE_NO, LINE_NO, WORD, POS, TAG = zip(*sum(list_line_word_tags, []))
#     tag_df = pd.DataFrame({"File#": FILE_NO, "Line#":LINE_NO, "Word": WORD, "POS": POS, "Tag": TAG})
#     return tag_df

In [19]:
# df_BILOU_SN = create_BILOU_tagging(df_text[:5000])

In [20]:
# def preprocess_bilou(x):
#     w = x.Word
#     t = x.Tag
#     original_w = w
#     w = str(w).strip().lower()
#     if t not in ['B-SN', 'I-SN', 'L-SN', 'U-SN']:
#         if len(w) < 4 or w.isdigit() or re.sub("\s+", "", re.sub(r"[^A-z0-9]", " ", w)).strip().isdigit() or len(re.sub("\s+", "", re.sub(r"[^A-z0-9]", " ", w)).strip()) == 0:
#             return np.nan
#     return original_w
    
# data = df_BILOU_SN.copy()
# data["Word"] = data.apply(preprocess_bilou, axis=1)
# data = data.dropna(subset=['Word']).reset_index(drop=True)

In [21]:
# words = list(set(data["Word"].values))
# words.append("ENDPAD")
# n_words = len(words); n_words

In [22]:
# tags = list(set(data["Tag"].values))
# n_tags = len(tags); n_tags

In [23]:
# max_len = 100
# word2idx = {w: i + 1 for i, w in enumerate(words)}
# tag2idx = {t: i for i, t in enumerate(tags)}

In [24]:
# class SentenceGetter(object):
    
#     def __init__(self, data):
#         self.n_sent = 1
#         self.data = data
#         self.empty = False
#         agg_func = lambda s: [(w, p, t) for w, p, t in zip(s["Word"].values.tolist(),
#                                                            s["POS"].values.tolist(),
#                                                            s["Tag"].values.tolist())]
#         self.grouped = self.data.groupby("Line#").apply(agg_func)
#         self.sentences = [s for s in self.grouped]
    
#     def get_next(self):
#         try:
#             s = self.grouped["LINE#{}".format(self.n_sent)]
#             self.n_sent += 1
#             return s
#         except:
#             return None

In [25]:
# getter = SentenceGetter(data)
# sentences = getter.sentences

In [26]:
# # pad the sequence
# X = [[word2idx[w[0]] for w in s] for s in sentences]
# X = pad_sequences(maxlen=max_len, sequences=X, padding="post", value=n_words-1)

# # pad the target
# y = [[tag2idx[w[2]] for w in s] for s in sentences]
# y = pad_sequences(maxlen=max_len, sequences=y, padding="post", value=tag2idx["O"])

# y = [to_categorical(i, num_classes=n_tags) for i in y]
# X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.1)

In [27]:
# model = Sequential()
# model.add(Embedding(input_dim=n_words+1, output_dim=200, input_length=max_len))
# model.add(Dropout(0.5))
# model.add(Bidirectional(LSTM(units=128, return_sequences=True, recurrent_dropout=0.1)))
# model.add(TimeDistributed(Dense(n_tags, activation="relu")))
# crf_layer = CRF(n_tags)
# model.add(crf_layer)
# model.summary()

In [28]:
# # adam = k.optimizers.Adam(lr=0.0005, beta_1=0.9, beta_2=0.999)
# model.compile(optimizer='adam', loss=crf_layer.loss_function, metrics=[crf_layer.accuracy])
# history = model.fit(X_tr, np.array(y_tr), batch_size=256, epochs=10, validation_split=0.1, verbose=1)

In [29]:
# hist = pd.DataFrame(history.history)

# plt.style.use("ggplot")
# plt.figure(figsize=(12,12))
# plt.plot(hist["crf_viterbi_accuracy"])
# plt.plot(hist["val_crf_viterbi_accuracy"])
# plt.show()

In [30]:
# test_pred = model.predict(X_te, verbose=1)

# idx2tag = {i: w for w, i in tag2idx.items()}

# def pred2label(pred):
#     out = []
#     for pred_i in pred:
#         out_i = []
#         for p in pred_i:
#             p_i = np.argmax(p)
#             out_i.append(idx2tag[p_i].replace("PAD", "O"))
#         out.append(out_i)
#     return out
    
# pred_labels = pred2label(test_pred)
# test_labels = pred2label(y_te)
# print("F1-score: {:.1%}".format(f1_score(test_labels, pred_labels)))
# print(classification_report(test_labels, pred_labels))

# model.evaluate(X_te, np.array(y_te))

In [31]:
# i = 10
# p = model.predict(np.array([X_te[i]]))
# p = np.argmax(p, axis=-1)
# true = np.argmax(y_te[i], -1)
# print("{:15}||{:5}||{}".format("Word", "True", "Pred"))
# print(30 * "=")
# for w, t, pred in zip(X_te[i], true, p[0]):
#     if w != 0:
#         print("{:15}: {:5} {}".format(words[w-1], tags[t], tags[pred]))

In [32]:
# # Custom Tokenizer
# re_tok = re.compile(f'([{string.punctuation}“”¨«»®´·º½¾¿¡§£₤‘’])')

# def tokenize(s): 
#     return re_tok.sub(r' \1 ', s).split()

In [33]:
# x_test_sent = pad_sequences(sequences=[[word2idx.get(w, 0) for w in tokenize(test_sentence)]], 
#                             padding="post", value=0, maxlen=max_len)

# p = model.predict(np.array([x_test_sent[0]]))
# p = np.argmax(p, axis=-1)

# print("{:15}||{}".format("Word", "Prediction"))
# print(30 * "=")

# for w, pred in zip(tokenize(test_sentence), p[0]):
#     print("{:15}: {:5}".format(w, tags[pred]))

.....................................................................................................................................................................................................................................................

### Approach 2
Using Word-Level features, n-grams with a Decision Tree model

- Creating n-grams for each word in a line. Constructing Word-Level features for each n-grma (middle word). Labelling the data using auto-label BILUO tags (using Spacy's GoldParser). Passing the constructed features to a classifier model Decision Tree for classifiying the middle term as S(Start), M(middle) and E(end) of an Address Block field. Finally passing the S,M,E values to a Scoring Decision Matrix(SME) for final chunk prediction.

<ins>STEPS</ins>


`Step 1: Construct n-grams`

1. The entire XML file is stored as a *'assumed sequential text'* in a paragraph P. For e.g. P = (w1, w2, w3, ..., wn)


2. The paragraph is then broken in trigrams. For e.g. n-grams = ([_, w1, w2], [w1, w2, w3], [w2, w3, w4], ..., [wn-1, wn, _ ])

...

`Step 2: Construct Word-Level features`

1. Word-Level features are exploited for each middle word. Word-Level features are: 
            
            - is Digit
            - is Text
            - is AlphaNumeric
            - is NER(email)
            - is NER(url)
            - is NRE(phone)
            - is Punts marks
            - is All Punct marks
            - is Dictionary Word
            - is 1st letter Capital Dictionary Word
            - is All Capital Dictionary Word
            - is Non Dictionary Word
            - is 1st letter Capital Non Dictionary Word
            - is All Capital Non Dictionary Word
            
            
2. All features give BOOL values(1 or 0). For each middle word in n-gram, these BOOL values are taken and concatanated to create a feature set.


3. For e.g. for each n-gram these features are calculated:

        - (_,  w1, w2)  = CalWordLevelFeatures(w1) =  [1,0,1,0, ...., 1]
        - (w1, w2, w3)  = CalWordLevelFeatures(w2) =  [0,0,0,0, ...., 0]
        - (w2, w3, w4)  = CalWordLevelFeatures(w3) =  [1,1,1,1, ...., 1]
          ...
        - (wn-1, wn, _)  = CalWordLevelFeatures(wn) =  [0,1,1,1, ...., 0]

...

`Step 3: Labelling data`

1. Each n-gram is considered and using some static rules, these n-grams are labelled: Start(S), Middle(M), End(E) or Other(O)

2. For each n-gram (e.g. W1, W2, W3) are checked and based on below rules:

            - RULE1: If W1 is a digit, and W2 or W3 contains a text mark the n-gram as S
            - RULE2: If W3 contains a zipcode, mark the n-gram as E
            - RULE3: If W1 & W3 donot contian a zipcode and W2 contains a city-name, state-name, mark the n-gram as M
            - RULE4: If W1, W3 both contains a digit, mark the n-gram as O
            - RULE5: If W3 contains a NER(phone) and W1 conatins a zipcode, mark the n-gram as E
            - RULE6: If W3 contains a NER(email) and W1 conatins a zipcode, mark the n-gram as E
            ... likewise
            
            
3. This labelled data is passed to SpaCy's GoldParse to create BILUO tags: Beginning of a chunk(B-tag), Inside of a chunk(I-tag), Last of a chunk(L-tag), Unit chunk(U-tag), Outside of a chunk(O-tag). Finally all n-grams are labelled as BILUO tags.

...

`Step 4: Constuct a Decision Tree`

1. Each labelled n-gram along with its feature set is passed to a DT model.


2. Intensive(heavy) pruning of the created DT model is carried out.


3. Overfitting is allowed (relaxed state).

...

`Step 5: Pass the predicted values to a SME`

1. For new data, n-grams are generated. For each new n-gram feature set of each n-gram is passed to a fitted pruned DT model and end-labels(B-tag, I-tag, .., O-tag) are generated. These values are converted back to intial labels(S,M,E and O). These labels are then passed to a final SME. 


2. Final set of n-grams for new data looks like:

        - Ngram1 = S
        - Ngram2 = M
        - Ngram3 = O
        - Ngram4 = M
        - Ngram5 = O
        - Ngram6 = E
        ...
        - NgramN = S
        

3. A *potential AB* is created from above labelled n-grams as when a 'S' is encountered - which marks beginning of a *potential AB* and all the following labels are considered/stored until a 'E' is encountered - which marks the end of this *potenital AB*.



4. Likewise all *potenital AB* are created. AB may contian a minimum of only 3 n-grams(S, M, E) or may contain multiple n-grams such as (S, O, O, O, O, E) or (S, M, O, O, E). The newly created AB bunches look like:

        - Ngram1  = S  :>
        - Ngram2  = M  :>  Potential AB 1  (Pattern = SME)
        - Ngram3  = E  :> 
        - Ngram4  = O
        - Ngram5  = O
        - Ngram6  = M
        - Ngram7  = M
        - Ngrma8  = O
        - Ngram9  = S  :>
        - Ngram10 = O  :>
        - Ngram11 = M  :>  Potential AB 2  (Pattern = SOMOE)
        - Ngram12 = O  :>
        - Ngram13 = E  :>
        - Ngram14 = O
        - Ngram15 = S  :>
        - Ngram16 = O  :>  Potential AB 3  (Pattern = SOE)
        - Ngram17 = E  :>


5. All patterns which have atleast 1 "M" label are subjected to a pattern-check following below rules:
        
        - "S" is followed by either "O" or "M"  AND  not followed by a "E"
        - "M" is followed by either "O" or "E"  AND  not followed by a "S"
        - "E" is followed by a "O"              AND  not followed by either "M" or "S"
        ... likewise
        
        
6. All potenital AB blocks which follow a correct pattern are considered as ***final Address Block (AB)***.

.....................................................................................................................................................................................................................................................

### Approach 3
Using lexNLP, libpostal and a scoring decision matrix

- Taking all lines in an XML file and passing it to 3rd party library. Using combination of rules and a scoring decision matrix all AddressBlocks(AB) are found.

<ins>STEPS:</ins>


1. INPUT = All `LINES` in the 1st page of the XML file are considered and stored in a separate dataframe i.e. df_complete. (As opposed to another dataframe which reads only non-table blocks from the 1st page of XML file and was used for SN extraction process).


2. All `LINES` are subjected to a *libpostal* and `POSTAL-TAG` are found.


3. Using some *Rules*, POSTAL-TAG are converted to `LABEL-TAG`.


4. Using some *Rules*, LABEL-TAG are converted to `LABEL`.


5. Using some *Rules*, a `potenital-AB` is discovered.


6. For all potential-AB, LABEL are collected together to create a Pattern.


7. This pattern is passed to a pattern-checker module to idnetify all actual `AB`.

In [34]:
"""
----------------------------------------
ADDRESS BLOCK LABELLING RULES
----------------------------------------
Start = HAS a NUMBER (DIGIT), AND following one word from (street number, house number, road number, block number)

MIDDLE = between fields like highway, floor, unit, level, building, compound, etc

END = HAS a ZIPCODE AND preecidng one pair from (city AND state | city AND country| state AND country)

OTHERS = house
"""

print("...Rules for AB")

...Rules for AB


### Labelling Conventions

- **POSTAL-TAG**


    - generated using libpostal
    - Examples: house_number, unit, level, po_box, suburb, city_district, city, island, state_district, state, 
                country_region, country,  world_region, postcode
. 
- **LABEL-TAG**


    - generated using rules
    - Examples:
      S_LABELS: Marks the beginning of an AB
      M_LABELS: Marks the middle of an AB
      E_LABELS: Marks the ending of an AB
.

- **LABEL**


    - generated using rules
    - Examples:
      S house_number: usually refers to the external (street-facing) building number.
      S unit: an apartment, unit, office, lot, or other secondary unit designator
      S level: expressions indicating a floor number e.g. "3rd Floor", "Ground Floor", etc.
      M po_box: post office box number
      M suburb: an unofficial neighborhood name like "Harlem", "South Bronx", or "Crown Heights"
      M city_district: districts within a city e.g. "Brooklyn" or "Hackney" or "Bratislava IV"
      M city: settlement including cities, towns, villages, hamlets, localities, etc.
      M island: named islands e.g. "Maui"
      M state_district: usually a second-level administrative division or county.
      M state: a first-level administrative division.
      M country_region: informal subdivision of a country without any political status
      M country: sovereign nations and their dependent territories, anything with an ISO-3166 code.
      M world_region: a pattern frequently used in the English-speaking Caribbean e.g. “Jamaica, West Indies”
      E postcode: postal codes used for mail sorting    
.


In [35]:
##########################################################################
# Module 1: Extracts POSTAL-TAG for each line in a file.
##########################################################################
# INPUT  ::  df.LINES
# OUTPUT ::  returns a list of "tagged_lines" and "tags"

def get_postal_lines(df_lines):
    list_tagged_lines, list_tags = [], []
    for line in df_lines.values:
        tagged_line = crf_NER(line)
        tags = list(map(lambda x: x[1], tagged_line))
        list_tagged_lines.append(tagged_line)
        list_tags.append(tags)
    return list_tagged_lines, list_tags

In [36]:
#########################################################################################################################
# Module 2: Decides which bunch of lines form an AddressBlock. Creates a final-tag "AB" or "O" for a bunch of lines
#########################################################################################################################
# INPUT  ::  Accepts a column having POSTAL-TAG(house, road,..,city) for all lines in file
# OUTPUT ::  Returns O or AB predicitions for each line in file

def get_postal_labels(df_TAGS):
    
    # DEFINE LABELS
    S_LABELS = ['house_number', 'road', 'unit', 'level', 'po_box']
    M_LABELS = ['suburb', 'city_district', 'city', 'island', 'state_district', 'state', 'country_region', 'country', 'world_region']
    E_LABELS = ['postcode']
    E1_LABELS = ['phonetag', 'urltag', 'emailtag']
    
    # 1. Replacing POSTAL-TAG with corresponding LABEL using the above MAPPER.
    # Returns corresponding LABEL per line (e.g. L1 = S, L2 = M, L3 = E, etc)
    def get_label_perLine(line):
        label = []
        for i in line:
            if i in S_LABELS:
                label.append('S')
            elif i in M_LABELS:
                label.append('M')
            elif i in E_LABELS:
                label.append('E')
            elif i in E1_LABELS:
                label.append('E1')
            else:
                label.append('O')
        return label
    df = pd.DataFrame({'TAGS': df_TAGS})
    df['LABELS'] = df.TAGS.apply(get_label_perLine)
    
 
    # This module runs after each TAGGED_LINE has been mapped with corresponding "S", "M", "E" labels.
    # Executes some rules to find address-boundaries!
    # Pattern - S+O*M+O*E

    
    # 2. Finding LABEL(S and E) for bunch of tagged lines.
    # Returns a 2D list containing S label(Row_num, Col_num) & E label(Row_num, Col_num)
    found_index, store_found_index = [], []
    finding_value = 'S'
    for row_index, row in enumerate(df.LABELS.tolist()): 
        for column_index, col in enumerate(row):
            # if a row contains only 'O' labels only OR consecutive two lines contain only 'O'  ---> break
            if list(set(row)) == ['O'] or list(set(sum(df.LABELS.iloc[row_index:row_index+2].tolist(), []))) == ['O']:
                finding_value = 'S'
                found_index = []
                break
            if col == finding_value and finding_value == 'S':
                found_index.append((row_index, column_index))
                finding_value = 'E'
                continue
            if col == finding_value and len(found_index) == 0:
                continue
            if col == finding_value and len(found_index) != 0:
                found_index.append((row_index, column_index))
                finding_value = 'S'
                store_found_index.append(found_index)
                found_index = []
    
    # 3. Finding LABEL-PATTERN("S+O*M+O*E") for bunch of tagged lines.
    # Returns a series of LABEL-PATTERN per line (e.g. L1 = [], L2 = [SOOMOOE], L3 = [SMESS], etc)
    list_final_labels = ['O']*df.LABELS.shape[0]
    for index_tuple in store_found_index:
        start_row, end_row = index_tuple[0][0], index_tuple[1][0]
        start_col, end_col = index_tuple[0][1], index_tuple[1][1]
        i = 0
        values = []
        label_list = df.LABELS.tolist()[start_row: end_row+1]
        for lst in label_list:
            if i == 0:
                slicer_start, slicer_end = start_col, len(lst)
            elif i == len(label_list)-1:
                slicer_start, slicer_end = 0, end_col 
            else:
                slicer_start, slicer_end = 0, len(lst)
            values.append(lst[slicer_start: slicer_end+1])
            i+=1
        for i in range(start_row, end_row+1):
            list_final_labels[i] = "".join(sum(values, []))
    df['POSTAL_LABEL'] = list_final_labels
    
    # 4. Checks if the found LABEL-PATTERN qualifies to be "AB" or "O".
    # Rreturns a series of final tag "AB" or "O" per line (e.g. L1 = O, L2 = AB, L3 = AB, etc)
    def checkPattern(string, pattern): 
        if 'M' not in string:
            return False
        l = len(pattern) 
        if len(string) < l: 
            return False
        for i in range(l - 1): 
            x = pattern[i] 
            y = pattern[i + 1] 
            last = string.rindex(x) 
            first = string.index(y) 
            if last == -1 or first == -1 or last > first: 
                return False
        return True
    # Check for the pattern, if M is not followed by 'S' likewise...
    final_label = []
    for tags in df.POSTAL_LABEL.values:
        label = checkPattern(tags, pattern="SME")
        if label == True:
            final_label.append('AB')
        else:
            final_label.append('O')
    df['FINAL_LABEL'] = final_label
    return df.LABELS, df.POSTAL_LABEL, df.FINAL_LABEL

In [37]:
#####################################################################################################################
# Module 3: Accepts a df and returns various postal columns. Creates final tag "AB" or "O" for each line in a df
#####################################################################################################################
# INPUT  ::  Accepts a whole dataframe
# OUTPUT ::  Returns 5 pd.series with postal information

def get_postal(df):
    
    # Step 1: Create POSTAL-TAG for each line
    df['POSTAL_taggedLines'], df['POSTAL_tags'] = get_postal_lines(df['LINES'])
    
    # Step 2: Create final tag("AB" or "O") for each line
    df['POSTAL_labels'], df['POSTAL_labels_AB'], df['POSTAL_AB'] = get_postal_labels(df['POSTAL_tags'])
    
    return df.POSTAL_taggedLines, df.POSTAL_tags, df.POSTAL_labels, df.POSTAL_labels_AB, df.POSTAL_AB

In [38]:
##################################################################################################################
# Module 4: Runs above 3 modules and combines them to create final "AB CHUNK" for each file
##################################################################################################################
# INPUT  ::  Accepts the whole dataframe(viz. "without" postal columns)
# OUTPUT ::  List of AB chunks for each file

def get_AB_file(df):
    
    AB_file = {}
    for index, f in enumerate(df.FILENAME.unique()):
        
        # Create df
        df_AB = df[df.FILENAME == f].copy()
        print("Extracting AB for File: #{} - {}".format(index, f))
        
        # Create feature set
        df_AB['POSTAL_taggedLines'], df_AB['POSTAL_tags'], df_AB['POSTAL_labels'], \
        df_AB['POSTAL_labels_AB'], df_AB['POSTAL_AB'] = get_postal(df_AB)
        
        # Extract AB chunks
        AB_blocks, AB_lines = [], []
        start, end = 0, 0
        for line, postal in zip(df_AB.LINES, df_AB.POSTAL_AB):
            if postal == 'AB':
                start = 1
                AB_lines.append(line)
            if postal != 'AB':
                end = 1
                if start == 1 and end == 1 and len(AB_lines) > 0:
                    AB_blocks.append(" ".join(AB_lines))
                    AB_lines, start, end = [], 0, 0
        
        # Store extracted AB Chunks
        AB_file[str(f)] = list(map(lambda x: [x], AB_blocks))
        
    # Storing AB Chunks in a df
    k,v = zip(*AB_file.items())
    df_AB_chunks = pd.DataFrame({'FILENAME': k, 'AddressBlocks': v})
    df_AB_chunks = df_AB_chunks[df_AB_chunks.AddressBlocks.apply(lambda x: len(x)) > 0].reset_index(drop=True)
    return df_AB_chunks

### AddressBlock Extraction Module


`df`          - Shall be used for model prediciton (viz. contains only non-table blocks on 1st page)

`df_complete` - Shall be used for AB extraction (viz. contains all blocks table + non-table on 1st page)

#### 1. AB Extraction for `df` 

(i.e. for creating feature set used later in *SupplierName Extraction* module)

In [None]:
# for whole df
df['POSTAL_taggedLines'], df['POSTAL_tags'], df['POSTAL_labels'], df['POSTAL_labels_AB'], df['POSTAL_AB'] = get_postal(df)

In [None]:
df[['LINES', 'POSTAL_tags', 'POSTAL_labels', 'POSTAL_labels_AB', 'POSTAL_AB']].head()

#### 2. AB Extraction for `df_complete` 

(i.e. creating feature set for each file and then extracting final AB CHUNKS for output report)

In [None]:
# Extract Address Block Chunks for each file
df_AB_chunks = get_AB_file(df_complete)

In [43]:
# # Save the df in a report...
# df_AB_chunks.to_excel('AmeriHealth_AB_8000Files.xlsx')
# df_AB_chunks.head()

---

## <ins>9. Labelling Data Module </ins>

- A line is labelled as 1 if the line contains a Supplier Name (as found in Mapper File) else a 0

In [44]:
%%time 

s, ts = 0, 0
orgsn = []
for sn in list(df.SUPPLIER_NAME.unique()):
    # English Dictionary words
    ts+=len(word_tokenize(sn))
    s+=len([w for w in word_tokenize(sn) if lemma.lemmatize(w.lower()).strip() in ENG_words])
    
    # NER Org tags
    x = 0
    if [1 if ent.label_ == 'ORG' else 0 for ent in nlp(sn).ents] != []: x=1
    orgsn.append(x)
    
print("% of English Dictionary words in Supplier Names = ", s*100.0/ts)
print("% of Spacy NER TAG \'ORG\' in Supplier Names = ", sum(orgsn)*100.0/len(list(df.SUPPLIER_NAME.unique())))

NameError: name 'df' is not defined

In [45]:
##########################################################################
# Labels df w.r.t computed similarity (b/w line & SN) from mapper file
##########################################################################

def label_data(df):
    
    def clean(sn, line):
        """
        Regex cleaning of SN and line.
        :param: string sn: Supplier Name; string line: A text line
        :return: Returns cleaned SN, cleaned line.
        """
        sn = re.sub(r"\s+", " ", re.sub(r"[^A-z0-9!!@#$%&?]+", " ", sn, flags=re.IGNORECASE | re.MULTILINE), flags=re.IGNORECASE | re.MULTILINE)
        line = re.sub(r"[\n\t\r]+", " ", line, flags=re.IGNORECASE | re.MULTILINE)
        line = re.sub(r"\s+", " ", re.sub(r"[^A-z0-9!!@#$%&?]+", " ", line, flags=re.IGNORECASE | re.MULTILINE), flags=re.IGNORECASE | re.MULTILINE)
        line = re.sub(r"\s+", " ", re.sub(r"[\`\~\^\*\(\)\[\]\{\}\:\;\'\"\,\<\.\>\/\\\|\-\_\=\+]+", " ", line, flags=re.IGNORECASE | re.MULTILINE), flags=re.IGNORECASE | re.MULTILINE)
        return sn.lower(), line.lower()
    
    def score(sn, line):
        """
        Labels a line based on (1)Regex match (2)Exact regex match (3)FuzzyWuzzy Distance > 95 between (SN, line).
        :param: string sn: Supplier Name; string line: A text line
        :return: Returns 1 or 0
        """
        # 1. Regex match
        match_index = [(a.start(), a.end()) for a in list(re.finditer(sn, line))]
        if len(match_index) > 0:
            iter_score = 1
        else:
            iter_score = 0
        
        # 2. Exact regex match
        finds = re.findall(r"(?=(\b" + '\\b|\\b'.join([sn]) + r"\b))", line, flags=re.IGNORECASE | re.MULTILINE)
        if len(finds) > 0:
            find_score = 1
        else:
            find_score = 0
        
        # 3. FuzzyWuzzy distance
        excat_score, partial_score = fuzz.ratio(sn, line), fuzz.partial_ratio(sn, line)
        if excat_score >=95:
            fuzzy_score = 1
        else:
            fuzzy_score = 0
        
        # Returns "1" if any of the above is 1 else a "0"
        return max(iter_score, find_score, fuzzy_score)

    def exact_score(sn, line):
        """
        Finds 'exact' label for a line.
        :param: string sn: Supplier Name; string line: A text line
        :return: Returns 1 or 0
        """
        exact_sn = re.sub(r"\s+", " ", re.sub(r"[^A-z\s\&]+", " ", sn.strip(), flags=re.IGNORECASE | re.MULTILINE), flags=re.IGNORECASE | re.MULTILINE).strip().lower()
        exact_line = re.sub(r"\s+", " ", re.sub(r"[^A-z\s\&]+", " ", line.strip(), flags=re.IGNORECASE | re.MULTILINE), flags=re.IGNORECASE | re.MULTILINE).strip().lower()
        if exact_line == exact_sn:
            exact_Label = 1
        else:
            exact_Label = 0
        return exact_Label
    
    
    # LABELLING
    def regex_match(x):
        # SN, Line
        sn = x.SUPPLIER_NAME
        line = x.LINES
        # 1. Regex cleaning
        clean_sn, clean_line = clean(sn, line)
        # 2. Normal labelling
        label = score(clean_sn, clean_line)
        # 3. Exact labelling
        exact_Label = exact_score(sn, line)
        return pd.Series([exact_Label, label])
    
    # Using only 'Y_SN' value in training...
    df[['Y_SN_EXACT', 'Y_SN']] = df.apply(regex_match, axis=1)
    return df

In [46]:
%%time
%%memit

df = label_data(df)

NameError: name 'df' is not defined

In [47]:
print("\nLabel Distribution ::")
df.Y_SN.value_counts()


Label Distribution ::


NameError: name 'df' is not defined

----

## <ins>10. Feature Engineering Module </ins>

- Linguistic Features = Features exploited based on meta-information about text such as Font-Size, coordinates, presence of dictionary words in a line, context neighbouring to a line, etc.


- Text Features = Text converted into numbers used as a feature. Using a count vectorizer to convert text into count of tokens per line.

.....................................................................................................................................................................................................................................................

## `TYPE I. Linguistic Features`

**Features List ::**
        
1. XML feature: Find | Coordinate | Font-size | of a line.

2. Line-Level feature: Contains Digit | All Digit | Punct | Email | URL | Date | Phone in a line.

3. Contextual feature: Presence of a AB in neghbouring lines.

4. Position feature: Quadrant of a line in which it is found.

5. Geographical Parser feature: Presence of NER(GPE) tags in neghbouring lines.

#### `1. XML Feature`

#### `2. Line-Level Feature`

In [48]:
##########################################################################
# BOOL FEATURE :: Finds various line-level features in each line
##########################################################################

def f1_lineLevel(line):
    """
    Finds line-level features for each line.
    :param: df.LINES
    :return: Returns various features at line-level with BOOl values (1 or 0)
    """
    
    def is_CONTAINSDIGIT(w):
        if len(re.findall(r"\d+", w)) > 0:
            return 1
        else:
            return 0
    def is_CONTAINSALLDIGITS(w):
        if w.isdigit() or re.sub("\s+", "", re.sub("["+REGEX_PUNCTS+"]+", " ", re.sub(r"[^A-z0-9]", " ", w))).strip().isdigit():
            return 1
        else:
            return 0
    def is_CONTAINSPUNCT(w):
        if len(re.findall("["+REGEX_PUNCTS+"]+", w)) > 0:
            return 1
        else:
            return 0
    def is_CONTAINSEMAIL(w):
        email_tags, _ = find_email(w)
        if len(email_tags) > 0:
            return 1
        else:
            return 0
    def is_CONTAINSURL(w):
        url_tags, _, _ = find_url(w)
        if len(url_tags) > 0:
            return 1
        else:
            return 0
    def is_CONTAINSMONEY(w):
        money_tags = find_money(w)
        if len(money_tags) > 0:
            return 1
        else:
            return 0
    def is_CONTAINSDATE(w):
        date_tags = find_date(w)
        if len(date_tags) > 0:
            return 1
        else:
            return 0
    def is_CONTAINSPHONE(w):
        phone_tags = find_phone(w)
        if len(phone_tags) > 0:
            return 1
        else:
            return 0
    
    # LINE-LEVEL FEATURES
    line_CONTAINSDIGIT = is_CONTAINSDIGIT(line)
    line_CONTAINSALLDIGITS = is_CONTAINSALLDIGITS(line)
    line_CONTAINSEMAIL = is_CONTAINSEMAIL(line)
    line_CONTAINSURL = is_CONTAINSURL(line)
    # OLD: line_CONTAINSMONEY = is_CONTAINSMONEY(line)
    line_CONTAINSDATE = is_CONTAINSDATE(line)
    line_CONTAINSPHONE = is_CONTAINSPHONE(line)
    return line_CONTAINSDIGIT, line_CONTAINSALLDIGITS, line_CONTAINSEMAIL, line_CONTAINSURL, line_CONTAINSDATE, \
            line_CONTAINSPHONE

In [49]:
##########################################################################
# BOOL FEATURE :: Finds if the line contains a legal/dynamic extension ABBRV
##########################################################################

def f3_checkAbbrv(s):
    """
    Finds if a line contains a Legal or Dynamic abbreviation
    :param: df.LINES
    :return: Returns 1 if it contains else a 0
    """
    finds = re.findall(r"(?=(\b" + '\\b|\\b'.join(list_abbrv_regex) + r"\b))", s.lower(), flags=re.IGNORECASE | re.MULTILINE)
    if len(finds) > 0:
        return 1
    else:
        return 0

In [50]:
##########################################################################
# PERCENTAGE FEATURE :: Finds all email address present in line
##########################################################################

def f1_checkEmail(s):
    """
    Finds % of email in a line
    :param: df.LINES
    :return: Returns percentage value number of emails out of total words
    """
    if len(s) < LENGTH_OF_EMAIL or re.sub("\s+", "", re.sub(r"[^A-z0-9]", " ", s, flags=re.IGNORECASE | re.MULTILINE), flags=re.IGNORECASE | re.MULTILINE).isdigit():
        emails = []
    else:
        s = s.strip()
        s = re.sub(r"\s+", " ", re.sub("[^A-z\&\@\.\\\/\:\;]", " ", re.sub(r"\s+[\~\`\!\#\$\^\*\(\)\-\_\+\=\[\]\{\}\:\;\'\"\,\<\.\>\?\/\\\|]+\s+", " ", re.sub(r"[\n\r\t]+", " ", s, flags=re.IGNORECASE | re.MULTILINE), flags=re.IGNORECASE | re.MULTILINE), flags=re.IGNORECASE | re.MULTILINE), flags=re.IGNORECASE | re.MULTILINE).strip()
        emails = re.findall(r"[\w\.-]+@[\w\.-]+", s, flags=re.IGNORECASE | re.MULTILINE)
        emails = [eml for eml in emails if len(eml) >= LENGTH_OF_EMAIL]
    total_words = s.split(" ")
    per_emails_line = len(emails)*100.0/len(total_words)
    return per_emails_line

In [51]:
##########################################################################
# PERCENTAGE FEATURE :: Finds all URLs/Websites present in line
##########################################################################

def f1_checkURL(s):
    """
    Finds % of url in a line
    :param: df.LINES
    :return: Returns percentage value number of urls out of total words
    """
    if len(s) < LENGTH_OF_URL or re.sub("\s+", "", re.sub(r"[^A-z0-9]", " ", s, flags=re.IGNORECASE | re.MULTILINE), flags=re.IGNORECASE | re.MULTILINE).isdigit():
        URL = []
    else:
        s = s.strip()
        s = re.sub(r"\s+", " ", re.sub(r"[\w\.-]+@[\w\.-]+", " ", s, flags=re.IGNORECASE | re.MULTILINE).strip(), flags=re.IGNORECASE | re.MULTILINE)
        s = re.sub(r"\d+\.\d+", " ", s.strip(), flags=re.IGNORECASE | re.MULTILINE)
        URL = re.findall(r'(https://www.|http://www.|ftp://www.|http://|ftp://|https://|www.)+([A-z_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', s, flags=re.IGNORECASE | re.MULTILINE)
    total_words = s.split(" ")
    per_URL_line = len(URL)*100.0/len(total_words)
    return per_URL_line

In [52]:
##########################################################################
# PERCENTAGE FEATURE :: checks % of DictW, C1DictW, CDictW, NDictW, C1NDictW, CNDictW, Digits
##########################################################################

def f4_checkDictWord(s):
    """
    Finds % of origin words.
    :param: df.LINES
    :return: Returns percentage values of various dict-related or non-dict related words
    """
    def percentage(no):
        return no*100.0/len(words)
    s = re.sub(r"\s+", " ", re.sub(r"[\^\,\.\?\!\;\:\'\"\`\-\+\_\[\]\(\)\{\}\=\*]+", " ", re.sub(r"[^A-z0-9]+", " ", s, flags=re.IGNORECASE | re.MULTILINE), flags=re.IGNORECASE | re.MULTILINE), flags=re.IGNORECASE | re.MULTILINE)
    words = word_tokenize(s)
    DictWords = [w for w in words if lemma.lemmatize(w).lower().strip() in ENG_words]
    Cap1DictWords = [w for w in DictWords if w[0].isupper()]
    CapDictWords = [w for w in DictWords if w.isupper()]
    NonDictWords = [w for w in words if w not in set(DictWords)]
    Cap1NonDictWords = [w for w in NonDictWords if w[0].isupper()]
    CapNonDictWords = [w for w in NonDictWords if w.isupper()]
    return percentage(len(DictWords)), percentage(len(Cap1DictWords)), percentage(len(CapDictWords)), percentage(len(NonDictWords)), percentage(len(Cap1NonDictWords)), percentage(len(CapNonDictWords))

#### `3. Contextual Feature`

In [53]:
##########################################################################
# PERCENTAGE & BOOL FEATURE :: Finds features related to neighbouring lines
##########################################################################

def f5_checkNeighbourAddress(df, enable_truecasing=True):
    """
    Extracts various features using neighbouring lines(context) for a particular line.
    :param: dataframe df
    :return: 
    list list_is_spacy_NER_line: Bool value if line contains a NER(GPE) tag.
    list list_per_spacy_NER_address: % value of NER(GPE) tag in neighbouring lines.
    list list_per_crf_NER_address: % value of libpostal(Location related) tags in neighbouring lines.
    list list_postal_ab_score: A %score if the line could be a Supplier Name line based on Scoring Decision Matrix. 
    """
    
    def stanfordNLP_truecasing(text):
        if len(text.strip()) > 0:
            doc = stf_nlp(text)
            text = " ".join([w.text.capitalize() if w.upos in ["PROPN","NNS"] else w.text for sent in doc.sentences for w in sent.words])
            return text
        else:
            return text

    LINE_CONTEXT_LOOKUP_1 = 7
    LINE_CONTEXT_LOOKUP_2 = 3
    
    list_is_spacy_NER_line, list_per_spacy_NER_address, list_per_crf_NER_address, list_postal_ab_score = [], [], [], []
    for line_num in range(len(df.LINES)):
        
        '''Line is the present line L'''
        # LINE FEATURE
        line = str(df.LINES.iloc[line_num:line_num+1].values[0])
        original_line = line
        line_tag = str(df.POSTAL_AB.iloc[line_num:line_num+1].values[0])
        doc_line = nlp(line)
        
        # bool spacy NER "ORG/FAC" tag in line
        spacy_NER_line_tag = ['ORG', 'FAC']
        spacy_NER_line = set([ent.label_ for ent in doc_line.ents if ent.label_ in spacy_NER_line_tag])
        is_spacy_NER_line = len(spacy_NER_line)
        list_is_spacy_NER_line.append(is_spacy_NER_line)
        
        '''Address is generally found below SN, context could be of varying size'''
        # CONTEXT FEATURE: 
        
        # CONTEXT #1 - context is 6 lines below L 
        address_1 = " ".join(df.LINES.iloc[line_num+1: line_num+LINE_CONTEXT_LOOKUP_1].tolist())
        original_address_1 = " .  ".join(df.LINES.iloc[line_num+1: line_num+LINE_CONTEXT_LOOKUP_1].tolist())
        address_1 = re.sub("\s+", " ", re.sub(r"\^+", " ", re.sub(r"[^A-z\!\@\&\(\)\,\.\?]+", " ", address_1.strip(), flags=re.IGNORECASE | re.MULTILINE))).strip()
        address_1 = re.sub(r"po.box|po.box.|po. box|po. box.|p.o.box|p.o.box.|p.o box|p.o box.|p.o. box|p.o. box.", "Chicago", address_1.strip(), flags=re.IGNORECASE | re.MULTILINE).strip()
        
        if enable_truecasing == True:
            # No preprocessing only truecasing needed!
            address_1 = stanfordNLP_truecasing(address_1)
        doc_address = nlp(address_1)
            
        # Total words
        if len(doc_address) > 0:
            # if on normal line
            total_words = len(doc_address)
        else:
            # if on last line
            total_words = len(doc_line)
        
        # F1: % of spacy location tags in context
        spacy_NER_address_tag =  ['FAC', 'GPE', 'LOC', 'ORDINAL']
        spacy_NER_address = [ent.label_ for ent in doc_address.ents if ent.label_ in spacy_NER_address_tag]
        per_spacy_NER_address = len(spacy_NER_address)*100.0/total_words
        list_per_spacy_NER_address.append(per_spacy_NER_address)

        # F2: % of CRF_NER location tags in context
        crf_NER_address_tag = ['house_number', 'road', 'level', 'po_box', 'postcode', 'city_district', 'city', 'state_district', 'state', 'country_region', 'country']
        crf_NER_address = [ent[1] for ent in crf_NER(doc_address) if ent[1] in crf_NER_address_tag]
        per_crf_NER_address = len(crf_NER_address)*100.0/total_words
        list_per_crf_NER_address.append(per_crf_NER_address)
        
        # CONTEXT #2 - context is 3 lines below L 
        address_2 = df.LINES.iloc[line_num+1: line_num+LINE_CONTEXT_LOOKUP_2]
        address_tags = df.POSTAL_AB.iloc[line_num+1: line_num+LINE_CONTEXT_LOOKUP_2].tolist()
                
        def check_phoneInAddress():
            bool_phone_address = 0
            found = find_phone(original_address_1)
            if len(found) > 0:
                bool_phone_address = 1
            return bool_phone_address
            
        def check_emailInAddress():
            bool_domain_address = 0
            _, found_email_domains = find_email(original_address_1)
            _, _, found_url_domains = find_url(original_address_1)
            domain = found_email_domains + found_url_domains
            if len(domain) > 0:
                # Fuzz match (Line L, domains)
                domain_match_fuzz = any(np.where(np.array([fuzz.ratio(x, original_line.lower()) for x in domain]) >= 50, 1, 0))
                domain_match_re = any([True for x in domain 
                                       if x == re.sub("\s+", "", re.sub("\^", "", re.sub(r"[^A-z]+", " ", original_line, flags=re.IGNORECASE | re.MULTILINE))).lower() or 
                                       len(re.findall(x, re.sub("\s+", "", re.sub(r"[^A-z]+", " ", original_line, flags=re.IGNORECASE | re.MULTILINE)).lower())) > 0])
                print(original_line.lower(), domain, "matched = ", domain_match_fuzz, domain_match_re)
                if any([domain_match_fuzz + domain_match_re]):
                    bool_domain_address = 1
            return bool_domain_address
             
        # Scoring Mechanism for Address Blocks
        '''
                   1      2       3       4
        L          O      AB     AB       O
        context    O      O      AB       AB
        Phone      -      -       -       -
        Email      -      -       -       -
        -----------------------------------------
        Score  =   0     25%     25-50%    75-100%
            
        For Case 3: (1). 50   - if L is the 1st/2nd line in 'AB' and has 3 more 'AB' in context
                    (2). 41.6 - if L is the 3rd/4th line in 'AB' and has 2 more 'AB' in context
                    (2). 33.3 - if L is the 5th line in 'AB' and has 1 more 'AB' in context
                    (3). 25   - if L is the last line in 'AB' and has no 'AB' left in context [is same as Case 2]
                    
        For Case 4: (1). 100  - if an email domain matches with L
                    (2). 100  - if L is the line just above 'AB' and has a phone number in context.
                    (3). 85   - if L is the line just above 'AB  
                    (4). 75   - if L is two lines above 'AB'
        '''
    
        score = 0
        # 1. Score = 0%
        if 'AB' not in address_tags and line_tag != 'AB':
            score = 0
        # 2. Score = 25%
        elif 'AB' not in address_tags and line_tag == 'AB':
            score = 25
        # 3. Score = 25% - 50%
        elif 'AB' in address_tags and line_tag == 'AB':
            # cases 3.1, 3.2, 3.3, 3.4
            score = 25 + 25*(int(address_tags.count('AB'))/3.0)
        # 4. Score = 75% - 100%
        elif 'AB' in address_tags and line_tag != 'AB':
            value_phoneInAddress = check_phoneInAddress()
            value_emailInAddress = check_emailInAddress()
            if value_emailInAddress != 0:
                # case 4.1
                score = 100
            elif address_tags[0] == 'AB':
                # case 4.2
                if value_phoneInAddress != 0:
                    score = 100
                else:
                    # case 4.3
                    score = 85
            else:
                # case 4.4
                score = 75
        # 5. Unknown
        else:
            score = 0
        
        # append score to global list
        list_postal_ab_score.append(score)
        
    return list_is_spacy_NER_line, list_per_spacy_NER_address, list_per_crf_NER_address, list_postal_ab_score

#### `4. Position Feature`

In [54]:
##########################################################################
# NUMERICAL FEATURE :: Finds quadrant (1 to 6) in which line is present
##########################################################################

def f6_checkLineQuadrant(x):
    """
    Finds the quadrant for every line in df.
    :param: Meta-data information (Coordinates, Page-Dimensions) for df.LINES
    :return: Quadrant for each line in df.
    """
    
    # NOT USED
    def Overlap(l1,t1,r1,b1, l2,t2,r2,b2):
        def Rectangel_overlapp(l1,t1,r1,b1, l2,t2,r2,b2): 
            # If one rectangle is on left side of other 
            if(l1 >= r2 or l2 >= r1): 
                return False
            # If one rectangle is above other 
            if(t1 <= b2 or t2 <= b1): 
                return False
            return True
        if(Rectangel_overlapp(l1,t1,r1,b1, l2,t2,r2,b2)): 
            # Rectangles overlap
            return 1
        else: 
            # Rectangles Don't Overlap
            return 0
    
    def Overlap_area(l1,t1,r1,b1, page_width, page_height):
        # Calculates intersection area between two rectangles(LINE's bounding box & Quadrant Bounding box)
        def area_overlap(l1,t1,r1,b1, l2,t2,r2,b2):
            a = Rectangle(min(l1,r1), min(t1,b1), max(l1,r1), max(t1,b1))
            b = Rectangle(min(l2,r2), min(t2,b2), max(l2,r2), max(t2,b2))
            dx = min(a.xmax, b.xmax) - max(a.xmin, b.xmin)
            dy = min(a.ymax, b.ymax) - max(a.ymin, b.ymin)
            if (dx>=0) and (dy>=0):
                return dx*dy
            else: # returns 0 is no intersection (outside quad)
                return 0
        # Create a named tuple of 2 diagonal points of a rect
        Rectangle = namedtuple('Rectangle', 'xmin ymin xmax ymax')
        w, h = page_width, page_height
        # Area of intersection(LINE Bounding Box, Quadrant Bounding Box)
        # Q1
        l2, t2, r2, b2 = 0, 0, w/2, h/3
        areaL_Q1 = area_overlap(l1,t1,r1,b1, l2,t2,r2,b2)
        # Q2
        l2, t2, r2, b2 = w/2, 0, w, h/3
        areaL_Q2 = area_overlap(l1,t1,r1,b1, l2,t2,r2,b2)
        # Q3
        l2, t2, r2, b2 = 0, h/3, w/2, 2*h/3
        areaL_Q3 = area_overlap(l1,t1,r1,b1, l2,t2,r2,b2)
        # Q4
        l2, t2, r2, b2 = w/2, h/3, w, 2*h/3
        areaL_Q4 = area_overlap(l1,t1,r1,b1, l2,t2,r2,b2)
        # Q5
        l2, t2, r2, b2 = 0, 2*h/3, w/2, h
        areaL_Q5 = area_overlap(l1,t1,r1,b1, l2,t2,r2,b2)
        # Q6
        l2, t2, r2, b2 = w/2, 2*h/3, w, h
        areaL_Q6 = area_overlap(l1,t1,r1,b1, l2,t2,r2,b2)
        # Find out maximum area overlap = final position quadrant index(1,2,.., 6)
        max_area = {'1': areaL_Q1, '2': areaL_Q2, '3': areaL_Q3, '4': areaL_Q4, '5': areaL_Q5, '6': areaL_Q6}
        return max(max_area, key=max_area.get)
    
    l1,t1,r1,b1 = x['l'], x['t'], x['r'], x['b']
    page_width, page_height = x['page_width'], x['page_height']
    Quad = int(Overlap_area(l1,t1,r1,b1, page_width, page_height))
    return Quad

#### `5. Geographical Parser Feature`

In [55]:
##########################################################################
# PERCENTAGE FEATURE :: Finds % of spacy + geopandas NER (GPE) in line
##########################################################################

def f7_checkGPE(s):
    """
    Finds the % of Spacy NER(GPE) + GeoText Parser(GPE) tags in line.
    :param: df.LINES
    :return: % of Spacy NER(GPE) + GeoText Parser(GPE) tags in line
    """
    # General
    s = re.sub("\s+", " ", re.sub(r"\\", " ", re.sub(r"\^+", " ", re.sub(r"[^A-z\!\@\&\(\)\,\.\?]+", " ", s.strip(), flags=re.IGNORECASE | re.MULTILINE), flags=re.IGNORECASE | re.MULTILINE), flags=re.IGNORECASE | re.MULTILINE), flags=re.IGNORECASE | re.MULTILINE).strip()   
    # Replace PO BOX with a GPE Location (arbitary, "Chicago" is chosen as Spacy NER 100% identifies this city!)
    s = re.sub(r"po.box|po.box.|po. box|po. box.|p.o.box|p.o.box.|p.o box|p.o box.|p.o. box|p.o. box.", "Chicago", s.strip(), flags=re.IGNORECASE | re.MULTILINE).strip()
    doc = nlp(s)
    total_words = len(doc)
    if total_words > 0:
        # Spacy NER Tagging
        spacy_NER_GPE_tag = ['GPE']
        NER_tags = [ent.orth_ for ent in doc.ents if ent.label_ in spacy_NER_GPE_tag]
        NER_tags = list(map(lambda x: x.replace(".", "\.").replace("(", "\(").replace(")", "\)").replace("*", "\*").replace("+", "\+").replace("?","\?").replace("^","\^"), NER_tags))
        s = re.sub(r"\s+", " ", re.sub("\\b" + "\\b|\\b".join(NER_tags) + "\\b",  " ", s, flags=re.IGNORECASE | re.MULTILINE), flags=re.IGNORECASE | re.MULTILINE) # Removing already found words from line
        # Geotext NER Tagging
        GEO_tags = GeoText(s).cities
        per_GPE = len(NER_tags + GEO_tags)*100.0/total_words
    else:
        per_GPE = 0
    return per_GPE

### Final Features

All features are calculated on `df.LINES` not on processed lines. Make sure features are calculated before any line preprocessing

In [56]:
%%time
%%memit

# 0. XML features
df['l'], df['t'], df['r'], df['b'], df['FS'], \
df['l_'], df['t_'], df['r_'], df['b_'], df['FS_'], \
df['page_width'], df['page_height'] = zip(*[(C['l'], C['t'], C['r'], C['b'], C['font_size'],
                                             C['l_'], C['t_'], C['r_'], C['b_'], C['font_size_'],
                                             C['page_width'], C['page_height']) for C in df.LINES_DICT])

# 1. Line-Level features
df['F1_CONTAINSDIGIT'], df['F1_CONTAINSALLDIGIT'], df['F1_CONTAINSEMAIL'], df['F1_CONTAINSURL'], \
df['F1_CONTAINSDATE'], df['F1_CONTAINSPHONE'] = zip(*df.LINES.apply(f1_lineLevel))
df['F1_emails'] = df.LINES.apply(f1_checkEmail)
df['F1_urls'] = df.LINES.apply(f1_checkURL)
df['F3_abbrv'] = df.LINES.apply(f3_checkAbbrv)
df['F4_DictWords'], df['F4_Cap1DictWords'], df['F4_CapDictWords'], df['F4_NonDictWords'], df['F4_Cap1NonDictWords'], df['F4_CapNonDictWords'] = zip(*df.LINES.apply(f4_checkDictWord))

# 2. Contextual and AB features
df['F5_isSpacyNERLine'], df['F5_perSpacyNERAddress'], df['F5_perCrfNERAddress'], df['F5_postalAB'] = f5_checkNeighbourAddress(df, enable_truecasing=True)

# 3. Position features
df['F6_lineQuadrant'] = df.apply(f6_checkLineQuadrant, axis=1)

# 4. Geographical features
df['F7_gpe'] = df.LINES.apply(f7_checkGPE)

NameError: name 'df' is not defined

### **Total Numerical Features = 26**

- l_ > Regularized left coordinate of bounding box (rectangle) around char/word/line. 
- t_  > Regularized top coordinate of bounding box (rectangle) around char/word/line. Point(x,y) = Point(l,t) of principal diagonal of box.
- r_  > Regularized right coordinate of bounding box (rectangle) around char/word/line. 
- b_  > Regularized bottom coordinate of bounding box (rectangle) around char/word/line. Point(x,y) = Point(r,b) of principal diagonal of box.
- FS_  > Regularized font size of line (taken as mean of all font-sizes of all chars present in line)

- F1_CONTAINSDIGIT > If line contains a digit or not
- F1_CONTAINSALLDIGIT > If line contains only digits
- F1_CONTAINSEMAIL > If line contains email or not
- F1_CONTAINSURL > If line contains URL or not
- F1_CONTAINSDATE > If line contains Date NER or not
- F1_CONTAINSPHONE > If line contains Phone NER or not
- F1_emails  > % of emails/gmails present in line. % = n(emails)/n(total words)
- F1_urls  > % of urls/weblinks with http/https present in line. % = n(urls)/n(total words)
- F3_abbrv  > BOOL whether a known abbrv (legal + dynamic) is present in line or not. BOOL = 1 or 0

- F4_DictWords  > % of lang='en' dict words present in line. % = n(dict words)/n(total words)
- F4_Cap1DictWords  > % of lang='en' dict words with 1st letter capital in line. % = n(Cap1dict words)/n(total words)
- F4_CapDictWords  > % of lang='en' dict words in uppercase/capital in line. % = n(Capdict words)/n(total words)
- F4_NonDictWords  > % of lang='en' NON dict words present in line. % = n(NONdict words)/n(total words)
- F4_Cap1NonDictWords  > % of lang='en' NON dict words with 1st letter capital in line. % = n(Cap1NONdict words)/n(total words)
- F4_CapNonDictWords  > % of lang='en' NON dict words in uppercase/capital in line. % = n(CapNONdict words)/n(total words)

- F5_isSpacyNERLine  > BOOL if the line contains spacyNER(ORG, FAC) tag or not. BOOL = 1 or 0
- F5_perSpacyNERAddress  > % of spacyNER(GPE, LOC, FAC, ORDINAL) tags in '6' Lines below (excluding present line). % = n(tags)/total words
- F5_perCrfNERAddress  > % of crfNER(list of 8 tags) in '6' Lines below (excluding present line). % = n(tags)/total words
- F5_postalAB > Calcuates the %score of each line inside a AddressBlock based on its position and other factors.

- F6_lineQuadrant  > cals intersection area between line-box and all 6 quardant-boxes. Finds the Q in which Line lies.
- F7_gpe  > % of Spacy+GeoText (GPE) tags in line. % n(tags)/n(total words)

## `TYPE II. Text Features`

For each line(L), 4 lines below it (including the L) are taken and all Organizational/Geographical NER tags are identified. Then each line is represented in terms of these NER tags.

    L = "hello what's up?"               -> []
    L+1 = "Going to Mumbai"              -> [Mumbai, GPE]
    L+2 = "I work at Google"             -> [Google, ORG]
    L+3 = "Nice, noteworthy greenland!"  -> [Greenland, LOC]

    L_NER = [GPE, ORG, LOC]

In [57]:
##########################################################################
# VECTORIZER: Converts a line into NER tags corresponding to its context
##########################################################################

def line_context(df, enable_regex=True, enable_truecasing=True):
    """
    Finds number of NER(GPE-related) tags in neighbouring lines(context) for each line.
    :param: dataframe df
    :return: num of tags in context for each line.
    """

    def regex(text):
        # removes digits symbols, etc
        text = re.sub("\s+", " ", re.sub(r"\^+", " ", re.sub(r"[^A-z\!\@\&\(\)\,\.\?]+", " ", text.strip(), flags=re.IGNORECASE | re.MULTILINE), flags=re.IGNORECASE | re.MULTILINE), flags=re.IGNORECASE | re.MULTILINE).strip()
        text = re.sub(r"po.box|po.box.|po. box|po. box.|p.o.box|p.o.box.|p.o box|p.o box.|p.o. box|p.o. box.", "Chicago", text.strip(), flags=re.IGNORECASE | re.MULTILINE).strip()
        return text
    
    def stanfordNLP_truecasing(text):
        # TrueCasing - converts text into original casing  
        if len(text.strip()) > 0:
            doc = stf_nlp(text)
            text = " ".join([w.text.capitalize() if w.upos in ["PROPN","NNS"] else w.text for sent in doc.sentences for w in sent.words])
            return text
        else:
            return text
        
    LINE_CONTEXT_LOOKUP = 4
    line_context_NERtags =[]
    for i in range(len(df.LINES)):
        # text = Present LINE + 3 lines below as context
        text = " ".join(df.LINES.iloc[i:i+LINE_CONTEXT_LOOKUP].tolist()).strip()
        # preprocessing
        if enable_regex == True:
            text = regex(text)
        if enable_truecasing == True:
            text = stanfordNLP_truecasing(text)
        # Spacy NER Tagging
        doc = nlp(text)
        spacy_NER_line_context_tag = ['ORG', 'FAC', 'GPE', 'LOC', 'ORDINAL']
        NER_tags = [ent.label_ for ent in doc.ents if ent.label_ in spacy_NER_line_context_tag]
        line_context_NERtags.append(NER_tags)
    return line_context_NERtags

In [58]:
%%time
%%memit

# Vectorization - Representation of each line (L) in terms of context and its spacy NER tags. 
#                 Each line is concatanated with 3 lines below it using " ".join() and projected to 
#                 Regex(removal of digits), TrueCasing, and then Spacy NER tagging.
#
#                 vector(L) == L + 3lines -> remove digits -> truecasing -> SpacyNER -> [list of ner tags]

df['LINE_NER'] = line_context(df, enable_regex=True, enable_truecasing=True)

NameError: name 'df' is not defined

----

## <ins>11. Saving & Loading Module </ins>

- All data is saved in excel chunks.
- DB is not used as data volumse is large.

BOT

--> SAVE_DF
    
- ------->  DF_30042020
- ------->  DF_04052020
- ------->  **DF_14052020_LATEST** : Contains the latest df.

#### 1. SAVING

In [59]:
save_data = "SAVE_DF/DF_LATEST (01_07_2020)/"

In [60]:
%%time
print("Original DF Shape = ", df.shape)

# Rows of 50,000
size = 50000
list_of_dfs = [df.loc[i:i+size-1,:] for i in range(0, len(df),size)]
i=0
for d in list_of_dfs:
    d['KEY'] = "KEY_" + str(i)
    fn = save_data + "DF_" + str(i) + ".xlsx"
    print("Saving  ::  File = ", fn)
    d.to_excel(fn)
    i+=1

NameError: name 'df' is not defined

#### 2. LOADING

In [61]:
load_data = "SAVE_DF/DF_LATEST (01_07_2020)/"

In [62]:
%%time
df = pd.DataFrame()
total_saved_df = len(os.listdir(load_data))
for i in range(total_saved_df):
    fn = load_data + "DF_" + str(i) + ".xlsx"
    print("Loading  ::  File = ", fn)
    temp_df = pd.read_excel(fn)
    df = df.append(temp_df)

# Re-shaping loaded df...
df = df.drop(columns=['Unnamed: 0', 'KEY']).reset_index(drop=True)
df.LINE_NER = df.LINE_NER.apply(lambda x: ast.literal_eval(x))
print("\nLoaded df Shape = ", df.shape, '\n')

FileNotFoundError: [Errno 2] No such file or directory: 'SAVE_DF/DF_LATEST (01_07_2020)/'

----

## <ins> 12. Filtering Dataset Module </ins>

- Filtering those XML files for which mapping doesn't exit in mapper file, i.e. labelling doesn't exist and manual labelling effort is required for these files.


- Manual labelling shall be done following the existing mapper file.

In [63]:
# Label count per file
Filename_Label_data = pd.DataFrame(df.groupby('FILENAME').Y_SN.sum()).reset_index(level=0)
Filename_Label_data['Y_SN'] = np.where(Filename_Label_data.Y_SN > 0, 1, 0)

# Files with atleast 1 label
label_files = list(set(Filename_Label_data[Filename_Label_data.Y_SN == 1].FILENAME.unique()))

# Files with no labels (multiline split label possibilty - ignoring this for now!)
no_label_files = list(set(Filename_Label_data[Filename_Label_data.Y_SN == 0].FILENAME.unique()))

# FILTERING DATASET...
# - Taking files with atleast one label(maybe - having one-liner SN)
df_labelled = df[df['FILENAME'].isin(label_files)]
df_labelled = df_labelled.dropna(subset=['LINES']).reset_index(drop=True).copy()

KeyError: 'FILENAME'

In [64]:
print("<............... STATS .................>\n")

print("{}\nShape = {}\nFiles = {} || 0: {} | 1: {} ||\nLines = {} || 0: {} | 1: {} ||\n{}\n"
      .format("------------- ORIGINAL DF -------------", df.shape, len(df.FILENAME.unique()), Filename_Label_data.Y_SN.value_counts()[0], Filename_Label_data.Y_SN.value_counts()[1], df.Y_SN.shape[0], df.Y_SN.value_counts()[0], df.Y_SN.value_counts()[1], "-------------------------------------"))
              
print("{}\nShape = {}\nFiles = {} || 0: {} | 1: {} ||\nLines = {} || 0: {} | 1: {} ||\n{}"
      .format("------------- FILTERED DF -------------", df_labelled.shape, len(df_labelled.FILENAME.unique()), 0, pd.Series(np.where(df_labelled.groupby("FILENAME").Y_SN.sum() > 0, 1, 0)).value_counts()[1], df_labelled.Y_SN.shape[0], df_labelled.Y_SN.value_counts()[0], df_labelled.Y_SN.value_counts()[1], "-------------------------------------"))

<............... STATS .................>



AttributeError: 'DataFrame' object has no attribute 'FILENAME'

----

#  <ins>13. Training Module</ins>

- Phase I: Line Classification Training
        - Trains a classifier model to learn which line is a SN Line and which is not.
  
...

- Phase II: Chunk Identification Training
        - Trains a Scoring Decision Matrix to decide what part of chunk correpsonds to a SN

`TEXT`: Textual features 

`LINGUISTIC`: Numerical features 

## `Phase I: Line Classification Training`

<ins> `Normal Methods: ` </ins>

1. **Method 1: LINGUSITIC [BASE] + 2nd-level Ensemble_CLF(Features: Base(Probs) + LINGUSITIC) --------------BEST----------**


2. Method 2: TEXT + LINGUSITIC via direct concat [BASE] + 2nd-level Ensemble_CLF(Features: Base(Probs) + LINGUSITIC)


3. Method 3: TEXT(DR) + LINGUSITIC via direct concat [BASE] + 2nd-level Ensemble_CLF(Features: Base(Probs) + LINGUSITIC)


4. Method 4: TEXT [BASE] + 2nd-level Ensemble_CLF(Features: Base(Probs) + LINGUSITIC)


5. Method 5: TEXT + LINGUSITIC via hstack [BASE] + 2nd-level Ensemble_CLF(Features: Base(Probs) + LINGUSITIC)


6. Method 6: LINGUSITIC using Ensemble_CLF [BASE] + 2nd-level Ensemble_CLF(Features: Base(Probs) + LINGUSITIC)


7. Method 7: TEXT + LINGUSITIC using Ensemble_CLF [BASE] + 2nd-level Ensemble_CLF(Features: Base(Probs) + LINGUSITIC)

...

<ins> `NN Methods: ` </ins>
8. Method 8: TEXT using LSTM_CLF


9. Method 9: TEXT using LSTM_CLF [BASE] + 2nd-level Ensemble_CLF(Features: Base(Probs) + LINGUSITIC)


10. Method 10: TEXT using LSTM_CLF [BASE] + 2nd-level Ensemble_CLF(Features: Base(Probs) + LINGUSITIC + TEXT)


11. Method 11: TEXT(*ELMO*) using LSTM_CLF [BASE] + 2nd-level Ensemble_CLF(Features: Base(Probs) + LINGUSITIC)


12. Method 12: LINGUSITIC using NN_CLF


13. Method 13: LINGUSITIC using NN_CLF [BASE] + 2nd-level Ensemble_CLF(Features: Base(Probs) + LINGUSITIC)


14. Method 14: LINGUSITIC using NN_CLF [BASE] + 2nd-level Ensemble_CLF(Features: Base(Probs) + LINGUSITIC + TEXT)


15. Method 15: TEXT + LINGUSITIC via h-stack using NN_CLF


16. Method 16: TEXT + LINGUSITIC via h-stack using NN_CLF [BASE] + 2nd-level Ensemble_CLF(Features: Base(Probs) + LINGUSITIC)


17. **Method 17: TEXT + LINGUSITIC via 2-input Channel using LSTM_CLF [BASE]   --------------BEST----------**


18. Method 18: TEXT(NER -> Clustering -> OHE) + LINGUSITIC via h-stack *<ins>EXPERIMENTAL</ins>*


19. Method 19: Ensembling via Stacking - LINGUSITIC Features using VotingClassifier (XGB, RF, AB)

.

20. **FINAL ARCHITECTURE**

...

#### Normalization
    
Used RandomForest feature importance to check values with and without normailization of meta-data attributes. Figured out, without normalization of meta-data attributes gave better results.
    
1. Normalize: Meta-data attributes(Coordinates, FS) + Engineered Numerical Features
2. Normalize: Engineered Numerical Features   **[SELECTED]**
    
...
    
#### Feature Transformation
Various feature transformation techniques were performed. None of them seemsed to improve results, and thus **no feature transformation was used.**

#### FEATURE SELECTION 

In [65]:
# ALL FEATURES

# Features_NUM = ['l_', 't_', 'r_', 'b_', 'FS_', 'F1_CONTAINSDIGIT', 'F1_CONTAINSALLDIGIT', 'F1_CONTAINSEMAIL',
#                 'F1_CONTAINSURL', 'F1_CONTAINSDATE', 'F1_CONTAINSPHONE', 'F1_emails', 'F1_urls', 'F3_abbrv',
#                 'F4_DictWords', 'F4_Cap1DictWords', 'F4_CapDictWords', 'F4_NonDictWords', 'F4_Cap1NonDictWords',
#                 'F4_CapNonDictWords', 'F5_isSpacyNERLine', 'F5_perSpacyNERAddress', 'F5_perCrfNERAddress', 'F5_postalAB', 
#                 'F6_lineQuadrant', 'F7_gpe']

# normalize_cols = ['F4_DictWords', 'F4_Cap1DictWords', 'F4_CapDictWords', 'F4_NonDictWords', 'F4_Cap1NonDictWords', 
#                   'F4_CapNonDictWords', 'F5_perSpacyNERAddress', 'F5_perCrfNERAddress', 'F5_postalAB', 'F7_gpe']

# Features_L2 = Features_NUM + ['P0', 'P1']

# Target = 'Y_SN'

In [66]:
# SELECTED FEATURES
# - All features were taken except 3 (SpacyNERLine, SpacyNERAddress, CrfNERAddress).

Features_NUM = ['l_', 't_', 'r_', 'b_', 'FS_', 'F1_CONTAINSDIGIT', 'F1_CONTAINSALLDIGIT', 'F1_CONTAINSEMAIL',
                'F1_CONTAINSURL', 'F1_CONTAINSDATE', 'F1_CONTAINSPHONE', 'F1_emails', 'F1_urls', 'F3_abbrv',
                'F4_DictWords', 'F4_Cap1DictWords', 'F4_CapDictWords', 'F4_NonDictWords', 'F4_Cap1NonDictWords',
                'F4_CapNonDictWords', 'F5_postalAB', 'F6_lineQuadrant', 'F7_gpe']

normalize_cols = ['F4_DictWords', 'F4_Cap1DictWords', 'F4_CapDictWords', 'F4_NonDictWords', 'F4_Cap1NonDictWords', 
                  'F4_CapNonDictWords', 'F5_postalAB', 'F7_gpe']

Features_L2 = Features_NUM + ['P0', 'P1']

Target = 'Y_SN'

#### HELPER FUNCTIONS

In [67]:
##########################################################################
# Helper 1: Accuracy Metrics for Generic and NN models
##########################################################################
# INPUT  -> df, X, Y_true, model(keras/sklearn)
# OUTPUT -> Df with P0(Prob of 0) and P1(Prob of 1)

def accuracy(df, X, Y, model):
    
    # Calculate metrics
    def accuracy_metrics(Y_true, Y_pred, pos_probs):
        def percent(value):
            return round(float(value)*100.0, 3)
        accuracy = percent(accuracy_score(Y_true, Y_pred))
        precision = percent(precision_score(Y_true, Y_pred))
        recall = percent(recall_score(Y_true, Y_pred))
        f1 = percent(f1_score(Y_true, Y_pred))
        roc_auc = roc_auc_score(Y_true, pos_probs)
        P, R, _ = precision_recall_curve(Y_true, pos_probs)
        pr_auc = auc(R, P)
        confusion = confusion_matrix(Y_true, Y_pred)
        TP, TN, FP, FN = confusion[1, 1], confusion[0, 0], confusion[0, 1], confusion[1, 0]
        print("Accuracy: {};  P: {};  R: {};  F1: {};  ROC-AUC: {};  PR-AUC: {}\nValueCounts() :: \n{}\nTP = {}\nTN = {}\
              \nFP = {}\nFN = {}".format(accuracy, precision, recall, f1, roc_auc, pr_auc, 
                                         pd.Series(Y_true).value_counts(), TP, TN, FP, FN))
    
    # Decides probability function based on model type
    def accuracy_normal():
        Y_pred_probs = model.predict_proba(X)
        Y_pred = np.where(Y_pred_probs>0.50, 1, 0).argmax(axis=1)
        pos_probs = Y_pred_probs[:, 1]
        return Y_pred_probs, Y_pred, pos_probs
    def accuracy_NN():
        Y_pred_probs = model.predict(X, verbose=0)
        Y_pred = np.where(Y_pred_probs>0.50, 1, 0).argmax(axis=1)
        pos_probs = Y_pred_probs[:, 1]
        return Y_pred_probs, Y_pred, pos_probs
    
    # Check model type
    if 'keras' in str(model):
        # NN model
        Y_pred_probs, Y_pred, pos_probs = accuracy_NN()
    else:
        # Normal model
        Y_pred_probs, Y_pred, pos_probs = accuracy_normal()
    
    # Calculate Accuracy Metrics
    ACCURACY = accuracy_metrics(Y, Y_pred, pos_probs)
    
    # Store Model Probability Predictions P0, P1
    df['Y_PRED'] = Y_pred
    df['P0'], df['P1'] = zip(*Y_pred_probs)
    return df

In [68]:
##########################################################################
# Helper 2: 2-layer Contextual Classifier Model
##########################################################################
# INPUT  -> level (1 or 2), model, Normalized(X_train, X_test, X_unseen), Y_train, Y_test, Y_unseen
# OUTPUT -> returns trained models

def execute_model(level, model, X_train, Y_train, X_test, Y_test):
    
    ##########################
    # LEVEL 1 - CLassification
    ##########################
    print("## ---- Training: Level 1 ---- ##")
    
    # 1.Normalization
    # Input Data should be NORMALIZED already!
    
    # 2. Sampling
    Sampler = SMOTE('minority')
    X_train_final, Y_train_final = Sampler.fit_sample(X_train, Y_train)
    
    # 3. Fitting model
    model.fit(X_train_final, Y_train_final)
    
    # 4. Predictions(Probs: P0, P1)
    print("Test Data Results :: ")
    TEST_DF_1 = accuracy(test_df, X_test, Y_test, model)
    
    ##########################
    # LEVEL 2 - CLassification
    ##########################
    if level == 2:
        print("## ---- Training: Level 2 ---- ##")
        
        # 1. Adding probs P0, P1 columns to df
        train_df['P0'], train_df['P1'] = zip(*model.predict_proba(X_train))
        test_df['P0'], test_df['P1'] = zip(*model.predict_proba(X_test))
        
        # 2. Considering L2 Features (i.e. Linguistic(Numerical) features + P0, P1)
        X_train_L2, X_test_L2 = train_df[Features_L2], test_df[Features_L2]
        
        # 3. Normalize the final features
        Normalize = MinMaxScaler()
        X_train_L2[normalize_cols] = Normalize.fit_transform(X_train_L2[normalize_cols])
        X_test_L2[normalize_cols] = Normalize.transform(X_test_L2[normalize_cols])
        
        # 4. Sampling
        Sampler = SMOTE('minority')
        X_train_final, Y_train_final = Sampler.fit_sample(X_train_L2, Y_train)

        # 3. Fitting model
        model2 = RandomForestClassifier(n_jobs=-1)
        model2.fit(X_train_final, Y_train_final)
        
        # 5. Final Predictions
        print("\nTest Data Results :: ")
        TEST_DF = accuracy(test_df, X_test_L2, Y_test, model2)
        return Normalize, model, model2, TEST_DF
    
    else:
        # level 1 return
        return model, TEST_DF_1

In [69]:
################################################################################################
# Helper 3: Loading Testing data from customer(Porsche), called as "UNSEEN_DF" henceforth
################################################################################################
# INPUT  -> None
# OUTPUT -> returns loaded dataframe "UNSEEN_DF"

def unseendata_load():
    # PATH
    UNSEEN_DATA_PATH = SAVED_test_data_path
    # Loading unseen df
    total_saved_df = len(os.listdir(UNSEEN_DATA_PATH))
    unseen_df = pd.DataFrame()
    for i in range(total_saved_df):
        fn = UNSEEN_DATA_PATH + "DF_" + str(i) + ".xlsx"
        print("Loading      ::", fn)
        temp_df = pd.read_excel(fn)
        unseen_df = unseen_df.append(temp_df)
    unseen_df = unseen_df.drop(columns=['Unnamed: 0', 'KEY']).reset_index(drop=True)    
    # Label count per file
    Filename_Label_data = pd.DataFrame(unseen_df.groupby('FILENAME').Y_SN.sum()).reset_index(level=0)
    Filename_Label_data['Y_SN'] = np.where(Filename_Label_data.Y_SN > 0, 1, 0)
    # Files with alteast 1 label
    label_files = list(set(Filename_Label_data[Filename_Label_data.Y_SN == 1].FILENAME.unique()))
    # Files with no labels
    no_label_files = list(set(Filename_Label_data[Filename_Label_data.Y_SN == 0].FILENAME.unique()))
    # FILTERING DF
    # - Taking files with atleast 1 label found!
    unseen_df_labelled = unseen_df[unseen_df['FILENAME'].isin(label_files)]
    unseen_df_labelled = unseen_df_labelled.dropna(subset=['LINES']).reset_index(drop=True).copy()
    unseen_df_labelled.LINE_NER = unseen_df_labelled.LINE_NER.apply(lambda x: ast.literal_eval(x))
    # Display
    print("Original Df  :: Shape = {};   Files = {}  >> 0 = {};  1 = {}".format(unseen_df.shape, Filename_Label_data.shape[0], Filename_Label_data.Y_SN.value_counts()[0], Filename_Label_data.Y_SN.value_counts()[1]))
    print("Filtered Df  :: Shape = {};   Files = {}  >> 0 = {};   1 = {}".format(unseen_df_labelled.shape, pd.Series(unseen_df_labelled.groupby('FILENAME').Y_SN.sum()).shape[0], 0, pd.Series(np.where(unseen_df_labelled.groupby("FILENAME").Y_SN.sum() > 0, 1, 0)).value_counts()[1]))
    return unseen_df_labelled


################################################################################################
# Helper 4: Loading a 'sample' of testing data from same customer(AH), called as "UNSEEN_AHDF" henceforth
################################################################################################
# INPUT  -> df
# OUTPUT -> returns loaded dataframe "UNSEEN_AHDF"

def create_AH_testdf(df):
    counter=0
    create_ahdf = pd.DataFrame()
    for f in df.FILENAME.unique():
        tempdf = df[df.FILENAME == f]
        if 1 in tempdf.Y_SN.value_counts() and tempdf.Y_SN.value_counts()[1] >=2 :
            create_ahdf = create_ahdf.append(tempdf)
            counter+=1
        if counter == 100:
            break
    create_ahdf = create_ahdf.reset_index(drop=True)
    return create_ahdf

In [70]:
# ##########################################################################
# # Helper 5: Loading different Vectorizer functions for TEXTUAL FEATURES
# ##########################################################################
# # INPUT  -> None
# # OUTPUT -> returns loaded vectorizer class

# # VECTORIZERS...
# class Word2vec_Vectorizer(object):
#     def __init__(self, pretrained_model='local'):
#         self.word2vec_modelname = pretrained_model 
#         if self.word2vec_modelname == 'google':
#             self.word2vec = KeyedVectors.load_word2vec_format('Models/GoogleNews-vectors-negative300.bin', binary=True)
#         elif self.word2vec_modelname == 'glove':
#             self.word2vec = KeyedVectors.load_word2vec_format('Models/glove.6B.300d.txt.word2vec', binary=False)
#         else:
#             self.word2vec = None
#         print("word2vec loaded ", self.word2vec_modelname)
#         return
#     def fit(self, X):
#         if self.word2vec_modelname != 'local':
#             return
#         sentences = X.tolist()
#         self.word2vec = Word2Vec(sentences, size=300, min_count=1, seed=7)
#         return
#     def transform(self, X):
#         doc_vector = []
#         for doc in X:  
#             word_vector_doc = []  
#             if doc != []:
#                 for word in doc:    
#                     if word in self.word2vec.wv:
#                         word_vector_doc.append(self.word2vec.wv[word])
#                     else:
#                         word_vector_doc.append(np.ones(300))
#                 doc_vector.append(list(np.mean(word_vector_doc, axis=0)))
#             else:
#                 doc_vector.append(list(np.zeros(300)))     
#         doc_vector = np.array(doc_vector)
#         return doc_vector
#     def fit_transform(self, X):
#         self.fit(X)
#         return self.transform(X)
#     def model(self):
#         return self.word2vec
    
# class Spacy_Vectorizer(object):
#     def __init__(self, nlp):
#         self.nlp = nlp
#         return
#     def fit(self, X, y=None):
#         return self    
#     def transform(self, X):
#         doc_vector = []
#         print(len(X))
#         c=0
#         for doc in X:
#             print(c)
#             doc_vector.append(self.nlp(doc).vector)
#             c+=1
#         doc_vector = np.array(doc_vector)
#         return doc_vector
#     def fit_transform(self, X, y=None):
#         return self.transform(X)
#     def model(self):
#         return self.nlp

- Vectorizers are not being used

In [71]:
##################################################################################################
# Helper 6: Uses only 1 model to predict 'chunk' using Phase II module (which actually requires 2 models)
##################################################################################################
# INPUT   --> df with only 1 model(for all code before phase II)
# OUTPUT  --> Comparison df having SN by prob, SN by order(line number), CN by prob

def predict_chunk_lines_PHASE1(df):

    # Execute for each filename...
    final_df = []
    for f in df.FILENAME.unique():
        # every df
        tempdf = df[df.FILENAME == f].copy().reset_index(drop=True)

        #########################################################################
        # 1. Line Classification
        # TRUE
        actual = tempdf[tempdf.Y_SN == 1]
        actual_SN = str(actual.SUPPLIER_NAME.tolist()[0])
        # Accuracy
        correct_df = tempdf[(tempdf.Y_SN == 1) & (tempdf.Y_PRED == 1)]
        correct_LINES = correct_df['LINES'].tolist()
        if len(correct_LINES) > 0:
            correct_LINE_found = 1
        else:
            correct_LINE_found = 0
        #########################################################################

        #########################################################################
        # 2. Chunk Identification
        # Prep df with only 1 model for running phase II prediction module
        tempdf = tempdf.rename(columns={"Y_PRED": "Model1_Y_PRED", "P0": "Model1_P0", "P1": "Model1_P1"})
        # Creating empty model 2 predcition cols for running phase II
        tempdf['Model2_Y_PRED'], tempdf['Model2_P0'], tempdf['Model2_P1']= 0, 0, 0
           
        if tempdf[tempdf.Model1_Y_PRED==1].shape[0] > 0:
            # By Prob...
            # > PHASE II MODEL PREDICTION
            FINAL_SN_list, SN_list, CN_list = chunk_identification(tempdf)
            Final_Pred_SN_byprob = SN_list[0][0]
            S1 = fuzz.partial_ratio(actual_SN.lower(), Final_Pred_SN_byprob.lower())
            if S1 < 90: S1 = 0

            # By Order...
            tempdf = tempdf.sort_values(by='Model1_P1', ascending=False)
            Final_Pred_SN_byorder = tempdf[tempdf.Model1_Y_PRED==1].LINES.values[0]
            S2 = fuzz.partial_ratio(actual_SN.lower(), Final_Pred_SN_byorder.lower())
            if S2 < 90: S2 = 0
    
        else:
            Final_Pred_SN_byprob, Final_Pred_SN_byorder, S1, S2 = "", "", 0, 0
        #########################################################################
        
        # Store in common df
        final_df.append({"FILE": f, "SN": actual_SN, "CL": correct_LINE_found, 
                         "PRED_SN_Prob":Final_Pred_SN_byprob, "S1":S1, 
                         "PRED_SN_Order":Final_Pred_SN_byorder, "S2":S2, 
                         "PRED_CN": CN_list[:3]})
    
    # DISPLACY
    pred_df = pd.DataFrame.from_dict(final_df)
    print("Total Files = {}; Correct Lines = {}; Lines Missed = {}\nScore By Prob = {};  Score by Order = {}".format(pred_df.shape[0], pred_df.CL.sum(), pred_df.shape[0] - pred_df.CL.sum(), pred_df.S1.mean(), pred_df.S2.mean()))
    pred_df = pred_df[['FILE', 'SN', 'PRED_SN_Prob', 'PRED_SN_Order', 'PRED_CN']]    
    return pred_df

-----
-----

................................................................................................................................................................................................................................................
## GENERIC METHODS
................................................................................................................................................................................................................................................

### `Method 1: `  Linguistic [BASE] + 2nd-level Ensemble_CLF(Features: Base(Probs) + Linguistic)

    ARCHITECTURE

    1. BASE Classifier: Linguistic Featrues used for training a generic model.
    2. 2nd  Classifier: Prob(0,1) + Linguistic Featrues used for training another generic model.

#### Load Data

In [72]:
# Load training, testing data
train_df, test_df = train_test_split(df_labelled, shuffle=True, random_state=7, stratify=df_labelled.Y_SN)

# Load testing data from another customer (called as 'UNSEEN_DF' henceforth)
# unseen_df_labelled = unseendata_load()

NameError: name 'df_labelled' is not defined

#### Calcuate feature importance

In [73]:
# 1. Selected features
X_train, X_test = train_df[Features_NUM], test_df[Features_NUM]
Y_train, Y_test = train_df[Target], test_df[Target]

# 2. Normalize the final features
Normalize = MinMaxScaler()
X_train[normalize_cols] =  Normalize.fit_transform(X_train[normalize_cols])
X_test[normalize_cols] = Normalize.transform(X_test[normalize_cols])

# 3. Sampling
Sampler = SMOTE('minority')
X_train_final, Y_train_final = Sampler.fit_sample(X_train, Y_train)

# 4. Fitting model
rnd_clf = RandomForestClassifier(n_jobs=-1)
rnd_clf.fit(X_train_final, Y_train_final)

# 5. Predictions
print("Test Data Results :: ")
ACCURACY = accuracy(test_df, X_test, Y_test, rnd_clf)

# 6. Calcuate Feature Importance
plt.figure(figsize=(8,8))
features = X_train.columns
importances = rnd_clf.feature_importances_
indices = np.argsort(importances)
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()

NameError: name 'train_df' is not defined

#### Execute Model

In [74]:
# Load training, testing data
train_df, test_df = train_test_split(df_labelled, shuffle=True, random_state=7, stratify=df_labelled.Y_SN)

# 1. Selected features
X_train, X_test = train_df[Features_NUM], test_df[Features_NUM]
Y_train, Y_test = train_df[Target], test_df[Target]

# 2. Normalize the final features
Normalize = MinMaxScaler()
X_train[normalize_cols] =  Normalize.fit_transform(X_train[normalize_cols])
X_test[normalize_cols] = Normalize.transform(X_test[normalize_cols])

NameError: name 'df_labelled' is not defined

#### Different models

In [75]:
# # LR
# model1 = LogisticRegression(solver='sag', multi_class='multinomial', C=1.8, penalty='l2', n_jobs=-1, class_weight='balanced')
# model1 = execute_model(1, model1, X_train, Y_train, X_test, Y_test, X_unseen, Y_unseen)

In [76]:
# # SVM
# model2 = SVC(kernel='linear', C=0.5, gamma=0.001, probability=True, tol=1e-3, class_weight='balanced')
#model2 = execute_model(1, model2, X_train, Y_train, X_test, Y_test, X_unseen, Y_unseen)

In [77]:
# # AD
# model4 = AdaBoostClassifier(random_state=1)
# model4 = execute_model(1, model4, X_train, Y_train, X_test, Y_test, X_unseen, Y_unseen)

In [78]:
# XGB
model3 = XGBClassifier(random_state=1, n_jobs=-1)
Normalize, model31, model32, TEST_DF  = execute_model(2, model3, X_train, Y_train, X_test, Y_test)

NameError: name 'XGBClassifier' is not defined

In [79]:
# RF -> BEST
model5 = RandomForestClassifier(n_jobs=-1)
Normalize, model51, model52, TEST_DF = execute_model(2, model5, X_train, Y_train, X_test, Y_test)

NameError: name 'RandomForestClassifier' is not defined

In [80]:
# Running phase II: chunk identification module over above "unseen_df_labelled"
predict_chunk_lines_PHASE1(TEST_DF)

NameError: name 'TEST_DF' is not defined

In [81]:
# # BEST GENERIC MODEL...
# # SAVE DATE 19 May 2020 (Models = Normalize, model51, model52, unseen_df_labelled)

# pickle.dump(Normalize, open("Models/RF_2Level_Generic_Num_19052020/Model_Normalize_19052020.pickle", "wb"))
# pickle.dump(model51, open("Models/RF_2Level_Generic_Num_19052020/Model_model51_19052020.pickle", "wb")) 
# pickle.dump(model52, open("Models/RF_2Level_Generic_Num_19052020/Model_model52_19052020.pickle", "wb")) 

## `BEST GENERIC METHOD USE-CASE`

In [82]:
# load models
model = pickle.load(open("Models/RF_2Level_Generic_Num_19052020/Model_model51_19052020.pickle", "rb")) 
model2 = pickle.load(open("Models/RF_2Level_Generic_Num_19052020/Model_model52_19052020.pickle", "rb")) 
Normalizer = pickle.load(open("Models/RF_2Level_Generic_Num_19052020/Model_Normalize_19052020.pickle", "rb"))

# Load traing, testing and unseen df
train_df, test_df = train_test_split(df_labelled, shuffle=True, random_state=10, stratify=df_labelled.Y_SN)
unseen_df_labelled = unseendata_load()

FileNotFoundError: [Errno 2] No such file or directory: 'Models/RF_2Level_Generic_Num_19052020/Model_model51_19052020.pickle'

#### Testing above loaded models

In [83]:
# LEVEL 1
# 1. Select Features
X_train, X_test, X_unseen = train_df[Features_NUM], test_df[Features_NUM], unseen_df_labelled[Features_NUM]
Y_train, Y_test, Y_unseen = train_df[Target], test_df[Target], unseen_df_labelled[Target]

# 2. Normalize features using loaded "Normalizer"
X_train[normalize_cols] =  Normalizer.transform(X_train[normalize_cols])
X_test[normalize_cols] = Normalizer.transform(X_test[normalize_cols])
X_unseen[normalize_cols] = Normalizer.transform(X_unseen[normalize_cols])

# # 3. Predictions using fitted "model_1"
# print("Test Data Results :: ")
# ACCURACY = accuracy(test_df, X_test, Y_test, model5)
# print("Unseen Data Results :: ")
# UDL = accuracy(unseen_df_labelled, X_unseen, Y_unseen, model5)

# LEVEL 2
# 3. Predictions using "model_1"
train_df['P0'], train_df['P1'] = zip(*model.predict_proba(X_train))
test_df['P0'], test_df['P1'] = zip(*model.predict_proba(X_test))
unseen_df_labelled['P0'], unseen_df_labelled['P1'] = zip(*model.predict_proba(X_unseen))

# 4. Select Features: Linguistic + P0, P1
X_train_L2, X_test_L2, X_unseen_L2 = train_df[Features_L2], test_df[Features_L2], unseen_df_labelled[Features_L2]

# 5. Normalize features using loaded "Normalizer"
X_train_L2[normalize_cols] = Normalizer.transform(X_train_L2[normalize_cols])
X_test_L2[normalize_cols] = Normalizer.transform(X_test_L2[normalize_cols])
X_unseen_L2[normalize_cols] = Normalizer.transform(X_unseen_L2[normalize_cols])

# 6. Predictions using "model_2"
print("\nTest Data Results :: ")
ACCURACY = accuracy(test_df, X_test_L2, Y_test, model2)
print("\nUnseen Data Results :: ")
UNSEEN_DF = accuracy(unseen_df_labelled, X_unseen_L2, Y_unseen, model2)

NameError: name 'train_df' is not defined

In [84]:
# Running phase II: chunk identification module over above "UNSEEN_DF"
predict_chunk_lines_PHASE1(UNSEEN_DF)

NameError: name 'UNSEEN_DF' is not defined

----

### `Method 2: ` Text +  Liguistic via direct concat [BASE] + 2nd-level Ensemble_CLF(Features: Base(Probs) + Liguistic)

    ARCHITECTURE

    1. BASE Classifier: Text(sparse matrix) + Liguistic(dense matrix) via direct concat --> [Both Sparse] 
    used for training a generic model.
    
    2. 2nd  Classifier: Prob(0,1) + Linguistic Featrues used for training another generic model.

- Very poor results 
- Discarded

----

### `Method 3: ` Text (*DR) + Liguistic via direct concat [BASE] + 2nd-level Ensemble_CLF(Features: Base(Probs) + Liguistic)

    ARCHITECTURE

    1. BASE Classifier: Text(DR using SVD) + Linguistic(dense matrix) via direct concatenation --> [Both Dense] 
    used for training a generic model.
    
    2. 2nd  Classifier: Prob(0,1) + Linguistic Featrues used for training another generic model.

- Takes significant amount of time and gives Very poor results 
- Discarded

----

### `Method 4: ` Text [BASE] + 2nd-level Ensemble_CLF(Features: Base(Probs) + Linguistic)

    ARCHITECTURE

    1. BASE Classifier: Text (Count Vectorized) used for training a generic model.
    2. 2nd  Classifier: Prob(0,1) + Linguistic Featrues used for training another generic model.

In [85]:
# Loading training, testing, unseen df
train_df, test_df = train_test_split(df_labelled, shuffle=True, random_state=7, stratify=df_labelled.Y_SN)
unseen_df_labelled = unseendata_load()

NameError: name 'df_labelled' is not defined

In [86]:
# Countvectorizer
Vect = CountVectorizer()
Vect.fit(train_df.LINE_NER.apply(lambda x: " ".join(x)))
X_train_text = Vect.transform(train_df.LINE_NER.apply(lambda x: " ".join(x)))
X_test_text = Vect.transform(test_df.LINE_NER.apply(lambda x: " ".join(x)))
X_unseen_text = Vect.transform(unseen_df_labelled.LINE_NER.apply(lambda x: " ".join(x)))
Y_train, Y_test, Y_unseen = train_df.Y_SN, test_df.Y_SN, unseen_df_labelled.Y_SN

# Running 2-level contextual classifier using above data
model1 = RandomForestClassifier(n_jobs=-1)
Normalize, model1, model2, unseen_df_labelled = execute_model(2, model1, X_train_text, Y_train, X_test_text, Y_test, X_unseen_text, Y_unseen)

NameError: name 'train_df' is not defined

In [87]:
# Running phase II
predict_chunk_lines_PHASE1(unseen_df_labelled)

NameError: name 'unseen_df_labelled' is not defined

- Underfiited results.

----

### `Method 5: ` Text + Linguistic via h-stack [BASE] + 2nd-level Ensemble_CLF(Features: Base(Probs) + Linguistic)

    ARCHITECTURE

    1. BASE Classifier: Text(CountVectorized) + Linguistic(Numerical) using h-stack used for training a generic model.
    2. 2nd  Classifier: Prob(0,1) + Linguistic Featrues used for training another generic model.

In [88]:
# Loading training, testing, unseen df
train_df, test_df = train_test_split(df_labelled, shuffle=True, random_state=7, stratify=df_labelled.Y_SN)
unseen_df_labelled = unseendata_load()

NameError: name 'df_labelled' is not defined

In [89]:
# # word2vec local
# X_train_vector_w2v, X_test_vector_w2v, X_unseen_vector_w2v = X_train_vector_w2v, X_test_vector_w2v, X_unseen_vector_w2v
# X_train_num, X_test_num, X_unseen_num = train_df[Features_NUM], test_df[Features_NUM], unseen_df_labelled[Features_NUM]
# Y_train, Y_test, Y_unseen = train_df.Y_SN, test_df.Y_SN, unseen_df_labelled.Y_SN

# X_train = hstack([X_train_vector_w2v, X_train_num]).tocsr() 
# X_test = hstack([X_test_vector_w2v, X_test_num]).tocsr()
# X_unseen = hstack([X_unseen_vector_w2v, X_unseen_num]).tocsr() 

# model1 = RandomForestClassifier(n_jobs=-1)
# model1 = execute_model(1, model1, X_train, Y_train, X_test, Y_test, X_unseen, Y_unseen)

In [90]:
# # word2vec glove
# X_train_vector_glove, X_test_vector_glove, X_unseen_vector_glove = X_train_vector_glove, X_test_vector_glove, X_unseen_vector_glove
# X_train_num, X_test_num, X_unseen_num = train_df[Features_NUM], test_df[Features_NUM], unseen_df_labelled[Features_NUM]
# Y_train, Y_test, Y_unseen = train_df.Y_SN, test_df.Y_SN, unseen_df_labelled.Y_SN

# X_train = hstack([X_train_vector_glove, X_train_num]).tocsr() 
# X_test = hstack([X_test_vector_glove, X_test_num]).tocsr() 
# X_unseen = hstack([X_unseen_vector_glove, X_unseen_num]).tocsr() 

# model1 = RandomForestClassifier(n_jobs=-1)
# model1 = execute_model(1, model1, X_train, Y_train, X_test, Y_test, X_unseen, Y_unseen)

In [91]:
# # Spacy
# X_train_vector_spacy, X_test_vector_spacy, X_unseen_vector_spacy = X_train_vector_spacy, X_test_vector_spacy, X_unseen_vector_spacy
# X_train_num, X_test_num, X_unseen_num = train_df[Features_NUM], test_df[Features_NUM], unseen_df_labelled[Features_NUM]
# Y_train, Y_test, Y_unseen = train_df.Y_SN, test_df.Y_SN, unseen_df_labelled.Y_SN

# X_train = hstack([X_train_vector_spacy, X_train_num]).tocsr() 
# X_test = hstack([X_test_vector_spacy, X_test_num]).tocsr()
# X_unseen = hstack([X_unseen_vector_spacy, X_unseen_num]).tocsr() 

# model1 = RandomForestClassifier(n_jobs=-1)
# model1 = execute_model(1, model1, X_train, Y_train, X_test, Y_test, X_unseen, Y_unseen)

In [92]:
# LABEL
Y_train, Y_test, Y_unseen = train_df.Y_SN, test_df.Y_SN, unseen_df_labelled.Y_SN

# TEXT
Vect = CountVectorizer()
X_train_text, X_test_text, X_unseen_text = Vect.fit_transform(train_df.LINE_NER.apply(lambda x: " ".join(x))), Vect.transform(test_df.LINE_NER.apply(lambda x: " ".join(x))), Vect.transform(unseen_df_labelled.LINE_NER.apply(lambda x: " ".join(x)))

# NUM
X_train_NUM, X_test_NUM, X_unseen_NUM = train_df[Features_NUM], test_df[Features_NUM], unseen_df_labelled[Features_NUM]
Normalize = MinMaxScaler()
X_train_NUM[normalize_cols], X_test_NUM[normalize_cols], X_unseen_NUM[normalize_cols] =  Normalize.fit_transform(X_train_NUM[normalize_cols]), Normalize.transform(X_test_NUM[normalize_cols]), Normalize.transform(X_unseen_NUM[normalize_cols])

# COMBINE
X_train, X_test, X_unseen = hstack([X_train_text, X_train_NUM]).tocsr(), hstack([X_test_text, X_test_NUM]).tocsr(), hstack([X_unseen_text, X_unseen_NUM]).tocsr()

# Running 2-level contextual classifier using combined features above
model1 = RandomForestClassifier(n_jobs=-1)
Normalize, model1, model2, unseen_df_labelled= execute_model(2, model1, X_train, Y_train, X_test, Y_test, X_unseen, Y_unseen)

NameError: name 'train_df' is not defined

In [93]:
# Running phase II
predict_chunk_lines_PHASE1(unseen_df_labelled)

NameError: name 'unseen_df_labelled' is not defined

- Slightly better results but still underfitted.

----

### `Method 6: ` Linguistic using Ensemble_CLF [BASE] + 2nd-level Ensemble_CLF(Features: Base(Probs) + Linguistic)

    ARCHITECTURE

    1. BASE Classifier: Linguistic(Numerical) features used for training a Voting classifier model(RF, XGB, AB).
    2. 2nd  Classifier: Prob(0,1) + Linguistic Featrues used for training Voting Classifier Ensemble model(RF, XGB, AB).

- Results were average and over-fitted.

---

### `Method 7: ` Text + Linguistic using Ensemble_CLF [BASE] + 2nd-level Ensemble_CLF(Features: Base(Probs) + Linguistic)

    ARCHITECTURE

    1. BASE Classifier: Text(CountVectorized) + Linguistic(Numerical) via h-stack used for training a Voting classifier.
    2. 2nd  Classifier: Prob(0,1) + Linguistic Featrues used for training Voting Classifier Ensemble model(RF, XGB, AB).

- Results were average and over-fitted.

-----

................................................................................................................................................................................................................................................
## NN MODELS
................................................................................................................................................................................................................................................

### `Method 8: ` Text using LSTM_CLF

    ARCHITECTURE

    1. BASE Classifier: Text(Normal/Glove/Spacy embeddings) used for training a LSTM Classifier model

In [94]:
# Loading training, testing, unseen df
train_df, test_df = train_test_split(df_labelled, shuffle=True, random_state=7, stratify=df_labelled.Y_SN)
unseen_df_labelled = unseendata_load()

NameError: name 'df_labelled' is not defined

In [95]:
# Select Features: Tokenized LINES (df.LINE_T)
X_train, X_test, X_unseen = train_df.LINES_T, test_df.LINES_T, unseen_df_labelled.LINES_T
Y_train, Y_test, Y_unseen = train_df.Y_SN, test_df.Y_SN, unseen_df_labelled.Y_SN

NameError: name 'train_df' is not defined

In [96]:
# NN Settings
max_features = 40000
sequence_length = 30
embedding_dim = 300

In [97]:
# Tokenizing
tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(X_train)
X_train, X_test, X_unseen = tokenizer.texts_to_sequences(X_train), \
                            tokenizer.texts_to_sequences(X_test), \
                            tokenizer.texts_to_sequences(X_unseen)

# Padding the sequences
X_train, X_test, X_unseen = pad_sequences(X_train, padding='post', maxlen=sequence_length), \
                            pad_sequences(X_test, padding='post', maxlen=sequence_length), \
                            pad_sequences(X_unseen, padding='post', maxlen=sequence_length)

NameError: name 'Tokenizer' is not defined

In [98]:
# Sampling
sampler = SMOTE('minority')
X_train, Y_train = sampler.fit_sample(X_train, Y_train)

NameError: name 'SMOTE' is not defined

In [99]:
# Training Labels
Y_train = to_categorical(Y_train)

NameError: name 'to_categorical' is not defined

In [100]:
# Unique Tokens
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

NameError: name 'tokenizer' is not defined

In [101]:
# Unique Tokens + 1 for Embedding Layer
num_words = min(max_features, len(word_index)) + 1
print(num_words)

NameError: name 'word_index' is not defined

In [102]:
# Creating EMBEDDING LAYER ....

# Creating a inital matrix of zeros
embedding_matrix = np.zeros((num_words, embedding_dim))

# >> GLOVE EMBEDDINGS <<
# embedding_glove = KeyedVectors.load_word2vec_format('Models/glove.6B.300d.txt.word2vec', binary=False)
# # for each word in out tokenizer lets try to find that work in local/pretrained embeddings model
# for word, i in word_index.items():
#     if i > max_features:
#         continue
#     if word in embedding_glove:
#         # we found the word - add that words vector to the matrix
#         embedding_vector = embedding_glove[word]
#         embedding_matrix[i] = embedding_vector
#     else:
#         # doesn't exist, assign a random vector
#         embedding_matrix[i] = np.random.randn(embedding_dim)
         
# >> SPACY EMBEDDINGS <<
nlp = spacy.load('en_core_wen_lg')
# for each word in out tokenizer lets try to find that work in our w2v model
for word, i in word_index.items():
    if i > max_features:
        continue
    embedding_vector = nlp(word).vector
    embedding_matrix[i] = embedding_vector

NameError: name 'num_words' is not defined

In [103]:
# NN MODEL
model = Sequential()

model.add(Embedding(num_words, embedding_dim, embeddings_initializer=Constant(embedding_matrix), input_length=sequence_length, trainable=True))
model.add(Bidirectional(LSTM(128)))
model.add(Dropout(0.25))
model.add(Dense(2, activation='softmax'))

model.compile(loss='binary_crossentropy', optimizer='adam', metrics = ['accuracy'])

NameError: name 'Sequential' is not defined

In [104]:
model.summary()

NameError: name 'model' is not defined

In [105]:
history = model.fit(X_train, Y_train, batch_size=5000, epochs=10, verbose=1)

NameError: name 'model' is not defined

In [106]:
ACC = accuracy(test_df, X_test, Y_test, model)
ACC = accuracy(unseen_df_labelled, X_unseen, Y_unseen, model)

NameError: name 'test_df' is not defined

- Results were average and under-fitted.

----

### `Method 9: ` Text using LSTM_CLF [BASE] + 2nd-level Ensemble_CLF(Features: Base(Probs) + Linguistic)

    ARCHITECTURE

    1. BASE Classifier: Text(Normal/Glove/Spacy embeddings) used for training a LSTM Classifier model
    2. 2nd  Classifier: Prob(0,1) + Linguistic Featrues used for training a Dense NN model

- Results were under-fitted and poor.

----

### `Method 10: ` Text using LSTM_CLF [BASE] + 2nd-level Ensemble_CLF(Features: Base(Probs) + Linguistic + Text)

    ARCHITECTURE

    1. BASE Classifier: Text(Normal/Glove/Spacy embeddings) used for training a LSTM Classifier model
    2. 2nd  Classifier: Text(CountVectorized) + Linguistic Featrues + Prob(0,1) via h-stack for training a Dense NN model.

- Results were under-fitted and poor.

----

### `Method 11: ` Text  - Keras ELMO Embeddings  using LSTM_CLF [BASE]

    ARCHITECTURE

    1. BASE Classifier: Text in Keras's ELMO embeddings used for training a LSTM Classifier model

- Results could not be determined as embedding creation took more than 2 days of training. Rejected the method mid-way.

In [107]:
# Loading training, testing, unseen df
train_df, test_df = train_test_split(df_labelled, shuffle=True, random_state=7, stratify=df_labelled.Y_SN)
unseen_df_labelled = unseendata_load()

NameError: name 'df_labelled' is not defined

In [108]:
# ELMO
elmo = hub.Module("https://tfhub.dev/google/elmo/2", trainable=True)

NameError: name 'hub' is not defined

In [109]:
# just a random sentence
x = ["Roasted ants are a popular snack in Columbia"]

# Extract ELMo features 
embeddings = elmo(x, signature="default", as_dict=True)["elmo"]

embeddings.shape

NameError: name 'elmo' is not defined

In [110]:
# get elmo from tensorflow hub
embed = hub.Module("https://tfhub.dev/google/elmo/2", trainable=True)

# ELMo Embedding
def ELMoEmbedding(x):
    return embed(tf.squeeze(tf.cast(x, tf.string)), signature="default", as_dict=True)["default"]

NameError: name 'hub' is not defined

In [111]:
Input_Numeric = Input(shape=(X_train_num.values.shape[1],), name='Input_Numeric')
Input_Text = Input(shape=(sequence_length,), name='Input_Text')

# Text layer
layer_Embedding = Lambda(ELMoEmbedding, output_shape=(1024,))(Input_Text)
layer_LSTM = Bidirectional(LSTM(64, dropout=0.3, recurrent_dropout=0.3, kernel_regularizer=regularizers.l2(0.01)))(layer_Embedding)      
layer_LSTM_norm = BatchNormalization()(layer_LSTM)

# Num layer
layer_Dense1 = Dense(128, activation='relu')(Input_Numeric)
#layer_Dense2 = Dense(64, activation='relu')(layer_Dense1)

x = concatenate([layer_LSTM_norm, layer_Dense2])
x = Dense(32, activation='relu')(x)
x = Dense(2, activation='softmax')(x)       

model = Model(inputs=[Input_Text, Input_Numeric] , outputs=[x])
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

NameError: name 'Input' is not defined

In [112]:
with tf.Session() as session:
    K.set_session(session)
    session.run(tf.global_variables_initializer())  
    session.run(tf.tables_initializer())
    class_weights = class_weight.compute_class_weight('balanced', np.unique(Y_train), Y_train)
    history = model.fit([X_train_text, X_train_num], Y_train_final, 
                        batch_size=5000, epochs=5, verbose=1, class_weight=class_weights)
    model.save_weights('Models/Response-elmo-model.h5')

NameError: name 'tf' is not defined

In [113]:
with tf.Session() as session:
    K.set_session(session)
    session.run(tf.global_variables_initializer())
    session.run(tf.tables_initializer())
    
    model.load_weights('Models/Response-elmo-model.h5')
    predicts = model.predict([X_test_text, X_test_num], batch_size=16)

NameError: name 'tf' is not defined

- Results could not be determined as embedding creation took more than 2 days of training. Rejected the method mid-way.

----

### `Method 12: ` Linguistic features only using a Dense NN_CLF

    ARCHITECTURE

    1. BASE Classifier: Linguistic(Numerical) features used for training a Dense NN Model

- Results were under-fitted and poor.

----

### `Method 13: ` Linguistic using NN_CLF [BASE] + 2nd-level Ensemble_CLF(Features: Base(Probs) + Linguistic)

    ARCHITECTURE

    1. BASE Classifier: Linguistic(Numerical) features used for training a Dense NN Model
    2. 2nd  Classifier: Prob(0,1) + Linguistic Features used for training a Dense Model

- Results were under-fitted and poor.

----

### `Method 14: ` Linguistic using NN_CLF [BASE] + 2nd-level Ensemble_CLF(Features: Base(Probs) + Linguistic + Text)

    ARCHITECTURE

    1. BASE Classifier: Linguistic(Numerical) features used for training a Dense NN Model
    2. 2nd  Classifier: Text(CountVectorized) + Linguistic Features + Prob(0,1) via h-stack used for training a Dense Model

- Results were under-fitted and poor.

----

### `Method 15: ` Text + Linguistic via <ins>h-stack</ins> using NN_CLF

    ARCHITECTURE

    1. BASE Classifier: Text(CountVectorized) + Linguistic(Numerical) via h-stack used for training a Dense NN model

- Results were over-fitted and poor.

In [114]:
# Loading training, testing, unseen df
train_df, test_df = train_test_split(df_labelled, shuffle=True, random_state=7, stratify=df_labelled.Y_SN)
unseen_df_labelled = unseendata_load()

NameError: name 'df_labelled' is not defined

In [115]:
# LABEL
Y_train, Y_test, Y_unseen = train_df.Y_SN, test_df.Y_SN, unseen_df_labelled.Y_SN

# TEXT
Vect = CountVectorizer()
X_train_text = Vect.fit_transform(train_df.LINE_NER.apply(lambda x: " ".join(x)))
X_test_text = Vect.transform(test_df.LINE_NER.apply(lambda x: " ".join(x)))
X_unseen_text = Vect.transform(unseen_df_labelled.LINE_NER.apply(lambda x: " ".join(x)))

# NUM
X_train_NUM, X_test_NUM, X_unseen_NUM = train_df[Features_NUM], test_df[Features_NUM], unseen_df_labelled[Features_NUM]

# Normalize numerical features
Normalize = MinMaxScaler()
X_train_NUM[normalize_cols] =  Normalize.fit_transform(X_train_NUM[normalize_cols])
X_test_NUM[normalize_cols] = Normalize.transform(X_test_NUM[normalize_cols])
X_unseen_NUM[normalize_cols] = Normalize.transform(X_unseen_NUM[normalize_cols])

# Combine Text + Num using h-stack
X_train = hstack([X_train_text, X_train_NUM]).tocsr() 
X_test = hstack([X_test_text, X_test_NUM]).tocsr()
X_unseen = hstack([X_unseen_text, X_unseen_NUM]).tocsr()

# Sampling combined features
Sampler = SMOTE('minority')
X_train_final, Y_train_final = Sampler.fit_sample(X_train, Y_train)

NameError: name 'train_df' is not defined

In [116]:
# Conversion to categorical values
Y_train_final = to_categorical(Y_train_final)

NameError: name 'to_categorical' is not defined

In [117]:
# DENSE NN MODEL
Input_Numeric = Input(shape=(X_train.toarray().shape[1],))

layer_Dense1 = Dense(128, activation='relu')(Input_Numeric)
layer_Dense2 = Dense(64, activation='relu')(layer_Dense1)
layer_Dense3 = Dense(2, activation='softmax')(layer_Dense2)

model = Model(inputs=Input_Numeric, outputs=layer_Dense3)
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

NameError: name 'Input' is not defined

In [118]:
history = model.fit(X_train_final, Y_train_final, batch_size=256, epochs=20, verbose=1)

NameError: name 'model' is not defined

In [119]:
ACC = accuracy(test_df, X_test, Y_test, model)
unseen_df_labelled = accuracy(unseen_df_labelled, X_unseen, Y_unseen, model)

NameError: name 'test_df' is not defined

In [120]:
predict_chunk_lines_PHASE1(unseen_df_labelled)[:40]

NameError: name 'unseen_df_labelled' is not defined

----

### `Method 16: ` Text + Linguistic via h-stack using NN_CLF [BASE] + 2nd-level Ensemble_CLF(Features: Base(Probs) + Linguistic)

    ARCHITECTURE

    1. BASE Classifier: Text(CountVectorized) + Linguistic(numerical) via h-stack used for training a Dense NN Model
    2. 2nd  Classifier: Prob(0,1) + Linguistic Features + Prob(0,1) used for training a Dense Model

- Results were under-fitted and poor.

----

### `Method 17: ` Text + Linguistic via 2-input Channel using LSTM_CLF

    ARCHITECTURE

    1. BASE Classifier: Text(df.LINE_NER) + Linguistic(numerical) in a 2-channel input bi-LSTM network.

- Results are the best in terms of Recall and is used in final architecture!

#### Load df

In [121]:
# Loading training, testing, unseen df
train_df, test_df = train_test_split(df_labelled, shuffle=True, random_state=7, stratify=df_labelled.Y_SN)
unseen_df_labelled = unseendata_load()

NameError: name 'df_labelled' is not defined

#### Select Features

In [122]:
# TEXT (using df.LINE_NER --> a "line" in terms of context NER-GPE tags)
X_train_text, X_test_text, X_unseen_text = train_df.LINE_NER, test_df.LINE_NER, unseen_df_labelled.LINE_NER

# NUM
X_train_num, X_test_num, X_unseen_num = train_df[Features_NUM], test_df[Features_NUM], unseen_df_labelled[Features_NUM]

# LABELS
Y_train, Y_test, Y_unseen = train_df.Y_SN, test_df.Y_SN, unseen_df_labelled.Y_SN

NameError: name 'train_df' is not defined

#### 1. Text layer

In [123]:
# NN Settings
max_features = 5
sequence_length = 6
embedding_dim = 6

In [124]:
# Tokenizer
tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(X_train_text)
X_train_text, X_test_text, X_unseen_text = tokenizer.texts_to_sequences(X_train_text), tokenizer.texts_to_sequences(X_test_text), tokenizer.texts_to_sequences(X_unseen_text)

# Padding sequences (to make same len)
X_train_text, X_test_text, X_unseen_text = pad_sequences(X_train_text, padding='post', maxlen=sequence_length), pad_sequences(X_test_text, padding='post', maxlen=sequence_length), pad_sequences(X_unseen_text, padding='post', maxlen=sequence_length)

NameError: name 'Tokenizer' is not defined

In [125]:
# Unique Tokens
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

NameError: name 'tokenizer' is not defined

In [126]:
# Unique Tokens + 1 for Embedding Layer
num_words = min(max_features, len(word_index)) + 1
print(num_words)

NameError: name 'word_index' is not defined

In [127]:
# Sampling
sampler = SMOTE('minority')
X_train_text, Y_train_final = sampler.fit_sample(X_train_text, Y_train)

NameError: name 'SMOTE' is not defined

#### 2. Numerical Layer

In [128]:
# Normalize features
N = MinMaxScaler()
X_train_num[normalize_cols], X_test_num[normalize_cols], X_unseen_num[normalize_cols] = N.fit_transform(X_train_num[normalize_cols]), N.transform(X_test_num[normalize_cols]), N.transform(X_unseen_num[normalize_cols])

NameError: name 'X_train_num' is not defined

In [129]:
# Sampling
Sampler = SMOTE('minority')
X_train_num, Y_train_final_2 = Sampler.fit_sample(X_train_num, Y_train)

NameError: name 'SMOTE' is not defined

In [130]:
# Training labels
Y_train_final = to_categorical(Y_train_final)

NameError: name 'to_categorical' is not defined

#### Model

In [131]:
# finetuning
Input_Numeric = Input(shape=(X_train_num.values.shape[1],), name='Input_Numeric')
Input_Text = Input(shape=(sequence_length,), name='Input_Text')

# Text layer
layer_Embedding = Embedding(num_words, embedding_dim, trainable=True)(Input_Text)
layer_LSTM = Bidirectional(LSTM(32, dropout=0.25, recurrent_dropout=0.25, kernel_regularizer=regularizers.l2(0.01)))(layer_Embedding)

# Num layer
layer_Dense1 = Dense(256, activation='relu')(Input_Numeric)
layer_Dropout1 = Dropout(0.25)(layer_Dense1)

x = concatenate([layer_LSTM, layer_Dropout1])
x = Dense(256, activation='relu')(x)
x = Dense(2, activation='softmax')(x)       

model = Model(inputs=[Input_Text, Input_Numeric] , outputs=[x])
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

NameError: name 'Input' is not defined

In [132]:
%%time
class_weights = class_weight.compute_class_weight('balanced', np.unique(Y_train), Y_train)

history = model.fit([X_train_text, X_train_num], Y_train_final, 
                    batch_size=256, epochs=30, verbose=1, class_weight=class_weights)

modelNN = model

ACC = accuracy(test_df, [X_test_text, X_test_num], Y_test, model)
unseen_df_labelled = accuracy(unseen_df_labelled, [X_unseen_text, X_unseen_num], Y_unseen, model)

NameError: name 'Y_train' is not defined

#### Predictions

In [133]:
# Running phase II: chunk identification module over above "unseen_df_labelled"
predict_chunk_lines_PHASE1(unseen_df_labelled)

NameError: name 'unseen_df_labelled' is not defined

In [134]:
# Running phase II: chunk identification module over above "new testing df "ahdf"
ahdf = create_AH_testdf(ACC)
predict_chunk_lines_PHASE1(ahdf)

NameError: name 'ACC' is not defined

## `BEST NN METHOD USE-CASE`

In [135]:
# # Saving 

# pickle.dump(modelNN, open("Models/NN_TextNum_2SepLayers_30epochs_21052020/ModelNN_21052020.pickle", 'wb'))
# pickle.dump(tokenizer, open("Models/NN_TextNum_2SepLayers_30epochs_21052020/Tokenizer_21052020.pickle", 'wb'))
# pickle.dump(N, open("Models/NN_TextNum_2SepLayers_30epochs_21052020/Normalizer_21052020.pickle", 'wb'))

In [136]:
# # Loading Best NN Model

# modelNN = pickle.load(open("Models/NN_TextNum_2SepLayers_30epochs_21052020/ModelNN_21052020.pickle", 'rb'))
# Tokenizer = pickle.load(open("Models/NN_TextNum_2SepLayers_30epochs_21052020/Tokenizer_21052020.pickle", 'rb'))
# N = pickle.load(open("Models/NN_TextNum_2SepLayers_30epochs_21052020/Normalizer_21052020.pickle", 'rb'))

### Experimentation on fine-tuning above network

#### 1. Finetuning - Alternative Model I

In [137]:
unseen_df_labelled = unseendata_load()
train_df, test_df = train_test_split(df_labelled, shuffle=True, random_state=7, stratify=df_labelled.Y_SN)

NameError: name 'SAVED_test_data_path' is not defined

In [138]:
# finetuning
Input_Numeric = Input(shape=(X_train_num.values.shape[1],), name='Input_Numeric')
Input_Text = Input(shape=(sequence_length,), name='Input_Text')

# Text layer
layer_Embedding = Embedding(num_words, embedding_dim, trainable=True)(Input_Text)
layer_LSTM = Bidirectional(LSTM(32, dropout=0.25, recurrent_dropout=0.25, kernel_regularizer=regularizers.l2(0.01)))(layer_Embedding)

# Num layer
layer_Dense1 = Dense(256, activation='relu')(Input_Numeric)
layer_Dropout1 = Dropout(0.25)(layer_Dense1)

x = concatenate([layer_LSTM, layer_Dropout1])
x = Dense(256, activation='relu')(x)
x = Dense(2, activation='softmax')(x)       

model = Model(inputs=[Input_Text, Input_Numeric] , outputs=[x])
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

NameError: name 'Input' is not defined

In [139]:
%%time
class_weights = class_weight.compute_class_weight('balanced', np.unique(Y_train), Y_train)

history = model.fit([X_train_text, X_train_num], Y_train_final, 
                    batch_size=256, epochs=30, verbose=1, class_weight=class_weights)

ACC = accuracy(test_df, [X_test_text, X_test_num], Y_test, model)
unseen_df_labelled_1 = accuracy(unseen_df_labelled, [X_unseen_text, X_unseen_num], Y_unseen, model)

NameError: name 'Y_train' is not defined

In [140]:
predict_chunk_lines_PHASE1(unseen_df_labelled_1)

NameError: name 'unseen_df_labelled_1' is not defined

#### 2. Finetuning - Alternative Model II

In [141]:
unseen_df_labelled = unseendata_load()
train_df, test_df = train_test_split(df_labelled, shuffle=True, random_state=7, stratify=df_labelled.Y_SN)

NameError: name 'SAVED_test_data_path' is not defined

In [142]:
Input_Numeric = Input(shape=(X_train_num.values.shape[1],), name='Input_Numeric')
Input_Text = Input(shape=(sequence_length,), name='Input_Text')

# Text layer
layer_Embedding = Embedding(num_words, embedding_dim, trainable=True)(Input_Text)
layer_Conv1D1 = Conv1D(filters=32, kernel_size=3, padding='same', activation='relu')(layer_Embedding)
layer_pool1 = MaxPooling1D(pool_size=2)(layer_Conv1D1)
layer_LSTM = Bidirectional(LSTM(64, dropout=0.3, recurrent_dropout=0.3, kernel_regularizer=regularizers.l2(0.01)))(layer_pool1)

# Num layer
layer_Dense1 = Dense(128, activation='relu')(Input_Numeric)
layer_Dense2 = Dense(64, activation='relu')(layer_Dense1)
layer_Droput1 = Dropout(0.2)(layer_Dense2)

x = concatenate([layer_LSTM, layer_Droput1])
x = Dense(256, activation='relu')(x)
x = Dense(64, activation='relu')(x)
x = Dense(2, activation='softmax')(x)       

model = Model(inputs=[Input_Text, Input_Numeric] , outputs=[x])
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

NameError: name 'Input' is not defined

In [143]:
%%time
class_weights = class_weight.compute_class_weight('balanced', np.unique(Y_train), Y_train)

history = model.fit([X_train_text, X_train_num], Y_train_final, 
                    batch_size=256, epochs=30, verbose=1, class_weight=class_weights)

ACC = accuracy(test_df, [X_test_text, X_test_num], Y_test, model)
unseen_df_labelled_2 = accuracy(unseen_df_labelled, [X_unseen_text, X_unseen_num], Y_unseen, model)

NameError: name 'Y_train' is not defined

In [144]:
predict_chunk_lines_PHASE1(unseen_df_labelled_2)

NameError: name 'unseen_df_labelled_2' is not defined

----

Different NN MODELS Finetuning ANALYSIS

-----

### `EXPERIMENTAL METHOD 18: ` TEXT(NER -> Clustering -> OHE) + LINGUSITIC via h-stack

#### Text Vector Representation in Clusters

- This feature implemented usage of KMeans clustering algorithm to represent text clusters into a vectorized format. Each line was tokenized and then used for creating a vocabulary using Word2vec.


- This vocabulary was fitted on a KMeans algorithm to predict cluster information for each word in a line. Using elbow method number of clusters was chosen as 5. 


- Cluster information for each token in a line is then padded and converted using OneHotEncoder to vector format.


- For all new words that are not in present in word2vec vocab were assigned a default cluster like Cn

...

if word in word2vec.wv.vocab
    # cal score -> KM.predict(score)

else:
    # SKIP WORD2VEC SOCRE and put directly in some cluster where ORG(i.e NAME) is present (lets suppose CN)
    km_label = CN
    
...

<ins>Description</ins>
   
    Line 			    =  "ABC Ltd is located in India"     	
    Line_tokenized	   =  [“ABC”,  “Ltd”,  “is”,  “located”,  “in”,  “India”]
    Line_word2vec		=  [embed(“ABC”), embed(“Ltd”),  …,  “embed(“India”)]
    Line_clustered	   =  [c1,  c2,  c3,  c3,  c3,  c1]
    Line_padded 		 =  [1,   2,   3,   3,   3,   1,  -1,  -1,  -1,  …,  -1]
    Line_OHE		     =  [[0 1 0 0 0],  [0 0 1 0 0],  [0 0 0 1 0],  …,  [0 0 0 0 0]]
    Line_Vector	      =  [0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 …. 0 0 0 0 0]   

#### SAMPLE

In [145]:
# Load sample of df
a = df_labelled[:10].copy()

NameError: name 'df_labelled' is not defined

In [146]:
def Preprocess(doc):
    doc = doc.strip()
    doc = re.sub(r"\s+", " ", re.sub("[^A-z0-9\r\t\n\,\.\?\!\;\:\’\'\"\-\_\`\@\(\)\[\]\{\}\#]", " ", doc).strip())
    doc = re.sub(r"\s+", " ", re.sub(r"[^A-z0-9\&\-\.\,]", " ", doc)).strip()
    doc = nlp(doc)
    line = " ".join(["__" + str(token.ent_type_) + "__" if token.ent_type_ == 'ORG' or token.ent_type_ == 'GPE' else token.text for token in doc])
    line = re.sub(r"\s+", " ", re.sub(r"[^A-z0-9\&]", " ", line)).strip()
    line = re.sub(r"\s+", " ", re.sub(r"\w+\d+\w+", "__D__", line)).strip()
    return word_tokenize(line.lower())

# Preprocess lines
a['LINES_T'] = a.LINES.apply(Preprocess)

NameError: name 'a' is not defined

In [147]:
a.LINES_T

NameError: name 'a' is not defined

In [148]:
# Training data
a_train = a.LINES_T

NameError: name 'a' is not defined

In [149]:
# Word2vec Vectorize
sentences = a_train.tolist()
word2vec = Word2Vec(sentences, size=300, min_count=1)
a_words = word2vec.wv[word2vec.wv.vocab]

NameError: name 'a_train' is not defined

In [150]:
# # ELBOW - calculate distortion for a range of number of cluster
# distortions = []
# for i in range(1, 15):
#     km = KMeans(n_clusters=i, init='random', n_init=30, max_iter=500, tol=1e-04, random_state=7, n_jobs=-1)
#     km.fit(a_words)
#     distortions.append(km.inertia_)
# # plot
# plt.plot(range(1, 15), distortions, marker='o')
# plt.xlabel('Number of clusters')
# plt.ylabel('Distortion')
# plt.show()

In [151]:
# KMeans Clustering (n_clusters=5)
km = KMeans(n_clusters=5, init='random', n_init=30, max_iter=500, tol=1e-04, random_state=7, n_jobs=-1)
km.fit(a_words)

# Display of clusters
y_km = km.predict(a_words)
words = list(word2vec.wv.vocab)
c0, c1, c2, c3, c4 = [], [], [], [], []
for word, label in zip(words,y_km):   
    if label == 0:
        c0.append(word)
    elif label == 1:
        c1.append(word)
    elif label == 2:
        c2.append(word)
    elif label == 3:
        c3.append(word)
    else:
        c4.append(word)
c0 = c0 + [np.nan for i in range(int(np.abs(max(len(c0), len(c1), len(c2), len(c3), len(c4)) - len(c0))))]
c1 = c1 + [np.nan for i in range(int(np.abs(max(len(c0), len(c1), len(c2), len(c3), len(c4)) - len(c1))))]
c2 = c2 + [np.nan for i in range(int(np.abs(max(len(c0), len(c1), len(c2), len(c3), len(c4)) - len(c2))))]
c3 = c3 + [np.nan for i in range(int(np.abs(max(len(c0), len(c1), len(c2), len(c3), len(c4)) - len(c3))))]
c4 = c4 + [np.nan for i in range(int(np.abs(max(len(c0), len(c1), len(c2), len(c3), len(c4)) - len(c4))))]
cluster_df = pd.DataFrame({'C0_abbrv': c0, "C1_units": c1, "C2_company":c2, "C3_Prepos":c3, "C4_Names":c4})

NameError: name 'KMeans' is not defined

In [152]:
cluster_df

NameError: name 'cluster_df' is not defined

In [153]:
# Vectorization
def kmeans_transform(X):
    doc_vector = []
    for doc in X:  
        word_vector_doc = []  
        if doc != []:
            for word in doc:    
                if word in word2vec.wv:
                    word_vector_doc.append(km.predict([word2vec.wv[word]])[0])
                else:
                    word_vector_doc.append(4)
            doc_vector.append(word_vector_doc)
        else:
            doc_vector.append([])     
    return doc_vector

# KMeans vector
a_train_vector = kmeans_transform(a_train.tolist())
print("Kmeans Vector = ", a_train_vector[5])

# Padding
a_train_vector_padded = pad_sequences(a_train_vector, padding='post', value=-1, maxlen=10)
print("\nKmeans Vector Padded = ", a_train_vector_padded[5])

# OneHotEncoding Kmeans padded vector
d = {0:[1,0,0,0,0], 1:[0,1,0,0,0], 2:[0,0,1,0,0], 3:[0,0,0,1,0], 4:[0,0,0,0,1], -1:[0,0,0,0,0]}
a_train_vector_padded_OHE = [[d[b] for b in i] for i in a_train_vector_padded]
print("\nOneHotEncoder - Kmeans Vector Padded: \n", a_train_vector_padded_OHE[5])

# Final OHE Vector
a_train_vector_padded_OHE_vector = np.array(list(map(lambda x: sum(x, []), a_train_vector_padded_OHE)))
print("\nOHE Kmeans Vector: \n", a_train_vector_padded_OHE_vector[5])

NameError: name 'a_train' is not defined

In [154]:
# Shape = padding(pad) x 1hotencoding(5clusters) = pad x 5 dim
a_train_vector_padded_OHE_vector.shape

NameError: name 'a_train_vector_padded_OHE_vector' is not defined

In [155]:
a_train_vector_padded_OHE_vector[5:6]

NameError: name 'a_train_vector_padded_OHE_vector' is not defined

#### Executing for df

In [156]:
# Load df
df_cluster = df_labelled[:10000].copy()

NameError: name 'df_labelled' is not defined

In [157]:
%%time
def Preprocess(doc):
    doc = doc.strip()
    doc = re.sub(r"\s+", " ", re.sub("[^A-z0-9\r\t\n\,\.\?\!\;\:\’\'\"\-\_\`\@\(\)\[\]\{\}\#]", " ", doc).strip())
    doc = re.sub(r"\s+", " ", re.sub(r"[^A-z0-9\&\-\.\,]", " ", doc)).strip()
    doc = nlp(doc)
    line = " ".join(["__" + str(token.ent_type_) + "__" if token.ent_type_ == 'ORG' or token.ent_type_ == 'GPE' else token.text for token in doc])
    line = re.sub(r"\s+", " ", re.sub(r"[^A-z0-9\&]", " ", line)).strip()
    line = re.sub(r"\s+", " ", re.sub(r"\w+\d+\w+", "__D__", line)).strip()
    return word_tokenize(line.lower())

# Preprocess lines
df_cluster['LINES_T'] = df_cluster.LINES.apply(Preprocess)

NameError: name 'df_cluster' is not defined

In [158]:
# Training and Testing data
train_df, test_df = train_test_split(df_cluster, shuffle=True, random_state=7, stratify=df_cluster.Y_SN)
X_train, X_test = train_df.LINES_T, test_df.LINES_T
Y_train, Y_test = train_df.Y_SN, test_df.Y_SN

NameError: name 'df_cluster' is not defined

In [159]:
# Word2vec Vectorize
sentences = X_train.tolist()
word2vec = Word2Vec(sentences, size=300, min_count=1)
X_train_words = word2vec.wv[word2vec.wv.vocab]

NameError: name 'X_train' is not defined

In [160]:
%%time
# Elbow Method
distortions = []
for i in range(1, 20):
    km = KMeans(n_clusters=i, init='random', n_init=30, max_iter=500, tol=1e-04, random_state=7, n_jobs=-1)
    km.fit(X_train_words)
    distortions.append(km.inertia_)
# plot
plt.plot(range(1, 20), distortions, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Distortion')
plt.show()

NameError: name 'KMeans' is not defined

- SELECTED: n_clusters = 5

In [161]:
%%time
# KMeans
km = KMeans(n_clusters=5, init='random', n_init=30, max_iter=500, tol=1e-04, random_state=7, n_jobs=-1)
km.fit(X_train_words)

# Display of clusters...
y_km = km.predict(X_train_words)
words = list(word2vec.wv.vocab)
c0, c1, c2, c3, c4 = [], [], [], [], []
for word, label in zip(words,y_km):   
    if label == 0:
        c0.append(word)
    elif label == 1:
        c1.append(word)
    elif label == 2:
        c2.append(word)
    elif label == 3:
        c3.append(word)
    else:
        c4.append(word)
        
c0 = c0 + [np.nan for i in range(int(np.abs(max(len(c0), len(c1), len(c2), len(c3), len(c4)) - len(c0))))]
c1 = c1 + [np.nan for i in range(int(np.abs(max(len(c0), len(c1), len(c2), len(c3), len(c4)) - len(c1))))]
c2 = c2 + [np.nan for i in range(int(np.abs(max(len(c0), len(c1), len(c2), len(c3), len(c4)) - len(c2))))]
c3 = c3 + [np.nan for i in range(int(np.abs(max(len(c0), len(c1), len(c2), len(c3), len(c4)) - len(c3))))]
c4 = c4 + [np.nan for i in range(int(np.abs(max(len(c0), len(c1), len(c2), len(c3), len(c4)) - len(c4))))]
cluster_df = pd.DataFrame({'C0_abbrv': c0, "C1_units": c1, "C2_company":c2, "C3_Prepos":c3, "C4_Names":c4})

NameError: name 'KMeans' is not defined

In [162]:
cluster_df.head()

NameError: name 'cluster_df' is not defined

#### Vectorization

In [163]:
def kmeans_transform(X):
    doc_vector = []
    for doc in X:  
        word_vector_doc = []  
        if doc != []:
            for word in doc:    
                if word in word2vec.wv:
                    word_vector_doc.append(km.predict([word2vec.wv[word]])[0])
                else:
                    # Words not in word2vec vocab are assigned Cluster 4
                    word_vector_doc.append(4)
            doc_vector.append(word_vector_doc)
        else:
            doc_vector.append([])     
    return doc_vector

# KMeans vector
X_train_vector, X_test_vector = kmeans_transform(X_train.tolist()), kmeans_transform(X_test.tolist())
print("Kmeans Vector = ", X_train_vector[0])

# Padding
X_train_vector_padded, X_test_vector_padded = pad_sequences(X_train_vector, padding='post', value=-1, maxlen=30), pad_sequences(X_test_vector, padding='post', value=-1, maxlen=30)
print("\nKmeans Vector Padded = ", X_train_vector_padded[0])

# OneHotEncoding Kmeans padded vector
d = {0:[1,0,0,0,0], 1:[0,1,0,0,0], 2:[0,0,1,0,0], 3:[0,0,0,1,0], 4:[0,0,0,0,1], -1:[0,0,0,0,0]}
X_train_padded_OHE, X_test_padded_OHE = [[d[b] for b in i] for i in X_train_vector_padded], [[d[b] for b in i] for i in X_test_vector_padded]
print("\nOneHotEncoder - Kmeans Vector Padded: \n", X_train_padded_OHE[0])

# Final OHE Vector
X_train_padded_OHE_vector, X_test_padded_OHE_vector = np.array(list(map(lambda x: sum(x, []), X_train_padded_OHE))), np.array(list(map(lambda x: sum(x, []), X_test_padded_OHE)))
print("\nOHE Kmeans Vector: \n", X_train_padded_OHE_vector[0])

NameError: name 'X_train' is not defined

In [164]:
X_train_padded_OHE_vector

NameError: name 'X_train_padded_OHE_vector' is not defined

#### Sampling

In [165]:
Sampler = SMOTE('minority')
X_train_final, Y_train_final = Sampler.fit_sample(X_train_padded_OHE_vector, Y_train)

NameError: name 'SMOTE' is not defined

#### Fitting

In [166]:
%%time
# Fit
model5 = RandomForestClassifier(n_jobs=-1)
model5.fit(X_train_final, Y_train_final)

NameError: name 'RandomForestClassifier' is not defined

#### Prediction

In [167]:
# Predict
testing = pd.DataFrame()
testing['Y_SN'] = Y_test
testing['Y_PRED'] = model5.predict(X_test_padded_OHE_vector)
Y_TRUE, Y_PRED = np.array(testing.Y_SN.astype(int)), np.array(testing.Y_PRED.astype(int))

# Metrics
confusion = confusion_matrix(Y_TRUE, Y_PRED)     
TP, TN, FP, FN = confusion[1, 1], confusion[0, 0], confusion[0, 1], confusion[1, 0]
clf_accuracy = round(float((TP + TN)/(TP + TN + FP + FN))*100.0, 3)
P = round(float(TP /(TP + FP))*100.0, 3)
R = round(float(TP /(TP + FN))*100.0, 3)
f1_score = round(float((2 * P * R)/(P + R + 0.01)), 3)
print("ACC = {}; P = {}, R = {}, F1 Score = {}\n".format(clf_accuracy, P, R, f1_score))
# Scores
print("Test_df valueCounts() :: \n{}\n".format(testing.Y_SN.value_counts()))
# TP
print("TP = ", testing[(testing.Y_SN == 1) & (testing.Y_PRED == 1)].shape)
# TN
print("TN = ", testing[(testing.Y_SN == 0) & (testing.Y_PRED == 0)].shape)
# False Positive
print("FP = ", testing[(testing.Y_SN == 0) & (testing.Y_PRED == 1)].shape)
# FN
print("FN = ", testing[(testing.Y_SN == 1) & (testing.Y_PRED == 0)].shape)

NameError: name 'Y_test' is not defined

- Vectorization in this way helps, but exploring NER tags proved to be better than this!

----

### `Method 19: ` Ensembling via Stacking - Linguistic using VotingClassifier

    ARCHITECTURE

    1. BASE Classifier: Linguistic(numerical) for training a Voting Classifier Model (RF, XGB, AB)

- Results are manageable!

#### Load Data

In [168]:
# Load training, testing, unseen df
train_df, test_df = train_test_split(df_labelled, shuffle=True, random_state=7, stratify=df_labelled.Y_SN)
unseen_df_labelled = unseendata_load()

NameError: name 'df_labelled' is not defined

#### Voting Classifier (XGB + RF 1 layer)

In [169]:
# ensemble of models 
estimator = [] 
estimator.append(('XGBoost',  XGBClassifier(random_state=1, n_jobs=-1))) 
estimator.append(('RandomForest', RandomForestClassifier(random_state=1, n_jobs=-1))) 
  
# Voting Classifier with hard voting 
ensemble_VC = VotingClassifier(estimators=estimator, voting ='soft') 

NameError: name 'XGBClassifier' is not defined

#### `INPUT 1`: Input = Linguistic(numerical) features 

In [170]:
# 1. Selected features: Lingusitic(Numerical)
X_train, X_test, X_unseen = train_df[Features_NUM], test_df[Features_NUM], unseen_df_labelled[Features_NUM]
Y_train, Y_test, Y_unseen = train_df[Target], test_df[Target], unseen_df_labelled[Target]

# 2. Normalize features
Normalize = MinMaxScaler()
X_train[normalize_cols] =  Normalize.fit_transform(X_train[normalize_cols])
X_test[normalize_cols] = Normalize.transform(X_test[normalize_cols])
X_unseen[normalize_cols] = Normalize.transform(X_unseen[normalize_cols])

# 3. Sampling
Sampler = SMOTE('minority')
X_train_final, Y_train_final = Sampler.fit_sample(X_train, Y_train)

# 4. Fitting model
ensemble_VC = VotingClassifier(estimators=estimator, voting ='soft') 
ensemble_VC.fit(X_train_final, Y_train_final)

# 5. Making predictions
print("Test Data Results :: ")
ACCURACY = accuracy(test_df, X_test, Y_test, ensemble_VC)
print("Unseen Data Results :: ")
UDL = accuracy(unseen_df_labelled, X_unseen, Y_unseen, ensemble_VC)

NameError: name 'train_df' is not defined

In [171]:
# Running phase II: chunk identification using above "Unseen Df (UDL)"
predict_chunk_lines_PHASE1(UDL)

NameError: name 'UDL' is not defined

In [172]:
# Running phase II: chunk identification using new testing data "ahdf"
ahdf = create_AH_testdf(ACCURACY)
predict_chunk_lines_PHASE1(ahdf)

NameError: name 'ACCURACY' is not defined

#### `INPUT 2: ` Input = Text(CountVectorized) + Linguistic(numerical) via <ins>h-stack</ins> for training above Voting CLF

- Model is highly over-fitting and hence discarded!

-----
-----

# <ins>FINAL ARCHITECTURE</ins>

Final Architecture consists of 2 separate models each giving probabilites of labels(P0, P1).

`Model 1: Generic Model`

2-layer RandomForest model
   
    ARCHITECTURE
    
    - Input  : df[Linguistic_Features]
    
    1. BASE Classifier: Linguistic Featrues(numerical) used for training a generic RandomForest Model.
    2. 2nd  Classifier: Prob(0,1) + Linguistic Featrues(numerical) used for training another generic RF model.
    
*Taken from **Method** 1*

...

`Model 2: NN Model`

2-input Channel biLSTM network
   
    ARCHITECTURE
    
    - 1st Input : Text (*vectorized in terms of neighbouring context NER tags `df.LINE_NER`*) in an Embedding layer
    - 2nd Input : Linguistic features(numerical) in a dense layer 
    - Concat    : Together in a Dense layer to give softmax probability outputs.
    
    1. BASE Classifier: Text(df.LINE_NER) + Linguistic(numerical) in a 2-channel input bi-LSTM network.

*Taken from **Method** 17*

#### LOAD MODELS

In [173]:
%%time
%%memit

# Load Generic Model...
model = pickle.load(open("Models/RF_2Level_Generic_Num_19052020/Model_model51_19052020.pickle", "rb")) 
model2 = pickle.load(open("Models/RF_2Level_Generic_Num_19052020/Model_model52_19052020.pickle", "rb")) 
Normalizer = pickle.load(open("Models/RF_2Level_Generic_Num_19052020/Model_Normalize_19052020.pickle", "rb"))

# Load NN Model...
modelNN = pickle.load(open("Models/NN_TextNum_2SepLayers_30epochs_21052020/ModelNN_21052020.pickle", 'rb'))
Tokenizer = pickle.load(open("Models/NN_TextNum_2SepLayers_30epochs_21052020/Tokenizer_21052020.pickle", 'rb'))
N = pickle.load(open("Models/NN_TextNum_2SepLayers_30epochs_21052020/Normalizer_21052020.pickle", 'rb'))

FileNotFoundError: [Errno 2] No such file or directory: 'Models/RF_2Level_Generic_Num_19052020/Model_model51_19052020.pickle'

#### FEATURE SELECTION

In [174]:
# SELECTED FEATURES
# - All features are taken except 3 (i.e. SpacyNERLine, SpacyNERAddress, CrfNERAddress)

Features_NUM = ['l_', 't_', 'r_', 'b_', 'FS_', 'F1_CONTAINSDIGIT', 'F1_CONTAINSALLDIGIT', 'F1_CONTAINSEMAIL',
                'F1_CONTAINSURL', 'F1_CONTAINSDATE', 'F1_CONTAINSPHONE', 'F1_emails', 'F1_urls', 'F3_abbrv',
                'F4_DictWords', 'F4_Cap1DictWords', 'F4_CapDictWords', 'F4_NonDictWords', 'F4_Cap1NonDictWords',
                'F4_CapNonDictWords', 'F5_postalAB', 'F6_lineQuadrant', 'F7_gpe']

normalize_cols = ['F4_DictWords', 'F4_Cap1DictWords', 'F4_CapDictWords', 'F4_NonDictWords', 'F4_Cap1NonDictWords', 
                  'F4_CapNonDictWords', 'F5_postalAB', 'F7_gpe']

Features_L2 = Features_NUM + ['L1_P0', 'L1_P1']

Target = 'Y_SN'

#### LOAD DATA

In [175]:
train_df, test_df = train_test_split(df_labelled, shuffle=True, random_state=7, stratify=df_labelled.Y_SN)
unseen_df_labelled = unseendata_load()

NameError: name 'df_labelled' is not defined

### <ins>Phase I: Line Classification</ins>

#### Model 1: Generic Model

In [176]:
# LEVEL 1
# 1. Select features
X_train, X_test, X_unseen = train_df[Features_NUM], test_df[Features_NUM], unseen_df_labelled[Features_NUM]
Y_train, Y_test, Y_unseen = train_df[Target], test_df[Target], unseen_df_labelled[Target]

# 2. Normalize selected features
X_train[normalize_cols] =  Normalizer.transform(X_train[normalize_cols])
X_test[normalize_cols] = Normalizer.transform(X_test[normalize_cols])
X_unseen[normalize_cols] = Normalizer.transform(X_unseen[normalize_cols])

# LEVEL 2
train_df['L1_P0'], train_df['L1_P1'] = zip(*model.predict_proba(X_train))
test_df['L1_P0'], test_df['L1_P1'] = zip(*model.predict_proba(X_test))
unseen_df_labelled['L1_P0'], unseen_df_labelled['L1_P1'] = zip(*model.predict_proba(X_unseen))

# 3. Select features: L2 [P(0,1) + Linguistic Features]
X_train_L2, X_test_L2, X_unseen_L2 = train_df[Features_L2], test_df[Features_L2], unseen_df_labelled[Features_L2]

# 4. Normalize above L2 features
X_train_L2[normalize_cols] = Normalizer.transform(X_train_L2[normalize_cols])
X_test_L2[normalize_cols] = Normalizer.transform(X_test_L2[normalize_cols])
X_unseen_L2[normalize_cols] = Normalizer.transform(X_unseen_L2[normalize_cols])

# 5. Make Predictions
print("\n ----- AmeriHealth Test Data -------")
test_df__ = accuracy(test_df.copy(), X_test_L2, Y_test, model2)
unseen_AHdf_labelled = create_AH_testdf(test_df__)
unseen_AHdf_labelled = unseen_AHdf_labelled.rename(columns={"Y_PRED": "Model1_Y_PRED", "P0": "Model1_P0", "P1": "Model1_P1"})

print("\n ------- Porsche Test Data -------")
unseen_df_labelled = accuracy(unseen_df_labelled.copy(), X_unseen_L2, Y_unseen, model2)
unseen_df_labelled = unseen_df_labelled.rename(columns={"Y_PRED": "Model1_Y_PRED", "P0": "Model1_P0", "P1": "Model1_P1"})

NameError: name 'train_df' is not defined

#### Model 2: NN Model

In [177]:
# NN Settings
max_features = 5
sequence_length = 6
embedding_dim = 6

# Text (df.LINE_NER)
X_train_text, X_test_text, X_unseen_text = train_df.LINE_NER, unseen_AHdf_labelled.LINE_NER, unseen_df_labelled.LINE_NER
X_train_text, X_test_text, X_unseen_text = Tokenizer.texts_to_sequences(X_train_text), Tokenizer.texts_to_sequences(X_test_text), Tokenizer.texts_to_sequences(X_unseen_text)
X_train_text, X_test_text, X_unseen_text = pad_sequences(X_train_text, padding='post', maxlen=sequence_length), pad_sequences(X_test_text, padding='post', maxlen=sequence_length), pad_sequences(X_unseen_text, padding='post', maxlen=sequence_length)

# Num
X_train_num, X_test_num, X_unseen_num = train_df[Features_NUM], unseen_AHdf_labelled[Features_NUM], unseen_df_labelled[Features_NUM]
X_train_num[normalize_cols], X_test_num[normalize_cols], X_unseen_num[normalize_cols] = N.transform(X_train_num[normalize_cols]), N.transform(X_test_num[normalize_cols]), N.transform(X_unseen_num[normalize_cols])

# Both as 2-input channel
X_train_NN, X_test_NN, X_unseen_NN = [X_train_text, X_train_num], [X_test_text, X_test_num], [X_unseen_text, X_unseen_num]
Y_train_NN, Y_test_NN, Y_unseen_NN = train_df.Y_SN, unseen_AHdf_labelled.Y_SN, unseen_df_labelled.Y_SN

# Predictions
print("\n ----- AmeriHealth Test Data -------")
unseen_AHdf_labelled = accuracy(unseen_AHdf_labelled, X_test_NN, Y_test_NN, modelNN)
unseen_AHdf_labelled = unseen_AHdf_labelled.rename(columns={"Y_PRED": "Model2_Y_PRED", "P0": "Model2_P0", "P1": "Model2_P1"})

print("\n ------- Porsche Test Data -------")
unseen_df_labelled = accuracy(unseen_df_labelled, X_unseen_NN, Y_unseen_NN, modelNN)
unseen_df_labelled = unseen_df_labelled.rename(columns={"Y_PRED": "Model2_Y_PRED", "P0": "Model2_P0", "P1": "Model2_P1"})

NameError: name 'train_df' is not defined

### <ins>Phase II: Chunk Identification</ins>

Sample of unseen data from Amerihealth customer

In [178]:
predict_chunk_lines(unseen_AHdf_labelled)

NameError: name 'predict_chunk_lines' is not defined

Unseen data from Porsche golden dataset

In [179]:
predict_chunk_lines(unseen_df_labelled)

NameError: name 'predict_chunk_lines' is not defined

-----
-----

# <ins> `Phase II: Chunk Identification Module` </ins> 

- Chunk are lines or tokens in lines..
- Using a scoring decision matrix, set of regex rules, set of static rules, keyword analysis, filteration, noise reduction, AddressBlock Line reduction to separate the final "chunk" from predicted SN Line.

#### Data Analysis

In [180]:
a = df.SUPPLIER_NAME.apply(lambda x: len(x))
b = df.SUPPLIER_NAME.apply(lambda x: len(word_tokenize(x)))
print("Number of Chars  :: Min={}; Mean={}; Max={}; MODE={}".format(np.min(a), np.mean(a), np.max(a), a.value_counts().index[0]))
print("Number of Words :: Min={}; Mean={}; Max={}; MODE={}".format(np.min(b), np.mean(b), np.max(b), b.value_counts().index[0]))
sns.distplot(a)
sns.distplot(b)

AttributeError: 'DataFrame' object has no attribute 'SUPPLIER_NAME'

In [181]:
a = ahdf.SUPPLIER_NAME.apply(lambda x: len(x))
b = ahdf.SUPPLIER_NAME.apply(lambda x: len(word_tokenize(x)))
print("Number of Chars  :: Min={}; Mean={}; Max={}; MODE={}".format(np.min(a), np.mean(a), np.max(a), a.value_counts().index[0]))
print("Number of Words :: Min={}; Mean={}; Max={}; MODE={}".format(np.min(b), np.mean(b), np.max(b), b.value_counts().index[0]))
sns.distplot(a)
sns.distplot(b)

NameError: name 'ahdf' is not defined

In [182]:
a = df.SUPPLIER_NAME.apply(lambda x: len([w for w in word_tokenize(x) if w.lower() in stop_words])*100.0/len(word_tokenize(x)))
print("Number of Stopwords  :: Min={}; Mean={}; Max={}; MODE={}".format(np.min(a), np.mean(a), np.max(a), a.value_counts().index[0]))
sns.distplot(a)

AttributeError: 'DataFrame' object has no attribute 'SUPPLIER_NAME'

In [183]:
a = ahdf.SUPPLIER_NAME.apply(lambda x: len([w for w in word_tokenize(x) if w.lower() in stop_words])*100.0/len(word_tokenize(x)))
print("Number of Stopwords  :: Min={}; Mean={}; Max={}; MODE={}".format(np.min(a), np.mean(a), np.max(a), a.value_counts().index[0]))
sns.distplot(a)

NameError: name 'ahdf' is not defined

### Final Chunk Identification Module

In [184]:
#####################################################################################
# Performs chunk identification for each file (i.e. df)
#####################################################################################
# VERSION 4.0
# Author: Pranjal Pathak
# INPUT   --> df per file
# OUTPUT  --> list of Supplier Name chunks, list of Customer Name chunks

def chunk_identification(df):
    
    '''
    STATIC TAGS DESCRIPTION
    ----------------------------------------------------------------------------------------------------------------------
    '''
    SUPPLIER_TAGS = [r"\bdirected\s+inquiries\s+to\b\s*[\:|\-]+\s*", r"\bdirect\s+inquiries\s+to\b\s*[\:|\-]+\s*", 
                     r"\bdirect\s+inquiries\s+to\b\s*[\:|\-]+\s*", r"\bdirecting\s+to\b\s*[\:|\-]+\s*", 
                     r"\bdirected\s+to\b\s*[\:|\-]+\s*", r"\bdirects\s+to\b\s*[\:|\-]+\s*", r"\bdirect\s+to\b\s*[\:|\-]+\s*", 
                     r"\bcrediting\s+to\b\s*[\:|\-]+\s*", r"\bcredited\s+to\b\s*[\:|\-]+\s*", r"\bcredits\s+to\b\s*[\:|\-]+\s*",
                     r"\bcredit\s+to\b\s*[\:|\-]+\s*", r"\bremiting\s+to\b\s*[\:|\-]+\s*", r"\bremittance\s+to\b\s*[\:|\-]+\s*",
                     r"\bremitted\s+to\b\s*[\:|\-]+\s*", r"\bremits\s+to\b\s*[\:|\-]+\s*", r"\bremit\s+to\b\s*[\:|\-]+\s*",
                     r"f/b/o\s*", r"\baccount\s+name\b\s*[\:|\-]+\s*", r"\bchecked\s+payable\s+to\b\s*[\:|\-]+\s*", 
                     r"\bchecks\s+payable\s+to\b\s*[\:|\-]+\s*", r"\bchecks\s+payable\s+at\b\s*[\:|\-]+\s*", 
                     r"\bcheck\s+payable\s+at\b\s*[\:|\-]+\s*", r"\bcheck\s+payable\s+to\b\s*[\:|\-]+\s*", 
                     r"\bchecks\s+pay\s+to\b\s*[\:|\-]+\s*", r"\bcheck\s+pay\s+to\b\s*[\:|\-]+\s*", 
                     r"\bchecking\s+to\b\s*[\:|\-]+\s*", r"\bchecked\s+to\b\s*[\:|\-]+\s*", r"\bpayable\s+to\b\s*[\:|\-]+\s*", 
                     r"\bchecks\s+to\b\s*[\:|\-]+\s*", r"\bcheck\s+to\b\s*[\:|\-]+\s*", r"\bpays\s+to\b\s*[\:|\-]+\s*", 
                     r"\bpay\s+to\b\s*[\:|\-]+\s*"]
    
    NEGATIVE_SUPPLIERWORDS = ["attn", "from", "subject", "terms", "please", "invoice", "invoices", "invoiceno", 
                              "amount", "amounts", "address", "addresses", "addrss", "telephone", "phone", "phones", "call", 
                              "callat", "phoneat", "fax", "cancel", "notice", "confidential",  "client"]               
                           
    CUSTOMER_TAGS = [r"\bbilling\s+to\b\s*\:*\s*", r"\bbilled\s+to\b\s*\:*\s*", r"\bbills\s+to\b\s*\:*\s*", 
                     r"\bbill\s+to\b\s*\:*\s*", r"\bbilling\s+at\b\s*\:*\s*", r"\bbilled\s+at\b\s*\:*\s*", 
                     r"\bbills\s+at\b\s*\:*\s*", r"\bbill\s+at\b\s*\:*\s*", r"\bshipping\s+to\b\s*\:*\s*", 
                     r"\bshipped\s+to\b\s*\:*\s*", r"\bships\s+to\b\s*\:*\s*", r"\bship\s+to\b\s*\:*\s*", 
                     r"\bshipto\b\s*\:*\s*", r"\bshippingto\b\s*\:*\s*", r"\bshipsto\b\s*\:*\s*"]
    
    MIXED_TAGS = [r"\bbelonging\s+to\b\s*\:*\s*", r"\bbelonged\s+to\b\s*\:*\s*", r"\bbelongs\s+to\b\s*\:*\s*", 
                  r"\bbelong\s+to\b\s*\:*\s*", r"\bdivisions\s+of\b\s*\:*\s*", r"\bdivision\s+of\b\s*\:*\s*", 
                  r"\bdivision\s+of\b\s*\:*\s*", r"\bdivs\s+of\b\s*\:*\s*", r"\bdiv\s+of\b\s*\:*\s*", 
                  r"\bconnect\s+with\b\s*\:*\s*", r"\bcompany\s+of\b\s*\:*\s*", r"\bcomp\s+of\b\s*\:*\s*", 
                  r"\bfirm\s+of\b\s*\:*\s*", r"\bname\s+of\b\s*\:*\s*", r"\baccount\s+of\b\s*\:*\s*", 
                  r"\bacc\s+of\b\s*\:*\s*", r"\baccount\s+name\b\s*\:*\s*", r"\bacc\s+name\b\s*\:*\s*", 
                  r"\bis\s+of\b\s*\:*\s*", r"\bis\s+to\b\s*\:*\s*", 
                  r"\bto\b\s*\:*\s+", 
                  r"^by\b\s*\:*\s+", r"^mailing\b\s*\:*\s+", r"^to\b\s*\:*\s+", r"^of\b\s*\:*\s+",
                  r"^mail\b\s*\:*\s+", r"^payable\s+to\b\s*\:*\s+", r"^checks\s+payable\s+to\b\s*\:*\s+",
                  r"^check\s+payable\s+to\b\s*\:*\s+", r"^checks\s+to\b\s*\:*\s+", r"^check\s+to\b\s*\:*\s+"]
        
    NOISE_KEYWORDS = [r"\battn\b[^\:\s]*\:+", r"\bfrom\b\s*\:+", r"\bsubject\b\s*\:+", r"\bcc\b\s*\:+", r"\bbcc\b\s*\:+", 
                      r"\bterms\s+and\s+condition\b", r"\bterms\s+\&\s+conditions\b", r"\bterms\&conditions\b", 
                      r"\btermsandconditions\b", r"\bT\&C\b", r"\bTC\b", r"\bif\s+you\s+have\s+any\s+questions\b",
                      r"\bif\s+you\s+have\s+any\s+question\b", r"\byou\s+have\s+any\s+questions\b", 
                      r"\byou\s+have\s+any\s+question\b", r"\byou\s+have\s+questions\b", r"\byou\s+have\s+question\b",
                      r"\bif\s+you\s+have\s+questions", r"\bplease\s+write\s+to\b\[^\:\s]*\:+", 
                      r"\bplease\s+write\s+at\b\[^\:\s]*\:+", r"\bwrite\s+to\b\[^\:\s]*\:+", 
                      r"\bcontact\s+to\b[^\:\s]*\:+", r"\bemail\s+to\b[^\:\s]*\:+", r"\bemail\s+at\b[^\:\s]*\:+", 
                      r"\bemail\b\s+\@\s*\:+", "\bconfidential\b"]

    PHONE_TAGS = [r"\bphone\b\s*\:*\s+", r"\bphone\s+at\b\s*\:*\s+", r"\bphones\b\s*\:*\s+", r"\bphone\s+on\b\s*\:*\s+", 
                  r"\bcall\b\s*\:*\s+", r"\bcalling\b\s*\:*\s+", r"\bcall\s+at\b\s*\:*\s+", r"\bcalls\s+at\b\s*\:*\s+", 
                  r"\btelephone\b\s*\:*\s+", r"\bt\b[^\:\s]*\:+\s+", r"\btelephone\s+at\b\s*\:*\s+", 
                  r"\btelephones\b\s*\:*\s+", r"\bdial\b\s*\:*\s+", r"\bdial\s+at\b\s*\:*\s+", r"\bdate\b\s*\:*\s+", 
                  r"\bday\b\s*\:*\s+", r"\btime\b\s*\:*\s+", r"\bcalendar\b\s*\:*\s+", r"\bfax\b\s*\:*\s+"]
                              
    # runs on each prediction...
    SMART_STOPWORDS = ["https", "http", "please", "pls", "lets", "know", "knowing", "knows", 
                       "respond", "responding", "responds", "naming", "names", "thanks", "thank", "thankyou", 
                       "thanksyou", "thanking", "thankingyou", "thanx", "thnx", "thx", "asap", 
                       "dear", "hello", "kindly", "regards", "hi", "hola", "hey", "yeah", "nope", 
                       "advise", "attached", "days", "months", "month", "invoice", "invoices", "invocing", 
                       "ordernumber", "invoiceno", "accountno", "remit", "remittance", "fbo", "inquiries", 
                       "payable", "billed", "billing", "bills", "shipped", "shipping", "ships", "shipto", "shippingto", 
                       "shipsto", "terms&conditions", "termsandconditions", "confidential", "t&c","emailed", "customerno", 
                       "duration", "since", "address", "order", "make", "check", "checks", "submit", "submits", "sumitting" , 
                       "submitto",  "paying", "paymentsto", "notices", "phone", "phones", "phoning", "phono", "emailo", 
                       "cancel", "cancelling", "cancellation", "cancellations", "requires", "require", "required", 
                       "requiring", "noticeable", "noticing", "date", "dates", "months", "month", "day", "days",
                       "addresses", "address", "inovicenumber",
                       "if", "because", "is", "there", "always", 'itself', 'shouldn', 'there', 'were', 'should', 
                       'who', 'hasn', "you've", 'he', "mustn't", 'whom', 'mustn', 'very', 'doesn', 'have', 'here', 
                       'those', 'wasn', 'having', "aren't", 'yourselves', "hasn't", 'themselves', 'until', "you'd", 
                       'shan', 'him', 'ourselves', 'aren', "isn't", "haven't", 'that', 'herself', "hadn't", 'both',
                       'where', 'most', 'doing', 'further', 'any', 'didn', 'theirs', "weren't", 'same', 'you', 'hadn', 
                       'myself', 'yours', 'won', 'mightn', 'she', 'hers', 'weren', "she's", 'does', 'during', 'was', 
                       'wouldn', 'because', "won't", 'again', 'his', "you'll", 'once', 'between', 'couldn', 'how', 
                       'what', "shouldn't", 'then', 'own', 'above', "doesn't", 'had', "wasn't", 'them', 'when', 'such', 
                       'while', 'if', 'did', 'before', "couldn't", 'your', "shan't", 'other', 'which', 'himself', 'through', 
                       'below', 'under', 'too', "mightn't", 'been', "should've", 'few', 'these', "you're", 'their', 
                       'can', 'each', "it's", 'has', "needn't", 'but', "wouldn't", 'needn', 'they', "didn't", 'nor', 
                       "that'll", "don't", 'than', 'some']
    
    # POSTAL-TAG LABELS
    S_LABELS = ['house_number', 'road', 'unit', 'level', 'po_box']
    M_LABELS = ['suburb', 'city_district', 'city', 'island', 'state_district', 'state', 'country_region', 'country', 'world_region']
    E_LABELS = ['postcode']
    ''' ------------------------------------------------------------------------------------------------------------------'''
    
    
    # ################################################################################################################## #
    # ------------------------------------------ MODULES :: file level (FL) -------------------------------------------- #
    # ################################################################################################################## #
    
    """ MODULE FL 1:> REDUCE NER ENTITIES AND AB IN PREDICTION LINES
    ----------------------------------------------------------------------------------------------------------------------
    - Replaces NER(PHONE, FAX, DATE) tags with 'whitespace' and reduces(or slices) "AB LINES" or potential "AB LINES".  """
    def check_entities(df):
        
        # Reducing line-by-line
        def reduce_entities(x):
            """
            Takes a line and replaces NER tags and tries to reduce "AB LINE".
            :param: df.LINES
            :returns: Reduced LINE or np.nan
            """
            
            # MODULE 1: Check if a line could be "AB"
            def check_AB(line):
                """
                Takes a line and finds if it could be an "AB LINE" or not. Returns True if found, else False.
                :param: string line: A line
                :returns: bool True or False
                """
                
                def checkPattern(string, pattern): 
                    if all(["S" in string, "M" in string, "E" in string]) == False:
                        return False
                    l = len(pattern) 
                    if len(string) < l: return False
                    for i in range(l - 1): 
                        x, y = pattern[i], pattern[i + 1] 
                        last = string.rindex(x) 
                        first = string.index(y) 
                        if last == -1 or first == -1 or last > first: return False
                    return True

                # create a dic that maps S_LABELS to "S", M_LABELS to "M", E_LABELS to "E"
                label_mapper = {**dict(zip(S_LABELS, ["S"]*len(S_LABELS))), **dict(zip(M_LABELS, ["M"]*len(M_LABELS))), **dict(zip(E_LABELS, ["E"]*len(E_LABELS)))}

                # Remove NER(PHONE, FAX, DATE) tags from line, if present
                line = line.strip()
                find_PhoneDateRE = sum([x for x in [[i for x in find_phone(line) for i in x if i!= ''], find_date(line)] if any(x)], [])
                if len(find_PhoneDateRE) > 0:
                    find_PhoneDateRE += PHONE_TAGS
                    line = re.sub(r"\((?=\d+\))|(?<=\d)\)", "", line, flags=re.IGNORECASE).strip()
                    line = re.sub(r"\s+", " ", re.sub("|".join(find_PhoneDateRE), " ", line, flags=re.IGNORECASE)).strip()

                # Run "libpostal" on line to generate POSTAL-TAGS
                line_AB = crf_NER(line)

                # Use the above dict(label_mapper) to create a LABEL-pattern for line
                tagged_line = [(tag[0], label_mapper[tag[1]]) if tag[1] in label_mapper.keys() else (tag[0], 'O') for tag in line_AB]
                
                # If LABEL-pattern follows "AB" rule return True, else False
                tagged_line_pattern = checkPattern("".join([tag[1] for tag in tagged_line]), 'SME')
                return tagged_line_pattern


            # MODULE 2: Reduce a line if its "AB LINE"
            def check_softAB(line):
                """
                Takes a line if it contains atleast 1 SpacyNER(GPE) tag and re-checks if the line could be a "AB LINE"
                in a relaxed-tone. If qualified, slices the line before a comma(',').
                :param: string line: A line which contains at-least one SpacyNER(GPE) tag
                :returns: string line: Reduced line
                """
                
                original_line = line

                # Runs "libpostal" on line to generate POSTAL-TAGS
                line_AB = crf_NER(line)

                # List of POSTAL-TAGS
                line_AB_tags = [tag[1] for tag in line_AB]

                # find all commas(",") in line
                line_commas = len(re.findall(r"\,", line))

                # If POSTAL-TAG contains: City + State + Zipcode + >=3 commas ---> It's a "AB LINE" (relaxed-tone)
                if all(["city" in line_AB_tags, "state" in line_AB_tags, "postcode" in line_AB_tags, line_commas >= 3]):
                    line = line.strip().split(',')[0]
                    # print("SOFT AB == ", original_line, "; Transformed ==", line)
                return line
        
            # df.LINES
            line = x.LINES
            
            # 1). Repalce NER(PHONE, FAX, DATE) with 'whitespace'
            check_PhoneDate = sum([x for x in [[i for x in find_phone(line) for i in x if i!= ''], find_date(line)] if any(x)], [])
            if len(check_PhoneDate) > 0:
                line = re.sub(r"\((?=\d+\))|(?<=\d)\)", " ", line, flags=re.IGNORECASE)
                line = re.sub(r"\s+", " ", re.sub("|".join(check_PhoneDate), " ", line, flags=re.IGNORECASE)).strip()
            
            # 2). Reduce "AB LINE"
            # MODULE 1: Check if a line could be "AB"
            if x.POSTAL_AB == 'AB' or check_AB(line):
                # crf_NER
                tagged_line = [tag[0] for tag in crf_NER(line) if tag[1] =='house']
                tagged_line = tagged_line[0] if len(tagged_line) > 0 else line
                # spacy
                chunkNER = [ent.orth_ for ent in nlp(line).ents if ent.label_ in ['ORG', 'FAC']]
                chunkNER = chunkNER[0] if chunkNER != [] else tagged_line
                line = chunkNER.strip()
                # print("HARD AB == ", x.LINES, "; Transformed == ", line)
            # MODULE 2: Reduce a line if its "AB LINE"
            elif len([ent.label_ for ent in nlp(line).ents if ent.label_=='GPE']) > 0:
                line = check_softAB(line)
            else:
                pass

            # 3). Reduce empty lines
            if len(line) == 0: line = np.nan
            return line
        
        # Reducing line-by-line
        df['LINES'] = df.apply(reduce_entities, axis=1)
        df = df.dropna(subset=['LINES']).reset_index(drop=True)
        return df


    """ MODULE FL 2:> FIND SUPPLIER TAGS IN FILE
    ----------------------------------------------------------------------------------------------------------------------
    - Looks for SupplierTag in PRESENT_LINE, ABOVE_LINE and if found, returns a SupplierName                           """
    def check_SupplierTags(df):
        
        # looking for SupplierTag line-by-line
        MAX_TOKENS = 15
        def find_SupplierTag(line):
            """
            Finds a SupplierTag in line, and takes the following chunk of line as a SupplierName (based on rules).
            :param: string line: A line
            :returns: string line: SupplierName if SupplierTag is found else None
            """
            SupplierTag = [t for t in SUPPLIER_TAGS if re.findall(r"{}".format(t), line, flags=re.IGNORECASE)]
            SupplierTag_value = " ".join(re.findall(r"{}(?!.*[\:].*)(?=.*[A-Za-z].*)(.*)".format(SupplierTag[0]), line.strip(), flags=re.IGNORECASE)) if len(SupplierTag) > 0 else None
            SupplierTag_tokens = re.findall(r"\w+", SupplierTag_value) if SupplierTag_value != None and len(SupplierTag_value) > 0 else []
            if SupplierTag_tokens != []:
                condition1 = 0 < len(SupplierTag_tokens) <= MAX_TOKENS 
                condition2 = SupplierTag_tokens[0].strip().lower() not in NEGATIVE_SUPPLIERWORDS
                if all([condition1, condition2]): 
                    # print("SUPPLIER TAG == ", line, SupplierTag, SupplierTag_value)
                    return SupplierTag_value 
            return None
        
        # looking for SupplierTag
        SUPPLIER_TAG_FOUND = None
        for line_index in df.index.values:
            # looking in "PRESENT_LINE"
            line_present = df.iloc[line_index:line_index+1].LINES.values[0]
            SupplierTag_value = find_SupplierTag(line_present)
            if SupplierTag_value != None:
                SUPPLIER_NAME = SupplierTag_value
                SUPPLIER_TAG_FOUND = "line_present"
                # print("SupplierTag found in present line! Line = {}; SN = {}".format(line_present, SUPPLIER_NAME))
                break
            else:
                # looking in "ABOVE_LINE"
                if line_index != 0:
                    line_above = df.iloc[line_index-1:line_index].LINES.values[0]
                else:
                    line_above = ""
                SupplierTag = [t for t in SUPPLIER_TAGS if re.findall(r"{}".format(t), line_above, flags=re.IGNORECASE)]
                SupplierTag_value = re.findall(r"\w+", " ".join(re.findall(r"{}(.*)".format(SupplierTag[0]), line_above.strip(), flags=re.IGNORECASE))) if len(SupplierTag) > 0 else None
                # Return a SupplierName if both conditions are fulfilled
                condition1 = len(SupplierTag) > 0
                condition2 = SupplierTag_value == []
                if all([condition1, condition2]):
                    SUPPLIER_NAME = line_present
                    SUPPLIER_TAG_FOUND = "line_above"
                    # print("SupplierTag found in above line! Line = {}; SN = ".format(line_above, SUPPLIER_NAME))            
                else:
                    continue 
                    
        if SUPPLIER_TAG_FOUND != None:
            return SUPPLIER_NAME.strip()
        else:
            return None
        

    """ MODULE FL 3:> FIND CUSTOMER TAGS IN FILE
    ----------------------------------------------------------------------------------------------------------------------
    - Looks for CustomerTag in PRESENT_LINE, ABOVE_LINE and if found, returns a CustomerName.                          """
    def check_CustomerTags(df):
        
        # looking for CustomerTag line-by-line
        def find_CustomerTag(line):
            """
            Finds a CustomerTag in line, and takes the following chunk of line as a CustomerName (based on rules).
            :param: string line: A line
            :returns: string line: CustomerName if CustomerTag is found else None
            """
            CustomerTag = [t for t in CUSTOMER_TAGS if re.findall(r"{}".format(t), line, flags=re.IGNORECASE)]
            CustomerTag_value = " ".join(re.findall(r"{}(?!.*[\:].*)(?=.*[a-zA-Z].*)(.*[A-Za-z\&\.\,\(\)\-\@\<\>\\\/\s]+.*)".format(CustomerTag[0]), line.strip(), flags=re.IGNORECASE)) if len(CustomerTag) > 0 else None
            CustomerTag_tokens = re.findall(r"\w+", CustomerTag_value) if CustomerTag_value != None and len(CustomerTag_value) > 0 else []
            if CustomerTag_tokens != []: 
                return CustomerTag_value 
            return None
        
        # looking for CustomerTag
        CUSTOMER_TAG_FOUND = None
        for line_index in df.index.values:
            # looking in "PRESENT_LINE"
            line_present = df.iloc[line_index:line_index+1].LINES.values[0]
            CustomerTag_value = find_CustomerTag(line_present)
            if CustomerTag_value != None:
                CUSTOMER_NAME = CustomerTag_value
                CUSTOMER_TAG_FOUND = "line_present"
                # print("CustomerTag found in present line! Line = {}; SN = {}".format(line_present, CUSTOMER_NAME))
                break
            else:
                # looking in "ABOVE_LINE"
                if line_index != 0:
                    line_above = df.iloc[line_index-1:line_index].LINES.values[0]
                else:
                    line_above = ""
                CustomerTag = [t for t in CUSTOMER_TAGS if re.findall(r"{}".format(t), line_above, flags=re.IGNORECASE)]
                CustomerTag_value = re.findall(r"\w+", " ".join(re.findall(r"{}(.*)".format(CustomerTag[0]), line_above.strip(), flags=re.IGNORECASE))) if len(CustomerTag) > 0 else None
                # Return a CustomerName if both conditions are fulfilled
                condition1 = len(CustomerTag) > 0
                condition2 = CustomerTag_value == []
                if all([condition1, condition2]): 
                    CUSTOMER_NAME = line_present
                    CUSTOMER_TAG_FOUND = "line_above"
                    # print("CustomerTag found in above line! Line = {}; SN = ".format(line_above, CUSTOMER_NAME))
                else:
                    continue 
        
        if CUSTOMER_TAG_FOUND != None:
            return CUSTOMER_NAME.strip()
        else:
            return None
    # ------------------------------------------------------------------------------------------------------------------ #
    
    
    # ################################################################################################################## #
    # ------------------------------------------ MODULES :: line level (LL) -------------------------------------------- #
    # ################################################################################################################## #

    """ MODULE LL 1:> SORT df.LINES
    ----------------------------------------------------------------------------------------------------------------------
    - Sort the df.LINES giving more importance to some columns.                                                        """
    def sort(df):
        """
        SORTING MECHANISM
        ------------------
        Sort df.LINES in priority order:
        (1). Model_1 Probability    (i.e. generic model's probability of label 1)
        (2). Model_2 Probability    (i.e. NN model's probability of label 1)
        (3). Line-number            (i.e. position/quadrant of line in file)
        """
        df['LINE_NO'] = [line_no for line_no in range(df.shape[0])]  
        # OLD: df['Model1_P1_LINENO'] = [line_no if prob >=0.95 else 0 for prob, line_no in zip(df.Model1_P1, df.LINE_NO)]
        # OLD: df = df.sort_values(by=['Model1_P1', 'Model1_P1_LINENO', 'Model2_P1', 'LINE_NO'], ascending=[False, True, False, True]) 
        df = df.sort_values(by=['Model1_P1', 'Model2_P1', 'LINE_NO'], ascending=[False, False, True]) 
        return df

    
    """ MODULE LL 2:> FILTER NOISE LINES
    ----------------------------------------------------------------------------------------------------------------------
    - Looks for lines that fulfill 'noise/outlier' condition and remove them from df.                                  """
    def check_NoiseLines(df):
        """
        NOISE FILTERING MECHANISM
        --------------------------
        Remove a line if it does not contain any 'abbrv' AND :
        (1). Number of tokens are more than 20
        (2). Number of chars are more than 80
        (3). Contains only digits
        (4). Contains more than 70% stop-words
        (5). Contains a negative keyword from a list of stored_negative_keywords
        """
        ######################
        ##  Noise Settings  ##
        MAX_TOKENS = 20
        MAX_LEN = 80
        MAX_ALLDIGITS = 0
        MAX_PERSTOPWORS = 70.0
        MAX_NOISE = 0
        ######################
        def find_NoiseLines(x):           
            tokens = word_tokenize(re.sub(r"\s+", " ", re.sub(r"\^", " ", re.sub(r"[^A-Za-z]", " ", x.LINES.strip(), flags=re.IGNORECASE | re.MULTILINE))))
            check_NumOfTokens = len(tokens)
            check_Length = len(x.LINES.lower().strip())
            check_AllDigits = int(x.F1_CONTAINSALLDIGIT)
            check_PercentStopWords = len([w for w in tokens if w.lower() in stop_words]) * 100.0/check_NumOfTokens if check_NumOfTokens > 0 else 0
            check_NegativeKeywords = len(re.findall("|".join(NOISE_KEYWORDS) + "|".join(CUSTOMER_TAGS), str(x.LINES).strip(), flags=re.IGNORECASE | re.MULTILINE))
            check_abbrv = len(re.findall(r"(?=(\b" + '\\b|\\b'.join(list_abbrv_regex) + r"\b))", x.LINES.lower().strip()))
            QUALIFY = "SN"  
            if check_abbrv == 0 and any([check_NumOfTokens > MAX_TOKENS, check_Length > MAX_LEN, 
                                         check_AllDigits > MAX_ALLDIGITS, check_PercentStopWords > MAX_PERSTOPWORS,
                                         check_NegativeKeywords > MAX_NOISE]):
                QUALIFY = "CN"
            return QUALIFY        
        df['QUALIFY'] = df.apply(find_NoiseLines, axis=1)
        return df


    """ MODULE LL 3:> REMOVE DUPLICATE PREDICTIONS
    ----------------------------------------------------------------------------------------------------------------------
    - Removes multiple(duplicate) predictions from SupplierName list and CustomerName list.                            """
    def check_duplicates(prediction_list):
        ##########################
        ##  Duplicate Settings  ##
        MAX_SIMILARITY = 30
        ##########################
        # Returns True(similar or duplicate) if 1st token is a match AND FuzzyWuzzy score > 30
        def check_similarity(s1, s2):
            return True if word_tokenize(s1)[0].strip().lower() == word_tokenize(s2)[0].strip().lower() and \
                           fuzz.ratio(s1.lower(), s2.lower()) > MAX_SIMILARITY else False
        def eliminate_duplicates(a):
            for index, name in enumerate(a):
                for i, match_with in enumerate(a):
                    if index != i and check_similarity(name[0], match_with[0]):
                        a[i][0] = '<DUPLICATE>'
            return a
        PL = prediction_list
        if len(PL) > 2:
            PL = eliminate_duplicates(PL)
            PL = [w for w in PL if w[0] != '<DUPLICATE>']
            if len(PL) == 1:   # in case all were duplicates and hence got removed
                PL += prediction_list
        return PL


    """ MODULE LL 4:> SPLIT PREDICTIONS INTO 2 GROUPS
    ----------------------------------------------------------------------------------------------------------------------
    - Splits the initial prediciton list into 2 lists: SupplierName list and CustomerName list.                        """
    def check_split(prediction_list, TAG_SupplierName, TAG_CustomerName):
        """
        SPLITTING MECHANISM
        --------------------
        Split the list based on:
        (1). list contains a SupplierTag
        (2). list contains a CustomerTag
        (3). list contains noise lines
        """
        
        # TAG_SupplierName(T_SN) - Contains potential SupplierName(SN) based on found SupplierTag (viz. MODULE FL 2)
        # TAG_CustomerName(T_CN) - Contains potential CustomerName(CN) based on found CustomerTag (viz. MODULE FL 3)
        
        # If T_SN i.e. potential SN contains a email/url, in that case compare with remaining SN predictions
        def update__TAG_SupplierName():
            updated__TAG_SupplierName = TAG_SupplierName
            if TAG_SupplierName != None:  
                # Check for email/url
                domains = sum([x[-1] for x in [find_email(TAG_SupplierName), find_url(TAG_SupplierName)] if any(x)], [])
                # If T_SN is a email/url, Find its "domain-name" (e.g. abc@gmail.com -> "gmail")
                domain = domains[0] if len(domains) > 0 else None
                # If T_SN is a email/url, Assign a new T_SN
                if domain != None:
                    for chunk in prediction_list:
                        # Assign a new T_SN if: Model_1 P1 > 70%  AND  FuzzyWuzzy(found-domain, line) > 50
                        if all( [chunk[1] >= 0.70, fuzz.partial_ratio(domain.lower(), chunk[0].lower()) > 50] ):
                            updated__TAG_SupplierName = chunk[0]
                            #print("TAG_SupplierName is a email/url. Updated Tag_SupplierName = ", updated__TAG_SupplierName)
                            break
            return updated__TAG_SupplierName
        
        # Sanity Check: if T_SN contains a email/url -> Assign a new T_SN or keep same potenital SN
        SupplierName = update__TAG_SupplierName()
        
        # Potential CN
        CustomerName = TAG_CustomerName
        
        # 1. If both T_SN, T_CN were found in file!
        if SupplierName != None and CustomerName != None: 
            # SN list = T_SN + predictions;  CN list = T_CN
            SN_list = [(SupplierName, 1.0, 1.0, 'SN', 0)] + prediction_list
            CN_list = [(CustomerName, 1.0, 1.0, 'CN', 0)]
        
        # 2. If only T_CN was found!
        elif SupplierName == None and CustomerName != None:
            # SN list = predictions - T_CN;  CN list = T_CN
            SN_list = [x for x in prediction_list if x[0] != CustomerName or len(re.findall(CustomerName.strip(), x[0], flags=re.IGNORECASE)) == 0]
            CN_list = [(CustomerName, 1.0, 1.0, 'CN', 0)]
        
        # 3. If only T_SN was found!
        elif SupplierName != None and CustomerName == None:
            # SN list = T_SN + predictions;  CN list = predictions - T_SN
            SN_list = [(SupplierName, 1.0, 1.0, 'SN', 0)] + prediction_list
            CN_list = [x for x in prediction_list if x[0] != SupplierName or len(re.findall(SupplierName.strip(), x[0], flags=re.IGNORECASE)) == 0]
        
        # 4. If none were found!
        else: 
            # SN list = predictions;  CN list = predictions
            SN_list = [x for x in prediction_list if x[3] == 'SN']
            CN_list = [x for x in prediction_list if x[3] == 'CN']
        
        # Final Sanity Check: extracted SN, CN are subjected to final review to get separate lists.
        SN_list = [chunk for chunk in SN_list if chunk[3] == 'SN']
        CN_list = [chunk for chunk in CN_list if chunk[3] == 'CN']
        return SN_list, CN_list


    """ MODULE LL 5:> CLEAN SN_LIST 
    ----------------------------------------------------------------------------------------------------------------------
    - Cleans the final prediction SN list, and finds chunks (i.e. exact Supplier Name) from it.                         """
    def check_cleaning(chunk):
        """
        CHUNK-IDENTIFICATION MECHANISM
        -------------------------------
        Find a chunk from the line if:
        (1). Line contains a email/url               --> chunk = domain-name
        (2). Line contains a ".com"                  --> chunk = domain-name
        (3). Line contains a SupplierTag             --> chunk = Supplier tag value
        (4). Line contains more than '12'  words     --> chunk = Spacy NER(ORG, FAC) tag
        (5). Line contains more than '40%' stopwords --> chunk = Removed stopwords
        (6). Line contains whitespaces, symbols      --> chunk = Regex cleaned 
        
        (7). Line does not fulfill (1) to (6)        --> chunk = Entire remaining line
        """
        ######################
        ##  Clean Settings  ##
        MAX_TOKENS = 12
        MAX_STOPWORDS_PERCENT = 40.0
        ######################
        
        # chunk - Predicted lines from final SN list
        
        # 1. If line is a email/url --> reduce the line to it's domain-name
        domains = sum([x[-1] for x in [find_url(chunk), find_email(chunk)] if any(x)], [])
        domain = domains[0].upper() if len(domains) > 0 else None
        if domain != None:
            return domain.upper()
        else:
            # 2. If domain-name not found above, performing hard-check for domain-name --> reduce the line to it's domain-name
            domain_2 = re.findall(r"([A-Za-z]+)\.com\b", chunk, flags=re.IGNORECASE)
            if len(domain_2) > 0:
                return domain_2[0].upper()
        
        # 3.If line contains a SupplierTag --> reduce the line to found SupplierTag_Value
        ALL_TAGS = SUPPLIER_TAGS + CUSTOMER_TAGS + MIXED_TAGS
        find_Tag = [tag for tag in ALL_TAGS if re.findall(r"{}".format(tag), chunk.strip(), flags=re.IGNORECASE)]
        value_TAG = " ".join(re.findall(r"{}(?=.*[a-zA-Z].*)(.*)".format(find_Tag[0]), chunk.strip(), flags=re.IGNORECASE)).strip() if len(find_Tag) > 0 else None
        if value_TAG != None:
            return value_TAG
        
        # Tokenize the line into tokens and calculate % of stop-words
        tokens = re.findall(r"\w+", chunk.strip())
        tokens_stopwords_percentage = len([w for w in tokens if w.lower() in stop_words])*100.0/len(tokens) if len(tokens) > 0 else 0
            
        # 4. If line contains more than '12' words --> reduce the line to SpacyNER(ORG, FAC) tag
        if len(tokens) > MAX_TOKENS:
            chunkNER = [ent.orth_ for ent in nlp(chunk).ents if ent.label_ in ['ORG', 'FAC']]
            chunkNER = chunkNER[0] if chunkNER != [] else None
            if chunkNER != None: return chunkNER.strip()
        
        # 5. If line contains more than '40%' stopwords --> remove stop-words from the line
        if tokens_stopwords_percentage > MAX_STOPWORDS_PERCENT:
            chunk = " ".join([w for w in tokens if w.lower() not in stop_words])
        
        # 6. Clean the line
        chunk = re.sub(r"\s+", " ", re.sub(r"\^", " ", re.sub(r"[^A-Za-z0-9\@\&\(\)\-]", " ", chunk.strip(), flags=re.IGNORECASE))).strip()
        chunk = " ".join([w for w in word_tokenize(chunk) if w.lower() not in SMART_STOPWORDS])
        return chunk
    
    
    """ MODULE LL 6:> FINAL SN_LIST
    ----------------------------------------------------------------------------------------------------------------------
    - Chooses the final SN list from SN_list and CN_list.                                                              """
    def get_final_chunks(SN_list, CN_list):
        """
        Picks top "N" from SN_list + "1" from CN_list (condition: if CN_list's top value prob(1) >= 0.90)
        :param: list SN_list, list CN_list
        :returns: list FINAL SN_LIST containing 'N' or 'N+1' Supplier Names
        """
        # # EDIT THIS :: Gives top 'N' predictions from SN lists
        N_VALUES = 3

        TOP_SNlist = [(sn[0], round(np.mean([sn[1], sn[2]]) * 100.0, 2)) for sn in SN_list[:N_VALUES]]
        if len(CN_list) != 0 and np.mean([CN_list[0][1], CN_list[0][2]]) >= 0.80:
            TOP_CNlist = [(CN_list[0][0], float(TOP_SNlist[N_VALUES - 1][1]) - 10.0)]
        else:
            # borrow from SN_list if CN_list is empty!
            TOP_CNlist = [(sn[0], round(np.mean([sn[1], sn[2]]) * 100.0, 2)) for sn in SN_list[N_VALUES:N_VALUES + 1]]
        FINAL_LIST = TOP_SNlist + TOP_CNlist
        return FINAL_LIST
            
    ######################################################################################################################
    # Chunk Identification Starts......
    
    
    # ------------------------------------------ CHUNK :: file level ---------------------------------------------------- #
    # 1. MODULE FL 1: Reduce NER entitites, AB Lines in predictions
    df = check_entities(df)
    
    # 2. MODULE FL 2: Find Supplier Tags in whole file
    TAG_SupplierName = check_SupplierTags(df)
    
    # 3. MODULE FL 3: Find Customer Tags in whole file
    TAG_CustomerName = check_CustomerTags(df)
    print("Found :: SN_Tag = {};  CN_Tag = {}".format(TAG_SupplierName, TAG_CustomerName))
    # ------------------------------------------------------------------------------------------------------------------- #

    
    # ------------------------------------------ CHUNK :: line level ---------------------------------------------------- #
    final_SN, final_CN = [], []
    
    # 4. MODULE LL 1: Sort df.LINES
    sorted_df = sort(df)  
    print("\nModule_LL_1: Sorted Lines, Predictions == ", sorted_df.LINES.tolist())
    
    # 5. MODULE LL 2: Filter noise(outliers/unwanted) df.LINES
    sorted_df = check_NoiseLines(sorted_df)
    print("\nModule_LL_2: Removed Noise, Predictions == ", sorted_df.LINES.tolist())
    
    # 6. Create "Prediction List" with only necesseary columns
    # chunk[0] = df.LINES
    # chunk[1] = Model1_P1
    # chunk[2] = Model2_P1
    # chunk[3] = QUALIFY (SN/CN)
    # chunk[4] = LINE_NO
    prediction_list = [[L, P1, P2, Q, Ln] for L, P1, P2, Q, Ln in zip(sorted_df.LINES.tolist(), sorted_df.Model1_P1.tolist(), sorted_df.Model2_P1.tolist(), sorted_df.QUALIFY.tolist(), sorted_df.LINE_NO.tolist())]
    
    # 7. MODULE LL 3: Remove duplicates from predictions
    prediction_list = check_duplicates(prediction_list) 
    print("\nModule_LL_3: Removed Duplicates, Predictions == ", prediction_list)
    
    # 8. MODULE LL 4: Split predicitons list --> SN list, CN list
    SN_list, CN_list = check_split(prediction_list, TAG_SupplierName, TAG_CustomerName)
    print("\nModule_LL_4: Splitted Predictions into two, SN_LIST == {}; CN_LIST == {};".format(SN_list, CN_list))
    
    # 9. MODULE LL 5: Cleans the final prediction SN list, and finds chunks (i.e. exact Supplier Name) from it. 
    SN_list = list(map(lambda chunk: [check_cleaning(chunk[0]), chunk[1], chunk[2], chunk[3], chunk[4]], SN_list))
    print("\nModule_LL_5: Cleaned List, SN_LIST == {}".format(SN_list))
    
    # 10. MODULE LL 6: Final List of SN...
    FINAL_SN_list = get_final_chunks(SN_list, CN_list)
    print("\nModule_LL_6: Final SN List == {}".format(FINAL_SN_list))
    
    #------------------------------------------------------------------------------------------------------------------- #    
    ######################################################################################################################

    return FINAL_SN_list, SN_list, CN_list

In [185]:
#####################################################################################
# Breaks the enitre df into file-by-file df and extracts chunk per file
#####################################################################################
# INPUT   --> all files (whole df)
# OUTPUT  --> list of Supplier Name chunks, list of Customer Name chunks for all files

def predict_chunk_lines(df):

    # Execute for each filename...
    final_df = []
    for f in df.FILENAME.unique():
        
        # For each filename, create a temp df
        tempdf = df[df.FILENAME == f].copy().reset_index(drop=True)

        ###########################################################################
        # 1. Line Classification
        # TRUE
        actual_SN = str(tempdf.SUPPLIER_NAME.values[0])
        # IF SUPPLIER NAME LINE WAS FOUND
        if 1 in tempdf.Y_SN.values:
            correct_LINES = tempdf[(tempdf.Y_SN == 1) & (tempdf.Model1_Y_PRED == 1)]['LINES'].tolist()
        else:
            correct_LINES = []
        if len(correct_LINES) > 0:
            correct_LINE_found = 1
        else:
            correct_LINE_found = 0
        ###########################################################################

        ###########################################################################
        # 2. Chunk Identification
        FINAL_SN_list, SN_list, CN_list = chunk_identification(tempdf)
        S1 = fuzz.partial_ratio(actual_SN.lower(), SN_list[0][0].lower())
        if S1 < 90:
            S1 = 0
        ###########################################################################

        # STORE
        final_df.append({"FILENAME": f,
                         "SN": actual_SN,
                         "FINAL_SN": FINAL_SN_list,
                         "PRED_SN": SN_list, 
                         "CL": correct_LINE_found,
                         "S1": S1, 
                         "PRED_CN": CN_list})

    # FINAL DATAFRAME
    pred_df = pd.DataFrame.from_dict(final_df)
    print("Total Files = {}; Correct Lines = {}; Lines Missed = {}\n>> SCORE = {};"
          .format(pred_df.shape[0], pred_df.CL.sum(), pred_df.shape[0] - pred_df.CL.sum(), pred_df.S1.mean()))
    
    # DISPLAY RESULTS
    pred_df = pred_df[['FILENAME', 'SN', 'FINAL_SN', 'PRED_SN', 'PRED_CN']]
    return pred_df

#### SAMPLE CASE

<span id="papermill-error-cell" style="color:red; font-family:Helvetica Neue, Helvetica, Arial, sans-serif; font-size:2em;">Execution using papermill encountered an exception here and stopped:</span>

In [186]:
# Sample

# create unseen_df with probs
# 1. load unseen_df
# 2. run a fitted model on unseen_df
# 3. store predictions in unseen_df

# run Phase II: Chunk Identification Module
final_df = predict_chunk_lines(unseen_df_labelled)

# examine final_df
pd.set_option('max_colwidth', 800)
final_df[['FILENAME', 'SN', 'FINAL_SN']][:50]

NameError: name 'unseen_df_labelled' is not defined

----
----

# APPENDIX

## <ins>14. OLD Methods/Versions for Chunk Identification Module</ins>

- Version 3.0 - 21/05/2020
- Version 2.0 - 10/05/2020
- version 1.0 - 30/04/2020

##### Version 3.0

In [187]:
# OLD :: Version 3.0
# def chunk_identification(df):
#     ''' REGEX TAGS "remit\s+to[^\:]*\:+",
#     ----------------------------------------------------------------------------------------------------------------------'''
#     SUPPLIER_TAGS = [r"\bdirected\s+inquiries\s+to\b\s*[\:|\-]+\s*", r"\bdirect\s+inquiries\s+to\b\s*[\:|\-]+\s*", 
#                      r"\bdirect\s+inquiries\s+to\b\s*[\:|\-]+\s*", r"\bdirecting\s+to\b\s*[\:|\-]+\s*", 
#                      r"\bdirected\s+to\b\s*[\:|\-]+\s*", r"\bdirects\s+to\b\s*[\:|\-]+\s*", r"\bdirect\s+to\b\s*[\:|\-]+\s*", 
#                      r"\bcrediting\s+to\b\s*[\:|\-]+\s*", r"\bcredited\s+to\b\s*[\:|\-]+\s*", r"\bcredits\s+to\b\s*[\:|\-]+\s*",
#                      r"\bcredit\s+to\b\s*[\:|\-]+\s*", r"\bremiting\s+to\b\s*[\:|\-]+\s*", r"\bremittance\s+to\b\s*[\:|\-]+\s*",
#                      r"\bremitted\s+to\b\s*[\:|\-]+\s*", r"\bremits\s+to\b\s*[\:|\-]+\s*", r"\bremit\s+to\b\s*[\:|\-]+\s*",
#                      r"f/b/o\s*", r"\baccount\s+name\b\s*[\:|\-]+\s*", r"\bchecked\s+payable\s+to\b\s*[\:|\-]+\s*", 
#                      r"\bchecks\s+payable\s+to\b\s*[\:|\-]+\s*", r"\bchecks\s+payable\s+at\b\s*[\:|\-]+\s*", 
#                      r"\bcheck\s+payable\s+at\b\s*[\:|\-]+\s*", r"\bcheck\s+payable\s+to\b\s*[\:|\-]+\s*", 
#                      r"\bchecks\s+pay\s+to\b\s*[\:|\-]+\s*", r"\bcheck\s+pay\s+to\b\s*[\:|\-]+\s*", 
#                      r"\bchecking\s+to\b\s*[\:|\-]+\s*", r"\bchecked\s+to\b\s*[\:|\-]+\s*", r"\bpayable\s+to\b\s*[\:|\-]+\s*", 
#                      r"\bchecks\s+to\b\s*[\:|\-]+\s*", r"\bcheck\s+to\b\s*[\:|\-]+\s*", r"\bpays\s+to\b\s*[\:|\-]+\s*", 
#                      r"\bpay\s+to\b\s*[\:|\-]+\s*"]
    
#     NEGATIVE_SUPPLIERWORDS = ["attn", "from", "subject", "terms", "please", "invoice", "invoices", "invoiceno", 
#                               "amount", "amounts", "address", "addresses", "addrss", "telephone", "phone", "phones", "call", 
#                               "callat", "phoneat", "fax", "cancel", "notice", "confidential",  "client"]               
                           
#     CUSTOMER_TAGS = [r"\bbilling\s+to\b\s*\:*\s*", r"\bbilled\s+to\b\s*\:*\s*", r"\bbills\s+to\b\s*\:*\s*", 
#                      r"\bbill\s+to\b\s*\:*\s*", r"\bbilling\s+at\b\s*\:*\s*", r"\bbilled\s+at\b\s*\:*\s*", 
#                      r"\bbills\s+at\b\s*\:*\s*", r"\bbill\s+at\b\s*\:*\s*", r"\bshipping\s+to\b\s*\:*\s*", 
#                      r"\bshipped\s+to\b\s*\:*\s*", r"\bships\s+to\b\s*\:*\s*", r"\bship\s+to\b\s*\:*\s*", 
#                      r"\bshipto\b\s*\:*\s*", r"\bshippingto\b\s*\:*\s*", r"\bshipsto\b\s*\:*\s*"]
    
#     MIXED_TAGS = [r"\bbelonging\s+to\b\s*\:*\s*", r"\bbelonged\s+to\b\s*\:*\s*", r"\bbelongs\s+to\b\s*\:*\s*", 
#                   r"\bbelong\s+to\b\s*\:*\s*", r"\bdivisions\s+of\b\s*\:*\s*", r"\bdivision\s+of\b\s*\:*\s*", 
#                   r"\bdivision\s+of\b\s*\:*\s*", r"\bdivs\s+of\b\s*\:*\s*", r"\bdiv\s+of\b\s*\:*\s*", 
#                   r"\bconnect\s+with\b\s*\:*\s*", r"\bcompany\s+of\b\s*\:*\s*", r"\bcomp\s+of\b\s*\:*\s*", 
#                   r"\bfirm\s+of\b\s*\:*\s*", r"\bname\s+of\b\s*\:*\s*", r"\baccount\s+of\b\s*\:*\s*", 
#                   r"\bacc\s+of\b\s*\:*\s*", r"\baccount\s+name\b\s*\:*\s*", r"\bacc\s+name\b\s*\:*\s*", 
#                   r"\bis\s+of\b\s*\:*\s*", r"\bis\s+to\b\s*\:*\s*", 
#                   r"\bto\b\s*\:*\s+", 
#                   r"^by\b\s*\:*\s+", r"^mailing\b\s*\:*\s+", r"^to\b\s*\:*\s+", r"^of\b\s*\:*\s+",
#                   r"^mail\b\s*\:*\s+", r"^payable\s+to\b\s*\:*\s+", r"^checks\s+payable\s+to\b\s*\:*\s+",
#                   r"^check\s+payable\s+to\b\s*\:*\s+", r"^checks\s+to\b\s*\:*\s+", r"^check\s+to\b\s*\:*\s+"]
        
#     NOISE_KEYWORDS = [r"\battn\b[^\:\s]*\:+", r"\bfrom\b\s*\:+", r"\bsubject\b\s*\:+", r"\bcc\b\s*\:+", r"\bbcc\b\s*\:+", 
#                       r"\bterms\s+and\s+condition\b", r"\bterms\s+\&\s+conditions\b", r"\bterms\&conditions\b", 
#                       r"\btermsandconditions\b", r"\bT\&C\b", r"\bTC\b", r"\bif\s+you\s+have\s+any\s+questions\b",
#                       r"\bif\s+you\s+have\s+any\s+question\b", r"\byou\s+have\s+any\s+questions\b", 
#                       r"\byou\s+have\s+any\s+question\b", r"\byou\s+have\s+questions\b", r"\byou\s+have\s+question\b",
#                       r"\bif\s+you\s+have\s+questions", r"\bplease\s+write\s+to\b\[^\:\s]*\:+", 
#                       r"\bplease\s+write\s+at\b\[^\:\s]*\:+", r"\bwrite\s+to\b\[^\:\s]*\:+", 
#                       r"\bcontact\s+to\b[^\:\s]*\:+", r"\bemail\s+to\b[^\:\s]*\:+", r"\bemail\s+at\b[^\:\s]*\:+", 
#                       r"\bemail\b\s+\@\s*\:+", "\bconfidential\b"]

#     PHONE_TAGS = [r"\bphone\b\s*\:*\s+", r"\bphone\s+at\b\s*\:*\s+", r"\bphones\b\s*\:*\s+", r"\bphone\s+on\b\s*\:*\s+", 
#                   r"\bcall\b\s*\:*\s+", r"\bcalling\b\s*\:*\s+", r"\bcall\s+at\b\s*\:*\s+", r"\bcalls\s+at\b\s*\:*\s+", 
#                   r"\btelephone\b\s*\:*\s+", r"\bt\b[^\:\s]*\:+\s+", r"\btelephone\s+at\b\s*\:*\s+", 
#                   r"\btelephones\b\s*\:*\s+", r"\bdial\b\s*\:*\s+", r"\bdial\s+at\b\s*\:*\s+", r"\bdate\b\s*\:*\s+", 
#                   r"\bday\b\s*\:*\s+", r"\btime\b\s*\:*\s+", r"\bcalendar\b\s*\:*\s+", r"\bfax\b\s*\:*\s+"]
                              
#     # runs on each prediction...
#     SMART_STOPWORDS = ["https", "http", "please", "pls", "lets", "know", "knowing", "knows", 
#                        "respond", "responding", "responds", "naming", "names", "thanks", "thank", "thankyou", 
#                        "thanksyou", "thanking", "thankingyou", "thanx", "thnx", "thx", "asap", 
#                        "dear", "hello", "kindly", "regards", "hi", "hola", "hey", "yeah", "nope", 
#                        "advise", "attached", "days", "months", "month", "invoice", "invoices", "invocing", 
#                        "ordernumber", "invoiceno", "accountno", "remit", "remittance", "fbo", "inquiries", 
#                        "payable", "billed", "billing", "bills", "shipped", "shipping", "ships", "shipto", "shippingto", 
#                        "shipsto", "terms&conditions", "termsandconditions", "confidential", "t&c","emailed", "customerno", 
#                        "duration", "since", "address", "order", "make", "check", "checks", "submit", "submits", "sumitting" , 
#                        "submitto",  "paying", "paymentsto", "notices", "phone", "phones", "phoning", "phono", "emailo", 
#                        "cancel", "cancelling", "cancellation", "cancellations", "requires", "require", "required", 
#                        "requiring", "noticeable", "noticing", "date", "dates", "months", "month", "day", "days",
#                        "addresses", "address", "inovicenumber",
#                        "if", "because", "is", "there", "always", 'itself', 'shouldn', 'there', 'were', 'should', 
#                        'who', 'hasn', "you've", 'he', "mustn't", 'whom', 'mustn', 'very', 'doesn', 'have', 'here', 
#                        'those', 'wasn', 'having', "aren't", 'yourselves', "hasn't", 'themselves', 'until', "you'd", 
#                        'shan', 'him', 'ourselves', 'aren', "isn't", "haven't", 'that', 'herself', "hadn't", 'both',
#                        'where', 'most', 'doing', 'further', 'any', 'didn', 'theirs', "weren't", 'same', 'you', 'hadn', 
#                        'myself', 'yours', 'won', 'mightn', 'she', 'hers', 'weren', "she's", 'does', 'during', 'was', 
#                        'wouldn', 'because', "won't", 'again', 'his', "you'll", 'once', 'between', 'couldn', 'how', 
#                        'what', "shouldn't", 'then', 'own', 'above', "doesn't", 'had', "wasn't", 'them', 'when', 'such', 
#                        'while', 'if', 'did', 'before', "couldn't", 'your', "shan't", 'other', 'which', 'himself', 'through', 
#                        'below', 'under', 'too', "mightn't", 'been', "should've", 'few', 'these', "you're", 'their', 
#                        'can', 'each', "it's", 'has', "needn't", 'but', "wouldn't", 'needn', 'they', "didn't", 'nor', 
#                        "that'll", "don't", 'than', 'some']

#     ''' ------------------------------------------------------------------------------------------------------------------'''
    
    
#     # ################################################################################################################## #
#     # ------------------------------------------ MODULES :: file level ------------------------------------------------- #
#     # ################################################################################################################## #
    
#     ''' FIND SUPPLIER TAGS IN FILE
#     ----------------------------------------------------------------------------------------------------------------------
#     - Looks for supplier tags in whole file. Finds a matching supplier tag in: 'present_line' and 'above_line'.        '''
#     def check_SupplierTags(df):
#         # looking for SUPPLIER_TAG
#         MAX_TOKENS = 15
#         def find_SupplierTag(line):
#             SupplierTag = [t for t in SUPPLIER_TAGS if re.findall(r"{}".format(t), line, flags=re.IGNORECASE)]
#             SupplierTag_value = " ".join(re.findall(r"{}(?!.*[\:].*)(?=.*[A-Za-z].*)(.*)".format(SupplierTag[0]), line.strip(), flags=re.IGNORECASE)) if len(SupplierTag) > 0 else None
#             SupplierTag_tokens = re.findall(r"\w+", SupplierTag_value) if SupplierTag_value != None and len(SupplierTag_value) > 0 else []
#             if SupplierTag_tokens != []:
#                 condition1 = 0 < len(SupplierTag_tokens) <= MAX_TOKENS 
#                 condition2 = SupplierTag_tokens[0].strip().lower() not in NEGATIVE_SUPPLIERWORDS
#                 if all([condition1, condition2]): return SupplierTag_value 
#             return None
        
#         # if Tag not found, checks for Line context - AddressBlock(AB, Phone/Fax, Email/Url)
#         def find_SupplierAddressBlock(line):
#             return
        
#         SUPPLIER_TAG_FOUND = None
#         for line_index in df.index.values:
#             line_present = df.iloc[line_index:line_index+1].LINES.values[0]
#             SupplierTag_value = find_SupplierTag(line_present)
#             if SupplierTag_value != None:
#                 # looking in present_line
#                 SUPPLIER_NAME = SupplierTag_value
#                 SUPPLIER_TAG_FOUND = "line_present"
#                 break
#             else:
#                 # looking in above_line
#                 if line_index != 0:
#                     line_above = df.iloc[line_index-1:line_index].LINES.values[0]
#                 else:
#                     line_above = ""
#                 if len([t for t in SUPPLIER_TAGS if re.findall(r"{}".format(t), line_above, flags=re.IGNORECASE)]) > 0:
#                     SUPPLIER_NAME = line_present
#                     SUPPLIER_TAG_FOUND = "line_above"
#                 else:
#                     continue 
#         if SUPPLIER_TAG_FOUND != None:
#             return SUPPLIER_NAME.strip()
#         else:
#             #find_SupplierAddressBlock()
#             return None
        

#     ''' FIND CUSTOMER TAGS IN FILE
#     ----------------------------------------------------------------------------------------------------------------------
#     - Looks for customer tags in whole file. Finds a matching customer tag in: 'present_line' and 'above_line'.        '''
#     def check_CustomerTags(df):
#         def find_CustomerTag(line):
#             CustomerTag = [t for t in CUSTOMER_TAGS if re.findall(r"{}".format(t), line, flags=re.IGNORECASE)]
#             CustomerTag_value = " ".join(re.findall(r"{}(?!.*[\:].*)(?=.*[a-zA-Z].*)(.*[A-Za-z\&\.\,\(\)\-\@\<\>\\\/\s]+.*)".format(CustomerTag[0]), line.strip(), flags=re.IGNORECASE)) if len(CustomerTag) > 0 else None
#             CustomerTag_tokens = re.findall(r"\w+", CustomerTag_value) if CustomerTag_value != None and len(CustomerTag_value) > 0 else []
#             if CustomerTag_tokens != []: 
#                 return CustomerTag_value 
#             return None
#         CUSTOMER_TAG_FOUND = None
#         for line_index in df.index.values:
#             line_present = df.iloc[line_index:line_index+1].LINES.values[0]
#             CustomerTag_value = find_CustomerTag(line_present)
#             if CustomerTag_value != None:
#                 # looking in present_line
#                 CUSTOMER_NAME = CustomerTag_value
#                 CUSTOMER_TAG_FOUND = "line_present"
#                 break
#             else:
#                 # looking in above_line
#                 if line_index != 0:
#                     line_above = df.iloc[line_index-1:line_index].LINES.values[0]
#                 else:
#                     line_above = ""
#                 if len([t for t in CUSTOMER_TAGS if re.findall(r"{}".format(t), line_above, flags=re.IGNORECASE)]) > 0:
#                     CUSTOMER_NAME = line_present
#                     CUSTOMER_TAG_FOUND = "line_above"
#                 else:
#                     continue 
#         if CUSTOMER_TAG_FOUND != None:
#             return CUSTOMER_NAME.strip()
#         else:
#             return None
#     # ------------------------------------------------------------------------------------------------------------------ #
    
    
#     # ################################################################################################################## #
#     # ------------------------------------------ MODULES :: line level ------------------------------------------------- #
#     # ################################################################################################################## #

#     ''' SORT USING A SCORING DECISION MATRIX
#     ----------------------------------------------------------------------------------------------------------------------
#     - Sort the df Lines based on a scoring mechanism. SORT on:
#         (1). on model1_probability
#         (2). on line_number for lines with model1_probability > 95%
#         (3). on model2_probability ...(if model1_probability less than 95%, skip line_number and check model2_probability)
#         (4). line_number                                                                                                '''
#     def sort(df):
#         # generic line_number
#         df['LINE_NO'] = [line_no for line_no in range(df.shape[0])]  
#         # mmodified line_number with prob > 0.95
#         df['Model1_P1_LINENO'] = [line_no if prob >=0.95 else 0 for prob, line_no in zip(df.Model1_P1, df.LINE_NO)]
#         # sorting
#         df = df.sort_values(by=['Model1_P1', 'Model1_P1_LINENO', 'Model2_P1', 'LINE_NO'], ascending=[False, True, False, True]) 
#         return df


# #     ''' REDUCE ENTITIES (AB, FAX, PHONE, DATE) IN PREDICTION LINES
# #     ----------------------------------------------------------------------------------------------------------------------
# #     - Finds and replaces or reduces NER with <address>, <phone>, <fax>, <date> phone-numbers, date in list of predicted names.                                                          '''
# #     def check_entities(df):
        
# #         def replace_entities(x):
# #             line = x.LINES
            
# #             # Reduce AB
# #             if x.POSTAL_AB == 'AB':
# #                 chunkNER = [ent.orth_ for ent in nlp(line).ents if ent.label_ in ['ORG', 'FAC']]
# #                 chunkNER = chunkNER[0] if chunkNER != [] else '<address>'
# #                 line = chunkNER.strip()
            
# #             # Reduce Phone, Date
# #             check_PhoneDate = sum([x for x in [[i for x in find_phone(line) for i in x if i!= ''], find_date(line)] if any(x)], [])
# #             if len(check_PhoneDate) > 0:
# #                 line = re.sub(r"\((?=\d+\))|(?<=\d)\)", " ", line, flags=re.IGNORECASE)
# #                 line = re.sub(r"\s+", " ", re.sub("|".join(check_PhoneDate), " ", line, flags=re.IGNORECASE)).strip()
            
# #             # Reduce empty lines
# #             if len(line) == 0: line = np.nan
# #             return line
        
# #         df['LINES'] = df.apply(replace_entities, axis=1)
# #         df = df.dropna(subset=['LINES']).reset_index(drop=True)
# #         return df
         

#     ''' FILTER NOISE LINES IN PREDICTION LINES
#     ----------------------------------------------------------------------------------------------------------------------
#     - Looks for lines that fulfill the 'noise/outlier' condition as per below configuration.                           '''
#     def check_NoiseLines(df):
#         ######################
#         ##  Noise Settings  ##
#         MAX_TOKENS = 20
#         MAX_LEN = 80
#         MAX_ALLDIGITS = 0
#         MAX_PERSTOPWORS = 70.0
#         MAX_NOISE = 0
#         MAX_ABBRV = 0
#         ######################
#         def find_NoiseLines(x):           
#             tokens = word_tokenize(re.sub(r"\s+", " ", re.sub(r"\^", " ", re.sub(r"[^A-Za-z]", " ", x.LINES.strip(), flags=re.IGNORECASE | re.MULTILINE))))
#             check_NumOfTokens = len(tokens)
#             check_Length = len(x.LINES.lower().strip())
#             check_AllDigits = int(x.F1_CONTAINSALLDIGIT)
#             check_PercentStopWords = len([w for w in tokens if w.lower() in stop_words]) * 100.0/check_NumOfTokens if check_NumOfTokens > 0 else 0
#             check_NegativeKeywords = len(re.findall("|".join(NOISE_KEYWORDS) + "|".join(CUSTOMER_TAGS), str(x.LINES).strip(), flags=re.IGNORECASE | re.MULTILINE))
#             check_abbrv = len(re.findall(r"(?=(\b" + '\\b|\\b'.join(list_abbrv_regex) + r"\b))", x.LINES.lower().strip()))
#             # adjust above noise settings
#             QUALIFY = "SN"  
#             if check_abbrv == 0 and any([check_NumOfTokens > MAX_TOKENS, check_Length > MAX_LEN, 
#                                          check_AllDigits > MAX_ALLDIGITS, check_PercentStopWords > MAX_PERSTOPWORS,
#                                          check_NegativeKeywords > MAX_NOISE]):
#                 QUALIFY = "CN"
#             return QUALIFY        
#         df['QUALIFY'] = df.apply(find_NoiseLines, axis=1)
#         return df


#     ''' REMOVE DUPLICATE NAMES (SN/CN) IN PREDICTION LINES
#     ----------------------------------------------------------------------------------------------------------------------
#     - Removes duplicate-names in list of predicted names.                                                              '''
#     def check_duplicates(prediction_list):
#         ##########################
#         ##  Duplicate Settings  ##
#         MAX_SIMILARITY = 30
#         ##########################
#         def check_similarity(s1, s2):
#             return True if word_tokenize(s1)[0].strip().lower() == word_tokenize(s2)[0].strip().lower() and \
#                            fuzz.ratio(s1.lower(), s2.lower()) > MAX_SIMILARITY else False
#         def eliminate_duplicates(a):
#             for index, name in enumerate(a):
#                 for i, match_with in enumerate(a):
#                     if index != i and check_similarity(name[0], match_with[0]):
#                         a[i][0] = '<DUPLICATE>'
#             return a
#         PL = prediction_list
#         if len(PL) > 2:
#             PL = eliminate_duplicates(PL)
#             PL = [w for w in PL if w[0] != '<DUPLICATE>']
#             if len(PL) == 1:   # in case all were duplicates and hence got removed
#                 PL += prediction_list
#         return PL


#     ''' REMOVE ENTITIES (FAX, PHONE, DATE) IN PREDICTION LINES
#     ----------------------------------------------------------------------------------------------------------------------
#     - Removes phone-numbers, date in list of predicted names.                                                          '''
#     def check_entities(prediction_list):
#         filtered_prediction_list = []
#         for chunk in prediction_list:
#             find_PhoneDateRE = sum([x for x in [[i for x in find_phone(chunk[0]) for i in x if i!= ''], find_date(chunk[0])] if any(x)], [])
#             if len(find_PhoneDateRE) > 0:
#                 find_PhoneDateRE += PHONE_TAGS
#                 chunk[0] = re.sub(r"\((?=\d+\))|(?<=\d)\)", "", chunk[0], flags=re.IGNORECASE).strip()
#                 chunk[0] = re.sub(r"\s+", " ", re.sub("|".join(find_PhoneDateRE), " ", chunk[0], flags=re.IGNORECASE))
#             filtered_prediction_list.append(chunk)
#         return filtered_prediction_list


#     ''' SPLIT PREDICTION LINES INTO "Supplier_Names" and "Customer_Names"
#     ----------------------------------------------------------------------------------------------------------------------
#     - Splits the list of prediciton names into SN_list and CN_list based on findings of:-
#         1. check_SupplierTags(); 
#         2. check_CustomerTags();    
#         3. check_NoiseLines();                                                                                         '''
#     def check_split(prediction_list, TAG_SupplierName, TAG_CustomerName):
#         # Sanity Check :: 
#         # - checking if the "TAG_SupplierName" is Email/URL, in that case compare with remaining similar chunks.
#         def update__TAG_SupplierName():
#             updated__TAG_SupplierName = TAG_SupplierName
#             if TAG_SupplierName != None:  
#                 domains = sum([x[-1] for x in [find_email(TAG_SupplierName), find_url(TAG_SupplierName)] if any(x)], [])
#                 domain = domains[0] if len(domains) > 0 else None
#                 if domain != None:
#                     for chunk in prediction_list:
#                         if all( [chunk[1] >= 0.70, fuzz.partial_ratio(domain.lower(), chunk[0].lower()) > 50] ):
#                             updated__TAG_SupplierName = chunk[0]
#                             #print("Updated Tag_SupplierName = ", updated__TAG_SupplierName)
#                             break
#             return updated__TAG_SupplierName
#         # Sanity Check for SN 
#         SupplierName = update__TAG_SupplierName()
#         CustomerName = TAG_CustomerName
#         # Conditions to select SN,CN between: <File-level> OR <Line-level>
#         if SupplierName != None and CustomerName != None:       
#             SN_list = [(SupplierName, 100.0, 100.0, 'SN', 0)] + prediction_list
#             CN_list = [(CustomerName, 100.0, 100.0, 'CN', 0)]
#         elif SupplierName == None and CustomerName != None:
#             SN_list = [x for x in prediction_list if x[0] != CustomerName or len(re.findall(CustomerName.strip(), x[0], flags=re.IGNORECASE)) == 0]
#             CN_list = [(CustomerName, 100.0, 100.0, 'CN', 0)]
#         elif SupplierName != None and CustomerName == None:
#             SN_list = [(SupplierName, 100.0, 100.0, 'SN', 0)] + prediction_list
#             CN_list = [x for x in prediction_list if x[0] != SupplierName or len(re.findall(SupplierName.strip(), x[0], flags=re.IGNORECASE)) == 0]
#         else: 
#             # SupplierName == None & CustomerName == None
#             SN_list = [x for x in prediction_list if x[3] == 'SN']
#             CN_list = [x for x in prediction_list if x[3] == 'CN']
#         # Final Sanity Check for SN
#         # - Extracted SN, CN are subjected to final review to get separate lists.
#         SN_list = [chunk for chunk in SN_list if chunk[3] == 'SN']
#         CN_list = [chunk for chunk in CN_list if chunk[3] == 'CN']
#         return SN_list, CN_list


#     ''' CLEAN SN_LIST AND CN_LIST 
#     ----------------------------------------------------------------------------------------------------------------------
#     - Removes duplicate-names in list of predicted names.                                                              '''
#     def check_cleaning(chunk):
#         ######################
#         ##  Clean Settings  ##
#         MAX_TOKENS = 12
#         MAX_STOPWORDS_PERCENT = 40.0
#         ######################
        
#         # 1. if chunk is an email/url
#         domains = sum([x[-1] for x in [find_url(chunk), find_email(chunk)] if any(x)], [])
#         domain = domains[0].upper() if len(domains) > 0 else None
#         if domain != None:
#             return domain.upper()
#         else:
#             # 2nd check for domain name
#             domain_2 = re.findall(r"([A-Za-z]+)\.com\b", chunk, flags=re.IGNORECASE)
#             if len(domain_2) > 0:
#                 return domain_2[0].upper()
        
#         # 2. if chunk contains a TAG
#         ALL_TAGS = SUPPLIER_TAGS + CUSTOMER_TAGS + MIXED_TAGS
#         find_Tag = [tag for tag in ALL_TAGS if re.findall(r"{}".format(tag), chunk.strip(), flags=re.IGNORECASE)]
#         value_TAG = " ".join(re.findall(r"{}(?=.*[a-zA-Z].*)(.*)".format(find_Tag[0]), chunk.strip(), 
#                                         flags=re.IGNORECASE)).strip() if len(find_Tag) > 0 else None
#         if value_TAG != None:
#             return value_TAG
        
#         ## tokenize
#         tokens = re.findall(r"\w+", chunk.strip())
#         tokens_stopwords_percentage = len([w for w in tokens if w.lower() in stop_words])*100.0/len(tokens) if len(tokens) > 0 else 0
            
#         # 3. filter if len is > max_tokens :: Spacy's NER
#         if len(tokens) > MAX_TOKENS:
#             chunkNER = [ent.orth_ for ent in nlp(chunk).ents if ent.label_ in ['ORG', 'FAC']]
#             chunkNER = chunkNER[0] if chunkNER != [] else None
#             if chunkNER != None: return chunkNER.strip()
        
#         # 4. filter if it contains more than 40% stopwords :: remove stopwords
#         if tokens_stopwords_percentage > MAX_STOPWORDS_PERCENT:
#             chunk = " ".join([w for w in tokens if w.lower() not in stop_words])
        
#         # 5. [FINAL] :: Regex based cleaning for all types of chunks...
#         chunk = re.sub(r"\s+", " ", re.sub(r"\^", " ", re.sub(r"[^A-Za-z0-9\@\&\(\)\-]", " ", chunk.strip(), flags=re.IGNORECASE))).strip()
#         chunk = " ".join([w for w in word_tokenize(chunk) if w.lower() not in SMART_STOPWORDS])
#         return chunk
    
            
#     ##########################################
#     # Chunk Identification Starts......
#     ##########################################
#     print("\n Chunk Identification...")
    
#     # ------------------------------------------ CHUNK :: file level ---------------------------------------------------- #
#     TAG_SupplierName = check_SupplierTags(df)
#     TAG_CustomerName = check_CustomerTags(df)
#     print(" --> :: TAG_SN = {}; TAG_CN = {}".format(TAG_SupplierName, TAG_CustomerName))
#     # ------------------------------------------------------------------------------------------------------------------- #


#     # ------------------------------------------ CHUNK :: line level ---------------------------------------------------- #
#     final_SN, final_CN = [], []
    
#     #################################################################
#     # MODEL 1 + MOdel 2
#     #  - Generic model running only numerical features in 2-layer
#     #  - LSTM-CNN model running Text + Numerical features
#     #################################################################
    
#     sorted_df = sort(df)  # 1. sort df
#     print("\n1. Sorted Predictions == ", sorted_df.LINES.tolist())
    
#     sorted_df = check_NoiseLines(sorted_df)  # 2. remove noise and unwanted long lines
#     print("\n2. Noise Predictions == ", sorted_df.LINES.tolist())
        
#     prediction_list = [[L, P1, P2, Q, Ln] for L, P1, P2, Q, Ln in zip(sorted_df.LINES.tolist(), 
#                                                                       sorted_df.Model1_P1.tolist(),
#                                                                       sorted_df.Model2_P1.tolist(), 
#                                                                       sorted_df.QUALIFY.tolist(), 
#                                                                       sorted_df.LINE_NO.tolist())]
#     print("\n3. Scored Predictions == ", prediction_list)
    
#     prediction_list = check_duplicates(prediction_list) 
#     print("\n4. Duplicated Predictions == ", prediction_list)
    
#     prediction_list = check_entities(prediction_list)  # 5. remove entitites
#     print("\n5. Ents Predictions == ", prediction_list)
    
#     SN_list, CN_list = check_split(prediction_list, TAG_SupplierName, TAG_CustomerName)  # 6. split pred_list -> SN, CN
#     print("\n6. SPLIT SN_LIST == {}; CN_LIST == {};".format(SN_list, CN_list))
    
#     SN_list = list(map(lambda chunk: [check_cleaning(chunk[0]), chunk[1], chunk[2], chunk[3], chunk[4]], SN_list))  # 7. clean SN chunks
#     print("\n FINAL LIST :: SN_LIST == {}; CN_LIST == {}".format(SN_list, CN_list))
    
    
#     # MODEL 2 
#     if SN_list[0][1] < 0.10:
#         print('INSIDE NN MODELLING')
#         SN_list = list(map(lambda chunk: [chunk[0], 'NN', chunk[2]], SN_list))
    
#     #------------------------------------------------------------------------------------------------------------------- #    
    
    
    
#     return SN_list, CN_list

----

##### Version 2.0

In [188]:
# OLD :: Version 2.0
# def chunk_identification(df): 
#     # TAGS
#     SUPPLIER_TAGS = ["credit\s+to[^\:]*\:+", "remit\s+to[^\:]*\:+", "remittance\s+to[^\:]*\:+", "f/b/o\s*", 
#                      "direct\s+\w+inquiries\s+to[^\:]*\:+", "payable\s+to[^\:]*\:+", "checks\s+to[^\:]*\:+", 
#                      "check\s+to[^\:]*\:+", "account\s+name[^\:]*\:+", ]
    
#     CUSTOMER_TAGS = ["billed\s+to[^\:]*\:*", "bill\s+to[^\:]*\:*", "billing\s+to[^\:]*\:*", 'bills\s+to[^\:]*\:*',
#                      "bill\s+to\:*", "billed\s+to\:*", "ship\s+to[^\:]*\:*", "shipped\s+to[^\:]*\:*", "shipping\s+to[^\:]*\:*", 
#                      "ships\s+to[^\:]*\:*", "shipto[^\:]*\:*", "shippingto[^\:]*\:*", "shipsto[^\:]*\:*"]
    
#     MIXED_TAGS = [".*\s+to\:*\s+", ".*\s+is\:*\s+", ".*\s+belongs\s+to\:*\s+", ".*\s+belong\s+to\:*\s+", ".*\s+is\s+of\:*\s+", 
#                   ".*\s+is\s+to\:*\s+", ".*\s+name\:*\s+"]
    
#     smart_stop_words = ["https", "http", "www", "com", "please", "pls", "let", "lets", "know", "knowing", "knows", 
#                         "respond", "responding", "responds", "naming", "names", "name", "thanks", "thank", "thankyou", 
#                         "thanksyou", "thanking", "thankingyou", "thanx", "thnx", "thx", "jan", "feb", "mar", "apr", 
#                         "may", "jun", "jul", "aug", "sep", "oct", "nov", "dec", "asap", "january", "february", "march", 
#                         "april", "june", "july", "august", "september", "october", "november", "december", 
#                         "dear", "hello", "kindly", "ok", "okay", "regards", "hi", "hola", "hey", "yeah", "nope", 
#                         "advise", "many", "th", "st", "nd", "rd", "attached", "screen", "shot", "screenshot", 
#                         "sunday", "monday", "tuesday", "wednesday", "thursday", "friday", "saturday", "pfa", 
#                         "emp", "id", "invoice", "contract", "remit", "remittance", "fbo", "inquiries", 
#                         "payable", "billed", "billing", "bills", "shipped", "shipping", "ships", "shipto", "shippingto", 
#                         "shipsto", "terms&conditions", "termsandconditions", "confidential", "t&c", 
#                         "email", "emailing", "emails", "emailed", "mailing", "mails", "mail", "to", "is", 
#                         "customer", "client", "duration", "since", "address", "order", "invoices",
#                         "invocing", "ordernumber", "invoiceno", "accountno", "customerno", "make", "check", "checks",  
#                         "submit", "submits", "sumitting" , "submitto", "payment", "paying", "payments", "paymentsto", 
#                         "phone", "phones", "phoning", "phono", "emailo", "cancel", "cancelling", "cancellation", 
#                         "cancellations", "requires", "require", "required", "requiring", "noticeable", "noticing", "notice", 
#                         "notices", "day", "days", "months", "month"]
    
        
#     # 1. Look for "SUPPLIER_TAGS" in whole file
#     def check_SupplierTags(df):
#         def find_SupplierTag(line):
#             SupplierTag = [t for t in SUPPLIER_TAGS if re.findall(r"{}".format(t), line, flags=re.IGNORECASE)]
#             if len(SupplierTag) > 0:
#                 SupplierTag_value = " ".join(re.findall(r"{}(?!.*[\:].*)(?=.*[a-zA-Z].*)(.*[A-Za-z\&\.\,\(\)\-\@\<\>\\\/\s]+.*)".format(SupplierTag[0]), line, flags=re.IGNORECASE)).strip()
#                 #print(SupplierTag, "line", line, "==", SupplierTag_value)
#                 if len(SupplierTag_value) > 0:
#                     return SupplierTag_value
#             return None
#         # find "SUPPLIER_TAG" in each 'line' or in 'line above'
#         SUPPLIER_TAG_FOUND = None
#         for line_index in df.index.values:
#             line_present = df.iloc[line_index:line_index+1].LINES.values[0]
#             SupplierTag_value = find_SupplierTag(line_present)
#             if SupplierTag_value != None:
#                 # > SUPPLIER_TAG found in 'line'
#                 SUPPLIER_NAME = SupplierTag_value
#                 SUPPLIER_TAG_FOUND = "line_present"
#                 break
#             else:
#                 # > SUPPLIER_TAG NOT found in 'line'
#                 if line_index != 0:
#                     line_above = df.iloc[line_index-1:line_index].LINES.values[0]
#                 else:
#                     line_above = ""
#                 if len([t for t in SUPPLIER_TAGS if re.findall(r"{}".format(t), line_above, flags=re.IGNORECASE)]) > 0:
#                     # > SUPPLIER_TAG found in 'line above'
#                     SUPPLIER_NAME = line_present
#                     SUPPLIER_TAG_FOUND = "line_above"
#                 else:
#                     # > SUPPLIER_TAG NOT found in 'line above'
#                     continue 
#         if SUPPLIER_TAG_FOUND != None:
#             return SUPPLIER_NAME.strip()
#         else:
#             return None
        
        
#     # 2. Look for "CUSTOMER_TAGS" in whole file
#     def check_CustomerTags(df):
#         def find_CustomerTag(line):
#             CustomerTag = [t for t in CUSTOMER_TAGS if re.findall(r"{}".format(t), line, flags=re.IGNORECASE)]
#             if len(CustomerTag) > 0:
#                 CustomerTag_value = " ".join(re.findall(r"{}(?!.*[\:].*)(?=.*[a-zA-Z].*)(.*[A-Za-z\&\.\,\(\)\-\@\<\>\\\/\s]+.*)".format(CustomerTag[0]), line, flags=re.IGNORECASE)).strip()
#                 if len(CustomerTag_value) > 0:
#                     return CustomerTag_value    
#             return None
#         # find "CUSTOMER_TAG" in each 'line' or in 'line above'
#         CUSTOMER_TAG_FOUND = None
#         for line_index in df.index.values:
#             line_present = df.iloc[line_index:line_index+1].LINES.values[0]
#             CustomerTag_value = find_CustomerTag(line_present)
#             if CustomerTag_value != None:
#                 # > CUSTOMER_TAG found in 'line'
#                 CUSTOMER_NAME = CustomerTag_value
#                 CUSTOMER_TAG_FOUND = "line_present"
#                 break
#             else:
#                 # > CUSTOMER_TAG NOT found in 'line'
#                 if line_index != 0:
#                     line_above = df.iloc[line_index-1:line_index].LINES.values[0]
#                 else:
#                     line_above = ""
#                 if len([t for t in CUSTOMER_TAGS if re.findall(r"{}".format(t), line_above, flags=re.IGNORECASE)]) > 0:
#                     # > CUSTOMER_TAG found in 'line above'
#                     CUSTOMER_NAME = line_present
#                     CUSTOMER_TAG_FOUND = "line_above"
                    
#                 else:
#                     # > CUSTOMER_TAG NOT found in 'line above'
#                     continue 
#         if CUSTOMER_TAG_FOUND != None:
#             return CUSTOMER_NAME.strip()
#         else:
#             return None

#     # 3. Remove Noise lines
#     def check_NoiseLines(df):
#         # Check noise in predicted lines
#         def find_NoiseLines(x):              
#             tokens = word_tokenize(re.sub(r"\s+", " ", re.sub(r"\^", " ", re.sub(r"[^A-Za-z]", " ", x.LINES.strip(), 
#                                                                                  flags=re.IGNORECASE | re.MULTILINE))))
#             find_numofTokens = len(tokens)
#             find_length = len(x.LINES.lower().strip())
#             find_allDigits = x.F1_CONTAINSALLDIGIT
#             find_noise = re.findall("attn[^\:]*\:+\s*|from\s*\:+|subject\s*\:+|cc\s*\:+|bcc\s*\:+|\
#                                     terms\s+and\s+condition\s*|terms\s+&\s+conditions\s*|terms&conditions\s*|\
#                                     termsandconditions\s*|confidential\s*|T&C\s*|TC\s*|if you have any questions\s*|\
#                                     if you have questions\s*|write\s*to\:+\s*|bill\s*to\:+\s*|billed\s*to.*\:+\s*|\
#                                     contact\s*to\:+\s*|email\s*to\:+\s*|write\s*at\:+\s*|email\s*at\:+\s*|\
#                                     email\s*\@\:+\s*|client\s*|" + "|".join(CUSTOMER_TAGS),
#                                     x.LINES, flags=re.IGNORECASE | re.MULTILINE)
#             if len(tokens) > 0:
#                 find_stopwords = len([w for w in tokens if w.lower() in stop_words])*100.0/len(tokens)
#             else:
#                 find_stopwords = 0
#             # Condition for "NOT a Supplier Name"...
#             QUALIFY = "SN"         
#             if find_numofTokens > 30 or find_length > 80 or find_allDigits > 0 or len(find_noise) > 0 or find_stopwords > 60:
#                 QUALIFY = "CN"
#             return QUALIFY                
#         df['QUALIFY'] = df.apply(find_NoiseLines, axis=1)
#         return df
        
#     # Post-Processing
#     def cleaning(prediction_chunks, TAG_SupplierName, TAG_CustomerName):
        
#         # A. Removing duplicate values from prediction
#         def remove_duplicates(Top_lines):   
#             def check_similarity(s1, s2):
#                 if word_tokenize(s1)[0].strip().lower() == word_tokenize(s2)[0].strip().lower() and fuzz.ratio(s1.lower(), s2.lower()) > 30:
#                     return True
#                 else:
#                     return False
#             def eliminate_duplicates(a):
#                 for index, name in enumerate(a):
#                     for i, match_with in enumerate(a):
#                         if index != i and check_similarity(name[0], match_with[0]):
#                             a[i][0] = '<DUPLICATE>'
#                 return a
#             a = Top_lines
#             if len(a) > 2:
#                 a = eliminate_duplicates(a)
#                 a = [w for w in a if w[0] != '<DUPLICATE>']
#                 if len(a) == 1:
#                     a += Top_lines
#                 return a
#             else:
#                 return Top_lines

#         # B. Splitting prediction chunks into SupplerName, CompanyName
#         def split_chunklist(prediction_chunks):
#             # CHECK: Checking if found "TAG_SupplierName" = Email/URL
#             def check_TAGSupplierNameType():
#                 UPDATED_TAG_SupplierName = TAG_SupplierName
#                 if TAG_SupplierName != None:                    
#                     _, email_domain = find_email(TAG_SupplierName)
#                     _, _, url_domain = find_url(TAG_SupplierName)
#                     if len(email_domain) > 0: 
#                         domain = email_domain[0]
#                     elif len(url_domain) > 0: 
#                         domain = url_domain[0]
#                     else:
#                         domain = None
#                     if domain != None:
#                         # check in top 3
#                         for chunk in prediction_chunks[:3]:
#                             # if prob(1) is > 70%
#                             if chunk[1] >= 0.70:
#                                 score = fuzz.partial_ratio(domain.lower(), chunk[0].lower())
#                                 if score > 50:
#                                     UPDATED_TAG_SupplierName = chunk[0]
#                                     #print("NEW TAG SN == ", UPDATED_TAG_SupplierName)
#                                     break
#                 return UPDATED_TAG_SupplierName

#             # Store 'TAG_SupplierName/Updated_TAG_SupplierName' as SN; 'TAG_CustomerName' as CN
#             SupplierName = check_TAGSupplierNameType()
#             CustomerName = TAG_CustomerName
#             # Conditions to check SN, CN present in whole file
#             if SupplierName != None and CustomerName != None:       
#                 SN_list = [(SupplierName, 100.0, 'SN')] + prediction_chunks
#                 CN_list = [(CustomerName, 100.0, 'CN')]
#             elif SupplierName == None and CustomerName != None:
#                 SN_list = [x for x in prediction_chunks if x[0] != CustomerName or len(re.findall(CustomerName.strip(), x[0], flags=re.IGNORECASE)) == 0]
#                 CN_list = [(CustomerName, 100.0, 'CN')]
#             elif SupplierName != None and CustomerName == None:
#                 SN_list = [(SupplierName, 100.0, 'SN')] + prediction_chunks
#                 CN_list = [x for x in prediction_chunks if x[0] != SupplierName or len(re.findall(SupplierName.strip(), x[0], flags=re.IGNORECASE)) == 0]
#             else: 
#                 # SupplierName == None and CustomerName == None
#                 SN_list = [x for x in prediction_chunks if x[2] == 'SN']
#                 CN_list = [x for x in prediction_chunks if x[2] == 'CN']
#             return SN_list, CN_list
            
#         # C. Final Chunk Cleaning
#         def extract_cleanchunk(chunk):
#             # Clean - EMAIL CHUNKS
#             _, email_domain = find_email(chunk)
#             if len(email_domain) > 0:
#                 chunk = email_domain[0].upper()
#                 return chunk
            
#             _, _, url_domain = find_url(chunk)
#             if len(url_domain) > 0:
#                 chunk = url_domain[0].upper()
#                 return chunk
#             # Clean - TAG CHUNKS
#             SupplierTags = [t for t in SUPPLIER_TAGS if re.findall(r"{}".format(t), chunk.strip(), flags=re.IGNORECASE)]
#             CustomerTags = [t for t in CUSTOMER_TAGS if re.findall(r"{}".format(t), chunk.strip(), flags=re.IGNORECASE)]
#             if len(SupplierTags) > 0:
#                 chunk = " ".join(re.findall(r"{}([\w\&\.\,\(\)\-\@\<\>\\\/\s]+)".format(SupplierTags[0]), chunk, flags=re.IGNORECASE)).strip()
#                 return chunk
#             if len(CustomerTags) > 0:
#                 chunk = " ".join(re.findall(r"{}([\w\&\.\,\(\)\-\@\<\>\\\/\s]+)".format(CustomerTags[0]), chunk, flags=re.IGNORECASE)).strip()
#                 return chunk
#             # Clean - Generic
#             chunk = re.sub(r"\s+", " ", re.sub(r"\^", " ", re.sub(r"[^A-Za-z0-9\@\&\.\,\(\)\-]", " ", chunk.strip(), flags=re.IGNORECASE))).strip()
#             if len(chunk.split(' ')) > 6: 
#                 chunk = " ".join([w for w in word_tokenize(chunk) if w.lower() not in stop_words])
#             return chunk
        
#         # D. Final SN_LIST Cleaning
#         def final_cleaning(SN_list):
#             cleaned_SN_list = []
#             for SN_chunk in SN_list:
#                 SN = SN_chunk[0]
#                 # Remove if a Clean Tag is found...
#                 CLEAN_TAGS = SUPPLIER_TAGS + CUSTOMER_TAGS + MIXED_TAGS
#                 CLEAN_TAG_found = [t for t in CLEAN_TAGS if re.findall(r"{}".format(t), SN, flags=re.IGNORECASE | re.MULTILINE)]
#                 if len(CLEAN_TAG_found) > 0:
#                     SN = " ".join(re.findall(r"{}(?=.*[a-zA-Z].*)(.*)".format(CLEAN_TAG_found[0]), SN, flags=re.IGNORECASE | re.MULTILINE))
                
#                 # Remove stop_words if len is more than 8 words...
#                 if len(SN.split(' ')) > 8:
#                     SN = " ".join([w for w in word_tokenize(SN) if w.lower() not in stop_words])
#                 # Remove Smart Stop_words if len is more than 3 words...
#                 if len(SN.split(' ')) >= 4:
#                     SN = " ".join([w for w in word_tokenize(SN) if w.lower() not in smart_stop_words])
#                 SN_chunk[0] = SN
#                 cleaned_SN_list.append(SN_chunk)
#             return cleaned_SN_list
            
        
#         # Run [A, B, C, D] Steps...
#         prediction_chunks = remove_duplicates(prediction_chunks) # A
#         #print("PREDICTION :: ", prediction_chunks)
        
#         SN_list, CN_list = split_chunklist(prediction_chunks) # B
#         SN_list = [x for x in SN_list if x[2] == 'SN']
#         CN_list = [x for x in CN_list if x[2] == 'CN']
#         #print("SPLIT :: SN LIST == ", SN_list, "; CN LIST == ", CN_list)
        
#         SN_list = list(map(lambda x: [extract_cleanchunk(x[0]), x[1]], SN_list)) # C
#         CN_list = list(map(lambda x: [extract_cleanchunk(x[0]), x[1]], CN_list))
#         #print("Semi-Final :: SN LIST == ", SN_list, "; CN LIST == ", CN_list)
        
#         SN_list = final_cleaning(SN_list) # D
#         #print(" ** FINAL ** :: SN LIST == ", SN_list, "; CN LIST == ", CN_list)
        
#         return SN_list, CN_list
    

#     ##########################################
#     # Chunk Identification starts...
#     ##########################################
#     #print("\n OLD")
    
#     # Find Supplier Tags & Customer Tags in file...
#     TAG_SupplierName = check_SupplierTags(df)
#     TAG_CustomerName = check_CustomerTags(df)
#     #print("\nTAG :: TAG_SN = ", TAG_SupplierName, "; TAG_CN = ", TAG_CustomerName)
    
#     final_SN, final_CN = [], []
#     ### Y_PRED == 1 ###
#     if df[df.Y_PRED == 1].shape[0] > 0:
#         # sorted by prob of 1 (0.50 to 1.0)
#         sorted_df = df[df.Y_PRED == 1].sort_values(by=['P1'], ascending=False)
#         sorted_df = check_NoiseLines(sorted_df)
#         predictions = [[line, score, qualify] for line, score, qualify in zip(sorted_df.LINES.tolist(), sorted_df.P1.tolist(), sorted_df.QUALIFY.tolist())]
#         final_SN, final_CN = cleaning(predictions, TAG_SupplierName, TAG_CustomerName)

#     ### Y_PRED != 1 ###
#     else:
#         # sorted by prob of 1 (0 to 0.50)
#         sorted_df = df.sort_values(by=['P1'], ascending=False)
#         sorted_df = check_NoiseLines(sorted_df)
#         predictions = [[line, score, qualify] for line, score, qualify in zip(sorted_df.LINES.tolist(), sorted_df.P1.tolist(), sorted_df.QUALIFY.tolist())]
#         final_SN, final_CN = cleaning(predictions, TAG_SupplierName, TAG_CustomerName)
        
#     return final_SN, final_CN

----

##### Version 1.0

In [189]:
# OLD :: Version 1.0
#     def chunk_identification(df):   
#         SUPPLIER_TAGS = ["Credit\s+to[^\:]*\:+", "Remit\s+to[^\:]*\:+", "Remittance\s+to[^\:]*\:+", "f/b/o\s*", 
#                          "Direct\s+\w+Inquiries\s+to[^\:]*\:+", "Payable\s+to[^\:]*\:+", "Checks\s+to[^\:]*\:+", 
#                          "Check\s+to[^\:]*\:+", "Account\s+Name[^\:]*\:+", ]

#         CUSTOMER_TAGS = ["Billed\s+to[^\:]*\:+", "Bill\s+to[^\:]*\:+", "Billing\s+to[^\:]*\:+"]

#         SN_LINE_regex = "[^A-z\&\.\,\(\)\-\@\<\>\\\/]"

#         MAX_LENGTH_LINE = 200

#         def final_list_chunks(Top_lines):
#             def check_similarity(s1, s2):
#                 if word_tokenize(s1)[0].strip().lower() == word_tokenize(s2)[0].strip().lower() and fuzz.ratio(s1.lower(), s2.lower()) > 30:
#                     return True
#                 else:
#                     return False
#             def elimnate_duplicates(a):
#                 for index, name in enumerate(a):
#                     for i, match_with in enumerate(a):
#                         if index != i and check_similarity(name, match_with):
#                             del a[i:i+1]                
#                 return a

#             a = Top_lines
#             a = list(OrderedDict.fromkeys(a))
#             if len(a) > 2:
#                 a = elimnate_duplicates(a[:5])
#                 if len(a) == 1:
#                     a += Top_lines[5:6]
#                 return a
#             else:
#                 return Top_lines
        
#         def check_SupplierTags(df):
#             def find_header_tag(line):
#                 header_tag = [t for t in SUPPLIER_TAGS if re.findall(r"{}".format(t), line, re.IGNORECASE)]
#                 if len(header_tag) > 0:
#                     header_value = " ".join(re.findall(r"{}([\w\&\.\,\(\)\-\@\<\>\\\/\s]+)".format(header_tag[0]), line, re.IGNORECASE)).strip()
#                     if len(header_value) > 0:
#                         return header_value
#                 return None
            
#             HEADER_FOUND = None
#             for line_index in df[df.Y_PRED==1].index.values:
#                 line_present = df.iloc[line_index:line_index+1].LINES.values[0]
#                 header_value = find_header_tag(line_present)
#                 if header_value != None:
#                     header_line = header_value
#                     HEADER_FOUND = "line_present"
#                     break
#                 else:
#                     if line_index != 0:
#                         line_above = df.iloc[line_index-1:line_index].LINES.values[0]
#                     else:
#                         line_above = ""
#                     if len([t for t in SUPPLIER_TAGS if re.findall(r"{}".format(t), line_above, re.IGNORECASE)]) > 0:
#                         header_line = line_present
#                         HEADER_FOUND = "line_above"
#                     else:
#                         continue 
            
#             if HEADER_FOUND != None:                
#                 Pred_SN, Pred_SN_P0, Pred_SN_P1 = header_line, 0.01, 0.99
#                 return Pred_SN, Pred_SN_P0, Pred_SN_P1
#             else:
#                 return None
        
#         def check_CustomerTags(df):
#             return None
                       
#         def check_NoiseLines(df):
#             def noise_line_found(x):  
#                 find_keyword = re.sub(r"\s+", " ", re.sub(SN_LINE_regex, " ", x.LINES.strip(), re.IGNORECASE))
#                 find_len = len(re.sub(r"\s+", " ", re.sub(SN_LINE_regex, " ", x.LINES.strip(), re.IGNORECASE).strip()))
#                 find_abbrv = len(re.findall(r"(?=(\b" + '\\b|\\b'.join(list_abbrv_regex) + r"\b))", x.LINES.lower().strip()))
#                 find_digits = x.F1_CONTAINSALLDIGIT
#                 find_emails = len(re.findall(r"[\w\.-]+@[\w\.-]+", x.LINES.strip(), re.IGNORECASE))
#                 find_header_tag = len(re.findall(r"^[^\:]*\:\s*$", x.LINES.strip(), re.IGNORECASE))
#                 find_noise = len(re.findall("To\s*\:+|From\s*\:+|Subject\s*\:+|CC\s*\:+|BCC\s*\:+|Customer\s*\-*\:*|\
#                                             Terms\s+and\s+conditions|Terms\s+&\s+conditions|Terms&Conditions\s+|\
#                                             TermsandConditions\s+|Confidential\s+|T&C\s+|TC\s+|If you have any questions|\
#                                             If you have questions|Write\s*to\:+|Bill\s*to\:+|Billed\s*to.*\:+|\
#                                             Contact\s*to\:+|Email\s*to\:+|Write\s*at\:+|Email\s*at\:+|Email\s*\@\:+|\
#                                             Confidential|Client\s*|Manager\s*|Customs\s*|Trade\s*", 
#                                             re.sub(SN_LINE_regex, " ", x.LINES, re.IGNORECASE).strip(), re.IGNORECASE))
#                 find_stopwords = len([w for w in word_tokenize(re.sub(SN_LINE_regex, " ", x.LINES, re.IGNORECASE).strip()) \
#                                       if w in stop_words])*100.0/len(word_tokenize(re.sub(SN_LINE_regex, " ", x.LINES, re.IGNORECASE).strip()))
#                 # Checks
#                 if find_abbrv > 0 and find_len <= 30:
#                     return 1
#                 if find_len > MAX_LENGTH_LINE or find_digits > 0 or find_header_tag > 0 or find_noise > 0 :
#                     return 0
                 
#                 return 1
#             df['QUALIFY'] = df.apply(noise_line_found, axis=1)
#             df = df[df.QUALIFY != 0].drop(columns=['QUALIFY'])
#             return df
        
#         # Chunk Identification starts...
#         if df[df.Y_PRED == 1].shape[0] > 0:
            
            
#             Pred_SN, Pred_SN_P0, Pred_SN_P1 = "", 0, 0
            
#             # Check for header tags
#             if check_SupplierTags(df) != None:
#                 Pred_SN, Pred_SN_P0, Pred_SN_P1 = check_SupplierTags(df)
#                 return Pred_SN, Pred_SN_P0, Pred_SN_P1, [Pred_SN]
            
#             # Prediction using a classifier
#             fdf = df[df.Y_PRED == 1].copy()
            
# #             # remove noise lines
# #             if fdf.shape[0] > 2:
# #                 fdf = check_NoiseLines(fdf)

# #             # Normalize it
# #             fdf[normalize_cols] = Normalize.transform(fdf[normalize_cols])

# #             if 'P0' in fdf.columns:
# #                 fdf['P0'], fdf['P1'] = 0,0

# #             # MODELS...
# # #             #NN
# # #             max_features = 5
# # #             sequence_length = 6
# # #             embedding_dim = 6
# # #             X_unseen_text = tokenizer.texts_to_sequences(fdf.LINE_NER)
# # #             X_unseen_text = pad_sequences(X_unseen_text, padding='post', maxlen=sequence_length)
# # #             X_unseen_num = fdf[Features_NUM]
# # #             X_unseen_num[normalize_cols] = Normalize.transform(X_unseen_num[normalize_cols])
# # #             fdf['Final_P0'], fdf['Final_P1'] = zip(*model.predict([X_unseen_text, X_unseen_num]))
# #             ## RF
# #             fdf['Final_P0'], fdf['Final_P1'] = zip(*model.predict_proba(fdf[Features_L2]))

#             fdf['Final_P0'], fdf['Final_P1'] = df.P0, df.P1

#             # Max Prob in "Final_P1"
#             prediction = fdf[fdf['Final_P1']==fdf['Final_P1'].max()]
#             Pred_SN, Pred_SN_P0, Pred_SN_P1 = fdf.LINES.values[0], prediction['Final_P0'].values[0], prediction['Final_P1'].values[0]
            
#             # Lines sorted on predicted score
#             Top_Lines = fdf.sort_values(by=['Final_P1'], ascending=False).LINES.tolist()
            
#             Top_Lines = final_list_chunks(Top_Lines)[:5]
            
#             return Pred_SN, Pred_SN_P0, Pred_SN_P1, Top_Lines
#         else:
#             return "", 0, 0, []

#     # Execute for each filename...
#     final_df = []
#     for f in df.FILENAME.unique():
        
#         # every df
#         tempdf = df[df.FILENAME == f].copy().reset_index(drop=True)
        
#         #########################
#         # 1. Line Classification
#         # TRUE
#         actual = tempdf[tempdf.Y_SN == 1]
#         actual_SN = str(actual.SUPPLIER_NAME.tolist()[0])
#         # PRED
#         predicted = tempdf[tempdf.Y_PRED == 1]
#         predicted_SN = predicted.LINES.tolist()
#         # Accuracy
#         correct_df = tempdf[(tempdf.Y_SN == 1) & (tempdf.Y_PRED == 1)]
#         correct_LINES, correct_QUAD = correct_df['LINES'].tolist(), correct_df['F6_lineQuadrant'].tolist()
#         if len(correct_LINES) > 0:
#             correct_LINE_found = 1
#         else:
#             correct_LINE_found = 0
#         #########################
        
#         #########################
#         # 2. Chunk Identification
#         Final_Pred_SN, Final_Pred_SN_P0, Final_Pred_SN_P1, Top_Lines = chunk_identification(tempdf)
#         Final_Pred_SN = generic_cleaning(Final_Pred_SN)
        
#         #########################
    
#         # STORE
#         final_df.append({"FILE": f, "SN": actual_SN, 
#                          #"Count_True_Lines": actual.shape[0], 
#                          #"Count_Pred_Lines": predicted.shape[0], 
#                          "Pred_Lines": predicted_SN, "is_CorrectLineFound": correct_LINE_found, 
#                          "Correct_Pred_Lines": correct_LINES, "Correct_Quad": correct_QUAD,
#                          #"PRED_SN_P0": Final_Pred_SN_P0, "PRED_SN_P1": Final_Pred_SN_P1, 
#                          "PRED_SN_Byorder":Final_Pred_SN, 'PRED_SN_Byprob':Top_Lines})
    
#     pred_df = pd.DataFrame.from_dict(final_df)
#     print("Total Files = {}\n**LINE CLASSIFICATION**\nCorrect Line Found = {}\nLines Missed = {}"
#           .format(pred_df.shape[0], pred_df.is_CorrectLineFound.sum(), pred_df.shape[0] - pred_df.is_CorrectLineFound.sum()))
    
#     # DISPLAY
#     pred_df = pred_df.drop(columns=['is_CorrectLineFound', 'Correct_Pred_Lines', 'Correct_Quad', 'Pred_Lines', 'PRED_SN_Byorder'])
    
#     return pred_df

--- X -- X ---