###### Author : Harinadh Appidi

### Importing Required Libraries 

In [358]:
import spacy
import pandas as pd
import re
import pickle
from spacy.tokens import DocBin
import json
from tqdm import tqdm
import warnings
warnings.filterwarnings("ignore")

### Reading Data from csv file

In [359]:
'''
importing data
'''
data = pd.read_csv('Summer Internship - Homework Exercise.csv')
data.head()

Unnamed: 0,transaction_descriptor,store_number,dataset
0,DOLRTREE 2257 00022574 ROSWELL,2257,train
1,AUTOZONE #3547,3547,train
2,TGI FRIDAYS 1485 0000,1485,train
3,BUFFALO WILD WINGS 003,3,train
4,J. CREW #568 0,568,train


#### Cleaning Data

In [360]:
def clean_data(x):
    x = re.sub('\W+',' ',x).split()
    x = ' '.join(x)
    return x
data.transaction_descriptor = data.transaction_descriptor.apply(clean_data)

#### Split Data to train, test and validation folds

In [361]:
train_data = data[data.dataset=='train']
val_data = data[data.dataset=='validation']
test_data = data[data.dataset=='test']

#### Create custom entities to finetune spacy model learn new entities

In [362]:
def json_data(data):
    transaction_descriptor = data.transaction_descriptor.tolist()
    store_number_list = data.store_number.tolist()
    data =  [(descriptor, {'entities':[(re.search(store_number,descriptor).start(),re.search(store_number,descriptor).end(),'store_detector') ] }) for descriptor,store_number in zip(transaction_descriptor,store_number_list)]
    return data


#### Dump newly created train, test, valid data to json files

In [363]:
with open('train.json','w',encoding='utf-8') as f:
    json.dump(json_data(train_data),f)
with open('valid.json','w',encoding='utf-8') as f:
    json.dump(json_data(val_data),f)
with open('test.json','w') as f:
    json.dump(json_data(test_data),f)
    

#### Create a new blank spacy model and train, validation data in spacy format

In [364]:
# nlp = spacy.blank('en') ## create blank spacy model
# nlp = spacy.load('en_core_web_trf')
nlp = spacy.load('en_core_web_lg')


def covert_spacy_format(TRAIN_DATA):
    db = DocBin()
    for text, annotations in tqdm(TRAIN_DATA):
        doc = nlp.make_doc(text)
        ents=[]
        for start, end, label in annotations['entities']:
            span = doc.char_span(start, end, label=label,alignment_mode='expand')
            if span is None:
                print("skipping entity {} in {}".format(label, text))
            else:
                ents.append(span)
        doc.ents = ents
        db.add(doc)
    return (db)

#### convert train, validation data from json to spacy format

In [365]:
with open('train.json','r') as f:
    train_data = json.load(f)
with open('valid.json','r') as f:
    valid_data = json.load(f)
    
train_data = covert_spacy_format(train_data)
train_data.to_disk('./data/train.spacy')
valid_data = covert_spacy_format(valid_data)
valid_data.to_disk('./data/valid.spacy')

100%|██████████| 100/100 [00:00<00:00, 2273.02it/s]
100%|██████████| 100/100 [00:00<00:00, 2273.00it/s]


#### Create new config.cfg file from command line after copying default parameter values from spacy website to basic_config.cfg

* python -m spacy init fill-config base_config.cfg config.cfg

#### Train Spacy model from command line
* python -m spacy train config.cfg --output ./large_out

### Load the spacy model and predict on test data
1. For multiple predictions on a test sample, only first prediction is taken into consideration

In [366]:
nlp = spacy.load('large_out/model-best')
docs = test_data.transaction_descriptor.tolist()
store_numbers=[]
for doc in docs:
    doc= nlp(doc)
    temp =[]
    for ent in doc.ents:
        if ent.label_=='store_detector':
            temp.append(ent.text.lstrip('0'))
            break
    else:
        temp.append('Not_Predicted')
    store_numbers.extend(temp)

#### Write test predictions to a csv file

In [367]:
test_data['store_number_pred'] = store_numbers
test_data.to_csv('test_data_predictions.csv')

### Calculating Accuracy score for Entity Extraction model on test data

In [372]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
print("***** Accuracy *****",accuracy_score(test_data.store_number,test_data.store_number_pred))
print("****** Precision *****",precision_score(test_data.store_number,test_data.store_number_pred, average='micro'))
print("****** Recall *****",recall_score(test_data.store_number,test_data.store_number_pred,average='micro'))
print("****** F1 Score *****",f1_score(test_data.store_number,test_data.store_number_pred,average='micro'))

***** Accuracy ***** 0.9
****** Precision ***** 0.9
****** Recall ***** 0.9
****** F1 Score ***** 0.9


#### Steps Followed to extract entities
* I first observed the data to understand the problem of extracting entities from the transaction descriptors
* some of the transaction descriptors are simple with only one number occuring where applying regex can find those
* There are cases where applying regex or other rule based techniques may not be able to fetch right store number when there are other numbers exist along with store number.
* My next approach was to build a learning based model to extract entities from these transaction descriptors.
* With the input given to use pretrained model, I have gone with spacy model trained on large corpus of web text in english 'en_core_web_lg'.
* This is because of my previous exposure to spacy models as they are trained using neural networks combining CNN , RNN, attention mechanisms and has been widely used in NLP community.
* Spacy models are also well versed in named entity recognition tasks and comes with well defined documentation.

##### Data Cleaning and Preparation:
* Transaction descriptors are cleaned to have only numeric and alphabet characters and remove other characters.
* data is split into train, valid,test data and converted to spacy format after adding new entity store_detector to custom train spacy model.
#### Model Training:
* Pretrained en_core_web_lg model is finetuned on train data and validated using validation data.
#### Model Prediction and Metrics:
* Best model is used for predictions on test data.
* Accuracy score and other metircs are computed on the test data which is 90%.
 
#### Other Notes:
* I have also tried small language model and transformer based model (couldnt train completely on my laptop and had to abort before completion). However large model performed slightly better than small model with 1 point accuracy improvement.

