# Introduction

**Note:** This is an explanatory file about the process of training the algorithm. Do not run this file unless necessary!

### About the Project

The task was to find important features of products based on the titles they are listed under on the website. We received a dataset of over 600,000 products, which is a size that is certainly challenging to navigate within. Therefore, the main goal of this project is to create a tool that will help the customer with fast orientation in these products. The tool that we chose for this task is **Name Entity Recognition** algorithm.

### About Name Entity Recognition

Named entity recognition (NER) — sometimes referred to as entity chunking, extraction, or identification — is the task of identifying and categorizing key information (entities) in the text. An entity can be any word or series of words that consistently refers to the same thing. Every detected entity is classified into a predetermined category.

NER is a form of natural language processing (NLP), a subfield of artificial intelligence. NLP is concerned with computers processing and analyzing natural language, i.e., any language that has developed naturally, rather than artificially, such as with computer coding languages.

**How NER works:**

At the heart of any NER model is a two step process:

1. Detect a named entity 
  - This step involves detecting a word or string of words that form an entity. Each word represents a token: “The Great Lakes” is a string of three tokens that represents one entity.

2. Categorize the entity
  - This step requires the creation of entity categories. Here are some common entity categories: Person, Organization, Time, Location, Work of art, etc.

To learn what is and is not a relevant entity and how to categorize them, a model requires training data. The more relevant that training data is to the task, the more accurate the model will be at completing said task. Once one have defined their entities and the categories, they can use these to label data and create a training dataset. This training dataset can be used to train an algorithm to label any given text predictively.

Generally, NER learns the context in which entities emerge and is the capable of detecting them in unseen text. That is why it is suited to any situation in which a high-level overview of a large quantity of text is helpful, which surely is our case.

### Custom Named-Entity Recognition

After overviewing our data more closely, we found out that the titles are a mix of English and Finnish language. This fact causes certain difficulty, as already existing pretrained models are usually build for a single language. Similarly, the existing models are not trained to detect entities specific to our dataset. Therefore, we decided to go for a custom NER model, which allows to detect the required entities in both languages.

### Choosing Custom SpaCy Model

After researching our options with custom models, we decided to work with spaCy. SpaCy's NER model is a simple classifier that is made powerful using some clever feature engineering. Before the input, features are fed into the classifier, a stack of weighted bloom embedding layers merge neighbouring features together. This gives each word a unique representation for each distinct context it is in. SpaCy is also highly flexible and allows us to add a new entity type and train the model, which is exactly the feature we need from the model.

## Importing the Customer Data

For a better performance, we decided to work in Google colab, thus we uploaded the data to Google drive and explored them.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import pandas as pd
path = "/content/drive/My Drive/integrify_ner_project_team/data.csv"
df_bonus = pd.read_csv(path) # Dataset is now stored in a Pandas Dataframe

In [None]:
df_bonus.head()

Unnamed: 0,title,description,summary,brand,price,meta,provider_category,provider
0,"adidas Originals - Superstar - Valkoinen - US 5,5",,,adidas Originals,66.5,"{""SIZE"": [""us 5,5""], ""COLOR"": [""valkoinen""], ""...",17-muoti-ja-vaatetus,Caliroots
1,Sc-Erna Polvipituinen Hame Sininen Soyaconcept,"SOYACONCEPT on tanskalainen brändi, joka luo e...",,Soyaconcept,49.99,"{""SIZE"": [""36""], ""COLOR"": [""cristal blue""], ""G...",17-muoti-ja-vaatetus,Boozt
2,Dana Buchman Silmälasit Taren CARAMEL TORTOISE,Dana Buchman Taren Silmälasit. Collection:Men....,,Dana Buchman,146.0,"{""SIZE"": [""54""], ""COLOR"": [""tortoise""], ""GENDE...",13-silmalasit-ja-piilolinssit,Smartbuy Glasses
3,Active Sports Woven Shorts B Shortsit Musta PUMA,PUMA Active Sports Woven Shorts B,,PUMA,27.0,"{""SIZE"": [""164"", ""128"", ""110"", ""116"", ""104"", ""...",17-muoti-ja-vaatetus,Boozt
4,Renata Polvipituinen Hame Musta Fall Winter Sp...,Fall Winter Spring Summer. A-linjainen.,,Fall Winter Spring Summer,199.0,"{""SIZE"": [""xs""], ""COLOR"": [""jet black""], ""GEND...",17-muoti-ja-vaatetus,Boozt


Firstly, we explored the dataset. We could see that there are in total 8 features/columns and over 600,000 samples/products. We came to the conclusion that we are going to train the model to search entities only in the **title** column, as description column not only has a different structure, but also have much larger number of missing data.

##Creating a Blank SpaCy Model

Even though we are creating our own custom model, it is a good practice to use blank spaCy pre-existing model, which works as a skeleton for a selected language model. We chose the Finnish blank model, as the Finnish language has a more complicated structure, compared to English. We also selected only **'ner'** part of the NLP pipeline.

In [None]:
# Train NER from a blank spacy model
import spacy
spacy.prefer_gpu()

nlp=spacy.blank("fi")

nlp.add_pipe(nlp.create_pipe('ner'))

nlp.begin_training()

<thinc.neural.optimizers.Optimizer at 0x7f6861fe0fd0>

In [None]:
# Getting the ner component
ner=nlp.get_pipe('ner')

##Getting the Training Data

#### Labelling Strategy

After we have created the blank spaCy model, we needed to gather training data for our task. Firstly, we needed to choose the labels we want the algorithm to detect for us. Based on the exploration of the title column, we decided to include 6 labels:
- **'BRAND'** - the brand of the given product, i.e., PUMA, or adidas 
- **'PRODUCT'** - the name of the product, i.e., shoe, dress, or sunglasses
- **'COLOR'** - color of the product
- **'SIZE/WEIGHT'** - size or weight of the product, i.e., small / medium / large, UK 38, 50ml, or 44mm
- **'GENDER'** - the gender to whom the product is targeted to
- **'MATERIAL'** - material of the product, i.e., silk, or leather

#### Required Data

In order to train the algorithm efficiently we need to feed it with a good quality training data. Training data for spaCy model need to be stored in a list of tuples, looking like a following tuple example:

- ("content":"adidas Originals - Superstar - Valkoinen - US 5,5","entities":[[31,40,"COLOR"],[43,49,"SIZE"],[0,6,"BRAND"]])

The **"content"** is an exact quotation of the title from our dataset, **"entities"** then store the start and end indices along with the label of the entities present in the text. 

In total, we labelled about 1,000 title examples and used them as a training data. We have split the workload and then used code below to merge all the files into the final training data. We also labelled additional 50 examples for the purpose of the testing of the algorithm. 

#### Annotation Strategy

As a strategy for creating annotations we have decided to use available annotator, which we fed with manually annoted labels. We believe, that this strategy gave us the training data with the highest possible accuracy, as we manually fed the annotator with entities and made sure that there are no errors.

As a tool for annotations we used spaCy [NER Annotator](https://manivannanmurugavel.github.io/annotating-tool/spacy-ner-annotator/), that was created to reduce the annotation time and convert the training data into the spaCy format. Because the spaCy training format is a list of a tuples, which is not supported by javascript, we also used the python script from the annotator called convert_spacy_train_data.py to convert data to the final training format (available in this [Github repository](https://github.com/ManivannanMurugavel/spacy-ner-annotator)).

Next, we stored the name of new category types in a string variable **LABEL**.

In [None]:
# This code was used for merging all the training files (which was ran on local host, do not run this one!)

# import json
# import glob

# result = []
# for f in glob.glob("train_files/*.json"):
#     with open(f, "rb") as infile:
#         x=json.load(infile)
#         result+=x

# with open("train_data.json", "w") as outfile:
#      json.dump(result, outfile)

In [None]:
# Add the new label to ner
LABEL = ['COLOR','BRAND','SIZE/WEIGHT','MATERIAL','GENDER','PRODUCT']
for i in LABEL:
  ner.add_label(i)

# Resume training
optimizer = nlp.resume_training()
move_names = list(ner.move_names)

# List of pipes you want to train
pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]

# List of pipes which should remain unaffected in training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]

In [None]:
# code for converting our training data from json format to txt format

import json

filename='/content/drive/My Drive/integrify_ner_project_team/fixed_train_data.json'
print(filename)


with open(filename) as train_data:
    train = json.load(train_data)

TRAIN_DATA = []
for data in train:
    ents = [tuple(entity) for entity in data['entities']]
    TRAIN_DATA.append((data['content'],{'entities':ents}))


with open('{}'.format(filename.replace('json','txt')),'w') as write:
    write.write(str(TRAIN_DATA))

print('-------------Copy and Paste to spacy training-------------\n\n\n')

print(TRAIN_DATA)

print('\n\n\n--------------------------End-----------------------------')





/content/drive/My Drive/integrify_ner_project_team/fixed_train_data.json
-------------Copy and Paste to spacy training-------------



[('Lacoste Silmälasit L2840 220', {'entities': [(0, 7, 'BRAND'), (8, 18, 'PRODUCT')]}), ('Beyond Perf. Foundation + Concealer, 09 Neutral', {'entities': [(13, 35, 'PRODUCT'), (40, 47, 'COLOR')]}), ('Silk Screen Refining Powder Makeup, 24 Meikkivoide Meikki Origins', {'entities': [(21, 34, 'PRODUCT'), (39, 50, 'PRODUCT'), (58, 65, 'BRAND')]}), ('Calvin Klein Earth Miesten kello K5Y31XWL Vihreä/Nahka Ø44 mm', {'entities': [(0, 12, 'BRAND'), (19, 26, 'GENDER'), (27, 32, 'PRODUCT'), (42, 48, 'COLOR'), (49, 54, 'MATERIAL'), (56, 61, 'SIZE/WEIGHT')]}), ('Just Cavalli Silmälasit JC 0817 086', {'entities': [(0, 12, 'BRAND'), (13, 23, 'PRODUCT')]}), ('Alliance Forestry 333 ( 320/85 -24 127A8 14PR TL kaksoistunnus\xa0 124B )', {'entities': [(0, 8, 'BRAND'), (24, 48, 'SIZE/WEIGHT')]}), ('ITP Bajacross ( 26x10.00 R14 TL 81D )', {'entities': [(0, 3, 'BRAND'), (16, 3

##Training the NER Model

In the process of training the model we followed these steps: 

(a) To train an ner model, the model has to be looped over examples for sufficient number of iterations. If you train it for just 5 or 6 iterations, it may not be effective. We did 100 iterations in our final file.

(b) Before every iteration it’s a good practice to shuffle the examples randomly through **random.shuffle()** function. This will ensure the model does not make generalizations based on the order of the examples.

(c) The training data is usually passed in batches. We used the **minibatch()** function of spaCy over the training data that returns data in batches. A parameter of minibatch function is size, denoting the batch size.

#### Explanation of Sections of the Training Code

- **compunding()** function takes three inputs which are start (first integer value, stop (maximum value that can be generated) and finally compound. This value stored in compund is the compounding factor for the series.

- For each iteration, the ner model is updated through the **nlp.update()** command. Parameters of nlp.update() are :
  - sgd : we have to pass the optimizer that was returned by **resume_training()** here.
  - drop: represents the dropout rate. We choose the dropout rate of 0.1, as we want to prevent model from overfitting, yet we have only a small portion of the total data as the training data, therefore we decided not to set a larger dropout rate.
  - losses: a dictionary to hold the losses against each pipeline component. We created an empty dictionary and passed it here.

At each word, the **update()** function makes a prediction. It then consults the annotations to check if the prediction is right. If it isn’t, it adjusts the weights so that the correct action will score higher next time.


Finally, all of the training is done within the context of the nlp model with disabled pipeline, to prevent the other components from being involved.



In [None]:
# Importing requirements

%%time

from spacy.util import minibatch, compounding
import random

# Begin training by disabling other pipeline components
with nlp.disable_pipes(*other_pipes) :

  sizes = compounding(1.0, 4.0, 1.001)
  # Training for 100 iterations     
  for itn in range(100):    
    # shuffle examples before training
    random.shuffle(TRAIN_DATA)
    # batch up the examples using spaCy's minibatch
    batches = minibatch(TRAIN_DATA, size=sizes)
    # ictionary to store losses
    losses = {}
    for batch in batches:
      texts, annotations = zip(*batch)
      # Calling update() over the iteration
      nlp.update(texts, annotations, sgd=optimizer, drop=0.1, losses=losses)
      print("Losses", losses)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Losses {'ner': 642.2270280124836}
Losses {'ner': 644.2252210846228}
Losses {'ner': 646.7551652606013}
Losses {'ner': 648.3043738514887}
Losses {'ner': 653.8038466452986}
Losses {'ner': 656.7461098254321}
Losses {'ner': 658.8545484015735}
Losses {'ner': 662.6672264233043}
Losses {'ner': 663.16821911526}
Losses {'ner': 667.3053826054443}
Losses {'ner': 668.8005121194315}
Losses {'ner': 676.8522074662637}
Losses {'ner': 678.1027543340844}
Losses {'ner': 680.8963917347511}
Losses {'ner': 684.6214948083116}
Losses {'ner': 686.6099255705645}
Losses {'ner': 689.2587863213953}
Losses {'ner': 696.4658521729648}
Losses {'ner': 698.299338055811}
Losses {'ner': 700.4002308537674}
Losses {'ner': 700.4058445443802}
Losses {'ner': 2.3568634694820503}
Losses {'ner': 8.553603086285875}
Losses {'ner': 11.616634262740263}
Losses {'ner': 12.630646660927596}
Losses {'ner': 15.464993955906039}
Losses {'ner': 15.465215239811295}
Losses {'ner': 

## Remarks About the Training Process


We experimented with different settings of the model, especially with the dropout rate and the number of iterations. We started with the dropout rate of 0.35, which resulted in losses of about 2600, which were not the best possible results. When we increased the dropout rate, losses increased even more. We reached the best results with the dropout rate of 0.1, that prevents the model from overfitting, but we still did not lose the bigger portion of our training data. The final losses of our model equalled to about 600.

Regarding the number of iterations, we also tried the model with various number of iterations - 100, 200, and 500. Even though the 500 iterations resulted in the lowest possible losses, the time spent with the training model was not reflected in a significant decrease of losses. We also discovered, that 100 iterations result in the best possible accuracy of the model, therefore we used 100 iterations in the final model.

## Saving the NER

Once we found the performance of the model satisfactory, we saved the updated model to a directory using **to_disk** command. That way we can load the model from the directory at any point of time by passing the directory path to **spacy.load()** function.

In [None]:
# Output directory
from pathlib import Path
output_dir=Path('/content/drive/My Drive/integrify_ner_project_team/model/')

# Saving the model to the output directory
if not output_dir.exists():
  output_dir.mkdir()
nlp.meta['name'] = 'my_ner'  # rename model
nlp.to_disk(output_dir)
print("Saved model to", output_dir)

# Loading the model from the directory
print("Loading from", output_dir)
nlp_custom = spacy.load(output_dir)
assert nlp_custom.get_pipe("ner").move_names == move_names

doc2 = nlp_custom('Flairville Low Laceshoes Shoes Business Laced Shoes Ruskea GANT')
for ent in doc2.ents:
  print(ent.label_, ent.text)

Saved model to /content/drive/My Drive/integrify_ner_project_team/model
Loading from /content/drive/My Drive/integrify_ner_project_team/model
PRODUCT Flairville Low Laceshoes
PRODUCT Business Laced Shoes
COLOR Ruskea
BRAND GANT
