## Creating a Spacy NER Model for laptops

The Brand, Model, Processor, RAM, OS, Disk, Dim are the entities we want to define.

In [1]:
##import libraries
import pandas as pd
import numpy as np
import spacy
import random
import time

### 1. Data


In [2]:
## loading data with the columns names as entities
data = pd.read_csv("laptop.csv",index_col=None)
data

Unnamed: 0,Brand,Model,Processor,RAM,OS,Disk,Dim,Category
0,Lenovo,Ideapad,Intel Core i3 Processor (7th Gen),4 GB DDR4,64 bit Windows 10,1 TB HDD,39.62 cm (15.6 inch),Laptop
1,Lenovo,Ideapad,Intel Core i3 Processor (7th Gen),4 GB DDR4,64 bit Windows 10,1 TB HDD,39.62 cm (15.6 inch),Laptop
2,HP,EliteBook,Intel Core i3 Processor (7th Gen),8 GB DDR4,64 bit Windows 10,256 GB SSD,35.56 cm (14 inch),Laptop
3,Dell,Vostro,Intel Core i3 Processor (8th Gen),4 GB DDR4,Linux/Ubuntu,1 TB HDD,35.56 cm (14 inch),Laptop
4,HP,Zbook,Intel Core i5 Processor (8th Gen),8 GB DDR4,64 bit Windows 10,1 TB HDD,35.56 cm (14 inch),Laptop
...,...,...,...,...,...,...,...,...
411,Lenovo,Ideapad,Intel Core i5 Processor (6th Gen),4 GB DDR3,64 bit Windows 10,1 TB HDD,35.56 cm (14 inch),Laptop
412,Lenovo,Ideapad,Intel Core i7 Processor (8th Gen),8 GB DDR4,64 bit Windows 10,1 TB HDD,39.62 cm (15.6 inch),Laptop
413,Lenovo,Ideapad,AMD APU Quad Core A6 Processor,4 GB DDR3,64 bit Windows 10,1 TB HDD,39.62 cm (15.6 inch),Laptop
414,Lenovo,Legion,Intel Pentium Quad Core Processor (4th Gen),4 GB DDR3,D,500 GB HDD,39.62 cm (15.6 inch),Laptop


### 2. Pre-processing

The input data has to be in a particular format.

Steps:
1. Create a phrase with jumbled entities and the annotations for each entity.
2. Convert into json file with the content (phrase above) and the corresponding entities.
3. The json file is converted into a list.


####  a. Creating content for the laptops

In [3]:
## the entities
cols = data.columns
cols

Index(['Brand', 'Model', 'Processor', 'RAM', 'OS', 'Disk', 'Dim', 'Category'], dtype='object')

In [4]:
## number of entities
num_ent = len(data.columns)
ent_list = list(np.arange(num_ent))
ent_list

[0, 1, 2, 3, 4, 5, 6, 7]

In [5]:
# sample of an entity
data.iloc[0,2]

'Intel Core i3 Processor (7th Gen)'

In [6]:
## jumble indices of entities to create a phrase
prod_name = [] # list of all product names
prod_ann = [] # list of all the annotations
for i in range(len(data)): # loop for each laptop
    idx_list = random.sample(ent_list,num_ent) # shuffling indices
    cont = []
    ann = []
    ann_idx = 0 # pointer for annotating 
    for j in range(num_ent): # creating the jumbled product name
        col_num = idx_list[j] # column number according jumbled column index
        val = data.iloc[i,col_num] # value of the entity 
        cont.append(val) # appending list of entities into a single list
        ann.append((ann_idx, len(val)+ ann_idx, cols[col_num])) # annotations and entity name
        ann_idx = ann_idx + len(val) + 1 # updating the annotation pointer
        
    prod_name.append( ' '.join(cont)) # complete phrase for each laptop
    prod_ann.append(ann) 


    

In [7]:
## Example
## This is the content
sample_prod_name = prod_name[8]
sample_prod_name

'Laptop Modern 35.56 cm (14 inch)  8 GB DDR4  MSI 64 bit Windows 10 512 GB SSD Intel Core i5 Processor (10th Gen)'

In [8]:
## entities in it with their annotations
sample_prod_ent = prod_ann[8]
sample_prod_ent

[(0, 6, 'Category'),
 (7, 13, 'Model'),
 (14, 33, 'Dim'),
 (34, 44, 'RAM'),
 (45, 48, 'Brand'),
 (49, 66, 'OS'),
 (67, 77, 'Disk'),
 (78, 112, 'Processor')]

In [9]:
## take an entity in the prod 
st_id = sample_prod_ent[7][0]
end_id = sample_prod_ent[7][1]
ent = sample_prod_ent[7][2]
print(ent,':', sample_prod_name[st_id : end_id])


Processor : Intel Core i5 Processor (10th Gen)


In [10]:
prod =[]
for i in range(len(data)):
    prod.append([prod_name[i], prod_ann[i]])

prod[4]

['Laptop Zbook 64 bit Windows 10 1 TB HDD 35.56 cm (14 inch)  8 GB DDR4  Intel Core i5 Processor (8th Gen) HP',
 [(0, 6, 'Category'),
  (7, 12, 'Model'),
  (13, 30, 'OS'),
  (31, 39, 'Disk'),
  (40, 59, 'Dim'),
  (60, 70, 'RAM'),
  (71, 104, 'Processor'),
  (105, 107, 'Brand')]]

In [11]:
## creating a dataframe with product names and annotations
prod_data = pd.DataFrame(prod, columns = ['ProdName','Annotations'])
prod_data.head()

Unnamed: 0,ProdName,Annotations
0,1 TB HDD Ideapad Laptop Intel Core i3 Processo...,"[(0, 8, Disk), (9, 16, Model), (17, 23, Catego..."
1,1 TB HDD Lenovo 64 bit Windows 10 4 GB DDR4 L...,"[(0, 8, Disk), (9, 15, Brand), (16, 33, OS), (..."
2,HP 8 GB DDR4 Intel Core i3 Processor (7th Gen...,"[(0, 2, Brand), (3, 13, RAM), (14, 47, Process..."
3,Dell Linux/Ubuntu Intel Core i3 Processor (8th...,"[(0, 4, Brand), (5, 17, OS), (18, 51, Processo..."
4,Laptop Zbook 64 bit Windows 10 1 TB HDD 35.56 ...,"[(0, 6, Category), (7, 12, Model), (13, 30, OS..."


In [12]:
# converting into csv file
prod_data.to_csv('laptop_prodNames.csv', index= None)

#### b. Creating json file

In [13]:
# converting into json format
import csv
import json

csvfile = open('laptop_prodNames.csv', 'r')
jsonfile = open('laptop_prodNames.json', 'w')

fieldnames = ('ProdName', 'Annotations')
reader = csv.DictReader( csvfile, fieldnames)

for row in reader:
    json.dump(row, jsonfile)
    jsonfile.write('\n')

#### c. json to list (spacy format)

In [14]:
## function to convert json file into spacy traning data format
def convert_to_spacytrain(json_file):
    try:
        
        training_data = []
        lines=[]
        with open(json_file, 'r') as f:
            lines = f.readlines() # this has 416 lines
            
        for line in lines[1:400]: # loop for every product
            data = json.loads(line) # single row
            text = data['ProdName'] #this is complete phrase
            entities = data['Annotations']
            training_data.append((text, {"entities" : eval(entities)}))
            
        return training_data
    
    except Exception as e:
        
        logging.exception("Unable to process " + json_file + "\n" + "error = " + str(e))
        
        return None

In [21]:
train_data = convert_to_spacytrain('laptop_prodNames.json')
train_data

[('1 TB HDD Ideapad Laptop Intel Core i3 Processor (7th Gen) 4 GB DDR4  64 bit Windows 10 39.62 cm (15.6 inch)  Lenovo',
  {'entities': [(0, 8, 'Disk'),
    (9, 16, 'Model'),
    (17, 23, 'Category'),
    (24, 57, 'Processor'),
    (58, 68, 'RAM'),
    (69, 86, 'OS'),
    (87, 108, 'Dim'),
    (109, 115, 'Brand')]}),
 ('1 TB HDD Lenovo 64 bit Windows 10 4 GB DDR4  Laptop 39.62 cm (15.6 inch)  Ideapad Intel Core i3 Processor (7th Gen)',
  {'entities': [(0, 8, 'Disk'),
    (9, 15, 'Brand'),
    (16, 33, 'OS'),
    (34, 44, 'RAM'),
    (45, 51, 'Category'),
    (52, 73, 'Dim'),
    (74, 81, 'Model'),
    (82, 115, 'Processor')]}),
 ('HP 8 GB DDR4  Intel Core i3 Processor (7th Gen) 256 GB SSD Laptop EliteBook 64 bit Windows 10 35.56 cm (14 inch) ',
  {'entities': [(0, 2, 'Brand'),
    (3, 13, 'RAM'),
    (14, 47, 'Processor'),
    (48, 58, 'Disk'),
    (59, 65, 'Category'),
    (66, 75, 'Model'),
    (76, 93, 'OS'),
    (94, 113, 'Dim')]}),
 ('Dell Linux/Ubuntu Intel Core i3 Processor (8th

### 3. Training NER model

In [16]:
def train_spacy(data,iterations):
    
    TRAIN_DATA = data
    nlp = spacy.blank('en')  # create blank Language class
    
    # create the built-in pipeline components and add them to the pipeline
    # nlp.create_pipe works for built-ins that are registered with spaCy
    if 'ner' not in nlp.pipe_names:
        ner = nlp.create_pipe('ner')
        nlp.add_pipe(ner, last=True)
       

    # add labels
    for _, annotations in TRAIN_DATA:
         for ent in annotations.get('entities'):
                ner.add_label(ent[2])

    # get names of other pipes to disable them during training
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
    
    with nlp.disable_pipes(*other_pipes):  # only train NER
        optimizer = nlp.begin_training()
        for itn in range(iterations):
            print("Starting iteration " + str(itn))
            random.shuffle(TRAIN_DATA)
            losses = {}
            for text, annotations in TRAIN_DATA:
                #print(text, annotations)
                nlp.update(
                    [text],  # batch of texts
                    [annotations],  # batch of annotations
                    drop=0.2,  # dropout - make it harder to memorise data
                    sgd=optimizer,  # callable to update weights
                    losses=losses)
                
                if (losses <100):
                    break
            print(losses)
    return nlp

In [None]:
# can plot losses and choose best model

In [17]:
%%time
prdnlp = train_spacy(train_data,10)

Starting iteration 0
{'ner': 1973.4669107749435}
Starting iteration 1
{'ner': 426.8618686048913}
Starting iteration 2
{'ner': 379.4073549236349}
Starting iteration 3
{'ner': 333.3488322236765}
Starting iteration 4
{'ner': 154.84842001685763}
Starting iteration 5
{'ner': 340.11241003607614}
Starting iteration 6
{'ner': 316.0971494102368}
Starting iteration 7
{'ner': 108.69856905705248}
Starting iteration 8
{'ner': 187.32028293791328}
Starting iteration 9
{'ner': 112.57227631507551}
CPU times: user 2min 57s, sys: 692 ms, total: 2min 57s
Wall time: 2min 58s


In [18]:
prdnlp_1 = train_spacy(train_data,20)

Starting iteration 0
{'ner': 2075.323239810802}
Starting iteration 1
{'ner': 210.35872319320842}
Starting iteration 2
{'ner': 160.28514150741734}
Starting iteration 3
{'ner': 263.17171813443895}
Starting iteration 4
{'ner': 446.17392687042405}
Starting iteration 5
{'ner': 583.7702890509131}
Starting iteration 6
{'ner': 311.69052782090375}
Starting iteration 7
{'ner': 328.0147173392929}
Starting iteration 8
{'ner': 424.0727626126882}
Starting iteration 9
{'ner': 404.5050261430823}
Starting iteration 10
{'ner': 423.78265992964145}
Starting iteration 11
{'ner': 226.66804519802028}
Starting iteration 12
{'ner': 346.1391590602029}
Starting iteration 13
{'ner': 483.48646824623137}
Starting iteration 14
{'ner': 369.323711042288}
Starting iteration 15
{'ner': 486.0551578276055}
Starting iteration 16
{'ner': 742.163702002263}
Starting iteration 17
{'ner': 528.1660210407974}
Starting iteration 18
{'ner': 587.0425977933282}
Starting iteration 19
{'ner': 447.530279200851}


### 4. Testing the model

In [19]:
test_text = input("Enter your testing text: ")
doc = prdnlp(test_text)

for ent in doc.ents:

    print('Entity: ',ent.text)
    print('Details: ',ent.start_char, ent.end_char, ent.label_)

Enter your testing text: I have a HP Pavillion Laptop with 1 TB HDD.
Entity:  I
Details:  0 1 Model
Entity:  have
Details:  2 6 Brand
Entity:  a
Details:  7 8 OS
Entity:  HP
Details:  9 11 Brand
Entity:  Pavillion
Details:  12 21 Model
Entity:  Laptop
Details:  22 28 Category
Entity:  with
Details:  29 33 Model
Entity:  1 TB HDD
Details:  34 42 Disk
Entity:  .
Details:  42 43 Model


In [20]:
test_text = input("Enter your testing text: ")
doc = prdnlp(test_text)

for ent in doc.ents:

    print('Entity: ',ent.text)
    print('Details: ',ent.start_char, ent.end_char, ent.label_)

Enter your testing text: What is the price of Intel Core i3 Processor (7th Gen) Dell Vostro and Lenovo Legion laptop?
Entity:  What
Details:  0 4 Model
Entity:  is the
Details:  5 11 OS
Entity:  price
Details:  12 17 Brand
Entity:  of
Details:  18 20 Model
Entity:  Intel Core i3 Processor (7th Gen)
Details:  21 54 Processor
Entity:  Dell
Details:  55 59 Brand
Entity:  Vostro
Details:  60 66 Model
Entity:  Lenovo
Details:  71 77 Brand
Entity:  Legion
Details:  78 84 Model
Entity:  laptop
Details:  85 91 Category
