## Custom NER Model

Create a custom Named Entity recognition model for filtering the right Named entites from CVs to help Human resouce Team. 

# Word Vectorization

Here's an example of how you can use pre-trained word embeddings in Python using the Gensim library:

In [None]:
import gensim.downloader as api

# Download pre-trained word embeddings
word_vectors = api.load("glove-wiki-gigaword-100")

We are use=ing the gensim.downloader module to download the pre-trained glove-wiki-gigaword-100 embeddings. You can then use the word_vectors object to access word vectors and perform other operations are shown below:

In [None]:
# Access word vectors for a given word
word_vectors.get_vector("apple")

In [None]:
# Perform similarity calculations between words
word_vectors.similarity("apple", "mango")

In [None]:
# check simillarity between king and queen
word_vectors.similarity("king", "queen")

In [None]:
# finding the most simillar words
most_similar = word_vectors.most_similar("man")
most_similar

In [None]:
# Find the word that best completes an analogy
result = word_vectors.most_similar(positive=["woman", "king"], negative=["man"])
result

These are just a few of the many operations you can perform using pre-trained word embeddings and the gensim library. You can also use these embeddings as features for other NLP tasks, such as text classification or sentiment analysis, by feeding the word vectors into a machine learning model.

# <H1>NER case study </H1> 

1.A sample CV taken to extract Named entity recognition

2.A pickle file consisting of training data on 200 CVs

<H1> Introduction </H1> 

In this notebook we will be learning about how to set custom named entity recognition (NER) in spacy.

**What is Named Entity?**
A named entity is a “real-world object” that’s assigned a name – for example, a person, a country, a product or a book title.

**About spaCy :-**
spaCy is an open-source software library for advanced natural language processing, written in the programming languages Python and Cython. The library is published under the MIT license and its main developers are Matthew Honnibal and Ines Montani, the founders of the software company Explosion. 

**NLTK vs spaCy :-**
While NLTK provides access to many algorithms to get something done, spaCy provides the best way to do it. It provides the fastest and most accurate syntactic analysis of any NLP library released to date. It also offers access to larger word vectors that are easier to customize. For an app builder mindset that prioritizes getting features done, spaCy would be the better choice. 

**Some Features of spaCy :-**

Tokenization
Part-of-speech (POS) Tagging
Lemmatization
Named Entity Recognition (NER)
Similarity
Text Classification
We will be focusing on NER.

**What is NER ?**
Named-entity recognition is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.

**What is POS Tagging**

Part-of-speech (POS) tagging is a popular Natural Language Processing process which refers to categorizing words in a corpus in correspondence with a particular part of speech, depending on the definition of the word and its context.

This can include nouns, verbs, adjectives, adverbs, determiners etc. 


In [None]:
# Install 2.3.7 version of spacy for this exercise. 

!pip install -U spacy==2.3.7

In [None]:
# Importing spacy lib and checking the version of Spacy
import spacy
print(spacy.__version__)

In [None]:
# Printing & Expalining various Named entities in Spacy
print(f'PERSON - {spacy.explain("PERSON")}')
print(f'GPE    - {spacy.explain("GPE")}')
print(f'DATE   - {spacy.explain("DATE")}')
print(f'MONEY  - {spacy.explain("MONEY")}')

In [None]:
# Importing imp libs. 
import pickle
import random
import warnings
warnings.filterwarnings("ignore")


In [None]:
# Driving the mount in Google Collab. Pl. put data i.e. Resume and Training pickle file on a folder
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# The os.walk() function generates the file names in a directory tree by walking the tree either top-down or bottom-up.
import os
for dirname, _, filenames in os.walk('/content/drive/MyDrive/NER Custom Model'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
# since our resume is in pdf format we will use PyMuPDF to extract data from it.
# You can also use PyPDF2.

!pip install PyMuPDF

In [None]:
import sys,fitz
fname = '/content/drive/MyDrive/Prac Data/Alice Clark CV.pdf'
doc= fitz.open(fname)
alice_cv=""
for i in range(doc.page_count):
  page=doc.load_page(i)
  alice_cv +=page.get_text()

print(alice_cv)

# we have extracted the data from pdf file using PyMuPDF and stored in alice_cv variable.

In [None]:
# In this code we are just finding POS and representing it in form of json dumps.
# This is just for showcasing how Part of speech tagging is done. 

import nltk
import json

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# Tokenize the text
tokens = nltk.word_tokenize(alice_cv)

# Perform POS tagging
tagged_tokens = nltk.pos_tag(tokens)

# Print out the POS tags
print(json.dumps(tagged_tokens,indent=4))

In [None]:
# We are doing NER via NLTK. We can also do NER via Spacy en_core_web_sm lib. Results are not accurate. 

nltk.download('maxent_ne_chunker')
nltk.download('words')

# Tokenize the text
tokens = nltk.word_tokenize(alice_cv)

# Perform POS tagging
tagged_tokens = nltk.pos_tag(tokens)

# Perform Named Entity Recognition
entities = nltk.ne_chunk(tagged_tokens)

# Print out the entities
for token in entities:
    if hasattr(token, 'label'):
        print(token.label(), ' '.join(c[0] for c in token))

<H1> Observation </H1> 
POS is giving quite a good result but using the standard NER model maynot give you the accurate results you are looking for. 

In this example here are just few in accurate entity extraction. 

- SQL is not a organization but programming language
- Stored procedures is not a person
- Stream analytics is not a person. 
- Karnataka is not a person but a state
- Skills is not a person. 

Hence we will custom train our model just looking into corpus of CVs to make extraction of entites accurate. 

In [None]:
train_data = pickle.load(open('/content/drive/MyDrive/Prac Data/train_data.pkl','rb'))
print(f"Training data consist of {len(train_data)} manually labelled resume's.")

In [None]:
train_data[96]

**Anatomy of our train data**

Our train data is stored as a tuple consisting of 200 resume data, each resume data consist of 2 parts/indexes.

First index [0] consist of all details(name, degree, designation, compaines worked at) in resume.
Second index [1] consist of a dictionary object having only one key i.e., 'entities' and look carefully at its value.
Value of 'entities' key has a list of tuples and in each tuple we have some number and some labelling.

For Eg :- (0, 15, 'Name'), here 0 denotes start index and 15 denotes end index of label 'Name', which is 'Ramesh chokkala'. Similarly, we can see that all the other tuple also has some start and end index alongwith their respective label. This is how you can manually create data for training.

Note :- label of all training data should be same i.e., if you have specified label as 'Name' for one resume then for all the resume data wherever name is present for that label should be as 'Name' only and not something else.

As we have our training data ready, we will now train our spacy model and add custom NER.

In [None]:
nlp = spacy.blank('en')

# Creating a function to train our model

def train_model(train_data):
    if 'ner' not in nlp.pipe_names:# Checking if NER is present in pipeline
        ner = nlp.create_pipe('ner')# creating NER pipe if not present
        nlp.add_pipe(ner, last = True)# adding NER pipe in the end
    
    for _, annotation in train_data:# Getting 1 resume at a time from our training data of 200 resumes

        for ent in annotation['entities']:# Getting each tuple at a time from 'entities' 
        #key in dictionary at index[1] i.e.,(0, 15, 'Name') and so on
            ner.add_label(ent[2])
            
    
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
    # getting all other pipes except NER.
    
    with nlp.disable_pipes(*other_pipes):  # only train NER
        optimizer = nlp.begin_training()
        for itn in range(10):
            print("Statring iteration " + str(itn))
            random.shuffle(train_data)
            losses = {}
            index = 0
            for text, annotations in train_data:
                try:
                    nlp.update(
                        [text],  # batch of texts
                        [annotations],  # batch of annotations
                        drop=0.2,  # dropout - make it harder to memorise data
                        sgd=optimizer,  # callable to update weights
                        losses=losses)
                except Exception as e:
                    pass
                
            print(losses)

In [None]:
# pass train data to function.

train_model(train_data)

In [None]:
# Saving our trained model to re-use.

nlp.to_disk('/content/drive/MyDrive/NER Custom Model/nlp_model')

In [None]:
# Loading our trained model

nlp_model = spacy.load('/content/drive/MyDrive/NER Custom Model/nlp_model')

In [None]:
# Checking all the custom NER created
nlp_model.get_pipe('ner').labels

In [None]:
doc = nlp_model(" ".join(alice_cv.split('\n')))
for ent in doc.ents:
  print(f'{ent.label_.upper():{20}} - {ent.text}')

# <H1>Conclusion </H1>

Custom NER is giving far better results than inbuilt Model as can be observed from the results printed above