# Named Entity Recognition

---

## Introduction
Named Entity Recognition (NER) is a fundamental Natural Language Processing (NLP) task that requires locating and categorising entities (such as names of people, groups, places, dates, etc.) in text. For named entity recognition in this project, we will hone a BERT (Bidirectional Encoder Representations from Transformers) model using hugging face transformers. The objective is to create a model that can correctly classify and identify various entity kinds found in a given text.

# Section A: Data Preparation

Initially, we need to install specific libraries that are not readily accessible in Google Colab using the !pip install command. These libraries include "simpletransformers" for accessing the Named Entity Recognition (NER) model, and "gradio" for creating an application-like environment. With "simpletransformers," we can utilize the NER model to analyze and identify named entities in text. Meanwhile, "gradio" allows us to build an interactive interface where users can input text, have it processed, and receive the output – essentially, it provides a platform for users to rephrase sentences.


In [1]:
!pip install simpletransformers
!pip install gradio

Collecting simpletransformers
  Downloading simpletransformers-0.64.3-py3-none-any.whl (250 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m250.8/250.8 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
Collecting transformers>=4.31.0 (from simpletransformers)
  Downloading transformers-4.34.1-py3-none-any.whl (7.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m17.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets (from simpletransformers)
  Downloading datasets-2.14.6-py3-none-any.whl (493 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m493.7/493.7 kB[0m [31m29.9 MB/s[0m eta [36m0:00:00[0m
Collecting seqeval (from simpletransformers)
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting tokenizers (from simpletransformers)

In [2]:
#Now we are going to import required modules
import pandas as pd
import gradio as gr
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from simpletransformers.ner import NERModel,NERArgs

In [3]:
# drive mounting and taking dataset into dataframe using pandas
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
path  = '/content/drive/MyDrive/Dataset/ner_dataset.csv'
data = pd.read_csv(path,encoding="latin1" )
data.head(15)

Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,,of,IN,O
2,,demonstrators,NNS,O
3,,have,VBP,O
4,,marched,VBN,O
5,,through,IN,O
6,,London,NNP,B-geo
7,,to,TO,O
8,,protest,VB,O
9,,the,DT,O


In [5]:
# Remove NaN from sentence column
data =data.fillna(method ="ffill")
data.rename(columns={"Sentence #":"sentence_id","Word":"words","Tag":"labels"}, inplace =True)
data.head(15)

Unnamed: 0,sentence_id,words,POS,labels
0,Sentence: 1,Thousands,NNS,O
1,Sentence: 1,of,IN,O
2,Sentence: 1,demonstrators,NNS,O
3,Sentence: 1,have,VBP,O
4,Sentence: 1,marched,VBN,O
5,Sentence: 1,through,IN,O
6,Sentence: 1,London,NNP,B-geo
7,Sentence: 1,to,TO,O
8,Sentence: 1,protest,VB,O
9,Sentence: 1,the,DT,O


# Section B: Training and testing
The training script will involve the following steps:

Loading dataset and preprocessing it for the NER task.

Fine-tuning the BERT model on the NER task using Hugging Face's Transformers library.

Training the model using GPU hardware acceleration to expedite the training process.

Evaluating the trained model's performance using standard NER metrics like precision, recall, and F1-score.

Saving the fine-tuned model and associated metadata for later use.


In [6]:
#Spliiting the Dataset into testing(x_test,y_test) and training(x_train,y_train))
data["labels"] = data["labels"].str.upper()
X= data[["sentence_id","words"]]
Y =data["labels"]
x_train, x_test, y_train, y_test = train_test_split(X,Y, test_size =0.2)


#building up train data and test data
train_data = pd.DataFrame({"sentence_id":x_train["sentence_id"],"words":x_train["words"],"labels":y_train})
test_data = pd.DataFrame({"sentence_id":x_test["sentence_id"],"words":x_test["words"],"labels":y_test})
train_data

Unnamed: 0,sentence_id,words,labels
976868,Sentence: 44670,the,O
137519,Sentence: 6269,to,O
397470,Sentence: 18159,.,O
347595,Sentence: 15900,for,O
653928,Sentence: 29885,says,O
...,...,...,...
62046,Sentence: 2806,streets,O
572925,Sentence: 26196,sectors,O
661496,Sentence: 30223,from,O
108966,Sentence: 4961,with,O


In [7]:
label = data["labels"].unique().tolist()
label

['O',
 'B-GEO',
 'B-GPE',
 'B-PER',
 'I-GEO',
 'B-ORG',
 'I-ORG',
 'B-TIM',
 'B-ART',
 'I-ART',
 'I-PER',
 'I-GPE',
 'I-TIM',
 'B-NAT',
 'B-EVE',
 'I-EVE',
 'I-NAT']

In [8]:
# Providing the required arguments like epoch, learning rate, batch sized etc
args = NERArgs()
args.num_train_epochs = 1
args.learning_rate = 1e-4
args.overwrite_output_dir = True
args.train_batch_size = 32
args.eval_batch_size = 32

# Using NERModel from simpletransformer module
model = NERModel('bert', 'bert-base-cased',labels=label,args =args)

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [9]:
# Training model
model.train_model(train_data,eval_data = test_data,acc=accuracy_score)

  return [


  0%|          | 0/3 [00:00<?, ?it/s]

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 0 of 1:   0%|          | 0/1499 [00:00<?, ?it/s]



(1499, 0.19113378726030447)

In [10]:
# Reading the result using model.eval_model()
result, model_outputs, preds_list = model.eval_model(test_data)
result

# Precting by taking sample sentence
prediction, model_output = model.predict(["What is the new name of Mumbai"])
prediction

  return [


  0%|          | 0/3 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/1460 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

Running Prediction:   0%|          | 0/1 [00:00<?, ?it/s]

[[{'What': 'O'},
  {'is': 'O'},
  {'the': 'O'},
  {'new': 'O'},
  {'name': 'O'},
  {'of': 'O'},
  {'Mumbai': 'B-GEO'}]]

In [17]:
prompt = "My name is Hassan"
def predict_ner(prompt):
    predictions, _ = model.predict([prompt])
    return predictions[0]
predict_ner(prompt)

  0%|          | 0/1 [00:00<?, ?it/s]

Running Prediction:   0%|          | 0/1 [00:00<?, ?it/s]

[{'My': 'O'}, {'name': 'O'}, {'is': 'O'}, {'Hassan': 'I-PER'}]

# Section C: Gradio interface and linkage
We will create a simple interactive demo application(Interface) using Gradio, a user-friendly library for creating UIs to interact with machine learning models. The demo app will allow users to input a text and visualize the NER predictions made by our fine-tuned BERT model. The app will highlight the recognized named entities in the input text and categorize them into predefined entity types.


The Gradio demo app will serve as a user-friendly interface to showcase the capabilities of our NER model. Users can input text samples, view the model's predictions, and get a better understanding of how the model performs on real-world examples

In [13]:
# defining predict_ner()
def predict_ner(prompt):
    predictions, _ = model.predict([prompt])
    return predictions[0]

# Creating the interface using Gradio which takes input and give output using textboxes
iface = gr.Interface(
    fn=predict_ner,
    inputs=gr.inputs.Textbox(),
    outputs=gr.outputs.Textbox(),
    live=True,
    title="Named Entity Recognition Demo",
    description="Enter a prompt and see named entities highlighted."
)

iface.launch()

  inputs=gr.inputs.Textbox(),
  inputs=gr.inputs.Textbox(),
  inputs=gr.inputs.Textbox(),
  outputs=gr.outputs.Textbox(),


Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://a2edb8d9688f44e234.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


