# NER model exploration

## Fileload

In [1]:
import re
from markdown import markdown

with open("../data/ProjectPhoenixPlan.md", "r", encoding="utf-8") as file:
    md_text = file.read()

print(md_text[:1000])

# Project Phoenix: A Comprehensive Plan to Expand Redwood Tech Solutions

### Table of Contents
1. [Introduction](#introduction)
2. [Project Objectives](#project-objectives)
3. [Key Stakeholders](#key-stakeholders)
4. [Market Analysis](#market-analysis)
5. [Product Roadmap](#product-roadmap)
6. [Budget & Resource Allocation](#budget--resource-allocation)
7. [Timeline](#timeline)
8. [Risk Assessment & Mitigation](#risk-assessment--mitigation)
9. [Conclusion](#conclusion)

---

## Introduction

Redwood Tech Solutions (RTS) is a leading provider of artificial intelligence and data analytics platforms headquartered in **Redwood City, California**. Founded in 2012 by **Michael Lansing** and **Sasha Petrov**, the company has experienced consistent growth over the past decade. Today, RTS has over **2,000 employees** distributed across offices in **London, Tokyo, São Paulo**, and **Melbourne**. Over the last three years, market demand for advanced analytics and cloud-based AI services has soar

## SpaCy
This section describes the process of using the spaCy library to extract named entities from a Markdown document. The document is first read and converted to plain text, then processed with spaCy to identify entities such as organizations, locations, dates, and people. The extracted entities are displayed in a list format.

In [2]:
import spacy
import spacy.cli

# spacy.cli.download("en_core_web_sm")
nlp = spacy.load("en_core_web_sm")

In [3]:
# list possible entities
nlp.get_pipe("ner").labels

('CARDINAL',
 'DATE',
 'EVENT',
 'FAC',
 'GPE',
 'LANGUAGE',
 'LAW',
 'LOC',
 'MONEY',
 'NORP',
 'ORDINAL',
 'ORG',
 'PERCENT',
 'PERSON',
 'PRODUCT',
 'QUANTITY',
 'TIME',
 'WORK_OF_ART')

In [4]:
doc = nlp(md_text)
entities = [(ent.text, ent.label_) for ent in doc.ents]

### Output before preprocessing

In [5]:
entities

[('Expand Redwood Tech Solutions', 'ORG'),
 ('###', 'MONEY'),
 ('1', 'CARDINAL'),
 ('2', 'CARDINAL'),
 ('3', 'CARDINAL'),
 ('4', 'CARDINAL'),
 ('5', 'CARDINAL'),
 ('6', 'CARDINAL'),
 ('Budget & Resource', 'ORG'),
 ('7', 'CARDINAL'),
 ('8', 'CARDINAL'),
 ('Risk Assessment & Mitigation](#risk-assessment', 'ORG'),
 ('9', 'CARDINAL'),
 ('## Introduction', 'MONEY'),
 ('RTS', 'ORG'),
 ('Redwood City', 'GPE'),
 ('California', 'GPE'),
 ('2012', 'DATE'),
 ('Michael Lansing', 'PERSON'),
 ('the past decade', 'DATE'),
 ('Today', 'DATE'),
 ('RTS', 'ORG'),
 ('2,000', 'CARDINAL'),
 ('London', 'GPE'),
 ('Tokyo', 'GPE'),
 ('São Paulo', 'PERSON'),
 ('the last three years', 'DATE'),
 ('AI', 'ORG'),
 ('Redwood Tech Solutions', 'ORG'),
 ('the RTS Board of Directors', 'ORG'),
 ('Redwood Tech Solutions', 'ORG'),
 ('Amazon Web Services', 'ORG'),
 ('Microsoft', 'ORG'),
 ('International Data Corporation', 'ORG'),
 ('Project Phoenix', 'PERSON'),
 ('RTS', 'ORG'),
 ('## Project', 'MONEY'),
 ('Project Phoenix', 'LO

## Preprocess text

In [6]:
from spacy.language import Language
from bs4 import BeautifulSoup
from nltk.corpus import stopwords


def preprocess(text=None):
    if text is None:
        text = self.text

    # Lower case the text
    # text = text.lower()

    text = BeautifulSoup(text, "html.parser").get_text()
    text = re.sub(r"[^a-zA-Z0-9\s]", "", text)

    stop_words = set(stopwords.words("english"))
    text = " ".join(word for word in text.split() if word not in stop_words)

    return text

def add_entityFishing(nlp):
    nlp.add_pipe(
        "entityfishing", config={"language": "en", "api_ef_base": "http://nerd.huma-num.fr/nerd/service"}
    )
    # nlp.add_pipe("spancat")
    
    return nlp

def process_with_entityFishing(input_text, nlp):
    return nlp(preprocess(input_text))

In [7]:
print(md_text[:1000])

# Project Phoenix: A Comprehensive Plan to Expand Redwood Tech Solutions

### Table of Contents
1. [Introduction](#introduction)
2. [Project Objectives](#project-objectives)
3. [Key Stakeholders](#key-stakeholders)
4. [Market Analysis](#market-analysis)
5. [Product Roadmap](#product-roadmap)
6. [Budget & Resource Allocation](#budget--resource-allocation)
7. [Timeline](#timeline)
8. [Risk Assessment & Mitigation](#risk-assessment--mitigation)
9. [Conclusion](#conclusion)

---

## Introduction

Redwood Tech Solutions (RTS) is a leading provider of artificial intelligence and data analytics platforms headquartered in **Redwood City, California**. Founded in 2012 by **Michael Lansing** and **Sasha Petrov**, the company has experienced consistent growth over the past decade. Today, RTS has over **2,000 employees** distributed across offices in **London, Tokyo, São Paulo**, and **Melbourne**. Over the last three years, market demand for advanced analytics and cloud-based AI services has soar

In [8]:
nlp = spacy.load("en_core_web_sm")
nlp = add_entityFishing(nlp)

In [9]:
# spancat = nlp.get_pipe("spancat")
# spancat.cfg["treshold"] = 0.99

In [11]:
doc = process_with_entityFishing(md_text, nlp)
entities = [(ent.text, ent.label_) for ent in doc.ents]

In [12]:
entities

[('Project Phoenix A Comprehensive Plan Expand Redwood Tech Solutions Table',
  'ORG'),
 ('5', 'CARDINAL'),
 ('Redwood City California Founded', 'ORG'),
 ('Michael Lansing Sasha Petrov', 'PERSON'),
 ('past decade', 'DATE'),
 ('London', 'GPE'),
 ('Tokyo', 'GPE'),
 ('Paulo Melbourne', 'PERSON'),
 ('last three years', 'DATE'),
 ('AI', 'ORG'),
 ('Redwood Tech Solutions', 'ORG'),
 ('RTS Board Directors', 'ORG'),
 ('Project Phoenix This', 'PRODUCT'),
 ('Redwood Tech Solutions', 'ORG'),
 ('Amazon Web Services AWS', 'ORG'),
 ('International Data Corporation', 'ORG'),
 ('Project Phoenix RTS', 'ORG'),
 ('AIdriven', 'CARDINAL'),
 ('Project Objectives', 'ORG'),
 ('Project Phoenix', 'ORG'),
 ('Redwood Tech Solutions', 'ORG'),
 ('AI', 'ORG'),
 ('1', 'CARDINAL'),
 ('2 Infrastructure Modernization Upgrading', 'ORG'),
 ('3', 'CARDINAL'),
 ('Talent Acquisition Development Hiring', 'ORG'),
 ('AI', 'ORG'),
 ('4', 'CARDINAL'),
 ('Partnerships Alliances Establishing', 'ORG'),
 ('five', 'CARDINAL'),
 ('Redwo

In [13]:
nlp.pipe_names

['tok2vec',
 'tagger',
 'parser',
 'attribute_ruler',
 'lemmatizer',
 'ner',
 'entityfishing']

In [15]:
# Convert Markdown to Plain Text
# plain_text = re.sub(r"<[^>]+>", "", markdown(md_text))  # Strip HTML tags

In [16]:
# print(plain_text[:1000])

## HuggingFace NER on BERT

In [None]:
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")

nlp = pipeline("ner", model=model, tokenizer=tokenizer)
example = md_text

ner_results = nlp(example)
print(ner_results)

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0


[{'entity': 'I-ORG', 'score': 0.5630673, 'index': 3, 'word': 'Phoenix', 'start': 10, 'end': 17}, {'entity': 'B-ORG', 'score': 0.9984162, 'index': 12, 'word': 'Red', 'start': 50, 'end': 53}, {'entity': 'I-ORG', 'score': 0.99888057, 'index': 13, 'word': '##wood', 'start': 53, 'end': 57}, {'entity': 'I-ORG', 'score': 0.9992242, 'index': 14, 'word': 'Tech', 'start': 58, 'end': 62}, {'entity': 'I-ORG', 'score': 0.9987562, 'index': 15, 'word': 'Solutions', 'start': 63, 'end': 72}, {'entity': 'B-ORG', 'score': 0.99760616, 'index': 155, 'word': 'Red', 'start': 498, 'end': 501}, {'entity': 'I-ORG', 'score': 0.99848217, 'index': 156, 'word': '##wood', 'start': 501, 'end': 505}, {'entity': 'I-ORG', 'score': 0.99928457, 'index': 157, 'word': 'Tech', 'start': 506, 'end': 510}, {'entity': 'I-ORG', 'score': 0.9991239, 'index': 158, 'word': 'Solutions', 'start': 511, 'end': 520}, {'entity': 'B-ORG', 'score': 0.99591416, 'index': 160, 'word': 'R', 'start': 522, 'end': 523}, {'entity': 'I-ORG', 'score':

In [19]:
ner_results

[{'entity': 'I-ORG',
  'score': 0.5630673,
  'index': 3,
  'word': 'Phoenix',
  'start': 10,
  'end': 17},
 {'entity': 'B-ORG',
  'score': 0.9984162,
  'index': 12,
  'word': 'Red',
  'start': 50,
  'end': 53},
 {'entity': 'I-ORG',
  'score': 0.99888057,
  'index': 13,
  'word': '##wood',
  'start': 53,
  'end': 57},
 {'entity': 'I-ORG',
  'score': 0.9992242,
  'index': 14,
  'word': 'Tech',
  'start': 58,
  'end': 62},
 {'entity': 'I-ORG',
  'score': 0.9987562,
  'index': 15,
  'word': 'Solutions',
  'start': 63,
  'end': 72},
 {'entity': 'B-ORG',
  'score': 0.99760616,
  'index': 155,
  'word': 'Red',
  'start': 498,
  'end': 501},
 {'entity': 'I-ORG',
  'score': 0.99848217,
  'index': 156,
  'word': '##wood',
  'start': 501,
  'end': 505},
 {'entity': 'I-ORG',
  'score': 0.99928457,
  'index': 157,
  'word': 'Tech',
  'start': 506,
  'end': 510},
 {'entity': 'I-ORG',
  'score': 0.9991239,
  'index': 158,
  'word': 'Solutions',
  'start': 511,
  'end': 520},
 {'entity': 'B-ORG',
  's

In [20]:
import pandas as pd

pd.DataFrame(ner_results)

Unnamed: 0,entity,score,index,word,start,end
0,I-ORG,0.563067,3,Phoenix,10,17
1,B-ORG,0.998416,12,Red,50,53
2,I-ORG,0.998881,13,##wood,53,57
3,I-ORG,0.999224,14,Tech,58,62
4,I-ORG,0.998756,15,Solutions,63,72
5,B-ORG,0.997606,155,Red,498,501
6,I-ORG,0.998482,156,##wood,501,505
7,I-ORG,0.999285,157,Tech,506,510
8,I-ORG,0.999124,158,Solutions,511,520
9,B-ORG,0.995914,160,R,522,523
