<a href="https://colab.research.google.com/github/chrischibueze/Name-entity-recognition-The-Data-City/blob/main/NER.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Named entity recognition (NER) is a natural language processing technique for identifying and classifying named entities in text. Named entities are typically people, organizations, locations, dates, times, and quantities.

An important thing about NER models is that their ability to understand Named Entities depends on the data they have been trained on

# Examples 1

```
# This is formatted as code
```



In [45]:
import spacy

In [46]:
nlp = spacy.load("en_core_web_sm")

In [47]:
text = ("When Sebastian Thrun started working on self-driving cars at "
        "Google in 2007, few people outside of the company took him "
        "seriously. “I can tell you very senior CEOs of major American "
        "car companies would shake my hand and turn away because I wasn’t "
        "worth talking to,” said Thrun, in an interview with Recode earlier "
        "this week.HOuse address 1-3 Newdigate street nottingham NG7 4FD")
doc = nlp(text)

In [48]:
# Analyze syntax
print("Noun phrases:", [chunk.text for chunk in doc.noun_chunks])
print("Verbs:", [token.lemma_ for token in doc if token.pos_ == "VERB"])

Noun phrases: ['Sebastian Thrun', 'self-driving cars', 'Google', 'few people', 'the company', 'him', 'I', 'you', 'very senior CEOs', 'major American car companies', 'my hand', 'I', 'Thrun', 'an interview', 'Recode', 'HOuse address']
Verbs: ['start', 'work', 'drive', 'take', 'tell', 'shake', 'turn', 'talk', 'say']


# Example 2

In [40]:
# Find named entities, phrases and concepts
for entity in doc.ents:
    print(entity.text, entity.label_)

Sebastian Thrun PERSON
Google ORG
2007 DATE
American NORP
Thrun PERSON
Recode ORG
earlier this week DATE
HOuse ORG
1 CARDINAL
Newdigate ORG


# Example 3

In [36]:
from spacy.tokens import Doc

words = ["hello", "world", "!"]
spaces = [True, True, True]
doc = Doc(nlp.vocab, words=words, spaces=spaces)
doc

hello world ! 

# Example 4

In [41]:
from spacy.tokenizer import Tokenizer

In [42]:
import re

def split_joined_words(text):
    # Define a regular expression pattern to split words
    pattern = r'([a-z])([A-Z])'
    return re.sub(pattern, r'\1 \2', text)

# Example input: "HelloWorldThisIsAnExample"
joined_text = "theDataCity"
split_text = split_joined_words(joined_text)

print(split_text)  # Output: "Hello World This Is An Example"


the Data City


# Example 5

# Example 7

In [None]:
import requests

# Fetch the content from the URL
url = "http://thedatacity.com"
response = requests.get(url)
content = response.text

# Process the content using spaCy
doc = nlp(content)


In [None]:
# Extract named entities
named_entities = []
for ent in doc.ents:
    named_entities.append((ent.text, ent.label_))

In [None]:
# Print the extracted named entities
for text, label in named_entities:
    print(f"Text: {text}, Label: {label}")

Text: GB, Label: GPE
Text: href="https://thedatacity.com, Label: ORG
Text: max, Label: PERSON
Text: Data City, Label: GPE
Text: UK&#039;s, Label: NORP
Text: sectors &amp, Label: ORG
Text: equiv="X-UA-Compatible, Label: ORG
Text: href="https://thedatacity.com, Label: ORG
Text: href="https://thedatacity.com, Label: ORG
Text: max, Label: PERSON
Text: max-snippet:-1, Label: PERSON
Text: max, Label: PERSON
Text: dataLayer, Label: PERSON
Text: UK, Label: GPE
Text: cluster &amp, Label: ORG
Text: 5, Label: CARDINAL
Text: 350, Label: CARDINAL
Text: Data City, Label: GPE
Text: UK&#039;s, Label: NORP
Text: sectors &, Label: ORG
Text: UK, Label: GPE
Text: cluster &amp, Label: ORG
Text: 5, Label: CARDINAL
Text: 350, Label: CARDINAL
Text: Data City, Label: GPE
Text: UK, Label: GPE
Text: sectors & companies","isPartOf":{"@id":"https://thedatacity.com/#website"},"datePublished":"2021-05-05T10:18:53, Label: ORG
Text: UK, Label: GPE
Text: cluster & company, Label: ORG
Text: The Data City's, Label: GPE
T

# Example 8

In [None]:
from bs4 import BeautifulSoup

# Load the spaCy model
nlp = spacy.load("en_core_web_sm")

# Fetch the HTML content from the URL
url = "http://thedatacity.com"
response = requests.get(url)
html_content = response.text

# Parse the HTML content to extract the text
soup = BeautifulSoup(html_content, "html.parser")
text = soup.get_text()

# Process the text with spaCy NER
doc = nlp(text)

# Extract named entities
named_entities = []
for ent in doc.ents:
    named_entities.append((ent.text, ent.label_))

# Print the named entities
for entity, label in named_entities:
    print(f"Entity: {entity}, Label: {label}")

Entity: Data City, Label: GPE
Entity: UK, Label: GPE
Entity: GOVERNMENTPE, Label: ORG
Entity: VC & INVESTMENTB2B COMPANIESPlatformDirectoryGlobal PlatformAccreditationLocal Government Package, Label: ORG
Entity: 350, Label: CARDINAL
Entity: Real-Time Industrial Classifications &, Label: ORG
Entity: Featured RTICsAdvanced ManufacturingAgriTechArtificial IntelligenceCryptocurrency EconomyImmersive, Label: PERSON
Entity: ZeroQuantum, Label: ORG
Entity: SIC, Label: ORG
Entity: Role Opening: Business Development ExecutiveData Explorer Release NotesUncovering Life Sciences’ Innovation CommunitiesReviewing the Space Economy’s networkEmbracing Innovation and Empowering Teams The UK, Label: WORK_OF_ART
Entity: Top Artificial Intelligence, Label: ORG
Entity: Skills, Label: ORG
Entity: Midlands Engine, Label: PERSON
Entity: CityDiscover, Label: PRODUCT
Entity: UK, Label: GPE
Entity: UK, Label: GPE
Entity: over 5 million, Label: CARDINAL
Entity: 350, Label: CARDINAL
Entity: SIC, Label: ORG
Entity:

# Example 9

In [None]:
import spacy

# Load the English language model
nlp = spacy.load("en_core_web_sm")

# Define a custom tokenization rule to split combined words
def custom_tokenizer(nlp):
    infixes = nlp.Defaults.infixes + [r"(?<=thedata)(?=city)"]
    infix_re = spacy.util.compile_infix_regex(infixes)
    return spacy.tokenizer.Tokenizer(nlp.vocab, infix_finditer=infix_re.finditer)

# Set the custom tokenizer
nlp.tokenizer = custom_tokenizer(nlp)

# Text to be tokenized
text = "thedatacity.com is a great company."

# Process the text using the customized tokenizer
doc = nlp(text)

# Print the tokens
for token in doc:
    print(token.text)


thedata
city.com
is
a
great
company.


# Example 10

In [None]:
import spacy

# Load the spaCy model
nlp = spacy.load("en_core_web_sm")

# Dataset of URLs
dataset = [
    "https://www.linkedin.com",
    "https://thedatacity.com",
    "https://www.youtube.com"
]

# Function to split URL parts using NER
def split_url_parts(url):
    doc = nlp(url)
    split_parts = []
    current_part = ""

    for token in doc:
        if token.ent_type_ == "URL" or token.is_punct:
            if current_part:
                split_parts.append(current_part)
            current_part = ""
        else:
            current_part += token.text + " "

    if current_part:
        split_parts.append(current_part)

    return split_parts

# Process and print split URL parts
for url in dataset:
    split_parts = split_url_parts(url)
    print(f"Original URL: {url}")
    print(f"Split URL Parts: {split_parts}")
    print()


Original URL: https://www.linkedin.com
Split URL Parts: ['https://www.linkedin.com ']

Original URL: https://thedatacity.com
Split URL Parts: ['https://thedatacity.com ']

Original URL: https://www.youtube.com
Split URL Parts: ['https://www.youtube.com ']



# Example *12*

In [None]:
from flair.data import Sentence
from flair.models import SequenceTagger

# load tagger
tagger = SequenceTagger.load("flair/ner-english")

# make example sentence
sentence = Sentence("George Washington wenttoWashington")

# predict NER tags
tagger.predict(sentence)

# print sentence
print(sentence)

# print predicted NER spans
print('The following NER tags are found:')
# iterate over entities and print
for entity in sentence.get_spans('ner'):
    print(entity)

2023-08-17 12:51:28,062 SequenceTagger predicts: Dictionary with 20 tags: <unk>, O, S-ORG, S-MISC, B-PER, E-PER, S-LOC, B-ORG, E-ORG, I-PER, S-PER, B-MISC, I-MISC, E-MISC, I-ORG, B-LOC, E-LOC, I-LOC, <START>, <STOP>
Sentence[3]: "George Washington wenttoWashington" → ["George Washington"/PER]
The following NER tags are found:
Span[0:2]: "George Washington" → PER (0.9879)


# Example

In [None]:
import re
import spacy
from spacy.tokenizer import Tokenizer

special_cases = {":)": [{"ORTH": ":)"}]}
prefix_re = re.compile(r'''^[\\[\\("']''')
infix_re = re.compile(r'''[-~]''')
simple_url_re = re.compile(r'''^https?://''')

def custom_tokenizer(nlp):
    return Tokenizer(nlp.vocab, rules=special_cases,
                                prefix_search=prefix_re.search,
                                infix_finditer=infix_re.finditer,
                                url_match=simple_url_re.match)

nlp = spacy.load("en_core_web_sm")
nlp.tokenizer = custom_tokenizer(nlp)
doc = nlp("hello-world. :)")
print([t.text for t in doc]) # ['hello', '-', 'world.', ':)']

['hello', '-', 'world.', ':)']


# Example

In [68]:
import spacy
nlp = spacy.load("en_core_web_sm")
import en_core_web_sm
nlp = en_core_web_sm.load()
doc = nlp("This is a sentence.")
print([(w.text, w.pos_) for w in doc])

[('This', 'PRON'), ('is', 'AUX'), ('a', 'DET'), ('sentence', 'NOUN'), ('.', 'PUNCT')]


# Example

In [69]:
import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Dialogflow, previously known as api.ai, is a chatbot framework provided by Google. Google acquired API.AI in 2016.")

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Google 75 81 ORG
Google 83 89 ORG
2016 109 113 DATE


# Steps that i plan to take in the project.

In [2]:
# I am going to be using Spacy for the process
# SpaCy is an open-source library for advanced Natural Language Processing in Python.
# SpaCy provides an exceptionally efficient statistical system for NER in python, which can assign labels to groups of tokens which are contiguous. It provides a default model which can recognize a wide range of named or numerical entities, which include person, organization, language, event etc.
#  Apart from these default entities, spaCy also gives us the liberty to add arbitrary classes to the NER model, by training the model to update it with newer trained examples.


# 1. I am going to use the redictLeads_enriched_sample_report_Dataset to train the model.

# 2. Load the dataset

# 3. The Dataset will be converted to a Json file , then used to train the dataset.

# 4. train the SpaCy model to incorporate for our own custom entities present in our dataset.

# a. Load the model
# b. Add the new entity label
# c. Loop over
# d. Save

# 5. Test