<a href="https://colab.research.google.com/github/priyank21112000/NLP-using-SpaCy/blob/main/NLP_using_SpaCy_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [5]:
import spacy

In [6]:
!python -m spacy download en_core_web_md

Collecting en-core-web-md==3.5.0
  Using cached https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.5.0/en_core_web_md-3.5.0-py3-none-any.whl (42.8 MB)
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')


## **Word Vector** -
Word vectors, or word embeddings, are numerical representations of words in multidimensional space through matrices. The purpose of the word vector is to get a computer system to understand a word. Computers cannot understand text efficiently. They can, however, process numbers quickly and well. For this reason, it is important to convert a word into a number.

In [7]:
nlp = spacy.load('en_core_web_md')

In [8]:
with open ("/content/Rise of AI.txt", 'r') as f:
  text = f.read()

In [9]:
doc = nlp(text)
sentence1 = list(doc.sents)[0]
print(sentence1)

The Rise of Artificial Intelligence: Revolutionizing Industries and Transforming Society\

Introduction
Artificial Intelligence (AI) has emerged as one of the most transformative and revolutionary technologies of the 21st century.


In [10]:
my_word = "learning"

In [11]:
#Most similar words, not synonyms
import numpy as np
ms = nlp.vocab.vectors.most_similar(
    np.asarray([nlp.vocab.vectors[nlp.vocab.strings[my_word]]]), n=10)
words = [nlp.vocab.strings[w] for w in ms[0][0]]
distances = ms[2]
print(words)

['itslearning', 'teachin', 'Garners', 'Microlearning', 'understandingly', 'educates', 'knowles', 'renverser', 'pressroom', 'memristive']


In [12]:
doc1 = nlp("I like fries and vegetarian burgers")
doc2 = nlp("Fast food tastes very good")

In [13]:
print(doc1, "<->", doc2, ":",doc1.similarity(doc2))

I like fries and vegetarian burgers <-> Fast food tastes very good : 0.594179107276761


In [14]:
doc3 =  nlp("Brooklyn bridge is a bridge in Brooklyn, New York")
print(doc1, "<->", doc3, ":",doc1.similarity(doc3))

I like fries and vegetarian burgers <-> Brooklyn bridge is a bridge in Brooklyn, New York : 0.2522421310765939


In [15]:
doc4 = nlp("I like having fruits")
doc5 = nlp("I love staying healthy")
print(doc4, "<->", doc5, ":",doc4.similarity(doc5))

I like having fruits <-> I love staying healthy : 0.8781820811565578


## **SpaCy Pipelines** -
 SpaCy is much more than an NLP framework. It is also a way of designing and implementing complex pipelines. A pipeline is a sequence of pipes, or actors on data, that make alterations to the data or extract information from it

In [16]:
nlp = spacy.blank("en")

In [17]:
nlp.add_pipe("sentencizer")

<spacy.pipeline.sentencizer.Sentencizer at 0x79430af13c80>

In [18]:
import requests
from bs4 import BeautifulSoup
s = requests.get("https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt")
soup = BeautifulSoup(s.content).text.replace("-\n", "").replace("\n", " ")
nlp.max_length = 5278439

In [19]:
nlp.analyze_pipes()

{'summary': {'sentencizer': {'assigns': ['token.is_sent_start', 'doc.sents'],
   'requires': [],
   'scores': ['sents_f', 'sents_p', 'sents_r'],
   'retokenizes': False}},
 'problems': {'sentencizer': []},
 'attrs': {'doc.sents': {'assigns': ['sentencizer'], 'requires': []},
  'token.is_sent_start': {'assigns': ['sentencizer'], 'requires': []}}}

In [20]:
nlp2 = spacy.load("en_core_web_sm")

In [21]:
nlp2.analyze_pipes()

{'summary': {'tok2vec': {'assigns': ['doc.tensor'],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'tagger': {'assigns': ['token.tag'],
   'requires': [],
   'scores': ['tag_acc'],
   'retokenizes': False},
  'parser': {'assigns': ['token.dep',
    'token.head',
    'token.is_sent_start',
    'doc.sents'],
   'requires': [],
   'scores': ['dep_uas',
    'dep_las',
    'dep_las_per_type',
    'sents_p',
    'sents_r',
    'sents_f'],
   'retokenizes': False},
  'attribute_ruler': {'assigns': [],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'lemmatizer': {'assigns': ['token.lemma'],
   'requires': [],
   'scores': ['lemma_acc'],
   'retokenizes': False},
  'ner': {'assigns': ['doc.ents', 'token.ent_iob', 'token.ent_type'],
   'requires': [],
   'scores': ['ents_f', 'ents_p', 'ents_r', 'ents_per_type'],
   'retokenizes': False}},
 'problems': {'tok2vec': [],
  'tagger': [],
  'parser': [],
  'attribute_ruler': [],
  'lemmatizer': [],
  'ner': []},
 'att

## **Entity Ruler**
The EntityRuler is a spaCy factory that allows one to create a set of patterns with corresponding labels. A factory in spaCy is a set of classes and functions preloaded in spaCy that perform set tasks. In the case of the EntityRuler, the factory at hand allows the user to create an EntityRuler, give it a set of instructions, and then use this instructions to find and label entities.

Two Approches -

**Rule Based Approach** - In the rule-based approach, linguistic experts manually create a set of predefined rules to analyze and process natural language text. These rules are designed to identify specific patterns, syntactic structures, or semantic relationships within the text. The rules are typically based on linguistic knowledge and grammatical principles.

*Use a Rule-Based Approach When*:

Well-defined tasks: Rule-based approaches work well for tasks with clear and specific rules, such as simple text preprocessing, keyword matching, or basic information extraction.

Domain-specific knowledge: If the task requires expertise in a particular domain, rule-based systems can be crafted by linguistic experts to capture domain-specific nuances and jargon effectively.

Limited data: When training data is scarce or of poor quality, rule-based systems can be a viable alternative as they do not require large datasets for development.

Interpretability and explainability: In applications where transparency and interpretability are crucial, rule-based systems offer clear and understandable processing steps.

**Machine Learning-Based Approach** -
In the machine learning-based approach, NLP models are trained on large datasets, where the model learns patterns and relationships from the data itself. These models use statistical algorithms to make predictions and derive meaning from the text. Popular machine learning techniques used in NLP include deep learning models like Recurrent Neural Networks (RNNs), Convolutional Neural Networks (CNNs), and Transformers.

*Use a Machine Learning-Based Approach When* :

Complex tasks: Machine learning-based approaches are ideal for more complex NLP tasks, such as sentiment analysis, named entity recognition, machine translation, and language generation.

Large and diverse datasets: If sufficient high-quality data is available, machine learning models can learn patterns and relationships effectively, leading to better generalization and performance.

Contextual understanding: Machine learning models, especially deep learning-based ones, are well-suited for tasks that require context-dependent comprehension, such as machine comprehension or question-answering systems.

Adaptability and scalability: In applications where the language patterns change frequently or adaptability to new data is essential, machine learning models can continuously improve with additional training data.

In [22]:
nlp = spacy.load("en_core_web_sm")
text = "West Chestertenfieldville was referred in Mr. Deeds."

In [23]:
doc = nlp(text)

In [24]:
for ent in doc.ents:
  print(ent.text, ent.label_)

West Chestertenfieldville GPE
Deeds PERSON


In [25]:
ruler = nlp.add_pipe("entity_ruler")

In [26]:
nlp.analyze_pipes()

{'summary': {'tok2vec': {'assigns': ['doc.tensor'],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'tagger': {'assigns': ['token.tag'],
   'requires': [],
   'scores': ['tag_acc'],
   'retokenizes': False},
  'parser': {'assigns': ['token.dep',
    'token.head',
    'token.is_sent_start',
    'doc.sents'],
   'requires': [],
   'scores': ['dep_uas',
    'dep_las',
    'dep_las_per_type',
    'sents_p',
    'sents_r',
    'sents_f'],
   'retokenizes': False},
  'attribute_ruler': {'assigns': [],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'lemmatizer': {'assigns': ['token.lemma'],
   'requires': [],
   'scores': ['lemma_acc'],
   'retokenizes': False},
  'ner': {'assigns': ['doc.ents', 'token.ent_iob', 'token.ent_type'],
   'requires': [],
   'scores': ['ents_f', 'ents_p', 'ents_r', 'ents_per_type'],
   'retokenizes': False},
  'entity_ruler': {'assigns': ['doc.ents', 'token.ent_type', 'token.ent_iob'],
   'requires': [],
   'scores': ['ents_f', 'ent

In [27]:
nlp2 = spacy.load("en_core_web_sm")

In [28]:
ruler = nlp2.add_pipe("entity_ruler", before="ner")

In [29]:
patterns = [
  {"label": "Film", "pattern":"Mr. Deeds" }
]

In [30]:
ruler.add_patterns(patterns)

In [31]:
doc = nlp2(text)

In [32]:
for ent in doc.ents:
  print(ent.text, ent.label_)

West Chestertenfieldville GPE
Mr. Deeds Film


## **Matcher**

In [33]:
from spacy.matcher import Matcher

In [34]:
nlp = spacy.load("en_core_web_sm")

In [35]:
matcher=Matcher(nlp.vocab)
pattern = [{"LIKE_EMAIL": True}]
matcher.add("EMAIL_ADDRESS", [pattern])

In [36]:
doc = nlp("This is a test email: 6l6Rm@example.com")
matches = matcher(doc)

In [37]:
print(matches)

[(16571425990740197027, 6, 7)]


In [38]:
print(nlp.vocab[matches[0][0]].text)

EMAIL_ADDRESS


In [39]:
with open ("/content/Martin_Luther_King.txt", "r") as f:
  text = f.read()

In [40]:
print(text)

Martin Luther King Jr. (born Michael King Jr.; January 15, 1929 – April 4, 1968) was an American Baptist minister and activist who was one of the most prominent leaders in the civil rights movement from 1955 until his assassination on April 4, 1968. A Black church leader and a son of early civil rights activist and minister Martin Luther King Sr., King advanced civil rights for people of color in the United States through nonviolence and civil disobedience. Inspired by his Christian beliefs and the nonviolent activism of Mahatma Gandhi, he led targeted, nonviolent resistance against Jim Crow laws and other forms of discrimination in the United States.

King participated in and led marches for the right to vote, desegregation, labor rights, and other civil rights.[1] He oversaw the 1955 Montgomery bus boycott and later became the first president of the Southern Christian Leadership Conference (SCLC). As president of the SCLC, he led the unsuccessful Albany Movement in Albany, Georgia, a

In [41]:
nlp = spacy.load("en_core_web_sm")

In [42]:
matcher = Matcher(nlp.vocab)
pattern = [{"POS":"PROPN"}]
matcher.add("PROPER_NOUN", [pattern])
doc = nlp(text)
matches = matcher(doc)
print(len(matches))
for match in matches[:10]:
  print(match, doc[match[1]:match[2]])

67
(451313080118390996, 0, 1) Martin
(451313080118390996, 1, 2) Luther
(451313080118390996, 2, 3) King
(451313080118390996, 3, 4) Jr.
(451313080118390996, 6, 7) Michael
(451313080118390996, 7, 8) King
(451313080118390996, 8, 9) Jr.
(451313080118390996, 10, 11) January
(451313080118390996, 15, 16) April
(451313080118390996, 23, 24) Baptist


In [43]:
matcher = Matcher(nlp.vocab)
pattern = [{"POS":"PROPN", "OP":"+"}]
matcher.add("PROPER_NOUN", [pattern])
doc = nlp(text)
matches = matcher(doc)
print(len(matches))
for match in matches[:10]:
  print(match, doc[match[1]:match[2]])

111
(451313080118390996, 0, 1) Martin
(451313080118390996, 0, 2) Martin Luther
(451313080118390996, 1, 2) Luther
(451313080118390996, 0, 3) Martin Luther King
(451313080118390996, 1, 3) Luther King
(451313080118390996, 2, 3) King
(451313080118390996, 0, 4) Martin Luther King Jr.
(451313080118390996, 1, 4) Luther King Jr.
(451313080118390996, 2, 4) King Jr.
(451313080118390996, 3, 4) Jr.


In [44]:
matcher = Matcher(nlp.vocab)
pattern = [{"POS":"PROPN", "OP":"+"}]
matcher.add("PROPER_NOUN", [pattern], greedy="LONGEST")
doc = nlp(text)
matches = matcher(doc)
print(len(matches))
for match in matches[:10]:
  print(match, doc[match[1]:match[2]])

40
(451313080118390996, 65, 70) Martin Luther King Sr.
(451313080118390996, 0, 4) Martin Luther King Jr.
(451313080118390996, 160, 164) Southern Christian Leadership Conference
(451313080118390996, 6, 9) Michael King Jr.
(451313080118390996, 241, 244) Civil Rights Act
(451313080118390996, 247, 250) Voting Rights Act
(451313080118390996, 255, 258) Fair Housing Act
(451313080118390996, 317, 320) J. Edgar Hoover
(451313080118390996, 81, 83) United States
(451313080118390996, 99, 101) Mahatma Gandhi


In [45]:
matcher = Matcher(nlp.vocab)
pattern = [{"POS":"PROPN", "OP":"+"}]
matcher.add("PROPER_NOUN", [pattern], greedy="LONGEST")
doc = nlp(text)
matches = matcher(doc)
matches.sort(key=lambda x: x[1])
print(len(matches))
for match in matches[:10]:
  print(match, doc[match[1]:match[2]])

40
(451313080118390996, 0, 4) Martin Luther King Jr.
(451313080118390996, 6, 9) Michael King Jr.
(451313080118390996, 10, 11) January
(451313080118390996, 15, 16) April
(451313080118390996, 23, 24) Baptist
(451313080118390996, 46, 47) April
(451313080118390996, 65, 70) Martin Luther King Sr.
(451313080118390996, 71, 72) King
(451313080118390996, 81, 83) United States
(451313080118390996, 99, 101) Mahatma Gandhi


In [46]:
matcher = Matcher(nlp.vocab)
pattern = [{"POS":"PROPN", "OP":"+"}, {"POS":"VERB"}]
matcher.add("PROPER_NOUN", [pattern], greedy="LONGEST")
doc = nlp(text)
matches = matcher(doc)
matches.sort(key=lambda x: x[1])
print(len(matches))
for match in matches[:10]:
  print(match, doc[match[1]:match[2]])

4
(451313080118390996, 71, 73) King advanced
(451313080118390996, 123, 125) King participated
(451313080118390996, 317, 321) J. Edgar Hoover considered
(451313080118390996, 362, 364) FBI mailed


## Working with Custom Components


In [48]:
doc2 = nlp("New York is a place. Steve is a person.")

In [49]:
for ent in doc2.ents:
  print(ent.text, ent.label_)

New York GPE
Steve PERSON


In [50]:
from spacy.language import Language

In [51]:
@Language.component("remove_GPE")
def remove_GPE(doc2):
  original_ents = list(doc2.ents)
  for ent in doc2.ents:
    if ent.label_ == "GPE":
      original_ents.remove(ent)
  doc2.ents = original_ents
  return doc2


In [52]:
nlp.add_pipe("remove_GPE")

<function __main__.remove_GPE(doc2)>

In [53]:
nlp.analyze_pipes()

{'summary': {'tok2vec': {'assigns': ['doc.tensor'],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'tagger': {'assigns': ['token.tag'],
   'requires': [],
   'scores': ['tag_acc'],
   'retokenizes': False},
  'parser': {'assigns': ['token.dep',
    'token.head',
    'token.is_sent_start',
    'doc.sents'],
   'requires': [],
   'scores': ['dep_uas',
    'dep_las',
    'dep_las_per_type',
    'sents_p',
    'sents_r',
    'sents_f'],
   'retokenizes': False},
  'attribute_ruler': {'assigns': [],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'lemmatizer': {'assigns': ['token.lemma'],
   'requires': [],
   'scores': ['lemma_acc'],
   'retokenizes': False},
  'ner': {'assigns': ['doc.ents', 'token.ent_iob', 'token.ent_type'],
   'requires': [],
   'scores': ['ents_f', 'ents_p', 'ents_r', 'ents_per_type'],
   'retokenizes': False},
  'remove_GPE': {'assigns': [],
   'requires': [],
   'scores': [],
   'retokenizes': False}},
 'problems': {'tok2vec': [],
  

In [54]:
doc2 = nlp("New York is a place. Steve is a person.")
for ent in doc2.ents:
  print(ent.text, ent.label_)

Steve PERSON


## Regular Expressions

In [55]:
import re

In [56]:
text = "Paul Neuman was an American actor, but Paul Hollywood is a British TV host. The name Paul is quite common."

In [57]:
pattern = r"Paul [A-Z]\w+"

In [58]:
matches = re.finditer(pattern, text)
for match in matches:
  print(match)

<re.Match object; span=(0, 11), match='Paul Neuman'>
<re.Match object; span=(39, 53), match='Paul Hollywood'>


In [59]:
from spacy.tokens import Span

In [63]:
nlp = spacy.blank("en")
doc3 = nlp(text)
original_ents = list(doc3.ents)
mwt_ents = []
for match in re.finditer(pattern, text):
  start, end = match.span()
  span = doc3.char_span(start, end)
  if span is not None:
    mwt_ents.append((span.start, span.end, span.text))

In [64]:
print(mwt_ents)

[(0, 2, 'Paul Neuman'), (8, 10, 'Paul Hollywood')]
