<a href="https://colab.research.google.com/github/preetamjumech/txta_using_spacy/blob/main/Preetam_saha_spacy_practice_31_08_2022.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import spacy

In [2]:
nlp = spacy.load("en_core_web_sm")

In [3]:
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

In [4]:
doc

Apple is looking at buying U.K. startup for $1 billion

# Tokenization: Split the sentence into different words

```
# This is formatted as code
```



In [5]:
for token in doc:
  print(token)
  print(token.text)

Apple
Apple
is
is
looking
looking
at
at
buying
buying
U.K.
U.K.
startup
startup
for
for
$
$
1
1
billion
billion


In [6]:
doc2 = nlp("Python is a programming language")

In [7]:
for token in doc2:
  print(token)

Python
is
a
programming
language


# Adding special case Tokenization

In [8]:
from spacy.symbols import ORTH

In [9]:
nlp

<spacy.lang.en.English at 0x7fc0e8fcaf10>

In [10]:
doc = nlp("gimme that")

In [11]:
doc

gimme that

In [12]:
for token in doc:
  print(token.text)

gimme
that


In [13]:
special_case = [{ORTH:"gim"},{ORTH:"me"}]
special_case

[{65: 'gim'}, {65: 'me'}]

In [14]:
nlp.tokenizer.add_special_case("gimme",special_case)

In [15]:
for token in nlp("gimme that"):
  print(token.text)

gim
me
that


#Parts of Speech (POS)

In [16]:
from spacy import displacy

In [17]:
doc = nlp("python is a programming language. Current year is 2022. Dollar symbol is $")

In [18]:
doc

python is a programming language. Current year is 2022. Dollar symbol is $

In [19]:
for token in doc:
  print(token)

python
is
a
programming
language
.
Current
year
is
2022
.
Dollar
symbol
is
$


In [20]:
for token in doc:
  print(token, "-->", token.pos_)

python --> PROPN
is --> AUX
a --> DET
programming --> NOUN
language --> NOUN
. --> PUNCT
Current --> ADJ
year --> NOUN
is --> AUX
2022 --> NUM
. --> PUNCT
Dollar --> NOUN
symbol --> NOUN
is --> AUX
$ --> SYM


In [21]:
for token in doc:
  print(token, "-->", token.pos)

python --> 96
is --> 87
a --> 90
programming --> 92
language --> 92
. --> 97
Current --> 84
year --> 92
is --> 87
2022 --> 93
. --> 97
Dollar --> 92
symbol --> 92
is --> 87
$ --> 99


In [22]:
displacy.render(doc, style="dep",jupyter = True)

# Stop Words

In [25]:
from spacy.lang.en.stop_words import STOP_WORDS

In [26]:
print(STOP_WORDS)

{'front', 'had', 'becomes', 'therein', 'under', 'around', 'a', 'then', 'how', 'here', 'hence', 're', 'why', '‘m', "'re", 'most', 'make', 'empty', 'n‘t', 'quite', 'after', 'amount', 'have', 'still', 'latter', 'alone', 'whereby', 'seems', 'its', 'two', 'once', '‘ll', 'name', 'thereupon', 'latterly', 'beforehand', 'anyone', 'other', 'give', 'always', 'either', 'top', 'whatever', 'anyhow', 'whole', 'mine', 'an', 'everything', 'again', 'be', 'no', 'last', 'being', 'due', 'has', 'would', 'been', 'seem', 'hereafter', 'my', 'him', "n't", 'will', 'of', 'because', 'anyway', 'wherever', 'between', 'next', 'are', 'cannot', 'otherwise', 'before', 'three', 'among', '‘re', 'everywhere', 'each', 'he', 'sixty', '‘ve', 'now', 'when', 'get', 'few', 'per', 'as', 'do', '’m', 'using', 'while', 'besides', 'done', 'amongst', 'without', 'nowhere', '‘d', 'there', 'behind', 'with', 'the', 'eleven', 'i', 'to', 'toward', 'did', 'nor', 'however', 'itself', 'whence', 'against', 'anything', 'meanwhile', 'regarding', 

In [27]:
len(STOP_WORDS)

326

In [29]:
"in" in STOP_WORDS

True

In [30]:
"apple" in STOP_WORDS

False

In [31]:
nlp.vocab["apple"].is_stop

False

In [33]:
nlp.vocab["in"].is_stop

True

In [34]:
doc = nlp("Python is a programming language.I am learning NLP")

In [35]:
for token in doc:
  print(token.text)

Python
is
a
programming
language
.
I
am
learning
NLP


In [36]:
for token in doc:
  if nlp.vocab[token.text].is_stop:
    print(token.text)

is
a
I
am


In [37]:
for token in doc:
  if not nlp.vocab[token.text].is_stop:
    print(token.text)

Python
programming
language
.
learning
NLP


# Name Entity Recognition

In [40]:
doc = nlp("Apple is looking at buying U.K. startup at $1 billion")
doc

Apple is looking at buying U.K. startup at $1 billion

In [41]:
for token in doc:
  print(token.text, "-", token.pos_)

Apple - PROPN
is - AUX
looking - VERB
at - ADP
buying - VERB
U.K. - PROPN
startup - NOUN
at - ADP
$ - SYM
1 - NUM
billion - NUM


In [42]:
displacy.render(doc, style="ent",jupyter = True)

In [43]:
for entity in doc.ents:
  if entity.label_ == "ORG":
    print(entity)

Apple


# Lemmatization

In [44]:
doc = nlp("I am recording video")

In [45]:
for token in doc:
  print(token.text)

I
am
recording
video


In [46]:
for token in doc:
  print(token.lemma_)

I
be
record
video


In [47]:
doc = nlp("I love swimming but I hate jogging")

In [48]:
for token in doc:
  print(token.lemma_)

I
love
swimming
but
I
hate
jog


In [49]:
for token in doc:
  print(token.lemma)

4690420944186131903
3702023516439754181
12526975369366237900
14560795576765492085
4690420944186131903
8706232279129489120
14708015734410536110


In [50]:
for token in doc:
  print(token.pos_)

PRON
VERB
NOUN
CCONJ
PRON
AUX
VERB


# Word Similarity  

In [51]:
g1 = nlp("hi")

In [52]:
g2 = nlp("hello")

In [54]:
g1.similarity(g2)

  """Entry point for launching an IPython kernel.


0.6473686254886916

In [55]:
g2.similarity(g1)

  """Entry point for launching an IPython kernel.


0.6473686254886916

In [56]:
g3 = nlp("similarity")

In [57]:
g1.similarity(g3)

  """Entry point for launching an IPython kernel.


0.15432497907791906

In [58]:
s1 = nlp("nlp is used for dealing with text analysis")

In [59]:
s2 = nlp("spacy is a part of nlp which is used for text analysis")

In [60]:
s3 = nlp("I like movie")

In [61]:
s1.similarity(s2)

  """Entry point for launching an IPython kernel.


0.6729594214644644

In [62]:
s1.similarity(s3)

  """Entry point for launching an IPython kernel.


0.2643360398405988

#Sentence segmentaion

In [63]:
doc = nlp("This is our first sentence.This is our second sentence.") 

In [64]:
for sent in doc.sents:
  print(sent.text)

This is our first sentence.
This is our second sentence.


In [65]:
doc = nlp("Sometimes to understand a word's meaning you need more than a definition; you need to see the word used in a sentence. At YourDictionary, we give you the tools to learn what a word means and how to use it correctly. With this sentence maker, simply type a word in the search bar and see a variety of sentences with that word used in its different ways. Our sentence generator can provide more context and relevance, ensuring you use a word the right way.")


In [66]:
doc

Sometimes to understand a word's meaning you need more than a definition; you need to see the word used in a sentence. At YourDictionary, we give you the tools to learn what a word means and how to use it correctly. With this sentence maker, simply type a word in the search bar and see a variety of sentences with that word used in its different ways. Our sentence generator can provide more context and relevance, ensuring you use a word the right way.

In [67]:
for sent in doc.sents:
  print(sent.text)

Sometimes to understand a word's meaning you need more than a definition; you need to see the word used in a sentence.
At YourDictionary, we give you the tools to learn what a word means and how to use it correctly.
With this sentence maker, simply type a word in the search bar and see a variety of sentences with that word used in its different ways.
Our sentence generator can provide more context and relevance, ensuring you use a word the right way.
