# Introduction To Spacy

    1). Spacy is an advanced library for Natural Language Processing (NLP), typically used in building various applications
        related to NLP.
    2). Spacy comes with some pre-trained models that can perfom most common NLP tasks such as (tokenization, part of
        speech (POS), lemmatization, transforming word vectors).
    3). Spacy can deal with multiple languages and is very widely used for creating varities of applications in different
        languages.
        
**Spacy Installation:**

In [1]:
#!pip install spacy

**Before loading 'en_core_web_sm', install the 'en_core_web_sm' using the code below**

In [2]:
#!python -m spacy download en_core_web_sm

In [3]:
import spacy

In [4]:
nlp = spacy.load('en_core_web_sm')

In [5]:
nlp

<spacy.lang.en.English at 0xf0edca48e0>

In [6]:
type(nlp)

spacy.lang.en.English

In [7]:
my_text = "Hello I am learning Spacy."

In [8]:
my_doc = nlp(my_text)

In [9]:
my_doc

Hello I am learning Spacy.

In [10]:
type(my_doc)

spacy.tokens.doc.Doc

**Doc Object:**

    - Doc is a sequence of token that contains not just the original text but all the result produced by Spacy model after
      processing the text.
    - Information such as lemma(lemmatization) of the text, the word vector of the text and so on are pre computed and
      stored in the doc object.

In [11]:
print(my_doc.text)

Hello I am learning Spacy.


In [12]:
for item in my_doc:
    print(item.text,"\t",item.idx,"\t",item.pos_,"\t",item.pos)

Hello 	 0 	 INTJ 	 91
I 	 6 	 PRON 	 95
am 	 8 	 AUX 	 87
learning 	 11 	 VERB 	 100
Spacy 	 20 	 PROPN 	 96
. 	 25 	 PUNCT 	 97


In [13]:
data = """My name is Dharmesh. I love travelling. I live in Mumbai"""

In [14]:
data_doc = nlp(data)

In [15]:
for item in data_doc.sents:
    print(item)

My name is Dharmesh.
I love travelling.
I live in Mumbai


In [16]:
for item in data_doc:
    print(item.text,"\t",item.idx,"\t",item.pos_,"\t",item.pos)

My 	 0 	 PRON 	 95
name 	 3 	 NOUN 	 92
is 	 8 	 AUX 	 87
Dharmesh 	 11 	 PROPN 	 96
. 	 19 	 PUNCT 	 97
I 	 21 	 PRON 	 95
love 	 23 	 VERB 	 100
travelling 	 28 	 VERB 	 100
. 	 38 	 PUNCT 	 97
I 	 40 	 PRON 	 95
live 	 42 	 VERB 	 100
in 	 47 	 ADP 	 85
Mumbai 	 50 	 PROPN 	 96


In [17]:
spacy.explain('AUX')

'auxiliary'

In [18]:
spacy.explain('ADP')

'adposition'

In [19]:
spacy.explain('PROPN')

'proper noun'

In [20]:
spacy.explain('PRON')

'pronoun'

### Entity Recognition

In [21]:
data_doc

My name is Dharmesh. I love travelling. I live in Mumbai

In [22]:
for data in data_doc.ents:
    print(data.text," ",data.label," ",data.label_)

Dharmesh   380   PERSON
Mumbai   384   GPE


In [23]:
spacy.explain('GPE')

'Countries, cities, states'

In [24]:
text = """Mumbai or Bombay is the capital city of the Indian state of Maharashtra. According to the United Nations, as of 2018, Mumbai was the second most populated city in India after Delhi. In the world with a population of roughly 20 million"""

In [25]:
text_doc = nlp(text)

In [26]:
for item in text_doc.ents:
    print(item.text," ",item.label_)

Mumbai   GPE
Bombay   GPE
Indian   NORP
Maharashtra   ORG
the United Nations   ORG
2018   DATE
Mumbai   GPE
second   ORDINAL
India   GPE
Delhi   GPE
roughly 20 million   CARDINAL


In [27]:
spacy.explain('NORP')

'Nationalities or religious or political groups'

In [28]:
spacy.explain('CARDINAL')

'Numerals that do not fall under another type'

### Stopwords

In [29]:
data1 = """The economic situtation of the country is on the edge, as the stock market crashed causing \nloss of million of dollars. Citizens who had their main investment in the share market are facing great loss\nMany Companies might layoff thousand of people to reduce labour cost"""

In [30]:
data_doc1 = nlp(data1)

In [31]:
for item in data_doc1:
    print(item,": :",item.is_stop)

The : : True
economic : : False
situtation : : False
of : : True
the : : True
country : : False
is : : True
on : : True
the : : True
edge : : False
, : : False
as : : True
the : : True
stock : : False
market : : False
crashed : : False
causing : : False

 : : False
loss : : False
of : : True
million : : False
of : : True
dollars : : False
. : : False
Citizens : : False
who : : True
had : : True
their : : True
main : : False
investment : : False
in : : True
the : : True
share : : False
market : : False
are : : True
facing : : False
great : : False
loss : : False

 : : False
Many : : True
Companies : : False
might : : True
layoff : : False
thousand : : False
of : : True
people : : False
to : : True
reduce : : False
labour : : False
cost : : False


### Punctuation

In [32]:
for item in data_doc1:
    print(item,": :",item.is_punct)

The : : False
economic : : False
situtation : : False
of : : False
the : : False
country : : False
is : : False
on : : False
the : : False
edge : : False
, : : True
as : : False
the : : False
stock : : False
market : : False
crashed : : False
causing : : False

 : : False
loss : : False
of : : False
million : : False
of : : False
dollars : : False
. : : True
Citizens : : False
who : : False
had : : False
their : : False
main : : False
investment : : False
in : : False
the : : False
share : : False
market : : False
are : : False
facing : : False
great : : False
loss : : False

 : : False
Many : : False
Companies : : False
might : : False
layoff : : False
thousand : : False
of : : False
people : : False
to : : False
reduce : : False
labour : : False
cost : : False


### Removing Stopwords & Punctuations

In [33]:
def rm_stop_punc(text):
    text_doc = nlp(text)
    cleaned_data = [item for item in text_doc if not item.is_stop and not item.is_punct]
    return cleaned_data

In [34]:
rm_stop_punc(data1)

[economic,
 situtation,
 country,
 edge,
 stock,
 market,
 crashed,
 causing,
 ,
 loss,
 million,
 dollars,
 Citizens,
 main,
 investment,
 share,
 market,
 facing,
 great,
 loss,
 ,
 Companies,
 layoff,
 thousand,
 people,
 reduce,
 labour,
 cost]

### Display Fancy Words

In [35]:
from spacy import displacy

In [36]:
displacy.render(data_doc1,style='ent' ,jupyter=True)

In [37]:
displacy.render(text_doc,style='ent' ,jupyter=True)

### Lemmatization

In [38]:
for item in data_doc1:
    print(item," ",item.lemma_)

The   the
economic   economic
situtation   situtation
of   of
the   the
country   country
is   be
on   on
the   the
edge   edge
,   ,
as   as
the   the
stock   stock
market   market
crashed   crash
causing   cause

   

loss   loss
of   of
million   million
of   of
dollars   dollar
.   .
Citizens   citizen
who   who
had   have
their   their
main   main
investment   investment
in   in
the   the
share   share
market   market
are   be
facing   face
great   great
loss   loss

   

Many   many
Companies   Companies
might   might
layoff   layoff
thousand   thousand
of   of
people   people
to   to
reduce   reduce
labour   labour
cost   cost


### Some Spacial Data

    is_alpha : Return True if the given token is a alphabet
    is_ascii : Return True if the token belong to ascii character
    is_digit : Return True if the token is a number (0-9)
    is_lower : Return True if the token is a lower case alphabet
    is_space : Return True if token is a space
    is_bracket : return true if token is a bracket
    is_quote : Return True if token has quotation mark
    like_url : return true if token is in url format
    like_num : Return True if token is a number
    like_email : Return True if token is in a email format

In [39]:
emp_txt = """name: test age: 25 email: test@gmail.com, name: alpha age: 32 email: alpha@yahoo.com, name: beta age: 29 email: beta@hotmail.com"""

In [40]:
emp_txt

'name: test age: 25 email: test@gmail.com, name: alpha age: 32 email: alpha@yahoo.com, name: beta age: 29 email: beta@hotmail.com'

In [41]:
emp_doc = nlp(emp_txt)

In [42]:
for item in emp_doc:
    if item.like_num:
        print(item)

25
32
29


In [43]:
for item in emp_doc:
    if item.like_email:
        print(item)

test@gmail.com
alpha@yahoo.com
beta@hotmail.com


### Removing Other Type of Words

In [44]:
spacy.explain('X')

'other'

In [45]:
raw_data = """I liked the movies etc The movie had good direction The movie was amazing i.e. \nThe movie was average direction was not bad The cinematography was nice i.e.\nThe movie was a bit lengthy otherwise fantastic etc etc"""

In [46]:
raw_data

'I liked the movies etc The movie had good direction The movie was amazing i.e. \nThe movie was average direction was not bad The cinematography was nice i.e.\nThe movie was a bit lengthy otherwise fantastic etc etc'

In [47]:
raw_data_doc = nlp(raw_data)

In [48]:
for item in raw_data_doc:
    if item.pos_ == 'X':
        print(item)

etc
i.e.
i.e.
etc
etc


In [49]:
cleaned_data = [item for item in raw_data_doc if item.pos_ != 'X']

In [50]:
cleaned_data

[I,
 liked,
 the,
 movies,
 The,
 movie,
 had,
 good,
 direction,
 The,
 movie,
 was,
 amazing,
 ,
 The,
 movie,
 was,
 average,
 direction,
 was,
 not,
 bad,
 The,
 cinematography,
 was,
 nice,
 ,
 The,
 movie,
 was,
 a,
 bit,
 lengthy,
 otherwise,
 fantastic]

In [51]:
displacy.render(data_doc, style='dep', jupyter=True)

In [52]:
displacy.render(data_doc, style='ent', jupyter=True)