### Spacy-
* Open source Natural Language Processing Library
* For multple NLP tasks spacy has one implemented methods choosing the most efficient algorithm currently available.
* So we can not choose the other algorithms.

### NLTK-
* NLTK is other popular open source NLP library.
* It is old but includes less efficient algorithms.

### NLTK vs Spacy -
Spacy does not include pre-created models for some applications like sentiment analysis, which is typically easier to perform with NLTK. 

### What is NLP-
NLP is an area of computer science and AI concerned with the interactions between computers and human languages, in particular how to program computers to process and analyze large amounts of natural language data.

### SPACY Basics 
* .text = original word text.
* ._lemma_ = The base form of the word.
* .pos_ = The simple part-of speech tag
* tag_ = The detailed part of speech tag
* shape_ = The word shape - capitlization,punctuation, digits.
* .is_alpha = Is the token an alpha character?
* .is_stop = is the token part of stop list, i.e the most common words of the language.

In [21]:
import spacy

In [22]:
#loading a model
nlp = spacy.load('en_core_web_sm')

In [23]:
#this doc object contains the processed text.
doc = nlp('Tesla is looking at buying U.S startup $6 million')

In [24]:
#Example
for token in doc:
    print(token.text,token.pos_,token.dep_)

Tesla PROPN nsubj
is AUX aux
looking VERB ROOT
at ADP prep
buying VERB pcomp
U.S PROPN nsubj
startup NOUN conj
$ SYM quantmod
6 NUM compound
million NUM dobj


In [25]:
#spacy works on a pipeline object
# so when we passed our text through nlp object. All these operations were performed.
nlp.pipeline

[('tagger', <spacy.pipeline.pipes.Tagger at 0x7fdb6d2c7860>),
 ('parser', <spacy.pipeline.pipes.DependencyParser at 0x7fdb6ce7d3a8>),
 ('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x7fdb6ce7d408>)]

In [26]:
#getting basic names-
nlp.pipe_names

['tagger', 'parser', 'ner']

### Tokenization-
* The first step in processing any text is split up all the component parts i.e. words and punctuations into tokens.

### tokenization is done based on-
* Prefix - Characters at the begining. '$'
* Suffix - characters at the end. km
* Infix - characters in bw=etween. '-'
* Exceptions - special-case rule to split a string into several tokens or prevent a token from being split when punctuation rules are aplied. U.S.

In [27]:
doc2 = nlp("Tesla is not looking into start ups any more.")
for token in doc2:
    print(token.text,token.pos_,token.dep_)

Tesla PROPN nsubj
is AUX aux
not PART neg
looking VERB ROOT
into ADP prep
start NOUN pobj
ups NOUN advcl
any DET advmod
more ADV advmod
. PUNCT punct


### Spans
Large objects can be hard to work with a times.A span is a slice of Doc object in the form Doc[start:stop]

In [28]:
doc3 = nlp("Databases are a great, secure, and reliable way to store data. All major relational databases have something in common — SQL — a language to manipulate databases, tables, and data. SQL is a broad topic to cover, especially when dealing with different database vendors, such as Microsoft, IBM, or Oracle, so let’s start with SQLite — the most lightweight database system.")

In [29]:
limit_text = doc3[15:50]
limit_text

major relational databases have something in common — SQL — a language to manipulate databases, tables, and data. SQL is a broad topic to cover, especially when dealing with different database

In [30]:
for sentence in doc3.sents:
    print(sentence)

Databases are a great, secure, and reliable way to store data.
All major relational databases have something in common — SQL — a language to manipulate databases, tables, and data.
SQL is a broad topic to cover, especially when dealing with different database vendors, such as Microsoft, IBM, or Oracle, so let’s start with SQLite — the most lightweight database system.


### Vocab
Vocab is alist of tokens a library contains.current library is 'en_core_web_sm'.It would be having a vocabulary of tokens.

In [31]:
len(doc.vocab)
#so when we loaded up en_core_web_sm that has a vocab of below number.

545

### Named Entity Recognition(NER)-
* NER are another layer of context, when we loaded a language model in the begining recognises organizational names,location etc.
* These are available as ENTS property of entity object.

In [32]:
doc7 = nlp("Apple to build a hong kong factory for $6 million")

In [33]:
for entity in doc7.ents:
    print(entity)
    print(entity.label_)
    print(str(spacy.explain(entity.label_)))
    print()

Apple
ORG
Companies, agencies, institutions, etc.

hong kong
GPE
Countries, cities, states

$6 million
MONEY
Monetary values, including unit



### Noun chunks-
Noun chunks can be defined as noun plus the word describing that noun.


In [34]:
doc7 = nlp("Autonomous cars shift insurance liability toward manufacturers")
for chunk in doc7.noun_chunks:
    print(chunk)

Autonomous cars
insurance liability
manufacturers


### display-

In [35]:
doc = nlp("Apple is going to build a U.K. factory $6 million.")

In [36]:
from spacy import displacy
displacy.render(doc,style='ent',jupyter=True)

### Stemming-

* Often when searching for certain keyword, it helps if the search return variations of the word.

* For instance, searching for 'boat' might also return "boats" and "boating".Here "boat would be the stem for boat,boater,boating,boats"

* Stemming chops off letter from the end untill the stem is reached.
#### But english language has too many exceptions.

So we need a more sophesticated way to reach a root word so SPACY uses LEMMITIZATION.

#### English stemmer or perter stemmer-

In [37]:
import nltk
from nltk.stem.porter import PorterStemmer

In [38]:
p_stemmer = PorterStemmer()

In [39]:
words = ['run','runner','ran','runs','easily','fairly']

In [40]:
for i in words:
    print(i + '---->' + p_stemmer.stem(i))

run---->run
runner---->runner
ran---->ran
runs---->run
easily---->easili
fairly---->fairli


#### Snowball Stemmer-

In [41]:
from nltk.stem.snowball import SnowballStemmer

In [42]:
s_stemmer = SnowballStemmer(language='english')

In [43]:
for i in words:
    print(i + '---->' + s_stemmer.stem(i))

run---->run
runner---->runner
ran---->ran
runs---->run
easily---->easili
fairly---->fair


### Lemmatization-
* In contrast to stemming,lemmatization looks beyond word reduction, and considers a language's full vocabulary to apply morphological analysis to words.
* Lemmaization looks at the surrounding text to determine a given words part of speech, it does not categorize phrases.

In [44]:
doc1 = nlp("I am a runner running in a race because I love to run since I ran today")

In [45]:
for token in doc1:
    print(token.text, '\t',token.lemma_)

I 	 -PRON-
am 	 be
a 	 a
runner 	 runner
running 	 run
in 	 in
a 	 a
race 	 race
because 	 because
I 	 -PRON-
love 	 love
to 	 to
run 	 run
since 	 since
I 	 -PRON-
ran 	 run
today 	 today


In [47]:
#You can use hash values  to see if the words are breaking to a same words
for token in doc1:
    print(token.text, '\t',token.lemma_,'\t',token.lemma)

I 	 -PRON- 	 561228191312463089
am 	 be 	 10382539506755952630
a 	 a 	 11901859001352538922
runner 	 runner 	 12640964157389618806
running 	 run 	 12767647472892411841
in 	 in 	 3002984154512732771
a 	 a 	 11901859001352538922
race 	 race 	 8048469955494714898
because 	 because 	 16950148841647037698
I 	 -PRON- 	 561228191312463089
love 	 love 	 3702023516439754181
to 	 to 	 3791531372978436496
run 	 run 	 12767647472892411841
since 	 since 	 10066841407251338481
I 	 -PRON- 	 561228191312463089
ran 	 run 	 12767647472892411841
today 	 today 	 11042482332948150395


### Stop words-
* Words like "a" and "the" appear so frequently that they dont require tagging as thoroughly as nouns,verbs and modifiers.
* We call them stop words and they can be filtered from the text to be processed.
* Spacy holds a built- in list of some 326 english stop words.

In [50]:
print(len(nlp.Defaults.stop_words))

326


### Add a stop word in vocab-

In [53]:
nlp.Defaults.stop_words.add('btw')
nlp.vocab['btw'].is_stop = True

In [54]:
print(len(nlp.Defaults.stop_words))

327


### Remove a stop word-

In [55]:
nlp.Defaults.stop_words.remove('btw')
nlp.vocab['btw'].is_stop = False

In [56]:
print(len(nlp.Defaults.stop_words))

326


### Phrase matching-

This can be considered as more powerful version of REGEX

In [58]:
from spacy.matcher import Matcher

In [59]:
matcher = Matcher(nlp.vocab)

In [62]:
# SolarPower
pattern1 = [{'LOWER':'solarpower'}]
# Solar-power
pattern2 = [{'LOWER':'solar'},{'IS_PUNCT':True},{'LOWER':'power'}]
# Solar power
pattern3 = [{'LOWER':'solar'},{'LOWER':'power'}]

In [64]:
matcher.add('SolarPower',None,pattern1,pattern2,pattern3)

In [65]:
doc = nlp("The solar power industry continues to grow a solarpower increases. Solar-power is great")

In [66]:
found_matches = matcher(doc)
found_matches

[(8656102463236116519, 1, 3),
 (8656102463236116519, 8, 9),
 (8656102463236116519, 11, 14)]