# Spacy:
- Spacy is free, open-source library for advanced NLP in Python.
- made for :
    - production use
    - information extraction
    - Natural language understanding systems
    - pre-process text for deep learning

## - Spacy's Statistical Models
## - Spacy's Processing Pipeline


## Features of Spacy:
| Name                          | Description                                                                                     |
|-------------------------------|-------------------------------------------------------------------------------------------------|
| Tokenization                  | Segmenting text into words, punctuations marks etc.                                              |
| Part-of-speech (POS) Tagging | Assigning word types to tokens, like verb or noun.                                               |
| Dependency Parsing            | Assigning syntactic dependency labels, describing the relations between individual tokens.      |
| Lemmatization                 | Assigning the base forms of words. For example, the lemma of “was” is “be”, and the lemma of “rats” is “rat”. |
| Sentence Boundary Detection   | Finding and segmenting individual sentences.                                                      |
| Named Entity Recognition (NER)| Labelling named “real-world” objects, like persons, companies or locations.                      |
| Entity Linking (EL)          | Disambiguating textual entities to unique identifiers in a knowledge base.                       |
| Similarity                    | Comparing words, text spans and documents and how similar they are to each other.                 |
| Text Classification          | Assigning categories or labels to a whole document, or parts of a document.                        |
| Rule-based Matching          | Finding sequences of tokens based on their texts and linguistic annotations, similar to regular expressions. |
| Training                      | Updating and improving a statistical model’s predictions.                                         |
| Serialization                | Saving objects to files or byte strings.                                                          |


### Statistical Models in Spacy:
- en_core_web_sm
- en_core_web_md
- en_core_web_lg

It need to be load by using ```spacy.load()```
- It returns a Language callable object, commonly called nlp.

In [6]:
# Firstly, download the models as required:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 12.8/12.8 MB 6.1 MB/s eta 0:00:00
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.7.1
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [7]:
!pip install spacy
import spacy
nlp = spacy.load("en_core_web_sm")
nlp



<spacy.lang.en.English at 0x1840cebbe50>

Or, we can directly import form the ```spacy.lang.en``` for English

In [23]:
# Import the english language class
from spacy.lang.en import English

#create the nlp object
nlp = English()

### Doc Object
- You can instantiate a Doc object by calling the Language object with the input string as an argument:
- For instance, you iterated over the Doc object with a list comprehension that produces a series of Token objects.
### Token Object:
- for example word or a punctuation character
- To get a token at specific position, you can index into the doc.
- Token objects also provide various attributes that let you access more information about the tokens.
    - On each Token object, you called the .text attribute to get the text contained within that token.

In [24]:
introduction_doc = nlp("This tutorial is about Natural Language Processing in spaCy.")
type(introduction_doc)

spacy.tokens.doc.Doc

In [14]:
[token.text for token in introduction_doc]

['This',
 'tutorial',
 'is',
 'about',
 'Natural',
 'Language',
 'Processing',
 'in',
 'spaCy',
 '.']

In [27]:
# Index into Doc to get specific token:
token1 =introduction_doc[0]
token1

This

### Span Object:
- A span object is a slice of the document consisting of one or more tokens.
- It's only a view of the Doc and doesn't contain any data itself.
- To create a span, you can use python's slice notation.
- For example:
    - 1:3 will create a slice starting from the token at position 1, up to – but not including! – the token at position 3.


In [30]:
doc = nlp("How are you doing?")

# A slice from the doc object is a span object
span = doc[1:3]

# to get span text, use .text attribute
print(span.text)

are you


### Lexical Attributes

Some available token attributes:

- ```i``` is the index of the token within the parent document.
- ```text``` return the token text
- ```is_alpha``` return boolean values indicating whether the token consists of alphabetic character
    - example: the word "ten"
- ```is_punct``` return boolean whether token is punctuation
    - example: "one, zero"
- ```like_num``` returns boolean if it is like number
    - example: a token "10"
    
These attributes are also called lexical attributes: they refer to the entry in the vocabulary and don't depend on the token's context.

In [33]:
# Create doc object:
doc = nlp("Hello sir, here is you coffee! It costs $10")

# lexical attributes: i, text
print("Indexes = ",[token.i for token in doc])
print("Text = ", [token.text for token in doc])

# lexical attributes: is_alpha, is_punct, like_num
print("is_alpha = ", [token.is_alpha for token in doc])
print("is_punct = ", [token.is_punct for token in doc])
print("like_num = ", [token.like_num for token in doc])

Indexes =  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
Text =  ['Hello', 'sir', ',', 'here', 'is', 'you', 'coffee', '!', 'It', 'costs', '$', '10']
is_alpha =  [True, True, False, True, True, True, True, False, True, True, False, False]
is_punct =  [False, False, True, False, False, False, False, True, False, False, False, False]
like_num =  [False, False, False, False, False, False, False, False, False, False, False, True]


In [34]:
# Example of lexical attribute: Give the numbers infront of percentage:

# creating the doc object:
doc = nlp(
    "In 1990, more than 60% of people in East Asia were in extreme poverty. "
    "Now less than 4% are."
)

for token in doc:
    if token.like_num:
        next_token = doc[token.i + 1]

        if next_token.text == "%":
            print("Percentage found :", token)

Percentage found : 60
Percentage found : 4


#### Doc Object by reading the file.


In [18]:
import pathlib
file_name = "../test.txt"
introduction_doc = nlp(pathlib.Path(file_name).read_text(encoding="utf-8"))
print ([token.text for token in introduction_doc])

['Hello', ',', 'this', 'is', 'just', 'the', 'dummy', 'text', 'file', 'to', 'test', 'the', 'document', 'object', 'of', 'the', 'Spacy', 'library', '.']


## Sentence Detection:
- locate where sentence start and end in a given text
- divides text into linguistically meaningful units
- useful for "POS tagging" and "Named-Entity Recognition"
- ```.sents``` property is used to extract sentence from doc object

In [20]:
about_text = (
    "Gus Proto is a Python developer currently"
    " working for a London-based Fintech"
    " company. He is interested in learning"
    " Natural Language Processing."
)

#create a doc object:
about_doc = nlp(about_text)

# use .sents property to get the sentence from the document
sentences = list(about_doc.sents)
print(sentences)

[Gus Proto is a Python developer currently working for a London-based Fintech company., He is interested in learning Natural Language Processing.]


In [22]:
#Iterate over each sentence
for sentence in sentences:
    print(f"{sentence[:5]}...")

Gus Proto is a Python...
He is interested in learning...
