# Spacy

Installing typing:

```python
pip install spacy
```

The following demo is extracted from spaCy site and [Real Python web site](https://realpython.com/natural-language-processing-spacy-python/).

In [None]:
#!python -m spacy download en_core_web_sm # english
#!python -m spacy download es_core_news_sm # spanish

## Simple Example

In [None]:
import spacy

# Load English tokenizer, tagger, parser, NER and word vectors
nlp = spacy.load("en_core_web_sm")

# Process whole documents
text = ("When Sebastian Thrun started working on self-driving cars at "
        "Google in 2007, few people outside of the company took him "
        "seriously. “I can tell you very senior CEOs of major American "
        "car companies would shake my hand and turn away because I wasn’t "
        "worth talking to,” said Thrun, in an interview with Recode earlier "
        "this week.")
doc = nlp(text)

# Analyze syntax
print("Noun phrases:", [chunk.text for chunk in doc.noun_chunks])
print("Verbs:", [token.lemma_ for token in doc if token.pos_ == "VERB"])

# Find named entities, phrases and concepts
for entity in doc.ents:
    print(entity.text, entity.label_)

In [None]:
for chunk in doc.noun_chunks:
    print(chunk.text)

## Complete exercise

In [2]:
import spacy
nlp = spacy.load("es_core_news_sm")

text = "Esto es una demo para familiarizarse con la librería spaCy, que se utiliza para el procesamiento del lenguaje natural"

In [8]:
doc = nlp(text)

In [9]:
print("Noun phrases:", [chunk.text for chunk in doc.noun_chunks])
print("Verbs:", [token.lemma_ for token in doc if token.pos_ == "VERB"])

Noun phrases: ['Esto', 'una demo', 'familiarizarse', 'la librería', 'que', 'se', 'el procesamiento', 'lenguaje']
Verbs: ['utilizar']


In [3]:
file_name = './data/sample_text.txt'
introduction_file_text = open(file_name).read()
text_to_doc = nlp(introduction_file_text)

print("La frase es:", [token.text for token in text_to_doc])

La frase es: ['Esto', 'es', 'una', 'prueba', 'de', 'texto', 'en', 'español', 'para', 'ver', 'si', 'la', 'librería', 'SpaCy', 'es', 'capaz', 'de', 'sacar', 'los', 'verbos', 'y', 'sustantivos', '.', '\n']


In [6]:
print("Los sustantivos son: ", [token.lemma_ for token in text_to_doc if token.pos_ == "NOUN"])
print("Los verbos son: ", [token.lemma_ for token in text_to_doc if token.pos_ == "VERB"])

Los sustantivos son:  ['probar', 'texto', 'español', 'librería', 'verbo']
Los verbos son:  ['sacar']


### Sencence detection

Is the process of locating the start and end of sentences in a given text. You’ll use these units when you’re processing your text to perform tasks such as **part of speech tagging** and **entity extraction**.

In spaCy, the sents property is used to extract sentences. Here’s how you would extract the total number of sentences and the sentences for a given input text:

In [9]:
text_with_sentences = ('Dado que hemos cargado el diccionario en español en esta segunda parte'
                       ' probaremos la detección de frases con texto en español.'
                       ' Puede ser muy útil para procesar pequeñas porciones de texto'
                       ' de esta manera las conclusiones obtenidas pueden ser mejores.'
                       ' El resultado es el siguiente.')

sentence_to_doc = nlp(text_with_sentences)
sentences = list(sentence_to_doc.sents)

for sentence in sentences:
    print(sentence)

Dado que hemos cargado el diccionario en español en esta segunda parte probaremos la detección de frases con texto en español.
Puede ser muy útil para procesar pequeñas porciones de texto de esta manera las conclusiones obtenidas pueden ser mejores.
El resultado es el siguiente.


### Tokenization

Tokenization is the next step after sentence detection. It allows you to identify the basic units in your text. These basic units are called tokens. Tokenization is useful because it breaks a text into meaningful units. These units are used for further analysis, like part of speech tagging. Note how spaCy preserves the starting index of the tokens. It’s useful for in-place word replacement.

In spaCy, you can print tokens by iterating on the Doc object:

In [None]:
for token in sentence_to_doc:
    print(token, token.idx)

spaCy provides various attributes for the Token class:

- `text_with_ws` prints token text with trailing space (if present).
- `is_alpha` detects if the token consists of alphabetic characters or not.
- `is_punct` detects if the token is a punctuation symbol or not.
- `is_space` detects if the token is a space or not.
- `shape_` prints out the shape of the word.
- `is_stop` detects if the token is a stop word or not.

In [None]:
for token in sentence_to_doc:
    print (token,
           token.idx,
           token.text_with_ws,
           token.is_alpha,
           token.is_punct,
           token.is_space,
           token.shape_,
           token.is_stop, sep="-")

### Custom Tokenization

- `nlp.vocab` is a storage container for special cases and is used to handle cases like contractions and emoticons.
- `prefix_search` is the function that is used to handle preceding punctuation, such as opening parentheses.
- `infix_finditer` is the function that is used to handle non-whitespace separators, such as hyphens.
- `suffix_search` is the function that is used to handle succeeding punctuation, such as closing parentheses.
- `token_match` is an optional boolean function that is used to match strings that should never be split. It overrides the previous rules and is useful for entities like URLs or numbers.

In [None]:
import re
import spacy
from spacy.tokenizer import Tokenizer


text = ('Gus Proto is a Python developer currently working for a '
        'London-based Fintech company.'
        'He is interested in learning Natural Language Processing.')
custom_nlp = spacy.load("en_core_web_sm")

prefix_re = spacy.util.compile_prefix_regex(custom_nlp.Defaults.prefixes)
suffix_re = spacy.util.compile_suffix_regex(custom_nlp.Defaults.suffixes)

infix_re = re.compile(r'''[-~]''')

def customize_tokenizer(nlp):
    # Adds support to use `-` as the delimiter for tokenization
    return Tokenizer(nlp.vocab, 
                     prefix_search=prefix_re.search,
                     suffix_search=suffix_re.search,
                     infix_finditer=infix_re.finditer,
                     token_match=None
                     )

custom_nlp.tokenizer = customize_tokenizer(custom_nlp)
custom_tokenizer_about_doc = custom_nlp(text)

print([token.text for token in custom_tokenizer_about_doc])

## Stop Words

In the English language, some examples of stop words are `the`, `are`, `but`, and `they`. Most sentences need to contain stop words in order to be full sentences that make sense.

In [34]:
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
len(spacy_stopwords)

326

In [None]:
for stop_word in list(spacy_stopwords)[:10]:
    print(stop_word)

If we remove those words from the text, we are removing "noise" that can confuse the objective of the phrase.

In [None]:
text = ('Gus Proto is a Python developer currently working for a '
        'London-based Fintech company.'
        'He is interested in learning Natural Language Processing.')
text_to_doc = nlp(text)

for token in text_to_doc:
    if not token.is_stop:
        print(token)

We can also create a **list** of tokens not containing stop words. The output can be used to form a sentence with no stop words.

In [39]:
about_no_stopword_doc = [token for token in text_to_doc if not token.is_stop]
print (about_no_stopword_doc)

[Gus, Proto, is, a, Python, developer, currently, working, for, a, London, -, based, Fintech, company, ., is, interested, in, learning, Natural, Language, Processing, .]


## Lemmatization



In [None]:
import spacy
nlp = spacy.load("es_core_news_sm")
text = 'Soy un texto que pide a gritos que lo procesen. Por eso yo canto, tú cantas, ella canta, nosotros cantamos, cantáis, cantan…'
doc = nlp(text)
lemmas = [tok.lemma_.lower() for tok in doc]

print("Original\t\t  Lemma")
print("----------------------------------")
for token in doc:
    print(token, "\t\t", token.lemma_)

## Word Frequency

You can now convert a given text into tokens and perform statistical analysis over it. This analysis can give you various insights about word patterns, such as common words or unique words in the text:

In [None]:
from collections import Counter

nlp = spacy.load("en_core_web_sm")

complete_text = ('Gus Proto is a Python developer currently'
                 'working for a London-based Fintech company. He is'
                 ' interested in learning Natural Language Processing.'
                 ' There is a developer conference happening on 21 July'
                 ' 2019 in London. It is titled "Applications of Natural'
                 ' Language Processing". There is a helpline number '
                 ' available at +1-1234567891. Gus is helping organize it.'
                 ' He keeps organizing local Python meetups and several'
                 ' internal talks at his workplace. Gus is also presenting'
                 ' a talk. The talk will introduce the reader about "Use'
                 ' cases of Natural Language Processing in Fintech".'
                 ' Apart from his work, he is very passionate about music.'
                 ' Gus is learning to play the Piano. He has enrolled '
                 ' himself in the weekend batch of Great Piano Academy.'
                 ' Great Piano Academy is situated in Mayfair or the City'
                 ' of London and has world-class piano instructors.')

complete_doc = nlp(complete_text)

# Remove stop words and punctuation symbols
words = [token.text for token in complete_doc
         if not token.is_stop and not token.is_punct]
word_freq = Counter(words)

# 5 commonly occurring words with their frequencies
common_words = word_freq.most_common(5)
print (common_words, "\n")

# Unique words
unique_words = [word for (word, freq) in word_freq.items() if freq == 1]
print (unique_words)

This following example is the same that above but with **stop words**:

In [24]:
words_all = [token.text for token in complete_doc if not token.is_punct]
word_freq_all = Counter(words_all)
# 5 commonly occurring words with their frequencies
common_words_all = word_freq_all.most_common(5)
print (common_words_all)

[('is', 10), ('a', 5), ('in', 5), ('Gus', 4), ('of', 4)]


Four out of five of the most common words are stop words, which don’t tell you much about the text. If you consider stop words while doing word frequency analysis, then you won’t be able to derive meaningful insights from the input text. This is why removing stop words is so important.

Part of speech or POS is a grammatical role that explains how a particular word is used in a sentence. There are eight parts of speech:

- Noun
- Pronoun
- Adjective
- Verb
- Adverb
- Preposition
- Conjunction
- Interjection

Part of speech tagging is the process of assigning a POS tag to each token depending on its usage in the sentence. POS tags are useful for assigning a syntactic category like noun or verb to each word.

In [None]:
for token in complete_doc:
    print (token, token.tag_, token.pos_, spacy.explain(token.tag_))

In [None]:
nouns = []
adjectives = []
for token in complete_doc:
    if token.pos_ == 'NOUN':
        nouns.append(token)
    if token.pos_ == 'ADJ':
        adjectives.append(token)
print(nouns, "\n")
print(adjectives)

## Visualization: Using displaCy

spaCy comes with a built-in visualizer called displaCy. You can use it to visualize a dependency parse or named entities in a browser or a Jupyter notebook.

You can use displaCy to find POS tags for tokens:

In [33]:
from spacy import displacy
about_interest_text = ('He is interested in learning'
    ' Natural Language Processing.')
about_interest_doc = nlp(about_interest_text)
displacy.render(about_interest_doc, style='dep', jupyter=True)

You can create a preprocessing function that takes text as input and applies the following operations:

- Lowercases the text
- Lemmatizes each token
- Removes punctuation symbols
- Removes stop words


A preprocessing function converts text to an analyzable format. It’s necessary for most NLP tasks. Here’s an example:

In [36]:
def is_token_allowed(token):
    '''
        Only allow valid tokens which are not stop words
        and punctuation symbols.
    '''
    if (not token or not token.string.strip() or
        token.is_stop or token.is_punct):
        return False
    return True


def preprocess_token(token):
    # Reduce token to its lowercase lemma form
    return token.lemma_.strip().lower()


for token in complete_doc:
    if is_token_allowed(token):
        preprocess_token(token)

print(complete_filtered_tokens)

['gus', 'proto', 'python', 'developer', 'currentlyworking', 'london', 'base', 'fintech', 'company', 'interested', 'learn', 'natural', 'language', 'processing', 'developer', 'conference', 'happen', '21', 'july', '2019', 'london', 'title', 'applications', 'natural', 'language', 'processing', 'helpline', 'number', 'available', '+1', '1234567891', 'gus', 'help', 'organize', 'keep', 'organize', 'local', 'python', 'meetup', 'internal', 'talk', 'workplace', 'gus', 'present', 'talk', 'talk', 'introduce', 'reader', 'use', 'case', 'natural', 'language', 'processing', 'fintech', 'apart', 'work', 'passionate', 'music', 'gus', 'learn', 'play', 'piano', 'enrol', 'weekend', 'batch', 'great', 'piano', 'academy', 'great', 'piano', 'academy', 'situate', 'mayfair', 'city', 'london', 'world', 'class', 'piano', 'instructor']


## Rule-Based Matching

Rule-based matching can use regular expressions to extract entities (such as phone numbers) from an unstructured text. It’s different from extracting text using regular expressions only in the sense that regular expressions don’t consider the lexical and grammatical attributes of the text.

With rule-based matching, you can extract a first name and a last name, which are always proper nouns.

In this example, pattern is a list of objects that defines the combination of tokens to be matched. Both POS tags in it are PROPN (proper noun). So, the pattern consists of two objects in which the POS tags for both tokens should be PROPN. This pattern is then added to Matcher using FULL_NAME and the the match_id. Finally, matches are obtained with their starting and end indexes.

 Here, some attributes of the token are also used:

- `ORTH` gives the exact text of the token.
- `SHAPE` transforms the token string to show orthographic features.
- `OP` defines operators. Using ? as a value means that the pattern is optional, meaning it can match 0 or 1 times.

In [66]:
from spacy.matcher import Matcher

#matcher = Matcher(nlp.vocab)

text = ('Gus Proto is a Python developer currently working for a '
        'London-based Fintech company.'
        'He is interested in learning Natural Language Processing.')

conference_org_text = ('There is a developer conference'
    'happening on 21 July 2019 in London. It is titled'
    ' "Applications of Natural Language Processing".'
    ' There is a helpline number available'
    ' at (123) 456-789')

text_to_doc = nlp(text)
conference_org_doc = nlp(conference_org_text)


def extract_full_name(nlp_doc):
    pattern = [{'POS': 'PROPN'}, {'POS': 'PROPN'}]
    matcher.add('FULL_NAME', None, pattern)
    matches = matcher(nlp_doc)
    for match_id, start, end in matches:
        span = nlp_doc[start:end]
        return span.text

def extract_phone_number(nlp_doc):
    pattern = [{'ORTH': '('}, {'SHAPE': 'ddd'},
               {'ORTH': ')'}, {'SHAPE': 'ddd'},
               {'ORTH': '-', 'OP': '?'},
               {'SHAPE': 'ddd'}]
    matcher.add('PHONE_NUMBER', None, pattern)
    matches = matcher(nlp_doc)
    for match_id, start, end in matches:
        span = nlp_doc[start:end]
        return span.text

    
print(extract_full_name(text_to_doc))
print(extract_phone_number(conference_org_doc))

Gus Proto
Natural Language
