# Natural Language Processing With spaCy in Python

Notebook for Introduction to NLP and spaCy.

Source [Real Python](https://realpython.com/natural-language-processing-spacy-python/#conclusion)

---

In this tutorial, you’ll learn how to:

- Implement NLP in spaCy.
- Customize and extend built-in functionalities in spaCy.
- Perform basic statistical analysis on a text.
- Create a pipeline to process unstructured text.
- Parse a sentence and extract meaningful insights from it.

In [1]:
import spacy
nlp = spacy.load("en_core_web_sm")      # model for English language

In [2]:
nlp

<spacy.lang.en.English at 0x151e674ba00>

In [3]:
introduction_doc = nlp(
    """spaCy is an open-source library for NLP, mostly used in Python and Cython
    which provides pre-trained Neural Networks, specifically, Convolutional Neural
    Networks (CNNs) to perform NLP related taks in several languages (will
    provide images with examples)."""
)
type(introduction_doc)

spacy.tokens.doc.Doc

In [4]:
print([token.text for token in introduction_doc])

['spaCy', 'is', 'an', 'open', '-', 'source', 'library', 'for', 'NLP', ',', 'mostly', 'used', 'in', 'Python', 'and', 'Cython', '\n    ', 'which', 'provides', 'pre', '-', 'trained', 'Neural', 'Networks', ',', 'specifically', ',', 'Convolutional', 'Neural', '\n    ', 'Networks', '(', 'CNNs', ')', 'to', 'perform', 'NLP', 'related', 'taks', 'in', 'several', 'languages', '(', 'will', '\n    ', 'provide', 'images', 'with', 'examples', ')', '.']


In [5]:
import pathlib

In [6]:
# file_name = "C:\\Users\\Usuario\\Documents\\Cryptocurrency-NLP-main\\introduction.txt"
file_name = "introduction.txt"
introduction_doc = nlp(pathlib.Path(file_name).read_text(encoding="utf-8"))
print ([token.text for token in introduction_doc])

['"', '"', '"', 'spaCy', 'is', 'an', 'open', '-', 'source', 'library', 'for', 'NLP', ',', 'mostly', 'used', 'in', 'Python', 'and', 'Cython', '\n    ', 'which', 'provides', 'pre', '-', 'trained', 'Neural', 'Networks', ',', 'specifically', ',', 'Convolutional', 'Neural', '\n    ', 'Networks', '(', 'CNNs', ')', 'to', 'perform', 'NLP', 'related', 'taks', 'in', 'several', 'languages', '(', 'will', '\n    ', 'provide', 'images', 'with', 'examples', ')', '.', '"', '"', '"']


## Sentence Detection

In [7]:
about_text = (
    "Gus Proto is a Python developer currently"
    " working for a London-based Fintech"
    " company. He is interested in learning"
    " Natural Language Processing."
)
about_doc = nlp(about_text)
sentences = list(about_doc.sents)
len(sentences)

2

In [8]:
sentences[0]

Gus Proto is a Python developer currently working for a London-based Fintech company.

In [9]:
for sentence in sentences:
    print(sentence)

Gus Proto is a Python developer currently working for a London-based Fintech company.
He is interested in learning Natural Language Processing.


## Tokenss in spaCy

In [10]:
nlp = spacy.load("en_core_web_sm")
about_text = (
    """spaCy is an open-source library for NLP, mostly used in Python and Cython
    which provides pre-trained Neural Networks, specifically, Convolutional Neural
    Networks (CNNs) to perform NLP related taks in several languages (will
    provide images with examples)."""
)
about_doc = nlp(about_text)

In [11]:
for token in about_doc:
    print(token, token.idx)

spaCy 0
is 6
an 9
open 12
- 16
source 17
library 24
for 32
NLP 36
, 39
mostly 41
used 48
in 53
Python 56
and 63
Cython 67

     73
which 78
provides 84
pre 93
- 96
trained 97
Neural 105
Networks 112
, 120
specifically 122
, 134
Convolutional 136
Neural 150

     156
Networks 161
( 170
CNNs 171
) 175
to 177
perform 180
NLP 188
related 192
taks 200
in 205
several 208
languages 216
( 226
will 227

     231
provide 236
images 244
with 251
examples 256
) 264
. 265


In [28]:
print(
    f"{'Text with whitespace':22}"
    f"{'Is Alphanumeric?':18}"
    f"{'Is Punctuation?':18}"
    f"{'Is Stop Word?'}"
)
for token in about_doc:
    print(
         f"{str(token.text_with_ws):22}"
         f"{str(token.is_alpha):18}"
         f"{str(token.is_punct):18}"
         f"{str(token.is_stop)}"
    )

Text with whitespace  Is Alphanumeric?  Is Punctuation?   Is Stop Word?
spaCy                 True              False             False
is                    True              False             True
an                    True              False             True
open                  True              False             False
-                     False             True              False
source                True              False             False
library               True              False             False
for                   True              False             True
NLP                   True              False             False
,                     False             True              False
mostly                True              False             True
used                  True              False             True
in                    True              False             True
Python                True              False             False
and                   True            

---
## Stop Words

Stop words are typically defined as the most common words in a language. In the English language, some examples of stop words are the, are, but, and they. Most sentences need to contain stop words in order to be full sentences that make grammatical sense.

In [15]:
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
len(spacy_stopwords)

326

In [16]:
for stop_word in list(spacy_stopwords)[:10]:
    print(stop_word)

yours
seeming
neither
us
can
whereby
‘s
cannot
myself
keep


In [36]:
about_doc

spaCy is an open-source library for NLP, mostly used in Python and Cython
    which provides pre-trained Neural Networks, specifically, Convolutional Neural
    Networks (CNNs) to perform NLP related taks in several languages (will
    provide images with examples).

In [35]:
len(about_doc)

51

In [33]:
print([token for token in about_doc if not token.is_stop])

[spaCy, open, -, source, library, NLP, ,, Python, Cython, 
    , provides, pre, -, trained, Neural, Networks, ,, specifically, ,, Convolutional, Neural, 
    , Networks, (, CNNs, ), perform, NLP, related, taks, languages, (, 
    , provide, images, examples, ), .]


In [34]:
len([token for token in about_doc if not token.is_stop])

38

---
## Lematization

Lemmatization is the process of reducing inflected forms of a word while still ensuring that the reduced form belongs to the language. This reduced form, or root word, is called a **lemma**.

In [40]:
for token in about_doc:
    if str(token) != str(token.lemma_):
        print(f"{str(token):>20} : {str(token.lemma_)}")

               spaCy : spacy
                  is : be
                used : use
            provides : provide
             trained : train
             related : relate
                taks : tak
           languages : language
              images : image
            examples : example
