# Natural Language Processing With spaCy in Python

Notebook for Introduction to NLP and spaCy.

Source [Real Python](https://realpython.com/natural-language-processing-spacy-python/#conclusion)

---

In this tutorial, you’ll learn how to:

- Implement NLP in spaCy.
- Customize and extend built-in functionalities in spaCy.
- Perform basic statistical analysis on a text.
- Create a pipeline to process unstructured text.
- Parse a sentence and extract meaningful insights from it.

In [1]:
import pathlib

import spacy
from collections import Counter
from spacy import displacy


In [2]:
nlp = spacy.load("en_core_web_sm")      # model for English language
nlp

<spacy.lang.en.English at 0x1f088afd360>

In [3]:
introduction_doc = nlp(
    """spaCy is an open-source library for NLP, mostly used in Python and Cython
    which provides pre-trained Neural Networks, specifically, Convolutional Neural
    Networks (CNNs) to perform NLP related taks in several languages (will
    provide images with examples)."""
)
type(introduction_doc)

spacy.tokens.doc.Doc

In [4]:
print([token.text for token in introduction_doc])

['spaCy', 'is', 'an', 'open', '-', 'source', 'library', 'for', 'NLP', ',', 'mostly', 'used', 'in', 'Python', 'and', 'Cython', '\n    ', 'which', 'provides', 'pre', '-', 'trained', 'Neural', 'Networks', ',', 'specifically', ',', 'Convolutional', 'Neural', '\n    ', 'Networks', '(', 'CNNs', ')', 'to', 'perform', 'NLP', 'related', 'taks', 'in', 'several', 'languages', '(', 'will', '\n    ', 'provide', 'images', 'with', 'examples', ')', '.']


In [5]:
# file_name = "C:\\Users\\Usuario\\Documents\\Cryptocurrency-NLP-main\\introduction.txt"
file_name = "introduction.txt"
introduction_doc = nlp(pathlib.Path(file_name).read_text(encoding="utf-8"))
print ([token.text for token in introduction_doc])

['"', '"', '"', 'spaCy', 'is', 'an', 'open', '-', 'source', 'library', 'for', 'NLP', ',', 'mostly', 'used', 'in', 'Python', 'and', 'Cython', '\n    ', 'which', 'provides', 'pre', '-', 'trained', 'Neural', 'Networks', ',', 'specifically', ',', 'Convolutional', 'Neural', '\n    ', 'Networks', '(', 'CNNs', ')', 'to', 'perform', 'NLP', 'related', 'taks', 'in', 'several', 'languages', '(', 'will', '\n    ', 'provide', 'images', 'with', 'examples', ')', '.', '"', '"', '"']


## Sentence Detection

In [6]:
about_text = (
    "Gus Proto is a Python developer currently"
    " working for a London-based Fintech"
    " company. He is interested in learning"
    " Natural Language Processing."
)
about_doc = nlp(about_text)
sentences = list(about_doc.sents)
len(sentences)

2

In [7]:
sentences[0]

Gus Proto is a Python developer currently working for a London-based Fintech company.

In [8]:
for sentence in sentences:
    print(sentence)

Gus Proto is a Python developer currently working for a London-based Fintech company.
He is interested in learning Natural Language Processing.


## Tokens in spaCy

In [9]:
nlp = spacy.load("en_core_web_sm")
about_text = (
    """spaCy is an open-source library for NLP, mostly used in Python and Cython
    which provides pre-trained Neural Networks, specifically, Convolutional Neural
    Networks (CNNs) to perform NLP related taks in several languages (will
    provide images with examples)."""
)
about_doc = nlp(about_text)

In [10]:
for token in about_doc:
    print(token, token.idx)

spaCy 0
is 6
an 9
open 12
- 16
source 17
library 24
for 32
NLP 36
, 39
mostly 41
used 48
in 53
Python 56
and 63
Cython 67

     73
which 78
provides 84
pre 93
- 96
trained 97
Neural 105
Networks 112
, 120
specifically 122
, 134
Convolutional 136
Neural 150

     156
Networks 161
( 170
CNNs 171
) 175
to 177
perform 180
NLP 188
related 192
taks 200
in 205
several 208
languages 216
( 226
will 227

     231
provide 236
images 244
with 251
examples 256
) 264
. 265


In [11]:
print(
    f"{'Text with whitespace':22}"
    f"{'Is Alphanumeric?':18}"
    f"{'Is Punctuation?':18}"
    f"{'Is Stop Word?'}"
)
for token in about_doc:
    print(
         f"{str(token.text_with_ws):22}"
         f"{str(token.is_alpha):18}"
         f"{str(token.is_punct):18}"
         f"{str(token.is_stop)}"
    )

Text with whitespace  Is Alphanumeric?  Is Punctuation?   Is Stop Word?
spaCy                 True              False             False
is                    True              False             True
an                    True              False             True
open                  True              False             False
-                     False             True              False
source                True              False             False
library               True              False             False
for                   True              False             True
NLP                   True              False             False
,                     False             True              False
mostly                True              False             True
used                  True              False             True
in                    True              False             True
Python                True              False             False
and                   True            

---
## Stop Words

Stop words are typically defined as the most common words in a language. In the English language, some examples of stop words are the, are, but, and they. Most sentences need to contain stop words in order to be full sentences that make grammatical sense.

In [12]:
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
len(spacy_stopwords)

326

In [13]:
for stop_word in list(spacy_stopwords)[:10]:
    print(stop_word)

everywhere
becomes
however
via
again
all
of
herself
either
perhaps


In [14]:
about_doc

spaCy is an open-source library for NLP, mostly used in Python and Cython
    which provides pre-trained Neural Networks, specifically, Convolutional Neural
    Networks (CNNs) to perform NLP related taks in several languages (will
    provide images with examples).

In [15]:
len(about_doc)

51

In [16]:
print([token for token in about_doc if not token.is_stop])

[spaCy, open, -, source, library, NLP, ,, Python, Cython, 
    , provides, pre, -, trained, Neural, Networks, ,, specifically, ,, Convolutional, Neural, 
    , Networks, (, CNNs, ), perform, NLP, related, taks, languages, (, 
    , provide, images, examples, ), .]


In [17]:
len([token for token in about_doc if not token.is_stop])

38

---
## Lematization

Lemmatization is the process of reducing inflected forms of a word while still ensuring that the reduced form belongs to the language. This reduced form, or root word, is called a **lemma**.

In [18]:
for token in about_doc:
    if str(token) != str(token.lemma_):
        print(f"{str(token):>20} : {str(token.lemma_)}")

               spaCy : spacy
                  is : be
                used : use
            provides : provide
             trained : train
             related : relate
                taks : tak
           languages : language
              images : image
            examples : example


---
## Word Frecuency

In [19]:
nlp = spacy.load("en_core_web_sm")
complete_text = (
    """It has never been easier for individual investors to get started trading
    stocks or options. Companies like Robinhood and Webull even
    offer zero commission trades, no account minimum size, and incentives
    like a free stock if a user creates an account. Recently, there
    has been a huge increase in the growth of users of these types of platforms.
    The combination of many people out of work because of the
    coronavirus and the US government stimulus package appears to have
    sparked this. This surge in new investors has sparked tons of activity
    on popular social media websites like Reddit. There users regularly
    post stock recommendations and trading strategies. It appears
    that many Reddit traders are grouping together and causing irrational
    stock market moves. Panic buying stocks for companies that just declared
    bankruptcy, betting against Warren Buffet, and ignoring the
    impact of the coronavirus on airlines and cruise ships are a few of the
    unusual market moves lately. The purpose of this project is to identify
    if there is a relationship between the Reddit sentiment on stocks and
    performance."""
)

In [20]:
complete_doc = nlp(complete_text)
words = [
    token.text
    for token in complete_doc
    if not token.is_stop and not token.is_punct
]

In [21]:
print(words)

['easier', 'individual', 'investors', 'started', 'trading', '\n    ', 'stocks', 'options', 'Companies', 'like', 'Robinhood', 'Webull', '\n    ', 'offer', 'zero', 'commission', 'trades', 'account', 'minimum', 'size', 'incentives', '\n    ', 'like', 'free', 'stock', 'user', 'creates', 'account', 'Recently', '\n    ', 'huge', 'increase', 'growth', 'users', 'types', 'platforms', '\n    ', 'combination', 'people', 'work', '\n    ', 'coronavirus', 'government', 'stimulus', 'package', 'appears', '\n    ', 'sparked', 'surge', 'new', 'investors', 'sparked', 'tons', 'activity', '\n    ', 'popular', 'social', 'media', 'websites', 'like', 'Reddit', 'users', 'regularly', '\n    ', 'post', 'stock', 'recommendations', 'trading', 'strategies', 'appears', '\n    ', 'Reddit', 'traders', 'grouping', 'causing', 'irrational', '\n    ', 'stock', 'market', 'moves', 'Panic', 'buying', 'stocks', 'companies', 'declared', '\n    ', 'bankruptcy', 'betting', 'Warren', 'Buffet', 'ignoring', '\n    ', 'impact', 'cor

In [22]:
common_words = Counter(words).most_common()
print(common_words)

[('\n    ', 16), ('stocks', 3), ('like', 3), ('stock', 3), ('Reddit', 3), ('investors', 2), ('trading', 2), ('account', 2), ('users', 2), ('coronavirus', 2), ('appears', 2), ('sparked', 2), ('market', 2), ('moves', 2), ('easier', 1), ('individual', 1), ('started', 1), ('options', 1), ('Companies', 1), ('Robinhood', 1), ('Webull', 1), ('offer', 1), ('zero', 1), ('commission', 1), ('trades', 1), ('minimum', 1), ('size', 1), ('incentives', 1), ('free', 1), ('user', 1), ('creates', 1), ('Recently', 1), ('huge', 1), ('increase', 1), ('growth', 1), ('types', 1), ('platforms', 1), ('combination', 1), ('people', 1), ('work', 1), ('government', 1), ('stimulus', 1), ('package', 1), ('surge', 1), ('new', 1), ('tons', 1), ('activity', 1), ('popular', 1), ('social', 1), ('media', 1), ('websites', 1), ('regularly', 1), ('post', 1), ('recommendations', 1), ('strategies', 1), ('traders', 1), ('grouping', 1), ('causing', 1), ('irrational', 1), ('Panic', 1), ('buying', 1), ('companies', 1), ('declared',

---
## Part-of-Speech tagging

Part of speech or POS is a grammatical role that explains how a particular word is used in a sentence. There are typically eight parts of speech:

1. Noun
2. Pronoun
3. Adjective
4. Verb
5. Adverb
6. Preposition
7. Conjunction
8. Interjection

**Part-of-speech** tagging is the process of assigning a **POS** tag to each token depending on its usage in the sentence. POS tags are useful for assigning a syntactic category like noun or verb to each word.

In [23]:
for token in about_doc[:5]:
    print(
        f"""
        TOKEN: {str(token)}
        =====
        TAG: {str(token.tag_):10} POS: {token.pos_}
        EXPLANATION: {spacy.explain(token.tag_)}"""
    )


        TOKEN: spaCy
        =====
        TAG: UH         POS: INTJ
        EXPLANATION: interjection

        TOKEN: is
        =====
        TAG: VBZ        POS: AUX
        EXPLANATION: verb, 3rd person singular present

        TOKEN: an
        =====
        TAG: DT         POS: DET
        EXPLANATION: determiner

        TOKEN: open
        =====
        TAG: JJ         POS: ADJ
        EXPLANATION: adjective (English), other noun-modifier (Chinese)

        TOKEN: -
        =====
        TAG: HYPH       POS: PUNCT
        EXPLANATION: punctuation mark, hyphen


In [24]:
nouns = []
adjetives = []
for token in about_doc:
    if token.pos_ == "NOUN":
        nouns.append(token)
    if token.pos == "ADJ":
        adjetives.append(token)

In [25]:
nouns

[source, library, taks, languages, images, examples]

In [26]:
adjetives

[]

---
## Visualization: Using displaCy

In [36]:
displacy.serve(about_doc, style="dep")




Using the 'dep' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


---
## Preprocessing Funtions

To bring your text into a format ideal for analysis, you can write preprocessing functions to encapsulate your cleaning process. For example, in this section, you’ll create a preprocessor that applies the following operations:

- Lowercases the text
- Lemmatizes each token
- Removes punctuation symbols
- Removes stop words

In [30]:
def is_token_allowed(token):
    return bool(
        token
        and str(token).split()
        and not token.is_stop
        and not token.is_punct
    )
    
def preprocess_token(token):
    return token.lemma_.strip().lower()

In [32]:
complete_filtered_tokens = [
    preprocess_token(token)
    for token in complete_doc
    if is_token_allowed(token)
]

In [34]:
print(complete_filtered_tokens)

['easy', 'individual', 'investor', 'start', 'trade', 'stock', 'option', 'company', 'like', 'robinhood', 'webull', 'offer', 'zero', 'commission', 'trade', 'account', 'minimum', 'size', 'incentive', 'like', 'free', 'stock', 'user', 'create', 'account', 'recently', 'huge', 'increase', 'growth', 'user', 'type', 'platform', 'combination', 'people', 'work', 'coronavirus', 'government', 'stimulus', 'package', 'appear', 'spark', 'surge', 'new', 'investor', 'spark', 'ton', 'activity', 'popular', 'social', 'medium', 'website', 'like', 'reddit', 'user', 'regularly', 'post', 'stock', 'recommendation', 'trading', 'strategy', 'appear', 'reddit', 'trader', 'group', 'cause', 'irrational', 'stock', 'market', 'move', 'panic', 'buy', 'stock', 'company', 'declare', 'bankruptcy', 'bet', 'warren', 'buffet', 'ignore', 'impact', 'coronavirus', 'airline', 'cruise', 'ship', 'unusual', 'market', 'move', 'lately', 'purpose', 'project', 'identify', 'relationship', 'reddit', 'sentiment', 'stock', 'performance']


In [35]:
complete_doc

It has never been easier for individual investors to get started trading
    stocks or options. Companies like Robinhood and Webull even
    offer zero commission trades, no account minimum size, and incentives
    like a free stock if a user creates an account. Recently, there
    has been a huge increase in the growth of users of these types of platforms.
    The combination of many people out of work because of the
    coronavirus and the US government stimulus package appears to have
    sparked this. This surge in new investors has sparked tons of activity
    on popular social media websites like Reddit. There users regularly
    post stock recommendations and trading strategies. It appears
    that many Reddit traders are grouping together and causing irrational
    stock market moves. Panic buying stocks for companies that just declared
    bankruptcy, betting against Warren Buffet, and ignoring the
    impact of the coronavirus on airlines and cruise ships are a few of the
   