## Text Preprocessing Using Spacy

### Stop Words

- Stop words are words that are filtered out before or after the natural language data(text) are processed.
- stop words typically refers to the most common words in a language.
- There is no universal list of stop words that is used by all NLP tools in common.

**what are stop words?**
- Stopwords are the words in any language which does not add much meaning to a sentence.
- They can safely be ignored without sacrificing the meaning of the sentence.
- For some search engines, these are some of the most common, short function words, such as the, is, at, which, and on.

**But sometimes, stop words can be really useful and shouldnot be removed.**

**When to remove stop words?**

- If we have a task of text classification or sentiment analysis then we should remove stop words as they do not provide any information to our model i.e. keeping out unwanted words out of our corpus.
- But, if we have the task of language translation then stopwords are useful, as they have to be translated along with other words.
- There is no hard and fast rule on when to remove stop words
    1) Remove stopwords if task to be performed is one of Language Classification, Spam Filtering, Caption Generation, Auto-Tag Generation, Sentiment analysis, or something that is related to text classification.
    2) Better not to remove stopwords if task to be performed is one of Machine Translation, Question Answering problems, Text summarization, Language Modeling.

**Pros of Removing stop words**

- Stopwords are often removed from the text before training deep learning and machine learning models since stop words occur in abundance, hence providing little to no unique information that can be used for classification or clustering.
- On removing stopwords, dataset size decreases, and the time to train the model also decreases without a huge impact on the accuracy of the model.
- Stopword removal can potentially help in improving performance, as there are fewer and only significant tokens left. Thus, the classification accuracy could be improved.

**Cons of Removing Stop Words**

- Improper selection and removal of stop words can change the meaning of our text. So we have to be careful in choosing our stop words.
- Example: This movie is not good
    - If we remove (not ) in pre-processing step the sentence (this movie is good) indicates that it is positive which is wrongly interpreted.

**Removing Stop words using SpaCy Library**

- Comparing to NLTK, spacy got bigger set of stop words (326) than that of NLTK (179)
- installation: (spacy, English Language Model)
    - pip install -U spacy
    - python -m spacy download en_core_web_sm

In [1]:
!pip install spacy



In [2]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 12.8/12.8 MB 5.2 MB/s eta 0:00:00
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [3]:
import spacy

nlp = spacy.load("en_core_web_sm")

In [4]:
# list stopwords from spacy:
stopwords = nlp.Defaults.stop_words
print(stopwords)

{'our', 'than', 'whereas', 'would', 'there', 'being', 'not', 'take', 'except', 'were', 'least', 'afterwards', 'sometimes', 'much', 'among', 'part', 'regarding', 'must', 'others', 'elsewhere', 'indeed', 'off', 'now', 'throughout', 'top', '‘ll', 'yet', 'whether', 'whereby', 'hundred', 'herself', 'all', 'made', 'whoever', 'whither', 'unless', 'other', '‘re', 'twenty', 'name', 'get', 'one', 'someone', 'ever', 'had', 'below', '’m', 'nine', 'is', 'though', 'most', '‘ve', 'whereupon', 'before', 'keep', 'seemed', 'never', 'many', "'d", 'same', 'ours', 'hence', 'for', 'doing', '‘s', 'beforehand', "n't", 'so', 'are', 'whom', 'hers', 'was', 'did', 'seeming', 'eight', 'please', 'ten', 'two', 'might', 'noone', 'nowhere', 'it', 'already', 'these', 'due', 'hereby', 'in', 'no', 'somehow', 'within', 'a', 'been', 'that', 'anything', 'to', 'through', 'done', 'twelve', 'mine', 'five', 'often', 'from', 'thence', 'why', 'each', 'without', 'thereby', 'anywhere', 'formerly', 'perhaps', 'just', 'seem', 'seems'

In [5]:
text = '''Hello, this text is written from the text preprocessing with Spacy library which is open source.
Various statistical models and pipelines are available here. Techniques like stop words,
POS tagging and dependency matching is pre-trained.
'''

In [7]:
# Tokenize and remove stopwords from that list of tokens.
doc = nlp(text)

# remove stopwords and make tokens without stop word.
tokens_without_sw = [token for token in doc if token.text not in stopwords]
print(tokens_without_sw)

[Hello, ,, text, written, text, preprocessing, Spacy, library, open, source, ., 
, Various, statistical, models, pipelines, available, ., Techniques, like, stop, words, ,, 
, POS, tagging, dependency, matching, pre, -, trained, ., 
]


Note : Don't forget to use `.text` attribute

## Tokenization:
- Tokenization refers to diving the whole text into multiple managable units.
- Helps to form sequence of words or sentences.
- Each tokens have meaning and semantic relation with other tokens.

- Word Tokenization
    - Word Tokenization simply means splitting sentence/text in words.
    - Using attribute `token.text` to tokenize the doc

- Sentence Tokenization

    - Sentence Tokenization is the process of splitting up strings into sentences.
    - A sentence usually ends with a full stop (.), here focus is to study the structure of sentence in the analysis
    - use `sents` attribute from spacy to identify the sentences.

In [8]:
# Word tokenization:

nlp = spacy.load("en_core_web_sm")

doc = nlp(text)

tokens = [token.text for token in doc]
print(tokens, len(tokens))

['Hello', ',', 'this', 'text', 'is', 'written', 'from', 'the', 'text', 'preprocessing', 'with', 'Spacy', 'library', 'which', 'is', 'open', 'source', '.', '\n', 'Various', 'statistical', 'models', 'and', 'pipelines', 'are', 'available', 'here', '.', 'Techniques', 'like', 'stop', 'words', ',', '\n', 'POS', 'tagging', 'and', 'dependency', 'matching', 'is', 'pre', '-', 'trained', '.', '\n'] 45


In [9]:
# Sentence tokenization:
sentences = [sent.text for sent in doc.sents] 
# Or, just : list(doc.sents)
print(sentences, len(sentences))

['Hello, this text is written from the text preprocessing with Spacy library which is open source.\n', 'Various statistical models and pipelines are available here.', 'Techniques like stop words,\nPOS tagging and dependency matching is pre-trained.\n'] 3


## Punctuation:
- punctuation are special marks that are placed in a text to show the division between phrases and sentences.
- There are 14 punctuation marks that are commonly used in English grammar.
- They are, **period, question mark, exclamation point, comma, semicolon, colon, dash, hyphen, parentheses, brackets, braces, apostrophe, quotation marks, and ellipsis**.
- We can remove punctuation from text using `is_punct` attribute.


In [10]:
# Removing the punctuation:
# Detecting each token and only appending those tokens which is not punct:

non_punct_tokens = []
for token in doc:
    if token.is_punct == False:
        non_punct_tokens.append(token.text)

print(non_punct_tokens)

['Hello', 'this', 'text', 'is', 'written', 'from', 'the', 'text', 'preprocessing', 'with', 'Spacy', 'library', 'which', 'is', 'open', 'source', '\n', 'Various', 'statistical', 'models', 'and', 'pipelines', 'are', 'available', 'here', 'Techniques', 'like', 'stop', 'words', '\n', 'POS', 'tagging', 'and', 'dependency', 'matching', 'is', 'pre', 'trained', '\n']


## Lower Casing:
- Converting word to lower case (NLP->nlp).
- **Q.Why Lower Casing?**
    - Words like Book and book mean the same,
    - When not converted to the lower case those two are represented as two different words in the vector space model (resulting in more dimension).
    - Higher the dimension, more computation resources are required.

In [11]:
lower_text = text.lower()
lower_text

'hello, this text is written from the text preprocessing with spacy library which is open source.\nvarious statistical models and pipelines are available here. techniques like stop words,\npos tagging and dependency matching is pre-trained.\n'

## Stemming:
- Converting to the words or tokens to their root word. 
- The root word might not make sense
- It is based on algorithm

Note: Stemming is not available in Spacy library.

## Lemmatization:
- Lemmatization is the process of converting a word to its base form.
- For example, lemmatization would correctly identify the base form of caring to care
- Lemmatization can be carried out using the attribute `token.lemma_`
- It is search-based algorithm

In [12]:
for token in doc:
    print(token.text, "=>", token.lemma_)

Hello => hello
, => ,
this => this
text => text
is => be
written => write
from => from
the => the
text => text
preprocessing => preprocesse
with => with
Spacy => Spacy
library => library
which => which
is => be
open => open
source => source
. => .

 => 

Various => various
statistical => statistical
models => model
and => and
pipelines => pipeline
are => be
available => available
here => here
. => .
Techniques => technique
like => like
stop => stop
words => word
, => ,

 => 

POS => POS
tagging => tagging
and => and
dependency => dependency
matching => matching
is => be
pre => pre
- => -
trained => train
. => .

 => 



## POS Tagging:
- Parts-of-speech tagging is the process of tagging words in textual input with their appropriate parts of speech.
- This is one of the core feature loaded into the pipeline.
- POS tag can be accessed using `token.pos_`

In [13]:

for token in doc:
    print(token.text, token.pos_, token.tag_)

Hello INTJ UH
, PUNCT ,
this DET DT
text NOUN NN
is AUX VBZ
written VERB VBN
from ADP IN
the DET DT
text NOUN NN
preprocessing VERB VBG
with ADP IN
Spacy PROPN NNP
library NOUN NN
which PRON WDT
is AUX VBZ
open ADJ JJ
source NOUN NN
. PUNCT .

 SPACE _SP
Various ADJ JJ
statistical ADJ JJ
models NOUN NNS
and CCONJ CC
pipelines NOUN NNS
are AUX VBP
available ADJ JJ
here ADV RB
. PUNCT .
Techniques NOUN NNS
like AUX IN
stop VERB VB
words NOUN NNS
, PUNCT ,

 SPACE _SP
POS PROPN NNP
tagging NOUN NN
and CCONJ CC
dependency NOUN NN
matching NOUN NN
is AUX VBZ
pre VERB VBN
- ADJ JJ
trained VERB VBN
. PUNCT .

 SPACE _SP


## Named Entity Recognition
- It is the process of detecting the named entities such as the person name, the location name, the company name, the quantities and the monetary value.
- We can find the named entity using spaCy `ents` attribute class.
- `entity.text` and `entity.label`
- Entity attributes details

In [15]:
text = """
Albert Einstein was a German-born theoretical physicist who developed the theory of relativity, one of the two pillars of modern physics. His work is also known for its influence on the philosophy of science. He was born in Ulm, in the Kingdom of Württemberg in the German Empire, on 14 March 1879. When he was 17, he moved to Switzerland, where he began his theoretical physics studies at the Swiss Federal Institute of Technology in Zurich. He published his first paper in 1900, at the age of 21.
"""

In [16]:
import spacy

from spacy import displacy

nlp = spacy.load('en_core_web_sm')

doc = nlp(text)

for entity in doc.ents:
    print(entity.text, entity.label_)


Albert Einstein PERSON
German NORP
one CARDINAL
two CARDINAL
Ulm GPE
the Kingdom of Württemberg GPE
the German Empire GPE
14 March 1879 DATE
17 DATE
Switzerland GPE
the Swiss Federal Institute of Technology ORG
Zurich GPE
first ORDINAL
1900 DATE
the age of 21 DATE


In [17]:
displacy.serve(doc, style = "ent")




Using the 'ent' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.
