
# Introduction to NLP Library: Spacy in Python

Uzair Adamjee, https://medium.com/analytics-vidhya/introduction-to-nlp-library-spacy-in-python-a98cf344eb6d

The **spaCy** package https://spacy.io/



**Natural Language Processing (NLP)** allows computers and machines to perform task like read and understanding human language.
One of the major challenge we faced when working with textual data or understanding languages is extracting those patterns which are meaningful and used the information to find actionable insights.

In today’s post we will explore NLP library that can help us in finding patterns in text data.

**spaCy** is one of the popular and easy-to-use natural language processing library in Python. It helps in building applications that can process and get insights from large volumes of text. It can be used in task related to information extraction or natural language understanding systems, deep learning etc. Companies like Airbnb, Quora, Uber are using this for production purposes and it has an active open source community.

**spaCy** is a good choice for NLP tasks. Some of the features provided by spaCy are

- Tokenization
- Lemmatization
- Entity recognition
- Dependency parsing
- Sentence recognition
- Part-of-speech tagging

# Installation

In [1]:
#pip install -U spacy

In [2]:
# import libraries

import spacy
import pandas as pd
#ignore warnings
import warnings
warnings.filterwarnings('ignore')

In [3]:
spacy.__version__

'3.7.6'

In [23]:
# ! cd DataScienceProgramming/07-Text-Processing
# ! pwd

/Users/markjack/GSU_Fall2024/IFI8410


After installing library we need to download a language model:

In [5]:
!python -m spacy download en

[38;5;3m⚠ As of spaCy v3.0, shortcuts like 'en' are deprecated. Please use the
full pipeline package name 'en_core_web_sm' instead.[0m
Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m13.1 MB/s[0m eta [36m0:00:00[0m00:01[0m0:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [6]:
from spacy.lang.en import English
from spacy import displacy
nlp = spacy.load("en_core_web_sm")
# nlp = spacy.load('en')
# nlp = spacy.load('en_core_web_lg')

Define a test sentence:

In [7]:
test_sent= "Pakistan got independence in 1947. Karachi, Lahore and Islamabad are few of the major cities Pakistan."

# Tokenization

It is the process of splitting a text data into set of words, symbols, punctuation, spaces or in short tokens.

In [8]:
parsed_sent = nlp(test_sent)

In [9]:
type(parsed_sent)

spacy.tokens.doc.Doc

In [10]:
parsed_sent.text.split()

['Pakistan',
 'got',
 'independence',
 'in',
 '1947.',
 'Karachi,',
 'Lahore',
 'and',
 'Islamabad',
 'are',
 'few',
 'of',
 'the',
 'major',
 'cities',
 'Pakistan.']

`.orth_` method, which returns a string representation of the token rather than a **SpaCy** token object, this might not always be desirable, but worth noting. **SpaCy** recognises punctuation and is able to split these punctuation tokens from word tokens.

In [11]:
for token in parsed_sent:
    print(token.orth_ )

Pakistan
got
independence
in
1947
.
Karachi
,
Lahore
and
Islamabad
are
few
of
the
major
cities
Pakistan
.


In [12]:
for token in parsed_sent:
    if not token.is_punct | token.is_space:
        print(token.orth_ )

Pakistan
got
independence
in
1947
Karachi
Lahore
and
Islamabad
are
few
of
the
major
cities
Pakistan


In [13]:
[token.orth_ for token in parsed_sent if not token.is_punct | token.is_space] 

['Pakistan',
 'got',
 'independence',
 'in',
 '1947',
 'Karachi',
 'Lahore',
 'and',
 'Islamabad',
 'are',
 'few',
 'of',
 'the',
 'major',
 'cities',
 'Pakistan']

# Named Entities

**Entity recognition** is the process of classifying named entities found in a text into **pre-defined categories**, such as **persons, places, organizations, dates,** etc.

In [14]:
parsed_sent = nlp(test_sent)
spacy.displacy.render(parsed_sent, style='ent',jupyter=True)

# Part-of-Speech Tagging

**Spacy** makes it easy to get **part-of-speech** tags:

**N(oun)** : This usually denotes words that depict some object or entity which may be living or nonliving.

**V(erb)** : Verbs are words that are used to describe certain actions, states, or occurrences.

**Adj(ective)** : Adjectives are words used to describe or qualify other words, typically nouns and noun phrases. The phrase beautiful flower has the noun (N) flower which is described or qualified using the adjective (ADJ) beautiful.

**Adv(erb)** : Adverbs usually act as modifiers for other words including nouns, adjectives, verbs, or other adverbs. The phrase very beautiful flower has the adverb (ADV) very, which modifies the adjective (ADJ) beautiful.

In [15]:
sentence_spans = list(parsed_sent.sents)
displacy.render(sentence_spans, style='dep', jupyter=True)

In [16]:
for token in parsed_sent:
    print(token.orth_, token.ent_type_ if token.ent_type_ != "" else "(not an entity)")

Pakistan GPE
got (not an entity)
independence (not an entity)
in (not an entity)
1947 DATE
. (not an entity)
Karachi PERSON
, (not an entity)
Lahore GPE
and (not an entity)
Islamabad GPE
are (not an entity)
few (not an entity)
of (not an entity)
the (not an entity)
major (not an entity)
cities (not an entity)
Pakistan GPE
. (not an entity)


# Lemmatization

**Lemmatization** is the reduction of each word to its **root**, or **lemma**. In **spaCy** we call `token.lemma_` to get the lemmas for each word.

In [17]:
for token in parsed_sent:
    print(token, ' -> Its Lemma word ', token.lemma_)
    print()

Pakistan  -> Its Lemma word  Pakistan

got  -> Its Lemma word  get

independence  -> Its Lemma word  independence

in  -> Its Lemma word  in

1947  -> Its Lemma word  1947

.  -> Its Lemma word  .

Karachi  -> Its Lemma word  Karachi

,  -> Its Lemma word  ,

Lahore  -> Its Lemma word  Lahore

and  -> Its Lemma word  and

Islamabad  -> Its Lemma word  Islamabad

are  -> Its Lemma word  be

few  -> Its Lemma word  few

of  -> Its Lemma word  of

the  -> Its Lemma word  the

major  -> Its Lemma word  major

cities  -> Its Lemma word  city

Pakistan  -> Its Lemma word  Pakistan

.  -> Its Lemma word  .



### Convert **Spacy** data into a Dataframe:

In [18]:
df_token = pd.DataFrame()

for i, token in enumerate(parsed_sent):
    df_token.loc[i, 'text'] = token.text
    df_token.loc[i, 'lemma'] = token.lemma_,
    df_token.loc[i, 'pos'] = token.pos_
    df_token.loc[i, 'tag'] = token.tag_
    df_token.loc[i, 'dep'] = token.dep_
    #df_token.loc[i, 'shape'] = token.shape_
    #df_token.loc[i, 'is_alpha'] = token.is_alpha
    df_token.loc[i, 'is_stop'] = token.is_stop
    
print(df_token)

            text            lemma    pos  tag    dep is_stop
0       Pakistan      (Pakistan,)  PROPN  NNP  nsubj   False
1            got           (get,)   VERB  VBD   ROOT   False
2   independence  (independence,)   NOUN   NN   dobj   False
3             in            (in,)    ADP   IN   prep    True
4           1947          (1947,)    NUM   CD   pobj   False
5              .             (.,)  PUNCT    .  punct   False
6        Karachi       (Karachi,)  PROPN  NNP  nsubj   False
7              ,             (,,)  PUNCT    ,  punct   False
8         Lahore        (Lahore,)  PROPN  NNP   conj   False
9            and           (and,)  CCONJ   CC     cc    True
10     Islamabad     (Islamabad,)  PROPN  NNP   conj   False
11           are            (be,)    AUX  VBP   ROOT    True
12           few           (few,)    ADJ   JJ   attr    True
13            of            (of,)    ADP   IN   prep    True
14           the           (the,)    DET   DT    det    True
15         major        

In [24]:
import openpyxl
df_token.to_csv('./Tokens_Data.csv', index=False)
df_token.to_excel('./Tokens_Data.xlsx', index=False)