<a href="https://colab.research.google.com/github/aisha-partha/AIMLOps-Assignments/blob/main/M5_AST_01_TextPreprocessing_using_spaCy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced Certification Programme in AI and MLOps
## A programme by IISc and TalentSprint
### Assignment 1: Text Preprocessing using spaCy

## Learning Objectives:

At the end of the experiment, you will be able to:

* understand the spaCy library
* perform simple natural language processing tasks using the spaCy library

## Introduction

**spaCy** is a free, open-source library for advanced Natural Language Processing (NLP) in Python.

It is designed specifically for production use and helps you build applications that process and “understand” large volumes of text. It can be used to build information extraction or natural language understanding systems, or to pre-process text for deep learning.

spaCy's features and capabilities include:

- ***Tokenization***:	Segmenting text into words, punctuations marks etc.
- ***Part-of-speech (POS) Tagging***: Assigning word types to tokens, like verb or noun.
- ***Dependency Parsing***: Assigning syntactic dependency labels, describing the relations between individual tokens, like subject or object.
- ***Lemmatization***: Assigning the base forms of words. For example, the lemma of “was” is “be”, and the lemma of “rats” is “rat”.
- ***Sentence Boundary Detection (SBD)***: Finding and segmenting individual sentences.
- ***Named Entity Recognition (NER)***: Labelling named “real-world” objects, like persons, companies or locations.
- ***Entity Linking (EL)***: Disambiguating textual entities to unique identifiers in a knowledge base.
- ***Similarity***: Comparing words, text spans and documents and how similar they are to each other.
- ***Text Classification***: Assigning categories or labels to a whole document, or parts of a document.
- ***Rule-based Matching***: Finding sequences of tokens based on their texts and linguistic annotations, similar to regular expressions.
- ***Training***: Updating and improving a statistical model's predictions.
- ***Serialization***: Saving objects to files or byte strings.


### Statistical models

While some of spaCy's features work independently, others require ***trained pipelines*** to be loaded, which enable spaCy to predict linguistic annotations - for example, whether a word is a verb or a noun.

A trained pipeline can consist of multiple components that use a statistical model trained on labeled data.

spaCy currently offers trained pipelines for a variety of languages, which can be installed as individual Python modules. Pipeline packages can differ in size, speed, memory usage, accuracy and the data they include.

For English language, available trained pipelines include:
- `en_core_web_sm`
- `en_core_web_md`
- `en_core_web_lg`
- `en_core_web_trf` - English transformer pipeline

To know more about trained pipelines for English, refer [here](https://spacy.io/models/en).

Let's perform basic NLP tasks with spaCy using an English trained pipeline.


### Setup Steps:

In [1]:
#@title Please enter your registration id to start: { run: "auto", display-mode: "form" }
Id = "2304896" #@param {type:"string"}

In [2]:
#@title Please enter your password (your registered phone number) to continue: { run: "auto", display-mode: "form" }
password = "9916583736" #@param {type:"string"}

In [3]:
#@title Run this cell to complete the setup for this Notebook
from IPython import get_ipython

ipython = get_ipython()

notebook= "M5_AST_01_TextPreprocessing_using_spaCy" #name of the notebook

def setup():
#  ipython.magic("sx pip3 install torch")

    from IPython.display import HTML, display
    display(HTML('<script src="https://dashboard.talentsprint.com/aiml/record_ip.html?traineeId={0}&recordId={1}"></script>'.format(getId(),submission_id)))
    print("Setup completed successfully")
    return

def submit_notebook():
    ipython.magic("notebook -e "+ notebook + ".ipynb")

    import requests, json, base64, datetime

    url = "https://dashboard.talentsprint.com/xp/app/save_notebook_attempts"
    if not submission_id:
      data = {"id" : getId(), "notebook" : notebook, "mobile" : getPassword()}
      r = requests.post(url, data = data)
      r = json.loads(r.text)

      if r["status"] == "Success":
          return r["record_id"]
      elif "err" in r:
        print(r["err"])
        return None
      else:
        print ("Something is wrong, the notebook will not be submitted for grading")
        return None

    elif getAnswer() and getComplexity() and getAdditional() and getConcepts() and getComments() and getMentorSupport():
      f = open(notebook + ".ipynb", "rb")
      file_hash = base64.b64encode(f.read())

      data = {"complexity" : Complexity, "additional" :Additional,
              "concepts" : Concepts, "record_id" : submission_id,
              "answer" : Answer, "id" : Id, "file_hash" : file_hash,
              "notebook" : notebook,
              "feedback_experiments_input" : Comments,
              "feedback_mentor_support": Mentor_support}
      r = requests.post(url, data = data)
      r = json.loads(r.text)
      if "err" in r:
        print(r["err"])
        return None
      else:
        print("Your submission is successful.")
        print("Ref Id:", submission_id)
        print("Date of submission: ", r["date"])
        print("Time of submission: ", r["time"])
        print("View your submissions: https://aimlops-iisc.talentsprint.com/notebook_submissions")
        #print("For any queries/discrepancies, please connect with mentors through the chat icon in LMS dashboard.")
        return submission_id
    else: submission_id


def getAdditional():
  try:
    if not Additional:
      raise NameError
    else:
      return Additional
  except NameError:
    print ("Please answer Additional Question")
    return None

def getComplexity():
  try:
    if not Complexity:
      raise NameError
    else:
      return Complexity
  except NameError:
    print ("Please answer Complexity Question")
    return None

def getConcepts():
  try:
    if not Concepts:
      raise NameError
    else:
      return Concepts
  except NameError:
    print ("Please answer Concepts Question")
    return None


# def getWalkthrough():
#   try:
#     if not Walkthrough:
#       raise NameError
#     else:
#       return Walkthrough
#   except NameError:
#     print ("Please answer Walkthrough Question")
#     return None

def getComments():
  try:
    if not Comments:
      raise NameError
    else:
      return Comments
  except NameError:
    print ("Please answer Comments Question")
    return None


def getMentorSupport():
  try:
    if not Mentor_support:
      raise NameError
    else:
      return Mentor_support
  except NameError:
    print ("Please answer Mentor support Question")
    return None

def getAnswer():
  try:
    if not Answer:
      raise NameError
    else:
      return Answer
  except NameError:
    print ("Please answer Question")
    return None


def getId():
  try:
    return Id if Id else None
  except NameError:
    return None

def getPassword():
  try:
    return password if password else None
  except NameError:
    return None

submission_id = None
### Setup
if getPassword() and getId():
  submission_id = submit_notebook()
  if submission_id:
    setup()
else:
  print ("Please complete Id and Password cells before running setup")



Setup completed successfully


### Install packages

In [4]:
!pip -q install spacy==3.7.4

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.6/6.6 MB[0m [31m32.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.0/57.0 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.0/46.0 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.1/50.1 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.0/45.0 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[?25h

In [5]:
!python -m spacy info

[1m

spaCy version    3.7.4                         
Location         /usr/local/lib/python3.10/dist-packages/spacy
Platform         Linux-6.1.85+-x86_64-with-glibc2.35
Python version   3.10.12                       
Pipelines        en_core_web_sm (3.7.1)        



From the above info, we can see that by default spaCy contains the small trained pipeline for English `en_core_web_sm`.

To use medium, large, and transformer trained pipelines, they need to be installed first using the `!python -m spacy download` command.

For example: `!python -m spacy download en_core_web_trf`

In [6]:
# Install English transformer pipeline
# NOTE that Runtime needs to restart after this step

!python -m spacy download en_core_web_trf

Collecting en-core-web-trf==3.7.3
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.7.3/en_core_web_trf-3.7.3-py3-none-any.whl (457.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m457.4/457.4 MB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
Collecting spacy-curated-transformers<0.3.0,>=0.2.0 (from en-core-web-trf==3.7.3)
  Downloading spacy_curated_transformers-0.2.2-py2.py3-none-any.whl.metadata (2.7 kB)
Collecting curated-transformers<0.2.0,>=0.1.0 (from spacy-curated-transformers<0.3.0,>=0.2.0->en-core-web-trf==3.7.3)
  Downloading curated_transformers-0.1.1-py2.py3-none-any.whl.metadata (965 bytes)
Collecting curated-tokenizers<0.1.0,>=0.0.9 (from spacy-curated-transformers<0.3.0,>=0.2.0->en-core-web-trf==3.7.3)
  Downloading curated_tokenizers-0.0.9-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.9 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.12.0->spacy-curated-transformers<0.3.0

**Restart the Runtime/Session**

In [4]:
!python -m spacy info

[1m

spaCy version    3.7.4                         
Location         /usr/local/lib/python3.10/dist-packages/spacy
Platform         Linux-6.1.85+-x86_64-with-glibc2.35
Python version   3.10.12                       
Pipelines        en_core_web_trf (3.7.3), en_core_web_sm (3.7.1)



### Import required packages

In [5]:
import spacy
from spacy import displacy

### Load the trained pipeline

Once you've downloaded and installed a trained pipeline, you can load it via `spacy.load()`. This will return a *Language object* containing all components and data needed to process text. We usually call it `nlp`.


In [6]:
# Load transformer pipeline for English
nlp = spacy.load("en_core_web_trf")

# This gives us a Language object
nlp

<spacy.lang.en.English at 0x7946f2c8d570>

Esentially, spaCy's *Language* object is a pipeline that uses the language model to perform a number of natural language processing tasks such as *tokenization*, *part-of-speech tagging*, *syntactic parsing*, *named entity recognition*, etc.

<br>
<img src='https://cdn.iisc.talentsprint.com/AIandMLOps/Images/spacy_pipeline.png' width=800px>

<br>

In [7]:
# Pipeline names
nlp.pipe_names

['transformer', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

## Performing basic NLP tasks using spaCy

Calling the Language object, `nlp`, on a string of text will return a processed *Doc*.

In [8]:
# An example sentence
text = "Apple is looking at buying U.K. startup for $1 billion."
text

'Apple is looking at buying U.K. startup for $1 billion.'

In [9]:
# Feed the string object under 'text' to the Language object under 'nlp'
# Store the result under the variable 'doc'
doc = nlp(text)

In [10]:
type(doc)

spacy.tokens.doc.Doc

Passing the variable `text` to the _Language_ object `nlp` returns a spaCy *Doc* object, short for document.

This object contains both the input text stored under `text` and the results of natural language processing using spaCy.

In [11]:
# Call the variable to examine the object
doc

Apple is looking at buying U.K. startup for $1 billion.

Calling the variable `doc` returns the contents of the object.

Although the output resembles that of a Python string, the *Doc* object contains a wealth of information about its linguistic structure, which spaCy generated by passing the text through the NLP pipeline.

Let's examine the tasks that were performed under the hood after the input sentence was provided to the language model.

### Tokenization

*Tokenization* breaks the text down into words, punctuation and so on.

The diagram below outlines the tasks that spaCy can perform after a text has been tokenised, such as *part-of-speech tagging*, *syntactic parsing* and *named entity recognition*.

<img src='https://cdn.iisc.talentsprint.com/AIandMLOps/Images/spacy_pipeline.png' width=800px>

Each *Doc* consists of individual tokens, and we can iterate over them.

Let's print out each *Token* object stored in the _Doc_ object `doc`.

In [12]:
# Tokens present inside the document

print("Token\n"+'='*20)

for token in doc:
    print(token.text)

Token
Apple
is
looking
at
buying
U.K.
startup
for
$
1
billion
.


### Part-of-speech tagging

Part-of-speech (POS) tagging is the task of determining the word class of a token. This is crucial for *disambiguation*, because different parts of speech may have similar forms.

>Consider the example: *The sailor dogs the hatch*.<br>
>The present tense of the verb *dog* (to fasten something with something) is precisely the same as the plural form of the noun *dog*: *dogs*.

To identify the correct word class, we must examine the context in which the word appears.

spaCy provides two types of part-of-speech tags, coarse and fine-grained, which are stored under the attributes `pos_` and `tag_`, respectively.

To access the results of POS tagging, let's loop over the *Doc* object `doc` and print each *Token* and its part-of-speech tags.

In [13]:
# Print the token and the POS tags

print(f"{' ':<30}POS tag\n{' ':<20}{'-'*25}")
print(f"{'Token':<20}{'Coarse':<13}Fine-grained\n{'='*45}")

for token in doc:
    coarse = token.pos_         # coarse pos tag
    fine = token.tag_           # fine-grained pos tag

    print(f"{token.text:<20}{coarse:<13}{fine}")

                              POS tag
                    -------------------------
Token               Coarse       Fine-grained
Apple               PROPN        NNP
is                  AUX          VBZ
looking             VERB         VBG
at                  ADP          IN
buying              VERB         VBG
U.K.                PROPN        NNP
startup             NOUN         NN
for                 ADP          IN
$                   SYM          $
1                   NUM          CD
billion             NUM          CD
.                   PUNCT        .


### Lemmatization

A **lemma** is the base form of a word.

Unless explicitly instructed, computers cannot tell the difference between singular and plural forms of words, but treat them as distinct tokens, because their forms differ.

For instance, if we want to count the occurrences of words, a process known as _lemmatization_ is needed to group together the different forms of the same token.

Lemmas are available for each _Token_ under the attribute `lemma_`.

In [14]:
# Print the token and its base form

print(f"{'Token':<20} Lemma\n{'='*30}")

for token in doc:
    lemma = token.lemma_
    print(f"{token.text:<20} {lemma}")

Token                Lemma
Apple                Apple
is                   be
looking              look
at                   at
buying               buy
U.K.                 U.K.
startup              startup
for                  for
$                    $
1                    1
billion              billion
.                    .


### Removing punctuations, stop words, converting to lowercase

In [15]:
# Stop words in spacy
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
print(len(spacy_stopwords))
print(spacy_stopwords)

326
{'‘ve', 'indeed', 'take', 'her', '‘s', 'as', 'these', 'hereupon', 'whence', 'thru', 'forty', 'had', 'part', 'wherever', 'first', 'too', 'various', 'never', 'will', 'few', 'most', 'was', 'no', "'s", 'almost', 'with', '’d', 'many', 'onto', 'somehow', 'were', 'am', 'down', 'hers', 'latterly', 'of', 'very', 'seemed', 'then', 'via', 'whole', 'up', 'himself', 'it', 'above', 'my', 'less', 'two', 'when', 'move', 'although', 'without', 'that', 'wherein', 'top', 'elsewhere', 'third', 'until', '’ll', 'beyond', 'sometime', '’ve', 'but', 'an', 'there', 'amongst', 'same', 'not', 'therefore', 'they', 'also', 'whose', 'me', 'at', 'last', 'someone', 'them', 'something', 'put', 'what', 'make', 'once', 'whether', 'your', 'often', 're', 'see', 'alone', 'out', 'seeming', 'fifteen', 'every', 'who', 'quite', 'anyway', 'which', 'full', 'over', 'least', 'their', 'along', 'behind', 'a', 'have', 'towards', 'still', 'i', 'some', 'regarding', 'five', 'again', 'anywhere', 'can', 'noone', 'and', 'because', 'made

In [16]:
# Check for punctuations and stop words

print(f"{'Token':<20} {'Is stopword':<15} Is Punctuation\n{'='*55}")

for token in doc:
    stop = token.is_stop
    punct = token.is_punct
    print(f"{token.text:<20} {stop:<15} {punct}")

Token                Is stopword     Is Punctuation
Apple                0               False
is                   1               False
looking              0               False
at                   1               False
buying               0               False
U.K.                 0               False
startup              0               False
for                  1               False
$                    0               False
1                    0               False
billion              0               False
.                    0               True


In [17]:
def preprocess_token(token):

    """ If a token is not a stop word and not a punctuation,
        then reduce it to its base form, remove trailing spaces, and covert to lowercase. """

    if not token.is_stop and not token.is_punct:
        return token.lemma_.strip().lower()

In [18]:
# Preprocess the text

print(f"{'Token':<20} Preprocessed token\n{'='*40}")

for token in doc:
    output = preprocess_token(token)
    print(f"{token.text:<20} {output}")

Token                Preprocessed token
Apple                apple
is                   None
looking              look
at                   None
buying               buy
U.K.                 u.k.
startup              startup
for                  None
$                    $
1                    1
billion              billion
.                    None


### Named Entity Recognition (NER)

Named entity recognition (NER) is the task of recognising and classifying entities named in a text.

spaCy can recognise the named entities such as persons, geographic locations, and products as these were annotated in the dataset its trained on (OntoNotes 5 corpus).

We can use the *Doc* object's `.ents` attribute to get the named entities.

In [19]:
# Entities
doc.ents

(Apple, U.K., $1 billion)

This returns a tuple with the named entities.

Each item in the tuple is a spaCy *Span* object. *Span* objects can consist of multiple *Token* objects, as many named entities span multiple *Tokens*.

In [20]:
# Check the type of the object used to store named entities
type(doc.ents[0])

spacy.tokens.span.Span

The named entities and their types are stored under the attributes `.text` and `.label_` of each *Span* object.

Let's loop over the *Span* objects in the tuple and print out both attributes.

In [21]:
# Loop over the named entities in the Doc object, and print the named entity and its label

print(f"{'Text':<20} {'Entity_label':<16} Explanation\n{'='*80}")

for ent in doc.ents:
    ent_text = ent.text           # named entity
    ent_label = ent.label_        # entity label
    ent_label_val = spacy.explain(ent_label)       # entity label explanation

    print(f"{ent_text:<20} {ent_label:<16} {ent_label_val}")

Text                 Entity_label     Explanation
Apple                ORG              Companies, agencies, institutions, etc.
U.K.                 GPE              Countries, cities, states
$1 billion           MONEY            Monetary values, including unit


As you can see, named entities like '$1 billion' identified in the *Doc* consist of multiple *Tokens*, which is why they are represented as *Span* objects.

spaCy [*Span*](https://spacy.io/api/span) objects contain several useful arguments.

Most importantly, the attributes `start` and `end` return the indices of _Tokens_, which determine where the _Span_ starts and ends in the *Doc* object.

In [22]:
# Print the named entity and indices of its start and end Tokens
print(doc.ents[2], doc.ents[2].start, doc.ents[2].end)

$1 billion 8 11


The named entity starts at index 8 and ends at index 11 in the *Doc* object.

#### Visualize Named Entities

We can also render the named entities using *displacy*, the spaCy module we used for visualising dependency parses above.

Note that we must pass the string `ent` to the `style` argument to indicate that we wish to visualise named entities.

In [23]:
# Visualize named entity
spacy.displacy.render(doc, style='ent')

**Test another example 2:**

In [24]:
# Visualize another sample text
text2 = "On 3rd Feb, Ram was in Delhi.\nLater he traveled to Mumbai via Air India flight reading a Time magazine to meet Raj.\nAfter 10 days, he went again back to Delhi wearing a Timex watch."
doc2 = nlp(text2)
spacy.displacy.render(doc2, style='ent')

If a particular tag used for a named entity is unfamiliar, you can check it's explanation.

In [25]:
spacy.explain('DATE')

'Absolute or relative dates or periods'

In [26]:
spacy.explain('PERSON')

'People, including fictional'

**Test another example 3:**

In [27]:
# Visualize another sample text
text3 = "Holmes solves his another case while sitting at his home in Baker Street, without moving a single inch."
doc3 = nlp(text3)
spacy.displacy.render(doc3, style='ent')

In [28]:
spacy.explain('FAC')

'Buildings, airports, highways, bridges, etc.'

References:
* https://spacy.io/usage/spacy-101

### Please answer the questions below to complete the experiment:




In [29]:
#@title Which of the following is a technique to convert a word into its base form? { run: "auto", form-width: "500px", display-mode: "form" }
Answer = "Lemmatization" #@param ["", "Tokenization", "Part-of-Speech Tagging", "Lemmatization", "Named Entity Recognition"]

In [30]:
#@title How was the experiment? { run: "auto", form-width: "500px", display-mode: "form" }
Complexity = "Good and Challenging for me" #@param ["","Too Simple, I am wasting time", "Good, But Not Challenging for me", "Good and Challenging for me", "Was Tough, but I did it", "Too Difficult for me"]


In [31]:
#@title If it was too easy, what more would you have liked to be added? If it was very difficult, what would you have liked to have been removed? { run: "auto", display-mode: "form" }
Additional = "na" #@param {type:"string"}


In [32]:
#@title Can you identify the concepts from the lecture which this experiment covered? { run: "auto", vertical-output: true, display-mode: "form" }
Concepts = "Yes" #@param ["","Yes", "No"]


In [33]:
#@title  Text and image description/explanation and code comments within the experiment: { run: "auto", vertical-output: true, display-mode: "form" }
Comments = "Very Useful" #@param ["","Very Useful", "Somewhat Useful", "Not Useful", "Didn't use"]


In [34]:
#@title Mentor Support: { run: "auto", vertical-output: true, display-mode: "form" }
Mentor_support = "Very Useful" #@param ["","Very Useful", "Somewhat Useful", "Not Useful", "Didn't use"]


In [35]:
#@title Run this cell to submit your notebook for grading { vertical-output: true }
try:
  if submission_id:
      return_id = submit_notebook()
      if return_id : submission_id = return_id
  else:
      print("Please complete the setup first.")
except NameError:
  print ("Please complete the setup first.")

Your submission is successful.
Ref Id: 5774
Date of submission:  12 Aug 2024
Time of submission:  22:25:13
View your submissions: https://aimlops-iisc.talentsprint.com/notebook_submissions
