<a href="https://colab.research.google.com/github/iPALVIKAS/Linear-Regression-using-Gradient-Descent/blob/main/Text_pre_processing_for_NLP/Text_pre_processing_for_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1><center> Text Pre Processing for NLP


<img src="https://images.datacamp.com/image/upload/v1669223212/Text_Mining_6eeff5cb7c.png" width="800"/>

</center>

## Housekeeping
1. Check that the recording is on
2. Check audio and screenshare
3. Share link to notebook in chat
4. Check for light mode and readable font size

## What is Text Pre-processing?

Text preprocessing involves **cleaning and preparing raw text data** so that it can be passed on to downstream NLP tasks, such as text analysis or model training. When conducted effectively, text preprocessing can significantly impact the performance and accuracy of NLP models.

Today, we will look at some the essential steps involved in text preprocessing for NLP tasks.



## Why Text Preprocessing is Important?

Raw text data is often:
- **Noisy**: The contents could include inconsistencies such as typos, slang, abbreviations
- **Unstructured**: Itvexists in the form of characters, that are not machine-readable.

Preprocessing helps in:

- **Improving Data Quality:**
By removing noise and irrelevant components, the processed data is made clean and consistent.

- **Enhancing Model Performance:**
When a dataset is well-preprocessed, it improves feature extraction. Which, in turn improves the performance of NLP models.

- **Reducing Complexity:**
By narrowing the dataset to relevant text elements, we can reduce the required computational overhead, and make the training of models more efficient.



## Today's workshop will focus on 3 use-cases for text-preprocessing, and we will work with:

- **Plain Text**
- **Web page**
- **Pdf files**



# Plain Text: Simple Text Preprocessing Techniques:

## Basic Text Processing

In this example, we will work on the following tasks:
- Converting text to lowercase
- Removing punctuation and special characters (reducing special characters)
- Removing numbers

These tasks can be carried out with basic python functions and regular expressions. More details for ReGex for NLP is available in our previous [workshop notebook](https://github.com/ua-datalab/NLP-Speech/blob/main/Introduction_to_Regular_Expressions/Introduction_to_Regular_Expressions.ipynb)

In [1]:
corpus = [
    "I can't wait for the new season of my favorite show!",
    "The COVID-19 pandemic has affected millions of people worldwide.",
    "U.S. stocks fell on Friday after news of rising inflation.",
    "Python is a great programming language ^-^ !!! ??"
]


In [2]:
import re
import string

def clean_text(text):
    text = text.lower()  # Lowercase
    text = re.sub(r'\d+', '', text)  # Remove numbers
    text = text.translate(str.maketrans('', '', string.punctuation))  # Remove punctuation
    text = re.sub(r'\W', ' ', text)  # Remove special characters
    return text

cleaned_corpus = [clean_text(doc) for doc in corpus]

print(*(item for item in cleaned_corpus), sep='\n')

i cant wait for the new season of my favorite show
the covid pandemic has affected millions of people worldwide
us stocks fell on friday after news of rising inflation
python is a great programming language   


For the NLP tasks, we will use the NLTK python library. These tasks can also be conducted using the SpaCy package.

[See our previous workshop on NLP with SpaCy for more details](https://github.com/ua-datalab/NLP-Speech/blob/main/Natural_Language_Processing_Text_Mining_and_Sentiment_Analysis/Natural_Language_Processing_Text_Mining_and_Sentiment_Analysis.ipynb).

In [3]:
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [6]:
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [7]:
tokenized_corpus = [word_tokenize(doc) for doc in cleaned_corpus]
print(len(tokenized_corpus))
print(*(item for item in tokenized_corpus), sep='\n')

4
['i', 'cant', 'wait', 'for', 'the', 'new', 'season', 'of', 'my', 'favorite', 'show']
['the', 'covid', 'pandemic', 'has', 'affected', 'millions', 'of', 'people', 'worldwide']
['us', 'stocks', 'fell', 'on', 'friday', 'after', 'news', 'of', 'rising', 'inflation']
['python', 'is', 'a', 'great', 'programming', 'language']


## Stop Words Removal

Once we have limited our dataset to words, depending on the task, we might need only meaning-carrying words in our corpus. For tasks such as topic modeling, the objective involves looking for nouns and verbs, and excluding adverbs, articles and other grammatical elements of the text.





In [8]:
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [9]:

#create an element containing all the English stopwords:
stop_words = set(stopwords.words('english'))
# iterate over each sentence, to save content words:
filtered_corpus = [[word for word in doc if word not in stop_words] for doc in tokenized_corpus]
print(*(item for item in filtered_corpus), sep='\n')

['cant', 'wait', 'new', 'season', 'favorite', 'show']
['covid', 'pandemic', 'affected', 'millions', 'people', 'worldwide']
['us', 'stocks', 'fell', 'friday', 'news', 'rising', 'inflation']
['python', 'great', 'programming', 'language']


## Lemmatization and Stemming

While processing our documents, we may not want the model to be confused by words that have the same meaning but with different inflections (plural markers, tense markers, etc.). So we will remove the inflections and replace each token with its lemma. This uses a corpus called the WordNet, and its [NLTK library](https://www.nltk.org/howto/stem.html).

Learn more about WordNet here: https://wordnet.princeton.edu/


In [10]:
from nltk.stem import PorterStemmer, WordNetLemmatizer
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [11]:
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

stemmed_corpus = [[stemmer.stem(word) for word in doc] for doc in filtered_corpus]
lemmatized_corpus = [[lemmatizer.lemmatize(word) for word in doc] for doc in filtered_corpus]
print("Stems:")
print(*(item for item in stemmed_corpus), sep='\n')
print("Lemmas:")
print(*(item for item in lemmatized_corpus), sep='\n')

Stems:
['cant', 'wait', 'new', 'season', 'favorit', 'show']
['covid', 'pandem', 'affect', 'million', 'peopl', 'worldwid']
['us', 'stock', 'fell', 'friday', 'news', 'rise', 'inflat']
['python', 'great', 'program', 'languag']
Lemmas:
['cant', 'wait', 'new', 'season', 'favorite', 'show']
['covid', 'pandemic', 'affected', 'million', 'people', 'worldwide']
['u', 'stock', 'fell', 'friday', 'news', 'rising', 'inflation']
['python', 'great', 'programming', 'language']


Note: In this example, after stemming, some of the words may look strange or misspelt. Note that these are just the way they are stored by NLTK. This is done so NLTK can call those entries correctly.

## Handling Contractions
Contractions are words or combinations of words that are shortened by dropping letters and replacing them by an apostrophe.

Expanding contractions in the text.

For example: ` I’ll be there within 5 minutes, won't you?`

To process contractions, we will use the `contractions` library:

In [12]:
!pip install contractions --quiet

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/289.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m112.6/289.9 kB[0m [31m4.0 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m289.9/289.9 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/118.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m118.3/118.3 kB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[?25h

In [14]:
# import library
import textwrap
import contractions
# contracted text
text = '''I'll be there within 5 min. Shouldn't you be there too?
		      I'd love to see u there my dear. It's awesome to meet new friends.
		      We've been waiting for this day for so long.
          Aren't you leaving today?
          '''

# creating an empty list
expanded_words = ""

# using contractions.fix() to expand the shortened words
for word in text.split():
  expanded_words += contractions.fix(word) + " "

print("Original text:\n" + textwrap.fill(text, width=70)+"\n")
print("Text without contractions:\n" +textwrap.fill(expanded_words, width=70))

Original text:
I'll be there within 5 min. Shouldn't you be there too?
I'd love to see u there my dear. It's awesome to meet new friends.
We've been waiting for this day for so long.           Aren't you
leaving today?

Text without contractions:
I will be there within 5 min. Should not you be there too? I would
love to see you there my dear. It is awesome to meet new friends. We
have been waiting for this day for so long. Are not you leaving today?



## Parts of Speech (POS)

In this task, we label each word in a sentence with its corresponding part of speech, such as noun, verb, adjective, etc.

This information is crucial for many NLP applications, including parsing, information retrieval, and text analysis.



In [15]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [17]:

nltk.download('averaged_perceptron_tagger_eng') # More specific

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


True

In [24]:
# Importing the NLTK library
import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag

# Sample text
text = "NLTK is a powerful library for natural language processing. _^_. Also lets plan to learn this on a public datasaet avaiable in The U.S. and Europe"
words = word_tokenize(text)


# Performing PoS tagging
pos_tags = pos_tag(words)

# Displaying the PoS tagged result in separate lines
print("Original Text:")
print(text)

print("\nPoS Tagging Result:")
for word, pos_tag in pos_tags:
	print(f"{word}: {pos_tag}")


Original Text:
NLTK is a powerful library for natural language processing. _^_. Also lets plan to learn this on a public datasaet avaiable in The US and Europe

PoS Tagging Result:
NLTK: NNP
is: VBZ
a: DT
powerful: JJ
library: NN
for: IN
natural: JJ
language: NN
processing: NN
.: .
_^_: NN
.: .
Also: RB
lets: VBZ
plan: NN
to: TO
learn: VB
this: DT
on: IN
a: DT
public: JJ
datasaet: NN
avaiable: NN
in: IN
The: DT
US: NNP
and: CC
Europe: NNP


Here’s a breakdown of the common ones you’ll come across:

1. **Noun (NN)**: A word that represents a person, place, thing, or idea.
Examples: “cat,” “house,” “love.”

2. **Verb (VB)**: A word that expresses an action or state of being.

Examples: “run,” “eat,” “is.”

3. **Adjective (JJ)**: A word that describes or modifies a noun.

Examples: “red,” “happy,” “tall.”

4. **Adverb (RB)**: A word that modifies a verb, adjective, or other adverb, often indicating manner, time, place, degree, etc.

Examples: “quickly,” “very,” “here.”

5. **Pronoun (PRP)**: A word that substitutes for a noun or noun phrase.

Examples: “he,” “she,” “they.”

6. **Preposition (IN)**: A word that shows the relationship between a noun (or pronoun) and other words in a sentence.

Examples: “in,” “on,” “at.”

7. **Conjunction (CC)**: A word that connects words, phrases, or clauses.

Examples: “and,” “but,” “or.”

8. **Interjection (UH)**: A word or phrase that expresses emotion or exclamation.

# Pre-processing web pages

If you get a web data dump, a considerable portion of it will be CSS adn HTML elements used for creating the front end. We need to extract the actual contents of the page. For this task, we can use the Beautiful Soup library, that can identify and process HTML tags.

Often, the text in a page is stored under specific tags (such as `<body>` or `<article>`). We may want to preserve our text along with the tag it is saved under.

In [19]:
# prompt: Extracting text from html pages using beautiful soup

from bs4 import BeautifulSoup

In [30]:
def extract_text_from_html(html_content:str):
  '''
  searches each html tag, and returns the contents and a dictionary containing the tag name and
  the text housed in each html tag
  '''
  # Initialize an empty dictionary
  text_dict = {}

  # Create a BeautifulSoup object
  soup = BeautifulSoup(html_content, 'html.parser')

  # Extract text, split by new line and full stops.
  # This helps improve readability of real, complex web pages
  text = soup.get_text(separator='\n').strip()

  # Loop through all tags and extract text
  for tag in soup.find_all(True):  # True finds all tags
      # Get the tag name and text
      tag_name = tag.name
      tag_text = tag.get_text(strip=True)

      # Only add to the dictionary if there's text
      if tag_text:
          text_dict[tag_name] = tag_contents

  return text, text_dict


In [31]:
# A simple initial example with dummy HTML:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

# extract content:
text, tag_content = extract_text_from_html(html_doc)
# print content attached to each contentful html tag:
print("html tags found: ")
for key, value in tag_content.items():
    print(f"{key}: {value}")

print("\nContents of page:")
print(text)


NameError: name 'tag_contents' is not defined

As you can see, this process removed urls, sas they are stored within HTML `<a>` tags.

We can search for content housed in specific tags as well.



## Scrape and extract text

We will use [this](https://en.wikisource.org/wiki/Moral_letters_to_Lucilius) wikipedia page, containing the letters written by Seneca.

In [32]:
import requests
def fetch_html(url):
    try:
        # Send a GET request to the URL
        response = requests.get(url)

        # Raise an exception if the request was unsuccessful
        response.raise_for_status()

        # Return the HTML content
        return response.text
    except requests.exceptions.RequestException as e:
        print(f"An error occurred: {e}")
        return None

In [33]:
# import page containing links to all of Seneca's letters
# get web address
src = "https://en.wikisource.org/wiki/Moral_letters_to_Lucilius"

wiki = requests.get(src).text  # pull html as text
# print a small chunk, to see the output:
print(textwrap.fill(html_doc, width=70))


 <html><head><title>The Dormouse's story</title></head> <body>  <p
class="story">Once upon a time there were three little sisters; and
their names were <a href="http://example.com/elsie" class="sister"
id="link1">Elsie</a>, <a href="http://example.com/lacie"
class="sister" id="link2">Lacie</a> and <a
href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>  <p class="story">...</p>


## Extract plain text from the raw source code

In [34]:
# Create a BeautifulSoup object
soup = BeautifulSoup(wiki, 'html.parser')

# Extract text, split by full stops:
text = soup.get_text(separator='\n').strip()

print(textwrap.fill(text, width=70))

Moral letters to Lucilius - Wikisource, the free online library
Jump to content                 Main menu             Main menu   move
to sidebar   hide                        Navigation
Main Page Community portal Central discussion Recent changes Subject
index Authors Random work Random author Random transcription Help
Display Options
Search                         Search
Appearance                                   Donate     Create account
Log in                   Personal tools             Donate   Create
account   Log in                            Pages for logged out
editors  learn more         Contributions Talk
Moral letters to Lucilius         4 languages           العربية
Español Français Latina     Edit links                         Page
Source Discussion             English
Read Edit View history                 Tools             Tools   move
to sidebar   hide                        Actions                Read
Edit View history                            General
What links 

## Optional

Try running the subtasks discussed in the previous task, to turn this text dump into input for an NLP pieline.

# PDF files

PDF files are a common way to transmit documents. However, they work more like images than a text corpus. So, we will need to extract text from them.

For this task, we will use a python library that processed PDFs, and conduct the NLP tasks with NLTK.

In [35]:
!pip install PyPDF2

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/232.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.2/232.6 kB[0m [31m2.7 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1


## Task: Extract names of individuals from an exerpt of a report in PDF format.

Consider the names from the [Wiki entry for Adar](https://lotr.fandom.com/wiki/Adar), from the LORT universe.

Every entry on the wikipages contains several names, and researching one character could turn into looking up other characters. So, in order to understand what we are getting into, it may be worth it to take stock of all the mentions on a page. :


> Adar became known to the peoples of Middle-earth much later in the Second Age, first appearing in a large trench dug by his servant, Magrot, immediately after Arondir's failed attempt to cause a fray and escape. While Arondir is pinned down, having stabbed Magrot in the neck, Lurka orders he be brought to Adar. The "Lord-father" then emerges as the Orcs around him bow. He gently tends to the dying Magrot, who had sustained a mortal wound in the Elves' escape attempt, before suddenly ending his suffering with a dagger. As the rest of the Orcs leave, Adar speaks to Arondir, learning the Silvan Elf's birthplace to be in Beleriand. Adar reminisces about his days along the river Sirion, though he evades Arondir's own questions, before releasing Arondir to take a message to the Southlanders taking refuge in the Watchtower of Ostirith: that they may live if they forsake the territory and swear fealty to him. Later, as he watches one of the caged Wargs devouring fresh flesh, Adar is informed by Grugzûk that the Orc Sigil Hilt that they seek is in the watchtower.


**What are some of the character names you can spot?**

In [None]:
#Optional: run in case of bugs:
# !pip install pip==24.0
# !pip install textract --no-cache-dir

In [36]:
import requests

# URL of the PDF file
url = "https://raw.githubusercontent.com/ua-datalab/NLP-Speech/main/Text_pre_processing_for_NLP/adar_pdf_example.pdf"
# Send a GET request
response = requests.get(url)

# Check if the response is successful
if response.status_code == 200:
    # Save the PDF to a file
    with open('adar_pdf_example.pdf', 'wb') as file:
        file.write(response.content)
    print("PDF downloaded successfully!")
else:
    print(f"Failed to download PDF. Status code: {response.status_code}")

PDF downloaded successfully!


In [37]:
#Reading the PDF
import PyPDF2

pdf_file = '/content/adar_pdf_example.pdf'
pdfreader = PyPDF2.PdfReader(open(pdf_file, 'rb'))
# pdfreader = PyPDF2.PdfReader(BytesIO(wFile.read()))

## Extracting characters from a PDF:

In [38]:
#extracting page 2 of the docuemnt
pageObj = pdfreader.pages[2]
pdf_extract = pageObj.extract_text()
print(textwrap.fill(pdf_extract, width=90))

10/9/24, 12:57Adar | The One Wiki to Rule Them All | Fandom Page 3 of
10https://lotr.fandom.com/wiki/AdarBiographyYears of the Trees & First AgeIn the First
Age, the Elf who would later become known as Adar walkedalongside the river Sirion in
Beleriand, which had banks covered by miles ofsage blossoms.[1]He was one of the
Moriondor, the thirteen Elves chosen to be corrupted byMorgoth in the Elder Days. Lured by
the promise of power, he was led up a dark,nameless peak, chained and abandoned to hunger
and thirst. Morgoth'sservant, Sauron, eventually appeared and ofered him red wine. He
drank thewine, forever changing his nature.[2] His nature was later perceived by
Galadrielafter she briegy captured him in the Second Age.[3] Subsequent generations ofthe
newly bred race of Orcs considered him to be their "father", and followedhim
willingly.Second AgeAfter Morgoth's defeat,Adar entered theservice of Sauron
afteranswering his call tofollow him to thefortress Dúrnost,becoming his lieutenant.

Now that we have processed the PDF, let us fix some OCR issues that could impact our search for people. You may notice that the OCR recognition omits a considerable number of spaces.

While there are advanced tools for typo detections, because this is fantasy fiction, spell-checkers would also target nouns it doesn't recognize. So, we will stick to regular expressions for cleanup. Since we are only interested in proper nouns, a splitting capitalized words is all we need to do.

In [39]:
# we re-reun our pipeline for text processing:
import re
import string

def clean_text(text):
    # We will keep the following characters, and remove the special characters:
    text = re.sub(r'[^a-zA-Z\s.,!\'?()~@#$%^&*-+=]', '', text)
    #Look for capitalized letters in the middle of words, and split the word
    # Include in case the OCR has errors:
    # text = re.sub(r'(?<!^)(?=[A-Z])', ' ', text)
    return text

cleaned_corpus = clean_text(pdf_extract)
print(textwrap.fill(cleaned_corpus, width=90))

, Adar  The One Wiki to Rule Them All  Fandom Page  of
httpslotr.fandom.comwikiAdarBiographyYears of the Trees & First AgeIn the First Age, the
Elf who would later become known as Adar walkedalongside the river Sirion in Beleriand,
which had banks covered by miles ofsage blossoms.He was one of the Moriondor, the thirteen
Elves chosen to be corrupted byMorgoth in the Elder Days. Lured by the promise of power,
he was led up a dark,nameless peak, chained and abandoned to hunger and thirst.
Morgoth'sservant, Sauron, eventually appeared and ofered him red wine. He drank thewine,
forever changing his nature. His nature was later perceived by Galadrielafter she briegy
captured him in the Second Age. Subsequent generations ofthe newly bred race of Orcs
considered him to be their father, and followedhim willingly.Second AgeAfter Morgoth's
defeat,Adar entered theservice of Sauron afteranswering his call tofollow him to
thefortress Drnost,becoming his lieutenant. At Drst he helped his new master 

## Utilizing the NLTK pipeline

Next, we will run the NLTK pipeline to extract names entities. The `ne_chunk`[link text](https:// [link text](https://)) function adds a lot of information to tokens.

In [40]:
import nltk
nltk.download('maxent_ne_chunker')
nltk.download('maxent_ne_chunker_tab')
nltk.download('words')
from nltk import ne_chunk, pos_tag, word_tokenize
from nltk.tree import Tree

import re


[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


In [42]:
# nltk.download('maxent_ne_chunker_tab')

[nltk_data] Downloading package maxent_ne_chunker_tab to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker_tab.zip.


True

In [43]:
# Process the cleaned corpus
def clean_nltk(text):
  tokens = word_tokenize(text)
  tagged = pos_tag(tokens)
  named_entities = ne_chunk(tagged)
  return named_entities

# Parse and tag each sentence, and extract named entities:
nltk_results = clean_nltk(cleaned_corpus)

print(nltk_results)


(S
  ,/,
  Adar/NNP
  The/DT
  (ORGANIZATION One/NNP Wiki/NNP)
  to/TO
  Rule/VB
  Them/NNP
  All/NNP
  Fandom/NNP
  Page/NNP
  of/IN
  httpslotr.fandom.comwikiAdarBiographyYears/NNS
  of/IN
  the/DT
  (ORGANIZATION Trees/NNP)
  &/CC
  (PERSON First/NNP)
  AgeIn/NNP
  the/DT
  (ORGANIZATION First/NNP Age/NNP)
  ,/,
  the/DT
  Elf/NNP
  who/WP
  would/MD
  later/RB
  become/VB
  known/VBN
  as/IN
  (PERSON Adar/NNP)
  walkedalongside/IN
  the/DT
  river/NN
  Sirion/NNP
  in/IN
  (GPE Beleriand/NNP)
  ,/,
  which/WDT
  had/VBD
  banks/NNS
  covered/VBN
  by/IN
  miles/NNS
  ofsage/RB
  blossoms.He/NN
  was/VBD
  one/CD
  of/IN
  the/DT
  (ORGANIZATION Moriondor/NNP)
  ,/,
  the/DT
  thirteen/JJ
  Elves/NNP
  chosen/NN
  to/TO
  be/VB
  corrupted/VBN
  byMorgoth/DT
  in/IN
  the/DT
  (ORGANIZATION Elder/NNP Days/NNP)
  ./.
  Lured/VBN
  by/IN
  the/DT
  promise/NN
  of/IN
  power/NN
  ,/,
  he/PRP
  was/VBD
  led/VBN
  up/RP
  a/DT
  dark/NN
  ,/,
  nameless/JJ
  peak/NN
  ,/,
  chained/V

In [44]:
# We will only need the contents of the Tree item for each named entity, which provides all the words
# clustered for each NE label, along with their POS tags
# Learn more about this module here: https://www.nltk.org/api/nltk.tree.tree.html#nltk.tree.tree.Tree
named_entities = [ne for ne in nltk_results if type(ne) == Tree]
print(*(ne for ne in named_entities[:20]), sep='\n')

#  Try and see what is excluded from this search.
# You will see POS tags, but no NE labels:
# not_named_entities = [nne for nne in nltk_results if type(nne) != Tree]
# print(*(ne for ne in not_named_entities[:20]), sep='\n')

(ORGANIZATION One/NNP Wiki/NNP)
(ORGANIZATION Trees/NNP)
(PERSON First/NNP)
(ORGANIZATION First/NNP Age/NNP)
(PERSON Adar/NNP)
(GPE Beleriand/NNP)
(ORGANIZATION Moriondor/NNP)
(ORGANIZATION Elder/NNP Days/NNP)
(GPE Sauron/NNP)
(ORGANIZATION Second/JJ Age/NNP)
(PERSON Orcs/NNP)
(ORGANIZATION AgeAfter/NNP Morgoth/NNP)
(PERSON Adar/NNP)
(GPE Sauron/NNP)
(PERSON Drnost/NNP)
(GPE Middleearth/NNP)
(GPE Sauron/NNP)
(PERSON Sauron/NNP)
(GPE Sauron/NNP)
(GPE Drnost/NNP)


## Extracting Person names from the list of NERs

You will notice that the label "PERSON" is the one relevant to our search. Because this is a fantasy universe, some people have also been labelled as "GBE", or geographical entities. For the purposes of this exercise, we will only extract named entities with the label "PERSON"

Try this code with some real-world text and compare the results!

In [45]:
# Finally, we will only keep the Named Entities with the label "PERSON"
for ne in named_entities:
    if ne.label() == 'PERSON':
        # join all the nouns housed in the label:
        name = ''
        for leaf in ne.leaves():
            name += leaf[0] + ' '
        print ('Type: ', ne.label(), 'Name: ', name)

Type:  PERSON Name:  First 
Type:  PERSON Name:  Adar 
Type:  PERSON Name:  Orcs 
Type:  PERSON Name:  Adar 
Type:  PERSON Name:  Drnost 
Type:  PERSON Name:  Sauron 
Type:  PERSON Name:  Adar 
Type:  PERSON Name:  Dark Lord 
Type:  PERSON Name:  Morgoth 
Type:  PERSON Name:  Sauron 
Type:  PERSON Name:  Adar 
Type:  PERSON Name:  Adar 
Type:  PERSON Name:  Adar 
Type:  PERSON Name:  Adar 
Type:  PERSON Name:  Orcs 
Type:  PERSON Name:  Magrot 
Type:  PERSON Name:  Arondir 
Type:  PERSON Name:  Magrot 
Type:  PERSON Name:  Magrot 
Type:  PERSON Name:  Adar 
Type:  PERSON Name:  Adar 
Type:  PERSON Name:  Sirion 
Type:  PERSON Name:  Arondir 
Type:  PERSON Name:  Arondir 
Type:  PERSON Name:  Adar 
Type:  PERSON Name:  Grugzk 
Type:  PERSON Name:  Orc Sigil Hilt 
Type:  PERSON Name:  Adar 
Type:  PERSON Name:  Adar 
Type:  PERSON Name:  Arondir START 


# Final thoughts

In this task, we worked on some simple text pre-processing steps for text extracted from different sources. I used the NLTK took for this purpose. We encountered different kinds of noise for different sources, and needed different approaches and additional steps, if the complexity of the task was higher. When we moved to a fantsy universe, our model did a fairly good job, but faced issues if the text pre-processing had errors (such as incorrect POS tagging). The purpose of showing such a complex task was to examine how important text pre-processing tools are for other downstream tasks.

As this is a GIGO situation, set ups with language models trained on more data (potentially Spacy or even an LLM) coudl do an even better job, or be able to handle noisy data. We demonstrated some lightweight setups for fast processing. Try out this code with other pre-processing tools and tell us about your experience!

# Sources and References

- https://www.geeksforgeeks.org/text-preprocessing-for-nlp-tasks/
- https://www.xbyte.io/how-to-do-web-scraping-and-pre-processing-for-nlp-using-python/
- https://ydv-poonam.medium.com/how-to-extract-text-from-a-pdf-nlp-b6409422cfd2
- https://unbiased-coder.com/extract-names-python-nltk/
- https://towardsdatascience.com/elegant-text-pre-processing-with-nltk-in-sklearn-pipeline-d6fe18b91eb8

## Housekeeping
1. Turn off screenshare
2. Stop recording
3. Make the next instructor Zoom host
