# Natural Language Processing

Goal is to convert text data into numerical data so that it can be used in machine learning models.

Steps:
1. Remove punctuation
2. Stopwords removal
3. Case normalization
4. Tokenization
5. Lemmatization
6. Vectorization

- POS tagging
- Named Entity Recognition


In [None]:
!pip install -U spacy
!pip install beautifulsoup4
!pip install newspaper3k

In [64]:
url  = "https://www.thedailystar.net/opinion/editorial/news/strict-oversight-vital-end-the-tree-cutting-bonanza-3604996"

from newspaper import Article

article = Article(url)
article.download()
article.parse()
article.text

"Latest incident saw the startling transformation of Altadighi National Park\n\nThe High Court's nod on reining in tree-cutting practices by forming supervisory committees at the district and upazila levels could not have come at a more appropriate time. Despite the recent heatwave that turned out to be the longest in 76 years, tree felling by both public and private entities continues unabated, setting the stage for an even warmer future. You hear news of Bashundhara mowing down trees along the main road of its residential area. You hear of the forest department moving to cut down 2,044 trees on four roads in Jashore, similar to previous attempts targeting century-old trees on the Jashore-Benapole highway. You hear of the LGED felling trees in Patuakhali in the name of canal restoration.\n\nThese developments represent a dangerous disregard for trees and forests that keep temperatures down, among other things. One particularly disturbing development of late, as reported by this daily 

In [65]:
print(article.text)

Latest incident saw the startling transformation of Altadighi National Park

The High Court's nod on reining in tree-cutting practices by forming supervisory committees at the district and upazila levels could not have come at a more appropriate time. Despite the recent heatwave that turned out to be the longest in 76 years, tree felling by both public and private entities continues unabated, setting the stage for an even warmer future. You hear news of Bashundhara mowing down trees along the main road of its residential area. You hear of the forest department moving to cut down 2,044 trees on four roads in Jashore, similar to previous attempts targeting century-old trees on the Jashore-Benapole highway. You hear of the LGED felling trees in Patuakhali in the name of canal restoration.

These developments represent a dangerous disregard for trees and forests that keep temperatures down, among other things. One particularly disturbing development of late, as reported by this daily on We

### Apply NLP on text data

In [66]:
# Remove Newline characters
doc = article.text.replace('\n', '')
print(doc)

Latest incident saw the startling transformation of Altadighi National ParkThe High Court's nod on reining in tree-cutting practices by forming supervisory committees at the district and upazila levels could not have come at a more appropriate time. Despite the recent heatwave that turned out to be the longest in 76 years, tree felling by both public and private entities continues unabated, setting the stage for an even warmer future. You hear news of Bashundhara mowing down trees along the main road of its residential area. You hear of the forest department moving to cut down 2,044 trees on four roads in Jashore, similar to previous attempts targeting century-old trees on the Jashore-Benapole highway. You hear of the LGED felling trees in Patuakhali in the name of canal restoration.These developments represent a dangerous disregard for trees and forests that keep temperatures down, among other things. One particularly disturbing development of late, as reported by this daily on Wednes

In [74]:
# Remove punctuation
import string
print(string.punctuation)

# add . to the list of punctuation
string.punctuation = string.punctuation + '.'
doc = doc.replace(string.punctuation, '')
print(doc)


!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~......
Latest incident saw the startling transformation of Altadighi National ParkThe High Court's nod on reining in tree-cutting practices by forming supervisory committees at the district and upazila levels could not have come at a more appropriate time. Despite the recent heatwave that turned out to be the longest in 76 years, tree felling by both public and private entities continues unabated, setting the stage for an even warmer future. You hear news of Bashundhara mowing down trees along the main road of its residential area. You hear of the forest department moving to cut down 2,044 trees on four roads in Jashore, similar to previous attempts targeting century-old trees on the Jashore-Benapole highway. You hear of the LGED felling trees in Patuakhali in the name of canal restoration.These developments represent a dangerous disregard for trees and forests that keep temperatures down, among other things. One particularly disturbing development of la

In [75]:
# Tokenize the document
doc = doc.split(' ')
print(doc)

['Latest', 'incident', 'saw', 'the', 'startling', 'transformation', 'of', 'Altadighi', 'National', 'ParkThe', 'High', "Court's", 'nod', 'on', 'reining', 'in', 'tree-cutting', 'practices', 'by', 'forming', 'supervisory', 'committees', 'at', 'the', 'district', 'and', 'upazila', 'levels', 'could', 'not', 'have', 'come', 'at', 'a', 'more', 'appropriate', 'time.', 'Despite', 'the', 'recent', 'heatwave', 'that', 'turned', 'out', 'to', 'be', 'the', 'longest', 'in', '76', 'years,', 'tree', 'felling', 'by', 'both', 'public', 'and', 'private', 'entities', 'continues', 'unabated,', 'setting', 'the', 'stage', 'for', 'an', 'even', 'warmer', 'future.', 'You', 'hear', 'news', 'of', 'Bashundhara', 'mowing', 'down', 'trees', 'along', 'the', 'main', 'road', 'of', 'its', 'residential', 'area.', 'You', 'hear', 'of', 'the', 'forest', 'department', 'moving', 'to', 'cut', 'down', '2,044', 'trees', 'on', 'four', 'roads', 'in', 'Jashore,', 'similar', 'to', 'previous', 'attempts', 'targeting', 'century-old', 't

In [76]:
# Case normalization
doc = [word.lower() for word in doc]
print(doc)

['latest', 'incident', 'saw', 'the', 'startling', 'transformation', 'of', 'altadighi', 'national', 'parkthe', 'high', "court's", 'nod', 'on', 'reining', 'in', 'tree-cutting', 'practices', 'by', 'forming', 'supervisory', 'committees', 'at', 'the', 'district', 'and', 'upazila', 'levels', 'could', 'not', 'have', 'come', 'at', 'a', 'more', 'appropriate', 'time.', 'despite', 'the', 'recent', 'heatwave', 'that', 'turned', 'out', 'to', 'be', 'the', 'longest', 'in', '76', 'years,', 'tree', 'felling', 'by', 'both', 'public', 'and', 'private', 'entities', 'continues', 'unabated,', 'setting', 'the', 'stage', 'for', 'an', 'even', 'warmer', 'future.', 'you', 'hear', 'news', 'of', 'bashundhara', 'mowing', 'down', 'trees', 'along', 'the', 'main', 'road', 'of', 'its', 'residential', 'area.', 'you', 'hear', 'of', 'the', 'forest', 'department', 'moving', 'to', 'cut', 'down', '2,044', 'trees', 'on', 'four', 'roads', 'in', 'jashore,', 'similar', 'to', 'previous', 'attempts', 'targeting', 'century-old', 't

In [79]:
# print the stopwords

stop_word_list = ['a', 'an', 'the', 'is', 'are', 'was', 'were', 'will', 'shall', 'would', 'should', 
                  'can', 'could', 'may', 'might', 'be', 'been', 'being', 'have', 'has', 'had', 'do', 
                  'does', 'did', 'am', 'are', 'is', 'was', 'were', 'be', 'been', 'being', 'have',
                  'has', 'had', 'do', 'does', 'did', 'may', 'might', 'must', 'need', 'ought',
                  'shall', 'will', 'would', 'can', 'could', 'should', 'here', 'there', 
                  'where', 'when', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most',
                  'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too',
                  'very', 's', 't', 'just', 'don', 'should', 'now', 'd', 'll', 'm', 'o', 're', 've', 'y',
                  'ain', 'aren', 'couldn', 'didn', 'doesn', 'hadn', 'hasn', 'haven', 'isn', 'ma', 'mightn', 'mustn', 
                  'needn', 'shan', 'shouldn', 'wasn', 'weren', 'won', 'wouldn', '']

for word in doc:
    if word in stop_word_list:
        doc.remove(word)

In [None]:
# clean doc
doc = [word for word in doc if word not in stop_word_list]

print(doc)

In [80]:
# lemetization from scratch

from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

doc_lemma = [lemmatizer.lemmatize(word) for word in doc]

# print only the words that are different
for i in range(len(doc)):
    if doc[i] != doc_lemma[i]:
        print(doc[i], doc_lemma[i])

practices practice
committees committee
levels level
entities entity
trees tree
its it
trees tree
roads road
attempts attempt
trees tree
trees tree
developments development
trees tree
forests forest
temperatures temperature
as a
trees tree
as a
years year
areas area
says say
trees tree
species specie
trees tree
locals local
trees tree
varieties variety
species specie
highlights highlight
policies policy
ecosystems ecosystem
experts expert
decisions decision
authorities authority


In [81]:
# Stemming
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

doc_stem = [stemmer.stem(word) for word in doc]

# print only the words that are different
for i in range(len(doc)):
    if doc[i] != doc_stem[i]:
        print(doc[i], doc_stem[i])

incident incid
startling startl
transformation transform
national nation
parkthe parkth
court's court'
reining rein
tree-cutting tree-cut
practices practic
forming form
supervisory supervisori
committees committe
levels level
appropriate appropri
despite despit
heatwave heatwav
turned turn
felling fell
private privat
entities entiti
continues continu
setting set
mowing mow
trees tree
its it
residential residenti
department depart
moving move
trees tree
roads road
previous previou
attempts attempt
targeting target
trees tree
jashore-benapole jashore-benapol
felling fell
trees tree
restoration.these restoration.thes
developments develop
represent repres
dangerous danger
trees tree
forests forest
temperatures temperatur
particularly particularli
disturbing disturb
development develop
reported report
this thi
daily daili
trees tree
felled fell
government govern
multi-crore multi-cror
undertaking—initiated undertaking—initi
department depart
years year
go—aims go—aim
restore restor
conserve

In [82]:
doc

['latest',
 'incident',
 'saw',
 'startling',
 'transformation',
 'of',
 'altadighi',
 'national',
 'parkthe',
 'high',
 "court's",
 'nod',
 'on',
 'reining',
 'in',
 'tree-cutting',
 'practices',
 'by',
 'forming',
 'supervisory',
 'committees',
 'at',
 'district',
 'and',
 'upazila',
 'levels',
 'come',
 'at',
 'appropriate',
 'time.',
 'despite',
 'recent',
 'heatwave',
 'that',
 'turned',
 'out',
 'to',
 'longest',
 'in',
 '76',
 'years,',
 'tree',
 'felling',
 'by',
 'public',
 'and',
 'private',
 'entities',
 'continues',
 'unabated,',
 'setting',
 'stage',
 'for',
 'even',
 'warmer',
 'future.',
 'you',
 'hear',
 'news',
 'of',
 'bashundhara',
 'mowing',
 'down',
 'trees',
 'along',
 'main',
 'road',
 'of',
 'its',
 'residential',
 'area.',
 'you',
 'hear',
 'of',
 'forest',
 'department',
 'moving',
 'to',
 'cut',
 'down',
 '2,044',
 'trees',
 'on',
 'four',
 'roads',
 'in',
 'jashore,',
 'similar',
 'to',
 'previous',
 'attempts',
 'targeting',
 'century-old',
 'trees',
 'on',

### NLP Using spacy

In [83]:
print(article.text)

Latest incident saw the startling transformation of Altadighi National Park

The High Court's nod on reining in tree-cutting practices by forming supervisory committees at the district and upazila levels could not have come at a more appropriate time. Despite the recent heatwave that turned out to be the longest in 76 years, tree felling by both public and private entities continues unabated, setting the stage for an even warmer future. You hear news of Bashundhara mowing down trees along the main road of its residential area. You hear of the forest department moving to cut down 2,044 trees on four roads in Jashore, similar to previous attempts targeting century-old trees on the Jashore-Benapole highway. You hear of the LGED felling trees in Patuakhali in the name of canal restoration.

These developments represent a dangerous disregard for trees and forests that keep temperatures down, among other things. One particularly disturbing development of late, as reported by this daily on We

In [84]:
import spacy
nlp = spacy.load('en_core_web_sm')

spacy_doc = nlp(article.text)

type(spacy_doc)

spacy.tokens.doc.Doc

In [89]:
for token in spacy_doc:
    print(f'{token.text:20} {token.pos_:10} {token.lemma_:20} {token.is_stop}')

Latest               ADJ        late                 False
incident             NOUN       incident             False
saw                  VERB       see                  False
the                  DET        the                  True
startling            ADJ        startling            False
transformation       NOUN       transformation       False
of                   ADP        of                   True
Altadighi            PROPN      Altadighi            False
National             PROPN      National             False
Park                 PROPN      Park                 False


                   SPACE      

                   False
The                  DET        the                  True
High                 PROPN      High                 False
Court                PROPN      Court                False
's                   PART       's                   True
nod                  NOUN       nod                  False
on                   ADP        on                   True
re

In [90]:
# remove stop words, punctuation and spaces
spacy_doc_token = [token for token in spacy_doc if not token.is_stop and not token.is_punct and not token.is_space]

spacy_doc_token

[Latest,
 incident,
 saw,
 startling,
 transformation,
 Altadighi,
 National,
 Park,
 High,
 Court,
 nod,
 reining,
 tree,
 cutting,
 practices,
 forming,
 supervisory,
 committees,
 district,
 upazila,
 levels,
 come,
 appropriate,
 time,
 Despite,
 recent,
 heatwave,
 turned,
 longest,
 76,
 years,
 tree,
 felling,
 public,
 private,
 entities,
 continues,
 unabated,
 setting,
 stage,
 warmer,
 future,
 hear,
 news,
 Bashundhara,
 mowing,
 trees,
 main,
 road,
 residential,
 area,
 hear,
 forest,
 department,
 moving,
 cut,
 2,044,
 trees,
 roads,
 Jashore,
 similar,
 previous,
 attempts,
 targeting,
 century,
 old,
 trees,
 Jashore,
 Benapole,
 highway,
 hear,
 LGED,
 felling,
 trees,
 Patuakhali,
 canal,
 restoration,
 developments,
 represent,
 dangerous,
 disregard,
 trees,
 forests,
 temperatures,
 things,
 particularly,
 disturbing,
 development,
 late,
 reported,
 daily,
 Wednesday,
 saw,
 1,000,
 trees,
 felled,
 Altadighi,
 Lake,
 Naogaon,
 government,
 project,
 multi,
 cro

In [91]:
# Lemmatization
spacy_doc_lemma = [token.lemma_ for token in spacy_doc_token]

spacy_doc_lemma

['late',
 'incident',
 'see',
 'startling',
 'transformation',
 'Altadighi',
 'National',
 'Park',
 'High',
 'Court',
 'nod',
 'rein',
 'tree',
 'cut',
 'practice',
 'form',
 'supervisory',
 'committee',
 'district',
 'upazila',
 'level',
 'come',
 'appropriate',
 'time',
 'despite',
 'recent',
 'heatwave',
 'turn',
 'long',
 '76',
 'year',
 'tree',
 'fell',
 'public',
 'private',
 'entity',
 'continue',
 'unabated',
 'set',
 'stage',
 'warm',
 'future',
 'hear',
 'news',
 'Bashundhara',
 'mow',
 'tree',
 'main',
 'road',
 'residential',
 'area',
 'hear',
 'forest',
 'department',
 'move',
 'cut',
 '2,044',
 'tree',
 'road',
 'Jashore',
 'similar',
 'previous',
 'attempt',
 'target',
 'century',
 'old',
 'tree',
 'Jashore',
 'Benapole',
 'highway',
 'hear',
 'LGED',
 'felling',
 'tree',
 'Patuakhali',
 'canal',
 'restoration',
 'development',
 'represent',
 'dangerous',
 'disregard',
 'tree',
 'forest',
 'temperature',
 'thing',
 'particularly',
 'disturb',
 'development',
 'late',
 'r

In [92]:
# case normalization
spacy_doc_lower = [token.lower() for token in spacy_doc_stem]

spacy_doc_lower

['late',
 'incident',
 'see',
 'startling',
 'transformation',
 'altadighi',
 'national',
 'park',
 'high',
 'court',
 'nod',
 'rein',
 'tree',
 'cut',
 'practice',
 'form',
 'supervisory',
 'committee',
 'district',
 'upazila',
 'level',
 'come',
 'appropriate',
 'time',
 'despite',
 'recent',
 'heatwave',
 'turn',
 'long',
 '76',
 'year',
 'tree',
 'fell',
 'public',
 'private',
 'entity',
 'continue',
 'unabated',
 'set',
 'stage',
 'warm',
 'future',
 'hear',
 'news',
 'bashundhara',
 'mow',
 'tree',
 'main',
 'road',
 'residential',
 'area',
 'hear',
 'forest',
 'department',
 'move',
 'cut',
 '2,044',
 'tree',
 'road',
 'jashore',
 'similar',
 'previous',
 'attempt',
 'target',
 'century',
 'old',
 'tree',
 'jashore',
 'benapole',
 'highway',
 'hear',
 'lged',
 'felling',
 'tree',
 'patuakhali',
 'canal',
 'restoration',
 'development',
 'represent',
 'dangerous',
 'disregard',
 'tree',
 'forest',
 'temperature',
 'thing',
 'particularly',
 'disturb',
 'development',
 'late',
 'r

### Vectorization: Bag of Words

In [93]:
# Setup
doc_collection = [sent.text for sent in spacy_doc.sents]
doc_collection

["Latest incident saw the startling transformation of Altadighi National Park\n\nThe High Court's nod on reining in tree-cutting practices by forming supervisory committees at the district and upazila levels could not have come at a more appropriate time.",
 'Despite the recent heatwave that turned out to be the longest in 76 years, tree felling by both public and private entities continues unabated, setting the stage for an even warmer future.',
 'You hear news of Bashundhara mowing down trees along the main road of its residential area.',
 'You hear of the forest department moving to cut down 2,044 trees on four roads in Jashore, similar to previous attempts targeting century-old trees on the Jashore-Benapole highway.',
 'You hear of the LGED felling trees in Patuakhali in the name of canal restoration.\n\n',
 'These developments represent a dangerous disregard for trees and forests that keep temperatures down, among other things.',
 'One particularly disturbing development of late, 

In [94]:
len(doc_collection)

18

In [95]:
# Apply the same preprocessing steps to the entire document collection

doc_collection = [sent.replace('\n', ' ') for sent in doc_collection]
doc_collection = [sent.replace(string.punctuation, '') for sent in doc_collection]
doc_collection = [sent.split(' ') for sent in doc_collection]
doc_collection = [[word.lower() for word in sent] for sent in doc_collection]
doc_collection = [[word for word in sent if word not in stop_word_list] for sent in doc_collection]
doc_collection = [[lemmatizer.lemmatize(word) for word in sent] for sent in doc_collection]

In [98]:
doc_collection[1]

['despite',
 'recent',
 'heatwave',
 'that',
 'turned',
 'out',
 'to',
 'longest',
 'in',
 '76',
 'years,',
 'tree',
 'felling',
 'by',
 'public',
 'and',
 'private',
 'entity',
 'continues',
 'unabated,',
 'setting',
 'stage',
 'for',
 'even',
 'warmer',
 'future.']

In [99]:
# doc collection as df
import pandas as pd

df = pd.DataFrame(doc_collection)

df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,20,21,22,23,24,25,26,27,28,29
0,latest,incident,saw,startling,transformation,of,altadighi,national,park,high,...,committee,at,district,and,upazila,level,come,at,appropriate,time.
1,despite,recent,heatwave,that,turned,out,to,longest,in,76,...,setting,stage,for,even,warmer,future.,,,,
2,you,hear,news,of,bashundhara,mowing,down,tree,along,main,...,,,,,,,,,,
3,you,hear,of,forest,department,moving,to,cut,down,2044,...,targeting,century-old,tree,on,jashore-benapole,highway.,,,,
4,you,hear,of,lged,felling,tree,in,patuakhali,in,name,...,,,,,,,,,,
5,these,development,represent,dangerous,disregard,for,tree,and,forest,that,...,,,,,,,,,,
6,one,particularly,disturbing,development,of,"late,",a,reported,by,this,...,lake,in,naogaon,a,part,of,government,project.,,
7,multi-crore,undertaking—initiated,by,forest,department,three,year,go—aims,to,restore,...,form,part,of,altadighi,national,park.,,,,
8,part,of,plan,draining,and,re-excavating,"lake,",which,almost,done.,...,,,,,,,,,,
9,forest,department,say,tree,removed,to,facilitate,"excavation,",adding,that,...,,,,,,,,,,


In [101]:
# Build a vocabulary

vocab = []
for sent in doc_collection:
    for word in sent:
        if word not in vocab:
            vocab.append(word)

len(vocab)

215

In [102]:

# Bag of words implementation
bow = []
for sent in doc_collection:
    bow_sent = []
    for word in vocab:
        bow_sent.append(sent.count(word))
    bow.append(bow_sent)

# Convert to a dataframe
df_bow = pd.DataFrame(bow, columns=vocab)

df_bow


Unnamed: 0,latest,incident,saw,startling,transformation,of,altadighi,national,park,high,...,supervision,limit,practices.,"that,",relevant,authority,made,accountable,their,activities.
0,1,1,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,2,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,2,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,1,0,0,2,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,1,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
