# **CAP6640 - Natural Language Processing - Florida Atlantic University**
## **Nha Tran - Z23537257** 

**Text Processing**

Given a news article, you can choose to use any tool we mentioned in the class (e.g., NLTK, Opennlp, Textblog, Spacy, and other text processing tools). The following steps should be completed using the tool you pick:

1. Detect sentences in the given news article 
2. Tokenize each sentence into words 
3. Perform Part-of-Speech (POS) on each sentence 
4. Find name entities including person’s name entities and locations 
Please include your screen shots for each of the above steps and also the results of each step in your report. Please submit your report in Canvas.



# **Using nltk**

In [104]:
# Import libraries
import string 
import os
import nltk
import spacy
from collections import Counter
nltk.download('punkt')
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
nltk.download('stopwords')
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from nltk.chunk import ne_chunk_sents
from nltk.chunk import ne_chunk
from nltk.tag import pos_tag
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!


True

In [97]:
# Open and read the file
file_name = 'news_article.txt'
data_file = open(file_name, 'r')
text = data_file.read()
data_file.close()
print(text)

Anglo-French Channel Tunnel operator Eurotunnel Monday announced a deal giving its creditor banks 45.5 percent of the company in return for wiping out one billion pounds ($1.56 billion) of its debt.
The long-awaited restructuring brings to an end months of wrangling between Eurotunnel and the 225 banks to which it owes nearly nine billion pounds ($14.1 billion).
The deal, announced simultaneously in Paris and London, brings the company back from the brink of insolvency but leaves shareholders owning only 54.5 percent of the company.
"The restructuring plan provides Eurotunnel with the medium-term financial stability to allow it to consolidate its substantial commercial achievements to date and to develop its operations," Eurotunnel co-chairman Alastair Morton said.
The firm was now making a profit before interest, he added.
Although shareholders will see their interests diluted, they were offered the prospect of a brighter future after months of uncertainty while Eurotunnel wrestled to

## 1. Detect sentences in the given news article

In [100]:
# Tokenize Sentence 
sentences = sent_tokenize(text)
print(sentences)
print(f"Total sentences in document: {len(sentences)}")

['Anglo-French Channel Tunnel operator Eurotunnel Monday announced a deal giving its creditor banks 45.5 percent of the company in return for wiping out one billion pounds ($1.56 billion) of its debt.', 'The long-awaited restructuring brings to an end months of wrangling between Eurotunnel and the 225 banks to which it owes nearly nine billion pounds ($14.1 billion).', 'The deal, announced simultaneously in Paris and London, brings the company back from the brink of insolvency but leaves shareholders owning only 54.5 percent of the company.', '"The restructuring plan provides Eurotunnel with the medium-term financial stability to allow it to consolidate its substantial commercial achievements to date and to develop its operations," Eurotunnel co-chairman Alastair Morton said.', 'The firm was now making a profit before interest, he added.', "Although shareholders will see their interests diluted, they were offered the prospect of a brighter future after months of uncertainty while Eurot

## 2. Tokenize each sentence into words 

In [101]:
# Tokenize each sentence into words
tokenized_sentences = [word_tokenize(sentence) for sentence in sentences]
print(tokenized_sentences)

[['Anglo-French', 'Channel', 'Tunnel', 'operator', 'Eurotunnel', 'Monday', 'announced', 'a', 'deal', 'giving', 'its', 'creditor', 'banks', '45.5', 'percent', 'of', 'the', 'company', 'in', 'return', 'for', 'wiping', 'out', 'one', 'billion', 'pounds', '(', '$', '1.56', 'billion', ')', 'of', 'its', 'debt', '.'], ['The', 'long-awaited', 'restructuring', 'brings', 'to', 'an', 'end', 'months', 'of', 'wrangling', 'between', 'Eurotunnel', 'and', 'the', '225', 'banks', 'to', 'which', 'it', 'owes', 'nearly', 'nine', 'billion', 'pounds', '(', '$', '14.1', 'billion', ')', '.'], ['The', 'deal', ',', 'announced', 'simultaneously', 'in', 'Paris', 'and', 'London', ',', 'brings', 'the', 'company', 'back', 'from', 'the', 'brink', 'of', 'insolvency', 'but', 'leaves', 'shareholders', 'owning', 'only', '54.5', 'percent', 'of', 'the', 'company', '.'], ['``', 'The', 'restructuring', 'plan', 'provides', 'Eurotunnel', 'with', 'the', 'medium-term', 'financial', 'stability', 'to', 'allow', 'it', 'to', 'consolida

## 3. Perform Part-of-Speech (POS) on each sentence 

In [102]:
# POS tagging
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
print(tagged_sentences)

[[('Anglo-French', 'JJ'), ('Channel', 'NNP'), ('Tunnel', 'NNP'), ('operator', 'NN'), ('Eurotunnel', 'NNP'), ('Monday', 'NNP'), ('announced', 'VBD'), ('a', 'DT'), ('deal', 'NN'), ('giving', 'VBG'), ('its', 'PRP$'), ('creditor', 'NN'), ('banks', 'NNS'), ('45.5', 'CD'), ('percent', 'NN'), ('of', 'IN'), ('the', 'DT'), ('company', 'NN'), ('in', 'IN'), ('return', 'NN'), ('for', 'IN'), ('wiping', 'VBG'), ('out', 'RP'), ('one', 'CD'), ('billion', 'CD'), ('pounds', 'NNS'), ('(', '('), ('$', '$'), ('1.56', 'CD'), ('billion', 'CD'), (')', ')'), ('of', 'IN'), ('its', 'PRP$'), ('debt', 'NN'), ('.', '.')], [('The', 'DT'), ('long-awaited', 'JJ'), ('restructuring', 'NN'), ('brings', 'NNS'), ('to', 'TO'), ('an', 'DT'), ('end', 'JJ'), ('months', 'NNS'), ('of', 'IN'), ('wrangling', 'VBG'), ('between', 'IN'), ('Eurotunnel', 'NNP'), ('and', 'CC'), ('the', 'DT'), ('225', 'CD'), ('banks', 'NNS'), ('to', 'TO'), ('which', 'WDT'), ('it', 'PRP'), ('owes', 'VBZ'), ('nearly', 'RB'), ('nine', 'CD'), ('billion', 'CD

## 4. Find name entities including person's name entities and locations

In [103]:
# Chunk the sentences, adds category labels such as PERSON and GPE (binary = True meansn named entities are tagged as NE)
chunked_sentences = nltk.ne_chunk_sents(tagged_sentences, binary=False)

# Check node label of the tree
def extract_entity_names(tree):
  entity_names = []
  if hasattr(tree, 'label') and tree.label():
    if tree.label() == 'GPE' or tree.label() == 'PERSON':
      entity_names.append(' '.join([child[0] for child in tree]))
    else:
      # Check child node label of the tree
      for child in tree:
        entity_names.extend(extract_entity_names(child))
  return entity_names

# Check all the sentences 
entity = []
for tree in chunked_sentences:
  entity.extend(extract_entity_names(tree))

# Print all the entities
print(entity)
# Print unique ones (using set since set won't accept duplicates)
print(set(entity))

['Paris', 'London', 'Alastair Morton', 'Eurotunnel', 'Eurotunnel', 'European', 'French', 'Patrick Ponsolle', 'Eurotunnel', 'London', 'Eurotunnel', 'Eurotunnel', 'Eurotunnel']
{'Alastair Morton', 'European', 'Patrick Ponsolle', 'Eurotunnel', 'French', 'Paris', 'London'}


# **Using spaCy**


## 1. Detect sentences in the given news article

In [105]:


#Load English Language Model
nlp = spacy.load('en_core_web_sm') 

# Read the text
doc = nlp(text)

print(f"Total sentences in document: {len(list(doc.sents))}")
print(list(doc.sents))


Total sentences in document: 17
[Anglo-French Channel Tunnel operator Eurotunnel Monday announced a deal giving its creditor banks 45.5 percent of the company in return for wiping out one billion pounds ($1.56 billion) of its debt.
, The long-awaited restructuring brings to an end months of wrangling between Eurotunnel and the 225 banks to which it owes nearly nine billion pounds ($14.1 billion).
, The deal, announced simultaneously in Paris and London, brings the company back from the brink of insolvency but leaves shareholders owning only 54.5 percent of the company.
, "The restructuring plan provides Eurotunnel with the medium-term financial stability to allow it to consolidate its substantial commercial achievements to date and to develop its operations," Eurotunnel co-chairman Alastair Morton said.
, The firm was now making a profit before interest, he added.
, Although shareholders will see their interests diluted, they were offered the prospect of a brighter future after months 

## 2. Tokenize each sentence into words 

In [109]:
# Tokenize each sentence into words
words = [word.text for word in doc]
print(words)


['Anglo', '-', 'French', 'Channel', 'Tunnel', 'operator', 'Eurotunnel', 'Monday', 'announced', 'a', 'deal', 'giving', 'its', 'creditor', 'banks', '45.5', 'percent', 'of', 'the', 'company', 'in', 'return', 'for', 'wiping', 'out', 'one', 'billion', 'pounds', '(', '$', '1.56', 'billion', ')', 'of', 'its', 'debt', '.', '\n', 'The', 'long', '-', 'awaited', 'restructuring', 'brings', 'to', 'an', 'end', 'months', 'of', 'wrangling', 'between', 'Eurotunnel', 'and', 'the', '225', 'banks', 'to', 'which', 'it', 'owes', 'nearly', 'nine', 'billion', 'pounds', '(', '$', '14.1', 'billion', ')', '.', '\n', 'The', 'deal', ',', 'announced', 'simultaneously', 'in', 'Paris', 'and', 'London', ',', 'brings', 'the', 'company', 'back', 'from', 'the', 'brink', 'of', 'insolvency', 'but', 'leaves', 'shareholders', 'owning', 'only', '54.5', 'percent', 'of', 'the', 'company', '.', '\n', '"', 'The', 'restructuring', 'plan', 'provides', 'Eurotunnel', 'with', 'the', 'medium', '-', 'term', 'financial', 'stability', 'to

## 3. Perform Part-of-Speech (POS) on each sentence 

In [119]:
word_list = []
pos_list = []
for word in doc:
  word_list.append(word)
  pos_list.append(word.pos_)
  print(word, word.pos_)

Anglo ADJ
- PUNCT
French ADJ
Channel PROPN
Tunnel PROPN
operator NOUN
Eurotunnel PROPN
Monday PROPN
announced VERB
a DET
deal NOUN
giving VERB
its DET
creditor NOUN
banks NOUN
45.5 NUM
percent NOUN
of ADP
the DET
company NOUN
in ADP
return NOUN
for ADP
wiping VERB
out ADP
one NUM
billion NUM
pounds NOUN
( PUNCT
$ SYM
1.56 NUM
billion NUM
) PUNCT
of ADP
its DET
debt NOUN
. PUNCT

 SPACE
The DET
long ADV
- PUNCT
awaited VERB
restructuring NOUN
brings VERB
to ADP
an DET
end NOUN
months NOUN
of ADP
wrangling VERB
between ADP
Eurotunnel PROPN
and CCONJ
the DET
225 NUM
banks NOUN
to PART
which DET
it PRON
owes VERB
nearly ADV
nine NUM
billion NUM
pounds NOUN
( PUNCT
$ SYM
14.1 NUM
billion NUM
) PUNCT
. PUNCT

 SPACE
The DET
deal NOUN
, PUNCT
announced VERB
simultaneously ADV
in ADP
Paris PROPN
and CCONJ
London PROPN
, PUNCT
brings VERB
the DET
company NOUN
back ADV
from ADP
the DET
brink NOUN
of ADP
insolvency NOUN
but CCONJ
leaves VERB
shareholders NOUN
owning VERB
only ADV
54.5 NUM
percent

In [122]:
# Visualize POS
import pandas as pd
pos_tag = list(set(zip(word_list, pos_list)))
pos_df = pd.DataFrame(pos_tag)
pos_df.columns = ["Word", "POS"]
print(pos_df)

             Word    POS
0       interests   NOUN
1              in    ADP
2          brings   VERB
3    shareholders   NOUN
4            nine    NUM
..            ...    ...
520          long    ADV
521       company   NOUN
522             "  PUNCT
523            on    ADP
524             ,  PUNCT

[525 rows x 2 columns]


## 4. Find name entities including person’s name entities and locations 

In [125]:
# All the unique labels in the document
labels = set([x.label_ for x in doc.ents])
print(f"All the labels in the document: \n {labels}")

All the labels in the document: 
 {'PERSON', 'CARDINAL', 'NORP', 'GPE', 'PERCENT', 'ORG', 'ORDINAL', 'DATE', 'MONEY'}


In [126]:
# Print all the name entities including person's name entities and location
for label in labels: 
    entities = [e.text for e in doc.ents if label == e.label_] 
    entities = list(set(entities)) 
    print(label,entities)

PERSON ['Patrick Ponsolle', 'Alastair Morton']
CARDINAL ['225', 'six', 'around half', '24']
NORP ['French', 'European']
GPE ['Paris', 'London', 'Eurotunnel']
PERCENT ['45.5 percent', 'only 54.5 percent', 'just over 39 percent']
ORG ['Anglo-French Channel Tunnel', 'Eurotunnel']
ORDINAL ['first']
DATE ['Monday', 'two weeks', 'an end months', 'the next 10 years', 'months', 'the end of December 2003', 'Tuesday', 'last week', 'late last week', 'early in 1997']
MONEY ['around 160 pence', '$5.8 billion', 'nearly nine billion pounds', '130 pence', '8.7 billion pounds', '113.5 pence', 'one billion pounds', '1.0 billion', '10.40 francs', '$13.6 billion', '$1.56 billion', '$14.1 billion', '3.7 billion pounds']
