This notebook contains processing basics and visualizations to help you understand different NLP techniques that you have studied. We will illustrate the steps mainly using built-in nltk and spacy functions. They are purely meant for presentation and understanding. We do not expect you to implement those things as part of any requirement.

## Part 1: Lexical analysis

The first part of the notebook covers lexical analysis step on the COVID tweet dataset, specifically, part-of-speech (POS) tagging. 

### Applying POS tagging to COVID tweet dataset

In [1]:
# Import libraries here that you need for different processing steps
import nltk
import csv
import spacy
import pandas as pd

In [2]:
# Read csv into dataframe and remove lines which contain missing values in the OriginalTweet column

data_df = pd.read_csv("Dataset/covid.csv")

print ("Data set: ", len(data_df))

data_df = data_df[data_df['OriginalTweet'].notna()]
print ("Data set: ", len(data_df))

data_df.head()

Data set:  44957
Data set:  44955


Unnamed: 0,UserName,ScreenName,Location,TweetAt,OriginalTweet,Sentiment,Hashtags,CleanedTweet,Accounts,TokenizedTweet,StopwordRemovedTweet,StemmedTweet
0,3799,48751,London,16-03-2020,@menyrbie @phil_gahan @chrisitv https://t.co/i...,Neutral,,https t co ifz9fan2pa and https t co xx6ghgfz...,"['menyrbie', 'phil_gahan', 'chrisitv']","['https', 't', 'co', 'ifz9fan2pa', 'and', 'htt...","['https', 'co', 'ifz9fan2pa', 'https', 'co', '...","['http', 't', 'co', 'ifz9fan2pa', 'and', 'http..."
1,3800,48752,UK,16-03-2020,advice talk to your neighbours family to excha...,Positive,,advice talk to your neighbours family to excha...,,"['advice', 'talk', 'to', 'your', 'neighbours',...","['advice', 'talk', 'neighbours', 'family', 'ex...","['advic', 'talk', 'to', 'your', 'neighbour', '..."
2,3801,48753,Vagabonds,16-03-2020,coronavirus australia: woolworths to give elde...,Positive,,coronavirus australia woolworths to give elder...,,"['coronavirus', 'australia', 'woolworths', 'to...","['coronavirus', 'australia', 'woolworths', 'gi...","['coronaviru', 'australia', 'woolworth', 'to',..."
3,3802,48754,,16-03-2020,my food stock is not the only one which is emp...,Positive,"['covid19france', 'covid_19', 'covid19', 'coro...",my food stock is not the only one which is emp...,,"['my', 'food', 'stock', 'is', 'not', 'the', 'o...","['food', 'stock', 'one', 'empty', 'please', 'p...","['my', 'food', 'stock', 'is', 'not', 'the', 'o..."
4,3803,48755,,16-03-2020,"me, ready to go at supermarket during the #cov...",Negative,"['covid19', 'coronavirus', 'coronavirusfrance'...",me ready to go at supermarket during the outbr...,,"['me', 'ready', 'to', 'go', 'at', 'supermarket...","['ready', 'go', 'supermarket', 'outbreak', 'pa...","['me', 'readi', 'to', 'go', 'at', 'supermarket..."


### Functions for Penn Treebank-style tokenization and POS tagging

#### POS tagging using nltk
Refer to this link for explanation of all POS tag abbreviations: 
https://www.guru99.com/pos-tagging-chunking-nltk.html

In [3]:
import nltk

# class for tokenization
class Splitter(object):
    # load the tokenizer
    def __init__(self):
        self.nltk_splitter = nltk.data.load('tokenizers/punkt/english.pickle')
        self.nltk_tokenizer = nltk.tokenize.TreebankWordTokenizer()
    #split input 
    def split(self, text):
        sentences = self.nltk_splitter.tokenize(text)
        tokenized_sentences = [self.nltk_tokenizer.tokenize(sent) for sent in sentences]
        return tokenized_sentences

# class for POS tagging
class POSTagger(object):
    def __init__(self):
        pass
    def pos_tag(self, sentences):
        pos = [nltk.pos_tag(sentence) for sentence in sentences]
        pos = [[(word, word, [postag]) for (word, postag) in sentence] for sentence in pos]
        return pos
    
splitter = Splitter()
postagger = POSTagger()

In [4]:
# POS tagging on an example tweet 
# We use OriginalTweet as input because stemming and stopword removal would make POS tagging somewhat meaningless, as the integrity of sentences/tokens is violated. 

print(data_df.OriginalTweet.tolist()[98])
print("\n")

tweet = data_df.OriginalTweet.tolist()[98]
splitted_sentences = splitter.split(tweet)
pos_tagged_sentences = postagger.pos_tag(splitted_sentences)
for sentence in pos_tagged_sentences:
    for words in sentence:
        print(words)
    print("\n")

i followed this when i went shopping a few days ago. it's a pain but necessary! protect yourself from grocery shopping - consumer reports #covid2019 #stayhealthy https://t.co/48ng14me6e


('i', 'i', ['NN'])
('followed', 'followed', ['VBD'])
('this', 'this', ['DT'])
('when', 'when', ['WRB'])
('i', 'i', ['JJ'])
('went', 'went', ['VBD'])
('shopping', 'shopping', ['VBG'])
('a', 'a', ['DT'])
('few', 'few', ['JJ'])
('days', 'days', ['NNS'])
('ago', 'ago', ['RB'])
('.', '.', ['.'])


('it', 'it', ['PRP'])
("'s", "'s", ['VBZ'])
('a', 'a', ['DT'])
('pain', 'pain', ['NN'])
('but', 'but', ['CC'])
('necessary', 'necessary', ['JJ'])
('!', '!', ['.'])


('protect', 'protect', ['VB'])
('yourself', 'yourself', ['PRP'])
('from', 'from', ['IN'])
('grocery', 'grocery', ['NN'])
('shopping', 'shopping', ['NN'])
('-', '-', [':'])
('consumer', 'consumer', ['NN'])
('reports', 'reports', ['NNS'])
('#', '#', ['#'])
('covid2019', 'covid2019', ['JJ'])
('#', '#', ['#'])
('stayhealthy', 'stayhealthy', ['JJ'])
('htt

#### POS tagging using spacy

Refer to this link for more explanation and examples- https://machinelearningknowledge.ai/tutorial-on-spacy-part-of-speech-pos-tagging/

In [5]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp(tweet)

for token in doc:
    print(f'{token.text:{10}} {token.pos_:{10}} {token.tag_:{10}}')

i          PRON       PRP       
followed   VERB       VBD       
this       DET        DT        
when       ADV        WRB       
i          PRON       PRP       
went       VERB       VBD       
shopping   VERB       VBG       
a          DET        DT        
few        ADJ        JJ        
days       NOUN       NNS       
ago        ADV        RB        
.          PUNCT      .         
it         PRON       PRP       
's         AUX        VBZ       
a          DET        DT        
pain       NOUN       NN        
but        CCONJ      CC        
necessary  ADJ        JJ        
!          PUNCT      .         
protect    VERB       VB        
yourself   PRON       PRP       
from       ADP        IN        
grocery    NOUN       NN        
shopping   NOUN       NN        
-          PUNCT      HYPH      
consumer   NOUN       NN        
reports    NOUN       NNS       
#          SYM        $         
covid2019  PROPN      NNP       
#          NOUN       NN        
stayhealth

#### Comparison of nltk and spacy POS tagging results

Does one tagger perform better than the other? What are some of the differences you observe?
You can also try out tweets with more irregular text, which is likely to yield poorer results.

## Part 2: Syntactic analysis

### Chunking

####  Chunking with nltk

We can use regular expressions in NLTK for chunking. The following is a simple noun phrase chunking example. Study the `np` regular expression and try to interpret what it does. 

In [6]:
import nltk
# NLTK comes with RegexParser() function, which can help us with creating a simple noun phrase chunker.
np = ("NP: {<DT>?<JJ>*<NN>+}")

# create the regex for chunking
chunking = nltk.RegexpParser(np)

# tokenize the tweet sentence
sent_token = nltk.word_tokenize(tweet)

# POS tagging, a prerequisite for chunking
tagging = nltk.pos_tag(sent_token)
tagging


[('i', 'NN'),
 ('followed', 'VBD'),
 ('this', 'DT'),
 ('when', 'WRB'),
 ('i', 'JJ'),
 ('went', 'VBD'),
 ('shopping', 'VBG'),
 ('a', 'DT'),
 ('few', 'JJ'),
 ('days', 'NNS'),
 ('ago', 'RB'),
 ('.', '.'),
 ('it', 'PRP'),
 ("'s", 'VBZ'),
 ('a', 'DT'),
 ('pain', 'NN'),
 ('but', 'CC'),
 ('necessary', 'JJ'),
 ('!', '.'),
 ('protect', 'VB'),
 ('yourself', 'PRP'),
 ('from', 'IN'),
 ('grocery', 'NN'),
 ('shopping', 'NN'),
 ('-', ':'),
 ('consumer', 'NN'),
 ('reports', 'NNS'),
 ('#', '#'),
 ('covid2019', 'JJ'),
 ('#', '#'),
 ('stayhealthy', 'JJ'),
 ('https', 'NN'),
 (':', ':'),
 ('//t.co/48ng14me6e', 'NN')]

In [8]:
#!pip install svgling

# visualize the chunks
tree = chunking.parse(tagging)
print(tree)

(S
  (NP i/NN)
  followed/VBD
  this/DT
  when/WRB
  i/JJ
  went/VBD
  shopping/VBG
  a/DT
  few/JJ
  days/NNS
  ago/RB
  ./.
  it/PRP
  's/VBZ
  (NP a/DT pain/NN)
  but/CC
  necessary/JJ
  !/.
  protect/VB
  yourself/PRP
  from/IN
  (NP grocery/NN shopping/NN)
  -/:
  (NP consumer/NN)
  reports/NNS
  #/#
  covid2019/JJ
  #/#
  (NP stayhealthy/JJ https/NN)
  :/:
  (NP //t.co/48ng14me6e/NN))


In [10]:
# Print only the noun phrase chunks
for i in tree:
    if "NP" in str(i):
        print(i)

(NP i/NN)
(NP a/DT pain/NN)
(NP grocery/NN shopping/NN)
(NP consumer/NN)
(NP stayhealthy/JJ https/NN)
(NP //t.co/48ng14me6e/NN)


#### Chunking with spacy

To identify chunks in spacy, we need to first parse dependencies. 

For more details on spacy linguistic processing, refer to https://spacy.io/usage/linguistic-features. 

In [13]:
# displacy provides nice visualization features
from spacy import displacy

doc = nlp(tweet)    

# visualize sentence dependencies
displacy.render(doc, style='dep', jupyter = True, options = {'distance': 100})

### Constituency parsing

There is limited built-in functionality when it comes to constituency parsing in nltk and spacy, although they provide wrappers to other existing tools like CoreNLP library. 

Stanza is a state-of-the-art NLP pipeline that includes constituency parsing (in addition to other linguistic analyses) and has a similar feel to spacy. You need to install it first. If you're interested in getting started with Stanza, see the link: https://stanfordnlp.github.io/stanza/#getting-started.

## Part 3: Semantic analysis

### Named entity recognition 

Simple examples of named entity recognition using nltk and spacy

#### Named entity recognition using spacy

In [14]:
from collections import Counter
from pprint import pprint
import en_core_web_sm
nlp = en_core_web_sm.load()

tweet = "Mark Zuckerberg is one of the founders of Facebook, a company from the United States"
doc = nlp(tweet)
print(tweet)
pprint([(X.text, X.label_) for X in doc.ents])

Mark Zuckerberg is one of the founders of Facebook, a company from the United States
[('Mark Zuckerberg', 'PERSON'),
 ('one', 'CARDINAL'),
 ('the United States', 'GPE')]


#### Named entity recognition using nltk

In [16]:
import nltk
# Download necessary packages
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
 
# a function that performs named entity recognition and prints the results
def nltk_ner(text): 
    for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(text))):
        if hasattr(chunk, 'label'):
            print(chunk.label(), ' '.join(c[0] for c in chunk))
    return

print(tweet) 
nltk_ner(tweet)
print('\n')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/owenmonroe/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/owenmonroe/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /Users/owenmonroe/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading package words to
[nltk_data]     /Users/owenmonroe/nltk_data...
[nltk_data]   Package words is already up-to-date!


Mark Zuckerberg is one of the founders of Facebook, a company from the United States
PERSON Mark
ORGANIZATION Zuckerberg
GPE Facebook
GPE United States




Which is better? Try some other sentences before you come to a conclusion.

If you need an efficient solution for NER, spaCy is a good choice with its pre-trained NER models. However, if you require more customization for domain-specific tasks, NLTK might be a better fit (you can train your own NER models). Ultimately, the choice depends on your project's requirements and your familiarity with the libraries. You may even choose to use both libraries in different parts of your NER pipeline if it best suits your needs.

## Part 4: Ambiguity

Let's try out some examples that were mentioned in class or others of your own. 

### Examples of lexical ambiguity 

Let's take the following two sentences.

```
sentence1 = "The key broke in the lock."
sentence2 = "The key problem was not one of quality but of quantity."
```
Both sentences have the word "key" but with different POS and meaning. 

### POS tagging using nltk for the above sentences
Let's see how our nltk POS tagger tags these sentences.

In [17]:
sentence1 = "The key broke in the lock."
sentence2 = "The key problem was not one of quality but of quantity."

print(sentence1)
forpos= sentence1.strip()
splitted_sentences = splitter.split(forpos)
pos_tagged_sentences = postagger.pos_tag(splitted_sentences)
print(pos_tagged_sentences)

print("\n")

print(sentence2)
forpos= sentence2.strip()
splitted_sentences = splitter.split(forpos)
pos_tagged_sentences = postagger.pos_tag(splitted_sentences)
print(pos_tagged_sentences)

The key broke in the lock.
[[('The', 'The', ['DT']), ('key', 'key', ['JJ']), ('broke', 'broke', ['NN']), ('in', 'in', ['IN']), ('the', 'the', ['DT']), ('lock', 'lock', ['NN']), ('.', '.', ['.'])]]


The key problem was not one of quality but of quantity.
[[('The', 'The', ['DT']), ('key', 'key', ['NN']), ('problem', 'problem', ['NN']), ('was', 'was', ['VBD']), ('not', 'not', ['RB']), ('one', 'one', ['CD']), ('of', 'of', ['IN']), ('quality', 'quality', ['NN']), ('but', 'but', ['CC']), ('of', 'of', ['IN']), ('quantity', 'quantity', ['NN']), ('.', '.', ['.'])]]


What do you think? Are the results as expected? 


### POS tagging using spacy for the above sentences

In [60]:
sentence1 = "The key broke in the lock."
sentence2 = "The key problem was not one of quality but of quantity."

print(sentence1)
doc1 = nlp(sentence1)  
for token in doc1:
    print(f'{token.text:{10}} {token.pos_:{10}} {token.tag_:{10}}')

print("\n")

print(sentence2)

doc2 = nlp(sentence2)  
for token in doc2:
    print(f'{token.text:{10}} {token.pos_:{10}} {token.tag_:{10}}')


How does spacy results compare to nltk? Does it handle the word "key" better? 

Try other examples of interest. For example, you may try "I made her duck". 
