# STA 141B Data & Web Technologies for Data Analysis

### Lecture 19, 3/14/24, Natural language processing


### Announcements 

- Extra OH this Friday, 3/15, 12-1 PM via Zoom.

### Today's topics
- Chunking
    - Noun Phrase Chunking
    - Tag Patterns
    - Developing and Evaluating Chunkers
    - Chinking
- Training Classifier-Based Chunkers
- Cascaded Chunker
- Named Entity Recognition
- Relation Extraction

In [None]:
import re
import requests
import pandas as pd
import time
import lxml.html as lx
import nltk

In [None]:
def get_info(name):
    time.sleep(0.2)
    name = name.lower()
    name = re.sub(' ', '-', name)
    name = re.sub('[^\w-]', '', name)
    result = requests.get('https://www.cia.gov/the-world-factbook/page-data/countries/' \
                          + name + '/page-data.json')
    result.raise_for_status()
    return result.json()

In [None]:
result = requests.get('https://www.cia.gov/the-world-factbook/page-data/sq/d/1627106492.json')
country_names = [i.get('name') \
                 if i.get('redirect') is None else i.get('redirect').get('name') \
                 for i in result.json()['data']['countries']['nodes']]
countries = [get_info(name)['result']['data'] for name in country_names]

In [None]:
index = [i for i, e in enumerate(countries) if e['country']['name'] == "Burma"][0]

In [None]:
document = [i for i in countries[index]['fields']['nodes'] if i.get('name') == 'Background'][0]['data']
document = "".join([t for t in lx.fromstring(document).xpath('//p/text()')])

In [None]:
document

In [None]:
def preprocess(document):
    document = document.lower()
    sentences = nltk.sent_tokenize(document)
    sentences = [nltk.word_tokenize(sent) for sent in sentences]
    sentences = [nltk.pos_tag(sent) for sent in sentences]
    return sentences

In [None]:
processed_document = preprocess(document)
sentence = processed_document[0]
sentence

#### Noun Phrase Chunking

We will begin by considering the task of noun phrase chunking, or NP-chunking,
where we search for chunks corresponding to individual noun phrases.

One of the most useful sources of information for NP-chunking is part-of-speech tags. This is one of the motivations for performing part-of-speech tagging in our information extraction system. In order to create an NP-chunker, we will first define a chunk grammar, consisting of rules that indicate how sentences should be chunked.

Ee will define a simple grammar with a single regular expression rule . This rule says that an NP chunk should be formed whenever the chunker finds an optional determiner (DT) followed by any number of adjectives (JJ) and then a noun (NN). Using this grammar, we create a chunk parser, and test it on our example sentence . The result is a tree, which we can either print, or display graphically .

In [None]:
grammar = "NP: {<DT>?<JJ>*<NN.*>+}"
cp = nltk.RegexpParser(grammar)

In [None]:
type(cp)

In [None]:
result = cp.parse(sentence)
type(result)

In [None]:
#print(result)

In [None]:
result

#### Tag Patterns

The rules that make up a chunk grammar use tag patterns to describe sequences of
tagged words. A tag pattern is a sequence of part-of-speech tags delimited using angle
brackets, e.g.,`<DT>?<JJ>*<NN>`. Tag patterns are similar to regular expression patterns.

In [None]:
grammar = "NP: {<DT>?<JJ.*>*<NN.*>+}"
cp = nltk.RegexpParser(grammar)
result = cp.parse(sentence)
result

This will chunk any sequence of tokens beginning with an optional determiner, followed by zero or more adjectives of any type, followed by one or more nouns of any type. 

To find the chunk structure for a given sentence, the RegexpParser chunker begins with
a flat structure in which no tokens are chunked. The chunking rules are applied in turn,
successively updating the chunk structure. Once all of the rules have been invoked, the
resulting chunk structure is returned.

The next example shows a simple chunk grammar consisting of two rules. The first rule
matches an optional determiner or possessive pronoun, zero or more adjectives, then a noun. The second rule matches one or more proper nouns. We also define an example
sentence to be chunked , and run the chunker on this input.

In [None]:
grammar = r"""
    NP: {<DT|P.*>?<JJ>*<NN.*>+} # chunk determiner/possessive, adjectives and nouns
    {<NNP>+} # chunk sequences of proper nouns
"""
cp = nltk.RegexpParser(grammar)
example = [("Rapunzel", "NNP"), ("let", "VBD"), ("down", "RP"),
            ("her", "PP$"), ("long", "JJ"), ("golden", "JJ"), ("hair", "NN")]
cp.parse(sentence)

If a tag pattern matches at overlapping locations, the leftmost match takes precedence.

Sometimes it is easier to define what we want to exclude from a chunk. We can define a chink to be a sequence of tokens that is not included in a chunk.

In [None]:
sentence

In [None]:
grammar = r""" NP:
    {<.*>+}        # Chunk everything
    }<CC|.*DT|TO>?<\.|,|VB.*>+<IN>?{  # Chink 
"""
cp = nltk.RegexpParser(grammar)
cp.parse(sentence)

As befits their intermediate status between tagging and parsing, chunk
structures can be represented using either tags or trees. The most widespread file representation
uses IOB tags. In this scheme, each token is tagged with one of three special
chunk tags, I (inside), O (outside), or B (begin).

#### Developing and Evaluating Chunkers
Now you have a taste of what chunking does, but we haven’t explained how to evaluate
chunkers. As usual, this requires a suitably annotated corpus. We begin by looking at
the mechanics of converting IOB format into an NLTK tree, then at how this is done
on a larger scale using a chunked corpus. We will see how to score the accuracy of a
chunker relative to a corpus, then look at some more data-driven ways to search for
NP chunks. Our focus throughout will be on expanding the coverage of a chunker.

In [None]:
# it works like this 
print(cp.accuracy((result,)))

In [None]:
38 / len(sentence) # 38 tokens have been correctly classified in terms of IOB

Using the corpora module we can load the data `conll2000` that has been tagged
then chunked using the IOB notation. The chunk categories provided in this corpus
are NP, VP, and PP.

In [None]:
from nltk.corpus import conll2000
conll2000.chunked_sents('train.txt')[99]

In [None]:
len(conll2000.chunked_sents('train.txt'))

In [None]:
conll2000.chunked_sents('train.txt', chunk_types=['NP'])[99]

In [None]:
cp = nltk.RegexpParser("") # we are not providing any grammar!
test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP'])
print(cp.accuracy(test_sents))

Now let’s try a naive regular expression chunker that
looks for tags beginning with letters that are characteristic of noun phrase tags (e.g.,
`CD` (cardinal number), `DT`, and `JJ`).

In [None]:
grammar = r"NP: {<[CDJNP].*>+}"
cp = nltk.RegexpParser(grammar) 
test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP'])
print(cp.accuracy(test_sents))

We can improve on it
by adopting a more data-driven approach, where we use the training corpus to find the
chunk tag (I, O, or B) that is most likely for each part-of-speech tag. In other words, we
can build a chunker using a unigram tagger (two weeks ago). But rather than trying to
determine the correct part-of-speech tag for each word, we are trying to determine the
correct chunk tag, given each word’s part-of-speech tag.

In [None]:
class UnigramChunker(nltk.ChunkParserI):
    def __init__(self, train_sents):
        train_data = [[(t,c) for w,t,c in nltk.chunk.tree2conlltags(sent)] \
                      for sent in train_sents]
        self.tagger = nltk.UnigramTagger(train_data)
    def parse(self, sentence):
        pos_tags = [pos for (word,pos) in sentence]
        tagged_pos_tags = self.tagger.tag(pos_tags)
        chunktags = [chunktag for (pos, chunktag) in tagged_pos_tags]
        conlltags = [(word, pos, chunktag) for ((word,pos),chunktag)
        in zip(sentence, chunktags)]
        return nltk.chunk.conlltags2tree(conlltags)

In [None]:
nltk.chunk.tree2conlltags(conll2000.chunked_sents('train.txt', chunk_types=['NP'])[0])

In [None]:
test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP'])
train_sents = conll2000.chunked_sents('train.txt', chunk_types=['NP'])
unigram_chunker = UnigramChunker(train_sents)
print(unigram_chunker.accuracy(test_sents))

### Training Classifier-Based Chunkers
Both the regular expression–based chunkers and the n-gram chunkers decide what
chunks to create entirely based on part-of-speech tags. However, sometimes part-ofspeech
tags are insufficient to determine how a sentence should be chunked. For example,
consider the following two statements:

In [None]:
from nltk.tree import Tree
Tree.fromstring("(S (NP Joey) sold (NP the farmer) (NP rice) .)")

In [None]:
example = Tree.fromstring("(S (NP Joey) sold (NP the computer monitor) .)")
example

In [None]:
unchunked = example.flatten()
unchunked

In [None]:
unigram_chunker.parse(nltk.pos_tag(unchunked))

In [None]:
unigram_chunker.parse(nltk.pos_tag(Tree.fromstring("(S (NP Joey) sold (NP the farmer) (NP rice) .)").flatten()))

One way that we can incorporate information about the content of words is to use a
classifier-based tagger to chunk the sentence. Like the n-gram chunker considered in
the previous section, this classifier-based chunker will work by assigning IOB tags to
the words in a sentence, and then converting those tags to chunks.

### Cascaded Chunks

So far, our chunk structures have been relatively flat. Trees consist of tagged tokens,
optionally grouped under a chunk node such as NP. However, it is possible to build
chunk structures of arbitrary depth, simply by creating a multistage chunk grammar.

In [None]:
grammar = r"""
    NP: {<DT|JJ>*<NN.*>+} # Chunk sequences of DT, JJ, NN (noun phrase)
    PP: {<IN><NP>} # Chunk prepositions followed by NP (prepositional phrase)
    VP: {<VB.*><NP|PP>+$} # Chunk verbs and their arguments (verb phrase)
"""
cp = nltk.RegexpParser(grammar)
cp.parse(sentence)

This solution is not perfect, as `NP` are needlessly split. We will refine our grammar. 

In [None]:
grammar = r"""
    NP: {<DT|JJ|NN.*>+} # Chunk sequences of DT, JJ, NN (noun phrase)
    PP: {<IN><NP>} # Chunk prepositions followed by NP (prepositional phrase)
    VP: {<VB.*><NP|PP>+$} # Chunk verbs and their arguments (verb phrase)
    NP: {<NP|PP><CC>?<NP|PP>}
"""
cp = nltk.RegexpParser(grammar)
cp.parse(sentence)

In [None]:
cp = nltk.RegexpParser(grammar, loop = 2)
cp.parse(sentence)

Recall: The left side takes precedence when assigning the chunks! 

### Named Entity Recognition
Named entities are definite noun phrases that refer to specific types of individuals, such as organizations,
persons, dates, and so on.

| NAMED ENTITY | EXAMPLE | 
| ---- | ---- |
| ORGANIZATION | Georgia-Pacific Corp., WHO |
| PERSON | Eddy Bonte, President Obama |
| LOCATION | Murray River, Mount Everest |
| DATE | June, 2008-06-29 |
| TIME | two fifty a m, 1:30 p.m. |
| FACILITY | Washington Monument, Stonehenge |
| GEO-POLITICAL ENTITIES | South East Asia, Midlothian |

The goal of a named entity recognition (NER) system is to identify textual mentions
of the named entities. This can be broken down into two subtasks: identifying
the boundaries of the NE, and identifying its type. 

How do we go about identifying named entities? One option would be to look up each
word in an appropriate list of names. However, this is prone to errors caused by the fact that many named entity terms
are ambiguous.

Named entity recognition is a task that is well suited to the type of classifier-based
approach that we saw for noun phrase chunking. In particular, we can build a tagger
that labels each word in a sentence using the IOB format, where chunks are labeled by
their appropriate type.

NLTK provides a classifier that has already been trained to recognize named entities,
accessed with the function `nltk.ne_chunk()`. If we set the parameter `binary=True`,
then named entities are just tagged as `NE`; otherwise, the classifier adds category labels
such as `PERSON`, `ORGANIZATION`, and `GPE`.

In [None]:
import requests
import lxml.html as lx

In [None]:
r=requests.get('https://plato.stanford.edu/entries/liberalism-latin-america/')
html=lx.fromstring(r.text)
d=" ".join(html.xpath('//div[@id="aueditable"]//p//text()'))

In [None]:
d[:100]

In [None]:
import nltk
import re
def preprocess(document):
    document = re.sub("\s+", " ", document)
    sentences = nltk.sent_tokenize(document)
    sentences = [nltk.word_tokenize(sent) for sent in sentences]
    sentences = [nltk.pos_tag(sent) for sent in sentences]
    return sentences

In [None]:
document=preprocess(d)

In [None]:
document[0]

In [None]:
t=[nltk.ne_chunk(sentence) for sentence in document]

In [None]:
t[15]

In [None]:
for sent in t:
    for chunk in sent:
        if hasattr(chunk, "label"):
            print(chunk)

### Relation Extraction

Once named entities have been identified in a text, we then want to extract the relations
that exist between them. We will typically be looking for relations
between specified types of named entity. One way of approaching this task is to initially
look for all triples of the form `(X, α, Y)`, where `X` and `Y` are named entities of the required
types, and `α` is the string of words that intervenes between `X` and `Y`.