# Lab1-Assignment

Copyright: Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

This notebook describes the assignment for Lab 1 of the text mining course. 

**Points**: each exercise is prefixed with the number of points you can obtain for the exercise.

We assume you have worked through the following notebooks:
* **Lab1.1-introduction**
* **Lab1.2-introduction-to-NLTK**
* **Lab1.3-introduction-to-spaCy** 

In this assignment, you will process an English text (**Lab1-apple-samsung-example.txt**) with both NLTK and spaCy and discuss the similarities and differences.

## Credits
The notebooks in this block have been originally created by [Marten Postma](https://martenpostma.github.io). Adaptations were made by [Filip Ilievski](http://ilievski.nl).

## Tip: how to read a file from disk
Let's open the file **Lab1-apple-samsung-example.txt** from disk.

In [None]:
from pathlib import Path

In [None]:
cur_dir = Path().resolve() # this should provide you with the folder in which this notebook is placed
path_to_file = Path.joinpath(cur_dir, 'Lab1-apple-samsung-example.txt')
print(path_to_file)
print('does path exist? ->', Path.exists(path_to_file))

If the output from the code cell above states that **does path exist? -> False**, please check that the file **Lab1-apple-samsung-example.txt** is in the same directory as this notebook.

In [None]:
with open(path_to_file) as infile:
    text = infile.read()


print('number of characters', len(text))

## [total points: 4] Exercise 1: NLTK
In this exercise, we use NLTK to apply **Part-of-speech (POS) tagging**, **Named Entity Recognition (NER)**, and **Constituency parsing**. The following code snippet already performs sentence splitting and tokenization. 

In [None]:
# %pip uninstall nltk
# %pip install nltk

import nltk
from nltk.tokenize import sent_tokenize
from nltk import word_tokenize

# nltk.download()

In [None]:
sentences_nltk = sent_tokenize(text)

In [None]:
tokens_per_sentence = []
for sentence_nltk in sentences_nltk:
    sent_tokens = word_tokenize(sentence_nltk)
    tokens_per_sentence.append(sent_tokens)

We will use lists to keep track of the output of the NLP tasks. We can hence inspect the output for each task using the index of the sentence.

In [None]:
sent_id = 1
print('SENTENCE', sentences_nltk[sent_id])
print('TOKENS', tokens_per_sentence[sent_id])

### [point: 1] Exercise 1a: Part-of-speech (POS) tagging
Use `nltk.pos_tag` to perform part-of-speech tagging on each sentence.

Use `print` to **show** the output in the notebook (and hence also in the exported PDF!).

In [None]:
pos_tags_per_sentence = []
for tokens in tokens_per_sentence:
    a = nltk.pos_tag(tokens)
    pos_tags_per_sentence.append(a)

In [None]:
print(pos_tags_per_sentence)

### [point: 1] Exercise 1b: Named Entity Recognition (NER)
Use `nltk.chunk.ne_chunk` to perform Named Entity Recognition (NER) on each sentence.

Use `print` to **show** the output in the notebook (and hence also in the exported PDF!).

In [None]:
ner_tags_per_sentence = []
for tokens in pos_tags_per_sentence:
    b = nltk.chunk.ne_chunk(tokens)
    ner_tags_per_sentence.append(b)

In [None]:
print(ner_tags_per_sentence)

### [points: 2] Exercise 1c: Constituency parsing
Use the `nltk.RegexpParser` to perform constituency parsing on each sentence.

Use `print` to **show** the output in the notebook (and hence also in the exported PDF!).

In [None]:
constituent_parser = nltk.RegexpParser('''
NP: {<DT>? <JJ>* <NN>*} # NP
P: {<IN>}           # Preposition
V: {<V.*>}          # Verb
PP: {<P> <NP>}      # PP -> P NP
VP: {<V> <NP|PP>*}  # VP -> V (NP|PP)*''')

In [None]:
constituency_output_per_sentence = []
for tokens in ner_tags_per_sentence:
    constituency_output_per_sentence.append(constituent_parser.parse(tokens))
    

In [None]:
print(constituency_output_per_sentence)

Augment the RegexpParser so that it also detects Named Entity Phrases (NEP), e.g., that it detects *Galaxy S III* and *Ice Cream Sandwich*

In [None]:
constituent_parser_v2 = nltk.RegexpParser('''
NP: {<DT>? <JJ>* <NN>*} # NP
P: {<IN>}           # Preposition
V: {<V.*>}          # Verb
PP: {<P> <NP>}      # PP -> P NP
VP: {<V> <NP|PP>*}  # VP -> V (NP|PP)*
NEP: {}             # ???''')

In [None]:
constituency_v2_output_per_sentence = []
for tokens in ner_tags_per_sentence:
    constituency_v2_output_per_sentence.append(constituent_parser_v2.parse(tokens))

In [None]:
print(constituency_v2_output_per_sentence)

## [total points: 1] Exercise 2: spaCy
Use Spacy to process the same text as you analyzed with NLTK.

In [None]:
# %pip install -U pip setuptools wheel
# %pip install -U spacy
# %python -m spacy download en_core_web_sm

import spacy

nlp = spacy.load('en_core_web_sm')

doc = nlp(text)

sentences_spacy = list(doc.sents)

sentence_1 = sentences_spacy[2]
sentence_2 = sentences_spacy[5]

print("Sentence 1:", sentence_1.text)
print("POS tags sentence 1:")
for token in sentence_1:
    print(token.text, "-", token.pos_)

print("\nSentence 2:", sentence_2.text)
print("POS tags sentence 2:")
for token in sentence_2:
    print(token.text, "-", token.pos_)

print("\nNamed entities in sentence 1:")
for ent in sentence_1.ents:
    print(ent.text, "-", ent.label_)

print("\nNamed entities in sentence 2:")
for ent in sentence_2.ents:
    print(ent.text, "-", ent.label_)

small tip: You can use **sents = list(doc.sents)** to be able to use the index to access a sentence like **sents[2]** for the third sentence.


## [total points: 7] Exercise 3: Comparison NLTK and spaCy
We will now compare the output of NLTK and spaCy, i.e., in what do they differ?

### [points: 3] Exercise 3a: Part of speech tagging
Compare the output from NLTK and spaCy regarding part of speech tagging.

* In general, NLTK and spaCy make similar sentence segmentation for appropriate english text. In our case, both of them correctly split the article into sentences, with minimal variations in how they handled the quotations and sentences with multiple clauses. No significant errores were noticed for this text.
* We chose to select this sentence from the text “The six phones and tablets affected are the Galaxy S III, running the new Jelly Bean system, the Galaxy Tab 8.9 Wifi tablet, the Galaxy Tab 2 10.1, Galaxy Rugby Pro and Galaxy S III mini.” because it has multiple product names and numbers, which makes it a good choice to compare the two libraries.
* The only observed differences were in the way spaCy and NLTK handle compound product names. SpaCy handled them more consistently, it tagged parts such as “S”, “mini’ and “Pro” as NNP. That means that they are part of the proper nouns. On the other hand, NLTK sometimes mislabeled them as NN, which is a common noun.


### [points: 2] Exercise 3b: Named Entity Recognition (NER)
* We think that spaCy is the better performer out of the two libraries, it not only detects organizations and location names, but it also detects the product names, dates and their value in money. NLTK failed to get most names that were made up of multiple words, such as Galaxy S III.


### [points: 2] Exercise 3c: Constituency/dependency parsing
Choose one sentence from the text and run constituency parsing using NLTK and dependency parsing using spaCy.
* The difference between constituency and dependency parsing is that constituency divides the sentence into categories that are syntactically categorised. On the other hand, dependency shifts its focus on how words are related grammatically, which forms a directed graph.
* For parsing we selected the sentence “Samsung lost a US patent case to Apple and was ordered to pay its rival $1.05bn (£0.66bn) in damages for copying features of the iPad and iPhone in its Galaxy range of devices.”. Through comparing the two libraries, it was clear that spaCy was far superior. It gave clearer grammatical relationships and it did a much better job at connecting verbs to their subjects/objects. Though NLTK was useful for hierarchical breakdowns.


# End of this notebook