# POS Chunking
**1. Create a chunker that detects noun-phrases (NPs) and lists the NPs in the text below.**

- Both [NLTK](https://www.nltk.org/book/ch07.html) and [spaCy](https://spacy.io/api/matcher) supports chunking
- Look up RegEx parsing for NLTK and the document object for spaCy.
- Make use of what you've learned about tokenization.

In [1]:
import spacy

text = "The language model predicted the next word. It was a very nice word!"

# load spaCy model
nlp = spacy.load("en_core_web_sm")

# preprocess text (tokenization, POS tagging etc)
doc = nlp(text)

# initialize a list to keep track of noun phrase tokens
noun_phrase_tokens = set()
# dictionary to keep track of which tokens belong to which noun phrase
noun_phrases_dict = {}

# mark the tokens that are part of noun phrases
for chunk in doc.noun_chunks:
    np_text = chunk.text  # The text of the current noun phrase
    for token in chunk:
        noun_phrase_tokens.add(token.i)  # Add the token index to the set
        noun_phrases_dict[token.i] = np_text  # Map token index to noun phrase text

# initialize a list to store the final output
output = []

# iterate over each token in the document
for token in doc:
    if token.i in noun_phrase_tokens:
        # If this token is part of a noun phrase and not already added, add the noun phrase
        if noun_phrases_dict[token.i] not in output:
            output.append(noun_phrases_dict[token.i])
    else:
        # If the token is not part of a noun phrase, add it as an individual token
        output.append(token.text)


# Output: a list of all tokens, grouped as noun-phrases where applicable
output


['The language model',
 'predicted',
 'the next word',
 '.',
 'It',
 'was',
 'a very nice word',
 '!']

**2. Modify the chunker to handle verb-phases (VPs) as well.**
- This can be done by using a RegEx parser in NLTK or using a spaCy Matcher.

In [2]:
from spacy.matcher import Matcher

# using spaCy matcher for verb phrases
matcher = Matcher(nlp.vocab)

# example pattern for verb phrases (an adverb followed by a verb) 
vp_pattern = [{"POS": "ADV", "OP": "*"}, {"POS": "VERB", "OP": "+"}]
matcher.add("VP", [vp_pattern])

# initialize a list to keep track of phrase tokens and types
phrase_tokens = set()

# dictionary to keep track of which tokens belong to which phrase
phrases_dict = {}

# noun phrases
for chunk in doc.noun_chunks:
    np_text = chunk.text 
    for token in chunk:
        phrase_tokens.add(token.i)
        phrases_dict[token.i] = (np_text, "NP")

# verb phrases
matches = matcher(doc)
for match_id, start, end in matches:
    vp_text = doc[start:end].text
    for token in doc[start:end]:
        phrase_tokens.add(token.i)
        phrases_dict[token.i] = (vp_text, "VP")

# initialize a list to store the final output
output = []

# iterate over each token in the document
for token in doc:
    if token.i in phrase_tokens:
        phrase_text, phrase_type = phrases_dict[token.i]
        # add the phrase if it's not already in the output, checking for phrase type
        if phrase_text not in output:
            output.append(phrase_text)
    else:
        #if the token is not part of a phrase, add it as an individual token
        output.append(token.text)

# output: a list of all tokens, grouped as phrases if relevant
print(output)

['The language model', 'predicted', 'the next word', '.', 'It', 'was', 'a very nice word', '!']


**3. Verb-phrases (VPs) can be defined by many different grammatical rules. Give four examples.**


1. VP -> V (simple verb phrase, e.g. "run")

2. VP -> V NP (verb with direct object, e.g. "saw the bird")

3. VP -> V NP NP (verb with direct object and indirect object, e.g. "gave her a hug")

4. VP -> V PP (verb with a prepositional phrase, e.g. "walked to the park")

**4. After these applications, do you find chunking to be beneficial in the context of language modeling and next-word prediction? Why or why not?**

Using chunking to identify nouns and verb phrases can help language models predict words, but requires accurate phrase identification which can be problematic. Incorrectly identified phrases due to ambiguous syntax or errors in part-of-speech tagging can mislead the model, leading to inaccurate predictions. The complexity of natural language means that there are many exceptions to syntactic rules, making it challenging to devise chunking rules that are both comprehensive and accurate across different contexts and text types.

___

# Dependency Parsing

**1. Use spaCy to inspect/visualise the dependency tree of the text provided below.**
- Optional addition: visualize the dependencies as a graph using `networkx`

In [3]:
import spacy
from spacy import displacy

text = "The language model predicted the next word"

# load spaCy model
nlp = spacy.load("en_core_web_sm")

# preprocess the text
doc = nlp(text)

# visualize the dependency tree
displacy.render(doc, style="dep")

**2. What is the root of the sentence? Attempt to spot it yourself, but the answer should be done by code**

In [4]:
# find the root
sentence_root = [token for token in doc if token.head == token][0]

# print word and POS tag
print(f"Root: {sentence_root.text}, POS tag: {sentence_root.pos_}")

Root: predicted, POS tag: VERB


**3. Find the subject and object of a sentence. Print the results for the sentence above.**

In [17]:
def find_subjects_objects(text):
    doc = nlp(text)
    
    subjects = []
    objects = []
    
    # iterate over the tokens in the document
    for token in doc:
        # if the token is a subject, add it to the subjects list
        if "subj" in token.dep_:
            subjects.append((token.text))
        # if the token is an object, add it to the objects list
        elif "obj" in token.dep_:
            objects.append((token.text))
    
    return subjects, objects

subjects, objects = find_subjects_objects(text)

print("Subjects:", subjects)
print("Objects:", objects)


Subjects: ['model']
Objects: ['word']


**4. How would you use the relationships extracted from dependency parsing in language modeling contexts?**

Dependency parsing helps in understanding how different parts of a sentence are connected to each other and the roles they play in the sentence, allowing for a deeper understanding of the text. By integrating relationships extracted from dependency parsing into language modeling, the model's accuracy can be improved on both grammar, semantics, and syntactic understanding. This can be applied to enhance word predictions and generate texts.

___

# Wordnet

**1. Use Wordnet (from NLTK) and create a function to get all synonyms of a word of your choice. Try with "language"**

In [18]:
from nltk.corpus import wordnet as wn

def find_synonyms(word):
    
    # find synsets of the word
    synsets = wn.synsets(word)
    
    # extract the lemma names (synonyms) from each synset
    synonyms = set()
    for synset in synsets:
        for lemma in synset.lemmas():
            synonyms.add(lemma.name())
    
    return list(synonyms)

word = "course"

print(f'Synonyms of "{word}": {find_synonyms(word)}')

Synonyms of "course": ['feed', 'naturally', 'course', 'trend', 'flow', 'class', 'grade', 'of_course', 'form', 'course_of_study', 'row', 'line', 'course_of_action', 'track', 'course_of_instruction', 'run', 'path']


**2. From the same word you chose, extract an additional 4 or more features from wordnet (such as hyponyms). Describe each category briefly.**

In [19]:
from nltk.corpus import wordnet as wn

# format the output
def format_synsets(synsets):
    return [f"{synset.lemmas()[0].name()}" for synset in synsets]

# Hyponyms: words that are a subclass of the concept represented by the synset, e.g. "eagle" given the concept "bird"
hyponyms = wn.synsets(word)[0].hyponyms()

# Hypernyms: opposite of hyponyms - a general concept a subclass belongs to, e.g. "bird" given the subclass "eagle"
hypernyms = wn.synsets(word)[0].hypernyms()

# Meronums: a part of / a member of the concept represented by the synset, e.g. "leaf" given the concept "tree"
meronyms = wn.synsets(word)[0].part_meronyms()

# Holonyms: opposite of meronyms - the concept to which a member is a part of, e.g. "tree" given the member "leaf"
holonyms = wn.synsets(word)[0].member_holonyms()

print("Hyponyms:", hyponyms)
print("Hypernyms:", hypernyms)
print("Meronyms:", meronyms)
print("Holonyms:", holonyms)


Hyponyms: [Synset('adult_education.n.01'), Synset('art_class.n.01'), Synset('childbirth-preparation_class.n.01'), Synset('correspondence_course.n.01'), Synset('course_of_lectures.n.01'), Synset('directed_study.n.01'), Synset('elective_course.n.01'), Synset('extension_course.n.01'), Synset('home_study.n.01'), Synset('industrial_arts.n.01'), Synset('orientation_course.n.01'), Synset('propaedeutic.n.01'), Synset('refresher_course.n.01'), Synset('required_course.n.01'), Synset('seminar.n.02'), Synset('shop_class.n.01'), Synset('workshop.n.02')]
Hypernyms: [Synset('education.n.01')]
Meronyms: [Synset('course_session.n.01'), Synset('coursework.n.01'), Synset('lecture.n.03'), Synset('lesson.n.01')]
Holonyms: []


___

# Machine Learning Exercise - A sentiment classifier
- A rule-based approach with SentiWordNet + A machine learning classifier

**1. There are several steps required to build a classifier or any sort of machine learning application for textual data. For data including (INPUT_TEXT, LABEL), list the typical pipeline for classification.**

1. Select relevant data: each piece of text (INPUT_TEXT) is associated with a label (LABEL), to ensure that the data is relevant and that the classes are clearly defined.

2. Preprocessing: cleaning the data (e.g. removing noise, correcting spelling errors) and preparing it for further processing (e.g., tokenization, normalization, stemming, lemmatization).

3. Splitting the data: dividing the dataset into a training set and a test set. The training set is used to train the machine learning model, while the test set is used to evaluate the model's performance.

4. Feature extraction: features are extracted from the preprocessed text, using methods like Bag of Words, TF-IDF, or word embeddings.

5. Training the classifier: train the selected classifier on the training set.

6. Evaluating the model: evaluate the performance of the model based on how accurately it can classify the unseen data in the test set, using metrics such as accuracy, precision, recall, and F1 score.

7. Iterate and refine: potentially improve the model based on the evaluation, e.g. by selecting a different set of features, trying a different classification algorithm, or tuning the model's parameters.

**2. Before developing a classifier, having a baseline is very useful. Build a baseline model for sentiment classification using SentiWordNet.**
- How you decide to aggregate sentiment is up to you. Explain your approach.
- It should report the accuracy of the classifier.

In [20]:
from nltk.corpus import sentiwordnet as swn
from nltk.corpus import wordnet as wn

sents = [
    "I liked it! Did you?",
    "It's not bad but... Nevermind, it is.",
    "It's awful",
    "I don't care if you loved it - it was terrible!",
    "I don't care if you hated it, I think it was awesome"
]

# This method assumes that the sentiment of a text can be approximated by the cumulative sentiment of its constituent words.

# convert spaCy POS tags to WordNet POS tags since SentiWordNet requires WordNet-specific POS tags.
def convert_tags(pos_tag):
    if pos_tag.startswith("ADJ"):
        return wn.ADJ
    elif pos_tag.startswith("NOUN"):
        return wn.NOUN
    elif pos_tag.startswith("ADV"):
        return wn.ADV
    elif pos_tag.startswith("VERB"):
        return wn.VERB
    return None  # For unhandled cases

# get sentiment of a text using SentiWordNet
def get_sentiment(text):
    # preprocess the text using spaCy, which automatically tokenizes and assigns POS-tags
    doc = nlp(text)
    score = 0
    for token in doc:
        wn_tag = convert_tags(token.pos_)
        if wn_tag not in (wn.ADJ, wn.ADV, wn.NOUN, wn.VERB):
            continue  # skip irrelevant words

        # calculate the sentiment score for each word and calculate the overall sentiment score for the text based on the sum of the averaged sentiment scores.
        synsets = list(swn.senti_synsets(token.lemma_, pos=wn_tag))
        if synsets:
            # average the positive and negative sentiment scores synsets
            for synset in synsets:
                score += synset.pos_score() - synset.neg_score()
    
    # if the sum is positive, the text is classified as positive (1), otherwise, it's classified as negative (0). 
    return 1 if score > 0 else 0

# truth labels
y_true = [1, 0, 0, 0, 1]

# predicted labels
y_pred = [get_sentiment(sent) for sent in sents]

# evaluate accuracy based on the proportion of correctly predicted sentiments in relation to the total number of sentences
def get_accuracy(y_true, y_pred):
    accuracy = sum(1 for true, pred in zip(y_true, y_pred) if true == pred) / len(y_true)
    return accuracy

print("Accuracy:", get_accuracy(y_true, y_pred))


Accuracy: 0.8


## The SST-2 binary sentiment dataset

**3. Split the training set into a training and test set. Choose a split size, and justify your choice.**

In [21]:
from datasets import load_dataset
dataset = load_dataset("sst2")

df = dataset["train"].to_pandas().drop(columns=["idx"])
df_sample = df.sample(10000)  # a tiny subset
print(df_sample.label.value_counts())
df_sample.head()

label
1    5539
0    4461
Name: count, dtype: int64


Unnamed: 0,sentence,label
62435,succeeded in,1
32549,"figures prominently in this movie , and helps ...",1
45783,with an unusual protagonist ( a kilt-wearing j...,1
50278,give a spark to `` chasing amy '' and `` chang...,0
45351,what 's most refreshing about real women have ...,1


In [22]:
from sklearn.model_selection import train_test_split

# splitting the dataset into training and test sets with an 80/20 split, to train the model on a substantial portion of the data while reserving enough unseen examples to evaluate its performance on data it hasn't seen before (as shown in Kochmar, ch. 2)

# using seed 42 to make sure that all future runs will shuffle the data in the same way (Kochmar, ch. 2)
train_data, test_data = train_test_split(df_sample, test_size=0.2, random_state=42)


**4. Evaluate your baseline model on the test set.**

- Additionally: compare it against a random baseline. That is, a random guess for each example

In [23]:
import numpy as np
from sklearn.metrics import accuracy_score

# predict using the baseline model
y_test_pred = [get_sentiment(sent) for sent in test_data['sentence']]

# calculate accuracy of the baseline model
baseline_accuracy = get_accuracy(test_data['label'], y_test_pred)
print(f"Accuracy (baseline model): {baseline_accuracy:.4f}")

# generate random predictions
class_distribution = train_data['label'].value_counts(normalize=True)
random_predictions = np.random.choice([0, 1], size=len(test_data), p=[class_distribution[0], class_distribution[1]])

# calculate accuracy for the random predictions
random_accuracy = get_accuracy(test_data['label'], random_predictions)
print(f"Accuracy (random baseline): {random_accuracy:.4f}")

Accuracy (baseline model): 0.6765
Accuracy (random baseline): 0.5120


**5. Did you beat random guess?**

If not, can you think of any reasons why?

Yes, the baseline model beat the random guess, which indicates that the model is able to predict sentiments more accurately than an approach which make predictions randomly based on the class distribution.

## Classification with Naive Bayes and TF-IDF
This is the final task of the lab. You will use high-level libraries to implement a TF-IDF vectorizer and train your data using a Naive Bayes classifier.

In [24]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, accuracy_score

# source: https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html#exercise-2-sentiment-analysis-on-movie-reviews

# using sckit pipeline for TF-IDF vectorization and Naive Bayes classification as explained in the soure
text_clf = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', MultinomialNB()),
])

# training the classifier on the SST-2 data
text_clf.fit(train_data['sentence'], train_data['label'])

# predict test dataset labels
predicted = text_clf.predict(test_data['sentence'])

# evaluate the classifier
# formatting with sckit classification report
print(classification_report(test_data['label'], predicted))
print("Accuracy:", accuracy_score(test_data['label'], predicted))


              precision    recall  f1-score   support

           0       0.88      0.70      0.78       903
           1       0.79      0.92      0.85      1097

    accuracy                           0.82      2000
   macro avg       0.83      0.81      0.82      2000
weighted avg       0.83      0.82      0.82      2000

Accuracy: 0.822


Clearly, this is just one out of infinitely many ways to approach this problem.
Consider this: in lab 2 you explored word embeddings. In the playground file, there is an example of how we can project these word embeddings in a 2D space - and suddenly, text becomes any regular data points! Can you think of a way to use these word embeddings for classification? 

Word embeddings can be used in classification by converting the semantic meanings of words into vector representations. A fixed-length feature vector can then be created for each text, which can be fed into classifiers, e.g. SVM, to classify texts based on its semantic content.

## Optional task: using a pre-trained transformer model
If you wish to push the accuracy as far as you can, take a look at BERT-based or other pre-trained language models. As a starting point, take a look at a model already fine-tuned on the SST-2 dataset: [DistilBERT](https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english)

**Advanced:**

Going beyond this, you could look into the addition of a *classification head* on top of the pooling layer of a BERT-based model. This is a common approach to fine-tuning these models on classification or regression problems.