#Artificial Intelligence - COMP9414
###Tutorial week 8 - Language Processing

NLP methods applied to a Sentiment Analysis use-case scenario


##Theoretical Background

**Sentiment Analysis** is the task of determining the emotional tone behind a certain text. Sentiment can be classified as either *positive*, *negative* or *neutral* depending on the attitude expressed in the text. The following texts provide examples of positive, negative and neutral sentiment:

*   "This is beautiful! You really did a great job" -- <font color='green'>Positive</font>
*   "Whaat? This is all wrong! I'm not happy with this at all" -- <font color='red'>Negative</font>
*   "I guess it could be worse" -- <font color='orange'>Neutral</font>
*   "Oh yeah, great job breaking my only laptop!" -- <font color='red'>Negative</font>

The last example exhibits a peculiar phenomenon in sentiment analysis called **sarcasm detection** (i.e. sentences that *sound* positive but are really just being sarcastic and conveying a negative sentiment).

In this tutorial, we will use a supervised machine learning technique to create an automated sentiment classifier for movie reviews. This will also give us a chance to familiarise with popular NLP techniques (such as **Regular Expressions** and **Context-free Grammars** as well as libraries that researchers in the field use to make their job easier (such as **NLTK** and **Scikit-Learn**).


##Part 1 - Grammars

A **formal grammar** is a set of rules that define how to generate and recognise strings that belong to a certain Language *L*. Grammars are designed to formalise the *syntax* of a language (i.e. how to write things that make sense in that language) and give no information on its *semantics* (i.e. what sentences in that language mean).

We can formally define a grammar as follows:

\begin{equation}
G=(V,Σ,R,S).
\end{equation}

Where

$V$ is a finite set of _nonterminal symbols_ or _variables_. These are the symbols used in the grammar to denote syntactic categories.

$Σ$ is a finite set of _terminal symbols_ which make up the actual content of the generated sentences. This is also known as the *alphabet* of the language generated by $G$.

$R$ is a set of relations defined in $V\times (V\cup \Sigma )^{*}$. These are also known as _Rewrite Rules_ and indicate how grammar variables can be converted into other variables or into terminal symbols.

$S$ is the *Start Symbol* used to represent the whole sentence.

In this section we will explore some basic generation rules for grammars, using them as a tool to generate training data for our Sentiment Classifier

__Section 1.a.__ Write a grammar `greetGrammar` that generates the following three strings:
["hello", "hi", "good to see you"]


In [None]:
greetGrammar = """
  S -> "hello" | "hi" | "good to see you"
"""

__Section 1.b.__ Using NLTK, load your grammar and verify that its language corresponds to the given strings

In [None]:
import nltk
from nltk.parse.generate import generate

grammar = nltk.CFG.fromstring(greetGrammar)
print([g for g in generate(grammar)])

[['hello'], ['hi'], ['good to see you']]


__Section 1.c.__ Modify the greetGrammar to accept a name after the greeting. The new grammar should parse all greet strings followed by either of these names: ["Alice", "Bob", "Charlie"]. Test your grammar with NLTK to verify that it produces the correct language

_HINT:_ use a new non-terminal symbol N to indicate the names and combine it with the existing rules

In [None]:
greetGrammarWithNames = """
  S -> "hello" N | "hi" N | "good to see you" N
  N -> "Alice" | "Bob" | "Charlie"
"""
#Alternatively
#greetGrammarWithNames = """
#  S -> H N
#  H -> "hello" | "hi" | "good to see you"
#  N -> "Alice" | "Bob" | "Charlie"
#"""
grammar = nltk.CFG.fromstring(greetGrammarWithNames)
print([g for g in generate(grammar)])

[['hello', 'Alice'], ['hello', 'Bob'], ['hello', 'Charlie'], ['hi', 'Alice'], ['hi', 'Bob'], ['hi', 'Charlie'], ['good to see you', 'Alice'], ['good to see you', 'Bob'], ['good to see you', 'Charlie']]


__Section 1.d.__ We will now use grammars to automatically generate some data for our Sentiment Analysis classifier. Consider the following grammar that generates positive reviews for films:

```
S -> NP VP | PR VPR
NP -> Det N
N -> 'director' | 'screenplay' | 'plot' | 'story' | 'scenes' | 'special effects' | 'costumes' | 'actors' | 'dialogues' | 'characters'

VP -> Verb Adj

VPR -> VerbPR NP

Det -> 'the' | 'this' | 'these' | 'those'

Verb -> 'is' | 'looks' | 'was' | 'are' | 'look' | 'were'

VerbPR -> 'love' | 'loved' | 'enjoy' | 'enjoyed' | 'fell in love with' | 'adore' | 'adored'
PR -> 'I'

Adj -> 'great' | 'cool' | 'amazing' | 'fantastic' | 'very nice'
```

Load the grammar in NLTK and store all generated strings in a variable. Then, print 5 randomly generated strings from the grammar


In [None]:
import random


positive_film_grammar = nltk.CFG.fromstring("""
S -> NP VP | PR VPR
NP -> Det N
N -> 'director' | 'screenplay' | 'plot' | 'story' | 'scenes' | 'special effects' | 'costumes' | 'actors' | 'dialogues' | 'characters'

VP -> Verb Adj

VPR -> VerbPR NP

Det -> 'the' | 'this' | 'these' | 'those'

Verb -> 'is' | 'looks' | 'was' | 'are' | 'look' | 'were'

VerbPR -> 'love' | 'loved' | 'enjoy' | 'enjoyed' | 'fell in love with' | 'adore' | 'adored'
PR -> 'I'

Adj -> 'great' | 'cool' | 'amazing' | 'fantastic' | 'very nice'
""")

positive_reviews = list(generate(positive_film_grammar))
for i in range(5):
  print(random.choice(positive_reviews))

['I', 'loved', 'these', 'costumes']
['this', 'scenes', 'were', 'great']
['these', 'dialogues', 'look', 'great']
['these', 'plot', 'was', 'great']
['I', 'loved', 'the', 'actors']


__Section 1.e.__ You may have noticed that your grammar generates strings that are not consistent with singular/plural and with first/third person verbs. Fix the grammar so that it only generates strings that are grammatically correct.




In [None]:
positive_film_grammar = nltk.CFG.fromstring("""
S -> NPS VPS | NPP VPP | PR VPR
NPS -> DetS NS
NPP -> DetP NP
NS -> 'director' | 'screenplay' | 'plot' | 'story' | 'atmosphere'
NP -> 'scenes' | 'special effects' | 'costumes' | 'actors' | 'dialogues' | 'characters'

VPS -> VerbS Adj
VPP -> VerbP Adj

VPR -> VerbPR NPS | VerbPR NPP
DetS -> 'the' | 'this'
DetP -> 'the' | 'these' | 'those'

VerbS -> 'is' | 'looks' | 'was'
VerbP -> 'are' | 'look' | 'were'

VerbPR -> 'love' | 'loved' | 'enjoy' | 'enjoyed' | 'fell in love with' | 'adore' | 'adored'
PR -> 'I'

Adj -> 'great' | 'cool' | 'amazing' | 'fantastic' | 'very nice'
""")

positive_reviews = [' '.join(s) for s in generate(positive_film_grammar)]
for i in range(5):
  print(random.choice(positive_reviews))

I adored those dialogues
those special effects look fantastic
I love those actors
those costumes were very nice
I love those special effects


__Section 1.f.__ Modify the `positive_film_grammar` to obtain a `negative_film_grammar` that produces negative reviews.

In [None]:
negative_film_grammar = nltk.CFG.fromstring("""
S -> NPS VPS | NPP VPP | PR VPR
NPS -> DetS NS
NPP -> DetP NP
NS -> 'director' | 'screenplay' | 'plot' | 'story' | 'atmosphere'
NP -> 'scenes' | 'special effects' | 'costumes' | 'actors' | 'dialogues' | 'characters'

VPS -> VerbS Adj
VPP -> VerbP Adj

VPR -> VerbPR NPS | VerbPR NPP
DetS -> 'the' | 'this'
DetP -> 'the' | 'these' | 'those'

VerbS -> 'is' | 'looks' | 'was'
VerbP -> 'are' | 'look' | 'were'

VerbPR -> 'hate' | 'hated' | 'do not like' | 'did not enjoy' | 'got bored with' | 'despise' | 'despised'
PR -> 'I'

Adj -> 'mediocre' | 'dull' | 'terrible' | 'boring' | 'lame' | 'dumb'
""")

negative_reviews = [' '.join(s) for s in generate(negative_film_grammar)]
for i in range(5):
  print(random.choice(negative_reviews))

those actors are lame
those actors are dumb
the screenplay is boring
I despised the scenes
the screenplay was mediocre


__Section 1.g.__ Generate a training dataset for the Movie Reviews Sentiment Analysis task. Your dataset should have 1,000 positive reviews obtained by sampling three random utterances from the `positive_reviews` language and concatenating them together, and 1,000 negative reviews obtained by applying the same method to the `negative_reviews` language. Each datapoint should be a tuple T= (utterance, label), where label can be either "neg" or "pos" depending on the sentiment of the generated datapoint.


In [None]:
grammar_training_dataset = []
for i in range(1000):
  positive_utterance = f"{random.choice(positive_reviews)}. {random.choice(positive_reviews)}. {random.choice(positive_reviews)}"
  negative_utterance = f"{random.choice(negative_reviews)}. {random.choice(negative_reviews)}. {random.choice(negative_reviews)}"
  grammar_training_dataset.append((positive_utterance, "pos"))
  grammar_training_dataset.append((negative_utterance, "neg"))

for i in range(5):
  print(random.choice(grammar_training_dataset))

('I despise these actors. the characters were terrible. those dialogues are terrible', 'neg')
('these dialogues were terrible. the atmosphere is mediocre. those actors are boring', 'neg')
('this atmosphere looks cool. the dialogues look great. I adore the plot', 'pos')
('the atmosphere is fantastic. this atmosphere was great. the special effects look cool', 'pos')
('I hated those scenes. the dialogues were terrible. those scenes look dumb', 'neg')


##Part 2 - Data Preparation

A crucial component of every Natural Language Processing application is the *Data Preparation Pipeline*, which converts the data into a format that can be parsed by a statistical machine learning algorithm. This pipeline usually combines at least the following essential steps:

- **Data cleanup** (i.e. removing noisy elements from a dataset such as special characters, abbreviations, URLs, multiple spaces etc.)
- **Word Embedding** (i.e. converting text data into numerical vectors). This can be performed in a number of different ways, but usually involves the creation of a *Vocabulary*, which maps words to unique numerical IDs.
- **Data splitting and collation**, which involves dividing a dataset into different subsets for training, testing and validation of hyperparameters (splitting), as well as combining multiple datapoints into a dense batch for training (collation)

We will use NLTK's **Movie Review Corpus** as a dataset for our sentiment classifier. This is a well-known corpus for the task which can be downloaded and imported directly from a Python script using the **NLTK** library (note that we won't tokenize reviews at this stage, as tokenization will be handled later when building the Machine Learning model). We will use _Regular Expressions_ for the Data Cleanup step, relying on Python's internal Regular Expression library, ```regex```. For the other steps, we will rely on **Scikit-Learn**, a machine learning library highly specialised for textual data processing



##2.1 - Regular Expressions

A regular expression (often shortened as regex) is a sequence of characters that specifies a match pattern in text. This sequence is used in combination with a matching algorithm that finds the pattern in a text, possibly replacing it with another piece of text.

There are various implementations of regular expressions, including online portals, UNIX shell commands and libraries for basically every imperative programming language. In this tutorial, we will use Python's ```regex``` module and its ```sub``` function, which finds regular expression patterns and replaces every occurrence with a new piece of text.

__Section 2.1.a. (familiarising with Regular Expressions)__ Import the ```regex``` library as ```re```, and use its ```re.sub``` function to replace all occurrences of the word ```"men"``` with the word ```"people"``` in the following portion of text (do not replace the string if it is a substring of another word):

```"We hold these truths to be self-evident, that all men are created equal, that all men are endowed by their Creator with certain unalienable Rights, that among these are Life, Liberty and the pursuit of Happiness and mental stability."```

In [None]:
import regex as re

input_text = "We hold these truths to be self-evident, that all men are created equal, that all men are endowed by their Creator with certain unalienable Rights, that among these are Life, Liberty and the pursuit of Happiness and mental stability."

re.sub(r"\bmen\b", "people", input_text)


'We hold these truths to be self-evident, that all people are created equal, that all people are endowed by their Creator with certain unalienable Rights, that among these are Life, Liberty and the pursuit of Happiness and mental stability.'

__Section 2.1.b.__ Load the NLTK Movie Reviews corpus. You will have to download the corpus first with the command:

`nltk.download("movie_reviews")`

and then you will be able to load it by importing movie_reviews from nltk.corpus. Load the corpus utterances and labels in a list of tuples similarly to the dataset created in Section 1.g

In [None]:
nltk.download('movie_reviews')

from nltk.corpus import movie_reviews

nltk_data = []
for file_id in movie_reviews.fileids():
  nltk_data.append((movie_reviews.raw(file_id), movie_reviews.categories(file_id)[0]))

print(len(nltk_data))

[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.


2000


__Section 2.1.c.__ Using regular expressions, write a function called ```cleanup_review``` which performs the following cleanup operations on your data:

1. Replace all URLs with URLTOKEN
2. Replace all dates with DATETOKEN
3. Remove all non-alphanumerical characters (except for the following: ```["?", "!", ",", ".", ":", ";", "'", "\""]```)
4. Collapse multiple spaces into one space

Test your function with the following string:

```"Hello!! My name is Stefano, I have been a tutor for COMP~9414 since 01/04/2023.    My personal website is http://stefano.com . (Nice to meet you ^__^)"```

should return

```"Hello!! My name is Stefano, I have been a tutor for COMP9414 since DATETOKEN. My personal website is URLTOKEN . Nice to meet you"```
```

In [None]:
import regex as re

def cleanup_review(review):

  # 1. Replace all URLs with URLTOKEN
  review = re.sub(r'http\S+', 'URLTOKEN', review)

  # 2. Replace all dates with DATETOKEN
  review = re.sub(r"\d{4}-\d{2}-\d{2}", "DATETOKEN", review)
  review = re.sub(r"\d{4}/\d{2}/\d{2}", "DATETOKEN", review)

  review = re.sub(r"\d{2}-\d{2}-\d{4}", "DATETOKEN", review)
  review = re.sub(r"\d{2}/\d{2}/\d{4}", "DATETOKEN", review)


  # 3. Remove all non-alphanumerical characters
  review = re.sub(r'[^a-zA-Z0-9,!.\';:? ]', '', review)

  # 4. Collapse multiple spaces into one space
  review = re.sub(r'\s\s+', ' ', review)

  return review


cleanup_review("Hello!! My name is Stefano, I have been a tutor for COMP~9414 since 01/04/2023. My personal website is http://stefano.com . (Nice to meet you ^__^)")

'Hello!! My name is Stefano, I have been a tutor for COMP9414 since DATETOKEN. My personal website is URLTOKEN . Nice to meet you '

##2.2 - Data splitting and dataset creation

__Section 2.2.a__ Apply the ```cleanup_review``` function to the NLTK dataset. Then, split it into three subsets:

- train_nltk_data (which should contain the first 85% of the reviews (1-1700))
- test_nltk_data (which should contain reviews 85%~95% (1701-1900))
- valid_nltk_data (which should contain the remaining reviews (1901-2000))

Remember to shuffle the cleaned data before splitting to ensure an equal distribution of labels across the three sets. You can set a seed of 999 to ensure a replicable behaviour for your Machine Learning algorithm.

In [None]:
import numpy as np
np.random.seed(999)

cleanup_data = [(cleanup_review(r), l) for r,l in nltk_data]

np.random.shuffle(cleanup_data)

train_nltk_data = cleanup_data[0:int(len(cleanup_data)*0.85)]
test_nltk_data = cleanup_data[int(len(cleanup_data)*0.85):int(len(cleanup_data)*0.95)]
valid_nltk_data = cleanup_data[int(len(cleanup_data)*0.95):]

##Part 3 - Machine Learning model

In this last section, we will implement a statistical machine learning model to fit the dataset and correctly predict the sentiment on new reviews. We will use scikit-learn for this section, and experiment with different classes of ML models.

Scikit-learn offers a data structure called the `Pipeline` which allows to combine data transformation functions and machine learning algorithms into a single object. We will rely on that for this section, and combine a Support Vector Classifier with a TF-IDF Vectorizer. The former is a type of machine learning classification algorithm that is known to work particularly well with text classification tasks such as Sentiment Anaylsis, while the latter is a type of vectorizer that converts each word in a document to a numerical ID and weighs it according to how often it appears in the review and in the rest of the corpus; the algorithm prioritizes words that are popular across all reviews, and gives less priority to words that are specific to a single review but will not help the model to generalize to unseen ones.

**Section 3.a** Create a scikit-learn pipeline comprised of two separate components:
- A TfidfVectorizer with min_df=3 and max_df=0.95 called "vect"
- A LinearSVC classifier with C=1000 called "clf"




In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
  ('vect', TfidfVectorizer(min_df=3, max_df=0.95)),
  ('clf', LinearSVC(C=1000, max_iter=10000)),
])

**Section 3.b** Train your pipeline with the `fit` method, giving it a list of training samples and a list of labels associated with those samples. Use your `grammar_training_dataset` for this step



In [None]:
pipeline.fit([d[0] for d in grammar_training_dataset], [d[1] for d in grammar_training_dataset])

**Section 3.c** Run your pipeline on the `test_nltk_data`. Then, evaluate your pipeline by using the `classification_report` function from `sklearn.metrics`. Your performances will likely not be very good, due to the repetitive nature of data generated via grammars.

In [None]:
from sklearn import metrics

y_predicted = pipeline.predict([t[0] for t in test_nltk_data])

# Print the classification report
print(metrics.classification_report([t[1] for t in test_nltk_data], y_predicted,
                                    target_names=['positive', 'negative']))


# import matplotlib.pyplot as plt
# plt.matshow(cm)
# plt.show()

              precision    recall  f1-score   support

    positive       0.69      0.20      0.31       111
    negative       0.47      0.89      0.61        89

    accuracy                           0.51       200
   macro avg       0.58      0.54      0.46       200
weighted avg       0.59      0.51      0.44       200



**Section 3.d** Train your pipeline again, this time using your `train_nltk_data` dataset.

In [None]:
pipeline = Pipeline([
  ('vect', TfidfVectorizer(min_df=3, max_df=0.95)),
  ('clf', LinearSVC(C=1000, max_iter=10000)),
])

pipeline.fit([d[0] for d in train_nltk_data], [d[1] for d in train_nltk_data])

**Section 3.e** Evaluate the performances of your model using `classification_report`. The new model should be a lot more accurate than the previous one.



In [None]:
from sklearn import metrics

y_predicted = pipeline.predict([t[0] for t in test_nltk_data])

# Print the classification report
print(metrics.classification_report([t[1] for t in test_nltk_data], y_predicted,
                                    target_names=['positive', 'negative']))

              precision    recall  f1-score   support

    positive       0.88      0.86      0.87       111
    negative       0.83      0.85      0.84        89

    accuracy                           0.85       200
   macro avg       0.85      0.85      0.85       200
weighted avg       0.86      0.85      0.86       200



**Section 3.f** Play with your code to familiarise with scikit-learn's suite of classifiers and parameters. Some experiments you may want to run include:
- Training a model combining the `grammar_training_dataset` and `train_nltk_data` and verify whether it performs better than the `train_data` only model
- Experiment with different classifiers -- for example, you may want to try a simple `GaussianNB` classifier, or try some classifiers that usually perform well on this task such as `AdaBoostClassifier` or `RandomForestClassifier`
- Try changing the parameters of your classifiers -- for example, try reducing the C regularization parameter in your SVC, or increasing it further.  

In [None]:
# 3.f (i) grammar_training_dataset and nltk_data combined

pipeline=Pipeline([
    ('vect', TfidfVectorizer(min_df=3, max_df=0.95)),
    ('clf', LinearSVC(C=1000, max_iter=10000)),
    ])
pipeline.fit([d[0] for d in train_nltk_data+grammar_training_dataset], [d[1] for d in train_nltk_data+grammar_training_dataset])
y_predicted = pipeline.predict([t[0] for t in test_nltk_data])
print(metrics.classification_report([t[1] for t in test_nltk_data], y_predicted,
                                  target_names=['positive', 'negative']))

              precision    recall  f1-score   support

    positive       0.89      0.83      0.86       111
    negative       0.80      0.88      0.84        89

    accuracy                           0.85       200
   macro avg       0.85      0.85      0.85       200
weighted avg       0.85      0.85      0.85       200



In [None]:
# 3.f (ii) classifiers experiments

from sklearn.datasets import make_circles, make_classification, make_moons
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.inspection import DecisionBoundaryDisplay
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

classifiers = [
    KNeighborsClassifier(3),
    SVC(gamma=2, C=1),
    DecisionTreeClassifier(max_depth=5),
    RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
    MLPClassifier(alpha=1, max_iter=1000),
    AdaBoostClassifier()
]
for c in classifiers:
  print(f"Training classifier model: {c.__str__()}")
  pipeline=Pipeline([
      ('vect', TfidfVectorizer(min_df=3, max_df=0.95)),
      ('clf', c),
      ])
  pipeline.fit(np.array([d[0] for d in train_nltk_data]), np.array([d[1] for d in train_nltk_data]))
  y_predicted = pipeline.predict([t[0] for t in valid_nltk_data])
  print(metrics.classification_report([t[1] for t in valid_nltk_data], y_predicted,
                                    target_names=['positive', 'negative']))

Training classifier model: KNeighborsClassifier(n_neighbors=3)
              precision    recall  f1-score   support

    positive       0.79      0.69      0.74        55
    negative       0.67      0.78      0.72        45

    accuracy                           0.73       100
   macro avg       0.73      0.73      0.73       100
weighted avg       0.74      0.73      0.73       100

Training classifier model: SVC(C=1, gamma=2)
              precision    recall  f1-score   support

    positive       0.89      0.87      0.88        55
    negative       0.85      0.87      0.86        45

    accuracy                           0.87       100
   macro avg       0.87      0.87      0.87       100
weighted avg       0.87      0.87      0.87       100

Training classifier model: DecisionTreeClassifier(max_depth=5)
              precision    recall  f1-score   support

    positive       0.67      0.53      0.59        55
    negative       0.54      0.69      0.61        45

    accurac