# 1. Introduction

In this notebook we will familiarize with some topics related to preprocessing. You will be using some existing pipelines for NLP to preprocess a dataset for sentiment analysis.  Preprocessed dataset we'll be used as input for naive heuristic based sentiment analysis. 

The dataset we are going to use ranges the polarity annotation from 0 to 5, where 0 denotes extremely negative sentiment, and 5 is the most positive.Nevertheless, for this lab we'll simplify the task, and we will translate the 5-way classification task into 2-way classification task (0  →  negative, ;1  →  positive),

At the end of the notebook, we will be using a Python implementation for doing some easy data augmentation (EDA).

**Goals**:
- To learn using some of the existing pipelines 
  + [**Natural Language Toolkit (NLTK)**](http://www.nltk.org/) 
  + [**SpaCy**](https://spacy.io/)
  + [**Stanford NLP**](https://stanfordnlp.github.io/stanfordnlp/)
  + [**Trankit**](http://nlp.uoregon.edu/trankit)
- Measure the effect of different preprocessing in specific tasks such as sentiment analysis.
- To learn doing some EDA


# 2. Load data

Let's load the Stanford Sentiment Treebank.  The data can be originaly downloaded from here: [the train/dev/test Stanford Sentiment Treebank distribution](http://nlp.stanford.edu/sentiment/trainDevTestTrees_PTB.zip). But **you don't need to download!** If you already copied ```nlp-app-II``` folder to your ```Colab Notebooks```, you should have the data for this lab in ```nlp-app-II/data/trees```. 

In order to load the data, you we'll need to mount your Drive folder first and give the access to the Notebook. This will require one-step authentication. Please when you run the cell below follow the instructions.

Once you mount everything, make sure ```sst_home = 'drive/My Drive/Colab Notebooks/nlp-app-II/data/trees/''``` is correct path for the data.


In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load the data
import re
import pandas as pd

# Let's do 2-way positive/negative classification instead of 5-way    
def load_sst_data(path,
                  easy_label_map={0:0, 1:0, 2:None, 3:1, 4:1}):
    data = []
    with open(path) as f:
        for i, line in enumerate(f): 
            example = {}
            example['label'] = easy_label_map[int(line[1])]
            if example['label'] is None:
                continue
            
            # Strip out the parse information and the phrase labels---we don't need those here
            text = re.sub(r'\s*(\(\d)|(\))\s*', '', line)
            example['text'] = text[1:]
            data.append(example)
    data = pd.DataFrame(data)
    return data

sst_home = 'drive/My Drive/Colab Notebooks/nlp-app-II/data/trees/'
training_set = load_sst_data(sst_home + 'train.txt')
dev_set = load_sst_data(sst_home + 'dev.txt')
test_set = load_sst_data(sst_home + 'test.txt')

print('Training size: {}'.format(len(training_set)))
print('Dev size: {}'.format(len(dev_set)))
print('Test size: {}'.format(len(test_set)))

In [None]:
training_set.head()

# 3. Preprocessing: Tokenization, lemmatization, removing semantically empty stuff
In almost all Natural Language Processing tasks that you will come across, one will generally always have to undergo few pre-processing steps to convert the input raw text into a form that is readable by your model and the machine. Text pre-processing can be boiled down to these few simple steps:

1. **Tokenization** - Segmentation of the text into its individual constitutent words. 
2. **Lemmatization** - The process of mapping all the different forms of a word to its base form (_lemma_).
3. **PoS tagging** - The process of mapping a word to its gramatical category in the sentence.
3. **Stopwords** - Throw away any words that occur too frequently as its frequency of occurrence will not be useful in helping detecting relevant texts. (as an aside also consider throwing away words that occur very infrequently).

There are many toolkits in Python that help preprocessing input text. Four well-known packages are: 

- [**Natural Language Toolkit (NLTK)**](http://www.nltk.org/) 
- [**SpaCy**](https://spacy.io/)
- [**Stanford NLP**](https://stanfordnlp.github.io/stanfordnlp/)
- [**Trankit**](http://nlp.uoregon.edu/trankit)


## Exercise 1

Implement `preprocess` function to preprocess the examples in the pandas dataframe loaded above. The `preprocess` function will add a  new column in the input dataframe: `preproc`. Preproc column contains tokenize and cleaned sentences.

The function has perfom the following preprocessing steps:
- Tokenization
- Part-of-speech tagging to get only content words.
- Remove stopwords.
- Remove punctuation marks (take a look to `string` package).


In [None]:
import spacy
from spacy.lang.en.examples import sentences 
import string

def preprocess(data, lemmatize=True, remove_stopwords=True, remove_func_words=True):
    ## YOUR CODE HERE    
    ## The function will add "preproc" column in the input dataframe data
    ## Preproc column contains tokenize and cleaned sentences.

    return data

In [None]:
preproc_training = preprocess(training_set)

In [None]:
preproc_training.head()

## Visualization of term frequencies

Having preprocessed the input data, we can plot for the term frquencies of the top 50 words (by frequency) to compare . As you can see from the plot, all our prior preprocessing efforts have not gone to waste. With the removal of stopwords, the remaining words seem much more meaningful where you can see that all the stopwords in the earlier term frequency plot 

In [None]:
# Plotly imports
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls
from matplotlib import pyplot as plt
%matplotlib inline

def enable_plotly_in_cell():
  import IPython
  from plotly.offline import init_notebook_mode
  display(IPython.core.display.HTML('''
        <script src="/static/components/requirejs/require.js"></script>
  '''))
  init_notebook_mode(connected=False)

## Exercise 2
Plot different term-frequency barplots with and without preprocessing. Do you any differences? What happened?

In [None]:
# Code to plot raw text.

enable_plotly_in_cell()

# word frequencies
all_words = preproc_training['text'].str.split(expand=True).unstack().value_counts()

data = [go.Bar(
            x = all_words.index.values[0:50],
            y = all_words.values[0:50],
            marker= dict(colorscale='Jet',
                         color = all_words.values[2:100]
                        ),
            text='Word counts'
    )]

layout = go.Layout(
    title='Top 50 (raw data) Word frequencies in the training dataset'
)

fig = go.Figure(data=data, layout=layout)

py.iplot(fig, filename='basic-bar')

In [None]:
# Code to plot preprocessed text
 
enable_plotly_in_cell()

# word frequencies
all_words = preproc_training['preproc'].str.split(expand=True).unstack().value_counts()

data = [go.Bar(
            x = all_words.index.values[0:50],
            y = all_words.values[0:50],
            marker= dict(colorscale='Jet',
                         color = all_words.values[2:100]
                        ),
            text='Word counts'
    )]

layout = go.Layout(
    title='Top 50 (cleaned) Word frequencies in the training dataset'
)

fig = go.Figure(data=data, layout=layout)

py.iplot(fig, filename='basic-bar')

# 4. Naive sentiment analysis

The __semantic orientation__ method of [Turney and Littman 2003](http://doi.acm.org/10.1145/944012.944013) is a method for automatically scoring words along some single semantic dimension like sentiment. It works from a pair of small seed sets of words that represent two opposing points on that dimension.

We can extend this idea to calculate the polarity of a sentence, by aggregating the score of each word in the sentence. Your goal in this section is to use the semantic model to obtain the aggregated polarity score of the sentence. 

## Helper functions

In [None]:
import numpy as np

def read(file, threshold=0, dim=50, vocabulary=None):
    count = 400000 if threshold <= 0 else min(threshold, 400000)
    words = []
    matrix = np.empty((count, dim)) if vocabulary is None else []
    for i in range(count):
        word, vec = file.readline().decode('utf-8').split(' ', 1)
        if vocabulary is None:
            words.append(word)
            matrix[i] = np.fromstring(vec, sep=' ')
        elif word in vocabulary:
            words.append(word)
            matrix.append(np.fromstring(vec, sep=' '))
    return (words, matrix) if vocabulary is None else (words, np.array(matrix))

In [None]:
def length_normalize(matrix):
    norms = np.sqrt(np.sum(matrix**2, axis=1))
    norms[norms == 0] = 1
    return matrix / norms[:, np.newaxis]

In [None]:
def determine_coefficient(candidate_word, seed_pos, seed_neg):
    if candidate_word not in word2ind:
        return 0.0
    pos_ind = np.array([word2ind[word] for word in seed_pos])
    pos_mat = matrix[pos_ind]

    neg_ind = np.array([word2ind[word] for word in seed_neg])
    neg_mat = matrix[neg_ind]

    i = word2ind[candidate_word]

    pos_sim = np.sum(matrix[i].dot(pos_mat.T))
    neg_sim = np.sum(matrix[i].dot(neg_mat.T))

    return pos_sim - neg_sim



## Set up sentiment model

In [None]:
import bz2

# Read input embeddings
glove_home = 'drive/My Drive/Colab Notebooks/2020-2021_labs/data/embeddings/glove.6B.50d.txt.bz2'
embsfile = bz2.open(glove_home)
words, matrix = read(embsfile)

# Length normalize embeddings so their dot product effectively computes the cosine similarity
matrix = length_normalize(matrix)

# Build word to index map
word2ind = {word: i for i, word in enumerate(words)}

In [None]:
seed_pos = ['good', 'great', 'awesome', 'like', 'love']
seed_neg = ['bad', 'awful', 'terrible', 'hate', 'dislike']

In [None]:
print(determine_coefficient('abhorrent', seed_pos, seed_neg))
print(determine_coefficient('vacations', seed_pos, seed_neg))
print(determine_coefficient('hunger', seed_pos, seed_neg))

## Apply sentiment analysis

## Exercise 3
- Build sentiment analysis model using determine_coefficient function and aggregate the score of each word in the input sentence (preprocessed or not). 

- You have to complete the code for `predict_sentiment`, which takes the preprocessed dataframe as input, and return predicted sentiments as well as gold sentiments. 

In [None]:
def predict_sentiment(data):
    ## YOUR CODE HERE

    return sentiments, gold_sentiments


In [None]:
from sklearn.metrics import accuracy_score

pred_sentiments, gold_sentimens = predict_sentiment(preproc_training)
accuracy_score(pred_sentimens, gold_sentiments)

# 5. EDA: Easy Data Augmentation 
We will use the EDA package available in Github: https://github.com/jasonwei20/eda_nlp . This code make some transformation on the input sentence obtain a similar, but different extra examples. 

- **Synonym Replacement (SR)**: Randomly choose n words from the sentence that are not stop words. Replace each of these words with one of its synonyms chosen at random.
- **Random Insertion (RI)**: Find a random synonym of a random word in the sentence that is not a stop word. Insert that synonym into a random position in the sentence. Do this n times.
- **Random Swap (RS)**: Randomly choose two words in the sentence and swap their positions. Do this n times.
- **Random Deletion (RD)**: For each word in the sentence, randomly remove it with probability p.

Note that these transformations are useful for text classification tasks such as sentiment analysis. 

In [None]:
!git clone https://github.com/jasonwei20/eda_nlp.git

In [None]:
import nltk
nltk.download('wordnet')

In [None]:
!python eda_nlp/code/augment.py --input=eda_nlp/data/sst2_train_500.txt

### Exercise 4
- Inspect and analyse the `eda_nlp/data/eda_sst2_train_500.txt` file. Can you identify example for each of transformations?

### Exercise 5
In this exercise we are going to run some experiments on sentiment analysis to see if EDA works when we only have few annotated examples to run.

In order to simulate a scenario with little annotated data, we are going to make the following steps:

- Create a small dataset of 100 examples from SST dataset.
- Train and evaluate the baseline model 
- Augment training set with EDA (TODO)
- Train and evaluate a new model in the augmented dataset. (TODO)

#### Create small dataset of sentiment analysis


In [None]:
# Create small dataset of sentiment analysis
positive_examples = training_set[training_set.label == 1].sample(50)
negative_examples = training_set[training_set.label == 0].sample(50)

small_training = pd.concat([positive_examples, negative_examples], axis=0).sample(frac=1)
small_training.to_csv("small_sentiment_training_set.txt", sep="\t",
                      columns=["label", "text"], header=False, index=False)

#### Train and evaluate
Which means: preprocess the data, define the model, run the model on the dataset.

In [None]:
import tensorflow as tf

# prepare dataset (train/dev)
vocab = small_training['text'].str.split(expand=True).unstack().value_counts()
max_features = len(vocab)
sequence_length = 40
batch_size = 32

vectorize_layer = tf.keras.layers.TextVectorization(
    max_tokens=max_features,
    output_mode='int',
    output_sequence_length=sequence_length)

def vectorize_text(text, label):
  text = tf.expand_dims(text, -1)
  return vectorize_layer(text), label

raw_train_ds = tf.data.Dataset.from_tensor_slices((small_training.text, small_training.label))
train_text = raw_train_ds.map(lambda x, y: x)
vectorize_layer.adapt(train_text)

raw_dev_ds = tf.data.Dataset.from_tensor_slices((dev_set.text, dev_set.label))

train_ds = raw_train_ds.batch(batch_size).map(vectorize_text)
dev_ds = raw_dev_ds.batch(batch_size).map(vectorize_text)


dataset = raw_train_ds.batch(batch_size)
text_batch, label_batch = next(iter(dataset))
first_review, first_label = text_batch[0], label_batch[0]
print("Review", first_review)
print("Label", first_label)
print("Vectorized review", vectorize_text(first_review, first_label))
print("Vocab size", max_features)

Define the model for text classification with keras

In [None]:
import tensorflow as tf

embedding_dim = 300
model = tf.keras.Sequential([
  tf.keras.layers.Embedding(max_features + 1, embedding_dim),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.GlobalAveragePooling1D(),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(1)])

model.summary()

Fit the model and evaluate on validation

In [None]:
model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              optimizer='adam',
              metrics=tf.metrics.BinaryAccuracy(threshold=0.0))

train_ds.batch(batch_size)
model.fit(train_ds, epochs=20, validation_data = dev_ds)

#### Apply EDA on the small dataset

__TODO__: Run the EDA script in the newly created small dataset ("small_sentiment_training_set.txt") and create an augmented dataset. 

You can run the following command see the option for EDA.
```
! python eda_nlp/code/augmented.py --help
```

In [None]:
# ADD YOUR CODE HERE: Create the augmented dataset


__TODO__: Load the dataset, prepare the training set, define the model and train it (you can repeat the code used above).

__TODO__: Check if augmented data improve the results

In [None]:
# ADD YOUR CODE HERE:
import tensorflow as tf

# load dataset

# prepare new training set (and dev set)


__TODO__: Define the model and fit it using the augmented dataset (you can repeate the code used for the baseline model):

In [None]:
# ADD YOUR CODE HERE:
import tensorflow as tf

# model definition

# fit the model