<a href="https://colab.research.google.com/github/mo-alrz/Machine-learning/blob/main/Day03_Task_Sentiment_classification_handout.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task: sentiment classification

The task is to classify one-sentence long movie reviews/opinions according to the sentiment they express. There are only two categories: positive and negative sentiment.


> "Data source: [UMICH SI650 - Sentiment Classification](https://www.kaggle.com/c/si650winter11/data)

> Training data: 7086 lines.
  
> Format: 1|0 (tab) sentence

> Test data: 33052 lines, each contains one sentence.

> The data was originally collected from opinmind.com (which is no longer active)."

The data is in the file "sentiment.tsv".

## Download/install necessary components/data

In [None]:
! python -m spacy download en_core_web_sm
! pip install wordcloud
! wget "https://drive.google.com/uc?export=download&id=19NUVV29Pq-j2WrNBYf6WRD8or7SOUHp2" -O sentiment.tsv

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/lib/python3.7/dist-packages/en_core_web_sm -->
/usr/local/lib/python3.7/dist-packages/spacy/data/en
You can now load the model via spacy.load('en')
--2021-07-16 07:52:55--  https://drive.google.com/uc?export=download&id=19NUVV29Pq-j2WrNBYf6WRD8or7SOUHp2
Resolving drive.google.com (drive.google.com)... 172.217.218.138, 172.217.218.102, 172.217.218.113, ...
Connecting to drive.google.com (drive.google.com)|172.217.218.138|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://doc-0k-3o-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/miu7v9je13t94ufa2srb63g817vp3sub/1626421950000/10227734428265054086/*/19NUVV29Pq-j2WrNBYf6WRD8or7SOUHp2?e=download [following]
--2021-07-16 07:52:56--  https://doc-0k-3o-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg

# Loading the data

In [None]:
import pandas as pd

df = pd.read_csv('sentiment.tsv', sep='\t',
                 quoting=3, # Quotes are _never_ field separators
                 header=None)

df.head()

In [None]:
df = df[[1,0]] # rearrange columns

df.rename(columns={1:"text", 0:"sentiment"}, inplace=True) # rename columns

df.head()

# Splitting into train, validation and test

Before doing anything else (!) we divide our data into train, validation and test parts,

In [None]:
# Import the necessary function from Scikit
from ...

# Please observe, that we can only do a split into two
# hence our best option is to call the function twice in a chain
# Don't forget to fix the random seed also, eg to 13, since that is a lucky number! :-)
# Try to make sure that the class proportions are the same in all three of the splits!
df_train, df_test_valid = ...

df_test, df_valid = ...

assert len(df_train)==5668 and len(df_valid)==709 and len(df_test)==709
print(len(df_train), len(df_valid), len(df_test))

In [None]:
## Just to check class proportions.
print("Classes (%):")
pd.concat([df_train["sentiment"].rename("train").value_counts() / len(df_train) * 100,
          df_valid["sentiment"].rename("valid").value_counts() / len(df_valid) * 100,
          df_test["sentiment"].rename("test").value_counts() / len(df_test) * 100,], axis=1).sort_index().round(2)

# Inspecting the data

In [None]:
df_train.describe()

We can examine the lengths of sentences as well.

In [None]:
n_chars = df_train.text.apply(lambda x: len(x))

n_chars.describe()

The first sentence with the maximal length:

In [None]:
long_sentence = df_train.loc[n_chars.idxmax(), "text"]
long_sentence

# Extra task: Let's do a word cloud!

Let us visualize together and separately (by category) the sentences!

Tool: https://github.com/amueller/word_cloud


Good example: https://github.com/amueller/word_cloud/blob/master/examples/simple.py



In [None]:
# Helper function for displaying a word cloud
# Input: one _UNIFIED_, space separated string!
# Protip: https://www.tutorialspoint.com/python/string_join.htm
def do_wordcloud(text, figsize=(15, 10)):
    from wordcloud import WordCloud

    # Generate a word cloud image
    wordcloud = WordCloud().generate(text)

    # Display the generated image:
    # the matplotlib way:
    import matplotlib.pyplot as plt

    # lower max_font_size
    wordcloud = WordCloud(max_font_size=40).generate(text)
    plt.figure(figsize=figsize)
    plt.imshow(wordcloud, interpolation="bilinear")
    plt.axis("off")
    plt.show()


In [None]:
### TASK !!! ####
#Put here the world cloud!




In [None]:
### TASK !!! ####
# Here only the cloud for sentences with negative sentiment!
# Help: the shape of the DataFrame with only the negative sentences is: (2975, 2)
# Source: https://pandas.pydata.org/pandas-docs/stable/indexing.html



# Bag of words (BoW) representation of the texts

We will represent each text as a (sparse) vector of lemma (word root) counts for frequent lemmas in the training data.

For tokenization and lemmatization we use [spaCy](https://spacy.io/), an open source Python NLP library, which can produce a list of unique lemma ids from the text.

In [None]:
import spacy

nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"])
# We only need the tokenizer, all higher functions are now unnecessary.

spaCy can produce spaCy Doc objects from texts that contain their linguistic analysis, among others lemmas and their unique spaCy string ids.

In [None]:
doc = nlp(long_sentence)
type(doc)

In [None]:
print([token.lemma_ for token in doc ]) # Lemmas

In [None]:
print([token.lemma for token in doc]) # Connected unique ID-s

Now we have to convert these lists into BoW vectors. We could "roll our own", but, fortunately, scikit-learn has a feature extractor doing exactly that, the [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer) so, for the sake of simplicity, we will use that along with spaCy.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

def spacy_lemmatizer(s):
    return [token.lemma for token in nlp(s)]

cv = CountVectorizer(analyzer=spacy_lemmatizer, #spaCy for analysis
                     min_df= 0.001) # We ignore the lemmas with low document frequency
cv

In [None]:
sents = ["I hate this movie.", "The movie is the worst I've seen."]
bows = cv.fit_transform(sents)
# A CountVectorizer produces a sparse matrix, we convert to ndarray to inspect it.
# The rows are the sentences, the columns are the features: in our case, lemmas.
pd.DataFrame(bows.toarray(), columns=[nlp.tokenizer.vocab.strings[k] for k in cv.get_feature_names_out()])

In [None]:
### We will be using spacy's tokenizer, but just for comparison,
### here is the scikit-only bow representation for the same sentences:
for stop_words in [None, "english"]:
    print(f"Not spacy, but sklearn tokenization; stop_words: {stop_words}:")
    cv2 = CountVectorizer(analyzer="word", #spaCy for analysis
                        stop_words=stop_words,
                        min_df= 0.001) # We ignore the lemmas with low document frequency
    bows2 = cv2.fit_transform(sents)
    # A CountVectorizer produces a sparse matrix, we convert to ndarray to inspect it.
    # The rows are the sentences, the columns are the features: in our case, lemmas.
    display(pd.DataFrame(bows2.toarray(), columns=[k for k in cv2.get_feature_names_out()]))
    print()

Using the CountVectorizer we convert the text columns of our train, validation and  test data into three sparse matrices.

In [None]:
bows_train = cv.fit_transform(df_train.text)
bows_train.sort_indices() # comes from TF2.0 sparse implementation, obscure requirement
bow_length = bows_train.shape[1]  ## the number of features (lemmas) used
print("BoW length:", bow_length)
bows_train

In [None]:
## not necessary for the model, just so that we can check some encodings:
cv_key_lemmahash_dict = {v: [k for k in cv.vocabulary_.keys() if cv.vocabulary_[k] == v][0] for v in cv.vocabulary_.values()}
cv_key_lemmastring_dict = {k: nlp.tokenizer.vocab.strings[cv_key_lemmahash_dict[k]] for k in cv_key_lemmahash_dict.keys() }

In [None]:
idx = 0
print(f"The sentence at index {idx}:")
print(df_train["text"].iloc[idx], "\n")

print("Its cv representation in the sparse matrix (index: (document, term), values: occurrences):")
tmp = bows_train[idx,:]
print()
print(tmp, "\n")

## using the dict we defined:
print("CountVectorizer-encoded terms turned into the strings they encode:")
tmps = pd.Series(tmp.toarray()[0][tmp.indices], index=[cv_key_lemmastring_dict[k] for k in tmp.indices])
print(tmps, "\n")

In [None]:
## Just a little insight into spacy lemmatizer:
print("'Awesome' lemmas:")
print([k for k in cv_key_lemmastring_dict.values() if k.lower() == "awesome"], "\n")
wholetext = df["text"].str.cat(sep=" ")
for k in ["Awesome", "awesome", "AWESOME"]:
  print(k, "in a sentence:", k in wholetext)

In [None]:
bows_valid = cv.transform(df_valid.text)
bows_valid.sort_indices() # comes from TF2.0 sparse implementation, obscure requirement
bows_test = cv.transform(df_test.text)
bows_test.sort_indices() # comes from TF2.0 sparse implementation, obscure requirement

# Task: The model

We build a feed-forward neural network in Keras for our binary classification task, which will be trained with cross-entropy loss and minibatch SGD.

In [None]:
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import SGD
import tensorflow as tf

## set random seed for reproducibility:
tf.random.set_seed(42)

## clear session: important if we retrain the model with different hyperparams
tf.keras.backend.clear_session()


# USE KERAS FUNCTIONAl API if possible, or SEQUENTIAL API if you prefer

# Parameters
############

hidden_size = 100

# Model
#######
## Create an "empty" model when using Sequential API.
## Don't forget to import the class you need to do this!

# Define (instantiate) the input layer if you use Functional API.
# Give the shape parameter the length of a BoW vector as length (hint: you can use bows_train's shape...)
# WARNING: shape only accepts a tuple, even if it is one dimensional
# (do not forget the comma after the single number in that case)!


# Hidden layer
##############
# Define a fully connected hidden layer that can be modified by the parameters above.
# Use the ReLU activation function.
# Give the inputs to the hidden layer if you use keras functional API,
# or pay attention to giving the input shape (specified above) when using sequential API.
# Please be aware that in Keras functional API, the parameters defining the layer are
# "instantiation" parameters, but the input of the layer is already a "function call" parameter!
# (The magic lies in the brackets... )



# Softmax
#########
# Define the output, softmax (!) layer.
# (Which is a fully connected layer with activation accordingly...)
# Please remember, we have exactly two classes!
# (We choose to use this generalized, Softmax approach...)
# We feed the layer with the output of the hidden one in functional API.


# Whole model
##############
# Nothing more is left than to instantiate the model when using functional API.
# Please ensure input and output is right!


# Optimization
##############
# For now, we stick to the basic SGD with a relatively large learning rate...but experiment with others!
optimizer = SGD(learning_rate=0.1)


# Compilation and teaching
##########################

model.compile(optimizer=optimizer,
              loss='sparse_categorical_crossentropy', # use this cross entropy variant
                                                      # since the input is not one-hot encoded
              metrics=['accuracy']) #We measure and print accuracy during training

In [None]:
## print out a summary of the model


# Training

In [None]:
epochs=10,
batch_size=200

history = model.fit(x=bows_train,
          y=df_train.sentiment.values,
          validation_data=(bows_valid, df_valid.sentiment.values),
          epochs=epochs,
          batch_size=batch_size)

# Please don't just run, understand!
# experiment with other hyperparameter setups for fitting! (Caution: always re-create your model before that, which includes clearing the session!)

In [None]:
## Run the code and interpret the plots.

historydf = pd.DataFrame(history.history)

historydf[["loss", "val_loss"]].plot(title="loss");

historydf[["accuracy", "val_accuracy"]].plot(title="accuracy");

# Prediction

In [None]:
print("=== INTERACTIVE DEMO ===")
while True:
    s = input("Enter a short text to evaluate or press return to quit: ")
    if s == "":
        break
    else:
        ## using count vectorizer to transform input string:
        bow = cv.transform([s])
        ## use model to predict:
        prob_pred = model.predict(bow[0])
        print(f"Positive vs negative sentiment probability: {prob_pred[0,1]} vs {prob_pred[0,0]}\n")