<a href="https://colab.research.google.com/github/oztuka/IS584/blob/main/Sentiment_Prediction_%C3%96zkan_Tu%C4%9Fberk_Kartal.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1 style="margin-bottom:0">IS 584: Deep Learning for Text Analytics</h1>
<h3 style="margin-top:0">Lab 2: Sentiment Prediction with Multinomial Naive Bayes</h2>
<h4 style="margin-top:0">Given by Volga Sezen</h4>
<i>Originally created by Özgün Ozan Kılıç.</i>
<br>
<br>

In this notebook, we will look at how we can predict the sentiment of a text by training some machine learning models from scratch.

-----

## Before we start<a id="start"></a>

We will use Pandas and NLTK packages for data manipulation and NLP. In this notebook, you are assumed to know the basics. If your environment does not have the necessary packages or the datasets, you can run this cell:

In [None]:
# Installing necessary packages
# !pip install pandas
# !pip install matplotlib
# !pip install sklearn
# !pip install numpy as np
#!pip install nltk
#!pip install gensim
!pip install contractions

# Installing some datasets for NLTK:
from nltk import download
download("popular") # Popular datasets
download('tagsets') # Tagsets for POS tagging
download('punkt_tab')
download('averaged_perceptron_tagger_eng')



[nltk_data] Downloading collection 'popular'
[nltk_data]    | 
[nltk_data]    | Downloading package cmudict to /root/nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package gazetteers to /root/nltk_data...
[nltk_data]    |   Package gazetteers is already up-to-date!
[nltk_data]    | Downloading package genesis to /root/nltk_data...
[nltk_data]    |   Package genesis is already up-to-date!
[nltk_data]    | Downloading package gutenberg to /root/nltk_data...
[nltk_data]    |   Package gutenberg is already up-to-date!
[nltk_data]    | Downloading package inaugural to /root/nltk_data...
[nltk_data]    |   Package inaugural is already up-to-date!
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package movie_reviews is already up-to-date!
[nltk_data]    | Downloading package names to /root/nltk_data...
[nltk_data]    |   Package names is already up-to-date!
[nltk_data]    | Do

True

Executing the following command will download the dataset we will work with. Since even the zip is kinda big, I will upload it to ODTUClass in case this link breaks.

In [None]:
!wget https://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip
!unzip trainingandtestdata.zip

## Importing the data<a id="data"></a>

We will use the [Sentiment140](https://www.kaggle.com/kazanova/sentiment140) dataset that has 1.6 million tweets, labeled by the existence of positive or negative emoticons. It has its  limitations, but it should suffice for learning purposes. You can check the corresponding paper, [Twitter Sentiment Classification using Distant Supervision](http://cs.stanford.edu/people/alecmgo/papers/TwitterDistantSupervision09.pdf), for more information. Download the dataset and put it in the same folder with this notebook for convenience. We will use Pandas for data manipulation.

In [None]:
import pandas as pd

# This option will automatically set the column width when we display data:
pd.set_option("display.max_colwidth", 0)

# Importing it using Pandas. Change the path as necessary. Note that we need to
# specify the encoding as "latin-1" and we need to provide the headers since the
# dataset itself does not have them.

# For Colab
#dataset = pd.read_csv("/content/gdrive/MyDrive/Deep Learning for TA/Sentiment Prediction/sentiment140_dataset.csv", encoding="latin-1", names=["sentiment", "id", "date", "tweet_source", "username", "tweet"])

# Downloading from the link
dataset = pd.read_csv("training.1600000.processed.noemoticon.csv", encoding="latin-1", names=["sentiment", "id", "date", "tweet_source", "username", "tweet"])
dataset.head()

Unnamed: 0,sentiment,id,date,tweet_source,username,tweet
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D"
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by texting it... and might cry as a result School today also. Blah!
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Managed to save 50% The rest go out of bounds
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all. i'm mad. why am i here? because I can't see you all over there."


We will not use most of the columns here, so we can drop them. Also, this dataset's sentiment labels are 0 (negative) and 4 (positive). We can map these to a more familiar range like -1 and 1.

In [None]:
# Dropping these columns:
dataset.drop(columns=["id", "date", "tweet_source", "username"], inplace=True)

# Min-max rescaling:
# https://en.wikipedia.org/wiki/Feature_scaling#Rescaling_(min-max_normalization)

# These are already known but it is still good practice:
dataset_min = min(dataset["sentiment"])
dataset_max = max(dataset["sentiment"])

# The range we want:
new_min = -1
new_max = 1

# Applying the formula:
dataset["sentiment"] = new_min + ((dataset["sentiment"] - dataset_min) * (new_max - new_min) / (dataset_max - dataset_min))

# Displaying five rows for both sentiment labels:
dataset.groupby("sentiment", as_index=False).head()

Unnamed: 0,sentiment,tweet
0,-1.0,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D"
1,-1.0,is upset that he can't update his Facebook by texting it... and might cry as a result School today also. Blah!
2,-1.0,@Kenichan I dived many times for the ball. Managed to save 50% The rest go out of bounds
3,-1.0,my whole body feels itchy and like its on fire
4,-1.0,"@nationwideclass no, it's not behaving at all. i'm mad. why am i here? because I can't see you all over there."
800000,1.0,I LOVE @Health4UandPets u guys r the best!!
800001,1.0,im meeting up with one of my besties tonight! Cant wait!! - GIRL TALK!!
800002,1.0,"@DaRealSunisaKim Thanks for the Twitter add, Sunisa! I got to meet you once at a HIN show here in the DC area and you were a sweetheart."
800003,1.0,"Being sick can be really cheap when it hurts too much to eat real food Plus, your friends make you soup"
800004,1.0,@LovesBrooklyn2 he has that effect on everyone


Since our dataset is a bit large, it will take some time to process them. If you would like to sample the dataset to speed up the processes, run the cell below:

In [None]:
# Fraction to drop:
frac_to_drop = 0.99

# Dropping negative and positive tweets separately to strictly preserve the
# ratio (random_state is set to get the same sample every time):
dataset = dataset.drop(dataset[dataset.sentiment == -1].sample(frac=frac_to_drop, random_state=42).index)
dataset = dataset.drop(dataset[dataset.sentiment == 1].sample(frac=frac_to_drop, random_state=42).index)

## Preprocessing<a id="preprocessing"></a>

Now, we need to process the tweets before we vectorize them. We will firstly tokenize the tweets into sentences and later words. For each sentence, we will POS-tag the word tokens and lemmatize them using their POS tags. We will also handle negation while removing certain tokens to reduce our vector size. We will use NLTK's tweet tokenizer for word tokenization. The rest of the processes will be handled using NLTK as well.

In [None]:
import nltk.data
from nltk.tokenize import TweetTokenizer
from functools import reduce
import operator
from nltk import pos_tag_sents
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet as wn
from nltk.sentiment.util import mark_negation
from nltk.corpus import stopwords
import contractions
import re
import sys


# This tokenizer will guess where the sentence ends instead of tokenizing
# sentences by simply looking at sentence terminators.
sentence_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

# This tokenizer will tokenize tweets. "preserve_case" parameter can be used to
# preserve cases or make it all lowercase. "reduce_len" parameter shortens
# consecutive character repetitions to at most three consecutive repetitions to
# reduce noise.
tokenizer = TweetTokenizer(preserve_case=False, reduce_len=True)

# This function takes the POS tag and returns it in the correct form for
# lemmatization:
def get_lemmatizer_pos(pos):
    pos_start = pos[0] # Takes the first letter to simplify the POS tag
    if pos_start == "J":
        return wn.ADJ
    elif pos_start == "V":
        return wn.VERB
    elif pos_start == "R":
        return wn.ADV
    else:
        return wn.NOUN

# This will lemmatize tokens. Note that it assumes the word is a noun unless a
# POS tag is provided.
lemmatizer = WordNetLemmatizer()

# This retrieves a list of stop words in English, which will be used to remove the
# stop words:
stop_words = stopwords.words("english")

# These combined punctuations will be used to remove punctuations from tweets (it
# is an extension to string.punctuation):
punctuations = "!\"“”#$%&'‘’()*+,-./:;<=>?@[\]^_`{|}~‍"

def preprocess(tweet):

    ###### Contraction expansion

    tweet = contractions.fix(tweet)

    ###### Sentence tokenization

    # Separates tweets into sentences:
    tweet_sentences = sentence_tokenizer.tokenize(tweet)

    ###### Word tokenization

    # Tokenization outputs are kept in separate lists for each sentence:
    tweet_sentences_tokens = [tokenizer.tokenize(sentence) for sentence in tweet_sentences]

    ###### POS-tagging

    # POS tagging happens separately for each sentence before they are combined:
    tokens_pos = [pos_tag for pos_tags in pos_tag_sents(tweet_sentences_tokens) for pos_tag in pos_tags]

    ###### Lemmatization

    # For each POS-tagged token, a lemma is obtained:
    lemmas = [lemmatizer.lemmatize(token[0], pos=get_lemmatizer_pos(token[1])) for token in tokens_pos]

    ###### Negation handling

    # Marks negations:
    lemmas = mark_negation(lemmas)

    ###### Removals

    # Filters stop words (considers negations):
    lemmas = [lemma for lemma in lemmas if not lemma.replace("_NEG", "") in stop_words]

    # Filters punctuation (considers negations):
    lemmas = [lemma for lemma in lemmas
              if not lemma.replace("_NEG", "").translate(lemma.maketrans('', '', punctuations)) == ""]

    # Filters numbers (considers negation and some other details):
    lemmas = [lemma for lemma in lemmas if not lemma.replace("_NEG", "").translate(lemma.maketrans({",": "", ".": "", "%": ""})).isdigit()]

    # Filters hashtags:
    lemmas = [lemma for lemma in lemmas if not lemma.startswith("#")]

    # Filters user handles:
    lemmas = [lemma for lemma in lemmas if not lemma.startswith("@")]

    # Filters the lemma by searching for "https://," "http://," or "www." using
    # regular expression. If one of them exists, they are not retrieved.
    lemmas = [lemma for lemma in lemmas if not re.search("(https?:\/\/)|(www\.)", lemma)]

    return lemmas

dataset["tweet_processed"] = dataset["tweet"].map(preprocess)

dataset.head()

Unnamed: 0,sentiment,tweet,tweet_processed
126,-1.0,"@grum WAH I can't see clip, must be el-stupido work filters. Can't wait 'till I get a 'puter. Something else 2 blame ex 4. He broke mine","[wah, cannot, see, clip, must, el-stupido, work, filter, cannot, wait, till, get, puter, something, else, blame, ex, break, mine]"
159,-1.0,Oh - Just got all my MacHeist 3.0 apps - sweet. Didn't get the Espresso serial no though although they said they sent it - oh well,"[oh, get, macheist, apps, sweet, get_NEG, espresso_NEG, serial_NEG, though_NEG, although_NEG, say_NEG, send_NEG, oh_NEG, well_NEG]"
215,-1.0,I'm gonna get up late tomorrow and it's 132am here. I gonna get tipsy by my lonesome. That's...that's just sad,"[go, get, late, tomorrow, 132am, go, get, tipsy, lonesome, sad]"
363,-1.0,@Sara_Kate Im afraid too ( ur reply about uni from ages ago,"[afraid, reply, uni, age, ago]"
390,-1.0,Wait should I eat?? Or be skinny for vegas!! I'm hungry!,"[wait, eat, skinny, vega, hungry]"


We can now split our dataset into training and testing sets using scikit-learn. Note that we can maintain the class balance by splitting the dataset using `stratify`.

In [None]:
from sklearn.model_selection import train_test_split

# Splitting the dataset into training and testing (random_state is set to get the
# same sample every time):
dataset_train, dataset_test = train_test_split(dataset, test_size=0.3, random_state=42, stratify=dataset["sentiment"])

## Vectorization<a id="vectorization"></a>

Now, we have a dataset that we can use to train our model, but we need to represent each tweet in a vector format. Using the training set, let us combine the lemma lists, flatten it, count each lemma's frequency, and look at some statistics:

In [None]:
terms = pd.Series(dataset_train.explode('tweet_processed').tweet_processed).value_counts()

terms.describe()

Unnamed: 0,count
count,14553.0
mean,5.223253
std,25.286962
min,1.0
25%,1.0
50%,1.0
75%,2.0
max,1104.0


As expected, we have a highly skewed distribution here as most terms occur only once. If we do not prune our terms, we will end up with a very large vector that mostly corresponds to highly biased terms. To deal with this problem, we can use only the top n terms. Since we have two sentiment groups, we can count these terms separately and select the most frequent terms for each:

In [None]:
# Top terms from the negative tweets:
terms_negative = pd.Series(dataset_train[dataset_train.sentiment == -1].explode('tweet_processed').tweet_processed).value_counts()

terms_negative.head()

Unnamed: 0_level_0,count
tweet_processed,Unnamed: 1_level_1
go,596
get,570
cannot,339
work,325
miss,316


In [None]:
# Top terms from the positive tweets:
terms_positive = pd.Series(dataset_train[dataset_train.sentiment == 1].explode('tweet_processed').tweet_processed).value_counts()

terms_positive.head()

Unnamed: 0_level_0,count
tweet_processed,Unnamed: 1_level_1
get,534
go,502
good,472
love,415
day,393


We can take the terms that occur more than n times for both groups to obtain the final term list that we will use for training. We can also limit our vector to the top m frequent words from each sentiment to reduce the vector size even more. Depending on your dataset, applying only one of them could be enough. We will simply obtain the top 500 words from each group.

In [None]:
# Retrieving the top 500 terms for both sentiments, remove the duplicates
# (since some words occur in both groups), sort them, and obtain a list:
vector_terms = sorted(set(terms_positive.nlargest(500).index) | set(terms_negative.nlargest(500).index))

# This would firstly filter the words with a minimum occurrence threshold (4
# in this case), but it makes no difference for this dataset.
# vector_terms = sorted(set(terms_positive[terms_positive > 0].nlargest(500).append(terms_negative[terms_negative > 4].nlargest(500)).index))

# Vector size:
print("Vector size:",len(vector_terms))

# The first 20 terms:
vector_terms[0:20]

Vector size: 639


['1st',
 '<3',
 'able_NEG',
 'absolutely',
 'ache',
 'actually',
 'add',
 'afternoon',
 'ago',
 'agree',
 'ah',
 'ahh',
 'ahhh',
 'album',
 'almost',
 'alone',
 'already',
 'also',
 'always',
 'amaze']

Now, we have a relatively small term list that can represent both groups. We need to represent all tweets according to this term list. To do so, we will vectorize them using the term occurrences.

Assume that our term list has only four terms: "i," "love," "you," and "myself." We can represent these terms as a vector (or a one dimensional list in Python): `["i", "love", "you", "myself"]`

If a tweet is `Love you.`, when we process it, its vector form would look like this:

`[0, 1, 1, 0]`

Notice that the first and the last numbers are 0, because the tweet does not have any "i" or "myself" in it. The first count corresponds to "i" while the last one corrsponds to "myself."

Therefore, if a tweet is `I love... I love you.`, its vector form would be:

`[2, 2, 1, 0]`

Since we removed many terms, our vector cannot represent every text perfectly. Even if we included all the terms from our training set, the test set may have many terms that did not exist in our training set, so this is inevitable. If a tweet is `I love you too.`, it would be represented like this:

`[1, 1, 1, 0]`

This is indistinguishable from `I love you.`, because we cannot represent "too" here. When we vectorize the tweets, they simply become bags of words. The only thing we now know about them is how many times some specific words occur in them. Note that we also lose the relationship between the terms and their order. Each word is independent.

To programatically vectorize each tweet, we need to create a dictionary that keeps the index of each term from our term list. Then, we can check if a word of a given tweet exists in that dictionary. If so, the dictionary gives us the location to fill in, so that we can obtain the vector. For example, our term index dictionary would be like this:

`term_indices = {"i": 0, "love": 1, "you": 2, "myself": 3}`

Using this, we know which index to fill in for a given word of a given tweet.

We start with a list full of zeros: `[0, 0, 0, 0]`

For the term "love," we know that its corresponding index is 1 according to our dictionary. Therefore, whenever we see it occur, we increase the number that corresponds to its index. Let us see how it works in Python:

In [None]:
term_indices = {term: index for index, term in enumerate(vector_terms)}

# First 20 terms from the dictionary
list(term_indices.items())[:20]

[('1st', 0),
 ('<3', 1),
 ('able_NEG', 2),
 ('absolutely', 3),
 ('ache', 4),
 ('actually', 5),
 ('add', 6),
 ('afternoon', 7),
 ('ago', 8),
 ('agree', 9),
 ('ah', 10),
 ('ahh', 11),
 ('ahhh', 12),
 ('album', 13),
 ('almost', 14),
 ('alone', 15),
 ('already', 16),
 ('also', 17),
 ('always', 18),
 ('amaze', 19)]

In [None]:
# This function vectorizes a preprocessed text according to the term vector:
def vectorize(terms, terms_dict=term_indices):

    # Creating a list full of zeros, according to the dictionary size
    vector = [0] * len(terms_dict)

    # For all terms:
    for term in terms:
        # The vector is only updated if the term exists in our dictionary:
        if term in terms_dict:
            index = terms_dict[term]
            vector[index] += 1

    return(vector)

# Now we can apply this to the training and testing set:
# dataset_train.loc[:,"vector"] = dataset_train["tweet_processed"].apply(vectorize)
dataset_train = dataset_train.assign(vector = dataset_train["tweet_processed"].apply(vectorize))
# dataset_test.loc[:,"vector"] = dataset_test["tweet_processed"].apply(vectorize)
dataset_test = dataset_test.assign(vector = dataset_test["tweet_processed"].apply(vectorize))
# You can also use map() since we are dealing with the columns here. However, apply()
# can take multiple arguments, which is necessary if you want to pass a different
# dictionary as terms_dict.

dataset_train.head()

Unnamed: 0,sentiment,tweet,tweet_processed,vector
1309492,1.0,http://twitpic.com/6ik6t - Will be doing this tomorrow.,[tomorrow],"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]"
925149,1.0,"@Rsltruly Haha, it seemed a little distraught, so I thought I'd ask","[haha, seem, little, distraught, thought, would, ask]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]"
742849,-1.0,@d33pak ya...but m missing my &quot;beeryani&quot; on the weekends,"[ya, miss, beeryani, weekend]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]"
1528263,1.0,smiling. my DJ is about to hook it up for Music Monday...trade lyrics with me today too twittFAM. music IS the universal language,"[smile, dj, hook, music, monday, trade, lyric, today, twittfam, music, universal, language]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]"
393444,-1.0,"@randfish I miss the SF fog Watching it roll in when you have nothing to do, yeah baby!","[miss, sf, fog, watch, roll, nothing, yeah_NEG, baby_NEG]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]"


We can check if our vectorization works by recreating the terms from the vector:

In [None]:
print("Vector representation of the first tweet:", dataset_train["vector"].iloc[0], "\n")

print("Unique sorted terms that can be represented for the first tweet:",
      sorted([vector_terms[index] for index, term in enumerate(dataset_train["vector"].iloc[0]) if term != 0]), "\n")

print("All sorted terms of the first tweet:", sorted(dataset_train["tweet_processed"].iloc[0]))

Vector representation of the first tweet: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

While these vectors are meaningless to the human eye, we can now quantify a given text and train a model using these term occurences. So, each occurrence count in the tweet vector is actually a feature. Our feature space is the vector itself.

Before we move forward, there is a limitation that you should keep in mind. To dramatically reduce the vector size, we limit our bag of words to the most frequent words from the two sentiment groups. However, this elimination process may not be ideal. For example, only because a given word occurs a lot may not always mean using that word would help us with classification. It is possible for a given word to be frequent in both groups and give us no information. In this case, ignoring that word for vectorization could be actually helpful for classification. In short, we actually want words that occur frequently but not occur in every group (both positive and negative tweets in this case). We can use the term frequency-inverse document frequency ([TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)) approach to evaluate a word's usefulness in this sense. This would also inherently eliminate stop words and such that occur in both sides. We will not implement this now, but you can use it to rank these terms, remove the unimportant ones, and also normalize the vectors to obtain more efficient vectors.

You can also use the [count vectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) and the [TF-IDF vectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) of scikit-learn.

## Multinomial Naive Bayes<a id="mnb"></a>

We can use Multinomial Naive Bayes through these sentiment classes' term occurrences. Before we jump into the codes, let us go over the logic behind Mutlinomial Naive Bayes.

Take a look at the ratio of tweets with a positive sentiment in our training set:

In [None]:
dataset[dataset.sentiment == 1].shape[0]/dataset.shape[0]
# Remember that df.shape[0] retrieves the row count for a given DataFrame object.
# Therefore, dataset[dataset.sentiment == 1].shape[0] retrieves the number of tweets
# that have a sentiment value of 1 (positive).

0.5

This means 50% of our training set has a positive sentiment. Using this information, we can roughly guess that for a given tweet, the possibility of it having a positive sentiment is 50%. This probability can be denoted by "p(positive)." This initial guess is called "prior probability."

We can improve our estimation by looking at the tweet content and comparing it with the ones we have in our dataset. For example, what is the probability of a given tweet, provided that it includes the word "love," to have a positive sentiment? We had already counted the occurences of terms for each sentiment. We can use these frequencies:

In [None]:
# Frequency of the word "love" divided by the total number of word occurencies in
# negative tweets:
terms_negative["love"]/sum(terms_negative)

0.0032645024224680673

In [None]:
# Frequency of the word "love" divided by the total number of word occurencies in
# postive tweets:
terms_positive["love"]/sum(terms_positive)

0.011091215223026966

We can also think about these ratios as probabilities of finding the word "love" in negative and positive tweets, separately.

These probabilities (or likelihoods) are shown as "p(<span style="color:purple">love</span><span style="color:blue">|</span><span style="color:red">negative</span>)" and "p(<span style="color:purple">love</span><span style="color:blue">|</span><span style="color:green">positive</span>)," which means <span style="color:purple">the probability of "love"</span> <span style="color:blue">given that</span> the tweet is <span style="color:red">negative</span>/<span style="color:green">positive</span>.

Notice that the probability of finding "love" is much higher in positive tweets. Since the amount of positive and negative tweets are about the same in our dataset, we can intuitively understand a tweet that has the word "love" has a higher chance to have a positive sentiment. However, this may not be that easy with different numbers, so let us formulate it. To calculate the posterior probability (which means the probability that is calculated based on another given event/condition), we essentially multiply the prior probability with the likelihood in the sentiment group. So, the probability of a tweet to have a positive sentiment given that it has the word "love" (can be denoted by "p(positive|love)") is actually the combination of these two events:
* The tweet has the word "love."
* The tweet is positive.

Therefore, we need to multiply these separate probabilities:

`p(positive|love) = p(love|positive) x p(positive)`

This is the formula of Naive Bayes. If you have heard about conditional probability, it may look familiar to you. In conditional probability, we divide this multiplication by the probability of the condition happening. Since it is given that it does happen (the word "love" does exist), we do not include it.

Now, we can calculate the exact probabilities for our case.



In [None]:
print("p(positive|love) =",(terms_positive["love"]/sum(terms_positive)) * (dataset[dataset.sentiment == 1].shape[0]/dataset.shape[0]))

p(positive|love) = 0.005545607611513483


In [None]:
print("p(negative|love) =",(terms_negative["love"]/sum(terms_negative)) * (dataset[dataset.sentiment == -1].shape[0]/dataset.shape[0]))

p(negative|love) = 0.0016322512112340337


So, if we have two conditions (for example our tweet has the words "love" and "hate"), we multiply their likelihoods along with the prior probability:

In [None]:
print("p(negative|love ∩ hate) =",
      (terms_positive["love"]/sum(terms_positive)) *
      (terms_positive["hate"]/sum(terms_positive)) *
      (dataset[dataset.sentiment == 1].shape[0]/dataset.shape[0]))

p(negative|love ∩ hate) = 2.9642181957471114e-06


In [None]:
print("p(negative|love ∩ hate) =",
      (terms_negative["love"]/sum(terms_negative)) *
      (terms_negative["hate"]/sum(terms_negative)) *
      (dataset[dataset.sentiment == -1].shape[0]/dataset.shape[0]))

p(negative|love ∩ hate) = 6.047411021749535e-06


One potential issue is that since we keep multiplying these probabilities, if our dataset does not have any occurrence of a specific word, using its likelihood would result a zero, which would absorb the whole calculation. To solve this, during calculation, we can add 1 to every frequency, so their likelihoods are never zero. However, since we represent each tweet strictly with a vector of specific words at hand, this is not a problem.

Anyway, we can now jump into training a Multinomial Naive Bayes classifier:

In [None]:
from sklearn.naive_bayes import MultinomialNB

mnb_classifier = MultinomialNB()

Remember that we have already split our dataset into training and test sets. To train a dataset, we need to split it into input and output. Our input is the tweet vector and our output is the sentiment, because we are interested in predicting the sentiment for a given tweet vector.

In [None]:
dataset_train_x = dataset_train["vector"].tolist()
dataset_train_y = dataset_train["sentiment"]

# We can now train our model:
mnb_classifier.fit(dataset_train_x, dataset_train_y)

We can now feed the training input back and see their predicted sentiments. We can compare these predictions with the expected sentiments (dataset_train_y) and evaluate the model's success.

In [None]:
# Predicting the training set:
train_predictions = mnb_classifier.predict(dataset_train_x)

# Evaluating the model:
from sklearn.metrics import classification_report

# Printing the classification report
print(classification_report(dataset_train_y, train_predictions))

              precision    recall  f1-score   support

        -1.0       0.75      0.78      0.76      5600
         1.0       0.77      0.74      0.75      5600

    accuracy                           0.76     11200
   macro avg       0.76      0.76      0.76     11200
weighted avg       0.76      0.76      0.76     11200



There are multiple ways one can evaluate the classification results. You can learn more about performance evaluation metrics from [here](https://www.kdnuggets.com/2020/04/performance-evaluation-metrics-classification.html).

The results are not bad, but remember that we need our model to be successful with the unknown data, which is our testing set. Let us try to predict the testing set and evaluate it.

In [None]:
# Splitting the testing set:
dataset_test_x = dataset_test["vector"].tolist()
dataset_test_y = dataset_test["sentiment"]

# Predicting the testing set:
test_predictions = mnb_classifier.predict(dataset_test_x)

# Printing the classification report
print(classification_report(dataset_test_y, test_predictions))

              precision    recall  f1-score   support

        -1.0       0.72      0.74      0.73      2400
         1.0       0.73      0.71      0.72      2400

    accuracy                           0.72      4800
   macro avg       0.72      0.72      0.72      4800
weighted avg       0.72      0.72      0.72      4800



It is usually expected to see slightly lower scores with the unknown data. Still, for the amount of work, it is not that bad.

We can also get the feature importances and see which words are more important to predict the sentiment. For every class, we can get the log probabilities and convert them to probabilities (for a slightly more intuitive understanding) for each word from the input vector. We can obtain them in human-readable form through `vector_terms`. Note that we did not need to train a model to see this, we could directly calculate it from the occurrences in the dataset.

In [None]:
mnb_feature_log_prob = mnb_classifier.feature_log_prob_

import numpy as np

# We define an empty DataFrame with the vectorm terms as indices:
mnb_feature_prob = pd.DataFrame(index=vector_terms)

# For each sentiment class:
for sentiment_index, sentiment in enumerate(mnb_classifier.classes_):
    # Dumping the term probabilities to mnb_feature_prob:
    mnb_feature_prob[sentiment] = np.exp(mnb_feature_log_prob[sentiment_index])

# Using floats as column names can be confusing, so we rename them in place:
mnb_feature_prob.rename(columns={-1.0: "negative", 1.0: "positive"}, inplace=True)

mnb_feature_prob.head()

Unnamed: 0,negative,positive
1st,0.000768,0.00026
<3,0.000938,0.003547
able_NEG,0.000853,0.000173
absolutely,0.000256,0.000649
ache,0.000597,8.7e-05


## Practice<a id="practice"></a>

Now is a good time to practice what you have learned. Use the processed forms of the tweets ("tweet_processed" column) and vectorize both training and testing sets **using only the top 500 positive terms**. Train the same model, predict the testing set, and evaluate the accuracy. Variables and some parts are already provided for you. You only need to fill in the lines where you see `## FILL IN HERE ##`.

In [None]:
# Creating disposable copies and dropping the existing vector column:
practice_dataset_train = dataset_train.copy().drop(columns="vector")
practice_dataset_test = dataset_test.copy().drop(columns="vector")

# Obtain the positive terms that occur at least five times.
practice_vector_terms = ## FILL IN HERE ##

# Adding the indices to the terms
practice_term_indices = {term: index for index, term in enumerate(practice_vector_terms)}

# Vector size (this must be 500):
print("Vector size:",len(practice_vector_terms))

# First 20 terms from the dictionary
list(practice_term_indices.items())[:20]

Vector size: 500


[('<3', 0),
 ('absolutely', 1),
 ('actually', 2),
 ('add', 3),
 ('afternoon', 4),
 ('ago', 5),
 ('agree', 6),
 ('ah', 7),
 ('ahh', 8),
 ('ahhh', 9),
 ('album', 10),
 ('almost', 11),
 ('already', 12),
 ('also', 13),
 ('always', 14),
 ('amaze', 15),
 ('amazing', 16),
 ('another', 17),
 ('anyone', 18),
 ('anything', 19)]

In [None]:
# Use the terms you obtained to vectorize the sets.
# Hint: Do not forget to pass your new term indices while applying vectorize.
# Otherwise, you will obtain the same vectors by default.
practice_dataset_train["vector"] = ## FILL IN HERE ##
practice_dataset_test["vector"] = ## FILL IN HERE ##

In [None]:
# Obtain the input and output for training.
practice_dataset_train_x = ## FILL IN HERE ##
practice_dataset_train_y = ## FILL IN HERE ##

# Obtain the input and output for testing.
practice_dataset_test_x = ## FILL IN HERE ##
practice_dataset_test_y = ## FILL IN HERE ##

# Training the model:
mnb_classifier.fit(practice_dataset_train_x, practice_dataset_train_y)

# Predicting the testing set:
practice_test_predictions = mnb_classifier.predict(practice_dataset_test_x)

# Printing the classification report
print(classification_report(practice_dataset_test_y, practice_test_predictions))

              precision    recall  f1-score   support

        -1.0       0.69      0.74      0.72      2400
         1.0       0.72      0.67      0.69      2400

    accuracy                           0.71      4800
   macro avg       0.71      0.71      0.71      4800
weighted avg       0.71      0.71      0.71      4800



Compare the results you obtained with the original one. How and why are they different? Feel free to experiment with the actual code as well. Reduce `frac_to_drop` to use more of the dataset, change vector size, process the tweets further, etc. and check the difference they make.

## Using another model<a id="svm"></a>

Once we obtain the vector representations, we can switch the model. For example, instead of Multinomial Naive Bayes, we can use Support Vector Machine (SVM). While the codes we will use are very similar to Multinomial Naive Bayes thanks to the library, SVM has a very different working principle. You can watch [this short video](https://www.youtube.com/watch?v=Y6RRHw9uN9o) to learn more.

In [None]:
from sklearn import svm

# Defining an SVM model:
svm_classifier = svm.SVC()

# Training the model:
svm_classifier.fit(dataset_train_x, dataset_train_y)

# Predicting the testing set:
test_predictions = svm_classifier.predict(dataset_test_x)

# Printing the classification report
print(classification_report(dataset_test_y, test_predictions))

              precision    recall  f1-score   support

        -1.0       0.76      0.68      0.72      2400
         1.0       0.71      0.79      0.75      2400

    accuracy                           0.73      4800
   macro avg       0.74      0.73      0.73      4800
weighted avg       0.74      0.73      0.73      4800



## A better way to vectorize<a id="vectorize-better"></a>

The way we vectorize the tweets is easy to understand but a bit manual labor. We could write a function that handles this for us, but it is also a bit primitive. There are more sophisticated alternatives. For example, we can use Word2Vec or Doc2Vec neural network models to vectorize a word or a document (which can be a sentence, a paragraph, or even an actual text document in this context).

In [None]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

# Obtaining TaggedDocument objects in a list using the training set:
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(dataset_train["tweet_processed"])]

# Training Doc2Vec model using these documents:
model = Doc2Vec(documents, vector_size=2, window=2, min_count=1, workers=4)
# vector_size: Number of dimensions that will represent our tweets
# window: Simply put, the window size that will be used to detect occurrence patterns
# min_count: A minimum threshold to filter out less common words
# workers: Number of concurrent threads to parallelize the process

We can now obtain the vector that represents a document (more like a list of words). Take a look at this example:

In [None]:
# Obtaining the vector
model.infer_vector(["natural", "language", "processing"])

array([ 0.16299061, -0.2089302 ], dtype=float32)

Now, take a look at this one and compare their vector representations:

In [None]:
model.infer_vector(["processing", "language", "natural"])

array([ 0.21345688, -0.16054054], dtype=float32)

You might have three important questions in your mind now:
1. **Why do we get these weird numbers?**

Previously, what we had obtained was a simple bag of words. So, each word's occurrence (given that they are chosen to form our vector) was simply represented by one component of the vector. Here, each component is actually a dimension rather than a mere occurrence count of a given word in that document. This is called "distributed representation." Think about the furnitures and items in your room. They all have an X, Y, and Z coordinates with respect to the center point of your room. Therefore, we can represent these furnitures using `<X, Y, Z>` vectors. The lamp is right at the center of your ceiling. The carpet is at the floor. The wall clock is on the wall in front of you, vertically between the lamp and the carpet. X, Y, or Z do not represent the existence of a specific furniture. These furnitures exist at those specific points in the space (your room). So, it is possible to represent words or documents in high-dimensional spaces.

2. **Can we even represent many words (documents) with smaller vector sizes?**

Think about the amount of furnitures you can represent using their relative X, Y, and Z coordinates in your room. It is way more than the number of dimenions in this scenario. Likewise, we can now efficiently squeeze in many representations in fewer dimensions. Still, this example vectorization was overly simplified for demonstration purposes. Normally, higher dimensions (something between 100–⁠300) are preferred. Compared to the simple bag of words approach, it is slightly harder to comprehend and envision higher dimensions. Luckily, computers do not care as much.

3. **Why do not these two documents have the same representation?**

Remember that we are now using Doc2Vec and we embed documents. So, instead of representing furnitures in your room, we are actually representing the room itself. When we change the position of a certain furniture (word) in the room (document), the representation changes. In some cases, you might prefer to use Word2Vec, obtain word-level vectors, and average them together to obtain document-level representations.

Let us vectorize the dataset using Doc2Vec:

In [None]:
# Let us firstly train a proper Doc2Vec model:
model = Doc2Vec(documents, vector_size=200, workers=4)

# Vectorization with Doc2Vec:
dataset_train_x_distributed = [model.infer_vector(tweet_processed) for tweet_processed in dataset_train["tweet_processed"]]
dataset_test_x_distributed = [model.infer_vector(tweet_processed) for tweet_processed in dataset_test["tweet_processed"]]

# Distributed representation for the first row:
dataset_train_x_distributed[0]

array([ 3.04732705e-03, -7.56429695e-03,  4.07447526e-03,  1.37106823e-02,
        1.67388692e-02, -1.82423703e-02, -4.11761412e-03,  2.74238586e-02,
       -4.12344560e-03,  1.79019924e-02, -9.95547324e-03, -1.08452616e-02,
        7.12523982e-03,  5.94163593e-03, -5.18688327e-03, -1.10455062e-02,
       -6.19061012e-03, -5.00199012e-03,  7.94325490e-03, -2.22684778e-02,
        8.77674855e-03, -1.40149770e-02,  5.54288551e-03,  6.26977882e-04,
        5.32958284e-03, -6.44624745e-03, -6.06276328e-03, -9.19247232e-03,
       -1.74699742e-02,  6.92194002e-03,  2.22266689e-02,  8.45513307e-03,
        3.53865209e-03,  5.96398162e-03, -1.80712237e-03,  9.63222235e-03,
        8.97638779e-03,  1.41677959e-03, -8.01365543e-03, -2.09047403e-02,
       -9.62120853e-03, -5.70791482e-04,  1.01324208e-02,  5.65510150e-03,
        2.29565650e-02, -2.85862293e-03, -9.95202642e-03, -6.30581845e-03,
        9.89388768e-03,  1.41810505e-02,  1.17747858e-02, -3.74690909e-03,
       -1.60711948e-02, -

Since we do not have frequencies with a distributed representation, we cannot use Multinomial Naive Bayes anymore. Instead, we can use a model that can work with continuous data. Here, we will use Gaussian Naive Bayes, but it would be possible to use SVM or another model as well.

In [None]:
from sklearn.naive_bayes import GaussianNB

# Defining an SVM model:
gnb_classifier = GaussianNB()

# Training the model:
gnb_classifier.fit(dataset_train_x_distributed, dataset_train_y)

# Predicting the testing set:
test_predictions = gnb_classifier.predict(dataset_test_x_distributed)

# Printing the classification report
print(classification_report(dataset_test_y, test_predictions))

              precision    recall  f1-score   support

        -1.0       0.57      0.13      0.21      2400
         1.0       0.51      0.90      0.65      2400

    accuracy                           0.52      4800
   macro avg       0.54      0.52      0.43      4800
weighted avg       0.54      0.52      0.43      4800



With a vector size of 200 and default parameters, it does not work well, which is not surprising. Note that, since these are neural networks models, they require a lot of data. You can try training it with using a bigger portion of the whole dataset. Instead of finding a large training set and time to train it, it is common to use pre-trained [GloVe](https://nlp.stanford.edu/projects/glove/) models for vectorization. You will learn more about these later.

### Discussion Question

* Think of a challange you faced at work or your studies. How would you (or did you) translate it into the NLP domain? <br> (Topic classification, text generation, etc.) Would you rather use pre-trained vectors or train them from scratch?

## More information<a id="more-info"></a>

* [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)
* [scikit-learn](https://scikit-learn.org)
    * [Count vectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)
    * [TF-IDF vectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)
    * [Multinomial Naive Bayes classifier](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html)
    * [SVM classifier](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)
    * [Gaussian Naive Bayes classifier](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html)
* [Gensim](https://radimrehurek.com/gensim/)
    * [Word2Vec](https://radimrehurek.com/gensim/models/word2vec.html)
    * [Doc2Vec](https://radimrehurek.com/gensim/models/doc2vec.html)
* [GloVe](https://nlp.stanford.edu/projects/glove/)