<a href="https://colab.research.google.com/github/meghna2312/SentimentCNN/blob/master/meg__Sentiment_analyzer_using_Convolutional_Neural_Network.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

###Let's delve into how computers deal with human language. 

###Natural Language Processing (NLP) is the part of AI (Artificial Intelligence) that deals with human language. We want computers to be able to grasp the meaning between words which is not evident since computers only deal with sequences of zeroes and ones. 

###Sentiment analysis is one of the essential tasks done in NLP. We will build a sentiment analyzer using CNN (Convolutional Neural Network). It will take as inputs tweets and outputs whether it conveys a positive or negative sentiment. 

###CNN was first used to process images. For example, at the very beginning, it was used to recognize digits that were handwritten and also to recognize the contents of an image. Researchers soon realized that CNN could also be a powerful tool for language processing. Classification tasks are easily addressed with CNNs. 

###First, we will go into the origins of CNN (how it works for images, how does it work and what are its components). We will understand how text is linked to images and CNNs for text. CNN takes as input an image and outputs a label (image class). 

##Steps in CNN:

###A crucial one and the first step is convolution. Convolution creates a lot of feature detectors that scan the entire image and gives us a list of feature maps whether or not and where a specific feature appears in the image.

###Next operation is Max Pooling, where we just apply the maximum function to all feature maps so that we can make them smaller and improve the performance of the model.

###Then comes the straightforward flattening phase where we make a massive vector out of all the maps which are matrices.

###We end with a feed-forward neural network which learns from the feature extraction phase. 

###Convolution consists of feature detectors applying over the entire image. We go through patches in the image that have the same size as that of our feature detector and perform an element-wise multiplication and sum all the products. This result might not make any sense for us humans, but when this is connected with a feed-forward neural network, CNN can achieve great results with it. 

###A CNN consists of a lot of feature detectors, and each detector will provide a feature map. The feature detector contains randomly initialized values, and those are the variables learned throughout the process of training. 

###After comparing the results after one cycle of prediction with the actual correct labels, the model will tune those numbers to learn to detect more useful and accurate features. 

###Next, in the max-pooling operation, we don't take the global maximum but maximum of kernel size. Max-pooling reduces size and computation. The idea is we need not know everything about whether the feature appeared in the image or not. The most important thing is to see whether it appears or not. But we still need to keep some locality in this process. 

###We still want to keep the relations between the positions of the features. There exists a tradeoff between getting rid of the information we don't need and also making the map smaller so that we can improve the process and reduce the costs. We apply max-pooling to several feature maps, and we get the same number of pooled maps. 

###Flattening converts into a vector of 1 dimension to be as input to the feed-forward neural network. We transform each pooled map into a vector. It keeps the locality information (position). 

###We can add as many hidden layers required in the feed-forward neural network. Higher the output value, higher the probability that input belongs to that class. 

###Similar to the way we looked for features in the images, we can look for features in the text. For that, a sentence should be converted into a matrix as images were matrices. 

###The easiest way of representing a sentence as a vector is also the most inefficient one. It's called one-hot encoding. Vocabulary consists of all the unique words in our corpus (text dataset). Each vector is of vocabulary size. All the entries in each vector will be zeroes except for one entry which will have a value of one. It doesn't convey any relation between words. 

###Reducing the size of vectors enforces less liberty, and that forces our model to create relations. Instead of being binary, each value in the vector dimension will have a value between 0 and 1. Words with similar meaning will be closer in the embedding space. 

###Image matrices have a meaning when moved from left to right and top to bottom. But word matrices only have a purpose when moved from top to bottom. Because the right to left movement indicates the dimension of that vector, it doesn't make sense to perform 2D convolution. The width of the filter or kernel would be the same as that of the embedding dimension. 1D convolution implemented as feature extraction along one dimension is needed. 

###Global maximum of the convolution operation is taken since the position of a feature in a sentence is not as important as that in an image. We care if that feature is present or not. A filter of different sizes is used for convolution as opposed to that when working with images. 

###Flattening phase is not required as the output of the convolution is a vector and not a matrix. 

###We are done with the theory and let's jump into the implementation. The implementation will be done in Google Colab. It offers free GPU and TPU instances which speeds up the code execution. You will also learn how to use Google Colab.

### Dataset Link: http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip

# Stage 1: Importing dependencies.

In [None]:
import numpy as np
import math
import re #regex for string cleaning
import pandas as pd
from bs4 import BeautifulSoup

from google.colab import drive
#The best way to use files in Google colab is via Google Drive. So, we import drive module to connect it with Google colab

In [None]:
#We are asking for the Tensorflow version of 2.x (it can be 2.1, 2.0.2 or any such ones but it should start with 2)
#If it doesn't have that it gives any version it has
try:
    %tensorflow_version 2.x
except Exception:
    pass
import tensorflow as tf

from tensorflow.keras import layers #Used to create layers in our deep learning model
import tensorflow_datasets as tfds #Tensorflow datasets are ready-to-use datasets with Tensorflow or other Python ML frameworks

# Stage 2: Data preprocessing

## Loading files

###Obviously, the first step is to load our files. As already told, drive is the best way of accessing files in Google Colab. The mount method of drive gives us the ability to connect one's Drive to Colab.

###'/content/drive' is the path at which Drive is located

In [None]:
#When you run this, it asks you to go to a URL for authentication. Once you land on the page, copy the code available and paste it in the textbox titled
#'Enter your authorization code' and hit enter.
drive.mount("/content/drive")

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


###cols is the list of column names of our dataset.

###`sentiment` : indicates if the sentiment is positive or negative. 0 denotes negative, 1 denotes positive

###`id` : ID of the tweet

###`date` : date on which the tweet was sent

###`query` : this column is not very useful. All the values for this column are 'NO_QUERY' which means none of the tweets have any query.

###`user` : the user who tweeted (Twitter handle)

###`text` : the tweet 

In [None]:
cols = ["sentiment", "id", "date", "query", "user", "text"]

####**We will be using Pandas to work with our data. Pandas stores the data in a dataframe which is a neat tabular representation of the data.**
###Our dataset is a CSV file (Comma separated value) which means all our values in a line (a row if you think of the data as being in a table) are separated by commas.

###We use the read_csv method to read in the CSV file.

###`First argument`: The Path to the file

###`Second argument`: We don't have column names in the first row. Which means there is no header and we don't want Pandas to use the first row as the header. So, header = None

###`Third argument`: names = column names which are stored in the cols list. So, we set names = cols

###`Fourth argument`: an optional argument, specifies which engine to use to parse the data, we select Python

###`Fifth argument`: encoding to use for the file. You can think of it as the language standard. We specify latin1, just in case we have tweets in other languages like Chinese, Arabic or Hebrew or words with accents unlike in English.

In [None]:
actual_data = pd.read_csv(
    "/content/drive/My Drive/Sentiment Analyzer Datasets/train.csv",
    header=None,
    names=cols,
    skiprows = 1,
    engine="python",
    encoding="latin1"
)

In [None]:
#1.6M tweets
actual_data.shape

(1599999, 6)

In [None]:
#no imbalance in classes
actual_data.sentiment.value_counts()

4    800000
0    799999
Name: sentiment, dtype: int64

###All emoticons have been removed from the dataset. Used Twitter API to collect this data by keyword search. Was collected automatically rather than manually annotating each tweet. 

In [None]:
#We store our train.csv file in a variable called train_data to use it later for other steps.
train_data = pd.read_csv(
    "/content/drive/My Drive/Sentiment Analyzer Datasets/data.csv",
    header=None,
    names=cols,
    skiprows = 1,
    engine="python",
    encoding="latin1"
)

In [None]:
#Shape returns the number of rows and columns in our dataset. There are 30000 rows and 6 columns
train_data.shape

(30000, 6)

In [None]:
#Seeing first five rows of our dataset
train_data.head()

Unnamed: 0,sentiment,id,date,query,user,text
0,0,1553795194,Sat Apr 18 15:13:59 PDT 2009,NO_QUERY,t_win,It's been the longest day ever! I still haven'...
1,0,2179002334,Mon Jun 15 08:30:28 PDT 2009,NO_QUERY,badsotheynv,I feel uber bad little ol lady is sick wanted ...
2,0,1936039755,Wed May 27 07:20:42 PDT 2009,NO_QUERY,mubi_just_do_it,goose just died...saddest scene i've seen...
3,0,2185132296,Mon Jun 15 16:56:05 PDT 2009,NO_QUERY,walkthistown,@alexamarzi I KNOWW dont move
4,0,2180496762,Mon Jun 15 10:33:02 PDT 2009,NO_QUERY,clare666,@Piewacket1 awwww pie... the 'once in a lifeti...


In [None]:
#no imbalance: distribution of data is preserved
train_data.sentiment.value_counts()

4    15000
0    15000
Name: sentiment, dtype: int64

In [None]:
#Let's have our mastercopy as it is, if we want to return to this file or see how the original train.csv looked
#So, we create a variable called data and assign it train_data to make a copy of it.
data = train_data

## Preprocessing

### Cleaning

###The columns important for our text analysis is only the tweet and it's sentiment. We can remove all other columns. This has 2 advantages:
###1. Smaller dataframes leads to higher performance
###2. We can focus on what is required and need not worry or get disturbed by unimportant columns

In [None]:
#Pandas gives a method called drop to drop rows or columns from the dataframe.
#We need to specify which columns to drop (remove), axis : if 0: removes the row, if 1: removes that column
#Once this cell is executed, Pandas modifies our dataframe only for this cell but not permanently. To permanently modify the dataframe, there are 2 methods:
#1. Assigning this statement to data (the variable containing our data). In this we overwrite the old data
#2. Using the inplace argument, setting it to True tells it to modify it in the original dataframe.
data.drop(["id", "date", "query", "user"],
          axis=1,
          inplace=True)

####Tweets may contain punctuations, white spaces, URL links. They may be in case, upper or lower. To use a piece of code repeatedly, a function can be used. Since we have a lot of tweets and every tweet needs to be cleaned. This calls for the perfect situation to use a function. To bring all tweets to a standard, let's define a function to clean tweets. The name of the function is clean_tweet. It takes in input as a tweet.

###**BeautifulSoup** is a Python library used to parse HTML. HTML stands for Hyper Text Markup Language. This is the language behind the Internet. Every webpage and website uses HTML to structure the layout and define components. HTML is dirty, apart from the text we need it contains a lot of tags, formatting and structuring components. This is where BeautifulSoup comes in, it parses HTML using any of the parsers like lxml, html5lib, html.parser. We also have the luxury to choose the parser we want. lxml is the best parser among all.

###`Line 1`: Creates a BeautifulSoup object with lxml parser. The first argument is the data we want to parse which is the tweet in our case. get_text() returns only the text.

###For text cleaning, regex is the best library. It allows us to specify patterns and search for strings that follow a pattern. We have imported it using the line 'import re' in the first cell.

###`Line 2`: Many tweets mention other users using '@'. This needs to be removed. Removing can also be thought of as replacing that text with a whitespace. Regex provides a method called sub which replaces the value in 1st argument by 2nd arg in the text specified by 3rd arg.

###`1st arg`: r indicates the start of regex. Follows the pattern specified within the quotes. The pattern is '@', 'any upper or lower case characters or numerals between 0 and 9' (A to Z, a to z, 0 to 9 is indicated as [A-Za-z0-9], [] specifies a class or group of characters), + indicates any number of them

###`2nd arg`: a space in quotes => Whitespace

###`3rd arg`: tweet

###The next step is to remove any links in the tweet. Links start with https or http follwed by :// and any number of characters or numerals. In regex, this is specified as r: start of regex, pattern within quotes. 

###`1st arg`: Since links can either start with http or https, we need to make s an optional character. '?' makes the preceding character optional. So, '?' should be used after https. Then '://', [A-Za-z0-9] to indicate the content of the URL. '.' matches all characters except for a newline character. '/' is just a delimiter (like quotes for strings). '+' indicates any number of such characters. 

###Next, keeping only letters and common punctuations used in text like '.', '!', '?' and '' (apostrophe)'. '^' used in [] stands for NOT. The same explanation as given for above 2 steps works here too.

###Since we have replaced all of the unnecessary stuff by whitespaces, a lot of whitespaces would exist. As you expect, last step is to remove all whitespaces. To indicate any number of whitespaces, we use ' +' (whitespace followed by +).

###After the manipulation, return tweet.

In [None]:
def clean_tweet(tweet):
    tweet = BeautifulSoup(tweet, "lxml").get_text()
    # Removing the @
    tweet = re.sub(r"@[A-Za-z0-9]+", ' ', tweet)
    # Removing the URL links
    tweet = re.sub(r"https?://[A-Za-z0-9./]+", ' ', tweet)
    # Keeping only letters and common punctuations used in text
    tweet = re.sub(r"[^a-zA-Z.!?']", ' ', tweet)
    # Removing additional whitespaces
    tweet = re.sub(r" +", ' ', tweet)
    return tweet

###Now, we have to call this function on all our tweets. A for loop can be used, but a much more compact way is list comprehension (only takes 1 line). 

###We say call the function clean_tweet on each tweet in the text column of data.

In [None]:
data_clean = [clean_tweet(tweet) for tweet in data.text]

###We can see the possible values of a column and their respective counts using value_counts(). Let's apply this on the sentiment column. We see that 4 is used instead of 1 to denote positive sentiments. So, all the occurrences of 4 has to be replaced by 1. Using the values attribute on any column returns an array with all column values. 

###A very powerful concept used in Pandas is boolean masking. An example of this is used below. 'data_labels == 4' returns either True if a value is 4 or False if a value is not 4, which means 0 (since 0 and 4 are the only possible values). We can use this True, False list as an index to decide which values need to be replaced by 1. Wherever there is True, it replaces that value by 1. The values with zero remain as is. 

In [None]:
data['sentiment'].value_counts()

4    15000
0    15000
Name: sentiment, dtype: int64

In [None]:
data_labels = data.sentiment.values
data_labels[data_labels == 4] = 1

In [None]:
data.shape

(30000, 2)

### Tokenization

###Now that we are done with preprocessing steps, let's get into the meat of this project. Tokenization is the first part in any NLP task. Tokenizer is separating the sentence into different components. A word tokenizer splits by every word. A sentence tokenizer splits by every sentence. It is also what converts a list of characters into numbers. If we manually make up a list of numbers for every word or sentence, it will be tedious and also will not be able to convert any given (new) word into a list of numbers.  

###Luckily for us, tensorflow datasets which we had imported before can create the tokenizer for us. We just give it the corpus (the tweets) and the Vocabulary size (number of unique words). It can even be less than the number of words we wish to have. In that way, the encoder can compose a word with another if it can't find an unique representation for a word. It can be useful and powerful sometimes for words that appear very less number of times in the corpus. It will build an encoder which is an object that can transform any string to a list of numbers. We will have our vocab size as 64000 words. It takes quite sometime to create the tokenizer. 

In [None]:
tokenizer = tfds.features.text.SubwordTextEncoder.build_from_corpus(
    data_clean, target_vocab_size=2**11
)

data_inputs = [tokenizer.encode(sentence) for sentence in data_clean]

### Padding

###We don't train with a single example at a time. We train in batches. But in order to do that, we need all tweets to have the same length. A simple way is to add zeros at the end of each sentence so that they all have the same length. An important thing is that zero is not a number that is used by our tokenizer. Zero doesn't have any meaning and doesn't correspond to any word. It can be used without altering the meaning of our sentences. First thing is to declare the maximum length of our tweets. Now, length of the tweet after encoding will be the number of words rather than it being the number of characters as it was before tokenizing. 

###pad_sequences is the method used for padding. It takes the corpus, the value to pad with, the way we want to pad: 'post' indicates we want to pad at the end, maximum length will be the max length we just calculated.


In [None]:
MAX_LEN = max([len(sentence) for sentence in data_inputs])
data_inputs = tf.keras.preprocessing.sequence.pad_sequences(data_inputs,
                                                            value=0,
                                                            padding="post",
                                                            maxlen=MAX_LEN)

###We need to split our dataset into training and testing sets. Our data is actually ordered, the first half has negative sentiments while the latter half has positive sentiments. We will take 10% of the data as testing set. To maintain the proportion of positive and negative sentiments in the test set to get an accurate resemblance of the accuracy, we need 1500 positive and 1500 negative tweets. (Totalling to 3000 tweets)

###We can generate 1500 random integers between 0 and 15000; 1500 random integers between 15001 and 30000. This can be done using random.randint() method of numpy. First arg: starting number (inclusive), second arg: ending number (exclusive), third arg: number of integers required.

### Spliting into training/testing set



In [None]:
test_neg_idx = np.random.randint(0, 15001, 1500)
test_pos_idx = np.random.randint(15001, 30001, 1500)
test_idx = np.concatenate((test_neg_idx, test_pos_idx))

###Using indices, we obtain the rows by accessing it like an array from both data_inputs and data_labels. Training inputs is obtained by deleting the test indices (randomly generated indices). 

###Axis specifies if we want to remove along row or column. Since we need to remove row-wise, set axis to 0. Delete from both data_inputs and data_labels to generate data and targets in their respective variables. 

###Axis is not required for labels since it's a vector and the only possible way to delete is row-wise.

In [None]:
test_inputs = data_inputs[test_idx]
test_labels = data_labels[test_idx]
train_inputs = np.delete(data_inputs, test_idx, axis=0)
train_labels = np.delete(data_labels, test_idx)

# Stage 3: Model building

###Architecture: Apply embedding layer. Embedding converts each sentence into a vector with the given dimension. Then 3 different kinds of 1D convolution of size two (bigram), three (trigram) and four (fourgram). 1D convolution is used and filter is applied along one dimension. We apply a certain number of each of them. After applying the activation function for each filter, the output is a vector. We will take the max of each of those vectors via Max pooling and concatenate. Apply a linear function (Dense layer). Finally, our classification task is done. 

###Our class is called DCNN stands for Deep Convolutional Neural Network that we inherit from keras.Model. Basically, we are building our own model. 

###First is ithe ____init____ method which is the initialization function and has to be implemented for every model or layer in TensorFlow. The first parameter is self (to refer to the current object when instantiating a class). This is follwed by all the variables that we need to build our model. 

In [None]:
class DCNN(tf.keras.Model):
    
    def __init__(self,
                 vocab_size,
                 emb_dim=128, #128 dimensions for embedding layer is default
                 nb_filters=50, #number of filters = 50 (default value), number of times to apply convolution
                 FFN_units=512, #number of units in the feed forward neural network = 512 (default value)
                 nb_classes=2, #number of classes = 2 (positive or negative)
                 dropout_rate=0.1, #default value. Dropout is a tool to turn off certain parameters and variables in order to avoid overfitting
                 training=False, #boolean variable indicating if the model is in training phase. Mainly used to know if we need to apply dropout as dropout is 
                 #only applied during training. 
                 name="dcnn"): #name of our model
        
        #call the init function from the class we are inheriting from. Done by calling the super method giving the name of the class we are writing now and 
        #self. Give the name of our model to init method to initialize properly
        super(DCNN, self).__init__(name=name)  
        
        #Defining layers
        #1. Embedding layer with vocab size and embedding dimensions
        self.embedding = layers.Embedding(vocab_size,
                                          emb_dim)
        self.bigram = layers.Conv1D(filters=nb_filters,
                                    kernel_size=2, #filter size
                                    padding="valid", #isn't very important which padding method is used because our stride (step size) is 1 (applying filter 
                            #word by word). During last convolutions when the filter exceeds the length of the sequence, valid method pads those spaces by zero. 
                                    activation="relu") #ReLU (Rectified Linear Unit) is a standard activation function to introduce non-linearity into our model
        self.trigram = layers.Conv1D(filters=nb_filters,
                                     kernel_size=3,
                                     padding="valid",
                                     activation="relu")
        self.fourgram = layers.Conv1D(filters=nb_filters,
                                      kernel_size=4,
                                      padding="valid",
                                      activation="relu")
        
        #1D Max Pooling since it's a 1D convolution 
        self.pool = layers.GlobalMaxPool1D() # no training variable so we can use the same layer for each pooling step
        self.dense_1 = layers.Dense(units=FFN_units, activation="relu") #Dense layer 
        self.dropout = layers.Dropout(rate=dropout_rate) #Since there is a lot of variables and connections between them, this is a good place to apply Dropout 
        #to avoid overfitting
        
        #The last dense layer depends on how many classes we have. If there are 2 classes, we need a single number between 0 and 1 as the output.
        #Below 0.5, belongs to class 0 (Negative sentiment). Above 0.5, belongs to class 1 (Positive sentiment)
        if nb_classes == 2:
            self.last_dense = layers.Dense(units=1, #1 unit means a single number 
                                           activation="sigmoid") #Sigmoid takes a number between -infinity and +infinity and returns a value between 0 and 1. 
                                           #This is the choice of activation in binary classification tasks
        else:
            self.last_dense = layers.Dense(units=nb_classes,
                                           activation="softmax") #Softmax gives the number of values (equal to number of classes) between 0 and 1 whose sum is 
                                           #1. It basically indicates the probability of belonging to each class
    
    #After defining the functions, we have to call them. Let's do this using a call function. This function gives outputs from inputs
    def call(self, inputs, training): #self and inputs are obviously needed. training to indicate whether to apply dropout or not 
        x = self.embedding(inputs) #applying embedding
        x_1 = self.bigram(x)
        x_1 = self.pool(x_1)
        x_2 = self.trigram(x)
        x_2 = self.pool(x_2)
        x_3 = self.fourgram(x)
        x_3 = self.pool(x_3)
        
        merged = tf.concat([x_1, x_2, x_3], axis=-1) # (batch_size, 3 * nb_filters), axis = -1 indicates last axis where all the pooling values are present
        merged = self.dense_1(merged) #First dense layer (starting feedforward process)
        merged = self.dropout(merged, training)
        output = self.last_dense(merged) #outputs
        
        return output

# Stage 4: Application

## Config

In [None]:
#Model parameters (Global variables)
#Rather than passing the values as arguments, it's better to pass the variables containing those values. In this way, we can change all the values easily to 
#modify our model

VOCAB_SIZE = tokenizer.vocab_size

EMB_DIM = 200

NB_FILTERS = 100

FFN_UNITS = 256

NB_CLASSES = len(set(train_labels))

DROPOUT_RATE = 0.2

BATCH_SIZE = 32

NB_EPOCHS = 5
#You can play around with these parameters (hyperparameter tuning) to acheive the highest accuracy.

## Training

In [None]:
#Creating an instance of the model and pass all the required parameters as defined before
Dcnn = DCNN(vocab_size=VOCAB_SIZE,
            emb_dim=EMB_DIM,
            nb_filters=NB_FILTERS,
            FFN_units=FFN_UNITS,
            nb_classes=NB_CLASSES,
            dropout_rate=DROPOUT_RATE)

###Loss function should return high values for bad predictions and low values for good predictions.



![Binary Cross Entropy Loss](https://miro.medium.com/max/1096/1*rdBw0E-My8Gu3f_BOB6GMA.png)

###y is the label (1 for positive tweets and 0 for negative tweets) and p(y) is the predicted probability of a tweet being positive for all N tweets (total number of tweets). 

###For each positive tweet (y=1), it adds log(p(y)) to the loss, that is, the log probability of it being positive. Conversely, it adds log(1-p(y)), that is, the log probability of it being negative, for each negative tweet (y=0).

###Since we’re trying to compute a loss, we need to penalize bad predictions, right? If the probability associated with the true class is 1.0, we need its loss to be zero. Conversely, if that probability is low, say, 0.01, we need its loss to be HUGE!

###It turns out, taking the (negative) log of the probability suits us well enough for this purpose (since the log of values between 0.0 and 1.0 is negative, we take the negative log to obtain a positive value for the loss).

###Entropy is a measure of the uncertainty associated with a given distribution q(y). What if all our tweets were positive? What would be the uncertainty of that distribution? ZERO, right? After all, there would be no doubt about the sentiment of a tweet: it is always positive! So, entropy is zero!

###On the other hand, what if we knew exactly half of the tweets were positive and the other half, negative? That’s the worst case scenario, right? We would have absolutely no edge on guessing the sentiment of a tweet: it is totally random! For that case, entropy is given by the formula below (we have two classes (sentiments)— positive or negative — hence, 2):

![alt text](https://miro.medium.com/max/223/1*1R1M9mDGxcrN3tGy8M9atw.png)

###For every other case in between, we can compute the entropy of a distribution, like our q(y), using the formula below, where C is the number of classes:

![alt text](https://miro.medium.com/max/612/1*Y0OJOAehqME6ePORQzzliQ.png)


###So, if we know the true distribution of a random variable, we can compute its entropy. But, if that’s the case, why bother training a classifier in the first place? After all, we KNOW the true distribution…

###But, what if we DON’T? Can we try to approximate the true distribution with some other distribution, say, p(y)? We can!

###Let’s assume our points follow this other distribution p(y). But we know they are actually coming from the true (unknown) distribution q(y), right?

###If we compute entropy like this, we are actually computing the cross-entropy between both distributions:

![alt text](https://miro.medium.com/max/640/1*loucyTXzGHuHi6D4PxjDlA.png)

###If we, somewhat miraculously, match p(y) to q(y) perfectly, the computed values for both cross-entropy and entropy will match as well.

##Adam Optimizer: It is used to iteratively update the weights 

###1. Computationally efficient.
###2. Little memory requirements.
###3. Works well with large datasets or parmeters
###4. Requires less tuning

###Good default settings for the tested machine learning problems are alpha=0.001, beta1=0.9, beta2=0.999 and epsilon=10−8. 

##Adam Configuration Parameters
###alpha: Also referred to as the learning rate or step size. The proportion that weights are updated (e.g. 0.001). Larger values (e.g. 0.3) results in faster initial learning before the rate is updated. Smaller values (e.g. 1.0E-5) slow learning right down during training
###beta1: The exponential decay rate for the first moment estimates (e.g. 0.9).
###beta2: The exponential decay rate for the second-moment estimates (e.g. 0.999). This value should be set close to 1.0 on problems with a sparse gradient (e.g. NLP and computer vision problems).
###epsilon. Is a very small number to prevent any division by zero in the implementation (e.g. 10E-8).

###Popular deep learning libraries generally use the default parameters recommended by the paper.

###Adam combines the best of AdaGrad and RMSProp optimizers

###Specifically, the algorithm calculates an exponential moving average of the gradient and the squared gradient, and the parameters beta1 and beta2 control the decay rates of these moving averages.

In [None]:
#Model compilation depending on the number of classes
if NB_CLASSES == 2:
    Dcnn.compile(loss="binary_crossentropy", #standard loss in binary classification
                 optimizer="adam", #standard
                 metrics=["accuracy"]) #metrics to track during training
else: #if more than 2 classes in a different application
    Dcnn.compile(loss="sparse_categorical_crossentropy",
                 optimizer="adam",
                 metrics=["sparse_categorical_accuracy"])

###Let's create a checkpoint before training our model. This is a way to store our model once it's trained so that we need not train from scratch when we want to use it later. 

In [None]:
#Defining the path
checkpoint_path = "./drive/My Drive/NLP/ckpt/" 

#Creating checkpoint object
ckpt = tf.train.Checkpoint(Dcnn=Dcnn)

#Creating checkpoint manager
ckpt_manager = tf.train.CheckpointManager(ckpt, checkpoint_path, max_to_keep=5) #max_to_keep is maximum number of checkpoints we want to keep

#Checking if there is already a checkpoint in the checkpoint path. If so, we will restore it and print a message saying the same.
if ckpt_manager.latest_checkpoint:
    ckpt.restore(ckpt_manager.latest_checkpoint)
    print("Latest checkpoint restored!!")

In [None]:
#Fitting the model
Dcnn.fit(train_inputs,
         train_labels,
         batch_size=BATCH_SIZE,
         epochs=NB_EPOCHS)
#Saving the checkpoint after training
ckpt_manager.save()

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


'./drive/My Drive/NLP/ckpt/ckpt-1'

## Evaluation

###Let's see how our model performs on new or unknown data. 

In [None]:
results = Dcnn.evaluate(test_inputs, test_labels, batch_size=BATCH_SIZE)
print(results)
#Ouputs [loss, accuracy]

[0.9236537218093872, 0.7450000047683716]


In [None]:
Dcnn.metrics_names

['loss', 'accuracy']

###For testing our own sentences, we have to encode the sentence using Tokenizer and convert into a numpy array. Set the training variable to False as we aren't training and dropout isn't required. The output will be a tensor which is hard to read. So, let's convert it into numpy format.

###Output value very close to zero indicates negative sentiment while close to 1 indicate positive sentiment. You can also choose a threshold for classification. A standard one is 0.5 (Above 0.5, positive sentiment and below 0.5, negative sentiment)

In [None]:
Dcnn(np.array([tokenizer.encode("He is the best")]), training=False).numpy()

In [None]:
Dcnn(np.array([tokenizer.encode("Doesn't make sense")]), training=False).numpy()

In [None]:
Dcnn(np.array([tokenizer.encode("He sucks at playing")]), training=False).numpy()

In [None]:
Dcnn(np.array([tokenizer.encode("Why does he look ugly")]), training=False).numpy()

In [None]:
Dcnn(np.array([tokenizer.encode("He is a great guy")]), training=False).numpy()

In [None]:
Dcnn(np.array([tokenizer.encode("You are so funny")]), training=False).numpy()

###We had good results on the test dataset and it seems to work pretty well on our sentences. Of course, the data used is far from perfect. They are just few tweets. You have to get your dataset depending on the task at hand. This was a good dataset to show that our model works really well. They may not contain all the possible words. This was a good dataset to show that our model works pretty well. You can also try with your sentences or even your own dataset. In this project we started from scratch, learnt theory and implemented a sentiment analyzer. Hope you like it!

In [None]:
from joblib import dump
dump(Dcnn, filename='model.joblib')

TypeError: ignored