<a href="https://colab.research.google.com/github/nidhi76/Sentiment-Analysis/blob/master/Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Sentiment analysis
It is a natural language processing problem where text is understood and the underlying intent is predicted. Here,the sentiment of movie reviews as either positive or negative in Python  is predicted using the Keras deep learning library.

## Data description
The dataset is the Large Movie Review Dataset often referred to as the IMDB dataset.

The [Large Movie Review Dataset](http://ai.stanford.edu/~amaas/data/sentiment/) (often referred to as the IMDB dataset) contains 25,000 highly polar movie reviews (good or bad) for training and the same amount again for testing. The problem is to determine whether a given movie review has a positive or negative sentiment.  Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers).

## Loading dataset
First, we will load complete dataset and analyze some properties of it.<br>


In [None]:
import numpy as np
from matplotlib import pyplot
import numpy
import keras
from keras import regularizers,layers
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence

In [None]:
# np.load is used inside imdb.load_data. But imdb.load_data still assumes the default 
# values of an older version of numpy. So necessary changes to np.load are made

# fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)

# load Numpy
np_load_old = np.load

# modify the default parameters of np.load
np.load = lambda *a,**k: np_load_old(*a, allow_pickle=True)





In [None]:
# call load_data with allow_pickle implicitly set to true
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=10000)

# restore np.load for future normal usage
np.load = np_load_old

X = np.concatenate((X_train, X_test), axis=0)
y = np.concatenate((y_train, y_test), axis=0)

print(X.shape)
print(X_train.shape)

(50000,)
(25000,)


## **Let's see some of reviews.**

In [None]:
word_to_id = keras.datasets.imdb.get_word_index()
id_to_word = {value:key for key,value in word_to_id.items()}
for i in range(15,20):
  print("********************************************")
  print(' '.join(id_to_word.get(id - 3, '?')for id in X_train[i] ))

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json
********************************************
? a total waste of time just throw in a few explosions non stop fighting exotic cars a deranged millionaire slow motion computer generated car crashes and last but not least a hugh ? like character with wall to wall hot babes and mix in a ? and you will have this sorry excuse for a movie i really got a laugh out of the dr evil like heavily ? compound the plot was somewhere between preposterous and non existent how many ? are willing to make a 25 million dollar bet on a car race answer 4 but didn't they become ? through ? responsibility this was written for ? males it plays like a video game i did enjoy the ? ii landing in the desert though
********************************************
? laputa castle in the sky is the bomb the message is as strong as his newer works and more pure fantastic and flying pirates how could it be any better the ar

## Summarize the data

In [None]:
def summarize_data():
  """
  Output:
                    classes: list, list of unique classes in y  
                no_of_words: int, number of unique words in dataset x 
     list_of_review_lengths: list,  list of lengths of each review 
         mean_review_length: float, mean(list_of_review_lengths), a single floating point value
          std_review_length: float, standard_deviation(list_of_review_lengths), a single floating point value
  """

  list_of_review_lengths = []
  n = []
  Y = np.array(y_train)
  classes = list(np.unique(Y))
  for j in X:
    for k in j:
      n.append(k)
  no_of_words = len(list(set(n)))
  for i in X:
    list_of_review_lengths.append(len(i))
  a = np.array( list_of_review_lengths)  
  mean_review_length = np.mean(list_of_review_lengths)
  std_review_length = np.std( list_of_review_lengths)
  return classes, no_of_words, list_of_review_lengths, mean_review_length, std_review_length


classes, no_of_words, list_of_review_lengths, mean_review_length, std_review_length = summarize_data()


In [None]:
y_train

array([1, 0, 0, ..., 0, 1, 0])

## One hot encode the output data

In [None]:
def one_hot(y):
  """
  Inputs:
    y: numpy array with class labels
  Outputs:
    y_oh: numpy array with corresponding one-hot encodings
  """
  y_oh = np.zeros((y.shape[0], 2)) 
  for i in range(y.shape[0]):
    if y[i]==0:
      y_oh[i][0]=1
    else:
      y_oh[i][1]=1
  return y_oh
y_train = one_hot(y_train)
y_test = one_hot(y_test)

### Multi-hot encode the input data

All sequences are of different length and our vocabulory size is 10K.  <br>

1) Intialize vector of dimension 10,000 with value 0. <br>
2) For those tokens in a sequence which are present in Vocabulary make that position as 1 and keep all other positions filled with 0. <br>
For example, lets take Vocabulary = ['I': 0, ':1, 'eat: 2:' mango: 3, 'fruit':4, 'happy':5, 'you':6] <br>
We have two sequnces and 
Multi-hot encoding of both sequences will be of dimension:  7 (vocab size).<br>
1) *Mango is my favourite fruit* becomes *Mango ? ? ? fruit* after removing words which are not in my vocabulary. Hence multi hot encoding will have two 1's corresponding to mango and fruit i.e, [0, 0, 0, 1, 1, 0, 0] <br>
Similarly, <br>
  2) *I love to eat mango*  = *I ? ? eat mango*  =  [1, 1, 0, 1, 0, 0, 0]

In [None]:
def multi_hot_encode(sequences, dimension):
  """
    Input:
          sequences: list of sequences in X_train or X_test

    Output:
          results: mult numpy matrix of shape(len(sequences), dimension)
                  
  """
  
  
  results = np.zeros((len(sequences), dimension))
  for i in range(sequences.shape[0]):
    for j in sequences[i]:
      
      results[i][j-1] = 1
      
  
  return results


In [None]:
x_train = multi_hot_encode(X_train, 10000)
x_test = multi_hot_encode(X_test, 10000)

print("x_train ", x_train.shape)
print("x_test ", x_test.shape)


x_train  (25000, 10000)
x_test  (25000, 10000)


## Splitting the data into train and validation

In [None]:
from sklearn.model_selection import train_test_split
x_strat, x_dev, y_strat, y_dev = train_test_split(x_train, y_train,test_size=0.40,random_state=0, stratify=y_train)
x_strat.shape

(15000, 10000)

## Building Model
Building a multi layered feed forward network in keras. 

### Creating the model

In [None]:
def create_model():
    """
    Output:
        model: A compiled keras model
    """
    import keras
    from keras.models import Sequential
    from keras.layers import Input, Activation
    from keras import optimizers
    model = Sequential()
    model.add(Embedding(15000, 32, input_length=10000))
    model.add(Flatten())
    model.add(Dense(200, activation='relu'))
    model.add(Dense(2, activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model
  
model = create_model()
print(model.summary())

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 10000, 32)         480000    
_________________________________________________________________
flatten (Flatten)            (None, 320000)            0         
_________________________________________________________________
dense (Dense)                (None, 200)               64000200  
_________________________________________________________________
dense_1 (Dense)              (None, 2)                 402       
Total params: 64,480,602
Trainable params: 64,480,602
Non-trainable params: 0
_________________________________________________________________
None


### Fit the Model

In [None]:
import matplotlib.pyplot as plt
def fit(model):
    """
    Action:
        Fit the model created above using training data as x_strat and y_strat
        and validation_data as x_dev and y_dev, verbose=2 and store it in 'history' variable.
        
        evaluate the model using x_test, y_test, verbose=0 and store it in 'scores' list
    Output:
        scores: list of length 2
        history_dict: output of history.history where history is output of model.fit()
    """
    # YOUR CODE HERE
    history = model.fit(x_strat, y_strat, validation_data = (x_dev, y_dev), verbose =2, epochs = 5) 
    scores = model.evaluate(x_test, y_test, verbose =0 )
    history_dict = history.history 
    return scores,history_dict
    
scores,history_dict = fit(model)    


Epoch 1/5
469/469 - 37s - loss: 0.5291 - accuracy: 0.7677 - val_loss: 0.2945 - val_accuracy: 0.8803
Epoch 2/5
469/469 - 36s - loss: 0.2073 - accuracy: 0.9200 - val_loss: 0.2934 - val_accuracy: 0.8786
Epoch 3/5
469/469 - 37s - loss: 0.1154 - accuracy: 0.9583 - val_loss: 0.3659 - val_accuracy: 0.8711
Epoch 4/5
469/469 - 37s - loss: 0.0568 - accuracy: 0.9813 - val_loss: 0.4785 - val_accuracy: 0.8727
Epoch 5/5
469/469 - 37s - loss: 0.0252 - accuracy: 0.9920 - val_loss: 0.6138 - val_accuracy: 0.8662


In [None]:
Accuracy=scores[1]*100
print('Accuracy of your model is')
print(scores[1]*100)

Accuracy of your model is
86.39600276947021
