# Sentiment Analysis
- The process of analysing online pieces of writing to determine the emotional tone they carry.
- Tools categorize pieces of writing as positive, neutral, or negative.
- Can be found on discussion forums, review sites, Twitter, Instagram, Facebook and other publicly available online sources

### Sentiment analysis challenges
Due to language complexity, sentiment analysis has to face at least a couple of issues.

- **Contrastive conjunction**
    - One problem a sentiment analysis tool has to face is contrastive conjunctions — they happen when one piece of writing (a sentence) consists of two contradictory words (both positive and negative).

Example sentence: “The weather was terrible, but the hike was amazing!”

- **Named-entity recognition**
    - Another big problem sentiment analysis algorithms face is named-entity recognition. Words in context have different meaning.

Does “Everest” refer to the mountain or to the movie?

- **Anaphora resolution**
Also known as pronoun resolution, describes the problem of references within a sentence: what a pronoun, or a noun refers to.

Example sentence: “We went to the theater and went for a dinner. It was awful.”

- **Sarcasm**
Is there any sentiment analysis tool detecting sarcasm? Please recommend one!

Example sentence: “I’m so happy the plane is delayed.”

- **The Internet**
It just so happens that any language used online takes its own form. The economy of language and the Internet as a medium result in poor spelling, abbreviations, acronyms, lack of capitals and poor grammar. Analyzing such pieces of writing may cause problems for sentiment analysis algorithms.

### Sentiment analysis can be done on:
- Document level – modeling long-term relationships
- Sentence level – is there a sentiment, and which?
- Aspect extraction – “great phone but crappy display” (difficult)

### Lexical methods
Early sentiment analysis used manually curated lists of good/bad words. This approach is now widely inferior to machine learning.

### Classical Supervised Learning
Bag-of-words and Naive Bayes work to some extent for simpler sentiment analysis tasks. Support Vector Machines are used to model complex content, with and without word embeddings.

### Embedding methods
Most modern sentiment analysis models are based on word embeddings. Popular architectures include:

- LSTMs:
    - Embedding -> LSTM -> Output layer
- LSTM with Pooling
    - Embedding -> LSTM -> MeanPool -> LogReg
- Convolution
    - Claim to be trained faster:
    - Embedding -> Conv1D -> Conv1D -> Dense -> Output

### Character-level embeddings
Both LSTMs and Convolutions work not only with word embeddings, but also character-level embeddings. In that case, the input would be an integer for each character, and the weights of the embedding layer would be trained, too, instead of using pre-trained weights.

## Sentiment Analysis - Two Options: Build your own, or use a handy python package!

### We will try both

---

### 1: Build your own Sentiment Analysis model using Keras

* First setup our imports
* We'll use the imdb dataset from keras
* The data has already been preprocessed

In [1]:
from keras.models import Sequential
from keras.layers import Embedding, Flatten, Dense
from keras.preprocessing import sequence
import numpy as np

Using TensorFlow backend.


### Word Embedding is contained in the Embedding layer
- **It takes 3 arguments - the size of the vocab (input_dims), the no. of dimensions of each word embedding (output_dim), and the length of each document (input_length)**
- **It outputs a 2D matrix, with rows equal to each word in the document, and columns equal to the number of dimensions in the word embedding**

### Create the data

In [2]:
reviews = ['I really didnt like it', 'it was amazing',
        'it was great','as great as talking to nedra',
        'waste of time', 'well worth it', 'awesome']

In [3]:
labels = [0,1,1,1,0,1,1]

#### One hot encode the text
* We have to transform the text we give to the sentiment analysis networkd

##### Part 1 : Create a vocab_to_keys and keys_to_vocab list for each unique word in the data

In [4]:
vocab = []
max_length = 0

for review in reviews:
    review = review.lower().split()
    for word in review:
        vocab.append(word)
        if len(review) > max_length:
            max_length = len(review)
            
vocab = list(set(vocab))
vocab_size = len(vocab)
vocab

['talking',
 'like',
 'it',
 'i',
 'great',
 'to',
 'well',
 'really',
 'awesome',
 'waste',
 'as',
 'nedra',
 'of',
 'amazing',
 'didnt',
 'was',
 'time',
 'worth']

In [5]:
max_length

6

In [7]:
vocab_size = vocab_size + 1
vocab_size

19

#### Part 2: Integer encode the words in each document

- **Turn our words from words into numbers**

In [8]:
vocab_to_keys = {} # Key : word, value : unique id
key_to_vocab = {} # Key : unique id, value : word

# embedded_reviews = []

for i in range(len(vocab)):
        vocab_to_keys[vocab[i]] = i+1
        key_to_vocab[i+1] = vocab[i]

# for review in reviews:
#     review = review.lower().split()
#     for i in range(len(review)):
#         vocab_to_keys[vocab[i]] = i
#         key_to_vocab[i] = vocab[i]

In [9]:
vocab_to_keys

{'talking': 1,
 'like': 2,
 'it': 3,
 'i': 4,
 'great': 5,
 'to': 6,
 'well': 7,
 'really': 8,
 'awesome': 9,
 'waste': 10,
 'as': 11,
 'nedra': 12,
 'of': 13,
 'amazing': 14,
 'didnt': 15,
 'was': 16,
 'time': 17,
 'worth': 18}

In [10]:
key_to_vocab

{1: 'talking',
 2: 'like',
 3: 'it',
 4: 'i',
 5: 'great',
 6: 'to',
 7: 'well',
 8: 'really',
 9: 'awesome',
 10: 'waste',
 11: 'as',
 12: 'nedra',
 13: 'of',
 14: 'amazing',
 15: 'didnt',
 16: 'was',
 17: 'time',
 18: 'worth'}

In [11]:
reviews

['I really didnt like it',
 'it was amazing',
 'it was great',
 'as great as talking to nedra',
 'waste of time',
 'well worth it',
 'awesome']

In [13]:
embedded_docs = [[vocab_to_keys[x] for x in review.lower().split()] for review in reviews]
embedded_docs

[[4, 8, 15, 2, 3],
 [3, 16, 14],
 [3, 16, 5],
 [11, 5, 11, 1, 6, 12],
 [10, 13, 17],
 [7, 18, 3],
 [9]]

**We've created two dictionaries with reference numbers!**

### Truncate and pad the review sequences
* Every input has to have the same shape, in our case the first 500 words

In [14]:
padded_docs = sequence.pad_sequences(embedded_docs, maxlen=max_length, padding='post')
padded_docs

array([[ 4,  8, 15,  2,  3,  0],
       [ 3, 16, 14,  0,  0,  0],
       [ 3, 16,  5,  0,  0,  0],
       [11,  5, 11,  1,  6, 12],
       [10, 13, 17,  0,  0,  0],
       [ 7, 18,  3,  0,  0,  0],
       [ 9,  0,  0,  0,  0,  0]], dtype=int32)

### Build the model

#### Fit the model on the training data
* Fit the word embeddings from scratch

In [15]:
from keras import backend as K 

In [20]:
K.clear_session()

In [21]:
model = Sequential()
model.add(Embedding(vocab_size, 16, input_length=max_length)) # Embedding Layer
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))

In [22]:
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])

In [23]:
X = padded_docs
y = labels

model.fit(X, y, epochs=50, verbose=1)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.History at 0x1253b7898>

In [24]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 6, 16)             304       
_________________________________________________________________
flatten_1 (Flatten)          (None, 96)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 97        
Total params: 401
Trainable params: 401
Non-trainable params: 0
_________________________________________________________________


In [25]:
model.evaluate(X,y,verbose=1)



[0.4718186557292938, 1.0]

### Now try and predict the sentiment of new text you feed in

### Preprocess the text in the same way
#### And predict the sentiment of the sentence against the model's prediction of the padded_doc

In [28]:
# max_length = 6
new_reviews = ['it was really amazing nedra', 'it was great like', 'amazing waste of time']
embedded_docs = [[vocab_to_keys[x] for x in review.lower().split()] for review in new_reviews]
padded_docs = sequence.pad_sequences(embedded_docs, maxlen=max_length, padding='post')

In [29]:
padded_docs

array([[ 3, 16,  8, 14, 12,  0],
       [ 3, 16,  5,  2,  0,  0],
       [14, 10, 13, 17,  0,  0]], dtype=int32)

In [30]:
X = padded_docs
test_labels = [1,1,0]
ypred = model.predict(X)

In [31]:
ypred

array([[0.5803242],
       [0.6323064],
       [0.5480741]], dtype=float32)

### Is our model good? 
* Discuss why
* Discuss bias
* Discuss how to circumvent training - 2 solutions!

We can see how the model is confused by the final review, 'amazing' makes it think it's a positive but then it is juxtaposed with 'waste of time'!

--- 

### VADER
* We will be using **Vader**, a sentiment analysis package trained on social media data
* It is a good out of the box tool & v easy to use
* Good at handling sublte sentiment signs, e.g:
    * good!!! > good!
    * omg so good > good
    * GOOD > good
    * good :) > good
    
*Pip install vaderSentiment*

In [32]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

In [33]:
analyzer = SentimentIntensityAnalyzer() 

In [34]:
analyzer.polarity_scores('all humans are shit')

{'neg': 0.545, 'neu': 0.455, 'pos': 0.0, 'compound': -0.5574}

In [37]:
string = 'hi this string is awesome'
analyzer.polarity_scores(string)

{'neg': 0.0, 'neu': 0.494, 'pos': 0.506, 'compound': 0.6249}

In [36]:
for review in reviews:
    print(review, analyzer.polarity_scores(review))
    print()

I really didnt like it {'neg': 0.443, 'neu': 0.557, 'pos': 0.0, 'compound': -0.3374}

it was amazing {'neg': 0.0, 'neu': 0.345, 'pos': 0.655, 'compound': 0.5859}

it was great {'neg': 0.0, 'neu': 0.328, 'pos': 0.672, 'compound': 0.6249}

as great as talking to nedra {'neg': 0.0, 'neu': 0.549, 'pos': 0.451, 'compound': 0.6249}

waste of time {'neg': 0.583, 'neu': 0.417, 'pos': 0.0, 'compound': -0.4215}

well worth it {'neg': 0.0, 'neu': 0.2, 'pos': 0.8, 'compound': 0.4588}

awesome {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.6249}



In [35]:
analyzer.polarity_scores('lecture over')

{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}