<a href="https://colab.research.google.com/github/k87rte/Sentiment-analyses/blob/main/sentiment_analyses.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# All about the data


The data was downloaded from kaggle.com
But, what does the data represent? It represents the setiments (positive or negative), based on sentences ('I feel good today'). In this section, data from 3 different sources, will be at first read using **pandas.read_csv**, and later cleaned using an in-house function, **str_num_separator**.


## Reading data

Read data, which are sentences and their corresponding sentiment. The data is from amazon, imdb, and yelp.

In [65]:
import pandas as pd

amazon_labelled = 'amazon_cells_labelled.csv'
amazon_df = pd.read_csv(amazon_labelled, header=None)
imdb_labelled = 'imdb_labelled.csv'
imdb_df = pd.read_csv(imdb_labelled, header=None)
yelp_labelled = 'yelp_labelled.csv'
yelp_df = pd.read_csv(yelp_labelled, header=None)

## Function to represent the data

I have noticed that pd.read_csv wont really work fine here. Why? Because the strings also have ',', for example 'what do you think about this, and that?'. So, creataed a function to represent the data like we intend to.

In [66]:
def str_num_separator(current_loc):
    current_loc = current_loc.dropna()
    str = ''
    num = []
    for i in range(current_loc.shape[0]):
        if current_loc.iloc[i].isdigit() == False:
            str=str+current_loc.iloc[i]
        if current_loc.iloc[i].isdigit() == True:
            num.append(int(current_loc.iloc[i]))
    return str, num

## View the data

In [67]:
# amazon data
sentence=[]
sentiment=[]
for i in range(amazon_df.shape[0]):
    try:
        str, num = str_num_separator(amazon_df.loc[i])
        sentence.append(str)
        sentiment.append(num)
    except AttributeError:
        print(f'not possible for row num: {i}')
amazon_filtered_df = pd.concat([pd.DataFrame(sentence), pd.DataFrame(sentiment)], axis=1)
amazon_filtered_df.columns = ['sentence','sentiment']
amazon_filtered_df.head()

not possible for row num: 545
not possible for row num: 717
not possible for row num: 821
not possible for row num: 864


Unnamed: 0,sentence,sentiment
0,So there is no way for me to plug it in here i...,0
1,Good case Excellent value.,1
2,Great for the jawbone.,1
3,Tied to charger for conversations lasting more...,0
4,The mic is great.,1


In [68]:
# imbd data
sentence=[]
sentiment=[]
for i in range(imdb_df.shape[0]):
    try:
        str, num = str_num_separator(imdb_df.loc[i])
        sentence.append(str)
        sentiment.append(num)
    except AttributeError:
        print(f'not possible for row num: {i}')
imdb_filtered_df = pd.concat([pd.DataFrame(sentence), pd.DataFrame(sentiment)], axis=1)
imdb_filtered_df.columns = ['sentence','sentiment']
imdb_filtered_df.head()

not possible for row num: 17


Unnamed: 0,sentence,sentiment
0,A very very very slow-moving aimless movie abo...,0.0
1,Not sure who was more lost - the flat characte...,0.0
2,Attempting artiness with black & white and cle...,0.0
3,Very little music or anything to speak of.,0.0
4,The best scene in the movie was when Gerardo i...,1.0


In [69]:
# yelp data
sentence=[]
sentiment=[]
for i in range(yelp_df.shape[0]):
    try:
        str, num = str_num_separator(yelp_df.loc[i])
        sentence.append(str)
        sentiment.append(num)
    except AttributeError:
        print(f'not possible for row num: {i}')
yelp_filtered_df = pd.concat([pd.DataFrame(sentence), pd.DataFrame(sentiment)], axis=1)
yelp_filtered_df.columns = ['sentence','sentiment']
yelp_filtered_df.head()

not possible for row num: 760


Unnamed: 0,sentence,sentiment
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


# All about the 'words to numbers'

## Tokenize the sentences

Machine likes numbers. So, convert the words to numbers. Here, each unique word is assigned a token number. See word_index from the next cell, to find the dictionary of unique words and their corresponding token.

In [70]:
import tensorflow as tf
from tensorflow import keras
from keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer(oov_token='<OOV>')
tokenizer.fit_on_texts(amazon_filtered_df.iloc[:,0].tolist())
words_to_numbers = tokenizer.texts_to_sequences(amazon_filtered_df.iloc[:,0].tolist())


## DL model creation, compile, fit

In [71]:
from keras.utils import pad_sequences

padded_words_to_numbers = pad_sequences(words_to_numbers, padding = 'post')
padded_sentiments = amazon_filtered_df.iloc[:,1]
word_index = tokenizer.word_index
vocab_size = len(word_index) + 1
embedding_dim = 16
input_length = padded_words_to_numbers.shape[1]
print(input_length)

from keras import Sequential, layers

model = Sequential([
                    layers.Embedding(vocab_size, embedding_dim, input_length=input_length),
                    layers.GlobalAveragePooling1D(),
                    layers.Dense(24, activation = 'relu'),
                    layers.Dense(1, activation = 'sigmoid')
                   ])
model.summary()

30
Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_6 (Embedding)     (None, 30, 16)            29856     
                                                                 
 global_average_pooling1d_5   (None, 16)               0         
 (GlobalAveragePooling1D)                                        
                                                                 
 dense_8 (Dense)             (None, 24)                408       
                                                                 
 dense_9 (Dense)             (None, 1)                 25        
                                                                 
Total params: 30,289
Trainable params: 30,289
Non-trainable params: 0
_________________________________________________________________


In [73]:
model.compile(loss='binary_crossentropy', optimizer = 'adam', metrics=['accuracy'])

## Number of epochs = 30

In [74]:
num_epochs = 30
history = model.fit(
    padded_words_to_numbers[0:900],
    padded_sentiments[0:900],
    epochs=num_epochs,
    validation_data=(padded_words_to_numbers[900:], padded_sentiments[900:])

)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


## Predict the sentiment of a sentence

We now create a sentence that the model has not seen before. We will see how it performs. The output will be a signmoid, that is, a value between 0 (negative) and 1 (positive). For the example, 'I like to sing and play the guitar', a sentiment value of ~0.7 is allocated, i.e., towards a positive valence.

In [103]:
sentence = ["I love to sing!"]
sentence[0]
sequences = tokenizer.texts_to_sequences(sentence)
padded = pad_sequences(sequences, maxlen=input_length, padding='post', truncating='post')
padded[0].shape
print(model.predict(padded))

[[0.97427183]]


## the Transformer modeule

In [106]:
!pip install transformers
from transformers import pipeline




In [107]:
classifier = pipeline("sentiment-analysis")
classifier("I love to sing!")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


[{'label': 'POSITIVE', 'score': 0.999854564666748}]