# Prediction of Twitter Data using Trained Word2Seq and Word2Vec Models

This notebook performs prediction on the Twitter dataset gathered by the University of Malaya Halal research group.  
Therefore, the data is not publicly exposed and can be made available upon further request.  
The prediction of the Twitter data uses the trained **Word2Vec** and **Word2Seq** models.  
The list of models available are:
* Word2Seq Convolutional Neural Network
* Word2Seq Long Short Term Memory
* Word2Seq Convolutional Neural Network + Long Short Term Memory
* Word2Seq Convolutional Neural Nwtwork + Bi-directional Recurrent Neural Network + Bi-directional Long Short Term Memory
* Word2Vec Convolutional Neural Network
* Word2Vec Long Short Term Memory
* Word2Vec Convolutional Neural Network + Long Short Term Memory
* Word2Vec Convolutional Neural Nwtwork + Bi-directional Recurrent Neural Network + Bi-directional Long Short Term Memory

###### Import the required libraries

In [1]:
import numpy as np
import pandas as pd
from tensorflow.python.keras.preprocessing import sequence as keras_seq
from tensorflow.python.keras.models import load_model
from tensorflow.python.keras.preprocessing import text as keras_text, sequence as keras_seq
from sklearn.utils import shuffle
from tensorflow import set_random_seed
import os
import gc

# Set seed
myrand=58584
np.random.seed(myrand)
set_random_seed(myrand)

WORDS_SIZE=8000

######  Load and prepare the collected Twitter data

In [2]:
data = pd.read_excel('../../data.xlsx',sheet_name='Sheet1')
data.text = data.text.astype(str)
data.shape

(105542, 5)

In [3]:
data.head(n=10)

Unnamed: 0,file_name,hash,text,timestamp,user_
0,halal_skincare.json,0e8df15d4bd22ee2a025e2ba244afc4e,Darah ko ni Dah kira Halal JAKIM,2014-01-06T08:46:19.000Z,Mylea_Skincare
1,halal_skincare.json,8b8a05bb640c3a08f13da6beefb2c458,Menurut kajian ada sesetengah pemakan rasuah o...,2014-01-06T08:45:11.000Z,Mylea_Skincare
2,halal_skincare.json,c6863b3517ccedc7c05b9acf9fb71d1b,we have a full range of cleansing and skincare...,2014-01-01T23:15:59.000Z,halalcosco
3,halal_skincare.json,5f5722fc7e900ec7074d702a8f3d3343,Inovasi skin care terkini bebas mercury dan no...,2014-01-01T04:35:54.000Z,rullynursesi
4,halal_trip.json,1bb31c3e8faebddec3c158a017be21f0,Wuih suami istri ikutan open trip I thought th...,2018-04-07T08:49:15.000Z,sayannisa
5,halal_trip.json,0d6140e38f2aa4f4414c3080ba651e6a,Love it check it out halalexpo,2018-04-06T12:11:39.000Z,MillanUS
6,halal_trip.json,23cbd1544d1e4359a248ba9898efa3db,Trip hobi traveling generasi muslim milenial d...,2018-04-06T00:10:58.000Z,Irsyad_af21
7,halal_skincare.json,937a238c510e64d0c33110a6639412dc,Halal is a requirement not only for food and b...,2013-12-30T19:18:30.000Z,Famiza72
8,halal_skincare.json,b1f410a86f23cd081b78c74cae03d74c,Love my pretty purchases from latifahalalbeaut...,2013-12-30T14:08:47.000Z,BlossomAndBean
9,halal_skincare.json,3730d033aaa900c211c296a40da48db2,Soyeux Skin Care adalah produk halal dan selam...,2013-12-25T03:52:14.000Z,soyeuxofficial


###### Load and prepare the tokkenizer for Word2Seq and Word2Vec

In [4]:
mydata = pd.read_csv('../../../../../Master (Sentiment Analysis)/Paper/Paper 3/Datasets/eRezeki/eRezeki_(text_class)_unclean.csv',header=0,encoding='utf-8')
mydata = mydata.loc[mydata['sentiment'] != "neutral"]
mydata['sentiment'] = mydata['sentiment'].map({'negative': 0, 'positive': 1})

mydata1 = pd.read_csv('../../../../../Master (Sentiment Analysis)/Paper/Paper 3/Datasets/IMDB/all_random.csv',header=0,encoding='utf-8')
mydata = mydata.append(mydata1)
mydata = shuffle(mydata)

mydata1 = pd.read_csv('../../../../../Master (Sentiment Analysis)/Paper/Paper 3/Datasets/Amazon(sports_outdoors)/Amazon_UCSD.csv',header=0,encoding='utf-8')
mydata1['feedback'] = mydata1['feedback'].astype(str)
mydata = mydata.append(mydata1)
mydata = shuffle(mydata)

mydata1 = pd.read_csv('../../../../../Master (Sentiment Analysis)/Paper/Paper 3/Datasets/Yelp(zhang_paper)/yelp_zhang.csv',header=0,encoding='utf-8')
mydata1['feedback'] = mydata1['feedback'].astype(str)
mydata = mydata.append(mydata1)

del(mydata1)
gc.collect()

mydata = shuffle(mydata)
mydata = shuffle(mydata)
mydata = shuffle(mydata)

###### Create tokkenizer from full list of texts

In [5]:
tokenizer = keras_text.Tokenizer(char_level=False)
tokenizer.fit_on_texts(list(mydata['feedback']))
tokenizer.num_words=WORDS_SIZE

###### Load the trained models

Create dictionary for different input sizes for each model

In [6]:
models_list = os.listdir('../Models/')
input_sizes = {'word2seq_cnn':700,
               'word2seq_cnn_birnn_bilstm':100,
               'word2seq_cnn_lstm':500,
               'word2seq_lstm':100,
               'word2vec_cnn':700,
               'word2vec_cnn_birnn_bilstm':100,
               'word2vec_cnn_lstm':500,
               'word2vec_lstm':100}

###### Function for sequence data matrix creation from Twitter data

In [7]:
def create_seq(input_size):
    list_tokenized = tokenizer.texts_to_sequences(list(data.text))
    x_data = keras_seq.pad_sequences(list_tokenized, 
                                     maxlen=input_size,
                                     padding='post')
    x_data = x_data.astype(np.int64)
    return(x_data)

###### Function for loading the trained model

In [8]:
def load_model(model_name):
    mydir = '../Models/%s/%s.hdf5' % (model_name,model_name)
    model = load_model(mydir)
    return(model)

###### Function for predict the data

In [9]:
def predict_data(model,x_data):
    sentiment = model.predict_classes(x_data)
    sentiment = sentiment.astype(str)
    sentiment[sentiment=='1'] = "Positive"
    sentiment[sentiment=='0'] = "Negative"
    probability = model.predict_proba(x_data)
    positive_probability = probability[:,1]
    negative_probabiltiy = probability[:,0]
    return(sentiment, positive_probability, negative_probabiltiy)

###### Function to add new column to the excel dataframe

In [10]:
def add_columns(data, model_name, sentiment, positive_probability, negative_probabiltiy):
    name_1 = '%s_sentiment' % (model_name)
    name_2 = '%s_posProb' % (model_name)
    name_3 = '%s_negProb' % (model_name)
    data['name_1'] = sentiment
    data['name_2'] = positive_probability
    data['name_3'] = negative_probabiltiy
    return(data)

###### Start looping to predict the sentiment

In [None]:
for name in models_list:
    x_data = create_seq(input_sizes[name])
    model = load_model(name)
    sentiment, positive_prob, negative_prob = predict_data(model, x_data)
    add_columns(data, name, sentiment, positive_prob, negative_prob)

data.head(n=10)