This problem statement came from a HackerEarth challenge: "Predict the Happiness". The objective is to  use 2-layered fully connected/Dense Neural network model to predict whether the hotel reviews at TripAdvisor site are positive sentiment or negative sentiment.

References : 
https://appliedmachinelearning.wordpress.com/2017/12/21/predict-the-happiness-on-tripadvisor-reviews-using-dense-neural-network-with-keras-hackerearth-challenge/



Dataset description:

- User_ID :: unique ID of the customer

- Description :: description of the review posted

- Browser_Used :: browser used to post the review

- Device_Used :: device used to post the review

- Is_Response :: target variable

We are interested in only the desription variable and the is_response variable. This is a 2 class sentiment analysis problem where we have to tell if the customer is happy or not. 


Steps :
- Preprocess data 
- Extract features 
- Build model
- Train model
- Test performance


In [1]:
import pandas as pd 
import numpy as np 
from sklearn.preprocessing import LabelEncoder
import json
import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.preprocessing.text import Tokenizer, text_to_word_sequence

Using TensorFlow backend.


Creating a function to read the csv file and prepare the dataset. Further, converting the Is_Response variable to binary 0/1 using LabelEncoder.

In [2]:
def preprocess(file_path):
    data = pd.read_csv(file_path, sep=',')
    features = []
    labels = []
    
    label_encoder = LabelEncoder()
    data['Is_Response'] = label_encoder.fit_transform(data['Is_Response'])
    
    for i in range(0, len(data['Description'])):
        feature = data['Description'][i]
        features.append(feature)
        label = data['Is_Response'][i]
        labels.append(label)
        
    labels = np.asarray(labels)

    return features, labels
    

Here words are the features. So we use bag-of-words model to create feature vector. 

- Make a dictionary having word - index tuples. Order not important. 
- Convert words in the review into word-index array for each review and save it in the global array. 
- The global array will become the feature matrix with number of colums equal to size of vocab (here limiting it to 10000). It will be a sparse matrix with 1 at the place if that word is present in the sparse matrix. 


In [3]:
def convert_text2indexarray(text):
    return [dictionary[word] for word in text_to_word_sequence(text)]

Create the model: We use a 2 layer NN with 2 nodes at the output for targets happy and unhappy.

In [4]:
max_words = 10000

model = Sequential()
model.add(Dense(256, input_shape=(max_words,), activation='elu')) ## Exponential linear unit
model.add(Dropout(0.5))
model.add(Dense(128, activation='elu'))
model.add(Dropout(0.5))
model.add(Dense(2, activation='softmax'))

In [5]:
features, labels = preprocess("train.csv")

tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(features)

dictionary = tokenizer.word_index  ## returns a word-index tuple dict

In [6]:
# Saving the vocabulary dictionary
with open('dictionary.json', 'w') as dictionary_file:
    json.dump(dictionary, dictionary_file)

In [7]:
# Replace words in the review to indices 
allWordIndices = []  ## global word matrix 

for num, text in enumerate(features):
    wordIndices = convert_text2indexarray(text)
    allWordIndices.append(wordIndices)

In [8]:
allWordIndices = np.asarray(allWordIndices)
train_x = tokenizer.sequences_to_matrix(allWordIndices, mode='binary')

labels = keras.utils.to_categorical(labels, num_classes=2)

In [9]:
# Compile and fit the model
model.compile(loss='categorical_crossentropy', optimizer='sgd', metrics=['accuracy'])
model.fit(train_x, labels, batch_size=32, epochs=15, verbose=1, validation_split=0.1, shuffle=True)

Train on 35038 samples, validate on 3894 samples
Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<keras.callbacks.History at 0x104dbca90>

In [10]:
# save model 
model_json = model.to_json()

with open('model.json', 'w') as json_file:
    json_file.write(model_json)
    
model.save_weights('model.h5')

Testing the model

In [11]:
labels = ['happy', 'not_happy']

# Load dictionary
with open('dictionary.json', 'r') as dictionary_file:
    dictionary = json.load(dictionary_file)

# Load json model
json_file = open('model.json', 'r')
loaded_model_json = json_file.read()
json_file.close()

model = keras.models.model_from_json(loaded_model_json)
model.load_weights('model.h5')

test_data = pd.read_csv('test.csv')
length = len(test_data['Description'])

tokenizer = Tokenizer(num_words=max_words)


In [12]:
def convert_text_to_index_array(text):
    words = text_to_word_sequence(text)
    wordIndices = []
    for word in words:
        if word in dictionary:
            wordIndices.append(dictionary[word])
    return wordIndices

In [13]:
y_pred = []

for i in range (0, length):
    feature = test_data['Description'][i]
    test_array = convert_text_to_index_array(feature)
    
    input = tokenizer.sequences_to_matrix([test_array], mode='binary')
    
    pred = model.predict(input)
    
    y_pred.append(labels[np.argmax(pred)])


In [14]:

raw_data = {'User_ID':test_data['User_ID'], 'Is_Response':y_pred}

df = pd.DataFrame(raw_data, columns=['User_ID', 'Is_Response'])

df.to_csv('results.csv', sep=',', index=False)