## Stacked Denoising Auto-encoder for Sentiment Analysis in Domain Adaptation Setting

Based on the work in the following papers: <i>Is Joint Training Better for Deep Auto-Encoders?</i> (Global optimization objective vs greedy training), <i>Domain Adaptation for Large-Scale Sentiment Classification: A Deep Learning Approach</i> (Framework for DAEs in domain adaptation), <i>Estimating User Location in Social Media with Stacked Denoising Auto-encoders</i>, <i>Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion</i>.

Using the global optimization objective, with the task being domain adaptation in the setting of sentiment review data (the distribution of text is different, but the task of sentiment classification is the same) from different companies (Amazon & Yelp). The assumption is that we only have labeled training data for the Amazon reviews.

In [1]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import Dense,Input,Layer
from tensorflow.keras import backend as K
from tensorflow.keras.activations import relu
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.regularizers import l2
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
tf.keras.backend.set_floatx('float32')
tf.compat.v1.enable_eager_execution()

import spacy
nlp = spacy.load("en_core_web_lg")

import warnings
warnings.filterwarnings('ignore')

### Cleaning the sentiment data

Text reviews with a binary classification of whether or not something is positive. Features are averag word-embeddings. This was also tested with bag-of-words in which there was an even smaller difference between the two model performances.

In [2]:
def clean_data(path):
    x = [] # list of text
    y = [] # list of numerical labels
    with open(path) as text_file:
        for line in text_file.readlines():
            line = line.strip()
            sent,score = line.split("\t")
            score = float(score)
            x.append(sent)
            y.append(score)
        return x,y

In [3]:
amazon_x,amazon_y = clean_data("../data/sentiment/amazon.txt")
yelp_x,yelp_y = clean_data("../data/sentiment/yelp.txt")
print(amazon_x[-1],amazon_y[-1])
print(yelp_x[-1],yelp_y[-1])

You can not answer calls with the unit, never worked once! 0.0
Then, as if I hadn't wasted enough of my life there, they poured salt in the wound by drawing out the time it took to bring the check. 0.0


In [4]:
def vectorize_text(text_list):
    x = []
    for string in text_list:
        vecs = [token.vector for token in nlp(string) if not token.is_stop and token.has_vector and not token.is_punct]
        if len(vecs)==0:
            x.append(np.zeros((300,)))
        else:
            x.append(np.mean(np.array(vecs),axis=0))
    return x

In [5]:
amazon_x = np.array(vectorize_text(amazon_x))
amazon_x = amazon_x.astype("float32")
yelp_x = np.array(vectorize_text(yelp_x))
yelp_x = yelp_x.astype("float32")
x = np.vstack([amazon_x,yelp_x])

amazon_y = np.expand_dims(np.array(amazon_y),axis=-1).astype("float32")
yelp_y = np.expand_dims(np.array(yelp_y),axis=-1).astype("float32")

## Baseline Modeling

In [6]:
def loss_function(pred,y):
    """ binary crossentropy loss function
    """
    return tf.reduce_sum(K.binary_crossentropy(y,pred,from_logits=True))

In [7]:
def predict(pred):
    """ returns the most likely class, 0 or 1
    """
    scaled = tf.sigmoid(pred)
    predictions = scaled >= 0.5 # setting cutoff at 0.5
    predictions = tf.cast(predictions,tf.float32)
    return predictions

In [8]:
def accuracy(pred,y):
    """ returns the accuracy based on class labels
    """
    return tf.keras.metrics.Accuracy()(y,pred) #(labels=y,predictions=pred)

In [9]:
def baseline_model():
    """ DNN with the same architecture as the "encoder" for the SDAE
    """
    x = Input(shape=(300))
    # encoder
    h1 = Dense(256,activation="relu")(x)
    h2 = Dense(128,activation="relu")(h1)
    # linear model
    out = Dense(1,activation=None)(h2)
    model = Model(inputs=x,outputs=out)
    return model

In [10]:
baseline = baseline_model()

In [11]:
def train_model(x_train,y_train,x_val,y_val,model,epochs=100,lr=0.001):
    optimizer = Adam(lr)
    for epoch in range(epochs):
        losses = []
        for i in range(0,len(x_train)-100,100): # batch size of 100
            x = x_train[i:i+100]
            y = y_train[i:i+100]
            with tf.GradientTape() as tape:
                pred = model(x)
                loss = loss_function(pred,y)
            losses.append(float(loss))
            gradients = tape.gradient(loss, model.trainable_variables)
            optimizer.apply_gradients(zip(gradients, model.trainable_variables))
            
        train_accuracy = accuracy(predict(model(x_train)),y_train)
        test_accuracy = accuracy(predict(model(x_val)),y_val)
        avg_loss = sum(losses)/len(losses)
        
        print("epoch {}; loss:{}; train_acc:{}; test_acc:{};".format(epoch+1,avg_loss,train_accuracy,test_accuracy))

In [12]:
train_model(amazon_x,amazon_y,yelp_x,yelp_y,baseline,epochs=10,lr=0.001)

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
epoch 1; loss:61.496163262261284; train_acc:0.8140000104904175; test_acc:0.7919999957084656;
epoch 2; loss:44.87359703911675; train_acc:0.8349999785423279; test_acc:0.800000011920929;
epoch 3; loss:36.59075948927138; train_acc:0.8460000157356262; test_acc:0.8029999732971191;
epoch 4; loss:31.989912668863933; train_acc:0.8700000047683716; test_acc:0.7919999957084656;
epoch 5; loss:28.347305085923935; train_acc:0.890999972820282; test_acc:0.777999997138977;
epoch 6; loss:25.075159284803604; train_acc:0.9020000100135803; test_acc:0.7710000276565552;
epoch 7; loss:22.090785132514107; train_acc:0.9129999876022339; test_acc:0.7699999809265137;
epoch 8; loss:19.238612916734482; train_acc:0.9290000200271606; test_acc:0.7689999938011169;
epoch 9; loss:16.453226195441353; train_acc:0.9430000185966492; test_acc:0.7699999809265137;
epoch 10; loss:13.792529477013481; train_acc:0.9559999704360962; test_acc:

## SDAE Modeling

In [13]:
def reconstruction_loss(pred,y):
    """ binary crossentropy loss function
    """
    return tf.reduce_sum(tf.keras.losses.MSE(y,pred))

In [14]:
class AddBias(Layer): # ultimately this was not used because it was found using the transpose didn't perform well
    """ custom layer for adding bias terms into the decoder portion of the SDAE
    """
    def __init__(self,output_dim,**kwargs):
        self.output_dim = output_dim
        super(AddBias,self).__init__(**kwargs)
    
    def build(self,input_shape):
        self.bias_term = self.add_weight(name="bias",shape=(self.output_dim,),initializer="zeros",trainable=True)
        super(AddBias,self).build(input_shape)
        
    def call(self,x):
        return x+self.bias_term
    
    def compute_output_shape(self,input_shape):
        return input_shape

In [15]:
def sdae(dense1,dense2):
    """ 2-layered stacked denoising auto-encoder, using masking noise
    """
    x = Input(shape=(300))
    
    # encoder
    input_mask = tf.cast(tf.random.uniform((100,300),minval=0,maxval=1)<=0.25,tf.float32) # 25% of values are zeros
    x_masked = x*input_mask
    h1 = dense1(x_masked)
    h_mask = tf.cast(tf.random.uniform((100,256),minval=0,maxval=1)<=0.25,tf.float32) # 25% of values are zeros
    h1_masked = h1*h_mask
    h2 = dense2(h1_masked) # latent layer
    
    # decoder
    d1 = Dense(256,activation="relu")(h2)  # AddBias(256)(K.dot(h2,K.transpose(dense2.get_weights()[0]))) << using the transpose of encoder weight matrix
    d2 = Dense(300)(d1) # AddBias(300)(K.dot(d1,K.transpose(dense1.get_weights()[0])))
    
    model = Model(inputs=x,outputs=d2)
    return model

In [25]:
dense1 = Dense(256,activation="relu",kernel_regularizer=l2(10)) # as specified in the first paper, regularization is important for the global loss 
dense2 = Dense(128,activation="relu",kernel_regularizer=l2(10))
model = sdae(dense1,dense2)

In [26]:
np.random.shuffle(x)

In [27]:
# training the feature extractor
epochs=100
optimizer = Adam(0.001)
for epoch in range(epochs):
    losses = []
    for i in range(0,len(x)-100,100): # batch size of 100
        a_x = x[i:i+100] # auto-encoder, so y=a_x
        with tf.GradientTape() as tape:
            pred = model(a_x)
            loss = reconstruction_loss(pred,a_x)
        losses.append(float(loss))
        gradients = tape.gradient(loss, model.trainable_variables)
        optimizer.apply_gradients(zip(gradients, model.trainable_variables))

    avg_loss = sum(losses)/len(losses)

    print("epoch {}; loss:{}".format(epoch+1,avg_loss))

epoch 1; loss:5.492444966968737
epoch 2; loss:3.734927440944471
epoch 3; loss:3.458477911196257
epoch 4; loss:3.3228472784945837
epoch 5; loss:3.134611543856169
epoch 6; loss:2.9703965689006604
epoch 7; loss:2.8086777486299215
epoch 8; loss:2.6499110648506568
epoch 9; loss:2.506976416236476
epoch 10; loss:2.3767161494807194
epoch 11; loss:2.258667883120085
epoch 12; loss:2.1541127091959904
epoch 13; loss:2.0633562615043237
epoch 14; loss:1.98483491571326
epoch 15; loss:1.9188478808653981
epoch 16; loss:1.877349106889022
epoch 17; loss:1.8645670413970947
epoch 18; loss:1.8210481342516447
epoch 19; loss:1.7436867073962563
epoch 20; loss:1.7020318696373387
epoch 21; loss:1.6692597677833156
epoch 22; loss:1.6198507672862004
epoch 23; loss:1.5741715368471647
epoch 24; loss:1.539167793173539
epoch 25; loss:1.50889411725496
epoch 26; loss:1.4803690910339355
epoch 27; loss:1.4503541306445473
epoch 28; loss:1.4245770793212087
epoch 29; loss:1.4000361342179148
epoch 30; loss:1.3782022438551251
e

In [28]:
amazon_feat = dense2(dense1(amazon_x)) # 1000x128
yelp_feat = dense2(dense1(yelp_x)) # 1000x128

In [29]:
amazon_feat = amazon_feat.numpy()
yelp_feat = yelp_feat.numpy()

In [30]:
def sdae_model():
    """ Linear model for the extracted features of SDAE
    """
    x = Input(shape=(128))
    # linear model
    out = Dense(1,activation=None)(x)
    model = Model(inputs=x,outputs=out)
    return model

In [31]:
sdae_model = sdae_model()

In [32]:
train_model(amazon_feat,amazon_y,yelp_feat,yelp_y,sdae_model,epochs=10,lr=0.005)

epoch 1; loss:75.29274410671658; train_acc:0.6309999823570251; test_acc:0.5590000152587891;
epoch 2; loss:59.13066864013672; train_acc:0.7210000157356262; test_acc:0.6940000057220459;
epoch 3; loss:51.987831115722656; train_acc:0.7559999823570251; test_acc:0.7250000238418579;
epoch 4; loss:48.97614500257704; train_acc:0.765999972820282; test_acc:0.7419999837875366;
epoch 5; loss:47.30768076578776; train_acc:0.7749999761581421; test_acc:0.7639999985694885;
epoch 6; loss:46.167416678534615; train_acc:0.781000018119812; test_acc:0.777999997138977;
epoch 7; loss:45.321928872002495; train_acc:0.7879999876022339; test_acc:0.7879999876022339;
epoch 8; loss:44.64265018039279; train_acc:0.7940000295639038; test_acc:0.7919999957084656;
epoch 9; loss:44.077796936035156; train_acc:0.7929999828338623; test_acc:0.7950000166893005;
epoch 10; loss:43.588846842447914; train_acc:0.8009999990463257; test_acc:0.800000011920929;
