# **"Deep Convolutional Neural Networks for Sentiment Analysis of Short Texts"**

*Reproduced by Kassymkhan Tengel and Nagima Chalkarova*

# **Introduction**
---

   Nowadays the number of different data is increasing. Especially, there is an exponential growth in textual data. That is why a big amount of studies is actively conducted for the text analysis. Most of works are related to the classification of texts. Because textual data can have various categories, such as author gender, sentiment, language category and so on. In this work sentiment analysis is studied. It involves classes such as very negative, negative, neutral, positive, and very positive. The sentiment analysis of short texts is important. It has its own applications in different industrial fields. For example, on the websites about restaurants or movies, people can leave their opinions or sentiments regarding some places or movies, respectively. If these sentiments are well predicted, then it can be used further to give restaurants or movies recommendations and so on. 

   This work is aimed to reproduce the paper of Santos and Gatti  called “Deep Convolutional Neural Networks for Sentiment Analysis of Short Texts” [1](https://www.aclweb.org/anthology/C14-1008.pdf). In their work, the aim is to do sentiment classification of short texts like Twitter messages which is challenging task. Because it usually contains limited contextual information. They used a deep convolutional network. The proposed network is called Character to Sentence Convolutional Neural Network (CharSCNN).The strenght of this network is that it analyzes the data in character level, in other words more deeper than others. It uses two convolutional layers. .One is aimed to extract related features from words and another is used to extract related features from sentences. The next model is called SCNN, which stands for Sentence Convolutional Neural Networks. It is in some way similar to previous model, but the difference is that it dpes not analyze the data in character level, just in sentence level. It is assumed that the first one is more accurate than latter. Also, the paper shows that using unsupervised pre-training is useful and compares different methods of machine learning for classification with these proposed models.


# **Data description** 
---
Two datasets were used in this paper. They are Twitter posts, which are from Stanford Twitter Sentiment (STS) corpus and movie reviews, which are proposed by Stanford Sentiment Treebank (SSTb). STS corpus has 1.6 million Twitter messages with class labels positive or negative. In the experiment we took randomly 80,000 tweets for training set and 20% of training set (16,000) was taken for validation set. Test set which was manually picked consists of 498 tweets. SSTb corpus contains fine grained (5 classes) sentiment labels for 215,154 phrases in the parse trees of 11,855 sentences. Training set contains 8544 sentences, validation set contains 1101 sentences and test set has 2210 sentences [1](https://www.aclweb.org/anthology/C14-1008.pdf).

Dataset | Set | # of sentences/tweets | # of classes
--- | --- | --- | ---
             | Training | 8544 | 5
SSTb | Validation | 1101 | 5
             | Test| 2210 | 5
             | Training | 80,000 | 2
STS | Validation | 16,000 | 2
             | Test | 498 | 2
             |**Table-1. Dataset for Sentiment analysis**

# **Necessary Python libraries**
---
The code below contains importings of necessary python tools. Most tools are from keras. And we use google colab to get data from my drive. Also, the constants which will be used throgout the project are given.

In [0]:
import csv
import sys
from numpy import loadtxt
import os
from numpy import array, asarray, zeros
from keras.preprocessing.text import one_hot, Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, SpatialDropout1D, LSTM
from keras.layers import Flatten
from keras.layers.embeddings import Embedding
import pandas as pd
import re
import random
import numpy as np
import tensorflow as tf
from tensorflow.keras.optimizers import Adam
from sklearn.model_selection import train_test_split
import pandas as pd
from keras.layers import Input, Embedding, Activation, Flatten, Dense
from keras.layers import Conv1D, MaxPooling1D, Dropout
from keras.models import Model
adam = Adam(lr=0.01)
num_epochs = 10
val_portion=0.2
embedding_dim = 100
training_size=80000
max_length = 20
trunc_type='post'
padding_type='post'
oov_tok = "<OOV>"

# **Reproduction for STS dataset**
---
Before working with models, it is better to preprocess the dataset. Our data has 6 columns, but 4 of them are not necesssary. So we deleted them, and 2 remaining columns containins the label and and tweet were used. After we removed unnecessary columns, we shuffled our data with two columns. Then by using tokenizer from Keras, we set tokens to each sentences, then by using padded we made our sentences have the same lenght. This process was applied to train data, val data and test data. Now, we can work with models.



# Pretrained SCNN model
The code below contains the pretrained SCNN model for STS dataset. To implement the model We used only some portion of STS dataset as author did. He used 80K samples for training data. Also, from 80k samples, its 20% is used for validation data. For the test portion, We manually picked 498 another samples. SCNN model contains one embedding layer, 2 bidirectional layers and 2 dense layers. For the embedding layer I used fixed valued elements such as vocab size, input size and embedding dimension. Also, the weights were derived from pretrained 100 dimensional glove file with 6 billion tokens inside. It is an open source material that can be used by everyone. The first LSTM bidirectional layer has 64 neurons, whereas the latter LSTM has 32 neurons. Actually, it all depends on user, but preferably, it is better to use values that can bring high accuracy of the model. The first dense layer has relu activation, whereas the next one has sigmoid activation function. Sigmoid is used, because our data contains binary labels. As optimizer, Adam optimizer with 0.01 learning rate was chosen. We decided to run epochs as author did, because increasing epoch number increases the time to run. As a resutl, the accuracy obtained from model reached 77%. When model was evaluated with test data, it reached almost 80% accuracy. It is worth to mention that we tried many models such as GRU, single LSTM, Convolution in 1D, but the best one among them was bidirectional lstm model. If to compare with author's accuracy which is 85.2%, our model did it poorly. However, it is due to some differences in our models. Author used wikipedia of 2013th year as pretrained data, but we could not find it, so instead of it we used 100 dimensional glove tokens. STS dataset was trained with other models. The following results were obtained: SVM-82.2%, NB-82.7%.

In [0]:
!wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1pIQYABUucCVV7qKy4sl8EWZfiBaNzYMu' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1pIQYABUucCVV7qKy4sl8EWZfiBaNzYMu" -O /content/tootrain.csv && rm -rf /tmp/cookies.txt
!wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1xoJwRR_1nnGFeBtngzbp7SLDhxc4ijSJ' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1xoJwRR_1nnGFeBtngzbp7SLDhxc4ijSJ" -O /content/totest.csv && rm -rf /tmp/cookies.txt
!wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1R-V-TP8TkQQLTcQVEVSagIU_U6N4_jth' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1R-V-TP8TkQQLTcQVEVSagIU_U6N4_jth" -O /content/glove.6B.100d.txt && rm -rf /tmp/cookies.txt


--2019-11-22 11:10:03--  https://docs.google.com/uc?export=download&confirm=ko9B&id=1pIQYABUucCVV7qKy4sl8EWZfiBaNzYMu
Resolving docs.google.com (docs.google.com)... 108.177.125.101, 108.177.125.138, 108.177.125.139, ...
Connecting to docs.google.com (docs.google.com)|108.177.125.101|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://doc-0g-3s-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/l8ilscfubh99cflbjkufdrtp640sn969/1574416800000/14221645452890223212/*/1pIQYABUucCVV7qKy4sl8EWZfiBaNzYMu?e=download [following]
--2019-11-22 11:10:03--  https://doc-0g-3s-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/l8ilscfubh99cflbjkufdrtp640sn969/1574416800000/14221645452890223212/*/1pIQYABUucCVV7qKy4sl8EWZfiBaNzYMu?e=download
Resolving doc-0g-3s-docs.googleusercontent.com (doc-0g-3s-docs.googleusercontent.com)... 64.233.189.132, 2404:6800:4008:c07::84
Connecting to doc-0g-3s-docs.googleusercontent

In [0]:
corpus = []
sentences=[]
labels=[]
num_sentences = 0

with open('/content/tootrain.csv',encoding='cp1252') as csvfile:
    reader = csv.reader(csvfile, delimiter=',')
    for row in reader:
        list_item=[]
        list_item.append(row[5])
        this_label=row[0]
        if this_label=='0':
            list_item.append(0)
        else:
            list_item.append(1)
        num_sentences = num_sentences + 1
        corpus.append(list_item)
random.shuffle(corpus)
for x in range(training_size):
    sentences.append(corpus[x][0])
    labels.append(corpus[x][1])
tokenizer = Tokenizer()
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
vocab_size=len(word_index)
sequences = tokenizer.texts_to_sequences(sentences)
padded = pad_sequences(sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)
split = int(val_portion * training_size)
test_sequences = padded[0:split]
training_sequences = padded[split:training_size]
test_labels = labels[0:split]
training_labels = labels[split:training_size]
embeddings_index = {};
with open('/content/glove.6B.100d.txt') as f:
    for line in f:
        values = line.split();
        word = values[0];
        coefs = np.asarray(values[1:], dtype='float32');
        embeddings_index[word] = coefs;

embeddings_matrix = np.zeros((vocab_size+1, embedding_dim));
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word);
    if embedding_vector is not None:
        embeddings_matrix[i] = embedding_vector;
# model = tf.keras.Sequential([
# tf.keras.layers.Embedding(vocab_size+1, embedding_dim, input_length=max_length, weights=[embeddings_matrix], trainable=False),
# tf.keras.layers.Conv1D(64, 5, activation='relu'),
# tf.keras.layers.MaxPooling1D(pool_size=4),
# tf.keras.layers.Flatten(),
# tf.keras.layers.Dropout(0.2),
# tf.keras.layers.Dense(1, activation='sigmoid'),
# ])
# model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size+1, embedding_dim, input_length=max_length,weights=[embeddings_matrix], trainable=False),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64, return_sequences=True)),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32)),
    tf.keras.layers.Dense(6, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
model.summary()
history1 = model.fit(training_sequences, training_labels, epochs=num_epochs, validation_data=(test_sequences, test_labels), verbose=1)   
newcorpus = []
newsentences=[]
newlabels=[]
with open('/content/totest.csv',encoding='cp1252') as csvfile:
    newreader = csv.reader(csvfile, delimiter=',')
    for row in newreader:
        list_item=[]
        list_item.append(row[5])
        this_label=row[0]
        if this_label=='0':
            list_item.append(0)
        else:
            list_item.append(1)
        num_sentences = num_sentences + 1
        newcorpus.append(list_item)
random.shuffle(newcorpus)
for x in range(s):
    newsentences.append(newcorpus[x][0])
    newlabels.append(newcorpus[x][1])
tokenizer = Tokenizer()
tokenizer.fit_on_texts(newsentences)
word_index = tokenizer.word_index
vocab_size=len(word_index)
sequences1 = tokenizer.texts_to_sequences(newsentences)
padded1 = pad_sequences(sequences1, maxlen=max_length, padding=padding_type, truncating=trunc_type)
teest_sequences = padded[0:498]
teest_labels = labels[0:498]
results1 = model.evaluate(teest_sequences, teest_labels, batch_size=128)                   

Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 20, 100)           8438900   
_________________________________________________________________
bidirectional (Bidirectional (None, 20, 128)           84480     
___

# Random SCNN model
The model below consists of series of layers such as embedding layer, bidirectional lstm layers, dense layers. Each layer has its own characteristics which can be seen in the code. The data proportion is same as in the previous model. We tried many different combination of models,but the peak of our accuracy was almost 70%. For the unused test data, accuracy was 71%. For the same model with slightly differences, author had 82.2%, SVM model had 82.2% and NB model had 82.7% accuracy.The reason for that is SCNN model is a some convolution model that does not consist of concrete layers, so we tried many combinations of models, but the peak was reached by used model. Actually, we had the same trend in the reproduction. As author, with pretrained model we had greater accuracy than non trained random model.

In [0]:
corpus = []
sentences=[]
labels=[]
num_sentences = 0

with open('/content/tootrain.csv',encoding='cp1252') as csvfile:
    reader = csv.reader(csvfile, delimiter=',')
    for row in reader:
        list_item=[]
        list_item.append(row[5])
        this_label=row[0]
        if this_label=='0':
            list_item.append(0)
        else:
            list_item.append(1)
        num_sentences = num_sentences + 1
        corpus.append(list_item)
random.shuffle(corpus)
for x in range(training_size):
    sentences.append(corpus[x][0])
    labels.append(corpus[x][1])
tokenizer = Tokenizer()
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
vocab_size=len(word_index)
sequences = tokenizer.texts_to_sequences(sentences)
padded = pad_sequences(sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)
split = int(val_portion * training_size)
test_sequences = padded[0:split]
training_sequences = padded[split:training_size]
test_labels = labels[0:split]
training_labels = labels[split:training_size]
# model = tf.keras.Sequential([
# tf.keras.layers.Embedding(vocab_size+1, embedding_dim, input_length=max_length, weights=[embeddings_matrix], trainable=False),
# tf.keras.layers.Conv1D(64, 5, activation='relu'),
# tf.keras.layers.MaxPooling1D(pool_size=4),
# tf.keras.layers.Flatten(),
# tf.keras.layers.Dropout(0.2),
# tf.keras.layers.Dense(1, activation='sigmoid'),
# ])
# model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size+1, embedding_dim, input_length=max_length, trainable=False),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64, return_sequences=True)),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32)),
    tf.keras.layers.Dense(6, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
model.summary()
history2 = model.fit(training_sequences, training_labels, epochs=num_epochs, validation_data=(test_sequences, test_labels), verbose=1)   
newcorpus = []
newsentences=[]
newlabels=[]
with open('/content/totest.csv',encoding='cp1252') as csvfile:
    newreader = csv.reader(csvfile, delimiter=',')
    for row in newreader:
        list_item=[]
        list_item.append(row[5])
        this_label=row[0]
        if this_label=='0':
            list_item.append(0)
        else:
            list_item.append(1)
        num_sentences = num_sentences + 1
        newcorpus.append(list_item)
random.shuffle(newcorpus)
for x in range(s):
    newsentences.append(newcorpus[x][0])
    newlabels.append(newcorpus[x][1])
tokenizer = Tokenizer()
tokenizer.fit_on_texts(newsentences)
word_index = tokenizer.word_index
vocab_size=len(word_index)
sequences1 = tokenizer.texts_to_sequences(newsentences)
padded1 = pad_sequences(sequences1, maxlen=max_length, padding=padding_type, truncating=trunc_type)
teest_sequences = padded[0:498]
teest_labels = labels[0:498]
results2 = model.evaluate(teest_sequences, teest_labels, batch_size=128)   

# Pretrained CharSCNN model
The following model was assumed to be more accurate, since it works with character level, and so it analyses the data more deeply. We enumerate each character with different numbers, and using this numbers convert the data into the array of numbers. Then by using the weights derived from pretrained glovo model, we construct our model and run it. The model has two 1D convolution layers with maxpoolings, flatten layer and dense layer. That list of layers is given by author. We use dropout element to drop some less significant terms. Activation is sigmoid and optimizer is adam as before. The model reached an accuracy of 76% which good enough. For the test data, unused 498 samples, accuracy was 72%. Author had 86.4% accuracy for this model. But it seems he had better pretrained data. Since our pretrained data is glove 100 dimension with 6 billion tokens, it may be non char level causing obstacles for our model, also lowering our accuracy. But, logically, char level cnn should have greater accuracy than SCNN model, since it analyses the data much more deeper. 


In [0]:
corpus = []
with open('/content/tootrain.csv',encoding='cp1252') as csvfile:
    reader = csv.reader(csvfile, delimiter=',')
    for row in reader:
        list_item=[]
        list_item.append(row[5])
        this_label=row[0]
        if this_label=='0':
            list_item.append(0)
        else:
            list_item.append(1)
        num_sentences = num_sentences + 1
        corpus.append(list_item)
sentences=[]
labels=[]
num_sentences = 0
random.shuffle(corpus)
for x in range(training_size):
    sentences.append(corpus[x][0])
    labels.append(corpus[x][1])
split = int(val_portion * training_size)
test_sequences = sentences[0:split]
training_sequences = sentences[split:training_size]
test_labels = labels[0:split]
training_labels = labels[split:training_size]
train_texts = training_sequences
train_texts = [s.lower() for s in train_texts]


test_texts = test_sequences
test_texts = [s.lower() for s in test_texts]

# =======================Convert string to index================
# Tokenizer
tk = Tokenizer(num_words=None, char_level=True, oov_token='UNK')
tk.fit_on_texts(train_texts)
# If we already have a character list, then replace the tk.word_index
# If not, just skip below part

# -----------------------Skip part start--------------------------
# construct a new vocabulary
alphabet = "abcdefghijklmnopqrstuvwxyz0123456789,;.!?:'\"/\\|_@#$%^&*~`+-=<>()[]{}"
char_dict = {}
for i, char in enumerate(alphabet):
    char_dict[char] = i + 1

# Use char_dict to replace the tk.word_index
tk.word_index = char_dict.copy()
# Add 'UNK' to the vocabulary
tk.word_index[tk.oov_token] = max(char_dict.values()) + 1
# -----------------------Skip part end----------------------------

# Convert string to index
train_seq = tk.texts_to_sequences(train_texts)
test_texts = tk.texts_to_sequences(test_texts)

# Padding
train_data = pad_sequences(train_seq, maxlen=1014, padding='post')
test_data = pad_sequences(test_texts, maxlen=1014, padding='post')

# Convert to numpy array
train_data = np.array(train_data, dtype='float32')
test_data = np.array(test_data, dtype='float32')
input_size = 1014
vocab_size = len(tk.word_index)
embedding_size = 100

fully_connected_layers = [1024, 1024]
num_of_classes = 1
dropout_p = 0.2
optimizer = 'adam'
loss = 'binary_crossentropy'

embeddings_index = {};
with open('/content/glove.6B.100d.txt') as f:
    for line in f:
        values = line.split();
        word = values[0];
        coefs = np.asarray(values[1:], dtype='float32');
        embeddings_index[word] = coefs;

embeddings_matrix = np.zeros((vocab_size+1, embedding_size));
for word, i in tk.word_index.items():
    embedding_vector = embeddings_index.get(word);
    if embedding_vector is not None:
        embeddings_matrix[i] = embedding_vector;
import tensorflow as tf
model = tf.keras.Sequential([
tf.keras.layers.Embedding(vocab_size + 1,
                            5,
                            input_length=input_size,
                            weights=[embeddings_matrix]),
tf.keras.layers.Conv1D(256, 5, activation='relu'),
tf.keras.layers.MaxPooling1D(pool_size=4),
tf.keras.layers.Conv1D(256, 5, activation='relu'),
tf.keras.layers.MaxPooling1D(pool_size=4),
tf.keras.layers.Flatten(),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(1, activation='sigmoid'),
])


model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])  # Adam, categorical_crossentropy
model.summary()

# Shuffle


x_train = train_data
y_train = training_labels

x_test = test_data
y_test = test_labels

# Training
history3 = model.fit(x_train, y_train,
          validation_data=(x_test, y_test), 
          batch_size=128,
          epochs=10,
          verbose=1)  
newcorpus = []
newsentences=[]
newlabels=[]
with open('/content/totest.csv',encoding='cp1252') as csvfile:
    newreader = csv.reader(csvfile, delimiter=',')
    for row in newreader:
        list_item=[]
        list_item.append(row[5])
        this_label=row[0]
        if this_label=='0':
            list_item.append(0)
        else:
            list_item.append(1)
        num_sentences = num_sentences + 1
        newcorpus.append(list_item)
    
random.shuffle(newcorpus)
sentences=[]
labels=[]
num_sentences = 0


for x in range(s):
    newsentences.append(newcorpus[x][0])
    newlabels.append(newcorpus[x][1])


test_sequences = newsentences[0:498]
teest_labels = newlabels[0:498]  
teest_texts = test_sequences
teest_texts = [s.lower() for s in teest_texts]



# =======================Convert string to index================
# Tokenizer
tk = Tokenizer(num_words=None, char_level=True, oov_token='UNK')
tk.fit_on_texts(teest_texts)
# If we already have a character list, then replace the tk.word_index
# If not, just skip below part

# -----------------------Skip part start--------------------------
# construct a new vocabulary
alphabet = "abcdefghijklmnopqrstuvwxyz0123456789,;.!?:'\"/\\|_@#$%^&*~`+-=<>()[]{}"
char_dict = {}
for i, char in enumerate(alphabet):
    char_dict[char] = i + 1

# Use char_dict to replace the tk.word_index
tk.word_index = char_dict.copy()
# Add 'UNK' to the vocabulary
tk.word_index[tk.oov_token] = max(char_dict.values()) + 1
# -----------------------Skip part end----------------------------

# Convert string to index
teest_seq = tk.texts_to_sequences(teest_texts)

# Padding
teest_data = pad_sequences(teest_seq, maxlen=1014, padding='post')

# Convert to numpy array
teest_data = np.array(teest_data, dtype='float32')
result3 = model.evaluate(teest_data, teest_labels, batch_size=128)    

# Random CharSCNN model
This model is as previous model trains the data deeply with character level. The difference is that we do not use here pretrained glove tokens. Model as before consists of layers that author stated. Activation is sigmoid, optimizer is adam with learning rate 0.01. As we mentioned, it is logically true that char scnn model should have more accuracy than simple scnn model, it is due the deepness of the model. If in previous data we had 76% accuracy, here without pretrained information, we have 78% accuracy. It is one more proof of that glove token is not good pretrained data for charscnn, because instead of bettering the accuracy, it lowered the acccuracy. The authar had 82.2% accuracy for the same model, but our model is also good enough, if we compare it with previous 3 models.


In [0]:
corpus = []
with open('/content/tootrain.csv',encoding='cp1252') as csvfile:
    reader = csv.reader(csvfile, delimiter=',')
    for row in reader:
        list_item=[]
        list_item.append(row[5])
        this_label=row[0]
        if this_label=='0':
            list_item.append(0)
        else:
            list_item.append(1)
        num_sentences = num_sentences + 1
        corpus.append(list_item)
sentences=[]
labels=[]
num_sentences = 0
random.shuffle(corpus)
for x in range(training_size):
    sentences.append(corpus[x][0])
    labels.append(corpus[x][1])
split = int(val_portion * training_size)
test_sequences = sentences[0:split]
training_sequences = sentences[split:training_size]
test_labels = labels[0:split]
training_labels = labels[split:training_size]
train_texts = training_sequences
train_texts = [s.lower() for s in train_texts]


test_texts = test_sequences
test_texts = [s.lower() for s in test_texts]

# =======================Convert string to index================
# Tokenizer
tk = Tokenizer(num_words=None, char_level=True, oov_token='UNK')
tk.fit_on_texts(train_texts)
# If we already have a character list, then replace the tk.word_index
# If not, just skip below part

# -----------------------Skip part start--------------------------
# construct a new vocabulary
alphabet = "abcdefghijklmnopqrstuvwxyz0123456789,;.!?:'\"/\\|_@#$%^&*~`+-=<>()[]{}"
char_dict = {}
for i, char in enumerate(alphabet):
    char_dict[char] = i + 1

# Use char_dict to replace the tk.word_index
tk.word_index = char_dict.copy()
# Add 'UNK' to the vocabulary
tk.word_index[tk.oov_token] = max(char_dict.values()) + 1
# -----------------------Skip part end----------------------------

# Convert string to index
train_seq = tk.texts_to_sequences(train_texts)
test_texts = tk.texts_to_sequences(test_texts)

# Padding
train_data = pad_sequences(train_seq, maxlen=1014, padding='post')
test_data = pad_sequences(test_texts, maxlen=1014, padding='post')

# Convert to numpy array
train_data = np.array(train_data, dtype='float32')
test_data = np.array(test_data, dtype='float32')
input_size = 1014
vocab_size = len(tk.word_index)
embedding_size = 100

fully_connected_layers = [1024, 1024]
num_of_classes = 1
dropout_p = 0.2
optimizer = 'adam'
loss = 'binary_crossentropy'

model = tf.keras.Sequential([
tf.keras.layers.Embedding(vocab_size + 1,
                            5,
                            input_length=input_size),
tf.keras.layers.Conv1D(256, 5, activation='relu'),
tf.keras.layers.MaxPooling1D(pool_size=4),
tf.keras.layers.Conv1D(256, 5, activation='relu'),
tf.keras.layers.MaxPooling1D(pool_size=4),
tf.keras.layers.Flatten(),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(1, activation='sigmoid'),
])


model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])  # Adam, categorical_crossentropy
model.summary()

# Shuffle


x_train = train_data
y_train = training_labels

x_test = test_data
y_test = test_labels

# Training
history4=model.fit(x_train, y_train,
          validation_data=(x_test, y_test), 
          batch_size=128,
          epochs=10,
          verbose=1)  
newcorpus = []
newsentences=[]
newlabels=[]
with open('/content/totest.csv',encoding='cp1252') as csvfile:
    newreader = csv.reader(csvfile, delimiter=',')
    for row in newreader:
        list_item=[]
        list_item.append(row[5])
        this_label=row[0]
        if this_label=='0':
            list_item.append(0)
        else:
            list_item.append(1)
        num_sentences = num_sentences + 1
        newcorpus.append(list_item)     
random.shuffle(newcorpus)
sentences=[]
labels=[]
num_sentences = 0


for x in range(s):
    newsentences.append(newcorpus[x][0])
    newlabels.append(newcorpus[x][1])


test_sequences = newsentences[0:498]
teest_labels = newlabels[0:498]  
teest_texts = test_sequences
teest_texts = [s.lower() for s in teest_texts]



# =======================Convert string to index================
# Tokenizer
tk = Tokenizer(num_words=None, char_level=True, oov_token='UNK')
tk.fit_on_texts(teest_texts)
# If we already have a character list, then replace the tk.word_index
# If not, just skip below part

# -----------------------Skip part start--------------------------
# construct a new vocabulary
alphabet = "abcdefghijklmnopqrstuvwxyz0123456789,;.!?:'\"/\\|_@#$%^&*~`+-=<>()[]{}"
char_dict = {}
for i, char in enumerate(alphabet):
    char_dict[char] = i + 1

# Use char_dict to replace the tk.word_index
tk.word_index = char_dict.copy()
# Add 'UNK' to the vocabulary
tk.word_index[tk.oov_token] = max(char_dict.values()) + 1
# -----------------------Skip part end----------------------------

# Convert string to index
teest_seq = tk.texts_to_sequences(teest_texts)

# Padding
teest_data = pad_sequences(teest_seq, maxlen=1014, padding='post')

# Convert to numpy array
teest_data = np.array(teest_data, dtype='float32')
result4 = model.evaluate(teest_data, teest_labels, batch_size=128)    

# Results for STS in Tabular form
---
Results for STS corpus


Model | Accuracy(unsupervised pre-training) | Accuracy( random word embeddings)|
--- | --- | --- 
CharSCNN     | 76% | 72% 
SCNN | 79.5% | 70.6%
CharSCNN (Santos and Gatti, 2014)     | 86.4% | 81.9% 
SCNN (Santos and Gatti, 2014)   | 85.2% | 82.2%
LProp (Speriosu et al., 2011)            | 84.7%
MaxEnt (Go et al., 2009)             | 83.0%
NB (Go et al., 2009)  | 82.7%
SVM (Go et al., 2009) | 82.2%
             **Table-2. Accuracy of different models for binary classification using STS.**


# **Reproduction for SSTb dataset**
---
In order to work with this dataset, firstly we installed pytreebank. Because it contains fine grained (5 classes) sentiment labels for 215,154 phrases in the parse trees of 11,855 sentences. We used pytreebank library to convert tree structured data to a tabular form.


# Pretrained SCNN model
In the model below, we work with SSTb dataset. It has 8544 samples for training, 1101 samples for validation and 2210 samples test data. Samples contain one of the 5 labels expressing different sentiments. In our model, there are 1 embedding layer, 2 bidirectional lstm layers and 2 dense layers. We chose these layer, because this combination of layers had more accuracy than any other layers we tried. The softmax function is used as activation function, because our data has 5 classes. As optimizer we took adam optimizer from keras with learning rate 0.01. Weights were taken from pretrained glove data with dimension 100. Running the model after constructing it properly, we faced 42% accuracy. Then we tested this model with unused test data, as a result it gave us 40% accuracy. The author of the paper had 48% accuracy for this same model with slightly changes. We can say that our model is good enough. because there are some differences in our and author's models. As pretrained work, he used 2013 december's wikipedia. Unfortunately, we could not use it, because we could not find this exact work, other works weighted several gigabytes in memory. If we compare our model's accuracy  with other popular models', those data can be derived: MVRNN - 44.4%, RNN - 43%, SVM - 40.7%, NB- 41%. So, it can be concluded that our model can be used in the same way as other models. But, We had a powerful computer, we could better our model, by increasing epoch numbers and giving different parameters.


In [0]:
!pip install pytreebank
import pytreebank

In [0]:

out_path = os.path.join(sys.path[0], 'sst_{}.txt')
dataset = pytreebank.load_sst('./raw_data')

# Store train, dev and test in separate files
for category in ['train', 'test', 'dev']:
    with open(out_path.format(category), 'w') as outfile:
        for item in dataset[category]:
            outfile.write("{}\t{}\n".format(
                item.to_labeled_lines()[0][0] ,
                item.to_labeled_lines()[0][1]
            ))
# Print the length of the training set
print(len(dataset['train']))
len(dataset['dev'])
train = loadtxt("sst_train.txt", dtype=str, comments="#", delimiter="\t", unpack=False)
train_data=pd.DataFrame(train)
train_data.columns=['label', 'text']
print(train_data)
val = loadtxt("sst_dev.txt", dtype=str, comments="#", delimiter="\t", unpack=False)
val_data=pd.DataFrame(val)
#val_data.columns=['label', 'text']
val_labels=val_data[0]
val_labels.shape
print(val_labels)
print(val_data[0][0])
val_sentences=val_data[1]
val_sentences.shape
print(val_sentences)
valid_labels=[]

for m in val_labels:
 valid_labels.append(int(m))
  #train_labels.append(str(m))
v_labels=np.array(valid_labels)

print(v_labels)
training_labels=train_data['label']
training_labels.shape
training_sentences=train_data['text']
training_sentences.shape
print(training_sentences)
train_labels=[]

for m in training_labels:
 train_labels.append(int(m))
  #train_labels.append(str(m))
tr_labels=np.array(train_labels) 
print(tr_labels)
import re
def preprocess_text(sen):
    # Remove punctuations and numbers
    sentence = re.sub('[^a-zA-Z]', ' ', sen)

    # Single character removal
    sentence = re.sub(r"\s+[a-zA-Z]\s+", ' ', sentence)

    # Removing multiple spaces
    sentence = re.sub(r'\s+', ' ', sentence)

    return sentence
v_sentences=[]

for k in val_sentences:
 v_sentences.append(preprocess_text(k))

print(v_sentences)
train_sentences=[]

for k in training_sentences:
 train_sentences.append(preprocess_text(k))

vocab_size = 100000


tokenizer = Tokenizer(num_words = vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(train_sentences)
word_index = tokenizer.word_index
sequences = tokenizer.texts_to_sequences(train_sentences)
padded = pad_sequences(sequences, truncating=trunc_type, maxlen=max_length)

val_sequences = tokenizer.texts_to_sequences(v_sentences)
val_padded = pad_sequences(val_sequences, maxlen=max_length)
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

def decode_review(text):
    return ' '.join([reverse_word_index.get(i, '?') for i in text])
from keras.utils import to_categorical
y_binary = to_categorical(v_labels)
from keras.utils import to_categorical
t_binary = to_categorical(tr_labels)
# model = tf.keras.Sequential([
#     tf.keras.layers.Embedding(vocab_size, embedding_dim, weights=[embeddings_matrix], input_length=max_length),
#     tf.keras.layers.Conv1D(300, 5, activation='relu'),
#     tf.keras.layers.GlobalAveragePooling1D(),
#     tf.keras.layers.Dense(6, activation='relu'),
#     tf.keras.layers.Dense(5, activation='softmax')
# ])
# model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
adam = Adam(lr=0.01)
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, weights=[embeddings_matrix],input_length=max_length),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64, return_sequences=True)),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32)),
    tf.keras.layers.Dense(6, activation='relu'),
    tf.keras.layers.Dense(5, activation='softmax')
])
model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
model.summary()
history5 = model.fit(padded, t_binary, epochs=num_epochs, validation_data=(val_padded, y_binary))
test = loadtxt("sst_test.txt", dtype=str, comments="#", delimiter="\t", unpack=False)
test_data=pd.DataFrame(test)
#val_data.columns=['label', 'text']
testing_labels=test_data[0]
print(testing_labels.shape)
#print(testing_labels)
test_labels=[]

for m in testing_labels:
  test_labels.append(int(m))
t_labels=np.array(test_labels) 
print("For test set :",t_labels)
testing_sentences=test_data[1]
testing_sentences.shape
print(testing_sentences)
test_sentences=[]

for k in testing_sentences:
 test_sentences.append(preprocess_text(k))

print(test_sentences)
test_sequences = tokenizer.texts_to_sequences(test_sentences)
test_padded = pad_sequences(test_sequences, maxlen=max_length)
from keras.utils import to_categorical
test_binary = to_categorical(t_labels)
print(test_binary)
results5 = model.evaluate(test_padded, test_binary, batch_size=128)


# Random SCNN model
This SCNN model uses the same elements as pretrained SCNN model, except pretrained work. Here we do not used pretrained weights from any other work. After running the model, we had 35% accuracy. Testing it with unused test data brought 33% accuracy. So, it is clear that pretrained SCNN model works better than this model. We could not compare our results with author's results, because in the paper he did not give information about which SCNN model give the 48% accuracy. We believe that this accuracy was derived from pretrained SCNN model. 

In [0]:
out_path = os.path.join(sys.path[0], 'sst_{}.txt')
dataset = pytreebank.load_sst('./raw_data')

# Store train, dev and test in separate files
for category in ['train', 'test', 'dev']:
    with open(out_path.format(category), 'w') as outfile:
        for item in dataset[category]:
            outfile.write("{}\t{}\n".format(
                item.to_labeled_lines()[0][0] ,
                item.to_labeled_lines()[0][1]
            ))
# Print the length of the training set
print(len(dataset['train']))
len(dataset['dev'])
train = loadtxt("sst_train.txt", dtype=str, comments="#", delimiter="\t", unpack=False)
train_data=pd.DataFrame(train)
train_data.columns=['label', 'text']
print(train_data)
val = loadtxt("sst_dev.txt", dtype=str, comments="#", delimiter="\t", unpack=False)
val_data=pd.DataFrame(val)
#val_data.columns=['label', 'text']
val_labels=val_data[0]
val_labels.shape
print(val_labels)
print(val_data[0][0])
val_sentences=val_data[1]
val_sentences.shape
print(val_sentences)
valid_labels=[]

for m in val_labels:
 valid_labels.append(int(m))
  #train_labels.append(str(m))
v_labels=np.array(valid_labels)

print(v_labels)
training_labels=train_data['label']
training_labels.shape
training_sentences=train_data['text']
training_sentences.shape
print(training_sentences)
train_labels=[]

for m in training_labels:
 train_labels.append(int(m))
  #train_labels.append(str(m))
tr_labels=np.array(train_labels) 
print(tr_labels)
import re
def preprocess_text(sen):
    # Remove punctuations and numbers
    sentence = re.sub('[^a-zA-Z]', ' ', sen)

    # Single character removal
    sentence = re.sub(r"\s+[a-zA-Z]\s+", ' ', sentence)

    # Removing multiple spaces
    sentence = re.sub(r'\s+', ' ', sentence)

    return sentence
v_sentences=[]

for k in val_sentences:
 v_sentences.append(preprocess_text(k))

print(v_sentences)
train_sentences=[]

for k in training_sentences:
 train_sentences.append(preprocess_text(k))

vocab_size = 100000


tokenizer = Tokenizer(num_words = vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(train_sentences)
word_index = tokenizer.word_index
sequences = tokenizer.texts_to_sequences(train_sentences)
padded = pad_sequences(sequences, truncating=trunc_type, maxlen=max_length)

val_sequences = tokenizer.texts_to_sequences(v_sentences)
val_padded = pad_sequences(val_sequences, maxlen=max_length)
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

def decode_review(text):
    return ' '.join([reverse_word_index.get(i, '?') for i in text])
from keras.utils import to_categorical
y_binary = to_categorical(v_labels)
from keras.utils import to_categorical
t_binary = to_categorical(tr_labels)
# model = tf.keras.Sequential([
#     tf.keras.layers.Embedding(vocab_size, embedding_dim, weights=[embeddings_matrix], input_length=max_length),
#     tf.keras.layers.Conv1D(300, 5, activation='relu'),
#     tf.keras.layers.GlobalAveragePooling1D(),
#     tf.keras.layers.Dense(6, activation='relu'),
#     tf.keras.layers.Dense(5, activation='softmax')
# ])
# model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
adam = Adam(lr=0.01)
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64, return_sequences=True)),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32)),
    tf.keras.layers.Dense(6, activation='relu'),
    tf.keras.layers.Dense(5, activation='softmax')
])
model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
model.summary()
history6 = model.fit(padded, t_binary, epochs=num_epochs, validation_data=(val_padded, y_binary))
test = loadtxt("sst_test.txt", dtype=str, comments="#", delimiter="\t", unpack=False)
test_data=pd.DataFrame(test)
#val_data.columns=['label', 'text']
testing_labels=test_data[0]
print(testing_labels.shape)
#print(testing_labels)
test_labels=[]

for m in testing_labels:
  test_labels.append(int(m))
t_labels=np.array(test_labels) 
print("For test set :",t_labels)
testing_sentences=test_data[1]
testing_sentences.shape
print(testing_sentences)
test_sentences=[]

for k in testing_sentences:
 test_sentences.append(preprocess_text(k))

print(test_sentences)
test_sequences = tokenizer.texts_to_sequences(test_sentences)
test_padded = pad_sequences(test_sequences, maxlen=max_length)
from keras.utils import to_categorical
test_binary = to_categorical(t_labels)
print(test_binary)
results6 = model.evaluate(test_padded, test_binary, batch_size=128)


# Pretrained CharSCNN model
This model consists of sequence of layers such as embedding layer, two one dimensional convolution layers 256 neurons inside, every convolution layer is followed by maxpooling layers with pooling size 4, and flattening layer, dropout layer which drops insignificant elemens, and the last layer is dense layer with 5 neurons. 5 because, we have classes. As optimizer we chose adam, softmax is activation fucntion for categorical case. In this model, we enumerate each existing character with numbers. Hence, the sentences in samples are converted into new array of numbers. It is supposed that this model should bring higher results, since the model deeply analyzes, in other words, we consider every character. As a pretrained work, we use glove samples. By using it, we construct weight matrix. But, when we used char level analysis for STS dataset by using pretrained work, it did not bring higher results. It was explained that glove samples may suit char level analysis. Despite those facts, we run our model and got this results: validation accuracy is 28% and test accuracy is 28% too. If we compare our result with aurthor's accuracy which is 43%, it is clear that he used better pretrained data and got meaningful weights. 

In [0]:
out_path = os.path.join(sys.path[0], 'sst_{}.txt')
dataset = pytreebank.load_sst('./raw_data')

# Store train, dev and test in separate files
for category in ['train', 'test', 'dev']:
    with open(out_path.format(category), 'w') as outfile:
        for item in dataset[category]:
            outfile.write("{}\t{}\n".format(
                item.to_labeled_lines()[0][0] ,
                item.to_labeled_lines()[0][1]
            ))
# Print the length of the training set
print(len(dataset['train']))
len(dataset['dev'])
train = loadtxt("sst_train.txt", dtype=str, comments="#", delimiter="\t", unpack=False)
train_data=pd.DataFrame(train)
train_data.columns=['label', 'text']
print(train_data)
val = loadtxt("sst_dev.txt", dtype=str, comments="#", delimiter="\t", unpack=False)
val_data=pd.DataFrame(val)
#val_data.columns=['label', 'text']
val_labels=val_data[0]
val_labels.shape
print(val_labels)
print(val_data[0][0])
val_sentences=val_data[1]
val_sentences.shape
print(val_sentences)
valid_labels=[]

for m in val_labels:
 valid_labels.append(int(m))
  #train_labels.append(str(m))
v_labels=np.array(valid_labels)

print(v_labels)
training_labels=train_data['label']
training_labels.shape
training_sentences=train_data['text']
training_sentences.shape
print(training_sentences)
train_labels=[]

for m in training_labels:
 train_labels.append(int(m))
  #train_labels.append(str(m))
tr_labels=np.array(train_labels) 
print(tr_labels)
import re
def preprocess_text(sen):
    # Remove punctuations and numbers
    sentence = re.sub('[^a-zA-Z]', ' ', sen)

    # Single character removal
    sentence = re.sub(r"\s+[a-zA-Z]\s+", ' ', sentence)

    # Removing multiple spaces
    sentence = re.sub(r'\s+', ' ', sentence)

    return sentence
v_sentences=[]

for k in val_sentences:
 v_sentences.append(preprocess_text(k))

print(v_sentences)
train_sentences=[]

for k in training_sentences:
 train_sentences.append(preprocess_text(k))

print(train_sentences)
train_texts = train_sentences
train_texts = [s.lower() for s in train_texts]


test_texts = v_sentences
test_texts = [s.lower() for s in test_texts]

# =======================Convert string to index================
# Tokenizer
tk = Tokenizer(num_words=None, char_level=True, oov_token='UNK')
tk.fit_on_texts(train_texts)
# If we already have a character list, then replace the tk.word_index
# If not, just skip below part

# -----------------------Skip part start--------------------------
# construct a new vocabulary
alphabet = "abcdefghijklmnopqrstuvwxyz0123456789,;.!?:'\"/\\|_@#$%^&*~`+-=<>()[]{}"
char_dict = {}
for i, char in enumerate(alphabet):
    char_dict[char] = i + 1

# Use char_dict to replace the tk.word_index
tk.word_index = char_dict.copy()
# Add 'UNK' to the vocabulary
tk.word_index[tk.oov_token] = max(char_dict.values()) + 1
# -----------------------Skip part end----------------------------

# Convert string to index
train_seq = tk.texts_to_sequences(train_texts)
test_texts = tk.texts_to_sequences(test_texts)


# Padding
train_data = pad_sequences(train_seq, maxlen=1014, padding='post')
test_data = pad_sequences(test_texts, maxlen=1014, padding='post')

# Convert to numpy array
train_data = np.array(train_data, dtype='float32')
test_data = np.array(test_data, dtype='float32')
word_index = tk.word_index
from keras.utils import to_categorical
y_binary = to_categorical(v_labels)
from keras.utils import to_categorical
t_binary = to_categorical(tr_labels)
import tensorflow as tf
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, weights=[embeddings_matrix], input_length=max_length),
    tf.keras.layers.Conv1D(300, 5, activation='relu'),
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(6, activation='relu'),
    tf.keras.layers.Dense(5, activation='softmax')
])
model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
model.summary()
history7 = model.fit(train_data, t_binary, epochs=num_epochs, validation_data=(test_data, y_binary))
test = loadtxt("sst_test.txt", dtype=str, comments="#", delimiter="\t", unpack=False)
test_data=pd.DataFrame(test)
#val_data.columns=['label', 'text']
testing_labels=test_data[0]
print(testing_labels.shape)
#print(testing_labels)
test_labels=[]

for m in testing_labels:
  test_labels.append(int(m))
t_labels=np.array(test_labels) 
print("For test set :",t_labels)
testing_sentences=test_data[1]
testing_sentences.shape
print(testing_sentences)
test_sentences=[]
for k in testing_sentences:
 test_sentences.append(preprocess_text(k))


teest_texts = test_sentences
teest_texts = [s.lower() for s in teest_texts]



# =======================Convert string to index================
# Tokenizer
tk = Tokenizer(num_words=None, char_level=True, oov_token='UNK')
tk.fit_on_texts(teest_texts)
# If we already have a character list, then replace the tk.word_index
# If not, just skip below part

# -----------------------Skip part start--------------------------
# construct a new vocabulary
alphabet = "abcdefghijklmnopqrstuvwxyz0123456789,;.!?:'\"/\\|_@#$%^&*~`+-=<>()[]{}"
char_dict = {}
for i, char in enumerate(alphabet):
    char_dict[char] = i + 1

# Use char_dict to replace the tk.word_index
tk.word_index = char_dict.copy()
# Add 'UNK' to the vocabulary
tk.word_index[tk.oov_token] = max(char_dict.values()) + 1
# -----------------------Skip part end----------------------------

# Convert string to index
teest_seq = tk.texts_to_sequences(teest_texts)

# Padding
teest_data = pad_sequences(teest_seq, maxlen=1014, padding='post')

# Convert to numpy array
teest_data = np.array(teest_data, dtype='float32')
from keras.utils import to_categorical
test_binary = to_categorical(t_labels)
print(test_binary)
results7 = model.evaluate(teest_data, test_binary, batch_size=128)

# Random Char SCNN model

This model consists of the same layers as previous model, but the difference is we don't we pretrained data which give us weights as a matrix. Usually, when pretrained data does not suit the model you use, in our case character level CNN, it is better to not use it. For example, our model without glove bring us 30% accuracy which is higher than previous model's accuracy. Testing it with unused test data resulted in 28% accuracy which is same as in Pretrained Char SCNN model. If we compare it with author's accuracu which is 43%, it is clear that our model works poorly. But there are several enhancements can be made to increase the accuracy. Increasing the epoch number, finding good pretrained source, changing parameters. 

In [0]:
out_path = os.path.join(sys.path[0], 'sst_{}.txt')
dataset = pytreebank.load_sst('./raw_data')

# Store train, dev and test in separate files
for category in ['train', 'test', 'dev']:
    with open(out_path.format(category), 'w') as outfile:
        for item in dataset[category]:
            outfile.write("{}\t{}\n".format(
                item.to_labeled_lines()[0][0] ,
                item.to_labeled_lines()[0][1]
            ))
# Print the length of the training set
print(len(dataset['train']))
len(dataset['dev'])
train = loadtxt("sst_train.txt", dtype=str, comments="#", delimiter="\t", unpack=False)
train_data=pd.DataFrame(train)
train_data.columns=['label', 'text']
print(train_data)
val = loadtxt("sst_dev.txt", dtype=str, comments="#", delimiter="\t", unpack=False)
val_data=pd.DataFrame(val)
#val_data.columns=['label', 'text']
val_labels=val_data[0]
val_labels.shape
print(val_labels)
print(val_data[0][0])
val_sentences=val_data[1]
val_sentences.shape
print(val_sentences)
valid_labels=[]

for m in val_labels:
 valid_labels.append(int(m))
  #train_labels.append(str(m))
v_labels=np.array(valid_labels)

print(v_labels)
training_labels=train_data['label']
training_labels.shape
training_sentences=train_data['text']
training_sentences.shape
print(training_sentences)
train_labels=[]

for m in training_labels:
 train_labels.append(int(m))
  #train_labels.append(str(m))
tr_labels=np.array(train_labels) 
print(tr_labels)
import re
def preprocess_text(sen):
    # Remove punctuations and numbers
    sentence = re.sub('[^a-zA-Z]', ' ', sen)

    # Single character removal
    sentence = re.sub(r"\s+[a-zA-Z]\s+", ' ', sentence)

    # Removing multiple spaces
    sentence = re.sub(r'\s+', ' ', sentence)

    return sentence
v_sentences=[]

for k in val_sentences:
 v_sentences.append(preprocess_text(k))

print(v_sentences)
train_sentences=[]

for k in training_sentences:
 train_sentences.append(preprocess_text(k))

print(train_sentences)
train_texts = train_sentences
train_texts = [s.lower() for s in train_texts]


test_texts = v_sentences
test_texts = [s.lower() for s in test_texts]

# =======================Convert string to index================
# Tokenizer
tk = Tokenizer(num_words=None, char_level=True, oov_token='UNK')
tk.fit_on_texts(train_texts)
# If we already have a character list, then replace the tk.word_index
# If not, just skip below part

# -----------------------Skip part start--------------------------
# construct a new vocabulary
alphabet = "abcdefghijklmnopqrstuvwxyz0123456789,;.!?:'\"/\\|_@#$%^&*~`+-=<>()[]{}"
char_dict = {}
for i, char in enumerate(alphabet):
    char_dict[char] = i + 1

# Use char_dict to replace the tk.word_index
tk.word_index = char_dict.copy()
# Add 'UNK' to the vocabulary
tk.word_index[tk.oov_token] = max(char_dict.values()) + 1
# -----------------------Skip part end----------------------------

# Convert string to index
train_seq = tk.texts_to_sequences(train_texts)
test_texts = tk.texts_to_sequences(test_texts)


# Padding
train_data = pad_sequences(train_seq, maxlen=1014, padding='post')
test_data = pad_sequences(test_texts, maxlen=1014, padding='post')

# Convert to numpy array
train_data = np.array(train_data, dtype='float32')
test_data = np.array(test_data, dtype='float32')
word_index = tk.word_index
from keras.utils import to_categorical
y_binary = to_categorical(v_labels)
from keras.utils import to_categorical
t_binary = to_categorical(tr_labels)
import tensorflow as tf
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.Conv1D(300, 5, activation='relu'),
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(6, activation='relu'),
    tf.keras.layers.Dense(5, activation='softmax')
])
model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
model.summary()
history7 = model.fit(train_data, t_binary, epochs=num_epochs, validation_data=(test_data, y_binary))
test = loadtxt("sst_test.txt", dtype=str, comments="#", delimiter="\t", unpack=False)
test_data=pd.DataFrame(test)
#val_data.columns=['label', 'text']
testing_labels=test_data[0]
print(testing_labels.shape)
#print(testing_labels)
test_labels=[]

for m in testing_labels:
  test_labels.append(int(m))
t_labels=np.array(test_labels) 
print("For test set :",t_labels)
testing_sentences=test_data[1]
testing_sentences.shape
print(testing_sentences)
test_sentences=[]
for k in testing_sentences:
 test_sentences.append(preprocess_text(k))


teest_texts = test_sentences
teest_texts = [s.lower() for s in teest_texts]



# =======================Convert string to index================
# Tokenizer
tk = Tokenizer(num_words=None, char_level=True, oov_token='UNK')
tk.fit_on_texts(teest_texts)
# If we already have a character list, then replace the tk.word_index
# If not, just skip below part

# -----------------------Skip part start--------------------------
# construct a new vocabulary
alphabet = "abcdefghijklmnopqrstuvwxyz0123456789,;.!?:'\"/\\|_@#$%^&*~`+-=<>()[]{}"
char_dict = {}
for i, char in enumerate(alphabet):
    char_dict[char] = i + 1

# Use char_dict to replace the tk.word_index
tk.word_index = char_dict.copy()
# Add 'UNK' to the vocabulary
tk.word_index[tk.oov_token] = max(char_dict.values()) + 1
# -----------------------Skip part end----------------------------

# Convert string to index
teest_seq = tk.texts_to_sequences(teest_texts)

# Padding
teest_data = pad_sequences(teest_seq, maxlen=1014, padding='post')

# Convert to numpy array
teest_data = np.array(teest_data, dtype='float32')
from keras.utils import to_categorical
test_binary = to_categorical(t_labels)
print(test_binary)
results8 = model.evaluate(teest_data, test_binary, batch_size=128)

# **Results for SSTb in Tabular form**
---
Results for SSTb corpus


Model | Accuracy(unsupervised pre-training) | Accuracy( random word embeddings)|
--- | --- | --- 
CharSCNN     | 28% | 28% 
SCNN | 40% | 33%
CharSCNN (Santos and Gatti, 2014)     | 43%
SCNN (Santos and Gatti, 2014)   | 48% 
RNTN (Socher et al., 2013b)            | 45.7%
MV-RNN (Socher et al., 2013b)              | 44.4%
RNN (Socher et al., 2013b)   | 43.2%
NB (Socher et al., 2013b)  | 41.0%
SVM (Socher et al., 2013b)  | 40.7%
             **Table-3. Accuracy of different models for binary classification using SSTb.**

# **Conclusion**
---
In this work we tried to reproduce the paper "Deep Convolutional Neural Networks for Sentiment Analysis of Short Texts", which is done by Santos and Gatti in 2014. We have used CharSCNN, SCNN  for sentiment analysis of STS and SSTb datasets. Both models were used with and without pretrained data. As a result, we run 8 models, 4 for each dataset. We came to conclusion that SCNN model works best for STS dataset, because this model brought the highest result. However, every model used for STS dataset can be improved by using more epochs and different parameters. When it comes to SSTb dataset, similarly, pretrained SCNN model reached the best accuracy which close enough to the author had. All in all, after reproducing the paper, we were convinced that NLP can be applied to analyze human sentiments. Even if no one has 100% accuracy, every model made in the past can be improved, to reach more accuracy.