# Inroduction
Code-mixing represent several unseen difficulties to NLP tasks like:- 
* part-of-speech tagging
* dependency parsing
* machine translation and semantic processing. 
* word-level language
* identification

Sentiment analysis becomes more difficult in the situation when data is noisy and collected from the social media.Code-mixed text adopts the syntax and vocabulary of
multiple languages. This becomes a challenge for sentiment analysis as traditional semantic analysis approaches do not capture the meaning of the sentences. Another challenge is the short abbreviated data present in the sentences. Same words can also be written in many forms in the sentence which is an another limitation. Pre-processing operations need to be performed to solve these challenges. This paper mainly focuses on pre-processing of tweet and to classify tweets into their corresponding sentiments-positive, negative or neutral.




# Dataset Details


> Follow this link for dataset details [Dataset](https://competitions.codalab.org/competitions/20654#participate)





# Loading & Pre-Processing the Dataset





In [None]:
# Import re "regular expression" module
import re
print("re module is imported successfully")


re module is imported successfully


In [None]:
# Loading Train and Validation Datasets
training_data = open('/content/train_data.txt', encoding='utf8').readlines()
validation_data = open('/content/validation_data.txt', encoding='utf8').readlines()

In [None]:
print(training_data[0])
print(validation_data[0])

In [None]:
from tqdm.notebook import tqdm

In [None]:
# Parsing Function
def parse_lines(lines):
    tweet_id = [] 
    tweet = [] 
    lang_tag = [] 
    sentiment = [] 
    tweet_max_length = 0

    print("Parsing lines from file...")
    for i, line in tqdm(enumerate(lines), total=len(lines)):
        line = line.strip().split('\t')
        if line[0]=='meta':
            if i!=0:
                tweet_id.append(buffer_id)
                tweet.append(buffer_tokens)
                lang_tag.append(buffer_labels)
                sentiment.append(buffer_sentiment)
                if len(buffer_tokens) > tweet_max_length:
                    max_length = len(buffer_tokens)
            buffer_id = line[1]
            try:
                buffer_sentiment = line[2]
            except:
                buffer_sentiment = ''
            buffer_tokens = []
            buffer_labels = []
        else:
            buffer_tokens.append(line[0])
            try:
                buffer_labels.append(line[1])
            except:
                buffer_labels.append('')

    tweet_id.append(buffer_id)
    tweet.append(buffer_tokens)
    lang_tag.append(buffer_labels)
    sentiment.append(buffer_sentiment)
    if len(buffer_tokens) > tweet_max_length:
        tweet_max_length = len(buffer_tokens)
    
    return tweet_id, tweet, lang_tag, sentiment, tweet_max_length

In [None]:
train_tweet_id, train_tweet, train_lang_tag, train_sentiment, train_tweet_max_length = parse_lines(training_data)
valid_tweet_id, valid_tweet, valid_lang_tag, valid_sentiment, valid_tweet_max_length = parse_lines(validation_data)

Parsing lines from file...


HBox(children=(FloatProgress(value=0.0, max=393560.0), HTML(value='')))


Parsing lines from file...


HBox(children=(FloatProgress(value=0.0, max=84678.0), HTML(value='')))




In [None]:
# source: https://en.wikipedia.org/wiki/Contraction_%28grammar%29
def load_dict_contractions():
    return {
        "ain't":"is not",
        "amn't":"am not",
        "aren't":"are not",
        "can't":"cannot",
        "'cause":"because",
        "couldn't":"could not",
        "couldn't've":"could not have",
        "could've":"could have",
        "daren't":"dare not",
        "daresn't":"dare not",
        "dasn't":"dare not",
        "didn't":"did not",
        "doesn't":"does not",
        "don't":"do not",
        "e'er":"ever",
        "em":"them",
        "everyone's":"everyone is",
        "finna":"fixing to",
        "gimme":"give me",
        "gonna":"going to",
        "gon't":"go not",
        "gotta":"got to",
        "hadn't":"had not",
        "hasn't":"has not",
        "haven't":"have not",
        "he'd":"he would",
        "he'll":"he will",
        "he's":"he is",
        "he've":"he have",
        "how'd":"how would",
        "how'll":"how will",
        "how're":"how are",
        "how's":"how is",
        "I'd":"I would",
        "I'll":"I will",
        "I'm":"I am",
        "I'm'a":"I am about to",
        "I'm'o":"I am going to",
        "isn't":"is not",
        "it'd":"it would",
        "it'll":"it will",
        "it's":"it is",
        "I've":"I have",
        "kinda":"kind of",
        "let's":"let us",
        "mayn't":"may not",
        "may've":"may have",
        "mightn't":"might not",
        "might've":"might have",
        "mustn't":"must not",
        "mustn't've":"must not have",
        "must've":"must have",
        "needn't":"need not",
        "ne'er":"never",
        "o'":"of",
        "o'er":"over",
        "ol'":"old",
        "oughtn't":"ought not",
        "shalln't":"shall not",
        "shan't":"shall not",
        "she'd":"she would",
        "she'll":"she will",
        "she's":"she is",
        "shouldn't":"should not",
        "shouldn't've":"should not have",
        "should've":"should have",
        "somebody's":"somebody is",
        "someone's":"someone is",
        "something's":"something is",
        "that'd":"that would",
        "that'll":"that will",
        "that're":"that are",
        "that's":"that is",
        "there'd":"there would",
        "there'll":"there will",
        "there're":"there are",
        "there's":"there is",
        "these're":"these are",
        "they'd":"they would",
        "they'll":"they will",
        "they're":"they are",
        "they've":"they have",
        "this's":"this is",
        "those're":"those are",
        "'tis":"it is",
        "'twas":"it was",
        "wanna":"want to",
        "wasn't":"was not",
        "we'd":"we would",
        "we'd've":"we would have",
        "we'll":"we will",
        "we're":"we are",
        "weren't":"were not",
        "we've":"we have",
        "what'd":"what did",
        "what'll":"what will",
        "what're":"what are",
        "what's":"what is",
        "what've":"what have",
        "when's":"when is",
        "where'd":"where did",
        "where're":"where are",
        "where's":"where is",
        "where've":"where have",
        "which's":"which is",
        "who'd":"who would",
        "who'd've":"who would have",
        "who'll":"who will",
        "who're":"who are",
        "who's":"who is",
        "who've":"who have",
        "why'd":"why did",
        "why're":"why are",
        "why's":"why is",
        "won't":"will not",
        "wouldn't":"would not",
        "would've":"would have",
        "y'all":"you all",
        "you'd":"you would",
        "you'll":"you will",
        "you're":"you are",
        "you've":"you have",
        "Whatcha":"What are you",
        "luv":"love",
        "sux":"sucks"
        }

In [None]:
!pip install emoji

Collecting emoji
[?25l  Downloading https://files.pythonhosted.org/packages/24/fa/b3368f41b95a286f8d300e323449ab4e86b85334c2e0b477e94422b8ed0f/emoji-1.2.0-py3-none-any.whl (131kB)
[K     |██▌                             | 10kB 19.9MB/s eta 0:00:01[K     |█████                           | 20kB 24.1MB/s eta 0:00:01[K     |███████▌                        | 30kB 17.5MB/s eta 0:00:01[K     |██████████                      | 40kB 15.3MB/s eta 0:00:01[K     |████████████▌                   | 51kB 9.3MB/s eta 0:00:01[K     |███████████████                 | 61kB 7.7MB/s eta 0:00:01[K     |█████████████████▌              | 71kB 8.7MB/s eta 0:00:01[K     |████████████████████            | 81kB 9.6MB/s eta 0:00:01[K     |██████████████████████▌         | 92kB 10.3MB/s eta 0:00:01[K     |█████████████████████████       | 102kB 7.8MB/s eta 0:00:01[K     |███████████████████████████▌    | 112kB 7.8MB/s eta 0:00:01[K     |██████████████████████████████  | 122kB 7.8MB/s eta 0:

In [None]:
import emoji
import itertools

In [None]:
def pre_processing_each_tweet(tweet):
    
    # Converting tweet into lowercase
    tweet = tweet.lower()

    # replace contractions
    CONTRACTIONS = load_dict_contractions()
    tweet = tweet.replace("’","'")
    words = tweet.split()
    for word in words:
      if word=="'":
        ind  = words.index(word)
        if (ind!=len(words)-1):
          word = words[ind-1]+word+words[ind+1]
          words[ind] = word
          words.pop(ind-1)
          words.pop(ind)
  
    reformed = [CONTRACTIONS[word] if word in CONTRACTIONS else word for word in words]
    tweet = " ".join(reformed)

    
    # demojize emojis 
    tweet = emoji.demojize(tweet)
    
    # other cleaning
    tweet = tweet.replace(":"," ")
    tweet = ' '.join(tweet.split())

    # Replace repeating characters with maximum length of two characters
    tweet = re.sub(r"(.)\1{2,}", r'\1\1', tweet)

    return tweet

# train_tweet_test = ['cittttty', 'ciiiity', 'cccciiiittttyyyy', 'citttyyyy', 'ciitty', '!!!','...']
# pre_train_tweet = pre_processing_each_tweet(' '.join(train_tweet_test)).split(' ')

# print(train_tweet_test)
# print(pre_train_tweet)

print(train_tweet[1])
pre_train_tweet = pre_processing_each_tweet(' '.join(train_tweet[1])).split(' ')
print(pre_train_tweet)

['@', 'nehantics', 'Haan', 'yaar', 'neha', '😔😔', 'kab', 'karega', 'woh', 'post', '😭', 'Usne', 'na', 'sach', 'mein', 'photoshoot', 'karna', 'chahiye', 'phir', 'woh', 'post', 'karega', '…', 'https', '//', 'tco', '/', '5RSlSbZNtt', '']
['@', 'nehantics', 'haan', 'yaar', 'neha', 'pensive_face', 'pensive_face', 'kab', 'karega', 'woh', 'post', 'loudly_crying_face', 'usne', 'na', 'sach', 'mein', 'photoshoot', 'karna', 'chahiye', 'phir', 'woh', 'post', 'karega', '…', 'https', '//', 'tco', '/', '5rslsbzntt']


In [None]:
def pre_processing_tweets(tweets):
    for i in range(len(tweets)):
        tweet = pre_processing_each_tweet(' '.join(tweets[i])).split(' ')
        tweets[i] = []
        j=0
        while j<len(tweet):
            tweets[i].append(tweet[j])
            j+=1
        
    return tweets

In [None]:
# Pre-Processing the data
pre_processed_training_data = pre_processing_tweets(train_tweet)
pre_processed_validation_data = pre_processing_tweets(valid_tweet)

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [None]:
# vocab_size = 
embedding_dim = 16
max_length = 50
trunc_type='post'
padding='post'
oov_tok = "<OOV>"

In [None]:
tokenizer = Tokenizer(num_words = None, oov_token=oov_tok)
tokenizer.fit_on_texts(pre_processed_training_data)
word_index = tokenizer.word_index
training_sequences = tokenizer.texts_to_sequences(pre_processed_training_data)
training_padded = pad_sequences(training_sequences,maxlen=max_length, truncating=trunc_type, padding=padding)

validation_sequences = tokenizer.texts_to_sequences(pre_processed_validation_data)
validation_padded = pad_sequences(validation_sequences,maxlen=max_length)

In [None]:
total_words = len(word_index)
print(total_words)

48876


In [None]:
from sklearn import preprocessing
from tensorflow.keras.utils import to_categorical

In [None]:
# output label encoding
le = preprocessing.LabelEncoder()
le.fit(train_sentiment)
print(le.classes_)

training_labels = to_categorical(le.transform(train_sentiment))

validation_labels = to_categorical(le.transform(valid_sentiment))

['negative' 'neutral' 'positive']


# CNN Model

In [None]:
import tensorflow as tf
import numpy as np
import pandas as pd

In [None]:
# CNN Model Hyper-Parameters
kernel_size = 5
filters = 128
cnn_optimizer = 'adamax'

In [None]:
cnn_model = tf.keras.Sequential([
    tf.keras.layers.Embedding(total_words+1, embedding_dim, input_length=max_length),
    tf.keras.layers.Conv1D(filters, kernel_size, activation='relu'),
    tf.keras.layers.GlobalAveragePooling1D(),
    # tf.keras.layers.MaxPooling1D(pool_size=3),
    # tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(6, activation='relu'),
    tf.keras.layers.Dense(3, activation='softmax')
])
cnn_model.compile(loss='categorical_crossentropy',
              optimizer=cnn_optimizer,
              metrics=['accuracy'])
cnn_model.summary()

In [None]:
cnn_history = cnn_model.fit(training_padded, training_labels,
          validation_data=(validation_padded,validation_labels),
          epochs=15)

In [None]:
import matplotlib.pyplot as plt


def plot_graphs(history, string):
  plt.plot(cnn_history.history[string])
  plt.plot(cnn_history.history['val_'+string])
  plt.xlabel("Epochs")
  plt.ylabel(string)
  plt.legend([string, 'val_'+string])
  plt.show()
  
plot_graphs(cnn_history, "accuracy")
plot_graphs(cnn_history, "loss")

# LSTM Model


In [None]:
# LSTM Hyper-Parameters
lstm_units = 16
lstm_recurrent_dropout = 0.2
lstm_dropout = 0.2

In [None]:
# Model Definition with LSTM
lstm_model = tf.keras.Sequential([
    tf.keras.layers.Embedding(total_words, 16, input_length=max_length),
    tf.keras.layers.LSTM(units=lstm_units),
    tf.keras.layers.Dense(6, activation='relu'),
    tf.keras.layers.Dense(3, activation='softmax')
])
lstm_model.compile(loss='categorical_crossentropy',optimizer='adamax',metrics=['accuracy'])
lstm_model.summary()


In [None]:
lstm_history = lstm_model.fit(training_padded, training_labels,
          validation_data=(validation_padded,validation_labels),
          epochs=10)

In [None]:
import matplotlib.pyplot as plt


def plot_graphs(history, string):
  plt.plot(lstm_history.history[string])
  plt.plot(lstm_history.history['val_'+string])
  plt.xlabel("Epochs")
  plt.ylabel(string)
  plt.legend([string, 'val_'+string])
  plt.show()
  
plot_graphs(lstm_history, "accuracy")
plot_graphs(lstm_history, "loss")

# Bi-LSTM Model

In [None]:
# LSTM Hyper-Parameters
bi_lstm_units = 128
bi_lstm_recurrent_dropout = 0.2
bi_lstm_dropout = 0.2

In [None]:
# Model Definition with LSTM
bi_lstm_model = tf.keras.Sequential([
    tf.keras.layers.Embedding(total_words, 16, input_length=max_length),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(units=bi_lstm_units)),
    tf.keras.layers.Dense(6, activation='relu'),
    tf.keras.layers.Dense(3, activation='softmax')
])
bi_lstm_model.compile(loss='categorical_crossentropy',optimizer='adamax',metrics=['accuracy'])
bi_lstm_model.summary()

In [None]:
bi_lstm_history = bi_lstm_model.fit(training_padded, training_labels,
          validation_data=(validation_padded,validation_labels),
          epochs=15)

In [None]:
import matplotlib.pyplot as plt


def plot_graphs(history, string):
  plt.plot(bi_lstm_history.history[string])
  plt.plot(bi_lstm_history.history['val_'+string])
  plt.xlabel("Epochs")
  plt.ylabel(string)
  plt.legend([string, 'val_'+string])
  plt.show()
  
plot_graphs(bi_lstm_history, "accuracy")
plot_graphs(bi_lstm_history, "loss")

# Testing the Models



In [None]:
from sklearn.metrics import classification_report

In [None]:
#Loading Test Dataset
testing_data = open('/content/test_data.txt', encoding='utf8').readlines()

In [None]:
test_tweet_id, test_tweet, test_lang_tag, test_sentiment, test_tweet_max_length = parse_lines(testing_data)

Parsing lines from file...


HBox(children=(FloatProgress(value=0.0, max=84362.0), HTML(value='')))




In [None]:
pre_processed_testing_data = pre_processing_tweets(test_tweet)

In [None]:
testing_sequences = tokenizer.texts_to_sequences(pre_processed_testing_data)
testing_padded = pad_sequences(testing_sequences,maxlen=max_length)

In [None]:
cnn_predictions = cnn_model.predict(testing_padded)
cnn_predictions = np.argmax(cnn_predictions,axis=-1)

# write predictions to file
with open('cnn_preds.txt', 'w') as out:
    out.write('Uid,Sentiment')
    for i, uid in enumerate(test_tweet_id):
        if cnn_predictions[i] == 0:
            sentiment = 'negative'
        elif cnn_predictions[i] == 1:
            sentiment = 'neutral'
        else:
            sentiment = 'positive'
        out.write("\n%s,%s"%(uid, sentiment))

In [None]:

lstm_predictions = lstm_model.predict(testing_padded)
lstm_predictions = np.argmax(lstm_predictions,axis=-1)

# write predictions to file
with open('lstm_preds.txt', 'w') as out:
    out.write('Uid,Sentiment')
    for i, uid in enumerate(test_tweet_id):
        if lstm_predictions[i] == 0:
            sentiment = 'negative'
        elif lstm_predictions[i] == 1:
            sentiment = 'neutral'
        else:
            sentiment = 'positive'
        out.write("\n%s,%s"%(uid, sentiment))

In [None]:

bi_lstm_predictions = bi_lstm_model.predict(testing_padded)
bi_lstm_predictions = np.argmax(bi_lstm_predictions,axis=-1)

# write predictions to file
with open('bi_lstm_preds.txt', 'w') as out:
    out.write('Uid,Sentiment')
    for i, uid in enumerate(test_tweet_id):
        if bi_lstm_predictions[i] == 0:
            sentiment = 'negative'
        elif bi_lstm_predictions[i] == 1:
            sentiment = 'neutral'
        else:
            sentiment = 'positive'
        out.write("\n%s,%s"%(uid, sentiment))

In [None]:
# load correct labels
test = pd.read_csv('/content/test_labels.txt')
# load predictions
cnn_preds = pd.read_csv('/content/cnn_preds.txt')
# lstm_preds = pd.read_csv('/content/lstm_preds.txt')
# bi_lstm_preds = pd.read_csv('/content/bi_lstm_preds.txt')

# compute evaluation metrics
results = {'cnn_preds': classification_report(test['Sentiment'], 
                                          cnn_preds['Sentiment'], 
                                          labels=['positive', 'neutral', 'negative'], 
                                          output_dict=True, digits=6)
          #  'lstm_preds': classification_report(test['Sentiment'], 
          #                                 lstm_preds['Sentiment'], 
          #                                 labels=['positive', 'neutral', 'negative'], 
          #                                 output_dict=True, digits=6),
          #  'bi_lstm_preds': classification_report(test['Sentiment'], 
          #                                 bi_lstm_preds['Sentiment'], 
          #                                 labels=['positive', 'neutral', 'negative'], 
          #                                 output_dict=True, digits=6),
}

In [None]:
# format and print scores
formatted_results = [['model', 'precision', 'recall', 'accuracy', 'f1-score']]
for ki in results.keys():
    scores = results[ki]['macro avg']
    model = [ki, scores['precision'], scores['recall'], results[ki]['accuracy'], scores['f1-score']]
    formatted_results.append(model)
    
formatted_results = pd.DataFrame(formatted_results[1:], columns=formatted_results[0])
print(formatted_results)

       model  precision    recall  accuracy  f1-score
0  cnn_preds   0.649452  0.638933     0.638  0.642569


In [None]:
# model  precision    recall  accuracy  f1-score
# 0      cnn_preds   0.669827  0.651923     0.651  0.657892
# 1     lstm_preds   0.552983  0.356939     0.359  0.222178
# 2  bi_lstm_preds   0.737094  0.418081     0.448  0.345919

# Saving the Models


# Deployment Using Anvil


>  Steps to connect colab with anvil app:

* Added getpass prompt so you can add your apps Uplink key
* Install the anvil-uplink library
* Import the anvil.server package
* Connect the notebook using your apps Uplink key
* Create a function to call from your app that includes the  anvil.server.callable decorator
* Add anvil.server.wait_forever() to the end of the notebook



In [None]:
#Added getpass prompt so you can add your apps Uplink key
from getpass import getpass
uplink_key = getpass('Enter your Uplink key: ')

In [None]:
#Install the anvil-uplink library
!pip install anvil-uplink

In [None]:
import anvil.server

In [None]:
anvil.server.connect(uplink_key)

Connecting to wss://anvil.works/uplink
Anvil websocket open
Connected to "Default environment (dev)" as SERVER


In [None]:
@anvil.server.callable
def predict_sentiment(input_message):
  print(input_message)
  tweets = []
  tweets.append(input_message)
  input_message_sequences = tokenizer.texts_to_sequences(tweets) 
  print(input_message_sequences)
  input_message_padded = pad_sequences(input_message_sequences,maxlen=max_length)
  classification = cnn_model.predict(input_message_padded)
  print(classification)
  classification = np.argmax(classification,axis=-1)
  print(classification)
  if classification == 0:
    sentiment = 'negative'
    
  elif classification == 1:
    sentiment = 'neutral'

  else:
    sentiment = 'positive'
    
  print(sentiment)
  return sentiment

In [None]:
anvil.server.wait_forever()