# Sentiment Analysis of Hin-Eng mixed tweets

Dataset: <a href="https://drive.google.com/file/d/1qpXAWrbNzL_TRK5OAYgXp6y-KAg0T11h/view?usp=sharing" >link</a>

## Methodology:
Github: <a href="">link</a>

For this task, I propose the following novel architecture based on SVM. As we know, SVMs are one of the best for classification based tasks. But, as these are Hindi-English mixed tweets, some pre-processing has to be done. So, my method can be described in three phases:

<ol>
    <li> <b>Phase I (Pre-processing):</b> First, I clean the unwanted text from the tweets (like the ones with labels "O", links, etc.). Next, on keen observation, I noticed that the labels of "Hin" and "Eng" were not proper in most of the tweets, especially the ones tagged "Hin". So, I followed the following procedure:
        <ul>
            <li> Check if the word exists in Wordnet, if yes just return, else keep it for further processing.</li>
            <li> Check if the word is a bad word using <a href="https://github.com/precog-iiitd/mind-your-language-aaai">this</a> dictionary, if yes, translate, else keep for further processing </li>
            <li> Finally, translate this word using Google's Cloud Translate API</li>
        </ul>
    </li> 
    <li> <b>Phase II:</b> In the first phase, the aim is to classify the tweets if they are of neutral stance or non-neutral stane. For that, I used the Weighted MPQA Subjectivity-Polarity Classification. Based on the subjectivity score, if the cumulative score is either $\lt$ 2 or $\gt$ 2, they are classified as non-neutral, else neutral. Also, if there is an adjective in a tweet, it generally implies subjectivity. Hene, using Wordnet based potential adjective recognition I classify between non-neutral and neutral.
    </li>
    <li> <b> Phase III:</b> In this phase, I classify between positive and negative stances. For that, I use Sentiwordnet to fetch positive and negative scores of the words, and then consider cumulative scores. With this as feature, I use CountVectorizer (One Hot Encodings) on the tweets and concatenate it to form the feature vector. Then I use this as input to the SVM model.
    </li>
</ol>

[**Note:** I tried tf-idf vectors as well as Glove embeddings, but among them, one hot encoding had highest accuracy.]

**Libraries**

In [19]:
import numpy as np
from tqdm import tqdm_notebook as tqdm
import csv

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.utils import shuffle
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report

In [2]:
from nltk.corpus import wordnet as wn
from nltk.stem import WordNetLemmatizer
from nltk.corpus import sentiwordnet as swn
from nltk import sent_tokenize, word_tokenize, pos_tag
from nltk import ngrams,bigrams

**Fetching MPQA Subjectivity Lexicon**

In [3]:
lexicon = {}
with open('./MPQA/lexicon_easy.csv',"r") as csvfile:
    data = csv.reader(csvfile)
    for row in data:
        row[1] = int(row[1])
        row[2] = int(row[2])
        lexicon[row[0]] = {}
        lexicon[row[0]]['subj'] = row[1]
        lexicon[row[0]]['sent'] = row[2]

**Utility functions**

In [4]:
# Function which returns subjectivity score of a given tweet. See description for scores.
def mpqa_subj(tweet):
    feat = 0
    score = 0
    tokens = word_tokenize(tweet)
    for token in tokens: 
        if token in lexicon:
            score+= lexicon[token]['subj']
    if score > 2 or score < -2:
        feat = 1
    return feat

In [5]:
# Function which returns presence of adjectives
def pot_adj(tweet):
    feat = 0
    tokens = word_tokenize(tweet)
    for token in tokens:
        synsets = wn.synsets(token)
        for s in synsets:
            if s.pos() == 'a':
                feat = 1
    return feat

In [6]:
lemmatizer = WordNetLemmatizer()

# Function to convert between the PennTreebank tags to simple Wordnet tags
def penn_to_wn(tag):
    if tag.startswith('J'):
        return wn.ADJ
    elif tag.startswith('N'):
        return wn.NOUN
    elif tag.startswith('R'):
        return wn.ADV
    elif tag.startswith('V'):
        return wn.VERB
    return None

In [7]:
# Function which returns sentiment scores [SentiWordnet]
def swn_polarity(tweet):
    sentiment = 0.0
    tokens_count = 0
    tagged_sentence = pos_tag(word_tokenize(tweet))
    for word, tag in tagged_sentence:
        wn_tag = penn_to_wn(tag)
        
        if wn_tag not in (wn.NOUN, wn.ADJ, wn.ADV):
            continue
        lemma = lemmatizer.lemmatize(word, pos=wn_tag)
        
        if not lemma:
            continue
        synsets = wn.synsets(lemma, pos=wn_tag)
        
        if not synsets:
            continue
        synset = synsets[0]
        swn_synset = swn.senti_synset(synset.name())
        
        if swn_synset.pos_score() - swn_synset.neg_score()>0:
            sentiment+=1
        elif swn_synset.pos_score() - swn_synset.neg_score()<0:
            sentiment+=-1
        tokens_count += 1
    
    if not tokens_count:
        return 0
    
    return sentiment

In [8]:
# Function which returns sentiment polarity based on sentiment scores
def sentiword_mpqa_sentiment(tweet):
    feat = [0,0]
    feat[0] = swn_polarity(tweet)
    tokens = word_tokenize(tweet)
    for token in tokens:
        if token in lexicon:
            feat[1]+= lexicon[token]['sent']
    return feat

In [9]:
# Function for prediction
def predict(clf1,clf2,x):
    ph1 = [x[0]]
    ph2 = [x[1]]
    p1 = clf1.predict(ph1)
    if p1[0] == 1:
        return 2
    else:
        p2 = clf2.predict(ph2)
        if p2[0] == 1:
            return 0
        else:
            return 1

In [10]:
# Function for calculating scores
def score(y_true,y_pred):
    fav = [0,0,0]
    ag = [0,0,0]
    tot = [fav,ag]
    corr = 0
    for y_t,y_p in zip(y_true,y_pred):
        if y_t < 2:
            tot[y_t][2]+=1
        if y_p < 2:
            tot[y_p][1]+=1
        if y_t == y_p and y_t < 2:
            tot[y_t][0]+=1
        if y_t == y_p:
            corr+=1

        r0 = tot[0][0]/(tot[0][2]+1e-5)
        p0 = tot[0][0]/(tot[0][1]+1e-5)
        r1 = tot[1][0]/(tot[1][2]+1e-5)
        p1 = tot[1][0]/(tot[1][1]+1e-5)
        f0 = 2*r0*p0/(r0+p0+1e-5)
        f1 = 2*r1*p1/(r1+p1+1e-5)
        
        f_avg = (f0+f1)/2
    return tot,f_avg, corr/len(y_pred)

### Fetching data and preparing Vectorizer

In [11]:
data = {}
features = {}
train_file = "./data1/train_modified.txt"
test_file = "./data1/test_modified.txt"
#Set up the TF-IDF Vectorizer
corpus = []
with open(train_file,'r', encoding="utf-8") as fr:
    lines = fr.readlines()
    for line in lines:
        row = line.split('\t')
        if row[0] == 'ID':
            continue
        tweet = row[1].lower()
        corpus.append(tweet)
with open(test_file,'r', encoding="utf-8") as fr:
    lines = fr.readlines()
    for line in lines:
        row = line.split('\t')
        if row[0] == 'ID':
            continue
        tweet = row[1].lower()
        corpus.append(tweet)
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
X = X.todense()

### Train data feature extraction

In [12]:
#Train Feature Extraction
train_ph1 = []
train_ph2 = []
i = 0
with open(train_file,'r', encoding="utf-8") as fr:
    lines = fr.readlines()
    for line in lines:
        feature_vector_1 = []
        feature_vector_2 = []
        row = line.split('\t')
        if row[0] == 'ID':
            continue
        tweet = row[1].lower()
        
        #First phase features
        feature_vector_1.append(mpqa_subj(tweet))
        feature_vector_1.append(pot_adj(tweet))
        feature_vector_1.extend(X[i].tolist()[0])

        #Second Phase features
        senti = sentiword_mpqa_sentiment(tweet)
        feature_vector_2.extend(senti)
        feature_vector_2.extend(X[i].tolist()[0])
        i+=1
        category = row[2].rstrip()
        if category == 'neutral':
            feature_vector_1.append(1)
            train_ph1.append(feature_vector_1)
        else:
            feature_vector_1.append(0)
            train_ph1.append(feature_vector_1)
            if category == 'negative':
                feature_vector_2.append(0)
                train_ph2.append(feature_vector_2)
            else:
                feature_vector_2.append(1)
                train_ph2.append(feature_vector_2)

### Test data feature extraction

In [13]:
#Load the test data and calculate features
test = []
test_y = []
with open(test_file,'r', encoding="utf-8") as fr:
    lines = fr.readlines()
    for line in lines:
        feature_vector_1 = []
        feature_vector_2 = []
        row = line.split('\t')
        if row[0] == 'ID':
            continue
        tweet = row[1].lower()
        target = ""

        #First phase features
        feature_vector_1.append(mpqa_subj(tweet))
        feature_vector_1.append(pot_adj(tweet))
        feature_vector_1.extend(X[i].tolist()[0])

        #Second Phase features
        senti = sentiword_mpqa_sentiment(tweet)
        feature_vector_2.extend(senti)
        feature_vector_2.extend(X[i].tolist()[0])
        i+=1 
        test.append((np.array(feature_vector_1,dtype=np.int32),np.array(feature_vector_2,dtype=np.int32)))
        category = row[2].rstrip()
        if category == 'neutral':
            test_y.append(2)
        else:
            if category == 'negative':
                test_y.append(1)
            else:
                test_y.append(0)

In [14]:
train_ph1 = np.array(train_ph1, dtype = np.int32)
train_ph2 = np.array(train_ph2,dtype = np.int32)
test_y = np.array(test_y,dtype = np.int32)
print(train_ph1.shape)
print(train_ph2.shape)
print(test_y.shape)
data_key = (train_ph1,train_ph2,test,test_y)

(15130, 54176)
(9492, 54176)
(1868,)


## Training and Parameter tuning

In [15]:
parameters = {
    'C' : np.logspace(start = 0.001,stop = 5,num = 10),
    'dual' : [False]
}
tot = np.array([[0,0,0],[0,0,0]])
# print(data)

train_ph1 = shuffle(data_key[0])
train_ph2 =shuffle(data_key[1])
test = data_key[2]
test_y = data_key[3]

#Phase 1 Training
train_ph1_x = train_ph1[:,:-1]
train_ph1_y = train_ph1[:,-1]
svc = LinearSVC(max_iter = 100000, verbose=10)
print("SVC done")
clf1  = GridSearchCV(svc, parameters, cv=3, verbose=10) # Applying Grid Search
clf1.fit(train_ph1_x,train_ph1_y)
print("Phase 1 done")
#Phase 2 Training
train_ph2_x = train_ph2[:,:-1]
train_ph2_y = train_ph2[:,-1]
clf2 = GridSearchCV(svc, parameters, cv=3, verbose=10) # Applying Grid Search
waste= clf2.fit(train_ph2_x,train_ph2_y)
print("Phase 2 done")

SVC done
Fitting 3 folds for each of 10 candidates, totalling 30 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] C=1.0023052380778996, dual=False ................................
[LibLinear][CV] .... C=1.0023052380778996, dual=False, score=0.628, total=  13.0s


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   13.2s remaining:    0.0s


[CV] C=1.0023052380778996, dual=False ................................
[LibLinear][CV] .... C=1.0023052380778996, dual=False, score=0.622, total=  10.4s


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:   23.9s remaining:    0.0s


[CV] C=1.0023052380778996, dual=False ................................
[LibLinear][CV] .... C=1.0023052380778996, dual=False, score=0.629, total=  10.8s


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:   34.9s remaining:    0.0s


[CV] C=3.6011768069240193, dual=False ................................
[LibLinear][CV] .... C=3.6011768069240193, dual=False, score=0.626, total=  13.8s


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:   49.0s remaining:    0.0s


[CV] C=3.6011768069240193, dual=False ................................
[LibLinear][CV] .... C=3.6011768069240193, dual=False, score=0.620, total=  13.9s


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:  1.1min remaining:    0.0s


[CV] C=3.6011768069240193, dual=False ................................
[LibLinear][CV] .... C=3.6011768069240193, dual=False, score=0.628, total=  12.7s


[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:  1.3min remaining:    0.0s


[CV] C=12.938647731300748, dual=False ................................
[LibLinear][CV] .... C=12.938647731300748, dual=False, score=0.623, total=  20.3s


[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed:  1.6min remaining:    0.0s


[CV] C=12.938647731300748, dual=False ................................
[LibLinear][CV] .... C=12.938647731300748, dual=False, score=0.615, total=  22.0s


[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:  2.0min remaining:    0.0s


[CV] C=12.938647731300748, dual=False ................................
[LibLinear][CV] .... C=12.938647731300748, dual=False, score=0.623, total=  19.4s


[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:  2.3min remaining:    0.0s


[CV] C=46.487194073008524, dual=False ................................
[LibLinear][CV] .... C=46.487194073008524, dual=False, score=0.622, total=  22.3s
[CV] C=46.487194073008524, dual=False ................................
[LibLinear][CV] .... C=46.487194073008524, dual=False, score=0.614, total=  26.2s
[CV] C=46.487194073008524, dual=False ................................
[LibLinear][CV] .... C=46.487194073008524, dual=False, score=0.622, total=  37.0s
[CV] C=167.02357600737488, dual=False ................................
[LibLinear][CV] .... C=167.02357600737488, dual=False, score=0.624, total=  10.4s
[CV] C=167.02357600737488, dual=False ................................
[LibLinear][CV] .... C=167.02357600737488, dual=False, score=0.614, total=  11.2s
[CV] C=167.02357600737488, dual=False ................................
[LibLinear][CV] .... C=167.02357600737488, dual=False, score=0.623, total=  29.1s
[CV] C=600.0980592306573, dual=False .................................
[LibLinear]

[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed:  8.3min finished


[LibLinear]Phase 1 done
Fitting 3 folds for each of 10 candidates, totalling 30 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] C=1.0023052380778996, dual=False ................................
[LibLinear][CV] .... C=1.0023052380778996, dual=False, score=0.837, total=   7.1s


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    7.2s remaining:    0.0s


[CV] C=1.0023052380778996, dual=False ................................
[LibLinear][CV] .... C=1.0023052380778996, dual=False, score=0.827, total=   6.6s


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:   14.0s remaining:    0.0s


[CV] C=1.0023052380778996, dual=False ................................
[LibLinear][CV] .... C=1.0023052380778996, dual=False, score=0.841, total=   6.5s


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:   20.7s remaining:    0.0s


[CV] C=3.6011768069240193, dual=False ................................
[LibLinear][CV] .... C=3.6011768069240193, dual=False, score=0.836, total=   6.7s


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:   27.5s remaining:    0.0s


[CV] C=3.6011768069240193, dual=False ................................
[LibLinear][CV] .... C=3.6011768069240193, dual=False, score=0.823, total=   7.1s


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:   34.8s remaining:    0.0s


[CV] C=3.6011768069240193, dual=False ................................
[LibLinear][CV] .... C=3.6011768069240193, dual=False, score=0.839, total=   6.6s


[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:   41.6s remaining:    0.0s


[CV] C=12.938647731300748, dual=False ................................
[LibLinear][CV] .... C=12.938647731300748, dual=False, score=0.837, total=   7.4s


[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed:   49.2s remaining:    0.0s


[CV] C=12.938647731300748, dual=False ................................
[LibLinear][CV] .... C=12.938647731300748, dual=False, score=0.823, total=   6.9s


[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:   56.3s remaining:    0.0s


[CV] C=12.938647731300748, dual=False ................................
[LibLinear][CV] .... C=12.938647731300748, dual=False, score=0.840, total=   6.7s


[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:  1.1min remaining:    0.0s


[CV] C=46.487194073008524, dual=False ................................
[LibLinear][CV] .... C=46.487194073008524, dual=False, score=0.836, total=   6.0s
[CV] C=46.487194073008524, dual=False ................................
[LibLinear][CV] .... C=46.487194073008524, dual=False, score=0.823, total=   6.1s
[CV] C=46.487194073008524, dual=False ................................
[LibLinear][CV] .... C=46.487194073008524, dual=False, score=0.840, total=   6.1s
[CV] C=167.02357600737488, dual=False ................................
[LibLinear][CV] .... C=167.02357600737488, dual=False, score=0.837, total=   5.8s
[CV] C=167.02357600737488, dual=False ................................
[LibLinear][CV] .... C=167.02357600737488, dual=False, score=0.822, total=   6.0s
[CV] C=167.02357600737488, dual=False ................................
[LibLinear][CV] .... C=167.02357600737488, dual=False, score=0.840, total=   5.8s
[CV] C=600.0980592306573, dual=False .................................
[LibLinear]

[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed:  3.2min finished


[LibLinear]Phase 2 done


## Testing

In [16]:
preds = []
for x in test:
    preds.append(predict(clf1,clf2,x))

In [31]:
print("\t\t   Classification Report\n")
print(classification_report(test_y,preds, target_names=["Negative", "Positive", "Neutral"]))

		   Classification Report

              precision    recall  f1-score   support

    Negative       0.53      0.58      0.55       582
    Positive       0.55      0.59      0.57       532
     Neutral       0.50      0.44      0.47       754

    accuracy                           0.53      1868
   macro avg       0.53      0.54      0.53      1868
weighted avg       0.53      0.53      0.53      1868

