# Assignment Two:  Sentiment Classification

For this exercise you will be using the "SemEval 2017 task 4" corpus provided on the module website, available through the following link: https://warwick.ac.uk/fac/sci/dcs/teaching/material/cs918/semeval-tweets.tar.bz2 You will focus particularly on Subtask A, i.e. classifying the overall sentiment of a tweet as positive, negative or neutral.

You are requested to produce a *Jupyter notebook* for the coursework submission. The input to your program is the SemEval data downloaded. Note that TAs need to run your program on their own machine by using the original SemEval data. As such, don’t submit a Python program that takes as input some preprocessed files.

#### Import necessary packages
You may import more packages here.

In [13]:
# Import necessary packages here
import re
from os.path import join
import numpy as np

In [14]:
# Define test sets
dataDir = '../semeval-tweets'
testsetStrings = ['twitter-test1.txt', 'twitter-test2.txt', 'twitter-test3.txt']
testsets = [join(dataDir, t) for t in testsetStrings]
print(testsets)

['../semeval-tweets/twitter-test1.txt', '../semeval-tweets/twitter-test2.txt', '../semeval-tweets/twitter-test3.txt']


In [15]:
# checking the structure of the dataset
with open(testsets[0], 'r', encoding='utf8') as f1:
    i = 0
    for line in f1:
        fields = line.split('\t')
        if i < 30:
            length = len(fields[2])
            if length > 130:
                print(f"Lenght: {length}")
                print(fields[0])  # 1st column - tweet ID
                print(fields[1])  # 2nd column - tweet sentiment
                print(fields[2])  # 3rd column - tweet text
                i += 1

# preprocessing questions and notes: 
  # -> what about removing the @usernames, is it advisable?
  # -> need to remove URLs!
  # -> There is a lot of noise/mistakes in the data and absence of interpunction.
  # -> what about adding of the starting token?  

Lenght: 139
102313285628711403
neutral
"Bing one-ups knowledge graph, hires Encyclopaedia Britannica to supply results:   It may have retired from the cut-throat world of pr..."

Lenght: 139
653274888624828198
neutral
"On Thursday, concealed-carry gun license holders will be given a new right in the state of Oklahoma: the ability... http://t.co/oSgGHKi1"

Lenght: 137
420747042670198316
negative
Miyagi just got banned from yoga. He was caught him sniffing the sphincter of the girl in front of him. There may be police involvement!

Lenght: 135
822064800445716046
neutral
Join us tonight at Boston Pizza - Centre on Barton for THURSDAY NIGHT FOOTBALL! Tonight the Chiefs take on the... http://t.co/iegTxPQv

Lenght: 139
055480020953212084
neutral
"#FX NEW YORK, Oct 18 (Reuters) - The Federal Reserve provided $4.701 billion of liquidity to the ... http://t.co/BJhIQTtO #EUR #AUD #CAD"

Lenght: 132
429443270273347255
neutral
"13 April 1996, History is made, as the MetroStars and the Los Angeles 

In [16]:
# Skeleton: Evaluation code for the test sets
def read_test(testset):
    '''
    reading the testset and return a dictionary with: ID -> sentiment
    :param testset: str, the file name of the testset to compare
    '''
    id_gts = {}  # init the dictionary
    with open(testset, 'r', encoding='utf8') as fh:
        for line in fh:
            fields = line.split('\t')
            tweetid = fields[0]
            gt = fields[1]
            id_gts[tweetid] = gt

    return id_gts


def confusion(id_preds, testset, classifier):
    '''
    print the confusion matrix of {'positive', 'netative'} between preds and testset
    :param id_preds: a dictionary of predictions formated as {<tweetid>:<sentiment>, ... }
    :param testset: str, the file name of the testset to compare
    :classifier: str, the name of the classifier
    '''
    id_gts = read_test(testset)

    gts = []
    for m, c1 in id_gts.items():
        if c1 not in gts:
            gts.append(c1)

    gts = ['positive', 'negative', 'neutral']

    conf = {}
    for c1 in gts:
        conf[c1] = {}
        for c2 in gts:
            conf[c1][c2] = 0

    for tweetid, gt in id_gts.items():
        if tweetid in id_preds:
            pred = id_preds[tweetid]
        else:
            pred = 'neutral'
        conf[pred][gt] += 1

    print(''.ljust(12) + '  '.join(gts))

    for c1 in gts:
        print(c1.ljust(12), end='')
        for c2 in gts:
            if sum(conf[c1].values()) > 0:
                print('%.3f     ' % (conf[c1][c2] / float(sum(conf[c1].values()))), end='')
            else:
                print('0.000     ', end='')
        print('')

    print('')


def evaluate(id_preds, testset, classifier):
    '''
    print the macro-F1 score of {'positive', 'netative'} between preds and testset
    :param id_preds: a dictionary of predictions formated as {<tweetid>:<sentiment>, ... }
    :param testset: str, the file name of the testset to compare
    :classifier: str, the name of the classifier
    '''
    id_gts = read_test(testset)

    acc_by_class = {}
    for gt in ['positive', 'negative', 'neutral']:
        acc_by_class[gt] = {'tp': 0, 'fp': 0, 'tn': 0, 'fn': 0}

    catf1s = {}

    ok = 0
    for tweetid, gt in id_gts.items():
        if tweetid in id_preds:
            pred = id_preds[tweetid]
        else:
            pred = 'neutral'

        if gt == pred:
            ok += 1
            acc_by_class[gt]['tp'] += 1
        else:
            acc_by_class[gt]['fn'] += 1
            acc_by_class[pred]['fp'] += 1

    catcount = 0
    itemcount = 0
    macro = {'p': 0, 'r': 0, 'f1': 0}
    micro = {'p': 0, 'r': 0, 'f1': 0}
    semevalmacro = {'p': 0, 'r': 0, 'f1': 0}

    microtp = 0
    microfp = 0
    microtn = 0
    microfn = 0
    for cat, acc in acc_by_class.items():
        catcount += 1

        microtp += acc['tp']
        microfp += acc['fp']
        microtn += acc['tn']
        microfn += acc['fn']

        p = 0
        if (acc['tp'] + acc['fp']) > 0:
            p = float(acc['tp']) / (acc['tp'] + acc['fp'])

        r = 0
        if (acc['tp'] + acc['fn']) > 0:
            r = float(acc['tp']) / (acc['tp'] + acc['fn'])

        f1 = 0
        if (p + r) > 0:
            f1 = 2 * p * r / (p + r)

        catf1s[cat] = f1

        n = acc['tp'] + acc['fn']

        macro['p'] += p
        macro['r'] += r
        macro['f1'] += f1

        if cat in ['positive', 'negative']:
            semevalmacro['p'] += p
            semevalmacro['r'] += r
            semevalmacro['f1'] += f1

        itemcount += n

    micro['p'] = float(microtp) / float(microtp + microfp)
    micro['r'] = float(microtp) / float(microtp + microfn)
    micro['f1'] = 2 * float(micro['p']) * micro['r'] / float(micro['p'] + micro['r'])

    semevalmacrof1 = semevalmacro['f1'] / 2

    print(testset + ' (' + classifier + '): %.3f' % semevalmacrof1)

In [17]:
# testing the evaluation functions
tweetDict = read_test('../semeval-tweets/twitter-test1.txt')
confusion(tweetDict, '../semeval-tweets/twitter-test1.txt', "PerfectClassifier")
evaluate(tweetDict, '../semeval-tweets/twitter-test1.txt', "PerfectClassifier")

            positive  negative  neutral
positive    1.000     0.000     0.000     
negative    0.000     1.000     0.000     
neutral     0.000     0.000     1.000     

../semeval-tweets/twitter-test1.txt (PerfectClassifier): 1.000


---
## Load training set, dev set and testing set
Here, you need to load the training set, the development set and the test set. For better classification results, you may need to preprocess tweets before sending them to the classifiers.

In [18]:
# Load training set, dev set and testing set

dataDir = '../semeval-tweets'  # change to the proper directory
datasetStrings = ['twitter-training-data.txt', 'twitter-test1.txt', 'twitter-test2.txt', 'twitter-test3.txt', 'twitter-dev-data.txt']
datasets = [join(dataDir, t) for t in datasetStrings]

tweet_IDs = {}          # init dictionary with tweet IDs
tweet_sentiments = {}   # init dictionary with sentiments
tweet_texts = {}        # init dictionary with tweet texts

for DatasetString in datasets:
    data_ID, data_sent, data_text  = {}, {}, {}    # temp dictionaries
    with open(DatasetString, 'r', encoding='utf8') as f1:
        for i, line in enumerate(f1):
            fields = line.split('\t')
            data_ID[i] = fields[0]            # tweet IDs
            data_sent[fields[0]] = fields[1]  # sentiments
            data_text[fields[0]] = fields[2]  # tweet text
    tweet_IDs[DatasetString] = data_ID
    tweet_sentiments[DatasetString] = data_sent
    tweet_texts[DatasetString] = data_text

# sentiment dictionaries
sent_train = tweet_sentiments[datasets[0]]
sent_test1 = tweet_sentiments[datasets[1]]
sent_test2 = tweet_sentiments[datasets[2]]
sent_test3 = tweet_sentiments[datasets[3]]
sent_dev = tweet_sentiments[datasets[4]]

# tweet text dictionaries
text_train = tweet_texts[datasets[0]]
text_test1 = tweet_texts[datasets[1]]
text_test2 = tweet_texts[datasets[2]]
text_test3 = tweet_texts[datasets[3]]
text_dev = tweet_texts[datasets[4]]

# tweet IDs dictionaries
IDs_train = tweet_IDs[datasets[0]]
IDs_test1 = tweet_IDs[datasets[1]]
IDs_test2 = tweet_IDs[datasets[2]]
IDs_test3 = tweet_IDs[datasets[3]]
IDs_dev = tweet_IDs[datasets[4]]


## examples and tests
# id = IDs_train[0]
# id_dev = IDs_dev[0]
# id1 = IDs_test1[0]
# id2 = IDs_test2[0]
# id3 = IDs_test3[0]
# print(f"-ID:{id} \n-TEXT:{text_train[id]}-SENTIMENT: {sent_train[id]}\n")
# print(f"-ID:{id_dev} \n-TEXT:{text_dev[id_dev]}-SENTIMENT: {sent_dev[id_dev]}\n")
# print(f"-ID:{id1} \n-TEXT:{text_test1[id1]}-SENTIMENT: {sent_test1[id1]}\n")
# print(f"-ID:{id2} \n-TEXT:{text_test2[id2]}-SENTIMENT: {sent_test2[id2]}\n")
# print(f"-ID:{id3} \n-TEXT:{text_test3[id3]}-SENTIMENT: {sent_test3[id3]}\n")
# print(len(IDs_train.keys()), len(text_train.keys()), len(sent_train.keys()))  # 45101
# print(len(IDs_test1.keys()), len(text_test1.keys()), len(sent_test1.keys()))  # 3531
# print(len(IDs_test2.keys()), len(text_test2.keys()), len(sent_test2.keys()))  # 1853
# print(len(IDs_test3.keys()), len(text_test3.keys()), len(sent_test3.keys()))  # 2379
# print(len(IDs_dev.keys()), len(text_dev.keys()), len(sent_dev.keys()))        # 2000

---
#### Order of preprocessing
* lowercase text
* regex cleaning
   * Remove URLs
   * Remove non-alphanumeric characters (leave hashtags and usernames)
   * Remove numbers that are fully made of digits
   * (Remove words with only 1 character)

 #### Preprocessing questions and notes:
   -> what about removing the @usernames, is it advisable?
   -> need to remove URLs!
   -> There is a lot of noise/mistakes in the data and absence of interpunction.
   -> what about adding of the starting token?

In [19]:
top100 = ['com', 'net', 'org', 'jp', 'de', 'uk', 'fr', 'br', 'it', 'ru', 'es', 'me', 'gov', 'pl', 'ca', 'au', 'cn', 'co', 'in', 'nl', 'edu', 'info', 'eu', 'ch', 'id', 'at', 'kr', 'cz', 'mx', 'be', 'tv', 'se', 'tr', 'tw', 'al', 'ua', 'ir', 'vn', 'cl', 'sk', 'ly', 'cc', 'to', 'no', 'fi', 'us', 'pt', 'dk', 'ar', 'hu', 'tk', 'gr', 'il', 'news', 'ro', 'my', 'biz', 'ie', 'za', 'nz', 'sg', 'ee', 'th', 'io', 'xyz', 'pe', 'bg', 'hk', 'rs', 'lt', 'link', 'ph', 'club', 'si', 'site', 'mobi', 'by', 'cat', 'wiki', 'la', 'ga', 'xxx', 'cf', 'hr', 'ng', 'jobs', 'online', 'kz', 'ug', 'gq', 'ae', 'is', 'lv', 'pro', 'fm', 'tips', 'ms', 'sa', 'app', 'lat']


text = '''
www.abc.com
www.bcd.net
www.sss.cc
www.dcamp.uk
U.S.A.
google.fr
google.it
ltd.
etc.
sell.uk
I said 'yes'.But I did not say Why.
https://www.scoutdns.com/100-most-popular-tlds-by-google-index/
'''


for ext in top100:
    re_string = "[^\s]*\." + ext + "[^\s]*"
    new_text = re.sub(re_string, '[Deleted link]', text)
    text = new_text
print(new_text)




[Deleted link]
[Deleted link]
[Deleted link]
[Deleted link]
U.S.A.
[Deleted link]
[Deleted link]
ltd.
etc.
[Deleted link]
I said 'yes'.But I did not say Why.
[Deleted link]



In [21]:
import os
import re
import pickle
import nltk
from nltk.stem import WordNetLemmatizer

if os.path.isfile("preprocessing.pkl"):  # loading preprocessed datasets
    with open('preprocessing.pkl', 'rb') as inp_file:
        temp_dicts = pickle.load(inp_file)
        txt_dicts = temp_dicts[0:5]
        txtlist_dicts = temp_dicts[5:]

else:
    ID_dicts = [IDs_train, IDs_test1, IDs_test2, IDs_test3, IDs_dev]
    txt_dicts = [text_train, text_test1, text_test2, text_test3, text_dev]
    txtlist_dicts = []

    lemmatizer = WordNetLemmatizer()  # init the lemmatizer
    POSconvert = lambda e: ('a' if e[0].lower() == 'j' else e[0].lower()) if e[0].lower() in ['n', 'r', 'v'] else 'n'

    for i, IDdict in enumerate(ID_dicts):
        output = txt_dicts[i]
        output_txt = {}
        for id in IDdict.values():
            text = output[id].lower()

            # replace/delete all URLs starting with 'http' and 'www'
            new_text = re.sub("http[^\s]*", '', text)
            new_text = re.sub("www[^\s]*", '', new_text)

            # delete all URLs which have one of 100 most common extensions ('.com', '.net', ...)
            top100 = ['com', 'net', 'org', 'jp', 'de', 'uk', 'fr', 'br', 'it', 'ru', 'es', 'me', 'gov', 'pl', 'ca', 'au', 'cn', 'co', 'in', 'nl', 'edu', 'info', 'eu', 'ch', 'id', 'at', 'kr', 'cz', 'mx', 'be', 'tv', 'se', 'tr', 'tw', 'al', 'ua', 'ir', 'vn', 'cl', 'sk', 'ly', 'cc', 'to', 'no', 'fi', 'us', 'pt', 'dk', 'ar', 'hu', 'tk', 'gr', 'il', 'news', 'ro', 'my', 'biz', 'ie', 'za', 'nz', 'sg', 'ee', 'th', 'io', 'xyz', 'pe', 'bg', 'hk', 'rs', 'lt', 'link', 'ph', 'club', 'si', 'site', 'mobi', 'by', 'cat', 'wiki', 'la', 'ga', 'xxx', 'cf', 'hr', 'ng', 'jobs', 'online', 'kz', 'ug', 'gq', 'ae', 'is', 'lv', 'pro', 'fm', 'tips', 'ms', 'sa', 'app', 'lat']
            for ext in top100:
                re_string = "[^\s]*\." + ext + "[^\s]*"
                new_text = re.sub(re_string, '', new_text)

            # removing '&amp'
            new_text = re.sub('&amp','', new_text)

            # remove all non-alphanumeric chars except for '# and @'
            new_text = re.sub('[^\w\s@#]','', new_text)

            # remove strings with '#' not on the beginning (to keep only hashtags)
            new_text = re.sub('\s[\w]+#[\w]*','', new_text)

            # numbers fully made of digits
            new_text = re.sub('[\d]+\s','', new_text)

            # remove words with only 1 character
            new_text = re.sub('\\b\\w{1}\\b','', new_text)

            # remove newline chars
            new_text = new_text.replace('\n', ' ')

            # replace a multiple spaces with a single space
            new_text = re.sub('\s+',' ', new_text)

            # using the lemmatizer
            txt_list = nltk.word_tokenize(new_text)
            for k, word in enumerate(txt_list):  # fixing the separation of hashtags by the tokenizer
                if word == '#' or word == '@':
                    if k < len(txt_list) - 1:
                        txt_list[k] = txt_list[k] + txt_list[k+1]
                        txt_list.pop(k+1)
            POS = nltk.pos_tag(txt_list)                  # POS tags from nltk
            WordNetPOS = [POSconvert(P[1]) for P in POS]  # POS tags for lemmatizer
            for j in range(len(txt_list)):
                word = txt_list[j]
                lemmatized = lemmatizer.lemmatize(word, WordNetPOS[j])  # process each token/word one by one
                txt_list[j] = lemmatized  # update the word in the txt_list

            ## UPDATE the dictionary
            output_txt[id] = ' '.join(txt_list)
            output[id] = txt_list

        txt_dicts[i] = output_txt
        txtlist_dicts.append(output)

text_train = txt_dicts[0]
text_test1 = txt_dicts[1]
text_test2 = txt_dicts[2]
text_test3 = txt_dicts[3]
text_dev = txt_dicts[4]
txtlist_train = txtlist_dicts[0]
txtlist_test1 = txtlist_dicts[1]
txtlist_test2 = txtlist_dicts[2]
txtlist_test3 = txtlist_dicts[3]
txtlist_dev = txtlist_dicts[4]

# saving preprocessing.pkl
if not os.path.isfile("preprocessing.pkl"):
    txt_dicts = [text_train, text_test1, text_test2, text_test3, text_dev, txtlist_train, txtlist_test1, txtlist_test2, txtlist_test3, txtlist_dev]
    with open('preprocessing.pkl', 'wb') as out_file:
        pickle.dump(txt_dicts, out_file, protocol=-1)

---
## Feature Extraction: Bag of words

In [81]:
# Bag of Words - my implementation:
from nltk.probability import FreqDist
from nltk.corpus import stopwords as Stopwords

# 1) removing stop words
stopwords = Stopwords.words('english')
stopwords = [word.replace('\'', '') for word in stopwords]

# 2) extracting the dictionary/vocabulary
freq = FreqDist()   # frequency distribution
for Dict in txtlist_dicts:
    for tweet in Dict.values():
        for word in tweet:
            if not word in stopwords:
                freq[word] += 1

nums = range(len(freq.keys()))
vocabulary = list(freq.keys())              # creating the dictionary
vocabularyOOV = vocabulary + ['<OOV>']      # dictionary with 'out of vocabulary' word
vocab2num = dict(zip(vocabulary, nums))     # word to index mapping
vocab2num['<OOV>'] = max(vocab2num.values()) + 1  # out of vocabulary words -> len: 69742

# auxiliary ftion which takes list of words and returns its BoW representation as np.array
def text2BOW(text_list, vocabulary, stopwords):
    BOW_vec = np.zeros(len(vocabulary) + 1)
    for word in text_list:
        if not word in stopwords:
            if word in vocabulary:
                BOW_vec[vocab2num[word]] += 1
            else:
                BOW_vec[vocab2num['<OOV>']] += 1
    return BOW_vec

# if os.path.isfile("BOWs.pkl"):  # loading preprocessed datasets
#     with open('BOWs.pkl', 'rb') as inp_file:
#         ll = pickle.load(inp_file)
#         BOW_train, BOW_test1, BOW_test2, BOW_test3, BOW_dev = ll[0], ll[1], ll[2], ll[3], ll[4]
#
# else:
# Bag of Words (BOW) for each tweet
BOW_train = {}
for ID, tweet in txtlist_train.items():
    BOW = text2BOW(tweet, vocabulary=vocabulary, stopwords=stopwords)
    BOW_train[ID] = BOW

BOW_test1 = {}
for ID, tweet in txtlist_test1.items():
    BOW = text2BOW(tweet, vocabulary=vocabulary, stopwords=stopwords)
    BOW_test1[ID] = BOW

BOW_test2 = {}
for ID, tweet in txtlist_test2.items():
    BOW = text2BOW(tweet, vocabulary=vocabulary, stopwords=stopwords)
    BOW_test2[ID] = BOW

BOW_test3 = {}
for ID, tweet in txtlist_test3.items():
    BOW = text2BOW(tweet, vocabulary=vocabulary, stopwords=stopwords)
    BOW_test3[ID] = BOW

BOW_dev = {}
for ID, tweet in txtlist_dev.items():
    BOW = text2BOW(tweet, vocabulary=vocabulary, stopwords=stopwords)
    BOW_dev[ID] = BOW


# saving BOWs.pkl: very large file - maybe not the best idea to save it?
    # if not os.path.isfile("BOWs.pkl"):
    #     BOW_dicts = [BOW_train, BOW_test1, BOW_test2, BOW_test3, BOW_dev]
    #     with open("BOWs.pkl", 'wb') as out_file:
    #         pickle.dump(BOW_dicts, out_file, protocol=-1)

---
## Feature Extraction: TF-IDF weighted Bag of words

In [89]:
a = np.array(['a', 'b', 'b'])
np.count_nonzero(a == 'b')

2

In [144]:
# extracting the dictionary
# freq = FreqDist()   # frequency distribution
# for Dict in txtlist_dicts:
#     for tweet in Dict.values():
#         for word in tweet:
#             if not word in stopwords:
#                 freq[word] += 1
#
# nums = range(len(freq.keys()))
# vocabulary = list(freq.keys())              # creating the dictionary
# vocabulary_array = np.array(vocabulary)     # np.array of the dictionary
# vocabularyOOV = vocabulary + ['<OOV>']      # dictionary with 'out of vocabulary' word
# vocab2num = dict(zip(vocabulary, nums))     # word to index mapping
# vocab2num['<OOV>'] = max(vocab2num.values()) + 1  # out of vocabulary words -> len: 69742

# extracting the dictionary
DFfreq = FreqDist()   # document frequency distribution
Ntexts = len(IDs_train) + len(IDs_test1) + len(IDs_test2) + len(IDs_test3) + len(IDs_dev)
for Dict in txtlist_dicts:
    for tweet in Dict.values():
        for word in np.unique(tweet):
            if not word in stopwords:
                DFfreq[word] += 1

# auxiliary ftion which takes list of words and returns its TFIDF representation as np.array
def text2TFIDF(text_list, vocabulary, stopwords, Ntexts):
    TFIDF_vec = np.zeros(len(vocabulary) + 1)
    for word in np.unique(text_list):
        if not word in stopwords:
            if word in vocabulary:
                tf = np.count_nonzero(np.array(text_list) == word) / len(text_list)
                idf = np.log2(Ntexts / DFfreq[word])
                TFIDF_vec[vocab2num[word]] = tf * idf
            else:
                tf = np.count_nonzero(np.array(text_list) == word) / len(text_list)
                idf = np.log2(Ntexts / 0.000001 )
                TFIDF_vec[vocab2num['<OOV>']] = tf * idf
    return TFIDF_vec


# TFIDF-weighted Bag of Words for each tweet
TFIDF_train = {}
for ID, tweet in txtlist_train.items():
    tfidf = text2TFIDF(tweet, vocabulary=vocabulary, stopwords=stopwords, Ntexts=Ntexts)
    TFIDF_train[ID] = tfidf

# TFIDF_test1 = {}
# for ID, tweet in txtlist_test1.items():
#     tfidf = text2TFIDF(tweet, vocabulary=vocabulary, stopwords=stopwords, Ntexts=Ntexts)
#     TFIDF_test1[ID] = tfidf
#
# TFIDF_test2 = {}
# for ID, tweet in txtlist_test2.items():
#     tfidf = text2TFIDF(tweet, vocabulary=vocabulary, stopwords=stopwords, Ntexts=Ntexts)
#     TFIDF_test2[ID] = tfidf
#
# TFIDF_test3 = {}
# for ID, tweet in txtlist_test3.items():
#     tfidf = text2TFIDF(tweet, vocabulary=vocabulary, stopwords=stopwords, Ntexts=Ntexts)
#     TFIDF_test3[ID] = tfidf
#
# TFIDF_dev = {}
# for ID, tweet in txtlist_dev.items():
#     tfidf = text2TFIDF(tweet, vocabulary=vocabulary, stopwords=stopwords, Ntexts=Ntexts)
#     TFIDF_dev[ID] = tfidf

In [150]:
print(len(TFIDF_train), len(text_train))
np.count_nonzero(TFIDF_train[IDs_train[500]])

45101 45101


14

In [129]:
DFfreq = FreqDist()   # document frequency distribution
temp = [['i', 'love', 'natural', 'language', 'processing', 'but', 'i', 'hate', 'python'],
['i', 'like' , 'image', 'processing'],['i', 'like', 'signal', 'processing', 'and' , 'image', 'processing' ]]

for tweet in temp:
    for word in np.unique(tweet):
        if not word in []:
            DFfreq[word] += 1


print(DFfreq.items())
DFfreq['but']

dict_items([('but', 1), ('hate', 1), ('i', 3), ('language', 1), ('love', 1), ('natural', 1), ('processing', 3), ('python', 1), ('image', 2), ('like', 2), ('and', 1), ('signal', 1)])


1

---
## Feature Extraction: GloVe

---
## Build sentiment classifiers
You need to create your own classifiers (at least 3 classifiers). For each classifier, you can choose between the bag-of-word features and the word-embedding-based features. Each classifier has to be evaluated over 3 test sets. Make sure your classifier produce consistent performance across the test sets. Marking will be based on the performance over all 5 test sets (2 of them are not provided to you).

In [None]:
# Buid traditional sentiment classifiers. An example classifier name 'svm' is given
# in the code below. You should replace the other two classifier names
# with your own choices. For features used for classifier training, 
# the 'bow' feature is given in the code. But you could also explore the 
# use of other features.
for classifier in ['NeirestNeighbour', 'NaiveBayes','SVM']:
    for features in ['BOW', '<feature-2-name>']:
        # Skeleton: Creation and training of the classifiers
        if classifier == 'NeirestNeighbour':
            # write the svm classifier here
            print('Training ' + classifier)
        elif classifier == 'NaiveBayes':
            # write the classifier 2 here
            print('Training ' + classifier)
        elif classifier == 'SVM':
            # write the classifier 3 here
            print('Training ' + classifier)
        elif classifier == 'LSTM':
            # write the LSTM classifier here
            if features == 'bow':
                continue
            print('Training ' + classifier)
        else:
            print('Unknown classifier name' + classifier)
            continue

        # Predition performance of the classifiers
        for testset in testsets:
            id_preds = {}
            # write the prediction and evaluation code here

            testset_name = testset
            testset_path = join('semeval-tweets', testset_name)
            evaluate(id_preds, testset_path, features + '-' + classifier)