**Load & Preprocessing Function Description**  
  
1. Read tweets
2. Extract text and sentiment label
3. Tokenize, lower-case, lemmatize the tokens, then exclude punctuations, pure numbers and web links.
4. Collect the processed tweeter text as a list of token list + sentiment label.

In [1]:
import sys
import nltk
import re
from nltk.tokenize import TweetTokenizer

# Filters used in the main function below
def filter_tokens(tokens):
    pattern1 = re.compile(r'^(https?://[^\s]+$|[^\w\s])')
    pattern2 = re.compile(r'\d+')
    # Exclude punctuations and web links
    filtered_ts = [token for token in tokens if not pattern1.match(token)]
    filtered_ts= [token for token in filtered_ts if not pattern2.match(token)]
    return filtered_ts
def sw_filter(tokens):
    # stopwords provided by 'TweetData'
    stopwords = [line.strip() for line in open('TweetData/stopwords_twitter.txt')]
    filtered_tokens = [token for token in tokens if token not in stopwords]
    return filtered_tokens

# Main function
def processtweets(path,nosw):

    # initialize NLTK built-in tweet tokenizer
    twtokenizer = TweetTokenizer()
    # read file
    f = open(path, 'r')

    # gather the original data
    tweetdata = []
    for line in f:
        line = line.strip()
        # each line has 4 items separated by tabs
        # ignore the tweet and user ids, and keep the sentiment and tweet text
        tweetdata.append(line.split('\t')[2:4])

    # create list of tweet documents as (list of words, label)
    # where the labels are condensed to just 3:  'pos', 'neg', 'neu'
    # Create a list for the data
    tweetdocs = []
    # add all the tweets except the ones whose text is Not Available
    neg_num=0
    pos_num=0
    neu_num=0
    for tweet in tweetdata:
        if (tweet[1] != 'Not Available'):
            # tokenize each tweet text
            tokens = twtokenizer.tokenize(tweet[1])
            

            # and used lemmatizer on them
            tokens_lower=[token.lower() for token in tokens]
            text=nltk.Text(tokens_lower)
            wnl = nltk.WordNetLemmatizer()
            tokens_lemma=[wnl.lemmatize(t) for t in text]
            # Then filter out web pages and pure numbers and punctuations
            tokens_filtered=filter_tokens(tokens_lemma)

            # if we choose to exclude stop words
            if nosw==1:
                tokens_filtered=sw_filter(tokens_filtered)
            
            if tweet[0] == '"negative"':
                label = 'neg'
                neg_num+=1
            elif tweet[0] == '"positive"':
                label = 'pos'
                pos_num+=1
            else:
                label='neu'
                neu_num+=1

            # add tokens, label to our document list
            tweetdocs.append((tokens_filtered, label))
    return [tweetdocs,pos_num,neg_num,neu_num]

In [2]:
#In a.dev.dist.tsv are just ids, pure numbers of tweets, not useful.
# b-dist can be seen as a train set while b.dev.dist is a separate test set we can use later.
import random
train_path='TweetData/corpus/downloaded-tweeti-b-dist.tsv'
test_path='TweetData/corpus/downloaded-tweeti-b.dev.dist.tsv'

# print number of tweets in each group
train_documents,pos_num,neg_num,neu_num=processtweets(train_path,0)
print([pos_num,neg_num,neu_num])
test_documents,pos_num,neg_num,neu_num=processtweets(test_path,0)
print([pos_num,neg_num,neu_num])

train_documents_nosw=processtweets(train_path,1)[0]
test_documents_nosw=processtweets(test_path,1)[0]

random.seed(42)
random.shuffle(train_documents)
random.seed(42)
random.shuffle(train_documents_nosw)

# Print the amount of tweets
print(len(train_documents))
print(len(test_documents))

# Print the tokenized tweet and label of the thrid document
print(train_documents[2][0])
print(train_documents[2][1])
print(train_documents_nosw[2][0])
print(train_documents_nosw[2][1])

[3059, 1207, 3942]
[491, 290, 632]
8208
1413
['last', 'day', 'in', 'jeddah', 'will', 'be', 'in', 'brunei', 'tomorrow', 'night', 'and', 'then', 'surabaya', 'the', 'following', 'night', 'and', 'then', 'bali', 'the', 'night', 'after', 'that', 'whee']
neu
['last', 'day', 'jeddah', 'will', 'brunei', 'tomorrow', 'night', 'surabaya', 'following', 'night', 'bali', 'night', 'whee']
neu


Neutral tweets are the most, followed by positive ones and negative ones are limited.

**Word features generating and selection**  
  
1. Create word features using top half of frequent words
2. Calculate baseline accuracy
3. 5-fold cross validation, for each time, record the most informative 100 words
4. Gather the 10 100-words sets together as the most useful features later

In [3]:
all_words_list = [word for (sent,cat) in train_documents for word in sent]
all_words = nltk.FreqDist(all_words_list)
print('There are',len(all_words),'unique words in total')
word_items = all_words.most_common(int(0.5*len(all_words)))
word_features = [word for (word, freq) in word_items]

def document_features(document, word_features):
	document_words = set(document)
	features = {}
	for word in word_features:
		features['V_{}'.format(word)] = (word in document_words)
	return features
featuresets = [(document_features(d,word_features), c) for (d,c) in train_documents]

There are 14060 unique words in total


In [4]:
# baseline accuracy
train_set, test_set = featuresets[800:], featuresets[:800]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

0.62875


In the next cross validation function, after regular printing accuracy, confusion matrix, precision, recall and F-1 scores, the most informative 100 word features each round are collected as a python set. 
  
We will use those word features for further classification.

In [7]:
def eval_measures(gold, predicted):
    # confusion matrix
    cm = nltk.ConfusionMatrix(gold, predicted)
    print(cm.pretty_format(sort_by_count=True, show_percents=False, truncate=9))

    # get a list of labels
    labels = list(set(gold))
    # these lists have values for each label 
    recall_list = []
    precision_list = []
    F1_list = []
    for lab in labels:
        # for each label, compare gold and predicted lists and compute values
        TP = FP = FN = TN = 0
        for i, val in enumerate(gold):
            if val == lab and predicted[i] == lab:  TP += 1
            if val == lab and predicted[i] != lab:  FN += 1
            if val != lab and predicted[i] == lab:  FP += 1
            if val != lab and predicted[i] != lab:  TN += 1
        # use these to compute recall, precision, F1
        recall = TP / (TP + FP)
        precision = TP / (TP + FN)
        recall_list.append(recall)
        precision_list.append(precision)
        F1_list.append( 2 * (recall * precision) / (recall + precision))
    # the evaluation measures in a table with one row per label
    print('\tPrecision\tRecall\t\tF1')
    # print measures for each label
    for i, lab in enumerate(labels):
        print(lab, '\t', "{:10.3f}".format(precision_list[i]), \
          "{:10.3f}".format(recall_list[i]), "{:10.3f}".format(F1_list[i]))

def cross_validation(num_folds, featuresets):
    subset_size = int(len(featuresets)/num_folds)
    accuracy_list = []
    # iterate over the folds
    word_set=set()
    for i in range(num_folds):
        test_this_round = featuresets[i*subset_size:(i+1)*subset_size]
        train_this_round = featuresets[:i*subset_size]+featuresets[(i+1)*subset_size:]
        # train using train_this_round
        classifier_this_round = nltk.NaiveBayesClassifier.train(train_this_round)
        infeatures=classifier_this_round.most_informative_features(100)
        for item in infeatures:
            word=item[0][2:]
            word_set.add(word)
        # evaluate against test_this_round and save accuracy
        accuracy_this_round = nltk.classify.accuracy(classifier_this_round, test_this_round)
        print(i, accuracy_this_round)
        accuracy_list.append(accuracy_this_round)

        # predicted and test 
        goldlist = []
        predictedlist = []
        for (features, label) in test_this_round:
            goldlist.append(label)
            predictedlist.append(classifier_this_round.classify(features))

        # print confusion matrix and evaluating measures
        eval_measures(goldlist,predictedlist)

    
    # find mean accuracy over all rounds
    print('mean accuracy', sum(accuracy_list) / num_folds)
    return list(word_set)

In [6]:
word_feature_final=cross_validation(5,featuresets)
print(word_feature_final)

0 0.6191346739792809
    |   n   p   n |
    |   e   o   e |
    |   u   s   g |
----+-------------+
neu |<491>198  85 |
pos | 156<404> 64 |
neg |  57  65<121>|
----+-------------+
(row = reference; col = test)

	Precision	Recall		F1
neu 	      0.634      0.697      0.664
neg 	      0.498      0.448      0.472
pos 	      0.647      0.606      0.626
1 0.6514320536258379
    |   n   p   n |
    |   e   o   e |
    |   u   s   g |
----+-------------+
neu |<565>191  68 |
pos | 133<404> 36 |
neg |  76  68<100>|
----+-------------+
(row = reference; col = test)

	Precision	Recall		F1
neu 	      0.686      0.730      0.707
neg 	      0.410      0.490      0.446
pos 	      0.705      0.609      0.654
2 0.6179159049360147
    |   n   p   n |
    |   e   o   e |
    |   u   s   g |
----+-------------+
neu |<543>185  82 |
pos | 160<380> 41 |
neg |  84  75 <91>|
----+-------------+
(row = reference; col = test)

	Precision	Recall		F1
neu 	      0.670      0.690      0.680
neg 	      0.364      0.4

In [7]:
print(len(word_feature_final))
print(word_feature_final[:10])
print(word_feature_final[10:20])
print(word_feature_final[20:30])

213
['damn', 'meant', 'kill', 'bless', 'breitbart', 'sb', 'suck', 'fuckin', 'losing', 'poetry']
['score', 'failed', 'report', 'changed', 'tired', 'window', 'except', 'kinda', 'six', 'swift']
['nooooooooo', 'h', 'favorite', 'enjoying', 'khl', 'body', 'bro', 'bored', 'sorry', 'funny']


The base line result is not good, I believe this is partly because of the limited data observations. Neutral tweets classification get better scores than positive ones, the negative ones have the worst scores.  
  
Among the 5 times of cross validations, each time we collected the most informative 100 words. However, there are only 213 unique words left. That means the tweets share many words that played a significant role in classification.

**Bigram features generating and selection**  
  
1. Create bigram features using top 1000 best bigrams measuered by chi square
2. Calculate baseline accuracy
3. 10-fold cross validation, for each time, record the most informative 100 bigrams
4. Gather the 10 100-bigram sets together as the useful features later

In [8]:
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(all_words_list)
bigram_features = finder.nbest(bigram_measures.chi_sq, 1000)

def bigram_document_features(document, bigram_features):
    document_bigrams = nltk.bigrams(document)
    features = {}
    for bigram in bigram_features:
        features['B_{}_{}'.format(bigram[0], bigram[1])] = (bigram in document_bigrams)    
    return features

In [9]:
bigram_featuresets = [(bigram_document_features(d,bigram_features), c) for (d,c) in train_documents]
thresh=int(len(bigram_featuresets)*0.1)
print(thresh)
train_set, test_set = bigram_featuresets[thresh:], bigram_featuresets[:thresh]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

820
0.47073170731707314


In [10]:
goldlist = []
predictedlist = []
for (features, label) in test_set:
    goldlist.append(label)
    predictedlist.append(classifier.classify(features))
cm = nltk.ConfusionMatrix(goldlist, predictedlist)
print(cm.pretty_format(sort_by_count=True, show_percents=False, truncate=9))

    |   n   p   n |
    |   e   o   e |
    |   u   s   g |
----+-------------+
neu |<386>  .   . |
pos | 318  <.>  . |
neg | 116   .  <.>|
----+-------------+
(row = reference; col = test)



It seems the bigrams won't give any valuable information

**POS tagged features**  

In [11]:
def POS_features(document,word_features):
    document_words = set(document)
    tagged_words = nltk.pos_tag(document)
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    numNoun = 0
    numVerb = 0
    numAdj = 0
    numAdverb = 0
    for (word, tag) in tagged_words:
        if tag.startswith('N'): numNoun += 1
        if tag.startswith('V'): numVerb += 1
        if tag.startswith('J'): numAdj += 1
        if tag.startswith('R'): numAdverb += 1
    features['nouns'] = numNoun
    features['verbs'] = numVerb
    features['adjectives'] = numAdj
    features['adverbs'] = numAdverb
    return features

POS_featuresets = [(POS_features(d, word_features), c) for (d, c) in train_documents]

In [12]:
thresh=int(len(POS_featuresets)*0.1)
train_set, test_set = POS_featuresets[thresh:], POS_featuresets[:thresh]
classifier = nltk.NaiveBayesClassifier.train(train_set)
nltk.classify.accuracy(classifier, test_set)

0.6365853658536585

In [13]:
POS_feature_final=cross_validation(5,POS_featuresets)
len(POS_feature_final)

0 0.6288848263254113
    |   n   p   n |
    |   e   o   e |
    |   u   s   g |
----+-------------+
neu |<503>184  87 |
pos | 146<405> 73 |
neg |  54  65<124>|
----+-------------+
(row = reference; col = test)

	Precision	Recall		F1
neu 	      0.650      0.716      0.681
neg 	      0.510      0.437      0.471
pos 	      0.649      0.619      0.634
1 0.6550883607556368
    |   n   p   n |
    |   e   o   e |
    |   u   s   g |
----+-------------+
neu |<569>177  78 |
pos | 135<401> 37 |
neg |  74  65<105>|
----+-------------+
(row = reference; col = test)

	Precision	Recall		F1
neu 	      0.691      0.731      0.710
neg 	      0.430      0.477      0.453
pos 	      0.700      0.624      0.660
2 0.6227909811090798
    |   n   p   n |
    |   e   o   e |
    |   u   s   g |
----+-------------+
neu |<548>177  85 |
pos | 168<372> 41 |
neg |  73  75<102>|
----+-------------+
(row = reference; col = test)

	Precision	Recall		F1
neu 	      0.677      0.695      0.685
neg 	      0.408      0.4

213

In [4]:
from textblob import TextBlob
def senti_features(document,word_features):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)

    # record the polarity and subjectivity of each word
    pol_list=[]
    sub_list=[]
    for word in document_words:
        pol_list.append(TextBlob(word).polarity)
        sub_list.append(TextBlob(word).subjectivity)
    features['polarity']=sum(pol_list)/len(pol_list)
    features['subjectivity']=sum(sub_list)/len(sub_list)
    return features

TBsenti_featuresets = [(senti_features(d, word_features), c) for (d, c) in train_documents]

In [5]:
thresh=int(len(TBsenti_featuresets)*0.1)
train_set, test_set = TBsenti_featuresets[thresh:], TBsenti_featuresets[:thresh]
classifier = nltk.NaiveBayesClassifier.train(train_set)
nltk.classify.accuracy(classifier, test_set)

0.6707317073170732

In [8]:
cross_validation(5,TBsenti_featuresets)

0 0.6569165143205362
    |   n   p   n |
    |   e   o   e |
    |   u   s   g |
----+-------------+
neu |<551>150  73 |
pos | 153<408> 63 |
neg |  69  55<119>|
----+-------------+
(row = reference; col = test)

	Precision	Recall		F1
neu 	      0.712      0.713      0.712
neg 	      0.490      0.467      0.478
pos 	      0.654      0.666      0.660
1 0.680073126142596
    |   n   p   n |
    |   e   o   e |
    |   u   s   g |
----+-------------+
neu |<612>149  63 |
pos | 138<402> 33 |
neg |  85  57<102>|
----+-------------+
(row = reference; col = test)

	Precision	Recall		F1
neu 	      0.743      0.733      0.738
neg 	      0.418      0.515      0.462
pos 	      0.702      0.661      0.681
2 0.6532602071907374
    |   n   p   n |
    |   e   o   e |
    |   u   s   g |
----+-------------+
neu |<586>151  73 |
pos | 155<386> 40 |
neg |  83  67<100>|
----+-------------+
(row = reference; col = test)

	Precision	Recall		F1
neu 	      0.723      0.711      0.717
neg 	      0.400      0.46

['ntains(poll)',
 'ntains(via)',
 'ntains(favorite)',
 'ntains(sick)',
 'ntains(killed)',
 'ntains(doesnt)',
 'ntains(net)',
 'ntains(hopefully)',
 'ntains(serious)',
 "ntains(hasn't)",
 'ntains(demitra)',
 'ntains(missing)',
 'ntains(score)',
 'ntains(thanks)',
 'ntains(emerson)',
 "ntains(can't)",
 'ntains(sad)',
 'ntains(twat)',
 'ntains(fucked)',
 'ntains(fit)',
 'ntains(love)',
 'ntains(rookie)',
 'ntains(absolutely)',
 'ntains(rudd)',
 'ntains(scotland)',
 'ntains(compared)',
 'ntains(language)',
 'ntains(knew)',
 'ntains(injury)',
 'ntains(bro)',
 'ntains(body)',
 'ntains(leg)',
 'ntains(fl)',
 'ntains(cancelled)',
 'ntains(crap)',
 'ntains(bellamy)',
 'ntains(canceled)',
 'ntains(fantastic)',
 'ntains(trayvon)',
 'larity',
 'ntains(warned)',
 'ntains(awesome)',
 'ntains(weekend)',
 'ntains(bitch)',
 'ntains(funny)',
 'ntains(anymore)',
 'ntains(cuz)',
 'ntains(leaf)',
 'ntains(potus)',
 'ntains(except)',
 'ntains(nooooooooo)',
 'ntains(wrong)',
 'ntains(yay)',
 'ntains(rafa)',


**Conclusion at this stage - Hang**  
  
For now, we did preprocessing on the data, tried classification with Naive Bayes model on the tokenized tweets using word-only, bigram, and POS-tagged features. We collected the most useful several hundred words for further classification tasks.  
  
We found:  
1. Bigram features are found not valuable for classification
2. POS-tagged features have slightly better result, compared with words-only ones.
3. Apply other sentiment analysis API on a token level increased the accuracy

**Some plans for next steps - Hang**
  
1. Check the scores of no-stop-words version of the document (already made as "train_document_nosw")
2. Try using some sentiment score APIs on the words so that we get sentiment values as features.
3. Try other models for this classification problem.