**Load & Preprocessing Function Description**  
  
1. Read tweets
2. Extract text and sentiment label
3. Tokenize, lower-case, lemmatize the tokens, then exclude punctuations, pure numbers and web links.
4. Collect the processed tweeter text as a list of token list + sentiment label.

In [1]:
import sys
import nltk
import re
from nltk.tokenize import TweetTokenizer

# Filters used in the main function below
def filter_tokens(tokens):
    pattern1 = re.compile(r'^(https?://[^\s]+$|[^\w\s])')
    pattern2 = re.compile(r'\d+')
    # Exclude punctuations and web links
    filtered_ts = [token for token in tokens if not pattern1.match(token)]
    filtered_ts= [token for token in filtered_ts if not pattern2.match(token)]
    return filtered_ts
def sw_filter(tokens):
    # stopwords provided by 'TweetData'
    stopwords = [line.strip() for line in open('TweetData/stopwords_twitter.txt')]
    filtered_tokens = [token for token in tokens if token not in stopwords]
    return filtered_tokens

# Main function
def processtweets(path,nosw):

    # initialize NLTK built-in tweet tokenizer
    twtokenizer = TweetTokenizer()
    # read file
    f = open(path, 'r')

    # gather the original data
    tweetdata = []
    for line in f:
        line = line.strip()
        # each line has 4 items separated by tabs
        # ignore the tweet and user ids, and keep the sentiment and tweet text
        tweetdata.append(line.split('\t')[2:4])

    # create list of tweet documents as (list of words, label)
    # where the labels are condensed to just 3:  'pos', 'neg', 'neu'
    # Create a list for the data
    tweetdocs = []
    # add all the tweets except the ones whose text is Not Available
    neg_num=0
    pos_num=0
    neu_num=0
    for tweet in tweetdata:
        if (tweet[1] != 'Not Available'):
            # tokenize each tweet text
            tokens = twtokenizer.tokenize(tweet[1])
            

            # and used lemmatizer on them
            tokens_lower=[token.lower() for token in tokens]
            text=nltk.Text(tokens_lower)
            wnl = nltk.WordNetLemmatizer()
            tokens_lemma=[wnl.lemmatize(t) for t in text]
            # Then filter out web pages and pure numbers and punctuations
            tokens_filtered=filter_tokens(tokens_lemma)

            # if we choose to exclude stop words
            if nosw==1:
                tokens_filtered=sw_filter(tokens_filtered)
            
            if tweet[0] == '"negative"':
                label = 'neg'
                neg_num+=1
            elif tweet[0] == '"positive"':
                label = 'pos'
                pos_num+=1
            else:
                label='neu'
                neu_num+=1

            # add tokens, label to our document list
            tweetdocs.append((tokens_filtered, label))
    return [tweetdocs,pos_num,neg_num,neu_num]

In [2]:
#In a.dev.dist.tsv are just ids, pure numbers of tweets, not useful.
# b-dist can be seen as a train set while b.dev.dist is a separate test set we can use later.
import random
train_path='TweetData/corpus/downloaded-tweeti-b-dist.tsv'
test_path='TweetData/corpus/downloaded-tweeti-b.dev.dist.tsv'

# print number of tweets in each group
train_documents,pos_num,neg_num,neu_num=processtweets(train_path,0)
print([pos_num,neg_num,neu_num])
test_documents,pos_num,neg_num,neu_num=processtweets(test_path,0)
print([pos_num,neg_num,neu_num])

train_documents_nosw=processtweets(train_path,1)[0]
test_documents_nosw=processtweets(test_path,1)[0]

random.seed(42)
random.shuffle(train_documents)
random.seed(42)
random.shuffle(train_documents_nosw)

# Print the amount of tweets
print(len(train_documents))
print(len(test_documents))

# Print the tokenized tweet and label of the thrid document
print(train_documents[2][0])
print(train_documents[2][1])
print(train_documents_nosw[2][0])
print(train_documents_nosw[2][1])

[3059, 1207, 3942]
[491, 290, 632]
8208
1413
['last', 'day', 'in', 'jeddah', 'will', 'be', 'in', 'brunei', 'tomorrow', 'night', 'and', 'then', 'surabaya', 'the', 'following', 'night', 'and', 'then', 'bali', 'the', 'night', 'after', 'that', 'whee']
neu
['last', 'day', 'jeddah', 'will', 'brunei', 'tomorrow', 'night', 'surabaya', 'following', 'night', 'bali', 'night', 'whee']
neu


Neutral tweets are the most, followed by positive ones and negative ones are limited.

**Word features generating and selection**  
  
1. Create word features using top half of frequent words
2. Calculate baseline accuracy
3. 5-fold cross validation, for each time, record the most informative 100 words
4. Gather the 10 100-words sets together as the most useful features later

In [3]:
all_words_list = [word for (sent,cat) in train_documents for word in sent]
all_words = nltk.FreqDist(all_words_list)
print('There are',len(all_words),'unique words in total')
word_items = all_words.most_common(int(0.5*len(all_words)))
word_features = [word for (word, freq) in word_items]

def document_features(document, word_features):
	document_words = set(document)
	features = {}
	for word in word_features:
		features['V_{}'.format(word)] = (word in document_words)
	return features
featuresets = [(document_features(d,word_features), c) for (d,c) in train_documents]

There are 14069 unique words in total


In [4]:
# baseline accuracy
train_set, test_set = featuresets[800:], featuresets[:800]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

0.63


In the next cross validation function, after regular printing accuracy, confusion matrix, precision, recall and F-1 scores, the most informative 100 word features each round are collected as a python set. 
  
We will use those word features for further classification.

In [5]:
def eval_measures(gold, predicted):
    # confusion matrix
    cm = nltk.ConfusionMatrix(gold, predicted)
    print(cm.pretty_format(sort_by_count=True, show_percents=False, truncate=9))

    # get a list of labels
    labels = list(set(gold))
    # these lists have values for each label 
    recall_list = []
    precision_list = []
    F1_list = []
    for lab in labels:
        # for each label, compare gold and predicted lists and compute values
        TP = FP = FN = TN = 0
        for i, val in enumerate(gold):
            if val == lab and predicted[i] == lab:  TP += 1
            if val == lab and predicted[i] != lab:  FN += 1
            if val != lab and predicted[i] == lab:  FP += 1
            if val != lab and predicted[i] != lab:  TN += 1
        # use these to compute recall, precision, F1
        recall = TP / (TP + FP)
        precision = TP / (TP + FN)
        recall_list.append(recall)
        precision_list.append(precision)
        F1_list.append( 2 * (recall * precision) / (recall + precision))
    # the evaluation measures in a table with one row per label
    print('\tPrecision\tRecall\t\tF1')
    # print measures for each label
    for i, lab in enumerate(labels):
        print(lab, '\t', "{:10.3f}".format(precision_list[i]), \
          "{:10.3f}".format(recall_list[i]), "{:10.3f}".format(F1_list[i]))

def cross_validation(num_folds, featuresets):
    subset_size = int(len(featuresets)/num_folds)
    accuracy_list = []
    # iterate over the folds
    word_set=set()
    for i in range(num_folds):
        test_this_round = featuresets[i*subset_size:(i+1)*subset_size]
        train_this_round = featuresets[:i*subset_size]+featuresets[(i+1)*subset_size:]
        # train using train_this_round
        classifier_this_round = nltk.NaiveBayesClassifier.train(train_this_round)
        infeatures=classifier_this_round.most_informative_features(100)
        for item in infeatures:
            word=item[0][2:]
            word_set.add(word)
        # evaluate against test_this_round and save accuracy
        accuracy_this_round = nltk.classify.accuracy(classifier_this_round, test_this_round)
        print(i, accuracy_this_round)
        accuracy_list.append(accuracy_this_round)

        # predicted and test 
        goldlist = []
        predictedlist = []
        for (features, label) in test_this_round:
            goldlist.append(label)
            predictedlist.append(classifier_this_round.classify(features))

        # print confusion matrix and evaluating measures
        eval_measures(goldlist,predictedlist)

    
    # find mean accuracy over all rounds
    print('mean accuracy', sum(accuracy_list) / num_folds)
    return list(word_set)

In [6]:
word_feature_final=cross_validation(5,featuresets)
print(word_feature_final)

0 0.6197440585009141
    |   n   p   n |
    |   e   o   e |
    |   u   s   g |
----+-------------+
neu |<492>198  84 |
pos | 155<405> 64 |
neg |  58  65<120>|
----+-------------+
(row = reference; col = test)

	Precision	Recall		F1
pos 	      0.649      0.606      0.627
neg 	      0.494      0.448      0.470
neu 	      0.636      0.698      0.665
1 0.6520414381474711
    |   n   p   n |
    |   e   o   e |
    |   u   s   g |
----+-------------+
neu |<565>191  68 |
pos | 133<405> 35 |
neg |  76  68<100>|
----+-------------+
(row = reference; col = test)

	Precision	Recall		F1
pos 	      0.707      0.610      0.655
neg 	      0.410      0.493      0.447
neu 	      0.686      0.730      0.707
2 0.6185252894576477
    |   n   p   n |
    |   e   o   e |
    |   u   s   g |
----+-------------+
neu |<543>185  82 |
pos | 160<380> 41 |
neg |  83  75 <92>|
----+-------------+
(row = reference; col = test)

	Precision	Recall		F1
pos 	      0.654      0.594      0.622
neg 	      0.368      0.4

In [7]:
print(len(word_feature_final))
print(word_feature_final[:10])
print(word_feature_final[10:20])
print(word_feature_final[20:30])

213
['worst', 'coz', 'ceo', 'why', 'potus', 'loss', 'bitch', 'dropping', 'suppose', 'cool']
['h', 'love', 'marijuana', 'luck', 'option', 'rookie', 'fl', 'suck', 'as', 'error']
['fucked', 'tired', 'delay', 'less', 'netanyahu', 'killed', 'fuckin', 'possibly', 'mistake', 'stevie']


The base line result is not good, I believe this is partly because of the limited data observations. Neutral tweets classification get better scores than positive ones, the negative ones have the worst scores.  
  
Among the 5 times of cross validations, each time we collected the most informative 100 words. However, there are only 213 unique words left. That means the tweets share many words that played a significant role in classification.

**Bigram features generating and selection**  
  
1. Create bigram features using top 1000 best bigrams measuered by chi square
2. Calculate baseline accuracy
3. 10-fold cross validation, for each time, record the most informative 100 bigrams
4. Gather the 10 100-bigram sets together as the useful features later

In [8]:
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(all_words_list)
bigram_features = finder.nbest(bigram_measures.chi_sq, 1000)

def bigram_document_features(document, bigram_features):
    document_bigrams = nltk.bigrams(document)
    features = {}
    for bigram in bigram_features:
        features['B_{}_{}'.format(bigram[0], bigram[1])] = (bigram in document_bigrams)    
    return features

In [9]:
bigram_featuresets = [(bigram_document_features(d,bigram_features), c) for (d,c) in train_documents]
thresh=int(len(bigram_featuresets)*0.1)
print(thresh)
train_set, test_set = bigram_featuresets[thresh:], bigram_featuresets[:thresh]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

820
0.47073170731707314


In [10]:
goldlist = []
predictedlist = []
for (features, label) in test_set:
    goldlist.append(label)
    predictedlist.append(classifier.classify(features))
cm = nltk.ConfusionMatrix(goldlist, predictedlist)
print(cm.pretty_format(sort_by_count=True, show_percents=False, truncate=9))

    |   n   p   n |
    |   e   o   e |
    |   u   s   g |
----+-------------+
neu |<386>  .   . |
pos | 318  <.>  . |
neg | 116   .  <.>|
----+-------------+
(row = reference; col = test)



It seems the bigrams won't give any valuable information

**POS tagged features**  

In [11]:
def POS_features(document,word_features):
    document_words = set(document)
    tagged_words = nltk.pos_tag(document)
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    numNoun = 0
    numVerb = 0
    numAdj = 0
    numAdverb = 0
    for (word, tag) in tagged_words:
        if tag.startswith('N'): numNoun += 1
        if tag.startswith('V'): numVerb += 1
        if tag.startswith('J'): numAdj += 1
        if tag.startswith('R'): numAdverb += 1
    features['nouns'] = numNoun
    features['verbs'] = numVerb
    features['adjectives'] = numAdj
    features['adverbs'] = numAdverb
    return features

POS_featuresets = [(POS_features(d, word_features), c) for (d, c) in train_documents]

In [12]:
thresh=int(len(POS_featuresets)*0.1)
train_set, test_set = POS_featuresets[thresh:], POS_featuresets[:thresh]
classifier = nltk.NaiveBayesClassifier.train(train_set)
nltk.classify.accuracy(classifier, test_set)

0.6365853658536585

In [13]:
POS_feature_final=cross_validation(5,POS_featuresets)
len(POS_feature_final)

0 0.6301035953686777
    |   n   p   n |
    |   e   o   e |
    |   u   s   g |
----+-------------+
neu |<504>183  87 |
pos | 145<406> 73 |
neg |  54  65<124>|
----+-------------+
(row = reference; col = test)

	Precision	Recall		F1
pos 	      0.651      0.621      0.635
neg 	      0.510      0.437      0.471
neu 	      0.651      0.717      0.682
1 0.6544789762340036
    |   n   p   n |
    |   e   o   e |
    |   u   s   g |
----+-------------+
neu |<566>179  79 |
pos | 133<403> 37 |
neg |  74  65<105>|
----+-------------+
(row = reference; col = test)

	Precision	Recall		F1
pos 	      0.703      0.623      0.661
neg 	      0.430      0.475      0.452
neu 	      0.687      0.732      0.709
2 0.6240097501523462
    |   n   p   n |
    |   e   o   e |
    |   u   s   g |
----+-------------+
neu |<549>177  84 |
pos | 168<372> 41 |
neg |  72  75<103>|
----+-------------+
(row = reference; col = test)

	Precision	Recall		F1
pos 	      0.640      0.596      0.617
neg 	      0.412      0.4

213

### polarity features 

In [14]:
from textblob import TextBlob
def senti_features(document,word_features):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)

    # record the polarity and subjectivity of each word
    pol_list=[]
    sub_list=[]
    for word in document_words:
        pol_list.append(TextBlob(word).polarity)
        sub_list.append(TextBlob(word).subjectivity)
    features['polarity']=sum(pol_list)/len(pol_list)
    features['subjectivity']=sum(sub_list)/len(sub_list)
    return features

TBsenti_featuresets = [(senti_features(d, word_features), c) for (d, c) in train_documents]

In [15]:
thresh=int(len(TBsenti_featuresets)*0.1)
train_set, test_set = TBsenti_featuresets[thresh:], TBsenti_featuresets[:thresh]
classifier = nltk.NaiveBayesClassifier.train(train_set)
nltk.classify.accuracy(classifier, test_set)

0.6743902439024391

In [16]:
cross_validation(5,TBsenti_featuresets)

0 0.6532602071907374
    |   n   p   n |
    |   e   o   e |
    |   u   s   g |
----+-------------+
neu |<549>150  75 |
pos | 154<404> 66 |
neg |  69  55<119>|
----+-------------+
(row = reference; col = test)

	Precision	Recall		F1
pos 	      0.647      0.663      0.655
neg 	      0.490      0.458      0.473
neu 	      0.709      0.711      0.710
1 0.6794637416209628
    |   n   p   n |
    |   e   o   e |
    |   u   s   g |
----+-------------+
neu |<612>149  63 |
pos | 139<401> 33 |
neg |  83  59<102>|
----+-------------+
(row = reference; col = test)

	Precision	Recall		F1
pos 	      0.700      0.658      0.679
neg 	      0.418      0.515      0.462
neu 	      0.743      0.734      0.738
2 0.6544789762340036
    |   n   p   n |
    |   e   o   e |
    |   u   s   g |
----+-------------+
neu |<588>150  72 |
pos | 155<386> 40 |
neg |  84  66<100>|
----+-------------+
(row = reference; col = test)

	Precision	Recall		F1
pos 	      0.664      0.641      0.653
neg 	      0.400      0.4

['ntains(dying)',
 'ntains(computer)',
 'ntains(parade)',
 "ntains(hasn't)",
 'ntains(hate)',
 'ntains(service)',
 'ntains(final)',
 'ntains(fuckin)',
 'ntains(smh)',
 'ntains(dont)',
 'ntains(report)',
 'ntains(demitra)',
 'ntains(damn)',
 'ntains(except)',
 'ntains(khl)',
 'ntains(language)',
 'ntains(changed)',
 'ntains(ceo)',
 'ntains(crap)',
 'ntains(juice)',
 'ntains(hell)',
 'ntains(enjoy)',
 'ntains(kinda)',
 "ntains(ain't)",
 'ntains(bro)',
 'ntains(international)',
 'ntains(trial)',
 'ntains(andrew)',
 'ntains(policy)',
 'ntains(penalty)',
 'ntains(luck)',
 'ntains(caltrain)',
 'ntains(awesome)',
 'ntains(fl)',
 'ntains(believe)',
 'ntains(absolutely)',
 'ntains(factor)',
 'ntains(less)',
 'ntains(killed)',
 'ntains(compared)',
 'ntains(pavol)',
 "ntains(wouldn't)",
 'ntains(seem)',
 'ntains(best)',
 'ntains(enjoying)',
 'ntains(poll)',
 'ntains(dwts)',
 'ntains(debate)',
 "ntains(couldn't)",
 'ntains(deal)',
 'ntains(love)',
 'ntains(alone)',
 'ntains(sick)',
 'ntains(score)

**Conclusion at this stage - Hang**  
  
For now, we did preprocessing on the data, tried classification with Naive Bayes model on the tokenized tweets using word-only, bigram, and POS-tagged features. We collected the most useful several hundred words for further classification tasks.  
  
We found:  
1. Bigram features are found not valuable for classification
2. POS-tagged features have slightly better result, compared with words-only ones.
3. Apply other sentiment analysis API on a token level increased the accuracy

**Some plans for Future next steps - Hang**
  
1. Check the scores of no-stop-words version of the document (already made as "train_document_nosw")
2. Try using some sentiment score APIs on the words so that we get sentiment values as features.
3. Try other models for this classification problem.
4. Creating more features

# As per advice from my teamate Hang, Im working on trying more algorithms and Creating new features, and trying them out.

## Additional changes - Vaishnavi Meka Work

#### - I'm trying experiments which includes trying bi-grams, tri-grams, with pos features and at last combining all features with Different models

In [17]:
# feature generation code
def bigram_pos_features(document):
    """Extract bigram + POS features."""
    document_bigrams = list(nltk.bigrams(document))
    tagged_bigrams = nltk.pos_tag([word for bigram in document_bigrams for word in bigram])
    
    features = {}
    for i, (word, tag) in enumerate(tagged_bigrams):
        features[f'B_{word}_POS_{tag}'] = True
    return features

def trigram_pos_features(document):
    """Extract trigram + POS features."""
    document_trigrams = list(nltk.trigrams(document))
    tagged_trigrams = nltk.pos_tag([word for trigram in document_trigrams for word in trigram])
    
    features = {}
    for i, (word, tag) in enumerate(tagged_trigrams):
        features[f'T_{word}_POS_{tag}'] = True
    return features

# Combine all features
def combined_features(document, word_features, bigram_features, trigram_features):
    """Combine word, bigram, trigram, and POS features."""
    document_words = set(document)
    document_bigrams = list(nltk.bigrams(document))
    document_trigrams = list(nltk.trigrams(document))
    tagged_words = nltk.pos_tag(document)

    features = {}
    # Word-level features
    for word in word_features:
        features[f'W_{word}'] = (word in document_words)
    
    # Bigram-level features
    for bigram in bigram_features:
        features[f'B_{bigram[0]}_{bigram[1]}'] = (bigram in document_bigrams)
    
    # Trigram-level features
    for trigram in trigram_features:
        features[f'T_{trigram[0]}_{trigram[1]}_{trigram[2]}'] = (trigram in document_trigrams)
    
    # POS-level features
    num_nouns = sum(1 for word, tag in tagged_words if tag.startswith('N'))
    num_verbs = sum(1 for word, tag in tagged_words if tag.startswith('V'))
    num_adjs = sum(1 for word, tag in tagged_words if tag.startswith('J'))
    num_advs = sum(1 for word, tag in tagged_words if tag.startswith('R'))

    features['num_nouns'] = num_nouns
    features['num_verbs'] = num_verbs
    features['num_adjectives'] = num_adjs
    features['num_adverbs'] = num_advs
    
    return features

# Generate combined feature sets
bigram_features = finder.nbest(bigram_measures.chi_sq, 1000)
trigram_features = list(nltk.trigrams(all_words_list))[:1000]  # Top 1000 trigrams

combined_featuresets = [
    (combined_features(d, word_features, bigram_features, trigram_features), c)
    for (d, c) in train_documents
]



In [None]:
# Defining NB Classifier from NLTK Class

# Function to evaluate a feature set
def evaluate_features(featuresets, description):
    """Train and evaluate a classifier on the given feature sets."""
    thresh = int(len(featuresets) * 0.1)  # Use 10% for testing
    train_set, test_set = featuresets[thresh:], featuresets[:thresh]
    classifier = nltk.NaiveBayesClassifier.train(train_set)
    accuracy = nltk.classify.accuracy(classifier, test_set)
    print(f"Accuracy with {description}: {accuracy:.4f}")

    # Confusion Matrix
    goldlist = []
    predictedlist = []
    for (features, label) in test_set:
        goldlist.append(label)
        predictedlist.append(classifier.classify(features))
    cm = nltk.ConfusionMatrix(goldlist, predictedlist)
    print(f"Confusion Matrix for {description}:\n{cm.pretty_format(sort_by_count=True, show_percents=False, truncate=9)}")
    return accuracy






In [None]:
#Defining SVC Classifier from Sklearn class

from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

def evaluate_features_svm(featuresets, description):
    """Train and evaluate an SVM classifier on the given feature sets."""
    thresh = int(len(featuresets) * 0.1)  # Use 10% for testing
    train_set, test_set = featuresets[thresh:], featuresets[:thresh]

    # Separate features and labels for training and testing
    train_X = [features for features, label in train_set]
    train_y = [label for features, label in train_set]
    test_X = [features for features, label in test_set]
    test_y = [label for features, label in test_set]

    # Convert feature dictionaries into feature matrices
    from sklearn.feature_extraction import DictVectorizer
    vectorizer = DictVectorizer(sparse=True)
    train_X = vectorizer.fit_transform(train_X)
    test_X = vectorizer.transform(test_X)

    # Train SVM classifier
    svm_classifier = SVC(kernel='linear', random_state=42)
    svm_classifier.fit(train_X, train_y)

    # Predict on the test set
    predicted_y = svm_classifier.predict(test_X)

    # Evaluate accuracy
    accuracy = accuracy_score(test_y, predicted_y)
    print(f"Accuracy with {description}: {accuracy:.4f}")

    # Confusion Matrix
    cm = confusion_matrix(test_y, predicted_y)
    print(f"Confusion Matrix for {description}:\n{cm}")

    # Classification Report
    print("Classification Report:")
    print(classification_report(test_y, predicted_y))

    return accuracy


##### 1. performing evaluation using NB model from NLTK Class on Bigram features + Pos Features #####

In [43]:
# Bigram + POS Features Evluating using NB Classifier
bigram_pos_featuresets = [
    (bigram_pos_features(d), c) for (d, c) in train_documents
]
evaluate_features(bigram_pos_featuresets, "Bigram + POS features")



Accuracy with Bigram + POS features: 0.5402
Confusion Matrix for Bigram + POS features:
    |   n   p   n |
    |   e   o   e |
    |   u   s   g |
----+-------------+
neu |<160> 95 131 |
pos |  34<190> 94 |
neg |   8  15 <93>|
----+-------------+
(row = reference; col = test)



0.5402439024390244

##### 2. performing evaluation using NB model from NLTK Class on Trigram features + Pos Features #####

In [44]:
# Trigram + POS Features Evluating using NB Classifier
trigram_pos_featuresets = [
    (trigram_pos_features(d), c) for (d, c) in train_documents
]
evaluate_features(trigram_pos_featuresets, "Trigram + POS features")


Accuracy with Trigram + POS features: 0.5305
Confusion Matrix for Trigram + POS features:
    |   n   p   n |
    |   e   o   e |
    |   u   s   g |
----+-------------+
neu |<148> 84 154 |
pos |  35<190> 93 |
neg |   8  11 <97>|
----+-------------+
(row = reference; col = test)



0.5304878048780488

##### 3. performing evaluation using NB model from NLTK Class on  Bigram features + Trigram + word features #####

In [45]:
# Combined Features (for comparison) using NB Classifier
combined_featuresets = [
    (combined_features(d, word_features, bigram_features, trigram_features), c)
    for (d, c) in train_documents
]
evaluate_features(combined_featuresets, "Combined features")

Accuracy with Combined features: 0.6024
Confusion Matrix for Combined features:
    |   n   p   n |
    |   e   o   e |
    |   u   s   g |
----+-------------+
neu |<221> 88  77 |
pos |  64<205> 49 |
neg |  18  30 <68>|
----+-------------+
(row = reference; col = test)



0.6024390243902439

##### 4. Performing evaluation using SVM model from Sklearn Class on Bigram features + Pos Features #####

In [None]:
# Now trying SVM to check if we can improve any accuracy? 

In [46]:
accuracy = evaluate_features_svm(bigram_pos_featuresets, "Bigram + POS features")
accuracy

Accuracy with Bigram + POS features: 0.6256
Confusion Matrix for Bigram + POS features:
[[ 52  49  15]
 [ 32 274  80]
 [ 37  94 187]]
Classification Report:
              precision    recall  f1-score   support

         neg       0.43      0.45      0.44       116
         neu       0.66      0.71      0.68       386
         pos       0.66      0.59      0.62       318

    accuracy                           0.63       820
   macro avg       0.58      0.58      0.58       820
weighted avg       0.63      0.63      0.63       820



0.625609756097561

##### 5. Performing evaluation using SVM model from Sklearn Class on Trigram features + Pos Features #####

In [47]:
accuracy = evaluate_features_svm(trigram_pos_featuresets, "Trigram + POS features")
accuracy

Accuracy with Trigram + POS features: 0.6329
Confusion Matrix for Trigram + POS features:
[[ 57  41  18]
 [ 36 272  78]
 [ 28 100 190]]
Classification Report:
              precision    recall  f1-score   support

         neg       0.47      0.49      0.48       116
         neu       0.66      0.70      0.68       386
         pos       0.66      0.60      0.63       318

    accuracy                           0.63       820
   macro avg       0.60      0.60      0.60       820
weighted avg       0.63      0.63      0.63       820



0.6329268292682927

##### 6. Performing evaluation using SVM model from Sklearn Class on Bigram features + word Features + trigram features #####

In [48]:
accuracy = evaluate_features_svm(combined_featuresets, "Combined features")
accuracy

Accuracy with Combined features: 0.6366
Confusion Matrix for Combined features:
[[ 53  44  19]
 [ 41 267  78]
 [ 35  81 202]]
Classification Report:
              precision    recall  f1-score   support

         neg       0.41      0.46      0.43       116
         neu       0.68      0.69      0.69       386
         pos       0.68      0.64      0.65       318

    accuracy                           0.64       820
   macro avg       0.59      0.59      0.59       820
weighted avg       0.64      0.64      0.64       820



0.6365853658536585

### Now trying Random Forest to check if we can improve any accuracy? 

In [49]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

def evaluate_features_rf(featuresets, description, n_estimators=100):
    """Train and evaluate a Random Forest classifier on the given feature sets."""
    thresh = int(len(featuresets) * 0.1)  # Use 10% for testing
    train_set, test_set = featuresets[thresh:], featuresets[:thresh]

    # Separate features and labels for training and testing
    train_X = [features for features, label in train_set]
    train_y = [label for features, label in train_set]
    test_X = [features for features, label in test_set]
    test_y = [label for features, label in test_set]

    # Convert feature dictionaries into feature matrices
    from sklearn.feature_extraction import DictVectorizer
    vectorizer = DictVectorizer(sparse=True)
    train_X = vectorizer.fit_transform(train_X)
    test_X = vectorizer.transform(test_X)

    # Train Random Forest classifier
    rf_classifier = RandomForestClassifier(n_estimators=n_estimators, random_state=42)
    rf_classifier.fit(train_X, train_y)

    # Predict on the test set
    predicted_y = rf_classifier.predict(test_X)

    # Evaluate accuracy
    accuracy = accuracy_score(test_y, predicted_y)
    print(f"Accuracy with {description}: {accuracy:.4f}")

    # Confusion Matrix
    cm = confusion_matrix(test_y, predicted_y)
    print(f"Confusion Matrix for {description}:\n{cm}")

    # Classification Report
    print("Classification Report:")
    print(classification_report(test_y, predicted_y))

    return accuracy


##### 7. Performing evaluation using Random Forest model from Sklearn Class on Bigram features + Pos Features #####

In [50]:
accuracy = evaluate_features_rf(bigram_pos_featuresets, "Bigram + POS features")
accuracy

Accuracy with Bigram + POS features: 0.6317
Confusion Matrix for Bigram + POS features:
[[  7  87  22]
 [  4 348  34]
 [  2 153 163]]
Classification Report:
              precision    recall  f1-score   support

         neg       0.54      0.06      0.11       116
         neu       0.59      0.90      0.71       386
         pos       0.74      0.51      0.61       318

    accuracy                           0.63       820
   macro avg       0.62      0.49      0.48       820
weighted avg       0.64      0.63      0.59       820



0.6317073170731707

##### 8. Performing evaluation using Random Forest model from Sklearn Class on Trigram features + Pos Features #####

In [51]:
accuracy = evaluate_features_rf(trigram_pos_featuresets, "Trigram + POS features")
accuracy

Accuracy with Trigram + POS features: 0.6402
Confusion Matrix for Trigram + POS features:
[[  7  90  19]
 [  3 350  33]
 [  3 147 168]]
Classification Report:
              precision    recall  f1-score   support

         neg       0.54      0.06      0.11       116
         neu       0.60      0.91      0.72       386
         pos       0.76      0.53      0.62       318

    accuracy                           0.64       820
   macro avg       0.63      0.50      0.48       820
weighted avg       0.65      0.64      0.60       820



0.6402439024390244

##### 9. Performing evaluation using Random Forest model from Sklearn Class on Bigram features + word + trigram Features #####

In [52]:
accuracy = evaluate_features_rf(combined_featuresets, "Combined features")
accuracy

Accuracy with Combined features: 0.6366
Confusion Matrix for Combined features:
[[ 10  87  19]
 [  3 348  35]
 [  2 152 164]]
Classification Report:
              precision    recall  f1-score   support

         neg       0.67      0.09      0.15       116
         neu       0.59      0.90      0.72       386
         pos       0.75      0.52      0.61       318

    accuracy                           0.64       820
   macro avg       0.67      0.50      0.49       820
weighted avg       0.67      0.64      0.60       820



0.6365853658536585

## Can we do better ? trying ensemble models

In [18]:
from sklearn.ensemble import VotingClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction import DictVectorizer
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

def evaluate_ensemble(featuresets, description):
    """Train and evaluate an ensemble model on the given feature sets."""
    # Split into train and test sets
    thresh = int(len(featuresets) * 0.1)  # Use 10% for testing
    train_set, test_set = featuresets[thresh:], featuresets[:thresh]

    # Separate features and labels
    train_X = [features for features, label in train_set]
    train_y = [label for features, label in train_set]
    test_X = [features for features, label in test_set]
    test_y = [label for features, label in test_set]

    # Convert feature dictionaries to feature matrices
    vectorizer = DictVectorizer(sparse=True)
    train_X = vectorizer.fit_transform(train_X)
    test_X = vectorizer.transform(test_X)

    # Initialize individual classifiers
    svm_clf = SVC(kernel='linear', probability=True, random_state=42)
    rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
    nb_clf = MultinomialNB()

    # Create ensemble model
    ensemble_clf = VotingClassifier(
        estimators=[('SVM', svm_clf), ('RF', rf_clf), ('NB', nb_clf)],
        voting='soft'  # Use 'hard' for majority vote
    )

    # Train ensemble model
    ensemble_clf.fit(train_X, train_y)

    # Predict on the test set
    predicted_y = ensemble_clf.predict(test_X)

    # Evaluate accuracy
    accuracy = accuracy_score(test_y, predicted_y)
    print(f"Accuracy with {description} (Ensemble): {accuracy:.4f}")

    # Confusion Matrix
    cm = confusion_matrix(test_y, predicted_y)
    print(f"Confusion Matrix for {description} (Ensemble):\n{cm}")

    # Classification Report
    print("Classification Report:")
    print(classification_report(test_y, predicted_y))

    return accuracy






##### 8. Performing evaluation using voting Ensemble model from sklearn.ensemble package (using SVC, RF, NB) from Sklearn Class on word level Features #####

In [55]:
# Evaluate Word-Level Features with Ensemble
word_featuresets = [(document_features(d, word_features), c) for (d, c) in train_documents]
accuracy_ensemble_word = evaluate_ensemble(word_featuresets, "Word-Level Features")

Accuracy with Word-Level Features (Ensemble): 0.6720
Confusion Matrix for Word-Level Features (Ensemble):
[[ 36  52  28]
 [ 15 302  69]
 [ 13  92 213]]
Classification Report:
              precision    recall  f1-score   support

         neg       0.56      0.31      0.40       116
         neu       0.68      0.78      0.73       386
         pos       0.69      0.67      0.68       318

    accuracy                           0.67       820
   macro avg       0.64      0.59      0.60       820
weighted avg       0.66      0.67      0.66       820



In [56]:
accuracy_ensemble_word

0.6719512195121952

##### 9. Performing evaluation using voting Ensemble model from sklearn.ensemble package (using SVC, RF, NB) from Sklearn Class on bigram level Features #####

In [57]:
# Evaluate Bigram Features with Ensemble
bigram_featuresets = [(bigram_document_features(d, bigram_features), c) for (d, c) in train_documents]
accuracy_ensemble_bigram = evaluate_ensemble(bigram_featuresets, "Bigram Features")

Accuracy with Bigram Features (Ensemble): 0.4707
Confusion Matrix for Bigram Features (Ensemble):
[[  0 116   0]
 [  0 386   0]
 [  0 318   0]]
Classification Report:
              precision    recall  f1-score   support

         neg       0.00      0.00      0.00       116
         neu       0.47      1.00      0.64       386
         pos       0.00      0.00      0.00       318

    accuracy                           0.47       820
   macro avg       0.16      0.33      0.21       820
weighted avg       0.22      0.47      0.30       820



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [58]:
accuracy_ensemble_bigram

0.47073170731707314

##### 10. Performing evaluation using voting Ensemble model from sklearn.ensemble package (using SVC, RF, NB) from Sklearn Class on word + bigram + trigram Features #####

In [59]:
# Evaluate Combined Features with Ensemble
combined_featuresets = [
    (combined_features(d, word_features, bigram_features, trigram_features), c)
    for (d, c) in train_documents
]
accuracy_ensemble_combined = evaluate_ensemble(combined_featuresets, "Combined Features")


Accuracy with Combined Features (Ensemble): 0.6732
Confusion Matrix for Combined Features (Ensemble):
[[ 46  39  31]
 [ 19 296  71]
 [ 18  90 210]]
Classification Report:
              precision    recall  f1-score   support

         neg       0.55      0.40      0.46       116
         neu       0.70      0.77      0.73       386
         pos       0.67      0.66      0.67       318

    accuracy                           0.67       820
   macro avg       0.64      0.61      0.62       820
weighted avg       0.67      0.67      0.67       820



In [60]:
accuracy_ensemble_combined

0.6731707317073171

##### 11. Performing evaluation using voting Ensemble model from sklearn.ensemble package (using SVC, RF, NB) from Sklearn Class on bigram plus pos features #####

In [61]:
accuracy = evaluate_ensemble(bigram_pos_featuresets, "Bigram + POS features")
accuracy

Accuracy with Bigram + POS features (Ensemble): 0.6573
Confusion Matrix for Bigram + POS features (Ensemble):
[[ 14  73  29]
 [  3 312  71]
 [  4 101 213]]
Classification Report:
              precision    recall  f1-score   support

         neg       0.67      0.12      0.20       116
         neu       0.64      0.81      0.72       386
         pos       0.68      0.67      0.68       318

    accuracy                           0.66       820
   macro avg       0.66      0.53      0.53       820
weighted avg       0.66      0.66      0.63       820



0.6573170731707317

##### 12. Performing evaluation using voting Ensemble model from sklearn.ensemble package (using SVC, RF, NB) from Sklearn Class on trigram + pos Features #####

In [62]:
accuracy = evaluate_ensemble(trigram_pos_featuresets, "Trigram + POS features")
accuracy

Accuracy with Trigram + POS features (Ensemble): 0.6610
Confusion Matrix for Trigram + POS features (Ensemble):
[[ 12  68  36]
 [  4 319  63]
 [  4 103 211]]
Classification Report:
              precision    recall  f1-score   support

         neg       0.60      0.10      0.18       116
         neu       0.65      0.83      0.73       386
         pos       0.68      0.66      0.67       318

    accuracy                           0.66       820
   macro avg       0.64      0.53      0.53       820
weighted avg       0.66      0.66      0.63       820



0.6609756097560976

##### 13. Performing evaluation using voting Ensemble model from sklearn.ensemble package (using SVC, RF, NB) from Sklearn Class on word + bigram + trigram Features #####

In [None]:
accuracy = evaluate_ensemble(combined_featuresets, "Combined features")
accuracy

Accuracy with Combined features (Ensemble): 0.6732
Confusion Matrix for Combined features (Ensemble):
[[ 46  39  31]
 [ 19 296  71]
 [ 18  90 210]]
Classification Report:
              precision    recall  f1-score   support

         neg       0.55      0.40      0.46       116
         neu       0.70      0.77      0.73       386
         pos       0.67      0.66      0.67       318

    accuracy                           0.67       820
   macro avg       0.64      0.61      0.62       820
weighted avg       0.67      0.67      0.67       820



0.6731707317073171

## Conclusions on Further Analysis - Vaishnavi Meka ##

1. I tried different models (NB, SVC, RF, DT, SVC+RF+NB) and tried to mix and match the models with different featuresets like word, bigrams, trigrams, Bigrams with POS, tri-grams with POS, Word+bigrams+trigrams featuresets.

2. Out of all the features & Models used, I find that Ensemble model gave the higest accuracy like 67%. This has the featuresets: Word, bigrams, trigrams combined. 

3. Out of all the features & Models used, I find that Ensemble model gave the higest accuracy like 67%. This has the featuresets: Word Level.

4. Over all Ensemble models did well with out any hyper parameter optimization.

## Future Scope of work - Vaishnavi Meka

1. Futue scope of work can be using DL Neural network based models
2. Using different hyper parameter optimization techniques for tunning the hyper parameters
3. Using more complex feature engineering techniques like TF-IDF, Word embeddings, GloVe, FastText etc.