#### Sentiment Analysis And Data Split

#### Guan Yue Wang

This Sentiment analysis classifier is built based on Sentiment 140 corpus, NLTK, and code from Laurent Luce's Blog

Reference:

Sentiment 140 corpus: http://help.sentiment140.com/for-students

Laurent Luce's Blog: http://www.laurentluce.com/posts/twitter-sentiment-analysis-using-python-and-nltk/

NLTK:https://www.nltk.org/

### Limitations:
- We assume training corpus and actual corpus are in the same domain and thus has the same meaning of words
- Sarcasm and hyperbole cannot be easily identified by the sentiment analysis classifier
- Reddit specific stop words are not considered
- Punctuation is not fully considered into sentiment analysis
- URLs, links, unnecessary contents cannot be excluded from the analysis
- The program takes long time to run due to the training on sentiment 140 corpus
- Due to computing constraints, only 10000 of the sentiment 140 corpus is used for training

In [6]:
import nltk
from enum import Enum
import random

class Sentiment(Enum):
    negative = 0
    neutral = 2
    positive = 4


def get_tweets(csvFile):
    tweets = []
    with open(csvFile, 'r') as f:
        for line in f:
            columns = line.split(',')
            sentiment = Sentiment(int(columns[0].replace('"','')))
            tweet = columns[5]
            filteredTweet = [e.lower().replace('"','') for e in tweet.split() if len(e) >= 3]
            tweets.append((filteredTweet, sentiment.name))

    return tweets


def get_words_in_tweets(tweets):
    all_words = []
    for (words, sentiment) in tweets:
      all_words.extend(words)
    return all_words


def get_word_features(wordlist):
    wordlist = nltk.FreqDist(wordlist)
    word_features = wordlist.keys()

    return word_features


training_tweets = get_tweets('training.1600000.processed.noemoticon.csv')
random.shuffle(training_tweets)
training_tweets = training_tweets[0:10000]
test_tweets = get_tweets('testdata.manual.2009.06.14.csv')
word_features = get_word_features(get_words_in_tweets(training_tweets))


def extract_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains(%s)' % word] = (word in document_words)
    return features


def sentiment_analysis_classifier():
    training_set = nltk.classify.apply_features(extract_features, training_tweets)
    classifier = nltk.NaiveBayesClassifier.train(training_set)
    print ('classifier accuracy:')
    test_set = nltk.classify.apply_features(extract_features, test_tweets)
    print ('\t' + str(nltk.classify.accuracy(classifier, test_set)) + '\n')
    reddit_sentiment_classification(classifier)




In [7]:
def reddit_sentiment_classification(classifier):
    print ('Sentiment Analysis Classifier - Reddit Comments: ')
    file1 = open("finaldata_marketing_positive.txt", "w",encoding='utf-8')
    file2 = open("finaldata_marketing_negative.txt", "w",encoding='utf-8')
    positivecounts = 0
    negativecounts = 0
    
    with open('finaldata_marketing.txt', encoding="utf8") as f:
        comments = f.readlines()
        for comment in comments:
            if classifier.classify(extract_features(comment.split())) == "positive":
                file1.write(str(comment.replace('\n','')) + '\n')
                positivecounts = positivecounts + 1
            else:
                file2.write(str(comment.replace('\n','')) + '\n')
                negativecounts = negativecounts + 1
        print("positive sentiment comment counts: ", positivecounts)
        print("negative sentiment comment counts: ", negativecounts)

In [8]:
sentiment_analysis_classifier()

classifier accuracy:
	0.5200803212851406

Sentiment Analysis Classifier - Reddit Comments: 
positive sentiment comment counts:  14872
negative sentiment comment counts:  4250
