### Assignment 10 - Sentiment Analysis

#### Guan Yue Wang

This Sentiment analysis classifier is built based on Sentiment 140 corpus, NLTK, and code from Laurent Luce's Blog

Reference:

Sentiment 140 corpus: http://help.sentiment140.com/for-students

Laurent Luce's Blog: http://www.laurentluce.com/posts/twitter-sentiment-analysis-using-python-and-nltk/

NLTK:https://www.nltk.org/

### Limitations:
- We assume training corpus and actual corpus are in the same domain and thus has the same meaning of words
- Sarcasm and hyperbole cannot be easily identified by the sentiment analysis classifier
- Reddit specific stop words are not considered
- Punctuation is not fully considered into sentiment analysis
- URLs, links, unnecessary contents cannot be excluded from the analysis
- The program takes long time to run due to the training on sentiment 140 corpus
- Due to computing constraints, only 8000 of the sentiment 140 corpus is used for training

In [1]:
import nltk
from enum import Enum
import random

class Sentiment(Enum):
    negative = 0
    neutral = 2
    positive = 4


def get_tweets(csvFile):
    tweets = []
    with open(csvFile, 'r') as f:
        for line in f:
            columns = line.split(',')
            sentiment = Sentiment(int(columns[0].replace('"','')))
            tweet = columns[5]
            filteredTweet = [e.lower().replace('"','') for e in tweet.split() if len(e) >= 3]
            tweets.append((filteredTweet, sentiment.name))

    return tweets


def get_words_in_tweets(tweets):
    all_words = []
    for (words, sentiment) in tweets:
      all_words.extend(words)
    return all_words


def get_word_features(wordlist):
    wordlist = nltk.FreqDist(wordlist)
    word_features = wordlist.keys()

    return word_features


training_tweets = get_tweets('training.1600000.processed.noemoticon.csv')
random.shuffle(training_tweets)
training_tweets = training_tweets[0:8000]
test_tweets = get_tweets('testdata.manual.2009.06.14.csv')
word_features = get_word_features(get_words_in_tweets(training_tweets))


def extract_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains(%s)' % word] = (word in document_words)
    return features


def sentiment_analysis_classifier():
    training_set = nltk.classify.apply_features(extract_features, training_tweets)
    classifier = nltk.NaiveBayesClassifier.train(training_set)
    print ('classifier accuracy:')
    test_set = nltk.classify.apply_features(extract_features, test_tweets)
    print ('\t' + str(nltk.classify.accuracy(classifier, test_set)) + '\n')
    reddit_sentiment_classification(classifier)


def reddit_sentiment_classification(classifier):
    print ('Sentiment Analysis Classifier - Reddit Comments: ')
    with open('reddit_comments-datascience.txt', encoding="utf8") as f:
        comments = f.readlines()
        for comment in comments:
            if comment.startswith('Body'):
                print ('\tcomment: ' + comment.replace('\n',''))
                print ('\tsentiment: ' +  classifier.classify(extract_features(comment.split())))


In [2]:
sentiment_analysis_classifier()

classifier accuracy:
	0.5140562248995983

Sentiment Analysis Classifier - Reddit Comments: 
	comment: Body: As a mere undergrad that took a stochastic processes course - this seems like a question that could be answered with markov chains??
	sentiment: negative
	comment: Body: I recently curated some [SQL resources for the wiki](https://www.reddit.com/r/datascience/wiki/frequently-asked-questions#wiki_how_do_i_learn_sql.3F). You may find it useful. 
	sentiment: negative
	comment: Body: Quit it, willya?
	sentiment: positive
	comment: Body: StackOverflow, Math Stackexchange, Cross Validated
	sentiment: positive
	comment: Body: Downvote to hell!
	sentiment: positive
	comment: Body: Hello all. I just recently graduated with a Master's degree in Physics and I just moved to a new city to become a data scientist. I'm working through a bootcamp course at the moment. I don't really have any questions as I don't really know what to ask, but I figured I'd throw my hat in the ring and say hello. I

	sentiment: negative
	comment: Body: This is really a basic question. You should at least google around for directories or databases of business listings. If all else fails just scrape yelp. 
	sentiment: negative
	comment: Body: Can I ask why?
	sentiment: positive
	comment: Body: This data science master degree has a lot of math and statistics in it. It’s actually a lot of math and stats, come CS, like machine learning and coding, and, well, it looks great. You think it would be better pursing a regular degree and then ending up anyway in data science?
	sentiment: negative
	comment: Body: You think the data science craze will eventually die?
	sentiment: negative
	comment: Body: Do you think I could get hired with a Bachelor in physics? I really doubt it. Plus I’m from Italy. I don’t know what’s the situation like for physicists abroad, but here, with a bachelor only you can’t go anywhere.
	sentiment: negative
	comment: Body: Main question shows removed. Apparently deleted via mods. 
	s

	sentiment: negative
	comment: Body: Depends on the company and industry. As a head of analytics / ds team in tech I care about concepts not degrees 
	sentiment: negative
	comment: Body: Sounds like you found an important niche and that you are excited. All in all, sounds awesome! Congrats!
	sentiment: positive
	comment: Body: Money.
	sentiment: positive
	comment: Body: For computational social scientists: 
	sentiment: positive
	comment: Body: Short answer is no, I imagine with a stats or cs degree supplemented with coursework from the other discipline in conjunction with some networking, hard work,  and determination will get you there. It definitely will help to have an advanced degree because you DS requires a decent amount of breadth and depth. Plus recruiters might not even consider people with less than a masters. Here’s some quick DS stats:
	sentiment: negative
	comment: Body: All depends on the field you want to go in to was well as how you currently plan to learn new skills an

	sentiment: positive
	comment: Body: oh also How Not to Be Wrong, not data science specific but helps with general statistical intuition. 
	sentiment: positive
	comment: Body: Signal and the Noise - Nate Silver
	sentiment: positive
	comment: Body: It *might* have been due to culture concerns about interpretability - I have a colleague who used to work in DS for a company specialising in loans and they prioritised Decision Trees because they were so easy to interpret and ensure that they wasn't any bias/discrimination in their recommendations. 
	sentiment: negative
	comment: Body: Introduction to Statistical Learning by Hastie is a must read
	sentiment: positive
	comment: Body: Second but instead think you should look at stats and/or computer science. But put in a lot of dilligence looking for a CS program that is what you want.
	sentiment: negative
	comment: Body: Awesome suggestion, thank you! And good noting on seasonality, the industry is finance and accounting so you can imagine wh

	sentiment: positive
	comment: Body: I guess the way I asked this question was kind of weird.
	sentiment: negative
	comment: Body: Fellow data scientist with an MS in stats here! :) Great job, congratulations!! 
	sentiment: positive
	comment: Body: It's my data-science site.
	sentiment: positive
	comment: Body: Take a look at https://www.amazon.com/Data-Mining-Masses-Third-Implementations/dp/1727102479. It covers all the basics with examples in both RapidMiner and R
	sentiment: negative
	comment: Body: What is Black Swans?
	sentiment: negative
	comment: Body: Current Population Survey tobacco supplement via IPUMS (U. Mich.). They let you generate free extracts with tons of individual-level data. It's a complicated dataset, but super good. That's the most common dataset for tobacco research, I believe.  
	sentiment: positive
	comment: Body: Are you sure you are receiving a JavaScript dictionary and not a JSON object? You can use the “jsonlite” package in R to automatically parse the con