In this notebook, I made code that could download and save tweets about Pokemon Go. I decided to do this after seeing the bad reactions on social media about the Pokemon Go Fest in Chicago on July 22nd, which was plagued by long lines, server overload, and glitchy user experience. I ran these scripts every day (except while on vacation) for about a month so that I could see if I could follow the sentiment of Pokemon Go users over time.

The steps in this project were to:
1. Search for tweets about pokemon go and save them for later analysis. I did two searches. One looked at a few different search terms individually, and the other looked for any of those search terms specifically in Chicago vs. everywhere.
2. Training a sentiment classifier (Naive Bayes Classifier). I used the built-in NLTK twitter corpus as a gold standard at first, but then didn't feel like it was doing a great job. In particular, the NLTK corpus was established using emoticons to determine the true sentiment. The result of that is that emoticons became absurdly informative features, but I think that emoticons are less used now than emoji, so I didn't feel like the corpus was well-matched to present-day tweets.
3. Perparing the saved tweets for analysis and applying the classifier to see trends over time.
4. Training a second sentiment classifier. I found a massive twitter corpus online which I used to train a second sentiment classifier.
5. Comparing the two classifiers
6. Looking at the tweets from the best and worst days

In [None]:
import pickle
import os
import nltk
import twitter
from datetime import datetime
import pandas as pd
import string
import json
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
%cd "/Users/hjohnsen/Dropbox (Personal)/Data Science/Week-8-NLP-Databases/pickles"

In [None]:
pwd = !pwd
print(pwd)
if not pwd[0] == "/Users/hjohnsen/Dropbox (Personal)/Data Science/Week-8-NLP-Databases":
    %cd "/Users/hjohnsen/Dropbox (Personal)/Data Science/Week-8-NLP-Databases/"    

if not os.path.exists('secret_twitter_credentials.pkl'):
    Twitter={}
    Twitter['Consumer Key'] = ''
    Twitter['Consumer Secret'] = ''
    Twitter['Access Token'] = ''
    Twitter['Access Token Secret'] = ''
    with open('secret_twitter_credentials.pkl','wb') as f:
        pickle.dump(Twitter, f)
else:
    Twitter=pickle.load(open('secret_twitter_credentials.pkl','rb'))
    
auth = twitter.oauth.OAuth(Twitter['Access Token'],
                           Twitter['Access Token Secret'],
                           Twitter['Consumer Key'],
                           Twitter['Consumer Secret'])

twitter_api = twitter.Twitter(auth=auth)

%cd pickles/

I did not end up using these trends for analysis.

In [None]:
worldID = 1
usID = 23424977
chicagoID = 2379574
grantparkID = 12784255 #using zipcode 60601 for lookup

trends = {}
trends["world_trends"] = twitter_api.trends.place(_id=worldID)
trends["us_trends"] = twitter_api.trends.place(_id=usID)
trends["chicago_trends"] = twitter_api.trends.place(_id=chicagoID)
#not working
#trends["grantpark_trends"] = twitter_api.trends.place(_id=grantparkID)  

#since trends might be time sensitive, I want to save them
currTime = str(datetime.now())
with open(currTime + "trends.pkl",'wb') as f:
    pickle.dump(trends, f)


In [None]:

#alltrends = {}
#for trend in trends.keys():
#    trendlist = trends[trend][0]["trends"]
#    for item in trendlist:
#        #print(item["name"] + ", " + str(item["tweet_volume"]))
#        currentValue = alltrends.get(item["name"], 0)
#        if item["tweet_volume"] is None or item["tweet_volume"]==0:
#            pass
#        else: alltrends[item["name"]] = item["tweet_volume"] + currentValue
##for item in alltrends.keys():
##    print (item + ", " ,alltrends[item])
#df = pd.Series(alltrends)
##print(df)
#df.sort_values(ascending = False)


In [None]:
#cTrends = {}
#for item in trends["chicago_trends"][0]["trends"]:
#    cTrends[item["name"]] = item["tweet_volume"]
#dfc = pd.Series(cTrends)
#dfc.sort_values(ascending = False)

# Getting and saving tweets about PoGo

Here I searched for tweets about pokemon go or the specific event. I did not realize it, but the case doesn't matter in Twitter's search, so "#PokemonGOFest" and "#pokemongofest" were identical searches.

100 is the maximum number that could be found and saved with these searches.

In [None]:
a = 'Pokemon Go'
b = "#PokemonGOFest"
c = "#pokemongofest"
d = "pokemongo"

In [None]:
number = 100
searchresults = {}
searchresults["a"] = twitter_api.search.tweets(q=a, count=number)
searchresults["b"] = twitter_api.search.tweets(q=b, count=number)
searchresults["c"] = twitter_api.search.tweets(q=c, count=number)
searchresults["d"] = twitter_api.search.tweets(q=d, count=number)

currTime = str(datetime.now())
with open(currTime + "search.pkl",'wb') as f:
    pickle.dump(searchresults, f)

In [None]:
for search in searchresults.keys():
    print(search)
    print(len(searchresults[search]["statuses"]))

# Getting and saving tweets about PoGo in Chicago vs. elsewhere

In [None]:
q = '-RT Pokemon Go OR #PokemonGOFest OR #pokemongofest OR pokemongo'
# This is centered on Grant Park, where the event took place.
loc = "41.8722,-87.621887,100mi"
lang = "en"
number = 100

search = {}
search["chicagoSearch"] = twitter_api.search.tweets(q=q, geocode=loc, lang=lang, count = number)
search["everywhereSearch"]  = twitter_api.search.tweets(q=q, lang=lang, count = number)

currTime = str(datetime.now())
with open(currTime + "Csearch.pkl",'wb') as f:
    pickle.dump(search, f)

In [None]:
for i in search.keys():
    print(i)
    print(len(search[i]["statuses"]))

I am not sure why the non-geocode-limited search result returns fewer queries.

In [None]:
#for a given "search" (stored as a dictionary), 
#extract the a list of status texts for a location (a key in that dict)
def getStatuses(d,k):
    return [s['text'] for s in d[k]['statuses']]

In [None]:
#filter, from "Using the Twitter API for Tweet Analysis" 
# modified since statuses here is a list of just the text
def filterRepeats(statuses):
    all_text = []
    filtered_statuses = []
    for s in statuses:
        if not s in all_text:
            filtered_statuses.append(s)
            all_text.append(s)
    return filtered_statuses     

In [None]:
everywhereTexts = getStatuses(search, "everywhereSearch")
#print(everywhereTexts)
print(len(everywhereTexts))
everywhereTextsFiltered = filterRepeats(everywhereTexts)
print(len(everywhereTextsFiltered))

I'm not entirely sure why this filtere didn't seem to work (it basically  never changed the length of the tweet list), except that maybe since retweets had already been excluded, there were few identical repeats. Something I also noticed is that when there were apparent repeats, they often had URLs that were different. For example, there might be two shortened URLs that pointed to the same place, but were perhaps different so that the clicks could be tracked.

In [None]:
everywhereTexts[:5]

In [None]:
chicagoTexts = getStatuses(search, "chicagoSearch")
print(len(chicagoTexts))
chicagoTextsFiltered = filterRepeats(chicagoTexts)
print(len(chicagoTextsFiltered))
#print(chicagoTextsFiltered)


I wasted a lot of time clicking on links in the tweets printed by this line :)

In [None]:
chicagoTexts[:5]

In [None]:
everywhereTexts[:5]

I notice that my search is also finding things that are unrelated to Pokemon Go but just have those two words in them. For example:

"@ChildrensITV It won't let me watch pokemon sun and moon on ITV Hub. It says unavailable when I click on go on the app?"
 or 
'When you go in the tall grass without your starting pokemon. https://t.co/dg9lrnsRp0'

# Training a twitter sentiment classifier

I trained a classifier in the same way that was shown in the MOOC examples. I trained it on the NLTK twitter sample corpus, which already tokenized the tweets.

In [None]:
nltk.download("twitter_samples")
from nltk.corpus import twitter_samples

In [None]:
len(twitter_samples.fileids())

In [None]:
print(twitter_samples.fileids())

In [None]:
# I am attempting to remove @s and URLs since they are not real, useful words.
def build_bag_of_words_features_filtered(words):
    bag = {}
    useless_words = nltk.corpus.stopwords.words("english") + list(string.punctuation)
    for word in words:
        if not word in useless_words:
            if not "http" in word:
                if not "@" in word:
                    bag[word]=1
    return bag

In [None]:
negstrings = twitter_samples.strings("negative_tweets.json")
#print(negstrings[:5])

In [None]:
negtokens = twitter_samples.tokenized("negative_tweets.json")
postokens = twitter_samples.tokenized("positive_tweets.json")
#print(negtokens[:5])

In [None]:
negbag = [build_bag_of_words_features_filtered(i) for i in negtokens]
negfeatures = [(bag, "neg") for bag in negbag]
posbag = [build_bag_of_words_features_filtered(i) for i in postokens]
posfeatures = [(bag, "pos") for bag in posbag]

In [None]:
#print(len(negfeatures))
#print(len(posfeatures))

In [None]:
from nltk.classify import NaiveBayesClassifier

In [None]:
split = 4000

In [None]:
sentiment_classifier = NaiveBayesClassifier.train(posfeatures[:split]+negfeatures[:split])

The model seems highly accurate, but this is likely because it is somewhat overfit in that the corpus is not representative of real tweets-- they are filtered to include emoticons. 

In [None]:
nltk.classify.util.accuracy(sentiment_classifier, posfeatures[split:]+negfeatures[split:])*100

It is highly overfit for identifying emoticons, which is how neg and pos were originally defined.

In [None]:
sentiment_classifier.show_most_informative_features()

# Determining positivity for the PoGo tweets

In [None]:
chicagoTexts[0]

In [None]:
def tokenizeTweets(tweetList):
    wordsList = []
    for tweet in tweetList:
        wordsList.append(nltk.word_tokenize(tweet))
    return wordsList

In [None]:
chicagoTokens = tokenizeTweets(chicagoTexts)

In [None]:
chicagoBag = [build_bag_of_words_features_filtered(i) for i in chicagoTokens]
#print(chicagoBag[:3])

The following cell prints some sample tweets along with their probability of being positive. Based on these results, I didn't feel like my classifier was doing a great job. But I also realized that many tweets didn't have a particular obvious sentiment-- many were simply giving information about game updates.

In [None]:
classifications = []
for tweet in chicagoBag:
    classifications.append(sentiment_classifier.prob_classify(tweet))
for i in range(10):
    print(chicagoTexts[i])
    print(classifications[i].prob("pos"))

In [None]:
#dir(classifications[0])

Instead of actually classifying tweets as positive or negative, I use the probability of being positive as the score for the tweet. I averaged over all tweets collected in one session to get an overall approval rating for pogo or the pogo fest for that day.

In [None]:
def approvalRating(classifList):
    runningScore = 0
    count = 0
    for tweet in classifList:
        runningScore += tweet.prob("pos")
        count += 1
    return 100*runningScore/count

In [None]:
approvalRating(classifications)

After trying out my code, I made a pipeline that could be run for each group of saved tweets to process them from raw tweets into an average approval rating.

In [None]:
#if repeatFilterOn is true, then this will filter repeats out of the tweets. Otherwise, it will not.
repeatFilterOn = False

def pipeline(query):
    scores = {}
    for place in query.keys():
        #print(place)
        if repeatFilterOn:
            statuses = filterRepeats(getStatuses(query, place))
        else:
            statuses = getStatuses(query, place)
        #print(statuses[0])
        bag = [build_bag_of_words_features_filtered(i) for i in tokenizeTweets(statuses)]
        #print(bag[0])
        classifications = []
        for tweet in bag:
            classifications.append(sentiment_classifier.prob_classify(tweet))
        #    print(classifications[-1:])
        nTweets = len(classifications)
        if nTweets == 0:
            print("No tweets saved; skipping")
        else:
            print("number of tweets: ", nTweets)
            score = approvalRating(classifications)
            #print(place, score)
            scores[place]=score
    print(scores)
    return scores

# Using saved historical tweets to find trends over time

In [None]:
files = !ls
datetimes = []
output = []
for filename in files:
    if "search" in filename:
        if "Csearch" not in filename:
            print(filename)
            searchresults = pickle.load(open(filename, "rb"))
            datetimes.append(datetime.strptime(filename[:-10], "%Y-%m-%d %H:%M:%S.%f"))
            output.append(pipeline(searchresults))

In [None]:
data = {a:[], b:[], c:[], d:[], "dt":[]}
for i in range(len(output)):
    data[a].append(output[i]["a"])
    data[b].append(output[i]["b"])
    data[c].append(output[i]["c"])
    data[d].append(output[i]["d"])
    data["dt"].append(datetimes[i])   
datadf =  pd.DataFrame.from_dict(data)
datadf.set_index(datadf["dt"], inplace = True)
datadf.pop("dt")
datadf.head()

In [None]:
print(datadf.corr())
print()
print(datadf.describe())

I found it interesting that tweets related to the Go Fest event did not correlate all that well to general pokemon go tweets. Overall, the approval scores varied a lot over the month. The best day for pogo, with an approval of 85%, was 8/16. At the end of the analysis, I look into what happened that day.

In [None]:
datadf[datadf["pokemongo"]>85]

I plotted the raw data as points, lines, and then also as a rolling average to smooth out the high variance. The open space is when I was on vacation and didn't collect data. There are no blue dots/lines because they are all identical to and written over by the orange (since the search is case insensitive).

In [None]:
del datadf["#pokemongofest"]
datadf.plot(style=".", ylim=[10,90])
plt.xlabel("Date of query")
plt.ylabel("Positivity")
datadf.plot(ylim=[10,90])
plt.xlabel("Date of query")
plt.ylabel("Positivity")
pd.rolling_mean(datadf,3).plot(ylim=[10,90])
plt.xlabel("Date of query")
plt.ylabel("Positivity")
#, figsize=(15,10)

#plt.figure(figsize=(20,10))
#plt.plot_date(datadf, xdate=True, ydate=False)

I'm not sure why, but many of the early days found no tweets for Chicago. I can't remember now if I modified the query or if it started working better on its own.

In [None]:
cdatetimes = []
coutput = []

files = !ls
for filename in files:
    if "Csearch" in filename:
        print(filename)
        searchresults = pickle.load(open(filename, "rb"))
        pipeline(searchresults)
        cdatetimes.append(datetime.strptime(filename[:-11], "%Y-%m-%d %H:%M:%S.%f"))
        coutput.append(pipeline(searchresults))

In [None]:
coutput[:5]

In [None]:
cdata = {"chicagoSearch":[], "everywhereSearch":[], "dt":[]}
for i in range(len(coutput)):
    if "chicagoSearch" not in coutput[i].keys():
        cdata["chicagoSearch"].append(0)
    else: 
        cdata["chicagoSearch"].append(coutput[i]["chicagoSearch"])
    cdata["everywhereSearch"].append(coutput[i]["everywhereSearch"])
    cdata["dt"].append(cdatetimes[i])    
    
cdatadf =  pd.DataFrame.from_dict(cdata)
cdatadf.set_index(cdatadf["dt"], inplace = True)
cdatadf.pop("dt")
cdatadf.head()

I graph using a ymin of 10, because these days with a rating of 0 are just because no tweets were collected in Chicago at those times.

In [None]:
print(cdatadf.corr())
print()
print(cdatadf.describe())

In [None]:
cdatadf.plot(style=".", ylim=[10,90])
plt.xlabel("Date of query")
plt.ylabel("Positivity")
cdatadf.plot(ylim=[10,90])
plt.xlabel("Date of query")
plt.ylabel("Positivity")
pd.rolling_mean(cdatadf,3).plot(ylim=[10,90])
plt.xlabel("Date of query")
plt.ylabel("Positivity")
#, figsize=(15,10)

#x=cdatadf.plot(style=".", figsize=(15,10), ylim=[40,90])
#plt.figure(figsize=(20,10))
#plt.plot_date(cdata["dt"], cdata["everywhereSearch"],label = "everywhereSearch", xdate=True, ydate=False)
#plt.plot_date(cdata["dt"], cdata["chicagoSearch"],label = "chicagoSearch", xdate=True, ydate=False)
#plt.legend()
#plt.ylim([40,90])

# Training a second twitter sentiment classifier

The previous classifier was probably not very accurate. It was based off a model that used emoticons to define the "gold standard," so the most important features for the classifier were :) and :( emoticons by far. I expect that this makes tweets without emoticons hard to analyze. At the same time, many of the tweets that I've seen could better be thought of as informative rather than emotive, so it's hard to know what kind of sentiment it should have.

Looking at some sample tweets, I don't think I would have given positive rating predictions like the classifier, for example:

I liked a @YouTube video https://t.co/3mnvMLL74a This Problem with Pokémon Go NEEDS to be Solved NOW...
0.7723498203721197

I found a second corpus of tweets online to try out: http://thinknook.com/twitter-sentiment-analysis-training-corpus-dataset-2012-09-22/

In [None]:
pwd = !pwd
print(pwd)
if not pwd[0] == "/Users/hjohnsen/Dropbox (Personal)/Data Science/Week-8-NLP-Databases":
    %cd "/Users/hjohnsen/Dropbox (Personal)/Data Science/Week-8-NLP-Databases/"   
trainingData=pd.read_csv("SentimentAnalysisDataset.csv")

In [None]:
trainingData.head()

In [None]:
del trainingData["ItemID"]
del trainingData["SentimentSource"]

In [None]:
trainingData.head()

I found an NLTK tokenizer that is designed for tweets.

In [None]:
tknzr = nltk.tokenize.casual.TweetTokenizer(preserve_case=False)
#tknzr.tokenize(trainingData["SentimentText"][0])

In [None]:
trainingData["tokenizedbag"]=trainingData["SentimentText"].map(tknzr.tokenize)

In [None]:
trainingData.head()

In [None]:
negData = trainingData[trainingData["Sentiment"]==0]["tokenizedbag"]
posData = trainingData[trainingData["Sentiment"]==1]["tokenizedbag"]

In [None]:
negbag = [(build_bag_of_words_features_filtered(i), "neg") for i in negData]
posbag = [(build_bag_of_words_features_filtered(i), "pos") for i in posData]

In [None]:
#print(len(negbag))
#print(len(posbag))

In [None]:
from nltk.classify import NaiveBayesClassifier
nsplit = int(.8*len(negbag))
psplit = int(.8*len(posbag))

sentiment_classifier2 = NaiveBayesClassifier.train(posbag[:psplit]+negbag[:nsplit])
print("Score on training set:")
print(nltk.classify.util.accuracy(sentiment_classifier2, posbag[:psplit]+negbag[:nsplit])*100)
print("Score on test set:")
print(nltk.classify.util.accuracy(sentiment_classifier2, posbag[psplit:]+negbag[nsplit:])*100)

sentiment_classifier2.show_most_informative_features()


While this sentiment classifier appears to have less accuracy, it's probably a more believable value than the other one's 99% accuracy (especially given that humans are only about 80% accurate). 

# Using the new classifier to do the same analysis

In [None]:
pwd = !pwd
print(pwd)
if not pwd[0] == "/Users/hjohnsen/Dropbox (Personal)/Data Science/Week-8-NLP-Databases/pickles":
    %cd "/Users/hjohnsen/Dropbox (Personal)/Data Science/Week-8-NLP-Databases/pickles"  
# !ls

In [None]:
repeatFilterOn = True
# I will tokenize using the same method as my training set this time.
def tokenizeTweets2(tweetList):
    wordsList = []
    for tweet in tweetList:
        wordsList.append(tknzr.tokenize(tweet))
    return wordsList

#chicagoTokens = tokenizeTweets(chicagoTexts)

#chicagoBag = [build_bag_of_words_features_filtered(i) for i in chicagoTokens]

#classifications = []
#for tweet in chicagoBag:
#    classifications.append(sentiment_classifier.prob_classify(tweet))
#for i in range(10):
#    print(chicagoTexts[i])
#    print(classifications[i].prob("pos"))

def approvalRating(classifList):
    runningScore = 0
    count = 0
    for tweet in classifList:
        runningScore += tweet.prob("pos")
        count += 1
    return 100*runningScore/count

#approvalRating(classifications)

def pipeline2(query):
    scores = {}
    for place in query.keys():
        #print(place)
        if repeatFilterOn:
            statuses = filterRepeats(getStatuses(query, place))
        else:
            statuses = getStatuses(query, place)
        #print(statuses[0])
        bag = [build_bag_of_words_features_filtered(i) for i in tokenizeTweets2(statuses)]
        #print(bag[0])
        classifications = []
        for tweet in bag:
            classifications.append(sentiment_classifier2.prob_classify(tweet))
        #    print(classifications[-1:])
        nTweets = len(classifications)
        if nTweets == 0:
            print("No tweets saved; skipping")
        else:
            score = approvalRating(classifications)
            #print("number of tweets: ", nTweets)
            #print(place, score)
            scores[place]=score
    print(scores)
    return scores

# Using saved historical tweets to find trends over time

files = !ls
datetimes = []
output = []
for filename in files:
    if "search" in filename:
        if "Csearch" not in filename:
 #           print(filename)
            searchresults = pickle.load(open(filename, "rb"))
            datetimes.append(datetime.strptime(filename[:-10], "%Y-%m-%d %H:%M:%S.%f"))
            output.append(pipeline2(searchresults))


data2 = {a:[], b:[], c:[], d:[], "dt":[]}
for i in range(len(output)):
    data2[a].append(output[i]["a"])
    data2[b].append(output[i]["b"])
    data2[c].append(output[i]["c"])
    data2[d].append(output[i]["d"])
    data2["dt"].append(datetimes[i])   
datadf2 =  pd.DataFrame.from_dict(data2)
datadf2.set_index(datadf2["dt"], inplace = True)
datadf2.pop("dt")
datadf2.head()

print(datadf2.corr())
print()
print(datadf2.describe())

#datadf[datadf["pokemongo"]>85]


cdatetimes = []
coutput = []

files = !ls
for filename in files:
    if "Csearch" in filename:
        print(filename)
        searchresults = pickle.load(open(filename, "rb"))
        pipeline2(searchresults)
        cdatetimes.append(datetime.strptime(filename[:-11], "%Y-%m-%d %H:%M:%S.%f"))
        coutput.append(pipeline2(searchresults))

coutput[:5]

cdata2 = {"chicagoSearch":[], "everywhereSearch":[], "dt":[]}
for i in range(len(coutput)):
    if "chicagoSearch" not in coutput[i].keys():
        cdata2["chicagoSearch"].append(0)
    else: 
        cdata2["chicagoSearch"].append(coutput[i]["chicagoSearch"])
    cdata2["everywhereSearch"].append(coutput[i]["everywhereSearch"])
    cdata2["dt"].append(cdatetimes[i])    
    
cdatadf2 =  pd.DataFrame.from_dict(cdata2)
cdatadf2.set_index(cdatadf2["dt"], inplace = True)
cdatadf2.pop("dt")
cdatadf2.head()

print(cdatadf2.corr())
print()
print(cdatadf2.describe())

#, figsize=(15,10)

#x=cdatadf.plot(style=".", figsize=(15,10), ylim=[40,90])
#plt.figure(figsize=(20,10))
#plt.plot_date(cdata["dt"], cdata["everywhereSearch"],label = "everywhereSearch", xdate=True, ydate=False)
#plt.plot_date(cdata["dt"], cdata["chicagoSearch"],label = "chicagoSearch", xdate=True, ydate=False)
#plt.legend()
#plt.ylim([40,90])

In [None]:
del datadf2["#pokemongofest"]
datadf2.plot(style=".", ylim=[10,90])
plt.xlabel("Date of query")
plt.ylabel("Positivity")
datadf2.plot(ylim=[10,90])
plt.xlabel("Date of query")
plt.ylabel("Positivity")
pd.rolling_mean(datadf2,3).plot(ylim=[10,90])
plt.xlabel("Date of query")
plt.ylabel("Positivity")
#, figsize=(15,10)

#plt.figure(figsize=(20,10))
#plt.plot_date(datadf, xdate=True, ydate=False)

#plt.figure(figsize=(20,10))
#plt.plot_date(data["dt"], data[a],label = a, xdate=True, ydate=False)
#plt.plot_date(data["dt"], data[b],label = b, xdate=True, ydate=False)
#plt.plot_date(data["dt"], data[c],label = c, xdate=True, ydate=False)
#plt.plot_date(data["dt"], data[d],label = d, xdate=True, ydate=False)
#plt.legend()


In [None]:
cdatadf2.plot(style=".", ylim=[10,90])
plt.xlabel("Date of query")
plt.ylabel("Positivity")
cdatadf2.plot(ylim=[10,90])
plt.xlabel("Date of query")
plt.ylabel("Positivity")
pd.rolling_mean(cdatadf2,3).plot(ylim=[10,90])
plt.xlabel("Date of query")
plt.ylabel("Positivity")

# Comparing the two classifiers
To get a sense of which classifier performed better, I asked them to each classify some tweets. I did this on the chicagoBag because it was already easily available in the memory, however it's probably important to note that it is not tokenized in the same way as either of these models' training data were, which might skew things. I took this information and tried to classify some of the tweets as positive or negative myself to calculate the RMSE and log-likelihood for each classifier to see which is better. They both had several hits and misses with pretty similar RMSEs (0.480237111	vs 0.488219941 for the first and second classifier, respectively) and log-likelihoods (-7.958536425 vs	-7.708474697). 

Although based on these niether classifier seems hugely better than the other, if I had to choose one for further use I would probably use the second given that it was based on a more representative and much larger dataset.

In [None]:
for i,tweet in enumerate(chicagoBag[:10]):
    print(chicagoTexts[i])
    print(sentiment_classifier.prob_classify(tweet).prob("pos"))
    print(sentiment_classifier2.prob_classify(tweet).prob("pos"))

# Exploring specific days

In [None]:
pwd = !pwd
if not pwd[0] == "/Users/hjohnsen/Dropbox (Personal)/Data Science/Week-8-NLP-Databases/pickles":
    %cd "/Users/hjohnsen/Dropbox (Personal)/Data Science/Week-8-NLP-Databases/pickles"  
# !ls

Earlier, I found that approval of pogo peaked on August 16th. I couldn't identify clearly what announcements had been made that day by searching the web, so I wanted to look at the tweets themselves:

In [None]:
def tweetTester(tweet):
    tokenized = tokenizeTweets([tweet])
    bag = build_bag_of_words_features_filtered(tokenized[0])
    print(sentiment_classifier2.prob_classify(bag).prob("pos"))

In [None]:
peakday="2017-08-16 16:45:17.978716search.pkl"

peakdaytweets = pickle.load(open(peakday, "rb"))
for key in peakdaytweets.keys():
    rtcount = 0
    othercount = 0
    peakstatuses= getStatuses(peakdaytweets, key)
    print(key+ "="*100)
    for i in peakstatuses:
        print(i)
        if "RT @Pokemon: Spotted: Shiny Pikachu," in i:
           rtcount +=1
        else:
            othercount +=1
        tweetTester(i)
    print(rtcount)
    print(othercount)

Pogo's "best" day was a day when the shiny versions of the pikachu family came out. The tweet "RT @Pokemon: Spotted: Shiny Pikachu, Pichu, and Raichu in #PokemonGO! Be on the lookout for these Shiny versions as you explore:" was highly positive and retweeted a lot! (This search didn't filter out retweets, unlike the Chicago vs Everywhere query.) With numerous retweets of this very positive post, it was a good day according to the sentiment classifier.

Next I looked at some of the tweets on the first day, which didn't have quite as bad of a score as I expected given all the tweets and posts I was reading online that day.

In [None]:
firstday="2017-07-22 19:04:57.902835search.pkl"
firstdaytweets = pickle.load(open(firstday, "rb"))
for key in firstdaytweets.keys():
    firststatuses= getStatuses(firstdaytweets, key)
    print(key+ "="*100)
    for i in firststatuses:
        print(i)
        tweetTester(i)

I realized that I didn't filter out retweets or foreign languages in this search query, but my sentiment analyzer has no idea what to do with that. I was curious why my analyzer was even giving it a positive score instead of giving a 0.5 or something, and I found that just "RT" on its own seems to be a positive feature.

Additionally, by this time in the day, legendary pokemon had been announced, which buffered their approval somewhat.

In [None]:
tweet = "RT @famitsu: 『Pokemon GO』に伝説のポケモン“ルギア”、“フリーザー”が登場！ 　皆の力を結集し、伝説のポケモンをゲットしよう！  https://t.co/dIa4e5AE5g https://t.co/bGqgqE9xr3"
tweetTester(tweet)

In [None]:
tweet = "RT"
tweetTester(tweet)

In [None]:
tweet = "米国のポケモンGOイベントで大規模サーバ障害が発生、運営元ナイアンティックはチケット全額返金と100ドル分の詫びコイン、伝説のポケモン『ルギア』をイベント出席者全員に配布する対応を発表しました"
tweetTester(tweet)

In [None]:
firstday="2017-07-22 15:32:06.903443search.pkl"
firstdaytweets = pickle.load(open(firstday, "rb"))
for key in firstdaytweets.keys():
    firststatuses= getStatuses(firstdaytweets, key)
    print(key+ "="*100)
    for i in firststatuses:
        print(i)
        tweetTester(i)

In [None]:
firstday="2017-07-22 17:40:45.802043search.pkl"
firstdaytweets = pickle.load(open(firstday, "rb"))
for key in firstdaytweets.keys():
    firststatuses= getStatuses(firstdaytweets, key)
    print(key+ "="*100)
    for i in firststatuses:
        print(i)
        tweetTester(i)

Earlier in the day you can see some of hte negative tweets. I think the worst was probably "Pretty sad how bad the Pokemon Go Chicago Event turned out. Cellular lines jammed up everywhere &amp; people boo-ing the CEO of Niantic" with a 3e-05 probability of being positive!

In [None]:
datadf[:10].plot(style=".", ylim=[10,90])
datadf[:10].plot(ylim=[10,90])
pd.rolling_mean(datadf[:10],3).plot(ylim=[10,90])
cdatadf[:10].plot(style=".", ylim=[10,90])
cdatadf[:10].plot(ylim=[10,90])
pd.rolling_mean(cdatadf[:10],3).plot(ylim=[10,90])

In [None]:
# This line is commented out because there is too much text in this notebook otherwise

#sentiment_classifier2.show_most_informative_features(10000)

According to the above, retweets are positve:

                    rt = 1                 pos : neg    =      2.7 : 1.0

In [None]:
    tokenized = tokenizeTweets(["RT"])
    bag = build_bag_of_words_features_filtered(tokenized[0])
    print(sentiment_classifier2.prob_classify(bag).prob("pos"))

In [None]:
    tokenized = tokenizeTweets(["RT"])
    bag = build_bag_of_words_features_filtered(tokenized[0])
    print(sentiment_classifier.prob_classify(bag).prob("pos"))

In [None]:
len(datadf)

In [None]:
bag

In [None]:
    tokenized = tokenizeTweets(["RT"])
    bag = build_bag_of_words_features_filtered(tokenized[0])
    print(sentiment_classifier.classify(bag))

In [None]:
x = []
y = []
for i,tweet in enumerate(chicagoBag[:50]):
    print(chicagoTexts[i])
    print(sentiment_classifier.prob_classify(tweet).prob("pos"))
    print(sentiment_classifier2.prob_classify(tweet).prob("pos"))
    x.append(sentiment_classifier.prob_classify(tweet).prob("pos"))
    y.append(sentiment_classifier2.prob_classify(tweet).prob("pos"))
plt.scatter(x,y)
plt.xlabel("Classifier 1 positivity score")
plt.ylabel("Classifier 2 positivity score")
