In this notebook, we will use the tweet-preprocessor package to clean our tweet text. We'll drop non-english tweets, and then use Stanford Core NLP for sentiment analysis.

In [1]:
from langdetect import detect
import pandas as pd
import preprocessor as p
from string import punctuation
import time
from pycorenlp import StanfordCoreNLP

In [2]:
tweets = pd.read_csv('../data/data_modified/tweets/tweets.csv')

In [3]:
tweets['content'][0:10].tolist()

['RT @SaveTheVikesOrg: Drumming up #Vikings stadium support @mnstatefair w/ @nicolelindaman and @laurabesh before concert! #countryfansare ...',
 'RT @Haydollbaby: #steelersnation',
 'RT @MyPhilaEagles: Yo the bol @DeseanJackson10 is toooo damnnnnn FAST on #Madden12 hahaha #Eagles #NFL',
 "@DEZ_88....yo i'm headed to miami right after work tomorrow to u guys beat the dolphins!!!!#GOCOWBOYS",
 'RT @nfl: Tom Brady and @Ochocinco still looking for that chemistry: http://t.co/1IZPpQq #Patriots',
 "@JermichaelF88 I'm drafting now, how you feeling? Going to have an awesome year? #gopackgo",
 'I wish the #Seahawks would switch back to their old school uniforms...they were #ballerstatus vintage is boss!',
 "@ChrisJohnson28 #ifchrisjohnsonwaswhite he'd probably honor his contract and THEN get a new mega contract!!!!! #NFL #RaiderNation",
 'Falcons Shore Up Troubled Secondary In Two Short Days: The Falcoholic » \xa0 Thomas Dimitroff he... http://t.co/cH6zwIR #nfl #ATL #falcons',
 'RT @Jonathanst

In [4]:
%%time
# Clean all the tweets, removing leading and lagging punctuation
tweets['content_clean'] = [p.clean(tweet).strip(punctuation) for tweet in tweets['content'].tolist()]

CPU times: user 1min 2s, sys: 393 ms, total: 1min 3s
Wall time: 1min 3s


In [5]:
%%time
# Find how many tweets are not in English using langdetect
langs = []
for text in tweets['content_clean']:
    try:
        langs.append(detect(text))
    except:
        langs.append('')
tweets['language'] = langs

CPU times: user 1h 31min 59s, sys: 2min 33s, total: 1h 34min 32s
Wall time: 1h 34min 37s


In [8]:
# That was a long process, so let's save our progress
tweets.to_csv('../data/data_modified/tweets/tweets.csv', index=False)

Checking the data anecdotally, it looks like most non-English tweets that are not Spanish are either false positives or empty (i.e, only hashtags). We want to keep all tweets for count features. We only want to do sentiment analysis for non-Spanish, non-empty tweets.

For the purposes of this model, we will treat all 'empty text' tweets as 'neutral' sentiment. This is contestable, however - anecdotal evidence suggests these may actually be positive (most are hashtag-only support for a given team). We won't dive into this much here, but it is a clear area for further research.

In [19]:
# Activate Stanford Core NLP - must first download English version from https://stanfordnlp.github.io/CoreNLP/
# and run this command locally in the unzipped folder

# java -mx5g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -timeout 10000

In [23]:
%%time
# Use NLP on the sentence level to determine overall sentiment of the tweet
nlp = StanfordCoreNLP('http://localhost:9000')

sentiment_mappings = {
    'Verypositive': 2,
    'Positive': 1,
    'Negative': -1,
    'Verynegative': -2,
    'Neutral': 0
}
sentiments = []

for i, tweet in enumerate(tweets['content_clean']):
    parsed_tweet = nlp.annotate(tweet,
                       properties={
                           'annotators': 'sentiment',
                           'outputFormat': 'json',
                           'timeout': 100000,
                       })
    sentiment = 0
    for s in parsed_tweet["sentences"]:
        sentiment += sentiment_mappings[s["sentiment"]]
    sentiments.append(sentiment)

CPU times: user 1h 9min 58s, sys: 9min 13s, total: 1h 19min 12s
Wall time: 18h 37min 46s


In [28]:
tweets['sentiment'] = sentiments
tweets['sentiment'] = [0 if sent[0]=='es' else sent[1] for sent in zip(tweets['language'], tweets['sentiment'])]

In [29]:
tweets.to_csv('../data/data_modified/tweets/tweets.csv', index=False)