##Sentiment analysis of weather related tweets

This notebook explores some initial modeling of sentiment analysis of weather related tweets. The datasets can be downloaded from Kaggle's "[Partly Sunny with a Chance of Hashtags](https://www.kaggle.com/c/crowdflower-weather-twitter/data)" competition.

Since the competition is over we will restrict ourselves to look at the training set only in order to have a good basis for benchmarking. We will make heavy use of *nltk* as a natural language processing (NLP) library to build a *bag-of-words* model that we we will use to train a Multinomial Naive Bayes classifier.

###Retrieving and basic cleansing of the data

In [1]:
import pandas as pd
pd.options.display.encoding = 'ascii'

In [2]:
### import the different corpora and packages from nltk as you need them
# import nltk
# nltk.download()

In [3]:
raw_df = pd.DataFrame.from_csv('./datasets/train.csv')
raw_df.head(5)

Unnamed: 0_level_0,tweet,state,location,s1,s2,s3,s4,s5,w1,w2,...,k6,k7,k8,k9,k10,k11,k12,k13,k14,k15
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,Jazz for a Rainy Afternoon: {link},oklahoma,Oklahoma,0,0,1,0.0,0.0,0.8,0,...,0,0.0,0,0.0,1,0,0,0.0,0,0
2,RT: @mention: I love rainy days.,florida,Miami-Ft. Lauderdale,0,0,0,1.0,0.0,0.196,0,...,0,0.0,0,0.0,1,0,0,0.0,0,0
3,Good Morning Chicago! Time to kick the Windy C...,idaho,,0,0,0,0.0,1.0,0.0,0,...,0,1.0,0,0.0,0,0,0,0.0,0,0
6,Preach lol! :) RT @mention: #alliwantis this t...,minnesota,Minneapolis-St. Paul,0,0,0,1.0,0.0,1.0,0,...,0,0.604,0,0.196,0,0,0,0.201,0,0
9,@mention good morning sunshine,rhode island,Purgatory,0,0,0,0.403,0.597,1.0,0,...,0,0.0,0,0.0,0,0,0,1.0,0,0


For completeness we will also show a snippet from the test set to see what kind of features we are allowed to use.

In [4]:
test_df = pd.DataFrame.from_csv('./datasets/test.csv')
test_df.head(2)

Unnamed: 0_level_0,tweet,state,location
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
4,Edinburgh peeps is it sunny?? #weather,,birmingham
5,"SEEVERE T’STORM WARNING FOR TROUSDALE, NORTHW...",,Nashville


###Label creation for Sentiment category

We start out by trying to model the sentiment category of a tweet. To this end we start by replacing the maximum vlaue of the sentiment fields s1-s5 with the corresponding category

In [5]:
def create_label(row):
    # row is of time pandas.Series need to cast to a list.
    lst = row.tolist()
    return lst.index(max(lst))+1

In [6]:
# apply defaults to work on columns rather than rows. Use axis = 1 to cahnge to row application.
label_df = raw_df[['s1','s2','s3','s4','s5']].apply(create_label, axis=1)
label_df.head(5)

id
1    3
2    4
3    5
6    4
9    5
dtype: int64

###Tokenizing, stemming and term counting

In order to use the tweets as input for a machine learning algorithm we need to convert them into numerical features. One way to do this is to chop up the tweets and count the number of words within a tweet and turn them into a large sparse vector whose length is size of the vocabulary of all tweets.

We start with a very simple way of counting the terms in the total of all tweets and count the number of terms.

In [7]:
from collections import Counter

full_tweet_string = raw_df.tweet.apply(lambda t: t.lower() + " ").sum()
Counter(full_tweet_string.split()).most_common()[:25]

[('the', 36622),
 ('weather', 23046),
 ('to', 20942),
 ('@mention', 19424),
 ('a', 18712),
 ('in', 18611),
 ('and', 17827),
 ('i', 16907),
 ('is', 14302),
 ('{link}', 13491),
 ('for', 12580),
 ('this', 12231),
 ('of', 10703),
 ('it', 9131),
 ('rt', 8999),
 ('on', 8060),
 ('@mention:', 8008),
 ('my', 7492),
 ("it's", 7130),
 ('at', 6974),
 ('out', 6397),
 ('be', 6122),
 ('its', 5900),
 ('storm', 5673),
 ('you', 5385)]

The simple counter already indicates a problem. There are a lot of very common words that have obiously no signal, such as 'the', 'at', 'of', etc. We need to remove those stopwords. Another problem can be seem by comparing '@mention' and '@mention:' which should clearly be identified as the same word, meaning we need to remove the punctuation. Finally we might want to identify 'storm' and 'stormy' as the same and thus require stemming techniques.

The following code snippet provides a tokenizer that does all of the above

In [8]:
from nltk.corpus import stopwords
from nltk.stem.snowball import EnglishStemmer
from nltk import word_tokenize
from nltk.util import ngrams
import string

def custom_tokenizer(document_string):
    
    # define the stop word vocabulary
    stops = [unicode(word) for word in stopwords.words('english')] \
        + ["''", "``", 're:', 'fwd:', '-', '@mention', '@mention:', 'mention', 'link', ':', 'f.', '&'] 
    
    # create a default stemmer
    stemmer = EnglishStemmer()
    
    # return the stemmed list of words
    tokens = [stemmer.stem(unicode(word)) for word in word_tokenize(document_string.lower()) \
            if not (unicode(word.lower()) in stops or unicode(word.lower()) in list(string.punctuation))]
    
    return tokens + list(ngrams(tokens, 2))

Let's have a look at the word counts again

In [9]:
Counter(custom_tokenizer(full_tweet_string)).most_common()[:25]

[(u'weather', 34001),
 (u'...', 17465),
 (u"'s", 12792),
 (u'rt', 9189),
 (u'day', 8824),
 (u'storm', 7920),
 (u'sunni', 6601),
 (u'hot', 5952),
 (u'today', 5612),
 (u'outsid', 5357),
 (u'like', 4831),
 (u"n't", 4714),
 (u'rain', 4655),
 (u'sunshin', 4587),
 (u'get', 4550),
 (u'degre', 4441),
 (u'thunderstorm', 4388),
 (u'feel', 4339),
 (u'humid', 4229),
 (u'go', 4164),
 (u'cold', 4041),
 (u"'m", 4032),
 (u'wind', 3913),
 (u'raini', 3889),
 (u'good', 3766)]

This looks already much better and we can identify weather realated terms such as 'storm', 'sunni', etc.
This will be a good starting point for the rest of the model.

###Model generation

One step that is missing is how to use the *custom_tokenizer* to actually create feature vectors. Luckily enough sklearn provides use with the right functionality to do just that. 

In [10]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(min_df = 100, max_df = 10000, tokenizer = custom_tokenizer)

To ensure we have always the same vectorization options we need to "fit" the vectorizer to the set so we can vectorize any new examples according to this. We do it here on the full set from which we will also extract the cross-validation data and we should be aware of this bias, as it necessarily eliminates the possibility of encountering previously unseen terms when we run our test set predicition.

In [11]:
# fit the vectorizer to the full set
X = vectorizer.fit_transform(raw_df.tweet.tolist())

In [12]:
X.shape

(77946, 1260)

####Splitting the set and training the model

As the training data and the labels are now spearate and of different type it is annoying to split the training and test-sets manually. Again sklearn has a tool for us that we can use: the cross_validation API.

In [13]:
from sklearn import cross_validation

df_train, df_test, y_train, y_test = cross_validation.train_test_split(raw_df, label_df, test_size = 0.3)

To avoid the above case of having the "perfect" vocabulary let us now train a new vectorizer using only the training set

In [14]:
X_train = vectorizer.fit_transform(pd.DataFrame(df_train)[0].tolist())
X_test = vectorizer.transform(pd.DataFrame(df_test)[0].tolist())

We can now start to train the model. As this is a highly sparse problem it lends itself to be tackled using a Naive Bayes classifier.

In [15]:
from sklearn.naive_bayes import MultinomialNB

multi_nb_clf = MultinomialNB()
multi_nb_clf.fit(X_train, y_train)

multi_nb_clf.score(X_test, y_test)

0.60549948682860077

With only single terms (monograms), bigrams (interestingly enough they seem not to have any impact at this level) and no additionally fine-grained modeling we already classify 60% of the tweets into the right category. This is impressive in so far that there are a lot of easy and obvious ways to improve the classification. Some ways to improve are

* include higher n-grams in the tokenizer
* train models per state or per city
* include state and city as features
* include TF-IDF vectorizers
* use a different classifier?

##Miscellenea
###References

are mostly given throughout the text. But importantly

- [1] [Partly Sunny with a Chance of Hashtags](https://www.kaggle.com/c/crowdflower-weather-twitter)
- [2] [scikit-learn](http://scikit-learn.org/stable/index.html)
- [3] [nltk](http://www.nltk.org/index.html)

###Stylesheet

In [17]:
from IPython.core.display import HTML

def css_styling():
    styles = open("../styles/custom.css", "r").read()
    return HTML(styles)
css_styling()