# Sentiment classification on tweets about airlines

This notebook describes an attempt to classify tweets by sentiment. It describes the initial data exploration, as well as implementation of a classifier.

First we start by importing some necessary tools.

## What is in the dataset?

It's always good to start by exploring the data that we have available. To do this we load the raw csv file using [Pandas][1] and check what the columns are.

  [1]: http://pandas.pydata.org/

In [None]:
import pandas as pd
rawData = pd.read_csv("../input/Tweets.csv")
list(rawData.columns.values)

We want to be able to determine the sentiment of a tweet without any other information but the tweet text itself, hence the 'text' column is our focus. Using the text we are going to try and predict 'airline_sentiment'. We also need to take into account 'airline_sentiment_confidence', but we will come back to that.

Lets take a look at what a typical record looks like.

In [None]:
rawData.head()

Lets take a look at what sentiments have been found.

In [None]:
sentiment_counts = rawData.airline_sentiment.value_counts()
number_of_tweets = rawData.tweet_id.count()
print(sentiment_counts)

It turns out that our dataset is skewed with significantly more negative than positive tweets. We will focus on the issue of separating positive and negative tweets. It's good to keep in mind that, while a terrible classifier, if we always guessed a tweet was negative we'd be right 79% of the time (9178 of 11541). That clearly wouldn't be a very useful classifier, but worth to remember.

# Let's explore the text

We begin by checking what common words we can find in each of the different classes. To investigate this we want to preprocess our data a little. Let's get rid of the 100 most common words, and some punctuation.

In [None]:
# We need some ugly code to supress deprecation warnings resulting from nltk on Kaggle
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 

# What characterizes text of different sentiments?

While we still haven't decided what classification method to use, it's useful to get an idea of how the different texts look. This might be an "old school" approach in the age of deep learning, but lets indulge ourselves nevertheless. 

To explore the data we apply some crude preprocessing. We will tokenize and lemmatize using [Python NLTK][1], and transform to lower case. As words mostly matter in context we'll look at bi-grams instead of just individual tokens.

### Preprocessing
Note that we remove the first two tokens as they always contain "@ airline_name".


  [1]: http://www.nltk.org/

In [None]:
import re, nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
wordnet_lemmatizer = WordNetLemmatizer()
negative_tweets = rawData.loc[rawData['airline_sentiment'] == 'negative'].text
def normalize_tweet(tweet):
    only_letters = re.sub("[^a-zA-Z]", " ",tweet) 
    tokens = nltk.word_tokenize(only_letters)
    lemmas = [wordnet_lemmatizer.lemmatize(t) for t in tokens[2:]]
    lower_case = [l.lower() for l in lemmas]
    filtered_result = list(filter(lambda l: l not in stop_words,lower_case))
    return filtered_result

preprocessed_negative_tweets = negative_tweets.apply(normalize_tweet)
print('Example preprocessed tweet:\n', preprocessed_negative_tweets.iloc[0])

In [None]:
from nltk import ngrams
def grams(tokens):
    return list(ngrams(tokens, 3))
negative_grams = preprocessed_negative_tweets.apply(grams)

And now some counting.

In [None]:
import collections
def count_words(input):
    cnt = collections.Counter()
    for row in input:
        for word in row:
            cnt[word] += 1
    return cnt

count_words(negative_grams).most_common(20)

We can already tell there's a pattern here. Sentences like "cancelled flight", "late flight", "booking problems",  "delayed flight" stand out clearly. Lets check the positive tweets.

In [None]:
positive_tweets = rawData.loc[rawData['airline_sentiment'] == 'positive'].text
preprocessed_positive_tweets = positive_tweets.apply(normalize_tweet)
positive_grams = preprocessed_positive_tweets.apply(grams)
count_words(positive_grams).most_common(20)

Some more good looking patterns here. We can however see that with 3-grams clear patterns are rare. "great customer service" occurs 12 times in 2362 positive responses, which really doesn't say much in general. 

Satisfied that our data looks possible to work with begin to construct our first classifier.

# First Classifier
Lets start simple with a bag-of-words Support-Vector-Machine (SVM) classifier. Bag-of-words means that we represent each sentence by the unique words in it. To make this representation useful for our SVM classifier we transform each sentence into a vector. The vector is of the same length as our vocabulary, i.e. the list of all words observed in our training data, with each word representing an entry in the vector. If a particular word is present, that entry in the vector is 1, otherwise 0.

To create these vectors we use the CountVectorizer from [sklearn][1]. Note that we make sure to have an index that mapps vectorized data back to the original sentence. This will come in handy when inspecting output from the classifier later.


  [1]: http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

## Preparing the data

In [None]:
import numpy as np
from scipy.sparse import hstack
from sklearn.feature_extraction.text import CountVectorizer
count_vectorizer = CountVectorizer()
negative_data = preprocessed_negative_tweets.apply(' '.join).as_matrix().tolist()
positive_data = preprocessed_positive_tweets.apply(' '.join).as_matrix().tolist()
negative_targets = np.zeros((len(negative_data),1))
positive_targets = np.ones((len(positive_data),1))
raw_data = negative_data+positive_data
vectorized_data = count_vectorizer.fit_transform(raw_data)
targets = np.concatenate((negative_targets,positive_targets), axis=0).ravel()
indexed_data = hstack((np.array(range(0,vectorized_data.shape[0]))[:,None], vectorized_data))

To check performance of our classifier we want to split our data in to train and test.

In [None]:
from sklearn.model_selection import train_test_split
data_train, data_test, targets_train, targets_test = train_test_split(indexed_data, targets, test_size=0.4, random_state=0)
data_train_index = data_train[:,0]
data_train = data_train[:,1:]
data_test_index = data_test[:,0]
data_test = data_test[:,1:]

## Fitting a classifier

We're now ready to fit a classifier to our data. We'll spend more time on hyper parameter tuning later, so for now we just pick some reasonable guesses.

In [None]:
from sklearn import svm
clf = svm.SVC(gamma=0.01, C=100., probability=True)
clf_settings = clf.fit(data_train, targets_train)

## Evaluation of results

In [None]:
clf.score(data_test, targets_test)

87% likely isn't great, but it's not nothing. It's most likely possible to achieve a higher score with more tuning, or a more advanced approach. Lets check on how it does on a couple of sentences.

In [None]:
sentences = count_vectorizer.transform([
    "What a great airline, the trip was a pleasure!",
    "My issue was quickly resolved after calling customer support. Thanks!",
    "What the hell! My flight was cancelled again. This sucks!",
    "Service was awful. I'll never fly with you again.",
    "You fuckers lost my luggage. Never again!",
    "I have mixed feelings about airlines. I don't know what I think.",
    ""
])
clf.predict_proba(sentences)

So while not a huge improvement over the baseline, we can see that it's doing a good job on these obvious sentences. 

## What is hard for the classifier?

It's interesting to know which sentences are hard. To find out, lets apply the classifier to all our test sentences and sort by the marginal probability.

In [None]:
predictions_on_test_data = np.array(clf.predict_proba(data_test))
index = np.transpose(np.array([range(0,len(predictions_on_test_data))]))
indexed_predictions = np.concatenate((predictions_on_test_data, index), axis=1).tolist()
hardest_test_sentences = sorted(list(map(lambda p : [abs(p[0]-p[1]), p[2]], indexed_predictions)), key=lambda p : p[0])
list(map(lambda p : raw_data[data_test_index[p[1]].toarray()[0][0]], hardest_test_sentences[0:20]))

How about the easiest test sentences?

In [None]:
list(map(lambda p : raw_data[data_test_index[p[1]].toarray()[0][0]], hardest_test_sentences[-20:]))

In [None]:
list(map(lambda p : clf.predict_proba(data_test[p[1]]), hardest_test_sentences[-20:]))