## Airline Tweets NLP Analysis

This document shows the results of basic natural language processing (NLP) analysis on Twitter tweets about major US airlines scraped from the site during part of February 2015. Specifically, I create a word cloud and conduct sentiment analysis.

Contributors to the data set were asked to classify positive, negative, and neutral tweets.
Thus, for each tweet, I have the 'correct' answer for sentiment analysis purposes.

The data can be found at the URL below. To find the dataset, search for 'Airline' on the page.  
I specifically use the 16,000 row dataset uploaded on February 12, 2015 by CrowdFlower.  
I assume the upload date is incorrect as the data includes tweets from after 2/12/2015...

https://www.crowdflower.com/data-for-everyone/

Note that the actual dataset only contains 14,640 rows. I'm not sure where the discrepancy comes from, but it doesn't affect the analysis.

In the cell below, I import modules for the analysis and the data. Note that the file path is specific to my machine and may need to be modified if this code is run elsewhere.

In [82]:
# Import modules.
import pandas as pd
import wordcloud
from stemming.porter2 import stem
import matplotlib.pyplot as plt
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
import sklearn.naive_bayes as sklearn_nb

# Import data
tweet_data = pd.read_csv('Documents/Github/airline-tweets-nlp-and-machine-learning/Airline-Sentiment-2-w-AA.csv', 
                         encoding = 'latin_1')

# Remove unneeded columns.
tweet_data = tweet_data[['airline_sentiment', 'text']]

Below is a sample of the data. Unfortunately, in this view, we can only see the beginning of the tweet text.

In [83]:
# View head of data.
tweet_data.head()

Unnamed: 0,airline_sentiment,text
0,neutral,@VirginAmerica What @dhepburn said.
1,positive,@VirginAmerica plus you've added commercials t...
2,neutral,@VirginAmerica I didn't today... Must mean I n...
3,negative,@VirginAmerica it's really aggressive to blast...
4,negative,@VirginAmerica and it's a really big bad thing...


Now, I'll do some data cleaning on 'tweet_data.text'
- make all characters lowercase
- remove unneeded characters
- remove the airline Twitter handles

In [84]:
# Make tweets lowercase.
tweet_data.text = tweet_data.text.str.lower()

# Remove unneeded characters.
tweet_data.text = tweet_data.text.str.replace('[^\w\s]', 
                                              '')

# Remove airline Twitter handles.
# Note that I have not removed stopwords.
# This removal is done when creating the word cloud.
# Stemming is done in the next section.
tweet_data.text = tweet_data.text.str.replace('virginamerica', 
                                              '')
tweet_data.text = tweet_data.text.str.replace('united', 
                                              '')
tweet_data.text = tweet_data.text.str.replace('southwestair', 
                                              '')
tweet_data.text = tweet_data.text.str.replace('jetblue', 
                                              '')
tweet_data.text = tweet_data.text.str.replace('usairways', 
                                              '')
tweet_data.text = tweet_data.text.str.replace('americanair', 
                                              '')

# For the rest of this section, I will turn 'tweet_data.text' into a single string.
tweet_data_string = str(tweet_data.text)

### Word Cloud:

The code in this section creates a wordcloud of the text in the tweet data.   
Now, I stem the words in 'tweet_data_string'.

In [85]:
# Split tweet words by spaces.
split_tweets = tweet_data_string.split(' ')

# Create new empty list to hold stemmed 'split_tweets'.
split_tweets_stemmed = []

# Stem the words in 'split_tweets'.
# There are empty list items, but the way I will proceed will make this irrelevant.
for word in split_tweets:
    split_tweets_stemmed.append(stem(word))
    
# Create 'stemmed_tweet_string' from 'split_tweets_stemmed'.
stemmed_tweet_string = ''
for word in split_tweets_stemmed:
    stemmed_tweet_string = stemmed_tweet_string + str(word) + ' '

In [86]:
# Create word cloud of the top 50 words (technically stems) in the tweet data (and remove stopwords).
tweet_data_wordcloud = wordcloud.WordCloud(background_color = 'black', 
                                max_words = 50, 
                                stopwords = wordcloud.STOPWORDS)
tweet_data_wordcloud.generate(stemmed_tweet_string)
plt.imshow(tweet_data_wordcloud)
plt.axis("off") # Remove graph axes
plt.show()



The output doesn't show up here, but the PNG file in the repository named 'airline_tweet_analysis_wordcloud.png' contains the wordcloud resulting from the code above. The larger a word, the more frequently it appears in the tweet data.

### Sentiment Analysis:

I conduct sentiment analysis in 2 ways:  
* Lexicon based (with pre-provided lists of positive and negative terms)   
* Naive Bayes Classification Model

#### Lexicon Based:

First, I create my own custom sets of positively and negatively associated words and use these for lexicon-based sentiment analysis. These lists are created using intuition and looking at some example tweets (so slight cheating/overfitting).

I will then count the number of positive and negative words in each tweet and use these counts to create a score to classify the tweets as 'positive', 'negative', or 'neutral'. Then, I'll compare my classification to the provided "answers".

In [87]:
# Create sets of positively and negatively associated terms.
positive_terms = {'amazing', 'good', 'great', 'awesome', 'thank', 'thanks', 'love', 'excited', 'amazing', 'polite', 'courteous', 'friendly', 'incredible'}
negative_terms = {'tacky', 'aggressive', 'obnoxious', 'bad', 'delay', 'worst', 'awful', 'cancel', 'cancelled', 'shitty', 'mess', 'fantastic', 'rude', 'mean', 'unfriendly'}

I will work with 'tweet_data.text'. This data has already been made lowercase, had unneeded characters removed, and had airline Twitter handles removed. Including stopwords does NOT affect this particular model (I'll just be counting positive and negative words in each tweet) and so I leave them in. 

Stemming the data obviously changes the words and so makes it difficult to create a custom list for classification. For this reason, I will not stem the tweet data for this model.

In [88]:
# For each 'tweet_data.text' create a 'split_tweet' column that is a list with 1 entry for each word.
# The resulting lists contain empty values, but this does not matter for this analysis.
tweet_data['split_tweet'] = tweet_data.text.str.split(' ')

Create 'positive_words' and 'negative_words' columns that count the number of positive and negative words in each cleaned tweet.

In [89]:
# Create 'positive_words' and 'negative_words' columns.
# These columns are initially populated with 'NA' and are correctly filled in below.
tweet_data['positive_words'] = 0
tweet_data['negative_words'] = 0

# Count 'positive_words' and 'negative_words' in each cleaned tweet.
# I loop through each row and use a nested loop to count positive and negative words in each 'split_tweet'.
# This step also takes a little while.
for i in range(0, len(tweet_data.text)):
    
    # Set count of positive and negative words to 0 for each row.
    positive_count = 0
    negative_count = 0
    
    for j in range(0, len(tweet_data.split_tweet[i])):
        
        if tweet_data.split_tweet[i][j] in positive_terms:
            positive_count += 1
        elif tweet_data.split_tweet[i][j] in negative_terms:
            negative_count += 1
    
    tweet_data.loc[i, 'positive_words'] = positive_count
    tweet_data.loc[i, 'negative_words'] = negative_count

The scoring metric I use is polarity and is computed as: (p - n) / (p + n)   
p[n] is the number of positive[negative] words in a tweet.

For each tweet, if polarity is less[greater] than 0, the tweet will be classified as negative[positive].   
Tweets with a polarity of 0 are classified as neutral.
Now I compute the polarity and classification for each tweet.

In [90]:
# Compute 'polarity' for each tweet.
tweet_data['polarity'] = (tweet_data.positive_words - tweet_data.negative_words) / (tweet_data.positive_words + tweet_data.negative_words)

# Classify each tweet as 'postive', 'negative', or 'neutral'.
tweet_data['lexicon_class'] = np.where(tweet_data.polarity > 0, 'positive', 
                                      np.where(tweet_data.polarity < 0, 'negative', 'neutral'))

Now, we can see lexicon-based sentiment analysis model results.

In [91]:
# Overall accuracy.

# Print total number of tweets.
print('Total Tweets: ' +
     str(len(tweet_data.index)))

# Compute total accuracy.
print('Overall Accuracy: ' + 
      str(round(100 * len(tweet_data[tweet_data.airline_sentiment == tweet_data.lexicon_class]) / len(tweet_data.index), 2)) + 
     '%')

# Compute accuracy by tweet classification category.

# Positive tweets.

# Create helper data frame.
positive = tweet_data[tweet_data.airline_sentiment == 'positive']

# Print total number of positive tweets.
print('Number of Positive Tweets: ' +
      str(len(positive.index)))

# Print positive tweet number and accuracy.
print('Accuracy on "positive" tweets (according to "answers"): ' +
      str(round(100 * len(positive[positive.airline_sentiment == positive.lexicon_class]) / len(positive.index), 2)) + 
     '% (' +
     str(len(positive[positive.airline_sentiment == positive.lexicon_class])) +
     ' positive tweets classified correctly' +
     ')')

# Neutral tweets.

# Create helper data frame.
neutral = tweet_data[tweet_data.airline_sentiment == 'neutral']

# Print total number of neutral tweets.
print('Number of Neutral Tweets: ' + 
      str(len(neutral.index)))

# Print neutral tweet number and accuracy.
print('Accuracy on "neutral" tweets (according to "answers"): ' +
      str(round(100 * len(neutral[neutral.airline_sentiment == neutral.lexicon_class]) / len(neutral.index), 2)) + 
     '% (' + 
     str(len(neutral[neutral.airline_sentiment == neutral.lexicon_class])) +
      ' neutral tweets classified correctly' +
     ')')

# Negative tweets.

# Create helper data frame.
negative = tweet_data[tweet_data.airline_sentiment == 'negative']

# Print total number of negative tweets.
print('Number of Negative Tweets: ' + 
      str(len(negative.index)))

# Print negative tweet number and accuracy.
print('Accuracy on "negative" tweets (according to "answers"): ' +
      str(round(100 * len(negative[negative.airline_sentiment == negative.lexicon_class]) / len(negative.index), 2)) +
     '% (' + 
     str(len(negative[negative.airline_sentiment == negative.lexicon_class])) + 
      ' negative tweets classified correctly' +
     ')')

Total Tweets: 14640
Overall Accuracy: 38.41%
Number of Positive Tweets: 2363
Accuracy on "positive" tweets (according to "answers"): 59.42% (1404 positive tweets classified correctly)
Number of Neutral Tweets: 3099
Accuracy on "neutral" tweets (according to "answers"): 85.48% (2649 neutral tweets classified correctly)
Number of Negative Tweets: 9178
Accuracy on "negative" tweets (according to "answers"): 17.11% (1570 negative tweets classified correctly)


This version of lexicon based sentiment analysis is actually more accurate over the entire data set than the version used in the R branch. In that version, I used a pre-provided list of positively and negatively associated terms.

**Further Exploration**:

These results are not great. They might improve if I expanded the list of positive/negative terms I used. Also, results might improve by using different classification thresholds. Actually, the way I did classification above does not take into account the denominator of the polarity at all. Finally, using a different scoring metric might yield better results.

#### Naive Bayes Classification Model:

I use 'tweet_data.text' in this section. In a previous section, 'tweet_data.text' was made lowercase, had unneeded characters removed, and had the airline tweet handles removed. Below, I remove English stopwords

In [95]:
# Clean up unneeded columns ('positive_words', 'negative_words', 'polarity').
# These commands can be optionally removed with no adverse effects.
tweet_data = tweet_data[['airline_sentiment', 'text', 'split_tweet', 'lexicon_class']]

# Create document term matrix from 'tweet_data'.

# Create 'count_vectorizer'.
# This is necessary to create the document term matrix.
# I remove accents, English stopwords, and terms that appear in less than 5 documents.
count_vectorizer = CountVectorizer(strip_accents = 'unicode', 
                                  stop_words = 'english', 
                                  min_df = 5, 
                                  binary = True)

# Create document term matrix.
tweet_tdm = count_vectorizer.fit_transform(tweet_data.text)
tweet_data.head()

Unnamed: 0,airline_sentiment,text,split_tweet,lexicon_class
0,neutral,what dhepburn said,"[, what, dhepburn, said]",neutral
1,positive,plus youve added commercials to the experienc...,"[, plus, youve, added, commercials, to, the, e...",negative
2,neutral,i didnt today must mean i need to take anothe...,"[, i, didnt, today, must, mean, i, need, to, t...",negative
3,negative,its really aggressive to blast obnoxious ente...,"[, its, really, aggressive, to, blast, obnoxio...",negative
4,negative,and its a really big bad thing about it,"[, and, its, a, really, big, bad, thing, about...",negative


In [53]:
# Create Naive Bayes classifier.
nb = sklearn_nb.MultinomialNB()

# Create training and test sets from 'tweet_data'.
# Both training and test sets contain approximately half the data.

# Set seed for reproducibility.
np.random.seed(79)

# Randomly sample index numbers for the training set.
training_rows = np.random.choice(len(tweet_data.index), 
                                 size = int(len(tweet_data.index) / 2), 
                                 replace = False)

# Create training and test sets.
tweet_tdm_train = tweet_data.iloc[training_rows]
tweet_tdm_test = tweet_data.iloc[-training_rows]

14640

In [36]:
# Stem each word in each list of 'tweet_data.split_tweet'. This takes a little while.
# I loop through each row, and then stem each element of 'split_tweet' in that row (via a nested loop).
for i in range(0, len(tweet_data.text)):
    for j in range(0, len(tweet_data.split_tweet[i])):
        tweet_data.split_tweet[i][j] = stem(tweet_data.split_tweet[i][j])

AttributeError: 'DataFrame' object has no attribute 'split_tweet'