## Airline Tweets NLP Analysis

This document shows the results of basic natural language processing (NLP) analysis on Twitter tweets about major US airlines scraped from the site during part of February 2015. Specifically, I create a word cloud and conduct sentiment analysis.

Contributors to the data set were asked to classify positive, negative, and neutral tweets.
Thus, for each tweet, I have the 'correct' answer for sentiment analysis purposes.

The data can be found at the URL below. To find the dataset, search for 'Airline' on the page.  
I specifically use the 16,000 row dataset uploaded on February 12, 2015 by CrowdFlower.  
I assume the upload date is incorrect as the data includes tweets from after 2/12/2015...

https://www.crowdflower.com/data-for-everyone/

Note that the actual dataset only contains 14,640 rows. I'm not sure where the discrepancy comes from, but it doesn't affect the analysis.

In the cell below, I import modules for the analysis and the data. Note that the file path is specific to my machine and may need to be modified if this code is run elsewhere.

In [1]:
# Import modules.
import pandas as pd
import wordcloud
from stemming.porter2 import stem
import matplotlib.pyplot as plt
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
import sklearn.naive_bayes as sklearn_nb

# Import data
tweet_data = pd.read_csv('Documents/Github/airline-tweets-nlp-and-machine-learning/Airline-Sentiment-2-w-AA.csv', 
                         encoding = 'latin_1')

# Remove unneeded columns.
tweet_data = tweet_data[['airline_sentiment', 'text']]

Below is a sample of the data. Unfortunately, in this view, we can only see the beginning of the tweet text.

In [2]:
# View head of data.
tweet_data.head()

Unnamed: 0,airline_sentiment,text
0,neutral,@VirginAmerica What @dhepburn said.
1,positive,@VirginAmerica plus you've added commercials t...
2,neutral,@VirginAmerica I didn't today... Must mean I n...
3,negative,@VirginAmerica it's really aggressive to blast...
4,negative,@VirginAmerica and it's a really big bad thing...


Now, I'll do some data cleaning on 'tweet_data.text'
- make all characters lowercase
- remove unneeded characters
- remove the airline Twitter handles

In [3]:
# Make tweets lowercase.
tweet_data.text = tweet_data.text.str.lower()

# Remove unneeded characters.
tweet_data.text = tweet_data.text.str.replace('[^\w\s]', 
                                              '')

# Remove airline Twitter handles.
# Note that I have not removed stopwords.
# This removal is done when creating the word cloud.
# Stemming is done in the next section.
tweet_data.text = tweet_data.text.str.replace('virginamerica', 
                                              '')
tweet_data.text = tweet_data.text.str.replace('united', 
                                              '')
tweet_data.text = tweet_data.text.str.replace('southwestair', 
                                              '')
tweet_data.text = tweet_data.text.str.replace('jetblue', 
                                              '')
tweet_data.text = tweet_data.text.str.replace('usairways', 
                                              '')
tweet_data.text = tweet_data.text.str.replace('americanair', 
                                              '')

# Remove numbers from tweet data.
tweet_data.text = tweet_data.text.str.replace('0', 
                                              '')
tweet_data.text = tweet_data.text.str.replace('1', 
                                              '')
tweet_data.text = tweet_data.text.str.replace('2', 
                                              '')
tweet_data.text = tweet_data.text.str.replace('3', 
                                              '')
tweet_data.text = tweet_data.text.str.replace('4', 
                                              '')
tweet_data.text = tweet_data.text.str.replace('5', 
                                              '')
tweet_data.text = tweet_data.text.str.replace('6', 
                                              '')
tweet_data.text = tweet_data.text.str.replace('7', 
                                              '')
tweet_data.text = tweet_data.text.str.replace('8', 
                                              '')
tweet_data.text = tweet_data.text.str.replace('9', 
                                              '')

### Word Cloud:

The code in this section creates a wordcloud of the text in the tweet data.   
Now, I stem the words in 'tweet_data_string'.

In [4]:
# For each 'tweet_data.text' create a 'split_tweet' column that is a list with 1 entry for each word.
# The resulting lists contain empty values.
tweet_data['split_tweet'] = tweet_data.text.str.split(' ')

# For each row in 'tweet_data', create a 'stemmed_text' column.
# This column is a string of the stemmed words in 'tweet_data.split_tweet'.
# First, I create the column as an empty string and then I actually fill it in.
tweet_data['stemmed_text'] = ''

for i in range(0, len(tweet_data.index)):
    
    for j in range(0, len(tweet_data.split_tweet[i])):
        
        if len(tweet_data.loc[i, 'split_tweet'][j]) > 0:
            tweet_data.stemmed_text[i] = tweet_data.stemmed_text[i] + ' ' + stem(tweet_data.loc[i, 'split_tweet'][j])

In [5]:
# Turn 'tweet_data.stemmed_text' into a string for purposes of creating the word cloud.
stemmed_text_string = str(tweet_data.stemmed_text)

# Create word cloud of the top 50 words (technically stems) in the tweet data (and remove stopwords).
# The wordcloud might look a bit odd because I'm using stemmed words.
tweet_data_wordcloud = wordcloud.WordCloud(background_color = 'black', 
                                max_words = 50, 
                                stopwords = wordcloud.STOPWORDS)
tweet_data_wordcloud.generate(stemmed_text_string)
plt.imshow(tweet_data_wordcloud)
plt.axis("off") # Remove graph axes
plt.show()



The output doesn't show up here, but the PNG file in the repository named 'airline_tweet_analysis_wordcloud.png' contains the wordcloud resulting from the code above. The larger a word, the more frequently it appears in the tweet data. Some of the terms in the word cloud may look odd because I'm using stemmed words.

### Sentiment Analysis:

I conduct sentiment analysis in 2 ways:  
* Lexicon based (with pre-provided lists of positive and negative terms)   
* Naive Bayes Classification Model

#### Lexicon Based:

First, I create my own custom sets of positively and negatively associated words and use these for lexicon-based sentiment analysis. These lists are created using intuition and looking at some example tweets (so slight cheating/overfitting).

I will then count the number of positive and negative words in each tweet and use these counts to create a score to classify the tweets as 'positive', 'negative', or 'neutral'. Then, I'll compare my classification to the provided "answers".

In [6]:
# Create sets of positively and negatively associated terms.
positive_terms = {'amazing', 'good', 'great', 'awesome', 'thank', 'thanks', 'love', 'excited', 'amazing', 'polite', 'courteous', 'friendly', 'incredible'}
negative_terms = {'tacky', 'aggressive', 'obnoxious', 'bad', 'delay', 'worst', 'awful', 'cancel', 'cancelled', 'shitty', 'mess', 'fantastic', 'rude', 'mean', 'unfriendly'}

I will work with 'tweet_data.split_tweet'. This column is the tweets split up by word. This column has already been made lowercase, had unneeded characters removed, and had airline Twitter handles removed. Including stopwords does NOT affect this particular model (I'll just be counting positive and negative words in each tweet) and so I leave them in. 

Stemming the data obviously changes the words and so makes it difficult to create a custom list for classification. For this reason, I will not use stemmed tweet data for this model.

Next, I create 'positive_words' and 'negative_words' columns that count the number of positive and negative words in each cleaned tweet.

In [7]:
# Create 'positive_words' and 'negative_words' columns.
# These columns are initially populated with 'NA' and are correctly filled in below.
tweet_data['positive_words'] = 0
tweet_data['negative_words'] = 0

# Count 'positive_words' and 'negative_words' in each cleaned tweet.
# I loop through each row and use a nested loop to count positive and negative words in each 'split_tweet'.
# This step also takes a little while.
for i in range(0, len(tweet_data.index)):
    
    # Set count of positive and negative words to 0 for each row.
    positive_count = 0
    negative_count = 0
    
    for j in range(0, len(tweet_data.split_tweet[i])):
        
        if tweet_data.loc[i, 'split_tweet'][j] in positive_terms:
            positive_count += 1
        elif tweet_data.loc[i, 'split_tweet'][j] in negative_terms:
            negative_count += 1
    
    tweet_data.loc[i, 'positive_words'] = positive_count
    tweet_data.loc[i, 'negative_words'] = negative_count

The scoring metric I use is polarity and is computed as: (p - n) / (p + n)   
p[n] is the number of positive[negative] words in a tweet.

For each tweet, if polarity is less[greater] than 0, the tweet will be classified as negative[positive].   
Tweets with a polarity of 0 are classified as neutral.
Now I compute the polarity and classification for each tweet.

In [8]:
# Compute 'polarity' for each tweet.
tweet_data['polarity'] = (tweet_data.positive_words - tweet_data.negative_words) / (tweet_data.positive_words + tweet_data.negative_words)

# Classify each tweet as 'postive', 'negative', or 'neutral'.
tweet_data['lexicon_class'] = np.where(tweet_data.polarity > 0, 'positive', 
                                      np.where(tweet_data.polarity < 0, 'negative', 'neutral'))

Now, we can see lexicon-based sentiment analysis model results.

In [9]:
# Overall accuracy.

# Print total number of tweets.
print('Total Tweets: ' +
     str(len(tweet_data.index)))

# Compute total accuracy.
print('Overall Accuracy: ' + 
      str(round(100 * len(tweet_data[tweet_data.airline_sentiment == tweet_data.lexicon_class]) / len(tweet_data.index), 2)) + 
     '%')

# Compute accuracy by tweet classification category.

# Positive tweets.

# Create helper data frame.
positive = tweet_data[tweet_data.airline_sentiment == 'positive']

# Print total number of positive tweets.
print('Number of Positive Tweets: ' +
      str(len(positive.index)))

# Print positive tweet number and accuracy.
print('Accuracy on "positive" tweets (according to "answers"): ' +
      str(round(100 * len(positive[positive.airline_sentiment == positive.lexicon_class]) / len(positive.index), 2)) + 
     '% (' +
     str(len(positive[positive.airline_sentiment == positive.lexicon_class])) +
     ' positive tweets classified correctly' +
     ')')

# Neutral tweets.

# Create helper data frame.
neutral = tweet_data[tweet_data.airline_sentiment == 'neutral']

# Print total number of neutral tweets.
print('Number of Neutral Tweets: ' + 
      str(len(neutral.index)))

# Print neutral tweet number and accuracy.
print('Accuracy on "neutral" tweets (according to "answers"): ' +
      str(round(100 * len(neutral[neutral.airline_sentiment == neutral.lexicon_class]) / len(neutral.index), 2)) + 
     '% (' + 
     str(len(neutral[neutral.airline_sentiment == neutral.lexicon_class])) +
      ' neutral tweets classified correctly' +
     ')')

# Negative tweets.

# Create helper data frame.
negative = tweet_data[tweet_data.airline_sentiment == 'negative']

# Print total number of negative tweets.
print('Number of Negative Tweets: ' + 
      str(len(negative.index)))

# Print negative tweet number and accuracy.
print('Accuracy on "negative" tweets (according to "answers"): ' +
      str(round(100 * len(negative[negative.airline_sentiment == negative.lexicon_class]) / len(negative.index), 2)) +
     '% (' + 
     str(len(negative[negative.airline_sentiment == negative.lexicon_class])) + 
      ' negative tweets classified correctly' +
     ')')

Total Tweets: 14640
Overall Accuracy: 38.41%
Number of Positive Tweets: 2363
Accuracy on "positive" tweets (according to "answers"): 59.42% (1404 positive tweets classified correctly)
Number of Neutral Tweets: 3099
Accuracy on "neutral" tweets (according to "answers"): 85.48% (2649 neutral tweets classified correctly)
Number of Negative Tweets: 9178
Accuracy on "negative" tweets (according to "answers"): 17.11% (1570 negative tweets classified correctly)


This version of lexicon based sentiment analysis is actually more accurate over the entire data set than the version used in the R branch. In that version, I used a pre-provided list of positively and negatively associated terms.

**Further Exploration**:

These results are not great. They might improve if I expanded the list of positive/negative terms I used. Also, results might improve by using different classification thresholds. Actually, the way I did classification above does not take into account the denominator of the polarity at all. Finally, using a different scoring metric might yield better results.

#### Naive Bayes Classification Model:

I use 'tweet_data.stemmed_text' in this section. In a previous section, this column was made lowercase, had unneeded characters removed, had airline tweet handles removed, and was stemmed. Below, I remove English stopwords.

In [10]:
# Clean up unneeded columns ('positive_words', 'negative_words', 'polarity').
# This command can be optionally removed with no adverse effects.
tweet_data = tweet_data[['airline_sentiment', 'text', 'split_tweet', 'stemmed_text', 'lexicon_class']]

# Create document term matrix from 'tweet_data'.
# In the R branch, I use information gain for each feature to do feature selection.
# In that branch, features giving no information are removed.
# In this branch, I do not compute information gain.

# Create 'count_vectorizer'.
# This is necessary to create the document term matrix.
# I remove accents, English stopwords, and terms that appear in less than a certain number of documents.
# I am more aggressive in removing terms because I kept hitting memory errors when converting back to a pandas data frame.
count_vectorizer = CountVectorizer(strip_accents = 'unicode', 
                                  stop_words = 'english', 
                                  min_df = 8,
                                  binary = True)

# Create document term matrix.
tweet_tdm = count_vectorizer.fit_transform(tweet_data.text)

# Turn 'tweet_tdm' into a data frame.
tweet_tdm = pd.DataFrame(tweet_tdm.toarray(), 
                        columns = count_vectorizer.get_feature_names())

# Add 'tweet_data.airline_sentiment' to 'tweet_tdm'.
# These are the classification task answers.
tweet_tdm['airline_sentiment'] = tweet_data.airline_sentiment

In [37]:
# Create training and test sets from 'tweet_data'.
# Both training and test sets contain approximately half the data.

# Set seed for reproducibility.
np.random.seed(79)

# Randomly sample index numbers for the training set.
training_rows = np.random.choice(len(tweet_data.index), 
                                 size = int(len(tweet_data.index) / 2), 
                                 replace = False)

# Create training and test sets.
tweet_tdm_train = tweet_tdm.iloc[training_rows]
tweet_tdm_test = tweet_tdm.iloc[-training_rows]

# Create Naive Bayes classifier.
nb = sklearn_nb.MultinomialNB()

# Train model
nb.fit(tweet_tdm_train.drop('airline_sentiment', 
                            axis = 1), 
       tweet_tdm_train.airline_sentiment)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [70]:
# Get predictions for training and test data and add them to 'tweet_tdm_train' and 'tweet_tdm_test' respectively.
# This throws a 'SettingWithCopyWarning' that I can't quite get around right now.
# So for now, I'll ignore it.
tweet_tdm_train['nb_pred'] = nb.predict(tweet_tdm_train.drop('airline_sentiment', 
                                    axis = 1))
tweet_tdm_test['nb_pred'] = nb.predict(tweet_tdm_test.drop('airline_sentiment', 
                                                           axis = 1))

# NEED TO DO CROSS VALIDATION AND ERROR SUPRRESSION

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0,_u,_u_,_ua,_ua_ua,_uo,_uo_,_uoa,aa,aadvantage,able,...,youre,youve,yr,yrs,yup,yyz,zero,zone,airline_sentiment,nb_pred
803,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,negative,negative
2079,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,negative,negative
10578,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,negative,negative
13823,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,negative,negative
7173,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,positive,negative


I remove terms that appear in less than 8 documents for the model.   
Below, I compute various accuracy scores.

In [78]:
# Print various accuracy scores.

# Print training set accuracy.
print('Training Set Accuracy: ' + 
     str(100 * round(len(tweet_tdm_train[tweet_tdm_train.airline_sentiment == tweet_tdm_train.nb_pred].index) / 
        len(tweet_tdm_train.index), 4)) + 
     '%')

# Print cross validation set accuracy:
#TO DO

# Print test set accuracy.
print('Test Set Accuracy: ' + 
     str(100 * round(len(tweet_tdm_test[tweet_tdm_test.airline_sentiment == tweet_tdm_test.nb_pred].index) / 
                    len(tweet_tdm_train.index), 4)) + 
     '%')

Training Set Accuracy: 81.61%
Test Set Accuracy: 78.51%
