# Airline Tweets NLP Analysis

This document shows the results of basic natural language processing (NLP) analysis on Twitter tweets about major US airlines scraped from the site during part of February 2015. Specifically, I create a word cloud and conduct sentiment analysis.

Contributors to the data set were asked to classify positive, negative, and neutral tweets.
Thus, for each tweet, I have the 'correct' answer for sentiment analysis purposes.

The data can be found at the URL below. To find the dataset, search for 'Airline' on the page.  
I specifically use the 16,000 row dataset uploaded on February 12, 2015 by CrowdFlower.  
I assume the upload date is incorrect as the data includes tweets from after 2/12/2015...

https://www.crowdflower.com/data-for-everyone/

Note that the actual dataset only contains 14,640 rows. I'm not sure where the discrepancy comes from, but it doesn't affect the analysis.

In the cell below, I import modules for the analysis and the data. Note that the file path is specific to my machine and may need to be modified if this code is run elsewhere.

In [111]:
# Import modules.
import pandas as pd
import wordcloud
from stemming.porter2 import stem
import matplotlib.pyplot as plt

# Import data
tweet_data = pd.read_csv('Documents/Github/airline-tweets-nlp-and-machine-learning/Airline-Sentiment-2-w-AA.csv', 
                         encoding = 'latin_1')

# Remove unneeded columns.
tweet_data = tweet_data[['airline_sentiment', 'text']]

Below is a sample of the data. Unfortunately, in this view, we can only see the beginning of the tweet text.

In [112]:
# View head of data.
tweet_data.head()

Unnamed: 0,airline_sentiment,text
0,neutral,@VirginAmerica What @dhepburn said.
1,positive,@VirginAmerica plus you've added commercials t...
2,neutral,@VirginAmerica I didn't today... Must mean I n...
3,negative,@VirginAmerica it's really aggressive to blast...
4,negative,@VirginAmerica and it's a really big bad thing...


Now, I'll do some data cleaning on 'tweet_data.text'
- make all characters lowercase
- remove characters that aren't letters
- remove the airline Twitter handles

In [113]:
# Make tweets lowercase.
tweet_data.text = tweet_data.text.str.lower()

# Remove all characters that are not alphanumeric or whitespace.
tweet_data.text = tweet_data.text.str.replace('[^\w\s]', 
                                              '')

# Remove airline Twitter handles.
# Note that I have not removed stopwords.
# This removal is done when creating the word cloud.
# Stemming is done in the next section.
tweet_data.text = tweet_data.text.str.replace('virginamerica', 
                                              '')
tweet_data.text = tweet_data.text.str.replace('united', 
                                              '')
tweet_data.text = tweet_data.text.str.replace('southwestair', 
                                              '')
tweet_data.text = tweet_data.text.str.replace('jetblue', 
                                              '')
tweet_data.text = tweet_data.text.str.replace('usairways', 
                                              '')
tweet_data.text = tweet_data.text.str.replace('americanair', 
                                              '')

# For the rest of this section, I will turn 'tweet_data.text' into a single string.
tweet_data_string = str(tweet_data.text)

Now, I stem the words in 'tweet_data_string'.

In [114]:
# Split tweet words by spaces.
split_tweets = tweet_data_string.split(' ')

# Create new empty list to hold stemmed 'split_tweets'.
split_tweets_stemmed = []

# Stem the words in 'split_tweets'.
# There are empty list items, but the way I will proceed will make this irrelevant.
for word in split_tweets:
    split_tweets_stemmed.append(stem(word))
    
# Create 'stemmed_tweet_string' from 'split_tweets_stemmed'.
stemmed_tweet_string = ''
for word in split_tweets_stemmed:
    stemmed_tweet_string = stemmed_tweet_string + str(word) + ' '

In [115]:
# Create word cloud of the top 50 words (technically stems) in the tweet data (and remove stopwords).
tweet_data_wordcloud = wordcloud.WordCloud(background_color = 'black', 
                                max_words = 50, 
                                stopwords = wordcloud.STOPWORDS)
tweet_data_wordcloud.generate(stemmed_tweet_string)
plt.imshow(tweet_data_wordcloud)
plt.axis("off") # Remove graph axes
plt.show()



The output doesn't show up here, but the PNG file in the repository named 'airline_tweet_analysis_wordcloud.png' contains the wordcloud resulting from the code above.