# Sentiment Analysis using Airline Tweets
Author: Matthew Huh

## Introduction

Companies, like airlines, should care about what their customers have to say about them, especially if customers are able to take their money to another competitor. I’ll be evaluating the comments by what types of comments people are leaving, and if that is viewed positively, negatively, or somewhere in the middle. This feedback should allow companies to know what their customers think about them, and how people perceive them.

## About the Data

The training data has been obtained from crowdflower on Kaggle, and the testing set has been obtained via Twitter's API on more recent data. 

The training set has far more information as the data has been reviewed by people to determine the sentiment, and the rationale behind negative comments, something that may not be so easily extracted from the testing data set.

## Research Question

## Packages

## Source

https://www.kaggle.com/crowdflower/twitter-airline-sentiment


# To-do list

* Implement NLP methods (LSA, TF-IDF, LDA, NNMF, etc.)
* Run machine learning models
* Extract twitter data for testing data set

In [1]:
# Basic imports
import os
import time
import timeit
import numpy as np
import pandas as pd
import scipy
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Machine Learning packages
from sklearn import ensemble
from sklearn.feature_selection import chi2, f_classif, SelectKBest 
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, GridSearchCV, train_test_split
from sklearn.preprocessing import normalize

# Natural Language processing
import nltk
import re
from collections import Counter
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.datasets import fetch_rcv1
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
from wordcloud import WordCloud, STOPWORDS

# Clustering packages
import sklearn.cluster as cluster
from sklearn.cluster import KMeans, MeanShift, estimate_bandwidth, SpectralClustering, AffinityPropagation
from scipy.spatial.distance import cdist

# Plotly packages
import plotly as py
import plotly.figure_factory as ff
import plotly.graph_objs as go
from plotly import tools
import cufflinks as cf
import ipywidgets as widgets
from scipy import special
py.offline.init_notebook_mode(connected=True)

  from numpy.core.umath_tests import inner1d


In [2]:
# Import the data
airline_tweets = pd.read_csv("airline_tweets/Tweets.csv")

# Preview the dataset
airline_tweets.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


In [3]:
# View the size of the dataset
airline_tweets.shape

(14640, 15)

This dataset has a bit more information than we actually need for this project. We definitely need the text information since that is what we are evaluating, the sentiment since that is what we are trying to measure, and the reason to determine what clusters of complaints people are encountering. As for the rest, they could have some impact on the outcome, but they are not what we are trying to measure so, we'll drop the rest before continuing.

In [4]:
# Drop columns that have no predictive power
airline_tweets.drop(['tweet_id'], axis=1,inplace=True)

In [5]:
# Print unique airlines in the dataset
sorted(airline_tweets['airline'].unique())

['American', 'Delta', 'Southwest', 'US Airways', 'United', 'Virgin America']

In [6]:
# Describe unique occurences for each categorical variable
airline_tweets.select_dtypes(include=['object']).nunique()

airline_sentiment             3
negativereason               10
airline                       6
airline_sentiment_gold        3
name                       7701
negativereason_gold          13
text                      14427
tweet_coord                 832
tweet_created             14247
tweet_location             3081
user_timezone                85
dtype: int64

## Data Visualization

In [7]:
# View distribution of tweets by sentiment 
# (Changing colors to red/gray/green would be nice)
trace = go.Pie(labels=airline_tweets['airline_sentiment'].value_counts().index, 
              values=airline_tweets['airline_sentiment'].value_counts())

# Create the layout
layout = go.Layout(
    title = 'Tweet Sentiment',
    height = 400,
    width = 500,
    autosize = False,
    yaxis = dict(title='Number of tweets')
)

fig = go.Figure(data = [trace], layout = layout)
py.offline.iplot(fig, filename='cufflinks/simple')

In [8]:
# Show distribution of texts

trace1 = go.Bar(
    x = sorted(airline_tweets['airline'].unique()),
    y = airline_tweets[airline_tweets['airline_sentiment'] == 'negative'].groupby('airline')['airline_sentiment'].value_counts(),
    name = 'Negative',
    marker = dict(color='rgba(200,0,0,.7)')
)

trace2 = go.Bar(
    x = sorted(airline_tweets['airline'].unique()),
    y = airline_tweets[airline_tweets['airline_sentiment'] == 'neutral'].groupby('airline')['airline_sentiment'].value_counts(),
    name = 'Neutral',
    marker = dict(color='rgba(150,150,150,.7)')
)

trace3 = go.Bar(
    x = sorted(airline_tweets['airline'].unique()),
    y = airline_tweets[airline_tweets['airline_sentiment'] == 'positive'].groupby('airline')['airline_sentiment'].value_counts(),
    name = 'Positive',
    marker = dict(color='rgba(0,200,0,.7)')
)

data = [trace1, trace2, trace3]
layout = go.Layout(
    title = 'Sentiment per airline (Totals)',
    barmode='group',
    yaxis = dict(title='Number of tweets')
)

fig = go.Figure(data=data, layout=layout)
py.offline.iplot(fig, filename='stacked-bar')

In [9]:
# Show distribution of texts
# (Percentage of total tweets per airline would be nice)

trace1 = go.Bar(
    x = sorted(airline_tweets['airline'].unique()),
    y = (airline_tweets[airline_tweets['airline_sentiment'] == 'negative'].groupby('airline')['airline_sentiment'].value_counts().values) / (airline_tweets['airline'].value_counts().sort_index().values),
    name = 'Negative',
    marker = dict(color='rgba(200,0,0,.7)')
)

trace2 = go.Bar(
    x = sorted(airline_tweets['airline'].unique()),
    y = (airline_tweets[airline_tweets['airline_sentiment'] == 'neutral'].groupby('airline')['airline_sentiment'].value_counts().values) / (airline_tweets['airline'].value_counts().sort_index().values),
    name = 'Neutral',
    marker = dict(color='rgba(150,150,150,.7)')
)

trace3 = go.Bar(
    x = sorted(airline_tweets['airline'].unique()),
    y = (airline_tweets[airline_tweets['airline_sentiment'] == 'positive'].groupby('airline')['airline_sentiment'].value_counts().values) / (airline_tweets['airline'].value_counts().sort_index().values),
    name = 'Positive',
    marker = dict(color='rgba(0,200,0,.7)')
)

data = [trace1, trace2, trace3]
layout = go.Layout(
    title = 'Sentiment per airline (Percentage)',
    barmode='stack',
    yaxis = dict(title='Number of tweets')
)

fig = go.Figure(data=data, layout=layout)
py.offline.iplot(fig, filename='stacked-bar')

In [10]:
# Plots the complaint reasons, and their frequency
# (It might be nice to somehow show how common each reason is for each airline)

# The input is the number of negative tweets by reason
data = [go.Bar(
    x = airline_tweets.negativereason.value_counts().index,
    y = airline_tweets.negativereason.value_counts(),
    opacity = 0.7
)]

# Create the layout
layout = go.Layout(
    title = 'Negative Tweets by Reason',
    yaxis = dict(title='Number of tweets')
)

fig = go.Figure(data = data, layout = layout)
py.offline.iplot(fig, filename='cufflinks/simple')

## Text Cleaning

In [11]:
def text_cleaner(text):
    # Visual inspection identifies a form of punctuation spaCy does not
    # recognize: the double dash '--'.  Better get rid of it now!
    text = re.sub(r'--',' ',text)
    text = re.sub("[\[].*?[\]]", "", text)
    text = ' '.join(text.split())
    return text

In [12]:
# Remove non-essential punctuation from the tweets
pd.options.display.max_colwidth = 200
airline_tweets['text'] = airline_tweets['text'].map(lambda x: text_cleaner(str(x)))
airline_tweets['text'].head()

0                                                                                               @VirginAmerica What @dhepburn said.
1                                                          @VirginAmerica plus you've added commercials to the experience... tacky.
2                                                           @VirginAmerica I didn't today... Must mean I need to take another trip!
3    @VirginAmerica it's really aggressive to blast obnoxious "entertainment" in your guests' faces &amp; they have little recourse
4                                                                           @VirginAmerica and it's a really big bad thing about it
Name: text, dtype: object

In [13]:
lemmatizer = WordNetLemmatizer()

# Reduce all text to their lemmas
for tweet in airline_tweets['text']:
    tweet = lemmatizer.lemmatize(tweet)

In [14]:
# # Modify values of sentiment to numerical values
# This block is useful before modelling, but it's pretty annoying right now
# sentiment = {'negative': -1, 'neutral': 0, 'positive': 1}
# airline_tweets['airline_sentiment'] = airline_tweets['airline_sentiment'].map(lambda x: sentiment[x])

# Wordcloud generation

In [15]:
wordcloud = WordCloud(stopwords=stopwords, max_words=50, background_color="white").generate(text)
plt.figure()
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()
    
text = " ".join(tweet for tweet in airline_tweets['text'])

NameError: name 'text' is not defined

# Natural Lanuage Processing

In [None]:
## Creating tf-idf matrix
vectorizer = TfidfVectorizer(stop_words='english')
synopsis_tfidf = vectorizer.fit_transform(job_data['Synopsis'])

# Getting the word list.
terms = vectorizer.get_feature_names()

# Number of topics.
ntopics=job_data['Query'].nunique()

# Linking words to topics
def word_topic(tfidf,solution, wordlist):
    
    # Loading scores for each word on each topic/component.
    words_by_topic=tfidf.T * solution

    # Linking the loadings to the words in an easy-to-read way.
    components=pd.DataFrame(words_by_topic,index=wordlist)
    
    return components

# Extracts the top N words and their loadings for each topic.
def top_words(components, n_top_words):
    n_topics = range(components.shape[1])
    index= np.repeat(n_topics, n_top_words, axis=0)
    topwords=pd.Series(index=index)
    for column in range(components.shape[1]):
        # Sort the column so that highest loadings are at the top.
        sortedwords=components.iloc[:,column].sort_values(ascending=False)
        # Choose the N highest loadings.
        chosen=sortedwords[:n_top_words]
        # Combine loading and index into a string.
        chosenlist=chosen.index +" "+round(chosen,2).map(str) 
        topwords.loc[column]=chosenlist
    return(topwords)

# Number of words to look at for each topic.
n_top_words = 50