# Airline Sentiment Analysis

A sentiment analysis about the problems of each major U.S. airline. Twitter data was scraped from
February of 2015 and contributors were asked to first classify positive, negative, and neutral tweets, followed
by categorizing negative reasons (such as "late flight" or "rude service").

# Data:
- **Tweets.csv:**
    - tweet_id
    - airline_sentiment
    - airline_sentiment_confidence
    - negativereason
    - negativereason_confidence
    - airline
    - airline_sentiment_gold
    - name
    - negativereason_gold
    - retweet_count
    - text
    - tweet_coord
    - tweet_created
    - tweet_location
    - user_timezone

## Data Summary


In [2]:
import numpy as np
import pandas as pd
pd.set_option('display.max_colwidth',None)
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize  
from nltk.stem.wordnet import WordNetLemmatizer         
import re
import unicodedata
from bs4 import BeautifulSoup
import sys  
!{sys.executable} -m pip install contractions

Collecting contractions
  Downloading https://files.pythonhosted.org/packages/0a/04/d5e0bb9f2cef5d15616ebf68087a725c5dbdd71bd422bcfb35d709f98ce7/contractions-0.0.48-py2.py3-none-any.whl
Collecting textsearch>=0.0.21
  Downloading https://files.pythonhosted.org/packages/d3/fe/021d7d76961b5ceb9f8d022c4138461d83beff36c3938dc424586085e559/textsearch-0.0.21-py2.py3-none-any.whl
Collecting anyascii
[?25l  Downloading https://files.pythonhosted.org/packages/09/c7/61370d9e3c349478e89a5554c1e5d9658e1e3116cc4f2528f568909ebdf1/anyascii-0.1.7-py3-none-any.whl (260kB)
[K     |████████████████████████████████| 266kB 13.8MB/s 
[?25hCollecting pyahocorasick
[?25l  Downloading https://files.pythonhosted.org/packages/4a/92/b3c70b8cf2b76f7e3e8b7243d6f06f7cb3bab6ada237b1bce57604c5c519/pyahocorasick-1.4.1.tar.gz (321kB)
[K     |████████████████████████████████| 327kB 25.8MB/s 
[?25hBuilding wheels for collected packages: pyahocorasick
  Building wheel for pyahocorasick (setup.py) ... [?25l[?25hdone

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [14]:
# Loading data into pandas dataframe
data = pd.read_csv('./drive/MyDrive/Tweets.csv')

In [6]:
data.shape

(14640, 15)

In [12]:
data.info

<bound method DataFrame.info of                  tweet_id  ...               user_timezone
0      570306133677760513  ...  Eastern Time (US & Canada)
1      570301130888122368  ...  Pacific Time (US & Canada)
2      570301083672813571  ...  Central Time (US & Canada)
3      570301031407624196  ...  Pacific Time (US & Canada)
4      570300817074462722  ...  Pacific Time (US & Canada)
...                   ...  ...                         ...
14635  569587686496825344  ...                         NaN
14636  569587371693355008  ...                         NaN
14637  569587242672398336  ...                         NaN
14638  569587188687634433  ...  Eastern Time (US & Canada)
14639  569587140490866689  ...                         NaN

[14640 rows x 14 columns]>

In [7]:
data.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials to the experience... tacky.,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I need to take another trip!,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,"@VirginAmerica it's really aggressive to blast obnoxious ""entertainment"" in your guests' faces &amp; they have little recourse",,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing about it,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


## Data Columns


In [15]:
# airline_sentiment to numerical values
# neutral is 1
# positive is 2 
# negative is 0
data['airline_sentiment'] = data['airline_sentiment'].astype('category').cat.codes
labels = data['airline_sentiment']
labels.value_counts(normalize=True,
                          dropna=False).sort_index()

0    0.626913
1    0.211680
2    0.161407
Name: airline_sentiment, dtype: float64

In [16]:
data.drop(columns=['airline_sentiment', 'tweet_id', 'airline_sentiment_confidence', 'negativereason', 'negativereason_confidence', 'airline', 'airline_sentiment_gold', 'name', 'negativereason_gold', 'retweet_count', 'tweet_coord', 'tweet_created', 'tweet_location', 'user_timezone'],
         inplace=True)
data.head()

Unnamed: 0,text
0,@VirginAmerica What @dhepburn said.
1,@VirginAmerica plus you've added commercials to the experience... tacky.
2,@VirginAmerica I didn't today... Must mean I need to take another trip!
3,"@VirginAmerica it's really aggressive to blast obnoxious ""entertainment"" in your guests' faces &amp; they have little recourse"
4,@VirginAmerica and it's a really big bad thing about it


In [17]:
data.shape

(14640, 1)

## Data Pre-Processing


In [None]:
"""
a. Html tag removal.
b. Tokenization.
c. Remove the numbers.
d. Removal of Special Characters and Punctuations.
e. Conversion to lowercase.
f. Lemmatize or stemming.
g. Join the words in the list to convert back to text string in the dataframe. (So that each row
contains the data in text format.)
h. Print first 5 rows of data after pre-processing.
"""

## Vectorization


In [None]:
"""
a. Use CountVectorizer.
b. Use TfidfVectorizer.
"""

In [None]:
"""
What to do after text pre-processing:
o Bag of words
o Tf-idf
"""

## Modelling and Evaluation

In [None]:
# Fit and evaluate model using BOTH type of vectorization.

In [None]:
# Build the classification model. 

In [None]:
# Evaluate the model.

## Conclusion

In [None]:
"""
Summarize your understanding of the application of Various Pre-processing and Vectorization and
performance of your model on this dataset. 
"""