# TWISTTER US AIRLINE SENTIMENT ANALYSIS

In this project, I decided to analyze sentiment of us airline with 13 steps:
> 1. High-level overview 
> 2. Description of Input Data
> 2. Strategy for solving problem and Disscussion the expected solution
> 3. Metrics with justification
> 4. Data Preprocessing
> 5. EDA
> 6. Modeling 
> 7. Hyperparameter tuning 
> 8. Results
> 9. Comparision table 
> 10. Conclusion
> 11. Improvement

### High-level overview 

Customer reviews play an important role in the service improvement process of airlines. Analyzing customer reviews will help airlines identify service problems they are facing to improve in the future.

Twitter data has been collected since February 2015, and contributors were asked to first categorize tweets as positive, negative, and neutral, and then to categorize negative reasons (such as “ incoming flight” or “service failure”).

From this dataset, I want to analyze what people review about a particular service and why they are giving those reviews, as well as use Machine Learning to classify whether the review is positive, negative or neutral.

###  Import Required Libraries

In [None]:
import pandas as pd 
import re
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
nltk.download(['stopwords','punkt','wordnet', 'omw-1.4'])
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import plotly.express as px

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ACER\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ACER\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ACER\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\ACER\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


### Discription of Input Data

In [None]:
#Export dataset using pandas
df = pd.read_csv('D:\\Udacity\\Twitter_US_Airline_Sentiment\\data\\Tweets.csv')

#Show first 5 rows of dataset
df.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


In [None]:
#Show all types of features in the dataset
df.dtypes

tweet_id                          int64
airline_sentiment                object
airline_sentiment_confidence    float64
negativereason                   object
negativereason_confidence       float64
airline                          object
airline_sentiment_gold           object
name                             object
negativereason_gold              object
retweet_count                     int64
text                             object
tweet_coord                      object
tweet_created                    object
tweet_location                   object
user_timezone                    object
dtype: object

In [None]:
#Show the shape of dataset
df.shape

(14640, 15)

### Strategy for solving problem and Disscussion the expected solution

### Metrics with justification

#### Data Preprocessing

In [None]:
# List of stop words in English
enstopwords = stopwords.words("english") 

# Columns text with no airline tag only keep characters @ a-z A-Z 
df['text_with_no_airline_tag'] = df['text'].apply(lambda str: re.sub('[^@a-zA-Z]',' ',str))
    
# Remove airline tag like @united from the string
df['text_with_no_airline_tag'] = df['text_with_no_airline_tag'].apply(lambda str: re.sub('@[a-zA-Z]+',' ',str))

# Define list of columns to remove stop words and convert to lowercase
cols = ['airline', 'text_with_no_airline_tag', 'text']

# Remove stop words and convert to lowercase
for col in cols: 
    df[col].apply(lambda str: str.lower())
    df[col].apply(lambda str: [word for word in str.split() if not word in enstopwords])

# Join list of words with space 
df['text'].apply(lambda str: ' '.join(str))
df['text_with_no_airline_tag'].apply(lambda str: ' '.join(str))

# Join list of words with '' 
df['airline'].apply(lambda str: ''.join(str))

# Lemmatize text in 3 columns: airline, text with no airline tag, text
for col in cols: 
    # Tokenize text in to list of words
    df[col].apply(lambda str: word_tokenize(str))
    # Lemmatize each word in list 
    df[col].apply(lambda words: [WordNetLemmatizer().lemmatize(w) for w in words])
    # Join text to return a string
    df[col].apply(lambda lemmed: ' '.join(lemmed))

# Convert label value from string to numeric
map_label_dict = {'positive':1, 'negative':-1, 'neutral':0}
df = df.replace({'airline_sentiment':map_label_dict})

# Filter only 4 columns 
df = df[['airline', 'text_with_no_airline_tag', 'text', 'airline_sentiment']]

### EDA

#### Question 1: Which airline has the best and worst reviews?

In [None]:
# Caculate percentage of positive, negative and neutral reviews of each airlines
df['count'] = 1
dfVis = df.groupby(['airline', 'airline_sentiment']).sum().reset_index()
tmp = dfVis.groupby(['airline'])['airline', 'count'].sum()
dfVis = dfVis.merge(tmp,how = 'left', on = 'airline')
dfVis['per'] = dfVis['count_x']/dfVis['count_y']
dfVis = dfVis[['airline', 'airline_sentiment', 'per', 'count_x']]
dfVis['airline_sentiment'].replace({-1:'negative', 0:'neutral', 1:'positive'}, inplace = True)
dfVis.head()



Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.



Unnamed: 0,airline,airline_sentiment,per,count_x
0,American,negative,0.710402,1960
1,American,neutral,0.167814,463
2,American,positive,0.121783,336
3,Delta,negative,0.429793,955
4,Delta,neutral,0.325383,723


From the dataset, we can plot 2 graphs to show the percentage and number of positive, negative and neutral reviews for each airline.

In [None]:
#Plot percentage of positive, negative and neutral reviews of each airlines
fig = px.bar(dfVis, x='airline', y='per', color='airline_sentiment')
fig.show()

In [None]:
#Plot number of positive, negative and neutral reviews of each airlines
fig = px.bar(dfVis, x='airline', y='count_x', color='airline_sentiment')
fig.show()

We can see that in 3 types of reviews, negative reviews account for the largest proportion and in 6 airlines, American Airlines, United Airlines and Us Airways have a higher negative contribution rate than the remaining 3 airlines.

Although the total number of reviews of Delta Airline, Southwest Airlines, Virgin America is less than the other 3 airlines, they account for a higher percentage of positive reviews.

#### Question 2: What is the cause of negative reviews?

To clarify this issue, we plot 1-gram, 2-gram and 3-gram charts of the text that customers review after going through the data cleaning steps.

In [None]:
def ngrams(n, title, lis_type):
    """
    A Function to plot most common ngrams

    Input: ngram want to show, title of the chart and a list containing 3 lists of object: negative, posivte, neutral 

    Output: a graph of top 15 tokens ngrams appeared the most
    """

    fig, axes = plt.subplots(1, 3, figsize=(18, 8))
    axes = axes.flatten()
    for i, j in zip(lis_type, axes):

        new = i.str.split()
        new = new.values.tolist()
        corpus = [word for i in new for word in i]

        def _get_top_ngram(corpus, n=None):
            #getting top ngrams
            vec = CountVectorizer(ngram_range=(n, n),
                                  max_df=0.9,
                                  stop_words='english').fit(corpus)
            bag_of_words = vec.transform(corpus)
            sum_words = bag_of_words.sum(axis=0)
            words_freq = [(word, sum_words[0, idx])
                          for word, idx in vec.vocabulary_.items()]
            words_freq = sorted(words_freq, key=lambda x: x[1], reverse=True)
            return words_freq[:15]

        top_n_bigrams = _get_top_ngram(i, n)[:15]
        x, y = map(list, zip(*top_n_bigrams))
        sns.barplot(x=y, y=x, palette='rocket', ax=j)
        
        axes[0].set_title('Positive')
        axes[1].set_title('Negative')
        axes[2].set_title('Neutral')
        axes[0].set_xlabel('Count')
        axes[0].set_ylabel('Word')
        axes[1].set_xlabel('Count')
        axes[1].set_ylabel('Word')
        axes[2].set_xlabel('Count')
        axes[2].set_ylabel('Word')
        fig.suptitle(title, fontsize=24, va='baseline')
        plt.tight_layout()

In [None]:
# 1-gram of text with no airline hastag
lis_text_with_no_airlinehastag = [
    df[df['airline_sentiment'] == 1]['text_with_no_airline_tag'],
    df[df['airline_sentiment'] == -1]['text_with_no_airline_tag'],
    df[df['airline_sentiment'] == 0]['text_with_no_airline_tag']
]
ngrams(1, "1-gram of text without airline hashtag", lis_text_with_no_airlinehastag)