
**MLS Case Study: Twitter US Airline Sentiment - Analysis**


## **Context:**

- **This dataset consists of tweet and retweet data about the major US Airlines probelms.**
- **The data spans 2015 and covers 14,640 reviews. Tweets include a plain text reasoning of the feedback.**
- **It alo includes feeedbacks for the major US Airlines.**


#### The purpose of this analysis is to explore the numerous features and build a classification model where we will be able to tag the rating based on the description and consequently rate them as positive or negative. In this analysis, we will be focusing on score, summary, description and score based sentiment features.



**Dataset:**

The dataset has the following columns:

* tweet_id
* airline_sentiment
* airline_sentiment_confidence
* negativereason
* negativereason_confidence
* airline
* airline_sentiment_gold
* name
* negativereason_gold
* retweet_count
* text
* tweet_coord
* tweet_created
* tweet_location
* user_timezone

##**Steps:**
- Import the necessary libraries
- Get the data
- Explore the data
- Do feature engineering (create relevant columns based on existing columns)
- Plot the wordcloud based on the relevant column
- Do pre-processing
- Noise removal (Special character, html tags, numbers, 
stopword removal)
- Lowercasing
- Stemming / lemmatization
- Text to number: Vectorization
- CountVectorizer
- TfidfVectorizer
- Build Machine Learning Model for Text Classification.
- Optimize the parameters to improve the
accuracy
- Plot the worldcloud based on the most important features
- Check the performance of the model
- Summary


In [None]:
pip install contractions

In [None]:
# import necessary libraries.

import re, string, unicodedata                          # Import Regex, string and unicodedata.
import contractions                                     # Import contractions library.
from bs4 import BeautifulSoup                           # Import BeautifulSoup.

import numpy as np                                      # Import numpy.
import pandas as pd                                     # Import pandas.
import nltk                                             # Import Natural Language Tool-Kit.

nltk.download('stopwords')                              # Download Stopwords.
nltk.download('punkt')
nltk.download('wordnet')

from nltk.corpus import stopwords                       # Import stopwords.
from nltk.tokenize import word_tokenize, sent_tokenize  # Import Tokenizer.
from nltk.stem.wordnet import WordNetLemmatizer         # Import Lemmatizer.
import matplotlib.pyplot as plt                         # Import plt for visualization

import seaborn as sns

sns.set_theme()

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# read Tweets file from Google Drive
import pandas as pd
data=pd.read_csv('/content/drive/MyDrive/colab/Project - Natural Language Processing/Tweets.csv')

In [None]:
# check the data shape
data.shape                                               # print shape of data.

- Tweets file contains 14,640 records and 15 columns

In [None]:
# checking sample data
pd.set_option('display.max_colwidth', None) # Display full dataframe information (Non-turncated Text column.)
data.head()                                              # Print first 5 rows of data.

In [None]:
# check data 
data.describe()

In [None]:
# check data types of each column
data.info()

In [None]:
# convert object columns to category columns
data[data.select_dtypes(['object']).columns] = data.select_dtypes(['object']).apply(lambda x: x.astype('category'))

In [None]:
# check conversion
data.info()

In [None]:
#Dropping tweet_id column from the dataframe as there no usage
data.drop(columns=['tweet_id'], inplace=True)

In [None]:
# data category list
data.select_dtypes(['category']).columns

In [None]:
# check values of the category columns
cat_cols=data.select_dtypes(['category']).columns

for column in cat_cols:
    print(data[column].value_counts())
    print('-'*30)

In [None]:
data.isnull().sum(axis=0)                                # Check for NULL values.

**Categorical data analysis**

* airline_sentiment has three distinct values: negative, positive and neutral
* tweets data is avaialable for 5 airlines: United, US Airways, American, Southwest, Delta, Virgin America
* airline_sentiment_gold, negativereason_gold, tweet_coord columns may not be useful for analysis.
* One main negativereason given is "Customer Service Issue". However, there is not enough data in this column.
* Following columns don't have meaningful data: name
* Following columns don't have enough data: airline_sentiment_gold, negativereason_gold

In [None]:
# define function to create labeled barplots to see individual columns data distribution


def labeled_barplot(data, feature, perc=False, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 1, 5))
    else:
        plt.figure(figsize=(n + 1, 5))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(
        data=data,
        x=feature,
        palette="Paired",
        order=data[feature].value_counts().index[:n].sort_values(),
    )

    for p in ax.patches:
        if perc:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage

    plt.show()  # show the plot

In [None]:
# let's check tweets by airlines
labeled_barplot(data, "airline", perc=True)

- 26% tweets are for United Airlines which is the biggest bucket
- Virgin America has least number of tweeets

In [None]:
# let's look at the overall tweets sentimets 
labeled_barplot(data, "airline_sentiment", perc=True)

- Maximum tweets (62.7%) reflects negative sentiments

In [None]:
# let's explore distribution of negative sentiments
labeled_barplot(data, "negativereason", perc=True)

- Almost 20% of the negative sentiments resulted from Customer Service Issues. This could be addressed with low investment.
-  Next bucket is from Late Flights of 11.4%.

In [None]:
# defining function to plot the predictor and target variables
def stacked_barplot(data, predictor, target):
    """
    Print the category counts and plot a stacked bar chart

    data: dataframe
    predictor: independent variable
    target: target variable
    """
    count = data[predictor].nunique()
    sorter = data[target].value_counts().index[-1]
    tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
        by=sorter, ascending=False
    )
    print(tab1)
    print("-" * 120)
    tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
        by=sorter, ascending=False
    )
    tab.plot(kind="bar", stacked=True, figsize=(count + 5, 5))
    plt.legend(
        loc="lower left", frameon=False,
    )
    plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
    plt.show()

In [None]:
# plot the airline_sentiment by airline
stacked_barplot(data, "airline", "airline_sentiment")

- United has the maximum feeback avaialable
- Virgin America has least tweets also lowest level of negative sentiment
-  US Airways has the maximum negative feedbak.

### **Word Cloud based on Summary**

In [None]:
pip install wordcloud

In [None]:
# build the wordcloud to see the passenger sentiment
import wordcloud
def show_wordcloud(data, title):
    text = ' '.join(data['text'].astype(str).tolist())                 # Converting Summary column into list
    stopwords = set(wordcloud.STOPWORDS)                                  # instantiate the stopwords from wordcloud
    
    fig_wordcloud = wordcloud.WordCloud(stopwords=stopwords,background_color='white',          # Setting the different parameter of stopwords
                    colormap='viridis', width=800, height=600).generate(text)
    
    plt.figure(figsize=(14,11), frameon=True)                             
    plt.imshow(fig_wordcloud)  
    plt.axis('off')
    plt.title(title, fontsize=30)
    plt.show()

In [None]:
# Let's look at the Negative sentiments visually
show_wordcloud(data[data.airline_sentiment == "negative"], title = "Negative Sentiments")

- Majority complains are for United, US Airways, AmericanAir, United and Southwest.
- Customer Service, Flight Delay, Cancellation, Baggage Probelms are highlighted.

In [None]:
show_wordcloud(data[data.airline_sentiment == "positive"], title = "Positive Sentiments")

- JetBlue, SouthwestAir, AmericalAir shows up in the positive sentiment.
- 'Thanks', 'Love', 'Appreciate', 'Amazing', 'Awesome' are highligted words.

In [None]:
# Only keeping relevant columns from the data, as these are useful for our analysis.
data_sel = data.loc[:,['airline_sentiment','text']]

In [None]:
data_sel.shape                                # Shape of data

- Only two columns - airline sentiment and tweet text is retained for further analysis

In [None]:
# checking the data types
data_sel.info()

### **Text Pre-processing:**

- Remove html tags.
- Replace contractions in string. (e.g. replace I'm --> I am) and so on.\
- Remove numbers.
- Tokenization
- To remove Stopwords.
- Lemmatized data

We have used the **NLTK library to tokenize words, remove stopwords and lemmatize the remaining words**

In [None]:
# define function to remove html tags
def strip_html(text):
    soup = BeautifulSoup(text, "html.parser")                    
    return soup.get_text()

data_sel['text'] = data_sel['text'].apply(lambda x: strip_html(x))

data_sel.head()

- no html tag is present

In [None]:
# define function to remove contractions in the text
def replace_contractions(text):
    """Replace contractions in string of text"""
    return contractions.fix(text)

data_sel['text'] = data_sel['text'].apply(lambda x: replace_contractions(x))

data_sel.head()

- no contraction string is present

In [None]:
# define function to remove numbers in the text
def remove_numbers(text):
  text = re.sub(r'\d+', '', text)
  return text

data_sel['text'] = data_sel['text'].apply(lambda x: remove_numbers(x))

data_sel.head()

- sample data checking looks good

In [None]:
# tokenize the tweets texts for further analysis
data_sel['text'] = data_sel.apply(lambda row: nltk.word_tokenize(row['text']), axis=1) # Tokenization of data

In [None]:
data_sel.head()                                                                    # Look at how tokenized data looks.

- tokenized data sample looks good

In [None]:
# preparing custom stoprwords list for better result from the analysis
stopwords = stopwords.words('english')

customlist = ['not', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn',
        "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn',
        "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn',
        "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

# Set custom stop-word's list as not, couldn't etc. words matter in Sentiment, so not removing them from original data.

stopwords = list(set(stopwords) - set(customlist))                              

In [None]:
#  using Multilingual Wordnet Data from OMW with newer Wordnet versions
import nltk
nltk.download('omw-1.4')

In [None]:
# define the word lemmatizer, lowercase conversion, punctuation and stopwords removal functions 
# apply on the sentiment text
lemmatizer = WordNetLemmatizer()

def remove_non_ascii(words):
    """Remove non-ASCII characters from list of tokenized words"""
    new_words = []
    for word in words:
        new_word = unicodedata.normalize('NFKD', word).encode('ascii', 'ignore').decode('utf-8', 'ignore')
        new_words.append(new_word)
    return new_words

def to_lowercase(words):
    """Convert all characters to lowercase from list of tokenized words"""
    new_words = []
    for word in words:
        new_word = word.lower()
        new_words.append(new_word)
    return new_words

def remove_punctuation(words):
    """Remove punctuation from list of tokenized words"""
    new_words = []
    for word in words:
        new_word = re.sub(r'[^\w\s]', '', word)
        if new_word != '':
            new_words.append(new_word)
    return new_words

def remove_stopwords(words):
    """Remove stop words from list of tokenized words"""
    new_words = []
    for word in words:
        if word not in stopwords:
            new_words.append(word)
    return new_words

def lemmatize_list(words):
    new_words = []
    for word in words:
      new_words.append(lemmatizer.lemmatize(word, pos='v'))
    return new_words

def normalize(words):
    words = remove_non_ascii(words)
    words = to_lowercase(words)
    words = remove_punctuation(words)
    words = remove_stopwords(words)
    words = lemmatize_list(words)
    return ' '.join(words)

data_sel['text'] = data_sel.apply(lambda row: normalize(row['text']), axis=1)

In [None]:
# check sample data
data_sel.head(5)    

In [None]:
#data_sel['text'] = data_sel.apply(lambda row: nltk.word_tokenize(row['text']), axis=1) # Tokenization of data

- data looks clean after the pre-processing steps are executed

### **Building the model based on CountVectorizer and Random Forest**

In [None]:
# Vectorization (Convert text data to numbers).
from sklearn.feature_extraction.text import CountVectorizer

bow_vec = CountVectorizer(max_features=3000)                # Keep only 3000 features as number of features will increase the processing time.
data_features = bow_vec.fit_transform(data_sel['text'])
data_features = data_features.toarray()                        # Convert the data features to array.

In [None]:
# check the data features shape
data_features.shape

- we have same number of rows as earlier (14,640) but now having 3000 feature columns.

In [None]:
#  double check the data types for each field
data_sel.info()

In [None]:
# convert object columns to category columns
data_sel[data_sel.select_dtypes(['object']).columns] = data_sel.select_dtypes(['object']).apply(lambda x: x.astype('category'))

In [None]:
labels = data['airline_sentiment']

In [None]:
# Split data into training and testing set.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data_features, labels, test_size=0.3, random_state=42)

In [None]:
# Convert labels from names to one hot vectors.
# Labelbinarizer works similar to onehotencoder 

from sklearn.preprocessing import LabelBinarizer
enc = LabelBinarizer()
y_train_encoded = enc.fit_transform(y_train)
y_test_encoded=enc.transform(y_test)

In [None]:
# Using Random Forest to build model for the classification of reviews.
# Also calculating the cross validation score.

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

forest = RandomForestClassifier(n_estimators=20, n_jobs=4)

forest = forest.fit(X_train, y_train_encoded)
print(forest)

print(np.mean(cross_val_score(forest, data_features, labels, cv=10)))

### **Optimizing the parameter: Number of trees in the random forest model(n_estimators)**

In [None]:

# Finding optimal number of base learners using k-fold CV ->
base_ln = [x for x in range(1, 25)]
base_ln

In [None]:
# K-Fold Cross - validation .
cv_scores = []
for b in base_ln:
    clf = RandomForestClassifier(n_estimators = b)
    scores = cross_val_score(clf, X_train, y_train_encoded, cv = 5, scoring = 'accuracy')
    cv_scores.append(scores.mean())

In [None]:
# plotting the error as k increases
error = [1 - x for x in cv_scores]                                 #error corresponds to each nu of estimator
optimal_learners = base_ln[error.index(min(error))]                #Selection of optimal nu of n_estimator corresponds to minimum error.
plt.plot(base_ln, error)                                           #Plot between each nu of estimator and misclassification error
xy = (optimal_learners, min(error))
plt.annotate('(%s, %s)' % xy, xy = xy, textcoords='data')
plt.xlabel("Number of base learners")
plt.ylabel("Misclassification Error")
plt.show()

In [None]:
# Training the best model and calculating accuracy on test data .
clf = RandomForestClassifier(n_estimators = optimal_learners)
clf.fit(X_train, y_train_encoded)
clf.score(X_test, y_test_encoded)

In [None]:
result =  clf.predict(X_test)                  #saving the prediction on test data as a result

In [None]:
# Obtaining the categorical values from y_test_encoded and y_pred
y_pred_arg=np.argmax(result,axis=1)
y_test_arg=np.argmax(y_test_encoded,axis=1)


In [None]:
# Build the Confusion matirx to get an idea of how the distribution of the prediction is, among all the classes.

import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import metrics
from sklearn.metrics import confusion_matrix

conf_mat = confusion_matrix(y_test_arg,y_pred_arg)

In [None]:
#Print and plot Confusion matirx 
f, ax = plt.subplots(figsize=(12, 12))
sns.heatmap(
    conf_mat,
    annot=True,
    linewidths=.4,
    fmt="d",
    square=True,
    ax=ax
)
# Setting the labels to both the axes
ax.set_xlabel('Predicted labels');ax.set_ylabel('True labels'); 
ax.set_title('Confusion Matrix'); 
ax.xaxis.set_ticklabels(list(enc.classes_),rotation=40)
ax.yaxis.set_ticklabels(list(enc.classes_),rotation=20)
plt.show()

- true negative sentiment cases are identified with high accuracy
- false neutral count is high



### **Word Cloud of top 40 important features from the CountVectorizer + Random Forest based model**

In [None]:


all_features = bow_vec.get_feature_names()              #Instantiate the feature from the vectorizer
top_features=''                                            # Addition of top 40 feature into top_feature after training the model
feat=clf.feature_importances_
features=np.argsort(feat)[::-1]
for i in features[0:40]:
    top_features+=all_features[i]
    top_features+=' '
    
    

from wordcloud import WordCloud
wordcloud = WordCloud(background_color="white",colormap='viridis',width=2000, 
                          height=1000).generate(top_features)

# Display the generated image:
plt.imshow(wordcloud, interpolation='bilinear')
plt.figure(1, figsize=(14, 11), frameon='equal')
plt.title('Top 40 features WordCloud', fontsize=20)
plt.axis("off")
plt.show()

### **Term Frequency(TF) - Inverse Document Frequency(IDF)**

In [None]:
# Using TfidfVectorizer to convert text data to numbers.

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=3000)
data_features = vectorizer.fit_transform(data_sel['text'])

data_features = data_features.toarray()

data_features.shape

In [None]:
# Split data into training and testing set.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data_features, labels, test_size=0.3, random_state=44)

In [None]:
# Convert labels from names to one hot vectors.
# Labelbinarizer works similar to onehotencoder 

from sklearn.preprocessing import LabelBinarizer
enc = LabelBinarizer()
y_train_encoded = enc.fit_transform(y_train)
y_test_encoded=enc.transform(y_test)

In [None]:
# Using Random Forest to build model for the classification of reviews.
# Also calculating the cross validation score.

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

import numpy as np

forest = RandomForestClassifier(n_estimators=20, n_jobs=4)

forest = forest.fit(X_train, y_train_encoded)

print(forest)

print(np.mean(cross_val_score(forest, data_features, labels, cv=5)))

In [None]:
# K - Fold Cross Validation .
cv_scores = []
for b in base_ln:
    clf = RandomForestClassifier(n_estimators = b)
    scores = cross_val_score(clf, X_train, y_train_encoded, cv = 5, scoring = 'accuracy')
    cv_scores.append(scores.mean())

In [None]:
# plotting the error as k increases
error = [1 - x for x in cv_scores]                                              #error corresponds to each nu of estimator
optimal_learners = base_ln[error.index(min(error))]                             #Selection of optimal nu of n_estimator corresponds to minimum error.
plt.plot(base_ln, error)                                                        #Plot between each nu of estimator and misclassification error
xy = (optimal_learners, min(error))
plt.annotate('(%s, %s)' % xy, xy = xy, textcoords='data')
plt.xlabel("Number of base learners")
plt.ylabel("Misclassification Error")
plt.show()

In [None]:
# Training the best model and calculating error on test data .
clf = RandomForestClassifier(n_estimators = optimal_learners)
clf.fit(X_train, y_train_encoded)
clf.score(X_test, y_test_encoded)

In [None]:
result = clf.predict(X_test)

In [None]:
# Obtaining the categorical values from y_test_encoded and y_pred

y_pred_arg=np.argmax(result,axis=1)
y_test_arg=np.argmax(y_test_encoded,axis=1)

In [None]:
# Build the Confusion matirx to get an idea of how the distribution of the prediction is, among all the classes.

import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import metrics
from sklearn.metrics import confusion_matrix

conf_mat = confusion_matrix(y_test_arg,y_pred_arg)

In [None]:
#Print and plot Confusion matirx 
f, ax = plt.subplots(figsize=(12, 12))
sns.heatmap(
    conf_mat,
    annot=True,
    linewidths=.4,
    fmt="d",
    square=True,
    ax=ax
)
# Setting the labels to both the axes
ax.set_xlabel('Predicted labels');ax.set_ylabel('True labels'); 
ax.set_title('Confusion Matrix'); 
ax.xaxis.set_ticklabels(list(enc.classes_),rotation=40)
ax.yaxis.set_ticklabels(list(enc.classes_),rotation=20)
plt.show()

- slightly higher number of true negative sentiment cases are identified in this model
- false neutral count went even higher

### **Word Cloud of top 40 important features from the TF IDF + Random Forest based model**

In [None]:

all_features = vectorizer.get_feature_names()                                #Instantiate the feature from the vectorizer
Top_features=''                                                              #Addition of top 40 feature into top_feature after training the model
feat=clf.feature_importances_
features=np.argsort(feat)[::-1]
for i in features[0:40]:
    Top_features+=all_features[i]
    Top_features+=' '
    
  
from wordcloud import WordCloud
wordcloud = WordCloud(background_color="Black",width=1000, 
                          height=750).generate(Top_features)

# Display the generated image:
plt.imshow(wordcloud, interpolation='bilinear')
plt.figure(1, figsize=(30, 30), frameon='equal')
plt.title('Top 40 features WordCloud', fontsize=30)
plt.axis("off")
plt.show()

### **Summary**:

- We used a dataset which has **reviews in text format and their sentiment score as 'postive', 'negative' and 'neutral'.**
- The goal was to **build a model for text-classification**.
- We **pre-processed the data** using various techniques and libraries.
- We **created a Word Cloud plot** based on negative, postive and top 40 features.
- The **pre-processed data is converted to numbers (vectorized)**, so that we can feed the data into the model.
- We trained the model and optimized the parameter, which **led to an increase the overall accuracy.**
- After building the classification model, we **predicted the results for the test data.**
- We saw that using the above techniques, our model performed well in perspective of how text classification models perform.
- However, **we can still increase the accuracy of our model by increasing the dataset we took into account for the model building** (We've currently only used 14,460 entries) 
- We can also increase the **max_feature parameter** in the vectorizer. 
- We can apply **other model tuning and hyperparameter tuning techniques, as well as other pre-processing techniques** to increase the overall accuracy even further.
- We tried to use Count Vectorizer and TF-IDF. Both are giving similar results. However, more true negative cases are identified in case of TF-IDF.
