# Motivation

This notebook is a practice of sentiment analysis using Keras on the US airline twitter dataset available at https://www.kaggle.com/crowdflower/twitter-airline-sentiment/downloads/twitter-airline-sentiment.zip/2

In [1]:
import pandas as pd
import seaborn as sns

In [2]:
df = pd.read_csv("Tweets.csv")

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14640 entries, 0 to 14639
Data columns (total 15 columns):
tweet_id                        14640 non-null int64
airline_sentiment               14640 non-null object
airline_sentiment_confidence    14640 non-null float64
negativereason                  9178 non-null object
negativereason_confidence       10522 non-null float64
airline                         14640 non-null object
airline_sentiment_gold          40 non-null object
name                            14640 non-null object
negativereason_gold             32 non-null object
retweet_count                   14640 non-null int64
text                            14640 non-null object
tweet_coord                     1019 non-null object
tweet_created                   14640 non-null object
tweet_location                  9907 non-null object
user_timezone                   9820 non-null object
dtypes: float64(2), int64(2), object(11)
memory usage: 1.7+ MB


In [4]:
# I want to only look at the text data for analysing the sentiment and will restrict myself to classification of positive/neutral/negative

columns_to_keep = [
    "airline_sentiment",
    "text"
]

df = df[columns_to_keep]

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14640 entries, 0 to 14639
Data columns (total 2 columns):
airline_sentiment    14640 non-null object
text                 14640 non-null object
dtypes: object(2)
memory usage: 228.8+ KB


In [6]:
df.head()

Unnamed: 0,airline_sentiment,text
0,neutral,@VirginAmerica What @dhepburn said.
1,positive,@VirginAmerica plus you've added commercials t...
2,neutral,@VirginAmerica I didn't today... Must mean I n...
3,negative,@VirginAmerica it's really aggressive to blast...
4,negative,@VirginAmerica and it's a really big bad thing...


In [7]:
df["airline_sentiment"].value_counts()

negative    9178
neutral     3099
positive    2363
Name: airline_sentiment, dtype: int64

In [8]:
# None of the rows contain nulls

In [9]:
# The target variable needs to be encoded as an int

sentiments = {
    "negative": -1,
    "neutral" : 0,
    "positive": 1,
}

df["sentiment"] = df["airline_sentiment"].apply(lambda x: sentiments[x])

In [10]:
# The text needs to be cleaned so it is lower case and only contains alpha-numerica characters

In [11]:
import string
lookup_table = str.maketrans({key: None for key in string.punctuation})

In [12]:
df["cleaned_text"] = df["text"].apply(lambda x: x.lower().translate(lookup_table))

In [13]:
df["cleaned_text"].head()

0                     virginamerica what dhepburn said
1    virginamerica plus youve added commercials to ...
2    virginamerica i didnt today must mean i need t...
3    virginamerica its really aggressive to blast o...
4    virginamerica and its a really big bad thing a...
Name: cleaned_text, dtype: object

# Modelling

In [14]:
X = df["cleaned_text"].values
y = df["sentiment"].values

In [15]:
from sklearn.model_selection import train_test_split

# TODO: do i need to stratify to maintain the distribution of the target variable.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3) 

#### Tokenisation

We need to convert the strings into lists of encoded characters/words for keras.

#### Does this include these special characters 
rev_idx[0] = 'padding_char'

rev_idx[1] = 'start_char'

rev_idx[2] = 'oov_char'

rev_idx[3] = 'unk_char'

In [16]:
from keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(oov_token="oov")

Using TensorFlow backend.


In [17]:
# Fit to the training set
tokenizer.fit_on_texts(X_train) 

In [18]:
X_train = tokenizer.texts_to_sequences(X_train)
X_test = tokenizer.texts_to_sequences(X_test)

In [26]:
# How many words are there in the vocabulary
num_words = len(tokenizer.word_index) + 1

### Target variable encoding

Need to encode to categorical variable for multi class classification

In [55]:
number_of_target_classes = len(df["sentiment"].unique())

In [56]:
from keras.utils import to_categorical

y_cat_train = to_categorical(y_train, num_classes=number_of_target_classes)
y_cat_test = to_categorical(y_test, num_classes=number_of_target_classes)

### Padding for the model

Tweets have a maximum character length of 140 so the maxlength of the sequences going into the model can be of length 140 with post padding to extend any tweets shorter than the maximum length

In [66]:
from keras.preprocessing.sequence import pad_sequences

max_length = 100 # You're allowed 280 characters in twitter, as we're using words probably could reduce this

X_train_pad = pad_sequences(X_train, max_length, padding="post")
X_test_pad = pad_sequences(X_test, max_length, padding="post")

In [67]:
X_train_pad.shape

(10248, 100)

# Model

In [71]:
from keras.layers import Input, Dense, GRU, LSTM, Dropout, BatchNormalization
from keras.models import Model

In [72]:
embedding_dimensions = 300 # number of dimensions for the word vector embeddings
number_of_classes = 3 # positive/neutral/negative
dropout_ratio = 0.2 # % of neurons to drop out to stop overfitting
vocabulary_size = num_words # the number of words found in the train set which have been encoded

In [73]:
inputs = Input(shape=(max_length,))

embedding = Embedding(input_dim=vocabulary_size, output_dim=embedding_dimensions, input_length=max_length)(inputs)
x = LSTM(32, dropout=dropout_ratio, recurrent_dropout=dropout_ratio)(embedding)
x = Dense(32, activation="relu")(x)
#x = Dropout(dropout_ratio)(x)
x = BatchNormalization()(x)
x = Dense(16, activation="relu")(x)
outputs = Dense(number_of_classes, activation="softmax")(x)

model = Model(inputs=inputs, outputs=outputs)

In [74]:
from keras.optimizers import RMSprop
model.compile(RMSprop(lr=0.05),
             "categorical_crossentropy",
             metrics=["accuracy"])

In [None]:
from keras.callbacks import EarlyStopping

early_stopping = EarlyStopping

In [None]:
model.fit(X_train_pad, 
          y_cat_train,
          batch_size=16,
          epochs=20,
          validation_split=0.2)

Train on 8198 samples, validate on 2050 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
 448/8198 [>.............................] - ETA: 1:11 - loss: 0.9635 - acc: 0.6027