<a href="https://colab.research.google.com/github/jasonyang429/Twitter-Sentiment-Analysis-Simple/blob/main/Twitter_Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Importing all the neccessary packages and libraries
import nltk
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
import re
from nltk.corpus import twitter_samples
import string
from sklearn.utils import shuffle
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences


nltk.download('twitter_samples')
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

[nltk_data] Downloading package twitter_samples to /root/nltk_data...
[nltk_data]   Unzipping corpora/twitter_samples.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

Here, I combined the positive tweets and negative tweets and created a labels for each of them.

"1" stands for Positive sentiment tweets and "0" stands for Negative sentiment tweets.

In [83]:
positive_tweets = twitter_samples.strings('positive_tweets.json')
negative_tweets = twitter_samples.strings('negative_tweets.json')

stop_words = set(stopwords.words('english'))

tweets = positive_tweets + negative_tweets

Y = np.array([1]*(len(tweets)//2) + [0]*(len(tweets)//2))

In [84]:
print(tweets[0])
print(Y[0])

#FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)
1


This cell below I implemented a method to preprocess the tweets in to a tokenized version to pass into the network later.

There are 2 concepts applied here which are stemming and lemmatizing.

*    Stemming is to reduce the word into its base forms, like 'eating' -> 'eat'
*    Lemmatizing is to change the word in to another word form without changing its meaning. For example, 'better' -> 'good' <sup>[3]</sup>

In [85]:
def preprocess_texts(tweet):
  # Lower case the sentence
  tweet = tweet.lower()

  # Substitute all URL links starting with http, https, www with '' 
  tweet = re.sub(r'http\S+|https\S+|www\S+', '', tweet)

  # Substitute all @ and # to removes taggings with ''
  tweet = re.sub(r'\@\w+|\#', '', tweet)

  # Can refer to reference no. 5
  # maketrans() functions: 
  #   1st argument is mapped to 2nd arguments for substitution 
  #   all the ''(1st argument) in tweets are substituted with ''(2nd argument)
  #   3rd argument is the characters to be removed from the tweets
  #   so the function is to replace '' with '' and remove all string.punctuation
  tweet = tweet.translate(str.maketrans('', '', string.punctuation))

  # Tokenize the tweets with nltk
  tweet_tokens = word_tokenize(tweet)

  # Remove the stopwords from tweets
  tweet_tokens_without_stopwords = [token for token in tweet_tokens if not token in stop_words]

  # Stemming the words 
  stemmer = PorterStemmer()

  stemmed_tokens = [stemmer.stem(token) for token in tweet_tokens_without_stopwords]
  
  ### Can uncomment this part for lemmatizing the tweets
  ### Not encouraged to do both stemming and lemmatizing
  # lemmatizer = WordNetLemmatizer()

  # lemmatized_tokens = []
  # for token, tag in pos_tag(stemmed_tokens):
  #   if tag.startswith('NN'):
  #     tag = 'n'
  #   elif tag.startswith('VB'):
  #     tag = 'vb'
  #   else:
  #     tag = 'a'
  # lemmatized_tokens = [lemmatizer.lemmatize(token, pos='a') for token in tweet_tokens_without_stopwords]

  # return lemmatized_tokens
  return stemmed_tokens


Then, proceed to tokenize the workds and padding it.

In [86]:
# Set the maximum number of words stored in the dictionary
VOCAB_SIZE = 10000

# Preprocess all tweets
tweets = [preprocess_texts(tweet) for tweet in tweets]

# Define the dictionary to store the vocabularies
# Maximum number of words stored is VOCAB_SIZE
tokenizers = Tokenizer(num_words=VOCAB_SIZE)
tokenizers.fit_on_texts(tweets)

# Vectorise all words into numbers
X = tokenizers.texts_to_sequences(tweets)
X = pad_sequences(X, padding='post')

# Changing the labels from lists to categorical vector
Y = tf.keras.utils.to_categorical(Y)

Debugged for 2 days because I forgot to convert Y labels into categorical matrix, and I used categorical crossentropy as loss function

In [138]:
print(tweets[20])
print(X[20])
print(Y[20])

['bc', 'realli', 'dont', 'feel', 'like', 'read']
[158  29  14  31   5 168   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0]
[0. 1.]


In [88]:
# Shuffle the X, Y and tweets together
# random_state=0 is to ensure the same results will be obtained everytime
# random_state is similar to seed 
X, Y, tweets = shuffle(X, Y, tweets, random_state=0)

In [None]:
print(tweets[20])
print(X[20])
print(Y[20])

['bc', 'realli', 'dont', 'feel', 'like', 'read']
[158  29  14  31   5 168   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0]
[0. 1.]


In [89]:
# Fix the input dimensions and length for the embedding layer of the model
INPUT_DIMS = VOCAB_SIZE
INPUT_LENGTH = X.shape[1]

print(INPUT_LENGTH)
print(INPUT_DIMS)

27
10000


Here, I created the model with LSTM. Feel free to use other such as GRU or Conv1D

In [112]:
# Specify the embedding dimensions for the model
EMBEDDING_DIMS = 64

# The model
# Embedding layer for getting embedding of words
#   means that each word has 64 meaning representations
# LSTM = Long Short-Term Memory gates
# Dropout
#   for regularization purpose
# Dense
#   Fully connected layer with 1 neurons for predictions
model = tf.keras.Sequential([
          tf.keras.layers.Embedding(INPUT_DIMS, EMBEDDING_DIMS, input_length=INPUT_LENGTH),
          tf.keras.layers.LSTM(16, recurrent_dropout=0.3, dropout=0.3, recurrent_regularizer='l2'),
          tf.keras.layers.Dropout(0.4),
          tf.keras.layers.Dense(2, activation='softmax')
])

model.summary()

Model: "sequential_16"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_16 (Embedding)     (None, 27, 64)            640000    
_________________________________________________________________
lstm_16 (LSTM)               (None, 16)                5184      
_________________________________________________________________
dropout_9 (Dropout)          (None, 16)                0         
_________________________________________________________________
dense_16 (Dense)             (None, 2)                 34        
Total params: 645,218
Trainable params: 645,218
Non-trainable params: 0
_________________________________________________________________


In [113]:
# Split into training, evaluation and test sets
TRAIN_SPLIT = int(0.9*len(X))

x_test = X[TRAIN_SPLIT:]
x_train = X[:TRAIN_SPLIT]

y_test = Y[TRAIN_SPLIT:]
y_train = Y[:TRAIN_SPLIT]

EVAL_SPLIT = int(0.9 * len(x_train))

x_eval = x_train[EVAL_SPLIT:]
x_train = x_train[:EVAL_SPLIT]

y_eval = y_train[EVAL_SPLIT:]
y_train = y_train[:EVAL_SPLIT]



In [114]:
# Training the model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

model.fit(x_train, y_train, batch_size=32, epochs=10, validation_data=(x_eval, y_eval))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7f7f6bd5a550>

The model is overfitting to the training sets, possible improvements could be : 


*   Adding more regularization
*   Reduce model complexity
*   Using higher dropout rates



In [157]:
# Evaluating the model
model.evaluate(x_test, y_test)



[0.7940672039985657, 0.7599999904632568]

The cell below is to predict any sentences. 

In [156]:
your_text = ['I am so idiot']
your_text = [preprocess_texts(text) for text in your_text]
your_text = tokenizers.texts_to_sequences(your_text)
your_text = pad_sequences(your_text, maxlen=X.shape[1], padding='post')



def get_sentiment(text):
  sentiment = model.predict(text, batch_size=1, verbose=2)[0]
  print(sentiment)

  if(np.argmax(sentiment) == 0):
    print("Negative sentiment")
  else:
    print("Positive sentiment")

get_sentiment(your_text)


1/1 - 0s
[0.867867   0.13213296]
Negative sentiment


References : 

1.   [Keras LSTM Twitter Sentiment Analsysis](https://www.kaggle.com/vandalko/keras-lstm-twitter-sentiment-analysis)
2.   [LSTM Sentiment Analysis | Keras](https://www.kaggle.com/ngyptr/lstm-sentiment-analysis-keras)
3. [Building a Twitter Sentiment Analysis in Python](https://www.pluralsight.com/guides/building-a-twitter-sentiment-analysis-in-python)
4. [How To Perform Sentiment Analysis in Python 3 Using the Natural Language Toolkit (NLTK)](https://www.digitalocean.com/community/tutorials/how-to-perform-sentiment-analysis-in-python-3-using-the-natural-language-toolkit-nltk)
5. [Python String maketrans() Method](https://www.w3schools.com/python/ref_string_maketrans.asp)


