# Sentiment Analysis - World Cup Tweets

* **Author:** Mitch Fehr

* This notebook creates a neural network using tensorflow, designed for sentiment analysis.
* The network is trained on tweets from the 2022 FIFA World Cup, meaning that its best application would be with other soccer-related tweets.

* **Next Steps**:
  * Implement a pre-trained embedding layer
  * Add more data to combat overfitting

In [None]:
import numpy as np
import pandas as pd
import tensorflow as tf
import re
import matplotlib.pyplot as plt
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.callbacks import EarlyStopping
from sklearn.model_selection import train_test_split
from sklearn.utils import class_weight
from datasets import load_dataset

In [None]:
tf.random.set_seed(33)

In [None]:
# Hugging Face dataset
fifa_tweets = load_dataset("Tirendaz/fifa-world-cup-2022-tweets")

In [None]:
fifa_tweets = fifa_tweets["train"].to_pandas() # only has a train split
fifa_tweets.head()

In [None]:
print(f"This dataset has {fifa_tweets.shape[0]} rows and {fifa_tweets.shape[1]} features.")

In [None]:
# Drop irrelevant features
fifa_tweets.drop(axis=1, columns=["Unnamed: 0", "Date Created", "Number of Likes", "Source of Tweet"], inplace=True)
fifa_tweets.head()

## EDA

* Checking for duplicates in data and other anomalies
* Looking at distribution of label values

In [None]:
fifa_tweets.info()

No missing values

In [None]:
fifa_tweets.describe()

Looks like there are some duplicate tweets, let's handle those.

In [None]:
fifa_tweets[fifa_tweets.duplicated()]

In [None]:
fifa_tweets.drop_duplicates(inplace=True)
fifa_tweets.shape

In [None]:
sentiment_counts = fifa_tweets['Sentiment'].value_counts()

plt.bar(x=sentiment_counts.index, height=sentiment_counts.values)
plt.xlabel('Sentiment')
plt.ylabel('Counts')
plt.show()

This plot shows a good amount of each sentiment, but significantly less negative. This imbalance could be an issue by creating a bias in the model towards positive and neutral sentiments. We can account for this in training using class weights.

## Preprocessing

* Clean tweets to get rid of noise in the data
* Pass tweets into a tokenizer
* Pad tweets so that they are all the same size input
* Change sentiment values to integers for training

In [None]:
# Checking what kind of noise is in the tweets.
fifa_tweets.sample(5)

In [None]:
# Function to clean text
def clean_tweet(text: str):
    """
    Cleans tweet to prepare for model training by getting rid of noisy text.

    - Removes all the '#' in the hashtags, keeping the text
    - Removes all the Web addresses
    - Removes all new line characters
    - Removes all digits
    - Removes punctuation (including hashtags)
    - Replaces user mentions (@___) with 'user'

    Returns the cleaned text.
    """
    # Remove web addresses
    cleaned_text = re.sub(r'https?://\S+', '', text)
    # Removes new line characters
    cleaned_text = re.sub(r'\n', '', cleaned_text)
    # Removes digits
    cleaned_text = re.sub(r'\d', '', cleaned_text)
    # Removes punctuation (including hashtags)
    cleaned_text = re.sub(r'[^\w\s]', '', cleaned_text)
    # Replaces user mentions
    cleaned_text = re.sub(r'@\w+', 'user', cleaned_text)

    return cleaned_text.lower().strip()


In [None]:
# Clean the reviews
fifa_tweets['Tweet'] = fifa_tweets['Tweet'].apply(clean_tweet)

# Tokenization
tokenizer = Tokenizer(num_words=10000, oov_token='<OOV>') # only consider the 10,000 most frequent words
tokenizer.fit_on_texts(fifa_tweets["Tweet"])
sequences = tokenizer.texts_to_sequences(fifa_tweets["Tweet"])

# Pad sequences so that they are all same length to feed into model
padded_sequences = pad_sequences(sequences, maxlen=100)

# Convert sentiment labels to integers
sentiment_map = {
    'positive': 2,
    'neutral': 1,
    'negative': 0
}
fifa_tweets["Sentiment"] = fifa_tweets["Sentiment"].map(sentiment_map)

## Model Training

* Split data into training-test sets (validation set comes into play in the .fit method)

* RNN architecture  
  1. **Embedding layer**  
  2. **Bidirectional Long-Short Term Memory Layer (LSTM)**  
  3. **Dropout layer** (for regularization)   
  4. **Bidirectional LSTM Layer**  
  5. **Dropout Layer**  
  6. **Dense Layer** with ReLU activation
  7. **Dropout Layer**
  8. **Output Dense Layer** with Softmax activation  

In [None]:
X = padded_sequences
y = fifa_tweets["Sentiment"].values

# Splitting into training and testing splits
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1, random_state=33)

In [None]:
optimizer = tf.keras.optimizers.Adam(learning_rate=0.0001)

# Model Architecture
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(10000, 32),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32, return_sequences=True)),
    tf.keras.layers.Dropout(0.4),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32)),
    tf.keras.layers.Dropout(0.4),
    tf.keras.layers.Dense(16, activation='relu'),
    tf.keras.layers.Dropout(0.4),
    tf.keras.layers.Dense(3, activation='softmax')
])

# Wait 3 epochs with no improvement and then stop training
early_stopping = EarlyStopping(
    monitor='val_accuracy',
    patience=3,
    restore_best_weights=True # use best version of the model
)

# To account for class weights
class_weights = class_weight.compute_class_weight(
    'balanced',
    classes=np.unique(y_train),
    y=y_train
)

class_weight_dict = dict(enumerate(class_weights))

# Model fitting
model.compile(
    loss='sparse_categorical_crossentropy',
    optimizer=optimizer,
    metrics=['accuracy']
)

history = model.fit(
    X_train,
    y_train,
    epochs=20,
    validation_split=0.2,
    batch_size=32,
    callbacks=[early_stopping],
    class_weight=class_weight_dict
)

model.summary()

Tested out a bunch of different architectures here. Changed learning rate from th default 0.1 because model converged way too quickly, hinting at overfitting. Also, tried more neurons in each layer but that seemed to make the model too complex and didn't really alter performance.

In [None]:
# Plot Training vs Validation accuracy
plt.plot(history.history['accuracy'], label='Training Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.show()

In [None]:
# Plot Training vs Validation loss
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()

Based on the validation plots, the model converges somewhat quickly, in only the first few epochs of training. Adding more data would probably help to change this, since there are currently only around 20,000 tweets in the dataset. Luckily, the model is restored to its best version by using the `EarlyStopping` function

## Final Evaluation

In [None]:
test_loss, test_accuracy = model.evaluate(X_test, y_test)
print(f'Test Accuracy: {test_accuracy:.3f}')
print(f'Test Loss: {test_loss:.3f}')

## Using the Model

In [None]:
def predict(tweet: str):
  """
  Using the network, predicts the sentiment of a soccer tweet.

  - Preprocessed tweet first
    - tokenize and pad
  - Predict with model.predict() and change label from integer to sentiment

  Returns the predicted sentiment.
  """

  cleaned_tweet = clean_tweet(tweet)

  sample_sequence = tokenizer.texts_to_sequences([cleaned_tweet])
  sample_padded = pad_sequences(sample_sequence, maxlen=100)

  reverse_sentiment_map = {
      2: "Positive",
      1: "Neutral",
      0: "Negative"
  }

  # Returns the raw softmax probabilities
  prediction = model.predict(sample_padded)
  # Selects highest probability
  class_prediction = reverse_sentiment_map[np.argmax(prediction)]

  return class_prediction

In [None]:
sample_reviews = [
    "I'm not a huge fan of watching Ronaldo play. Hopefully Portugal loses.",
    "Soccer is super fun, and I really hope Argentina wins!"
]

for review in sample_reviews:
  prediction = predict(review)

  print(f"Review: {review}")
  print(f"Prediction: {prediction}")
  print()