## 1. Tweet preprocessing (previous datalab)

Before starting the DataLab, preprocess the tweets like you did before.

In [5]:
import nltk
from nltk.corpus import twitter_samples
import matplotlib.pyplot as plt
import random
import numpy as np
import pandas as pd

In [2]:
nltk.download('twitter_samples')

[nltk_data] Downloading package twitter_samples to
[nltk_data]     C:\Users\neilr\AppData\Roaming\nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!


True

In [3]:
# select the set of positive and negative tweets
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')

In [4]:
print('Number of positive tweets: ', len(all_positive_tweets))
print('Number of negative tweets: ', len(all_negative_tweets))

Number of positive tweets:  5000
Number of negative tweets:  5000


In [6]:
import re                                  # library for regular expression operations
import string                              # for string operations

from nltk.corpus import stopwords          # module for stop words that come with NLTK
from nltk.stem import PorterStemmer        # module for stemming
from nltk.tokenize import TweetTokenizer   # module for tokenizing strings

In [7]:
from nltk.corpus import stopwords
nltk.download('stopwords')
stopwords_english = stopwords.words('english')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\neilr\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [8]:
def tweet_processor(tweet):
    """
    Input:
        tweet: a string containing a tweet
    Output:
        processed_tweet: a list of token
        
    Processing steps:
    - Removes hyperlinks
    - Removes # sign
    - Tokenizes
    - Removes stopwords and punctuation
    - Stem tokens
        
    """
    # YOUR CODE HERE #
    # Remove URLs using regular expression
    processed_tweet = re.sub(r'http\S+|www\S+|https\S+', '', tweet)
    
    # Remove hashtags
    processed_tweet = re.sub(r'#([^\s]+)', r'\1', processed_tweet)
    
    # Tokenize the tweet using TweetTokenizer
    tokenizer = TweetTokenizer(preserve_case=False, reduce_len=True, strip_handles=False)
    processed_tweet = tokenizer.tokenize(processed_tweet)
    
    # Remove stopwords
    stopwords_english = set(stopwords.words('english'))
    processed_tweet = [word for word in processed_tweet if word not in stopwords_english]
    
    # Remove punctuation
    processed_tweet = [word for word in processed_tweet if word not in string.punctuation]
    
    # Stem the tokens
    stemmer = PorterStemmer()
    processed_tweet = [stemmer.stem(word) for word in processed_tweet]

    return processed_tweet

In [9]:
tweet = all_positive_tweets[2277]
tweet_processed = tweet_processor(tweet)

print(tweet)
print(tweet_processed)

My beautiful sunflowers on a sunny Friday morning off :) #sunflowers #favourites #happy #Friday off… https://t.co/3tfYom0N1i
['beauti', 'sunflow', 'sunni', 'friday', 'morn', ':)', 'sunflow', 'favourit', 'happi', 'friday', '…']


In [10]:
# 80% training 20% testing
positive_tweets_tr = all_positive_tweets[:4000]
positive_tweets_te = all_positive_tweets[4000:]

negative_tweets_tr = all_negative_tweets[:4000]
negative_tweets_te = all_negative_tweets[4000:]

In [11]:
def tweet_processor_list(tweet_list):
    # YOUR CODE HERE #
    processed_tweet_list = []
    for tweet in tweet_list:
        processed_tweet = tweet_processor(tweet)
        processed_tweet_list.append(processed_tweet)
    return processed_tweet_list

In [12]:
positive_tweets_tr = tweet_processor_list(positive_tweets_tr)
positive_tweets_te = tweet_processor_list(positive_tweets_te)

negative_tweets_tr = tweet_processor_list(negative_tweets_tr)
negative_tweets_te = tweet_processor_list(negative_tweets_te)

You already did the steps until here in the previous DataLab. Let's do a quick sanity check:

In [13]:
assert len(positive_tweets_tr) == 4000
assert len(negative_tweets_tr) == 4000

assert len(positive_tweets_te) == 1000
assert len(negative_tweets_te) == 1000

assert type(positive_tweets_tr) is list
assert type(positive_tweets_tr[0]) is list
assert type(positive_tweets_tr[0][0]) is str

## 2. Converting tweets to numbers (previous datalab)

Repeat your code from the last datalab.

In [14]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences




In [15]:
training_tweets = positive_tweets_tr + negative_tweets_tr
test_tweets = positive_tweets_te + negative_tweets_te

While we are creating our dataset, we can also create our labels. We know that first half of `training_tweets` and `test_tweets` are positive (label = 1) and second half is negative (label = 0). Therefore creating the labels is as easy as:

In [16]:
y_train = np.append(np.ones(len(positive_tweets_tr)),
                    np.zeros(len(negative_tweets_tr)))

y_test = np.append(np.ones(len(positive_tweets_te)),
                   np.zeros(len(negative_tweets_te)))

print(y_train.shape)
print(y_test.shape)

(8000,)
(2000,)


Remember that we already preprocessed and tokenized our tweets:

In [17]:
training_tweets[0]

['followfriday',
 '@france_int',
 '@pkuchly57',
 '@milipol_pari',
 'top',
 'engag',
 'member',
 'commun',
 'week',
 ':)']

But Keras `Tokenizer()` expects a list of strings. So let's combine tokens into strings:

In [18]:
training_tweets_str = []
for tw in training_tweets:
    training_tweets_str.append(' '.join(tw))
    
test_tweets_str = []
for tw in test_tweets:
    test_tweets_str.append(' '.join(tw))

In [19]:
training_tweets_str[0]

'followfriday @france_int @pkuchly57 @milipol_pari top engag member commun week :)'

**Task 2.1**

Use tokenizer on `training_tweets_str`. Notice that tokenizer processes text with the `filters` parameter. Set it to `filters=''` to prevent processing because we already processed our tweets.

In [20]:
# YOUR CODE HERE #
# Initialize the tokenizer
tokenizer = Tokenizer(filters='')

# Fit the tokenizer on training_tweets_str
tokenizer.fit_on_texts(training_tweets_str)

**Task 2.2**

Calculate the size of the vocabulary.

In [21]:
# YOUR CODE HERE #
vocab_size = len(tokenizer.word_index) + 1  # Adding 1 for the padding token
print("Vocabulary size:", vocab_size)

Vocabulary size: 14886


**Task 2.3**

Find the numbers that represent the words `'boy'`, `'girl'`, `'man'` and `'woman'`.

In [23]:
# YOUR CODE HERE #
word_to_index = tokenizer.word_index
numbers_representations = {word: index for word, index in word_to_index.items() if word in ['boy', 'girl', 'man', 'woman']}
numbers_representations

{'girl': 154, 'man': 203, 'boy': 354, 'woman': 897}

**Task 2.4**

Convert training and test tweets to sequences and use padding.

Example tweet:

`'followfriday top engag member commun week :)'`

Corresponding sequence:

`[347, 221, 937, 400, 286, 52, 3]`

Padded sequence:

`array([347, 221, 937, 400, 286,  52,   3,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0], dtype=int32)`


For padding arguments use `padding='post'` and `maxlen=30`.

In [24]:
# YOUR CODE HERE #
# Convert training tweets to sequences
training_sequences = tokenizer.texts_to_sequences(training_tweets_str)

# Pad the sequences
training_padded = pad_sequences(training_sequences, padding='post', maxlen=30)

# Convert test tweets to sequences
test_sequences = tokenizer.texts_to_sequences(test_tweets_str)

# Pad the sequences
test_padded = pad_sequences(test_sequences, padding='post', maxlen=30)

In [25]:
assert training_padded.shape == (8000, 30)
assert test_padded.shape == (2000, 30)

## 3. Combine embedding layer with RNNs (this datalab)

Create a model with the following layers:
- Embedding layer
- Recurrent layer
- Dense layer

This is the minimum architecture. You can modify it to increase the number of layers or add new layers such as Dropout.

Keras provides a few recurrent layers:

- LSTM layer
- GRU layer
- SimpleRNN layer
- TimeDistributed layer
- Bidirectional layer
- ConvLSTM1D layer
- ConvLSTM2D layer
- ConvLSTM3D layer
- Base RNN layer

https://keras.io/api/layers/recurrent_layers/

For your recurrent layer try LSTM, GRU and SimpleRNN.

In [38]:
X_train = training_padded
X_test = test_padded

In [27]:
X_train.shape, y_train.shape, X_test.shape, y_test.shape

((8000, 30), (8000,), (2000, 30), (2000,))

In [47]:
from keras import Sequential
from keras.layers import Embedding, LSTM, GRU, SimpleRNN, Dense

model = Sequential()

vocab_size = len(tokenizer.word_index) + 1
embedding_dim = 100
max_length = 30

model.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_length))
model.add(LSTM(units=128))
model.add(Dense(units=1, activation='sigmoid'))

model.summary()

Model: "sequential_7"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_4 (Embedding)     (None, 30, 100)           1488600   
                                                                 
 lstm_4 (LSTM)               (None, 128)               117248    
                                                                 
 dense_4 (Dense)             (None, 1)                 129       
                                                                 
Total params: 1605977 (6.13 MB)
Trainable params: 1605977 (6.13 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


**Task 3.2**

Compile the model by selecting a proper loss, optimizer and metric.

In [48]:
# YOUR CODE HERE #
# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

**Task 3.3**

Train the model with `X_train` and `y_train`. Use `X_test` and `y_test` as validation data.

In [49]:
# YOUR CODE HERE #
# Train the model
history = model.fit(X_train, y_train, epochs=5, batch_size=64, validation_data=(X_test, y_test))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


**Task 3.4**

Predict the class of a test tweet.

Example tweet:
`"back thnx god i'm happi :)"`

Model prediction:
`array([[0.99999976]], dtype=float32)`

In [50]:
test_tweets_str[1]

"@heyclairee back thnx god i'm happi :)"

In [53]:
# YOUR CODE HERE #
example_tweet = test_tweets_str[1]
# Tokenize and pad the example tweet
example_sequence = tokenizer.texts_to_sequences([example_tweet])
example_padded = pad_sequences(example_sequence, padding='post', maxlen=30)

# Predict the class of the example tweet
prediction = model.predict(example_padded)

print("Model prediction:", prediction)

Model prediction: [[0.9998839]]


## 4. Bidirectional LSTM (this datalab)

Try bidirectional LSTMs.

https://keras.io/examples/nlp/bidirectional_lstm_imdb/