<a href="https://colab.research.google.com/github/jonesLevin/TensorFlow-Deep-Learning/blob/main/Natural_Language_Processing_(NLP)_With_TensorFlow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction To NLP Fundamentals in TensorFlow

## Getting Helper Functions

In [1]:
!wget https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/main/extras/helper_functions.py

--2023-01-27 08:52:00--  https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/main/extras/helper_functions.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10246 (10K) [text/plain]
Saving to: ‘helper_functions.py’


2023-01-27 08:52:00 (107 MB/s) - ‘helper_functions.py’ saved [10246/10246]



In [2]:
from helper_functions import unzip_data, create_tensorboard_callback, plot_loss_curves, compare_historys

## Get a Text Dataset
We are going to be using dataset from kaggle NLP basics disaster classification

In [3]:
!wget https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip

# Unzip the data
unzip_data('nlp_getting_started.zip')

--2023-01-27 08:52:03--  https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.68.128, 74.125.24.128, 142.250.4.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.68.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 607343 (593K) [application/zip]
Saving to: ‘nlp_getting_started.zip’


2023-01-27 08:52:04 (756 KB/s) - ‘nlp_getting_started.zip’ saved [607343/607343]



## Visualizing the Data

In [4]:
import pandas as pd
import random
import tensorflow as tf

train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')
train_df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [5]:
# Shuffle Training DataFrame
train_df_shuffled = train_df.sample(frac=1, random_state=42)
train_df_shuffled.head()

Unnamed: 0,id,keyword,location,text,target
2644,3796,destruction,,So you have a new weapon that can cause un-ima...,1
2227,3185,deluge,,The f$&amp;@ing things I do for #GISHWHES Just...,0
5448,7769,police,UK,DT @georgegalloway: RT @Galloway4Mayor: ÛÏThe...,1
132,191,aftershock,,Aftershock back to school kick off was great. ...,0
6845,9810,trauma,"Montgomery County, MD",in response to trauma Children of Addicts deve...,0


In [6]:
# Looking at the test dataframe
test_df.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


In [7]:
# How many examples of each class do we have
train_df_shuffled['target'].value_counts()

0    4342
1    3271
Name: target, dtype: int64

In [8]:
# What is the total number of samples
len(train_df), len(test_df)

(7613, 3263)

## Splitting Training and Validation Sets

In [9]:
from sklearn.model_selection import train_test_split

In [10]:
train_sentences, val_sentences, train_labels, val_labels = train_test_split(train_df_shuffled['text'].values,
                                                                            train_df_shuffled['target'].values,
                                                                            test_size=0.1,
                                                                            random_state=42)

In [11]:
train_sentences

array(['@mogacola @zamtriossu i screamed after hitting tweet',
       'Imagine getting flattened by Kurt Zouma',
       '@Gurmeetramrahim #MSGDoing111WelfareWorks Green S welfare force ke appx 65000 members har time disaster victim ki help ke liye tyar hai....',
       ...,
       'Near them on the sand half sunk a shattered visage lies... http://t.co/0kCCG1BT06',
       "kesabaran membuahkan hasil indah pada saat tepat! life isn't about waiting for the storm to pass it's about learning to dance in the rain.",
       "@ScottDPierce @billharris_tv @HarrisGle @Beezersun I'm forfeiting this years fantasy football pool out of fear I may win n get my ass kicked"],
      dtype=object)

### Converting Text Into Numbers

#### Text Vectorization (Tokenization)

In [12]:
from tensorflow import keras
from keras.layers import TextVectorization

In [13]:
text_vectorizer = TextVectorization(max_tokens=5000, 
                                    standardize='lower_and_strip_punctuation',
                                    split='whitespace',
                                    ngrams=None,
                                    output_mode='int',
                                    output_sequence_length=None,
                                    pad_to_max_tokens=True)

In [14]:
# Find the average number of tokens (word) in the training tweets
round(sum([len(i.split()) for i in train_sentences])/len(train_sentences))

15

In [15]:
from keras.layers.preprocessing import text_vectorization
# Setup text vectorization varibales
max_vocab_length = 10000 
max_length = 15

text_vectorizer = TextVectorization(max_tokens=max_vocab_length,
                                    output_mode='int',
                                    output_sequence_length=max_length)

In [16]:
# Fit the text vectorizer to the training text
text_vectorizer.adapt(train_sentences)

In [17]:
# Create a sample sentence and tokenize it
sample_sentence = "There's a flood in my street"
text_vectorizer([sample_sentence])

<tf.Tensor: shape=(1, 15), dtype=int64, numpy=
array([[264,   3, 232,   4,  13, 698,   0,   0,   0,   0,   0,   0,   0,
          0,   0]])>

In [18]:
# Choose a random sentence from the training data and tokenize it
random_sentence = random.choice(train_sentences)
print(f'Original text:\n {random_sentence}\
      \n\nVectorized version:')
text_vectorizer([random_sentence])

Original text:
 ChinaÛªs stock market crash this summer has sparked interest from bargain hunters and bulls betting on a rebound. DÛ_ http://t.co/1yggZziZ9o      

Vectorized version:


<tf.Tensor: shape=(1, 15), dtype=int64, numpy=
array([[1882,  413,  457,   85,   19,  270,   41, 1971, 2925,   20, 2551,
        2089,    7,    1,    1]])>

In [19]:
# Get the unique words in the vocabulary
words_in_vocab = text_vectorizer.get_vocabulary()
top_5_words = words_in_vocab[:5]
bottom_5_words = words_in_vocab[-5:]
print(f'Number of words in vocab: {len(words_in_vocab)}')
print(f'5 most common words: {top_5_words}')
print(f'5 least common words: {bottom_5_words}')

Number of words in vocab: 10000
5 most common words: ['', '[UNK]', 'the', 'a', 'in']
5 least common words: ['pages', 'paeds', 'pads', 'padres', 'paddytomlinson1']


### Creating an Embedding Using an Embedding Layer

In [20]:
embedding = keras.layers.Embedding(input_dim=max_vocab_length,
                                   output_dim=128,
                                   input_length=max_length)

In [21]:
# Get a random sentence from the training set
random_sentence = random.choice(train_sentences)
print(f'Original text:\n {random_sentence}\
      \n\nEmbedded version:')

# Embedding the random sentece
sample_embedded = embedding(text_vectorizer([random_sentence]))
sample_embedded

Original text:
 Deaths 7 http://t.co/xRJA0XpL40      

Embedded version:


<tf.Tensor: shape=(1, 15, 128), dtype=float32, numpy=
array([[[-0.03012526,  0.01143426, -0.03173655, ...,  0.00670987,
         -0.02137901, -0.04563645],
        [ 0.04318194,  0.04386887, -0.00402145, ...,  0.01285579,
          0.03140854, -0.01738174],
        [ 0.0126843 ,  0.02112294,  0.04537885, ..., -0.04702204,
          0.02700282, -0.01701692],
        ...,
        [ 0.03213111,  0.00990774,  0.00992664, ...,  0.03468205,
         -0.03392397,  0.0124316 ],
        [ 0.03213111,  0.00990774,  0.00992664, ...,  0.03468205,
         -0.03392397,  0.0124316 ],
        [ 0.03213111,  0.00990774,  0.00992664, ...,  0.03468205,
         -0.03392397,  0.0124316 ]]], dtype=float32)>

In [22]:
# Check out a single token embedding
sample_embedded[0][0], sample_embedded[0][0].shape, random_sentence

(<tf.Tensor: shape=(128,), dtype=float32, numpy=
 array([-0.03012526,  0.01143426, -0.03173655,  0.00300337, -0.03401256,
        -0.01467349,  0.04967994, -0.01021887, -0.03014194, -0.02223145,
         0.02079246,  0.01427511,  0.04030127, -0.04717143, -0.02162863,
        -0.00054116,  0.02879354,  0.02984985,  0.01008482, -0.02633025,
         0.04377462,  0.04783454, -0.03473324, -0.00277481,  0.01512659,
         0.02849321, -0.04652622,  0.04349882,  0.03977272, -0.03295722,
         0.02685526, -0.0151438 ,  0.01323   , -0.04568079,  0.03337223,
         0.02067877, -0.02562715,  0.03020236,  0.01438209,  0.00122961,
         0.00097251,  0.01522081, -0.03970297, -0.00143702, -0.00511835,
         0.0174376 , -0.01138511,  0.03587771, -0.03590406,  0.01984977,
         0.04036276, -0.01375207,  0.03964858, -0.02341721, -0.03708751,
        -0.03112513,  0.00780678,  0.03404531, -0.04665877,  0.01029258,
         0.02070416, -0.00858735,  0.03272941, -0.0280553 ,  0.00878941,
  

## Modelling a Test Dataset and Running a Set of Experiments
We Start with a baseline and move on from there.
* Model 0: Naive Bayes (Baseline)
* Model 1: Feed Forward neural network (dense model)
* Model 2: LSTM model (RNN)
* Model 3: GRU model (RNN)
* Model 4: Bidirectional-LSTM model (RNN)
* Model 5: 1d CNN
* Model 6: Tensorflow hub pretained feature extractor
* Model 7: Same model 6 with 10% of training data

### Model 0: Getting a Baseline



In [23]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

In [24]:
# Create tokenization and modelling pipeline
model_0 = Pipeline(steps=[
    ('tfidf', TfidfVectorizer()), # Convert text into numbers 
    ('clf', MultinomialNB()) # model the text
])

# Fit the pipeline to the training data
model_0.fit(train_sentences, train_labels)

Pipeline(steps=[('tfidf', TfidfVectorizer()), ('clf', MultinomialNB())])

In [25]:
# Evaluating the baseline model
baseline_score = model_0.score(val_sentences, val_labels)
print(f'The baseline model achieves an accuracy of: {baseline_score * 100:.2f}%')

The baseline model achieves an accuracy of: 79.27%


In [26]:
# Making predictions
baseline_preds = model_0.predict(val_sentences)

### Creating an evaluation function for the experiments

In [27]:
# Function to evaluate: accuracy, precision, recall and f1-score
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def calculate_results(y_true, y_pred):
  """
  Calculates model accuracy, precision, recall and f1 score of a binary classification model
  """
  # Calculate model accuracy
  model_accuracy = accuracy_score(y_true, y_pred) * 100
  # Calculate model precision
  model_precision, model_recall, model_f1, _ = precision_recall_fscore_support(y_true, y_pred, average='weighted')
  model_results = {'accuracy': model_accuracy,
                   'precision': model_precision,
                   'recall': model_recall,
                   'f1': model_f1}
  return model_results

In [28]:
# Get baseline results
baseline_results = calculate_results(y_true=val_labels, y_pred=baseline_preds)
baseline_results

{'accuracy': 79.26509186351706,
 'precision': 0.8111390004213173,
 'recall': 0.7926509186351706,
 'f1': 0.7862189758049549}

### Model 1: Feed-Forward Neural Network

In [42]:
# Create a tensorboard callback
SAVE_DIR = 'model_logs'
from keras import layers
inputs = layers.Input(shape=(1,), dtype=tf.string) # Inputs are one dimensional strings
x = text_vectorizer(inputs) # turn the input text into numbers
x = embedding(x) # Create an embedding of the numberized inputs
x = layers.GlobalAveragePooling1D()(x) # Condense the feature vector for each token to one vector
outputs = layers.Dense(1, activation='sigmoid')(x)
model_1 = tf.keras.Model(inputs, outputs, name='model_1_dense')

In [43]:
model_1.summary()

Model: "model_1_dense"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_4 (InputLayer)        [(None, 1)]               0         
                                                                 
 text_vectorization_1 (TextV  (None, 15)               0         
 ectorization)                                                   
                                                                 
 embedding (Embedding)       (None, 15, 128)           1280000   
                                                                 
 global_average_pooling1d (G  (None, 128)              0         
 lobalAveragePooling1D)                                          
                                                                 
 dense_2 (Dense)             (None, 1)                 129       
                                                                 
Total params: 1,280,129
Trainable params: 1,280,129
N

In [44]:
# Compile the model
model_1.compile(loss='binary_crossentropy',
                optimizer='adam',
                metrics='accuracy')

In [45]:
# Fit the model
model_1_history = model_1.fit(x=train_sentences,
                              y=train_labels,
                              epochs=5,
                              validation_data=(val_sentences, val_labels),
                              callbacks=[create_tensorboard_callback(dir_name=SAVE_DIR, 
                                                                     experiment_name='model_1_dense')])

Saving TensorBoard log files to: model_logs/model_1_dense/20230127-090620
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [46]:
# Check the results
model_1.evaluate(val_sentences, val_labels)



[0.47643041610717773, 0.7795275449752808]

In [47]:
 # Make some predictions and evaluate
model_1_pred_probs = model_1.predict(val_sentences)
model_1_pred_probs.shape



(762, 1)

In [49]:
model_1_pred_probs[:10]

array([[0.3758435 ],
       [0.82052773],
       [0.99748516],
       [0.13065727],
       [0.11172882],
       [0.9292552 ],
       [0.9090359 ],
       [0.9925769 ],
       [0.9700318 ],
       [0.256143  ]], dtype=float32)

In [50]:
# Convert model prediction probabilities to label format
model_1_preds = tf.squeeze(tf.round(model_1_pred_probs))
model_1_preds[:20]

<tf.Tensor: shape=(20,), dtype=float32, numpy=
array([0., 1., 1., 0., 0., 1., 1., 1., 1., 0., 0., 1., 0., 0., 0., 0., 0.,
       0., 0., 1.], dtype=float32)>

In [52]:
# Calculate model 1 results
model_1_results = calculate_results(y_true=val_labels, y_pred=model_1_preds)
model_1_results

{'accuracy': 77.95275590551181,
 'precision': 0.7822644211580037,
 'recall': 0.7795275590551181,
 'f1': 0.7771404562571971}

In [53]:
baseline_results

{'accuracy': 79.26509186351706,
 'precision': 0.8111390004213173,
 'recall': 0.7926509186351706,
 'f1': 0.7862189758049549}

## Visualizing learned Embeddings

In [56]:
len(words_in_vocab)

10000

In [57]:
# Model 1 summary
model_1.summary()

Model: "model_1_dense"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_4 (InputLayer)        [(None, 1)]               0         
                                                                 
 text_vectorization_1 (TextV  (None, 15)               0         
 ectorization)                                                   
                                                                 
 embedding (Embedding)       (None, 15, 128)           1280000   
                                                                 
 global_average_pooling1d (G  (None, 128)              0         
 lobalAveragePooling1D)                                          
                                                                 
 dense_2 (Dense)             (None, 1)                 129       
                                                                 
Total params: 1,280,129
Trainable params: 1,280,129
N

In [60]:
# Get the weight matrix of embedding layer
embed_weights = model_1.get_layer('embedding').get_weights()
embed_weights[0].shape

(10000, 128)