<a href="https://colab.research.google.com/github/jonesLevin/TensorFlow-Deep-Learning/blob/main/Natural_Language_Processing_(NLP)_With_TensorFlow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction To NLP Fundamentals in TensorFlow

## Getting Helper Functions

In [1]:
!wget https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/main/extras/helper_functions.py

--2023-01-25 10:16:53--  https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/main/extras/helper_functions.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10246 (10K) [text/plain]
Saving to: ‘helper_functions.py’


2023-01-25 10:16:53 (112 MB/s) - ‘helper_functions.py’ saved [10246/10246]



In [23]:
from helper_functions import unzip_data, create_tensorboard_callback, plot_loss_curves, compare_historys

## Get a Text Dataset
We are going to be using dataset from kaggle NLP basics disaster classification

In [24]:
!wget https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip

# Unzip the data
unzip_data('nlp_getting_started.zip')

--2023-01-25 10:16:58--  https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.68.128, 74.125.24.128, 142.250.4.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.68.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 607343 (593K) [application/zip]
Saving to: ‘nlp_getting_started.zip’


2023-01-25 10:16:59 (141 MB/s) - ‘nlp_getting_started.zip’ saved [607343/607343]

--2023-01-25 10:17:03--  https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.68.128, 74.125.24.128, 142.250.4.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.68.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 607343 (593K) [application/zip]
Saving to: ‘nlp_getting_started.zip.1’


2023-01-25 10:17:03 (115 MB/s) - ‘nlp_getting_started.zip.1’ saved

## Visualizing the Data

In [39]:
import pandas as pd
import random
import tensorflow as tf

train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')
train_df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [5]:
# Shuffle Training DataFrame
train_df_shuffled = train_df.sample(frac=1, random_state=42)
train_df_shuffled.head()

Unnamed: 0,id,keyword,location,text,target
2644,3796,destruction,,So you have a new weapon that can cause un-ima...,1
2227,3185,deluge,,The f$&amp;@ing things I do for #GISHWHES Just...,0
5448,7769,police,UK,DT @georgegalloway: RT @Galloway4Mayor: ÛÏThe...,1
132,191,aftershock,,Aftershock back to school kick off was great. ...,0
6845,9810,trauma,"Montgomery County, MD",in response to trauma Children of Addicts deve...,0


In [6]:
# Looking at the test dataframe
test_df.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


In [7]:
# How many examples of each class do we have
train_df_shuffled['target'].value_counts()

0    4342
1    3271
Name: target, dtype: int64

In [8]:
# What is the total number of samples
len(train_df), len(test_df)

(7613, 3263)

## Splitting Training and Validation Sets

In [9]:
from sklearn.model_selection import train_test_split

In [10]:
train_sentences, val_sentences, train_labels, val_labels = train_test_split(train_df_shuffled['text'].to_numpy(),
                                                                            train_df_shuffled['target'].to_numpy(),
                                                                            test_size=0.1,
                                                                            random_state=42)

### Converting Text Into Numbers

#### Text Vectorization (Tokenization)

In [11]:
from tensorflow import keras
from keras.layers import TextVectorization

In [12]:
text_vectorizor = TextVectorization(max_tokens=5000, 
                                    standardize='lower_and_strip_punctuation',
                                    split='whitespace',
                                    ngrams=None,
                                    output_mode='int',
                                    output_sequence_length=None,
                                    pad_to_max_tokens=True)

In [13]:
# Find the average number of tokens (word) in the training tweets
round(sum([len(i.split()) for i in train_sentences])/len(train_sentences))

15

In [14]:
from keras.layers.preprocessing import text_vectorization
# Setup text vectorization varibales
max_vocab_length = 10000 
max_length = 15

text_vectorizor = TextVectorization(max_tokens=max_vocab_length,
                                    output_mode='int',
                                    output_sequence_length=max_length)

In [15]:
# Fit the text vectorizer to the training text
text_vectorizor.adapt(train_sentences)

In [16]:
# Create a sample sentence and tokenize it
sample_sentence = "There's a flood in my street"
text_vectorizor([sample_sentence])

<tf.Tensor: shape=(1, 15), dtype=int64, numpy=
array([[264,   3, 232,   4,  13, 698,   0,   0,   0,   0,   0,   0,   0,
          0,   0]])>

In [17]:
# Choose a random sentence from the training data and tokenize it
random_sentence = random.choice(train_sentences)
print(f'Original text:\n {random_sentence}\
      \n\nVectorized version:')
text_vectorizor([random_sentence])

Original text:
 @Jannet2208 I fell off someone's back and hit my head on concrete /: I was bleeding n shit      

Vectorized version:


<tf.Tensor: shape=(1, 15), dtype=int64, numpy=
array([[   1,    8, 1856,  102, 2697,   88,    7,  244,   13,  331,   11,
        5939,    8,   23,  587]])>

In [18]:
# Get the unique words in the vocabulary
words_in_vocab = text_vectorizor.get_vocabulary()
top_5_words = words_in_vocab[:5]
bottom_5_words = words_in_vocab[-5:]
print(f'Number of words in vocab: {len(words_in_vocab)}')
print(f'5 most common words: {top_5_words}')
print(f'5 least common words: {bottom_5_words}')

Number of words in vocab: 10000
5 most common words: ['', '[UNK]', 'the', 'a', 'in']
5 least common words: ['pages', 'paeds', 'pads', 'padres', 'paddytomlinson1']


### Creating an Embedding Using an Embedding Layer

In [19]:
embedding = keras.layers.Embedding(input_dim=max_vocab_length,
                                   output_dim=128,
                                   input_length=max_length)

In [21]:
# Get a random sentence from the training set
random_sentence = random.choice(train_sentences)
print(f'Original text:\n {random_sentence}\
      \n\nEmbedded version:')

# Embedding the random sentece
sample_embedded = embedding(text_vectorizor([random_sentence]))
sample_embedded

Original text:
 One Direction Is my pick for http://t.co/y9WvqKGbBI Fan Army #Directioners http://t.co/S5F9FcOmp8      

Embedded version:


<tf.Tensor: shape=(1, 15, 128), dtype=float32, numpy=
array([[[ 0.04214131, -0.04265949, -0.01102252, ..., -0.03324126,
          0.00115684,  0.00484693],
        [-0.0131126 ,  0.02029145, -0.0170217 , ..., -0.0146867 ,
         -0.01929076,  0.02336286],
        [-0.03381758,  0.01767565,  0.01200159, ..., -0.02081579,
         -0.02625326, -0.00572349],
        ...,
        [-0.00262966,  0.0201077 ,  0.04173733, ..., -0.01839133,
         -0.01044907,  0.03568229],
        [-0.00262966,  0.0201077 ,  0.04173733, ..., -0.01839133,
         -0.01044907,  0.03568229],
        [-0.00262966,  0.0201077 ,  0.04173733, ..., -0.01839133,
         -0.01044907,  0.03568229]]], dtype=float32)>

In [22]:
# Check out a single token embedding
sample_embedded[0][0], sample_embedded[0][0].shape, random_sentence

(<tf.Tensor: shape=(128,), dtype=float32, numpy=
 array([ 4.2141307e-02, -4.2659488e-02, -1.1022519e-02,  4.5076024e-02,
        -3.8994662e-03, -3.6876835e-02,  4.7304001e-02,  2.0000387e-02,
        -4.9542010e-02, -1.2770940e-02, -1.1985563e-02, -4.0564369e-02,
        -3.5697334e-03,  3.3385046e-03,  4.1957978e-02, -2.8590573e-02,
        -3.8689744e-02, -2.6513470e-02, -4.5806397e-02, -2.5691498e-02,
        -1.2717485e-02,  1.5749935e-02, -1.3428163e-02, -4.1642189e-03,
         3.5681497e-02,  3.1557027e-02,  1.1629391e-02,  4.1593242e-02,
         4.5240629e-02,  2.8963316e-02,  1.6412865e-02,  3.5496388e-02,
        -4.3166544e-02, -2.5806308e-02,  2.3711290e-02,  1.7723553e-03,
         3.3076350e-02, -3.6892451e-02,  4.1192602e-02,  2.7794909e-02,
         1.0363616e-02,  1.3549235e-02, -3.3078469e-02,  3.7647989e-02,
        -3.4660518e-02,  8.5945837e-03, -4.1950010e-02, -4.3568600e-02,
         9.8970905e-03, -2.7870834e-02, -3.2217406e-02, -4.5007911e-02,
         3.5329

## Modelling a Test Dataset and Running a Set of Experiments
We Start with a baseline and move on from there.
* Model 0: Naive Bayes (Baseline)
* Model 1: Feed Forward neural network (dense model)
* Model 2: LSTM model (RNN)
* Model 3: GRU model (RNN)
* Model 4: Bidirectional-LSTM model (RNN)
* Model 5: 1d CNN
* Model 6: Tensorflow hub pretained feature extractor
* Model 7: Same model 6 with 10% of training data

### Model 0: Getting a Baseline



In [28]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

In [29]:
# Create tokenization and modelling pipeline
model_0 = Pipeline(steps=[
    ('tfidf', TfidfVectorizer()), # Convert text into numbers 
    ('clf', MultinomialNB()) # model the text
])

# Fit the pipeline to the training data
model_0.fit(train_sentences, train_labels)

Pipeline(steps=[('tfidf', TfidfVectorizer()), ('clf', MultinomialNB())])

In [30]:
# Evaluating the baseline model
baseline_score = model_0.score(val_sentences, val_labels)
print(f'The baseline model achieves an accuracy of: {baseline_score * 100:.2f}%')

The baseline model achieves an accuracy of: 79.27%


In [34]:
# Making predictions
baseline_preds = model_0.predict(val_sentences)

### Creating an evaluation function for the experiments

In [35]:
# Function to evaluate: accuracy, precision, recall and f1-score
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def calculate_results(y_true, y_pred):
  """
  Calculates model accuracy, precision, recall and f1 score of a binary classification model
  """
  # Calculate model accuracy
  model_accuracy = accuracy_score(y_true, y_pred) * 100
  # Calculate model precision
  model_precision, model_recall, model_f1, _ = precision_recall_fscore_support(y_true, y_pred, average='weighted')
  model_results = {'accuracy': model_accuracy,
                   'precision': model_precision,
                   'recall': model_recall,
                   'f1': model_f1}
  return model_results

In [38]:
# Get baseline results
baseline_results = calculate_results(y_true=val_labels, y_pred=baseline_preds)
baseline_results

{'accuracy': 79.26509186351706,
 'precision': 0.8111390004213173,
 'recall': 0.7926509186351706,
 'f1': 0.7862189758049549}

### Model 1: Feed-Forward Neural Network

In [40]:
# Create a tensorboard callback
SAVE_DIR = 'model_logs'
from keras import layers
inputs = layers.Input(shape=(1,), dtype=tf.string) # Inputs are one dimensional strings
x = text_vectorizor(inputs) # turn the input text into numbers
x = embedding(x) # Create an embedding of the numberized inputs
outputs = layers.Dense(1, activation='sigmoid')(x)
model_1 = tf.keras.Model(inputs, outputs, name='model_1_dense')

In [41]:
model_1.summary()

Model: "model_1_dense"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 1)]               0         
                                                                 
 text_vectorization_1 (TextV  (None, 15)               0         
 ectorization)                                                   
                                                                 
 embedding (Embedding)       (None, 15, 128)           1280000   
                                                                 
 dense (Dense)               (None, 15, 1)             129       
                                                                 
Total params: 1,280,129
Trainable params: 1,280,129
Non-trainable params: 0
_________________________________________________________________


In [42]:
# Compile the model
model_1.compile(loss='binary_crossentropy',
                optimizer='adam',
                metrics='accuracy')

In [49]:
tf.expand_dims(train_sentences, axis=1).shape

TensorShape([6851, 1])

In [46]:
# Fit the model
model_1_history = model_1.fit(x=train_sentences,
                              y=train_labels,
                              epochs=5,
                              validation_data=(val_sentences, val_labels),
                              callbacks=[create_tensorboard_callback(dir_name=SAVE_DIR, 
                                                                     experiment_name='model_1_dense')])

Saving TensorBoard log files to: model_logs/model_1_dense/20230125-111833
Epoch 1/5


ValueError: ignored