<a href="https://colab.research.google.com/github/mhlaghari/Amazon-product-co-purchasing-network-metadata/blob/main/NLP_Tensorflow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to NLP Fundamentals in TensorFlow

In [1]:
!nvidia-smi -L

GPU 0: Tesla T4 (UUID: GPU-425fc176-dc4b-cbdb-9ef9-e9985d0108fe)


In [2]:
## Get helper functions

In [3]:
!wget https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/main/extras/helper_functions.py

# Import series of helper functions 
from helper_functions import unzip_data, create_tensorboard_callback, plot_loss_curves, compare_historys

--2023-05-23 04:58:37--  https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/main/extras/helper_functions.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10246 (10K) [text/plain]
Saving to: ‘helper_functions.py’


2023-05-23 04:58:37 (82.2 MB/s) - ‘helper_functions.py’ saved [10246/10246]



## Get Text data

In [4]:
!wget https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip

#Unzip data 
unzip_data('nlp_getting_started.zip')

--2023-05-23 04:58:41--  https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.200.128, 74.125.68.128, 74.125.24.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.200.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 607343 (593K) [application/zip]
Saving to: ‘nlp_getting_started.zip’


2023-05-23 04:58:42 (752 KB/s) - ‘nlp_getting_started.zip’ saved [607343/607343]



## Visualizing a text dataset

In [5]:
import pandas as pd 
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')
train_df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [6]:
train_df['text'][0]

'Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all'

In [7]:
# Shuffle training dataframe 
train_df_shuffled = train_df.sample(frac=1, random_state=42)
train_df_shuffled.head()

Unnamed: 0,id,keyword,location,text,target
2644,3796,destruction,,So you have a new weapon that can cause un-ima...,1
2227,3185,deluge,,The f$&amp;@ing things I do for #GISHWHES Just...,0
5448,7769,police,UK,DT @georgegalloway: RT @Galloway4Mayor: ÛÏThe...,1
132,191,aftershock,,Aftershock back to school kick off was great. ...,0
6845,9810,trauma,"Montgomery County, MD",in response to trauma Children of Addicts deve...,0


In [8]:
# What does the test_data look like 
test_df.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


In [9]:
train_df.target.value_counts()

0    4342
1    3271
Name: target, dtype: int64

In [10]:
# How many total samples 
len(train_df), len(test_df)

(7613, 3263)

In [11]:
# Lets visualize some random training samples 
import random 
random_index = random.randint(0, len(train_df)-5)
for row in train_df_shuffled[['text', 'target']][random_index:random_index+5].itertuples():
  _, text, target = row
  print(f'Target: {target}', '(real disaster)' if target>0 else '(not real disaster)')
  print(f'Text: \n{text}')
  print("---\n")

Target: 0 (not real disaster)
Text: 
Wish I had a personal hair twister
---

Target: 1 (real disaster)
Text: 
What would it look like if Hiroshima bomb hit Detroit?: Thursday marks the 70-year anniversary of the United S... http://t.co/6sy44kyYsD
---

Target: 1 (real disaster)
Text: 
Please allow me to reiterate it's not the weapon it's the mindset of the individual! #professional #help! -LEGION! https://t.co/2lGTZkwMqW
---

Target: 1 (real disaster)
Text: 
A big issue left undone is HOW to get home if adverse weather hits. @GoTriangle has no real emergency plan in place https://t.co/s7xdXuudcy
---

Target: 1 (real disaster)
Text: 
Salem 2 nuclear reactor shut down over electrical circuit failure on pump http://t.co/LQjjy1PTWT
---



In [12]:
#Split data into training and validation sets
from sklearn.model_selection import train_test_split
# Use train test split to split training data into training and validation set 
train_sentences, val_sentences, train_labels, val_labels = train_test_split(train_df_shuffled['text'].to_numpy(),
                                                                            train_df_shuffled['target'].to_numpy(),
                                                                            test_size=0.1,
                                                                            random_state=42)

In [13]:
# Check the lengths 
len(train_sentences) , len(val_sentences)

(6851, 762)

In [14]:
# Check the first 10 samples 
train_sentences[:10] , train_labels[:10]

(array(['@mogacola @zamtriossu i screamed after hitting tweet',
        'Imagine getting flattened by Kurt Zouma',
        '@Gurmeetramrahim #MSGDoing111WelfareWorks Green S welfare force ke appx 65000 members har time disaster victim ki help ke liye tyar hai....',
        "@shakjn @C7 @Magnums im shaking in fear he's gonna hack the planet",
        'Somehow find you and I collide http://t.co/Ee8RpOahPk',
        '@EvaHanderek @MarleyKnysh great times until the bus driver held us hostage in the mall parking lot lmfao',
        'destroy the free fandom honestly',
        'Weapons stolen from National Guard Armory in New Albany still missing #Gunsense http://t.co/lKNU8902JE',
        '@wfaaweather Pete when will the heat wave pass? Is it really going to be mid month? Frisco Boy Scouts have a canoe trip in Okla.',
        'Patient-reported outcomes in long-term survivors of metastatic colorectal cancer - British Journal of Surgery http://t.co/5Yl4DC1Tqt'],
       dtype=object),
 array([0,

In [15]:
# Converting text into numbers 
train_sentences[:5]

array(['@mogacola @zamtriossu i screamed after hitting tweet',
       'Imagine getting flattened by Kurt Zouma',
       '@Gurmeetramrahim #MSGDoing111WelfareWorks Green S welfare force ke appx 65000 members har time disaster victim ki help ke liye tyar hai....',
       "@shakjn @C7 @Magnums im shaking in fear he's gonna hack the planet",
       'Somehow find you and I collide http://t.co/Ee8RpOahPk'],
      dtype=object)

In [16]:
import tensorflow as tf
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

In [17]:
# Use the default TextVectorization parameters
text_vectorization = TextVectorization(max_tokens=None, 
                                       standardize="lower_and_strip_punctuation",
                                       split="whitespace",
                                       ngrams=None, #
                                       output_mode='int',
                                       output_sequence_length=None,
                                       pad_to_max_tokens=False
                                       )

In [18]:
len(train_sentences[0].split())

7

In [19]:
# Find the average number of tokens (words) in the training tweets
round(sum([len(i.split()) for i in train_sentences])/len(train_sentences))

15

In [20]:
# Setup text vectorization variables
max_vocab_length = 10000
max_length = 15

text_vectorizer = TextVectorization(max_tokens=max_vocab_length,
                                    output_mode='int',
                                    output_sequence_length=max_length)

In [21]:
# fit the text vectorizer to the train text
text_vectorizer.adapt(train_sentences)

In [22]:
# Create a sample sentence and tokenize it 
sample_sentence ='Theres a flood in my street!'
text_vectorizer([sample_sentence])

<tf.Tensor: shape=(1, 15), dtype=int64, numpy=
array([[264,   3, 232,   4,  13, 698,   0,   0,   0,   0,   0,   0,   0,
          0,   0]])>

In [23]:
# Choose a random sentence from the training dataset and tokenize it
random_sentence = random.choice(train_sentences)
print(f'Original Text: \n {random_sentence}\
\n\nVectorized version:')
text_vectorizer([random_sentence])

Original Text: 
 @whvholst @leashless And this is a structural problem rather than just a failure of competence by traditional soc democratic parties.

Vectorized version:


<tf.Tensor: shape=(1, 15), dtype=int64, numpy=
array([[6793,    1,    7,   19,    9,    3,  384, 1064, 1263,   76,   29,
           3,  320,    6,    1]])>

In [24]:
# Get the unique words in the vocabulary 
words_in_vocab = text_vectorizer.get_vocabulary()
top_5_words = words_in_vocab[:5]
least_5_words = words_in_vocab[-5:]
print(f'Number of words in vocab: {len(words_in_vocab)}')
print(f'5 common words in vocab: {top_5_words}')
print(f'5 least common words in vocab: {least_5_words}')

Number of words in vocab: 10000
5 common words in vocab: ['', '[UNK]', 'the', 'a', 'in']
5 least common words in vocab: ['pages', 'paeds', 'pads', 'padres', 'paddytomlinson1']


### Creating an embedding using embedding layer

To make our embedding, we're going to use TensorFlows embedding layers

The parameters we care most about for our embedding layer:
* `input_dim` = the size of the vocabulary
* `output_dim` = the size of the output embedding vector, for example, a value of 100 would mean each token gets represented by a vector 100 long
* `input_length` = length of the sequences being passed to the embedding layer

In [25]:
from tensorflow.keras import layers

embedding = layers.Embedding(input_dim=max_vocab_length,
                             output_dim=128,
                             input_length=max_length)

embedding

<keras.layers.core.embedding.Embedding at 0x7fe0b0096710>

In [26]:
print(f'Original text: \n {random_sentence}\
\n\nEmbedded version:')

sample_embed = embedding(text_vectorizer([random_sentence]))
sample_embed

Original text: 
 @whvholst @leashless And this is a structural problem rather than just a failure of competence by traditional soc democratic parties.

Embedded version:


<tf.Tensor: shape=(1, 15, 128), dtype=float32, numpy=
array([[[-0.00708473,  0.01842835, -0.0381577 , ...,  0.0272238 ,
         -0.01001953, -0.01313305],
        [ 0.02386964, -0.04707037, -0.02547088, ...,  0.02497151,
         -0.01627138, -0.02054993],
        [ 0.04369897, -0.01185461, -0.01038635, ...,  0.00080677,
          0.01093136,  0.00030817],
        ...,
        [ 0.00473542, -0.01885836, -0.02623818, ...,  0.03179271,
          0.01500565, -0.03026978],
        [-0.02182528,  0.00997988, -0.03648349, ...,  0.02764324,
          0.04430462, -0.03896264],
        [ 0.02386964, -0.04707037, -0.02547088, ...,  0.02497151,
         -0.01627138, -0.02054993]]], dtype=float32)>

In [27]:
# Check out a single tokens embedding 
sample_embed[0][0], sample_embed[0][0].shape, random_sentence[0]

(<tf.Tensor: shape=(128,), dtype=float32, numpy=
 array([-0.00708473,  0.01842835, -0.0381577 ,  0.00296819,  0.00891498,
         0.00997893, -0.01693913, -0.03544533, -0.04437518, -0.0484769 ,
        -0.01295454,  0.00467021, -0.03061861, -0.00883807,  0.01956381,
         0.01809511,  0.0028515 , -0.03298839, -0.01307914, -0.03292278,
        -0.02095683, -0.00551286,  0.00078695,  0.03122022,  0.00827272,
        -0.02952151, -0.03754707, -0.045051  , -0.01976395,  0.04512675,
         0.04800001,  0.0076166 , -0.00097601,  0.03582307,  0.02202768,
        -0.03225785, -0.01090016, -0.02932999,  0.01511348,  0.03941908,
        -0.04001164, -0.01229831, -0.03048295,  0.01758636,  0.04882666,
        -0.02205416,  0.01772512,  0.03899714, -0.00898293, -0.04382578,
        -0.04388292, -0.03865759,  0.02830834,  0.04504127,  0.00067047,
         0.01210872, -0.00353836,  0.04118088, -0.01210396,  0.04813256,
         0.04866682,  0.02874149,  0.02832279, -0.02112577, -0.04453593,
  

## Modelling a text dataset (running a series of experiments)

Now we've got a way to turn our text sequences into numbers, its time to start building a series of modelling experiments.

We'll start with a baseline and move on from there. 

* Model 0: Naive Bayes (baseline), this is from SKlearn ML map
* Model 1: Feed-forward neural network (dense model)
* Model 2: LSTM model (RNN)
* Model 3: GRU model (RNN)
* Model 4: Bidirectional-LSTM model (RNN)
* Model 5: 1D Convolutional Neural Network (CNN)
* Model 6: TensorFlow Hub Pretrained Feature Extractor (using transfer learning for NLP) 
* Model 7: Same as model 6 with 10% of training data

How are we going to approach all of these? 

Use the standard steps of tensorflow

### Model 0: getting a baseline 

As with all machine learning modelling experiements, it is important to create a baseline model so you've got a benchmark

In [28]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline 

# Create tokenization and modelling pipeline
model_0 = Pipeline([
    ("tfidf", TfidfVectorizer()), #convert words to numbers
    ("clf", MultinomialNB()) #Model the text
])

# Fit the pipeline to the training data
model_0.fit(train_sentences, train_labels)

In [30]:
# Evaluate our baseline model 
baseline_score = model_0.score(val_sentences, val_labels)
print(f'Our baseline model achieves an accuracy score of {baseline_score*100:.2f}%')

Our baseline model achieves an accuracy score of 79.27%


In [31]:
# Make predictions 
baseline_preds = model_0.predict(val_sentences)
baseline_preds[:20]

array([1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1])

### Creating an evaluation function for our model experiments 



In [32]:
# Function to evaluate accuracy, precision, recall, F1-score
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def calculate_results(y_true, y_pred):
  '''
  Calculates model accuracy, precision, recall, and f1 score of a binary classification model
  '''
  # Calculate model accuracy
  model_accuracy = accuracy_score(y_true, y_pred)*100
  # Calculate model precision, recall, and F1-Score using "weighted average"
  model_precision, model_recall, model_f1, _ = precision_recall_fscore_support(y_true, y_pred, average='weighted')
  model_results = {"accuracy": model_accuracy,
                   "precision": model_precision,
                   "recall": model_recall,
                   "f1": model_f1}

  return model_results


In [34]:
# Get baseline results 
baseline_results = calculate_results(y_true=val_labels,
                                     y_pred=baseline_preds)
baseline_results

{'accuracy': 79.26509186351706,
 'precision': 0.8111390004213173,
 'recall': 0.7926509186351706,
 'f1': 0.7862189758049549}

### Model 1: Feed forward neural network(dense model)


In [35]:
# Create a tensorboard callback (need to create a new one for each model)
from helper_functions import create_tensorboard_callback

# Create a directory to save TensorBoard logs
SAVE_DIR = "model_logs"

In [42]:
# Build model with the Functional API
from tensorflow.keras import layers 
inputs = layers.Input(shape=(1,), dtype=tf.string) # inputs are 1-dimensional strings
x = text_vectorizer(inputs) # turn the input text into numbers
x = embedding(x) # create an embedding of the numberized inputs
x = layers.GlobalAveragePooling1D(name="global_avg_pool_layer")(x) 
outputs = layers.Dense(1, activation='sigmoid')(x) # create the output layer, want binary output so use sigmoid function
model_1 = tf.keras.Model(inputs, outputs, name="model_1_dense")

In [43]:
model_1.summary()

Model: "model_1_dense"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_2 (InputLayer)        [(None, 1)]               0         
                                                                 
 text_vectorization_1 (TextV  (None, 15)               0         
 ectorization)                                                   
                                                                 
 embedding (Embedding)       (None, 15, 128)           1280000   
                                                                 
 global_avg_pool_layer (Glob  (None, 128)              0         
 alAveragePooling1D)                                             
                                                                 
 dense_1 (Dense)             (None, 1)                 129       
                                                                 
Total params: 1,280,129
Trainable params: 1,280,129
N

In [44]:
# Compile model
model_1.compile(loss='binary_crossentropy',
                optimizer=tf.keras.optimizers.Adam(),
                metrics=['accuracy'])

In [45]:
# Fit the model
model_1_history = model_1.fit(x=train_sentences,
                              y=train_labels,
                              epochs=5,
                              validation_data=(val_sentences, val_labels),
                              callbacks=[create_tensorboard_callback(dir_name=SAVE_DIR,
                                                                     experiment_name="model_1_dense")])

Saving TensorBoard log files to: model_logs/model_1_dense/20230523-055603
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [48]:
# Check the results
model_1.evaluate(val_sentences, val_labels)



[0.4786967635154724, 0.7808399200439453]

In [49]:
# Make some predictions and evaluate those
model_1_pred_probs = model_1.predict(val_sentences)
model_1_pred_probs.shape



(762, 1)