<a href="https://colab.research.google.com/github/jonesLevin/TensorFlow-Deep-Learning/blob/main/Natural_Language_Processing_(NLP)_With_TensorFlow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction To NLP Fundamentals in TensorFlow

## Getting Helper Functions

In [1]:
!wget https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/main/extras/helper_functions.py

--2023-01-24 09:22:31--  https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/main/extras/helper_functions.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10246 (10K) [text/plain]
Saving to: ‘helper_functions.py’


2023-01-24 09:22:31 (99.4 MB/s) - ‘helper_functions.py’ saved [10246/10246]



In [2]:
from helper_functions import unzip_data, create_tensorboard_callback, plot_loss_curves, compare_historys

## Get a Text Dataset
We are going to be using dataset from kaggle NLP basics disaster classification

In [3]:
!wget https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip

# Unzip the data
unzip_data('nlp_getting_started.zip')

--2023-01-24 09:22:35--  https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 172.253.62.128, 172.253.115.128, 172.253.122.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|172.253.62.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 607343 (593K) [application/zip]
Saving to: ‘nlp_getting_started.zip’


2023-01-24 09:22:35 (71.3 MB/s) - ‘nlp_getting_started.zip’ saved [607343/607343]



## Visualizing the Data

In [4]:
import pandas as pd
import random

train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')
train_df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [5]:
# Shuffle Training DataFrame
train_df_shuffled = train_df.sample(frac=1, random_state=42)
train_df_shuffled.head()

Unnamed: 0,id,keyword,location,text,target
2644,3796,destruction,,So you have a new weapon that can cause un-ima...,1
2227,3185,deluge,,The f$&amp;@ing things I do for #GISHWHES Just...,0
5448,7769,police,UK,DT @georgegalloway: RT @Galloway4Mayor: ÛÏThe...,1
132,191,aftershock,,Aftershock back to school kick off was great. ...,0
6845,9810,trauma,"Montgomery County, MD",in response to trauma Children of Addicts deve...,0


In [6]:
# Looking at the test dataframe
test_df.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


In [7]:
# How many examples of each class do we have
train_df_shuffled['target'].value_counts()

0    4342
1    3271
Name: target, dtype: int64

In [8]:
# What is the total number of samples
len(train_df), len(test_df)

(7613, 3263)

## Splitting Training and Validation Sets

In [9]:
from sklearn.model_selection import train_test_split

In [10]:
train_sentences, val_sentences, train_labels, val_labels = train_test_split(train_df_shuffled['text'].to_numpy(),
                                                                            train_df_shuffled['target'].to_numpy(),
                                                                            test_size=0.1,
                                                                            random_state=42)

### Converting Text Into Numbers

#### Text Vectorization (Tokenization)

In [11]:
from tensorflow import keras
from keras.layers import TextVectorization

In [12]:
text_vectorizor = TextVectorization(max_tokens=5000, 
                                    standardize='lower_and_strip_punctuation',
                                    split='whitespace',
                                    ngrams=None,
                                    output_mode='int',
                                    output_sequence_length=None,
                                    pad_to_max_tokens=True)

In [13]:
# Find the average number of tokens (word) in the training tweets
round(sum([len(i.split()) for i in train_sentences])/len(train_sentences))

15

In [14]:
from keras.layers.preprocessing import text_vectorization
# Setup text vectorization varibales
max_vocab_length = 10000 
max_length = 15

text_vectorizor = TextVectorization(max_tokens=max_vocab_length,
                                    output_mode='int',
                                    output_sequence_length=max_length)

In [15]:
# Fit the text vectorizer to the training text
text_vectorizor.adapt(train_sentences)

In [16]:
# Create a sample sentence and tokenize it
sample_sentence = "There's a flood in my street"
text_vectorizor([sample_sentence])

<tf.Tensor: shape=(1, 15), dtype=int64, numpy=
array([[264,   3, 232,   4,  13, 698,   0,   0,   0,   0,   0,   0,   0,
          0,   0]])>

In [17]:
# Choose a random sentence from the training data and tokenize it
random_sentence = random.choice(train_sentences)
print(f'Original text:\n {random_sentence}\
      \n\nVectorized version:')
text_vectorizor([random_sentence])

Original text:
 #Eyewitness media is actively embraced by #UK audiences. Read the report by @emhub on the impact of #UGC in news: http://t.co/6mBPvwiTxf      

Vectorized version:


<tf.Tensor: shape=(1, 15), dtype=int64, numpy=
array([[654, 620,   9,   1,   1,  18, 915,   1, 193,   2, 329,  18,   1,
         11,   2]])>

In [18]:
# Get the unique words in the vocabulary
words_in_vocab = text_vectorizor.get_vocabulary()
top_5_words = words_in_vocab[:5]
bottom_5_words = words_in_vocab[-5:]
print(f'Number of words in vocab: {len(words_in_vocab)}')
print(f'5 most common words: {top_5_words}')
print(f'5 least common words: {bottom_5_words}')

Number of words in vocab: 10000
5 most common words: ['', '[UNK]', 'the', 'a', 'in']
5 least common words: ['pages', 'paeds', 'pads', 'padres', 'paddytomlinson1']


### Creating an Embedding Using an Embedding Layer

In [19]:
embedding = keras.layers.Embedding(input_dim=max_vocab_length,
                                   output_dim=128,
                                   input_length=max_length)

In [20]:
embedding

<keras.layers.core.embedding.Embedding at 0x7f63e00caf70>

In [22]:
# Get a random sentence from the training set
random_sentence = random.choice(train_sentences)
print(f'Original text:\n {random_sentence}\
      \n\nEmbedded version:')

# Embedding the random sentece
sample_embedded = embedding(text_vectorizor([random_sentence]))
sample_embedded

Original text:
 Former Township fire truck being used in Philippines - Langley Times http://t.co/L90dCPV9Zu #Philippines      

Embedded version:


<tf.Tensor: shape=(1, 15, 128), dtype=float32, numpy=
array([[[ 0.02047047,  0.04717774,  0.02208969, ..., -0.01302468,
         -0.00661403,  0.02895762],
        [ 0.01171187, -0.04750746, -0.04086875, ...,  0.01401928,
         -0.04918836, -0.03419223],
        [-0.0213108 , -0.02155674, -0.03504245, ..., -0.00695921,
          0.00399873, -0.01492119],
        ...,
        [ 0.01069112, -0.03639345,  0.00403202, ...,  0.0485805 ,
         -0.01971176,  0.02866713],
        [ 0.01069112, -0.03639345,  0.00403202, ...,  0.0485805 ,
         -0.01971176,  0.02866713],
        [ 0.01069112, -0.03639345,  0.00403202, ...,  0.0485805 ,
         -0.01971176,  0.02866713]]], dtype=float32)>

In [24]:
# Check out a single token embedding
sample_embedded[0][0], sample_embedded[0][0].shape, random_sentence

(<tf.Tensor: shape=(128,), dtype=float32, numpy=
 array([ 0.02047047,  0.04717774,  0.02208969, -0.01023325, -0.0217353 ,
        -0.01831362, -0.03020023, -0.0449996 ,  0.03895458,  0.024732  ,
         0.01647003,  0.03423383,  0.03731117,  0.03981307,  0.03881622,
        -0.01426226,  0.00323788,  0.0449265 , -0.02035411, -0.00728257,
        -0.01188564,  0.02718005, -0.01574154, -0.04513613, -0.04353093,
         0.03476835,  0.02528851, -0.002297  , -0.02868397, -0.03729815,
         0.04547671, -0.0145273 , -0.04254128, -0.03323709,  0.00308675,
         0.02143674,  0.04656669, -0.03151812, -0.02718755,  0.03217509,
         0.00922196,  0.01395848, -0.01531209,  0.02072383,  0.03579365,
        -0.04981706, -0.00263063, -0.03444052,  0.03094924,  0.03292103,
        -0.02362932, -0.03600603,  0.03524664,  0.01535666,  0.02925712,
        -0.00468326,  0.02128382, -0.01384807,  0.02667285,  0.03227974,
         0.0328976 ,  0.01937688,  0.00308887, -0.04944065,  0.00696492,
  