<a href="https://colab.research.google.com/github/hrushikute/DataAnalytics/blob/master/NLP_introduction_to_tensorflow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to NLP fundamentals in TensorFlow

NLP has the goal of deriving information out of natural language (could be sequence of text  or speech)
Another common term for NLP problems is sequence to sequence problems (seq2seq).

## Check for GPU

In [8]:
!nvidia-smi -L

GPU 0: Tesla K80 (UUID: GPU-1a541e15-9b24-70dc-fb4f-a8980b6f5e78)


## Get Helper functions

In [9]:
! wget https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/main/extras/helper_functions.py
# Import Series of helper functions for the notebook
from helper_functions import unzip_data, create_tensorboard_callback, plot_loss_curves, compare_historys

--2021-10-21 05:16:28--  https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/main/extras/helper_functions.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10246 (10K) [text/plain]
Saving to: ‘helper_functions.py’


2021-10-21 05:16:28 (83.6 MB/s) - ‘helper_functions.py’ saved [10246/10246]



## Get text dataset

The dataset we are going to use is Kaggles introduction to NLP dataset (test samples of Tweets labelled a disaster or not disaster).

See orginal source : https://www.kaggle.com/c/nlp-getting-started/overview

In [10]:
! wget https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip

--2021-10-21 05:16:29--  https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 142.251.8.128, 74.125.23.128, 74.125.203.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|142.251.8.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 607343 (593K) [application/zip]
Saving to: ‘nlp_getting_started.zip’


2021-10-21 05:16:29 (91.7 MB/s) - ‘nlp_getting_started.zip’ saved [607343/607343]



In [11]:
# Unzip the data

unzip_data('nlp_getting_started.zip')

## Visualize a text dataset 

In [12]:
import pandas as pd

train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")

train_df.head(10)

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1
5,8,,,#RockyFire Update => California Hwy. 20 closed...,1
6,10,,,#flood #disaster Heavy rain causes flash flood...,1
7,13,,,I'm on top of the hill and I can see a fire in...,1
8,14,,,There's an emergency evacuation happening now ...,1
9,15,,,I'm afraid that the tornado is coming to our a...,1


In [13]:
## Shuffle the training data.
train_df_shuffled = train_df.sample(frac=1, random_state=42)
train_df_shuffled.head()

Unnamed: 0,id,keyword,location,text,target
2644,3796,destruction,,So you have a new weapon that can cause un-ima...,1
2227,3185,deluge,,The f$&amp;@ing things I do for #GISHWHES Just...,0
5448,7769,police,UK,DT @georgegalloway: RT @Galloway4Mayor: ÛÏThe...,1
132,191,aftershock,,Aftershock back to school kick off was great. ...,0
6845,9810,trauma,"Montgomery County, MD",in response to trauma Children of Addicts deve...,0


In [14]:
test_df.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


In [15]:
#How many examples of each class
train_df.target.value_counts()

0    4342
1    3271
Name: target, dtype: int64

In [16]:
# How many total samples?
len(train_df), len(test_df)

(7613, 3263)

In [17]:
# Visualising the random data samples
import random

random_index = random.randint(0,len(train_df)-5)

for row in train_df_shuffled[["text","target"]][random_index:random_index+5].itertuples():
  _, text, target = row
  print(f"Target : {target} ","(read disaster)" if target> 0 else "(not a real disastee)")
  print (f"Text : \n{text}\n")
  print (f"------\n")


Target : 0  (not a real disastee)
Text : 
Police walk up on me I be blowin smoke in dey face  wanna lock me up cus I got dope shit is gay

------

Target : 0  (not a real disastee)
Text : 
I get this feeling that society will collapse or implode. So don't be a hero and play your part.

------

Target : 0  (not a real disastee)
Text : 
Various issues fail to derail homes bid http://t.co/zhsLl7swBh

------

Target : 1  (read disaster)
Text : 
airplane crashes on house in Colombia 12 people die in accident https://t.co/ZhJlfLBHZL

------

Target : 1  (read disaster)
Text : 
Today is the day Hiroshima got Atomic bomb 70 years ago.  - The 'sanitised narrative' of Hiroshima's atomic bombing http://t.co/GKpANz7vg0

------



# Split the data into trainig and validation sets

In [18]:
from sklearn.model_selection import train_test_split


In [19]:
# Use train_test_split to split training data into training and validation sets


train_sentences, val_sentences, train_label, val_label = train_test_split(train_df_shuffled["text"].to_numpy(),
                                                                          train_df_shuffled["target"].to_numpy(),
                                                                          test_size = 0.1, # use 10% of training data as validation data
                                                                          random_state = 42)
len(train_sentences), len(val_sentences),len(train_label),len(val_label)

(6851, 762, 6851, 762)

In [20]:
len(train_df_shuffled)

7613

In [21]:
# check first few samples from training  samples

train_sentences[:10] , train_label[:10]

(array(['@mogacola @zamtriossu i screamed after hitting tweet',
        'Imagine getting flattened by Kurt Zouma',
        '@Gurmeetramrahim #MSGDoing111WelfareWorks Green S welfare force ke appx 65000 members har time disaster victim ki help ke liye tyar hai....',
        "@shakjn @C7 @Magnums im shaking in fear he's gonna hack the planet",
        'Somehow find you and I collide http://t.co/Ee8RpOahPk',
        '@EvaHanderek @MarleyKnysh great times until the bus driver held us hostage in the mall parking lot lmfao',
        'destroy the free fandom honestly',
        'Weapons stolen from National Guard Armory in New Albany still missing #Gunsense http://t.co/lKNU8902JE',
        '@wfaaweather Pete when will the heat wave pass? Is it really going to be mid month? Frisco Boy Scouts have a canoe trip in Okla.',
        'Patient-reported outcomes in long-term survivors of metastatic colorectal cancer - British Journal of Surgery http://t.co/5Yl4DC1Tqt'],
       dtype=object), array([0, 

# We need to convert the the text data into numbers

When dealing with a text problem , one of the first things we will have to do before we can build a model is to convert our test to numbers.

There are few ways to do this , namely :

* Tokenization - direct mapping of token (a token could be a word or a character ) to number.

* Embedding - Create a matix  of feature vector for each token (the size of the feature vector can be defined and this embedding  can be learned )

## Text vectorization  (tokenization)



In [22]:
import tensorflow as tf

from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

# Use deafult text vectorization Parameter

text_vectorizer = TextVectorization(max_tokens= 1000,# How many words in vocabulary (automatically add <OOV> out of vocabulary)
                                    standardize = 'lower_and_strip_punctuation',
                                    split='whitespace',
                                    ngrams=None,# create a group of n-words?
                                    output_mode='int', # this is how to map text to numbers
                                    output_sequence_length=None,# Automatically pads each squence to longest sequnce
                                    pad_to_max_tokens=True)


In [23]:
# Find the average number of tokens (words) in the training tweets.
round(sum([len(i.split()) for i in train_sentences])/len(train_sentences))

15

In [27]:
# Set up text Vectorization variables.

max_vocab_length = 10000 #max number of words to have in put vocabulary
max_length = 15 # max length of our sequnces(Eg: How many words from a Tweet does a model see?)

text_vectorizer = TextVectorization(max_tokens=max_vocab_length,
                                    output_mode='int',
                                    output_sequence_length=max_length)

In [28]:
# Fit the text vectorizer to training dataset

text_vectorizer.adapt(train_sentences)

In [30]:
# Create a sample sentence and tokensize it.

sample_sentence = "There is a cloud burst on my street"
text_vectorizer([sample_sentence])

<tf.Tensor: shape=(1, 15), dtype=int64, numpy=
array([[  74,    9,    3, 3100, 2174,   11,   13,  698,    0,    0,    0,
           0,    0,    0,    0]])>

In [37]:
# Choose a random sentence from a training dataset and tokenize it

random_sentence = random.choice(train_sentences)
print(f"Original Text : \n{random_sentence}\
\n\nVectorized version : \n {text_vectorizer([random_sentence])}")

Original Text : 
@harbhajan_singh @StuartBroad8 i cant believe...is this d same stuart broad who was destroyed by our yuvi..????

Vectorized version : 
 [[   1 8005    8   98    1   19  902  726 2275 2176   65   23  351   18
   103]]


In [40]:
# Ge tthe unique words in vocabulary

words_in_vocab = text_vectorizer.get_vocabulary() # get all of the unique words from traing data sets

top_5_words = words_in_vocab[:5] # get most common words from vocabulary of training data set.

bottom_5_words = words_in_vocab[-5:] #get least common words

print(f"Total number of words :{len(words_in_vocab)}")
print(f"\nTop 5 words : {top_5_words}\
        \nBottom 5 words: {bottom_5_words}\n")

Total number of words :10000

Top 5 words : ['', '[UNK]', 'the', 'a', 'in']        
Bottom 5 words: ['pages', 'paeds', 'pads', 'padres', 'paddytomlinson1']

