# Introduction to NLP Fundatmentals in TensorFlow

NLP has the goal of deriving information out of natural language

Another common term for NLP problem is seq2seq

## Check for GPU

In [1]:
!nvidia-smi

zsh:1: command not found: nvidia-smi


## Downloading helper function inside the folder


! wget https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/main/extras/helper_functions.py

In [2]:
from helper_functions import unzip_data, create_tensorboard_callback, plot_loss_curves, compare_historys

## Get a text dataset

The dataset we're going ti be using is Kaggle's introduction to NLP dataset. A classification problem

[Competition Link](https://www.kaggle.com/competitions/nlp-getting-started/overview)

In [3]:
!wget https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip

unzip_data('nlp_getting_started.zip')

--2024-03-12 11:10:55--  https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 2404:6800:4009:805::201b, 2404:6800:4009:806::201b, 2404:6800:4009:809::201b, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|2404:6800:4009:805::201b|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 607343 (593K) [application/zip]
Saving to: 'nlp_getting_started.zip.1'


2024-03-12 11:10:55 (4.80 MB/s) - 'nlp_getting_started.zip.1' saved [607343/607343]



## Become one with the data

In [4]:
# Read thed data
import pandas as pd

train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

train_df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [5]:
# Shuffle training dataframe
train_df_shuffled = train_df.sample(frac= 1, random_state= 42)

train_df_shuffled.head()

Unnamed: 0,id,keyword,location,text,target
2644,3796,destruction,,So you have a new weapon that can cause un-ima...,1
2227,3185,deluge,,The f$&amp;@ing things I do for #GISHWHES Just...,0
5448,7769,police,UK,DT @georgegalloway: RT @Galloway4Mayor: ÛÏThe...,1
132,191,aftershock,,Aftershock back to school kick off was great. ...,0
6845,9810,trauma,"Montgomery County, MD",in response to trauma Children of Addicts deve...,0


In [6]:
test_df.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


In [7]:
# How many examples of each class
train_df['target'].value_counts()

target
0    4342
1    3271
Name: count, dtype: int64

In [8]:
# How may samples
len(train_df), len(test_df)

(7613, 3263)

In [9]:
# Let's visualise some random training examples
import random
random_index = random.randint(0, len(train_df) - 5)

for row in train_df_shuffled[['text', 'target']][random_index: random_index + 5].itertuples():
    _, text, target = row
    print(f"Target: {'Disaster' if target == 1 else 'Not a disaster'}")
    print(f"Text: {text}")
    print("----\n")


Target: Not a disaster
Text: Mike Magner Discusses A Trust Betrayed: http://t.co/GETBjip5Rh via @YouTube #military #veterans #environment
----

Target: Not a disaster
Text: Beware of your temper and a loose tongue! These two dangerous weapons combined can lead a person to the Hellfire #islam!
----

Target: Disaster
Text: Flood: Two people dead 60 houses destroyed in Kaduna: Two people have been reportedly killed and 60 houses ut... http://t.co/BDsgF1CfaX
----

Target: Disaster
Text: MH370: Aircraft debris found on La Reunion is from missing Malaysia Airlines ... - ABC Onlin... http://t.co/N3lNdJKYo3 G #Malaysia #News
----

Target: Not a disaster
Text: Whenever I have a meltdown and need someone @Becca_Caitlyn99 is always like 'leaving in 5' and I don't know how I got so lucky #blessed
----



### Split data into training and validation sets

In [10]:
from sklearn.model_selection import train_test_split

# Use train_test_split() to split the trianing data into train and validation dataset

train_sentences, val_sentences, train_labels, val_labels = train_test_split(train_df_shuffled['text'].to_numpy(), train_df_shuffled['target'].to_numpy(), test_size= 0.1, random_state= 42)

In [11]:
len(train_sentences), len(train_labels), len(val_sentences), len(val_labels)

(6851, 6851, 762, 762)

In [12]:
# Check the first 10 sentences
train_sentences[:10], train_labels[:10]

(array(['@mogacola @zamtriossu i screamed after hitting tweet',
        'Imagine getting flattened by Kurt Zouma',
        '@Gurmeetramrahim #MSGDoing111WelfareWorks Green S welfare force ke appx 65000 members har time disaster victim ki help ke liye tyar hai....',
        "@shakjn @C7 @Magnums im shaking in fear he's gonna hack the planet",
        'Somehow find you and I collide http://t.co/Ee8RpOahPk',
        '@EvaHanderek @MarleyKnysh great times until the bus driver held us hostage in the mall parking lot lmfao',
        'destroy the free fandom honestly',
        'Weapons stolen from National Guard Armory in New Albany still missing #Gunsense http://t.co/lKNU8902JE',
        '@wfaaweather Pete when will the heat wave pass? Is it really going to be mid month? Frisco Boy Scouts have a canoe trip in Okla.',
        'Patient-reported outcomes in long-term survivors of metastatic colorectal cancer - British Journal of Surgery http://t.co/5Yl4DC1Tqt'],
       dtype=object),
 array([0,

## Converting text into numbers

When dealing with a text problem, one of the first things you'll have to do before you can build a model is to convert your text to nuumbers.

There are a few ways to do this:
* Tokenization
* Embedding

### Text vectorization (tokenization)

In [13]:
import tensorflow as tf
from tensorflow.keras.layers import TextVectorization

# Use defalut TextVectorization parameter
text_vectorizer = TextVectorization(max_tokens= None, # how many words in vocab
                                    standardize= 'lower_and_strip_punctuation',
                                    split= 'whitespace',
                                    ngrams= None, # creates a group of words
                                    output_mode= 'int', # in which format the output should be
                                    output_sequence_length= None, # how long deos the output sequence should be of
                                    )

In [14]:
# Find the average number of tokens in the training tweets
round(sum([len(i.split()) for i in train_sentences]) // len(train_sentences))

14

In [15]:
# Setup text vectorization variables
max_vocab_length = 10000 # max number of words to have in our vocab
max_length = 15 # max length our sequences will be

text_vectorizer = TextVectorization(max_tokens= max_vocab_length,
                                    output_mode= 'int',
                                    output_sequence_length= max_length)