# NLP BASICS 

NLP Tutorial
NLP - or Natural Language Processing - is shorthand for a wide array of techniques designed to help machines learn from text. Natural Language Processing powers everything from chatbots to search engines, and is used in diverse tasks like sentiment analysis and machine translation.

We’ll then turn to a set of tasks collectively called text normalization, in which text
**normalization**
regular expressions play an important part.
Normalizing text means converting it
to a more convenient, standard form. For example, most of what we are going to
do with language relies on first separating out or **tokenizing** words from running
**tokenization** text, the task of tokenization.
English words are often separated from each otherby whitespace, but whitespace is not always sufficient. New York and rock ’n’ rollare sometimes treated as large words despite the fact that they contain spaces, while sometimes we’ll need to separate I’m into the two words I and am. 
For processing tweets or texts we’ll need to tokenize emoticons like :) or hashtags like #nlproc.

Another part of text normalization is lemmatization, the task of determining
that two words have the same root, despite their surface differences. For example,
the words sang, sung, and sings are forms of the verb sing. The word sing is the
common lemma of these words, and a lemmatizer maps from all of these to sing.
Lemmatization is essential for processing morphologically complex languages like
stemming Arabic. Stemming refers to a simpler version of lemmatization in which we mainly
just strip suffixes from the end of the word. Text normalization also includes sentence segmentation: breaking up a text into individual sentences, using cues like sentence
segmentation
periods or exclamation points.

In [1]:

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)


import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))



/kaggle/input/nlp-getting-started/sample_submission.csv
/kaggle/input/nlp-getting-started/train.csv
/kaggle/input/nlp-getting-started/test.csv


In [2]:
pd.set_option('display.max_colwidth', None)

In [3]:
train_df = pd.read_csv("/kaggle/input/nlp-getting-started/train.csv")
test_df = pd.read_csv("/kaggle/input/nlp-getting-started/test.csv")
train_df.tail(12)


Unnamed: 0,id,keyword,location,text,target
7601,10859,,,#breaking #LA Refugio oil spill may have been costlier bigger than projected http://t.co/5ueCmcv2Pk,1
7602,10860,,,a siren just went off and it wasn't the Forney tornado warning ??,1
7603,10862,,,Officials say a quarantine is in place at an Alabama home over a possible Ebola case after developing symptoms... http://t.co/rqKK15uhEY,1
7604,10863,,,#WorldNews Fallen powerlines on G:link tram: UPDATE: FIRE crews have evacuated up to 30 passengers who were tr... http://t.co/EYSVvzA7Qm,1
7605,10864,,,on the flip side I'm at Walmart and there is a bomb and everyone had to evacuate so stay tuned if I blow up or not,1
7606,10866,,,Suicide bomber kills 15 in Saudi security site mosque - Reuters via World - Google News - Wall ... http://t.co/nF4IculOje,1
7607,10867,,,#stormchase Violent Record Breaking EF-5 El Reno Oklahoma Tornado Nearly Runs Over ... - http://t.co/3SICroAaNz http://t.co/I27Oa0HISp,1
7608,10869,,,Two giant cranes holding a bridge collapse into nearby homes http://t.co/STfMbbZFB5,1
7609,10870,,,@aria_ahrary @TheTawniest The out of control wild fires in California even in the Northern part of the state. Very troubling.,1
7610,10871,,,M1.94 [01:04 UTC]?5km S of Volcano Hawaii. http://t.co/zDtoyd8EbJ,1


In [4]:
# Shuffle training dataframe
train_df_shuffled = train_df.sample(frac=1, random_state=42) 
train_df_shuffled.head()

Unnamed: 0,id,keyword,location,text,target
2644,3796,destruction,,So you have a new weapon that can cause un-imaginable destruction.,1
2227,3185,deluge,,The f$&amp;@ing things I do for #GISHWHES Just got soaked in a deluge going for pads and tampons. Thx @mishacollins @/@,0
5448,7769,police,UK,DT @georgegalloway: RT @Galloway4Mayor: ÛÏThe CoL police can catch a pickpocket in Liverpool Stree... http://t.co/vXIn1gOq4Q,1
132,191,aftershock,,Aftershock back to school kick off was great. I want to thank everyone for making it possible. What a great night.,0
6845,9810,trauma,"Montgomery County, MD",in response to trauma Children of Addicts develop a defensive self - one that decreases vulnerability. (3,0


let's see the class distribution 

In [5]:
classdis=train_df.target
classdis.value_counts()

target
0    4342
1    3271
Name: count, dtype: int64

we got a good amount of distribution for both classes 

In [6]:
train_df.describe()

Unnamed: 0,id,target
count,7613.0,7613.0
mean,5441.934848,0.42966
std,3137.11609,0.49506
min,1.0,0.0
25%,2734.0,0.0
50%,5408.0,0.0
75%,8146.0,1.0
max,10873.0,1.0


lets return some characteristics of text length for all the rows using anonymous(lambda) fuction 

In [7]:
train_df["length"]= train_df["text"].apply(lambda x:len(x))
test_df["length"]=test_df["text"].apply(lambda x:len(x))\

print("training data characteristics")
print(train_df["length"].describe())

print("testing data characteristics")
print(test_df["length"].describe())



training data characteristics
count    7613.000000
mean      101.037436
std        33.781325
min         7.000000
25%        78.000000
50%       107.000000
75%       133.000000
max       157.000000
Name: length, dtype: float64
testing data characteristics
count    3263.000000
mean      102.108183
std        33.972158
min         5.000000
25%        78.000000
50%       109.000000
75%       134.000000
max       151.000000
Name: length, dtype: float64


let's look at some random training examples

In [8]:
print(f"Text:{train_df['text'].tail(10)}",f"Target:{train_df['target'].tail(10)}")

Text:7603     Officials say a quarantine is in place at an Alabama home over a possible Ebola case after developing symptoms... http://t.co/rqKK15uhEY
7604     #WorldNews Fallen powerlines on G:link tram: UPDATE: FIRE crews have evacuated up to 30 passengers who were tr... http://t.co/EYSVvzA7Qm
7605                           on the flip side I'm at Walmart and there is a bomb and everyone had to evacuate so stay tuned if I blow up or not
7606                    Suicide bomber kills 15 in Saudi security site mosque - Reuters via World - Google News - Wall ... http://t.co/nF4IculOje
7607       #stormchase Violent Record Breaking EF-5 El Reno Oklahoma Tornado Nearly Runs Over ... - http://t.co/3SICroAaNz http://t.co/I27Oa0HISp
7608                                                          Two giant cranes holding a bridge collapse into nearby homes http://t.co/STfMbbZFB5
7609                @aria_ahrary @TheTawniest The out of control wild fires in California even in the Northern part of 

**split data into training and validation set**
so we can check our model performance while training on training set and as test set doesn't have labels thus we need to make a vallidation set


In [9]:
from sklearn.model_selection import train_test_split 
train_sentences, val_sentences, train_labels, val_labels = train_test_split(train_df_shuffled["text"].to_numpy(),
                                                                            train_df_shuffled["target"].to_numpy(),
                                                                            test_size=0.1, 
                                                                            random_state=42)
   

In [10]:
len(train_sentences),len(val_labels)

(6851, 762)

# Data preprocessing- Converting text into numbers
When dealing with a text problem, one of the first things you'll have to do before you can build a model is to convert your text to numbers.

There are a few ways to do this, namely:

Tokenziation - direct mapping of token (a token could be a word or a character) to number
Embedding - create a matrix of feature vector for each token (the size of the feature vector can be defined and this embedding can be learned)

**Text vectorization**

In [11]:
import tensorflow as tf
from tensorflow.keras.layers import TextVectorization

# Initialize the TextVectorization layer with corrected parameters
text_vectorizer = TextVectorization(
    max_tokens=None,  # Maximum size of the vocabulary (None means no limit)
    standardize="lower_and_strip_punctuation",  # Standardization method as a string
    split="whitespace",  # Tokenize based on whitespace
    ngrams=None,  # No n-gram creation
    output_mode="int",  # Map tokens to integers
    output_sequence_length=None  # Length of the output sequences (None means variable length)
)


It's often beneficial to set this to a specific number (e.g., max_tokens=10000) to limit the vocabulary size, which can help with model performance and prevent overfitting.

In [12]:
# Setup text vectorization variables
max_vocab_length = 10000 # max number of words to have in our vocabulary
max_length = 15 # max length our sequences will be (e.g. how many words from a Tweet does a model see?)

text_vectorizer = TextVectorization(max_tokens=max_vocab_length,
                                    output_mode="int",
                                    output_sequence_length=max_length)
     

In [13]:
# Fit the text vectorizer instance to the training data using the adapt() method
text_vectorizer.adapt(train_sentences)

In [14]:
import random
# Choose a random sentence every time u run from the training dataset and tokenize it
random_sentence = random.choice(train_sentences)
print(f"Original text:\n{random_sentence}\
        \n\nVectorized version:")
text_vectorizer([random_sentence])

Original text:
Strawberries are in big trouble. Scientists race to find solution. http://t.co/MqydXRLae7 http://t.co/EpJjkB4Be9        

Vectorized version:


<tf.Tensor: shape=(1, 15), dtype=int64, numpy=
array([[8035,   22,    4,  335,  495, 2717, 2768,    5,  653, 2290,    1,
           1,    0,    0,    0]])>

In [15]:
# Get the unique words in the vocabulary
words_in_vocab = text_vectorizer.get_vocabulary()
top_10_words = words_in_vocab[:10] # the most common words in the vocab
bottom_10_words = words_in_vocab[-10:] # the least common words in the vocab
print(f"Most common words in vocab: {top_10_words}")
print(f"Least common words in vocab: {bottom_10_words}")

Most common words in vocab: ['', '[UNK]', 'the', 'a', 'in', 'to', 'of', 'and', 'i', 'is']
Least common words in vocab: ['painthey', 'painful', 'paine', 'paging', 'pageshi', 'pages', 'paeds', 'pads', 'padres', 'paddytomlinson1']
