# NLP BASICS 

NLP Tutorial
NLP - or Natural Language Processing - is shorthand for a wide array of techniques designed to help machines learn from text. Natural Language Processing powers everything from chatbots to search engines, and is used in diverse tasks like sentiment analysis and machine translation.

We’ll then turn to a set of tasks collectively called text normalization, in which text
**normalization**
regular expressions play an important part.
Normalizing text means converting it
to a more convenient, standard form. For example, most of what we are going to
do with language relies on first separating out or **tokenizing** words from running
**tokenization** text, the task of tokenization.
English words are often separated from each otherby whitespace, but whitespace is not always sufficient. New York and rock ’n’ rollare sometimes treated as large words despite the fact that they contain spaces, while sometimes we’ll need to separate I’m into the two words I and am. 
For processing tweets or texts we’ll need to tokenize emoticons like :) or hashtags like #nlproc.

Another part of text normalization is lemmatization, the task of determining
that two words have the same root, despite their surface differences. For example,
the words sang, sung, and sings are forms of the verb sing. The word sing is the
common lemma of these words, and a lemmatizer maps from all of these to sing.
Lemmatization is essential for processing morphologically complex languages like
stemming Arabic. Stemming refers to a simpler version of lemmatization in which we mainly
just strip suffixes from the end of the word. Text normalization also includes sentence segmentation: breaking up a text into individual sentences, using cues like sentence
segmentation
periods or exclamation points.

In [None]:

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)


import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))



In [None]:
pd.set_option('display.max_colwidth', None)

In [None]:
train_df = pd.read_csv("/kaggle/input/nlp-getting-started/train.csv")
test_df = pd.read_csv("/kaggle/input/nlp-getting-started/test.csv")
train_df.tail(12)


In [None]:
# Shuffle training dataframe
train_df_shuffled = train_df.sample(frac=1, random_state=42) 
train_df_shuffled.head()

let's see the class distribution 

In [None]:
classdis=train_df.target
classdis.value_counts()

we got a good amount of distribution for both classes 

In [None]:
train_df.describe()

lets return some characteristics of text length for all the rows using anonymous(lambda) fuction 

In [None]:
train_df["length"]= train_df["text"].apply(lambda x:len(x))
test_df["length"]=test_df["text"].apply(lambda x:len(x))\

print("training data characteristics")
print(train_df["length"].describe())

print("testing data characteristics")
print(test_df["length"].describe())



let's look at some random training examples

In [None]:
print(f"Text:{train_df['text'].tail(10)}",f"Target:{train_df['target'].tail(10)}")

**split data into training and validation set**
so we can check our model performance while training on training set and as test set doesn't have labels thus we need to make a vallidation set


In [None]:
from sklearn.model_selection import train_test_split 
train_sentences, val_sentences, train_labels, val_labels = train_test_split(train_df_shuffled["text"].to_numpy(),
                                                                            train_df_shuffled["target"].to_numpy(),
                                                                            test_size=0.1, 
                                                                            random_state=42)
   

In [None]:
len(train_sentences),len(val_labels)

# Data preprocessing- Converting text into numbers
When dealing with a text problem, one of the first things you'll have to do before you can build a model is to convert your text to numbers.

There are a few ways to do this, namely:

Tokenziation - direct mapping of token (a token could be a word or a character) to number
Embedding - create a matrix of feature vector for each token (the size of the feature vector can be defined and this embedding can be learned)

**Text vectorization**

In [None]:
import tensorflow as tf
from tensorflow.keras.layers import TextVectorization

# Initialize the TextVectorization layer with corrected parameters
text_vectorizer = TextVectorization(
    max_tokens=None,  # Maximum size of the vocabulary (None means no limit)
    standardize="lower_and_strip_punctuation",  # Standardization method as a string
    split="whitespace",  # Tokenize based on whitespace
    ngrams=None,  # No n-gram creation
    output_mode="int",  # Map tokens to integers
    output_sequence_length=None  # Length of the output sequences (None means variable length)
)


It's often beneficial to set this to a specific number (e.g., max_tokens=10000) to limit the vocabulary size, which can help with model performance and prevent overfitting.

In [None]:
# Setup text vectorization variables
max_vocab_length = 10000 # max number of words to have in our vocabulary
max_length = 15 # max length our sequences will be (e.g. how many words from a Tweet does a model see?)

text_vectorizer = TextVectorization(max_tokens=max_vocab_length,
                                    output_mode="int",
                                    output_sequence_length=max_length)
     

In [None]:
# Fit the text vectorizer instance to the training data using the adapt() method
text_vectorizer.adapt(train_sentences)

In [None]:
import random
# Choose a random sentence every time u run from the training dataset and tokenize it
random_sentence = random.choice(train_sentences)
print(f"Original text:\n{random_sentence}\
        \n\nVectorized version:")
text_vectorizer([random_sentence])

In [None]:
# Get the unique words in the vocabulary
words_in_vocab = text_vectorizer.get_vocabulary()
top_10_words = words_in_vocab[:10] # the most common words in the vocab
bottom_10_words = words_in_vocab[-10:] # the least common words in the vocab
print(f"Most common words in vocab: {top_10_words}")
print(f"Least common words in vocab: {bottom_10_words}")

**Creating an Embedding using an Embedding Layer**

In [None]:
from tensorflow.keras import layers 

embedding = layers.Embedding(input_dim=max_vocab_length, # set the input shape
                             output_dim=128, # set the size of the embedding vector
                             embeddings_initializer="uniform", # default, initialize embedding vectors randomly
                             input_length=max_length # how long is each input
                             )

embedding

In [None]:
# Get a random sentence from training set
random_sentence = random.choice(train_sentences)
print(f"Original text:\n{random_sentence}\
        \n\nEmbedded version:")

# Embed the random sentence (turn it into numerical representation, aka tokenization first)
sample_embed = embedding(text_vectorizer([random_sentence]))
sample_embed

trial over bitch