# **Natural Language Processing Basics Notebook**

This Jupyter notebook explores the fundamentals of Natural Language Processing (NLP). It starts with text preprocessing, a crucial step that involves lexical processing (tokenization, lemmatization, stemming, stop word removal, and POS tagging) to prepare the text data. Then, it dives into feature extraction techniques like Bag-of-Words, TF-IDF, and n-grams (unigrams, bigrams, skipgrams) to convert the preprocessed text into a numerical format suitable for machine learning models. Finally, the notebook explores deep learning models for NLP tasks, including Recurrent Neural Networks (RNNs) with their variants like Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU), and the powerful Transformer-based Bidirectional Encoder Representations from Transformers (BERT) model.

In [1]:
# Data and dependencies load

import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings('ignore')

pd.options.display.max_rows = 100
pd.options.display.max_columns = None

In [2]:
# Load data
with open('sample.txt', 'r') as f:
    corpus = f.read()

In [3]:
# Data
print(corpus)

The quick brown fox jumps over the lazy dog. It was a bright sunny day, and the fox was feeling particularly energetic. As it bounded through the forest, it caught sight of a dog lounging in a patch of sunlight, completely oblivious to its surroundings. The fox, being the mischievous creature that it is, decided to have a bit of fun. With a sly grin, it gathered its strength and leaped gracefully over the unsuspecting canine. The dog, startled by the sudden movement, let out a yelp of surprise before realizing what had transpired. The fox continued on its way, feeling quite pleased with itself for such a daring feat.


## **1. Text Preprocessing**

Text preprocessing is a crucial step in Natural Language Processing (NLP) that involves transforming raw text data into a format suitable for further analysis and modeling. This section covers various techniques used in text preprocessing.

### **Part 1: Lexical Processing**
Lexical processing is the foundation of working with text data in Natural Language Processing (NLP). It's essentially the first step where computers start to understand the individual building blocks of language. 

#### **1.1. Tokenization**
Tokenization is the process of breaking down a sequence of text into smaller units called tokens. These tokens can be words, subwords, or other meaningful elements, depending on the tokenization method used. Common tokenization techniques include word tokenization, subword tokenization (e.g., WordPiece, BPE), and character-level tokenization.

In [7]:
# Tokenization

import spacy

nlp = spacy.load("en_core_web_sm")

#### **1.2. Lemmatization**
Lemmatization is the process of reducing a word to its base or root form, known as the lemma. It considers the context and the part of speech of the word to determine its correct lemma. For example, the words "went" and "going" would be lemmatized to "go."

#### **1.3. Stemming**
Stemming is a simpler approach to word normalization compared to lemmatization. It involves removing affixes (prefixes and suffixes) from words to obtain their stem or root form. However, stemming does not consider the context or part of speech, which can sometimes lead to inaccurate results.

#### **1.4. Stop Word Removal**
Stop words are commonly occurring words that often carry little or no semantic value, such as "the," "a," "is," and "and." Stop word removal is the process of filtering out these words from the text, as they can introduce noise and increase the dimensionality of the data without providing much useful information.

#### **1.5. Part-of-Speech (POS) Tagging**
This assigns grammatical labels (e.g., noun, verb, adjective) to each word in a sentence. It helps understand the function of each word within the sentence structure.

### **Part 2: Feature Extraction**

In NLP, feature extraction is another critical step after lexical processing. It focuses on transforming the preprocessed text data into a numerical format that machine learning algorithms can understand and process.

#### **1.6. Bag of Words (BoW)**
The Bag of Words (BoW) model is a simple and widely used technique for representing text data as vectors. In this model, each unique word in the corpus is assigned a unique index, and the text is represented as a vector of word counts or binary occurrences.

#### **1.7. Term Frequency-Inverse Document Frequency (TF-IDF)**
Term Frequency-Inverse Document Frequency (TF-IDF) is a numerical statistic that reflects the importance of a word in a document or corpus. It is calculated by multiplying the term frequency (TF) of a word in a document by the inverse document frequency (IDF) of that word across the entire corpus.

#### **1.8. N-grams**
N-grams are contiguous sequences of n items (e.g., words, characters) from a given text. Common types of n-grams include:

- **Unigrams**: Single words or characters.
- **Bigrams**: Sequences of two consecutive words or characters.
- **Skipgrams**: Sequences of words or characters with gaps in between.

N-grams are often used as features in NLP tasks, capturing local context and providing more information than individual words or characters alone.

## **2. Modeling**

After preprocessing the text data, various machine learning and deep learning models can be employed for various NLP tasks, such as text classification, sentiment analysis, named entity recognition, and machine translation.

### **Deep Learning Models for NLP**
- **Recurrent Neural Networks (RNNs)**
- **Long Short-Term Memory (LSTM)**
- **Gated Recurrent Unit (GRU)**
- **Bidirectional Encoder Representations from Transformers (BERT)**

#### **2.1. Recurrent Neural Networks (RNNs)**
Recurrent Neural Networks (RNNs) are a type of neural network designed to process sequential data, such as text. They can capture long-term dependencies and maintain an internal state that represents the context from previous inputs.

#### **2.2. Long Short-Term Memory (LSTM)**
Long Short-Term Memory (LSTM) is a variant of RNNs that addresses the vanishing gradient problem, which can occur in traditional RNNs when handling long sequences. LSTMs introduce a gating mechanism that allows them to selectively remember or forget information, making them better suited for processing and modeling long sequences.

#### **2.3. Gated Recurrent Unit (GRU)**
Gated Recurrent Unit (GRU) is another variant of RNNs that aims to solve the vanishing gradient problem. GRUs have a simpler architecture than LSTMs but often achieve comparable performance in various NLP tasks.

#### **2.4. Bidirectional Encoder Representations from Transformers (BERT)**
Bidirectional Encoder Representations from Transformers (BERT) is a transformer-based language model that has achieved state-of-the-art results in various NLP tasks. BERT is pre-trained on a large corpus of text using a self-supervised learning approach, allowing it to capture rich contextual information and transfer that knowledge to downstream tasks with fine-tuning.

This markdown provides an overview of the key steps and techniques involved in an end-to-end NLP pipeline, from text preprocessing to modeling. However, it's important to note that the specific techniques and models used may vary depending on the NLP task at hand and the characteristics of the data.