<a href="https://colab.research.google.com/github/msfasha/307307-BI-Methods-LLMs/blob/main/section_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Natural Language Processing (NLP) Basics**

#### **1.1 What is NLP and Why Is It Important?**
**Definition:**  
Natural Language Processing (NLP) is a field of Artificial Intelligence that enables computers to understand, interpret, and generate human language.<br> NLP bridges the gap between human communication and machine understanding, allowing businesses to analyze and leverage text data effectively.

**Applications of NLP in Business:**
1. **Customer Insights:** Sentiment analysis of customer reviews and feedback.
2. **Automation:** Automated chatbots for customer support.
3. **Content Generation:** Writing articles, generating reports, or creating marketing content.
4. **Decision Support:** Extracting insights from financial reports, contracts, and legal documents.


---

#### **1.2 Core Concepts in NLP**

**1.2.1 Text Preprocessing**  
Preprocessing prepares raw text data for analysis by cleaning and standardizing it. This step is critical because raw text contains noise like punctuation, special characters, and inconsistencies.

**Steps of Text Preprocessing with Code:**

1. **Tokenization:** Breaking text into smaller components, such as words or sentences.
2. **Removing Stopwords:** Filtering out common words (e.g., "is," "the") that don’t contribute much meaning.
3. **Stemming and Lemmatization:** Reducing words to their root form for consistency.

**Python Code:**

First of all, we need to install NLTN, Python main NLP library.<br>
The Natural Language Toolkit (NLTK) is a comprehensive library for natural language processing (NLP) in Python. It provides easy-to-use interfaces to over 50 corpora and lexical resources, such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning.<br> NLTK also includes wrappers for industrial-strength NLP libraries. It is widely used for research and educational purposes due to its simplicity and extensive documentation.

In [1]:
! pip install nltk

Collecting nltk
  Downloading nltk-3.9.1-py3-none-any.whl.metadata (2.9 kB)
Collecting click (from nltk)
  Downloading click-8.1.8-py3-none-any.whl.metadata (2.3 kB)
Collecting regex>=2021.8.3 (from nltk)
  Downloading regex-2024.11.6-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (40 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.5/40.5 kB[0m [31m406.9 kB/s[0m eta [36m0:00:00[0mMB/s[0m eta [36m0:00:01[0m
[?25hCollecting tqdm (from nltk)
  Downloading tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.7/57.7 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading nltk-3.9.1-py3-none-any.whl (1.5 MB)
[2K   [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m0:01[0m:01[0m
[?25hDownloading regex-2024.11.6-cp312-cp312-manylinux_2_17_x86_64.manylinux2

**Example text**

In [10]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Download the 'punkt_tab' resource
nltk.download('punkt_tab')  # This line is added to download the necessary resource

# Download other required resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

text = "Natural Language Processing is an exciting field of Artificial Intelligence!"

# 1. Tokenization
tokens = word_tokenize(text)
print("Tokens:", tokens)

# 2. Removing Stopwords
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
print("Filtered Tokens:", filtered_tokens)

# 3. Stemming
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(word) for word in filtered_tokens]
print("Stemmed Tokens:", stemmed_tokens)

# 4. Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]
print("Lemmatized Tokens:", lemmatized_tokens)

Tokens: ['Natural', 'Language', 'Processing', 'is', 'an', 'exciting', 'field', 'of', 'Artificial', 'Intelligence', '!']
Filtered Tokens: ['Natural', 'Language', 'Processing', 'exciting', 'field', 'Artificial', 'Intelligence', '!']
Stemmed Tokens: ['natur', 'languag', 'process', 'excit', 'field', 'artifici', 'intellig', '!']
Lemmatized Tokens: ['Natural', 'Language', 'Processing', 'exciting', 'field', 'Artificial', 'Intelligence', '!']


[nltk_data] Downloading package punkt_tab to /home/me/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package punkt to /home/me/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/me/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/me/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


---

**1.2.2 Bag of Words (BoW) and TF-IDF**  
BoW and TF-IDF are techniques to convert text into numerical vectors for machine learning.

**Bag of Words:**
- Counts the frequency of words in a document.
- Doesn't consider the importance or meaning of words.

**TF-IDF (Term Frequency-Inverse Document Frequency):**
- Assigns weights to words based on their frequency in a document and across all documents.

**Python Code Example:**

In [3]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Sample text data
documents = [
    "NLP is amazing.",
    "Natural Language Processing is the future.",
    "NLP helps in understanding human language."
]

# Bag of Words
vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(documents)
print("Bag of Words Matrix:")
print(bow_matrix.toarray())
print("Feature Names:", vectorizer.get_feature_names_out())

# TF-IDF
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)
print("\nTF-IDF Matrix:")
print(tfidf_matrix.toarray())
print("Feature Names:", tfidf_vectorizer.get_feature_names_out())

Bag of Words Matrix:
[[1 0 0 0 0 1 0 0 1 0 0 0]
 [0 1 0 0 0 1 1 1 0 1 1 0]
 [0 0 1 1 1 0 1 0 1 0 0 1]]
Feature Names: ['amazing' 'future' 'helps' 'human' 'in' 'is' 'language' 'natural' 'nlp'
 'processing' 'the' 'understanding']

TF-IDF Matrix:
[[0.68091856 0.         0.         0.         0.         0.51785612
  0.         0.         0.51785612 0.         0.         0.        ]
 [0.         0.44036207 0.         0.         0.         0.3349067
  0.3349067  0.44036207 0.         0.44036207 0.44036207 0.        ]
 [0.         0.         0.44036207 0.44036207 0.44036207 0.
  0.3349067  0.         0.3349067  0.         0.         0.44036207]]
Feature Names: ['amazing' 'future' 'helps' 'human' 'in' 'is' 'language' 'natural' 'nlp'
 'processing' 'the' 'understanding']


---

#### **1.3 Introduction to Vectors**

**Definition:**  
A **vector** is a mathematical object that has both magnitude (size) and direction. In NLP, vectors are used to represent words, sentences, or documents as points in a multi-dimensional space. Understanding vectors is crucial for understanding word embeddings.

**1.3.1 Why Are Vectors Important in NLP?**
1. **Representation:** Words and phrases can be represented as numerical vectors, enabling mathematical operations.
2. **Similarity:** Vectors help measure similarity between words (e.g., cosine similarity).
3. **Operations:** Vectors enable computations like addition, subtraction, and scaling, which are useful for tasks like analogy generation.

**1.3.2 Basic Concepts of Vectors**
1. **Magnitude:** The length of a vector.
2. **Direction:** The orientation of the vector in space.
3. **Operations:**
   - Addition and subtraction of vectors.
   - Dot product (used in cosine similarity).
   - Scaling (multiplying a vector by a scalar).

**1.3.3 Visualizing Vectors**
Vectors can be visualized in 2D or 3D space. For example:
- A word like "king" might be represented as a vector [0.8, 0.6].
- A word like "queen" might be represented as [0.7, 0.7].

**1.3.4 Practical Python Code for Vectors**

**Python Code: Basic Operations**

In [4]:
import numpy as np

# Define vectors
vector_a = np.array([2, 3])
vector_b = np.array([4, 1])

# Magnitude of a vector
magnitude_a = np.linalg.norm(vector_a)
print("Magnitude of vector_a:", magnitude_a)

# Addition of vectors
vector_sum = vector_a + vector_b
print("Sum of vector_a and vector_b:", vector_sum)

# Subtraction of vectors
vector_diff = vector_a - vector_b
print("Difference of vector_a and vector_b:", vector_diff)

# Dot product
dot_product = np.dot(vector_a, vector_b)
print("Dot product of vector_a and vector_b:", dot_product)

# Scaling a vector
scalar = 2
scaled_vector = scalar * vector_a
print("Scaled vector_a:", scaled_vector)

Magnitude of vector_a: 3.605551275463989
Sum of vector_a and vector_b: [6 4]
Difference of vector_a and vector_b: [-2  2]
Dot product of vector_a and vector_b: 11
Scaled vector_a: [4 6]


**Python Code: Cosine Similarity**

In [6]:
from sklearn.metrics.pairwise import cosine_similarity

# Example vectors
vector_c = np.array([1, 0])
vector_d = np.array([0, 1])

# Reshape vectors to 2D arrays (required for cosine similarity)
similarity = cosine_similarity([vector_c], [vector_d])
print("Cosine similarity between vector_c and vector_d:", similarity[0][0])

Cosine similarity between vector_c and vector_d: 0.0


**1.3.5 Key Insights for NLP**
1. Vectors allow us to represent words in a way that computers can process.
2. Operations like the dot product and cosine similarity enable the measurement of relationships between words.
3. Scaling and combining vectors can help derive new relationships (e.g., "king - man + woman = queen").

---


#### **1.4 Introduction to Word Embeddings**
Traditional methods like BoW and TF-IDF fail to capture the semantic meaning of words. Word embeddings solve this by representing words as dense numerical vectors in a continuous vector space.

**Why Word Embeddings?**
- Words with similar meanings have similar vector representations.
- Captures relationships like "king - man + woman = queen."

**Example: Cosine Similarity of Word Vectors**
- Cosine similarity measures how similar two word vectors are in terms of direction.

**Python Code:**

In [7]:

from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Example word embeddings (hypothetical)
word_embeddings = {
    "king": np.array([0.8, 0.6]),
    "queen": np.array([0.7, 0.7]),
    "man": np.array([0.5, 0.3]),
    "woman": np.array([0.4, 0.5])
}

# Cosine Similarity
def calculate_cosine_similarity(word1, word2):
    vec1 = word_embeddings[word1]
    vec2 = word_embeddings[word2]
    similarity = cosine_similarity([vec1], [vec2])
    return similarity[0][0]

# Example Comparisons
print("Similarity between 'king' and 'queen':", calculate_cosine_similarity("king", "queen"))
print("Similarity between 'man' and 'woman':", calculate_cosine_similarity("man", "woman"))


Similarity between 'king' and 'queen': 0.9899494936611666
Similarity between 'man' and 'woman': 0.9374252720097653


### **Section 1 Summary**
1. NLP bridges the gap between human language and computers.
2. Text preprocessing is essential for cleaning and preparing data.
3. BoW and TF-IDF are basic text vectorization techniques but lack semantic understanding.
4. Word embeddings provide semantic meaning, enabling better NLP applications.

---

### **Section 1.5: Introduction to Arabic Language NLP**

#### **1.5.1 Why is Arabic NLP Challenging?**
1. **Rich Morphology:** Arabic has complex word structures (e.g., prefixes, suffixes, and infixes).
2. **Diacritics:** Words can have different meanings based on diacritics, which are often omitted in text.
3. **Word Order:** Arabic has a flexible word order compared to English.
4. **Variants:** Multiple dialects exist alongside Modern Standard Arabic (MSA), adding complexity.

#### **1.5.2 Libraries and Tools for Arabic NLP**
Here are some libraries that support Arabic:
1. **`nltk`:** Basic tokenization and stopword removal.
2. **`spacy`:** Tokenization and lemmatization for Arabic.
3. **`farasa`:** A tool specifically designed for Arabic NLP tasks like segmentation and diacritization.
4. **`pyarabic`:** General utilities for Arabic text processing.
5. **`Tashaphyne`:** For stemming Arabic words.

In [8]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# Download Arabic stopwords
nltk.download('stopwords')
nltk.download('punkt')

# Arabic text example
text = "الذكاء الاصطناعي يساعد في معالجة اللغة العربية بشكل كبير."

# Tokenization
tokens = word_tokenize(text)
print("Tokens:", tokens)

# Stopword removal
stop_words = set(stopwords.words("arabic"))
filtered_tokens = [word for word in tokens if word not in stop_words]
print("Filtered Tokens:", filtered_tokens)


Tokens: ['الذكاء', 'الاصطناعي', 'يساعد', 'في', 'معالجة', 'اللغة', 'العربية', 'بشكل', 'كبير', '.']
Filtered Tokens: ['الذكاء', 'الاصطناعي', 'يساعد', 'معالجة', 'اللغة', 'العربية', 'بشكل', 'كبير', '.']


[nltk_data] Downloading package stopwords to /home/me/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/me/nltk_data...
[nltk_data]   Package punkt is already up-to-date!



**1.5.3.2 Stemming and Lemmatization (Using `Tashaphyne` and `Spacy`)**

**Using `Tashaphyne` for Stemming:**

In [9]:
from tashaphyne.stemming import ArabicLightStemmer

# Initialize the stemmer
arabic_stemmer = ArabicLightStemmer()

# Example word
word = "المعالجة"
stemmed_word = arabic_stemmer.light_stem(word)
print("Stemmed Word:", stemmed_word)


ModuleNotFoundError: No module named 'tashaphyne'

**Using `spacy` for Lemmatization:**

In [None]:
import spacy

# Load spacy Arabic model (install with `python -m spacy download ar`)
nlp = spacy.load("ar_core_news_sm")

# Example text
doc = nlp("السيارات تسير في الشوارع.")

# Lemmatization
lemmatized_words = [token.lemma_ for token in doc]
print("Lemmatized Words:", lemmatized_words)

**1.5.3.3 Morphological Analysis and Diacritization (Using `Farasa`)**

**Installing Farasa:**
Farasa is not available on PyPI and requires downloading its tools from the [Farasa website](https://farasa.qcri.org/). You can use the **`farasa` Python wrapper** if Java is installed.

**Farasa Example (Morphological Analysis):**

In [None]:
from farasa.segmenter import FarasaSegmenter

# Initialize the Farasa Segmenter
segmenter = FarasaSegmenter(interactive=True)

# Example text
text = "الذكاء الاصطناعي يساعدنا يومياً."

# Morphological segmentation
segmented_text = segmenter.segment(text)
print("Segmented Text:", segmented_text)

#### **1.5.4 Common Challenges and Solutions in Arabic NLP**
1. **Dialectal Variations:** Use tools like Farasa or train models on specific dialectal corpora.
2. **Handling Diacritics:** Use diacritization tools (like Farasa) for disambiguation.
3. **Rich Morphology:** Use advanced tokenization and morphological analyzers to handle affixes and infixes.

### **1.5.5 Small Arabic NLP Project Idea**
**Sentiment Analysis for Arabic Text:**
Students can create a simple sentiment analysis tool for Arabic tweets or reviews by:
1. Preprocessing the text (tokenization, stopword removal, etc.).
2. Using a pre-trained Arabic word embedding (e.g., AraVec).
3. Training a classifier like Logistic Regression or a Neural Network.

### **Summary of Section 1.5**
1. Arabic NLP poses unique challenges due to its morphology, diacritics, and dialects.
2. Libraries like `nltk`, `spacy`, and `farasa` are useful for tasks like tokenization, stemming, and diacritization.
3. Practical examples help students grasp the differences between English and Arabic NLP processing.