<a href="https://colab.research.google.com/github/msfasha/307307-BI-Methods-LLMs/blob/main/section_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Introduction to Natural Language Processing (NLP)**

#### **1.1 What is NLP?**
Natural Language Processing (NLP) is a field of Artificial Intelligence (AI) that enables computers to understand, interpret, and generate human language.<br> NLP bridges the gap between human communication and machine understanding, allowing businesses to analyze and leverage text data effectively.

**Applications of NLP in Business:**
1. **Customer Insights:** Sentiment analysis of customer reviews and feedback.
2. **Automation:** Automated chatbots for customer support.
3. **Content Generation:** Writing articles, generating reports, or creating marketing content.
4. **Decision Support:** Extracting insights from financial reports, contracts, and legal documents.


#### **1.2 Core Concepts in NLP**

**1.2.1 Text Preprocessing**  
Preprocessing prepares raw text data for analysis by cleaning and standardizing it. This step is critical because raw text contains noise like punctuation, special characters, and inconsistencies.

**Steps of Text Preprocessing with Code:**

1. **Tokenization:** Breaking text into smaller components, such as words or sentences.
2. **Removing Stopwords:** Filtering out common words (e.g., "is," "the") that don’t contribute much meaning.
3. **Stemming and Lemmatization:** Reducing words to their root form for consistency.

**Python Code Example:**

First of all, we need to install NLTN, Python main NLP library.<br>
The Natural Language Toolkit (NLTK) is a comprehensive library for natural language processing (NLP) in Python. It provides easy-to-use interfaces to over 50 corpora and lexical resources, such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning.<br> NLTK also includes wrappers for industrial-strength NLP libraries. It is widely used for research and educational purposes due to its simplicity and extensive documentation.

In [None]:
! pip install nltk

Collecting nltk
  Downloading nltk-3.9.1-py3-none-any.whl.metadata (2.9 kB)
Collecting click (from nltk)
  Downloading click-8.1.8-py3-none-any.whl.metadata (2.3 kB)
Collecting regex>=2021.8.3 (from nltk)
  Downloading regex-2024.11.6-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (40 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.5/40.5 kB[0m [31m406.9 kB/s[0m eta [36m0:00:00[0mMB/s[0m eta [36m0:00:01[0m
[?25hCollecting tqdm (from nltk)
  Downloading tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.7/57.7 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading nltk-3.9.1-py3-none-any.whl (1.5 MB)
[2K   [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m0:01[0m:01[0m
[?25hDownloading regex-2024.11.6-cp312-cp312-manylinux_2_17_x86_64.manylinux2

**Example text**

In [None]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Download the 'punkt_tab' resource
nltk.download('punkt_tab')  # This line is added to download the necessary resource

# Download other required resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

text = "Natural Language Processing is an exciting field of Artificial Intelligence!"

# 1. Tokenization
tokens = word_tokenize(text)
print("Tokens:", tokens)

# 2. Removing Stopwords
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
print("Filtered Tokens:", filtered_tokens)

# 3. Stemming
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(word) for word in filtered_tokens]
print("Stemmed Tokens:", stemmed_tokens)

# 4. Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]
print("Lemmatized Tokens:", lemmatized_tokens)

---

**1.2.2 Bag of Words (BoW) and TF-IDF**  
BoW and TF-IDF are techniques to convert text into numerical vectors for machine learning.

**Bag of Words:**
- Counts the frequency of words in a document.
- Doesn't consider the importance or meaning of words.

**TF-IDF (Term Frequency-Inverse Document Frequency):**
- Assigns weights to words based on their frequency in a document and across all documents.

**Python Code Example:**

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Sample text data
documents = [
    "NLP is amazing.",
    "Natural Language Processing is the future.",
    "NLP helps in understanding human language."
]

# Bag of Words
vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(documents)
print("Bag of Words Matrix:")
print(bow_matrix.toarray())
print("Feature Names:", vectorizer.get_feature_names_out())

# TF-IDF
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)
print("\nTF-IDF Matrix:")
print(tfidf_matrix.toarray())
print("Feature Names:", tfidf_vectorizer.get_feature_names_out())

Bag of Words Matrix:
[[1 0 0 0 0 1 0 0 1 0 0 0]
 [0 1 0 0 0 1 1 1 0 1 1 0]
 [0 0 1 1 1 0 1 0 1 0 0 1]]
Feature Names: ['amazing' 'future' 'helps' 'human' 'in' 'is' 'language' 'natural' 'nlp'
 'processing' 'the' 'understanding']

TF-IDF Matrix:
[[0.68091856 0.         0.         0.         0.         0.51785612
  0.         0.         0.51785612 0.         0.         0.        ]
 [0.         0.44036207 0.         0.         0.         0.3349067
  0.3349067  0.44036207 0.         0.44036207 0.44036207 0.        ]
 [0.         0.         0.44036207 0.44036207 0.44036207 0.
  0.3349067  0.         0.3349067  0.         0.         0.44036207]]
Feature Names: ['amazing' 'future' 'helps' 'human' 'in' 'is' 'language' 'natural' 'nlp'
 'processing' 'the' 'understanding']


---

#### **1.3 Introduction to Vectors**

**Definition:**  
A **vector** is a mathematical object that has both magnitude (size) and direction. In NLP, vectors are used to represent words, sentences, or documents as points in a multi-dimensional space. Understanding vectors is crucial for understanding word embeddings.

**1.3.1 Why Are Vectors Important in NLP?**
1. **Representation:** Words and phrases can be represented as numerical vectors, enabling mathematical operations.
2. **Similarity:** Vectors help measure similarity between words (e.g., cosine similarity).
3. **Operations:** Vectors enable computations like addition, subtraction, and scaling, which are useful for tasks like analogy generation.

**1.3.2 Basic Concepts of Vectors**
1. **Magnitude:** The length of a vector.
2. **Direction:** The orientation of the vector in space.
3. **Operations:**
   - Addition and subtraction of vectors.
   - Dot product (used in cosine similarity).
   - Scaling (multiplying a vector by a scalar).

**1.3.3 Visualizing Vectors**
Vectors can be visualized in 2D or 3D space. For example:
- A word like "king" might be represented as a vector [0.8, 0.6].
- A word like "queen" might be represented as [0.7, 0.7].

**1.3.4 Practical Python Code for Vectors**

**Python Code: Basic Operations**

In [None]:
import numpy as np

# Define vectors
vector_a = np.array([2, 3])
vector_b = np.array([4, 1])

# Magnitude of a vector
magnitude_a = np.linalg.norm(vector_a)
print("Magnitude of vector_a:", magnitude_a)

# Addition of vectors
vector_sum = vector_a + vector_b
print("Sum of vector_a and vector_b:", vector_sum)

# Subtraction of vectors
vector_diff = vector_a - vector_b
print("Difference of vector_a and vector_b:", vector_diff)

# Dot product
dot_product = np.dot(vector_a, vector_b)
print("Dot product of vector_a and vector_b:", dot_product)

# Scaling a vector
scalar = 2
scaled_vector = scalar * vector_a
print("Scaled vector_a:", scaled_vector)

Magnitude of vector_a: 3.605551275463989
Sum of vector_a and vector_b: [6 4]
Difference of vector_a and vector_b: [-2  2]
Dot product of vector_a and vector_b: 11
Scaled vector_a: [4 6]


**Python Code: Cosine Similarity**

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# Example vectors
vector_c = np.array([1, 0])
vector_d = np.array([0, 1])

# Reshape vectors to 2D arrays (required for cosine similarity)
similarity = cosine_similarity([vector_c], [vector_d])
print("Cosine similarity between vector_c and vector_d:", similarity[0][0])

Cosine similarity between vector_c and vector_d: 0.0


**1.3.5 Key Insights for NLP**
1. Vectors allow us to represent words in a way that computers can process.
2. Operations like the dot product and cosine similarity enable the measurement of relationships between words.
3. Scaling and combining vectors can help derive new relationships (e.g., "king - man + woman = queen").

---


#### **1.4 Introduction to Word Embeddings**
Traditional methods like BoW and TF-IDF fail to capture the semantic meaning of words. Word embeddings solve this by representing words as dense numerical vectors in a continuous vector space.

**Why Word Embeddings?**
- Words with similar meanings have similar vector representations.
- Captures relationships like "king - man + woman = queen."

**Example: Cosine Similarity of Word Vectors**
- Cosine similarity measures how similar two word vectors are in terms of direction.

**Python Code:**

In [None]:

from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Example word embeddings (hypothetical)
word_embeddings = {
    "king": np.array([0.8, 0.6]),
    "queen": np.array([0.7, 0.7]),
    "man": np.array([0.5, 0.3]),
    "woman": np.array([0.4, 0.5])
}

# Cosine Similarity
def calculate_cosine_similarity(word1, word2):
    vec1 = word_embeddings[word1]
    vec2 = word_embeddings[word2]
    similarity = cosine_similarity([vec1], [vec2])
    return similarity[0][0]

# Example Comparisons
print("Similarity between 'king' and 'queen':", calculate_cosine_similarity("king", "queen"))
print("Similarity between 'man' and 'woman':", calculate_cosine_similarity("man", "woman"))


Similarity between 'king' and 'queen': 0.9899494936611666
Similarity between 'man' and 'woman': 0.9374252720097653


### **Section 1 Summary**
1. NLP bridges the gap between human language and computers.
2. Text preprocessing is essential for cleaning and preparing data.
3. BoW and TF-IDF are basic text vectorization techniques but lack semantic understanding.
4. Word embeddings provide semantic meaning, enabling better NLP applications.

## **Lecture 3: NLP for Business Applications**  

### **1. Sentiment Analysis for Customer Reviews**  
#### **Problem:**  
Businesses receive thousands of online reviews daily. Manually analyzing customer sentiment is time-consuming and impractical.  

#### **Solution:**  
NLP can automatically classify customer reviews into **positive, negative, or neutral** sentiments.  

#### **Business Impact:**  
- Helps companies understand customer satisfaction trends.  
- Identifies pain points to improve products/services.  
- Enables real-time feedback monitoring.  

#### **Example: Sentiment Classification Using Naïve Bayes**  
```python
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline

# Sample Data
X_train = ["I love this product!", "This is the worst experience ever", "It's okay, nothing special"]
y_train = ["positive", "negative", "neutral"]

# Build Model
model = make_pipeline(TfidfVectorizer(), MultinomialNB())
model.fit(X_train, y_train)

# Prediction
print(model.predict(["Amazing quality!"]))  # Output: ['positive']
```

---

### **2. Resume Screening for HR Automation**  
#### **Problem:**  
Recruiters manually review hundreds of resumes to match candidates to job descriptions. This is inefficient and prone to bias.  

#### **Solution:**  
NLP can **extract skills** from resumes and calculate similarity scores between resumes and job descriptions.  

#### **Business Impact:**  
- Saves HR departments **time and effort** in recruitment.  
- Improves candidate-job **matching accuracy**.  
- Reduces human bias in resume evaluation.  

#### **Example: Resume Matching Using Cosine Similarity**  
```python
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

job_description = "Looking for a data scientist skilled in Python, NLP, and machine learning."
resume = "Experienced data scientist with expertise in Python and NLP."

vectorizer = CountVectorizer().fit_transform([job_description, resume])
similarity = cosine_similarity(vectorizer)[0][1]

print(f"Resume Similarity Score: {similarity:.2f}")
```

---

### **3. Automating Customer Support with Chatbots**  
#### **Problem:**  
Customer support teams handle repetitive inquiries, leading to high costs and delays.  

#### **Solution:**  
NLP-powered chatbots **understand user queries** and **respond with predefined answers**.  

#### **Business Impact:**  
- Reduces customer support costs.  
- Provides instant responses to frequently asked questions (FAQs).  
- Enhances user experience with 24/7 availability.  

#### **Example: FAQ Chatbot Using Logistic Regression**  
```python
from sklearn.linear_model import LogisticRegression
import numpy as np

queries = ["What are your working hours?", "How can I reset my password?", "What is your refund policy?"]
responses = ["We are open from 9 AM to 6 PM.", "Click 'Forgot Password' to reset it.", "Refunds take 5 days."]

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(queries)
y = np.array(responses)

model = LogisticRegression().fit(X, y)

print(model.predict(vectorizer.transform(["How do I change my password?"])))  
```

---

### **4. Extracting Key Information from Business Documents**  
#### **Problem:**  
Organizations need to quickly extract key insights from lengthy business reports and contracts.  

#### **Solution:**  
NLP automates **keyword extraction** and **summarization**.  

#### **Business Impact:**  
- Saves time in document review.  
- Helps decision-makers find **critical information** faster.  
- Improves compliance and contract analysis.  

#### **Example: Keyword Extraction Using TF-IDF**  
```python
from sklearn.feature_extraction.text import TfidfVectorizer

document = ["Our company specializes in AI, machine learning, and NLP solutions."]
vectorizer = TfidfVectorizer(stop_words="english")
tfidf_matrix = vectorizer.fit_transform(document)

keywords = vectorizer.get_feature_names_out()
print("Extracted Keywords:", keywords)
```

---

### **5. Spam Detection in Emails**  
#### **Problem:**  
Companies receive thousands of spam emails, leading to wasted time and security risks.  

#### **Solution:**  
NLP can **detect spam patterns** using machine learning models.  

#### **Business Impact:**  
- Improves email security.  
- Saves time by filtering out irrelevant messages.  
- Reduces exposure to phishing attacks.  

#### **Example: Spam Classification Using Naïve Bayes**  
```python
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

# Sample emails
emails = ["Congratulations! You won a lottery.", "Meeting scheduled at 10 AM."]
labels = ["spam", "ham"]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(emails)

model = MultinomialNB()
model.fit(X, labels)

new_email = ["Claim your prize now!"]
print(model.predict(vectorizer.transform(new_email)))  # Output: ['spam']
```
### **Summary**  
✅ NLP enables **automation** and **insight extraction** in business applications.  
✅ Traditional NLP techniques like **tokenization, vectorization, and classification** solve real-world problems.  
✅ Next Step: **Deep Learning NLP (BERT, GPT)** for more **advanced** language understanding.  