<a href="https://colab.research.google.com/github/hgabrali/NLP-Foundations-to-Frontiers/blob/main/Natural_Language_Processing_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Processing: From Unstructured Data to Strategic Intelligence

## 1. Executive Summary
Natural Language Processing (NLP) serves as the critical bridge between the "noisy" reality of unstructured data‚Äîcomprising emails, social media, and clinical notes‚Äîand the strategic requirements of actionable business intelligence **(AWS)**. In an era where organizations generate massive volumes of text and voice data, NLP allows machines to interpret, manipulate, and comprehend human language to unlock significant competitive advantages **(AWS)**.

The core value proposition of modern NLP lies in its ability to transform raw text into structured knowledge through advanced preprocessing pipelines and **Graph-based Retrieval-Augmented Generation (GraphRAG)**, which enables multi-hop reasoning across complex document sets **(arXiv)**. For enterprise-scale applications, particularly within complex **Enterprise Resource Planning (ERP)** environments like **SAP S/4HANA**, GraphRAG has become essential for reasoning over configuration rules and transactional dependencies **(arXiv)**.

However, a significant strategic tension exists regarding efficiency vs. accuracy. In high-stakes sectors like healthcare and finance, the high computational costs of **Large Language Model (LLM)** based knowledge graph construction are often prohibitive. Research indicates that "industrial-grade" alternatives, such as **dependency-based knowledge graph construction**, can achieve **94%** of the performance of LLM-driven systems while remaining scalable and cost-effective **(arXiv)**. To reach this industrial-grade performance, organizations must first master the foundational technical stages required to prepare text for analysis.



---

## 2. Foundations of the NLP Preprocessing Pipeline
Text preprocessing is not merely a "cleanup" task; it is a strategic necessity for improving model performance and managing vocabulary size across diverse datasets **(Scale Events, Meegle)**. By transforming raw, noisy text into a consistent format, organizations ensure that downstream models are fed structured inputs that minimize semantic interference.

### üîç Evaluation of Essential Stages
* **Segmentation**: This stage partitions text into individual sentences. The primary technical challenge involves the ambiguity of punctuation; for instance, a period marks a sentence boundary but also appears in abbreviations like "Inc." **(Scale Events)**. Failure here leads to "fragmented data" that disrupts context; sentence-level units are far more amenable to syntactic parsing and enhance the performance of LLMs **(arXiv)**.
* **Tokenization**: Sentences are converted into "tokens" (individual words or phrases). This is the building block for all subsequent analysis, yet standard whitespace splitting often fails with contractions like "don't," which require specialized rules to preserve meaning **(Scale Events)**.
* **Case Normalization**: Most NLP software chooses to lowercase all text to ensure consistency. This prevents a model from treating "Apple" (the brand) and "apple" (the fruit) as distinct entities solely due to capitalization **(Scale Events)**.
* **Spell Correction**: This step prevents typographical errors from diluting the vocabulary, which is essential for maintaining high recall in classification and retrieval tasks **(Scale Events)**.

### üõ°Ô∏è The "Noise" Reduction Layer
**Stop-word removal** involves filtering out frequently occurring words like "the," "is," and "of" that provide little discriminatory value for tasks like document categorization or sentiment analysis **(Scale Events, ResearchGate)**. Research indicates that removing these words reduces the feature space and enhances computational efficiency.

* **Arabic Context**: This category is expanded to include specific pronouns, days of the week, and months **(ResearchGate)**.
* **Low-Resource Languages**: Specific libraries have been developed, such as **LiHiSTO** for Hindi (containing 820 stop-words) and **LiSTOM** for Malayalam, to improve retrieval performance **(ResearchGate)**.

### üìä Comparative Analysis: Stemming vs. Lemmatization
While both techniques reduce words to a base form, lemmatization offers a more sophisticated, context-aware approach.

| Feature | Stemming (Crude Heuristic) | Lemmatization (Morphological Analysis) |
| :--- | :--- | :--- |
| **Goal** | Chop off affixes to find a base "stem" **(Scale Events)**. | Identify the linguistically valid "lemma" or dictionary form **(Stack Overflow)**. |
| **Accuracy** | Lower; may produce non-words (e.g., "caring" becomes "car") **(Stack Overflow)**. | Higher; ensures a valid root (e.g., "caring" becomes "care") **(Stack Overflow)**. |
| **Computational Cost** | Low; faster as it uses simple rules or lookup tables **(Scale Events)**. | High; slower as it requires morphological analysis and dictionaries **(Scale Events)**. |
| **POS Awareness** | No; operates on individual words without context **(Stack Overflow)**. | Yes; uses Parts of Speech (POS) to determine meaning **(Stack Overflow)**. |

> **The "So What?" of Context Awareness**: Lemmatization is vital when word meaning depends on usage. For example, a lemmatizer can distinguish between "dove" as a noun (the bird) and "dove" as a verb (past tense of dive), whereas a stemmer would treat them identically. Similarly, it can identify that "better" has "good" as its lemma **(Stack Overflow)**.

---

## 3. Structural Analysis: Parsing and Information Extraction
Strategic NLP requires moving beyond "Bag of Words" approaches to understand grammatical relationships. This level of structural analysis allows systems to identify precise interactions between entities within a sentence **(arXiv)**.

### üèóÔ∏è Contrast Parsing Methodologies
* **Constituency Parsing**: Breaks text into sub-phrases or hierarchical segments (Noun Phrases, Verb Phrases). It is most effective when extracting specific spans of text for phrase-level classification **(Stack Overflow)**.
* **Dependency Parsing**: Connects words according to binary head-dependent relations **(Stack Overflow)**. In the sentence *"The developer refactored the code,"* "refactored" is the head, and "developer" is the subject **(arXiv)**.

### üè≠ Appraise Dependency Parsing in Enterprise Workflows
Research from **SAP** evaluates dependency-based knowledge graph construction as a cost-effective, "industrial-grade" alternative to LLM-driven extraction. The **DependencyExtractor** methodology follows five technical stages **(arXiv)**:

1. **Noun Phrase Extraction and Cleaning**: Identifying and normalizing the primary entities.
2. **Verb Processing**: Extracting the relation (the action) that links entities.
3. **Subject/Object Identification**: Mapping the directionality of the relationship.
4. **Special Pattern Recognition**: Handling technical syntax or domain-specific language structures.
5. **Triple Formation**: Constructing "subject-relation-object" sets for materialization in the knowledge graph.

---

## 4. The Problem Space: Ambiguity, Noise, and Domain Jargon
The technical challenges of raw text remain a primary obstacle to enterprise-scale AI. Three primary hurdles define this space:

* **Semantic Ambiguity**: Word Sense Disambiguation is required for terms like "bat" or "right" (direction vs. legal claim). Without context, models may incorrectly correlate unrelated concepts **(AWS)**.
* **Linguistic Noise**: Digital communication involves abbreviations (e.g., "smth") and lengthened words (e.g., "hellooo"). Text Normalization is required to convert these into a **Canonical Representation** (e.g., "something" or "hello") **(Scale Events)**.
* **Domain-Specific Jargon**: General NLP rules often fail when processing scientific documents containing mathematical symbols and equations, or technical logs like **SAP Custom Code Migration (CCM)** logs **(Scale Events, arXiv)**.

---

## 5. Comparative Industry Analysis: Healthcare vs. Finance
Domain adaptation is a strategic necessity; general NLP rules often fail when applied to specialized clinical or financial market data **(arXiv, Shaip)**.

| Sector | NLP Focus Area | Strategic Objectives & Requirements |
| :--- | :--- | :--- |
| **Healthcare** | Clinical notes, Electronic Health Records (EHR), and physician dictation **(AWS, Shaip)**. | **Objectives**: Predictive diagnostics through the extraction of patterns in clinical history. <br> **Mandate**: Critical "So What?" layer involves sensitive data redaction for HIPAA and privacy compliance. |
| **Finance** | Earning reports, risk flagging in contracts, and SAP Custom Code Migration (CCM) logs **(arXiv, Shaip)**. | **Objectives**: Alpha generation and risk mitigation. <br> **Mandate**: Mapping complex transactional dependencies and identifying risk flags within multi-system procurement modules. |

---

## 6. Advanced Architectures: GraphRAG and Edge Deployment
Enterprise environments are shifting toward **GraphRAG** to solve the limitations of traditional RAG in multi-hop reasoning. While standard RAG retrieves isolated snippets, GraphRAG utilizes a structured knowledge graph to enable traversal-based querying **(arXiv)**.

### üìà Evaluate Scalable GraphRAG
The SAP "Multi-model KG Construction Pipeline" contrasts two paths **(arXiv)**:
* **High-Quality Path**: Uses LLMs (GPT-4o) for extraction; accurate but slow and computationally expensive.
* **Lightweight Path**: Uses a dependency-parser-based builder. It maintains **94% of the performance** of the LLM path in context precision while using an architectural stack involving **iGraph** (in-memory graph store) and **Milvus** (Vector DB). Mechanisms like **One-hop traversal** and **Reciprocal Rank Fusion (RRF)** ensure low-latency, high-recall query performance.

### üì± Optimization for the Edge
Deploying NLP on resource-constrained mobile or IoT devices requires three core optimization techniques **(ICMLAS 2025)**:

1. **Pruning**: Eliminating redundant parameters to lower memory demands.
2. **Quantization**: Reducing precision (e.g., 32-bit to 8-bit integers) to decrease computational overhead.
3. **Knowledge Distillation**: Training "student" models to replicate "teacher" model behavior.

#### **Quantified Impact of Optimization (ICMLAS 2025):**
* **Model Size**: Pruning can reduce model size by **60%** (e.g., from 500 MB to 200 MB).
* **Inference Speed**: Pruning can decrease inference time to **125 ms**.
* **Memory Efficiency**: Quantization reduces the memory footprint by **50%** while maintaining **91.2% accuracy**.
* **Training Time**: Knowledge distillation can halve training duration (from 30 hours to 15 hours).

---

## 7. Future Trajectories and Ethical Considerations
The NLP market is accelerating on a trajectory toward a projected value of **$39.37 billion by 2025 (Shaip)**. Industrial analysis indicates several defining trends:

* **Real-Time Translation**: Breaking language barriers with up to **98% accuracy** in spoken and written formats **(Shaip)**.
* **Emotional Intelligence**: Moving beyond sentiment to detect complex states like frustration, joy, or sarcasm in customer interactions **(Shaip)**.
* **Multilingual Support**: Google‚Äôs **Universal Speech Model (USM)** aims to cover 1,000 languages, currently supporting over **400 languages (Shaip)**.
* **Market Dominance**: North America currently leads the global market with a **30.7% revenue share (Shaip)**.

> **The Ethical Mandate**: As NLP matures, "Ethical AI" is becoming a priority. Organizations face rising mandates to disclose training data sources to mitigate hallucinations and algorithmic biases in sensitive areas like hiring or lending **(Shaip)**.

### üéØ Conclusion
Natural Language Processing has evolved from basic text cleanup into the essential engine for the **2025 "AI Era,"** enabling a sophisticated, structured understanding of the human world.

# **Reading in Text Data From Various File Formats**


**1. Reading Data from a CSV File:**

In [30]:
import pandas as pd

# Create a dummy DataFrame with a 'Review Text' column
dummy_data = {
    'Review Text': [
        'This product is amazing! I love it.',
        'It was okay, but could be better.',
        'Absolutely terrible, a complete waste of money.',
        'Good value for the price.',
        'Not bad, but not great either.'
    ]
}
dummy_df = pd.DataFrame(dummy_data)

# Save the dummy DataFrame to 'reviews.csv'
dummy_df.to_csv('reviews.csv', index=False)

print("Dummy 'reviews.csv' created successfully.")

Dummy 'reviews.csv' created successfully.


**2. Reading Data from a Plain Text File:**

In [48]:
# Create a dummy comments.txt file
file_path = "comments.txt"
with open(file_path, "w") as file:
    file.write("Loved the service!\n")
    file.write("Would not recommend.\n")
    file.write("Amazing experience overall.\n")

print("Dummy 'comments.txt' created successfully.")

Dummy 'comments.txt' created successfully.


In [49]:
# Read the text file
file_path = "comments.txt"
with open(file_path, "r") as file:
    lines = file.readlines()

# Print each line
for line in lines:
    print(line.strip())

Loved the service!
Would not recommend.
Amazing experience overall.


**3. Reading Data from a JSON File:**

In [50]:
[
    {"id": 1, "text": "This is amazing!"},
    {"id": 2, "text": "Not satisfied with the product."},
    {"id": 3, "text": "Would buy again!"}
]

[{'id': 1, 'text': 'This is amazing!'},
 {'id': 2, 'text': 'Not satisfied with the product.'},
 {'id': 3, 'text': 'Would buy again!'}]

In [51]:
import json

json_content = [
    {"id": 1, "text": "This is amazing!"},
    {"id": 2, "text": "Not satisfied with the product."},
    {"id": 3, "text": "Would buy again!"}
]

file_path = "data.json"
with open(file_path, "w") as file:
    json.dump(json_content, file, indent=4)

print("Dummy 'data.json' created successfully.")

Dummy 'data.json' created successfully.


In [52]:
import json

# Read the JSON file
file_path = "data.json"
with open(file_path, "r") as file:
    data = json.load(file)

# Extract and print text values
for record in data:
    print(record["text"])

This is amazing!
Not satisfied with the product.
Would buy again!


In [34]:
#Reading a .txt File:


# Create a dummy data.txt file
file_path = "data.txt"
with open(file_path, "w") as file:
    file.write("This is the first line.\n")
    file.write("This is the second line.\n")
    file.write("And this is the third line.")

print("Dummy 'data.txt' created successfully.")

# Open the text file
file_path = "data.txt"
with open(file_path, "r") as file:
    # Read lines from the file
    lines = file.readlines()

# Print each line after stripping whitespace
for line in lines:
    print(line.strip())

Dummy 'data.txt' created successfully.
This is the first line.
This is the second line.
And this is the third line.


# **Common Preprocessing Techniques**

**1. Lowercasing:**

In [35]:
# Sample text
text = "Natural Language Processing is AMAZING!"

# Convert to lowercase
cleaned_text = text.lower()
print(cleaned_text)

natural language processing is amazing!


**2. Punctuation Removal:**

In [36]:
import re

# Sample text
text = "Hello, world! Welcome to NLP."

# Remove punctuation using regex
cleaned_text = re.sub(r"[^\w\s']", "", text)
print(cleaned_text)

Hello world Welcome to NLP


* The **re.sub()** function is used to replace punctuation with an empty string.



* The regex pattern **[^\w\s']** matches any character that is not a word **(\w)** or a space **(\s)**.

**3. Removing Extra Whitespaces:**

* The **.split()** method splits the text into words by whitespace.
* The **" ".join()** method reassembles the words into a single string, removing extra spaces.


In [37]:
# Sample text
text = "  This   is   a   sentence   with   extra   spaces.   "

# Remove extra whitespaces between the words
cleaned_text = " ".join(text.split())

print(cleaned_text)

This is a sentence with extra spaces.


**4. Removing Numbers:**

* The regex pattern **\d+** matches one or more digits in the text.
* **re.sub()** replaces the matched digits with an empty string.

In [38]:
import re

# Sample text
text = "The price is 100 dollars."

# Remove numbers using regex
cleaned_text = re.sub(r"\d+", "", text)
print(cleaned_text)

The price is  dollars.


**5. Handling Case-Specific Words:**

* The regex pattern **\b(and|or|but)\b** matches whole words "and", "or", and "but".

* After removing the words, we use the **split()** and **join()** methods to clean up extra spaces.

In [39]:
# Sample text
text = "Stop words like 'and', 'or', and 'but' can be removed."

# Replace specific words
cleaned_text = re.sub(r"\b(and|or|but)\b", "", text)
cleaned_text = " ".join(cleaned_text.split())
print(cleaned_text)

Stop words like '', '', '' can be removed.


**Challenge 1: Lowercasing and Punctuation Removal**


**Task:** Write a function clean_text() that takes a string and returns it in lowercase with punctuation removed.

In [40]:
import re

def clean_text(text):
    """Clean text by lowercasing and removing punctuation."""
    # Convert to lowercase
    text = text.lower()
    # Remove punctuation
    text = re.sub(r"[^\w\s']", "", text)
    return text

# Test the function
sample_text = "Hello, NLP World!"
print(clean_text(sample_text))

hello nlp world


**Challenge 2: Removing Numbers and Extra Whitespaces**

**Task:** Write a function clean_text_numbers_spaces() that removes numbers and extra whitespaces from a string.

In [41]:
import re

def clean_text_numbers_spaces(text):
    """Clean text by removing numbers and extra spaces."""
    # Remove numbers
    text = re.sub(r"\d+", "", text)
    # Remove extra spaces
    text = " ".join(text.split())
    return text

# Test the function
sample_text = "This 123 text   has 456 extra spaces and 789 numbers."
print(clean_text_numbers_spaces(sample_text))

This text has extra spaces and numbers.


# Bag of Words: Implementation

In [42]:
from sklearn.feature_extraction.text import CountVectorizer

# Sample data
documents = ["I love programming.", "Programming is fun."]

# Initialize the CountVectorizer
vectorizer = CountVectorizer()

# Fit and transform the documents
X = vectorizer.fit_transform(documents)

# Convert the result to an array and print
print("Vocabulary:", vectorizer.get_feature_names_out())  # Get the vocabulary
print("BoW Matrix:\n", X.toarray())  # Display the document-term matrix

Vocabulary: ['fun' 'is' 'love' 'programming']
BoW Matrix:
 [[0 0 1 1]
 [1 1 0 1]]


# TF-IDF: Implementation

In [43]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample data
documents = ["I love programming.", "Programming is fun."]

# Initialize the TfidfVectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the documents
X = vectorizer.fit_transform(documents)

# Convert the result to an array and print
print("Vocabulary:", vectorizer.get_feature_names_out())  # Get the vocabulary
print("TF-IDF Matrix:\n", X.toarray())  # Display the TF-IDF matrix

Vocabulary: ['fun' 'is' 'love' 'programming']
TF-IDF Matrix:
 [[0.         0.         0.81480247 0.57973867]
 [0.6316672  0.6316672  0.         0.44943642]]


# üìä Bag of Words vs. TF-IDF: A Technical Comparison

In modern Natural Language Processing (NLP), choosing the right vectorization strategy is critical for model performance. This table provides a detailed comparative analysis between the **Bag of Words (BoW)** model and **Term Frequency-Inverse Document Frequency (TF-IDF)**.

---

| Aspect | Bag of Words (BoW) | TF-IDF |
| :--- | :--- | :--- |
| **Word Frequency** | Utilizes raw word counts for vector representation. | Employs term frequency adjusted by the inverse document frequency. |
| **Order of Words** | Neglects word order and spatial relationships within the text. | Neglects word order and spatial relationships within the text. |
| **Focus** | Concentrates on the raw occurrence of words within an individual document. | Concentrates on the statistical significance of words within a broader corpus. |
| **Handling Common Words** | Assigns equal weight to all words, including non-discriminatory common terms like "the". | Down-weights common "noise" words and highlights rare, semantically significant words. |
| **Use Case** | Optimal for fundamental text analysis tasks where raw frequency is the primary metric. | Superior for tasks requiring sophisticated importance ranking, such as search engine indexing. |

---


### üí° Key Takeaway
While **Bag of Words** is efficient for simple classification, **TF-IDF** provides a more nuanced understanding of "meaning" by filtering out the linguistic noise common in large datasets.

**Challenge 1: Implement Bag of Words**

In [44]:
from sklearn.feature_extraction.text import CountVectorizer

# Sample data
documents = ["Machine learning is fun.", "Deep learning is a subset of machine learning."]
# Initialize the CountVectorizer
vectorizer = CountVectorizer()

# Fit and transform the documents
X = vectorizer.fit_transform(documents)

# Display the vocabulary and the BoW matrix
print("Vocabulary:", vectorizer.get_feature_names_out())  # Get the vocabulary
# BoW Matrix:
print(X.toarray())  # Display the document-term matrix

Vocabulary: ['deep' 'fun' 'is' 'learning' 'machine' 'of' 'subset']
[[0 1 1 1 1 0 0]
 [1 0 1 2 1 1 1]]


**Challenge 2: Implement TF-IDF**

In [45]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample data
documents = ["Machine learning is fun.", "Deep learning is a subset of machine learning."]
# Initialize the TfidfVectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the documents
X = vectorizer.fit_transform(documents)

# Display the vocabulary and the TF-IDF matrix
print("Vocabulary:", vectorizer.get_feature_names_out())  # Get the vocabulary
# TF-IDF Matrix:
print(X.toarray())  # Display the TF-IDF matrix

Vocabulary: ['deep' 'fun' 'is' 'learning' 'machine' 'of' 'subset']
[[0.         0.63009934 0.44832087 0.44832087 0.44832087 0.
  0.        ]
 [0.40697968 0.         0.2895694  0.57913879 0.2895694  0.40697968
  0.40697968]]


**Challenge 3: Comparing Bag of Words and TF-IDF**

In [46]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Sample data
documents = ["Data science is exciting!", "Data science requires programming knowledge.", "Programming is essential for data science."]

# Initialize the CountVectorizer and TfidfVectorizer
vectorizer_bow = CountVectorizer()
vectorizer_tfidf = TfidfVectorizer()

# Fit and transform the documents using both vectorizers
X_bow = vectorizer_bow.fit_transform(documents)
X_tfidf = vectorizer_tfidf.fit_transform(documents)

# Display the results
# BoW Matrix:
print(X_bow.toarray())
# TF-IDF Matrix:
print(X_tfidf.toarray())

[[1 0 1 0 1 0 0 0 1]
 [1 0 0 0 0 1 1 1 1]
 [1 1 0 1 1 0 1 0 1]]
[[0.39148397 0.         0.66283998 0.         0.50410689 0.
  0.         0.         0.39148397]
 [0.32630952 0.         0.         0.         0.         0.55249005
  0.42018292 0.55249005 0.32630952]
 [0.30083189 0.50935267 0.         0.50935267 0.38737583 0.
  0.38737583 0.         0.30083189]]


# **Hands-on Practice**

**Challenge 1: Reading Text Data from Multiple File Formats**

**Task:**

Read in text data from three file formats:
* A CSV file that contains a column with text data. Extract the text from the column named "Review Text".
* A TXT file with multiple lines. Read the lines and print each one.
* A JSON file where the "text" key contains the text data. Extract and print the text for each entry.

In [65]:
import pandas as pd
import json

# Create a dummy sample.txt file
file_path_txt = "sample.txt"
with open(file_path_txt, "w") as file:
    file.write("This is the first line of sample text.\n")
    file.write("This is the second line.\n")
    file.write("And this is the third line.")

print("Dummy 'sample.txt' created successfully.")

# Create a dummy sample.json file
json_content = [
    {"id": 1, "text": "This is JSON text 1."},
    {"id": 2, "text": "This is JSON text 2."}
]
file_path_json_create = "sample.json"
with open(file_path_json_create, "w") as file:
    json.dump(json_content, file, indent=4)
print("Dummy 'sample.json' created successfully.")

# Read in the CSV file and extract the "Review Text" column
file_path_csv = "reviews.csv"
data_csv = pd.read_csv(file_path_csv)  # Read the CSV file
reviews = data_csv["Review Text"]  # Extract the "Review Text" column
print("\n--- Reading from reviews.csv ---")
print(reviews)

# Read in the TXT file and print each line
print("\n--- Reading from sample.txt ---")
with open(file_path_txt, "r") as txt_file:
    lines = txt_file.readlines()  # Use readlines() to get all lines
    for line in lines:
        print(line.strip())

# Read in the JSON file and extract the text data from the "text" key
print("\n--- Reading from sample.json ---")
file_path_json_read = "sample.json"
with open(file_path_json_read, "r") as json_file:
    data_json = json.load(json_file)
    for entry in data_json:
        print(entry["text"]) # Access the "text" key

Dummy 'sample.txt' created successfully.
Dummy 'sample.json' created successfully.

--- Reading from reviews.csv ---
0                This product is amazing! I love it.
1                  It was okay, but could be better.
2    Absolutely terrible, a complete waste of money.
3                          Good value for the price.
4                     Not bad, but not great either.
Name: Review Text, dtype: object

--- Reading from sample.txt ---
This is the first line of sample text.
This is the second line.
And this is the third line.

--- Reading from sample.json ---
This is JSON text 1.
This is JSON text 2.


In [64]:
import pandas as pd

# Create a dummy DataFrame with a 'Review Text' column
dummy_data = {
    'Review Text': [
        'This product is amazing! I love it.',
        'It was okay, but could be better.',
        'Absolutely terrible, a complete waste of money.',
        'Good value for the price.',
        'Not bad, but not great either.'
    ]
}
dummy_df = pd.DataFrame(dummy_data);

# Save the dummy DataFrame to 'reviews.csv'
dummy_df.to_csv('reviews.csv', index=False);

print("Dummy 'reviews.csv' created successfully.")

Dummy 'reviews.csv' created successfully.


**Challenge 2: String Operations for Text Cleaning**
**Task:**

Given a sample text, clean it by:

* Removing punctuation using the re library.
* Converting to lowercase.
* Stripping extra spaces.

In [66]:
import re

sample_text = "  Hello, World!! NLP is amazing, isn't it?  "

# Remove punctuation using regex
cleaned_text = re.sub(r"[^\w\s]", "", sample_text)

# Convert to lowercase
text_lowercase = cleaned_text.lower()

# Strip extra spaces
final_text = text_lowercase.strip()
print(final_text)

hello world nlp is amazing isnt it


**Challenge 3: Tokenization and Stop-Words Removal**
**Task:**
Tokenize the given text and remove stop words using NLTK.

In [69]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nltk.download("stopwords")
nltk.download("punkt")
nltk.download("punkt_tab") # Add this line to download the missing resource

sample_text = "NLP is a fascinating field of study with diverse applications."

# Tokenize the text
tokens = word_tokenize(sample_text)

# Remove stop words
stop_words = set(stopwords.words("english"))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
print(filtered_tokens)

['NLP', 'fascinating', 'field', 'study', 'diverse', 'applications', '.']


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


**Challenge 4: Stemming and Lemmatization**

**Task:**

Perform stemming and lemmatization on a list of words. Even though we know now that the accuracy of lemmatization may be improved if we specify the part of speech for each lemmatized word, in this challenge we want to ask you to go with the default lemmatizer configuration.

In [70]:
from nltk.stem import PorterStemmer, WordNetLemmatizer
nltk.download("wordnet")

words = ["running", "flies", "better", "easily", "happiest"]

# Stem the words
stemmer = PorterStemmer()
stems = [stemmer.stem(word) for word in words]

# Lemmatize the words
lemmatizer = WordNetLemmatizer()
lemmas = [lemmatizer.lemmatize(word) for word in words]
print("Stems:", stems)
print("Lemmas:", lemmas)

[nltk_data] Downloading package wordnet to /root/nltk_data...


Stems: ['run', 'fli', 'better', 'easili', 'happiest']
Lemmas: ['running', 'fly', 'better', 'easily', 'happiest']


If you are interested in knowing how to lemmatize and achieve the best accuracy, check out the below example that performs lemmatization for each part of speech separately:

In [72]:
import nltk
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import wordnet

nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')  # Needed for POS tagging
nltk.download('averaged_perceptron_tagger_eng') # Download the specific English tagger

words = ["running", "flies", "better", "easily", "happiest"]

# Stem the words
stemmer = PorterStemmer()
stems = [stemmer.stem(word) for word in words]

# Lemmatize with POS tagging
lemmatizer = WordNetLemmatizer()

# Function to map NLTK POS tags to WordNet POS tags
def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN  # Default to noun

# Get POS tags for each word
pos_tags = nltk.pos_tag(words)

# Lemmatize with POS tags
lemmas = []
for word, tag in pos_tags:
    wn_pos = get_wordnet_pos(tag)
    lemmas.append(lemmatizer.lemmatize(word, pos=wn_pos))

print("Stems:", stems)
print("Lemmas:", lemmas)

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


Stems: ['run', 'fli', 'better', 'easili', 'happiest']
Lemmas: ['run', 'fly', 'well', 'easily', 'happiest']
