<div style="  background: linear-gradient(145deg, #0f172a, #1e293b);  border: 4px solid transparent;  border-radius: 14px;  padding: 18px 22px;  margin: 12px 0;  font-size: 26px;  font-weight: 600;  color: #f8fafc;  box-shadow: 0 6px 14px rgba(0,0,0,0.25);  background-clip: padding-box;  position: relative;">  <div style="    position: absolute;    inset: 0;    padding: 4px;    border-radius: 14px;    background: linear-gradient(90deg, #06b6d4, #3b82f6, #8b5cf6);    -webkit-mask:       linear-gradient(#fff 0 0) content-box,       linear-gradient(#fff 0 0);    -webkit-mask-composite: xor;    mask-composite: exclude;    pointer-events: none;  "></div>    <b>Feature Engineering for NLP in Python</b>    <br/>  <span style="color:#9ca3af; font-size: 18px; font-weight: 400;">(Tokenization, Lemmatization, Cleaning, POS Tagging, and NER)</span></div>

## Table of Contents

1. [Tokenization and Lemmatization](#section-1)
2. [Text Cleaning](#section-2)
3. [Part-of-Speech (POS) Tagging](#section-3)
4. [Named Entity Recognition (NER)](#section-4)
5. [Conclusion](#section-5)

---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 1. Tokenization and Lemmatization</span><br>

### 1.1 Introduction to Text Data
Natural Language Processing (NLP) involves processing and analyzing large amounts of natural language data. Before we can feed text into machine learning models, we must perform **Feature Engineering**.

Common sources of text data include:
*   News articles
*   Tweets
*   Comments
*   Reviews

### 1.2 Making Text Machine Friendly
Raw text is often messy and inconsistent. To a machine, "Dogs" and "dog" are completely different strings, even though they represent the same concept. Similarly, "won't" and "will not" mean the same thing but look different.

**Common inconsistencies:**
*   **Case sensitivity:** `Dogs`, `dog`
*   **Word forms:** `reduction`, `REDUCING`, `Reduce`
*   **Contractions:** `don't` vs `do not`, `won't` vs `will not`

### 1.3 Text Preprocessing Techniques
To standardize text, we apply several preprocessing techniques:
1.  Converting words into lowercase.
2.  Removing leading and trailing whitespaces.
3.  Removing punctuation.
4.  Removing stopwords.
5.  Expanding contractions.
6.  Removing special characters (numbers, emojis, etc.).

### 1.4 Tokenization
Tokenization is the process of splitting a string into its constituent parts, called **tokens**. These tokens can be words, punctuation marks, or numbers.

**Example 1:**
*   Input: `"I have a dog. His name is Hachi."`
*   Tokens: `["I", "have", "a", "dog", ".", "His", "name", "is", "Hachi", "."]`

**Example 2:**
*   Input: `"Don't do this."`
*   Tokens: `["Do", "n't", "do", "this", "."]`

#### Tokenization using spaCy
We will use the `spaCy` library for these tasks. First, we load the English model (`en_core_web_sm`), create a document object, and iterate over it to extract tokens.



In [None]:
# Install spaCy and download the model if not already installed
# !pip install spacy
# !python -m spacy download en_core_web_sm

import spacy

# Load the en_core_web_sm model
nlp = spacy.load('en_core_web_sm')

# Initialize string
string = "Hello! I don't know what I'm doing here."

# Create a Doc object
doc = nlp(string)

# Generate list of tokens
tokens = [token.text for token in doc]
print(tokens)



### 1.5 Lemmatization
Lemmatization is the process of converting a word into its **base form** (lemma). This helps in reducing the vocabulary size and grouping similar words together.

**Examples:**
*   `reducing`, `reduces`, `reduced`, `reduction` $\rightarrow$ `reduce`
*   `am`, `are`, `is` $\rightarrow$ `be`
*   `n't` $\rightarrow$ `not`
*   `'ve` $\rightarrow$ `have`

#### Lemmatization using spaCy
In spaCy, the lemma of a token is accessed using the `token.lemma_` attribute. Note that spaCy handles pronouns specifically (often denoted as `-PRON-` in older versions, though newer versions may return the pronoun itself).



In [None]:
import spacy

# Load the en_core_web_sm model
nlp = spacy.load('en_core_web_sm')

# Initialize string
string = "Hello! I don't know what I'm doing here."

# Create a Doc object
doc = nlp(string)

# Generate list of lemmas
lemmas = [token.lemma_ for token in doc]
print(lemmas)



---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 2. Text Cleaning</span><br>

### 2.1 Text Cleaning Techniques
Text cleaning goes beyond simple tokenization. It involves removing noise that might confuse a machine learning model.

**Key Techniques:**
*   Removing unnecessary whitespaces and escape sequences (e.g., `\n`, `\t`).
*   Removing punctuation.
*   Removing special characters (numbers, emojis).
*   Removing stopwords.

### 2.2 The `isalpha()` Method
Python's string method `.isalpha()` is useful for identifying if a string consists only of alphabetical characters.



In [None]:
# Examples of isalpha()
print(f"'Dog'.isalpha(): {'Dog'.isalpha()}")       # True
print(f"'3dogs'.isalpha(): {'3dogs'.isalpha()}")   # False
print(f"'12347'.isalpha(): {'12347'.isalpha()}")   # False
print(f"'!'.isalpha(): {'!'.isalpha()}")           # False
print(f"'?'.isalpha(): {'?'.isalpha()}")           # False



<div style="background: #e0f2fe; border-left: 16px solid #0284c7; padding: 14px 18px; border-radius: 8px; font-size: 18px; color: #075985;"> ðŸ’¡ <b>Tip: A Word of Caution</b> <br>
Be careful when strictly removing non-alphabetic characters.
<ul>
    <li><b>Abbreviations:</b> U.S.A, U.K (dots make them non-alpha).</li>
    <li><b>Proper Nouns:</b> word2vec, xto10x (contain numbers).</li>
</ul>
For nuanced cases, you should write custom functions using <b>regex</b>.
</div>

### 2.3 Removing Non-Alphabetic Characters
Let's clean a messy string containing punctuation, escape characters, and numbers.



In [None]:
import spacy

string = """
OMG!!!! This is like    the best thing ever \t\n.
Wow, such an amazing song! I'm hooked. Top 5 definitely. ?
"""

# Load model
nlp = spacy.load('en_core_web_sm')
doc = nlp(string)

# Generate list of lemmas
lemmas = [token.lemma_ for token in doc]

# Remove tokens that are not alphabetic
# Note: We keep '-PRON-' because older spaCy models use this for pronouns. 
# Even if your model returns the actual pronoun, isalpha() usually handles it well.
a_lemmas = [lemma for lemma in lemmas if lemma.isalpha() or lemma == '-PRON-']

# Print string after text cleaning
print(' '.join(a_lemmas))



### 2.4 Stopwords
Stopwords are words that occur extremely commonly in a language but often carry little specific meaning for classification tasks (e.g., articles, be-verbs, pronouns).
*   Examples: *the, is, at, which, on*.

#### Removing Stopwords using spaCy
spaCy provides a built-in list of stopwords.



In [None]:
import spacy

# Get list of stopwords
stopwords = spacy.lang.en.stop_words.STOP_WORDS

string = """
OMG!!!! This is like    the best thing ever \t\n.
Wow, such an amazing song! I'm hooked. Top 5 definitely. ?
"""

nlp = spacy.load('en_core_web_sm')
doc = nlp(string)

lemmas = [token.lemma_ for token in doc]

# Remove stopwords and non-alphabetic tokens
# We filter out lemmas that are NOT in the stopwords list
a_lemmas = [lemma for lemma in lemmas 
            if lemma.isalpha() and lemma.lower() not in stopwords]

# Print string after text cleaning
print(' '.join(a_lemmas))



### 2.5 Other Preprocessing Techniques
Depending on your data source, you might need:
*   **Removing HTML/XML tags**: Essential for web scraping.
*   **Replacing accented characters**: Converting `Ã©` to `e`.
*   **Correcting spelling errors**: Using libraries like `TextBlob` or `pyspellchecker`.

<div style="background: #e0f2fe; border-left: 16px solid #0284c7; padding: 14px 18px; border-radius: 8px; font-size: 18px; color: #075985;"> ðŸ’¡ <b>Tip:</b> Always use only those text preprocessing techniques that are relevant to your specific application. Over-cleaning can lead to loss of information. </div>

---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 3. Part-of-Speech (POS) Tagging</span><br>

### 3.1 What is POS Tagging?
POS Tagging is the process of assigning every word in a text its corresponding part of speech (noun, verb, adjective, etc.).

**Example:**
Input: `"Jane is an amazing guitarist."`

*   **Jane** $\rightarrow$ proper noun
*   **is** $\rightarrow$ verb
*   **an** $\rightarrow$ determiner
*   **amazing** $\rightarrow$ adjective
*   **guitarist** $\rightarrow$ noun

### 3.2 Applications
*   **Word-sense disambiguation**:
    *   "The **bear** is a majestic animal" (Noun)
    *   "Please **bear** with me" (Verb)
*   **Sentiment analysis**: Adjectives often carry sentiment.
*   **Question answering**.
*   **Fake news and opinion spam detection**.

### 3.3 POS Tagging using spaCy
spaCy makes POS tagging easy. The tag is available via the `token.pos_` attribute.



In [None]:
import spacy

# Load the en_core_web_sm model
nlp = spacy.load('en_core_web_sm')

# Initialize string
string = "Jane is an amazing guitarist"

# Create a Doc object
doc = nlp(string)

# Generate list of tokens and pos tags
pos = [(token.text, token.pos_) for token in doc]

print(pos)



### 3.4 POS Annotations in spaCy
Below is a table of common POS tags used in spaCy (based on the Universal Dependencies scheme).

| POS | Description | Examples |
| :--- | :--- | :--- |
| **ADJ** | adjective | big, old, green, incomprehensible, first |
| **ADP** | adposition | in, to, during |
| **ADV** | adverb | very, tomorrow, down, where, there |
| **AUX** | auxiliary | is, has (done), will (do), should (do) |
| **CONJ** | conjunction | and, or, but |
| **CCONJ** | coordinating conjunction | and, or, but |
| **DET** | determiner | a, an, the |
| **PROPN** | proper noun | Jane, London, Google |
| **NOUN** | noun | guitarist, dog, table |
| **VERB** | verb | run, eat, play |

---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 4. Named Entity Recognition (NER)</span><br>

### 4.1 What is NER?
Named Entity Recognition (NER) involves identifying and classifying named entities in text into predefined categories such as persons, organizations, countries, dates, etc.

**Example:**
Input: `"John Doe is a software engineer working at Google. He lives in France."`

*   **John Doe** $\rightarrow$ PERSON
*   **Google** $\rightarrow$ ORGANIZATION
*   **France** $\rightarrow$ COUNTRY (Geopolitical Entity)

### 4.2 Applications
*   **Efficient search algorithms**.
*   **Question answering**.
*   **News article classification**.
*   **Customer service** (routing tickets based on entities mentioned).

### 4.3 NER using spaCy
In spaCy, named entities are accessed via the `doc.ents` property. Each entity has a `text` and a `label_`.



In [None]:
import spacy

string = "John Doe is a software engineer working at Google. He lives in France."

# Load model and create Doc object
nlp = spacy.load('en_core_web_sm')
doc = nlp(string)

# Generate named entities
# We iterate over doc.ents, not tokens
ne = [(ent.text, ent.label_) for ent in doc.ents]

print(ne)



### 4.4 NER Annotations in spaCy
spaCy supports more than 15 categories of named entities. Here are some common ones:

| Type | Description |
| :--- | :--- |
| **PERSON** | People, including fictional. |
| **NORP** | Nationalities or religious or political groups. |
| **FAC** | Buildings, airports, highways, bridges, etc. |
| **ORG** | Companies, agencies, institutions, etc. |
| **GPE** | Countries, cities, states. |

<div style="background: #e0f2fe; border-left: 16px solid #0284c7; padding: 14px 18px; border-radius: 8px; font-size: 18px; color: #075985;"> ðŸ’¡ <b>Tip: A Word of Caution</b> <br>
<ul>
    <li><b>Not Perfect:</b> NER models are probabilistic and can make mistakes.</li>
    <li><b>Data Dependency:</b> Performance depends heavily on the training data.</li>
    <li><b>Specialization:</b> For nuanced cases (e.g., medical texts, legal documents), you often need to train models with specialized data.</li>
    <li><b>Language Specific:</b> Models are specific to the language they were trained on.</li>
</ul>
</div>

---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 5. Conclusion</span><br>

In this notebook, we have covered the foundational steps of **Feature Engineering for NLP** using Python and `spaCy`.

**Key Takeaways:**
1.  **Tokenization**: Breaking text into meaningful units (words/punctuation) is the first step in understanding text structure.
2.  **Lemmatization**: Reducing words to their base forms normalizes the text and reduces vocabulary size.
3.  **Text Cleaning**: Removing noise (stopwords, punctuation, special characters) is crucial, but must be done carefully to avoid losing valuable information (like proper nouns with numbers).
4.  **POS Tagging**: Identifying grammatical roles helps in disambiguating meaning and analyzing sentiment.
5.  **NER**: Extracting real-world entities (People, Orgs, Locations) adds semantic understanding to the text processing pipeline.

**Next Steps:**
With these features extracted and cleaned, your text data is now ready for vectorization (e.g., Bag of Words, TF-IDF) and subsequent modeling in machine learning algorithms.
