<div style="  background: linear-gradient(145deg, #0f172a, #1e293b);  border: 4px solid transparent;  border-radius: 14px;  padding: 18px 22px;  margin: 12px 0;  font-size: 26px;  font-weight: 600;  color: #f8fafc;  box-shadow: 0 6px 14px rgba(0,0,0,0.25);  background-clip: padding-box;  position: relative;">  <div style="    position: absolute;    inset: 0;    padding: 4px;    border-radius: 14px;    background: linear-gradient(90deg, #06b6d4, #3b82f6, #8b5cf6);    -webkit-mask:       linear-gradient(#fff 0 0) content-box,       linear-gradient(#fff 0 0);    -webkit-mask-composite: xor;    mask-composite: exclude;    pointer-events: none;  "></div>    <b>INTRODUCTION TO NLP FEATURE ENGINEERING</b>    <br/>  <span style="color:#9ca3af; font-size: 18px; font-weight: 400;">(FEATURE ENGINEERING FOR NLP IN PYTHON)</span></div>

## Table of Contents

1. [Introduction to Numerical and Textual Data](#section-1)
2. [One-Hot Encoding](#section-2)
3. [Text Pre-processing and Vectorization](#section-3)
4. [Basic Feature Concepts (POS & NER)](#section-4)
5. [Implementing Basic Feature Extraction](#section-5)
6. [Advanced Feature Extraction: Hashtags and Mentions](#section-6)
7. [Readability Tests](#section-7)
8. [Conclusion](#section-8)

***

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 1. INTRODUCTION TO NUMERICAL AND TEXTUAL DATA</span><br>

### Numerical Data
Machine learning models generally require numerical input. Standard datasets, like the famous Iris dataset, come in a format where features are already numerical.

**Iris Dataset Example:**

| sepal length | sepal width | petal length | petal width | class |
| :--- | :--- | :--- | :--- | :--- |
| 6.3 | 2.9 | 5.6 | 1.8 | Iris-virginica |
| 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
| 5.6 | 2.9 | 3.6 | 1.3 | Iris-versicolor |
| 6.0 | 2.7 | 5.1 | 1.6 | Iris-versicolor |
| 7.2 | 3.6 | 6.1 | 2.5 | Iris-virginica |

### Textual Data
However, Natural Language Processing (NLP) deals with unstructured text. Before we can feed this into a model, we must perform feature engineering to convert text into numbers.

**Movie Review Dataset Example:**

| review | class |
| :--- | :--- |
| This movie is for dog lovers. A very poignant... | positive |
| The movie is forgettable. The plot lacked... | negative |
| A truly amazing movie about dogs. A gripping... | positive |

***

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 2. ONE-HOT ENCODING</span><br>

When dealing with categorical data (like gender, city, or product category), we cannot simply use the strings. One common technique to convert these categories into numbers is **One-hot encoding**.

### Conceptual Transformation

**Original Data:**
| sex |
| :--- |
| female |
| male |
| female |
| male |
| female |

**One-Hot Encoded Data:**
| sex | sex_female | sex_male |
| :--- | :--- | :--- |
| female | 1 | 0 |
| male | 0 | 1 |
| female | 1 | 0 |
| male | 0 | 1 |
| female | 1 | 0 |

### Implementation with Pandas
We can use the `pandas` library to perform this transformation automatically.



In [None]:
# Import the pandas library
import pandas as pd

# Create a sample dataframe to demonstrate
data = {'sex': ['female', 'male', 'female', 'male', 'female']}
df = pd.DataFrame(data)

# Perform one-hot encoding on the 'sex' feature of df
df_encoded = pd.get_dummies(df, columns=['sex'])

# Display the result
print(df_encoded)



***

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 3. TEXT PRE-PROCESSING AND VECTORIZATION</span><br>

Before converting text to numbers, we often clean or "pre-process" the data to standardize it.

### Common Pre-processing Steps

1.  **Converting to lowercase:**
    *   Example: `Reduction` $\rightarrow$ `reduction`
2.  **Converting to base-form (Stemming/Lemmatization):**
    *   Example: `reduction` $\rightarrow$ `reduce`

### Vectorization
Vectorization is the process of converting text into a numerical format (vectors). A common method is counting word occurrences or using TF-IDF.

**Conceptual Vectorization Matrix:**

| 0 | 1 | 2 | ... | n | class |
| :--- | :--- | :--- | :--- | :--- | :--- |
| 0.03 | 0.71 | 0.00 | ... | 0.22 | positive |
| 0.45 | 0.00 | 0.03 | ... | 0.19 | negative |
| 0.14 | 0.18 | 0.00 | ... | 0.45 | positive |

***

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 4. BASIC FEATURE CONCEPTS (POS & NER)</span><br>

Apart from raw vectorization, we can extract linguistic features from the text.

### Basic Features List
*   Number of words
*   Number of characters
*   Average length of words
*   Special tweet features (hashtags, mentions)

### Part-of-Speech (POS) Tagging
POS tagging assigns a grammatical category to each word.

| Word | POS |
| :--- | :--- |
| I | Pronoun |
| have | Verb |
| a | Article |
| dog | Noun |

### Named Entity Recognition (NER)
NER identifies proper nouns and classifies them into categories like Person, Organization, or Country.

| Noun | NER |
| :--- | :--- |
| Brian | Person |
| DataCamp | Organization |

***

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 5. IMPLEMENTING BASIC FEATURE EXTRACTION</span><br>

In this section, we will write Python code to extract basic features from text strings.

### 5.1 Number of Characters
We can calculate the length of a string including spaces and punctuation.



In [None]:
# Compute the number of characters
text = "I don't know."
num_char = len(text)

# Print the number of characters
print(f"Text: {text}")
print(f"Character Count: {num_char}")

# --- Applying to a DataFrame ---
import pandas as pd
df = pd.DataFrame({'review': ["I don't know.", "I love Python.", "Data science is fun."]})

# Create a 'num_chars' feature
df['num_chars'] = df['review'].apply(len)
print("\nDataFrame with Character Counts:")
print(df)



### 5.2 Number of Words
To count words, we typically split the string by whitespace.



In [None]:
# Split the string into words
text = "Mary had a little lamb."
words = text.split()

# Print the list containing words
print(f"Words list: {words}")

# Print number of words
print(f"Word count: {len(words)}")

# --- Function Implementation ---
def word_count(string):
    # Split the string into words
    words = string.split()
    # Return length of words list
    return len(words)

# Create num_words feature in df
df['num_words'] = df['review'].apply(word_count)
print("\nDataFrame with Word Counts:")
print(df)



### 5.3 Average Word Length
This feature can indicate the complexity of the vocabulary used.



In [None]:
# Function that returns average word length
def avg_word_length(x):
    # Split the string into words
    words = x.split()
    
    # Compute length of each word and store in a separate list
    word_lengths = [len(word) for word in words]
    
    # Compute average word length
    # Avoid division by zero if string is empty
    if len(words) == 0:
        return 0
    
    avg_word_length = sum(word_lengths) / len(words)
    
    # Return average word length
    return avg_word_length

# Create a new feature avg_word_length
df['avg_word_length'] = df['review'].apply(avg_word_length)
print("\nDataFrame with Average Word Length:")
print(df)



***

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 6. ADVANCED FEATURE EXTRACTION: HASHTAGS AND MENTIONS</span><br>

For social media data (like Tweets), specific features like hashtags (`#`) and mentions (`@`) are very valuable.

### Extracting Hashtags



In [None]:
# Function that returns number of hashtags
def hashtag_count(string):
    # Split the string into words
    words = string.split()
    
    # Create a list of hashtags
    hashtags = [word for word in words if word.startswith('#')]
    
    # Return number of hashtags
    return len(hashtags)

# Test the function
tweet = "@janedoe This is my first tweet! #FirstTweet #Happy"
count = hashtag_count(tweet)

print(f"Tweet: {tweet}")
print(f"Hashtag Count: {count}")



### Other Potential Features
Beyond hashtags, you can extract:
*   **Number of sentences**
*   **Number of paragraphs**
*   **Words starting with an uppercase**
*   **All-capital words**
*   **Numeric quantities**

***

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 7. READABILITY TESTS</span><br>

Readability tests determine how difficult an English passage is to understand. They often output a score corresponding to a grade level (e.g., primary school vs. college graduate).

### Overview
*   **Goal:** Determine readability of an English passage.
*   **Scale:** Ranges from primary school to college graduate level.
*   **Mechanism:** Mathematical formulas utilizing word count, syllable count, and sentence count.
*   **Applications:** Fake news detection, opinion spam detection.

### Common Readability Tests
1.  **Flesch reading ease**
2.  **Gunning fog index**
3.  **Simple Measure of Gobbledygook (SMOG)**
4.  **Dale-Chall score**

### 7.1 Flesch Reading Ease
One of the oldest and most widely used tests. It depends on two factors:
1.  **Average Sentence Length:** Greater length $\rightarrow$ Harder to read.
2.  **Average Syllables per Word:** More syllables $\rightarrow$ Harder to read.

**Interpretation:** Higher score = Greater readability (Easier).

| Reading ease score | Grade Level |
| :--- | :--- |
| 90-100 | 5 |
| 80-90 | 6 |
| 70-80 | 7 |
| 60-70 | 8-9 |
| 50-60 | 10-12 |
| 30-50 | College |
| 0-30 | College Graduate |

### 7.2 Gunning Fog Index
Developed in 1954. It also depends on average sentence length but focuses on "complex words" (words with 3+ syllables).

**Interpretation:** Higher index = Lesser readability (Harder).

| Fog index | Grade level | Fog index | Grade level |
| :--- | :--- | :--- | :--- |
| 17 | College graduate | 10 | High school sophomore |
| 16 | College senior | 9 | High school freshman |
| 15 | College junior | 8 | Eighth grade |
| 14 | College sophomore | 7 | Seventh grade |
| 13 | College freshman | 6 | Sixth grade |
| 12 | High school senior | | |
| 11 | High school junior | | |

### 7.3 Implementation with `textatistic`
The `textatistic` library allows for easy calculation of these scores.

<div style="background: #e0f2fe; border-left: 16px solid #0284c7; padding: 14px 18px; border-radius: 8px; font-size: 18px; color: #075985;"> ðŸ’¡ <b>Tip:</b> You may need to install the library first using <code>!pip install textatistic</code>. </div>



In [None]:
# Note: This code requires the textatistic library.
# If not installed, uncomment the line below:
# !pip install textatistic

try:
    # Import the Textatistic class
    from textatistic import Textatistic

    # Sample text
    text = "The quick brown fox jumps over the lazy dog. This is a simple sentence."

    # Create a Textatistic Object
    readability_scores = Textatistic(text).scores

    # Generate scores
    print(f"Flesch Score: {readability_scores['flesch_score']}")
    print(f"Gunning Fog Score: {readability_scores['gunningfog_score']}")

except ImportError:
    print("The 'textatistic' library is not installed. Please install it to run this block.")



***

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 8. CONCLUSION</span><br>

In this notebook, we explored the fundamentals of Feature Engineering for NLP in Python.

**Key Takeaways:**
1.  **Data Types:** We distinguished between structured numerical data (like Iris) and unstructured textual data (like Movie Reviews).
2.  **Encoding:** We learned how to use One-Hot Encoding to convert categorical variables into numerical features using `pd.get_dummies`.
3.  **Basic Features:** We implemented Python functions to extract simple but powerful features such as character counts, word counts, and average word length.
4.  **Social Media Features:** We saw how to parse specific tokens like hashtags from tweets.
5.  **Readability:** We examined advanced linguistic features like the Flesch Reading Ease and Gunning Fog Index to quantify the complexity of a text.

**Next Steps:**
*   Apply these feature extraction techniques to a real-world dataset.
*   Explore more advanced vectorization techniques like TF-IDF (Term Frequency-Inverse Document Frequency).
*   Feed these extracted features into machine learning classifiers (like Logistic Regression or Naive Bayes) to perform sentiment analysis or text classification.
