# **Natural Language Processing**

Natural Language Processing (NLP) is a subfield of Artificial Intelligence (AI) and Computational Linguistics that focuses on enabling machines to understand, interpret, and generate human language. It seeks to bridge the gap between human communication (natural language) and machine understanding, allowing computers to interact with us in a more intuitive way. NLP combines computer science, linguistics, and statistical models to process and analyze large amounts of natural language data. It has applications in various domains, such as:

- **Search engines** (e.g., Google Search)
- **Voice assistants** (e.g., Siri, Alexa)
- **Text analysis** (e.g., sentiment analysis, topic modeling)
- **Machine translation** (e.g., Google Translate)
- **Chatbots and customer service automation**
- **Social media monitoring** (e.g., detecting trends, sentiments)

The ability to process and understand human language opens the door for computers to assist in tasks traditionally requiring human cognition, such as summarizing documents, understanding context, extracting key information, and generating coherent responses.

### **Key Components of NLP**
NLP typically involves a series of tasks to understand and process natural language, including:

1. **Text Preprocessing:** This involves cleaning and preparing text for further analysis. Steps include tokenization, removing stopwords, stemming, and lemmatization.
2. **Part-of-Speech Tagging (POS):** Identifying the grammatical components of a sentence, such as nouns, verbs, adjectives, etc.
3. **Named Entity Recognition (NER):** Detecting entities (e.g., person names, locations, dates) within text.
4. **Dependency Parsing:** Analyzing the syntactic structure of a sentence to understand how words are related to each other.
5. **Sentiment Analysis:** Determining the sentiment (positive, negative, or neutral) expressed in a piece of text.
6. **Machine Translation:** Converting text from one language to another using AI models.
7. **Text Generation:** Generating new text based on learned patterns, as in chatbot or language models like GPT-3.

NLP plays a crucial role in tasks such as search engine optimization (SEO), social media monitoring, language translation, and content personalization. As AI continues to evolve, the significance of NLP will only grow, offering even more sophisticated ways for machines to understand and interact with human language.

### **Challenges in NLP**
Despite significant advancements, NLP faces several challenges due to the complexity and ambiguity of human language. Some of these challenges include:

- **Ambiguity:** Words can have multiple meanings depending on context (e.g., "bat" could refer to an animal or a sports equipment).
- **Contextual Understanding:** Understanding nuances like irony, sarcasm, and cultural references.
- **Language Variability:** Different dialects, slang, and informal language pose a challenge in accurate language processing.
- **Data Quality:** Large amounts of labeled data are required to train NLP models, and this data can often be noisy or biased.

To address these challenges, advanced techniques like deep learning, transfer learning, and transformer models (e.g., BERT, GPT) have significantly improved the capabilities of NLP systems.


## **Natural Language Toolkit (NLTK)**

### **Overview of NLTK**
NLTK (Natural Language Toolkit) is one of the most popular Python libraries for working with human language data. It provides an extensive suite of tools and resources for:

- **Text preprocessing**: Tokenization, stemming, lemmatization, and cleaning text.
- **Linguistic analysis**: Part-of-speech (POS) tagging, syntactic parsing, and dependency analysis.
- **Text classification**: Feature extraction and classification for tasks like sentiment analysis.
- **Corpora access**: A wide variety of prebuilt datasets and lexicons, such as the Gutenberg Corpus and WordNet.

#### **Why NLTK?**
1. Comprehensive and versatile for NLP tasks, ranging from beginner to advanced use cases.
2. Provides access to standard corpora and pre-trained models.
3. Well-documented and easy to use for prototyping and educational purposes.





### **Key Concepts and Examples**

#### **1. Corpus**
A **corpus** is a collection of texts, often used as a dataset for training or testing NLP models. NLTK provides built-in corpora like `gutenberg` (classic books) and `brown` (annotated texts). You can also load your own datasets.

**Example: Loading and Exploring a Corpus**


In [2]:
# pip install nltk

import nltk
nltk.download('gutenberg')

from nltk.corpus import gutenberg
# List all files in the Gutenberg Corpus
print(gutenberg.fileids())

# Load text from a specific file (e.g., Jane Austen's "Emma")
text = gutenberg.raw('austen-emma.txt')
print(text[:500])  # Print the first 500 characters



['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']
[Emma by Jane Austen 1816]

VOLUME I

CHAPTER I


Emma Woodhouse, handsome, clever, and rich, with a comfortable home
and happy disposition, seemed to unite some of the best blessings
of existence; and had lived nearly twenty-one years in the world
with very little to distress or vex her.

She was the youngest of the two daughters of a most affectionate,
indulgent father; and had, in consequence of her sister's marriage,
been mistress of his house from a very early period.  Her mother
had died t


[nltk_data] Downloading package gutenberg to
[nltk_data]     C:\Users\Administrator\AppData\Roaming\nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


In [4]:
text_poems = gutenberg.raw('blake-poems.txt')
print(text_poems[:600])


[Poems by William Blake 1789]

 
SONGS OF INNOCENCE AND OF EXPERIENCE
and THE BOOK of THEL


 SONGS OF INNOCENCE
 
 
 INTRODUCTION
 
 Piping down the valleys wild,
   Piping songs of pleasant glee,
 On a cloud I saw a child,
   And he laughing said to me:
 
 "Pipe a song about a Lamb!"
   So I piped with merry cheer.
 "Piper, pipe that song again;"
   So I piped: he wept to hear.
 
 "Drop thy pipe, thy happy pipe;
   Sing thy songs of happy cheer:!"
 So I sang the same again,
   While he wept with joy to hear.
 
 "Piper, sit thee down and write
   In a book, that all may read."
 So he vanish'd


In [5]:

text_bible = gutenberg.raw('bible-kjv.txt')
print(text_bible[:500])


[The King James Bible]

The Old Testament of the King James Bible

The First Book of Moses:  Called Genesis


1:1 In the beginning God created the heaven and the earth.

1:2 And the earth was without form, and void; and darkness was upon
the face of the deep. And the Spirit of God moved upon the face of the
waters.

1:3 And God said, Let there be light: and there was light.

1:4 And God saw the light, that it was good: and God divided the light
from the darkness.

1:5 And God called the light Da


In this example:

The gutenberg corpus includes texts from classic literature like Shakespeare and Jane Austen. By loading and inspecting a file (e.g., Emma), you can perform various analyses and operations on it, such as tokenization, sentiment analysis, or frequency analysis.

## 2. Tokenization

Tokenization is the process of splitting a large body of text into smaller units called tokens. These tokens can be words, sentences, or even subword components. Tokenization is typically one of the first steps in any NLP pipeline because it breaks down the text into manageable pieces.

### Types of Tokenization:

- **Word Tokenization**: Splitting text into individual words. This is useful when you want to analyze the frequency of words, part-of-speech tagging, or perform other word-level operations.
- **Sentence Tokenization**: Splitting text into sentences. This is useful for sentence-level analysis such as sentiment analysis or parsing.

Tokenization is essential because it allows subsequent NLP processes to work with smaller, more meaningful components of text.

### Example: Word and Sentence Tokenization


In [13]:
# text = "Natural Language Processing is exciting. It opens up many possibilities!"

# text_list = text.split()
# text_list

In [8]:
text = "Tokenization is essential because it allows subsequent NLP processes to work with smaller, more meaningful components of text."
text_list = text.split()

my_text = text_list

for element in my_text:
    print(element)

Tokenization
is
essential
because
it
allows
subsequent
NLP
processes
to
work
with
smaller,
more
meaningful
components
of
text.


In [14]:
# my_list = ['Natural', 'Language', 'Processing', 'is', 'exciting.', 'It', 'opens', 'up', 'many', 'possibilities!']

# for element in my_list:
#     print(element)


In [10]:
from nltk.tokenize import word_tokenize
def nltk_tokenization_pipeline(text):
    tokens = word_tokenize(text)
    return tokens

text = "Tokenization is essential because it allows subsequent NLP processes to work with smaller, more meaningful components of text."
tokens = nltk_tokenization_pipeline(text)
print("Tokens (nltk):", tokens)

Tokens (nltk): ['Tokenization', 'is', 'essential', 'because', 'it', 'allows', 'subsequent', 'NLP', 'processes', 'to', 'work', 'with', 'smaller', ',', 'more', 'meaningful', 'components', 'of', 'text', '.']


In [16]:
import nltk
nltk.download('punkt')

from nltk.tokenize import word_tokenize, sent_tokenize

text = "Natural Language Processing is exciting. It opens up many possibilities!"
# Sentence tokenization
sentences = sent_tokenize(text)
print(sentences)
# Output: ['Natural Language Processing is exciting.', 'It opens up many possibilities!']

# Word tokenization
words = word_tokenize(text)
print(words)
# Output: ['Natural', 'Language', 'Processing', 'is', 'exciting', '.', 'It', 'opens', 'up', 'many', 'possibilities', '!']


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Administrator\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


['Natural Language Processing is exciting.', 'It opens up many possibilities!']
['Natural', 'Language', 'Processing', 'is', 'exciting', '.', 'It', 'opens', 'up', 'many', 'possibilities', '!']


In this example:

Sentence Tokenization splits the text into sentences using punctuation marks such as periods (.), exclamation points (!), or question marks (?).
Word Tokenization splits the text further into individual words. Notice how punctuation marks like periods and exclamation points are treated as separate tokens.

## 3. Stemming and Lemmatization

Both stemming and lemmatization are techniques used to reduce words to their base or root form, allowing models to treat different variations of a word (e.g., "running" vs. "run") as the same entity.

### Stemming:
Stemming involves removing prefixes or suffixes from words, often resulting in roots that may not always be valid words. Stemming is a fast, rule-based method and is commonly used in information retrieval.

- **Example**: "running" → "run", "better" → "better" (no change)

### Lemmatization:
Lemmatization is a more sophisticated technique that reduces words to their lemma, which is the dictionary form of the word. Unlike stemming, lemmatization ensures that the resulting words are valid dictionary entries. Lemmatization takes into account the word's context (part of speech) to produce the correct lemma.

- **Example**: "running" → "run" (verb), "better" → "good" (adjective)

### Example: Stemming vs. Lemmatization


In [17]:
text

'Tokenization is essential because it allows subsequent NLP processes to work with smaller, more meaningful components of text.'

In [16]:
text

# perform stemming on the text..
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

# initialize stemmer and lemmatizer algorithms
stemmer = PorterStemmer()
stemmer.stem(text)

'tokenization is essential because it allows subsequent nlp processes to work with smaller, more meaningful components of text.'

In [18]:
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

nltk.download('wordnet')

# Initialize stemmer and lemmatizer the algorithms
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Stemming example
words_stemmed = [stemmer.stem(word) for word in ["better", "running", "studies", "flies", "happier", "illegal", "sentimental"]]
print("Stemmed words:", words_stemmed)
# Output: ['better', 'run', 'studi', 'fli', 'happi']

# Lemmatization example
words_lemmatized = [lemmatizer.lemmatize(word, pos='v') for word in ["better", "running", "studies", "flies", "happier", "journey", "wrote", 'slept', "sentimental"]]
print("Lemmatized words:", words_lemmatized)
# Output: ['better', 'run', 'study', 'fly', 'happier']


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Administrator\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Stemmed words: ['better', 'run', 'studi', 'fli', 'happier', 'illeg', 'sentiment']
Lemmatized words: ['better', 'run', 'study', 'fly', 'happier', 'journey', 'write', 'sleep', 'sentimental']


**In this example:**

- Stemming reduces words like *studies* to *studi* and *happiness* to *happi*, which may not be correct forms of the words.

- Lemmatization ensures that words are reduced to valid base forms, like *study* instead of *studi*.

## 4. Part-of-Speech (POS) Tagging

Part-of-speech tagging is the process of identifying the grammatical category (e.g., noun, verb, adjective) of each word in a sentence. POS tagging helps identify the role that each word plays in a sentence, enabling more advanced tasks like syntactic parsing or named entity recognition.

### Why is POS tagging useful?

- It helps in extracting useful information from text, such as identifying nouns, verbs, and adjectives.
- It is crucial for syntactic parsing, where the structure of a sentence is analyzed.
- It can also be used in tasks like named entity recognition (NER) to identify people, locations, and other entities in text.

### Example: POS Tagging


In [20]:
from nltk import pos_tag
from nltk.tokenize import word_tokenize
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')

import nltk
nltk.download('averaged_perceptron_tagger')

# Sample text
text = "Natural Language Processing is exciting!"

# Tokenize and POS tagging
tokens = word_tokenize(text)
tagged = pos_tag(tokens)
print(tagged)
# Output: [('Natural', 'JJ'), ('Language', 'NNP'), ('Processing', 'NNP'), ('is', 'VBZ'), ('exciting', 'JJ'), ('!', '.')]


[('Natural', 'JJ'), ('Language', 'NNP'), ('Processing', 'NNP'), ('is', 'VBZ'), ('exciting', 'VBG'), ('!', '.')]


[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\Administrator\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\Administrator\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Administrator\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [28]:
# from nltk import pos_tag
# from nltk.chunk import ne_chunk
# from nltk.tokenize import word_tokenize

# # Download necessary NLTK models/data
# nltk.download('punkt')
# nltk.download('averaged_perceptron_tagger')
# nltk.download('maxent_ne_chunker')
# nltk.download('words')

# # Example text
# text = "Barack Obama was born in Hawaii. He was elected president in 2008."

# # Tokenize the text into words
# tokens = word_tokenize(text)

# # Perform POS tagging on the tokens
# pos_tags = pos_tag(tokens)

# # Perform Named Entity Recognition (NER) on the POS tagged tokens
# named_entities = ne_chunk(pos_tags)

# # Print the named entities
# print(named_entities)

# # For a more readable output, we can define a function to extract named entities
# def extract_named_entities(ne_tree):
#     entities = []
#     for subtree in ne_tree:
#         if isinstance(subtree, nltk.Tree):  # If subtree is a named entity
#             entity_name = " ".join([token for token, pos in subtree.leaves()])
#             entity_type = subtree.label()
#             entities.append((entity_name, entity_type))
   # return entities

# # Extract and print the named entities in a readable format
# entities = extract_named_entities(named_entities)
# for entity in entities:
#     print(f"Entity: {entity[0]}, Type: {entity[1]}")

**In this example:**

POS tagging assigns labels like:
- **JJ**: Adjective (e.g., exciting)
- **NNP**: Proper Noun (e.g., Language, Processing)
- **VBZ**: Verb, 3rd person singular (e.g., is)

## 5. Stop Words

Stop words are common words that appear frequently in natural language but carry little meaningful content on their own. In the context of Natural Language Processing (NLP), stop words are often removed during the preprocessing stage because they don't contribute much to the text's core meaning and can add noise to analyses.

### Why Are Stop Words Removed?

- **Frequency**: Words like "is", "the", "in", "on" are extremely common and appear in almost every text. They are considered redundant in most NLP tasks.
- **Noise Reduction**: Since they don't contain significant meaning, removing stop words helps to focus on words that convey more useful information.
- **Efficiency**: Reducing the size of the text by eliminating these words can make algorithms faster and more efficient, especially when dealing with large datasets.

### Examples of Stop Words:
- **Articles**: "the", "a", "an"
- **Pronouns**: "he", "she", "it", "they"
- **Prepositions**: "in", "on", "at", "by"
- **Conjunctions**: "and", "or", "but"
- **Auxiliary verbs**: "is", "are", "was", "were"
- **Other**: "to", "from", "of", "with", "about"

### Example Sentence:
- **Original**: "The dog is running in the park."
- **After Removing Stop Words**: "dog running park"

In this example, the words "The", "is", and "in" are removed as stop words, leaving behind only the more meaningful words.

### Practical Use of Stop Words Removal:
- **Text Classification**: In tasks like sentiment analysis, stop words are removed so that models focus on important words (like "happy", "good", "love") that contribute to sentiment.
- **Search Engines**: Stop words are ignored in search queries to improve search speed and accuracy.
- **Topic Modeling**: By removing stop words, algorithms can identify the key topics from text more efficiently.


In [41]:
from nltk.corpus import stopwords
text

tokens_text = word_tokenize(text) # tokenize the text into words

nltk.download("stopwords");

stop_words = set(stopwords.words('english'))
print(stop_words)

# # here filter out the stop words..
# filtered_tokens = [word for word in tokens_text if word.lower() not in stop_words]
# print(filtered_tokens)

{'during', "you'll", 'its', 'with', "you've", 'yours', "weren't", 'further', 'y', 'more', 'they', 'd', 'no', 'where', 'hadn', 'so', 'what', "isn't", 'do', "shan't", 'when', 'mustn', 'own', 'up', 'will', 'such', 'myself', "mightn't", 'once', 'only', 'some', 'needn', 'of', "that'll", "should've", 'this', "needn't", 'doing', 'as', 'she', 'into', 'now', 'being', "won't", 'out', 'wouldn', 'same', 'all', 'shan', 'it', 'than', 'at', 'is', 'll', 'here', "hadn't", 'we', 'itself', 'through', 'and', 'on', 'off', "don't", 'me', 'you', 'hasn', "doesn't", 'or', 'whom', 'hers', 'were', "you'd", 'to', 'isn', 'ma', "mustn't", 'did', 'he', 'his', 'against', 'how', 'has', 'before', "didn't", "aren't", 'aren', 'too', 'haven', 'these', 'an', 'few', 'shouldn', 'be', 'weren', "you're", 'are', 'by', "shouldn't", 'won', 'down', 'i', 'ain', 'ours', 's', 'not', 'about', 'himself', 'below', 'their', 'while', 'but', 'above', 'ourselves', 'doesn', 'didn', 'which', "haven't", 'them', 'yourselves', "it's", 'been', 'e

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Administrator\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [38]:
from nltk.corpus import stopwords;

# Download necessary NLTK resources
nltk.download('stopwords');

# Sample text
text = "This is a sample sentence that contains stop words."

# Tokenize the text into words
tokens = word_tokenize(text);

# Get the stop words for English
stop_words = set(stopwords.words('english'));

# Filter out stop words
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

# Print the filtered tokens
print("Filtered Tokens:", filtered_tokens)


Filtered Tokens: ['sample', 'sentence', 'contains', 'stop', 'words', '.']


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Administrator\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [42]:
import string



## 6. Vectorization

After cleaning and tokenizing the text, the next step in processing textual data for machine learning tasks is to convert it into numerical form. This is where **vectorization** comes in. Textual data is inherently unstructured, so we need to represent it in a structured, numerical form that algorithms can understand. There are two common strategies for vectorization: **Count Vectorization** and **TF-IDF Vectorization**.

---

### **1. Count Vectorization (Bag of Words Model)**


Count vectorization is a simple technique where the text is represented as a matrix of word counts. Each document is represented as a vector where each element corresponds to the frequency of a unique word from the document's vocabulary. This model doesn't consider the context or word order, just how often a word appears.

- **For a single document**: We count the frequency of each word.
- **For multiple documents**: We build a matrix, where each row corresponds to a document, and each column corresponds to a word in the entire corpus. The cell value at position `(i, j)` will be the frequency of the word `j` in document `i`.

#### **Example**:

Let's say we have two documents:

- **Document 1**: "Apple is a fruit"
- **Document 2**: "Apple is red"

We extract the unique words from both documents: `['Apple', 'is', 'a', 'fruit', 'red']`.

Now, let's create the count matrix:

| Document   | Apple | is | a | fruit | red |
|------------|-------|----|---|-------|-----|
| Document 1 | 1     | 1  | 1 | 1     | 0   |
| Document 2 | 1     | 1  | 0 | 0     | 1   |

---

### **2. TF-IDF (Term Frequency - Inverse Document Frequency)**

TF-IDF is a more advanced vectorization strategy that considers not only the frequency of a word in a document (Term Frequency) but also how unique or rare the word is across the entire corpus (Inverse Document Frequency). It is useful for reducing the weight of common words that appear in many documents and emphasizing the importance of words that are more unique.

The **TF-IDF** score is the product of two components:
- **Term Frequency (TF)**: This measures how frequently a term appears in a document. The formula for TF is:

$$
\text{TF}(t, d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d}
$$

- **Inverse Document Frequency (IDF)**: This measures how important a word is across the entire corpus. It gives higher weight to words that appear in fewer documents, indicating that they contain more information. The formula for IDF is:

$$
\text{IDF}(t, D) = \log \left( \frac{|D|}{1 + \text{DF}(t)} \right)
$$


  Where `DF(t)` is the number of documents containing the term `t`, and `|D|` is the total number of documents in the corpus.

The final **TF-IDF** value for a term in a document is:

$$
\text{TF-IDF}(t, d, D) = \text{TF}(t, d) \times \text{IDF}(t, D)
$$


#### **Example**:

Let’s use the same documents as before:

- **Document 1**: "Apple is a fruit"
- **Document 2**: "Apple is red"

First, we calculate **Term Frequency (TF)**:
- **Document 1 (TF for 'Apple')**: 1/4 = 0.25
- **Document 2 (TF for 'Apple')**: 1/3 = 0.33

Now, calculate **Inverse Document Frequency (IDF)**:
- The word "Apple" appears in both documents, so `DF('Apple') = 2`.
- There are 2 documents in total, so `IDF('Apple') = log(2 / (1 + 2)) ≈ 0.176`.

Now, calculate the **TF-IDF** for "Apple":
- **TF-IDF('Apple', Document 1)** = 0.25 * 0.176 = 0.044
- **TF-IDF('Apple', Document 2)** = 0.33 * 0.176 = 0.058


### **Comparison and Use Cases**

| Feature                | Count Vectorization                     | TF-IDF Vectorization                        |
|------------------------|-----------------------------------------|--------------------------------------------|
| **Purpose**            | Counts word frequency in each document  | Weighs word frequency by importance across documents |
| **Use Case**           | Useful for text classification or tasks where word frequency is key | Better for document classification, information retrieval, and search |

---

### **Conclusion**

Both **Count Vectorization** and **TF-IDF** are important methods for converting text into numerical form for machine learning. Count Vectorization is useful for simpler problems, while TF-IDF is better for distinguishing between more meaningful terms and common words. Depending on the task, one may be more applicable than the other. For example:
- **Count Vectorization** is good for simple classification tasks (e.g., spam detection).
- **TF-IDF** is better for tasks requiring more nuanced understanding, such as document similarity or retrieval.


In [1]:
# pip install nltk

import nltk
nltk.download('gutenberg')

from nltk.corpus import gutenberg
# List all files in the Gutenberg Corpus
print(gutenberg.fileids())

# Load text from a specific file (e.g., Jane Austen's "Emma")
text = gutenberg.raw('austen-emma.txt')
print(text[:500])  # Print the first 500 characters



['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']
[Emma by Jane Austen 1816]

VOLUME I

CHAPTER I


Emma Woodhouse, handsome, clever, and rich, with a comfortable home
and happy disposition, seemed to unite some of the best blessings
of existence; and had lived nearly twenty-one years in the world
with very little to distress or vex her.

She was the youngest of the two daughters of a most affectionate,
indulgent father; and had, in consequence of her sister's marriage,
been mistress of his house from a very early period.  Her mother
had died t


[nltk_data] Downloading package gutenberg to
[nltk_data]     C:\Users\Administrator\AppData\Roaming\nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


## 7. ReGex in NLP

Regular Expressions (ReGex) are sequences of characters that form a search pattern. In Natural Language Processing (NLP), ReGex is used for identifying, matching, and manipulating text data. It allows us to efficiently search for specific patterns in text, such as dates, phone numbers, or even specific words and phrases. ReGex is essential in preprocessing text, performing tokenization, extracting useful data, and handling noisy text (such as special characters or unwanted spaces).

### How ReGex is Used in NLP:
- **Text Cleaning**: Removing unnecessary characters (punctuation, extra spaces, etc.).


In [None]:
import re

# Text input
text = "mikawambua@124"

# Extract only alphabetic characters
alphabets = re.findall(r'[a-zA-Z]', text)

# Combine the extracted characters into a single string
result = ''.join(alphabets)

print(result)


In [45]:
import re
text = "mikawambua@124"

# extract word
word = re.findall(r'[a-zA-Z]', text)
print(word)

['m', 'i', 'k', 'a', 'w', 'a', 'm', 'b', 'u', 'a']


In [49]:
# 
text = "mikawambua @124 benson"

# extract word
word = re.sub(r'[^a-zA-Z]',"", text)
print(word)

mikawambuabenson


In [None]:

# Example text containing an email address
text = "You can reach me at mikawambua@domain.com for further information."

# Regex pattern for extracting email addresses
email_pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'

# Find all matches for the email pattern
emails = re.findall(email_pattern, text)

# Print the extracted email addresses
print(emails)


In [58]:
text = " *****  teller"
text1 = text.lstrip(" * ")
print(text1)


text2 = text1.lstrip(" ")
print(text2)

teller
teller


In [59]:

import re
text = "Hello! This is a test sentence."
cleaned_text = re.sub(r'\s+', ' ', re.sub(r'[^\w\s]', '', text)).strip()
print(cleaned_text)  # Output: 'Hello This is a test sentence'


Hello This is a test sentence




- **Tokenization**: Splitting text into meaningful components like words or sentences.


In [None]:

text = "Hello! How are you?"
words = re.findall(r'\b\w+\b', text)
print(words)  # Output: ['Hello', 'How', 'are', 'you']


['Hello', 'How', 'are', 'you']




- **Information Extraction**: Identifying entities such as emails, phone numbers, and dates.


In [64]:
# email having the domain part @, gma
text = "Contact us at support@example.com or mika@gmail.com  mikabenson@mail mika@wambua.co.ke john.doe@gmail.com"
emails = re.findall(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b', text)
print(emails)  # Output: ['support@example.com']


['support@example.com', 'mika@gmail.com', 'mika@wambua.co.ke', 'john.doe@gmail.com']



- **Pattern Matching**: Searching for specific word patterns or structures.



In [None]:

   import re
   text = "Hello! This is a test sentence."
   cleaned_text = re.sub(r'\s+', ' ', re.sub(r'[^\w\s]', '', text)).strip()
   print(cleaned_text)  # Output: 'Hello This is a test sentence'


Hello This is a test sentence
