## Speaker Details

**Name**: Israel Odeajo 
**Expertise**: Data Science and Natural Language Processing Specialist  
**Bio**: Israel Odeajo has over a decade of experience in machine learning and NLP, with a focus on developing scalable AI solutions for real-world applications. He has contributed to numerous open-source projects and is a regular speaker at international conferences.

### Image

<img src="mine.jpg" alt="Speaker: Israel Odeajo" width="200" style="display: block; margin-left: auto; margin-right: auto;"/>

### Contact Information

- **Email**: isrealodeajo@gmail.com
- **LinkedIn**: [https://www.linkedin.com/in/odeajo-israel/](#)  
- **Twitter**: [@israelkingz1](#)


https://colab.research.google.com/drive/1fARnUHu9WGaLKtdgtrulghP5zCxoeRiq#scrollTo=7ef288ef

# What is NLP?

NLP stands for Natural Language Processing. It is the branch of Artificial Intelligence that gives machines the ability to understand and process human languages. Human languages can be in the form of text or audio format.

## History of NLP

Natural Language Processing started in 1950 when Alan Mathison Turing published an article titled "Computing Machinery and Intelligence". This foundational work is based on Artificial Intelligence and discusses the automatic interpretation and generation of natural language. As technology has evolved, different approaches have emerged to deal with NLP tasks.


## Applications of NLP

The applications of Natural Language Processing include:

- Text and speech processing, like Voice assistants – Alexa, Siri, etc.
- Text classification, like Grammarly, Microsoft Word, and Google Docs
- Information extraction, like Search engines like DuckDuckGo, Google
- Chatbots and Question Answering systems, like website bots
- Language Translation, like Google Translate
- Text summarization


## How to Use RegEx in Python?

To use Regular Expressions (RegEx) in Python, you first need to import the `re` module.

### Example:

Below is a Python code snippet that demonstrates how to use regular expressions to search for the word "portal" in a given string. It then prints the start and end indices of the matched word within the string.


In [3]:
import re 

s = 'GeeksforGeeks: A computer science portal for geeks'

match = re.search(r'portal', s) 

print('Start Index:', match.start()) 
print('End Index:', match.end()) 

Start Index: 34
End Index: 40


## Metacharacters

Metacharacters are characters with special meanings in regular expressions. They play a crucial role in crafting patterns for matching and manipulating strings. Below is a list of metacharacters used in the `re` module functions along with their descriptions:

| MetaCharacter | Description |
|---------------|-------------|
| `\`           | Used to drop the special meaning of the character following it |
| `[]`          | Represent a character class |
| `^`           | Matches the beginning of a string |
| `$`           | Matches the end of a string |
| `.`           | Matches any character, except for a newline |
| `|`           | Means OR (matches with any of the characters separated by it) |
| `?`           | Matches zero or one occurrence of the preceding element |
| `*`           | Matches zero or more occurrences of the preceding element |
| `+`           | Matches one or more occurrences of the preceding element |
| `{}`          | Indicate the number of occurrences of a preceding regex to match |
| `()`          | Enclose a group of Regex patterns |

Understanding and utilizing these metacharacters are fundamental to effectively using regular expressions for pattern matching and text manipulation.


# Tokenization in NLP

Tokenization in natural language processing (NLP) is a foundational technique that involves dividing a sentence or phrase into smaller units, known as tokens. These tokens can include words, dates, punctuation marks, or even fragments of words. This article covers the basics of tokenization, its types, and its use cases.

## What is Tokenization in NLP?

Natural Language Processing (NLP) is a subfield of computer science, artificial intelligence, information engineering, and human-computer interaction. It focuses on programming computers to process and analyze large amounts of natural language data. The complexity of reading and understanding languages makes NLP a challenging field. Tokenization is a crucial first step in the NLP pipeline that influences the entire workflow.

Tokenization is the process of breaking down text into smaller units, or tokens. These tokens are typically words or sub-words in natural language processing. It is a critical step in many NLP tasks, including text processing, language modeling, and machine translation. Tokens serve as the building blocks for further processing and analysis, enabling the conversion of unstructured text into a structured form suitable for machine learning.

## Types of Tokenization

Tokenization can be classified into several types based on the text segmentation method:

### Word Tokenization

- **Description**: Divides the text into individual words.
- **Example**:  
  Input: "Tokenization is an important NLP task."  
  Output: ["Tokenization", "is", "an", "important", "NLP", "task", "."]

### Sentence Tokenization

- **Description**: Segments the text into sentences.
- **Example**:  
  Input: "Tokenization is an important NLP task. It helps break down text into smaller units."  
  Output: ["Tokenization is an important NLP task.", "It helps break down text into smaller units."]

### Subword Tokenization

- **Description**: Breaks down words into smaller units.
- **Example**:  
  Input: "tokenization"  
  Output: ["token", "ization"]

### Character Tokenization

- **Description**: Divides the text into individual characters.
- **Example**:  
  Input: "Tokenization"  
  Output: ["T", "o", "k", "e", "n", "i", "z", "a", "t", "i", "o", "n"]

## Need of Tokenization

Tokenization plays a crucial role in NLP for several reasons:

- **Effective Text Processing**: Simplifies raw text for easier processing and analysis.
- **Feature Extraction**: Enables numerical representation of text data for machine learning models.
- **Language Modeling**: Facilitates organized representations of language for text generation and language modeling.
- **Information Retrieval**: Essential for efficient indexing and searching based on words or phrases.
- **Text Analysis**: Supports NLP tasks like sentiment analysis and named entity recognition.
- **Vocabulary Management**: Helps manage a corpus's vocabulary by generating a list of distinct tokens.
- **Task-Specific Adaptation**: Can be customized to suit specific NLP tasks such as summarization and machine translation.
- **Preprocessing Step**: Transforms raw text into a format suitable for further statistical and computational analysis.


# Tokenizing Text Using NLTK in Python

To utilize the Natural Language Toolkit (NLTK) for text tokenization, it's necessary to first install NLTK on your system. NLTK is a vast library designed to assist with a wide range of Natural Language Processing (NLP) tasks, from tokenization to parsing and semantic reasoning.

## Installation Process

To install NLTK and set up your environment for NLP tasks, follow these steps:

1. **Install NLTK**:
   Open your terminal and execute the following command to install NLTK:

    ```bash
    sudo pip install nltk
    ```

2. **Enter Python Shell**:
   After installation, enter the Python shell by typing `python` in your terminal.

3. **Import NLTK and Download Resources**:
   Inside the Python shell, import NLTK and download the necessary datasets, models, and resources by running:

    ```python
    import nltk
    nltk.download('all')
    ```

    This step may take some time as NLTK includes a substantial amount of tokenizers, chunkers, algorithms, and corpora that need to be downloaded.

## Understanding Key Terms

Before diving into tokenization, let's clarify a few terms frequently encountered in NLP:

- **Corpus**: A body of text, singular. "Corpora" is the plural form.
- **Lexicon**: Essentially, a dictionary that includes words and their meanings.
- **Token**: An individual piece or "entity" that results from breaking down text based on specific rules. For instance, a sentence can be tokenized into words, making each word a token. Similarly, paragraphs can be tokenized into sentences, making each sentence a token.

Tokenization is crucial as it transforms text into a more manageable and analyzable format by splitting it into sentences, words, or other meaningful elements.


In [None]:
#importing the NLTK library for NLP task 
import nltk
nltk.download('punkt')

The punkt package is a part of the Natural Language Toolkit (NLTK), which is a set of libraries and programs for symbolic and statistical natural language processing (NLP) for the Python programming language.

punkt is specifically used for dividing a text into a list of sentences by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences. It is used for sentence tokenization, which means it can split text into sentences. This is particularly useful for applications that involve Natural Language Processing tasks like text summarization, sentiment analysis, and so on, where understanding the boundary of sentences is important.

In [None]:
#tokenization
from nltk.tokenize import sent_tokenize, word_tokenize

In [None]:
text = "Natural language processing is an exciting area. Huge budget have been allocated for this."

In [None]:
sent_tokenization = sent_tokenize(text)
print(sent_tokenization)

In [None]:
word_token = word_tokenize(text)
print(word_token)

In [None]:
# import the existing word and sentence tokenizing 
# libraries 
from nltk.tokenize import sent_tokenize, word_tokenize 

text = "Natural language processing (NLP) is a field " + \ 
	"of computer science, artificial intelligence " + \ 
	"and computational linguistics concerned with " + \ 
	"the interactions between computers and human " + \ 
	"(natural) languages, and, in particular, " + \ 
	"concerned with programming computers to " + \ 
	"fruitfully process large natural language " + \ 
	"corpora. Challenges in natural language " + \ 
	"processing frequently involve natural " + \ 
	"language understanding, natural language" + \ 
	"generation frequently from formal, machine" + \ 
	"-readable logical forms), connecting language " + \ 
	"and machine perception, managing human-" + \ 
	"computer dialog systems, or some combination " + \ 
	"thereof."

print(sent_tokenize(text)) 
print(word_tokenize(text))` 


# Tokenizing Text Using TextBlob

The TextBlob module is a Python library that provides a simple API for performing basic Natural Language Processing (NLP) tasks. It is built on top of the NLTK module, leveraging its rich set of features while offering an easier entry point for beginners.

## Installing TextBlob

To start using TextBlob for NLP tasks, including text tokenization, you need to install the library and its dependencies. Follow these steps to install TextBlob and download the necessary NLTK corpora:

1. **Install TextBlob**:
   Open your terminal and run the following command to install the TextBlob library:

    ```bash
    pip install -U textblob
    ```

2. **Download TextBlob Corpora**:
   After installing TextBlob, you need to download the necessary data for NLP tasks. Execute the following command in your terminal:

    ```bash
    python -m textblob.download_corpora
    ```

   Note: The installation and download process may take some time due to the large volume of tokenizers, chunkers, algorithms, and corpora that need to be downloaded.


In [None]:
# from textblob lib. import TextBlob method 
from textblob import TextBlob 

text = ("Natural language processing (NLP) is a field " +
	"of computer science, artificial intelligence " +
	"and computational linguistics concerned with " +
	"the interactions between computers and human " +
	"(natural) languages, and, in particular, " +
	"concerned with programming computers to " +
	"fruitfully process large natural language " +
	"corpora. Challenges in natural language " +
	"processing frequently involve natural " +
	"language understanding, natural language" +
	"generation frequently from formal, machine" +
	"-readable logical forms), connecting language " +
	"and machine perception, managing human-" +
	"computer dialog systems, or some combination " +
	"thereof.") 
	
# create a TextBlob object 
blob_object = TextBlob(text) 

# tokenize paragraph into words. 
print(" Word Tokenize :\n", blob_object.words) 

# tokenize paragraph into sentences. 
print("\n Sentence Tokenize :\n", blob_object.sentences) 


In [None]:
from nltk.stem.porter import PorterStemmer

In [None]:
# Reduce words to their stems
stemmed = [PorterStemmer().stem(w) for w in words]
print(stemmed)

### What is Lemmatization? 

In contrast to stemming, lemmatization is a lot more powerful. It looks beyond word reduction and considers a language’s full vocabulary to apply a morphological analysis to words, aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.

# Wordnet Lemmatizer

Wordnet is a comprehensive, publicly available lexical database supporting over 200 languages, known for providing semantic relationships between words. As one of the earliest and most widely used lemmatization techniques, it is a crucial tool in the field of Natural Language Processing (NLP).

## Key Features of Wordnet

- **Semantic Links**: Wordnet organizes words into semantic relations, such as synonyms, making it a valuable resource for understanding the meaning of words in context.
- **Synsets**: Words are grouped into sets of synonyms, known as synsets, which represent a unique semantic concept. This grouping facilitates a deeper understanding of word semantics.

## How to Use Wordnet Lemmatizer in Python

Wordnet Lemmatizer is part of the NLTK (Natural Language Toolkit) library in Python, which offers a vast suite of tools for language processing.

### Installing NLTK and Wordnet

To use Wordnet Lemmatizer, you first need to install the NLTK package and then download the Wordnet data via NLTK. Here are the steps:

1. **Install NLTK**:
   Open your terminal or Anaconda prompt and install NLTK by running:

    ```bash
    pip install nltk
    ```

2. **Download Wordnet and Required NLTK Data**:
   After installing NLTK, you need to download the Wordnet dataset and the 'averaged_perceptron_tagger' which is required for part-of-speech tagging. In your Python console, execute the following commands:

    ```python
    import nltk
    nltk.download('wordnet')
    nltk.download('averaged_perceptron_tagger')
    ```

By following these steps, you'll have Wordnet and the necessary components installed, enabling you to use the Wordnet Lemmatizer for processing and analyzing text data in Python.


In [None]:
nltk.download('wordnet')

In [None]:
from nltk.stem.wordnet import WordNetLemmatizer
# Reduce words to their root form
lemmed = [WordNetLemmatizer().lemmatize(w) for w in words]
print(lemmed)

In [None]:
import nltk

nltk.download('stopwords')

In [None]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

example_sent = """This is a sample sentence,
				showing off the stop words filtration."""

stop_words = set(stopwords.words('english'))

word_tokens = word_tokenize(example_sent)
# converts the words in word_tokens to lower case and then checks whether
#they are present in stop_words or not
filtered_sentence = [w for w in word_tokens if not w.lower() in stop_words]
#with no lower case conversion
filtered_sentence = []

for w in word_tokens:
	if w not in stop_words:
		filtered_sentence.append(w)

print(word_tokens)
print(filtered_sentence)


Spacy is a library that comes under NLP (Natural Language Processing). It is an object-oriented Library that is used to deal with pre-processing of text, and sentences, and to extract information from the text using modules and functions.

Tokenization is the process of splitting a text or a sentence into segments, which are called tokens. It is the first step of text preprocessing and is used as input for subsequent processes like text classification, lemmatization, etc.

In natural language processing (NLP), stopwords are frequently filtered out to enhance text analysis and computational efficiency. Eliminating stopwords can improve the accuracy and relevance of NLP tasks by drawing attention to the more important words, or content words. The article aims to explore stopwords.

In [None]:
import spacy

# Load spaCy English model
nlp = spacy.load("en_core_web_sm")

# Sample text
text = "There is a pen on the table"

# Process the text using spaCy
doc = nlp(text)

# Remove stopwords
filtered_words = [token.text for token in doc if not token.is_stop]

# Join the filtered words to form a clean text
clean_text = ' '.join(filtered_words)

print("Original Text:", text)
print("Text after Stopword Removal:", clean_text)


### Removing stop words with Genism

In [None]:
from gensim.parsing.preprocessing import remove_stopwords

# Another sample text
new_text = "The majestic mountains provide a breathtaking view."

# Remove stopwords using Gensim
new_filtered_text = remove_stopwords(new_text)

print("Original Text:", new_text)
print("Text after Stopword Removal:", new_filtered_text)


### Part of Speech Tagging using TextBlob
TextBlob module is used for building programs for text analysis. One of the more powerful aspects of the TextBlob module is the Part of Speech tagging.

In [None]:
import nltk
nltk.download('averaged_perceptron_tagger')

In [None]:
# from textblob lib import TextBlob method
from textblob import TextBlob

text = ("Sukanya, Rajib and Naba are my good friends. " +
	"Sukanya is getting married next year. " +
	"Marriage is a big step in one’s life." +
	"It is both exciting and frightening. " +
	"But friendship is a sacred bond between people." +
	"It is a special kind of love between us. " +
	"Many of you must have tried searching for a friend "+
	"but never found the right one.")

# create a textblob object
blob_object = TextBlob(text)

# Part-of-speech tags can be accessed
# through the tags property of blob object.'

# print word with pos tag.
print(blob_object.tags)


In [None]:
# input string
string = "	 Python 3.0, released in 2008, was a major revision of the language that is not completely backward compatible and much Python 2 code does not run unmodified on Python 3. With Python 2's end-of-life, only Python 3.6.x[30] and later are supported, with older versions still supporting e.g. Windows 7 (and old installers not restricted to 64-bit Windows)."
print(string)


In [None]:
# convert to lower case
lower_string = string.lower()
print(lower_string)

In [None]:
import re
# remove numbers
no_number_string = re.sub(r'\d+','',lower_string)
print(no_number_string)

In [None]:
# remove all punctuation except words and space
no_punc_string = re.sub(r'[^\w\s]','', no_number_string)
print(no_punc_string)

In [None]:
# remove white spaces
no_wspace_string = no_punc_string.strip()
print(no_wspace_string)

In [None]:
# convert string to list of words
lst_string = [no_wspace_string][0].split()
print(lst_string)

In [None]:
# remove stopwords
no_stpwords_string=""
for i in lst_string:
    if not i in stop_words:
        no_stpwords_string += i+' '

# removing last space
no_stpwords_string = no_stpwords_string[:-1]

# output
print(no_stpwords_string)

In [None]:
# Python3 code for preprocessing text
import nltk
import re
import numpy as np

# execute the text here as :
text = "Beans. I was trying to explain to somebody as we were flying in, that’s corn. That’s beans. And they were very impressed at my agricultural knowledge. Please give it up for Amaury once again for that outstanding introduction. I have a bunch of good friends here today, including somebody who I served with, who is one of the finest senators in the country, and we’re lucky to have him, your Senator, Dick Durbin is here. I also noticed, by the way, former Governor Edgar here, who I haven’t seen in a long time, and somehow he has not aged and I have. And it’s great to see you, Governor. I want to thank President Killeen and everybody at the U of I System for making it possible for me to be here today. And I am deeply honored at the Paul Douglas Award that is being given to me. He is somebody who set the path for so much outstanding public service here in Illinois. Now, I want to start by addressing the elephant in the room. I know people are still wondering why I didn’t speak at the commencement."
dataset = nltk.sent_tokenize(text)
for i in range(len(dataset)):
	dataset[i] = dataset[i].lower()
	dataset[i] = re.sub(r'\W', ' ', dataset[i])
	dataset[i] = re.sub(r'\s+', ' ', dataset[i])


In [None]:
# Creating the Bag of Words model
word2count = {}
for data in dataset:
	words = nltk.word_tokenize(data)
	for word in words:
		if word not in word2count.keys():
			word2count[word] = 1
		else:
			word2count[word] += 1


In [None]:
# import required module fior TF- IDF
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
# assign documents
d0 = 'Geeks for geeks'
d1 = 'Geeks'
d2 = 'r2j'

# merge documents into a single corpus
string = [d0, d1, d2]

In [None]:
string

In [None]:
# create object
tfidf = TfidfVectorizer()

# get tf-df values
result = tfidf.fit_transform(string)

In [None]:
# get idf values
print('\nidf values:')
for ele1, ele2 in zip(tfidf.get_feature_names_out(), tfidf.idf_):
	print(ele1, ':', ele2)

In [None]:
# get indexing
print('\nWord indexes:')
print(tfidf.vocabulary_)

# display tf-idf values
print('\ntf-idf value:')
print(result)

# in matrix form
print('\ntf-idf values in matrix form:')
print(result.toarray())

In [None]:
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('vader_lexicon')

In [None]:
tweets = [
 "dear @verizonsupport your service is straight 💩 in dallas.. been with y'all over a decade and this is all time low for y'all. i'm talking no internet at all.",
 "@verizonsupport I sent you a dm",
 "thanks to michelle et al at @verizonsupport who helped push my no-show-phone problem along. Order canceled successfully, and I ordered this for pickup today at the Apple store in the mall."
 ]

In [None]:
for tweet in tweets:
  blob = TextBlob(tweet)
  print(blob.sentiment)

In [None]:
from nltk.sentiment import SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()

In [None]:
for tweet in tweets:
  scores = sia.polarity_scores(tweet)
  print(scores)