# Introduction to Tokenization

Tokenization is the process of breaking down text into smaller units called tokens. These tokens can be individual words, phrases, or even characters, depending on the level of granularity required. In the context of natural language processing (NLP), tokenization is one of the foundational steps, enabling machines to process and understand text efficiently.

## Why Tokenization is Important
1. **Simplifies Text Processing**: By breaking text into manageable pieces, tokenization simplifies the process of analyzing and understanding text data.
2. **Facilitates NLP Tasks**: Most NLP tasks, such as text classification, sentiment analysis, or machine translation, require text to be tokenized as a preprocessing step.
3. **Reduces Redundancy**: Tokenization helps reduce redundant words and phrases by enabling the use of standardized tokens. For instance, different forms of the same word (e.g., "running," "runs," "ran") can be tokenized and further normalized to a base form (a process known as stemming or lemmatization).
4. **Improves Model Efficiency**: Tokenization creates structured input for machine learning models, enhancing their ability to learn patterns and relationships within the data.

In this notebook, we will explore various tokenization techniques, from simple approaches like whitespace splitting to more advanced library-based methods. We'll also see how tokenization helps in reducing redundancy and preparing text for further analysis.



In [1]:
import numpy as np



In [2]:
text1 = '''
Concentration of Risk: Credit Risk
Financial instruments that potentially subject us to a concentration of credit risk consist of cash, cash equivalents, marketable securities,
restricted cash, accounts receivable, convertible note hedges, and interest rate swaps. Our cash balances are primarily invested in money market funds
or on deposit at high credit quality financial institutions in the U.S. These deposits are typically in excess of insured limits. As of September 30, 2021
and December 31, 2020, no entity represented 10% or more of our total accounts receivable balance. The risk of concentration for our convertible note
hedges and interest rate swaps is mitigated by transacting with several highly-rated multinational banks.
Concentration of Risk: Supply Risk
We are dependent on our suppliers, including single source suppliers, and the inability of these suppliers to deliver necessary components of our
products in a timely manner at prices, quality levels and volumes acceptable to us, or our inability to efficiently manage these components from these
suppliers, could have a material adverse effect on our business, prospects, financial condition and operating results.
'''

text2 = '''
Follow our leader Elon musk on twitter here: https://twitter.com/elonmusk, more information 
on Tesla's products can be found at https://www.tesla.com/. Also here are leading influencers 
for tesla related news,
https://twitter.com/teslarati
https://twitter.com/dummy_tesla
https://twitter.com/dummy_2_tesla
'''

text3='codebasics: Hello, I am having an issue with my order # 412889912'

## Naive Tokenization: Splitting Text by Spaces

One of the simplest ways to tokenize text is by splitting it based on spaces. While this approach does not handle punctuation or account for complex language structures, it serves as a good starting point for understanding tokenization.

Example Code

In [3]:
text1[:100]

'\nConcentration of Risk: Credit Risk\nFinancial instruments that potentially subject us to a concentra'

In [4]:
len(text1.split())

176

In [5]:
len(set(text1.split()))

120

In [6]:
vocab={}
for token in text1.split():
    if token in vocab:
        vocab[token]+=1
    else:
        vocab[token]=1

In [7]:
len(list(vocab.keys()))

120

## Importance of Lowercasing Before Tokenization

Text data often contains words in mixed cases, such as "Apple," "apple," or "APPLE." Without normalization, these variations are treated as distinct tokens, which can lead to redundancy and inconsistencies in downstream tasks.

Lowercasing is a simple yet effective preprocessing step that ensures case insensitivity in the tokenization process.

Example Code

In [8]:
text1

'\nConcentration of Risk: Credit Risk\nFinancial instruments that potentially subject us to a concentration of credit risk consist of cash, cash equivalents, marketable securities,\nrestricted cash, accounts receivable, convertible note hedges, and interest rate swaps. Our cash balances are primarily invested in money market funds\nor on deposit at high credit quality financial institutions in the U.S. These deposits are typically in excess of insured limits. As of September 30, 2021\nand December 31, 2020, no entity represented 10% or more of our total accounts receivable balance. The risk of concentration for our convertible note\nhedges and interest rate swaps is mitigated by transacting with several highly-rated multinational banks.\nConcentration of Risk: Supply Risk\nWe are dependent on our suppliers, including single source suppliers, and the inability of these suppliers to deliver necessary components of our\nproducts in a timely manner at prices, quality levels and volumes acc

In [9]:
text1.lower()

'\nconcentration of risk: credit risk\nfinancial instruments that potentially subject us to a concentration of credit risk consist of cash, cash equivalents, marketable securities,\nrestricted cash, accounts receivable, convertible note hedges, and interest rate swaps. our cash balances are primarily invested in money market funds\nor on deposit at high credit quality financial institutions in the u.s. these deposits are typically in excess of insured limits. as of september 30, 2021\nand december 31, 2020, no entity represented 10% or more of our total accounts receivable balance. the risk of concentration for our convertible note\nhedges and interest rate swaps is mitigated by transacting with several highly-rated multinational banks.\nconcentration of risk: supply risk\nwe are dependent on our suppliers, including single source suppliers, and the inability of these suppliers to deliver necessary components of our\nproducts in a timely manner at prices, quality levels and volumes acc

In [10]:
len(set(text1.lower().split()))

113

### Importance of Stop Word Removal During Tokenization

In Natural Language Processing (NLP), tokenization is a crucial first step in preparing text for analysis. It involves breaking down a large body of text into smaller chunks, typically words or sub-words, which can then be processed further. However, one aspect of tokenization that plays a pivotal role in improving the efficiency and relevance of downstream tasks is **stop word removal**.

### What Are Stop Words?

Stop words are common words that appear frequently in a language but carry very little meaningful information on their own. Examples of stop words include:

- Articles: "the", "a", "an"
- Prepositions: "in", "on", "at"
- Pronouns: "he", "she", "it"
- Conjunctions: "and", "but", "or"

These words, while essential for grammar and sentence structure, are often removed during tokenization because they don't contribute significantly to the meaning of a text in most NLP tasks.


In [11]:
import nltk
from nltk.corpus import stopwords
 
nltk.download('stopwords')



[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/nmadali/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [12]:
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [13]:
len(text1.lower().split() )

176

In [14]:
len([token for token in text1.lower().split() if not token in stopwords.words('english')])

113

In [15]:
len(set([token for token in text1.lower().split() if not token in stopwords.words('english')]))

90

### Importance of Regular Expressions for Removing Inadequate Tokens (e.g., HTML Tags) During Tokenization

In Natural Language Processing (NLP), one of the challenges when processing raw text is dealing with "inadequate tokens" that do not add meaningful information to the analysis. These tokens might include special characters, formatting elements, or structural components such as HTML tags. To clean the text and prepare it for further analysis, we often use **regular expressions (regex)** to remove these types of tokens efficiently.

### What Are Inadequate Tokens?

Inadequate tokens refer to elements in the text that do not contribute to its core meaning. Examples of such tokens include:

- **HTML Tags**: `<div>`, `<p>`, `<a>`, etc.
- **CSS Classes/IDs**: `class="content"`, `id="main"`, etc.
- **JavaScript Code**: `<script>`, `onclick="function()"`, etc.
- **URLs and Email Addresses**: `https://www.example.com`, `someone@example.com`
- **Special Characters**: Punctuation, extra spaces, non-alphanumeric characters that do not serve a semantic purpose.

These elements are often artifacts of how the text is structured or formatted on the web and do not add to the meaning of the text in most NLP applications.


In [16]:
import re

In [17]:
res = re.sub(r'[^a-zA-Z]', ' ', text3)
print(res.lower().split())

['codebasics', 'hello', 'i', 'am', 'having', 'an', 'issue', 'with', 'my', 'order']


In [18]:
text3

'codebasics: Hello, I am having an issue with my order # 412889912'

### Lemmatization in Natural Language Processing

Lemmatization is an essential step in Natural Language Processing (NLP) that focuses on reducing words to their base or dictionary form, known as the **lemma**. Unlike stemming, which simply truncates words to remove suffixes, lemmatization considers the context and meaning of words to produce their correct base form. This step is crucial for normalizing words and improving the quality of text analysis in many NLP tasks.

### What is Lemmatization?

Lemmatization involves the process of mapping a word to its canonical form or root form. The lemma of a word is its base or dictionary form, which is often found in the dictionary. For example:

- **"running"** → **"run"**
- **"better"** → **"good"**
- **"geese"** → **"goose"**

Lemmatization typically uses part-of-speech (POS) tagging to understand the context of the word and accurately transform it into its base form. For instance, "running" may be lemmatized to "run" if it’s a verb, but "better" may be lemmatized to "good" if it’s an adjective.


In [19]:
from nltk.stem.porter import *

stemmer = PorterStemmer()

In [20]:
text1.lower().split()

['concentration',
 'of',
 'risk:',
 'credit',
 'risk',
 'financial',
 'instruments',
 'that',
 'potentially',
 'subject',
 'us',
 'to',
 'a',
 'concentration',
 'of',
 'credit',
 'risk',
 'consist',
 'of',
 'cash,',
 'cash',
 'equivalents,',
 'marketable',
 'securities,',
 'restricted',
 'cash,',
 'accounts',
 'receivable,',
 'convertible',
 'note',
 'hedges,',
 'and',
 'interest',
 'rate',
 'swaps.',
 'our',
 'cash',
 'balances',
 'are',
 'primarily',
 'invested',
 'in',
 'money',
 'market',
 'funds',
 'or',
 'on',
 'deposit',
 'at',
 'high',
 'credit',
 'quality',
 'financial',
 'institutions',
 'in',
 'the',
 'u.s.',
 'these',
 'deposits',
 'are',
 'typically',
 'in',
 'excess',
 'of',
 'insured',
 'limits.',
 'as',
 'of',
 'september',
 '30,',
 '2021',
 'and',
 'december',
 '31,',
 '2020,',
 'no',
 'entity',
 'represented',
 '10%',
 'or',
 'more',
 'of',
 'our',
 'total',
 'accounts',
 'receivable',
 'balance.',
 'the',
 'risk',
 'of',
 'concentration',
 'for',
 'our',
 'convertibl

In [21]:
stemmer.stem('concentration')

'concentr'

In [22]:
len(set([stemmer.stem(token) for token in text1.lower().split() if not token in stopwords.words('english')]))

88

### Using `en_core_web_sm` for Natural Language Processing

`en_core_web_sm` is a pre-trained small-sized model from **spaCy**, a popular open-source library for natural language processing (NLP) in Python. The model is used to perform various NLP tasks like tokenization, part-of-speech tagging, named entity recognition, dependency parsing, and more. The `sm` in the model name stands for "small", indicating it is designed for efficient use with smaller computational resources.

## Key Features of `en_core_web_sm`

1. **Tokenization**: Breaks text into individual words, punctuation marks, and symbols.
2. **Part-of-Speech (POS) Tagging**: Identifies the grammatical category of each word in a sentence (e.g., noun, verb, adjective).
3. **Named Entity Recognition (NER)**: Identifies and classifies entities in text, such as persons, organizations, and locations.
4. **Dependency Parsing**: Analyzes the syntactic structure of a sentence and identifies the relationships between words.
5. **Lemmatization**: Reduces words to their base form (e.g., "running" becomes "run").
6. **Sentence Segmentation**: Splits text into sentences.


In [25]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m45.5 MB/s[0m eta [36m0:00:00[0m [36m0:00:01[0m
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [26]:
import spacy



# No tokenizer assignment, everything is fine
nlp = spacy.load('en_core_web_sm')
doc = nlp( text1.lower())

In [44]:
vocab=[]
for token in doc:
    if not (token.is_stop or token.is_punct):
        vocab.append(token.text)
    

In [47]:
len(set(vocab))

86

### Creating a Vocabulary of Unique Words After Tokenization

Once the text has been properly tokenized and unnecessary elements like stop words or HTML tags have been removed, the next important step in Natural Language Processing (NLP) is to create a **vocabulary**. A vocabulary is essentially a collection of unique words that appear in a given corpus or dataset. This step is crucial for transforming raw text into a structured form that can be used by machine learning models.

### What is a Vocabulary?

A vocabulary is a set of distinct words that have been extracted from the tokenized text. These words can then be used as features in various NLP tasks such as text classification, sentiment analysis, or machine translation. The vocabulary represents all the terms that the model will recognize and work with.

For example, consider the following tokenized text:


In [74]:
vocab_tmp=[]
for text in [text3]:
    doc = nlp( text.lower())
    for token in doc:
        if not (token.is_stop or token.is_punct):
            vocab_tmp.append(token.text)

In [73]:
'codebasics' in vocab_tmp

True

In [54]:
vocab={k:v for v,k in enumerate(set(vocab_tmp))}

In [56]:
vocab_inv={v:k for k,v in vocab.items()}

In [58]:
vocab_inv[0]

'hedges'

In [55]:
vocab

{'hedges': 0,
 'https://twitter.com/elonmusk': 1,
 'receivable': 2,
 'having': 3,
 'funds': 4,
 'https://twitter.com/teslarati': 5,
 'financial': 6,
 'necessary': 7,
 'prospects': 8,
 'single': 9,
 'consist': 10,
 'september': 11,
 'hello': 12,
 'total': 13,
 'related': 14,
 'business': 15,
 'volumes': 16,
 'codebasics': 17,
 'dependent': 18,
 'excess': 19,
 'products': 20,
 'balances': 21,
 'effect': 22,
 'order': 23,
 'highly': 24,
 '2020': 25,
 'musk': 26,
 'issue': 27,
 'typically': 28,
 'acceptable': 29,
 'twitter': 30,
 'entity': 31,
 'components': 32,
 'efficiently': 33,
 'inability': 34,
 'material': 35,
 'rate': 36,
 'deposit': 37,
 'market': 38,
 '30': 39,
 'deliver': 40,
 'marketable': 41,
 '31': 42,
 'insured': 43,
 'limits': 44,
 'balance': 45,
 'convertible': 46,
 'invested': 47,
 'leading': 48,
 'subject': 49,
 'banks': 50,
 'high': 51,
 'https://twitter.com/dummy_2_tesla': 52,
 'manage': 53,
 'interest': 54,
 '412889912': 55,
 'note': 56,
 'mitigated': 57,
 'concentrati

In [53]:
len(vocab)

105

### Bag of Words (BoW) Using NLTK

The **Bag of Words (BoW)** model is one of the simplest and most widely used techniques for text representation in Natural Language Processing (NLP). It involves converting text documents into a matrix of token counts or frequencies. In the BoW model, each document is represented by a vector of words, where the frequency of each word is stored in the vector. This model is called "bag of words" because it disregards the order and structure of words in the text, focusing solely on the frequency of words.

In this section, we'll explore how to build a Bag of Words model using **NLTK** functions.

### Steps to Implement Bag of Words Using NLTK

1. **Tokenization**:  
   The first step is to tokenize the text, breaking it down into individual words or tokens. NLTK provides the `word_tokenize` function for this purpose.

2. **Removing Stop Words**:  
   Once the text is tokenized, we often remove stop words (common words like "the", "is", etc.) because they don't provide useful information for analysis. NLTK provides a predefined list of stop words that we can filter out.

3. **Creating the Vocabulary**:  
   The next step is to create the vocabulary (the list of unique words) from the text. This vocabulary will serve as the basis for our Bag of Words representation.

4. **Vectorizing the Text**:  
   After preparing the vocabulary, each document is represented as a vector, where each element in the vector corresponds to the frequency of a specific word in the vocabulary.

### Example: Building a Bag of Words Model with NLTK

Let's go through an example of creating a Bag of Words model for a set of text documents using NLTK.

#### 1. Import Necessary Libraries and Download Resources

```python
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer

# Download NLTK resources
nltk.download('punkt')
nltk.download('stopwords')


In [59]:
text3.lower().split()

['codebasics:',
 'hello,',
 'i',
 'am',
 'having',
 'an',
 'issue',
 'with',
 'my',
 'order',
 '#',
 '412889912']

In [121]:
len(text3.lower().split())

12

In [122]:
vec=np.zeros(len(vocab))

tokens = [token.text  for token in nlp( text3.lower())  if not (token.is_stop or token.is_punct)]
for key in vocab: 

    if key in tokens:   
        vec[vocab[key]]+=1

In [123]:
'codebasics' in vocab

True

In [124]:
vocab['codebasics']

17

In [125]:
text3.lower().split()

['codebasics:',
 'hello,',
 'i',
 'am',
 'having',
 'an',
 'issue',
 'with',
 'my',
 'order',
 '#',
 '412889912']

In [126]:
text3

'codebasics: Hello, I am having an issue with my order # 412889912'

In [127]:
vec.shape

(105,)

In [128]:
np.where(vec==1)[0]

array([ 3, 12, 17, 23, 27, 55])

In [129]:
vocab_inv[3]

'having'

In [130]:
[vocab_inv[idx] for idx in  np.where(vec==1)[0]]

['having', 'hello', 'codebasics', 'order', 'issue', '412889912']