#  Tokenization

Tokenization is the process of breaking a stream of text into smaller units called tokens. Tokens can be words, phrases, or symbols. Tokenization is a fundamental step in natural language processing (NLP) tasks, such as text classification, sentiment analysis, and machine translation.

There are many different ways to tokenize text. Some common methods include:

* **Word tokenization:** This is the most common type of tokenization. It breaks text into individual words.
* **Sentence tokenization:** This breaks text into individual sentences.
* **Character tokenization:** This breaks text into individual characters.
* **N-gram tokenization:** This breaks text into groups of N words. For example, 2-gram tokenization would break the sentence "I love you" into the tokens "I", "love", and "you".

The type of tokenization that is used depends on the specific NLP task. For example, word tokenization is typically used for text classification tasks, while sentence tokenization is typically used for machine translation tasks.

Tokenization is an important step in NLP because it allows computers to understand the structure of text. By breaking text into smaller units, tokenization makes it easier for computers to identify the meaning of words and phrases. This is essential for tasks such as text classification, sentiment analysis, and machine translation.

Here are some examples of tokenization:

* The sentence "I love you" can be tokenized as the words "I", "love", and "you".
* The phrase "the quick brown fox" can be tokenized as the words "the", "quick", "brown", and "fox".
* The acronym "NLP" can be tokenized as the words "N", "L", and "P".

Tokenization is a powerful tool that can be used to improve the performance of NLP tasks. By breaking text into smaller units, tokenization makes it easier for computers to understand the structure of text and the meaning of words and phrases.

In [1]:
import spacy
nlp=spacy.load("en_core_web_sm")

In [2]:
text  = "Tokenization is a powerful tool that can be used to improve the performance of NLP tasks. By breaking text into smaller units, tokenization makes it easier for computers to understand the structure of text and the meaning of words and phrases."
doc = nlp(text)
for token in doc[:10]:
    print(token)

Tokenization
is
a
powerful
tool
that
can
be
used
to


In [3]:
token1=doc[0]

In [4]:
dir(token1)

['_',
 '__bytes__',
 '__class__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__len__',
 '__lt__',
 '__ne__',
 '__new__',
 '__pyx_vtable__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__unicode__',
 'ancestors',
 'check_flag',
 'children',
 'cluster',
 'conjuncts',
 'dep',
 'dep_',
 'doc',
 'ent_id',
 'ent_id_',
 'ent_iob',
 'ent_iob_',
 'ent_kb_id',
 'ent_kb_id_',
 'ent_type',
 'ent_type_',
 'get_extension',
 'has_dep',
 'has_extension',
 'has_head',
 'has_morph',
 'has_vector',
 'head',
 'i',
 'idx',
 'iob_strings',
 'is_alpha',
 'is_ancestor',
 'is_ascii',
 'is_bracket',
 'is_currency',
 'is_digit',
 'is_left_punct',
 'is_lower',
 'is_oov',
 'is_punct',
 'is_quote',
 'is_right_punct',
 'is_sent_end',
 'is_sent_start',
 'is_space',
 'is_stop',
 'is_title',
 'is_upper',
 'lang',
 'lang_',
 'le

In [5]:
text  = "Tokenization is a powerful tool that can be used to improve the performance of NLP tasks. By breaking text into smaller units, tokenization makes it easier for computers to understand the structure of text and the meaning of words and phrases."
doc = nlp(text)
for token in doc[:10]:
    print(token ,"|",token.pos_,'|',spacy.explain(token.pos_))

Tokenization | NOUN | noun
is | AUX | auxiliary
a | DET | determiner
powerful | ADJ | adjective
tool | NOUN | noun
that | PRON | pronoun
can | AUX | auxiliary
be | AUX | auxiliary
used | VERB | verb
to | PART | particle


# Stemming 

Stemming is a process in natural language processing (NLP) that reduces inflected words to their word stem, base or root form. The stem is the part of the word that conveys the most meaning. For example, the words "running", "ran", and "runs" all have the same stem, which is "run". Stemming is often used in tasks such as text classification, information retrieval, and machine translation.

There are two main types of stemming algorithms:

* **Rule-based stemmers:** These algorithms use a set of rules to remove inflectional endings from words. For example, a rule-based stemmer might have a rule that removes the "-ing" ending from verbs.
* **Statistical stemmers:** These algorithms use statistical methods to find the most likely stem for a word. For example, a statistical stemmer might look at the frequency of different word endings in a corpus of text to determine the most likely stem for a word.

Stemming is a useful technique for NLP, but it is important to note that it is not perfect. Some stemming algorithms can produce incorrect results, and it is important to evaluate the performance of a stemming algorithm on a specific task before using it.

In [6]:
text="eating eats eat ate adjustable rafting ability meeting better"
doc=nlp(text)

for token in doc:
    print(token ,'|',token.lemma_)

eating | eat
eats | eat
eat | eat
ate | eat
adjustable | adjustable
rafting | raft
ability | ability
meeting | meeting
better | well


# POS (Part of Speech)

Part-of-speech (POS) tagging is a **natural language processing** (NLP) task of **categorizing each word in a text according to its part of speech**. Parts of speech (POS) are the **lexical categories** that **describe the syntactic role of a word** in a sentence.

POS tagging is a **fundamental** NLP task, **useful for many other NLP tasks** such as **named entity recognition**, **parsing**, **and machine translation**.

There are **two main approaches to POS tagging** :

* **Rule-based** POS tagging uses **a set of hand-crafted rules** to assign POS tags to words.
* **Statistical** POS tagging uses **a statistical model** to learn the probability of each word being assigned a particular POS tag.

Statistical POS taggers are **more accurate** than rule-based POS taggers, but they **require a large corpus of text** to train the statistical model.

**Here are some of the benefits of POS tagging:**

* **It can help to disambiguate words** that have multiple meanings. For example, the word "bank" can be a noun or a verb. POS tagging can help to determine which meaning is intended in a particular context.
* **It can help to identify the grammatical structure of a sentence**. For example, POS tagging can help to identify the subject, verb, and object of a sentence.
* **It can help to identify the relationships between words in a sentence**. For example, POS tagging can help to identify the conjunctions that connect words and phrases.
* **It can help to improve the accuracy of other NLP tasks** such as named entity recognition and machine translation.

**Here are some of the challenges of POS tagging:**

* **The accuracy of POS tagging can vary depending on the language**. Some languages are more ambiguous than others, which makes it more difficult to assign POS tags accurately.
* **The accuracy of POS tagging can also vary depending on the corpus**. A corpus that is well-tagged will produce more accurate POS tags than a corpus that is not well-tagged.
* **POS tagging can be computationally expensive**. Statistical POS taggers require a large corpus of text to train the statistical model, which can be time-consuming and expensive.

**Despite the challenges, POS tagging is a valuable NLP task** that can be used to improve the accuracy of many other NLP tasks.

In [7]:
text1 = 'I am learning english'
text2 = 'I am ate apple'
# text3 = ''
doc = nlp(text2)

for token in doc:
    print(token,'|',token.tag_,'|',spacy.explain(token.tag_))

I | PRP | pronoun, personal
am | VBP | verb, non-3rd person singular present
ate | VBN | verb, past participle
apple | NN | noun, singular or mass


# Named Entity Recognition (NER)
Named Entity Recognition (NER) is a subtask of natural language processing (NLP) that identifies named entities in text and classifies them into predefined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc. NER is also used to extract structured information from unstructured text and is a key task in many NLP applications, such as question answering, information retrieval, and machine translation.

Here are some examples of named entities:

* Person names: John Smith, Jane Doe
* Organization names: Google, Microsoft, Apple
* Location names: New York City, London, Paris
* Medical codes: ICD-10, CPT, HCPCS
* Time expressions: 2023-04-29, 10:00 AM, 5 minutes
* Quantities: 100, 5000, 1000000
* Monetary values: $10, $50, $100
* Percentages: 10%, 20%, 30%

NER is a challenging task because it requires the ability to understand the context of a text and to identify the boundaries of named entities. There are a number of different approaches to NER, including rule-based, statistical, and machine learning-based approaches.

Rule-based NER systems use a set of hand-crafted rules to identify named entities. These rules are typically based on the knowledge of a particular domain, such as medicine or finance. Statistical NER systems use statistical methods to identify named entities. These methods typically involve training a machine learning model on a large corpus of text that has been labeled with named entities. Machine learning-based NER systems use machine learning methods to identify named entities. These methods typically involve training a neural network on a large corpus of text that has been labeled with named entities.

NER is a powerful tool that can be used to extract information from text. It is used in a variety of applications, such as:

* Question answering: NER can be used to identify the entities that are mentioned in a question. This information can then be used to find the relevant information in a knowledge base.
* Information retrieval: NER can be used to identify the entities that are mentioned in a document. This information can then be used to rank the document in a search results list.
* Machine translation: NER can be used to identify the entities that are mentioned in a source text. This information can then be used to translate the entities into the target language.


In [17]:
doc = nlp("Tesla Inc is going to acquire twitter for $45 billion")

for ent in doc.ents:
    print(ent.text, "|", ent.label_ , "|", spacy.explain(ent.label_))
    
    
from spacy import displacy 
displacy.render(doc,style='ent')

Tesla Inc | ORG | Companies, agencies, institutions, etc.
$45 billion | MONEY | Monetary values, including unit


In [31]:
doc = nlp("Tesla Inc is going to acquire Twitter Inc for $45 billion Real Madrid club Barcelona, Chelsea,Uzbekistan,Fergana,Oxford")

for ent in doc.ents:
    print(ent.text, "|", ent.label_ , "|", spacy.explain(ent.label_))
    
print(displacy.render(doc,style='ent'))

Tesla Inc | ORG | Companies, agencies, institutions, etc.
Twitter Inc | ORG | Companies, agencies, institutions, etc.
$45 billion | MONEY | Monetary values, including unit
Real Madrid | ORG | Companies, agencies, institutions, etc.
Barcelona | GPE | Countries, cities, states
Chelsea | GPE | Countries, cities, states
Uzbekistan | GPE | Countries, cities, states
Fergana | GPE | Countries, cities, states
Oxford | ORG | Companies, agencies, institutions, etc.


None


# Bag of Words
A bag-of-words (BoW) model is a simplifying representation used in natural language processing and information retrieval (IR). In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity. The bag-of-words model has also been used for computer vision.

The bag-of-words model is commonly used in methods of document classification where the (frequency of) occurrence of each word is used as a feature for training a classifier. An early reference to "bag of words" in a linguistic context can be found in Zellig Harris's 1954 article on Distributional Structure. The Bag-of-words model is one example of a Vector space model.

To create a bag-of-words representation of a text, the following steps are typically performed:

1. The text is tokenized, i.e., it is split into individual words or tokens.
2. Stop words are removed. Stop words are common words that do not add much information to the representation, such as "the", "a", and "of".
3. The words are stemmed or lemmatized. Stemming and lemmatization are processes that reduce words to their root form, i.e., the form of the word that is most similar to its meaning.
4. The words are counted. The number of times each word appears in the text is recorded.
5. The words are sorted by frequency. The words are sorted in descending order of frequency, i.e., the most frequent words are listed first.

The bag-of-words representation of a text is a vector of word counts. The length of the vector is equal to the number of words in the vocabulary. The value of each element in the vector is the number of times the corresponding word appears in the text.

The bag-of-words model is a simple and effective way to represent text. It is easy to understand and implement, and it can be used with a variety of machine learning algorithms. However, the bag-of-words model does not take into account the order of words in a text, which can be important for some tasks.


# N Gramms

In [49]:
from sklearn.feature_extraction.text import   CountVectorizer

v=CountVectorizer()

v.fit(["Thor Hatdowala is looking for job"])
v.vocabulary_   

{'thor': 5, 'hatdowala': 1, 'is': 2, 'looking': 4, 'for': 0, 'job': 3}

In [55]:
v=CountVectorizer(ngram_range=(1,2))

v.fit(["Thor Hatdowala is looking for job"])
v.vocabulary_   

{'thor': 9,
 'hatdowala': 2,
 'is': 4,
 'looking': 7,
 'for': 0,
 'job': 6,
 'thor hatdowala': 10,
 'hatdowala is': 3,
 'is looking': 5,
 'looking for': 8,
 'for job': 1}

In [63]:
corpus = [
    "Thor ate pizza",
    "Loki is tall",
    "Loki is eating pizza"
]


def preprocess(text):
    
    filtered_text=[]
    doc = nlp(text)
    for token in doc:
        if token.is_stop or token.is_punct:
            continue
        filtered_text.append(token.lemma_)
    return " ".join(filtered_text)

In [64]:
preprocess("Loki is eating pizza")

'Loki eat pizza'

In [65]:
corpues_preprocessed = [preprocess(text) for text in corpus]
print(corpues_preprocessed)

['thor eat pizza', 'Loki tall', 'Loki eat pizza']


In [66]:
v=CountVectorizer(ngram_range=(1,2))

v.fit(corpues_preprocessed)
v.vocabulary_

{'thor': 7,
 'eat': 0,
 'pizza': 5,
 'thor eat': 8,
 'eat pizza': 1,
 'loki': 2,
 'tall': 6,
 'loki tall': 4,
 'loki eat': 3}

In [70]:
v.transform(["Thor eat pizza"]).toarray()

array([[1, 1, 0, 0, 0, 1, 0, 1, 1]], dtype=int64)

# TF - IDF
TF-IDF stands for term frequency-inverse document frequency. It is a statistical measure that is used to evaluate the importance of a word in a document within a collection or corpus. TF-IDF is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The tf-idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general.

The tf-idf is the product of two statistics, term frequency and inverse document frequency.

* Term frequency (tf): This is the number of times a word appears in a document.
* Inverse document frequency (idf): This is the logarithm of the number of documents in the corpus divided by the number of documents that contain the word.

The tf-idf of a word in a document is calculated as follows:

```
tf-idf = tf * idf
```

TF-IDF is a powerful tool that can be used to improve the performance of a variety of natural language processing (NLP) tasks. For example, it can be used to:

* Rank documents in a search results list
* Identify important words in a document
* Classify documents into different categories
* Extract information from text

TF-IDF is a simple but effective way to measure the importance of words in a document. It is a widely used technique in NLP and has been shown to be effective in a variety of tasks.

In [73]:
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    "Thor eating pizza, Loki is eating pizza, Ironman ate pizza already",
    "Apple is announcing new iphone tomorrow",
    "Tesla is announcing new model-3 tomorrow",
    "Google is announcing new pixel-6 tomorrow",
    "Microsoft is announcing new surface tomorrow",
    "Amazon is announcing new eco-dot tomorrow",
    "I am eating biryani and you are eating grapes"
]

In [93]:
v=TfidfVectorizer()
transformed_output=v.fit_transform(corpus)
print(v.vocabulary_)

{'thor': 25, 'eating': 10, 'pizza': 22, 'loki': 17, 'is': 16, 'ironman': 15, 'ate': 7, 'already': 0, 'apple': 5, 'announcing': 4, 'new': 20, 'iphone': 14, 'tomorrow': 26, 'tesla': 24, 'model': 19, 'google': 12, 'pixel': 21, 'microsoft': 18, 'surface': 23, 'amazon': 2, 'eco': 11, 'dot': 9, 'am': 1, 'biryani': 8, 'and': 3, 'you': 27, 'are': 6, 'grapes': 13}


In [96]:
all_feature_names = v.get_feature_names_out()

for word in all_feature_names:
    indx= v.vocabulary_.get(word)
    print(f"{word} {v.idf_[indx]}")

already 2.386294361119891
am 2.386294361119891
amazon 2.386294361119891
and 2.386294361119891
announcing 1.2876820724517808
apple 2.386294361119891
are 2.386294361119891
ate 2.386294361119891
biryani 2.386294361119891
dot 2.386294361119891
eating 1.9808292530117262
eco 2.386294361119891
google 2.386294361119891
grapes 2.386294361119891
iphone 2.386294361119891
ironman 2.386294361119891
is 1.1335313926245225
loki 2.386294361119891
microsoft 2.386294361119891
model 2.386294361119891
new 1.2876820724517808
pixel 2.386294361119891
pizza 2.386294361119891
surface 2.386294361119891
tesla 2.386294361119891
thor 2.386294361119891
tomorrow 1.2876820724517808
you 2.386294361119891


# Word Vectors
A word vector is a representation of a word as a vector of real numbers. Word vectors are typically learned from a large corpus of text, and they can be used to represent the meaning of words, to measure the similarity between words, and to perform other natural language processing tasks.

There are a number of different ways to learn word vectors. One common approach is to use a neural network to predict the context of a word, given the word itself. The neural network learns to represent the meaning of words by predicting the words that are likely to appear around them.

Another common approach to learning word vectors is to use a statistical method called latent semantic analysis (LSA). LSA uses a technique called singular value decomposition (SVD) to reduce the dimensionality of a matrix of word co-occurrences. The resulting vectors are then used to represent the meaning of words.

Word vectors have been shown to be effective for a variety of natural language processing tasks, including:

* **Text classification:** Word vectors can be used to represent the content of text documents, which can then be used to classify the documents into different categories.
* **Named entity recognition:** Word vectors can be used to identify named entities in text, such as people, places, and organizations.
* **Machine translation:** Word vectors can be used to translate text from one language to another.
* **Question answering:** Word vectors can be used to answer questions about text documents.

Word vectors are a powerful tool for natural language processing. They are able to represent the meaning of words in a way that is both informative and efficient. As a result, word vectors have been shown to be effective for a variety of natural language processing tasks.