1. What are Corpora?
2. What are Tokens?
3. What are Unigrams, Bigrams, Trigrams?
4. How to generate n-grams from text?
5. Explain Lemmatization
6. Explain Stemming
7. Explain Part-of-speech (POS) tagging
8. Explain Chunking or shallow parsing
9. Explain Noun Phrase (NP) chunking
10. Explain Named Entity Recognition

### What is Corpora?
**Corpora:** In natural language processing (NLP), a corpus (plural: "corpora") is a large collection of text data that is used to train and evaluate NLP models. Corpora are typically used to develop and test language models, speech recognition systems, and other NLP applications.
Corpora are a very important resource for training and evaluating NLP models, as they provide the model with a large amount of text data that is representative of the target domain or task.

### What are Tokens?
In natural language processing (NLP), a token is a sequence of characters that represents a single unit of meaning in a text. Tokens are the basic building blocks of text data and are used in various NLP tasks such as text classification, language translation, and language generation. The process of breaking a text into tokens is called tokenization.

There are several types of tokens that can be used in NLP, including:

*   Words
*   Sentences
*   Subwords
*   Characters
*   N-grams

The choice of tokenization depends on the specific NLP task and the language being studied. For example, word-level tokenization is often used for tasks such as text classification, while character-level tokenization is used for tasks such as text generation




### What are Unigram, Bigram and Trigram
In natural language processing (NLP), unigrams, bigrams, and trigrams are types of tokens that represent sequences of words in text data.

**Unigrams:** An unigram is a single word token. For example, in the sentence "I love to play soccer", the unigrams would be ["I", "love", "to", "play", "soccer"].

**Bigrams:** A bigram is a sequence of two words. For example, in the sentence "I love to play soccer", the bigrams would be ["I love", "love to", "to play", "play soccer"].

**Trigrams:** A trigram is a sequence of three words. For example, in the sentence "I love to play soccer", the trigrams would be ["I love to", "love to play", "to play soccer"].

These type of tokens are used in various NLP tasks such as text classification, language translation and language generation. They are often used as features in NLP models.

For example, if you want to predict the next word in a sentence given the previous two words, using trigrams as features would be more informative than using unigrams. Similarly, in sentiment analysis, using bigrams can provide more information as compared to using unigrams as they can capture the sentiment conveyed by a combination of words.

In summary, Unigrams, Bigrams and Trigrams are useful in capturing the context of words and providing more information to the model than just individual words

###How to generate n-grams from text?

In [1]:
from nltk import ngrams

text = "I love to play hockey"
tokens = text.split()

# Generate bigrams
bigrams = ngrams(tokens, 2)
print(list(bigrams))

# Generate trigrams
trigrams = ngrams(tokens, 3)
print(list(trigrams))

[('I', 'love'), ('love', 'to'), ('to', 'play'), ('play', 'hockey')]
[('I', 'love', 'to'), ('love', 'to', 'play'), ('to', 'play', 'hockey')]


### Explain Lemmatization

Lemmatization is a process of reducing a word to its meaningful base or root form. It is a more sophisticated form of stemming, which simply cuts off the end of a word to reduce it to its root form. 

In [3]:
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [4]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
words = ["running", "ran", "runner", "run", "caring", "books"]
lemma_words = [lemmatizer.lemmatize(word) for word in words]
print(lemma_words)

['running', 'ran', 'runner', 'run', 'caring', 'book']


### Explain Stemming

Stemming is a technique used to extract the base form of the words by removing affixes from them.

In [5]:
import nltk
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

words = ["running", "ran", "runner", "run", "caring", "books"]
stemmed_words = [stemmer.stem(word) for word in words]

print(stemmed_words)


['run', 'ran', 'runner', 'run', 'care', 'book']


### POS Tagging
Part-of-speech (POS) tagging is the process of assigning a grammatical category to each word in a text based on its context. This is also known as grammatical tagging or word-category disambiguation. The goal of POS tagging is to identify the syntactic role of each word in a sentence, such as noun, verb, adjective, adverb, pronoun, preposition, conjunction, etc.

In [6]:
import nltk

# Download the required POS tagger
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')

# Define the sentence
sentence = "The cat sat on the mat."

# Tokenize the sentence
tokens = nltk.word_tokenize(sentence)

# POS tag the tokenized sentence
tagged_sentence = nltk.pos_tag(tokens)

print(tagged_sentence)


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


[('The', 'DT'), ('cat', 'NN'), ('sat', 'VBD'), ('on', 'IN'), ('the', 'DT'), ('mat', 'NN'), ('.', '.')]


Chunking, also known as shallow parsing, is a natural language processing technique that involves identifying and extracting phrases from unstructured text. The goal of chunking is to group words together into meaningful units, such as noun phrases, verb phrases, and prepositional phrases.

In chunking, the words in a sentence are tagged with their corresponding POS tags, and then grouped together into chunks based on a set of predefined rules. The resulting chunks are typically represented in a tree structure, with the chunks as the internal nodes and the individual words as the leaves.



In [7]:
import nltk

# Define the sentence
sentence = "The cat sat on the mat."

# Tokenize the sentence
tokens = nltk.word_tokenize(sentence)

# POS tag the tokenized sentence
tagged_sentence = nltk.pos_tag(tokens)

# Define the grammar for chunking
grammar = "NP: {<DT>?<JJ>*<NN>}"

# Create a parser
parser = nltk.RegexpParser(grammar)

# Perform chunking
chunked_sentence = parser.parse(tagged_sentence)

print(chunked_sentence)


(S (NP The/DT cat/NN) sat/VBD on/IN (NP the/DT mat/NN) ./.)


###Explain Noun Phrase (NP) chunking

Noun Phrase (NP) chunking is a specific form of shallow parsing, which is used to extract noun phrases from unstructured text. The goal of NP chunking is to identify and group together words in a sentence that form a noun phrase.

In NP chunking, the words in a sentence are first tagged with their corresponding POS tags, and then grouped together into NP chunks based on a set of predefined rules. The resulting chunks are typically represented in a tree structure, with the chunks as the internal nodes and the individual words as the leaves.



###NER

Named Entity Recognition (NER) is a subtask of natural language processing that involves identifying and extracting specific information from unstructured text. The goal of NER is to locate and classify named entities in text into predefined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.

In [8]:
nltk.download('maxent_ne_chunker')
nltk.download('words')

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!


True

In [9]:
import nltk
from nltk import word_tokenize, pos_tag, ne_chunk

# Define the sentence
sentence = "Barack Obama was born in Hawaii and served as the 44th President of the United States"

# Tokenize the sentence
tokens = nltk.word_tokenize(sentence)

# POS tag the tokenized sentence
tagged_sentence = nltk.pos_tag(tokens)

# Perform NER
ner_output = nltk.ne_chunk(tagged_sentence)

print(ner_output)


(S
  (PERSON Barack/NNP)
  (PERSON Obama/NNP)
  was/VBD
  born/VBN
  in/IN
  (GPE Hawaii/NNP)
  and/CC
  served/VBD
  as/IN
  the/DT
  44th/CD
  President/NNP
  of/IN
  the/DT
  (GPE United/NNP States/NNPS))
