# Natural Language Processing

First introduction to Natural Language Processing.

## What is Natural Language Processing ?

Natural Language Processing (NLP) is a field of artificial intelligence (AI) that focuses on the interaction between computers and human language. Its primary goal is to enable computers to understand, interpret, and generate human language in a way that is both meaningful and useful. NLP is used in various applications, including language translation, sentiment analysis, chatbots, and more. Some some key concepts within NLP are:

1. **Tokenization:**
   - Tokenization is the process of breaking down a text into individual units, or tokens, such as words or sentences. It's a fundamental step in NLP as it provides a structured way to analyze and process text.
   - Example: The sentence "I love natural language processing" would be tokenized into individual words: ["I", "love", "natural", "language", "processing"].
   
2. **Stemming:**
   - Stemming is a text normalization process that aims to reduce words to their base or root form by removing suffixes or prefixes. The goal is to group similar words together to improve text analysis.
   - Example: The words "jumping," "jumps," and "jumped" would all be stemmed to "jump."

3. **Lemmatization:**
   - Lemmatization is another text normalization method that reduces words to their base or dictionary form (lemma) while considering their part of speech.
   - Unlike stemming, lemmatization ensures that the resulting word is valid and meaningful.
   - Example: The word "better" might be lemmatized to "good" because it recognizes the comparative form.

4. **Frequency Analysis:**
   - Frequency analysis involves counting how often words or phrases appear in a text corpus. It is a fundamental technique in NLP to understand the importance of words in a document.
   - By identifying frequently occurring words, you can extract insights, such as common themes or keywords within a text.

5. **N-grams:**
   - N-grams are contiguous sequences of n items (usually words) from a given sample of text. They are used to capture the context and relationships between words in a document.
   - Examples: 
     - 1-gram (unigram): "I"
     - 2-gram (bigram): "I am"
     - 3-gram (trigram): "I am learning"

6. **Vectorization:**
   - Vectorization is the process of converting text data into numerical vectors or arrays so that machine learning algorithms can work with it.
   - Common techniques include Bag of Words (BoW) and Term Frequency-Inverse Document Frequency (TF-IDF) for converting text into numerical features.

7. **Key Phrase Extraction:**
   - Key phrase extraction is the process of identifying and extracting important phrases or terms from a text document.
   - It helps in summarizing content, categorizing documents, and understanding the core topics within a document.

8. **Named Entity Recognition (NER):**
   - NER is a subtask of information extraction that identifies and classifies named entities (such as names of people, organizations, locations, dates, and more) within a text.
   - NER is used in various applications, including entity linking, content categorization, and information retrieval.

In practice, these NLP concepts are often used together to perform more complex text analysis tasks. For example, you might preprocess text data by stemming, create n-grams to capture context, perform frequency analysis to find important words, and then use vectorization to prepare the data for machine learning models. Named Entity Recognition can be applied to extract specific information, and key phrase extraction can help summarize the content. These techniques collectively enable computers to understand and process human language for a wide range of applications.

# Libraries

Please install the following dependencies using pip install:
- nltk

In [1]:
import nltk
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('words')
from nltk import FreqDist
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import sent_tokenize, word_tokenize

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\VelzM\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\VelzM\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\VelzM\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\VelzM\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\VelzM\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\VelzM\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-

## The sample text

In [2]:
sample_text = """
Once upon a time, in a faraway land nestled between emerald forests and sparkling rivers, there lived a young and curious princess named Aurora. She had hair as golden as the sun and eyes as blue as the clearest skies. Aurora's heart was full of wonder, and her spirit yearned for adventure.
In the heart of the enchanted forest, a mysterious cottage stood hidden among ancient oak trees. Inside this cottage, an old and wise wizard named Merlin resided. Merlin possessed a magical crystal ball that could reveal glimpses of the future.
One bright morning, Princess Aurora, guided by her boundless curiosity, set off on a journey to seek Merlin's guidance. She ventured through the whispering woods, where the trees seemed to share their secrets with the wind. Birds with plumage of every hue serenaded her with songs of joy.
Upon reaching Merlin's cottage, the princess found the wise wizard studying ancient tomes by the warm glow of a crackling fireplace. Merlin welcomed her with a warm smile and offered his mystical crystal ball. 'Look into it, dear princess, and you shall see what destiny has in store for you,' he said.
Aurora gazed into the crystal ball, and it revealed visions of dragons soaring through the skies, castles in the clouds, and a magical quest that would test her courage and kindness. With newfound determination, she thanked Merlin and set out on her remarkable adventure.
As the princess embarked on her journey, she encountered talking animals, enchanted forests, and brave companions who joined her quest. Together, they faced challenges and learned valuable lessons about friendship, bravery, and the magic that resides within one's heart.
And so, Princess Aurora's fairy tale unfolded, filled with wonder, courage, and the belief that dreams could come true in a world where the impossible became possible.
"""

## Tokenizing

In [3]:
# Tokenize by sentence.
sentences = sent_tokenize(sample_text)

In [4]:
# Tokenize by word.
words = word_tokenize(sample_text)

## Filtering Stop Words

In [5]:
# Create the English stopwords.
stop_words = set(stopwords.words("english"))

In [6]:
# Filter the sample text.
filtered_list = []
for word in words:
    if word.casefold() not in stop_words:
        filtered_list.append(word)

In [7]:
# Display the filtered text.
filtered_list

['upon',
 'time',
 ',',
 'faraway',
 'land',
 'nestled',
 'emerald',
 'forests',
 'sparkling',
 'rivers',
 ',',
 'lived',
 'young',
 'curious',
 'princess',
 'named',
 'Aurora',
 '.',
 'hair',
 'golden',
 'sun',
 'eyes',
 'blue',
 'clearest',
 'skies',
 '.',
 'Aurora',
 "'s",
 'heart',
 'full',
 'wonder',
 ',',
 'spirit',
 'yearned',
 'adventure',
 '.',
 'heart',
 'enchanted',
 'forest',
 ',',
 'mysterious',
 'cottage',
 'stood',
 'hidden',
 'among',
 'ancient',
 'oak',
 'trees',
 '.',
 'Inside',
 'cottage',
 ',',
 'old',
 'wise',
 'wizard',
 'named',
 'Merlin',
 'resided',
 '.',
 'Merlin',
 'possessed',
 'magical',
 'crystal',
 'ball',
 'could',
 'reveal',
 'glimpses',
 'future',
 '.',
 'One',
 'bright',
 'morning',
 ',',
 'Princess',
 'Aurora',
 ',',
 'guided',
 'boundless',
 'curiosity',
 ',',
 'set',
 'journey',
 'seek',
 'Merlin',
 "'s",
 'guidance',
 '.',
 'ventured',
 'whispering',
 'woods',
 ',',
 'trees',
 'seemed',
 'share',
 'secrets',
 'wind',
 '.',
 'Birds',
 'plumage',
 '

## Stemming

In [8]:
stemmer = PorterStemmer()

In [9]:
stemmed_words = [stemmer.stem(word) for word in filtered_list]

In [10]:
stemmed_words

['upon',
 'time',
 ',',
 'faraway',
 'land',
 'nestl',
 'emerald',
 'forest',
 'sparkl',
 'river',
 ',',
 'live',
 'young',
 'curiou',
 'princess',
 'name',
 'aurora',
 '.',
 'hair',
 'golden',
 'sun',
 'eye',
 'blue',
 'clearest',
 'sky',
 '.',
 'aurora',
 "'s",
 'heart',
 'full',
 'wonder',
 ',',
 'spirit',
 'yearn',
 'adventur',
 '.',
 'heart',
 'enchant',
 'forest',
 ',',
 'mysteri',
 'cottag',
 'stood',
 'hidden',
 'among',
 'ancient',
 'oak',
 'tree',
 '.',
 'insid',
 'cottag',
 ',',
 'old',
 'wise',
 'wizard',
 'name',
 'merlin',
 'resid',
 '.',
 'merlin',
 'possess',
 'magic',
 'crystal',
 'ball',
 'could',
 'reveal',
 'glimps',
 'futur',
 '.',
 'one',
 'bright',
 'morn',
 ',',
 'princess',
 'aurora',
 ',',
 'guid',
 'boundless',
 'curios',
 ',',
 'set',
 'journey',
 'seek',
 'merlin',
 "'s",
 'guidanc',
 '.',
 'ventur',
 'whisper',
 'wood',
 ',',
 'tree',
 'seem',
 'share',
 'secret',
 'wind',
 '.',
 'bird',
 'plumag',
 'everi',
 'hue',
 'serenad',
 'song',
 'joy',
 '.',
 'upo

## Lemmatizing

In [11]:
lemmatizer = WordNetLemmatizer()

In [12]:
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]

In [13]:
lemmatized_words

['Once',
 'upon',
 'a',
 'time',
 ',',
 'in',
 'a',
 'faraway',
 'land',
 'nestled',
 'between',
 'emerald',
 'forest',
 'and',
 'sparkling',
 'river',
 ',',
 'there',
 'lived',
 'a',
 'young',
 'and',
 'curious',
 'princess',
 'named',
 'Aurora',
 '.',
 'She',
 'had',
 'hair',
 'a',
 'golden',
 'a',
 'the',
 'sun',
 'and',
 'eye',
 'a',
 'blue',
 'a',
 'the',
 'clearest',
 'sky',
 '.',
 'Aurora',
 "'s",
 'heart',
 'wa',
 'full',
 'of',
 'wonder',
 ',',
 'and',
 'her',
 'spirit',
 'yearned',
 'for',
 'adventure',
 '.',
 'In',
 'the',
 'heart',
 'of',
 'the',
 'enchanted',
 'forest',
 ',',
 'a',
 'mysterious',
 'cottage',
 'stood',
 'hidden',
 'among',
 'ancient',
 'oak',
 'tree',
 '.',
 'Inside',
 'this',
 'cottage',
 ',',
 'an',
 'old',
 'and',
 'wise',
 'wizard',
 'named',
 'Merlin',
 'resided',
 '.',
 'Merlin',
 'possessed',
 'a',
 'magical',
 'crystal',
 'ball',
 'that',
 'could',
 'reveal',
 'glimpse',
 'of',
 'the',
 'future',
 '.',
 'One',
 'bright',
 'morning',
 ',',
 'Princess

## Named Entity Recognition (NER)


In [14]:
# Tag those words by part of speech.
lotr_pos_tags = nltk.pos_tag(words)

In [15]:
# Recognize named entities.
tree = nltk.ne_chunk(lotr_pos_tags)

In [16]:
# Visualize the tree.
tree.draw()

In [17]:
# Extract named entities.
def extract_ne(quote):
    words = word_tokenize(quote, language='english')
    tags = nltk.pos_tag(words)
    tree = nltk.ne_chunk(tags, binary=True)
    return set(
        " ".join(i[0] for i in t)
        for t in tree
        if hasattr(t, "label") and t.label() == "NE"
    )

In [18]:
extract_ne(sample_text)

{'Aurora', 'Merlin', 'Princess Aurora'}

## Frequency Distribution

In [19]:
# Determine the frequencey distribution of the sample text without the stop words.
frequency_distribution = FreqDist(filtered_list)

In [20]:
frequency_distribution.most_common(20)

[(',', 27),
 ('.', 17),
 ('Merlin', 6),
 ('Aurora', 5),
 ("'s", 5),
 ('princess', 4),
 ('heart', 3),
 ('cottage', 3),
 ('crystal', 3),
 ('ball', 3),
 ('forests', 2),
 ('named', 2),
 ('skies', 2),
 ('wonder', 2),
 ('adventure', 2),
 ('enchanted', 2),
 ('ancient', 2),
 ('trees', 2),
 ('wise', 2),
 ('wizard', 2)]

## References

https://chat.openai.com/<br>
https://realpython.com/nltk-nlp-python/<br>
https://towardsdatascience.com/intro-to-nltk-for-nlp-with-python-87da6670dde