<a href="https://colab.research.google.com/github/nikhilbordekar/Natural-Language-Processing-NLP/blob/main/NLP_Basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **What is NLP?**

Natural language processing (NLP) is a machine learning technology that gives computers the ability to interpret, manipulate, and comprehend human language. Organizations today have large volumes of voice and text data from various communication channels like emails, text messages, social media newsfeeds, video, audio, and more. They use NLP software to automatically process this data, analyze the intent or sentiment in the message, and respond in real time to human communication.

## **Why is NLP important?**

Natural language processing (NLP) is critical to fully and efficiently analyze text and speech data. It can work through the differences in dialects, slang, and grammatical irregularities typical in day-to-day conversations.

Companies use it for several automated tasks, such as to:

*   Process, analyze, and archive large documents
*   Analyze customer feedback or call center recordings

*   Run chatbots for automated customer service
*   Answer who-what-when-where questions

*   Classify and extract text


## **Tokenization:**

Tokenization is the process of breaking a text into smaller units called tokens. In the context of natural language processing (NLP), tokens are typically words, but they can also be sentences or other meaningful sub-elements. Tokenization is the first step in many NLP tasks and plays a crucial role in text analysis. Tokens are the building blocks used for various linguistic analyses.

Example:

Input: "Tokenization is essential for NLP."

Tokens: ["Tokenization", "is", "essential", "for", "NLP", "."]

In [6]:
# Tokenization of Paragraph/sentences
import nltk #nltk: Natural language TookKit
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [4]:
paragraph = """Rohit Gurunath Sharma is an Indian international cricketer and the
current captain of India national cricket team in all formats. Known for his batting elegance,
Sharma is the record holder for most international sixes across all formats and most sixes in the
World Cups and in a calendar year. He plays as a right-handed batsman for India national cricket team
in international cricket, Mumbai Indians in IPL and for Mumbai in domestic cricket. He was the captain
of Indian national team which played in 2023 World cup final.
Sharma also captains Mumbai Indians and the team has won 5 titles in 2013, 2015, 2017, 2019 and 2020 under
his leadership, making him the most successful captain in IPL history, sharing this record with MS Dhoni (5 title wins in IPL).
With India, Sharma was a member of the team that won the 2007 T20 World Cup, and the 2013 ICC Champions Trophy, where he played
in the finals of both tournaments. Rohit is one of four players to have played in every edition of the ICC T20 World Cup, from
the inaugural edition in 2007 to the latest one in 2022."""

In [7]:
# Tokenizing Sentences
sentences = nltk.sent_tokenize(paragraph)

In [8]:
sentences

['Rohit Gurunath Sharma is an Indian international cricketer and the \ncurrent captain of India national cricket team in all formats.',
 'Known for his batting elegance, \nSharma is the record holder for most international sixes across all formats and most sixes in the \nWorld Cups and in a calendar year.',
 'He plays as a right-handed batsman for India national cricket team \nin international cricket, Mumbai Indians in IPL and for Mumbai in domestic cricket.',
 'He was the captain \nof Indian national team which played in 2023 World cup final.',
 'Sharma also captains Mumbai Indians and the team has won 5 titles in 2013, 2015, 2017, 2019 and 2020 under \nhis leadership, making him the most successful captain in IPL history, sharing this record with MS Dhoni (5 title wins in IPL).',
 'With India, Sharma was a member of the team that won the 2007 T20 World Cup, and the 2013 ICC Champions Trophy, where he played \nin the finals of both tournaments.',
 'Rohit is one of four players to hav

In [9]:
# Tokenizing Words
words = nltk.word_tokenize(paragraph)

## **Stemming:**

Stemming is a text normalization technique that involves reducing words to their root or base form. The goal is to map words with similar meanings to the same root, even if the actual forms are different. Stemming removes prefixes or suffixes from words, allowing different variations of a word to be represented as the same token.

Example:

Original: "running"

Stemmed: "run"

Stemming is computationally less expensive but may not always produce valid words.

In [16]:
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [17]:
stopwords.words('english')

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [12]:
stemmer = PorterStemmer()

In [18]:
# Stemming each and every word
for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i])
    words = [stemmer.stem(word) for word in words if word not in set(stopwords.words('english'))]
    sentences[i] = ' '.join(words)

In [19]:
sentences

['rohit gurunath sharma indian intern cricket current captain india nation cricket team format .',
 'known bat eleg , sharma record holder intern six across format six world cup calendar year .',
 'he play right-hand batsman india nation cricket team intern cricket , mumbai indian ipl mumbai domest cricket .',
 'he captain indian nation team play 2023 world cup final .',
 'sharma also captain mumbai indian team 5 titl 2013 , 2015 , 2017 , 2019 2020 leadership , make success captain ipl histori , share record ms dhoni ( 5 titl win ipl ) .',
 'with india , sharma member team 2007 t20 world cup , 2013 icc champion trophi , play final tournament .',
 'rohit one four player play everi edit icc t20 world cup , inaugur edit 2007 latest one 2022 .']

## **Lemmatization:**
Lemmatization is another text normalization technique that, like stemming, aims to reduce words to their base form. However, lemmatization considers the context and meaning of words, ensuring that the resulting base form (lemma) is a valid word. Lemmatization often involves using a vocabulary or a morphological analysis of words.

Example:

Original: "better"

Lemmatized: "good"

Lemmatization is more linguistically informed compared to stemming and tends to produce more accurate results, but it can be computationally more expensive.

In [22]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')


[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [23]:
lemmatizer = WordNetLemmatizer()

In [24]:
# Lemmatization
for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i])
    words = [lemmatizer.lemmatize(word) for word in words if word not in set(stopwords.words('english'))]
    sentences[i] = ' '.join(words)

In [25]:
sentences

['rohit gurunath sharma indian intern cricket current captain india nation cricket team format .',
 'known bat eleg , sharma record holder intern six across format six world cup calendar year .',
 'play right-hand batsman india nation cricket team intern cricket , mumbai indian ipl mumbai domest cricket .',
 'captain indian nation team play 2023 world cup final .',
 'sharma also captain mumbai indian team 5 titl 2013 , 2015 , 2017 , 2019 2020 leadership , make success captain ipl histori , share record m dhoni ( 5 titl win ipl ) .',
 'india , sharma member team 2007 t20 world cup , 2013 icc champion trophi , play final tournament .',
 'rohit one four player play everi edit icc t20 world cup , inaugur edit 2007 latest one 2022 .']

In [29]:
import nltk

para =  """I have three visions for India. In 3000 years of our history, people from all over
               the world have come and invaded us, captured our lands, conquered our minds.
               From Alexander onwards, the Greeks, the Turks, the Moguls, the Portuguese, the British,
               the French, the Dutch, all of them came and looted us, took over what was ours.
               Yet we have not done this to any other nation. We have not conquered anyone.
               We have not grabbed their land, their culture,
               their history and tried to enforce our way of life on them.
               Why? Because we respect the freedom of others.That is why my
               first vision is that of freedom. I believe that India got its first vision of
               this in 1857, when we started the War of Independence. It is this freedom that
               we must protect and nurture and build on. If we are not free, no one will respect us.
               My second vision for India’s development. For fifty years we have been a developing nation.
               It is time we see ourselves as a developed nation. We are among the top 5 nations of the world
               in terms of GDP. We have a 10 percent growth rate in most areas. Our poverty levels are falling.
               Our achievements are being globally recognised today. Yet we lack the self-confidence to
               see ourselves as a developed nation, self-reliant and self-assured. Isn’t this incorrect?
               I have a third vision. India must stand up to the world. Because I believe that unless India
               stands up to the world, no one will respect us. Only strength respects strength. We must be
               strong not only as a military power but also as an economic power. Both must go hand-in-hand.
               My good fortune was to have worked with three great minds. Dr. Vikram Sarabhai of the Dept. of
               space, Professor Satish Dhawan, who succeeded him and Dr. Brahm Prakash, father of nuclear material.
               I was lucky to have worked with all three of them closely and consider this the great opportunity of my life.
               I see four milestones in my career"""




In [30]:

# Cleaning the texts
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer

ps = PorterStemmer()
wordnet=WordNetLemmatizer()
sent = nltk.sent_tokenize(para)
corpus = []
for i in range(len(sent)):
    review = re.sub('[^a-zA-Z]', ' ', sent[i])
    review = review.lower()
    review = review.split()
    review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    corpus.append(review)



In [31]:
# Creating the Bag of Words model
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 1500)
X = cv.fit_transform(corpus).toarray()

In [32]:
X

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 1, 1, 0],
       [0, 1, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

# **Term Frequency-Inverse Document Frequency (TF-IDF):**
TF-IDF is a numerical statistic that reflects the importance of a word in a document relative to a collection of documents (corpus). It combines Term Frequency (TF) and Inverse Document Frequency (IDF). TF measures the frequency of a word in a document, while IDF measures how unique or rare a word is across documents. The product of TF and IDF gives the TF-IDF score.

Example:

**Term Frequency (TF):** Number of times a word appears in a document.
**Inverse Document Frequency (IDF):** Logarithm of the total number of documents divided by the number of documents containing the word.

 **TF-IDF = TF * IDF**

TF-IDF helps identify important words in a document by assigning higher scores to words that are frequent in the document but rare across the entire corpus.

In [34]:
# Creating the BTF-IDF from cleaned corpus list from above example
from sklearn.feature_extraction.text import TfidfVectorizer
cvv = TfidfVectorizer()
XX = cvv.fit_transform(corpus).toarray()

In [35]:
XX

array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.25057734, 0.29539106,
        0.        ],
       [0.        , 0.28201784, 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])