<a href="https://colab.research.google.com/github/pragya-singh/Git-and-GitHub-Essentials/blob/main/NLTK(NATURAL_LAUNGUAGE_TOOLKIT).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## NLTK (Natural Language Toolkit) is a powerful library in Python for working with human language data (text). It provides tools for tasks like tokenization, stemming, lemmatization, parsing, and even some machine learning algorithms for text classification.

In [1]:
!pip install nltk



In [4]:
import nltk
nltk.download('punkt')       # Tokenizer models
nltk.download('wordnet')     # WordNet corpus for lemmatization
nltk.download('averaged_perceptron_tagger')  # POS tagging model
nltk.download('stopwords')   # Stopwords for filtering

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [5]:
nltk.download()

NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> Download
Command 'Download' unrecognized

---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> d

Download which package (l=list; x=cancel)?
  Identifier> l
Packages:
  [ ] abc................. Australian Broadcasting Commission 2006
  [ ] alpino.............. Alpino Dutch Treebank
  [ ] averaged_perceptron_tagger_eng Averaged Perceptron Tagger (JSON)
  [ ] averaged_perceptron_tagger_ru Averaged Perceptron Tagger (Russian)
  [ ] averaged_perceptron_tagger_rus Averaged Perceptron Tagger (Russian)
  [ ] basque_grammars..... Grammars for Basque
  [ ] bcp47........

True

**Common NLTK Operations**

a. Tokenization

Tokenization is the process of breaking text into individual words (word tokenization) or sentences (sentence tokenization).

In [6]:
from nltk.tokenize import word_tokenize, sent_tokenize

# Example text
text = "Hello! This is a sample text for NLTK. Let's learn tokenization."

# Sentence tokenization
sentences = sent_tokenize(text)
print(sentences)

# Word tokenization
words = word_tokenize(text)
print(words)

['Hello!', 'This is a sample text for NLTK.', "Let's learn tokenization."]
['Hello', '!', 'This', 'is', 'a', 'sample', 'text', 'for', 'NLTK', '.', 'Let', "'s", 'learn', 'tokenization', '.']


**b. Stopwords Filtering**

Stopwords are common words (e.g., "is", "the", "and") that are often removed during text processing.

In [None]:
from nltk.corpus import stopwords

# Load English stopwords
stop_words = set(stopwords.words('english'))

# Filter stopwords from a list of words
words = word_tokenize("This is a simple example sentence.")
filtered_words = [word for word in words if word.lower() not in stop_words]
print(filtered_words)

**c. Stemming**

Stemming is the process of reducing words to their root forms. For example, "running" becomes "run".

In [7]:
from nltk.stem import PorterStemmer

ps = PorterStemmer()

# Stem a list of words
words = ["running", "ran", "runs", "easily", "fairly"]
stemmed_words = [ps.stem(word) for word in words]
print(stemmed_words)

['run', 'ran', 'run', 'easili', 'fairli']


**d. Lemmatization**

Lemmatization is similar to stemming, but it uses the context to convert words to their base or dictionary form.

In [8]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

# Lemmatize words with their parts of speech (POS)
print(lemmatizer.lemmatize("running", pos='v'))  # 'run'
print(lemmatizer.lemmatize("better", pos='a'))   # 'good'

run
good


**e. Part-of-Speech (POS) Tagging**

POS tagging classifies words into their grammatical categories, such as nouns, verbs, adjectives, etc.

In [9]:
from nltk import pos_tag
from nltk.tokenize import word_tokenize

# Tokenize text
words = word_tokenize("John is learning NLTK.")

# Tag the tokens with POS
pos_tags = pos_tag(words)
print(pos_tags)

[('John', 'NNP'), ('is', 'VBZ'), ('learning', 'VBG'), ('NLTK', 'NNP'), ('.', '.')]


**f. Named Entity Recognition (NER)**

NER identifies named entities like people, organizations, and locations in a text.

In [11]:
from nltk import ne_chunk

# Tokenize and tag
sentence = "Barack Obama was born in Hawaii."
words = word_tokenize(sentence)
pos_tags = pos_tag(words)

# Perform named entity recognition
named_entities = ne_chunk(pos_tags)
print(named_entities)

LookupError: 
**********************************************************************
  Resource [93mmaxent_ne_chunker[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('maxent_ne_chunker')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mchunkers/maxent_ne_chunker/PY3/english_ace_multiclass.pickle[0m

  Searched in:
    - '/root/nltk_data'
    - '/usr/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - ''
**********************************************************************


**g. WordNet for Synonyms and Antonyms**

WordNet is a lexical database that can help find word meanings, synonyms, antonyms, etc.

In [12]:
from nltk.corpus import wordnet

# Get synonyms for a word
synonyms = wordnet.synsets("good")
for syn in synonyms:
    print(syn.name(), syn.definition())

# Get the lemmas for 'good' and find antonyms
for lemma in wordnet.synset('good.n.01').lemmas():
    if lemma.antonyms():
        print(lemma.antonyms()[0].name())

good.n.01 benefit
good.n.02 moral excellence or admirableness
good.n.03 that which is pleasing or valuable or useful
commodity.n.01 articles of commerce
good.a.01 having desirable or positive qualities especially those suitable for a thing specified
full.s.06 having the normally expected amount
good.a.03 morally admirable
estimable.s.02 deserving of esteem and respect
beneficial.s.01 promoting or enhancing well-being
good.s.06 agreeable or pleasing
good.s.07 of moral excellence
adept.s.01 having or showing knowledge and skill and aptitude
good.s.09 thorough
dear.s.02 with or in a close or intimate relationship
dependable.s.04 financially sound
good.s.12 most suitable or right for a particular purpose
good.s.13 resulting favorably
effective.s.04 exerting force or influence
good.s.15 capable of pleasing
good.s.16 appealing to the mind
good.s.17 in excellent physical condition
good.s.18 tending to promote physical well-being; beneficial to health
good.s.19 not forged
good.s.20 not left to

##Summary of Key Functions:

* Tokenization: nltk.tokenize.word_tokenize(), nltk.tokenize.sent_tokenize()
* Stopwords: nltk.corpus.stopwords.words('language')
* Stemming: nltk.stem.PorterStemmer().stem()
* Lemmatization: nltk.stem.WordNetLemmatizer().lemmatize()
* POS Tagging: nltk.pos_tag()
* Named Entity Recognition: nltk.ne_chunk()
* WordNet: nltk.corpus.wordnet


