                                                  # Natural Language Processing                                                 

### 🌐 Introduction to Natural Language Processing (NLP) in Machine Learning Using NLTK

Natural Language Processing (NLP) is a subfield of Artificial Intelligence (AI) and Machine Learning (ML) that focuses on enabling machines to understand, interpret, and generate human language. It bridges the gap between human communication and computer understanding.

---

### 🔍 What is NLP?

NLP combines linguistics and computer science to process and analyze large amounts of natural language data. Common tasks include:

* **Text classification** (e.g., spam detection)
* **Sentiment analysis**
* **Named Entity Recognition (NER)**
* **Machine translation**
* **Speech recognition**
* **Text summarization**

---

### 🧠 NLP in Machine Learning

In ML, NLP is used to train models that can make predictions or extract information from text. The pipeline typically involves:

1. **Text Preprocessing**
2. **Feature Extraction** (e.g., Bag of Words, TF-IDF)
3. **Model Training** (e.g., Naive Bayes, SVM, LSTM)
4. **Evaluation and Prediction**

---

### 📦 Introduction to NLTK (Natural Language Toolkit)

**NLTK** is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources along with libraries for:

* Tokenization
* Lemmatization
* POS tagging
* Parsing
* WordNet access

Install NLTK:

```bash
pip install nltk
```

---

### ✅ Basic NLP Tasks Using NLTK

```python
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

text = "Natural Language Processing makes it possible for machines to understand human language."

# Tokenization
tokens = word_tokenize(text)
print("Tokens:", tokens)

# Stopword Removal
filtered = [word for word in tokens if word.lower() not in stopwords.words('english')]
print("Filtered Tokens:", filtered)

# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(word) for word in filtered]
print("Lemmatized:", lemmatized)
```

---

### 📌 Why Use NLTK?

* Beginner-friendly
* Extensive documentation and corpora
* Excellent for prototyping NLP workflows
* Good integration with other ML libraries (e.g., Scikit-learn)

---




# **NLTK modules** and their **functionalities**:

| **Module**                    | **Functionality**                                  | **Example Functions**                       | **Use Case**                                         |
| ----------------------------- | -------------------------------------------------- | ------------------------------------------- | ---------------------------------------------------- |
| `nltk.tokenize`               | Splits text into sentences or words (tokenization) | `word_tokenize()`, `sent_tokenize()`        | Preprocessing input text                             |
| `nltk.corpus`                 | Access to text corpora and lexical resources       | `stopwords.words()`, `names.words()`        | Working with language datasets                       |
| `nltk.stem`                   | Reduces words to their stem/root form              | `PorterStemmer()`, `LancasterStemmer()`     | Normalizing words (e.g., "running" → "run")          |
| `nltk.stem.WordNetLemmatizer` | Converts words to their lemma (dictionary form)    | `lemmatize()`                               | Semantic normalization (more accurate than stemming) |
| `nltk.probability`            | Support for frequency distributions                | `FreqDist()`                                | Word frequency analysis                              |
| `nltk.tag`                    | Assigns POS (Part of Speech) tags to tokens        | `pos_tag()`                                 | Syntax analysis                                      |
| `nltk.chunk`                  | Groups tokens into meaningful phrases              | `ne_chunk()`                                | Named Entity Recognition (NER)                       |
| `nltk.parse`                  | Syntax parsing of sentences                        | Various parsers                             | Tree-based parsing                                   |
| `nltk.classify`               | Text classification using machine learning         | `NaiveBayesClassifier`, `SklearnClassifier` | Sentiment analysis, spam detection                   |
| `nltk.translate`              | Tools for machine translation and BLEU scoring     | `bleu_score`                                | Translation evaluation                               |
| `nltk.draw`                   | Visualization of parse trees and relationships     | `tree.draw()`, `dispersion_plot()`          | NLP visualizations                                   |
| `nltk.metrics`                | Evaluation metrics for NLP models                  | `edit_distance()`, `accuracy()`             | Model evaluation                                     |
| `nltk.sentiment`              | Pre-built tools for sentiment analysis             | `SentimentIntensityAnalyzer`                | Polarity scoring (positive/negative)                 |

---



# Natural Language Toolkit


In [57]:
import nltk
nltk.__version__

'3.9.1'

In [58]:
# import StopWords to this colab file
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to C:\Users\PANDIT
[nltk_data]     JI\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [59]:
from nltk.corpus import stopwords
engWordsList = stopwords.words('english')

In [60]:
print(len(engWordsList))
print(engWordsList)

198
['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 

In [61]:
hgwordLIst = stopwords.words('hinglish')
print(len(hgwordLIst))
print(hgwordLIst)

1036
['a', 'aadi', 'aaj', 'aap', 'aapne', 'aata', 'aati', 'aaya', 'aaye', 'ab', 'abbe', 'abbey', 'abe', 'abhi', 'able', 'about', 'above', 'accha', 'according', 'accordingly', 'acha', 'achcha', 'across', 'actually', 'after', 'afterwards', 'again', 'against', 'agar', 'ain', 'aint', "ain't", 'aisa', 'aise', 'aisi', 'alag', 'all', 'allow', 'allows', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'am', 'among', 'amongst', 'an', 'and', 'andar', 'another', 'any', 'anybody', 'anyhow', 'anyone', 'anything', 'anyway', 'anyways', 'anywhere', 'ap', 'apan', 'apart', 'apna', 'apnaa', 'apne', 'apni', 'appear', 'are', 'aren', 'arent', "aren't", 'around', 'arre', 'as', 'aside', 'ask', 'asking', 'at', 'aur', 'avum', 'aya', 'aye', 'baad', 'baar', 'bad', 'bahut', 'bana', 'banae', 'banai', 'banao', 'banaya', 'banaye', 'banayi', 'banda', 'bande', 'bandi', 'bane', 'bani', 'bas', 'bata', 'batao', 'bc', 'be', 'became', 'because', 'become', 'becomes', 'becoming', 'been', 'before', 'beforeh

# Hindi Words in NLP
***

import hindi words in nltk


In [62]:
nltk.download('indian')

[nltk_data] Downloading package indian to C:\Users\PANDIT
[nltk_data]     JI\AppData\Roaming\nltk_data...
[nltk_data]   Package indian is already up-to-date!


True

In [63]:
hindiList = nltk.corpus.indian.words('hindi.pos')
print(len(hindiList))   
print(hindiList)

9408
['पूर्ण', 'प्रतिबंध', 'हटाओ', ':', 'इराक', 'संयुक्त', ...]


Telgu

In [64]:
teluguList = nltk.corpus.indian.words('telugu.pos')
print(len(teluguList), teluguList[:6])

9999 ['4', '.', 'ఆడిట్', 'నిర్వహణ', 'ఆడిటర్', 'ఒక']


Bangla

In [65]:
banglaList = nltk.corpus.indian.words('bangla.pos')
print(len(banglaList), banglaList[:10])

10281 ['মহিষের', 'সন্তান', ':', 'তোড়া', 'উপজাতি', '৷', 'বাসস্থান-ঘরগৃহস্থালি', 'তোড়া', 'ভাষায়', 'গ্রামকেও']


In [66]:
# panjabiList = nltk.corpus.indian.words('panjabi.pos')
# print(len(panjabiList), panjabiList[:10])

In [67]:
import nltk
nltk.download('stopwords.pa')

[nltk_data] Error loading stopwords.pa: Package 'stopwords.pa' not
[nltk_data]     found in index


False

In [68]:
!pip install inltk

Defaulting to user installation because normal site-packages is not writeable


DEPRECATION: Loading egg at c:\program files\python311\lib\site-packages\vboxapi-1.0-py3.11.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330

[notice] A new release of pip is available: 22.3.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [69]:
# from inltk.inltk import setup
# setup('pa')

# Sentence : Word Tokenizer

In [70]:
nltk.download('punkt')
nltk.download('punkt_tab')
from nltk.tokenize import sent_tokenize
textpara = "Geneva, May 26 (PTI) Switzerland has declared May 26 as the Science Day in honour of visiting President A P J Abdul Kalam. This was announced by the Swiss government, following his arrival here last night on a four-day state visit, considering his vast expertise in science and technology. Switzerland considers the President as the father of India missile programme. Kalam is the first Indian head of state visiting Switzerland after a gap of more than 30 years. Former President V V Giri's was the last high profile visit to the country."


[nltk_data] Downloading package punkt to C:\Users\PANDIT
[nltk_data]     JI\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to C:\Users\PANDIT
[nltk_data]     JI\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [71]:
print(textpara)
print(len(textpara))

Geneva, May 26 (PTI) Switzerland has declared May 26 as the Science Day in honour of visiting President A P J Abdul Kalam. This was announced by the Swiss government, following his arrival here last night on a four-day state visit, considering his vast expertise in science and technology. Switzerland considers the President as the father of India missile programme. Kalam is the first Indian head of state visiting Switzerland after a gap of more than 30 years. Former President V V Giri's was the last high profile visit to the country.
539


In [72]:
list1 = sent_tokenize(textpara)
print(len(list1))

5


In [73]:
for i in list1:
    print(i)

Geneva, May 26 (PTI) Switzerland has declared May 26 as the Science Day in honour of visiting President A P J Abdul Kalam.
This was announced by the Swiss government, following his arrival here last night on a four-day state visit, considering his vast expertise in science and technology.
Switzerland considers the President as the father of India missile programme.
Kalam is the first Indian head of state visiting Switzerland after a gap of more than 30 years.
Former President V V Giri's was the last high profile visit to the country.


Sent_Toknize is intelligence is is detect the Mr. words and not see that in a sentence form . 
Other than PunkitSentenceTokenizer, there is also a word tokenizer in nltk which is used to tokenize the words in a sentence. They all are the same, but the only difference is that the word tokenizer is used to tokenize the words in a sentence, while the sentence tokenizer is used to tokenize the sentences in a paragraph.

In [74]:
t1 = "Mr. Ram is the next CEO. He is ex student of AIML 6 months IIT ropar batch. He has great knowledge."
# list3 = PunktSentenceTokenizer(t1)
# print(len(list3))
list2 = sent_tokenize(t1)
print(len(list2))


3


In [75]:
from nltk.tokenize.punkt import PunktSentenceTokenizer

t1 = "Mr. Ram is the next CEO. He is ex student of AIML 6 months IIT ropar batch. He has great knowledge."
tokenizer = PunktSentenceTokenizer()
list3 = tokenizer.tokenize(t1)
print(len(list3))

4


Output is 4 bcz it will take Mr. as a sentece. It is not that advance.

# Word Toknize
*** 

In [76]:

nltk.download('punkt')
from nltk.tokenize import word_tokenize

[nltk_data] Downloading package punkt to C:\Users\PANDIT
[nltk_data]     JI\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [77]:
wordlist=word_tokenize(list1[2])
print(list1[2])
print(len(wordlist),wordlist)

Switzerland considers the President as the father of India missile programme.
12 ['Switzerland', 'considers', 'the', 'President', 'as', 'the', 'father', 'of', 'India', 'missile', 'programme', '.']


In [78]:
print(len(wordlist),wordlist)
print(len(engWordsList), engWordsList)

12 ['Switzerland', 'considers', 'the', 'President', 'as', 'the', 'father', 'of', 'India', 'missile', 'programme', '.']
198 ['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on',

In [79]:
198-12

186

# Data Processing

In [80]:
# remove all stopwords from the give wordlist
# take one word from wordlist at a time
# make for loop to check the word availability in engWordsList
#  if its present dont include in resultList
#  else include in resultList

resultList = []
for w in wordlist :
      if w.lower() in engWordsList : pass
      else : resultList.append(w)

print(resultList)

['Switzerland', 'considers', 'President', 'father', 'India', 'missile', 'programme', '.']


In [81]:
# remove all words from given wordlist which are there in engWordsList
# loop - take one word from given wordlist at a time :
#   if the choosen word is in engWordsList then dont print the word
#   else print the chosen word
print ("given " , wordlist)
print(engWordsList)

for w in wordlist :
  if w.lower() in engWordsList : pass
  else :print (w)

# the result should be appended to List2

given  ['Switzerland', 'considers', 'the', 'President', 'as', 'the', 'father', 'of', 'India', 'missile', 'programme', '.']
['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on',

In [82]:
print ("given " , wordlist)
print(engWordsList)
List2 = []

for w in wordlist :
  if w.lower() in engWordsList : pass
  else : List2.append(w)
print (List2)

given  ['Switzerland', 'considers', 'the', 'President', 'as', 'the', 'father', 'of', 'India', 'missile', 'programme', '.']
['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on',

In [83]:
print ("given " , wordlist)
print(engWordsList)
List2 = []

for w in wordlist :
  if not w.lower() in engWordsList : List2.append(w)
print (List2)

given  ['Switzerland', 'considers', 'the', 'President', 'as', 'the', 'father', 'of', 'India', 'missile', 'programme', '.']
['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on',

In [84]:
print(wordlist[0]) #engWordsList

Switzerland


In [85]:
# write code to check whether wordlist[0] is in engWordsList
mylist= []
if wordlist[1] in engWordsList : print('true')
else  :  mylist.append( wordlist[2] ) #print('false')

print(mylist)

['the']


In [86]:
mylist= []
for w in wordlist :
  # if not w in engWordsList :
  if  w not in engWordsList :
    mylist.append(w)

print(wordlist)
print(mylist)

['Switzerland', 'considers', 'the', 'President', 'as', 'the', 'father', 'of', 'India', 'missile', 'programme', '.']
['Switzerland', 'considers', 'President', 'father', 'India', 'missile', 'programme', '.']


### Web scrapping : URLlib module

In [87]:
import requests
import urllib.request
blogurl = "https://www.aajtak.in/"
recData = urllib.request.urlopen(blogurl)

In [88]:
pagedata= recData.read()
print(  pagedata )



In [89]:


# url = "https://www.aajtak.in/"
# headers = {
#     "User-Agent": "Mozilla/5.0"
# }
# response = requests.get(url, headers=headers)

# # Print or process HTML content
# print(response.text)

In [90]:
from bs4 import BeautifulSoup


soup = BeautifulSoup(pagedata, "html5lib")  # Correct instantiation
text = soup.get_text(strip=True)
print(text)

Hindi news, हिंदी न्यूज़ , Hindi Samachar, हिंदी समाचार, Latest News in Hindi, Breaking News in Hindi, ताजा ख़बरें, Aaj Tak Newsconst home_config = {}; home_config["is_US"] = 0; home_config["popular_shows"] = 1;var country_flag = '';
    var special_offer_story = "";
    const SwiperPendingCalls = [];
    const Delay_call = 0;
    const failSafe_delay = Delay_call + 1000;
     
    async function fetchDocument() {
        let response = await fetch('https://feeds.intoday.in/geocheck');
        if (response.status === 200) {
            let d = await response.text();
            let data = JSON.parse(d, true);
            //console.log(data['country_code']);
            country_flag = data['country_code'];
            if (typeof special_offer_story === "function") {
                special_offer_story();
            }
        }
    }
    fetchDocument();var GoogleAdTags = [];
    var is_sso_check = !0;
    var ssoUserDetail;
    var is_ad_free = "no";
    var refresh_callback = [];

    

In [91]:
if 'Obesion' in text:
    print("Obsession found in the text.")
if 'हिंदी' in text:
    print("हिंदी found in the text.")

हिंदी found in the text.


 count- how many times हिंदी  appeared in text

In [92]:
count=0
for w in text.split():
    if 'हिंदी' in w:
        count += 1
print(f"Count of 'हिंदी': {count}")


Count of 'हिंदी': 11


* give me 20 characters before and after हिंदी
* print the location where हिंदी is starting

In [93]:
l1=text.split()
print(type(l1),len(l1))

# print(l1.index('हिंदी')

idx1 = l1.index('हिंदी')
for i in range(idx1-2, idx1+2):
    print(l1[i])
print(l1[:20])

<class 'list'> 26811
Hindi
news,
हिंदी
न्यूज़
['Hindi', 'news,', 'हिंदी', 'न्यूज़', ',', 'Hindi', 'Samachar,', 'हिंदी', 'समाचार,', 'Latest', 'News', 'in', 'Hindi,', 'Breaking', 'News', 'in', 'Hindi,', 'ताजा', 'ख़बरें,', 'Aaj']


In [94]:
# show the next 'हिंदी'
idx1 = l1.index('हिंदी' , idx1+1 , len(l1))
for i in range(idx1-2,idx1+2):
  print(l1[i])

Hindi
Samachar,
हिंदी
समाचार,


In [95]:
# show the next 'हिंदी'
idx1 = l1.index('हिंदी' , idx1+1 , len(l1))
for i in range(idx1-2,idx1+2):
  print(l1[i])

मतलबऔर
भीसाहित्यजब
हिंदी
के


In [96]:
# idx1 =0
# for i in range(10):
#   idx1 = l1.index('हिंदी' , idx1+1 , len(l1))
#   for i in range(idx1-2,idx1+2):
#     print(l1[i])
#   print ('---------')


In [97]:
idx1 = 0
for _ in range(10):
    try:
        idx1 = l1.index('हिंदी', idx1 + 1, len(l1))
    except ValueError:
        break
    for j in range(idx1 - 2, idx1 + 2):
        print(l1[j])
    print('---------')


Hindi
news,
हिंदी
न्यूज़
---------
Hindi
Samachar,
हिंदी
समाचार,
---------
मतलबऔर
भीसाहित्यजब
हिंदी
के
---------
"0120-4807100"}}{"@context":"http://schema.org/","@type":"WebPage","name":"Hindi
news,
हिंदी
न्यूज़
---------
Hindi
Samachar,
हिंदी
समाचार,
---------
ख़बरें","keywords":"Hindi
news,
हिंदी
न्यूज़
---------
Hindi
Samachar,
हिंदी
समाचार,
---------


# Other

In [98]:
import urllib.request
from bs4 import BeautifulSoup

blogurl = "https://aimljan24.glitch.me/"
# blogurl = 'https://www.ndtv.com/india-news/indian-space-start-up-uses-spy-satellite-tech-to-track-mosquitos-5563134'
recData = urllib.request.urlopen(blogurl)
pagedata = recData.read()
soup = BeautifulSoup(pagedata, "html5lib")  # Correct instantiation
text = soup.get_text(strip=True)
print(text)

AIML websiteAIMLjan24HomeStudentsaboutSign UpLoginN P S KapanyVinod DhamGo to Ajay Bhattlink1link2link3


In [99]:
token1 = [t for t in text.split()]
token1

['AIML',
 'websiteAIMLjan24HomeStudentsaboutSign',
 'UpLoginN',
 'P',
 'S',
 'KapanyVinod',
 'DhamGo',
 'to',
 'Ajay',
 'Bhattlink1link2link3']

In [100]:
stext = text.split()
print(type(stext), len(stext))

<class 'list'> 10


In [101]:
# search apj in webpage 
if 'apj' in text: print("APJ found in the text.")

In [102]:
# create a list of words from the text
words = [t for t in stext]
print (len(words), words[:20])

10 ['AIML', 'websiteAIMLjan24HomeStudentsaboutSign', 'UpLoginN', 'P', 'S', 'KapanyVinod', 'DhamGo', 'to', 'Ajay', 'Bhattlink1link2link3']


In [103]:
# show me all words containing 'apj'
for w in words:
    if 'apj' in w.lower(): print(w)

In [104]:
fd = nltk.FreqDist(words)
print(type(fd),fd)

<class 'nltk.probability.FreqDist'> <FreqDist with 10 samples and 10 outcomes>


In [106]:
for item, val in fd.items():
  if val>=1 :   print(item ," - " , val)

AIML  -  1
websiteAIMLjan24HomeStudentsaboutSign  -  1
UpLoginN  -  1
P  -  1
S  -  1
KapanyVinod  -  1
DhamGo  -  1
to  -  1
Ajay  -  1
Bhattlink1link2link3  -  1


In [107]:
def showFreq (url):
  recData = urllib.request.urlopen (url)
  pagedata= recData.read()
  soup=BeautifulSoup(pagedata,"html5lib")
  text=soup.get_text(strip=True)
  stext = text.split()
  words = [ t for t in stext]
  fd = nltk.FreqDist(words)
  for item, val in fd.items():
    if val>=2 :   print(item ," - " , val)

In [108]:
url = "https://aimljul23f.glitch.me/"
url1 = "https://aimljul23f.glitch.me/cse"
url2 = "https://nielit.gov.in/chandigarh/index.php"
showFreq(url2)

of  -  42
India  -  2
:  -  4
Institute  -  2
Electronics  -  6
&  -  22
Information  -  5
@import  -  21
||  -  2
//--><!]]><!--//--><![CDATA[//><!--  -  2
()  -  16
{  -  64
function  -  8
jQuery('ul',  -  2
},  -  11
}  -  26
);  -  3
//  -  5
$(document).ready(function(){  -  5
var  -  16
=  -  38
navState  -  2
+  -  4
the  -  30
to  -  29
in  -  18
>  -  4
e.preventDefault();  -  4
});  -  43
function(){  -  2
this  -  4
Virtual  -  2
and  -  23
MoU  -  3
for  -  20
Digitization  -  2
NIELIT  -  5
Data  -  2
BOOTCAMPGoT  -  2
-  -  12
AdvancedGoT  -  3
BasicARVR  -  2
Energy  -  2
EOI  -  2
Certification  -  2
with  -  6
IIT  -  3
RoparShort  -  2
Term  -  6
Training  -  5
CoursesDigital  -  2
Literacy  -  3
Development  -  2
weeks  -  2
6  -  3
Industrial  -  2
Programmes  -  3
under  -  2
Joint  -  3
Short  -  3
between  -  5
With  -  10
Accredited  -  2
CentreBecome  -  2
NewsNIELIT  -  2
selected  -  8
contractual  -  8
post  -  9
Resource  -  9
Person  -  9
(Faculty-IT/CS)  

In [110]:
def showFreqofword (url , word2search):
  '''
  showFreqofword ( url to be opened , word to be searched )
  '''
  recData = urllib.request.urlopen (url)
  pagedata= recData.read()
  soup=BeautifulSoup(pagedata,"html5lib")
  text=soup.get_text(strip=True)
  stext = text.split()
  words = [ t.lower() for t in stext]
  fd = nltk.FreqDist(words)
  for item, val in fd.items():
    if item == word2search.lower() : print(val)
    # if val>=2 :   print(item ," - " , val)

In [111]:
url1 = "https://aimljul23f.glitch.me/cse"
url4 = "https://avrtutorial.glitch.me/"
showFreqofword(url4 , "virtual" )

10


# 1. search the email of apj in the webpage
# 2. how many emails are present on the webpage
# 3. display the text written in <h1> on the given webpage

### find the text inside h1 or any other tag in given url

In [112]:
import nltk
import urllib.request
from bs4 import BeautifulSoup

url ="https://www.aajtak.in"
bytesrcd  = urllib.request.urlopen  (url)
txtrcd = bytesrcd.read()

txtrcd = str(txtrcd)
print (txtrcd)



In [113]:
#display the text inside h3 tag on the webpage
# step 1 search h3 in txtrcd
st = txtrcd.find( 'h3',0)
h3end = txtrcd.find( '/h3',0)
print(st, h3end)

50091 314909


In [115]:
#display 100 characters from h3 tag
st = txtrcd.find( '<h3>',0)
# txtrcd [ st : st+1800]
stend = txtrcd.find( '</h3>',st)
print(st, stend)
print(txtrcd [ st: stend+6 ])
# <h3> <a  title="कैसरगंज से बृजभूषण का कट सकता है टिकट, बेटे की उम्मीदवारी पर बन सकती बात"
# href="https://www.aajtak.in/elections/lok-sabha-election-2024/story/brijbhushan-sharan-singh-
# bjp-ticket-may-be-cut-from-kaiserganj-talks-may-be-held-on-son-candidature-ntc-1935740-2024-05-02" >
#  कैसरगंज से बृजभूषण का कट सकता है टिकट, बेटे की उम्मीदवारी पर बन सकती बात </a> </h3>

312473 314908
<h3>\n                            \n                            <a  title="\xe0\xa4\xac\xe0\xa5\x88\xe0\xa4\x95\xe0\xa4\xab\xe0\xa5\x81\xe0\xa4\x9f \xe0\xa4\xaa\xe0\xa4\xb0 \xe0\xa4\xa6\xe0\xa4\xbf\xe0\xa4\xb2\xe0\xa5\x8d\xe0\xa4\xb2\xe0\xa5\x80 \xe0\xa4\xb8\xe0\xa4\xb0\xe0\xa4\x95\xe0\xa4\xbe\xe0\xa4\xb0, \xe0\xa4\xb9\xe0\xa4\x9f \xe0\xa4\xb8\xe0\xa4\x95\xe0\xa4\xa4\xe0\xa4\xbe \xe0\xa4\xb9\xe0\xa5\x88 \'\xe0\xa4\x89\xe0\xa4\xae\xe0\xa5\x8d\xe0\xa4\xb0\xe0\xa4\xa6\xe0\xa4\xb0\xe0\xa4\xbe\xe0\xa4\x9c\' \xe0\xa4\xb5\xe0\xa4\xbe\xe0\xa4\xb9\xe0\xa4\xa8\xe0\xa5\x8b\xe0\xa4\x82 \xe0\xa4\xaa\xe0\xa4\xb0 \xe0\xa4\xb2\xe0\xa4\x97\xe0\xa4\xbe \xe0\xa4\xac\xe0\xa5\x88\xe0\xa4\xa8, \xe0\xa4\xae\xe0\xa4\x82\xe0\xa4\xa4\xe0\xa5\x8d\xe0\xa4\xb0\xe0\xa5\x80 \xe0\xa4\xb8\xe0\xa4\xbf\xe0\xa4\xb0\xe0\xa4\xb8\xe0\xa4\xbe \xe0\xa4\xa8\xe0\xa5\x87 \xe0\xa4\x97\xe0\xa4\xbf\xe0\xa4\xa8\xe0\xa4\xbe\xe0\xa4\x88\xe0\xa4\x82 \xe0\xa4\xa8\xe0\xa4\x8f \xe0\xa4\xa8\xe0\xa4\xbf\xe0\xa4\xaf\xe0\xa4\xae

In [116]:
firsth3 = txtrcd [ st: stend+6 ]
# apply beautiful soup on this
txt2 =  BeautifulSoup(firsth3,"html5lib")
print(txt2.getText())

\n                            \n                            \n                                                                \xe0\xa4\xac\xe0\xa5\x88\xe0\xa4\x95\xe0\xa4\xab\xe0\xa5\x81\xe0\xa4\x9f \xe0\xa4\xaa\xe0\xa4\xb0 \xe0\xa4\xa6\xe0\xa4\xbf\xe0\xa4\xb2\xe0\xa5\x8d\xe0\xa4\xb2\xe0\xa5\x80 \xe0\xa4\xb8\xe0\xa4\xb0\xe0\xa4\x95\xe0\xa4\xbe\xe0\xa4\xb0, \xe0\xa4\xb9\xe0\xa4\x9f \xe0\xa4\xb8\xe0\xa4\x95\xe0\xa4\xa4\xe0\xa4\xbe \xe0\xa4\xb9\xe0\xa5\x88 \'\xe0\xa4\x89\xe0\xa4\xae\xe0\xa5\x8d\xe0\xa4\xb0\xe0\xa4\xa6\xe0\xa4\xb0\xe0\xa4\xbe\xe0\xa4\x9c\' \xe0\xa4\xb5\xe0\xa4\xbe\xe0\xa4\xb9\xe0\xa4\xa8\xe0\xa5\x8b\xe0\xa4\x82 \xe0\xa4\xaa\xe0\xa4\xb0 \xe0\xa4\xb2\xe0\xa4\x97\xe0\xa4\xbe \xe0\xa4\xac\xe0\xa5\x88\xe0\xa4\xa8, \xe0\xa4\xae\xe0\xa4\x82\xe0\xa4\xa4\xe0\xa5\x8d\xe0\xa4\xb0\xe0\xa5\x80 \xe0\xa4\xb8\xe0\xa4\xbf\xe0\xa4\xb0\xe0\xa4\xb8\xe0\xa4\xbe \xe0\xa4\xa8\xe0\xa5\x87 \xe0\xa4\x97\xe0\xa4\xbf\xe0\xa4\xa8\xe0\xa4\xbe\xe0\xa4\x88\xe0\xa4\x82 \xe0\xa4\xa8\xe0\xa4\x8f \xe0\xa4\xa

# Punktsentence Tokenizer

In [117]:
text1 ="Hello Mr. Ram you r great. have a nice day "
list1 = sent_tokenize( text1)
print(len(list1), list1)

2 ['Hello Mr. Ram you r great.', 'have a nice day']


In [118]:
from nltk.tokenize import PunktSentenceTokenizer

In [119]:
tk1 = PunktSentenceTokenizer()
list2 = tk1.tokenize(text1)
print(len(list2) , list2)

3 ['Hello Mr.', 'Ram you r great.', 'have a nice day']


In [120]:
from nltk.tokenize import word_tokenize
print(word_tokenize(text1))

['Hello', 'Mr.', 'Ram', 'you', 'r', 'great', '.', 'have', 'a', 'nice', 'day']


### wordnet , leema, antonym, synonyms

| Term        | Description                                                                                                 | Example                              |
| ----------- | ----------------------------------------------------------------------------------------------------------- | ------------------------------------ |
| **WordNet** | A lexical database that organizes words into synsets, with semantic relations like synonyms, antonyms, etc. | `wn.synsets("good")` returns synsets |
| **Synset**  | A set of synonyms that share the same meaning                                                               | `['happy', 'felicitous']`            |
| **Lemma**   | The base (dictionary) form of a word in a synset                                                            | `'run'` is lemma of `'running'`      |
| **Synonym** | A word with the same or very similar meaning                                                                | `'happy'` ↔ `'joyful'`               |
| **Antonym** | A word with the opposite meaning                                                                            | `'happy'` ↔ `'sad'`                  |


In [121]:
nltk.download('wordnet')
from nltk.corpus import wordnet

[nltk_data] Downloading package wordnet to C:\Users\PANDIT
[nltk_data]     JI\AppData\Roaming\nltk_data...


In [122]:
mpain = wordnet.synsets("computer")
print(type(mpain))

<class 'list'>


In [123]:
mpain[0].definition()

'a machine for performing calculations automatically'

In [124]:
# see all
for mp in mpain:
  print( mp.definition())

a machine for performing calculations automatically
an expert at calculation (or at operating calculating machines)


In [125]:
for syn in wordnet.synsets('Computer'):
  for lemma in syn.lemmas():
    print(lemma.name())

computer
computing_machine
computing_device
data_processor
electronic_computer
information_processing_system
calculator
reckoner
figurer
estimator
computer


In [126]:
 for syn in wordnet.synsets("good"):
  for lemma in syn.lemmas():
    print(lemma.name() , end= " - ")
    if lemma.antonyms():
      print(lemma.antonyms()[0].name())

good - good - evil
goodness - evilness
good - bad
goodness - badness
commodity - trade_good - good - good - bad
full - good - good - evil
estimable - good - honorable - respectable - beneficial - good - good - good - just - upright - adept - expert - good - practiced - proficient - skillful - skilful - good - dear - good - near - dependable - good - safe - secure - good - right - ripe - good - well - effective - good - in_effect - in_force - good - good - serious - good - sound - good - salutary - good - honest - good - undecomposed - unspoiled - unspoilt - good - well - ill
good - thoroughly - soundly - good - 

In [128]:
# complete the code to show all synonyms

for syn in wordnet.synsets("good"):
    for lemma in syn.lemmas():
        print(lemma.name())

good
good
goodness
good
goodness
commodity
trade_good
good
good
full
good
good
estimable
good
honorable
respectable
beneficial
good
good
good
just
upright
adept
expert
good
practiced
proficient
skillful
skilful
good
dear
good
near
dependable
good
safe
secure
good
right
ripe
good
well
effective
good
in_effect
in_force
good
good
serious
good
sound
good
salutary
good
honest
good
undecomposed
unspoiled
unspoilt
good
well
good
thoroughly
soundly
good


# Word Stemming

In [129]:
from nltk.stem import PorterStemmer

In [130]:
stemmer =PorterStemmer()
print(stemmer.stem('heros'))

hero


In [131]:
print(stemmer.stem('classes') , stemmer.stem('buses'),stemmer.stem('increses'))

class buse incres


In [132]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [133]:
print(lemmatizer.lemmatize('increases'),lemmatizer.lemmatize('buses'),lemmatizer.lemmatize('classes'))

increase bus class


In [134]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize('playing', pos="v"))
print(lemmatizer.lemmatize('playing', pos="n"))
print(lemmatizer.lemmatize('playing', pos="a"))
print(lemmatizer.lemmatize('playing', pos="r"))

play
playing
playing
playing


| POS Code | Meaning   | Example             |
| -------- | --------- | ------------------- |
| `"v"`    | Verb      | `'running' → 'run'` |
| `"n"`    | Noun      | `'wolves' → 'wolf'` |
| `"a"`    | Adjective | `'better' → 'good'` |
| `"r"`    | Adverb    | `'better' → 'well'` |


In [135]:
def showWord( myword):
  lemmatizer = WordNetLemmatizer()
  print(lemmatizer.lemmatize(myword, pos="v"))
  print(lemmatizer.lemmatize(myword, pos="n"))
  print(lemmatizer.lemmatize(myword, pos="a"))
  print(lemmatizer.lemmatize(myword, pos="r"))

In [136]:
showWord('mailing')

mail
mailing
mailing
mailing
