Natural Language Understanding (NLU) is the ability of a computer to understand human language as it is naturally spoken and written. Here are some applications of NLU:

1. **Named Entity Recognition (NER)**:
   Extract entities like names of people, organizations, locations, expressions of times, quantities, monetary values, etc. from the text.
   Libraries: `spaCy`, `nltk`

2. **Chatbots and Virtual Assistants**:
   Understand user queries and generate appropriate responses.
   Libraries: `Rasa`, `ChatterBot`

3. **Part-of-Speech Tagging**:
   Label words in a sentence with their respective parts of speech (e.g., noun, verb, adjective).
   Libraries: `spaCy`, `nltk`

4. **Text Classification**:
   Classify the text into different categories such as spam detection, sentiment analysis, or topic categorization.
   Libraries: `scikit-learn`, `TensorFlow`, `PyTorch`

5. **Syntax Parsing**:
   Analyze the grammatical structure of a sentence, establishing relationships between words.
   Libraries: `spaCy`, `nltk`

6. **Semantic Role Labeling**:
   Determine the semantic relationships between words in a sentence (e.g., who did what to whom).
   Libraries: `AllenNLP`

7. **Machine Translation**:
   Translate text from one language to another.
   Libraries: `OpenNMT`, `Transformers` from Hugging Face

8. **Coreference Resolution**:
   Identify which words (pronouns and nouns) refer to the same object.
   Libraries: `spaCy` (with neural coref plugin), `AllenNLP`

9. **Text Summarization**:
   Generate a concise and meaningful summary of a longer text.
   Libraries: `Transformers` from Hugging Face, `Gensim`

10. **Question Answering**:
    Extract answers from a given text based on the posed question.
    Libraries: `Transformers` from Hugging Face



## Sentiment Analysis using Textblob

In [4]:
%%capture
!pip install textblob

In [78]:
# Importing necessary modules
from textblob import TextBlob

In [79]:
# Spelling correction
blob = TextBlob("I havv good speling!")
print(blob.correct())

I have good spelling!


In [81]:
# Tokenization using Textblob
from textblob import TextBlob
blob = TextBlob("TextBlob is a great Python library. It makes text processing easy!")
blob.sentences  # Split into sentences

[Sentence("TextBlob is a great Python library."),
 Sentence("It makes text processing easy!")]

In [82]:
blob.words      # Split into words

WordList(['TextBlob', 'is', 'a', 'great', 'Python', 'library', 'It', 'makes', 'text', 'processing', 'easy'])

In [83]:
blob.tags #POS Tags

[('TextBlob', 'NNP'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('great', 'JJ'),
 ('Python', 'NNP'),
 ('library', 'NN'),
 ('It', 'PRP'),
 ('makes', 'VBZ'),
 ('text', 'JJ'),
 ('processing', 'VBG'),
 ('easy', 'JJ')]

In [84]:
blob.noun_phrases

WordList(['textblob', 'python', 'text processing'])

In [86]:
blob

TextBlob("TextBlob is a great Python library. It makes text processing easy!")

In [90]:
def analyze_sentiment(text):
    """Return sentiment polarity and subjectivity of a given text using TextBlob."""
    analysis = TextBlob(text)
    
    # The polarity is a float value within the range [-1.0 to 1.0]
    # where -1 indicates a negative sentiment and 1 indicates a positive sentiment.
    polarity = analysis.sentiment.polarity
    
    # Subjectivity is within the range [0.0 to 1.0] where 0.0 is very objective
    # and 1.0 is very subjective.
    subjectivity = analysis.sentiment.subjectivity
    
    return polarity, subjectivity

In [91]:
# Test the function
text_sample = "I love this product! It's amazing."
polarity, subjectivity = analyze_sentiment(text_sample)

print(f"Polarity: {polarity}, Subjectivity: {subjectivity}")

Polarity: 0.6125, Subjectivity: 0.75


In [92]:
# Test the function with different samples
samples = [
    "Paris is the capital of France.",
    "Water boils at 100 degrees Celsius under standard atmospheric pressure.",
    "I think chocolate ice cream is the best flavor ever.",
    "The movie was incredibly boring and felt like a waste of time."
]

In [93]:
for sample in samples:
    polarity, subjectivity = analyze_sentiment(sample)
    print(f"Text: '{sample}'\nPolarity: {polarity}, Subjectivity: {subjectivity}\n")

Text: 'Paris is the capital of France.'
Polarity: 0.0, Subjectivity: 0.0

Text: 'Water boils at 100 degrees Celsius under standard atmospheric pressure.'
Polarity: 0.0, Subjectivity: 0.0

Text: 'I think chocolate ice cream is the best flavor ever.'
Polarity: 1.0, Subjectivity: 0.3

Text: 'The movie was incredibly boring and felt like a waste of time.'
Polarity: -0.6, Subjectivity: 0.5



## Word Sense Disambiguation

In [6]:
%%capture
!pip install -U pywsd

In [94]:
# Importing necessary modules
from pywsd.lesk import simple_lesk
from nltk.corpus import wordnet
import nltk

# # This is needed if you're using NLTK for the first time
# nltk.download('averaged_perceptron_tagger')
# nltk.download('wordnet')
# nltk.download('punkt')

In [97]:
# Provide a sentence with the word 'bank' for disambiguation
# sentence = "I went to the bank to deposit my money."
sentence = "He picked up the bat and swung it"

# Using simple_lesk for WSD. It returns a synset.
disambiguated_sense = simple_lesk(sentence, 'bat') # bank

print(f"Disambiguated sense: {disambiguated_sense}")
print(f"Definition: {disambiguated_sense.definition()}")

Disambiguated sense: Synset('squash_racket.n.01')
Definition: a small racket with a long handle used for playing squash


## POS Tagging

In [98]:
import nltk
from nltk.tokenize import word_tokenize

# Downloading the Punkt Tokenizer and the averaged_perceptron_tagger (for POS tagging)
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Prashant\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Prashant\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [99]:
text = "The upcoming India vs Pakistan match on Sunday faces a significant risk of being interrupted by rain again."

words = word_tokenize(text)
pos_tags = nltk.pos_tag(words)

print("Word".ljust(15), "POS".ljust(10))
print("-" * 25)
for word, tag in pos_tags:
    print(word.ljust(15), tag.ljust(10))

Word            POS       
-------------------------
The             DT        
upcoming        JJ        
India           NNP       
vs              NN        
Pakistan        NNP       
match           NN        
on              IN        
Sunday          NNP       
faces           VBZ       
a               DT        
significant     JJ        
risk            NN        
of              IN        
being           VBG       
interrupted     VBN       
by              IN        
rain            NN        
again           RB        
.               .         


#### `nltk` uses the Penn Treebank POS Tags

In [100]:
# Ensure the 'help' module is downloaded
nltk.download('tagsets')

# Displaying the Penn Treebank POS Tags and their explanations
nltk.help.upenn_tagset()

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

[nltk_data] Downloading package tagsets to
[nltk_data]     C:\Users\Prashant\AppData\Roaming\nltk_data...
[nltk_data]   Package tagsets is already up-to-date!


### POS Tagging using Spacy

In [14]:
# %%capture
# !pip install spacy
# !python -m spacy download en_core_web_sm

In [18]:
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)

In [19]:
import spacy

# Load English tokenizer, POS tagger, and NER
nlp = spacy.load("en_core_web_sm")

In [101]:
text = "The upcoming India vs Pakistan match on Sunday faces a significant risk of being interrupted by rain again."

doc = nlp(text)

In [103]:
type(doc)

spacy.tokens.doc.Doc

In [20]:
print("Word".ljust(15), "POS".ljust(10))
print("-" * 25)
for token in doc:
    print(token.text.ljust(15), token.pos_.ljust(10))

Word            POS       
-------------------------
The             DET       
upcoming        ADJ       
India           PROPN     
vs              ADP       
Pakistan        PROPN     
match           NOUN      
on              ADP       
Sunday          PROPN     
faces           VERB      
a               DET       
significant     ADJ       
risk            NOUN      
of              ADP       
being           AUX       
interrupted     VERB      
by              ADP       
rain            NOUN      
again           ADV       
.               PUNCT     


**SpaCy uses the Universal POS Tags and its own fine-grained tags.**

In [27]:
# Extracting the POS tag map from the loaded language model
pos_tags = nlp.get_pipe("tagger").labels

# Listing Universal POS Tags:
print("\nUniversal POS Tags in spaCy:")
print("-" * 50)
universal_tags = ["ADJ", "ADP", "ADV", "AUX", "CONJ", "CCONJ", "DET", "INTJ", \
                  "NOUN", "NUM", "PART", "PRON", "PROPN", "PUNCT", "SCONJ", "SYM", "VERB", "X"]

for tag in universal_tags:
    print(f"{tag.ljust(6)}: {spacy.explain(tag)}")


Universal POS Tags in spaCy:
--------------------------------------------------
ADJ   : adjective
ADP   : adposition
ADV   : adverb
AUX   : auxiliary
CONJ  : conjunction
CCONJ : coordinating conjunction
DET   : determiner
INTJ  : interjection
NOUN  : noun
NUM   : numeral
PART  : particle
PRON  : pronoun
PROPN : proper noun
PUNCT : punctuation
SCONJ : subordinating conjunction
SYM   : symbol
VERB  : verb
X     : other


In [28]:
# List of unique fine-grained tags in the loaded model
tags = list(nlp.get_pipe("tagger").labels)
for tag in tags:
    print(f"{tag.ljust(5)} : {spacy.explain(tag)}")

$     : symbol, currency
''    : closing quotation mark
,     : punctuation mark, comma
-LRB- : left round bracket
-RRB- : right round bracket
.     : punctuation mark, sentence closer
:     : punctuation mark, colon or ellipsis
ADD   : email
AFX   : affix
CC    : conjunction, coordinating
CD    : cardinal number
DT    : determiner
EX    : existential there
FW    : foreign word
HYPH  : punctuation mark, hyphen
IN    : conjunction, subordinating or preposition
JJ    : adjective (English), other noun-modifier (Chinese)
JJR   : adjective, comparative
JJS   : adjective, superlative
LS    : list item marker
MD    : verb, modal auxiliary
NFP   : superfluous punctuation
NN    : noun, singular or mass
NNP   : noun, proper singular
NNPS  : noun, proper plural
NNS   : noun, plural
PDT   : predeterminer
POS   : possessive ending
PRP   : pronoun, personal
PRP$  : pronoun, possessive
RB    : adverb
RBR   : adverb, comparative
RBS   : adverb, superlative
RP    : adverb, particle
SYM   : symbol
TO   

In [109]:
# Download the German Corpus/Tokenizer/POS Tagger etc...
# !python -m spacy download de_core_news_sm

In [110]:
# Load the German language model
nlp = spacy.load("de_core_news_sm")

def pos_tag_spacy(text):
    doc = nlp(text)
    return [(token.text, token.pos_) for token in doc]

sentence = "Das ist ein einfacher Satz."
print(pos_tag_spacy(sentence))

  


[('Das', 'PRON'), ('ist', 'AUX'), ('ein', 'DET'), ('einfacher', 'ADJ'), ('Satz', 'NOUN'), ('.', 'PUNCT')]


## Noun Phrase Chunking

Chunking, often also referred to as noun phrase chunking, aims to extract phrases from unstructured text.

In [111]:
text = """
PM Modi, Macron hold lunch meet, eye India-France ties at new heights of progress.
French president Emmanuel Macron said India did its utmost for the G-20 presidency to serve unity and peace.
Prime Minister Narendra Modi on Sunday had lunch with French president Emmanuel Macron following the conclusion of the G20 Summit.
"""

In [112]:
doc = nlp(text)

print("Noun Phrases using spaCy:")
print("-" * 50)
for chunk in doc.noun_chunks:
    print(chunk.text)

Noun Phrases using spaCy:
--------------------------------------------------

PM Modi
Macron

French
Emmanuel Macron
India

Prime
Minister Narendra Modi on Sunday
Emmanuel Macron
following the conclusion of the G20 Summit


nltk uses a more manual approach with regular expressions to define the chunk grammar.

In [113]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
from nltk.chunk import RegexpParser

# Download necessary resources
# nltk.download('punkt')
# nltk.download('averaged_perceptron_tagger')

In [129]:
# Tokenize and POS Tag
words = word_tokenize(text)
pos_tags = pos_tag(words)

# Define chunk grammar for noun phrases
grammar = "NP: {<DT>?<JJ>*<NN>}"

# Parsing the sentence
parser = RegexpParser(grammar)
tree = parser.parse(pos_tags)

print("Noun Phrases using nltk:")
print("-" * 50)
for subtree in tree.subtrees(filter=lambda t: t.label() == 'NP'):
    print(' '.join([word for word, tag in subtree.leaves()]))

Noun Phrases using nltk:
--------------------------------------------------
the whimsical tale
a young girl
a rabbit
hole
a fantastical world
a myriad
bewildering adventure
journey
test
logic
reason
the very nature
reality
a dream


# Named Entity Recognition (NER)

Named Entity Recognition (NER) is used to identify named entities such as persons, organizations, dates, etc., in the text. Let's extract named entities from the given text using both spaCy and nltk.

In [121]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
from nltk.chunk import ne_chunk

# Download necessary resources
# nltk.download('punkt')
# nltk.download('averaged_perceptron_tagger')
# nltk.download('maxent_ne_chunker')
# nltk.download('words')

In [34]:
text = """Indian Prime Minister Narendra Modi, French president Emmanuel Macron hold lunch meet, eye India-France ties at new heights of progress
French president Emmanuel Macron said India did its utmost for the G-20 presidency to serve unity and peace.
Prime Minister Narendra Modi on Sunday had lunch with French president Emmanuel Macron following the conclusion of the G20 Summit."""

In [132]:
text = """PM Modi handed over the G20 presidency to Brazil's Lula Da Silva, marking the end of the G20 Summit in Delhi. The Prime Minister handed over the ceremonial gavel to the Brazilian President to mark the transfer of the Presidency at the third G20 session "One Future" at the Bharat Mandapam in Delhi's Pragati Maidan. With this, the two-day mega conclave of world leaders under India's G20 Presidency in the national capital came to an end."""

In [127]:
text = """"Alice in Wonderland," penned by Lewis Carroll, narrates the whimsical tale of a young girl named Alice who tumbles down a rabbit hole and finds herself in a fantastical world filled with bizarre creatures and perplexing situations. As she navigates this Wonderland, she encounters a myriad of eccentric characters, including the Cheshire Cat, the Mad Hatter, the Queen of Hearts, and the White Rabbit, each of whom adds to her bewildering adventure. Throughout her journey, Alice faces challenges that test her logic and reason, ultimately leading her to question the very nature of reality before she finally awakens, realizing her adventures were but a dream."""

In [128]:
# Tokenize and POS Tag
words = word_tokenize(text)
pos_tags = pos_tag(words)

# Extract named entities
named_entities_tree = ne_chunk(pos_tags)

print("Named Entities using nltk:")
print("-" * 50)
for subtree in named_entities_tree.subtrees(filter=lambda t: t.label() in ['GPE', 'PERSON', 'ORGANIZATION', 'DATE']):
    entity_name = ' '.join([word for word, tag in subtree.leaves()])
    print(entity_name, "-", subtree.label())

Named Entities using nltk:
--------------------------------------------------
Wonderland - GPE
Lewis Carroll - PERSON
Wonderland - GPE
Cheshire Cat - ORGANIZATION
Mad Hatter - ORGANIZATION
Hearts - GPE
Alice - ORGANIZATION


In [126]:
print("Common NER Tags in nltk:")
print("-" * 50)

nltk_tags = {
    "GPE": "Geopolitical Entity",
    "PERSON": "Person",
    "ORGANIZATION": "Organization",
    "LOCATION": "Location",
    "DATE": "Date",
    "TIME": "Time",
    "MONEY": "Money",
    "PERCENT": "Percentage",
    "FACILITY": "Facility",
    "GSP": "Geopolitical Subdivision"
}

for tag, description in nltk_tags.items():
    print(f"{tag.ljust(20)}: {description}")


Common NER Tags in nltk:
--------------------------------------------------
GPE                 : Geopolitical Entity
PERSON              : Person
ORGANIZATION        : Organization
LOCATION            : Location
DATE                : Date
TIME                : Time
MONEY               : Money
PERCENT             : Percentage
FACILITY            : Facility
GSP                 : Geopolitical Subdivision


In [131]:
print(text)

"Alice in Wonderland," penned by Lewis Carroll, narrates the whimsical tale of a young girl named Alice who tumbles down a rabbit hole and finds herself in a fantastical world filled with bizarre creatures and perplexing situations. As she navigates this Wonderland, she encounters a myriad of eccentric characters, including the Cheshire Cat, the Mad Hatter, the Queen of Hearts, and the White Rabbit, each of whom adds to her bewildering adventure. Throughout her journey, Alice faces challenges that test her logic and reason, ultimately leading her to question the very nature of reality before she finally awakens, realizing her adventures were but a dream.


In [135]:
# Extracting ALL the noun tags only
def extract_nouns(text):
    # Tokenize the text
    words = word_tokenize(text)
    
    # POS tagging
    tagged = pos_tag(words)
    
    # Filter noun tags
    noun_tags = ['NN', 'NNS', 'NNP', 'NNPS']
    nouns = [word for word, tag in tagged if tag in noun_tags]
    
    return nouns

text = "Alice and the White Rabbit entered Wonderland."
print(extract_nouns(text))

['Alice', 'White', 'Rabbit', 'Wonderland']


In [133]:
# Spacy
doc = nlp(text)

print("Named Entities using spaCy:")
print("-" * 50)
for ent in doc.ents:
    print(ent.text, "-", ent.label_)


Named Entities using spaCy:
--------------------------------------------------
PM - ORG
G20 - MISC
Brazil's Lula - PER
G20 - MISC
Delhi - LOC
The Prime - MISC
Brazilian President to mark the transfer of the Presidency at the - MISC
G20 - MISC
Mandapam - LOC
Delhi's - LOC
Pragati Maidan - PER
With - PER
G20 Presidency in the national capital - MISC


In [38]:
# Access the NER pipeline component
ner = nlp.get_pipe("ner")

print("NER Tags in spaCy:")
print("-" * 50)
for label in ner.labels:
    print(f"{label.ljust(20)}: {spacy.explain(label)}")


NER Tags in spaCy:
--------------------------------------------------
CARDINAL            : Numerals that do not fall under another type
DATE                : Absolute or relative dates or periods
EVENT               : Named hurricanes, battles, wars, sports events, etc.
FAC                 : Buildings, airports, highways, bridges, etc.
GPE                 : Countries, cities, states
LANGUAGE            : Any named language
LAW                 : Named documents made into laws.
LOC                 : Non-GPE locations, mountain ranges, bodies of water
MONEY               : Monetary values, including unit
NORP                : Nationalities or religious or political groups
ORDINAL             : "first", "second", etc.
ORG                 : Companies, agencies, institutions, etc.
PERCENT             : Percentage, including "%"
PERSON              : People, including fictional
PRODUCT             : Objects, vehicles, foods, etc. (not services)
QUANTITY            : Measurements, as of weigh

# Syntax parsing

Syntax parsing refers to the process of analyzing a text, which corresponds to the syntactic structure of the sentence according to a given grammar. The result of this analysis is often represented in a parse tree.

nltk provides a context-free grammar parser. For this, you would first need to define a grammar (or load a predefined one) and then use it to parse a sentence.

In [39]:
from nltk import CFG
from nltk.parse.chart import ChartParser

# Define a simple grammar
grammar = CFG.fromstring("""
    S -> NP VP
    PP -> P NP
    NP -> Det N | Det N PP | 'I'
    VP -> V NP | VP PP
    Det -> 'an' | 'my'
    N -> 'elephant' | 'pajamas'
    V -> 'shot'
    P -> 'in'
""")

This section creates a simple context-free grammar using the CFG.fromstring() method. Each rule defines how certain linguistic structures can be formed:

- S -> NP VP: A sentence (S) is made up of a noun phrase (NP) followed by a verb phrase (VP).
- PP -> P NP: A prepositional phrase (PP) is made of a preposition (P) followed by a noun phrase (NP).
- NP -> Det N | Det N PP | 'I': A noun phrase can be a determiner (Det) followed by a noun (N), a determiner followed by a noun and then a prepositional phrase, or simply the word 'I'.
- VP -> V NP | VP PP: A verb phrase can be a verb (V) followed by a noun phrase, or a verb phrase followed by a prepositional phrase.

The subsequent rules define the valid words for each category:
- Det -> 'an' | 'my': Valid determiners are "an" and "my".
- N -> 'elephant' | 'pajamas': Valid nouns are "elephant" and "pajamas".
- V -> 'shot': The only verb in this grammar is "shot".
- P -> 'in': The only preposition is "in".

The defined grammar is meant to capture simple sentences and is motivated by the classic syntactic ambiguity example: "I shot an elephant in my pajamas." This sentence can be interpreted in two ways based on the grammar:

I shot an elephant while I was in my pajamas.

I shot an elephant that was in my pajamas.

In [40]:
parser = ChartParser(grammar)

sentence = word_tokenize("I shot an elephant in my pajamas")
for tree in parser.parse(sentence):
    tree.pretty_print()

     S                                       
  ___|______________                          
 |                  VP                       
 |         _________|__________               
 |        VP                   PP            
 |    ____|___              ___|___           
 |   |        NP           |       NP        
 |   |     ___|_____       |    ___|_____     
 NP  V   Det        N      P  Det        N   
 |   |    |         |      |   |         |    
 I  shot  an     elephant  in  my     pajamas

     S                                   
  ___|__________                          
 |              VP                       
 |    __________|______                   
 |   |                 NP                
 |   |     ____________|___               
 |   |    |     |          PP            
 |   |    |     |       ___|___           
 |   |    |     |      |       NP        
 |   |    |     |      |    ___|_____     
 NP  V   Det    N      P  Det        N   
 |   |    |     |    

In [42]:
from spacy import displacy

sentence = "The cat sat on the mat."
doc = nlp(sentence)

# Using displacy.render() to visualize the dependency tree
displacy.render(doc, style="dep", jupyter=True)  # Use jupyter=False if not in a Jupyter environment


# TOPIC MODELLING

In [163]:
news1 = """PM Modi, Macron hold lunch meet, eye India-France ties at new heights of progress
French president Emmanuel Macron said India did its utmost for the G-20 presidency to serve unity and peace.
Prime Minister Narendra Modi on Sunday had lunch with French president Emmanuel Macron following the conclusion of the G20 Summit.

Prime Minister Narendra Modi with French president Emmanuel Macron.(X/Narendra Modi)
“A very productive lunch meeting with President @EmmanuelMacron. We discussed a series of topics and look forward to ensuring India-France relations scale new heights of progress”, PM Modi posted on social media platform X, formerly Twitter.

The prime minister was accompanied by external affairs minister S Jaishankar and National Security Advisor Ajit Doval.

“We will further develop defence cooperation with India”, Macron was quoted by PTI as saying.

Earlier, Macron thanked Prime Minister Modi and hailed India's efforts for G20 presidency to serve peace and unity. “I thank PM Modi. Faithful to its principles India did its utmost for the G-20 presidency to serve unity and peace and send across the message of unity while Russia is still waging its aggression on Ukraine”, ANI quoted the French president as saying.

Macron also called for deep reform of international organisations. “We support a deep reform of global governance. Security Council but as well the World Bank and the IMF, they have to reflect today's reality in terms of demography and economy as well. And then we want to increase the available tools. That's why we want to replenish the World Bank and France is supporting that so that the emerging countries have a greater role to play”, the French president said."""

In [164]:
news2 = """Shaheen dismissed Rohit, Kohli, so focus was on him. But...': Gavaskar alerts Team India against another Pakistan star

The India batting great spoke in detail about the previous match between the two sides in the group stage.
Team India is preparing for a highly anticipated showdown with Pakistan in the Asia Cup Super 4 stage on Sunday. The two sides previously crossed paths during the group stage, but the match was abandoned due to rain in Pallekele. India played out their whole innings, scoring 266, but the match did leave the Indian team management wit significant concerns over their top-order failure. India were left reeling at 66/4 with Rohit Sharma and Virat Kohli being castled by left-armer Shaheen Afridi, while fellow pacers Naseem Shah and Haris Rauf also making life miserable for the Indian batters.

India's Virat Kohli walks back to the pavilion after his dismissal during the Asia Cup 2023 match between India and Pakistan(AFP)

However, Hardik Pandya's resilient innings of 87 runs and Ishan Kishan's impressive 82 helped India recover from the early setbacks, guiding them to a total of 266. Among Pakistan's bowlers, star pacer Shaheen Afridi delivered a standout performance, registering impressive figures of 4/35. But while Afridi did end as the innings' leading wicket-taker, former India captain and batting great Sunil Gavaskar emphasised on the Naseem Shah threat as the side prepares to face Pakistan again.

Also read: 'India have great players but I'm sorry, they're praising us too much': PAK great's hard-hitting remark on Rohit, Kohli
Gavaskar pointed out that the right-armer was terrific in his opening spell and created problems for Shubman Gill, insisting that his lethal out-swingers made it difficult for the Indian batters in the start.

“If you looked at those 10 overs, you must have noticed the way Naseem Shah bowled. His out-swingers were brilliant, playing him was pretty difficult. Shaheen Afridi did take two wickets of Rohit Sharma and Virat Kohli, so focus was on him. But the way Naseem bowled, that was terrific. Shubman Gill was playing majorly against him, and he was leaving him well,” Gavaskar said in a joint broadcast from Sports Tak and Samaa TV.

“If batters keep getting out at other end, it's important to stick around. Shubman isn't in the best form right now, so he knows he has to take responsibility. This was a 50-over game after all. He has the shots to make up for slow start. I guess that's why he was watchful, and a bit nervous. But Naseem Shah kept Shubman Gill quiet,” said Gavaskar further.

Rain threat looms large
The upcoming India vs Pakistan match on Sunday faces a significant risk of being interrupted by rain again. The Asian Cricket Council triggered controversy earlier this week by designating a reserve day for the Super 4 encounter between the two arch-rivals, particularly because no other game in the Super 4 was allocated reserve day benefits.

This decision was prompted by the high likelihood of rain on the scheduled match day, Sunday. However, the weather forecast indicates that there are also considerable chances of rain on Monday, which means that even the reserve day might not guarantee uninterrupted play."""

In [165]:
news = news1 + news2

In [58]:
# %%capture 
# !pip install gensim

In [166]:
import gensim
gensim.__version__

'4.2.0'

In [59]:
import gensim
from gensim import corpora
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import string

# Download stopwords from NLTK
# import nltk
# nltk.download('stopwords')
# nltk.download('punkt')
# nltk.download('wordnet')

In [62]:
def preprocess(text):
    # Tokenization
    tokens = word_tokenize(text)

    # Lowercasing
    tokens = [word.lower() for word in tokens]

    # Removing punctuations
    tokens = [word for word in tokens if word not in string.punctuation]

    # Removing stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]

    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens]

    return tokens

In [167]:
text = news
tokens = preprocess(text)

# Create a dictionary representation of the tokenized text
dictionary = corpora.Dictionary([tokens])  # dictionary is SAME as "VOCAB"

In [175]:
print(dictionary.keys())

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221,

In [172]:
# Convert the list of tokens into a bag of words corpus
bow_corpus = [dictionary.doc2bow(tokens)]
bow_corpus  # list of lists

[[(0, 1),
  (1, 1),
  (2, 1),
  (3, 10),
  (4, 1),
  (5, 1),
  (6, 1),
  (7, 2),
  (8, 3),
  (9, 1),
  (10, 1),
  (11, 1),
  (12, 1),
  (13, 1),
  (14, 1),
  (15, 1),
  (16, 1),
  (17, 1),
  (18, 1),
  (19, 1),
  (20, 4),
  (21, 1),
  (22, 1),
  (23, 1),
  (24, 1),
  (25, 4),
  (26, 1),
  (27, 1),
  (28, 1),
  (29, 1),
  (30, 1),
  (31, 1),
  (32, 2),
  (33, 1),
  (34, 1),
  (35, 1),
  (36, 2),
  (37, 3),
  (38, 2),
  (39, 1),
  (40, 1),
  (41, 1),
  (42, 2),
  (43, 1),
  (44, 1),
  (45, 1),
  (46, 1),
  (47, 1),
  (48, 1),
  (49, 1),
  (50, 1),
  (51, 1),
  (52, 1),
  (53, 1),
  (54, 1),
  (55, 2),
  (56, 1),
  (57, 1),
  (58, 1),
  (59, 1),
  (60, 2),
  (61, 4),
  (62, 1),
  (63, 2),
  (64, 1),
  (65, 1),
  (66, 1),
  (67, 1),
  (68, 1),
  (69, 1),
  (70, 2),
  (71, 1),
  (72, 1),
  (73, 1),
  (74, 1),
  (75, 1),
  (76, 2),
  (77, 1),
  (78, 1),
  (79, 1),
  (80, 1),
  (81, 3),
  (82, 1),
  (83, 1),
  (84, 1),
  (85, 2),
  (86, 1),
  (87, 1),
  (88, 1),
  (89, 1),
  (90, 2),
  (91, 1

In [178]:
len(bow_corpus)

1

In [177]:
bow_corpus[0]  # extracting the first nested list

[(0, 1),
 (1, 1),
 (2, 1),
 (3, 10),
 (4, 1),
 (5, 1),
 (6, 1),
 (7, 2),
 (8, 3),
 (9, 1),
 (10, 1),
 (11, 1),
 (12, 1),
 (13, 1),
 (14, 1),
 (15, 1),
 (16, 1),
 (17, 1),
 (18, 1),
 (19, 1),
 (20, 4),
 (21, 1),
 (22, 1),
 (23, 1),
 (24, 1),
 (25, 4),
 (26, 1),
 (27, 1),
 (28, 1),
 (29, 1),
 (30, 1),
 (31, 1),
 (32, 2),
 (33, 1),
 (34, 1),
 (35, 1),
 (36, 2),
 (37, 3),
 (38, 2),
 (39, 1),
 (40, 1),
 (41, 1),
 (42, 2),
 (43, 1),
 (44, 1),
 (45, 1),
 (46, 1),
 (47, 1),
 (48, 1),
 (49, 1),
 (50, 1),
 (51, 1),
 (52, 1),
 (53, 1),
 (54, 1),
 (55, 2),
 (56, 1),
 (57, 1),
 (58, 1),
 (59, 1),
 (60, 2),
 (61, 4),
 (62, 1),
 (63, 2),
 (64, 1),
 (65, 1),
 (66, 1),
 (67, 1),
 (68, 1),
 (69, 1),
 (70, 2),
 (71, 1),
 (72, 1),
 (73, 1),
 (74, 1),
 (75, 1),
 (76, 2),
 (77, 1),
 (78, 1),
 (79, 1),
 (80, 1),
 (81, 3),
 (82, 1),
 (83, 1),
 (84, 1),
 (85, 2),
 (86, 1),
 (87, 1),
 (88, 1),
 (89, 1),
 (90, 2),
 (91, 1),
 (92, 1),
 (93, 1),
 (94, 1),
 (95, 2),
 (96, 1),
 (97, 1),
 (98, 1),
 (99, 1),
 (100, 1)

In [179]:
dictionary[100]  # text against the id 100

'formerly'

In [148]:
# Extracting tokenized text from dictionary
extracted_tokens = [dictionary[id1] for id1, freq in bow_corpus[0]]

print(extracted_tokens)

["'india", "'m", "'re", "'s", '...', '10', '2023', '266', '4', '4/35', '50-over', '66/4', '82', '87', 'abandoned', 'accompanied', 'across', 'advisor', 'affair', 'afp', 'afridi', 'aggression', 'ajit', 'alert', 'allocated', 'also', 'among', 'ani', 'another', 'anticipated', 'arch-rivals', 'around', 'asia', 'asian', 'available', 'back', 'bank', 'batter', 'batting', 'benefit', 'best', 'bit', 'bowled', 'bowler', 'brilliant', 'broadcast', 'called', 'captain', 'castled', 'chance', 'concern', 'conclusion', 'considerable', 'controversy', 'cooperation', 'council', 'country', 'created', 'cricket', 'crossed', 'cup', 'day', 'decision', 'deep', 'defence', 'delivered', 'demography', 'designating', 'detail', 'develop', 'difficult', 'discussed', 'dismissal', 'dismissed', 'doval', 'due', 'earlier', 'early', 'economy', 'effort', 'emerging', 'emmanuel', 'emmanuelmacron', 'emphasised', 'encounter', 'end', 'ensuring', 'even', 'external', 'eye', 'face', 'failure', 'faithful', 'fellow', 'figure', 'focus', 'fol

In [180]:
text

"PM Modi, Macron hold lunch meet, eye India-France ties at new heights of progress\nFrench president Emmanuel Macron said India did its utmost for the G-20 presidency to serve unity and peace.\nPrime Minister Narendra Modi on Sunday had lunch with French president Emmanuel Macron following the conclusion of the G20 Summit.\n\nPrime Minister Narendra Modi with French president Emmanuel Macron.(X/Narendra Modi)\n“A very productive lunch meeting with President @EmmanuelMacron. We discussed a series of topics and look forward to ensuring India-France relations scale new heights of progress”, PM Modi posted on social media platform X, formerly Twitter.\n\nThe prime minister was accompanied by external affairs minister S Jaishankar and National Security Advisor Ajit Doval.\n\n“We will further develop defence cooperation with India”, Macron was quoted by PTI as saying.\n\nEarlier, Macron thanked Prime Minister Modi and hailed India's efforts for G20 presidency to serve peace and unity. “I tha

In [181]:
tokens  # list of all the terms in the corpus, thos NOT the VOCAB, so there will be a lot of words repeated

['pm',
 'modi',
 'macron',
 'hold',
 'lunch',
 'meet',
 'eye',
 'india-france',
 'tie',
 'new',
 'height',
 'progress',
 'french',
 'president',
 'emmanuel',
 'macron',
 'said',
 'india',
 'utmost',
 'g-20',
 'presidency',
 'serve',
 'unity',
 'peace',
 'prime',
 'minister',
 'narendra',
 'modi',
 'sunday',
 'lunch',
 'french',
 'president',
 'emmanuel',
 'macron',
 'following',
 'conclusion',
 'g20',
 'summit',
 'prime',
 'minister',
 'narendra',
 'modi',
 'french',
 'president',
 'emmanuel',
 'macron',
 'x/narendra',
 'modi',
 '“',
 'productive',
 'lunch',
 'meeting',
 'president',
 'emmanuelmacron',
 'discussed',
 'series',
 'topic',
 'look',
 'forward',
 'ensuring',
 'india-france',
 'relation',
 'scale',
 'new',
 'height',
 'progress',
 '”',
 'pm',
 'modi',
 'posted',
 'social',
 'medium',
 'platform',
 'x',
 'formerly',
 'twitter',
 'prime',
 'minister',
 'accompanied',
 'external',
 'affair',
 'minister',
 'jaishankar',
 'national',
 'security',
 'advisor',
 'ajit',
 'doval',
 '

In [150]:
# Convert the list of tokens into a bag of words corpus
bow_corpus = [dictionary.doc2bow(tokens)]
print(bow_corpus)   # (token_id, frequency)

[[(0, 1), (1, 1), (2, 1), (3, 10), (4, 1), (5, 1), (6, 1), (7, 2), (8, 3), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 4), (21, 1), (22, 1), (23, 1), (24, 1), (25, 4), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 2), (33, 1), (34, 1), (35, 1), (36, 2), (37, 3), (38, 2), (39, 1), (40, 1), (41, 1), (42, 2), (43, 1), (44, 1), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1), (50, 1), (51, 1), (52, 1), (53, 1), (54, 1), (55, 2), (56, 1), (57, 1), (58, 1), (59, 1), (60, 2), (61, 4), (62, 1), (63, 2), (64, 1), (65, 1), (66, 1), (67, 1), (68, 1), (69, 1), (70, 2), (71, 1), (72, 1), (73, 1), (74, 1), (75, 1), (76, 2), (77, 1), (78, 1), (79, 1), (80, 1), (81, 3), (82, 1), (83, 1), (84, 1), (85, 2), (86, 1), (87, 1), (88, 1), (89, 1), (90, 2), (91, 1), (92, 1), (93, 1), (94, 1), (95, 2), (96, 1), (97, 1), (98, 1), (99, 1), (100, 1), (101, 1), (102, 1), (103, 5), (104, 2), (105, 2), (106, 2), (107, 5), (108, 1), (109, 3), (110, 1

In [186]:
# Train the LDA model
lda_model = gensim.models.LdaModel(bow_corpus, num_topics=8, id2word=dictionary, passes=15)
lda_model

<gensim.models.ldamodel.LdaModel at 0x14aff483688>

In [187]:
# Display the topics
for idx, topic in lda_model.print_topics(-1):
    print(f"Topic: {idx}\nWords: {topic}\n")

Topic: 0
Words: 0.003*"india" + 0.003*"'s" + 0.003*"president" + 0.003*"match" + 0.003*"macron" + 0.003*"gavaskar" + 0.003*"kohli" + 0.003*"pakistan" + 0.003*"modi" + 0.003*"great"

Topic: 1
Words: 0.025*"india" + 0.018*"'s" + 0.013*"modi" + 0.013*"macron" + 0.011*"”" + 0.011*"“" + 0.011*"pakistan" + 0.011*"president" + 0.011*"match" + 0.009*"naseem"

Topic: 2
Words: 0.003*"india" + 0.003*"'s" + 0.003*"macron" + 0.003*"modi" + 0.003*"“" + 0.003*"match" + 0.003*"president" + 0.003*"french" + 0.003*"pakistan" + 0.003*"”"

Topic: 3
Words: 0.003*"india" + 0.003*"'s" + 0.003*"”" + 0.003*"“" + 0.003*"match" + 0.003*"gavaskar" + 0.003*"president" + 0.003*"modi" + 0.003*"naseem" + 0.003*"rain"

Topic: 4
Words: 0.003*"'s" + 0.003*"india" + 0.003*"modi" + 0.003*"“" + 0.003*"president" + 0.003*"minister" + 0.003*"match" + 0.003*"”" + 0.003*"macron" + 0.003*"pm"

Topic: 5
Words: 0.003*"india" + 0.003*"modi" + 0.003*"'s" + 0.003*"macron" + 0.003*"”" + 0.003*"pakistan" + 0.003*"gavaskar" + 0.003*"ko

In [153]:
import gensim
from gensim import corpora
from gensim.models.coherencemodel import CoherenceModel
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import string

# nltk.download('stopwords')
# nltk.download('punkt')
# nltk.download('wordnet')

In [154]:
def preprocess(text):
    tokens = word_tokenize(text)
    tokens = [word.lower() for word in tokens]
    tokens = [word for word in tokens if word not in string.punctuation]
    
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens]

    return tokens

In [188]:
tokens = [preprocess(paragraph) for paragraph in text.split("\n") if paragraph]

In [191]:
len(tokens)

20

In [192]:
tokens

[['pm',
  'modi',
  'macron',
  'hold',
  'lunch',
  'meet',
  'eye',
  'india-france',
  'tie',
  'new',
  'height',
  'progress'],
 ['french',
  'president',
  'emmanuel',
  'macron',
  'said',
  'india',
  'utmost',
  'g-20',
  'presidency',
  'serve',
  'unity',
  'peace'],
 ['prime',
  'minister',
  'narendra',
  'modi',
  'sunday',
  'lunch',
  'french',
  'president',
  'emmanuel',
  'macron',
  'following',
  'conclusion',
  'g20',
  'summit'],
 ['prime',
  'minister',
  'narendra',
  'modi',
  'french',
  'president',
  'emmanuel',
  'macron',
  'x/narendra',
  'modi'],
 ['“',
  'productive',
  'lunch',
  'meeting',
  'president',
  'emmanuelmacron',
  'discussed',
  'series',
  'topic',
  'look',
  'forward',
  'ensuring',
  'india-france',
  'relation',
  'scale',
  'new',
  'height',
  'progress',
  '”',
  'pm',
  'modi',
  'posted',
  'social',
  'medium',
  'platform',
  'x',
  'formerly',
  'twitter'],
 ['prime',
  'minister',
  'accompanied',
  'external',
  'affair',
 

In [194]:
dictionary = corpora.Dictionary(tokens)
dictionary.filter_extremes(no_below=5, no_above=0.6)

In [229]:
bow_corpus = [dictionary.doc2bow(token) for token in tokens]

lda_model = gensim.models.LdaModel(bow_corpus, num_topics=2, iterations=100 , \
                                   id2word=dictionary, passes=25, alpha='auto', eta='auto')

In [230]:
# https://radimrehurek.com/gensim/models/ldamodel.html

In [231]:
# Display topics
for idx, topic in lda_model.print_topics(-1):
    print(f"Topic: {idx}\nWords: {topic}\n")

Topic: 0
Words: 0.130*"modi" + 0.130*"macron" + 0.113*"president" + 0.113*"“" + 0.113*"”" + 0.095*"french" + 0.088*"'s" + 0.086*"india" + 0.078*"gavaskar" + 0.035*"kohli"

Topic: 1
Words: 0.268*"india" + 0.174*"match" + 0.171*"pakistan" + 0.158*"'s" + 0.107*"kohli" + 0.040*"gavaskar" + 0.014*"french" + 0.014*"”" + 0.014*"“" + 0.014*"president"



In [224]:
# Evaluate the model
coherence_model = CoherenceModel(model=lda_model, texts=tokens, dictionary=dictionary, coherence='c_v')
coherence_score = coherence_model.get_coherence()
print(f"Coherence Score: {coherence_score}")

Coherence Score: 0.4243600442768668


In [232]:
import pandas as pd

# Extract the top 10 terms for each topic
topics = lda_model.show_topics(formatted=False, num_topics=2, num_words=10)

# Initialize empty lists to hold topic names and terms
topic_names = []
topic_terms = []

In [233]:
# Populate the lists
for idx, topic in topics:
    topic_names.append(f"Topic {idx}")  # Change this to provide more meaningful names if needed
    terms = ", ".join([word[0] for word in topic])
    topic_terms.append(terms)

In [234]:
# Create the DataFrame
df = pd.DataFrame({
    'Topic Name': topic_names,
    'Top 10 Terms': topic_terms
})

df

Unnamed: 0,Topic Name,Top 10 Terms
0,Topic 0,"modi, macron, president, “, ”, french, 's, ind..."
1,Topic 1,"india, match, pakistan, 's, kohli, gavaskar, f..."


In [None]:
# Topic 0,1,3 are same
# topics 2 & 4 are similaar 

# hence the number of meaningful topics in this dataset is still 2

In [160]:
# Optional: Export the table to a CSV for sharing/presentation
df.to_csv('topics_table.csv', index=False)