<a href="https://colab.research.google.com/github/iami0npkr/Story/blob/main/Stemmer_Implementation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Implementing Stemmer for 2 languages and Comparing

Nannuri Pranay Kumar Reddy,
21085052,
EEE , Btech , 4th year
IIT BHU Varanasi

In [2]:
# Installing necessary packages
!pip install spacy stanza nltk  # Installing spaCy, Stanza, and NLTK libraries




In [3]:
# Downloading spaCy's English model for tokenization
!python -m spacy download en_core_web_sm

# Downloading Stanza's Hindi model for tokenization and lemmatization
stanza.download('hi')


Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m60.0 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Downloading default packages for language: hi (Hindi) ...
INFO:stanza:File exists: /root/stanza_resources/hi/default.zip
INFO:stanza:Finished downloading models and saved to /root/stanza_resources


In [1]:
# Importing necessary modules
import spacy
import stanza
import nltk
from nltk.stem import PorterStemmer
import pandas as pd


In [2]:

english_text = """Artificial intelligence enables machines to learn from experience.
It involves reasoning, problem-solving, and decision-making.
These capabilities allow AI to perform tasks that typically require human intelligence.
AI systems are also capable of understanding natural language.
They use data to improve and adapt over time.
Examples include speech recognition, image analysis, and autonomous driving.
Such advancements have transformed industries like healthcare and education."""


hindi_text = """कृत्रिम बुद्धिमत्ता मशीनों को अनुभव से सीखने में सक्षम बनाती है।
यह तर्क, समस्या-समाधान और निर्णय लेने में शामिल है।
यह क्षमताएँ एआई को उन कार्यों को करने की अनुमति देती हैं जिन्हें सामान्यतः मानव बुद्धिमत्ता की आवश्यकता होती है।
एआई प्रणालियाँ प्राकृतिक भाषा को समझने में सक्षम हैं।
वे समय के साथ सुधारने और अनुकूलित करने के लिए डेटा का उपयोग करते हैं।
उदाहरणों में भाषण पहचान, छवि विश्लेषण और स्वायत्त ड्राइविंग शामिल हैं।
ऐसी प्रगति ने स्वास्थ्य देखभाल और शिक्षा जैसे उद्योगों को बदल दिया है।"""


In [7]:
# Initialising Stanza pipeline for Hindi, excluding the 'mwt' processor
nlp_hi = stanza.Pipeline('hi', processors='tokenize,pos,lemma')

doc_hi = nlp_hi(hindi_text)

# Tokenising the Hindi text and storing the words
hindi_words = [word.text for sent in doc_hi.sentences for word in sent.words]


INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Loading these models for language: hi (Hindi):
| Processor | Package       |
-----------------------------
| tokenize  | hdtb          |
| pos       | hdtb_charlm   |
| lemma     | hdtb_nocharlm |

INFO:stanza:Using device: cpu
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: pos
INFO:stanza:Loading: lemma
INFO:stanza:Done loading processors!


In [3]:
# Tokenising the English text using spaCy
nlp_en = spacy.load('en_core_web_sm')
doc_en = nlp_en(english_text)

# Tokenising the English text and storing the words
english_words = [token.text for token in doc_en]


In [4]:
# Stemming English words using Porter Stemmer
porter = PorterStemmer()
stemmed_english = [porter.stem(word) for word in english_words]


In [8]:
# Lemmatizing Hindi words using Stanza
hindi_lemmas = [word.lemma for sent in doc_hi.sentences for word in sent.words]


In [9]:
# For English, creating a DataFrame comparing the original and stemmed words
english_df = pd.DataFrame({
    "Original Word (English)": english_words,
    "Stemmed Word (English)": stemmed_english
}).drop_duplicates()

# For Hindi, creating a DataFrame comparing the original and lemmatized words
hindi_df = pd.DataFrame({
    "Original Word (Hindi)": hindi_words,
    "Lemmatized Word (Hindi)": hindi_lemmas
}).drop_duplicates()


In [10]:
# Comparing unique words before and after stemming/lemmatization
unique_word_comparison = pd.DataFrame({
    "Language": ["English", "Hindi"],
    "Unique Words (Before Stemming)": [len(set(english_words)), len(set(hindi_words))],
    "Unique Words (After Stemming)": [len(set(stemmed_english)), len(set(hindi_lemmas))]
})


In [12]:

# Setting pandas options to display all rows and columns without truncation
pd.set_option('display.max_rows', None)  # Show all rows
pd.set_option('display.max_columns', None)  # Show all columns
pd.set_option('display.width', None)  # Adjust width
pd.set_option('display.max_colwidth', None)  # Display full column content

print("=== Unique Word Comparison ===")
print(unique_word_comparison)

print("\n=== English Words (Original vs Stemmed) ===")
print(english_df)

print("\n=== Hindi Words (Original vs Lemmatized) ===")
print(hindi_df)


=== Unique Word Comparison ===
  Language  Unique Words (Before Stemming)  Unique Words (After Stemming)
0  English                              61                             60
1    Hindi                              66                             60

=== English Words (Original vs Stemmed) ===
   Original Word (English) Stemmed Word (English)
0               Artificial               artifici
1             intelligence               intellig
2                  enables                  enabl
3                 machines                 machin
4                       to                     to
5                    learn                  learn
6                     from                   from
7               experience                 experi
8                        .                      .
9                       \n                     \n
10                      It                     it
11                involves                 involv
12               reasoning                 reason
13

# Stemming Rules
English Stemming (Porter’s Stemmer):

Porter's Stemmer reduces words to their root form by removing common suffixes like -s, -ed, -ly, etc.
Example: "running" → "run", "happily" → "happi".
This helps in reducing the number of unique words by grouping variations of the same root word together, simplifying the text for further analysis.
Hindi Stemming:

Hindi Stemmer (based on rules) removes suffixes that represent plural forms, verb tenses, and case markers.
Example: "लड़के" → "लड़का", "सिखाती" → "सिख".
This stemming approach simplifies the Hindi text by reducing words to their base form, helping with tasks like information retrieval or sentiment analysis.
Key Stemming Rules:
English: Focuses on removing plurals (-s), past tense endings (-ed, -ing), adverbs/adjectives (-ly), and other suffixes.
Hindi: Primarily removes plural markers (-e, -ye), tense markers (-ना, -ते), case suffixes (-का, -की), and verb conjugations.v

For English, stemming removes variations like "running" and "runner" and reduces them to "run".
For Hindi, the process is similar by eliminating suffixes like -ता and -ते, simplifying words like "सिखाती" to "सिख".
This results in a more efficient analysis of text by focusing on the base form of words, making it easier to process large amounts of data.