# Stopwords

**Stopwords** are common words in a language that don't carry much meaning on their own, such as "the," "is," "in," "and," and "a." These words are usually removed during text analysis because they appear so often and don't add important information.

For example, in the sentence:  
**"The cat is sitting on the mat."**  
The words **"the"** and **"is"** are stopwords because they don't help us understand the main content (which is about the **cat**, **sitting**, and **mat**).

In [14]:
paragraph = "Natural language processing (NLP) is a subfield of artificial intelligence (AI) that focuses on the interaction between computers and human languages. The ultimate goal of NLP is to enable machines to understand, interpret, and generate human language in a way that is both meaningful and useful. Techniques such as tokenization, part-of-speech tagging, and named entity recognition are commonly used in NLP applications. Despite the advances in NLP, challenges such as ambiguity, context, and language diversity still pose significant difficulties."

In [2]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()


In [3]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\parth\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [4]:
from nltk.corpus import stopwords
en_stopwords = stopwords.words('english')

In [5]:
len(stopwords.words('english'))

179

In [6]:
len(stopwords.words('french'))

157

### 1. Convert paragaphs in to sentences

In [23]:
from nltk.tokenize import sent_tokenize

sentences = sent_tokenize(paragraph)

len(sentences)

4

In [24]:
sentences

['Natural language processing (NLP) is a subfield of artificial intelligence (AI) that focuses on the interaction between computers and human languages.',
 'The ultimate goal of NLP is to enable machines to understand, interpret, and generate human language in a way that is both meaningful and useful.',
 'Techniques such as tokenization, part-of-speech tagging, and named entity recognition are commonly used in NLP applications.',
 'Despite the advances in NLP, challenges such as ambiguity, context, and language diversity still pose significant difficulties.']

### 2. remove stopwords from the sentences

In [29]:
# using basic porterstemmer

sentences = sent_tokenize(paragraph)

for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i])
    stopwords_less_words = [stemmer.stem(word) for word in words if word not in set(en_stopwords)]
    sentences[i] = ' '.join(stopwords_less_words)

In [30]:
sentences

['natur languag process ( nlp ) subfield artifici intellig ( ai ) focus interact comput human languag .',
 'the ultim goal nlp enabl machin understand , interpret , generat human languag way meaning use .',
 'techniqu token , part-of-speech tag , name entiti recognit common use nlp applic .',
 'despit advanc nlp , challeng ambigu , context , languag divers still pose signific difficulti .']

In [35]:
# checking with Snowball stemmer

sentences = sent_tokenize(paragraph)

from nltk.stem import SnowballStemmer
stemmer = SnowballStemmer('english')

for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i])
    stopwords_less_words = [stemmer.stem(word) for word in words if word not in set(en_stopwords)]
    sentences[i] = ' '.join(stopwords_less_words)

In [36]:
sentences

['natur languag process ( nlp ) subfield artifici intellig ( ai ) focus interact comput human languag .',
 'the ultim goal nlp enabl machin understand , interpret , generat human languag way meaning use .',
 'techniqu token , part-of-speech tag , name entiti recognit common use nlp applic .',
 'despit advanc nlp , challeng ambigu , context , languag divers still pose signific difficulti .']

In [37]:
# using lemmatization techinque

sentences = sent_tokenize(paragraph)


from nltk.stem import WordNetLemmatizer
lemmtize = WordNetLemmatizer()

for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i])
    stopwords_less_words = [lemmtize.lemmatize(word, pos='v') for word in words if word not in set(en_stopwords)]
    sentences[i] = ' '.join(stopwords_less_words)

In [38]:
sentences

['Natural language process ( NLP ) subfield artificial intelligence ( AI ) focus interaction computers human languages .',
 'The ultimate goal NLP enable machine understand , interpret , generate human language way meaningful useful .',
 'Techniques tokenization , part-of-speech tag , name entity recognition commonly use NLP applications .',
 'Despite advance NLP , challenge ambiguity , context , language diversity still pose significant difficulties .']

In [28]:
lemmtize.lemmatize('ultimate',pos='v')

'ultimate'