<a href="https://colab.research.google.com/github/krishanu34/DataScience/blob/main/01.NLP/04.Text%20Preprocessing-Stopwords.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Stop Words in NLP

Stop words are commonly used words in a language (like "the", "a", "is", "in") that are often removed during text preprocessing in NLP. These words are usually filtered out because they don't carry significant meaning and can add noise to the data, potentially affecting the performance of NLP models. Removing stop words helps to focus on the more important terms in the text, reducing the dimensionality of the data and improving the efficiency of algorithms. The list of stop words can vary depending on the specific NLP task and the language being processed.

In [15]:
text = """
Natural language processing (NLP) is a fascinating field at the intersection of computer science, artificial intelligence, and linguistics.
It focuses on enabling computers to understand, interpret, and generate human language.
This involves a wide range of tasks, including text classification, sentiment analysis, machine translation, question answering, and text summarization.
NLP has become increasingly important in today's data-driven world, with applications in various industries such as healthcare, finance, and customer service.
One of the fundamental steps in many NLP tasks is text preprocessing, which involves cleaning and preparing the text data for analysis.
This often includes tasks like tokenization (breaking down text into individual words or sub-word units), stemming or lemmatization (reducing words to their root form), and removing stop words.
Stop words are common words like "the", "a", "is", and "in" that often don't carry significant meaning and can be removed to reduce noise and improve the performance of NLP models.
Another important aspect of NLP is feature extraction, which involves converting text data into numerical representations that can be used by machine learning algorithms.
Common techniques include bag-of-words, TF-IDF (Term Frequency-Inverse Document Frequency), and word embeddings.
Bag-of-words represents text as a collection of word counts, while TF-IDF assigns weights to words based on their frequency in a document and across a corpus.
Word embeddings, such as Word2Vec and GloVe, represent words as dense vectors in a continuous vector space, capturing semantic relationships between words.
NLP models can be broadly categorized into traditional machine learning models and deep learning models.
Traditional models like Naive Bayes and Support Vector Machines have been used for tasks like text classification, while deep learning models,
such as Recurrent Neural Networks (RNNs) and Transformers, have achieved state-of-the-art results in various NLP tasks, particularly in areas like machine translation and text generation.
The field of NLP is constantly evolving, with new techniques and models being developed.
With the increasing availability of large datasets and computational resources, NLP is expected to play an even more significant role in the future,
enabling more natural and intuitive interactions between humans and computers.
"""

In [16]:
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')
nltk.download('punkt_tab')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [17]:
stopwords.words('english')

['a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 "he'd",
 "he'll",
 'her',
 'here',
 'hers',
 'herself',
 "he's",
 'him',
 'himself',
 'his',
 'how',
 'i',
 "i'd",
 'if',
 "i'll",
 "i'm",
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it'd",
 "it'll",
 "it's",
 'its',
 'itself',
 "i've",
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'on

In [18]:
stemmer = PorterStemmer()
sentences = nltk.sent_tokenize(text)

### Apply Stopwords and filter And then apply Stemming

In [19]:
for i in range(len(sentences)):
  words = nltk.word_tokenize(sentences[i])
  words = [stemmer.stem(word) for word in words if word not in set(stopwords.words('english'))]
  sentences[i] = ' '.join(words)
sentences

['natur languag process ( nlp ) fascin field intersect comput scienc , artifici intellig , linguist .',
 'it focus enabl comput understand , interpret , gener human languag .',
 'thi involv wide rang task , includ text classif , sentiment analysi , machin translat , question answer , text summar .',
 "nlp becom increasingli import today 's data-driven world , applic variou industri healthcar , financ , custom servic .",
 'one fundament step mani nlp task text preprocess , involv clean prepar text data analysi .',
 'thi often includ task like token ( break text individu word sub-word unit ) , stem lemmat ( reduc word root form ) , remov stop word .',
 "stop word common word like `` '' , `` '' , `` '' , `` '' often n't carri signific mean remov reduc nois improv perform nlp model .",
 'anoth import aspect nlp featur extract , involv convert text data numer represent use machin learn algorithm .',
 'common techniqu includ bag-of-word , tf-idf ( term frequency-invers document frequenc ) , 

### Apply Stopwords and filter And then apply Lemitization

In [20]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [21]:
sentences = nltk.sent_tokenize(text)

In [22]:
for i in range(len(sentences)):
  words = nltk.word_tokenize(sentences[i])
  words = [lemmatizer.lemmatize(word) for word in words if word not in set(stopwords.words('english'))]
  sentences[i] = ' '.join(words)
sentences

['Natural language processing ( NLP ) fascinating field intersection computer science , artificial intelligence , linguistics .',
 'It focus enabling computer understand , interpret , generate human language .',
 'This involves wide range task , including text classification , sentiment analysis , machine translation , question answering , text summarization .',
 "NLP become increasingly important today 's data-driven world , application various industry healthcare , finance , customer service .",
 'One fundamental step many NLP task text preprocessing , involves cleaning preparing text data analysis .',
 'This often includes task like tokenization ( breaking text individual word sub-word unit ) , stemming lemmatization ( reducing word root form ) , removing stop word .',
 "Stop word common word like `` '' , `` '' , `` '' , `` '' often n't carry significant meaning removed reduce noise improve performance NLP model .",
 'Another important aspect NLP feature extraction , involves conver