In [None]:
from google.colab import drive
drive.mount('/content/drive')

Step 1: Install NLTK
To install the NLTK library, run the following command:


In [None]:
#!pip install nltk



**Importing NLTK and Downloading Resources:**
Before you can use certain features (like stopwords or lemmatization), you need to download some data from NLTK:

In [None]:
import nltk
nltk.download('punkt')      # For tokenization
nltk.download('stopwords')  # For stopword removal
nltk.download('wordnet')    # For lemmatization

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

**Tokenization:**
Tokenization is the process of splitting text into smaller chunks (usually words or sentences).

NLTK has built-in functions for this:

In [None]:
from nltk.tokenize import word_tokenize

text = "Hello, how are you today?"
tokens = word_tokenize(text)
print(tokens)

['Hello', ',', 'how', 'are', 'you', 'today', '?']


**Stopword Removal:**
Stopwords are common words that don’t contribute much to the sentiment or meaning of the sentence. NLTK provides a list of stopwords for various languages:

In [None]:
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
print(filtered_tokens)

['Hello', ',', 'today', '?']


In [None]:
print(stop_words)

{"that'll", 'them', 'off', 'in', 'or', 'between', "it's", 'further', 'then', 'ourselves', 't', 'into', "wasn't", "wouldn't", 'out', 'having', 'not', 'ma', 'over', 'hasn', 'its', 'isn', 'couldn', 'a', 'because', 'themselves', 'to', 'there', 'am', 'should', 's', 'o', 'most', 'does', 'do', 'our', 'and', 'ain', "hadn't", 'did', 'ours', 'their', 'very', 'haven', 'shan', 'now', 'shouldn', 'so', 'was', 'it', 'yours', 'all', 'here', 'being', 'of', 'the', 'm', 'his', 'mightn', 'will', 'mustn', "mightn't", 'for', "didn't", 'if', 'about', 'under', 'aren', 'again', "she's", 've', "aren't", 'itself', 'but', 'wouldn', "weren't", 'how', 'these', 'through', 'during', "couldn't", 'll', 'me', 'weren', 'whom', 'doing', 'on', 'needn', 'as', "you'd", 'been', 'y', 'that', 'yourself', 'myself', "doesn't", 'after', 'any', "won't", 'before', 'with', 'can', 'up', 'while', "shan't", 'each', 'just', 'such', "you've", 'below', 'from', 'which', 'her', 'won', 'same', 'i', 'you', 'your', 'himself', 'own', 'they', 'do

**Stemming:**
Stemming reduces words to their base form. For example, "running" becomes "run". This process is rough, but it works well for some basic tasks:

In [None]:
from nltk.stem import PorterStemmer

ps = PorterStemmer()
stemmed_words = [ps.stem(word) for word in filtered_tokens]
print(stemmed_words)

['hello', ',', 'today', '?']


In this case, the words were already quite simple, but for verbs like "running" or "played", stemming would reduce them to their root forms.

**Lemmatization:**
Lemmatization is more advanced than stemming, as it considers the context of the word. NLTK’s lemmatizer can reduce words to their dictionary form:

In [None]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(word) for word in filtered_tokens]
print(lemmatized_words)

['Hello', ',', 'today', '?']


Using spaCy: Step-by-Step
Now, let’s move to spaCy, a more powerful and efficient library for NLP.

Step 1: Install spaCy and Download a Language Model
First, install spaCy and its English language model:

In [None]:
!pip install spacy
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m51.6 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


Step 2: Import spaCy and Load the Language Model
To start using spaCy, load the English language model:

In [None]:
import spacy
nlp = spacy.load("en_core_web_sm")

Step 3: Tokenization
With spaCy, tokenization is built into the pipeline. Simply pass text through the nlp object:

In [None]:
doc = nlp("Hello, how are you today?")
tokens = [token.text for token in doc]
print(tokens)

['Hello', ',', 'how', 'are', 'you', 'today', '?']


spaCy automatically handles splitting the text into tokens.

Step 4: Stopword Removal
spaCy has a built-in list of stopwords, and you can remove them easily:

In [None]:
filtered_tokens = [token.text for token in doc if not token.is_stop]
print(filtered_tokens)

['Hello', ',', 'today', '?']


Step 5: Lemmatization
Lemmatization is also built into spaCy. Each token has a .lemma_ attribute:

In [None]:
lemmatized_tokens = [token.lemma_ for token in doc if not token.is_stop]
print(lemmatized_tokens)

['hello', ',', 'today', '?']


spaCy performs lemmatization much more efficiently than NLTK, and it’s part of the pipeline.