# NLP Foundations: From Text to Structure

Welcome to your first step in learning NLP! Natural Language Processing is the field of AI that gives machines the ability to understand, interpret, and generate human language.

In this notebook, we'll cover the essential foundations:
1. **Tokenization**: Breaking down text into smaller units (tokens).
2. **Stop Word Removal**: Filtering out common words that don't carry much meaning.
3. **Stemming & Lemmatization**: Reducing words to their base or root form.

First, let's make sure we have the necessary libraries.

In [1]:
!pip install nltk spacy
!python -m spacy download en_core_web_sm


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.1.1[0m[39;49m -> [0m[32;49m26.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m22.3 MB/s[0m eta [36m0:00:00[0m [36m0:00:01[0m
[?25hInstalling collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.8.0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.1.1[0m[39;49m -> [0m[32;49m26.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spa

## 1. Tokenization
Tokenization is the process of splitting a string into individual words or tokens.

In [3]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

nltk.download('punkt')

text = "Hello there! Welcome to the world of NLP. It's fascinating, isn't it?"

words = word_tokenize(text)
sentences = sent_tokenize(text)

print(f"Original: {text}")
print(f"Words: {words}")
print(f"Sentences: {sentences}")

Original: Hello there! Welcome to the world of NLP. It's fascinating, isn't it?
Words: ['Hello', 'there', '!', 'Welcome', 'to', 'the', 'world', 'of', 'NLP', '.', 'It', "'s", 'fascinating', ',', 'is', "n't", 'it', '?']
Sentences: ['Hello there!', 'Welcome to the world of NLP.', "It's fascinating, isn't it?"]


[nltk_data] Downloading package punkt to
[nltk_data]     /Users/amitprakash/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## 2. Stop Word Removal
Stop words are common words like 'the', 'is', 'in', etc., which appear frequently but carry little unique meaning.

In [5]:
from nltk.corpus import stopwords

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

filtered_words = [w for w in words if not w.lower() in stop_words]

print(f"Words after stop-word removal: {filtered_words}")

Words after stop-word removal: ['Hello', '!', 'Welcome', 'world', 'NLP', '.', "'s", 'fascinating', ',', "n't", '?']


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/amitprakash/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## 3. Stemming vs. Lemmatization
Both aim to reduce words to their root form, but they do it differently:
- **Stemming**: Often chops off the end of words using crude rules (e.g., 'playing' -> 'play', 'better' -> 'bet').
- **Lemmatization**: Uses a dictionary to find the actual linguistic root (e.g., 'better' -> 'good').

In [6]:
from nltk.stem import PorterStemmer, WordNetLemmatizer

nltk.download('wordnet')
nltk.download('omw-1.4')

ps = PorterStemmer()
lemmatizer = WordNetLemmatizer()

example_words = ["rocks", "corpora", "better", "playing", "happier"]

print("{0:20} {1:20} {2:20}".format("Word", "Stemming", "Lemmatization"))
for w in example_words:
    print("{0:20} {1:20} {2:20}".format(w, ps.stem(w), lemmatizer.lemmatize(w, pos='a')))

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/amitprakash/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/amitprakash/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


Word                 Stemming             Lemmatization       
rocks                rock                 rocks               
corpora              corpora              corpora             
better               better               good                
playing              play                 playing             
happier              happier              happy               


## Wrap-up Task
Try creating a simple function that takes a raw string and returns a list of cleaned, lemmatized tokens!