# 1-Tokenization

**Definition:**  
The process of breaking text into smaller units such as words, subwords, or characters.

## **Types:**

### **1. Word Tokenization**  
Splitting text into individual words.

**Example:**  
Input:  
`"I study Machine Learning on GeeksforGeeks."`  
Output:  
`['I', 'study', 'Machine', 'Learning', 'on', 'GeeksforGeeks', '.']`  

---

### **2. Sentence Tokenization**  
Splitting text into individual sentences.

**Example:**  
Input:  
`"I study Machine Learning on GeeksforGeeks. Currently, I'm studying NLP."`  
Output:  
`['I study Machine Learning on GeeksforGeeks.', 'Currently, I'm studying NLP.']`  

---

### **3. Subword Tokenization**  
Breaking words into smaller units like prefixes, suffixes, or individual characters.

---

### **Importance:**  
Tokenization is the first step in many NLP pipelines and directly impacts subsequent processing stages.


# **2-Stemming**

## **Definition:**  
The process of reducing words to their root form by stripping suffixes and prefixes.

## **Example:**  
- "running" → "run"  
- "runner" → "run"  

## **Difference from Lemmatization:**  
- Stemming is a more crude technique that may not always produce real words.  
- Example: "better" → "bet" (stemming) vs. "better" → "good" (lemmatization).  


In [1]:
from nltk.stem import PorterStemmer

# create an object of class PorterStemmer
stemmer = PorterStemmer()
print(stemmer.stem("Communication"))


commun


# **3-Lemmatization**

## **Definition:**  
The process of reducing a word to its base or dictionary form, called a lemma.

## **Example:**  
- "running" → "run"  
- "ran" → "run"  

## **Importance:**  
Lemmatization helps in understanding the underlying meaning of words by grouping different forms of a word.

## **Comparison with Stemming:**  
- Stemmers are faster and computationally less expensive than lemmatizers.  
- Lemmatization provides more accurate and meaningful results compared to stemming.


In [3]:
import nltk
nltk.download('wordnet')


[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [4]:
from nltk.stem import WordNetLemmatizer

# create an object of class WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize("Communication", 'v'))

Communication


### 4-Normalization

Normalization, Doğal Dil İşleme (NLP) sürecinde metnin standart bir forma dönüştürülmesi işlemidir. Bu işlem, metin verilerinin tutarlı bir şekilde işlenmesini sağlamak amacıyla yapılır. Normalization, NLP modellerinin metni daha iyi anlamasını ve yorumlamasını kolaylaştırır. Yaygın normalization teknikleri arasında metni küçük harflere dönüştürme (lowercasing), noktalama işaretlerini kaldırma, stop words (gereksiz kelimeler) çıkarma gibi işlemler yer alır.

#### 1. Lowercasing
Lowercasing, metindeki tüm karakterleri küçük harfe dönüştürmeyi içerir. Bu işlem, "Apple" ve "apple" gibi kelimelerin aynı kelime olarak kabul edilmesini sağlar.

**Örnek:**
- "Apple" -> "apple"

#### 2. Removing Punctuation
Noktalama işaretleri, genellikle metnin anlamını değiştirmediği için çoğu NLP görevinde çıkarılır. Bu işlem, gereksiz semboller ve işaretlerin metni bozmasını engeller.

**Örnek:**
- "Hello, world!" -> "Hello world"

#### 3. Removing Stopwords
Stop words, metinlerde sıkça bulunan ve genellikle anlam taşımayan kelimelerdir. Örneğin, "the", "is", "and" gibi kelimeler çoğu dil işleme görevinde gereksiz kabul edilir ve çıkarılabilir.

**Örnek:**
- "This is a pen" -> "pen"

In [18]:
import nltk
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Download the 'punkt_tab' resource
nltk.download('punkt_tab') # Downloading the Punkt sentence tokenizer model.

# nltk veri kümesini indiriyoruz
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Sample text
text = "Running is better than walking! Apples and oranges are different."

# 1. Lowercasing (Küçük harfe dönüştürme)
text_lower = text.lower()
print("Lowercased text:", text_lower)

# 2. Removing punctuation (Noktalama işaretlerini kaldırma)
text_no_punct = re.sub(r'[^\w\s]', '', text_lower)
print("Text without punctuation:", text_no_punct)

# 3. Tokenization (Kelime token'larına ayırma)
words = nltk.word_tokenize(text_no_punct)
print("Tokenized words:", words)

# 4. Removing stop words (Stop word'leri kaldırma)
stop_words = set(stopwords.words('english'))
words_no_stop = [word for word in words if word not in stop_words]
print("Text without stopwords:", words_no_stop)

# 5. Stemming (Kelimeleri köklerine indirme)
ps = PorterStemmer()
words_stemmed = [ps.stem(word) for word in words_no_stop]
print("Stemmed words:", words_stemmed)

# 6. Lemmatization (Kelimeleri lemmatize etme)
lemmatizer = WordNetLemmatizer()
words_lemmatized = [lemmatizer.lemmatize(word) for word in words_no_stop]
print("Lemmatized words:", words_lemmatized)

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Lowercased text: running is better than walking! apples and oranges are different.
Text without punctuation: running is better than walking apples and oranges are different
Tokenized words: ['running', 'is', 'better', 'than', 'walking', 'apples', 'and', 'oranges', 'are', 'different']
Text without stopwords: ['running', 'better', 'walking', 'apples', 'oranges', 'different']
Stemmed words: ['run', 'better', 'walk', 'appl', 'orang', 'differ']
Lemmatized words: ['running', 'better', 'walking', 'apple', 'orange', 'different']
