**Introduction to NLP and Basic Concepts**

What is NLP (Natural Language Processing)?

NLP is a field of artificial intelligence that enables computers to understand, analyze and interpret human language. NLP is used in many tasks such as understanding language structures, analyzing texts, performing emotion analysis, and providing translation. NLP combines linguistic rules, machine learning and deep learning methods to unravel the complexity of language.

**Tokenization (Word/Phrase Division):**
   
   Tokenization is the process of breaking a text into smaller pieces, usually words or sentences. This is done before we start analyzing the structure of the language.

In [None]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
from nltk.tokenize import word_tokenize
text=input("Enter a sentence: ")

Enter a sentence: Today is the first day of the NLP lesson.


In [None]:
tokens=word_tokenize(text)
print(tokens)

['Today', 'is', 'the', 'first', 'day', 'of', 'the', 'NLP', 'lesson', '.']


**Stop Words:**
   
   Stop words are words that are frequently used in the language but generally do not have any meaning in terms of analysis. More meaningful analyzes can be made by removing these words (such as and, with, or, etc.) from the text.

In [None]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
from nltk.corpus import stopwords

In [None]:
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [None]:
text = "Today is the 1. day of the our NLP lesson."
stopwords = set(stopwords.words('english'))

In [None]:
words = word_tokenize(text)
words_filtered = []

for w in words:
  if w not in stopwords:
    words_filtered.append(w)

print(words_filtered)

['Today', 'first', 'day', 'NLP', 'lesson', '.']


**Lemmatization and Stemming (Finding Root):**
   
   **Stemming:** It is the process of reducing words to their roots. During this process, the grammatical structure of the word is ignored and the root of the word is tried to be found. <br>
   **Lemmatization:** It is the reduction of words to their root form (lemma) in the dictionary. This process preserves the grammatical structure of the word and generally gives results that comply with the rules of the language.

In [None]:
from nltk.stem import PorterStemmer,WordNetLemmatizer

In [None]:
text = "Running studies caring happier better good"
words = word_tokenize(text)
words

['Running', 'studies', 'caring', 'happier', 'better', 'good']

In [None]:
# stemming
stemmer=PorterStemmer()
stemmer_words = [stemmer.stem(word) for word in words]
print("Stemmer words:",stemmer_words)

Stemmer words: ['run', 'studi', 'care', 'happier', 'better', 'good']


In [None]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [None]:
# lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
print("Lemmatized words:",lemmatized_words)

Lemmatized words: ['Running', 'study', 'caring', 'happier', 'better', 'good']


**Result** <br>
Stemming: Applies simple rules to get the root form of the word. It's fast but doesn't always create meaningful roots. <br>
Lemmatization: Considers the meaning of the word and reduces it to its correct and meaningful form in the dictionary. It gives more accurate results, but it is a little slower and sometimes requires word type information.

**Text Cleaning and Preprocessing**

Text cleaning and preprocessing is a critical step in natural language processing (NLP) projects to make data usable. This step creates the clean and organized data set necessary for analysis and modeling of the text.

**Removing Punctuation Marks:** <br>
Punctuation marks are often meaningless in text analysis and are omitted. For example, ,, ., !, ?

**Lower-Capital Case Conversions:** <br>
Converting all letters in the text to lowercase or uppercase puts the text into a standard form. This ensures that "NLP" and "nlp" are treated the same.

**Extracting Numbers and Special Characters:** <br>
Numbers are often not meaningful to analysis and are removed. Likewise, special characters such as @, , $ in the text are also removed.

In [None]:
import re

text = "Hello World! This is a great opportunity to learn NLP: 2024."
cleaned_text = re.sub(r'[^\w\s]','',text) # remove punctuation
cleaned_text = re.sub(r'\d+','',cleaned_text) # remove numbers
cleaned_text = cleaned_text.lower() # convert to lowercase
print(cleaned_text)

hello world this is a great opportunity to learn nlp 


Source: <br>

*   https://medium.com/@abhishekjainindore24/all-about-tokenization-stop-words-stemming-and-lemmatization-in-nlp-1620ffaf0f87
