<a href="https://colab.research.google.com/github/monokrrome/Linguistics-and-NLP-experiments/blob/main/Introduction_to_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Introduction to NLP - Tokenization, stop words removal, Stemming, POS Tagging, Lemmatization**

Natural Language Processing tasks derive much reference for their tasks from the field of Linguistics. Linguistic analysis consist of Phonetics, Morphology, Syntax, Semantics, Discourse and Pragmatics - which form the foundation of NLP tasks.

A body of text or collection of texts that are analyzed and on which operations are performed is called a corpus. The corpus that we will use in this introduction, we use a shortened, modified excerpt from Technical Report on Soil Degradation in Europe: an integrated economic and environmental assessment by G.J. van den Born, B.J. de Haan, D.W. Pearce, A. Howarth. The shortened text is provided in the repository in the form of a .txt file named 'soil.txt'.

In this introduction, I will demonstrate the basics of data preprocessing of a corpus for NLP tasks. The data preprocessing steps will include:
1. Tokenization
2. Stopwords Removal
3. Stemming
4. POS Tagging
5. Lemmatization


We will first install the necessary libraries and packages to the environment.

In [None]:
!pip install nltk



In [None]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

**Tokenization**

In simple layman terms, tokenization refers to breaking up of text into smaller units. There are two types of tokenization demonstarted here:
1. Word Tokenization - dividing the text into individual words
2. Sentence Tokenization - dividing the text into sentences

In [None]:
from nltk.tokenize import word_tokenize

sentence = "This is a sample tokenization example."
tokens = word_tokenize(sentence)

print(tokens)

['This', 'is', 'a', 'sample', 'tokenization', 'example', '.']


In [None]:
from nltk.tokenize import word_tokenize, sent_tokenize

with open("soil.txt", "r+") as soil:
  content = soil.read()
  tokens = word_tokenize(content)
  sentences = sent_tokenize(content)

print("word tokens: ", tokens)
print("\nsentence tokens: ", sentences)

word tokens:  ['The', 'issue', 'of', 'soil', 'degradation', 'ranges', 'from', 'erosion', 'and', 'contamination', 'of', 'the', 'topsoil', 'to', 'overabstraction', 'and', 'contamination', 'of', 'ground', 'water', '.', 'Soil', 'degradation', 'is', 'an', 'issue', 'of', 'growing', 'concern', 'in', 'Europe', ':', '12', '%', 'of', 'total', 'European', 'land', 'area', 'has', 'been', 'affected', 'by', 'water', 'erosion', 'and', 'a', '4', '%', 'by', 'wind', 'erosion', '(', 'Dobris', '+3', ')', '.', 'The', 'loss', 'of', 'fertile', 'soil', 'itself', 'may', 'degrade', 'the', 'productivity', 'of', 'the', 'local', 'agriculture', '.', 'Also', ',', 'the', 'eroded', 'soil', 'being', 'deposed', 'downstream/wind', 'may', 'cause', 'considerable', 'damage', 'to', 'water', 'management', 'systems', 'by', 'filling', 'up', 'water', 'storage', 'reservoirs', '.', 'Flash', 'floods', 'may', 'occur', 'after', 'torrential', 'rains', 'if', 'water-absorbing', 'capacity', 'has', 'been', 'diminished', 'for', 'agri-econom

**Removal of stop words**

The idea behind the removal of stop words like articles and pronouns is to make information retrieval and classification faster as they don't have a huge impact on the meaningfulness of the document.

The NLTK library in python provides list of stopwords in english language that can help us in this task. The library also has stopwords lists for many languages such as German, Indonesian, Portuguese, and Spanish.

In [None]:
nltk.download("stopwords")
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

with open("soil.txt", "r+") as soil:
  soilcontent = soil.read()
  words = word_tokenize(soilcontent)
  stop_words = set(stopwords.words("english"))

  filtered_words = [word for word in words if word.lower() not in stop_words]

print("Original Words:", words)
print("Filtered Words:", filtered_words)

Original Words: ['The', 'issue', 'of', 'soil', 'degradation', 'ranges', 'from', 'erosion', 'and', 'contamination', 'of', 'the', 'topsoil', 'to', 'overabstraction', 'and', 'contamination', 'of', 'ground', 'water', '.', 'Soil', 'degradation', 'is', 'an', 'issue', 'of', 'growing', 'concern', 'in', 'Europe', ':', '12', '%', 'of', 'total', 'European', 'land', 'area', 'has', 'been', 'affected', 'by', 'water', 'erosion', 'and', 'a', '4', '%', 'by', 'wind', 'erosion', '(', 'Dobris', '+3', ')', '.', 'The', 'loss', 'of', 'fertile', 'soil', 'itself', 'may', 'degrade', 'the', 'productivity', 'of', 'the', 'local', 'agriculture', '.', 'Also', ',', 'the', 'eroded', 'soil', 'being', 'deposed', 'downstream/wind', 'may', 'cause', 'considerable', 'damage', 'to', 'water', 'management', 'systems', 'by', 'filling', 'up', 'water', 'storage', 'reservoirs', '.', 'Flash', 'floods', 'may', 'occur', 'after', 'torrential', 'rains', 'if', 'water-absorbing', 'capacity', 'has', 'been', 'diminished', 'for', 'agri-econ

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


**Stemming**

Stemming refers to reducing words to their "stem words". These stem words may not necessarily be actual words as they do not consider linguistic context. By reducing many variations of a word to their stem words, information retrieval or document clustering can be made faster.

We will use a popular stemming algorithm called Porter Stemming which is provided as a function in the NLTK library. IBM explains this algorithm as, 'The Porter stemming algorithm classifies every character in a given token as either a consonant ("c") or vowel ("v"), grouping subsequent consonants as "C" and subsequent vowels as "V." The algorithm thereby represents every word token as a specific combination of consonant and vowel groups. For example, the word therefore is represented as CVCVCVCV, or C(VC)3V, with the exponent representing repetitions of consonant-vowel groups.'

In [None]:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

stemmer = PorterStemmer()
with open("soil.txt", "r+") as soil:
  soilcontent = soil.read()
  words = word_tokenize(soilcontent)
  stemmed_words = [stemmer.stem(word) for word in words]

print(stemmed_words)

['the', 'issu', 'of', 'soil', 'degrad', 'rang', 'from', 'eros', 'and', 'contamin', 'of', 'the', 'topsoil', 'to', 'overabstract', 'and', 'contamin', 'of', 'ground', 'water', '.', 'soil', 'degrad', 'is', 'an', 'issu', 'of', 'grow', 'concern', 'in', 'europ', ':', '12', '%', 'of', 'total', 'european', 'land', 'area', 'ha', 'been', 'affect', 'by', 'water', 'eros', 'and', 'a', '4', '%', 'by', 'wind', 'eros', '(', 'dobri', '+3', ')', '.', 'the', 'loss', 'of', 'fertil', 'soil', 'itself', 'may', 'degrad', 'the', 'product', 'of', 'the', 'local', 'agricultur', '.', 'also', ',', 'the', 'erod', 'soil', 'be', 'depos', 'downstream/wind', 'may', 'caus', 'consider', 'damag', 'to', 'water', 'manag', 'system', 'by', 'fill', 'up', 'water', 'storag', 'reservoir', '.', 'flash', 'flood', 'may', 'occur', 'after', 'torrenti', 'rain', 'if', 'water-absorb', 'capac', 'ha', 'been', 'diminish', 'for', 'agri-econom', 'reason', '.', 'accord', 'to', 'the', 'dobris+3', 'report', 'about', '115', 'million', 'hectar', 'ar

**Parts of Speech Tagging - POS Tagging**

As the name suggests, POS tagging assigns different words to the parts of speech. We will be using the Average Perceptron Tagger which is a widely used tagger from the nltk library.

In [None]:
import nltk
nltk.download('averaged_perceptron_tagger')
from nltk.tokenize import word_tokenize

with open("soil.txt", "r+") as soil:
  soilcontent = soil.read()
  words = word_tokenize(soilcontent)
  tagged_words = nltk.pos_tag(words)
  print(tagged_words)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


[('The', 'DT'), ('issue', 'NN'), ('of', 'IN'), ('soil', 'NN'), ('degradation', 'NN'), ('ranges', 'VBZ'), ('from', 'IN'), ('erosion', 'NN'), ('and', 'CC'), ('contamination', 'NN'), ('of', 'IN'), ('the', 'DT'), ('topsoil', 'NN'), ('to', 'TO'), ('overabstraction', 'NN'), ('and', 'CC'), ('contamination', 'NN'), ('of', 'IN'), ('ground', 'NN'), ('water', 'NN'), ('.', '.'), ('Soil', 'NNP'), ('degradation', 'NN'), ('is', 'VBZ'), ('an', 'DT'), ('issue', 'NN'), ('of', 'IN'), ('growing', 'VBG'), ('concern', 'NN'), ('in', 'IN'), ('Europe', 'NNP'), (':', ':'), ('12', 'CD'), ('%', 'NN'), ('of', 'IN'), ('total', 'JJ'), ('European', 'JJ'), ('land', 'NN'), ('area', 'NN'), ('has', 'VBZ'), ('been', 'VBN'), ('affected', 'VBN'), ('by', 'IN'), ('water', 'NN'), ('erosion', 'NN'), ('and', 'CC'), ('a', 'DT'), ('4', 'CD'), ('%', 'NN'), ('by', 'IN'), ('wind', 'NN'), ('erosion', 'NN'), ('(', '('), ('Dobris', 'NNP'), ('+3', 'NNP'), (')', ')'), ('.', '.'), ('The', 'DT'), ('loss', 'NN'), ('of', 'IN'), ('fertile', 'J

**Lemmatization**

Lemmatization is similar to stemming in that it reduces different variations of the word. However, the variations are mapped to a 'lemma' and the linguistic context and meaning is considered while doing so.

In [None]:
nltk.download('wordnet')
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

with open("soil.txt", "r+") as soil:
  soilcontent = soil.read()
  words = word_tokenize(soilcontent)
  lemmatized_words = [lemmatizer.lemmatize(word) for word in words]

print(lemmatized_words)

['The', 'issue', 'of', 'soil', 'degradation', 'range', 'from', 'erosion', 'and', 'contamination', 'of', 'the', 'topsoil', 'to', 'overabstraction', 'and', 'contamination', 'of', 'ground', 'water', '.', 'Soil', 'degradation', 'is', 'an', 'issue', 'of', 'growing', 'concern', 'in', 'Europe', ':', '12', '%', 'of', 'total', 'European', 'land', 'area', 'ha', 'been', 'affected', 'by', 'water', 'erosion', 'and', 'a', '4', '%', 'by', 'wind', 'erosion', '(', 'Dobris', '+3', ')', '.', 'The', 'loss', 'of', 'fertile', 'soil', 'itself', 'may', 'degrade', 'the', 'productivity', 'of', 'the', 'local', 'agriculture', '.', 'Also', ',', 'the', 'eroded', 'soil', 'being', 'deposed', 'downstream/wind', 'may', 'cause', 'considerable', 'damage', 'to', 'water', 'management', 'system', 'by', 'filling', 'up', 'water', 'storage', 'reservoir', '.', 'Flash', 'flood', 'may', 'occur', 'after', 'torrential', 'rain', 'if', 'water-absorbing', 'capacity', 'ha', 'been', 'diminished', 'for', 'agri-economic', 'reason', '.', '

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
