# Computational Linguistics with Python: Chapter 0
# Introduction to NLP & Text Data Processing

## What is NLP?

NLP stands for Natural Language Processing. It is considered a sub-field of AI by some and a standalone discipline by others. Its main goals are to help computers imitate human understanding and generation of natural (human) languages.


From ChatGPT to Google Translate, many AI products that we use daily are actually based on NLP practices.



I've written these tutorials as smooth, high-level introduction for those who do not have prior Linguistics or NLP experience.



While we are going to start off with something bit heavier on the programming side, bear in mind these tutorials were designed to help you learn Linguistics as well as NLP. Therefore, expect well-defined theoretical explanations as well as Python examples.

## Preparing Text Data for NLP

As you probably already know, it's not easy for machines to work with text data. In this first chapter, we will take a look at various data preprocessing steps that will make our text data more suitable for NLP tasks.  

### Tokenization

Tokenization is the process through which we split the text into smaller pieces called tokens. Let's take a look at an example:

In [None]:
# Install nltk if you don't have it
!pip install nltk

In [None]:
import nltk

In [2]:
# Download punkt
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [3]:
from nltk.tokenize import word_tokenize

# Full text
text = "Have I not done enough for you?"

# Break into tokens
tokens = word_tokenize(text)

print(tokens)

['Have', 'I', 'not', 'done', 'enough', 'for', 'you', '?']


### Normalization

Normalization is performed to make your text data consistent and, therefore, more suitable to be processed by algorithms.

It often involves lowercasing the words and the removal of punctuation marks.

In [4]:
# Lowercase all words

my_text = "OH MAN! This is awesome!"

my_text = my_text.lower()

print(my_text)

oh man! this is awesome!


In [5]:
# Remove Punctuation

import string
text_without_punctuation = my_text.translate(str.maketrans('', '', string.punctuation))
print(text_without_punctuation)


oh man this is awesome


### Stopword Removal

Stopwords are frequently used words which don't always carry a crucial meaning. While there can be certain exceptions, it is common NLP practice to remove them.

In [6]:
# Download stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [7]:
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
text = "There is a red car in the gallery"
tokens = word_tokenize(text)
filtered_tokens = [word for word in tokens if not word.lower() in stop_words]
print(filtered_tokens)


['red', 'car', 'gallery']


### Stemming

Stemming basically means reaching the root (stem) of a sentence.

The main objective of stemming is to reduce the negative effects of abundance, by ensuring that similar topics are treated the same way.



In [8]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
words = ["study","studies"]
stems = [stemmer.stem(word) for word in words]
print(stems)


['studi', 'studi']


It seems to be working nice, let's try something else:

In [9]:
new_words = ["studies","geese"]
new_stems = [stemmer.stem(word) for word in new_words]
print(new_stems)

['studi', 'gees']


There is a problem. It fails to be effective in certain cases, like plurals.

How can we improve this process?

With lemmatization.

### Lemmatization

Lemmatization has the same goal as stemming, however it achieves that goal in a more efficient manner.



In [10]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [11]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
comp_words = ["studies","geese"]
lemmas = [lemmatizer.lemmatize(word) for word in comp_words]
print(lemmas)


['study', 'goose']


## Conclusion

Preprocessing text data is not the most entertaining thing you can do in the world.

However, if you do care about the quality of your text-based analysis or want to overall understand the fundamentals of computational linguistics, you should pay attention to it.

In the next chapter, we will analyze the structure of sentences by delving into the delightful topic of Syntactic Analysis.

