# Natural Language Toolkit

`NLTK` is an <u>open-source library</u> that simplifies the complexities of natural language processing (NLP). It offers functionalities for tasks like tokenization, stemming, tagging, parsing, and more. NLTK is widely used in academia and industry for building applications involving text analysis, sentiment analysis, language understanding, and machine learning.

## What is Text Processing?

`Text processing` is the technique of transforming and analyzing textual data to make it easier to work with, especially for tasks like text mining, natural language processing (NLP), and machine learning. It involves cleaning, structuring, and preparing text so that it can be analyzed more effectively.

## Key steps in Text Processing

- **Tokenization**
    - The breaking down of text into smaller units is called tokens. tokens are a small part of that text. If we have a sentence, the idea is to separate each word and build a vocabulary such that we can represent all words uniquely in a list. Numbers, words, etc.. all fall under tokens.
      - ***Word Tokenization*** - is the process of splitting a sentence or text into individual words or tokens.
      - ***Sentence Tokenization*** - (or segmentation) is the process of splitting text into individual sentences.
    > Note: Basically its like using .split() function to split your text into a list.
- **Word Casing**
  - In text processing, while `lowercasing` is commonly used for standardizing text, there are situations where `uppercasing` or `title casing` (capitalizing the first letter of each word) might be required.
- **Stop words**
  - When presented features from a text to model, we might encounter a lot of noise. Removing common words like "the," "is," or "and" that do not contribute significant meaning to the analysis. With NLTK we can see all the stop words available in the English language.
- **Contractions**
  - Contractions are often expanded into their full forms to maintain uniformity and accuracy in text analysis. This step is critical since contractions like "can't" or "won't" are abbreviated forms of "cannot" and "will not," and failing to expand them may result in errors in text processing.
- **Stemming**
  - Stemming is the process of reducing a word to its root or stem by chopping off its suffixes or prefixes. The result may not always be a valid word, but it serves as a rough approximation to group similar words together. How it works is that it uses simple heuristic rules to remove common word endings, such as "-ing", "-ed", "-ly", "-s", etc.
  - uses simple heuristic rules to remove common word endings, such as "-ing", "-ed", "-ly", "-s", etc. It doesn’t take into account the meaning of the word or context, leading to quick and computationally efficient results.
    * Types of Stemming:
        - Porter Stemmer: One of the most commonly used stemming algorithms. It follows a set of rules to iteratively trim words down to their base form.
        - Lancaster Stemmer: A more aggressive stemming algorithm that removes larger chunks from words but may lead to over-stemming (too much truncation).
        - Snowball Stemmer: An improvement over Porter stemming, it’s also known as the English Stemmer and is more flexible in handling word forms.
| Word | Porter | Lancaster |
| ---- | ------ | ----------|
|"running"|	"run"|	"run"|
|"runner"|	"run"|	"run"|
|"happily"|	"happili"|	"happy"|
|"studies"|	"studi"|	"study"|
|"studying"|	"studi"|	"study"|


- **Lemmatization**
  -  is the process of reducing a word to its lemma, or dictionary form, after evaluating its meaning and part of speech. Unlike stemming, lemmatization generates an appropriate word in the language. It uses more complex process, often relying on a vocabulary or lexical database (like WordNet) to look up words and return their canonical form based on their context, such as whether they are nouns, verbs, or adjectives.
    * Types of Lemmatization:
      - Rule-based Lemmatization: Relies on a set of rules to transform inflected words into their root form based on their part of speech.
      - Dictionary-based Lemmatization: Uses a lookup table (like WordNet) to find the base form of words.
|   Word	|  Lemmatization (Verb)  |	Lemmatization (Noun)  |
|-----------|------------------------|------------------------|
|"running"  |	"run"                |            "run"       |
|"better"   |	"good"               |         "better"       |
|"geese"    |	"goose"              |          "goose"       |
|"studies"  |	"study"              |          "study"       |
|"studying" |	"study"              |          "study"       |
> Note: Lemmatization accurately returns meaningful words by considering context. For example, "better" as an adjective returns "good," while as a verb it remains "better."<br>

### Addt'l Info: 
#### ***Regex*** or Regular Expression

***Regular expressions*** (regex or regexp) are extremely useful in extracting information from any text by searching for one or more matches of a specific search pattern (i.e. a specific sequence of ASCII or unicode characters).

***[Here is an in-depth understanding on RegEx](https://medium.com/@victoriousjvictor/understanding-regular-expressions-regex-e1c048f5aa6c)*** <br>
***[Additional Information, ](https://archive.is/63sjK#selection-613.0-617.125)*** this is to go through some patterns and additional examples.

### Import necessary Packages

In [14]:
import nltk 
import re # for Regex
import string

import warnings
warnings.filterwarnings('ignore')

In [15]:
nltk.download('punkt_tab')
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package punkt_tab to /home/iragca/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to /home/iragca/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /home/iragca/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [16]:
myCorpus = [" The quick brown fox wasn't that quick and he couldn't win the race",
         "Hey! That's a grete deal! I just bought a /phone at $144 ",
         "@@You'll (learn) a **lot** in the book. Python fox is an amaewzing langage !@@"]

In [17]:
len(myCorpus)

3

### Remove Punctuation

In [18]:
def rem_punct(my_str):
    
    if my_str is None:
        return None
    
    punctuations = '''!¡()-[]{};:'.'"“”`\,.<>/|?@#$%^—-=&_+0123456789•~*…''' 

    no_punct = ""
    for char in my_str:
        if char not in punctuations:
            no_punct += char
    return no_punct

In [19]:
for y in myCorpus:
    rempunct = rem_punct(y)
    print(rempunct)

 The quick brown fox wasnt that quick and he couldnt win the race
Hey Thats a grete deal I just bought a phone at  
Youll learn a lot in the book Python fox is an amaewzing langage 


### Tokenization

In [20]:
def sent_tokenize(text):
    word_tokens = nltk.sent_tokenize(text)
    return word_tokens

def word_tokenize(text):
    word_tokens = nltk.word_tokenize(text)
    return word_tokens

In [21]:
new_list = [rem_punct(sentence) for sentence in myCorpus]
for word in new_list:
    token = word_tokenize(word)
    print(token)

['The', 'quick', 'brown', 'fox', 'wasnt', 'that', 'quick', 'and', 'he', 'couldnt', 'win', 'the', 'race']
['Hey', 'Thats', 'a', 'grete', 'deal', 'I', 'just', 'bought', 'a', 'phone', 'at']
['Youll', 'learn', 'a', 'lot', 'in', 'the', 'book', 'Python', 'fox', 'is', 'an', 'amaewzing', 'langage']


In [22]:
sentence_corpus = 'Hello World! I found a new hobby to do. How is life?'

token_sent = sent_tokenize(sentence_corpus)
for sentence in token_sent:
    print(sentence)

Hello World!
I found a new hobby to do.
How is life?


### Removing Repeating Letters

In [23]:
from nltk.corpus import wordnet

def removeRepeatedCharacters(tokens):
    repeatPattern = re.compile(r'(\w*)(\w)\2(\w*)')
    
    def replace(word):
        while not wordnet.synsets(word):
            newWord = repeatPattern.sub(r'\1\2\3', word)
            if newWord == word:
                break
            word = newWord
        return word

    return [replace(word) for word in tokens]

> ***Wild Regex pattern appeared!*** <br>
> r'(\w*)(\w)\2(\w*)' This pattern looks for repeated characters. The \2 refers to the second group, meaning it captures any character repeated consecutively.

>r'\1\2\3': This pattern replaces the repeated characters with a single occurrence. 

In [24]:
sentence = ' My Schoooooool is reeeeeeaaaallllllllly amaaaaaazingggg!'

sampleSentence = word_tokenize(sentence)
print("Original Sentence ", sampleSentence)
print("Corrected Sentence ", removeRepeatedCharacters(sampleSentence))

Original Sentence  ['My', 'Schoooooool', 'is', 'reeeeeeaaaallllllllly', 'amaaaaaazingggg', '!']
Corrected Sentence  ['My', 'School', 'is', 'realy', 'amazing', '!']


### Word Contraction

In [25]:
import contractions

for word in new_list:
    contra = contractions.fix(word)
    token = word_tokenize(contra)
    print(token)

['The', 'quick', 'brown', 'fox', 'was', 'not', 'that', 'quick', 'and', 'he', 'could', 'not', 'win', 'the', 'race']
['Hey', 'That', 'Is', 'a', 'grete', 'deal', 'I', 'just', 'bought', 'a', 'phone', 'at']
['You', 'Will', 'learn', 'a', 'lot', 'in', 'the', 'book', 'Python', 'fox', 'is', 'an', 'amaewzing', 'langage']


### Remove Stop words

In [1]:
from nltk.corpus import stopwords

def remove_stopwords_orig(text):
    global filtered_words
    stopWords = set(stopwords.words('english'))
    
    filtered_words = []
    for w in text:
        if w not in stopWords:
            filtered_words.append(w)
            filtered_words.append(' ')
           
    return "".join(filtered_words)

In [2]:
stopwords.words('english')

['a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 "he'd",
 "he'll",
 'her',
 'here',
 'hers',
 'herself',
 "he's",
 'him',
 'himself',
 'his',
 'how',
 'i',
 "i'd",
 'if',
 "i'll",
 "i'm",
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it'd",
 "it'll",
 "it's",
 'its',
 'itself',
 "i've",
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'on

In [27]:
for word in new_list:
    word = word.lower()
    contra = contractions.fix(word)
    token = word_tokenize(contra)
    no_stopwords = remove_stopwords_orig(token)
    print(no_stopwords)

quick brown fox quick could win race 
hey grete deal bought phone 
learn lot book python fox amaewzing langage 


> *Note*: <br>The stopwords list provided by nltk (or similar libraries) contains only lowercase versions of words. This is typical because stopwords are usually considered case-insensitive. If your input text has capitalized words (like "Is" or "I"), they will not match with the lowercase stopwords unless you normalize the case by converting all words to lowercase.

### Spelling

In [30]:
from autocorrect import Speller

def check_spelling(text):
    spellcheck = Speller(lang='en')
    return spellcheck(text)

In [31]:
for word in new_list:
    word = word.lower()
    contra = contractions.fix(word)
    token = word_tokenize(contra)
    no_stopwords = remove_stopwords_orig(token)
    corr_spell = check_spelling(no_stopwords)
    print(corr_spell)

quick brown fox quick could win race 
hey greet deal bought phone 
learn lot book python fox amazing language 


### Stemming

In [32]:
from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer
from nltk.stem.snowball import SnowballStemmer

port_stem = PorterStemmer()
lancas_stem = LancasterStemmer()
snowb_stem = SnowballStemmer(language='english')

In [33]:
display(port_stem.stem('studying'),
       lancas_stem.stem('studying'),
       snowb_stem.stem('studying'))
print('----'*20)
display(port_stem.stem('running'),
       lancas_stem.stem('running'),
       snowb_stem.stem('running'))
print('----'*20)
display(port_stem.stem('costly'),
       lancas_stem.stem('costly'),
       snowb_stem.stem('costly'))

'studi'

'study'

'studi'

--------------------------------------------------------------------------------


'run'

'run'

'run'

--------------------------------------------------------------------------------


'costli'

'cost'

'cost'

### Lemmatization

In [34]:
from nltk.stem import WordNetLemmatizer

wlm = WordNetLemmatizer()

In [35]:
display(wlm.lemmatize('understanding', 'v'),
        wlm.lemmatize('running', 'v'),
        wlm.lemmatize('happiest', 'a'),
        wlm.lemmatize('children', 'n'))

'understand'

'run'

'happy'

'child'

---

In [36]:
corr_list = []
for word in new_list:
    word = word.lower()
    contra = contractions.fix(word)
    token = word_tokenize(contra)
    no_stopwords = remove_stopwords_orig(token)
    corr_spell = check_spelling(no_stopwords)
    corr_list.append(corr_spell)
    
for sentence in corr_list:
    token = word_tokenize(sentence)
    for word in token:
        word = wlm.lemmatize(word, 'v')
        print(word)

quick
brown
fox
quick
could
win
race
hey
greet
deal
buy
phone
learn
lot
book
python
fox
amaze
language


As you can see, removing punctuations, expanding contractions, and spell checking are included here without the need for the NLTK Python package. These tasks help de-clutter our textual data, making it cleaner for more accurate analysis. These functions are part of what we call ***Data Cleaning***!

# Data Cleaning

## Data Preprocessing and Data Normalization in Text Data

<font size=4px><b><i>Data Preprocessing</i></b></font>
- for text involves a series of steps to clean, format, and transform raw text into a structured and analyzable form. It is the first stage before applying any machine learning or natural language processing (NLP) models.
> <b>Goal</b>: To reduce noise and prepare the text for further analysis by ensuring it's structured, clean, and ready for modeling.



<font size=4px><b><i>Data Normalization</i></b></font>
- refers to techniques used to standardize and format the text consistently. This ensures that variations in word forms, cases, and other features don't confuse text analysis models.
> <b>Goal</b>: To standardize and simplify the text by reducing variations and inconsistencies so that machine learning models can interpret it effectively.



In the context of Text Data, `Data Preprocessing` and `Data Normalization` are not *separate* stages, but rather integrated components of the same workflow that help us prepare unstructured text for deeper analysis.

Consider Data Preprocessing to be a comprehensive procedure that includes tasks such as cleaning, organizing, and altering raw text. Tokenization, stopword removal, and dealing with noisy data like punctuation and special characters are all difficulties that must be addressed during preprocessing. These are the core tasks that will ensure that our material is manageable for analysis.

Within this larger process, Data Normalization serves as a more targeted stage. Normalization ensures text consistency by standardizing diverse representations of the same information. 

Normalization, whether by converting everything to lowercase, extending contractions, or consistently formatting numbers and dates, helps us reduce variability and ensure that our research is not thrown off by little, inconsequential variances in the data.

For example, in a preprocessing work, we might have tokenized a sentence and deleted stopwords. Then, normalization assures that changes in word forms (for example, 'USA' vs. 'usa') do not result in inaccurate textual insights.

Thus, Data Normalization is an important subset of Data Preprocessing, especially when dealing with textual data, and they work together to lay the groundwork for any text-based data science tasks. Without proper preprocessing and normalization, our models may struggle to derive useful insights from the data.



---

### Import Packages

In [40]:
import pandas as pd
import nltk

from sqlalchemy import create_engine
import json
import pdfplumber
import requests
from bs4 import BeautifulSoup
import unicodedata

import os
from tqdm.notebook import tqdm
import warnings
warnings.filterwarnings('ignore')

### Json Movie Dataset

In [41]:
path = './Datasets/movies/'
movies = os.listdir(path)
movie_list = list()

for years in tqdm(movies):
    for movie in os.listdir(os.path.join(f'{path}', f'{years}')):
        with open(os.path.join(f'{path}', f'{years}', f'{movie}'), encoding="ISO-8859-1") as movie_jsn:
            movie_list.append(json.load(movie_jsn))

FileNotFoundError: [Errno 2] No such file or directory: './Datasets/movies/'

In [None]:
movie_df = pd.DataFrame(movie_list)
movie_df = movie_df.applymap(lambda x: None if isinstance(x, str) and x.strip() == '' else x)
# movie_df

## Data Normalization (Sample For DataFrames)

### Data Types

In [None]:
movie_df.info()

In [None]:
# ----- for integers and float ----- #
for cols in ['year', 'runtime']:
    movie_df[cols] = pd.to_numeric(movie_df[cols], errors='coerce')
# or you can use df[column].astype('int64') or df[column].astype('float64')

# ----- for strings ----- #
movie_df[['name','categories', 'director', 
          'writer', 'actors', 'storyline']] = movie_df[['name','categories', 'director', 
                                                        'writer', 'actors', 'storyline']].astype(str)

### Word Casing

In [None]:
movie_df['name'] = movie_df['name'].str.upper()
movie_df
# or you can do: movie_df['name'] = movie_df['name'].str.lower()
# or: movie_df['name'] = movie_df['name'].str.title()

### Date Formats

In [None]:
# -------- Nomralize the data's date to YYYY-MM-DD -------- #
movie_df['release-date'] = pd.to_datetime(movie_df['release-date'], format='%Y-%m-%d', errors='coerce')
display(movie_df, movie_df.info())

## Data Preprocessing (Sample For DataFrames)

### Finding Null Values
<font size='2px'>At this point, we are only focusing on the <u>name</u> column up to the <u>storyline</u> column</font>

In [None]:
movie_df.isnull().sum()

In [None]:
movie_df[movie_df['runtime'].isna()]

In [None]:
movie_df[movie_df['release-date'].isna()]

In [None]:
# For this case, there is a value 'nan'
movie_df[movie_df['categories'] == 'nan']

#### Optional: Dropping Columns

In [None]:
sample = movie_df[['description', 'future', 'plot', 'released', 'genre', 'category', 'directors']]
for x in sample.columns:
    print(f'Column : {x}')
    display(movie_df[movie_df[x].notnull()])

In [None]:
movie_df.drop(columns=['description', 'future', 'plot','released', 'genre', 'category', 'directors'], axis=1, inplace=True)

### Remove Punctuations, Numbers, and Symbols

In [None]:
def rem_punct(my_str):
    if my_str is None:
        return None
    
    punctuations = '''!¡()-[]{};:'.'"“”`\,.<>/|?@#$%^—-=&_+0123456789•~*…''' 

    no_punct = ""
    for char in my_str:
        if char not in punctuations:
            no_punct += char
    return no_punct

In [None]:
movie_df['categories'] = movie_df['categories'].apply(rem_punct)
movie_df['categories'] = movie_df['categories'].apply(lambda row: ", ".join(row.split(' ')) if isinstance(row, str) else '')

In [None]:
movie_df

### Special Characters

In [None]:
def latin_char(name):
    if isinstance(name, str):
        normalized_text = unicodedata.normalize('NFKD', name)
        encoded_text = normalized_text.encode('ASCII', 'ignore')
        cleaned_name = encoded_text.decode('utf-8')
        return cleaned_name
    else:
        return name

In [None]:
movie_df['name'] = movie_df.name.apply(latin_char)

In [None]:
movie_df.iloc[426]

### Check for Duplicates

In [None]:
dup_movies = movie_df[movie_df.duplicated(subset=['name'], keep=False)].sort_values('name')
dup_movies

In [None]:
dup_movies_year = movie_df[movie_df.duplicated(subset=['name', 'year'], keep=False)]#.sort_values('name')
dup_movies_year

`dup_movies_year` has some discrepencies, having the same movie title and released at the same year but their release date are different.

#### Drop Duplicates

In [None]:
movie_df_new = movie_df.drop_duplicates(subset=['name', 'year'], keep='last')
movie_df_new

In [None]:
# ------- or you can use the parameter 'inplace' in drop_duplicates() ------- #
# ---------------- to overwrite directly the main dataframe ----------------- #

movie_df.drop_duplicates(subset=['name', 'year'], keep='last', inplace=True)

In [None]:
movie_df

## Another Example!
<font size='3px'><b>Let's test this on a pdf dataset</b></font>

### PDF Dataset

In [None]:
with pdfplumber.open("./Datasets/sampledata_PDF.pdf") as pdf:
    tb = []
    for i in tqdm(range(len(pdf.pages))):
        page = ' '.join(pdf.pages[i].extract_text().split('\n')[1:-1])
        tb.append(page)

In [None]:
pdf_ = [text for text in tb if text]
pdf_ = str(pdf_)[1:-1]
pdf_

### Remove Punctuation

In [None]:
def rem_punct(my_str):
    
    if my_str is None:
        return None
    
    punctuations = '''!¡()-[]{};:''"“”`\,<>/|?@#$%^—-=&_+0123456789•~*…''' 

    no_punct = ""
    for char in my_str:
        if char not in punctuations:
            no_punct += char
    return no_punct

### Tokenization

In [None]:
def sent_tokenize(text):
    word_tokens = nltk.sent_tokenize(text)
    return word_tokens

def word_tokenize(text):
    word_tokens = nltk.word_tokenize(text)
    return word_tokens

In [None]:
pdf_ = rem_punct(pdf_)
pdf_sent = sent_tokenize(pdf_)[0:3]
pdf_sent

### Word Contraction

In [None]:
import contractions

pdf_contra=[]
for sent in pdf_sent:
    rempunct = rem_punct(sent)
    contra = contractions.fix(rempunct)
    token = word_tokenize(contra)[:-1]
    pdf_contra.append(' '.join(token))
    
pdf_contra

### Remove Stop words

In [None]:
from nltk.corpus import stopwords

def remove_stopwords_orig(text):
    global filtered_words
    stopWords = set(stopwords.words('english'))
    
    filtered_words = []
    for w in text:
        if w not in stopWords:
            filtered_words.append(w)
            filtered_words.append(' ')
           
    return "".join(filtered_words)

In [None]:
pdf_stop = []

for word in pdf_contra:
    word = word.lower()
    token = word_tokenize(word)
    no_stopwords = remove_stopwords_orig(token)
    pdf_stop.append(no_stopwords.strip())

pdf_stop

### Spelling

In [None]:
from autocorrect import Speller

def check_spelling(text):
    spellcheck = Speller(lang='en')
    return spellcheck(text)

In [None]:
pdf_spell = []

for word in pdf_stop:
    # token = word_tokenize(word)
    corr_spell = check_spelling(word)
    pdf_spell.append(corr_spell)
pdf_spell

### Stemming and Lemmatization

In [None]:
from nltk.stem.snowball import SnowballStemmer
from nltk.stem import WordNetLemmatizer

wlm = WordNetLemmatizer()
snowb_stem = SnowballStemmer(language='english')

In [None]:
pdf_final = []

for sentence in pdf_spell:
    sentence = word_tokenize(sentence)
    pdf_final.append(sentence)

pdf_final_stem = [snowb_stem.stem(word) for sentence in pdf_final for word in sentence]
pdf_final_stem

In [None]:
pdf_final_lemm = [wlm.lemmatize(word, 'v') for sentence in pdf_final for word in sentence]
pdf_final_lemm