<a href="https://colab.research.google.com/github/kmrakmr/workshop_text_classification/blob/main/notebooks/01_getting_started.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Getting Started with NLP

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/raghavbali/workshop_text_classification/blob/main/notebooks/01_getting_started.ipynb)

In this notebook, we will get familiar with the world on NLP. 
Key takeaways from this notebook are:

- Learn how to load a textual dataset
- Understand the dataset using basic EDA
- Learn how to perform basic preprocessing/cleanup to prepare the dataset

![nlp_workflow.png](https://github.com/raghavbali/workshop_text_classification/blob/main/assets/nlp_workflow.png?raw=1)

## Key NLP Libraries

If you have been working in the Data Science/ML domain, you must have a set of _goto_ libraries and tools to do your magic. For instance, libraries like ``sklearn`` , ``xgboost``, etc. are a must have. 

Similarly, the NLP domain has its set of favorites. The following are some of the popular ones:
- ``nltk`` : is a leading platform for building NLP applications. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing utilities.
- ``gensim`` : is a library for unsupervised topic modeling, document indexing, retrieval by similarity, and other NLP functionalities, using modern statistical machine learning.
- ``spacy`` : is a library which provides "Industrial-Strength NLP" capabilities which scale and are blazingly fast
- ``fasttext`` : is a library for learning of word embeddings and text classification created by Facebook's AI Research lab.
- ``huggingface`` 🤗 : is a community and data science platform that provides tools that enable users to build, train and deploy ML models based on open source (OS) code and technologies.

## Let's Read Some Shakespeare

The __Gutenberg Project__ is an amazing project aimed at providing free access to some of the world's most amazing classical works. This makes it a wonderful source of textual data for NLP practitionars to use and improve their understanding of textual data. Ofcourse you can improve your litrary skills too 😃

``NLTK`` provides us with a nice interface for the _Gutenberg_ project. Apart from some key utilities, this nice and clean interface enables us to access a number of large textual datasets to play with. For this workshop, we will focus on Shakespeare's __Hamlet__.

In [None]:
import nltk
import numpy as np
import pandas as pd
from nltk.corpus import gutenberg
import seaborn as sns
import re

%matplotlib inline
pd.options.display.max_columns=10000

In [None]:
# First things first, download the Gutenberg Project files
nltk.download('gutenberg')

In [None]:
# get the text for hamlet
hamlet_raw = gutenberg.open('shakespeare-hamlet.txt')
hamlet_raw = hamlet_raw.readlines()

In [None]:
# Let us print some text
print(hamlet_raw[:10])

## Quick Exploratory Analysis

Just like any other data science problem, the first step is to understand the dataset itself. NLP is no different.

In [None]:
# View a Few raw lines of text

# Add your code here

In [None]:
# Total Number of lines of text in Hamlet
print("Total lines in the book/corpus={}".format(len(hamlet_raw)))

In [None]:
# Total Number of lines of text excluding blank lines
hamlet_no_blanks = list(filter(None, [item.strip('\n') 
                               for item in hamlet_raw]))
hamlet_no_blanks[:5]

In [None]:
# Total Number of non-blank lines of text in Hamlet

# Add your code here

### How Long are the sentences?

In [None]:
line_lengths = [len(sentence) for sentence in hamlet_no_blanks]
p = sns.kdeplot(line_lengths, shade=True, color='yellow')

## Tokenization

Splitting sentences into usable terms/words is an important aspect of preprocessing textual data. Tokenization is thus the process of identifying the right word boundaries.

In [None]:
# simple tokenizer
# splitting each sentence to get words
tokens = [item.split() for item in hamlet_no_blanks]
print(tokens[:5])

In [None]:
# Let us visualize the distribution of tokens per sentence

## Add your code here

## A bit more clean-up
There can be a number of clean-up steps depending upon the kind of dataset and the problem we are solving. 

In this case, let us cleanup/remove terms which contain any kind of special characters

In [None]:
# only keeping words and removing special characters
words = list(filter(None, [re.sub(r'[^A-Za-z]', '', word) for word in words]))
print(words[:20])

## Can you identify Top Occurring words?

In [None]:
# Add your code here

### Stopword Removal
As you can see from the above output, the top occuring terms are not of much use in terms of understanding the context, etc. In the NLP space, such terms (punctuation marks, prepositions, etc) are termed as stopwords and are typically removed to handle dimensionality and other issues.

Thankfully, ``nltk`` provides a clean utility along with an extensible list of stopwords that we can use straight-away

In [None]:
import nltk 

# print a few stop words
stopwords = nltk.corpus.stopwords.words('english')
stopwords[:10]

In [None]:
# Remove stopwords
words = [word.lower() for word in words if word not in stopwords]

In [None]:
# Top Words by occurance after stopword removal

# Add your code here

## Text Preprocessing

We covered some basics of pre-processing so far, steps such as:
- Lower-casing
- Special character removal
- Stopword removal
- Removing blank lines and empty spaces

are typically performed time and again. There are a number of other steps as well but those are mostly application dependent.

In [None]:
# A utility function to perform basic cleanup
def normalize_document(doc):
    # lower case and remove special characters\whitespaces
    doc = re.sub(r'[^a-zA-Z\s]', '', doc, re.I|re.A)
    doc = doc.lower()
    doc = doc.strip()
    # tokenize document
    tokens = nltk.word_tokenize(doc)
    # filter stopwords out of document
    filtered_tokens = [token for token in tokens if token not in stopwords]
    # re-create document from filtered tokens
    doc = ' '.join(filtered_tokens)
    return doc

In [None]:
normalize_corpus = np.vectorize(normalize_document)

norm_corpus = normalize_corpus(hamlet_raw)
norm_corpus