# Introduction to NLP in Python
## Quest 1: NLP Basics for Text Preprocessing

### Tokenization

Tokenizers divide strings into lists of substrings. After installing the nltk library, let's import the library along with these two built-in methods, *sent_tokenize* and *word_tokenize*. 

In [3]:
import nltk
from nltk import sent_tokenize
from nltk import word_tokenize

1. `sent_tokenize`

The first method, `sent_tokenize`, splits the given text into sentences. This is useful especially if you are dealing with bigger chunks of text with longer sentences.

We will make use of the following sample paragraph about NLP in the healthcare industry. Run the cell below to check out the output.

In [4]:
sample = "The wind shook some blossoms from the trees, and the heavy lilac-blooms, with their clustering stars, moved to and fro in the languid air. A grasshopper began to chirrup by the wall, and like a blue thread a long thin dragon-fly floated past on its brown gauze wings. Lord Henry felt as if he could hear Basil Hallward’s heart beating, and wondered what was coming."
sentence_tokens=sent_tokenize(sample)

If you encounter the "Resource punkt not found" error when running the above cell, you can run the following command `nltk.download('punkt')`
<br/><br/>

2. `word_tokenize`

Likewise, the `word_tokenize` method tokenizes each individual word in the paragraph. Run the cell below to compare the outputs.

In [6]:
word_tokens=word_tokenize(sample)
print(word_tokens)

['The', 'wind', 'shook', 'some', 'blossoms', 'from', 'the', 'trees', ',', 'and', 'the', 'heavy', 'lilac-blooms', ',', 'with', 'their', 'clustering', 'stars', ',', 'moved', 'to', 'and', 'fro', 'in', 'the', 'languid', 'air', '.', 'A', 'grasshopper', 'began', 'to', 'chirrup', 'by', 'the', 'wall', ',', 'and', 'like', 'a', 'blue', 'thread', 'a', 'long', 'thin', 'dragon-fly', 'floated', 'past', 'on', 'its', 'brown', 'gauze', 'wings', '.', 'Lord', 'Henry', 'felt', 'as', 'if', 'he', 'could', 'hear', 'Basil', 'Hallward', '’', 's', 'heart', 'beating', ',', 'and', 'wondered', 'what', 'was', 'coming', '.']


Additionally, feel free to experiment with different sentences and pieces of text and passing them through each tokenizer. 

There are many more types of tokenizers in the nltk library itself, catered to producing various tokens based on the type of data that is needed. You can learn more about tokenizers from the nltk documentation [here](https://www.nltk.org/api/nltk.tokenize.html).

Return back to the StackUp platform, where we will continue on with the quest.

<br/><br/>

### Removing stop words

Stop words are the common words which don't really add much meaning to the text. Some stop words in English includes conjunctions such as for, and, but, or, yet, so, and articles such as a, an, the.

NLTK has pre-defined stop words for English. Let's go ahead and import it by running in the cell below.

In [7]:
nltk.download('stopwords')
from nltk.corpus import stopwords
stopwords = set(stopwords.words('english'))

[nltk_data] Error loading stopwords: <urlopen error [WinError 10060] A
[nltk_data]     connection attempt failed because the connected party
[nltk_data]     did not properly respond after a period of time, or
[nltk_data]     established connection failed because connected host
[nltk_data]     has failed to respond>


The list stopwords now contains the NLTK predefined stop words. Using the tokenized text from earlier, let's remove the stop words and return the remaining tokens.

In [8]:
tokens = word_tokenize(sample)
stopwords_removed = [i for i in tokens if i not in stopwords]


Now, lets head back to the StackUp platform, where we cover the third preprocessing technique in this quest.

<br></br>

### Stemming and Lemmatization

Here, we will experiment using the PorterStemmer and WordNetLemmatizer. Recall from the quest that stemming removes the suffix from the word while lemmatization takes into account the context and what the word means in the sentence.

Play along with different words to compare the outputs produced by a stemmer and a lemmatizer!

In [9]:

nltk.download('wordnet')
nltk.download('omw-1.4')

from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
stemmer = PorterStemmer()
lemma = WordNetLemmatizer()

[nltk_data] Error loading wordnet: <urlopen error [WinError 10060] A
[nltk_data]     connection attempt failed because the connected party
[nltk_data]     did not properly respond after a period of time, or
[nltk_data]     established connection failed because connected host
[nltk_data]     has failed to respond>
[nltk_data] Error loading omw-1.4: <urlopen error [WinError 10060] A
[nltk_data]     connection attempt failed because the connected party
[nltk_data]     did not properly respond after a period of time, or
[nltk_data]     established connection failed because connected host
[nltk_data]     has failed to respond>


Let's test both methods on various pluralised words.

In [10]:
plurals = ['apples', 'octopuses', 'categories', 'criteria', 'tomatoes', 'matrices', 'hypotheses', 'radii', 'algae', 'cacti']

sample_stem = [stemmer.stem(plural) for plural in stopwords_removed]
sample_lemma = [lemma.lemmatize(plural) for plural in stopwords_removed]

print("Stemming results: ", sample_stem)
print("Lemmatization results; ", sample_lemma)

Stemming results:  ['the', 'wind', 'shook', 'blossom', 'tree', ',', 'heavi', 'lilac-bloom', ',', 'cluster', 'star', ',', 'move', 'fro', 'languid', 'air', '.', 'a', 'grasshopp', 'began', 'chirrup', 'wall', ',', 'like', 'blue', 'thread', 'long', 'thin', 'dragon-fli', 'float', 'past', 'brown', 'gauz', 'wing', '.', 'lord', 'henri', 'felt', 'could', 'hear', 'basil', 'hallward', '’', 'heart', 'beat', ',', 'wonder', 'come', '.']
Lemmatization results;  ['The', 'wind', 'shook', 'blossom', 'tree', ',', 'heavy', 'lilac-blooms', ',', 'clustering', 'star', ',', 'moved', 'fro', 'languid', 'air', '.', 'A', 'grasshopper', 'began', 'chirrup', 'wall', ',', 'like', 'blue', 'thread', 'long', 'thin', 'dragon-fly', 'floated', 'past', 'brown', 'gauze', 'wing', '.', 'Lord', 'Henry', 'felt', 'could', 'hear', 'Basil', 'Hallward', '’', 'heart', 'beating', ',', 'wondered', 'coming', '.']


In [11]:
print(sample,"\n")
print(sentence_tokens,"\n")
print(word_tokens,"\n")
print(stopwords_removed,"\n")
print("Stemming results: ", sample_stem,"\n")
print("Lemmatization results: ", sample_lemma)

The wind shook some blossoms from the trees, and the heavy lilac-blooms, with their clustering stars, moved to and fro in the languid air. A grasshopper began to chirrup by the wall, and like a blue thread a long thin dragon-fly floated past on its brown gauze wings. Lord Henry felt as if he could hear Basil Hallward’s heart beating, and wondered what was coming. 

['The wind shook some blossoms from the trees, and the heavy lilac-blooms, with their clustering stars, moved to and fro in the languid air.', 'A grasshopper began to chirrup by the wall, and like a blue thread a long thin dragon-fly floated past on its brown gauze wings.', 'Lord Henry felt as if he could hear Basil Hallward’s heart beating, and wondered what was coming.'] 

['The', 'wind', 'shook', 'some', 'blossoms', 'from', 'the', 'trees', ',', 'and', 'the', 'heavy', 'lilac-blooms', ',', 'with', 'their', 'clustering', 'stars', ',', 'moved', 'to', 'and', 'fro', 'in', 'the', 'languid', 'air', '.', 'A', 'grasshopper', 'began

Compare the results produced above! The lemmatizer is more accurate when it comes to getting the root word of more complex plurals, however it is important to note that in the case of a large dataset, stemming comes in handy where performance is an issue. 

And that sums up the 3 techniques for text preprocessing in NLP! **Return back to the StackUp platform,** where we wrap up the quest and prepare the deliverables for submission. 