# What is NLP (Natural Language Processing)?

NLP is a subfield of computer science and artificial intelligence concerned with interactions between computers and human (natural) languages. It is used to apply machine learning algorithms to text and speech.  

**For example**, we can use NLP to create systems like speech recognition, document summarization, machine translation, spam detection, named entity recognition, question answering, autocomplete, predictive typing and so on.  

Nowadays, most of us have smartphones that have speech recognition. These smartphones use NLP to understand what is said. Also, many people use laptops which operating system has a built-in speech recognition.

#### Some Examples
- `Cortana`

![image.png](attachment:image.png)

The Microsoft OS has a virtual assistant called Cortana that can recognize a natural voice. You can use it to set up reminders, open apps, send emails, play games, track flights and packages, check the weather and so on.
You can read more for Cortana commands from __[here](https://www.howtogeek.com/225458/15-things-you-can-do-with-cortana-on-windows-10)__

- `Siri`

![image.png](attachment:image.png)

Siri is a virtual assistant of the Apple Inc.’s iOS, watchOS, macOS, HomePod, and tvOS operating systems. Again, you can do a lot of things with voice commands: start a call, text someone, send an email, set a timer, take a picture, open an app, set an alarm, use navigation and so on.

- `Gmail`

![image.png](attachment:image.png)

The famous email service Gmail developed by Google is using spam detection to filter out some spam emails.

![image.png](attachment:image.png)

### Why study NLP?
There’s a fast-growing collection of useful applications derived from this field of study. They range from simple to complex. Below are a few of them:
- Spell Checking, Keyword Search, Finding Synonyms.
- Extracting information from websites such as: product price, dates, location, people, or company names.
- Classifying: reading level of school texts, positive/negative sentiment of longer documents.
- Machine Translation.
- Spoken Dialog Systems.
- Complex Question Answering.

![image-2.png](attachment:image-2.png)

## Introduction to the NLTK library for Python 

NLTK (Natural Language Toolkit) is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to many corpora and lexical resources. Also, it contains a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning. Best of all, NLTK is a free, open source, community-driven project.

In [None]:
#Install Pip: run in terminal:
#pip install nltk

#Download NLTK data: run python shell (in terminal) and write the following code:
import nltk 
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


In [None]:
#import the NLTK toolkit
import nltk

### The Basics of NLP for Text

we'll cover the following topics:
1. Sentence Tokenization
2. Word Tokenization
3. Text Lemmatization and Stemming
4. Stop Words
5. Regex
6. Bag-of-Words
7. TF-IDF

**1. Sentence Tokenization**  
Sentence tokenization (also called sentence segmentation) is the problem of dividing a string of written language into its component sentences. In English and some other languages, we can split apart the sentences whenever we see a punctuation mark. 

However, even in English, this problem is not trivial due to the use of full stop character for abbreviations. When processing plain text, tables of abbreviations that contain periods can help us to prevent incorrect assignment of sentence boundaries. 

`Example:`  
Let's look a piece of text about a famous board game called backgammon.  
**"Backgammon is one of the oldest known board games. Its history can be traced back nearly 5,000 years to archeological discoveries in the Middle East. It is a two player game where each player has fifteen checkers which move between twenty-four points according to the roll of two dice."**    

To apply a sentence tokenization with `NLTK` we can use the `nltk.sent_tokenize` function.

In [None]:
text = "Backgammon is one of the oldest known board games. Its history can be traced back nearly 5,000 years to archeological discoveries in the Middle East. It is a two player game where each player has fifteen checkers which move between twenty-four points according to the roll of two dice."
sentences = nltk.sent_tokenize(text)
for sentence in sentences:
    print(sentence)
    print()

**2. Word Tokenization**  

Word tokenization (also called word segmentation) is the problem of dividing a string of written language into its component words. In English and many other languages using some form of Latin alphabet, space is a good approximation of a word divider.  
However, we still can have problems if we only split by space to achieve the wanted results. Some English compound nouns are variably written and sometimes they contain a space. In most cases, we use a library to achieve the wanted results.

`Example:`  
Let's use the sentences from the previous step and see how we can apply word tokenization on them. We can use the `nltk.word_tokenize` function.

In [None]:
for sentence in sentences:
    words = nltk.word_tokenize(sentence)
    print(words)
    print()

**3.Text Lemmatization and Stemming**  
For grammatical reasons, documents can contain different forms of a word such as drive, drives, driving. Also, sometimes we have related words with a similar meaning, such as nation, national, nationality.
The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form.  

Source: https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html

`Examples:`  
- am, are, is => be  
- dog, dogs, dog’s, dogs’ => dog  

The result of this mapping applied on a text will be something like that:  
- the boy’s dogs are different sizes => the boy dog be differ size  

Stemming and lemmatization are special cases of normalization. However, they are different from each other.  

**Stemming:**  Stemming is a rudimentary rule-based process of stripping the suffixes (“ing”, “ly”, “es”, “s” etc) from a word.  

**Lemmatization:** Lemmatization, on the other hand, is an organized & step by step procedure of obtaining the root form of the word, it makes use of vocabulary (dictionary importance of words) and morphological analysis (word structure and grammar relations).

Source: https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html

`Examples:`  
The word "better"
has "good" as its lemma. This link is missed by stemming, as it requires a dictionary look-up.  
The word "play" is the base form for the word "playing", and hence this is matched in both stemming and lemmatization.  
The word "meeting" can be either the base form of a noun or a form of a verb ("to meet") depending on the context;   
e.g., "in our last meeting" or "We are meeting again tomorrow". Unlike stemming, lemmatization attempts to select the correct lemma depending on the context.

In [None]:
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import wordnet

def compare_stemmer_and_lemmatizer(stemmer, lemmatizer, word, pos):
    """
    Print the results of stemmind and lemmitization using the passed stemmer, lemmatizer, word and pos (part of speech)
    """
    print("Stemmer:", stemmer.stem(word))
    print("Lemmatizer:", lemmatizer.lemmatize(word, pos))
    print()

lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()
compare_stemmer_and_lemmatizer(stemmer, lemmatizer, word = "seen", pos = wordnet.VERB)
compare_stemmer_and_lemmatizer(stemmer, lemmatizer, word = "drove", pos = wordnet.VERB)

**4. Stop words**  

![image.png](attachment:image.png)

`Stop words` are words which are filtered out before or after processing of text. When applying machine learning to text, these words can add a lot of noise. That's why we want to remove these irrelevant words.  

`Stop words` usually refer to the most common words such as "and", "the", "a" in a language, but there is no single universal list of stopwords. The list of the stop words can change depending on your application.

The NLTK tool has a predefined list of stopwords that refers to the most common words. If you use it for your first time, you need to download the stop words using this code:` nltk.download("stopwords")`.   

Once we complete the downloading, we can load the stopwords package from the nltk.corpus and use it to load the stop words.

In [None]:
#nltk.download("stopwords")

In [None]:
from nltk.corpus import stopwords
print(stopwords.words("english"))

Let’s see how we can remove the stop words from a sentence.

In [None]:
stop_words = set(stopwords.words("english"))
sentence = "Backgammon is one of the oldest known board games."

words = nltk.word_tokenize(sentence)
without_stop_words = [word for word in words if not word in stop_words]
print(without_stop_words)

**5. Regex**  
![image.png](attachment:image.png)

A regular expression, regex, or regexp is a sequence of characters that define a search pattern. Let’s see some basics.  

- . - match any character except newline  
- \w - match word  
- \d - match digit  
- \s - match whitespace  
- \W - match not word  
- \D - match not digit  
- \S - match not whitespace  
- [abc] - match any of a, b, or c  
- [^abc] - not match a, b, or c  
- [a-g] - match a character between a & g  

We can use regex to apply additional filtering to our text. For example, we can remove all the non-words characters. In many cases, we don’t need the punctuation marks and it’s easy to remove them with regex.  

In Python, the `re` module provides regular expression matching operations. We can use the `re.sub` function to replace the matches for a pattern with a replacement string.   
Let’s see an example when we replace all non-words with the space character.

In [None]:
import re
sentence = "The development of snowboarding was inspired by skateboarding, sledding, surfing and skiing."
pattern = r"[^\w]"
print(re.sub(pattern, " ", sentence))

**6. Bag-of-words**  

![image.png](attachment:image.png)

Machine learning algorithms cannot work with raw text directly, we need to convert the text into vectors of numbers. This is called <ins>**feature extraction**</ins>.  

The `bag-of-words` model is a popular and simple feature extraction technique used when we work with text. It describes the occurrence of each word within a document.  

To use this model, we need to:  
1. Design a vocabulary of known words (also called tokens)
2. Choose a measure of the presence of known words

Any information about the order or structure of words is discarded. That’s why it’s called a bag of words. This model is trying to understand whether a known word occurs in a document, but don’t know where is that word in the document.

`Example`  
Let’s see what are the steps to create a bag-of-words model. In this example, we’ll use only four sentences to see how this model works. In the real-world problems, you’ll work with much bigger amounts of data.

**1. Load the Data**  

![image.png](attachment:image.png)

In [None]:
I like this movie, it's funny.
I hate this movie.
This was awesome! I like it.
Nice one. I love it.

To achieve this we can simply read the file and split it by lines.

In [None]:
with open("simple movie reviews.txt", "r") as file:
    documents = file.read().splitlines()
    
print(documents)

**2. Design the Vocabulary** 

![image.png](attachment:image.png)

Let’s get all the unique words from the four loaded sentences ignoring the case, punctuation, and one-character tokens. These words will be our vocabulary (known words).
We can use the CountVectorizer class from the sklearn library to design our vocabulary. We’ll see how we can use it after reading the next step, too.

**3. Create the Document Vectors**

![image.png](attachment:image.png)

Next, we need to score the words in each document. The task here is to convert each raw text into a vector of numbers. After that, we can use these vectors as input for a machine learning model. The simplest scoring method is to mark the presence of words with 1 for present and 0 for absence.
Now, let’s see how we can create a bag-of-words model using the mentioned above CountVectorizer class.


In [None]:
# Import the libraries we need
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

# Step 2. Design the Vocabulary
# The default token pattern removes tokens of a single character. That's why we don't have the "I" and "s" tokens in the output
count_vectorizer = CountVectorizer()

# Step 3. Create the Bag-of-Words Model
bag_of_words = count_vectorizer.fit_transform(documents)
#print(bag_of_words)

# Show the Bag-of-Words Model as a pandas DataFrame
feature_names = count_vectorizer.get_feature_names()
pd.DataFrame(bag_of_words.toarray(), columns = feature_names)


Here are our sentences. Now we can see how the bag-of-words model works.

![image.png](attachment:image.png)

**Additional Notes on the Bag of Words Model**  

![image.png](attachment:image.png)

The complexity of the bag-of-words model comes in deciding how to design the vocabulary of known words (tokens) and how to score the presence of known words.  

**Designing the Vocabulary**  

When the vocabulary size increases, the vector representation of the documents also increases. In the example above, the length of the document vector is equal to the number of known words.  

In some cases, we can have a huge amount of data and in this cases, the length of the vector that represents a document might be thousands or millions of elements. Furthermore, each document may contain only a few of the known words in the vocabulary.

Therefore the vector representations will have a lot of zeros. These vectors which have a lot of zeros are called `sparse vectors`. They require more memory and computational resources.  

We can decrease the number of the known words when using a bag-of-words model to decrease the required memory and computational resources. We can use the text cleaning techniques:

- Ignoring the case of the words
- Ignoring punctuation
- Removing the stop words from our documents
- Reducing the words to their base form (Text Lemmatization and Stemming)
- Fixing misspelled words

Another more complex way to create a vocabulary is to use grouped words. This changes the scope of the vocabulary and allows the bag-of-words model to get more details about the document. This approach is called **`n-grams`**.

An n-gram is a **sequence** of a number of items (words, letter, numbers, digits, etc.). In the context of **text corpora**, n-grams typically refer to a sequence of words. A **unigram** is one word, a **bigram** is a sequence of two words, a **trigram** is a sequence of three words etc. The “n” in the “n-gram” refers to the number of the grouped words. Only the n-grams that appear in the corpus are modeled, not all possible n-grams.

**`Example`**  
Let’s look at the all bigrams for the following sentence:

`The office building is open today`

All the bigrams are:
- the office
- office building
- building is
- is open
- open today
- The **bag-of-bigrams** is more powerful than the bag-of-words approach.

**Scoring Words**  
Once, we have created our vocabulary of known words, we need to score the occurrence of the words in our data. We saw one very simple approach - the binary approach (1 for presence, 0 for absence).  

Some additional scoring methods are:  
- **Counts.** Count the number of times each word appears in a document.
- **Frequencies.** Calculate the frequency that each word appears in document out of all the words in the document.

**7. TF-IDF**
One problem with scoring word frequency is that the most frequent words in the document start to have the highest scores. These frequent words may not contain as much **“informational gain”** to the model compared with some rarer and domain-specific words. One approach to fix that problem is to penalize words that are frequent across all the documents. This approach is called **TF-IDF**.  

TF-IDF, short for **term frequency-inverse document frequency** is a statistical measure used to evaluate the importance of a word to a document in a collection or **corpus**.  

The TF-IDF scoring value increases proportionally to the number of times a word appears in the document, but it is offset by the number of documents in the corpus that contain the word.  

Let’s see the formula used to calculate a TF-IDF score for a given term x within a document y.
![image.png](attachment:image.png)


Now, let’s split this formula a little bit and see how the different parts of the formula work.

- **Term Frequency (TF):** a scoring of the frequency of the word in the current document.
![image.png](attachment:image.png)

- **Inverse Term Frequency (ITF):** a scoring of how rare the word is across documents.
![image-2.png](attachment:image-2.png)

Finally, we can use the previous formulas to calculate the TF-IDF score for a given term like this:
![image-3.png](attachment:image-3.png)

**Example**  
In Python, we can use the **TfidfVectorizer** class from the sklearn library to calculate the TF-IDF scores for given documents. Let’s use the same sentences that we have used with the bag-of-words example.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

tfidf_vectorizer = TfidfVectorizer()
values = tfidf_vectorizer.fit_transform(documents)

# Show the Model as a pandas DataFrame
feature_names = tfidf_vectorizer.get_feature_names()
pd.DataFrame(values.toarray(), columns = feature_names)


Again, I’ll add the sentences here for an easy comparison and better understanding of how this approach is working.
![image.png](attachment:image.png)

### POS Tagging
The primary target of Part-of-Speech(POS) tagging is to identify the grammatical group of a given word. Whether it is a NOUN, PRONOUN, ADJECTIVE, VERB, ADVERBS, etc. based on the context. POS Tagging looks for relationships within the sentence and assigns a corresponding tag to the word.

In [None]:
sent = "Albert Einstein was born in Ulm, Germany in 1879."

In [None]:
from nltk import word_tokenize, pos_tag, ne_chunk
tokens=nltk.word_tokenize(sent)
print(tokens)

In [None]:
nltk.pos_tag(tokens)

### Summary
You learn the basics of the NLP for text. More specifically you have learned the following concepts with additional details:
- NLP is used to apply machine learning algorithms to text and speech.
- NLTK (Natural Language Toolkit) is a leading platform for building Python programs to work with human language data
- Sentence tokenization is the problem of dividing a string of written language into its component sentences
- Word tokenization is the problem of dividing a string of written language into its component words
- The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form.
- Stop words are words which are filtered out before or after processing of text. They usually refer to the most common words in a language.
- A regular expression is a sequence of characters that define a search pattern.
- The bag-of-words model is a popular and simple feature extraction technique used when we work with text. It describes the occurrence of each word within a document.
- TF-IDF is a statistical measure used to evaluate the importance of a word to a document in a collection or corpus.

### Important Libraries for NLP (python)

- Scikit-learn: Machine learning in Python
- Natural Language Toolkit (NLTK): The complete toolkit for all NLP techniques.
- Pattern – A web mining module for the with tools for NLP and machine learning.
- TextBlob – Easy to use nlp tools API, built on top of NLTK and Pattern.
- spaCy – Industrial strength NLP with Python and Cython.
- Gensim – Topic Modelling for Humans
- Stanford Core NLP – NLP services and packages by Stanford NLP Group.

### Bonus Topic

**How to identify what the web page is about using NLTK in Python**  

First, we will grab a webpage and analyze the text to see what the page is about.  

urllib module will help us to crawl the webpage  

In [None]:
import urllib.request
response =  urllib.request.urlopen('https://en.wikipedia.org/wiki/Natural_language_processing')
html = response.read()
print(html)

It’s pretty clear from the link that page is about NLP now let us see whether our code is able to correctly identify the page’s context.  

We will use **`Beautiful Soup`** which is a Python library for pulling data out of HTML and XML files. We will use beautiful soup to clean our webpage text of HTML tags.

In [None]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'html5lib')
text = soup.get_text(strip = True)
print(text)