# Natural Language Processing (NLP):

**Natural Language Processing (NLP)** is a field in Computer Science and AI that gives machine an ability to understand human language better and to assist in language related tasks.

The language we will be using spaCy, Gensim and NLTK. These are different libraries that allows you to do nlp in python. <br>
We'll also use scikit-learn for our machine learning problems and then TensorFlow and PyTorch for deep learning related problems in nlp.

# Task 37: NLP Preprocessing:

In [13]:
import pandas as pd
import numpy as np
import nltk

## 1. Lowercasing:

Converting all your data to lowercase helps in the process of preprocessing and in later stages in the NLP application, when you are doing parsing.

**About Dataset**: <br>
IMDB dataset having 50K movie reviews.

In [6]:
df = pd.read_csv("IMDB Dataset.csv")

In [7]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [8]:
# check no. of rows and columns
df.shape

(50000, 2)

In [9]:
df['review'][3]

"Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />This movie is slower than a soap opera... and suddenly, Jake decides to become Rambo and kill the zombie.<br /><br />OK, first of all when you're going to make a film you must Decide if its a thriller or a drama! As a drama the movie is watchable. Parents are divorcing & arguing like in real life. And then we have Jake with his closet which totally ruins all the film! I expected to see a BOOGEYMAN similar movie, and instead i watched a drama with some meaningless thriller spots.<br /><br />3 out of 10 just for the well playing parents & descent dialogs. As for the shots with Jake: just ignore them."

You can see that there are some capital letter words. Let's covert it into lowercase.

In [10]:
df['review'][3].lower()

"basically there's a family where a little boy (jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />this movie is slower than a soap opera... and suddenly, jake decides to become rambo and kill the zombie.<br /><br />ok, first of all when you're going to make a film you must decide if its a thriller or a drama! as a drama the movie is watchable. parents are divorcing & arguing like in real life. and then we have jake with his closet which totally ruins all the film! i expected to see a boogeyman similar movie, and instead i watched a drama with some meaningless thriller spots.<br /><br />3 out of 10 just for the well playing parents & descent dialogs. as for the shots with jake: just ignore them."

Now let's apply it on all data/reviews.

In [12]:
df['review'] = df['review'].str.lower()
df

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production. <br /><br />the...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically there's a family where a little boy ...,negative
4,"petter mattei's ""love in the time of money"" is...",positive
...,...,...
49995,i thought this movie did a down right good job...,positive
49996,"bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,i am a catholic taught in parochial elementary...,negative
49998,i'm going to have to disagree with the previou...,negative


## 2. Remove HTML Tags:

In [14]:
import re
def remove_html_tags(text):
    pattern = re.compile('<.*?>')
    return pattern.sub(r'', text)

In [16]:
text = "<><br /><br /><p> Movie, hello world! <p> Click here to download </p>"

remove_html_tags(text)

' Movie, hello world!  Click here to download '

We define the function for removing html tags using regex. It works fine. Let's apply it on the **50K Movie Reviews Dataset**.

In [17]:
df['review'] = df['review'].apply(remove_html_tags)

In [18]:
df['review'][3]

"basically there's a family where a little boy (jake) thinks there's a zombie in his closet & his parents are fighting all the time.this movie is slower than a soap opera... and suddenly, jake decides to become rambo and kill the zombie.ok, first of all when you're going to make a film you must decide if its a thriller or a drama! as a drama the movie is watchable. parents are divorcing & arguing like in real life. and then we have jake with his closet which totally ruins all the film! i expected to see a boogeyman similar movie, and instead i watched a drama with some meaningless thriller spots.3 out of 10 just for the well playing parents & descent dialogs. as for the shots with jake: just ignore them."

If you see on the top, you can see that there are some **HTML Tags**. Here's the result after apply the function.

## 3. Tokenization:


**Tokenization** is a process of splitting text into meaningful segments.

In [None]:
import spacy

In [None]:
nlp = spacy.blank("en")

doc = nlp("Pakistan's IT industry generates approximately $2 Billion annually for the country. With an average annual growth rate of 30%")

for token in doc:
    print(token)

Pakistan
's
IT
industry
generates
approximately
$
2
Billion
annually
for
the
country
.
With
an
average
annual
growth
rate
of
30
%


First, I created **NLP** component around the text that I had, by default it gives you tokenizer, **Word Tokenizer**.<br>
So, when you feed text into it, you get doc document. Doc document already knows about tokens.

*   #### Some Text Analysis (Token Attributes):



In [None]:
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

for token in doc:
    print(token, "==>", "index: ", token.i,
          "is_alpha:", token.is_alpha,
          "is_punct:", token.is_punct,
          "like_num:", token.like_num,
          "is_currency:", token.is_currency,
          )

Apple ==> index:  0 is_alpha: True is_punct: False like_num: False is_currency: False
is ==> index:  1 is_alpha: True is_punct: False like_num: False is_currency: False
looking ==> index:  2 is_alpha: True is_punct: False like_num: False is_currency: False
at ==> index:  3 is_alpha: True is_punct: False like_num: False is_currency: False
buying ==> index:  4 is_alpha: True is_punct: False like_num: False is_currency: False
U.K. ==> index:  5 is_alpha: False is_punct: False like_num: False is_currency: False
startup ==> index:  6 is_alpha: True is_punct: False like_num: False is_currency: False
for ==> index:  7 is_alpha: True is_punct: False like_num: False is_currency: False
$ ==> index:  8 is_alpha: False is_punct: False like_num: False is_currency: True
1 ==> index:  9 is_alpha: False is_punct: False like_num: True is_currency: False
billion ==> index:  10 is_alpha: True is_punct: False like_num: True is_currency: False


So, these attributes can be pretty powerful in doing your text analysis.

*   #### Email extraction from Student Information Doc:

Let's assume, It's snow outside. So, as a teacher I want to sent an emails to all my students to inform about hoilday. There is a 'students' data and I want emails of my students.

How can I extract all the emails? I can use regex, regular expression. But on this ocassion, The spaCy can be more convenient than regex.

In [None]:
with open("students.txt") as f:
    text = f.readlines()
text

['University of California, undergraduate students information\n',
 '\n',
 'Name\tbirth day   \temail\n',
 '-----\t------------\t------\n',
 'Rishabh   5 June, 1882    rishabh@singh.com\n',
 'Maria\t12 April, 2001  maria@sharapova.com\n',
 'Serena  24 June, 1998   serena@williams.com \n',
 'Joe      1 May, 1997    joe@root.com\n',
 '\n',
 '\n',
 '\n']

When we do **'f.readlines()'** it will read all the lines in a text file as an array. I will convert this array into single big text.

In [None]:
text = ' '.join(text)
text



Text is an array. It will join all the elements by list and it will use space as a delimiter.

In [None]:
doc = nlp(text)

emails = []
for token in doc:
    if token.like_email:
        emails.append(token.text)
emails

['rishabh@singh.com',
 'maria@sharapova.com',
 'serena@williams.com',
 'joe@root.com']

This is one simple use case. It have other methods too.

Let's move futher,

**Stemming** & **Lemmatization**, these are the essential steps that you need to perform in pre-processing stage, while building **NLP** application.

## 4. Stemming:

Use fixed rules such as remove **able**, **ing** etc. to derive a base word is called **Stemming**. <br>
**By the way, for *Stemming* we will use NLTK because spaCy doesn't have support for stemming**.

In [None]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

In [None]:
words = ["eating", "eats", "eat", "ate", "adjustable", "rafting", "ability", "meeting",
         "something", "spoken", "instruction", "vocabulary"]

for word in words:
    print(word, "|", stemmer.stem(word))

eating | eat
eats | eat
eat | eat
ate | ate
adjustable | adjust
rafting | raft
ability | abil
meeting | meet
something | someth
spoken | spoken
instruction | instruct
vocabulary | vocabulari


As you can see that it applies fixed set of rules. It removes **'ing'** got **eat** and so on. But in **'ate'** we got **ate** because stemmer doen't have a knowledge of a language. It is just using fixed rules.

## 5. Lemmatization:

Use knowledge of a language (a.k.a. linguistic knowledge) to derive a base word.<br>
For example: ate ---> eat <br>
base word = lemma <br>
So, eat is a **lemma** for ate. This process is called **Lemmatization**.

In [None]:
nlp = spacy.load("en_core_web_sm")

doc = nlp("eating eats eat ate adjustable rafting ability meeting something spoken instruction vocabulary better")

for token in doc:
    print(token, "|", token.lemma_)

eating | eat
eats | eat
eat | eat
ate | eat
adjustable | adjustable
rafting | raft
ability | ability
meeting | meeting
something | something
spoken | speak
instruction | instruction
vocabulary | vocabulary
better | well


As you can see that **'eating'** reduces to **'eat'**, **'ate'** is also reduces to **'eat'**. But at last you notice that **'better'** is reduces to **'good'**. <br>
So it has the mapping rule in trained model that we just loaded.

## 6. Parts of Speech (POS) Tagging:

**POS tagging** is the process of labeling words in a text with their corresponding parts of speech (e.g., noun, verb, adjective). This helps algorithms understand the grammatical structure and meaning of a text and is an important step in **Natural language processing (NLP)**.

In [None]:
nlp = spacy.load("en_core_web_sm")

doc = nlp("The experiment seemed straightforward and there were plenty of scientists willing to try it. It was wonderful to have a simple laboratory experiment on fusion to try after the decades of embarrassing attempts to control hot fusion.")

for token in doc:
    print(token, "|", token.pos_, "|", spacy.explain(token.pos_))

The | DET | determiner
experiment | NOUN | noun
seemed | VERB | verb
straightforward | ADJ | adjective
and | CCONJ | coordinating conjunction
there | PRON | pronoun
were | VERB | verb
plenty | NOUN | noun
of | ADP | adposition
scientists | NOUN | noun
willing | ADJ | adjective
to | PART | particle
try | VERB | verb
it | PRON | pronoun
. | PUNCT | punctuation
It | PRON | pronoun
was | AUX | auxiliary
wonderful | ADJ | adjective
to | PART | particle
have | VERB | verb
a | DET | determiner
simple | ADJ | adjective
laboratory | NOUN | noun
experiment | NOUN | noun
on | ADP | adposition
fusion | NOUN | noun
to | PART | particle
try | VERB | verb
after | ADP | adposition
the | DET | determiner
decades | NOUN | noun
of | ADP | adposition
embarrassing | ADJ | adjective
attempts | NOUN | noun
to | PART | particle
control | VERB | verb
hot | ADJ | adjective
fusion | NOUN | noun
. | PUNCT | punctuation
