<a href="https://colab.research.google.com/github/grosa1/hands-on-ml-tutorials/blob/master/tutorial_3/nlp_examples.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tutorial 3 - NLP examples

> ##### A Simplified Part-of-Speech Tagset:

|Tag | Meaning           | Examples
|----|-------------------|--------------------------------------|
|ADJ | adjective         | new, good, high, special, big, local 
|ADV | adverb            | really, already, still, early, now
|CNJ | conjunction       | and, or, but, if, while, although
|DET | determiner        | the, a, some, most, every, no
|EX  | existential       | there, there’s
|FW  | foreign word      | dolce, ersatz, esprit, quo, maitre
|MOD | modal verb        | will, can, would, may, must, should
|N   | noun              | year, home, costs, time, education
|NP  | proper noun       | Alison, Africa, April, Washington
|NUM | number            | twenty-four, fourth, 1991, 14:24
|PRO | pronoun           | he, their, her, its, my, I, us
|P   |preposition        | on, of, at, with, by, into, under
|TO  | the word to       | to
|UH  | interjection      | ah, bang, ha, whee, hmpf, oops
|V   | verb              | is, has, get, do, make, see, run
|VD  | past tense        | said, took, told, made, asked
|VG  | present participle| making, going, playing, working
|VN  | past participle   | given, taken, begun, sung
|WH  | wh determiner     | who, which, when, what, where, how

> ##### Some useful Python functions for NLP:

|Tag | Meaning           | 
|----|-------------------|
|`s.startswith(t)`| Test if s starts with t
|`s.endswith(t)`|   Test if s ends with t
|`t in s`|          Test if t is contained inside s
|`s.islower()`|     Test if all cased characters in s are lowercase
|`s.isupper()`|     Test if all cased characters in s are uppercase
|`s.isalpha()`|       Test if all characters in s are alphabetic
|`s.ilsalnum()`|    Test if all characters in s are alphanumeric
|`s.isdigit()`|     Test if all characters in s are digits
|`s.istitle()`|     Test if s is titlecased (all words in s have initial capitals)


> ### Setup

In [0]:
!pip install nltk
!pip install spacy spacy-lookups-data
!python -m spacy download en_core_web_sm

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


> ##### To use spaCy with CUDA:

In [0]:
#!pip install -U spacy[cuda]
#spacy.prefer_gpu()

In [0]:
import pandas as pd
import spacy
nlp = spacy.load("en_core_web_sm")

import nltk
nltk.download('punkt') # for tokenization

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

>### Text pre-processing

The first step is text normalization. The goal is to split paragraphs in sentences and the sentences in single words. The next step is to clean text removing punctuations and digits. 

> ##### Word Tokenization (spaCy)

In [0]:
text = """Hello Mr. Smith, how are you doing today? The weather is great, and city is awesome.
The sky is pinkish-blue. You shouldn't eat cardboard 25"""

In [0]:
doc = nlp(text)
for token in doc:
    print(token.text)

Hello
Mr.
Smith
,
how
are
you
doing
today
?
The
weather
is
great
,
and
city
is
awesome
.


The
sky
is
pinkish
-
blue
.
You
should
n't
eat
cardboard
25


##### Sentence tokenization (NLTK)

In [0]:
from nltk.tokenize import sent_tokenize

In [0]:
tokenized_sents = sent_tokenize(text)
print(tokenized_sents)

['Hello Mr. Smith, how are you doing today?', 'The weather is great, and city is awesome.', 'The sky is pinkish-blue.', "You shouldn't eat cardboard 25"]


##### Word tokenization (NLTK)

In [0]:
sent = tokenized_sents[0]
tokens = nltk.word_tokenize(sent)
tokens

['Hello', 'Mr.', 'Smith', ',', 'how', 'are', 'you', 'doing', 'today', '?']

##### Punctuation and digits removal

In [0]:
tokens = [token.lower() for token in tokens if token.isalpha()]
tokens

['hello', 'smith', 'how', 'are', 'you', 'doing', 'today']

##### Stop words

In [0]:
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [0]:
stop_words = set(stopwords.words("english"))
filtered_sent = list()
for w in tokens:
    if w not in stop_words:
        filtered_sent.append(w)
print("Tokenized Sentence:", tokens)
print("Filterd Sentence:", filtered_sent)

Tokenized Sentence: ['hello', 'smith', 'how', 'are', 'you', 'doing', 'today']
Filterd Sentence: ['hello', 'smith', 'today']


> ###  Stemming

Stemming is the process of reducing words to their word stem (base form). The Porter stemming algorithm is one of the most famous.

In [0]:
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize

In [0]:
ps = PorterStemmer()

words = ['connection', 'connected', 'connecting']
stemmed_words = list()
for w in words:
    stemmed_words.append(ps.stem(w))

print("Raw:", words)
print("Stemmed:", stemmed_words)

Raw: ['connection', 'connected', 'connecting']
Stemmed: ['connect', 'connect', 'connect']


>###  Lemmatisation

The aim of lemmatization, like stemming, is to reduce words to a common base form. Unlike stemming, lemmatization uses lexical knowledge bases to get the correct base forms of words.

In [0]:
from nltk.stem.wordnet import WordNetLemmatizer
nltk.download('wordnet')

from nltk.stem.porter import PorterStemmer

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [0]:
lem = WordNetLemmatizer()
stem = PorterStemmer()

word = "better"
print("Lemmatized Word:", lem.lemmatize(word, 'a'))
print("Stemmed Word:", stem.stem(word))

Lemmatized Word: good
Stemmed Word: better


>###  Part-of-Speech Tagging

Part-of-speech tagging aims to assign parts of speech to each word of a given text (nouns, verbs, adjectives, etc.), using the word definition and context.

> ##### Using NLTK:

In [0]:
nltk.download('averaged_perceptron_tagger') # is used for tagging words with their parts of speech (POS)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [0]:
nltk.pos_tag(tokens)

[('hello', 'NN'),
 ('smith', 'VB'),
 ('how', 'WRB'),
 ('are', 'VBP'),
 ('you', 'PRP'),
 ('doing', 'VBG'),
 ('today', 'NN')]

> ##### Using spaCy:

In [0]:
from spacy import displacy

In [0]:
text = """Hello Mr. Smith, how are you doing today? The weather is great, and city is awesome.
The sky is pinkish-blue. You shouldn't eat cardboard"""

pos = list()
doc = nlp(text)
for token in doc:
    pos.append({'text': token.text, 
               'lemma': token.lemma_, 
               'pos': token.pos_, 
               'tag': token.tag_, 
               'dep': token.dep_,
               'shape': token.shape_, 
               'stop': token.is_stop})
pd.DataFrame(pos)

Unnamed: 0,text,lemma,pos,tag,dep,shape,stop
0,Hello,hello,INTJ,UH,intj,Xxxxx,False
1,Mr.,Mr.,PROPN,NNP,compound,Xx.,False
2,Smith,Smith,PROPN,NNP,npadvmod,Xxxxx,False
3,",",",",PUNCT,",",punct,",",False
4,how,how,ADV,WRB,advmod,xxx,True
5,are,be,AUX,VBP,aux,xxx,True
6,you,-PRON-,PRON,PRP,nsubj,xxx,True
7,doing,do,VERB,VBG,ROOT,xxxx,True
8,today,today,NOUN,NN,npadvmod,xxxx,False
9,?,?,PUNCT,.,punct,?,False


To show PoS tagging and word dependencies:

In [0]:
displacy.render(doc, style="dep", jupyter=True)

> ### Example of word dependency analysis with spaCy

In the following example, we have a commit message from a bug fix commit. We want to extract the commit hash that has introduced the bug.

In [0]:
commit_msg = "fixes a search bug introduced by 2508e124d0795678df9050ca5e9f38a469de2a6f and fixes a typo in the README.md"

In [0]:
doc = nlp(commit_msg)
tokens_info = list()
for token in doc:
  tokens_info.append({
      'text': token.text, 
      'pos': token.pos_, 
      'dep': token.dep_, 
      'head_text': token.head.text, 
      'head_pos': token.head.pos_, 
      'children': [child for child in token.children], 
      'ancestors': [t for t in token.ancestors]
  })

pd.DataFrame(tokens_info)

Unnamed: 0,text,pos,dep,head_text,head_pos,children,ancestors
0,fixes,NOUN,ROOT,fixes,NOUN,"[bug, and, fixes]",[]
1,a,DET,det,bug,NOUN,[],"[bug, fixes]"
2,search,NOUN,compound,bug,NOUN,[],"[bug, fixes]"
3,bug,NOUN,dobj,fixes,NOUN,"[a, search, introduced]",[fixes]
4,introduced,VERB,acl,bug,NOUN,[by],"[bug, fixes]"
5,by,ADP,agent,introduced,VERB,[2508e124d0795678df9050ca5e9f38a469de2a6f],"[introduced, bug, fixes]"
6,2508e124d0795678df9050ca5e9f38a469de2a6f,NUM,pobj,by,ADP,[],"[by, introduced, bug, fixes]"
7,and,CCONJ,cc,fixes,NOUN,[],[fixes]
8,fixes,VERB,conj,fixes,NOUN,"[typo, in]",[fixes]
9,a,DET,det,typo,NOUN,[],"[typo, fixes, fixes]"


> ##### To print dependency tree:

In [0]:
from nltk import Tree

def to_nltk_tree(node):
    if node.n_lefts + node.n_rights > 0:
        return Tree(node.orth_, [to_nltk_tree(child) for child in node.children])
    else:
        return node.orth_

In [0]:
for s in doc.sents:
  print(to_nltk_tree(s.root).pretty_print())

        fixes                                       
  ________|____________________________              
 |       bug                         fixes          
 |    ____|___________            _____|_______      
 |   |    |       introduced     |             in   
 |   |    |           |          |             |     
 |   |    |           by        typo       README.md
 |   |    |           |          |             |     
and  a  search 2508e124d0795678  a            the   
               df9050ca5e9f38a4                     
                   69de2a6f                         

None


> ### Useful links

* [NLTK](https://www.nltk.org/)

* [Spacy](https://spacy.io/) and [examples](https://spacy.io/usage/spacy-101)

* [Text summarization example](https://www.kaggle.com/harunshimanto/summarization-with-wine-reviews-using-spacy)

* [Sentiment analysis example](https://www.kaggle.com/ngyptr/python-nltk-sentiment-analysis)