## DEEP LEARNING SUMMIT

# Unstructured Data -- using the Structure

## Brushing up NLP Basics and Pre-requisites

Text is ubiquitous these days

It is important to accurately analyze text data and understand/ make sense of it.

Text Analytics, while is used interchangeably with NLP they are meant to be somewhat different.

Text Analytics does not focus on the underlying structure of the text, the natural language behind it.

Text Analytics is about analyzing words or sentences and their occurences.

It is similar to building models on time series data or image data, where it understands that the unit of the data is words or characters (similar to pixels in images) but it does not attempt to understand the structure of a natural language (the grammar and the semantics). NLP attempts to do that.

For instance the below sentences will be considered by text analytics to mean happiness whereas NLP will say neither of them do.

1. She will be happy if she wins it.

2. Is he happy with his job?

3. He is not happy since a week

Let's first look at some NLP, it is often used to pre-process in a text analytics problem

Text is unstructured and has a lot of fillers!

"People like to see Sachin play. It always gets them interested in Cricket! It's good to see people show such interest. They love Sachin. He has given so much to the game. His strokes with his MRF bat are etched in people's memoris. Many current cricketers owe pursuing cricket to him."

Now here words like "to", "it", "in", "it's", "to" do not add any value to models. So let's see what we can do about them.

As they say "your model is only as good as your data", it is important to remove noise from any data.

In [1]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

In [2]:
text = "People like to see Sachin play. It always gets them interested in Cricket! It's good to see people show such interest. They love Sachin. He has given so much to the game. His strokes with his MRF bat are etched in people's memoris. Many current cricketers owe pursuing cricket to him."

Tokenization

Word Tokenization

Also Sentence Tokenization

Stop Words

In [3]:
stop_words = set(stopwords.words('english'))
 
word_tokens = word_tokenize(text)   
    
filtered_sentence = [w for w in word_tokens if not w in stop_words]
 
print(word_tokens)
print(filtered_sentence)

['People', 'like', 'to', 'see', 'Sachin', 'play', '.', 'It', 'always', 'gets', 'them', 'interested', 'in', 'Cricket', '!', 'It', "'s", 'good', 'to', 'see', 'people', 'show', 'such', 'interest', '.', 'They', 'love', 'Sachin', '.', 'He', 'has', 'given', 'so', 'much', 'to', 'the', 'game', '.', 'His', 'strokes', 'with', 'his', 'MRF', 'bat', 'are', 'etched', 'in', 'people', "'s", 'memoris', '.', 'Many', 'current', 'cricketers', 'owe', 'pursuing', 'cricket', 'to', 'him', '.']
['People', 'like', 'see', 'Sachin', 'play', '.', 'It', 'always', 'gets', 'interested', 'Cricket', '!', 'It', "'s", 'good', 'see', 'people', 'show', 'interest', '.', 'They', 'love', 'Sachin', '.', 'He', 'given', 'much', 'game', '.', 'His', 'strokes', 'MRF', 'bat', 'etched', 'people', "'s", 'memoris', '.', 'Many', 'current', 'cricketers', 'owe', 'pursuing', 'cricket', '.']


Punctuation

Lower case

In [4]:
words = [w.lower() for w in filtered_sentence if w.isalpha()]

print(words)

['people', 'like', 'see', 'sachin', 'play', 'it', 'always', 'gets', 'interested', 'cricket', 'it', 'good', 'see', 'people', 'show', 'interest', 'they', 'love', 'sachin', 'he', 'given', 'much', 'game', 'his', 'strokes', 'mrf', 'bat', 'etched', 'people', 'memoris', 'many', 'current', 'cricketers', 'owe', 'pursuing', 'cricket']


PoS Tagging

Besides helping with lemmatization, they can be very useful with removing junk words in tf-idf generation for specific usecases

NER

In [5]:
import nltk
from nltk import pos_tag
tagged_words = pos_tag(words)
print("PoS on cleaned text")
print(tagged_words)
tagged_text = pos_tag(word_tokens)
print("PoS on raw text")
print(tagged_text)

#Identify named entities
#processed_words = [list(word)[0] for word in lemmatized_words ]
entities = nltk.chunk.ne_chunk(tagged_text)
print("NER on raw text")
print(entities.__repr__())

filtered_tagged_text = [w for w in tagged_text if not list(w)[0] in stop_words]
#print(filtered_text)

#for w in filtered_tagged_text:
#    if(list(w)[0].isalpha()):
#        print(list(w)[0].lower())

tagged_words = [[list(w)[0].lower(), list(w)[1]] for w in filtered_tagged_text if list(w)[0].isalpha()]
#print(tagged_words)

PoS on cleaned text
[('people', 'NNS'), ('like', 'IN'), ('see', 'NN'), ('sachin', 'JJ'), ('play', 'VB'), ('it', 'PRP'), ('always', 'RB'), ('gets', 'VBZ'), ('interested', 'JJ'), ('cricket', 'NN'), ('it', 'PRP'), ('good', 'JJ'), ('see', 'NN'), ('people', 'NNS'), ('show', 'VBP'), ('interest', 'NN'), ('they', 'PRP'), ('love', 'VBP'), ('sachin', 'NN'), ('he', 'PRP'), ('given', 'VBN'), ('much', 'JJ'), ('game', 'NN'), ('his', 'PRP$'), ('strokes', 'NNS'), ('mrf', 'VBP'), ('bat', 'NN'), ('etched', 'VBN'), ('people', 'NNS'), ('memoris', 'VBP'), ('many', 'JJ'), ('current', 'JJ'), ('cricketers', 'NNS'), ('owe', 'VBP'), ('pursuing', 'VBG'), ('cricket', 'NN')]
PoS on raw text
[('People', 'NNS'), ('like', 'IN'), ('to', 'TO'), ('see', 'VB'), ('Sachin', 'NNP'), ('play', 'NN'), ('.', '.'), ('It', 'PRP'), ('always', 'RB'), ('gets', 'VBZ'), ('them', 'PRP'), ('interested', 'JJ'), ('in', 'IN'), ('Cricket', 'NN'), ('!', '.'), ('It', 'PRP'), ("'s", 'VBZ'), ('good', 'JJ'), ('to', 'TO'), ('see', 'VB'), ('people

Spell Correction

Should eliminate NER entities from spell corrections

In [6]:
from autocorrect import spell
import re

print(spell('hte'))
print(spell('caaaar'))
print(spell(re.sub(r'(.)\1+', r'\1\1', "caaaar")))

the
caesar
car


In [7]:
spell_corrected = [[spell(re.sub(r'(.)\1+', r'\1\1', list(w)[0])), list(w)[1]] for w in tagged_words]
print(spell_corrected)

[['people', 'NNS'], ['like', 'IN'], ['see', 'VB'], ['machin', 'NNP'], ['play', 'NN'], ['it', 'PRP'], ['always', 'RB'], ['gets', 'VBZ'], ['interested', 'JJ'], ['cricket', 'NN'], ['it', 'PRP'], ['good', 'JJ'], ['see', 'VB'], ['people', 'NNS'], ['show', 'VBP'], ['interest', 'NN'], ['they', 'PRP'], ['love', 'VBP'], ['machin', 'NNP'], ['he', 'PRP'], ['given', 'VBN'], ['much', 'JJ'], ['game', 'NN'], ['his', 'PRP$'], ['strokes', 'NNS'], ['MRF', 'NNP'], ['bat', 'NN'], ['etched', 'VBN'], ['people', 'NNS'], ['memories', 'NN'], ['many', 'JJ'], ['current', 'JJ'], ['cricketers', 'NNS'], ['owe', 'VBP'], ['pursuing', 'VBG'], ['cricket', 'NN']]


Stemming and Lemmatization

In [8]:
from nltk.corpus import wordnet

def get_wordnet_pos(treebank_tag):

    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

In [9]:
from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

#for w in tagged_words:
#    print(w)
#    print(get_wordnet_pos(list(w)[1]))
#    print(lemmatizer.lemmatize(list(w)[0], get_wordnet_pos(list(w)[1])))

lemmatized_words = [[lemmatizer.lemmatize(list(w)[0], get_wordnet_pos(list(w)[1])), list(w)[1]] for w in spell_corrected]
print(lemmatized_words)

lemmatized_without_pos = [lemmatizer.lemmatize(word) for word in words]
print(lemmatized_without_pos)

[['people', 'NNS'], ['like', 'IN'], ['see', 'VB'], ['machin', 'NNP'], ['play', 'NN'], ['it', 'PRP'], ['always', 'RB'], ['get', 'VBZ'], ['interested', 'JJ'], ['cricket', 'NN'], ['it', 'PRP'], ['good', 'JJ'], ['see', 'VB'], ['people', 'NNS'], ['show', 'VBP'], ['interest', 'NN'], ['they', 'PRP'], ['love', 'VBP'], ['machin', 'NNP'], ['he', 'PRP'], ['give', 'VBN'], ['much', 'JJ'], ['game', 'NN'], ['his', 'PRP$'], ['stroke', 'NNS'], ['MRF', 'NNP'], ['bat', 'NN'], ['etch', 'VBN'], ['people', 'NNS'], ['memory', 'NN'], ['many', 'JJ'], ['current', 'JJ'], ['cricketer', 'NNS'], ['owe', 'VBP'], ['pursue', 'VBG'], ['cricket', 'NN']]
['people', 'like', 'see', 'sachin', 'play', 'it', 'always', 'get', 'interested', 'cricket', 'it', 'good', 'see', 'people', 'show', 'interest', 'they', 'love', 'sachin', 'he', 'given', 'much', 'game', 'his', 'stroke', 'mrf', 'bat', 'etched', 'people', 'memoris', 'many', 'current', 'cricketer', 'owe', 'pursuing', 'cricket']


In [10]:
from nltk.stem import PorterStemmer
st = PorterStemmer()

stemmed_words = [st.stem(list(word)[0]) for word in spell_corrected]
print(stemmed_words)

['peopl', 'like', 'see', 'machin', 'play', 'it', 'alway', 'get', 'interest', 'cricket', 'it', 'good', 'see', 'peopl', 'show', 'interest', 'they', 'love', 'machin', 'he', 'given', 'much', 'game', 'hi', 'stroke', 'mrf', 'bat', 'etch', 'peopl', 'memori', 'mani', 'current', 'cricket', 'owe', 'pursu', 'cricket']
