# Feature Engineering Text Mining

* Load packages and make some data
* Frequency Counts
* Stemming
* Stop Words
* Part of Speech Tagging (POS)
* Named Entitiy Recognition
* Chunking
* 
* 

#### Required Packages and Text Data

In [71]:
# import packages
import pandas as pd
import numpy as np
import os
import nltk #natural language processing toolkit
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
import nltk.corpus #sample text for performing tokenization
from nltk.tokenize import word_tokenize #Passing the string text into word tokenize for breaking the sentences
from nltk.probability import FreqDist #finding frequency of token
from nltk.stem import PorterStemmer #stemming
from nltk.stem import LancasterStemmer #stemming
from nltk.stem import WordNetLemmatizer #stemming
from nltk.corpus import stopwords #stopwords like the, a, of
from nltk import ne_chunk # tokenize and POS Tagging before doing chunk

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\gryka\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\gryka\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\gryka\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\gryka\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\gryka\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\gryka\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-

In [29]:
# some text and tokenizing it
text = "It prepare is it nothing blushes up brought wait. Or as gravity pasture limited evening on. Wise busy past both park when waiting an ye no. Nay likely her length sooner thrown waited lively income. The expense give gives giving given windows adapted sir. Wrong widen drawn ample eat off doors money. Offending wait waits belonging promotion provision an be oh consulted ourselves it. Blessing welcomed ladyship she met humoured sir breeding her. Six curiosity day assurance bed necessary."
# some random text from https://www.randomtextgenerator.com/
token = word_tokenize(text)

#### Frequency Counts

In [21]:
# finding the frequency distinct in the tokens
fdist = FreqDist(token)
fdist

FreqDist({'.': 9, 'it': 2, 'wait': 2, 'an': 2, 'her': 2, 'sir': 2, 'It': 1, 'prepare': 1, 'is': 1, 'nothing': 1, ...})

In [22]:
# frequency of top ten words
fdist1 = fdist.most_common(10)
fdist1

[('.', 9),
 ('it', 2),
 ('wait', 2),
 ('an', 2),
 ('her', 2),
 ('sir', 2),
 ('It', 1),
 ('prepare', 1),
 ('is', 1),
 ('nothing', 1)]

#### Stemming

Check out a few types of stemming functions.

In [38]:
# Checking for the word ‘waiting’ with PortStemmer()
pst = PorterStemmer()
print(pst.stem("waiting"), "\n")

# Checking for stem of a list of words with PortStemmer()
stm = ["waited", "waiting", "waits"]
for word in stm :
   print(word+ ": " +pst.stem(word))

wait 

waited: wait
waiting: wait
waits: wait


In [34]:
# Use LancasterStemmer() to check list of words.
lst = LancasterStemmer()
stm = ["giving", "given", "given", "gave"]
for word in stm :
 print(word+ ": " +lst.stem(word))

giving: giv
given: giv
given: giv
gave: gav


In [49]:
# Use Lemmatizer library to check stem

lemmatizer = WordNetLemmatizer() 
 
print("rocks:", lemmatizer.lemmatize("rocks")) 
print("corpora:", lemmatizer.lemmatize("corpora"))

rocks: rock
corpora: corpus


#### Stop Words

In [60]:
# Use stopwords to find common words
a = set(stopwords.words('english'))
text = "Cristiano Ronaldo was born on February 5, 1985, in Funchal, Madeira, Portugal."
text1 = word_tokenize(text.lower())
print(text1)
stopwords = [x for x in text1 if x not in a]
print(stopwords)

['cristiano', 'ronaldo', 'was', 'born', 'on', 'february', '5', ',', '1985', ',', 'in', 'funchal', ',', 'madeira', ',', 'portugal', '.']
['cristiano', 'ronaldo', 'born', 'february', '5', ',', '1985', ',', 'funchal', ',', 'madeira', ',', 'portugal', '.']


#### Part of Speech Tagging (POS)

In [63]:
text = "vote to choose a particular man or a group (party) to represent them in parliament"
tex = word_tokenize(text)#Tokenize the text
for token in tex:
    print(nltk.pos_tag([token]))

[('vote', 'NN')]
[('to', 'TO')]
[('choose', 'NN')]
[('a', 'DT')]
[('particular', 'JJ')]
[('man', 'NN')]
[('or', 'CC')]
[('a', 'DT')]
[('group', 'NN')]
[('(', '(')]
[('party', 'NN')]
[(')', ')')]
[('to', 'TO')]
[('represent', 'NN')]
[('them', 'PRP')]
[('in', 'IN')]
[('parliament', 'NN')]


#### Named Entitiy Recognition

In [72]:
text = "Google’s CEO Sundar Pichai introduced the new Pixel at Minnesota Roi Centre Event" 
token = word_tokenize(text)
tags = nltk.pos_tag(token)
chunk = ne_chunk(tags)
chunk

FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\gryka\\AppData\\Local\\Temp\\tmpyorwd014.png'

Tree('S', [Tree('PERSON', [('Google', 'NNP')]), ('’', 'NNP'), ('s', 'VBD'), Tree('ORGANIZATION', [('CEO', 'NNP'), ('Sundar', 'NNP'), ('Pichai', 'NNP')]), ('introduced', 'VBD'), ('the', 'DT'), ('new', 'JJ'), ('Pixel', 'NNP'), ('at', 'IN'), Tree('ORGANIZATION', [('Minnesota', 'NNP'), ('Roi', 'NNP'), ('Centre', 'NNP')]), ('Event', 'NNP')])

#### Chunking

In [76]:
text = "We saw the yellow dog"
token = word_tokenize(text)
tags = nltk.pos_tag(token)
reg = "NP: {<DT>?<JJ>*<NN>}" 
a = nltk.RegexpParser(reg)
result = a.parse(tags)
print(result)

(S We/PRP saw/VBD (NP the/DT yellow/JJ dog/NN))
