<center><h1>Data Curation and NLP - Big Data Technology</h1><br /></center>
<center>Tutorial 3 - Fred Amouzgar</center>

<centre><img src="https://github.com/FredAmouzgar/BigData/blob/master/pics/NLP_wordmap.jpg?raw=true"></centre>

<h3>NLP Libraries in Python (data pre-processing, Tokenisation and Curation)</h3><br />
1- NLTK: the most widely-mentioned NLP library<br />
2- SpaCy:  an industrial-strength NLP library built for performance<br />
3- Stanford CoreNLP: Written in Java with Python Wrapper, well-known for its speed<br />
4- TextBlob: a user-friendly and intuitive NLTK interface<br />
5- Polyglot: an open-source NLP library<br />

<centre><img src="https://github.com/FredAmouzgar/BigData/blob/master/pics/NLTK.png?raw=true"></centre>

<h3>NLTK – Natural Language ToolKit</h3><br />
The most popular Python NLP library
It is a suite of libraries for statistical and symbolic Natural Language Processing
NLTK is mostly an educational and research tool
Good for Prototyping and Heavy for production
Relevant Libraries:
- Lexical Analysis: Word and Text Tokeniser
- Part of Speech Tagger (POS tagging): Grammatical tagging and word-category
- Name Entity Recognition (NER): Locate and classify named entities in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages

<centre><img src="https://github.com/FredAmouzgar/BigData/blob/master/pics/nlu_nlp.png?raw=true" width="600px" height="300px"></centre>

<h3>NLTK Installation</h3><br />
- Find the Anaconda installation diresctory (Python and pip are usually under Scripts folder)<br />
- Use pip/conda commands to install NLTK<br />
- Then, download corpus and components as it is shown below<br />
<centre><img src="https://github.com/FredAmouzgar/BigData/blob/master/pics/nltk_install.png?raw=true"></centre>

<h3>NLTK Tokenization</h3>

In [12]:
from nltk import sent_tokenize, word_tokenize
from nltk.tokenize  import  TreebankWordTokenizer, RegexpTokenizer

sentence1 = "This is a normal sentence. This is the second sentence."
word_list = word_tokenize(sentence1)  # Returns a list of words
sents = sent_tokenize(sentence1) # Returns a list of sentences
print(sents, " has ", len(sents), " elements.")

sentence2 = "That's a good dog."
tokenizer = TreebankWordTokenizer()
sent2_words = tokenizer.tokenize(sentence2) # Returns a list like this: ['That', "'s", 'a', 'good', 'dog', '.’]
print(sent2_words, " has ", len(sent2_words), " elements.")

sentence3 = "This/is-a.very+badly)formatted,,sentence…"
regex_tokenizer = RegexpTokenizer(r'\w+') # Dropping everything but alphanumerics
sent3_words = regex_tokenizer.tokenize(sentence3) # Returns a list of words, discarding all the other chars.
print(sent3_words, " has ", len(sent3_words), " elements.")

['This is a normal sentence.', 'This is the second sentence.']  has  2  elements.
['That', "'s", 'a', 'good', 'dog', '.']  has  6  elements.
['This', 'is', 'a', 'very', 'badly', 'formatted', 'sentence']  has  7  elements.


<h3>NLTK – Part-of-Speech (POS) Tagging</h3>

In [14]:
import  nltk
comment = "Fit my galaxy s2 perfectly even with the case on it. Comfortable enough to wear and can still use the screen through the plastic"
text = nltk.word_tokenize(comment)
p_tags = nltk.pos_tag(text)  # The MAIN tagger
csv=""
for e in p_tags:
    print(e)
    if(e[1]=='NN'):
        if(csv==""):
            csv=e[0]
        else:
            csv+=","+e[0]
csv  # Consists of a comma-separated string:
# 'galaxy,s2,case,screen,plastic'

('Fit', 'NNP')
('my', 'PRP$')
('galaxy', 'NN')
('s2', 'NN')
('perfectly', 'RB')
('even', 'RB')
('with', 'IN')
('the', 'DT')
('case', 'NN')
('on', 'IN')
('it', 'PRP')
('.', '.')
('Comfortable', 'JJ')
('enough', 'RB')
('to', 'TO')
('wear', 'VB')
('and', 'CC')
('can', 'MD')
('still', 'RB')
('use', 'VB')
('the', 'DT')
('screen', 'NN')
('through', 'IN')
('the', 'DT')
('plastic', 'NN')


'galaxy,s2,case,screen,plastic'

<h3>NLTK POS tag meanings</h3><br />
CC coordinating conjunction<br />
CD cardinal digit<br />
DT determiner<br />
EX existential there (like: "there is" ... think of it like "there exists")<br />
FW foreign word<br />
IN preposition/subordinating conjunction<br />
JJ adjective 'big'<br />
JJR adjective, comparative 'bigger'<br />
JJS adjective, superlative 'biggest'<br />
LS list marker 1)<br />
MD modal could, will<br />
NN noun, singular 'desk'<br />
NNS noun plural 'desks'<br />
NNP proper noun, singular 'Harrison'<br />
NNPS proper noun, plural 'Americans'<br />
PDT predeterminer 'all the kids'<br />
POS possessive ending parent's<br />
PRP personal pronoun I, he, she<br />
PRP\$ possessive pronoun my, his, hers<br />
RB adverb very, silently,<br />
RBR adverb, comparative better<br />
RBS adverb, superlative best<br />
RP particle give up<br />
TO to go 'to' the store.<br />
UH interjection errrrrrrrm<br />
VB verb, base form take<br />
VBD verb, past tense took<br />
VBG verb, gerund/present participle taking<br />
VBN verb, past participle taken<br />
VBP verb, sing. present, non-3d take<br />
VBZ verb, 3rd person sing. present takes<br />
WDT wh-determiner which<br />
WP wh-pronoun who, what<br />
WP\$ possessive wh-pronoun whose<br />
WRB wh-abverb where, when<br />

<h3>NLTK – Name Entity Recognition</h3><br />
- NER is the first step towards information extraction from unstructured text.
- Extracting what is a real world entity from the text (Person, Organization, Event).

In [17]:
from nltk import word_tokenize, pos_tag, ne_chunk
sentence = "Amin and Fred are working for Macquarie university since 2017."
print(ne_chunk(pos_tag(word_tokenize(sentence)))) 

(S
  (PERSON Amin/NNP)
  and/CC
  (GPE Fred/NNP)
  are/VBP
  working/VBG
  for/IN
  (ORGANIZATION Macquarie/NNP)
  university/NN
  since/IN
  2017/CD
  ./.)


<h2>Practical</h2><br /><hr />
<h3>Database:</h3><br />
1- Download the Amazon JSON file (<a href="https://drive.google.com/file/d/1VKFoztinuLthitrSxPva8TcKPghcPR0B/view?usp=sharing">Amazon.zip</a>) and extract it. It has 194,439 reviews for different cell phones.<br /><br />

2- Import it into MongoDB, use “amazon” for your database name and “reviews” for you collection name. Please note that this file is <b>NOT</b> a JSON Array so the jsonArray option in $mongoimport$ command is not necessary.<br /><br />

3- With the right query and from the Mongo command line count the number of imported reviews.<br /><br />

<hr />
<h3>Application:</h3><br /><br />
1- Develop a Python application that connects to the amazon database and query all the reviews consist of “Nokia” in their “reviewText” field.<br /><br />

2- Use NLTK to identify all the keywords in the “reviewText” field (verbs, Adjectives and Nouns).<br /><br />


3- Put the keywords in a comma-separated string and add that to its relevant field in the database. Use the format below: 
“reviewKeywords”:”searching,battery,baby,monitor,Levana,BABYVIEW20,found,uses,Nokia,battery,was,nervous”<br /><br />

4- (Optional) Use the Name Entity Recognition in NLTK to recognise name entities in the extracted keywords. Add another record called Entities consists of these objects.<br /><br />

<h3>Testing:</h3>

- Query and check that all the Nokia reviews have the newly-added record.