# StoryAI Tests 
The goal of this notebook is to:
* learn how to recognize important words in a paragraph
* find and plot images from the internet that are the most representative of these words
* these are intermediate steps for the ultimate project goal

NLP Toolkit: https://www.nltk.org
Hide's Lit Review: https://datascience.stackexchange.com/questions/5316/general-approach-to-extract-key-text-from-sentence-nlp

https://github.com/keon/awesome-nlp

Step One: Recognize important words in a paragraph!

Notes:

"You can consider using OpenNLP / StanfordNLP for Part of Speech"
Shallow NLP technique steps:

1) Convert the sentence to lowercase

2) Remove stopwords (these are common words found in a language. Words like for, very, and, of, are, etc, are common stop words)

3) Extract n-gram i.e., a contiguous sequence of n items from a given sequence of text (simply increasing n, model can be used to store more context)

4) Assign a syntactic label (noun, verb etc.)

5) Knowledge extraction from text through semantic/syntactic analysis approach i.e., try to retain words that hold higher weight in a sentence like Noun/Verb

1-gram, 2-gram, 3-gram

Deep NLP technique will give better results i.e., rather than n-gram, detect relationships within the sentences and represent/express as complex construction to retain the context. For additional info, please refer https://stats.stackexchange.com/a/133680/66708

1) Get the syntactic relationship between each pair of words. 
2) Apply sentence segmentation to determine the sentence boundaries.
3) The Parser is then applied to generate output in the form of dependency relations, which represent the syntactic relations within each sentence. 


In [2]:
test_sentence = "Arthur, the fun westie, rolled on the grass while the squirrel sneeked past."

In [2]:
# 1
test_lower = test_sentence.lower()
print(test_lower)

arthur, the fun westie, rolled on the grass while the squirrel sneeked past.


In [11]:
import nltk
#nltk.download('stopwords')
#nltk.download('punkt')
#nltk.download('brown')
#nltk.download('averaged_perceptron_tagger')

In [4]:
stop_words = set(nltk.corpus.stopwords.words('english'))
word_tokens = nltk.tokenize.word_tokenize(test_lower)
print(test_lower.split())
print(word_tokens)
print([word for word in word_tokens if word not in stop_words])
filtered_sentence = [word for word in word_tokens if word not in stop_words]

['arthur,', 'the', 'fun', 'westie,', 'rolled', 'on', 'the', 'grass', 'while', 'the', 'squirrel', 'sneeked', 'past.']
['arthur', ',', 'the', 'fun', 'westie', ',', 'rolled', 'on', 'the', 'grass', 'while', 'the', 'squirrel', 'sneeked', 'past', '.']
['arthur', ',', 'fun', 'westie', ',', 'rolled', 'grass', 'squirrel', 'sneeked', 'past', '.']


In [41]:
# Tagger
from nltk.corpus import brown
brown_tagged_sents = brown.tagged_sents(categories='news')
brown_sents = brown.sents(categories='news')

backoff = nltk.RegexpTagger([
        (r'^-?[0-9]+(.[0-9]+)?$', 'CD'),   # cardinal numbers
        (r'(The|the|A|a|An|an)$', 'AT'),   # articles
        (r'.*able$', 'JJ'),                # adjectives
        (r'.*ness$', 'NN'),                # nouns formed from adjectives
        (r'.*ly$', 'RB'),                  # adverbs
        (r'.*s$', 'NNS'),                  # plural nouns
        (r'.*ing$', 'VBG'),                # gerunds
        (r'.*ed$', 'VBD'),                 # past tense verbs
        (r'.*', 'NN')                      # nouns (default)
        ])
train_sents = brown_tagged_sents[:]
t0 = nltk.DefaultTagger('NN')
t1 = nltk.UnigramTagger(train_sents, backoff=t0)
t2 = nltk.BigramTagger(train_sents, backoff=backoff)



#t3 = nltk.tag.brill.BrillTaggerTrainer()

#t3 = nltk.tag.brill.BrillTagger(t2,)
#t2.tag(filtered_sentence)



In [46]:
# Part of speech tagger, best one
nltk.pos_tag(filtered_sentence)

[('arthur', 'NN'),
 (',', ','),
 ('fun', 'NN'),
 ('westie', 'NN'),
 (',', ','),
 ('rolled', 'VBD'),
 ('grass', 'NN'),
 ('squirrel', 'NN'),
 ('sneeked', 'VBD'),
 ('past', 'JJ'),
 ('.', '.')]

In [8]:
# going to use TextBlob instead of nltk
from textblob import TextBlob
import textblob

blob = TextBlob(test_sentence, pos_tagger=textblob.taggers.NLTKTagger())

In [26]:
print(blob.tags)
d = dict(blob.tags)
print(blob.noun_phrases)
print(blob.words)
print([(' '.join(phrase), ' '.join(map(lambda x: d[x],phrase))) for phrase in blob.ngrams(n=3)])

for sentence in blob.sentences:
    print(sentence.sentiment.polarity)
    
# get rid of all stop words, don't grab NNP for now, grab NN
protected_words = ' '.join(blob.noun_phrases).split()
sentence_filter2 = [word for word, pos in blob.tags if pos in ['NN']] # ,'VBN','VBD'
print(sentence_filter2)
filter_sentence_final = []
for phrase in blob.noun_phrases:
    if len(set([word for word in sentence_filter2 if word in phrase])) == len(set(phrase.split())):
        filter_sentence_final.append(phrase)
filter_sentence_final.extend(list(set(sentence_filter2) - set(protected_words)))
print(filter_sentence_final)

[('Arthur', 'NNP'), ('the', 'DT'), ('fun', 'NN'), ('westie', 'NN'), ('rolled', 'VBN'), ('on', 'IN'), ('the', 'DT'), ('grass', 'NN'), ('while', 'IN'), ('the', 'DT'), ('squirrel', 'NN'), ('sneeked', 'VBD'), ('past', 'JJ')]
['arthur', 'fun westie']
['Arthur', 'the', 'fun', 'westie', 'rolled', 'on', 'the', 'grass', 'while', 'the', 'squirrel', 'sneeked', 'past']
[('Arthur the fun', 'NNP DT NN'), ('the fun westie', 'DT NN NN'), ('fun westie rolled', 'NN NN VBN'), ('westie rolled on', 'NN VBN IN'), ('rolled on the', 'VBN IN DT'), ('on the grass', 'IN DT NN'), ('the grass while', 'DT NN IN'), ('grass while the', 'NN IN DT'), ('while the squirrel', 'IN DT NN'), ('the squirrel sneeked', 'DT NN VBD'), ('squirrel sneeked past', 'NN VBD JJ')]
0.024999999999999994
['fun', 'westie', 'grass', 'squirrel']
['fun westie', 'grass', 'squirrel']


Step Two: Find most representative image from search of many images.
How to solve:
1. Scrape a lot of images.
2. Filter by name or word.
3. Transform images to same dimensions of pixels.
4. TSNE transform on all images. Multicore TSNE..
5. Density based clustering one cluster, remove all outliers.
6. Find the centroid of the cluster, and point closest to centroid is the most representative image of sample.

Some literature:
Image database:
* https://www.engadget.com/2016/10/01/google-releases-massive-visual-databases-for-machine-learning/
* https://research.google.com/youtube8m/index.html
* https://datascience.stackexchange.com/questions/8552/find-most-representative-image
* https://github.com/openimages/dataset
* https://en.wikipedia.org/wiki/List_of_datasets_for_machine_learning_research
* https://github.com/hardikvasa/google-images-download/issues
* Preprocessing- https://datascience.stackexchange.com/questions/5224/how-to-prepare-augment-images-for-neural-network

In [1]:
from google_images_download import google_images_download
response = google_images_download.googleimagesdownload() 
arguments = {"keywords":"Polar bears,baloons,Beaches","limit":20,"print_urls":True}
response.download(arguments)


Item no.: 1 --> Item name = Polar bears
Evaluating...
Starting Download...
Image URL: https://upload.wikimedia.org/wikipedia/commons/thumb/6/66/Polar_Bear_-_Alaska_%28cropped%29.jpg/220px-Polar_Bear_-_Alaska_%28cropped%29.jpg
Completed Image ====> 1. 220px-polar_bear_-_alaska_%28cropped%29.jpg
Image URL: https://c402277.ssl.cf1.rackcdn.com/photos/2330/images/hero_full/polar-bear-hero.jpg?1345901694
Completed Image ====> 2. polar-bear-hero.jpg
Image URL: https://defenders.org/sites/default/files/styles/homepage-feature-2015/public/polar-bear_j.-lyle.png?itok=EAQm89Z4
Completed Image ====> 3. polar-bear_j.-lyle.png
Image URL: https://i.ytimg.com/vi/zNO0kxTClYo/maxresdefault.jpg
Completed Image ====> 4. maxresdefault.jpg
Image URL: https://polarbearsinternational.org/img/edu-center-D00002331.jpg
Completed Image ====> 5. edu-center-d00002331.jpg
Image URL: https://kids.nationalgeographic.com/content/dam/kids/photos/animals/Mammals/H-P/polar-bear-cub-on-mom.adapt.945.1.jpg
Completed Image 

KeyboardInterrupt: 