## Learning Automaton Forumscrape Project
# Data Analytics

This notebook explains how to derive some basic statistics from the forum post data we gathered from the [Fruits and Veggies forum](https://learningautomaton.ca/wp-content/uploads/2019/02/FruitsAndVeggiesForum/Knock%20Knock...%20-%20Fruits%20and%20Veggies.html) as part of my [Forumscrape Project](https://learningautomaton.ca/2019/01/ethical-forum-scraping-and-nlp-data-analytics-project/).

In the [previous post](https://learningautomaton.ca/2019/04/forum-scrape-project-data-processing-sanitation-and-anonymizing/) we cleaned and anonymized the data we scraped from the forum.

To make this notebook interactive I have embedded the post contents into a string and included it in the notebook.

In this notebook we will:
* Strip out unneeded words from our data
* Make a custom pipeline to tag words as being fruit or vegetable words, and count occurances
* Find words decribeing fruits or veggies to get an idea of which is prefered on the forum

Tools: To do the languge processing and analytics, we will use [spaCy](https://spacy.io/), a free, open-source Natural Language Processing library for Python.


## Setup Labeling Pipeline, Strip Unneeded Words, Label Words of Interest
(In that order)

I recommend you take a look at the [spaCy doc page on pipelines](https://spacy.io/usage/processing-pipelines). 

In short, when you pass a string or iterable to a spaCy language object (retrieved using the spacy.load(languagemodel) call), it passes the string through a series of pipes that each do a specific thing, and stores the results in a `Doc` object. 
    
For example, a fairly standard (and default) pipeline is as follows:
* The `tokenizer` pipe breaks the string into tokens (words) which are stored as `token` objects in the `doc`
* The `tagger` pipe tries to identify what part of speech (noun, verb, adjective, etc) best desctibes each token, and adds it to that token's `pos` attribute.
* The `parser` pipe addes dependency labels to various attributes.
* The `ner` pipe tries to identify named entities and adds them to doc.ents.
* etc...

Note that we have included site-specific *slang* words for fruits and veggies to our match words so we capture them as well. In this data, "apples" are often refered to as "pulls". If we didn't explicitly label this word, it would be harder to find as spaCy tags it as a verb, not a noun.

In [21]:
# Setup
import spacy
nlp = spacy.load('en_core_web_sm')
data = "Apples are the best. Us enlightened who call them, 'pulls love them more than any other fruit. 'Pulls are tasty, sweet, and crisp. All other fruits and those yucky vegetables are inferior. Discuss.. Ya, 'pulls are the best. 'Pulls are ripe, juicy, and taste sooooo good. Nothing else can match! I'd eat apples every day all day if I could, doctors be damned!. Yall get real. Apples are gross. The worst of the fruits, which are all bad. Everyone knows that veggies are powerfood. Get your weak apple sauce out of here.. Everyone knows grapes are superior you filthy casual. Enjoy your cyanide balls. Battlestar Galactica.. NO ONE READ THE ABOVE. THAT IS NOT ME.  Identity theft is not a joke, Jim! Millions of families suffer every year!. MICHAEL!  *storms off*. Oh, that's funny.  MICHAEL. From my head tomatoes.. Never Leaf Me.  We make a pearfect couple.. Peaches are better than all others. Peaches are the best, cabbage is the worst, end of story.  Smelly Cabbage; yuck. Corn cob rows!  Take me home,  to the placeeee,  I belonggg!!!!  WEST INDIANAAAAA  BUTTER MAMAAAAA  TAKE ME HOMEEEEE  CORN BOB ROWWWSS. If a squash can make you smile.... Lol, tomatoes and squash are both fruits! Get this compost out of the veggie master plate forum. Na, tomatoes and squash are fruits in name only. Acidic tomatoes and dirty squash, just like dirty tomatoes, should stay out of our sweet fruit forum. Like yucky lettuce, they belong in the vegetable forum, aka the compost.. Knock Knock. Who's There?. Banana. Banana who?. Knock Knock.... Who's There?... Banana. Banana who?. Knock Knock. Whos. ... There..... Orange!. Orange who?. ORANGE YOU GLAD I DIDN'T SAY BANANA??!?. You thought you'd pull a fast one on me, you little bitch? I'll have you know I graduated top of my class in the Tasty Peels, and I've been involved in numerous secret raids on Al-Quinoa, and I have over 300 confirmed spills. I am trained in vanilla warfare and I'm the top ripener in the entire US farmed forces. You are nothing to me but just another target grocery aisle. I will wipe you the fuck out with precision the likes of which has never been seen before on this Earth, mark my fucking words. You think you can get away with saying that shit to me over the Internet? Think again, fucker. As we speak I am contacting my secret network of pies across the USA and your IP is being traced right now so you better prepare for the storm, maggot. The storm that wipes out the pathetic little thing you call your life. You're fucking dead, kid. I can be anywhere, anytime, and I can eat you in over seven hundred ways, and that's just with my bare hands. Not only am I extensively trained in organic combat, but I have access to the entire pesticide arsenal of the United States Soybean Crops and I will use it to its full extent to wipe your miserable ass off the face of the continent, you little shit. If only you could have known what appleholey retribution your little \"clever\" comment was about to bring down upon you, maybe you would have held your fucking tongue. But you couldn't, you didn't, and now you're paying the price, you goddamn idiot. I will shit fury all over you and you will drown in it. Your lettuce head is fucking dead, kiddo.. Get boiled, damn."
fruit_match_words = set(['pulls','fruit','apple','apricot','banana','cantaloupe','cherry','date','citrus','avocado','carissa','guava','cherry','citron','clementine','crabapple','grape','grapefruit','honeydew','lemon','lime','orange','mandarin','mango','papaya','peach','pear','pineapple','plantain','plum','pomelo','tangarine','watermelon'])
vegetable_match_words =  set(['vegetable','artichoke','eggplant','asparagu','broccoli','cabbage','cauliflower','celery','spinach','lettuce','onion','beet','carrot','potato','yam','turnip','squash','tomato','watercress'])

Because we only care about tagging nouns as "FRUITS" or "VEGETABLES", we are going to disable the default `ner` pipe (to avoid chashing labels), and add our own `EntityMatcher` pipe, which will use a `PhraseMatcher` to determine what words to label.

First we make a generic `EntityMatcher` that takes in a spaCy language object, the name of the matcher (which can be used to modify or disable the matcher later on), an iterable of terms we want to match, and a lable to apply to those terms if we find them. 

We will also use a nifty feature added in spaCy v2.1 that allows us to choose what *attribute* of each token we want to match on. Because attributes are added by other pipes, we want our pipe to be last. We are going to match on the "LEMMA" attribute, which is the base form of the word (the lemma of "Apples" is "apple", the lemma of "going" is "go").

In [22]:
# Make a generic EntityMatcher pipe
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span

class EntityMatcher(object):
    # tales in the spaCy language model initalized by spacy.load('en_core_web_sm'), the name of the matcher, 
    # a list of terms to match on, and the label to apply.
    def __init__(self, nlp, matcher_name, terms, label):
        # We run each term through the language model to make a doc object for each one, so we can access the "LEMMA" property of the
        # match terms (required to make use of the attr='LEMMA' feature, even though the terms are already in lemma form).
        patterns = [nlp(text) for text in terms]
        # The PhraseMatcher will find token in the doc that match the match terms 
        self.matcher = PhraseMatcher(nlp.vocab, attr='LEMMA')
        # Add our patterns (one for each FRUIT match word, for example) and the label to apply ("FRUIT")
        self.matcher.add(label, None, *patterns)
        # Set the name of the matcher
        self.name = matcher_name

    # When called on a Doc object
    def __call__(self, doc):
        # Get the list of matches from the matcher we created above
        matches = self.matcher(doc)
        # Make a Span (list of sequential tokens) for the matches, and save them to the doc.ents list
        # (The matcher only returns the index of the first and last token of the match, so we need to make a Span that gets the words in between)
        for match_id, start, end in matches:
            span = Span(doc, start, end, label=match_id)
            doc.ents = list(doc.ents) + [span]
        # Return the Doc object, with our changes
        return doc

In [23]:
# Make a new matcher for our list of fruit words and list of vegetable words
fruit_matcher = EntityMatcher(nlp, "fruit_matcher", fruit_match_words, "FRUIT")
vegetable_matcher = EntityMatcher(nlp, "vegetable_matcher", vegetable_match_words, "VEGETABLE")

# Disable the default ner pipe, which causes conflicts as it mis-labels some fruits/vegetables as named entities based on context.
nlp.disable_pipes("ner")
# Add our pipes to the language model.
# Note: spaCy will not let you add the same pipe twice. If you want to make changes to the pipe, you need to reload the model or use
# nlp.replace_pipe('nameOfOldPipe', newPipe), or nlp.remove_pipe('nameOfOldPipe'), nlp.add_pipe(newPipe)
nlp.add_pipe(fruit_matcher)
nlp.add_pipe(vegetable_matcher)

nlp.pipeline

[('tagger', <spacy.pipeline.pipes.Tagger at 0x1d9da228c88>),
 ('parser', <spacy.pipeline.pipes.DependencyParser at 0x1d9db4f59a8>),
 ('fruit_matcher', <__main__.EntityMatcher at 0x1d9da166c18>),
 ('vegetable_matcher', <__main__.EntityMatcher at 0x1d9da226978>)]

<pre><code>if can_learn: 
    learn()</code></pre>