## Learning Automaton Forumscrape Project
# Data Analytics

This notebook explains how to derive some basic statistics from the forum post data we gathered from the [Fruits and Veggies forum](https://learningautomaton.ca/wp-content/uploads/2019/02/FruitsAndVeggiesForum/Knock%20Knock...%20-%20Fruits%20and%20Veggies.html) as part of my [Forumscrape Project](https://learningautomaton.ca/2019/01/ethical-forum-scraping-and-nlp-data-analytics-project/).

In the [previous post](https://learningautomaton.ca/2019/04/forum-scrape-project-data-processing-sanitation-and-anonymizing/) we cleaned and anonymized the data we scraped from the forum.

You can see a text representation of the current state of the database [here](https://github.com/johneddcooper/forumscrape/blob/master/db_refactoring/data_stg3_word_strip).



The end state is that the data pulled has been cleaned, stripped of potential [personally identifying information](https://en.wikipedia.org/wiki/Personal_data), and is ready to be processed for data statistics. 

I delve deeper into the topic of personal information and why it is important to sanitize in the [Ethics Analasys](https://learningautomaton.ca/2019/01/forum-scrape-project-ethics/) for this project.

## Methodology

For our data and needs, this will be a three-part process:
1. Refactor the database to make it more easily accessible, and remove unnecessary fields;
2. Replace usernames with unique random identifiers (we don't want to strip them out completely, as we can derive anonymous user statistics;
3. Remove unnecessary, potentially identifying, and unknown words and data from post titles and content.

We will iteratively write the changes to a copy of the database in each phase.

## Tools

For database reading we will use PyMongo, to connect to a locally running MongoDB server.  

To do the languge processing and analytics, we will use [spaCy](https://spacy.io/), a free, open-source Natural Language Processing library for Python.

In [4]:
import pymongo
from pymongo import MongoClient
# Point the MongoClient at our database. In this case, the MongoDB server is running locally, on port 27017.
client = MongoClient('192.168.2.70', 27017)
# The DB within the MongoDB server the data is stored on, in my case, in a DB called fruitsandveggies
db=client['fruitsandveggies']

In [5]:
import spacy
nlp = spacy.load('en_core_web_sm') # we use spaCy's en_core_web_sm language model for demonstration, however, we would use the md or lg models normally as they are more accurate and offer more features 

In [3]:
# Sanitize post titles 
# Remove stop words, proper nouns, numbers, symbols, unknowns, punctuation other than periods, unknown words, and new lines.


# If pulling from a database...
# collection = db['aggregate-posts-out']
# posts = collection.find({})
# Note: if pulling the data from a database, post['thread_title'] needs to be used

for post in posts:
    content_doc = nlp(post.thread_title) # Construct a new nlp doc, which automatically tokenizes and tags, returns a spaCy Doc object 
    # Go through each token in the post title, only keeping the ones that don't match our stripping rules
    keep_tokens = [token for token in content_doc if not( # Iterating over a spaCy Doc returns the tokens within the Doc 
        token.is_stop or # remove stop words, or words that arn't useful for most NLP problems
        (token.pos_ in ['PROPN','NUM','SYM','X','PUNCT'] and not token.text==".") or # remove proper nouns, numbers, symbols, unknowns, and all punctuation except periods, so we can delineate sentences.
        #token.is_oov or # Remove words that are out of vocabulary (oov). We might lose some useful data, however, to be safe we take it out as we can't identify it.
        token.is_space # Remove blank sentences
    )]
    print(post.thread_title,keep_tokens) # will print once for each post, so there will be duplicates

# Code to update the changes to a DB, run from within the for loop
#     collection.update_many(
#        {'thread_title':post['thread_title']},
#        {'$set':{'thread_title':str(keep_tokens)}}
#     )

NameError: name 'posts' is not defined

In [None]:
# Sanitize post contents
# Remove stop words, proper nouns, numbers, symbols, unknowns, punctuation other than periods, unknown words, and new lines.

# If pulling from a database...
# collection = db['aggregate-posts-out']
# posts = collection.find({})
# Note: if pulling the data from a database, post['thread_title'] needs to be used

for post in posts:
    content_doc = nlp(post.content) 
    # Go through each token in the post title, only keeping the ones that don't match our stripping rules
    keep_tokens = [token for token in content_doc if not( 
        token.is_stop or # remove stop words, or words that arn't useful for most NLP problems
        (token.pos_ in ['PROPN','NUM','SYM','X','PUNCT'] and not token.text==".") or # remove proper nouns, numbers, symbols, unknowns, and all puncuation expect periods, so we can deliniate sentences.
        #token.is_oov or # Remove words that are out of vocabulary (oov). We might lose some useful data, however to be safe we take it out as we can't identify it.
        token.is_space # Remove blank sentances
    )]
    print(post.thread_title,keep_tokens)


# Code to update the changes to a DB, run within the for loop
#     collection.update_one(
#        {'_id':post['_id']},
#        {'$set':{'content':str(keep_tokens)}}
#     )

For demonstration, lets see all of the tokens that we stripped out:

In [None]:
stripped_tokens = set()

for post in posts:
    content_doc = nlp(post.content) 
    throwaway_tokens = [token for token in content_doc if  
        token.is_stop or # remove stop words, or words that aren't useful for most NLP problems
        (token.pos_ in ['PROPN','NUM','SYM','X','PUNCT'] and not token.text==".") or # remove proper nouns, numbers, symbols, unknowns, and all punctuation except periods, so we can delineate sentences.
        token.is_oov or # Remove words that are out of vocabulary (oov). We might lose some useful data, however, to be safe we take it out as we can't identify it.
        token.is_space # Remove blank sentences
    ]
    for token in throwaway_tokens:
        stripped_tokens.add(token)
        
        
print(stripped_tokens)

In summary, we have seen:
* Refactoring a DB using MongoDB aggregation pipelines
* Assigning a unique number to users and replacing them in the DB
* Stripping out unwanted words from the DB using simple language processing with spaCy

In the next post, we will dive more into spaCy and NLP, and how we can derive some useful information from our cleaned dataset.

If you have any comments or recommendations, please let me know by leaving a comment at https://learningautomaton.ca/forum-scrape-project---db-editing.

<pre><code>if can_learn: 
    learn()</code></pre>