## Learning Automaton Forumscrape Project
# Data Processing, Sanitation, and Anonymizing

This notebook explains the steps used to sanitize the data gathered in the [previous post](https://learningautomaton.ca/2019/03/forum-scrape-project-scrapy-spider/) in this project.

The start state is a MongoDB containing thread and post data scraped from the [Fruits and Veggies forum](https://learningautomaton.ca/wp-content/uploads/2019/02/FruitsAndVeggiesForum/Knock%20Knock...%20-%20Fruits%20and%20Veggies.html). 

You can see a text representation of the database here: [Raw Data Text File](https://github.com/johneddcooper/forumscrape/blob/master/db_refactoring/data_raw)

This notebook contains a text copy of the data before and after each stage, so you can see the changes without actually needing a working database. Some steps will work with the included text-data, however, some steps require a working database. These will be clearly indicated. 

The end state is that the data pulled has been cleaned, stripped of potential [personally identifying information](https://en.wikipedia.org/wiki/Personal_data), and is ready to be processed for data statistics. 

I delve deeper into the topic of personal information and why it is important to sanitize in the [Ethics Analasys](https://learningautomaton.ca/2019/01/forum-scrape-project-ethics/) for this project.

## Methodology

For our data and needs, this will be a three-part process:
1. Refactor the database to make it more easily accessible, and remove unnecessary fields;
2. Replace usernames with unique random identifiers (we don't want to strip them out completely, as we can derive anonymous user statistics;
3. Remove unnecessary, potentially identifying, and unknown words and data from post titles and content.

We will iteratively write the changes to a copy of the database in each phase.

## Tools

For database reading and writing we will use PyMongo, to connect to a locally running MongoDB server.  

To find data that needs to be stripped, we will use [spaCy](https://spacy.io/), a free, open-source Natural Language Processing library for Python.


## Stage 1 - Refactoring

**Note that for the following steps to work you will need to be running this notebook locally (not on Binder) and have a running MongoDB server.** See the [Forumscrape - Scrapy Spider](https://learningautomaton.ca/2019/03/forum-scrape-project-scrapy-spider/) post for details.

The data is currently stored in the following schema:

* thread_1
    * _id
    * title
    * url
    * posts
        * post_1
            * user
            * title
            * content
            * datetime
        * post_2
             <br>...
* thread_2
    <br>...
    
 
This schema was useful for when we were doing the scraping, however, nested documents are not fun to work with. We are only going to be doing simple language statistics on the post content. We can strip out the thread URL and post datetime, and we will promote the posts to be the top-level document. At the end out schema will look like this:
* post_1
    * thread_title
    * user
    * content
* post_2
    <br>...
    
<br>

*Note: a MongoDB "document" is an item in the database. It is similar to a dictionary object, in that they have field:value pairs and they can contain other documents (or lists of documents) as values within them.*

<br>

To accomplish the refactoring, we will make use of MongoDB's [Aggregation Pipelines](https://docs.mongodb.com/manual/core/aggregation-pipeline/).

Aggregation Pipelines are extremely powerful, but they can be complex. I won't go into all that is possible, just the steps we need.

Pipelines run in sequential stages. The stages we need are:
* unwind: make a new top-level document for each sub-document in an embedded array. This will turn every post from sub-documents embedded in threads into their own top-level documents.
* project: transform a new document into a new structure. The previous step left the posts as sub-documents in an embedded array or 1. We pull the data out of the embedded sub-document into the top level document.
* sort: sort (serves to group) the posts by thread title. Not required, but makes for easier reading and verifying the results.
* out: used to write the results of the pipeline to the database, in the given collection. If this stage is omitted, the pipeline call will return a temporary way to see the pipeline transformation without making any changes to the DB. Useful for testing the pipeline before writing, especially for large databases.

We construct an array with the parameters we need and pass it to the collection.aggregate(pipeline) function.

You can see the end result [here](https://github.com/johneddcooper/forumscrape/blob/master/db_refactoring/data_stg1_refactored).

In [23]:
import pymongo
from pymongo import MongoClient
# Point the MongoClient at our database. In this case, the MongoDB server is running locally, on port 27017.
client = MongoClient('192.168.2.70', 27017)
# The DB within the MongoDB server the data is stored on, in my case, in a DB called fruitsandveggies
db=client['fruitsandveggies']

In [34]:
# The 'threads' collection holds my raw data, that was gathered by the Scrapy spider in the previous post
collection = db['threads']

# making an array with the pipeline parameters, to be passed in using the collection.aggregate() function
pipeline =  [ 
    { "$unwind": "$posts" }, # Makes a copy of each thread document, for each embedded post document, containing only that post
    { "$project": 
         { "_id": 0, 
          "thread_title": "$title", # Make explicit that the field is the title of the parent thread 
          "user": "$posts.user",  # Pull the post user from the sub-document into the document
          "content": "$posts.content" } }, # Pull the post content from the sub-document into the document
    { "$sort": { "thread_title": 1 } }, # Sort the post documents by title
    { "$out": "aggregate-posts-out_3" } # Write to the "aggregate-posts-out" collection within the database
    ]

# Call aggregate on the collection, with our pipeline details. Because these steps are memory intensive, we need to set allowDiskUse to True to allow MongoDB to write temporary files to dist during the operation.
# The call will fail otherwise for any non-trivial data set.
collection.aggregate(pipeline, allowDiskUse=True)

# the call returns a cursor that can be used to navigate the results of the pipeline.


<pymongo.command_cursor.CommandCursor at 0x7ae4080>

## Stage 2 - Replace Username

As part of the ethics analysis for this project, we identified a need to hide username: both because they can be used to link the data to users on other sites and because they might contain personally identifying information.

The steps are as follows:
1. Get all posts from the DB
2. Make a dict with usernames as keys
3. Assign a unique random number to each user
4. Replace all instances of the username with the unique random number in the DB

You can see the results [here](https://github.com/johneddcooper/forumscrape/blob/master/db_refactoring/data_stg2_username_replace)

In [35]:
# Gather all posts, get set of usernames, assign a random unique number to each user, replace username with the unique identifier in the database

import random
import math
from math import log10, ceil
from random import sample

collection = db['aggregate-posts-out_3']
post_dict = dict()
user_dict = dict()
# collection.find() returns a cursor object that can be used to iterate over the results of the search query
posts = collection.find({}, {'_id':1, 'user':1})

for post in posts: # Call collection.find with two arguments: {} to pull every post, and  {'_id':1, 'user':1} to only get those fields returned
    user_dict[post['user']] = None # Make a dict with each user as a key. Ensures we don't assign multiple random ids to the same user

num_users = len(user_dict.keys())
number_sample=sample(range(1,num_users+1), k=num_users) # Make a range from 1 to n+1, where n is the number of users. Take n samples from our range of n elements (serves to shuffle the numbers)

for user in user_dict.keys():
    user_dict[user] = number_sample.pop() # Assign each user a number

# The our cursor object was exausted, to get a new one with a new find query
posts = collection.find({}, {'_id':1, 'user':1})
for post in posts:
    collection.update_one( # Update each post with the new random user id in place of the username
        {'_id':post['_id']},
        {'$set':{'user':user_dict[post['user']]}},
    )

## Stage 3 - Strip out Unwanted Words

In this stage, we will process the post titles and contents using spaCy.

I will go into greater detail into spaCy in the next post, where we will derive some useful statistics and meaning from the data. As a quick introduction: 

spaCy, once we have loaded one of the language models, will allow us to feed it strings. It will produce a spaCy `Doc` object, which is a series of tokens (a token being defined as a word, space, or punctuation) that were derived from the fed string.

It will automatically generate a bunch of useful fields for each token object, such as with what type of word spaCy thinks it is (noun, adverb, etc), if it is a [stop word](https://en.wikipedia.org/wiki/Stop_words) (words which are filtered out as generally not useful for NLP tasks), if the word is in the language model's vocabulary, etc.

Based on the post data, we will throw out tokens if one of the following applies:
* it is a stop word
* it is a proper noun, number, symbol, unknown, or punctuation other than a period
* it is a new line
* the word is not in our model's vocabulary

After applying these steps we should be left with useful, common, non-unique or unusual words. This will make language processing for statistics easier in the next step and makes it highly unlikely that personally identifying information slipped through. 

In order to make this part of the notebook interactive, I have captured the output of the above stages into named tuples below, so you can try the NLP yourself. Note that the syntax is slightly different if you are using data pulled from a DB, and I have indicated where it differs.

The results of applying this stage on the database can be found [here](https://github.com/johneddcooper/forumscrape/blob/master/db_refactoring/data_stg3_word_strip).

In [30]:
import spacy
nlp = spacy.load('en_core_web_sm') # we use spaCy's en_core_web_sm language model for demonstration, however, we would use the md or lg models normally as they are more accurate and offer more features 

In [36]:
# Sanitize post titles 
# Remove stop words, proper nouns, numbers, symbols, unknowns, punctuation other than periods, unknown words, and new lines.


# If pulling from a database...
collection = db['aggregate-posts-out_3']
posts = collection.find({})
# Note: if pulling the data from a database, post['thread_title'] needs to be used

for post in posts:
    content_doc = nlp(post['thread_title']) # Construct a new nlp doc, which automatically tokenizes and tags, returns a spaCy Doc object 
    # Go through each token in the post title, only keeping the ones that don't match our stripping rules
    keep_tokens = [token.text for token in content_doc if not( # Iterating over a spaCy Doc returns the tokens within the Doc 
        token.is_stop or # remove stop words, or words that arn't useful for most NLP problems
        (token.pos_ in ['PROPN','NUM','SYM','X','PUNCT'] and not token.text==".") or # remove proper nouns, numbers, symbols, unknowns, and all punctuation except periods, so we can delineate sentences.
        #token.is_oov or # Remove words that are out of vocabulary (oov). We might lose some useful data, however, to be safe we take it out as we can't identify it.
        token.is_space # Remove blank sentences
    )]
#    print(post.thread_title,keep_tokens) # will print once for each post, so there will be duplicates
    print(' '.join(keep_tokens))
# Code to update the changes to a DB, run from within the for loop
    collection.update_many(
       {'thread_title':post['thread_title']},
       {'$set':{'thread_title':str(' '.join(keep_tokens))}}
    )

' pulls best fruit
' pulls best fruit
' pulls best fruit
' pulls best fruit
Bears . Beets .
Bears . Beets .
Bears . Beets .
Bears . Beets .
Love
Love
Going na Eat Lot Peaches
Going na Eat Lot Peaches
like talk tomatoes
like talk tomatoes
like talk tomatoes

















In [37]:
# Sanitize post contents
# Remove stop words, proper nouns, numbers, symbols, unknowns, punctuation other than periods, unknown words, and new lines.

# If pulling from a database...
collection = db['aggregate-posts-out_3']
posts = collection.find({})
# Note: if pulling the data from a database, post['thread_title'] needs to be used

for post in posts:
    content_doc = nlp(post['content']) 
    # Go through each token in the post title, only keeping the ones that don't match our stripping rules
    keep_tokens = [token.text for token in content_doc if not( 
        token.is_stop or # remove stop words, or words that arn't useful for most NLP problems
        (token.pos_ in ['PROPN','NUM','SYM','X','PUNCT'] and not token.text==".") or # remove proper nouns, numbers, symbols, unknowns, and all puncuation expect periods, so we can deliniate sentences.
        #token.is_oov or # Remove words that are out of vocabulary (oov). We might lose some useful data, however to be safe we take it out as we can't identify it.
        token.is_space # Remove blank sentances
    )]
    print(' '.join(keep_tokens))
# Code to update the changes to a DB, run within the for loop
    collection.update_one(
       {'_id':post['_id']},
       {'$set':{'content':str(' '.join(keep_tokens))}}
    )

knows knows apples enlightened pulls best . fruits yucky vegetables inferior . Discuss .
pulls best . best ' pulls ripe juicy taste sooooo good . match eat apples day day doctors damned
real . Apples gross . worst fruits bad . knows veggies power food . week apple sauce .
knows grapes best filthy casual Enjoy cyanide balls
.
. . theft joke Millions families suffer year
storms
Oh funny .
head tomatoes .
. pearfect couple .
Peaches better . Peaches best cabbage worst end story . & yuck
Corn cob rows home placeeee belonggg ROWWWSS
squash smile
Lol tomatoes squash fruits compost veggie master plate forum .
Na tomatoes squash fruits . Acidic tomatoes dirty squash like dirty tomatoes stay sweet fruit forum . Like yucky lettuce belong vegetable forum aka compost .



Banana



Banana

s .

Orange
GLAD DIDN'T
thought pull fast little bitch know graduated class involved numerous secret raids confirmed spills . trained vanilla warfare ripener entire farmed forces . target grocery aisle . wipe fu

For demonstration, lets see all of the tokens that we stripped out:

In [15]:
stripped_tokens = set()

for post in posts:
    content_doc = nlp(post.content) 
    throwaway_tokens = [token for token in content_doc if  
        # token.is_stop or # remove stop words, or words that aren't useful for most NLP problems
        (token.pos_ in ['PROPN','NUM','SYM','X','PUNCT'] and not token.text==".") or # remove proper nouns, numbers, symbols, unknowns, and all punctuation except periods, so we can delineate sentences.
        token.is_oov or # Remove words that are out of vocabulary (oov). We might lose some useful data, however, to be safe we take it out as we can't identify it.
        token.is_space # Remove blank sentences
    ]
    for token in throwaway_tokens:
        stripped_tokens.add(token)
        
        
print(stripped_tokens)

set()


In summary, we have seen:
* Refactoring a DB using MongoDB aggregation pipelines
* Assigning a unique number to users and replacing them in the DB
* Stripping out unwanted words from the DB using simple language processing with spaCy

In the next post, we will dive more into spaCy and NLP, and how we can derive some useful information from our cleaned dataset.

If you have any comments or recommendations, please let me know by leaving a comment at https://learningautomaton.ca/forum-scrape-project---db-editing.

<pre><code>if can_learn: 
    learn()</code></pre>