# Cleaning and exploring a collection of tweets
We go to explore the keywords_list collection in MongoDB.
To the current date the collection has been enriched with streamed tweets containing the keyword "tobacco" for some days.

To explore the data set we need:
1. functions to interface with MongoDB
    - query data
    - load data to
    
2. functions to explore the data

3. functions to manipualate the data
    - cleaning 
    - features engineering

We approch the problem by splitting that into 2 sections:
- __Define the functions__
- __Use the functions__

__Note:__ When we query mongo we obtain a cursor object that contain all the documents that matched the query; to work with them we just need to do _FOR document IN cursor_

Since we work with tweet object documents it is important to have always in mind the __anathomy of a tweet__; we use the following online json parser to show our documents in aplain way.

In [1]:
%%HTML
<iframe width="100%" height="1000" src="http://www.jsoneditoronline.org/"></iframe>

-------

# Define the functions

In [11]:
# packages

import pymongo
import sys
import json

import nltk
import numpy as np
import pandas as pd
import nltk
import re
import os
import codecs

### load_from_mongo()
INPUT:
- __mongo_db__: database in which we have the cllection we are interested in (string)
- __mongo_db_coll__: collection in which we have the documents we are interested in (string)
- __return_cursor__: if set to TRUE return a cursor object which is a list of the documents that match our query
- __criteria__: it is the query icluded in {} parenthesis an written in javascript syntax (same as Monngo Shell)
- __projection__: it is a second operator to specifythe keys of the document we want to be returned see [db.collection.find()](https://docs.mongodb.com/manual/reference/method/db.collection.find/#find-projection)

OUTPUT:
- __cursor__: by default we obtain a cursor which is a list (iterable object) of the documents matching our query

In [2]:
def load_from_mongo(mongo_db, mongo_db_coll, return_cursor=False,
                    criteria=None, projection=None, **mongo_conn_kw):
    
    client = pymongo.MongoClient(**mongo_conn_kw)    
    db = pymongo.database.Database(client, mongo_db)
    coll = db.get_collection(mongo_db_coll)

    if criteria is None:
        criteria = {}
    
    if projection is None:
        cursor = coll.find(criteria)
    else:
        cursor = coll.find(criteria, projection)
    
    if return_cursor:
        return cursor
    else:
        return [ item for item in cursor ]

### save_to_mongo()
INPUT:
- __mongo_db__: database in which we have the cllection we are interested in (string)
- __mongo_db_coll__: collection in which we have the documents we are interested in (string)
- __return_cursor__: if set to TRUE return a cursor object which is a list of the documents that match our query
- __criteria__: it is the query icluded in {} parenthesis an written in javascript syntax (same as Monngo Shell)
- __projection__: it is a second operator to specifythe keys of the document we want to be returned see [db.collection.find()](https://docs.mongodb.com/manual/reference/method/db.collection.find/#find-projection)

OUTPUT:
- __cursor__: by default we obtain a cursor which is a list (iterable object) of the documents matching our query

In [None]:
def save_to_mongo(data, mongo_db, mongo_db_coll, **mongo_conn_kw):
    
    client = pymongo.MongoClient(**mongo_conn_kw)
    
    db = pymongo.database.Database(client, mongo_db)
     
    coll = db.get_collection(mongo_db_coll)

    return coll.insert(data)

### natural language processing functions
- _processTweet()_: process the tweet text cleaning it; take and return the tweet __text__
- _processWord()_: process a single word; take and return __single word__
- _getStopWordList()_: take a __text file__ containing english stopwords and return a __list of stopwords__
- _getWordsVector()_: teke the __text__ of a tweet preprocess it and return a __list of words__ contained in it

In [13]:
def processTweet(tweet):
    #Convert to lower case
    tweet = tweet['text'].lower()
    #Convert www.* or https?://* to URL
    tweet = re.sub('((www\.[^\s]+)|(https?://[^\s]+))','URL',tweet)
    #Convert retweet
    tweet = re.sub('(rt\s)@[^\s]+','RETWEET',tweet)
    #Convert @username to AT_USER
    tweet = re.sub('@[^\s]+','AT_USER',tweet)
    #Remove additional white spaces
    tweet = re.sub('[\s]+', ' ', tweet)
    #Replace #word with word
    tweet = re.sub(r'#([^\s]+)', r'\1', tweet)
    #trim
    tweet = tweet.strip('\'"')
    return tweet


def processWord(w):
    #look for 2 or more repetitions of character in a word and replace with the character itself
    pattern = re.compile(r"(.)\1{1,}", re.DOTALL)
    w = pattern.sub(r"\1\1", w)
    #strip punctuation
    w = w.strip('\'"?,.')
    #check if the word starts with an alphabet
    val = re.search(r"^[a-zA-Z][a-zA-Z0-9-]*$", w)
    if val is None:
        w = 'ABC'
    return w


def getStopWordList(stopWordListFileName):
    st = open(stopWordListFileName, 'r')
    #read the stopwords file and build a list
    stopWords = []
    stopWords.append('AT_USER')
    stopWords.append('URL')
    stopWords.append('RETWEET')
    stopWords.append('ABC')

    fp = open(stopWordListFileName, 'r')
    line = fp.readline()
    while line:
        word = line.strip()
        stopWords.append(word)
        line = fp.readline()
    fp.close()
    return stopWords


def getWordsVector(tweet):
    # initialize vector
    wordsVector = []
    #initialize stopWords
    stopWords = getStopWordList('data/stopwords.txt')
    #process tweet and split into words
    tweet = processTweet(tweet)
    words = tweet.split()
    for w in words:
        w = processWord(w)
        if w in stopWords:
            continue
        else:
            wordsVector.append(w.lower())
    return wordsVector

In [15]:
# test
tweet = {"text" : "RT @StepInPuddles: \"See through the trees\" alcohol marker pen and watercolour 8\" X 11\" please retweet if you like https://t.co/tbtrExK9rW"}
print tweet['text']
print "processTweet()"
processed_text = processTweet(tweet)
print processed_text
print "processWord()"
words_list = []
for w in processed_text.split():
    w_processed = processWord(w)
    words_list.append(w_processed)
print words_list
print "getStopWordList()"
stwl = getStopWordList('data/stopwords.txt')
print "getWordsVector()"
wv = getWordsVector(tweet)

print wv

RT @StepInPuddles: "See through the trees" alcohol marker pen and watercolour 8" X 11" please retweet if you like https://t.co/tbtrExK9rW
processTweet()
RETWEET "see through the trees" alcohol marker pen and watercolour 8" x 11" please retweet if you like URL
processWord()
['RETWEET', 'see', 'through', 'the', 'trees', 'alcohol', 'marker', 'pen', 'and', 'watercolour', 'ABC', 'x', 'ABC', 'please', 'retweet', 'if', 'you', 'like', 'URL']
getStopWordList()
getWordsVector()
['trees', 'alcohol', 'marker', 'pen', 'watercolour', 'please', 'retweet']


### extract_entities_from_collection()
INPUT:
- __cursor_object__: Python cursor obgect returned as result of a MongoDB query

OUTPUT:
- __result__: returns a dictionary containing the 3 __lists__ of extracted entities, result = {"words": words, "screen_names": screen_names, "hashtags": hashtags}

In [16]:
def extract_entities_from_collection(cursor_object):
    words = []
    screen_names = []
    hashtags = []
    
    result = {"words": words, "screen_names": screen_names, "hashtags": hashtags}

    for tweet in cursor_object:
        
        wordsVector = getWordsVector(tweet)
        for word in wordsVector:
            words.append(word)

        for user_mention in tweet['entities']['user_mentions']: 
            screen_names.append(user_mention['screen_name'])

        for hashtag in tweet['entities']['hashtags']:
            hashtags.append(hashtag['text'].lower())

    return(result)

----------

# Use the functions

Today it's July 28 2016; so far we have streamed data with the keyword "tobacco" for 5 days ("keywords_list" in MongoDB).
We are going to explore this collection so far.

### Count occurencies of collection entities: words, hashtags, screen_names

In [17]:
from collections import Counter
from prettytable import PrettyTable

cur = load_from_mongo('streamingAPI', 'top_occurrences0', return_cursor=True)

results = extract_entities_from_collection(cur)

for label, data in (('Word', results["words"]), 
                    ('Screen Name', results["screen_names"]), 
                    ('Hashtag', results["hashtags"])):
    pt = PrettyTable(field_names=[label, 'Count']) 
    c = Counter(data)
    [ pt.add_row(kv) for kv in c.most_common()[:150] ]
    pt.align[label], pt.align['Count'] = 'l', 'r' # Set column alignment
    print pt
    
# if 'KeyError: 'text'' is raised execute the following query in mongo
# db.getCollection('top_occurrences0').remove({"text": { $exists: false }})

+-----------------+-------+
| Word            | Count |
+-----------------+-------+
| health          | 30163 |
| cancer          | 17523 |
| smoke           | 11441 |
| weed            |  8456 |
| smoking         |  7700 |
| alcohol         |  5880 |
| da              |  5805 |
| marijuana       |  5534 |
| time            |  3745 |
| women           |  3573 |
| breast          |  3572 |
| care            |  3268 |
| mental          |  2906 |
| cannabis        |  2854 |
| stop            |  2834 |
| survivors       |  2828 |
| looking         |  2486 |
| people          |  2422 |
| path            |  2226 |
| blocking        |  2150 |
| quic            |  2142 |
| medical         |  2139 |
| via             |  2121 |
| addiction       |  2080 |
| minutes         |  2070 |
| toby            |  1979 |
| spent           |  1975 |
| sandwhich       |  1935 |
| mf              |  1920 |
| bread           |  1916 |
| ham             |  1915 |
| drinking        |  1864 |
| look            | 

### Find n-grams

Since hashtags is a sub-class of words (because we consider the hashtag also as a word), we want to update the stream list with the top words.
Problem: when adjective occurr would be more helpfull to have __n-grams__.

__TODO__: check for n-grams (focus on 3-grams and 5-grams)

[n-grams](https://marcobonzanini.com/2015/03/17/mining-twitter-data-with-python-part-3-term-frequencies/)