# WordUp

## Introduction

Use Naive Bayes to classify Reddit comments as either "up" or "down" in terms of score (net of up and down votes).  I am using the personalfinance subreddit November 2015 as input.  Initial queries to calculate percentile values and extract only extreme "up" or "down" voted comments (i.e., below 3rd and above 97th percentiles) done through Google BigQuery.

## Imports

In [17]:
%matplotlib inline
import nltk
import csv
import re
import pandas as pd
import scipy.stats
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
from ggplot import *
from mpl_toolkits.mplot3d import Axes3D
import codecs
import cStringIO

## Read data from CSV file downloaded from Google bucket

Necessary helper functions for reading Unicode CSV files. See https://docs.python.org/2/library/csv.html

In [18]:
class UTF8Recoder:
    def __init__(self, f, encoding):
        self.reader = codecs.getreader(encoding)(f)
    def __iter__(self):
        return self
    def next(self):
        return self.reader.next().encode("utf-8")

class UnicodeReader:
    def __init__(self, f, dialect=csv.excel, encoding="utf-8-sig", **kwds):
        f = UTF8Recoder(f, encoding)
        self.reader = csv.reader(f, dialect=dialect, **kwds)
    def next(self):
        '''next() -> unicode
        This function reads and returns the next line as a Unicode string.
        '''
        row = self.reader.next()
        return [unicode(s, "utf-8") for s in row]
    def __iter__(self):
        return self

In [20]:
with open('results-20151224-154845.csv', 'rb') as f:
    reader = UnicodeReader(f)
    pflist = list(reader)

In [21]:
pflist[0]

[u'body',
 u'score_hidden',
 u'archived',
 u'name',
 u'author',
 u'author_flair_text',
 u'downs',
 u'created_utc',
 u'subreddit_id',
 u'link_id',
 u'parent_id',
 u'score',
 u'retrieved_on',
 u'controversiality',
 u'gilded',
 u'id',
 u'subreddit',
 u'ups',
 u'distinguished',
 u'author_flair_css_class']

delete header

In [22]:
del pflist[0] 

map to just body and score

In [23]:
pflist_body_score = map(lambda line: [line[0], line[11]], pflist)

In [24]:
pflist_body_score[1001]

[u"This is a case where your neighbor's car insurance will be responsible.  This will be covered under their property damage liability coverage.  Depending on your state, the required amount of coverage can be low.  If your state has low requirements and they carried the state minimums then your homeowner's insurance would pick up the rest of the damage after your neighbor's car insurance is exhausted.\n\nI suspect that your neighbor may be getting their car insurance non renewed in the near future since their 8 year old was a) operating their car, and b) hit a fence and deck it.",
 u'106']

In [28]:
import re
def cleanup(body):
    body = re.sub("&gt;", ">", body) # Recode HTML codes
    body = re.sub("&lt;", "<", body)
    body = re.sub("&amp;", "&", body)
    body = re.sub("&nbsp;", " ", body)
    body = re.sub(ur"^[deleted]$", "", body) # Remove deleted
    body = re.sub("http[[:alnum:][:punct:]]*", " ", body) # Remove URL
    body = re.sub("/r/[[:alnum:]]+|/u/[[:alnum:]]+", " ", body) # Remove /r/subreddit, /u/user
    # body = re.sub("(>.*?\\n\\n)+", " ", body) # Remove quoted comments
    body = re.sub("[[:cntrl:]]", " ", body) # Remove control characters (\n, \b)
    body = re.sub("'", "", body) # Remove single quotation marks (contractions)
    body = re.sub("[[:punct:]]", " ", body) # Remove punctuation
    body = re.sub("\\s+", " ", body) # Replace multiple spaces with single space
    body = body.strip() # doesn't work for unicode
    # body = body.decode('utf-8').strip()
    body = body.lower() # Lower case
    return body # Return body (cleaned up text)

In [29]:
def label(score):
    if int(score) <= -1: return 'neg'
    else: return 'pos'

In [30]:
# clean up body, change numerical score to pos or neg
pflist_clean = map(lambda line: [cleanup(line[0]), label(line[1])], pflist_body_score)

In [31]:
pflist_clean[100:150]

[[u'not selling a "dream" car man.. i bought the car because i was in a fucked place mentally and i wanted something to remind me of what im even working for. instead of a shitter',
  'neg'],
 [u'i knew i had the accounts but just never had guidance on what to do. can the hospital still assist financially? i thought once they hand it over to collections the collections agency takes over all financial matters.',
  'neg'],
 [u'hes wrong.. call the collection company. ask for a settlement for 3k or less with a pay to delete. it has to be in writing and dont give them access to your bank accounts.',
  'pos'],
 [u'30k is absolute crap for a college degree in chicago. 45 would be the absolute floor for a fresh graduate, much less someone with several years experience. waiters make more. the problem is these temp to hire programs are so deceptive and take advantage of people early in their careers who arent yet confident in themselves. they say 90 days and then x salary act like everyone gets

In [32]:
len(pflist_clean)

6250

In [33]:
pflist_clean[1]

[u'i found the whole-life seller.', 'pos']

In [34]:
# pflist_unicode = map(lambda line: [unicode(line[0]), line[1]], pflist_clean)

In [35]:
pflist_tokens = map(lambda line: [nltk.word_tokenize(line[0]), line[1]], pflist_clean)

In [36]:
pflist_clean

[[u'exactly. its not supposed to be based on quantifiable metrics or whether the seats are leather or not. there are no statistics to measure. "luxury" is about marketing and public perception, not about features.',
  'pos'],
 [u'i found the whole-life seller.', 'pos'],
 [u'north korea is known to be able to produce superbills that are very difficult to tell from the real ones.',
  'pos'],
 [u'it satisfies no human need; it has utility because people are willing to pay for it. its fiat metal.',
  'neg'],
 [u'[deleted]', 'neg'],
 [u'[removed]', 'neg'],
 [u'start looking at your day to day. look at your day and ask yourself "is what im doing today going to help my tomorrow?" if not, look at where you can make changes **and then work to make those changes.**',
  'pos'],
 [u'unless there is an actual price to pay.', 'neg'],
 [u'blows my mind someone would pay rent in cash. if i handed $1k to my landlord every month in bills im sure they would be extremely wary of me.',
  'pos'],
 [u'[remov

SEE ABOVE.  u'[deleted]' is still there.  cleanup() isn't working.  I need to set up test to debug and make sure it workds- TDD!!!  The book "Python Cookbook" might have the answers for doing regex with Unicode, but check free sources first.