# Stoneburner, Kurt
- ## DSC 550 - Week 02

In [67]:
# //****************************************************************************************
# //*** Set Working Directory to thinkstats folder.
# //*** This pseudo-relative path call should work on all Stoneburner localized projects. 
# //****************************************************************************************

import os
import sys
import json 
# //*** Imports and Load Data
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from nltk.corpus import stopwords
import re
import unicodedata

pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', None)

### 2.2 Exercise: Build Your Text Classifiers ###

**1. You can find the dataset controversial-comments.jsonl for this exercise in the Weekly Resources: Week 2 Data Files.**

Pre-processing Text: For this part, you will start by reading the controversial-comments.jsonl file into a DataFrame. Then,

In [2]:
#//*** Temporary dictionary holds lists of JSON objects. pd.read_json generated an error. Likely due to the file
#//*** Not being a complete JSON object. Each line is its on JSON object. 
#//*** Read the file line by line
#//*** Parse each line of JSON. Parse each Key / Value pair. Each value is appeneded to a list. The lists are managed
#//*** with tdict[key]. As long as the input file has the same number of keys for each line, then this works.
#//*** Not sure what the canonical method is for converting items into a dataframe. But this technique has worked well
#//*** in DSC530 and DSC540.

#//*** Temporary Dictionary
tdict = {}

#//*** Read JSON into lists based on keys.
with open('z_controversial-comments.jsonl', "r") as f:
    
    #//*** Initialize tdict. Each Key is used in both the JSON and tdict. This works on JSON of any length but is
    #//*** limited to a flat construct which is fine for 2-D arrays.
    #//*** 1.) Read the first line of the file
    #//*** 2.) Convert the first line of JSON to a dictionary
    #//*** 3.) Get each key/value in dictionary items
    for key,value in json.loads(f.readline()).items():
            #//*** Initialize a list of value, using tdict[key]
            tdict[key] = [value]
    
    #//*** Process each remaining lines.
    for line in f:
        
        #//*** 1.) Convert each line to a dictionary
        #//*** 2.) get each key/value in dictionary
        for key,value in json.loads(line).items():
            
            #//*** Add Value to the list associated with tdict[key]
            tdict[key].append(value)
#//*** Initialize a new dataframe
con_df = pd.DataFrame()

#//*** Loop through tdict, add each key as a column with value as the column data
for key,value in tdict.items():
    con_df[key] = value

#//*** Delete tdict. It is unused and a 200mb+ object
del tdict

**A. Convert all text to lowercase letters.**

In [None]:
#//*** Convert to lower case
con_df['txt'] = con_df['txt'].str.lower()

**B. Remove all punctuation from the text.**

In [85]:
#//*** Remove new lines, I didn't see any samples of \r\n. But it is common enough. Replace it if it exists
con_df['txt'] = con_df['txt'].str.replace(r'\r?\n',"")
#//*** Remove plain ]n new lines
con_df['txt'] = con_df['txt'].str.replace(r'\n',"")

#//*** Remove html entities, observed entities are &gt; and &lt;. All HTML entities begin with & and end with ;.
#//*** Let's use regex to remove html entities
con_df['txt'] = con_df['txt'].str.replace(r'&.*;',"")

#//*** Remove elements flagged as [removed]
con_df['txt'] = con_df['txt'].str.replace(r'\[removed\]',"")

#//*** Remove elements flagged as [deleted]
con_df['txt'] = con_df['txt'].str.replace(r'\[deleted\]',"")

#//*** Some text should be empty with the removal of [removed] and [deleted]
#//*** Remove the empty text
con_df = con_df[ con_df['txt'].str.len() > 0]

#//*** Remove punctuation using the example from the book
punctuation = dict.fromkeys(i for i in range(sys.maxunicode) if unicodedata.category(chr(i)).startswith('P') )
con_df['txt'] = con_df['txt'].str.translate(punctuation)


**C. Remove stop words.**

**D. Apply NLTK’s PorterStemmer.**

In [84]:

print(con_df['txt'][400:420])
#print(punctuation)

439    bill will fuck a bagel if it hasnt been too lo...
440                               ucuteman got rekt lulz
442    i dont think anyone cares about his business m...
443    i mean hes going to take away my and my spouse...
444    if you are wanting to argue that something the...
445                   as long as we get to keep chicago 
446    everytime you say that i doubt more and more t...
447    working momsnever underestimate how stupid lef...
448    if you had put that effort into your first res...
449    1 why dont you post any positive trump news in...
450    ive been saying and thinking this for a year a...
452                           this subreddit is terrible
453    mitt romney is so right  for once there is a g...
454    are you dense enough to be taking the comments...
455    i think it boils down to them not understandin...
456    is that true cant the electors vote without th...
457    a lotta research on that being done on thedona...
458    especially when it was a

In [48]:
print(con_df[100:200])

     con                                                txt
100    0  this is the way government is supposed to be! ...
101    0  maybe they are cheating, we should keep drugs ...
102    1  when trump first hired her, everyone on both s...
103    0  uh... fyi who brought up race?and the left is ...
104    0  there we go, that i can somewhat agree with.i ...
105    0  lol. it is genuinely hilarious how worked up y...
106    0  they were betting on continuing the rigged gam...
107    0  i don't think you see the point: i don't need ...
108    0  well, since they have such a minority she's no...
109    0  he's said nothing. stepping down from manageme...
110    0  but it was caused by lack of banking regulatio...
111    0  the state of indiana will give the company tax...
112    0  runs circles around media, controls the narrat...
113    0  doesn't exist, anymore than his "400+ arrested...
114    0  for the midwest, yes. for states like north ca...
115    0  in this particular instance, i

**2. Now that the data is pre-processed, you will apply three different techniques to get it into a usable form for model-building. Apply each of the following steps (individually) to the pre-processed data.**

A. Convert each text entry into a word-count vector (see sections 5.3 & 6.8 in the Machine Learning with Python Cookbook).

B. Convert each text entry into a part-of-speech tag vector (see section 6.7 in the Machine Learning with Python Cookbook).

C. Convert each entry into a term frequency-inverse document frequency (tfidf) vector (see section 6.9 in the Machine Learning with Python Cookbook).

**Follow-Up Question**

For the three techniques in problem (2) above, give an example where each would be useful.

NOTE

Running these steps on all of the data can take a while, so feel free to cut down on the number of texts (maybe 50,000) if your program takes too long to run. But be sure to select the text entries randomly!

In [4]:
# //*** CODE HERE

In [5]:
# //*** CODE HERE