# Stoneburner, Kurt
- ## DSC 550 - Week 02

In [90]:
#//**** Project imports.
#//*** The nltk libraries involve additional downloads. The Try blocks automatically download the content if it's not
#//*** present. This feels like good form and being a digital nomad, it should run on whichever workstation I happen
#//*** to be on.
import os
import sys
import json 
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import re
import unicodedata
import time

#//*** nltk - Natural Language toolkit
import nltk

#//**** Requires the punkt module. Download if it doesn't exist
try:
    type(nltk.punkt)
except:
    nltk.download('punkt')
    
from nltk.corpus import stopwords

#//*** Stopwords requires an additional download
try:
    type(stopwords)
except:
    nltk.download('stopwords')

from nltk.tokenize import word_tokenize

from nltk.stem.porter import PorterStemmer

from nltk import pos_tag

#//pos_tag requires an additional download

try:
    pos_tag(["the","quick","brown","fox"])
except: 
    nltk.download('averaged_perceptron_tagger')

pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', None)

#//*** Convenience function to 
#//*** Take a time value and display the difference
#//*** Return the difference
def cum_time(input_time):
    tot_time = round(time.time() - input_time,2)
    
    print(f"Process Time: {int(tot_time/60)}m {tot_time % 60}s")
    
    return tot_time
    

### 2.2 Exercise: Build Your Text Classifiers ###

**1. You can find the dataset controversial-comments.jsonl for this exercise in the Weekly Resources: Week 2 Data Files.**

Pre-processing Text: For this part, you will start by reading the controversial-comments.jsonl file into a DataFrame. Then,

In [91]:
#//************************************************************************************************************************
#//*** In hindisght a little research would have revealed I could have used pd.read_json() with the lines=True attribute.
#//*** Instead, I borrowed code from my previous projects and parsed each line into a list dictionary.
#//************************************************************************************************************************
#//*** Read the file line by line
#//*** Parse each line of JSON. Parse each Key / Value pair. Each value is appeneded to a list. The lists are managed
#//*** with tdict[key]. As long as the input file has the same number of keys for each line, then this works.
#//*** Not sure what the canonical method is for converting items into a dataframe. But this technique has worked well
#//*** in DSC530 and DSC540.
#//************************************************************************************************************************

#//*** Temporary Dictionary
tdict = {}

#//*** Start Timing the process
start_time = time.time()
total_job_time = 0

#//*** Read JSON into lists based on keys.
with open('z_controversial-comments.jsonl', "r") as f:
    
    #//*** Initialize tdict. Each Key is used in both the JSON and tdict. This works on JSON of any length but is
    #//*** limited to a flat construct. It works well for 2-D arrays.
    #//*** 1.) Read the first line of the file
    #//*** 2.) Convert the first line of JSON to a dictionary
    #//*** 3.) Get each key/value in dictionary items
    for key,value in json.loads(f.readline()).items():
            #//*** Initialize a list of value, using tdict[key]
            tdict[key] = [value]
    
    #//*** Process each remaining lines.
    for line in f:
        
        #//*** 1.) Convert each line to a dictionary
        #//*** 2.) get each key/value in dictionary
        for key,value in json.loads(line).items():
            
            #//*** Add Value to the list associated with tdict[key]
            tdict[key].append(value)
            
#//*** Initialize a new dataframe
con_df = pd.DataFrame()

#//*** Loop through tdict, add each key as a column with value as the column data
for key,value in tdict.items():
    con_df[key] = value

#//*** Delete tdict. It is unused and a 200mb+ object
del tdict

#//*** Display process time and keep a running time of all step 1 processes
total_job_time += cum_time(start_time)

Process Time: 0m 5.79s


**A. Convert all text to lowercase letters.**

In [92]:
#//*** Start Timing the process
start_time = time.time()

#//*** Convert to lower case
con_df['txt'] = con_df['txt'].str.lower()

#//*** Display process time and keep a running time of all step 1 processes
total_job_time += cum_time(start_time)

Process Time: 0m 0.83s


**B. Remove all punctuation from the text.**

In [93]:
#//*** Start Timing the process
start_time = time.time()

#//*** Remove new lines, I didn't see any samples of \r\n. But it is common enough. Replace it if it exists
con_df['txt'] = con_df['txt'].str.replace(r'\r?\n',"")

#//*** Remove plain ]n new lines
con_df['txt'] = con_df['txt'].str.replace(r'\n',"")

#//*** Remove html entities, observed entities are &gt; and &lt;. All HTML entities begin with & and end with ;.
#//*** Let's use regex to remove html entities
con_df['txt'] = con_df['txt'].str.replace(r'&.*;',"")

#//*** Remove elements flagged as [removed]
con_df['txt'] = con_df['txt'].str.replace(r'\[removed\]',"")

#//*** Remove elements flagged as [deleted]
con_df['txt'] = con_df['txt'].str.replace(r'\[deleted\]',"")

#//*** Some text should be empty with the removal of [removed] and [deleted]
#//*** Remove the empty text
con_df = con_df[ con_df['txt'].str.len() > 0]

#//*** Remove punctuation using the example from the book
punctuation = dict.fromkeys(i for i in range(sys.maxunicode) if unicodedata.category(chr(i)).startswith('P') )
con_df['txt'] = con_df['txt'].str.translate(punctuation)

#//*** Display process time and keep a running time of all step 1 processes
total_job_time += cum_time(start_time)

Process Time: 0m 12.12s


**C. Remove stop words.**

In [94]:
#//*** Start Timing the process
start_time = time.time()

#//*** Tokenize conf_df['txt']
#//*** This Takes a wee bit to run
con_df['txt'] = con_df['txt'].apply(word_tokenize)

#//*** Display process time and keep a running time of all step 1 processes
total_job_time += cum_time(start_time)

Process Time: 2m 46.28s


In [95]:
#//*** Start Timing the process
start_time = time.time()

#//*** I'm not pythonic enough to do this on one line.
#//*** This function removes stop_words from a list.
#//*** Works with dataframe.apply()
def remove_stop_words(input_list):
    #//*** Load Stopwords   
    for word in input_list:
        if word in stop_words:
            input_list.remove(word)
    return input_list

#//*** The stop_words include punctuation. Stop Word Contractions will not be filtered out.
stop_words = []

#//*** Remove apostrophies from the stop_words
for stop in stopwords.words('english'):
    stop_words.append(stop.replace("'",""))

#//*** Remove Stop words from the tokenized strings in the 'process' column
con_df['txt'] = con_df['txt'].apply(remove_stop_words)

#//*** If I was cool, I'd do an additional filter pass and remove any Tokenized words with a length of 1
#//*** Then do a second pass and remove rows that have no items.

#//*** Display process time and keep a running time of all step 1 processes
total_job_time += cum_time(start_time)

Process Time: 0m 32.5s


**D. Apply NLTK’s PorterStemmer.**

In [97]:
#//*** Start Timing the process
start_time = time.time()

#/*** Create Stemmer
porter = PorterStemmer()

#//*** Pre stemming sample
print(con_df['txt'][400:420])

#//*** It's a pythonic answer
#//*** 1.) Apply() an action to each row
#//*** 2.) lambda word_list, each row is treated as word_list for the subsequent expression
#//*** 3.) The base [ word for word in wordlist] would return each word in word_list as a list. 
#//*** 4.) [porter.stem(word) for word in word_list] - performs stemming on each word and returns a list
con_df['txt'] = con_df['txt'].apply(lambda word_list: [porter.stem(word) for word in word_list] )

#//*** post stemming sample
print(con_df['txt'][400:420])

#//*** Display process time and keep a running time of all step 1 processes
total_job_time += cum_time(start_time)

439    [bill, fuck, bagel, it, been, long, of, toaste...
440                          [ucuteman, got, rekt, lulz]
442    [dont, think, anyone, cares, his, business, ma...
443    [mean, hes, going, take, away, spouses, health...
444    [wanting, argue, something, did, past, make, d...
445                       [long, we, get, keep, chicago]
446    [everytime, say, doubt, that, actually, adult,...
447    [working, momsnever, underestimate, stupid, le...
448    [you, put, effort, your, first, response, conv...
449    [1, dont, post, positive, trump, news, instead...
450    [ive, saying, thinking, for, year, its, depres...
452                                [subreddit, terrible]
453    [mitt, romney, so, right, once, guy, in, offic...
454    [you, dense, enough, be, taking, comments, sec...
455    [think, boils, not, understanding, lobbysts, t...
456    [that, true, cant, electors, vote, without, po...
457    [lotta, research, that, done, thedonald, trump...
458      [especially, it, a, 17

In [146]:
print(f"Total Step One Job Time: {int(total_job_time/60)}m {total_job_time % 60}s ")

Total Step One Job Time: 11m 11.649999999999977s 


In [116]:
#//*** Save the Dataframe to a CSV to speed up later processing sessions.
#//*** This ended up being a huge time sink. 
#//*** This is commented for safety. It is uncommented when needed

#con_df.to_csv("z_wk02_controversial_words_df.csv")

**2. Now that the data is pre-processed, you will apply three different techniques to get it into a usable form for model-building. Apply each of the following steps (individually) to the pre-processed data.**

In [175]:
#//*** Start Timing the process
start_time = time.time()
total_job_time = 0

#//*** For Step 2 We'll start from the CSV
con_df = pd.read_csv("z_wk02_controversial_words_df.csv")

#//*** Delete the first column, For some reason CSV likes to make the index also the first column. 
con_df.pop(con_df.columns[0])

#//*** In hindsight it is better to not move into and out of a CSV. It would have been a more efficient use of time
#//*** To just reprocess the data for every session.
#//*** The Tokenized words are stored to the CSV as a literal string representation of a list.
#//*** remove the list syntax to get a string
#//*** This process could have been improved by detokenizing before writing to CSV
#//*** I kept with this process since I'm fully invested in the sunken cost fallacy.

con_df['txt'] = con_df['txt'].str.replace("[","").str.replace("]","").str.replace(",","").str.replace("'","")

#//*** Create a tokenized columm for section 2B
con_df['token'] = con_df['txt'].apply(word_tokenize)

print()

#//*** Display the Process Time
cum_time(start_time)


Process Time: 2m 16.03s


In [4]:
#//*** Import Section 2 libraries.
#//*** This is ill form for standard python. But feels appropriate for Notebooks.
from sklearn.feature_extraction.text import CountVectorizer

A. Convert each text entry into a word-count vector (see sections 5.3 & 6.8 in the Machine Learning with Python Cookbook).

In [177]:
#//*** Start Timing the process
start_time = time.time()

#//*** Create the bag of words feature matrix
#//*** Initialize a new instance of the Counter Vectorizer
count_vector = CountVectorizer()

#//*** Vectorize the word count into a bag of words column
#//*** This is awesome in the sense that so much work is abstracted behind a single line of code.
#//*** I don't quite understand what is going on here. I'm assuming we will build upon this work and clarity
#//*** (or confounding) will come later.

con_df['bow'] = count_vector.fit_transform(con_df['txt'])

print(con_df['bow'])

#//*** Display the Process Time
cum_time(start_time)
print()

0           (0, 341165)\t1\n  (0, 127813)\t1\n  (0, 1342...
1           (0, 341165)\t1\n  (0, 127813)\t1\n  (0, 1342...
2           (0, 341165)\t1\n  (0, 127813)\t1\n  (0, 1342...
3           (0, 341165)\t1\n  (0, 127813)\t1\n  (0, 1342...
4           (0, 341165)\t1\n  (0, 127813)\t1\n  (0, 1342...
                                ...                        
874134      (0, 341165)\t1\n  (0, 127813)\t1\n  (0, 1342...
874135      (0, 341165)\t1\n  (0, 127813)\t1\n  (0, 1342...
874136      (0, 341165)\t1\n  (0, 127813)\t1\n  (0, 1342...
874137      (0, 341165)\t1\n  (0, 127813)\t1\n  (0, 1342...
874138      (0, 341165)\t1\n  (0, 127813)\t1\n  (0, 1342...
Name: bow, Length: 874139, dtype: object
Process Time: 0m 22.38s



In [171]:
#//*** Pick a line for verification. 
#//*** We can verify one word get is used 3x times. Which matches the results. This test is by no means definitive
#//*** but it feels 'right'. This is a cursory check opposed to a rigourous one.
print(con_df['txt'].iloc[3])
print(con_df['bow'].iloc[3])

get frustrat reason want it way becaus foundat more complex problem they advanc grade can get decent grade an sat type test math i realli understand lot mathemat way get the right answer lot time can figur a common sens work around a lot the question i would ill prepar take colleg level math cours despit the abov averag math score theyr just tri to bust the kid ball
  (0, 341165)	1
  (0, 127813)	1
  (0, 134216)	3
  (0, 283442)	2
  (0, 313271)	1
  (0, 42431)	1
  (0, 162200)	1
  (0, 217710)	1
  (0, 93569)	1
  (0, 320954)	1
  (0, 348493)	1
  (0, 113660)	1
  (0, 326408)	1
  (0, 170067)	1
  (0, 160770)	1
  (0, 253172)	2
  (0, 132834)	1
  (0, 215996)	1
  (0, 74674)	1
  (0, 337416)	1
  (1, 32283)	1
  (1, 261894)	1
  (1, 204312)	1
  (1, 240809)	1
  (2, 133807)	2
  :	:
  (874136, 88880)	1
  (874136, 344351)	1
  (874136, 248570)	1
  (874136, 135318)	1
  (874136, 213820)	1
  (874136, 39375)	1
  (874136, 326775)	1
  (874136, 257601)	1
  (874136, 162589)	1
  (874136, 225432)	1
  (874136, 27998)	1
 

B. Convert each text entry into a part-of-speech tag vector (see section 6.7 in the Machine Learning with Python Cookbook).

In [183]:
#//*** Start Timing the process
start_time = time.time()

#//*** Applying the speech tag vector is easy to code, and really hard on the CPU cycles.
#//*** Runs the part-of-speech tag against the tokenized data.
con_df['pos_tag'] = con_df['token'].apply(pos_tag)

print(con_df['pos_tag'])

#//*** Display the Process Time
cum_time(start_time)

0         [(well, RB), (great, JJ), (he, PRP), (someth, ...
1         [(are, VBP), (right, JJ), (mr, NN), (presid, NN)]
2         [(have, VB), (given, VBN), (input, JJ), (apart...
3         [(get, VB), (frustrat, JJ), (reason, NN), (wan...
4         [(am, VBP), (far, RB), (expert, JJ), (tpp, NN)...
                                ...                        
874134               [(payer, NN), (immort, NN), (all, DT)]
874135    [(genuin, NN), (cant, NN), (understand, NN), (...
874136    [(remind, NN), (subreddit, NN), (for, IN), (ci...
874137      [(k, NN), (explain, NN), (or, CC), (anyth, NN)]
874138    [(ya, JJ), (sociopath, NN), (known, VBN), (cel...
Name: pos_tag, Length: 874139, dtype: object
Process Time: 33m 26.450000000000045s


2006.45

C. Convert each entry into a term frequency-inverse document frequency (tfidf) vector (see section 6.9 in the Machine Learning with Python Cookbook).

In [186]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [194]:
#//*** Start Timing the process
start_time = time.time()

#//*** Initialize the Vectorizer
tfidf = TfidfVectorizer()

#//*** Build the feature matrix, which is a weighted sparse matrix
con_df['tfidf'] = tfidf.fit_transform(con_df['txt'])


#//*** Display the Process Time
cum_time(start_time)

#//*** Print the output to check the results.
print(con_df['tfidf'])

print(con_df['tfidf'].iloc[-1])

Process Time: 0m 21.37s
0           (0, 337416)\t0.18676189328943824\n  (0, 7467...
1           (0, 337416)\t0.18676189328943824\n  (0, 7467...
2           (0, 337416)\t0.18676189328943824\n  (0, 7467...
3           (0, 337416)\t0.18676189328943824\n  (0, 7467...
4           (0, 337416)\t0.18676189328943824\n  (0, 7467...
                                ...                        
874134      (0, 337416)\t0.18676189328943824\n  (0, 7467...
874135      (0, 337416)\t0.18676189328943824\n  (0, 7467...
874136      (0, 337416)\t0.18676189328943824\n  (0, 7467...
874137      (0, 337416)\t0.18676189328943824\n  (0, 7467...
874138      (0, 337416)\t0.18676189328943824\n  (0, 7467...
Name: tfidf, Length: 874139, dtype: object
  (0, 337416)	0.18676189328943824
  (0, 74674)	0.4725873408381811
  (0, 215996)	0.15491479249290988
  (0, 132834)	0.21614319633286316
  (0, 253172)	0.27581616050694946
  (0, 160770)	0.12617802969160016
  (0, 170067)	0.10295162508673708
  (0, 326408)	0.2751599031144539
  (0

**Follow-Up Question**

For the three techniques in problem (2) above, give an example where each would be useful.

NOTE

Running these steps on all of the data can take a while, so feel free to cut down on the number of texts (maybe 50,000) if your program takes too long to run. But be sure to select the text entries randomly!

#### Use Case: Bag of Words ####
Bag of words can used to build a quick baseline model using a few lines of code and relatively few CPU cycles. A big  downside of teh bag of words method is it removes word order which is vital for providing context. In situations where the context is domain specific, bag of words can be very effective.

Use cases include:<br>
- **Spam Filtering** - Text classification which filters email as spam or legitimate
- **Sentiment Analysis** - Classification of people's opinion expressed in a piece of text
- **Intention Mining** - Determine a future decision of a person based on the text.

#### Use Case: Part of Speech Tag Vector ####
Part-of-speech tagging in the process of breaking down language in to key categories on a word by word basis. Parts-of-Speech Tagging is used to help identify the context of a word, which involve identifying what type of word is used in its context. Common tags are: Pronouns, Verbs, Nouns, Prepositions, Conjunctions, Adverbs, and Adjectives.

A typical use case involves identifying nouns and adjectives in a sentence to classify the intent of text.

#### Use Case:  Term Frequency-inverse Document Frequency (tfidf) Vector ####
TF-IDF is a statistical measure to evaluate how relevant a word to a document in a collection of documents.
Use case:<br>
- **Information Retrieval**: TF-IDF was invented for document search and can be used to for words that are relevant to a search term
- **Keyword Extraction**: TF-IDF can be used determine document keywords. The highest scoring document words are the most releveant to that document and can be considered keywords