# Stoneburner, Kurt
- ## DSC 550 - Week 04

A large portion of this code is taking my text cleaning code from Week 02 - Step1 and refactoring into multiple functions. As we continue our study into natural language processing it makes sense to build a framework to consistently
clean data. These steps appear to be consistent (at this point) and definitely repeatable. I'm an inveterate reductionist, streamlining production workflows into a standarized toolkits is a process I'm constantly refining out in the deflationary world of televsion news production.

I started with the provided lexigraphical classifier. It works by counting the instances of the words: good, special, bad. The total positive words are subtracted from the total negative words to provide a basic (if not bespoke and overfitted) sentiment analysis. One interesting shortcoming of the model is the statement: 'Today is neither a good day or a bad day!' Which is an interesting example that includes the negation word neither. Which in this context translates to not good and not bad which is a neutral sentiment. If classifier trains for negation words, it is likely that neither would only be applied to 'neither a good day' which beomes not good, which is combined with bad, making the sentiment appear negative instead of neutral.

I applied the VADER Sentiment Analyzer since it was included in the nltk library. VADER is a pre-trained model designed to evaluated sentiment in social media posts. This seemed like a good candidate for this assignment. 

In [1]:
#//**** Project imports.
#//*** The nltk libraries involve additional downloads. The Try blocks automatically download the content if it's not
#//*** present. This feels like good form and being a digital nomad, it should run on whichever workstation I happen
#//*** to be on.
import os
import sys
import json 
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import re
import unicodedata
import time
import textblob

#//*** nltk - Natural Language toolkit
import nltk

#//**** Requires the punkt module. Download if it doesn't exist
try:
    type(nltk.punkt)
except:
    nltk.download('punkt')
    
#//*** Check for Vader Lexicon
try:
    nltk.sentiment.vader.SentimentIntensityAnalyzer()
except:
    nltk.download('vader_lexicon')

from nltk.stem.porter import PorterStemmer

from nltk import pos_tag

#//pos_tag requires an additional download

try:
    pos_tag(["the","quick","brown","fox"])
except: 
    nltk.download('averaged_perceptron_tagger')

pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', None)

#//*** Convenience function to 
#//*** Take a time value and display the difference
#//*** Return the difference
def cum_time(input_time):
    tot_time = round(time.time() - input_time,2)
    
    print(f"Process Time: {int(tot_time/60)}m {tot_time % 60}s")
    
    return tot_time
    

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\family\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [2]:
#//*** Read the raw data into a Series

#//****************************************************
#//*** It's a Dataframe, not a Text File you ninny!
#//****************************************************
#with open("z_wk04_DailyComments.csv", "r") as file:
#    raw_text = pd.Series(file.readlines())

#//*** Load the CSV file into a dataframe
this_df = pd.read_csv("z_wk04_DailyComments.csv")

#//*** Manually classify each statement to check our results
#//*** -1 = Negative
#//***  0 = Neutral
#//***  1 - Positive

this_df['actual'] = [0,1,1,0,-1,0,1]

print(this_df)


  Day of Week                                        comments  actual
0      Monday                             Hello, how are you?       0
1     Tuesday                            Today is a good day!       1
2   Wednesday  It's my birthday so it's a really special day!       1
3    Thursday       Today is neither a good day or a bad day!       0
4      Friday                           I'm having a bad day.      -1
5    Saturday       There' s nothing special happening today.       0
6      Sunday                      Today is a SUPER good day!       1


In [3]:
#//*** Apply Common Cleanup operations
#//*** In anticpation that I'll be re-using text cleanup code. I'm adding some robustness to the function.
#//*** Adding kwargs to disable features that default to true.
#//*** Whether an action is skipped or executed is based on a boolean value stored in action_dict.
#//*** Key values will default to true. If code needs to be defaulted to False, a default_false list can be added later

#//*** All Boolean kwarg keya are stored in kwarg list. This speeds up the coding of the action_dict.
#//*** As Kwargs are added 
def mr_clean_text(input_series, input_options={}):
    
    #//*** import time library
    try:
        type(time)
    except:
        import time
    
    #//*** Start Timing the process
    start_time = time.time()

    
    #//*** Add some data validation. I'm preparing this function for additional use. I'm checking if future users (ie future me)
    #//*** may throw some garbage at this function. Experience has taught me to fail safely wherever possible.

    #//*** All kwargs are listed here. These initialize TRUE by default.
    key_list = [ "lower", "newline", "html", "remove_empty", "punctuation" ]
    
    #//*** Build Action Dictionary
    action_dict = { } 
    
    #//*** Build the keys from kwarg_list and default them to TRUE
    for key in key_list:
        action_dict[key] = True
        
    #//*** Loop through the input kwargs (if any). Assign the action_dict values based on the kwargs:
    for key,value in input_options.items():
        print(key,value)
        action_dict[key] = value
    
    
    #//*************************************************************************
    #//*** The Cleanup/Processing code is a straight lift from DSC550 - Week02
    #//*************************************************************************
    #//*** Convert to Lower Case, Default to True
    if action_dict["lower"]:
        input_series = input_series.str.lower()
    
   
    #//*** Remove New Lines
    if action_dict["newline"]:
        #//*** Rmove \r\n
        input_series = input_series.str.replace(r'\r?\n',"")

        #//*** Remove \n new lines
        input_series = input_series.str.replace(r'\n',"")

    #//*** Remove html entities, observed entities are &gt; and &lt;. All HTML entities begin with & and end with ;.
    #//*** Let's use regex to remove html entities
    if action_dict["html"]:
        input_series = input_series.str.replace(r'&.*;',"")

    #//*** Remove the empty lines
    if action_dict["remove_empty"]:
        input_series = input_series[ input_series.str.len() > 0]

    #//*** Remove punctuation
    if action_dict["punctuation"]:
        #//*** Load libraries for punctuation if not already loaded.
        #//*** Wrapping these in a try, no sense in importing libraries that already exist.
        #//*** Unsure of the cost of reimporting libraries (if any). But testing if library is already loaded feels
        #//*** like a good practice
        try:
            type(sys)
        except:
            import sys

        try:
            type(unicodedata)
        except:
            import unicodedata
        
        #//*** replace Comma and Period with a space.
        for punct in [",","."]:
            input_series = input_series.str.replace(punct," ")

        #//*** Remove punctuation using the example from the book
        punctuation = dict.fromkeys(i for i in range(sys.maxunicode) if unicodedata.category(chr(i)).startswith('P') )
        input_series = input_series.str.translate(punctuation)

    print(f"Text Cleaning Time: {time.time() - start_time}")

    return input_series



In [4]:
#//*** Tokenize a Series containing Strings.
#//*** Breaking this out into it's own function for later reuse.
#//*** Not a lot of code here, but it helps to keep the libraries localized. This creates standarization for future
#//*** Stoneburner projects. Also has the ability to add functionality as needed.

def tokenize_series(input_series):
    
    try:
        type(nltk)
    except:
        import nltk
    
    word_tokenize = nltk.tokenize.word_tokenize 
    
    #//*** import time library
    try:
        type(time)
    except:
        import time
    
    #//*** Start Timing the process
    start_time = time.time()
    
    input_series = input_series.apply(word_tokenize)
    
    print(f"Tokenize Time: {time.time() - start_time}")
    
    return input_series


In [5]:
#//*** Remove Stop words from the input list
def remove_stop_words(input_series):
    
    #//*** This function removes stop_words from a series.
    #//*** Works with series.apply()
    def apply_stop_words(input_list):

        #//*** Load Stopwords   
        for word in input_list:
            if word in stop_words:
                input_list.remove(word)
        return input_list

    #//*** import nltk if needed
    try:
        type(nltk)
    except:
        import nltk
        
    stopwords = nltk.corpus.stopwords

    #//*** Stopwords requires an additional download
    try:
        type(stopwords)
    except:
        nltk.download('stopwords')


    #//*** import time library
    try:
        type(time)
    except:
        import time

    #//*** Start Timing the process
    start_time = time.time()


    #//*** The stop_words include punctuation. Stop Word Contractions will not be filtered out.
    stop_words = []

    #//*** Remove apostrophies from the stop_words
    for stop in stopwords.words('english'):
        stop_words.append(stop.replace("'",""))

    
    #//*** Remove Stop words from the tokenized strings in the 'process' column
    #input_series = input_series.apply(remove_stop_words,stop_words)
    
    input_series = input_series.apply(apply_stop_words)

    print(f"Stop Words Time: {time.time() - start_time}")
    
    return input_series

In [6]:
def apply_stemmer(input_series,trim_single_words = True):
    #//*** import nltk if needed
    try:
        type(nltk)
    except:
        import nltk

    #//*** Instantiate the Stemmer
    porter = nltk.stem.porter.PorterStemmer()
    
    #//*** import time library
    try:
        type(time)
    except:
        import time

    #//*** Start Timing the process
    start_time = time.time()
    
    #//*** 1.) Apply() an action to each row
    #//*** 2.) lambda word_list, each row is treated as word_list for the subsequent expression
    #//*** 3.) The base [ word for word in wordlist] would return each word in word_list as a list. 
    #//*** 4.) [porter.stem(word) for word in word_list] - performs stemming on each word and returns a list
    input_series = input_series.apply(lambda word_list: [porter.stem(word) for word in word_list] )
    
    #//*** Remove Single letter words after stemming
    if trim_single_words:
        for word_list in input_series:
            for word in word_list:
                if len(word) < 2:
                    word_list.remove(word)

    print(f"Apply Stemmer Time: {time.time() - start_time}")
    return input_series


In [7]:
#//*** Clean text: Remove punctuation, convert to lowercase, remove blank lines, new lines and html objects.
#//*** Tokenize
#//*** Remove Stop Words
#//*** Stem Words
#//*** Considering adding a lemnization routine.

#//*** Cleaned and processed text is stored in Token Form
this_df['tokens'] = apply_stemmer(remove_stop_words(tokenize_series(mr_clean_text(this_df['comments']))))

#//*** Covert the tokenized words into a string
this_df['processed'] = this_df['tokens'].apply(lambda word_list: ' '.join(word_list)  )
print(this_df)

Text Cleaning Time: 0.3500540256500244
Tokenize Time: 0.01500248908996582
Stop Words Time: 0.008817911148071289
Apply Stemmer Time: 0.00099945068359375
  Day of Week                                        comments  actual  \
0      Monday                             Hello, how are you?       0   
1     Tuesday                            Today is a good day!       1   
2   Wednesday  It's my birthday so it's a really special day!       1   
3    Thursday       Today is neither a good day or a bad day!       0   
4      Friday                           I'm having a bad day.      -1   
5    Saturday       There' s nothing special happening today.       0   
6      Sunday                      Today is a SUPER good day!       1   

                                     tokens                          processed  
0                              [hello, are]                          hello are  
1                        [today, good, day]                     today good day  
2  [my, birthday, it

In [8]:
# Basic Text analyzer included in this week's materials for reference
# Analyzing text for whether comments are positive or negative

from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
import numpy as np

corpus = this_df['processed']

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print("")
print("vectorized words")
print("")
print(vectorizer.get_feature_names())
print("")
print("Identify Feature Words - Matrix View")
print("")
print( X.toarray())

df = pd.DataFrame({'text' : corpus})

#check for positive words and negative words
df['positive1'] = df.text.str.count('good')
df['positive2']= df.text.str.count('special')
df['negative'] = df.text.str.count('bad')
df['TotScore'] = df.positive1 + df.positive2 - df.negative

print("")
print(df)

Z = sum(df['TotScore'])
print("")
print("Overall Score:  ",Z)



vectorized words

['are', 'bad', 'birthday', 'day', 'good', 'happen', 'hello', 'im', 'it', 'my', 'neither', 'noth', 'realli', 'special', 'super', 'today']

Identify Feature Words - Matrix View

[[1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0]
 [0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1]
 [0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 0]
 [0 1 0 2 1 0 0 0 0 0 1 0 0 0 0 1]
 [0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 1]
 [0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 1]]

                                text  positive1  positive2  negative  TotScore
0                          hello are          0          0         0         0
1                     today good day          1          0         0         1
2  my birthday it realli special day          0          1         0         1
3     today neither good day bad day          1          0         1         0
4                         im bad day          0          0         1        -1
5          noth special happen today          0          1         0         1
6    

In [9]:
#//*** Return a categorical value based on the vader score
def categorize_vader(input_score):

    #//*** Less than -.33 Sentiment is negative
    if  input_score < -.33:
        return -1

    #//*** Greater than .33 Sentiment is positive
    if  input_score > .33:
        return 1

    #//*** Everything else is neutral
    return 0


#//*** Use NLTK Vader SentimentIntensityAnalyzer to build a sentiment score
from nltk.sentiment.vader import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()

#//************************************************************************
#//*** Generate a categorical value based on the Vader Sentiment Score.
#//************************************************************************
#//*** Just because I can do this with one line of code, doesn't mean I should.
#//*** this pythonic style can get a bit ridiculous.
#//*** Let's try and unpack my monstrosity which I simplified and managed to remove a generator statement. Yay!
#//**************************************************************************************************************
#//*** Apply Vader to this_df['comments'], retrieve just the compound score: 
#//***         this_df['comments'].apply( lambda words : analyzer.polarity_scores(words)['compound'] )
#//*** analyzer.polarity_scores(words) returns a dictionary containing postive, negative, neutral 
#//*** and compound (combined) scores. The ['compound'] value retrieves only the compound values.
#//*** The second lambda function converts the Vader Compound score to a categorical sentiment value.
#//***         ...apply(lambda score : categorize_vader(score))
#//*** The second lambda pipes each Vader score through the custom function categorize_vader() which returns -1,0,1 for
#//*** each value.

#//*** Version 1: This includes an uneeded generator to filter out the compound value.
#//***            It's a nice Rube Goldberg touch.
#this_df['vader'] = pd.Series( [score['compound'] for score in this_df['comments'].apply( lambda words : analyzer.polarity_scores(words) )]).apply(lambda score : categorize_vader(score))

#//*** This is more readable as a compound lambda functions. 
this_df['vader'] = this_df['comments'].apply( lambda words : analyzer.polarity_scores(words)['compound'] ).apply(lambda score : categorize_vader(score))

#//*** Display the Actual Categorical Values vs the Vader Model Values
print(this_df[ ['actual','vader'] ])

print(f"Vader Scored {round(len(this_df[this_df['actual'] == this_df['vader'] ]) / len(this_df['vader']),3)*100}%  [ {len(this_df[this_df['actual'] == this_df['vader'] ])} of {len(this_df)} ] on unprocessed text")

    

   actual  vader
0       0      0
1       1      1
2       1      1
3       0     -1
4      -1     -1
5       0      0
6       1      1
Vader Scored 85.7%  [ 6 of 7 ] on unprocessed text


In [10]:
#//*** Run vader again on the Processed text, just to see if there is a difference.
#//*** And let's break apart the compound lambda statement to make it more legible.
this_df['vader'] = this_df['processed'].apply( lambda words : analyzer.polarity_scores(words)['compound'])
this_df['vader'] = this_df['vader'].apply(lambda score : categorize_vader(score))

print(this_df[ ['actual','vader'] ])

print(f"Vader Scored {round(len(this_df[this_df['actual'] == this_df['vader'] ]) / len(this_df['vader']),3)*100}%  [ {len(this_df[this_df['actual'] == this_df['vader'] ])} of {len(this_df)} ] on processed text")


   actual  vader
0       0      0
1       1      1
2       1      1
3       0      0
4      -1     -1
5       0      1
6       1      1
Vader Scored 85.7%  [ 6 of 7 ] on processed text


In [11]:
#//*** Convenience function to 
#//*** Take a time value and display the difference
#//*** Return the difference
def cum_time(input_time):
    tot_time = round(time.time() - input_time,2)
    
    print(f"Process Time: {int(tot_time/60)}m {tot_time % 60}s")
    
    return tot_time

#//*** Start Timing the process
start_time = time.time()
total_job_time = 0

#//*** For Step 2 We'll start from the CSV
con_df = pd.read_csv("z_wk02_controversial_words_df.csv")



#//*** Controversial Words was quite the lengthy project. Cleaning the corpus took over twelve minutes. 

con_df['txt'] = con_df['txt'].str.replace("[","").str.replace("]","").str.replace(",","").str.replace("'","")

#//*** Create a tokenized columm for section 2B
#con_df['token'] = con_df['txt'].apply(nltk.tokenize.word_tokenize)


#//*** Display the Process Time
cum_time(start_time)
print(con_df[10:20])

Process Time: 0m 6.11s
    Unnamed: 0  con                                                txt
10          11    0  meaningless word keep fire contain he power ca...
11          12    0  obama declar dictat life honestli would be ups...
12          13    0  classic case us govern depart interior give a ...
13          14    0  he a commun organ support redistribut wealth w...
14          15    0                                 stop cri unattract
15          16    0  believ is good time invok thishttpswwwredditco...
16          17    0     you explain there death threat obama got elect
17          18    0  cours doe do think is poor person onli give ha...
18          19    0  submiss been automat remov it either link shor...
19          21    0  would be a pretti common lay man term suppli s...


In [12]:
#//*** Start Timing the process
start_time = time.time()
#//*** Build the VADER dictionary attribute for each line
vader_dict = con_df['txt'].apply( lambda words : analyzer.polarity_scores(words) )

print("Raw Vader Scoring Time:")
#//*** Display the Process Time
x = cum_time(start_time)



Raw Vader Scoring Time:
Process Time: 4m 3.819999999999993s


In [41]:
start_time = time.time()

list_dict = {}

#//*** Initialize the list_dict with empty arrays
for key in vader_dict[0].keys():
    list_dict[key] = [ ]

#//*** Convert each dictionary value to an array
for row in vader_dict:
    for key,value in row.items():
        list_dict[key].append(value)

for key,value in list_dict.items():
    con_df[key] = list_dict[key]

print(con_df)

#//*** Display the Process Time
x = cum_time(start_time)

        Unnamed: 0  con                                                txt  \
0                0    0  well great he someth those belief he in offic ...   
1                1    0                                are right mr presid   
2                2    0  have given input apart say am wrong have argum...   
3                3    0  get frustrat reason want it way becaus foundat...   
4                4    0  am far expert tpp would tend agre lot problem ...   
...            ...  ...                                                ...   
874134      949994    0                                   payer immort all   
874135      949995    0  genuin cant understand anyon support at point ...   
874136      949996    0  remind subreddit for civil discussionhttpswwwr...   
874137      949997    0                                 k explain or anyth   
874138      949999    0  ya sociopath known celebr posit feel you fuck ...   

          neg    neu    pos  compound  
0       0.163  0.639  0

In [13]:
list_dict#//*** Start Timing the process
start_time = time.time()
con_df['categorical'] = con_df['compound'].apply(lambda score : categorize_vader(score))
print("Categorize Vader Scoring Time:")
#//*** Display the Process Time
x = cum_time(start_time)

KeyError: 'vader_raw'

In [None]:
#//*** Start Timing the process
start_time = time.time()

total_cat_negative = len(con_df[con_df['vader'] == -1])
total_cat_neutral = len(con_df[con_df['vader'] == 0])
total_cat_positive = len(con_df[con_df['vader'] == 1])
print(f"Total Positive Posts: {total_cat_positive} [{round(total_cat_positive/len(con_df),4)*100}%]" )
print(f"Total Neutral  Posts: {total_cat_neutral} [{round(total_cat_neutral/len(con_df),4)*100}%]" )
print(f"Total Negative Posts: {total_cat_negative} [{round(total_cat_negative/len(con_df),4)*100}%]" )
print(f"Corpus VADER Categorical Sentiment Sum: {con_df['vader'].sum()}")

print(f"VADER Corpus Compound Score: {con_df['vader_raw'].sum()}")
#//*** Display the Process Time
x = cum_time(start_time)

In [None]:
#con_df[ -.33 < con_df['vader_raw'] < .33 ] 

In [None]:
"""
#//*** I'm still wrapping my head around tfidf.
#//*** This is here for contemplative reference

countvectorizer = CountVectorizer(analyzer= 'word', stop_words='english')
tfidfvectorizer = TfidfVectorizer(analyzer='word',stop_words= 'english')

# convert th documents into a matrixcount_wm = countvectorizer.fit_transform(train)
count_wm = countvectorizer.fit_transform(this_df['processed'])
tfidf_wm = tfidfvectorizer.fit_transform(this_df['processed'])

#retrieve the terms found in the corpora
# if we take same parameters on both Classes(CountVectorizer and TfidfVectorizer) , it will give same output of get_feature_names() methods)#count_tokens = tfidfvectorizer.get_feature_names() # no difference
count_tokens = countvectorizer.get_feature_names()
tfidf_tokens = tfidfvectorizer.get_feature_names()
df_countvect = pd.DataFrame(data = count_wm.toarray(),columns = count_tokens)
df_tfidfvect = pd.DataFrame(data = tfidf_wm.toarray(),columns = tfidf_tokens)
print("Count Vectorizer\n")
print(df_countvect)
print("\nTD-IDF Vectorizer\n")
print(df_tfidfvect)

"""