# Jonathan Bunch

12 September 2021

Bellevue University

DSC550-T301

---

# Week 2 Exercises: Build Your Text Classifiers

In [1]:
# Import libraries.
import numpy as np
import pandas as pd
import unicodedata
import sys
import nltk
import sklearn

### Pre-processing Text:

For this part, you will start by reading the controversial-comments.jsonl file into a DataFrame.

In [3]:
# This is easily accomplished thanks to pandas read_json() function.
# The file appears to be line-delineated, which we can handle by setting the "lines" parameter to True.
df_raw = pd.read_json("controversial-comments.jsonl", lines=True)

# My computer had a difficult time processing the full sample, so I will select a random subset to work with.
df = df_raw.sample(n = 50000, random_state=12)

# View the first five observations to check for the expected results.
df.head()

Unnamed: 0,con,txt
365742,0,Down vote this dick in your mouth.
217278,0,I just remembered I was raped by Trump too. I ...
21252,0,Yalec Taldwin works primarily in Bollywood.
402752,1,Bbbbut the liberals hate Snowden and Wikileaks...
258448,0,This sub has gone to shit. I'm only here for t...


Then:

A. Convert all text to lowercase letters.

B. Remove all punctuation from the text.

C. Remove stop words.

D. Apply NLTK’s PorterStemmer.

In [4]:
# First, I will create a translation table that maps punctuation characters to None values.
punc_tbl = dict.fromkeys(i for i in range(sys.maxunicode) if unicodedata.category(chr(i)).startswith('P'))

# Load the list of stop words from the nltk library.
# nltk.download('stopwords')
stop_words = nltk.corpus.stopwords.words('english')

# Create a Stemmer from the nltk PorterStemmer function.
porter = nltk.stem.porter.PorterStemmer()


def prep_text(t):
    """ This function will convert all text to lowercase letters, remove all punctuation, tokenize the text into words,
    remove all stop words, apply the PorterStemmer function to each word, and finally recombine the tokenized words."""
    # Strip any leading or following white spaces and convert all text to lowercase.
    text = t.strip().lower()
    # Apply the translation table to the text to remove punctuation characters.
    text = text.translate(punc_tbl)
    # Next, we will tokenize the text into individual words.
    tok_text = nltk.tokenize.word_tokenize(text)
    # Create a new list of tokenized words that are NOT stop words.
    tok_text_nsw = [word for word in tok_text if word not in stop_words]
    # Now we can apply the nltk PorterStemmer function to stem the tokenized words.
    tok_text_fin = [porter.stem(word) for word in tok_text_nsw]
    # Combine the cleaned and stemmed words back into a string.
    text_fin = " ".join(tok_text_fin)
    return text_fin


In [5]:
# Apply the pre-processing function to the text data and add the results to the table as a new column.
df['prepped_text'] = df['txt'].apply(prep_text)

# It might also be useful to add the tokenized version of this pre-processed text as another new column.
df['prepped_tokens'] = df['prepped_text'].apply(nltk.tokenize.word_tokenize)

# View a sample of the results.
df.head()

Unnamed: 0,con,txt,prepped_text,prepped_tokens
365742,0,Down vote this dick in your mouth.,vote dick mouth,"[vote, dick, mouth]"
217278,0,I just remembered I was raped by Trump too. I ...,rememb rape trump rememb 5 day elect,"[rememb, rape, trump, rememb, 5, day, elect]"
21252,0,Yalec Taldwin works primarily in Bollywood.,yalec taldwin work primarili bollywood,"[yalec, taldwin, work, primarili, bollywood]"
402752,1,Bbbbut the liberals hate Snowden and Wikileaks...,bbbbut liber hate snowden wikileak great news,"[bbbbut, liber, hate, snowden, wikileak, great..."
258448,0,This sub has gone to shit. I'm only here for t...,sub gone shit im idioci nov 8 unsubscrib,"[sub, gone, shit, im, idioci, nov, 8, unsubscrib]"


### Now that the data is pre-processed, you will apply three different techniques to get it into a usable form for model-building. Apply each of the following steps (individually) to the pre-processed data.

A. Convert each text entry into a word-count vector

In [11]:
# Create a vectorizer from the sklern CountVectorizer() function.
count = sklearn.feature_extraction.text.CountVectorizer()

# Apply the vectorizer to the prepared text feature.
wcv = count.fit_transform(df.prepped_text)

# View some details of the matrix.
wcv

<50000x32766 sparse matrix of type '<class 'numpy.int64'>'
	with 809954 stored elements in Compressed Sparse Row format>

In [12]:
# The matrix is not very interesting to view this way, but it does appear to be the expected result.
wcv.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [25]:
# View some of the words corresponding to the features.
count.get_feature_names()[2000:2020]

['admir',
 'admiralti',
 'admiss',
 'admissionth',
 'admit',
 'admitt',
 'admittedhttpwwwmotherjonescompolitics201609donaldtrumpliesaboutdealingsmafiafigur',
 'admittedli',
 'admonish',
 'ado',
 'adolesc',
 'adolf',
 'adolph',
 'adopt',
 'adopteesor',
 'ador',
 'adrenalin',
 'adress',
 'adroitli',
 'adsrath']

B. Convert each text entry into a part-of-speech tag vector

In [13]:
# Create a list that will store the part of speech tag for each word in each observation.
tagged_list = []

# Loop through the tokenized words in each observation, tagging each word with its part of speech.
# Then, add the part of speech tags, per each observation, to the tagged_list.
for val in df.prepped_tokens.values:
    tags = nltk.pos_tag(val)
    tagged_list.append([tag for word, tag in tags])

# Create a one-hot binarizer from the sklearn MultiLabelBinarizer() function.
one_hot_multi = sklearn.preprocessing.MultiLabelBinarizer()

# Apply the binarizer to the tagged_list to create one-hot encoded part-of-speech vectors.
psv = one_hot_multi.fit_transform(tagged_list)

# View the resulting array.
psv

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [22]:
# View the corresponding feature names.
one_hot_multi.classes_

array(['$', "''", 'CC', 'CD', 'DT', 'EX', 'FW', 'IN', 'JJ', 'JJR', 'JJS',
       'LS', 'MD', 'NN', 'NNP', 'NNPS', 'NNS', 'PDT', 'POS', 'PRP',
       'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'SYM', 'TO', 'UH', 'VB', 'VBD',
       'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP', 'WP$', 'WRB', '``'],
      dtype=object)

C. Convert each entry into a term frequency-inverse document frequency (tfidf) vector

In [14]:
# Create the tfidf vectorizer.
tfidf = sklearn.feature_extraction.text.TfidfVectorizer()

# Apply the vectorizer to create a tfidf feature matrix.
tfv = tfidf.fit_transform(df.prepped_text)

# View details about the matrix.
tfv

<50000x32766 sparse matrix of type '<class 'numpy.float64'>'
	with 809954 stored elements in Compressed Sparse Row format>

In [15]:
# View the matrix.
tfv.toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [33]:
# View some of the words corresponding to the features.
list(tfidf.vocabulary_.items())[:10]

[('vote', 31386),
 ('dick', 8024),
 ('mouth', 19521),
 ('rememb', 24211),
 ('rape', 23706),
 ('trump', 29656),
 ('day', 7457),
 ('elect', 9075),
 ('yalec', 32481),
 ('taldwin', 28425)]

Follow-Up Question:

For the three techniques in problem (2) above, give an example where each would be useful.

1. Creating word-count vectors could be useful for extracting the domain or sentiment of the text.  For example, a high count of domain-specific words might indicate that the text could be classified as part of that domain.
2. Creating part-of-speech vectors could be useful for extracting intent from the text.  For example, user text entries could be filtered for a high number of verbs, potentially indicating an intent to act.
3. Creating tfidf vectors is useful for quantifying how important different words are in the context of the overall document or set of documents being analyzed.  One example might be analyzing posts on a message board thread to identify "trolling", i.e., posts that do not use many (or any) of the terms being used by most of the other posts.