# Creating features for Reddit comment toxicity classifier models
### John Burt


### Introduction:

This notebook reads the comment data and creates feature data files for training and testing the classifier models I'll be evaluating as toxic comment detectors. The feature data is split because not all classifier models will use all feature data.

Input data:
- comments collected for each sub, with PCA-based toxicity score added.
- input dir: ./data_scored

Output to:
- two feature files, one containing the prepared comment text plus toxicity score, and the other containing Doc2Vec vectors for each comment sample. I have to split these up for space saving reasons.
- output dir: ./data_for_models

There will be three feature sets:

- Comment metadata features: I will only include one feature from the metadata downloaded with each comment, the comment author's "comment karma", which is the mean comment score for all comments the user has posted. Other comment features were excluded because either they did not correlate with toxicity score, or they were correlative but only if the comment had been posted for a while. Time dependent associations are not helpful as features, since I want this classifier to be able to detect toxic comments shortly after they are posted. Comment karma, however, is a measure that exists at the time of the comment and can therefore be used as a feature for the models.

- Comment text: The original comment text. 

- Doc2Vec vectors: For each subreddit, a Doc2Vec embedding model is trained using all of the sub's text samples. Then all comments in the sub are converted into Doc2Vec vectors to be used as model features.



In [1]:
# remove warnings
import warnings
warnings.filterwarnings('ignore')
# ---

%matplotlib inline
from matplotlib import pyplot as plt
import matplotlib
matplotlib.style.use('ggplot')

import pandas as pd
pd.options.display.max_columns = 100

import numpy as np
import datetime
import time
import csv
import glob

# source data folder 
srcdir = './data_scored/' # dir of comment data with PCA scores added
destdir = './data_for_models/'

# the subreddits I'll be analyzing
sub2use = ['aww', 'funny', 'todayilearned','askreddit',
           'photography', 'gaming', 'videos', 'science',
           'politics', 'politicaldiscussion',             
           'conservative', 'the_Donald']

# load all labelled CSVs
dfs = []
for subname in sub2use:
    pathname = srcdir+'comment_sample_'+subname+'_scored.csv'
    print('reading',pathname)
    tdf = pd.read_csv(pathname)
    dfs.append(tdf)

# combine all subreddit datasets into one  
df = pd.concat(dfs)

# remove any deleted or removed comments 
df = df[(df.text!='[deleted]') & (df.text!='[removed]')]

# drop samples with NaNs
df.dropna(inplace=True)

# drop duplicates
df = df.drop_duplicates(subset='comment_ID')

# reformat parent ids to match comment ids
df.parent_ID = df.parent_ID.str.replace('t1_','')
df.parent_ID = df.parent_ID.str.replace('t3_','')

print('\nTotal comment samples read:',df.shape[0])

reading ./data_scored/comment_sample_aww_scored.csv
reading ./data_scored/comment_sample_funny_scored.csv
reading ./data_scored/comment_sample_todayilearned_scored.csv
reading ./data_scored/comment_sample_askreddit_scored.csv
reading ./data_scored/comment_sample_photography_scored.csv
reading ./data_scored/comment_sample_gaming_scored.csv
reading ./data_scored/comment_sample_videos_scored.csv
reading ./data_scored/comment_sample_science_scored.csv
reading ./data_scored/comment_sample_politics_scored.csv
reading ./data_scored/comment_sample_politicaldiscussion_scored.csv
reading ./data_scored/comment_sample_conservative_scored.csv
reading ./data_scored/comment_sample_the_Donald_scored.csv

Total comment samples read: 3251370


## Prepare the text

Clean up the text:
- Remove non-alphanumeric characters
- Remove stop words
- Make lowercase
- Stem

In [2]:
import re
from nltk.stem import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords as sw

# function to prepare text for NLP analysis
def process_comment_text(comments, 
                         stemmer=None, 
                         regexstr=None, lowercase=True,
                         removestop=False,
                         verbose=True):
    """Helper function to pre-process text.
        Combines several preprocessing steps: lowercase, 
            remove stop, regex text cleaning, stemming"""
    
    if type(stemmer) == str:
        if stemmer.lower() == 'porter':
            stemmer = PorterStemmer()
        elif stemmer.lower() == 'snowball':
            stemmer = SnowballStemmer(language='english')
        else:
            stemmer = None
            
    processed = comments
    
    # make text lowercase
    if lowercase == True:
        if verbose: print('make text lowercase')
        processed = processed.str.lower()
        
    # remove stop words
    # NOTE: stop words w/ capitals not removed!
    if removestop == True:
        if verbose: print('remove stop words')
        stopwords = sw.words("english")
        processed = processed.map(lambda text: ' '.join([word for word in text.split() if word not in stopwords]))
        
    # apply regex expression
    if regexstr is not None:
        if verbose: print('apply regex expression')
        regex = re.compile(regexstr) 
        processed = processed.str.replace(regex,' ')
        
    # stemming
    # NOTE: stemming makes all lowercase
    if stemmer is not None:
        if verbose: print('stemming')
        processed = processed.map(lambda x: ' '.join([stemmer.stem(y) for y in x.split(' ')]))
        
    if verbose: print('done')
         
    return processed


### Create the text features file.

- Do minimal text processing 
- File includes comment metadata and toxicity score.

In [3]:
def generate_text_features(df, subname, destdir):
    print('generating text features')
    # specify parameters for text prep
    processkwargs = {
        'stemmer': None, #'snowball', # snowball stemmer
        'regexstr':'[^a-zA-Z0-9\s]', # remove all but alphanumeric chars
        'lowercase':True, # make lowercase
        'removestop':False # remove stop words 
                    }
    df['text'] = process_comment_text(df['text'], **processkwargs, verbose=True)
    df.to_csv(destdir+'features_text_'+subname+'.csv',index=False)
    


### Create the Doc2Vec features file

- Create a Doc2Vec word embedding for each subreddit.
- Convert comment text to 100 Doc2Vec vectors to use as features

In [4]:
import gensim
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from gensim.models.word2vec import Word2Vec # the word2vec model gensim class
from nltk.tokenize import word_tokenize
from time import time
import pickle

def generate_doc2vec_features(df, subname, destdir):
    print('generating doc2vec features')
    # specify parameters for text prep
    processkwargs = {
        'stemmer': None, #'snowball', # snowball stemmer
        'regexstr':'[^a-zA-Z0-9\s]', # remove all but alphanumeric chars
        'lowercase':False, # make lowercase
        'removestop':False # remove stop words 
                    }
    d2v_text = process_comment_text(df['text'], **processkwargs, verbose=True)
    
    # convert text to tagged document format
    documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(d2v_text)]

    print('  Creating new model...')

    # some relevant doc2vec model parameters
    ndims = 100 # number of embedding features
    windowsize = 2 # word distance window size
    
    # train doc2vec model
    model_d2v = Doc2Vec(documents, vector_size=ndims, window=windowsize, 
                    min_count=1, workers=4)
    
    # create array to contain the embedding vectors
    vec = np.zeros([df.shape[0], model_d2v.vector_size])
    print('  vector dataframe shape:',vec.shape)

    print('  vectorizing comments...')
    t0 = time()

    # convert the text into embedding vectors
    for i,text in enumerate(d2v_text):
        vec[i,:] = model_d2v.infer_vector(text.split())
        
    # convert w2vec vetor features into a dataframe
    d2v_df = pd.DataFrame(vec, columns=['dv_'+str(i) for i in range(vec.shape[1])])

    # save only features
    d2v_df.to_csv(destdir+'features_doc2vec_'+subname+'.csv', index=False)


### Generate separate feature files for each subreddit sampled from:

- text features (plus meta info and generated features)
- word2vec features

In [5]:
# generate separate feature files for each subreddit sampled from
sublist = df['sub_name'].unique()

for subname in sublist:
    print('\nprocessing sub',subname)
    subdf = df[df['sub_name']==subname]
    generate_text_features(subdf, subname, destdir)
    generate_doc2vec_features(subdf, subname, destdir)


processing sub aww
generating text features
make text lowercase
apply regex expression
done
generating doc2vec features
apply regex expression
done
  Creating new model...
  vector dataframe shape: (217359, 100)
  vectorizing comments...

processing sub funny
generating text features
make text lowercase
apply regex expression
done
generating doc2vec features
apply regex expression
done
  Creating new model...
  vector dataframe shape: (255422, 100)
  vectorizing comments...

processing sub todayilearned
generating text features
make text lowercase
apply regex expression
done
generating doc2vec features
apply regex expression
done
  Creating new model...
  vector dataframe shape: (254687, 100)
  vectorizing comments...

processing sub askreddit
generating text features
make text lowercase
apply regex expression
done
generating doc2vec features
apply regex expression
done
  Creating new model...
  vector dataframe shape: (198972, 100)
  vectorizing comments...

processing sub photograph