# Data Cleaning: Prompt-Level

##What this bit-o-code does:
Take .json prompt data scrapped from reddit via perl, clean out the absurd data errors, and output a usable .csv for analysis.

Note: this file is a Python version (translation?) of the data cleaning process originally implemented in R. The R version of this file can be found here.

In [2]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import nltk
import re
import os
import codecs
import os.path
import seaborn as sns
from sklearn import feature_extraction
import statsmodels.formula.api as smf  #because ols
import statsmodels.api as sm

In [3]:
%pylab inline
pd.options.mode.chained_assignment = None  # default='warn'
                                           # because recodes

Populating the interactive namespace from numpy and matplotlib


In [4]:
# Where is my data?
# NOTE: _pd indicates the python version of these files
dir_project       = '/home/marvin/Desktop/Insight/WritingPrompts/'
dir_in_prompt     = dir_project + 'PromptCSV/'
dir_in_story      = dir_project + 'ResponseCSV/'
data_prompt_main  = dir_project + 'allPrompt_pd.csv'
data_prompt_clean = dir_project + 'allPrompt_edited_pd.csv'

data_prompt_main

'/home/marvin/Desktop/Insight/WritingPrompts/allPrompt_pd.csv'

## Import Loop!
Prompts were first pulled by unix timestamp ranges, with a second iteration for NSFW material. This resulted in approximately 2300 prompt-level files.

To combine:
loop over these files to combine them into a single large dataset for analysis. Because this is time consuming, save output to a file. If re-running, check if this file exists to save some time.

In [16]:
#list my files
list_myFiles = [ f for f in os.listdir(dir_in_prompt) if os.path.isfile(os.path.join(dir_in_prompt,f)) ]
print("# of prompt files found: " + str(len(list_myFiles)))


# Import loop!
forceImport = 1 #set to 0 to force a re-import
if((forceImport == 0) or not(os.path.isfile(data_prompt_main)) ):
    print "... running prompt import loop"
    
    thisPath = dir_in_prompt + list_myFiles[0]
    prompt1  = pd.read_csv(thisPath)
    for i in range(1, len(list_myFiles) ): #len(list_myFiles)
        thisPath = dir_in_prompt + list_myFilesgenre[i]
        prompt1  = prompt1.append(pd.read_csv(thisPath))
    
    prompt1.to_csv(data_prompt_main)

# Import the combined csv file
allPrompt = pd.read_csv(data_prompt_main)
allPrompt.shape

# of prompt files found: 2277


(130274, 48)

In [17]:
allPrompt.dtypes

Unnamed: 0                  int64
domain                     object
banned_by                  object
media_embed                object
subreddit                  object
selftext_html              object
selftext                   object
likes                      object
suggested_sort             object
user_reports               object
secure_media               object
link_flair_text            object
id                         object
gilded                    float64
archived                   object
clicked                    object
report_reasons             object
author                     object
media                      object
score                     float64
approved_by                object
over_18                    object
hidden                     object
thumbnail                  object
subreddit_id               object
edited                     object
link_flair_css_class       object
author_flair_css_class     object
downs                     float64
mod_reports   

## Clean some variables
Clean out strings, and create prompt-level features in the process

In [18]:
# 'Genre' identifier is the link_flair_text,
# variable needs a bit of cleaning first, though
allPrompt['genre'] = allPrompt['link_flair_text']
allPrompt['genre'] = allPrompt['genre'].map(str.strip)

recoder = {
    'Established Universe?' : 'Established Universe',
    'Constrained Writing/Media Prompt' : 'Constrained Writing',
    'NSFW?' : 'NSFW',
    'Writing prompt' : 'Writing Prompt',
    'Writing Prompt - DID NOT ACTUALLY HAPPEN' : 'Writing Prompt',
    'Writing Prompt - Not Actual News' : 'Writing Prompt',
    'Writing Prompt NOT /R/ASKREDDIT' : 'Writing Prompt',
    'Writing Prompt SHUT UP ABOUT THE WIRE' : 'Writing Prompt',
    'Writing Prompt [WP]' : 'Writing Prompt',
    'Flash Fiction CONTEST!' : 'Flash Fiction',
    'ALL THE PROMPTS' : 'Writing Prompt',
    'Poetry Prompt' : 'Poetry'}

allPrompt['genre'].replace(recoder, inplace=True)
allPrompt.groupby('genre').size()


genre
Constrained Writing         3358
Constructive Criticism       713
Contest!                      24
Continuing Story             238
Established Universe        8992
Flash Fiction               1396
Historical Prompt            109
Image Prompt                3783
Media Prompt                1109
Moderator Post               349
Music Prompt                 216
NSFW                           3
Off Topic                   1688
Poetry                        10
Prompt Inspired             1655
Prompt Me                   1230
Reality Fiction              692
Rewriting                     56
Workshop                      28
Writing Prompt                2
Writing Prompt            104220
null                         403
dtype: int64

Reddit has a flag for whether (and when)
the prompt was later edited

  want: 
   - editRec:  a dummy for whether the prompt was edited
   - editTime: how long (UNIX time) until the prompt was edited

In [19]:
# editRec: dummy for whether edited
recoder = { ' false' : 0}
allPrompt['editRec'] = allPrompt['edited']
allPrompt['editRec'].replace(recoder, inplace=True)
editRec = allPrompt['editRec']
editRec[editRec > 1] = 1
allPrompt['editRec'] = editRec
print(allPrompt.groupby('editRec').size())

# editNum: temp variable
allPrompt['editNum'] = allPrompt['edited']
allPrompt['editNum'] = allPrompt['editNum'].map(str.strip)
recoder = { 'false' : 0}
allPrompt['editNum'].replace(recoder, inplace=True)

# editTime: if prompt was edited, how long after posting?
#  set to 0 if prompt was never edited
#         truncate any edits that take longer than 1 day
#          (covers less than .5% of the data)
allPrompt['created']  = allPrompt['created_utc'] * allPrompt['editRec']
allPrompt[['editNum', 'created']] = allPrompt[['editNum', 'created']].astype(float)
allPrompt['editTime'] = allPrompt['editNum'] - allPrompt['created']
editTime = allPrompt['editTime']
editTime[editTime > 86400] = 0 
allPrompt['editTime'] = editTime

editRec
0    125313
1      4961
dtype: int64


In [20]:
# isAmod: is the author of the prompt a mod for the subreddit?
#         
recoder = { 
          ' MOD' : 1,
          ' null' : 0,
          ' ' : 0
}
allPrompt['isAmod'] = allPrompt['author_flair_css_class']
allPrompt['isAmod'].replace(recoder, inplace=True)
print(allPrompt.groupby('isAmod').size())

isAmod
0    128619
1      1655
dtype: int64


In [21]:
# remove some columns'Unnamed: 0'
#  these columns either (1) don't vary in the data or
#                       (2) were recoded into a more usable format
#                       (3) or renamed to eliminate leading whitespace
allPrompt['distinguished'] = allPrompt[' distinguished']
allPrompt = allPrompt.drop(['Unnamed: 0',
                            'domain', 'banned_by', 'media_embed', 'subreddit',
                            'selftext_html', 'likes', 'suggested_sort',
                            'user_reports', 'secure_media', 'link_flair_text',
                            'clicked', 'report_reasons', 'media', 'approved_by',
                            'hidden', 'thumbnail', 'subreddit_id', 'editNum',
                            'link_flair_css_class', 'author_flair_css_class',
                            'downs', 'mod_reports', 'secure_media_embed', 
                            'saved', 'removal_reason', 'stickied', 'created',
                            'permalink', 'visited', 'num_reports', 'nsfw',
                            ' distinguished'],axis=1)
allPrompt.dtypes

selftext              object
id                    object
gilded               float64
archived              object
author                object
score                float64
over_18               object
edited                object
is_self               object
name                  object
url                   object
author_flair_text     object
title                 object
created_utc          float64
ups                  float64
num_comments         float64
genre                 object
editRec               object
editTime             float64
isAmod                 int64
distinguished        float64
dtype: object

In [22]:
# dealing with dates
#   everyone's least-favorite fruit / variable type

allPrompt['year']  = pd.to_datetime(allPrompt.created_utc, unit='s').dt.year
allPrompt['month'] = pd.to_datetime(allPrompt.created_utc, unit='s').dt.month
allPrompt['day']   = pd.to_datetime(allPrompt.created_utc, unit='s').dt.day
allPrompt['wDay']  = pd.to_datetime(allPrompt.created_utc, unit='s').dt.dayofweek
print(allPrompt.groupby('wDay').size()) # fewest prompts on saturday (fri/sun also lower)

# num_comments field has two observations with -1 recode to 0
recoder = { -1 : 0}
allPrompt['num_comments'].replace(recoder, inplace=True)
allPrompt['prompt_length'] = allPrompt['title'].str.len()
allPrompt['all_length']    = allPrompt['prompt_length'] + allPrompt['selftext'].str.len()

wDay
0    19119
1    19832
2    19870
3    19641
4    18525
5    15936
6    17351
dtype: int64


In [23]:
# id variable has whitespace, remove it
allPrompt['id'] = allPrompt['id'].map(str.strip)

##Prepare to merge to comment-level
Since some variable names are used for both the prompt and comment-levels, rename the prompt level variables by prepending 'p_'.

In [25]:
new_columns = allPrompt.columns.values
new_columns = 'p_' + new_columns

allPrompt.columns = new_columns

In [26]:
allPrompt.columns.values

array(['p_selftext', 'p_id', 'p_gilded', 'p_archived', 'p_author',
       'p_score', 'p_over_18', 'p_edited', 'p_is_self', 'p_name', 'p_url',
       'p_author_flair_text', 'p_title', 'p_created_utc', 'p_ups',
       'p_num_comments', 'p_genre', 'p_editRec', 'p_editTime', 'p_isAmod',
       'p_distinguished', 'p_year', 'p_month', 'p_day', 'p_wDay',
       'p_prompt_length', 'p_all_length'], dtype=object)

##Write cleaned prompt-level data to data_prompt_clean

In [27]:
allPrompt.to_csv(data_prompt_clean)