Here, we go through the work done for our data cleaning.  This is being done on the data that was actually used, with exteme emperical oversampling of the target class.  Aside from insights originating from direct sporadic observation of the data, however, the procedures are in principle blind to the overall picture of the data and could be applied to other reddit comments used to supplement this data.

The procedures done here are also located in the <font color='red'>put package here</font> subpackage of the <font color='red'>module</font>, along with a table of the regex expressions used and their intent.  Our goal here is to normalize our text for both the purpose of analysis but also to make using a larger sample more tractable for global learning and analysis on a single machine.  The resulting file, used for model training, is already located under the `data_sets` directory, this notebook shows how it is created but does not *need* to be ran for the other notebooks to work. 

We begin by importing the "crude working data" created in the <font color='green'>data harvesting</font> notebook.

In [17]:
!conda install pyarrow
#Take this out if it doesn't work

^C


In [10]:
import pandas as pd
import pickle
import re
from nltk.tokenize import RegexpTokenizer 


working_data=pd.read_csv(r'..\data_sets\crude_data_enc_utf8.csv',names=['snapshot','subreddit','label'],
                        sep='|',encoding='utf-8',header)

In [23]:
pd.read_parquet(r'..\data_sets\crude_data_parq.parquet')

ArrowIOError: Invalid parquet file. Corrupt footer.

In [31]:
#Importing stil failing, christ I don't get how everything that can go wrong is going wrong. I
#'ll just procede as though the data is good and hopefully won't forget to undo the drop. 
# Y O U   N E E 
# D
# D

working_data.dropna(inplace=True)

# TO
# DLETE
# THIS
# CELL

t=pd.read_csv(r'..\crude_data_pipe_sep.csv',header=None,names=['snapshot','subreddit','label'],sep='|',encoding='UTF-8')


#better but still bad somehow.  How is it making a mistake still? 

We first take everything to lowercase and then try to drop anything problematic.  There are two types of meaningful tokens we want to be aware of besides works: the first, website urls-- we want to preserve this information the best that we can, at least initially (consider that choosing to drop websites will be as hard to do evenly across the board as finding a way to condense them).  The second is internal reddit username and subforum links, which are written as /u/user_name or /r/subreddit_name.  Nominally the links should begin with a /, though investigation shows that often is not the case.  The name itself can consist of a sequence of letters, numbers, or underscores.  We'd like to preserve this structure (we don't want to confuse "/r/politics", a political board on reddit, with the normal term "politics" for instance).  We'll write further text processing with this in mind, for now though we'll pass a regex substitution to normalize the expressions:

We will match anything of the form /u/valid_name_format or /r/valid_name_format and drop the leading /. These features, which are advisable to complete prior to the next step, can be accessed independently by the `basic_reddit_preformatting` method of `ex_id_tools.text_cleaning`

In [37]:
working_data['snapshot']=working_data['snapshot'].str.lower()
#send all words to lowercase
working_data['snapshot']=working_data.snapshot.str.replace('(?<![^\s])/(?=\w/[\w]+)',' ')
#clean /x/yz-type strings

<font color='red'>output</font>

We then filter out non-english subforums; nominally part of data cleaning but not viable in an advisable way before basic text cleaning.  This is was done by identifying the ratio of non-english stopwords to english stopwords in each entry, then sorting the data by that ratio to try to find subreddits that may not be english-based.  This is a quite crude method, but allows for computational feasibility that more sophisticated means (using a pretrained classifier, presumably, or using a larger lexicon) would lack. 

The functionality is given in the `text_cleaning` submodule, under the name `crude_language_detect_toll`.  This series is used to define a sorted dataframe, then repeated slices can be displayed to find non-english subreddits.  The list of those found for this data set is saved as `foreign_sub_list.p` in the `data_sets` folder. 

In [48]:
foreign_list=pickle.load(open(r'..\data_sets\foreign_sub_list.p','rb'))
working_data.drop(working_data[working_data['subreddit'].isin(foreign_list)].index,inplace=True)

Unnamed: 0,snapshot,subreddit,label
0,wtf one guess where i live incredibly aggrava...,0XPROJECT,0.0
4,lol meant to reference the guy who deleted his...,1819,0.0
33,sounds fun as hell i have just the guardian fa...,2B2T,0.0
38,thats what im leaning toward as well but a sea...,3DPRINTING,0.0
39,no problem and id recommend both if you arent ...,3DPRINTING,0.0
...,...,...,...
20998,chapotrashhouse is this regarding my inability...,NEOLIBERAL,1.0
20999,this whole russia thing is like a more deprave...,NEOLIBERAL,1.0
21000,the new zealand national party released an ad ...,NEOLIBERAL,1.0
21001,source on the 80 http://wwwnberorg/papers/w149...,NEOLIBERAL,1.0


In [None]:
reddit_tokenizer=RegexpTokenizer(r'http[s]*[:]*[\w_/]+|www[\w/_]+|[ur]/[\w]+|[a-z]{2,}')

working_data['snapshot']=working_data['snapshot'].map(reddit_tokenizer.tokenize)
working_data['snapshot']=working_data['snapshot'].map(lambda l: [s for s in l if s not in stopwords])

At this point, initial modelling was attempted.  The results stimulated further cleaning, which is then done below.  First, we do take out the websites we previously had chosen to keep.  Note that in the filtering & tokenizing we're running the risk of losing some information, particularly non-english top-level domains (co.uk, now as co uk, begins to float around, for instance).  This helps motivate aggressive feature dropping later on-- already likely a good idea-- to try to suppress the effects of both the already existing noise and the incidental noise we introduce in trying to better isolate good features.  

We should be aware at this point we are going to lose some messages written in a not-unpopular fromat <font color='gold'>O F T Y P I N G L I K E T H I S</font>

We then attempt to collapse down repetitions, as users will often type "hahahahahahahahahahahaha, hahahahahahahaha", etc and those should likely be treated as the same.  Having handled this, we can then safely drop words that are of length greater than 45, which were read in incorrectly at some point or are otherwise unwanted, such as long internal/diagnostic text or websites.  Note this choice is for the purposes of computability and expedience; a preferable method in the absence of those would involve at least some information recovery from those strings with readily exploitable opportunities to be split.

In [26]:
working_data['snapshot']=(working_data['snapshot']).map(lambda s:
                        [word for word in s if not re.match('http[s]*[:]*[\w_/]+|www[\w/_]+',word)])
#"url_start" in .text_cleaning.regex_store
working_data['snapshot']=working_data['snapshot'].map(    
        lambda s: list(
            map(
            (lambda w: re.sub(r'(\w{1,3}\w{0,3})\1{3,}','\g<1>\g<1>',w))
            ,s)))
#"repetitions_in and reptitions_out" in .text_cleaning.regex_store.  This will consense 1-and-2 letter repetitions

working_data['snapshot']=working_data['snapshot'].apply(lambda r: [w for w in r if len(w)<=45])
#Drop long strings


NameError: name 're' is not defined

We then join our strings back together and resume further filtering, again informed by the results of intermediate models, on the data remaining.  First, we'll handle the abundant formatting characters like 'nbsp' that appear.  
We'll use the list of character entity reference from:https://www.w3schools.com/html/html_entities.asp .  For ease we'll rejoin the snapshot column before working. In dropping these we take special care with respect to the details of the english language, a word ending with "nbsp' probably just has that on as a mistake, but a word ending with "amp" is likely just a normal english word.  An opportunity exists here for further filtering by examining the distribution of, say, words ending in -amp and the five words that occur on either side, to see what can and cannot be safely extroplated relative to the exact data being examined.  This was done to some extent here, and further informs the regex used-- the word boundaries are such that, wherever they are not present, they are such that *no* words are mistkenly dropped.  They are quite conservative as a result. 

We first drop extract the center for anything of the form likely intended to be of the form <font color='blue'>**\<**</font>
    text
    <font color='blue'>**\>**
        </font>, which would be possibly mis-represented in our data as <font color='blue'>**gt**</font>text<font color='blue'>**lt**</font>.  We then drop expected hmtl entities.  We will be introducing errant vocabulary here, expecially for simple hmtl tags; most are short and will be dropped or should be eliminated through feature extraction but it will be something to pay attention to. 

In [None]:
html_entities=r'|'.join([r'[\w]*nbsp[\w]*',r'\blt[\w]*',r'\bgt[\w]*',r'\bamp\b',r'\bquot\b',r'\bpos\b'])
#html_entities in  .text_cleaning.regex_store
working_data['snapshot']=working_data['snapshot'].str.replace(r'lt(\w*)gt',r'\g<1>')
#'html_tag_like_in' and 'html_tag_like_out' in text_cleaning.regex_store

working_data['snapshot']=working_data['snapshot'].str.replace(html_entities,'')

Some likely website addresses still remain; further, errant http or https strings were found at the end of innocuous words; in general this should be handled with careful grouping and splitting to preserve what follows -- and the complexity of the first regex pattern used expects & reflects that, but investigation showed here that we can just drop said patterns.

In [None]:
working_data['snapshot']=working_data['snapshot'].str.replace(r'\b\w*http(?:[^s\s]|(?:s\w))+\w*\b','')
#likely_url in regex-store
working_data['snapshot']=working_data['snapshot'].str.replace(r'(\w)http[s]*',r'\g<1>')
#'http_end_in' and 'http_end_out' in regex_store

working_data['snapshot']=working_data['snapshot'].str.replace(r'\w+(?:png|gif|jpg|jpeg)\w*','')
#image_endings in regex_store
#drop likely image links, note this is a little more information-compromising
#than might be expected due to the popularity of written out expressions like 'sad.jpg'

working_data['snapshot']=working_data['snapshot'].str.replace(r'\bcom\b','')
#isoloted_com in regex_store

working_data['snapshot']=working_data['snapshot'].str.replace(r'\bwww\b','')
#isolated_www in regex_store

working_data['snapshot']=working_data['snapshot'].str.replace(r'http[s]*(?=\s)','')
working_data['snapshot']=working_data['snapshot'].str.replace(r'(?<=\s)http[s]*','')
#strings starting or ending with http

More cleaning can likely be done.  Particularly, phrases like "submission removed" appear repatedly and represent some automatic action being taken that showed up in the post contents.  As we see elsewhere, though, we are able to get a good model with reasonable features from what results here. 

As promised, we can create our desired data by just running:

In [1]:
from ex_id_tools import text_cleaning
text_cleaning.clean_series(working_data,foreign_list=pickle.load(open(r'..\data_sets\foreign_sub_list.p','rb')))

ModuleNotFoundError: No module named 'ex_id_tools'