In [1]:
#In this notebook I am going to start with the dataframes that have been trimmed,
#The nans and the posts marked '[removed]' have been removed so we start with better data.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [3]:
lpt_raw = pd.read_csv('./data/lpt_trimmed.csv')
ulpt_raw = pd.read_csv('./data/ulpt_trimmed.csv')

In [4]:
#Probably not necessary, but just in case lets check:
lpt_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7778 entries, 0 to 7777
Data columns (total 72 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Unnamed: 0                     7778 non-null   int64  
 1   all_awardings                  7778 non-null   object 
 2   allow_live_comments            7778 non-null   bool   
 3   author                         7778 non-null   object 
 4   author_flair_background_color  0 non-null      float64
 5   author_flair_css_class         0 non-null      float64
 6   author_flair_text              0 non-null      float64
 7   author_flair_text_color        197 non-null    object 
 8   awarders                       7778 non-null   object 
 9   banned_by                      0 non-null      float64
 10  can_mod_post                   7778 non-null   bool   
 11  contest_mode                   7778 non-null   bool   
 12  created_utc                    7778 non-null   i

In [5]:
ulpt_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7527 entries, 0 to 7526
Data columns (total 75 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Unnamed: 0                     7527 non-null   int64  
 1   all_awardings                  7527 non-null   object 
 2   allow_live_comments            7527 non-null   bool   
 3   author                         7527 non-null   object 
 4   author_flair_background_color  0 non-null      float64
 5   author_flair_css_class         0 non-null      float64
 6   author_flair_text              0 non-null      float64
 7   author_flair_text_color        177 non-null    object 
 8   awarders                       7527 non-null   object 
 9   banned_by                      0 non-null      float64
 10  can_mod_post                   7527 non-null   bool   
 11  contest_mode                   7527 non-null   bool   
 12  created_utc                    7527 non-null   i

In [6]:
#Both dataframes have full title, selftext, and subreddit columns. This is where I will start.
lpt_df = lpt_raw[['title', 'selftext', 'subreddit']]
ulpt_df = ulpt_raw[['title', 'selftext', 'subreddit']]
lpt_df.head()

Unnamed: 0,title,selftext,subreddit
0,"LPT: When packing for a move, use your clothes...","Most people will unpack the kitchen early on, ...",LifeProTips
1,LPT: To avoid giving clicks/views to clickbait...,Most of the time it will give you the name of ...,LifeProTips
2,LPT: Kindness is not weakness.,"Before I go on, I hope this doesn’t get taken ...",LifeProTips
3,LPT: Bought online and shipping delayed? Check...,For example- I ordered several times from Newe...,LifeProTips
4,"LPT: Guys, if you are on a date with a girl, s...",(If physical attraction and all that is there ...,LifeProTips


In [7]:
ulpt_df.head()

Unnamed: 0,title,selftext,subreddit
0,ULPT Request: Tax fraud on Mercari?,"So, I make a lot of money off mercari. And it ...",UnethicalLifeProTips
1,ULPT: HOW TO GET CONSISTENT SEX,psa: this probably works best if your single\n...,UnethicalLifeProTips
2,ULPT Request: Need to fuck with someone’s head.,Someone I know is being an asshole and trollin...,UnethicalLifeProTips
3,ULPT REQUEST: how do I get someone fired from ...,how can I get someone fired from their current...,UnethicalLifeProTips
4,ULPT Request: Calling in sick at work,"Hey guys, I've got some exams coming up in a f...",UnethicalLifeProTips


It has come to my attention that each of the posts in the subreddits start with either LPT or ULPT. This is an issue, as it will essentially flag the post right away when my classifier wants to check if it is LPT or ULPT. I need to deal with this somehow:

In [8]:
len(lpt_df[lpt_df['title'].str.contains('LPT')])/len(lpt_df)

0.9915145281563383

In [9]:
len(ulpt_df[(ulpt_df['title'].str.contains('LPT' or 'ULPT'))])/len(ulpt_df)

0.9602763385146805

In [10]:
#99% of LPTs and 96% of ULPTs have the respective word in the title. Let's remove them.

In [11]:
#looks like we might want to strip lpt and ulpt from the dataframes - here we use regex to get all instances of it in the title and post text:
lpt_df['title'] = lpt_df['title'].str.lower().str.strip(r'?u[lpt]')
lpt_df['selftext'] = lpt_df['selftext'].str.lower().str.strip(r'?u[lpt]')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  lpt_df['title'] = lpt_df['title'].str.lower().str.strip(r'?u[lpt]')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  lpt_df['selftext'] = lpt_df['selftext'].str.lower().str.strip(r'?u[lpt]')


Somehow my regex only caught most of them:

In [12]:
len(lpt_df[lpt_df['selftext'].str.contains('lpt')])

179

So let's strip again:

In [13]:
ulpt_df['title'] = ulpt_df['title'].str.lower().str.strip(r'?u[lpt]')
ulpt_df['selftext'] = ulpt_df['selftext'].str.lower().str.strip(r'?u[lpt]')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ulpt_df['title'] = ulpt_df['title'].str.lower().str.strip(r'?u[lpt]')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ulpt_df['selftext'] = ulpt_df['selftext'].str.lower().str.strip(r'?u[lpt]')


ULPT threads also contain some lpt references. We might want to remove these as they probably will confuse my future algorithm, and in the real world attempting to identify unethical text, there is a low probability that it will be coming from the LPT or ULPT community on Reddit anyway:

In [14]:
len(ulpt_df[ulpt_df['selftext'].str.contains('lpt')])

154

In [15]:
#Let's drop all the instances of lpt and ulpt we can find:
lpt_df.drop(lpt_df[lpt_df['title'].str.contains(r'(?i)lpt')].index, inplace=True)
lpt_df.drop(lpt_df[lpt_df['selftext'].str.contains(r'(?i)lpt')].index, inplace=True)
ulpt_df.drop(ulpt_df[ulpt_df['title'].str.lower().str.contains('ulpt')].index, inplace=True)
ulpt_df.drop(ulpt_df[ulpt_df['selftext'].str.lower().str.contains('ulpt')].index, inplace=True)
ulpt_df.drop(ulpt_df[ulpt_df['title'].str.lower().str.contains('lpt')].index, inplace=True)
ulpt_df.drop(ulpt_df[ulpt_df['selftext'].str.lower().str.contains('lpt')].index, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


In [16]:
lpt_df.shape

(7562, 3)

In [17]:
ulpt_df.shape

(7346, 3)

Let's verify if there is actually any lpt anywhere in the title or text:

In [18]:
lpt_df[lpt_df['title'].str.contains(r'(?i)lpt')]

Unnamed: 0,title,selftext,subreddit


In [19]:
lpt_df[lpt_df['selftext'].str.contains(r'(?i)lpt')]

Unnamed: 0,title,selftext,subreddit


In [26]:
ulpt_df[ulpt_df['title'].str.contains(r'(?i)lpt')]

Unnamed: 0,title,selftext,subreddit


In [20]:
#Dropping all the 'u's with spaces in case there was a ulpt somewhere in there that got cut to just u
ulpt_df.drop(ulpt_df[ulpt_df['selftext'].str.contains(r'(?i) u ')].index, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


In [21]:
len(ulpt_df)

7312

Let's inspect our new dataframes:

In [24]:
lpt_df.head()

Unnamed: 0,title,selftext,subreddit
0,": when packing for a move, use your clothes to...","most people will unpack the kitchen early on, ...",LifeProTips
1,: to avoid giving clicks/views to clickbaity n...,most of the time it will give you the name of ...,LifeProTips
2,: kindness is not weakness.,"before i go on, i hope this doesn’t get taken ...",LifeProTips
3,: bought online and shipping delayed? check gu...,for example- i ordered several times from newe...,LifeProTips
4,": guys, if you are on a date with a girl, shut...",(if physical attraction and all that is there ...,LifeProTips


In [25]:
ulpt_df.head()

Unnamed: 0,title,selftext,subreddit
0,request: tax fraud on mercari,"so, i make a lot of money off mercari. and it ...",UnethicalLifeProTips
1,: how to get consistent sex,sa: this probably works best if your single\n\...,UnethicalLifeProTips
2,request: need to fuck with someone’s head.,someone i know is being an asshole and trollin...,UnethicalLifeProTips
3,request: how do i get someone fired from thei...,how can i get someone fired from their current...,UnethicalLifeProTips
4,request: calling in sick at work,"hey guys, i've got some exams coming up in a f...",UnethicalLifeProTips


In [26]:
lpt_df.to_csv('./data/lpt_cleaned.csv')
ulpt_df.to_csv('./data/ulpt_cleaned.csv')
both_df = pd.concat([lpt_df, ulpt_df])
both_df.reset_index(inplace=True)
both_df.head()

Unnamed: 0,index,title,selftext,subreddit
0,0,": when packing for a move, use your clothes to...","most people will unpack the kitchen early on, ...",LifeProTips
1,1,: to avoid giving clicks/views to clickbaity n...,most of the time it will give you the name of ...,LifeProTips
2,2,: kindness is not weakness.,"before i go on, i hope this doesn’t get taken ...",LifeProTips
3,3,: bought online and shipping delayed? check gu...,for example- i ordered several times from newe...,LifeProTips
4,4,": guys, if you are on a date with a girl, shut...",(if physical attraction and all that is there ...,LifeProTips


In [27]:
both_df.drop('index', axis=1, inplace=True)

In [28]:
both_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14874 entries, 0 to 14873
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   title      14874 non-null  object
 1   selftext   14874 non-null  object
 2   subreddit  14874 non-null  object
dtypes: object(3)
memory usage: 348.7+ KB


In [29]:
both_df.to_csv('./data/both_cleaned.csv')

Great. Now we have separate clean dataframes ready for analysis and modeling. Please Continue on to notebook 03.