# Data Cleaning

Cleaning the data

In [14]:
# standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# text cleaning, REGEX imports
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer, PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize, RegexpTokenizer

pd.set_option('display.max_colwidth', 200)

#read in and inspect the data

data = pd.read_csv('./data/subreddits.csv')
data.head()

Unnamed: 0,num_comments,score,selftext,subreddit,title
0,1,0,**Maybe it's *staged* right**?! We all know the Koch Bros and Der Juden control Drumpf! The NWO is **INEVITABLE!**\n\n\nTRUMP BOOTS AUTHOR FROM GOLF COURSE\n\nDonald Trump personally booted the au...,C_S_T,Koch Brother and Trump biographer kicked out of Trump Golf Course
1,21,31,[Part 1: It Starts With Your Thinking](https://www.reddit.com/r/C_S_T/comments/5k1b7h/how_not_to_get_sick_part_1_it_starts_with_your/) \n[Part 2: How Emotions Affect Your Health](https://www.redd...,C_S_T,How Not To Get Sick - Part 3: Discarding Your Victim Mentality
2,20,5,"**""Truth"" is invalid anywhere but inside Formal Logic. It doesn't exist in any physical reality.**\n\nPeople talk about ""Truth"" quiet casually, and they have a lot of different meanings for it, bu...",C_S_T,"""Truth"" in invalid anywhere but inside Formal Logic. It doesn't exist in any physical reality."
3,22,22,"Preface: I’d like to preface this text by admitting to speaking with major generalities when considering humanity. Additionally, I want to recognize how lengthy this read is, but when considering ...",C_S_T,Finding Truth in the Modern World: The Generation that Believed It Was Its Thoughts is now Awakening From the Illusion
4,6,21,The key to understanding the globalists’ strategy in implementing the New World Order is to understand that there are actually two NWOs: a Western-fronted decoy New World Order and a BRICS-fronted...,C_S_T,"The emerging United Nations-based, BRICS-fronted New World Order is the REAL NWO the globalists have been working towards..."


## Check for Duplicates and Any Null Values, Removed Posts

In [19]:
# Source: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.duplicated.html
data[data.duplicated(subset='title')] # these don't appear to be ACTUAL duplicates. We're good. 

#data.isna().sum() # no null values - we're good here

#data[data['title'] == 'removed'] #--> don't see any removed posts that we need to clean out of our dataset, we're good 

Unnamed: 0,num_comments,score,selftext,subreddit,title
242,0,1,"Hey I swear I'm not karma whoring, I reached my 4000 word limit and didn't want to delete anything. I've started looking more into the court cases and I've been finding some absolute crazy stuff. ...",C_S_T,Ted Bundy Was Not A Serial Killer
320,11,1,"1 Hearken, O ye people of my church, saith the voice of him who dwells on high, and whose eyes are upon all men; yea, verily I say: Hearken ye people from afar; and ye that are upon the islands of...",C_S_T,Doctrine and Covenants 1
560,3,10,"Source link:\n\nhttps://eraoflight.com/2017/01/11/the-collective-the-powerful-influx-of-light/#more-20541\n\n\nGreetings, dear ones! We are pleased to have this moment to speak with you today.\nWe...",conspiracy,"One of the greater areas of shift and change has to do with individuals and groups beginning to grasp that they need no ""middle man"" to speak to Creator God/Goddess"
582,32,1,How do we fix this? \n\nI used to buy into the libertarian ideas about free market healthcare. Charity and volunteers to cover people at the bottom. The more I shape my views I don't think that is...,conspiracy,Healthcare: how do you think we can fix it?
614,63,23,I am going to give you everything I have collected up to this point. I am doing this because I think we are at a breaking point and after 2018 a lot of this information will be irrelevant. I am al...,conspiracy,[MEGAPOST] I spent 2017 trying to make some solid conclusions...
730,5,1,One of the greatest obstacles humanity has in freeing itself from its shackles of slavery and finally realizing its full potential is its flawed perceptions about other living beings and Life itse...,conspiracy,Arriving at the Core of Our Problems - Humanity's Prison of Mental Labels and Its Inability to See the Life Beyond Them
779,7,1,**Direct link to the newest True Pundit article:**\n\n[https://truepundit.com/blackmail-disgraced-fbi-officials-threaten-to-release-weiners-laptop-evidence-exposing-clintons-if-indicted/](https://...,conspiracy,True Pundit: FBI Officials are using Weiner's Laptop Evidence to avoid Prison
785,2,1,"**The Danger of Transparent Blockchains for Individuals and Businesses**\n\nTransparent blockchains do not only affect those who use cryptocurrencies in the black market, but also those who work w...",conspiracy,Privacy Coin Review &amp; Introduction to the Web of Trust
824,61,1,"* *""All things by immortal power, Near or far, Hiddenly / To each other linked are, That thou canst not stir a flower / Without troubling of a star...""* - [Francis Thompson](https://imgur.com/2O2e...",conspiracy,"""Every part of the universe is concentrated to a point; and that point is so marvelous a thing..."" - Leonardo da Vinci"
995,322,1,"A few broken doors and windows at the Capitol does not compare to all the leftist rioters that burned, damaged, and looted hundreds of local business in cities all across the country for many mont...",conspiracy,The exaggerated outrage at the Capitol protest from the media and politicians is all for political power and propaganda purposes. Establishment politicians are just threatened that right wingers a...


In [22]:
data['selftext'][0]

"**Maybe it's *staged* right**?! We all know the Koch Bros and Der Juden control Drumpf! The NWO is **INEVITABLE!**\n\n\nTRUMP BOOTS AUTHOR FROM GOLF COURSE\n\nDonald Trump personally booted the author of an unflattering biography off Trump International Golf Club in West Palm Beach on Friday. Harry Hurt III, who penned the 1993 biography, Lost Tycoon: The Many Lives of Donald J. Trump, had come to play with billionaire industrialist David. H. Koch, a Trump club member, and two other golfers. Hurt, who has a scratch handicap and plays in colorful knickers, walked over to Trump on the practice range prior to his group’s assigned tee time, only to suffer a tongue lashing from the president-elect. “I said, ‘Congratulations, sir,’ and shook his hand,” Hurt recalls. “Trump said, ‘You were rough on me, Harry. Really rough. That shit you wrote.’” Hurt says he looked Trump in the eye, and said, “It’s all true,” to which Trump rejoined, “Not in the way you wrote it.” Among the juicy tidbits in 

In [23]:
#stop_word_list
stop_word_list = stopwords.words('english')

#clean rev function from breakfast hour
def cleaner_rev(review):
    # Set token & instantiate Lem/Stem
    lemmatizer = WordNetLemmatizer()
    my_tokenizer = RegexpTokenizer("[\w']+|\$[\d\.]+")
    
    # Tokenize words
    words = my_tokenizer.tokenize(review.lower())
    # What about stop words??
    stop_word_list = stopwords.words('english')
    no_stops = [i for i in words if i not in stop_word_list]

    # Lem/Stem
    rev_lem = [lemmatizer.lemmatize(i) for i in no_stops]

    # Put words back together
    return ' '.join(rev_lem)


In [24]:
cleaner_rev(data['selftext'][1])

"part 1 start thinking http www reddit com r c_s_t comment 5k1b7h how_not_to_get_sick_part_1_it_starts_with_your part 2 emotion affect health http www reddit com r c_s_t comment 5l5drq how_not_to_get_sick_part_2_how_emotions_affect disclaimer doctor often disagree thing commonly say sharing experience belief led that's good old entropy brings u much chaos let u call unavoidable fate destiny blah blah blah i'll reduce entropy purpose 'k taught theory came nothing moment somehow decided something thus exploded right taught shit went shit whatsoever going grew pokemon taught great great great great great ancestor monkey great great great great ancestor fish blah blah blah blah ancestor simple cell survive divide whoo get excited thinking ol' og grampa nothing exploded grandchild something anything exciting think we're taught life blow wind nothing became something expanding outward incredible http www etymonline com index php term incredible rate someday it'll collapse back right started 

In [25]:
data['clean_selftext'] = data['selftext'].map(cleaner_rev)
data['clean_title'] = data['title'].map(cleaner_rev)

In [26]:
data.head()

Unnamed: 0,num_comments,score,selftext,subreddit,title,clean_selftext,clean_title
0,1,0,**Maybe it's *staged* right**?! We all know the Koch Bros and Der Juden control Drumpf! The NWO is **INEVITABLE!**\n\n\nTRUMP BOOTS AUTHOR FROM GOLF COURSE\n\nDonald Trump personally booted the au...,C_S_T,Koch Brother and Trump biographer kicked out of Trump Golf Course,maybe staged right know koch bros der juden control drumpf nwo inevitable trump boot author golf course donald trump personally booted author unflattering biography trump international golf club w...,koch brother trump biographer kicked trump golf course
1,21,31,[Part 1: It Starts With Your Thinking](https://www.reddit.com/r/C_S_T/comments/5k1b7h/how_not_to_get_sick_part_1_it_starts_with_your/) \n[Part 2: How Emotions Affect Your Health](https://www.redd...,C_S_T,How Not To Get Sick - Part 3: Discarding Your Victim Mentality,part 1 start thinking http www reddit com r c_s_t comment 5k1b7h how_not_to_get_sick_part_1_it_starts_with_your part 2 emotion affect health http www reddit com r c_s_t comment 5l5drq how_not_to_g...,get sick part 3 discarding victim mentality
2,20,5,"**""Truth"" is invalid anywhere but inside Formal Logic. It doesn't exist in any physical reality.**\n\nPeople talk about ""Truth"" quiet casually, and they have a lot of different meanings for it, bu...",C_S_T,"""Truth"" in invalid anywhere but inside Formal Logic. It doesn't exist in any physical reality.",truth invalid anywhere inside formal logic exist physical reality people talk truth quiet casually lot different meaning 1 valid meaning everything else colloquialism treated accordingly truth exi...,truth invalid anywhere inside formal logic exist physical reality
3,22,22,"Preface: I’d like to preface this text by admitting to speaking with major generalities when considering humanity. Additionally, I want to recognize how lengthy this read is, but when considering ...",C_S_T,Finding Truth in the Modern World: The Generation that Believed It Was Its Thoughts is now Awakening From the Illusion,preface like preface text admitting speaking major generality considering humanity additionally want recognize lengthy read considering nature man lot said text speaks think power change life hope...,finding truth modern world generation believed thought awakening illusion
4,6,21,The key to understanding the globalists’ strategy in implementing the New World Order is to understand that there are actually two NWOs: a Western-fronted decoy New World Order and a BRICS-fronted...,C_S_T,"The emerging United Nations-based, BRICS-fronted New World Order is the REAL NWO the globalists have been working towards...",key understanding globalists strategy implementing new world order understand actually two nwos western fronted decoy new world order brics fronted real new world order understand globalists creat...,emerging united nation based brics fronted new world order real nwo globalists working towards


In [27]:
data['wordcount_clean_selftext'] = [len(i.split(' ')) for i in data['clean_selftext']]
data['wordcount_clean_title'] = [len(i.split(' ')) for i in data['clean_title']]

In [35]:
data.head(5)

Unnamed: 0,num_comments,score,selftext,subreddit,title,clean_selftext,clean_title,wordcount_clean_selftext,wordcount_clean_title
0,1,0,**Maybe it's *staged* right**?! We all know the Koch Bros and Der Juden control Drumpf! The NWO is **INEVITABLE!**\n\n\nTRUMP BOOTS AUTHOR FROM GOLF COURSE\n\nDonald Trump personally booted the au...,C_S_T,Koch Brother and Trump biographer kicked out of Trump Golf Course,maybe staged right know koch bros der juden control drumpf nwo inevitable trump boot author golf course donald trump personally booted author unflattering biography trump international golf club w...,koch brother trump biographer kicked trump golf course,160,8
1,21,31,[Part 1: It Starts With Your Thinking](https://www.reddit.com/r/C_S_T/comments/5k1b7h/how_not_to_get_sick_part_1_it_starts_with_your/) \n[Part 2: How Emotions Affect Your Health](https://www.redd...,C_S_T,How Not To Get Sick - Part 3: Discarding Your Victim Mentality,part 1 start thinking http www reddit com r c_s_t comment 5k1b7h how_not_to_get_sick_part_1_it_starts_with_your part 2 emotion affect health http www reddit com r c_s_t comment 5l5drq how_not_to_g...,get sick part 3 discarding victim mentality,950,7
2,20,5,"**""Truth"" is invalid anywhere but inside Formal Logic. It doesn't exist in any physical reality.**\n\nPeople talk about ""Truth"" quiet casually, and they have a lot of different meanings for it, bu...",C_S_T,"""Truth"" in invalid anywhere but inside Formal Logic. It doesn't exist in any physical reality.",truth invalid anywhere inside formal logic exist physical reality people talk truth quiet casually lot different meaning 1 valid meaning everything else colloquialism treated accordingly truth exi...,truth invalid anywhere inside formal logic exist physical reality,249,9
3,22,22,"Preface: I’d like to preface this text by admitting to speaking with major generalities when considering humanity. Additionally, I want to recognize how lengthy this read is, but when considering ...",C_S_T,Finding Truth in the Modern World: The Generation that Believed It Was Its Thoughts is now Awakening From the Illusion,preface like preface text admitting speaking major generality considering humanity additionally want recognize lengthy read considering nature man lot said text speaks think power change life hope...,finding truth modern world generation believed thought awakening illusion,1022,9
4,6,21,The key to understanding the globalists’ strategy in implementing the New World Order is to understand that there are actually two NWOs: a Western-fronted decoy New World Order and a BRICS-fronted...,C_S_T,"The emerging United Nations-based, BRICS-fronted New World Order is the REAL NWO the globalists have been working towards...",key understanding globalists strategy implementing new world order understand actually two nwos western fronted decoy new world order brics fronted real new world order understand globalists creat...,emerging united nation based brics fronted new world order real nwo globalists working towards,1248,14


In [34]:
data.tail(5)

Unnamed: 0,num_comments,score,selftext,subreddit,title,clean_selftext,clean_title,wordcount_clean_selftext,wordcount_clean_title
995,322,1,"A few broken doors and windows at the Capitol does not compare to all the leftist rioters that burned, damaged, and looted hundreds of local business in cities all across the country for many mont...",conspiracy,The exaggerated outrage at the Capitol protest from the media and politicians is all for political power and propaganda purposes. Establishment politicians are just threatened that right wingers a...,broken door window capitol compare leftist rioter burned damaged looted hundred local business city across country many month left stormed capitol like right anything left statue would smashed wou...,exaggerated outrage capitol protest medium politician political power propaganda purpose establishment politician threatened right winger actually went government instead locally owned business,99,22
996,4,1,# [Anthony Tata](https://en.wikipedia.org/wiki/Anthony_Tata) - Under Secretary of Defense for Policy \n\nHere are some of his past tweets [https://imgur.com/a/KDLpvsi](https://imgur.com/a/KDLpvsi...,conspiracy,DOD and other personnel who should be investigated for the failed security at the Capitol,anthony tata http en wikipedia org wiki anthony_tata secretary defense policy past tweet http imgur com kdlpvsi http imgur com kdlpvsi also tweeted california rep maxine water nancy pelosi said al...,dod personnel investigated failed security capitol,1153,6
997,27,1,"he theory (truth) is that the 'meteor' never actually hit earth, which 'caused the dinosaurs to die'. This is what the media tells you, but it is NOT TRUE! If this was the case, then the Earth wo...",conspiracy,Dinosaurs still live on Earth!,theory truth 'meteor' never actually hit earth 'caused dinosaur die' medium tell true case earth would destroyed would today instead small meteor hit kazakhstan causing crater created dinosaur att...,dinosaur still live earth,47,4
998,70,1,I’ve been pro-Trump and a Q believer since day 1. I told many here I’d be back to rub it in January when Trump was elected and Q was proven true. I also said I’d be back to admit if I was wrong. \...,conspiracy,I was pro-Trump and a Q believer since day 1. I swore I’d be back to admit if I was wrong.,pro trump q believer since day 1 told many back rub january trump elected q proven true also said back admit wrong scammed fucking hurt mad guess similar many million others wanted believe still c...,pro trump q believer since day 1 swore back admit wrong,103,11
999,46,1,Let’s say it’s all true. Whatever your favorite conspiracy theory. 9/11 was an inside job. Fake news. QAnon. Lin Wood. Aliens. Whatever.\n\nSuppose it’s all true. So what? I’ve spent years now stu...,conspiracy,"“It’s all true!!!” Ok, I believe you. So what?",let say true whatever favorite conspiracy theory 9 11 inside job fake news qanon lin wood alien whatever suppose true spent year studying stuff compiling evidence thinking figuring thing nobody gi...,true ok believe,105,3


In [36]:
#export cleaned data to csv
data.to_csv('./data/clean_subreddits.csv', index=False)