## Fixing the Extracted Information

Originally, I did not clean the text from the extracted ZST, because I knew that I would be cleaning the file on Pyspark. However, I ran into issues when trying to read the CSV into a spark dataframe because some of the text had commas and newline breaks. Because of this, I had to somewhat clean the file before using it on pyspark. 

I also utilized this notebook to check my work while I was working on the submission dataset before I felt comfortable with pyspark.

In [1]:
import pandas as pd
import csv

# Fixing DF

In [None]:
with open('~/desktop/30123/project/cons_submissions_abortion_retry.csv', 'r', newline='', encoding='utf-8') as infile, \
     open('~/desktop/30123/project/output_retry.csv', 'w', newline='', encoding='utf-8') as outfile:
    
    reader = csv.reader(infile)
    writer = csv.writer(outfile)

    for row in reader:
        corrected_row = []
        for item in row:
            item = item.replace(',', ' ')
            if '\n' in item:
                item = ' '.join(item.splitlines())
            corrected_row.append(item)
        
        writer.writerow(corrected_row)

# Fixing Duplicates

I think when I was experimenting with reading and writing into the csv from the zst, I messed up the writing and included duplicate posts sometimes. Fixing that.

In [59]:
subs = pd.read_csv('~/desktop/30123/project/output.csv')

print(subs.shape)
subs = subs.drop_duplicates(keep='first')
print(subs.shape)

(15974, 7)
(10372, 7)


In [60]:
subs.to_csv('~/desktop/30123/project/output_fixed.csv', index=False)

In [66]:
# checking types for post id
comments = subs['num_comments']
comments.unique()

# need to get rid of the bad row
subs = subs[subs['num_comments'] != 'num_comments']
print(subs.shape)

# and change to in
subs['num_comments'] = subs['num_comments'].astype(int)

(10371, 7)


In [67]:
posts_with_comments =[]
for i, row in subs.iterrows():
    num_comments = (row['num_comments'])
    if num_comments > 0:
        posts_with_comments.append(row['id'])

In [68]:
posts_with_comments

['72v18',
 'ahsop',
 'bcj1g',
 'bjsrb',
 'c45dz',
 'chxh4',
 'cnyoj',
 'd799v',
 'er128',
 'f0gf5',
 'f5hcf',
 'g1xdt',
 'g2neq',
 'gf0kc',
 'gufsw',
 'gv102',
 'gvs0r',
 'j6r1v',
 'kqj7z',
 'l7gor',
 'lb91h',
 'lru2q',
 'm62tz',
 'm9q6u',
 'mgzhr',
 'noljl',
 'nx94f',
 'o35bm',
 'o3aty',
 'o7ycp',
 'of65j',
 'omr7t',
 'opt2s',
 'pbrzo',
 'pl24d',
 'pr5g2',
 'q0mqi',
 'qeq83',
 'rdeqn',
 'riez2',
 'rk5z4',
 'rrvh8',
 'rzm05',
 'sar5s',
 'tnrrg',
 'uax69',
 'ue7vm',
 'ug6wr',
 'ufrvq',
 'uer77',
 'ugqzg',
 'unwah',
 'uxq6v',
 'uzuw0',
 'v440x',
 'v43ih',
 'v31uk',
 'vit43',
 'vibg9',
 'vjpjo',
 'vr6by',
 'wez1h',
 'wmopv',
 'wvvt7',
 'x5304',
 'xcnkj',
 'xppdo',
 'xvv66',
 'yabg3',
 'y9cw4',
 'yas5v',
 'yer7t',
 'yh17h',
 'yi3pw',
 'ylpbs',
 'ylna9',
 'ymvyo',
 'ypj5f',
 'ype1n',
 'yp2o0',
 'yo38j',
 'yqhbl',
 'ytdpe',
 'yt0iw',
 'yssu3',
 'yz38a',
 'yyrm7',
 'yyayp',
 'z06tc',
 'z2jmb',
 'z8q20',
 'zcj21',
 'zcci3',
 'zds0x',
 'zgmtf',
 'zi45x',
 'zk55b',
 'zk525',
 '100rwp',
 '1091y4'

# double checking on 'all'

In [2]:
subs_all = pd.read_csv('~/desktop/30123/project/cons_submissions_all.csv')

In [3]:
# need to get rid of the bad row
subs_all = subs_all[subs_all['num_comments'] != 'num_comments']
print(subs_all.shape)

# and change to in
subs_all['num_comments'] = subs_all['num_comments'].astype(int)

(1033863, 7)


In [4]:
subs_all_w_coms = subs_all[subs_all['num_comments'] > 0]

In [5]:
subs_all_w_coms.head(30)

Unnamed: 0,title,date_posted,author,id,flare,num_comments,url
6,Ron Paul and Raw Milk,2008-03-06 02:38:58,temjrpgh,6b1e5,,1,/r/Conservative/comments/6b1e5/ron_paul_and_ra...
11,Grandfather Kills Two Men Robbing Neighbors Ho...,2008-07-08 17:46:49,michael123,6qt4i,,1,/r/Conservative/comments/6qt4i/grandfather_kil...
12,Is Politics Theater?,2008-07-10 21:01:25,[deleted],6r7by,,1,/r/Conservative/comments/6r7by/is_politics_the...
13,Whites to be Minority in US by 2042,2008-08-14 03:27:05,michael123,6w8tf,,1,/r/Conservative/comments/6w8tf/whites_to_be_mi...
14,Sex and Politics,2008-08-15 17:12:21,michael123,6whlu,,1,/r/Conservative/comments/6whlu/sex_and_politics/
17,Teach Your Children to Defend Themselves! by J...,2008-08-30 18:00:00,HansGruen,6ytmt,,1,/r/Conservative/comments/6ytmt/teach_your_chil...
21,I've always wondered why the internet is so li...,2008-09-22 04:41:35,[deleted],72qew,,4,/r/Conservative/comments/72qew/ive_always_wond...
22,Men Aren't Allowed To Talk Abortion,2008-09-22 19:01:29,cldnails,72v18,,5,/r/Conservative/comments/72v18/men_arent_allow...
28,The Truth Manifesto - A Conservative Christian...,2008-09-25 18:34:10,PREZman,73jha,,1,/r/Conservative/comments/73jha/the_truth_manif...
30,The Truth Manifesto - A Conservative Christian...,2008-09-26 16:51:12,PREZman,73qks,,1,/r/Conservative/comments/73qks/the_truth_manif...


In [6]:
pd.set_option('display.max_colwidth', None)
subs_all_w_coms[subs_all_w_coms['id'] == '2onax2']

Unnamed: 0,title,date_posted,author,id,flare,num_comments,url
111210,Obamacare Architect Jonathan Gruber: Abortion of 'Marginal Children' a 'Social Good',2014-12-08 14:13:18,chabanais,2onax2,,124,/r/Conservative/comments/2onax2/obamacare_architect_jonathan_gruber_abortion_of/


In [74]:
print(subs_all.shape)
subs = subs.drop_duplicates(keep='first')
print(subs_all.shape)

(1033863, 7)
(1033863, 7)
