# Filter Data
1. Load data (in chunks)
2. Select only one category
3. Select only root comments
4. Filter out NaN

Overall, reduce the amount of data.

Use preprocessed file
- /mnt/data/group07/johannes/proc_data/merged_comments.csv

for generating
- /mnt/data/group07/johannes/proc_data/{category}_comments.csv

In [1]:
import pandas as pd
pd.__version__

'0.22.0'

In [26]:
category = "sport" # "politics"

In [9]:
articles = pd.read_csv('/mnt/data/datasets/newspapers/guardian/articles.csv')
articles = articles[articles['article_url'].str.contains("https://www.theguardian.com/" + category + "/")] # overwrite to save memory

In [10]:
articles

Unnamed: 0,article_id,article_url
1236,1238,https://www.theguardian.com/sport/gallery/2017...
1262,1264,https://www.theguardian.com/sport/2016/dec/31/...
1357,1359,https://www.theguardian.com/sport/2016/aug/25/...
1387,1389,https://www.theguardian.com/sport/2016/jul/19/...
1443,1445,https://www.theguardian.com/sport/shortcuts/20...
1468,1470,https://www.theguardian.com/sport/2016/may/03/...
1475,1477,https://www.theguardian.com/sport/2016/apr/26/...
1514,1516,https://www.theguardian.com/sport/2016/mar/20/...
1564,1566,https://www.theguardian.com/sport/2016/feb/08/...
1735,1737,https://www.theguardian.com/sport/2015/oct/05/...


In [11]:
filename = '/mnt/data/group07/johannes/proc_data/merged_comments.csv'

chunksize = 10 ** 6
comments_list = []

for chunk in pd.read_csv(filename, chunksize=chunksize):
    comment_chunk = chunk[chunk['comment_text'].notnull()] # filter out NaNs
    comment_chunk = comment_chunk[comment_chunk['parent_comment_id'].isnull()] # only select root comments
    comment_chunk = comment_chunk[comment_chunk['article_id'].isin(articles['article_id'])] # filter out article in category
    comment_chunk = comment_chunk.drop(['parent_comment_id'], axis=1)
    print(comment_chunk.shape)
    comments_list.append(comment_chunk)
    print(len(comments_list))

# it's faster to first gather all in a list and concat once
comments = pd.concat(comments_list)

(3133, 7)
1
(213, 7)
2
(1442, 7)
3
(161978, 7)
4
(600, 7)
5
(1925, 7)
6
(1884, 7)
7
(1920, 7)
8
(2753, 7)
9
(577, 7)
10
(82, 7)
11
(300, 7)
12
(17048, 7)
13
(4572, 7)
14
(1242, 7)
15
(1261, 7)
16
(28411, 7)
17
(122935, 7)
18
(1062, 7)
19
(3344, 7)
20
(1355, 7)
21
(2849, 7)
22
(256, 7)
23
(195, 7)
24
(17356, 7)
25
(972, 7)
26
(962, 7)
27
(31212, 7)
28
(121489, 7)
29
(1797, 7)
30
(2264, 7)
31
(1994, 7)
32
(42, 7)
33
(1142, 7)
34
(16876, 7)
35
(884, 7)
36
(144415, 7)
37
(2238, 7)
38
(1375, 7)
39
(1636, 7)
40
(340, 7)
41
(14126, 7)
42
(30390, 7)
43
(111466, 7)
44
(2039, 7)
45
(1900, 7)
46
(308, 7)
47
(13876, 7)
48
(664, 7)
49
(22733, 7)
50
(59612, 7)
51
(79897, 7)
52
(1415, 7)
53
(10071, 7)
54
(130839, 7)
55
(1782, 7)
56
(552, 7)
57
(55408, 7)
58
(77945, 7)
59
(111175, 7)
60
(116215, 7)
61
(32663, 7)
62


In [17]:
comments

Unnamed: 0.1,Unnamed: 0,article_id,author_id,comment_id,timestamp,upvotes,comment_text
95259,95259,1238,162,93196072,2017-02-14T17:25:30Z,2,lindsay is staggering and great from the looks...
95260,95260,1238,22121,93183940,2017-02-14T14:26:22Z,2,"great shots . the youth chess tournament , the..."
95261,95261,1238,31270,93225104,2017-02-15T08:42:20Z,1,two great photos -the horseracing and the ches...
95262,95262,1238,31271,93184685,2017-02-14T14:36:25Z,0,good idea for a series with the chess and some...
100510,100510,1264,14,90313450,2017-01-01T16:59:43Z,4,watched it this afternoon . perhaps i'm not th...
100511,100511,1264,19219,90276423,2016-12-31T20:37:43Z,3,will miss nick luck and simon holt and tanya a...
100512,100512,1264,25725,90272831,2016-12-31T18:07:24Z,5,"matt chapman is a loud mouth , think big mac w..."
100514,100514,1264,28706,90317319,2017-01-01T18:34:19Z,0,horseracings long pr advert makes the question...
100515,100515,1264,28706,90273500,2016-12-31T18:33:50Z,2,“ it may be less analytical than what you may ...
100516,100516,1264,32729,90298389,2017-01-01T11:10:15Z,2,if we want people to watch racing on a saturda...


In [27]:
comments.to_csv("/mnt/data/group07/johannes/proc_data/" + category + "_comments.csv")
print(comments.shape)

(1583407, 7)


In [9]:
comments[100000:]

Unnamed: 0,article_id,author_id,comment_id,comment_text,timestamp,upvotes,ommen t_t ex
166063,1793,39337,58438101,The bloke needs a personality transplant..not ...,2015-08-30T02:26:32Z,0,he bloke needs a personality transplant .. not...
166065,1793,40706,58406983,I wouldn't be surprised if his wife threw a li...,2015-08-29T12:02:32Z,1,wouldn't be surprised if his wife threw a litt...
166067,1793,40706,58400583,You were a disastrous Prime Minister who could...,2015-08-29T09:43:24Z,13,you were a disastrous prime minister who could...
166068,1793,41227,58405148,Cringe... Cringe...,2015-08-29T11:21:14Z,3,ringe ... cringe ..
166071,1793,44087,58392982,"It was AN eager, if slightly stiff, Kevin Rudd...",2015-08-29T05:50:15Z,7,"it was an eager , if slightly stiff , kevin ru..."
166074,1793,45539,58416760,"At least Rudd has something to say, not like o...",2015-08-29T16:28:46Z,5,"at least rudd has something to say , not like ..."
166075,1793,45568,58392840,The interesting fact is he can speak Mandarin ...,2015-08-29T05:41:02Z,11,he interesting fact is he can speak mandarin i...
166081,1793,45716,58393830,Could you imagine Gillard being the presenter?...,2015-08-29T06:40:27Z,8,ould you imagine gillard being the presenter ?...
166082,1793,47642,58393497,Forever the Media Tart,2015-08-29T06:23:30Z,9,orever the media tar
166083,1793,47850,58479529,Watched Kev hosting Amanpour ... he was knowle...,2015-08-30T20:26:03Z,0,watched kev hosting amanpour ... he was knowle...


In [10]:
comments[comments['comment_id'] == 63592243 ]

Unnamed: 0,article_id,author_id,comment_id,comment_text,timestamp,upvotes,ommen t_t ex
4915477,47823,43374,63592243,How about a paradigm shift - try running the r...,2015-11-19T16:59:46Z,32,ow about a paradigm shift - try running the ra...


In [8]:
comments[comments['upvotes'] > 0]

Unnamed: 0,article_id,author_id,comment_id,comment_text,timestamp,upvotes
12998,168,1128,12651598,Cameron ' s chauvinism runs through him like a...,2011-10-02T13:09:35Z,5
12999,168,1473,12654152,The Tories know women have been voting for th...,2011-10-02T18:40:14Z,3
13000,168,2068,12650257,[ think that politicians in power come across...,2011-10-02T10:55:16Z,7
13001,168,228,12650590,@runnel [ Despite the massive steps in Female...,2011-10-02T11:28:17Z,6
13002,168,228,12650312,Still trying to choke down the nausea engende...,2011-10-02T11:00:55Z,5
13003,168,2306,12653164,[ we ask voters and experts why women are grow...,2011-10-02T16:39:02Z,3
13004,168,243,12649474,I ' d find it really interesting to see what ...,2011-10-02T09:26:19Z,11
13005,168,261,12649916,That Johnson chin . You could smash diamonds w...,2011-10-02T10:17:50Z,1
13008,168,5542,12654939,"[ Audrey Pyle , , volunteer at Age Concern sh...",2011-10-02T19:51:19Z,1
13009,168,5543,12654823,Why any woman would vote Tory is a mystery to...,2011-10-02T19:41:18Z,4


In [9]:
comments[comments['upvotes'] == 0]

Unnamed: 0,article_id,author_id,comment_id,comment_text,timestamp,upvotes
13006,168,5540,12658849,Rachel Johnson [ many women I ' ve met are in ...,2011-10-03T08:28:02Z,0
13007,168,5541,12658338,Due to an absence of any ( coherent ) policie...,2011-10-03T06:02:16Z,0
25229,369,1094,10174270,Will the Guardian still refer to Mrs Milliband...,2011-03-30T21:56:49Z,0
25235,369,2025,10172430,@samro [ To say that a couple who . . . are mo...,2011-03-30T18:49:19Z,0
25236,369,2025,10168008,@fatbelly [ I wish the future Mr &amp Mrs Ed ...,2011-03-30T14:23:01Z,0
25237,369,2025,10167967,@uncleal [ @David I was at the march . Every ...,2011-03-30T14:21:00Z,0
25250,369,2434,10163457,Congrats Ed . . . sure we can ' t talk you out...,2011-03-30T10:19:34Z,0
25256,369,3184,10172664,Will the Guardian still refer to Mrs Milliband...,2011-03-30T19:08:29Z,0
25259,369,3438,10172504,This is almost as exciting and important as w...,2011-03-30T18:55:25Z,0
25260,369,4038,10162749,"WEd Milliband , Sideshow Bob to Wills n Kate ...",2011-03-30T09:36:00Z,0


In [12]:
comments

Unnamed: 0,article_id,author_id,comment_id,comment_text,timestamp,parent_comment_id,upvotes
0,1,1,14606180,So you are saying that the demonisation of fe...,2012-02-11T12:49:42Z,,0
5,1,1,14604722,It might be helpful for me to summarise the t...,2012-02-11T10:25:55Z,,0
9,1,1,14597711,I ' m called away to a Valentine ' s Day dinn...,2012-02-10T18:15:51Z,,0
33,1,1,14585559,As a general question to the thread is that t...,2012-02-10T09:19:49Z,,1
36,1,1,14584716,I can ' t help feeling that the virulently an...,2012-02-10T08:00:10Z,,0
40,1,12,14583858,[ At Davos ( fewer than one in five women del...,2012-02-10T02:18:03Z,,0
42,1,14,14583539,If we ' re going to go for quotas then let ' ...,2012-02-10T01:09:51Z,,7
47,1,17,14583270,The vast majority of homeless people are men ....,2012-02-10T00:31:36Z,,20
48,1,18,14583244,"are you insane , millions of women are being ...",2012-02-10T00:27:42Z,,2
49,1,19,14583226,To deny women face discrimination in spheres ...,2012-02-10T00:26:21Z,,5
